PROPOSED APPROACH TO DEAL WITH LDT S - LONG DURATION TRANSIENTS EFFECTS

2 LONG DURATION TRANSIENTS EFFECTS

2.3 PROPOSED APPROACH TO DEAL WITH LDT S

As shown in the previous section, most of currently known soft errors mitigation techniques will not be useful in the future scenario, where radiation induced transients will last longer than the cycle time of circuits and the probability of multiple simultaneous upsets will become higher due to the small distances between the devices.

Time redundancy based techniques will become too expensive, in terms of performance overhead, due to the need for an increased delay between two or three inputs sampling, which must be longer than the expected transient width. Space redundancy based ones, in turn, will still impose too heavy penalties in terms of area and power overheads, making them useless for applications fields in which those are scarce resources, such as the embedded systems arena. Finally, software based techniques will continue to suffer from the need to modify existing software or impose high area and/or performance overheads.

Given this scenario, this work proposes the development of a set of innovative low cost techniques, each one working at a different abstraction level in a complementary fashion, in order to face the challenges imposed to designers by future technologies.

While the development of a complete set of such solutions, able to harden a whole system against soft errors, is out of the scope of this thesis text, this is the ultimate goal of our research project.

Very recently, in Albrecht (2009), a generic approach to deal with the drawbacks imposed by future technologies in the design of systems-on-chip has been proposed.

The authors propose to divide the SoC into several architectural layers, each one tailored to the specific SoC fault-tolerance needs, aiming to cope with the decreasing device reliability due to parameters variations, temperature impact, and radiation effects. While the ideas behind this proposal have some common points with the ones proposed in Lisboa (ETS 2007), in Albrecht (2009) the authors suggest that the detection of errors should be implemented at lower levels, while the error correction should be performed at system level. In order to achieve this goal, mechanisms for fault detection and communication with upper levels should be implemented at the lower

levels, while the upper levels would implement the error correction mechanisms and also the ability to switch on and off the error detection mechanisms in lower levels, according to the specific fault tolerance requirements of the application. So, that proposal aims to define a configurable reliable design, able to deal not only with runtime, but also with design and production errors. However, no new error detection or correction technique is proposed in that work, and the authors simply make a review of existing alternatives, without any comments about their suitability to cope with the new scenario. Furthermore, the authors state that the overall system reliability is becoming more important than the mips-per-watt measure, thereby suggesting that performance and power overheads are not as important as the reliability features in a SoC.

In our work, in turn, the development of new low cost techniques to deal with the new challenges at different abstraction levels is proposed, but with reduced overheads In terms of area, power, and performance, as the first goal. While most of the solutions proposed in this thesis aim to detect errors, some of them also include error correction capabilities. As a general rule, our proposal is to correct errors through recomputation, which sometimes may seem to be a very expensive solution. However, it must be highlighted that when one deals with radiation induced errors, given the very low frequency of SETs in comparison with the operating frequencies of the circuits, the recomputation cost becomes almost negligible. Nevertheless, the reduction of the recomputation cost itself has also been a concern in our work, as is detailed in the description of the technique presented in Chapter 3.

For the purpose of this work, we have considered the following abstraction levels in a system:

• technology level

• component level

• circuit level

• architecture level

• algorithm or software level

• system level

At the technology level, the use of built-in current sensors, proposed in Neto (2006), is a possible approach to detect soft errors, as shown in Lisboa (ITC 2007) and Albrecht (2009), but the error correction capability must then be implemented at higher abstraction levels.

At the component level, given the possibility of transients lasting even longer than the propagation times of circuits, the use of space redundancy, under the single fault assumption, is the most suitable alternative. However, the area and power overheads imposed by space redundancy preclude the use of this option when designing portable or embedded systems. Another approach is to oversize the most sensitive transistors used in the construction of the component, but then again the area overhead becomes too high. Due to those considerations, in our work there are no studies for error detection or correction at the component level.

When working at circuit level, many alternative techniques have already been proposed. As previously mentioned, those based on time redundancy will be useless in the presence of LDTs. Space redundancy techniques, such as DWC, TMR, and the use of IPs or checkers operating in parallel with the circuit to be protected, can indeed cope

with LDTs, but usually impose heavy penalties in terms of area and power. With this in mind, one of the new low overhead solutions proposed in this thesis is the use of Hamming codes to protect combinational logic, as described in Chapter 6.

Considering the use of commercial off-the-shelf processors in the implementation of the system to be hardened, the mitigation of soft errors at the architecture level is usually restricted to space redundancy techniques such as TMR. However, the growing availability of multi-processor FPGAs, which allow the addition of custom logic around the COTS processors, opens new paths to be explored in the search for soft error mitigation alternatives. As an example of this approach, the use of the lockstep technique, combined with checkpoint and rollback, already proposed in the past but precluded due to the need to the high costs involved in the design and manufacturing of ASICS, is now becoming again a feasible option. Due to this fact, in this thesis this technique has been studied and an improvement at the architecture level that reduces the time required to perform checkpoints has been proposed, as described in Chapter 5.

Despite the studies aiming to deal with the problem at lower abstraction levels, as proposed in Lisboa (ETS 2007) our preferred alternative to deal with the effects of LDTs is to work at algorithm level or system level. And with this in mind, our efforts have been concentrated in the search for low cost alternatives to accomplish our goal.

Starting with the matrix multiplication algorithm, for which a verification technique with single element recomputation at extreme low cost has been devised, the work has continued with the use of software invariants in the runtime detection of soft errors. In both cases, the low cost goal has been achieved, and the proposed solutions seem good candidates to be included together with future ones in the hardening of complete systems against radiation induced long duration transient faults.

In the following chapters, the main results of our research work during the thesis development are presented and discussed.

No documento Dealing with Radiation Induced Long Duration Transient Faults in Future Technologies (páginas 34-37)