1 Grupo de Arquiteturas e Circuitos Integrados (GACI) – DINFO

94 XXIII SIM - South Symposium on Microelectronics

and the BGP, which computes the carry-in bits for the next adder blocks and the final carry-out, when the n-bit CLA basic block is instantiated to build the higher order CLA.

In the Carry-Select Adder (CSA) the addition is divided into a given number of sections that are computed in parallel. This speeds up significantly the carry propagation, resulting in high operation speed. Each section performs two additions at the same time: one considering a ‘0’ as carry-in and another considering a ‘1’ as carry-in. This way, the carry chain propagation is broken to be accelerated. After all the addition sections finish their operation, for each section the correct addition value is selected based on the resulting carry-out of the previous section.

The drawback of the CSA relies on its high cost in terms of hardware resources. Resource overhead comes mainly from the need for using two adders in each section, in order to allow the parallel computation of the additions. The high cost of CSA may prevent its use in several applications where hardware resources are limited. However, its redundancy may be explored in order to derive fault-tolerant adder versions.

In a previous work we have developed a new adder architecture inspired in the CSA that was named Re-computing the Inverse Carry-in Adder (RIC) [MES 07]. Although presenting the same philosophy of a CSA, the RIC adder demands less hardware because one of the two adders used in one adder section is replaced by a less expensive combinational block, called Re-computing Block. As shown in Figure 1, in each section there is one adder that generates the result assuming that the carry-in is equal to ‘0’. The Re-computing Block (RB in this figure) receives this result R and converts it to R+1. This re-computation is based on the exploration of some binary addition properties, as proposed by Kumar and Lala [KUM 03].

Fig. 1 – An 8-bit RIC adder is composed by two sections with one RCA and one Re-Computing Block.

3. Applying Fault-Tolerance Techniques to Adders

Among the existing soft error protection techniques, the Triple Modular Redundancy (TMR) is the simplest and most used one [CAR 01]. It also serves as reference, since its overhead in terms of hardware resources is known to be near 200% (masked IC implementations). The high resource overhead comes from the fact that TMR requires the use of three exemplars of the block to be protected plus a voter circuitry. The outputs of these three blocks are connected to a voter that decides, by means of a majority election, which is the correct result.

The voter is the solely block susceptible to error. If an error occurs in the voter, it may take a wrong decision.

The RIC adder protected with TMR demands only three adders per section, in contrast to the CSA adder protected with TMR that needs six adders per section. Note also that, as in the unprotected version, the delay of the RIC adder is expected to be higher than the delay of the CSA.

In the Temporal Redundancy (TR) technique the output of the block to be protected is sampled in three different moments (clk, clk+d and clk+2d). The time difference between two consecutive resulting samples must guarantee the disappearance of the fault, if it occurs. To implement this technique it is necessary to triplicate the storage elements in order to store the three samples that are used as inputs to the majority voter. In this technique the hardware overhead is smaller than in the TMR case. However, the TR version is more complex because it requires different clocks for the three storage elements. Thus, this technique presents disadvantages concerning the time necessary to perform the whole computation, which is approximately equal to clk+2d+tp, where tp is the voter delay and d is the time required to perform each computation [LIM 03].

The Duplication With Comparison + Time Redundancy technique (DWC+TR) is an alternative to TMR. It explores time and hardware redundancy in an attempt to reduce the hardware overhead. Instead of using three instances of the block to be protected, as in the TMR case, it demands only two instances (CLA0 and CLA1 in Fig. 2). The output of each instance is latched into two registers, being one triggered at instant clk and the other, at instant clk+d. Hence, four samples of the block output value are available, two per block instance. These four samples are fed to the error detection block, which is responsible for verifying if an error has occurred in one of the two instances. If that is the case, the error detector delivers the right value to be used as the third input of the majority voter. The voter also receives the output of each block instance. Its output is stored in a register at instant clk+2d. It is worth to notice that the amount of time “d”, which is the difference between two clock edges, must be long enough to allow a transient pulse to propagate through the logic that is placed between registers. Otherwise, the probability of sampling an error becomes higher.

RCA 0

0 1

P₀ A_0...3 B_0...3

4 C₀ 4

S_0...3

propagate

RCA 0

0 1

P₁ A_4...7 B_4...7

4 C₁ 4

S_4...7

C_in RCA

0 1

P₀ A_0...3 B_0...3

4 C₀ 4

S_0...3

propagate

RCA 0

0 1

P₁ A_4...7 B_4...7

4 C₁ 4

S_4...7

C_in

Fig. 2 – 8-bit CLA protected with DWC+TR.

4. Experiments and Synthesis Results

In this work we have described in VHDL unprotected and protected versions of the adders for six different data lengths, from 4 to 128 bits. To serve as a reference for the comparison, we have also described in VHDL protected and unprotected versions of the Ripple-Carry Adder (RCA) with the same data lengths. For each of the four adder architectures (RCA, CLA, CSA and RIC) three protected versions were designed: TMR, TR and DWC+TR.

All adder versions were synthesized for the EP3SE50F484C2 device (from Stratix III FPGA family) and validated through functional simulation with delay using Quartus II software from Altera (version 7.2 SP2).

Table 1 shows the number of ALUTs reported by Quartus II for each adder version and the critical delay of each version, according to the Altera Classic Timing Analyzer.

Tab.1 – Number of ALUTs and critical delay for each adder version synthesized and simulated for Stratix III.

Adder Technique

Number of ALUTs Topological critical delay (ns) 4 bits 8 bits 16

bits 32 bits

64 bits

128

bits 4 bits 8 bits 16 bits

32 bits

64 bits

128 bits RCA

unprotected 33 62 120 236 468 932 1.65 2.71 5.3 9.59 18.87 37.68 TMR 87 173 337 665 1321 2633 2.62 4.03 6.96 12.93 23.95 46.62

TR 73 138 260 571 1123 2227 10.39 14.68 24.36 47.92 93.62 184.7 DWC+TR 150 280 530 1030 2031 4288 9.73 14.08 21.41 38.72 71.97 149.8

CLA

unprotected 63 120 230 468 924 1916 3.59 3.98 4.26 5.07 6.34 6.96 TMR 196 327 637 1259 2512 5089 4.47 5.48 5.95 6.66 8.23 9.19 TR 129 231 433 838 1646 3343 12.25 20.94 19.85 23.02 28.93 31.32 DWC+TR 237 436 826 1616 3178 6450 13.59 13.31 18.02 21.62 25.09 27.69

CSA

unprotected 65 142 280 556 1110 2215 2.39 3.05 3.53 4.31 4.38 6.4 TMR 184 371 733 1465 2915 5813 3.18 3.96 4.9 5.44 6.29 8.46

TR 105 200 384 823 1619 3220 10.64 14.17 15.2 22.26 27.12 29.88 DWC+TR 182 402 776 1532 3090 6145 10.38 11.05 14.98 15.96 19.58 24.77

RIC

unprotected 55 118 232 460 918 1831 2.42 3.09 3.66 3.91 4.38 5.49 TMR 160 323 637 1273 2530 5044 3.7 4.01 4.46 5.31 6.26 7.99

TR 97 184 352 758 1490 2964 11.44 13.4 14.68 17.48 24.2 27.99 DWC+TR 166 362 696 1370 2770 5504 11.72 11.57 12.57 14.49 18.24 23.39 From the data of table 1 it is possible to conclude that the protection with TMR results in resource overheads ranging from 161.3% to 211.1%, being 175% the average overhead. In theory, the increase of TMR should be higher than 200% for full custom (masked) implementations. However, the synthesis algorithms and the granularity of FPGA programmable elements may influence this overhead.

Also from the first part of Table 1 it is possible to compute the resource overhead needed to protect adders with TR. It ranges from 37.1% up to 141.9%, with an average of 81.3%. However, a careful examination of resource overheads allows us to conclude that the impact of the extra registers (and voter) to protect with TR is higher for the RCA and for the CLA, with these overheads ranging from 116.7% to 141.9% and from 74.5% to 104.8%, respectively. This is because unprotected CSA and CLA adders require less ALUTs to be implemented than unprotected CSA or RIC adders. Therefore, the impact of the extra resources is not so prominent for CSA and RIC adders.

The resource overhead needed to protect the adders with DWC+TR ranges from 175.5% to 354.5%, with an average overhead of 245.1%. These results are in conflict with the original purpose of DWC+TR that is to reduce resource overhead, as claimed by some authors (see [LIM 03], for example). However, it is necessary to

S_(8...0)

Error Detection

V O T E R clk

Reg 0 9

9 9

clk+d Reg

clk Reg

2 9

clk+d Reg

3 9

0 1 9

A_0...7 B_0...7 C_in

CLA0

CLA1

S_(8...0)

Error Detection

V O T E R clk

Reg 0 9

9 9

clk+d Reg

clk Reg

2 9

clk+d Reg

3 9

0 1 9

A_0...7 B_0...7 C_in

CLA0

CLA1

96 XXIII SIM - South Symposium on Microelectronics

remark that our implementation of DWC+TR is quite conservative, in the sense that we use four registers, while other authors use only two.

The examination of the second part of Table 1 leads us to conclude that the protection of adders with TMR resulted in increases of critical delay ranging from 21.9% up to 58.8%. In the average, TMR results in an increase of 35.5% of the critical delay. On the other hand, the use of TR resulted in an increase of delay that ranges from 241.2% to 529.7%, with an average increase of 382.1%. These results were expected, since the implemented TR needs four clock cycles: three to sample the block output and one to vote the correct result.

The increase in critical delay resulting from the use of DWC+TR ranges from 234.4% to 489.7%, with an average of 312.2%.

Comparing the number of ALUTs and critical delays of the fast adders (CLA, CSA and RIC) protected with TMR, one may note that the CSA and the RIC present similar values. Particularly, for data lengths between 16 and 128 bits the RIC is faster or as fast as the CLA, at a lower or comparable cost.

5. Conclusions and Future Work

In order to evaluate the robustness of the proposed protection scheme against single-event transients (SETs), fault injection campaigns were performed for all 4-bit adders by means of logic-level simulation using Mentor Graphics ModelSim simulator. The voter was not considered in the robustness analysis because it was known beforehand that it is susceptible to faults. Also, we have limited the evaluation to pulses with 100 picoseconds, since it appears as a typical SET duration for the 65nm CMOS technology, as the one used in the fabrication of Stratix III devices [DHI 06].

The fault injection and simulation procedure showed that all the protection techniques used are efficient in protecting 4-bit adders against SETs. Since higher order adders are built up from such modules, it is expected that also higher order adders can be efficiently protected.

The data obtained from synthesis for Altera Stratix III FPGAs have shown that the TR technique required less extra hardware resources to implement protected adders, while the DWC+TR was the technique that has demanded more hardware. The functional simulation results have showed that adders protected with TR have the poorest performance, since the average critical delay increase was the highest: 382.1%. On the other hand, the adders protected with TMR presented an average increase in critical delay of only 35.5%, thus corresponding to the lowest increase among the three protection techniques.

As future work we intend to re-run these experiments for FPGA devices from other companies. It is also intended to compare full custom nanometer CMOS implementations of protected and unprotected fast adders.

6. References

[BAU 05] BAUMANN, R. C. Radiation-Induced Soft Errors in Advanced Semiconductor Technologies.

In: IEEE Trans. on Devices and Materials Reliability, v.5, n.3, pp.305-316, Sep. 2005.

[CAR 01] CARMICHAEL, C. Triple Module Redundancy Design Techniques for Virtex FPGA. In:

Xilinx Application Notes 197, San Jose, USA, 2001.

[COH 99] COHEN, N. et al.. Soft Error Considerations for Deep-Submicron CMOS Circuit Applications. In: Internacional Electron Devices Meeting, pp. 315-318, July 1999.

[DHI 06] DHILLON, Y. S. et al. Analysis and Optimization of Nanometer CMOS Circuits for Soft-Error Tolerance. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v.14, n.5, p.514-524, 2006.

[HWA 79] HWANG, K. Computer Arithmetic: Principles, Architecture, and Design. New York: Wiley, 423 p., 1979.

[KUM 03] KUMAR, B. On-line Detection of Faults in Carry-Select Adders. In: International Test Conference, pp. 912, 2003.

[LIM 03] LIMA, F. et al. Designing Fault Tolerant Systems into SRAM-based FPGAs. In: International Design Automation Conference, DAC, 2003, Proc. New York: ACM, 2003, pp. 650 – 655.

[MES 82] MESSENGER, C. Collection of Charge on Junction Nodes from Ion Tracks. In: IEEE Transactions on Nuclear Science, v. NS-29, pp. 2024-2031, Dec. 1982.

[MES 07] MESQUITA, E. et al. RIC Fast Adder and its SET Tolerant Implementation in FPGAs. In:

17th International Conference on Field Programmable Logic and Applications, Amsterdam, Netherlands, 2007.

[SHI 02] SHIVAKUMAR, P. et al.. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In: International Conference On Dependable Systems And Networks, pp.

389 – 398, 2002.

No documento SIM 2008 23 (páginas 91-95)

1 Grupo de Arquiteturas e Circuitos Integrados (GACI) – DINFO – UFPel

3. Applying Fault-Tolerance Techniques to Adders

4. Experiments and Synthesis Results

5. Conclusions and Future Work

6. References