Multi-layer error resilience - Développement d’Architectures HW / SW Tolérantes aux Fautes et A

Interleaving SEC codes leads to larger flits (i.e. more error check bits are necessary to protect data) that have a major impact on the area / power overheads of NL-FEC protected routers. When there are two and four interleaved data groups, the number of control bits increases from 6 to 10 and 16 for 32-bits networks, and from 7 to 12 and 20 for 64-bits networks. Thus, the number of intra-die / inter-die wires increases in order to allow data and error check bit parallel transmission. Compared to link-level FEC, NL-FEC is more expensive when interleaved codes are used (i.e. LL-FEC area / power overheads are up to 28% / 40%).

In the 65nm process, flit encoding and error detection / correction can be performed within one clock for 32-bits and 64-bits NL-FEC. Hence, there is no impact on network performance. This result, together with the area / power overheads and reliability assessments above show that NL-FEC is more efficient in the case when network performance is critical and multiple transients do not cumulate along the path.

Figure V-15 Comparison of link-level and network-level FEC for 32 bits routers

In both 32-bits and 64-bits networks, the path reliability of NL-FEC with intermediate correction stages is higher that in the cases where simple NL-FEC schemes are used (i.e. correction only at the destination node, NL-FEC 1). Although path reliability targets above 99.75% can be achieved without intermediate correction stages, adding correction stages significantly improves path reliability.

The reliability improvement observed for NL-FEC with correction stages is due to the fact that faults are less likely to cumulate over shorter path sections. Hence, the reliability increases when the path sections between correction stages are shorter. For example, reliability targets above 99.999% are achieved when intermediate correction stages are considered every two hops (i.e. NL-FEC 6).

NL-FEC with interleaved Hamming SEC codes improves path reliability. In the examples above, the path reliability of NL-FEC×4 is similar to that of NL-FEC with four intermediate retiming stages (i.e. correction every three hops) for 32-bits. In some cases, SEC interleaving provides better path reliability that SEC with correction stages, as double errors are more likely to cumulate along path sections than multiple uncorrectable errors along the entire path.

5.2.3.2 Area and power overheads

In 3D NoCs with NL-FEC, correction stages can be inserted such that faults can cumulate on path sections no longer than pMAX. Hence, up to six ports (i.e. NORTH, SOUTH, EAST, WEST, UP and DOWN) of the seven-port router may have correction stages. In the 65 nm process, a retiming stage is added before each correction stage in order to maintain the 1GHz network clock frequency. In this case, flit transmission on the PHY and error detection / correction are performed in two clock cycles. The insertion of the retiming stage raises the data loss problem: by the time the buffer FULL signal of the receiver router arrives at the transmitter router, a flit could be in the correction retiming stage. Therefore, in order to avoid data loss, an extra position is added for input buffers having a correction stage. In TABLE IV, the area and power overheads for the seven-port router with up to six correction stages are summarized.

The area / power overheads increase with the number of error correction stages. The area overheads are due to the larger flit size, the correction logic and the protected input buffers with an extra position. Compared to NL-FEC, protecting a single port increases the area overheads by 10%, while the power overheads almost double. Although the encoding / decoding complexity increases with data size, the overheads of 64-bits flits

are slightly lower than for 32-bits, as the relative number of error correction bits is 7/64 instead of 6/32. The correction complexity outweighs the buffer size when the number of correction stage increases. In this case, the area and power overheads exceed 40%.

TABLE IV AREA /POWER OVERHEADS [%] OF MULTI-LAYER ERROR RESILIENCE

32-bits 64-bits

Error correction scheme

Area (%) Power (%) Area (%) Power (%)

SEC with 1 port 27.64 23.19 26.75 17.83

SEC with 2 ports 32.43 29.01 31.83 23.81

SEC with 3 ports 34.54 30.25 37.77 29.57

SEC with 4 ports 40.15 39.79 43.02 35.28

SEC with 5 ports 45.65 46.3 50.86 42.12

SEC with 6 ports 48.02 48.84 53.97 45.55

In 3D NoCs, it is assumed that correction stages are added only for inter-die links and only two ports have error detection and correction stages. Hence, intra-die and inter-die errors will not propagate between layers, as errors are corrected before the incoming flits are stored in the router input FIFO. Compared to the link-level approach (see all links protection in TABLE II), the area overheads of the multi-level solution are ~5% larger, but they dissipate ~10% less power. Although network-level solution with interleaving SEC codes can ensure similar reliability levels, the multi-layer solution is more efficient than NL-FEC with two interleaved groups for 32-bits links. However, the complex Hamming SEC encoding/decoding logic and the buffer protection against flit drop makes the multi-layer solution less efficient for 64-bits. Moreover, interleaving SEC codes also requires more TSVs for error check bits transmission.

5.2.3.3 Impact on network latency

In the multi-layer scheme, adding the correction stages has a negative impact on network latency, as one clock cycle is lost every time flit errors are detected and corrected. Because error detection and correction is performed in a single cycle, the network latency is not affected by the decoder at the destination router.

Moreover, there are no variations of link latency with the error rates, as error detection and correction is performed in a single cycle.

Let us consider 3D NoCs with intermediate correction stages inserted for all links and for inter-die links, respectively. In Figure V-16, the impact of correction stages on the network latency is represented for different 3D mesh topologies for uniform random traffic.

0 2 4 6 8 10 12 14 16 18

4x4x2 4x4x3 4x4x4 3x3x2 5x5x2

3D mesh Topology

Latency Overhead [%]

Inter-die Links All Links

Figure V-16 Impact of intermediate correction stages on network latency

Adding intermediate correction stages increases the link latency by one clock cycle. Hence, the average network latency increases by more than 12% in the case when all links have correction capabilities. The variations in network latency overheads with the topology are explained by the average path lengths. Flits make on average 2.41 hops for the 3×3×2 topology and 3.78 hops for the 4×4×4 topology. Hence, the performance penalty is higher for network where packets traverse more links with correction stages.

Including error correction only for inter-die links reduces the latency overheads to less than 5%. Unlike the previous case where all links have correction capabilities, the variations of latency overheads are explained by the average number of inter-die hops, as only paths using such links are penalized. The 5×5×2 topology has relatively high average hop count (i.e. 3.79), but only half of its paths comprise inter-die links. Therefore, when only inter-die links are protected, the latency overhead is reduced from 16% to ~2%. As the number of inter-die links per path increases, so does the latency overhead. In the 4×4×4 topology, ~75% of paths are inter-die and the latency overhead is ~5%.

No documento Développement d’Architectures HW / SW Tolérantes aux Fautes et Auto-calibrantes pour les Technologies Intégrées 3D (páginas 113-116)