146 XXIII SIM - South Symposium on Microelectronics
the wasted time in the intra prediction module, since it must wait for new and valid references. Then, the architecture proposed in this paper contributes to this goal, since it can process a complete block with 2x2 samples at each clock cycle.
The 2-D 2x2 Hadamard defined in the H.264/AVC standard is presented in Equation (1), where WD
represents the four DC coefficients of the 2-D FDCT resulting matrices. This calculation is applied to both forward and inverse 2x2 Hadamard and then, the same designed architecture is able to be used in the forward and inverse transform modules.
The architecture was designed based on an algorithm obtained from Equation (1). This algorithm is presented in tab. 1, where a0 to a3 are internal variables that are used in the implementation of the parallel architecture, W0 to W3 are the input that will be processed by the transform and S0 to S3 are the 2x2 Hadamard transformed coefficients.
Tab.1 - 2-D 2x2 Hadamard algorithm.
a0 = W0 + W2 S0 = a0 + a2
a1 = W0 - W2 S1 = a0 - a2 a2 = W3 + W1 S2 = a1 - a3
a3 = W3 - W1 S3 = a1 + a3
The algorithm presented in tab. 1 was defined considering the future hardware implementation of this algorithm. Then, it was divided in two independent steps to allow the pipeline design with a consumption and production rates of four samples per clock cycle. This option allows a parallel processing and a high operation frequency, in function of the simplicity of the operations presented at each pipeline stage.
The block diagram of the designed architecture is shown in fig. 1. This architecture was designed in a two stages pipeline, with four operators per stage. Then, the architecture has a latency of two clock cycles.
Fig. 1 – 2-D 2x2 Hadamard parallel architecture.
In fig. 1, Wi inputs have a bit-width of 8 bits and Si outputs have a bit-width of 10 bits. The increase in the dynamic range was needed to avoid mistakes that could be caused by overflow in the sum operations.
To allow the pipeline implementation, time barriers (registers) were placed between the two stages. The operators used are dedicated to only one type of operation (additions or subtractions), reducing the complexity of each operator and of the control unit. On the other hand, the parallelism causes an increase in the number of components used in this design. The operators were described as macro function adders. This type of description allows the synthesis tool to map the adders carry propagation path to dedicated hardware presented inside the target FPGA, improving the adders performance.
The parallel processing solution for 2-D 2x2 Hadamard reduces dramatically the control block complexity, because the function of control, in this case, is only to verify if the data available in the input are valid and to indicate that the outputs contains valid values. This operation can be easily designed.
The architecture was validated through post place-and-route simulation using the tools provided by Altera and Xilinx. After that, the architecture was also prototyped in a Virtex II Pro Xilinx FPGA. The validation and prototyping process will be presented in section 4.
3. Synthesys Results
The 2-D 2x2 Hadamard transform architecture was described in VHDL. The synthesis targeted several Altera and Xilinx FPGAs. Quartus II tools, provided by Altera, and ISE tool, provided by Xilinx, were used to generate the results.
Tab. 2 presents some synthesis results of this architecture when mapped to Altera FPGAs. The selected FPGAs were: Cyclone II EP2C35F672C6, Stratix EP1S110F672C6 and Stratix II EP2515F672C3. Tab. 3 shows the synthesis results of the Hadamard 2-D parallel architecture for Xilinx FPGAs devices. The target FPGAs were: Virtex 2P XC2VP2, Virtex 4 XC4XLX15, Virtex 5 XC5VLX30, Spartan 2E XC2S50E e Spartan 3E XC3S100E.
S0
S1
S2
S3
W0
W1
W2
W3
a0
a1
a2
a3
(1)
Tab. 2 - Synthesis results for Altera FPGAs Tab. 3 - Synthesis results for Xilinx FPGAs
Considering the selected Altera FPGAs, Cyclone II presented the lowest operation frequency, with 420.17 MHz as maximum frequency. Stratix II reached the highest operation frequency (500 MHz). In this case, the architecture reaches the maximum operation frequency allowed to this device, showing the highly efficient designed architecture.
Among the analyzed Xilinx FPGAs, Spartan 3E presented the lowest operation frequency, with 225.1 MHz. The highest operation frequency was reached with the Virtex 5, reaching 655.9 MHz.
Since the architecture process four samples per clock cycle, and considering the frequencies presented in tab. 2 and in tab. 3, it was possible to calculate the throughput of this architecture, considering different target devices. The results presented in tab. 4 shows that this architecture can easily reach real time (30 frames per second) for all target devices even when processing very high resolution videos, like QHDTV (3840x2048 pixels) at a 4:2:0 color relation, as defined in H.264/AVC main profile. These results show that this architecture successfully reaches the high throughput goal, allowing its use in a hardware design that considers the Intra Frames prediction performance restrictions. In the third column is showed the minimal frequency required to allow real time when QHDTV frames are being processed.
Tab. 4 – Processing rates for Altera and Xilinx FPGAs Device Maximum Throughput
(Msamples/s) QHDTV Frames/s
(@ Max. Freq) Min. Freq. for QHDTV (MHz)
Cyclone II 1,680.7 6,838.79 1,84
Stratix 1,687.8 6.867.68 1,84
Stratix II 2,000.0 8,138.02 1,84
Spartan 2E 900.3 3,663.33 1,84
Spartan 3E 1,162.5 4,730.22 1,84
Virtex 2P 1,296.2 5,274.25 1,84
Virtex 4 2,342.2 9,530.44 1,84
Virtex 5 2,623.5 10,675.05 1,84
It is important to emphasize that the performance of this architecture depends on the continuous availability of four input samples per clock cycle. If these data are not always available, so the architecture throughput will be reduced.
4. Architecture Prototyping
The first validation step was a C implementation of the 2-D 2x2 Hadamard Transform which was used to generate the input stimuli that were used in the next validation step, allowing a comparison between the software and hardware results. After that, the designed architecture was prototyped in a Digilent XUP-V2P board which has one Virtex II Pro XC2VP30 Xilinx FPGA. Some embedded memory blocks of the target FPGA was used in this prototyping process. One FPGA embedded block memory was used to store the input stimuli, other to store the architecture results and a third embedded memory was used to store the results generated by the software implementation. This third memory was used only for validation purposes. An auxiliary architecture was also described in VHDL to realize de comparison between the results generated by software and by the designed.
One switch of the used prototyping board was used to implement the architecture reset. When the reset switch is changed to zero, the process starts. One LED of the board is turned on when the process finishes. If the results generated by the architecture are equal to the results generated by software, then another LED of the board is turned on to indicate that the architecture generated the expected outputs.
Fig. 2 shown the prototype of the 2x2 Hadamard transform architecture. The two LEDs are turned and are highlighted in fig. 2. This means that the architecture finishes to process the input stimuli and that the comparison with the software results was successfully. Therefore the prototype of this architecture was considered validated.
Device Freq.
(MHz)
Slices / bit slices
Slices Flip-Flops
/ Registers
LUTs / 4-input LUTs Spartan 2E 225.07 58 (7%) 70 (4%) 70 (4%) Spartan 3E 290.61 58 (6%) 70 (3%) 70 (3%) Virtex 2P 324.04 58 (4%) 70 (2%) 70 (2%) Virtex 4 575.54 58 (0%) 70 (0%) 70 (0%) Virtex 5 655.86 32(29%) 70 (0%) 70 (0%) Device Freq.
(MHz) LUTs Register s Cyclone II 420.17 106
(<1%) 66 Stratix 421.94 108 (1%) - Stratix II 500 76 (<1%) 66
148 XXIII SIM - South Symposium on Microelectronics
Fig. 2 – Architecture prototype.
5. Comparisons
This section presents a comparison with a serial design for 2x2 Hadamard, with a data consumption and production of one sample per clock cycle. This serial version was previously designed in our research group [AGO 01]. This architecture was also designed in a pipeline with two stages, but only one operator is used in each stage. Tab. 5 presents a comparison between these two architectures. In this comparison, the two architectures were mapped to a Xilinx Virtex 2P FPGA.
Tab. 5 Comparison between serial and parallel architectures in Xilinx Virtex 2P FPGAs.
Implementatio
n Freq. (MHz) 4-input LUTs Max. Throughput (Msamples/s)
Parallel 324,04 70 1.296,2
Serial 201,7 98 201,7
The first analysis about the results presented in tab. 5 is related to the hardware resources consumption.
The serial version of the architecture consumes more resources than the parallel version. Even using four operators per pipeline stage instead of one operator that is used in the serial version, the parallel architecture consumes less hardware than the serial architecture. This is mainly caused by the drastic simplification in the control and in the input and output management.
Other important information is that the parallel version, even using less hardware resources, presents a throughput 6.4 times higher than the serial architecture.
6. Conclusions
This paper presented the hardware design of a parallel architecture for the 2-D 2x2 Hadamard transform.
The architecture was defined in a pipeline with two stages and it was described in VHDL and synthesized to several Altera and Xilinx FPGAs devices.
Considering the Altera FPGAs, the device which reaches the best performance was the Stratix II, with 500 MHz as maximum operation frequency, processing about 2 billions of samples per second.
Among the Xilinx FPGAs, Virtex 5 got the higher operation frequency, with 655.86 MHz, reaching a throughput of 2.63 billions of samples per second.
The designed architecture was prototyped in a Digilent XUP V2P board and validated.
The designed architecture was compared with a previous work of our research group and the architecture presented in this paper reaches a lesser hardware consumption and a very higher throughput than that previous solution.
The designed architecture, when mapped to all considered FPGAs, is able to reach real time even when processing very high resolution videos like (QHDTV). Then, the goal of this paper was reached.
7. References
[AGO 01] AGOSTINI, L. V. et al. High Throughput Architecture for H.264/AVC Forward Transforms Block. In: ACM Grate Lakes Symposium on VLSI, GLSVLSI, 2006. Proceedings… New York:
ACM, 2006b. p. 320-323
[RIC 03] RICHARDSON, I. H.264/AVC and MPEG-4 Video Compression – Video Coding for Next-Generation Multimedia. Chichester: John Wiley and Sons, 2003.
[ITU 03] ITU-T Recommendation and final draft international standard of joint video specification (ITU-T Rec.H.264|ISO/IEC 14496-10 AVC), 2003.