A High Throughput Diamond Search Architecture for HDTV
150 XXIII SIM - South Symposium on Microelectronics
3. Designed Architecture
The designed architecture uses QSDS algorithm with sub-sampled blocks of 8x8 pixels. SAD [KUH 99]
was used as distortion criteria. The architectural block diagram is showed in fig. 2. The architecture has nine processing unities (PU). The PU can process eight samples in parallel and this means that one line of the sub-sampled 16x16 block is processed in parallel. So, eight accumulations must be used to generate the final SAD result of each block, one accumulation per line.
The nine candidate blocks of the LDSP are calculated in parallel and the results are sent to the comparator.
Each candidate block receives an index, to identify the position of this block in the LDSP. The comparator finds the lowest SAD and sends the index of the chosen candidate block to the control unit. The control unit analyses the block index and decides the next step of the algorithm. If the chosen block has index 4, the lowest SAD was found in the center of the LDSP. So, the control unit starts the final refinement with the SDSP. In the SDSP calculation, four more candidate blocks are evaluated. The lowest SAD is identified and the control unit generates the corresponding motion vector to this block.
When the chosen block in the first step has index 0, 3, 5 or 8, the control unit starts the second step of the algorithm with a search for vertex. In this case, other five candidate blocks are calculated and compared to the lowest SAD of the previous step. If the chosen block from the first step has index 1, 2, 6 or 7, the control unit starts the second step of the algorithm with a search for edge, and three more candidate blocks are calculated.
The second step can be repeated n times, till the lowest SAD is found in the center of the LDSP. Then the SDSP is applied.
The number of iterations in the second step is restricted to 20 in QSDS architecture. This restriction is done to define a search area and to allow an easier synchronism of this module with other encoder modules. Thus, the maximum search area of QSDS is 100x100 samples.
Fig. 2 – QSDS block diagram architecture.
3.1 Memory organization
The internal memory is organized in 15 different local memories, as presented in fig. 3. The local memory (LM) stores the region with the nine candidate blocks of the first LDSP and all the possible blocks for the next step. Thus, when the control unit decides which blocks must be evaluated in the next step, the LM already has this data. LM memory is composed of 16 words with 128 bits. Another 13 small memories are used to store the candidates block (CBM) and one for the current block (CB). These 14 memories are composed of 8 words with 64 bits (8 samples of 8 bits) and they store one sub-sampled block with 8x8 samples.
LM memory is read line by line and the data is stored in the CBM memories. Nine CBMs are used to store the candidate blocks from the LDSP. Four CBMs are used to store the blocks of the SDSP. These blocks are always stored, and they are ready to be calculated if the control decides to start the SDSP. This solution speeds up the architecture and reduces the memory access latency. When the SDSP mode is active, the control unit selects the multiplex in fig. 2 and the PUs receive the data from these memories.
PU8 PU7
Comparator
Control Unit
MV CBM
CBMS
CB PU0
PU1
PU2
PU3
PU4
PU5
PU6 CB
LM Memory
SAD
Fig. 3 - Memory organization.
Each line from the LM has 128 bits. However, a partition unity (Partition in fig. 3) cuts the line into 64 bits.
This unit selects the right part of the 128 bits which corresponds to the candidate block. When the first line of the LM is read, the PU selects the right part of the word which corresponds to the candidate block 0, to be stored in the CBM0. When the second line is read, this module selects the right part of the second line of the candidate block 0, to be stored in the CBM0, and the correct part for the first line of the candidate block 1, to be stored in the CBM1. This process finishes when all LM is read and all CBMs are full.
A local control was developed to control the memory access. The control unit in fig. 2 is not responsible for controlling the memories read/write process. When the datapath finishes the SAD calculation, the control unit informs the memory if the search should continue or if the SDSP should be applied. If the search continues, by an edge or a vertex, the CBMs are stored with the new candidate blocks and the LM memory is reloaded. In this case, there are 10 cycles of latency to write the first line in the nine CBMs. If SDSP is applied, the four candidate blocks are already stored and the datapath starts with no memory latency. Even using 15 different memories, the total memory consumption is small. The search area is loaded accordingly to the algorithms necessity, so, no irrelevant data is stored. A total of 9.2Kbits is used for all motion estimator memories.
3.2. Performance Evaluation
QSDS algorithm stops the search when the best match is found at the center of the LDSP. This condition can occur in the first application of the LDSP. Thus, no searches for edge or vertex are done. It is the best case, when only thirteen candidate blocks are evaluated: nine from the first LDSP and four from the SDSP.
This architecture uses 26 clock cycles to fill all the memories and to start the SAD calculation. The PUs have a latency of four cycles and it needs seven more cycles to calculate the SAD of one block. The comparator uses five cycles to choose the lowest SAD. Thus, 42 clock cycles are necessary to process the first LDSP. The SDSP needs only 20 cycles to calculate the SAD for the four candidates block. So, in the best case, this architecture can generate a motion vector in 62 cycles.
In the cases where the LDSP is repeated, for an edge or vertex search, 10 cycles are used to write in the CBMs. The same 16 cycles are used by the PUs and the comparator. Both edge and vertex search use the same 26 cycles. QSDS architecture can do 20 LDSP repetitions, and for each one, more 26 cycles are used.
4. Synthesis results
The proposed architecture was described in VHDL. ISE 8.1i tool was used to synthesize the architecture to the Virtex-4 XC4VLX15 device and ModelSim 6.1 tool was used to simulate and to validate the architecture design. Tab. 1 presents the synthesis results.
The device hardware utilization is small, since only 3,5k LUTs are used. The synthesis results show the high frequency achieved by the QSDS architecture.
Tab. 1- Synthesis Results for QSDS architecture Parameter QSDS Results
BRAMs 32 Slices 1964
Slice FF 1980
LUTs 3541
Frequency (MHz) 212.5
HDTV fps (worst case) 45 HDTV fps (average case) 187
External Memory
CBM0 CB
LM
CBM1 ... CBM11 CBM12 Partition
152 XXIII SIM - South Symposium on Microelectronics
Tab. 1 shows also the architecture performance considering HDTV (1920x1080 pixels) videos. The QSDS architecture can use, in the worst case, 562 clock cycles to generate a motion vector. This number of clock cycles is enough to execute the 20 iterations predefined for the second step of the algorithm. After that, the control unit stops the search and starts the SDSP. The QSDS architecture can achieve real time for HDTV video in the worst case, because it can process till 45 HDTV frames per second.
The average case was obtained through the software implementation of the algorithm. For a search area of 100x100 samples, the algorithm uses an average of three iterations in the second step of the algorithm. Then, the QSDS architecture needs 140 clock cycles to generate a motion vector in this average case. This implies a processing rate of 187 HDTV frames per second.
QSDS FPGA results were compared to FS [LAR 06], FTS (Flexible Triangle Search) [MOH 05] and FS+PS4:1 and early termination [REH 06] architectures, as presented in tab. 2. Solutions [LAR 06] and [MOH 05] have a constant throughput and [REH 06] have a variable throughput due to the early termination. The search area used by [REH 06] is of 32x32 samples, [LAR 06] and [MOH 05] do not present this information.
Tab. 2 shows the synthesis results comparisons of founded related works and QSDS architecture. The processing rate of HDTV frames (1920x1080) is also presented. The number of CLBs used by [REH 06] is not presented.
Tab. 2 - Comparative results with related works
Solution FPGA CLBs Frequency (MHz) Search Area HDTV (fps) [LAR 06] SPARTAN
2 939 109.8 - 1.66
[MOH 05] SPARTAN
3 6142 74.0 - 45
[REH 06] VIRTEX 2 - 120.0 32X32 *24
QSDS VIRTEX 4 492 212.5 100X100 *187
*Average throughput
The unit of hardware resources measurement used for the comparison is the number of Configurable Logic Blocks (CLB) used by each compared solution. However, the ISE 8.1i tool does not present this information clearly, so it is possible to estimate this information through the number of slices used. Each slice has four CLBs, thus the number of CLBs used for the QSDS architecture was calculated as the forth part of the number of used slices.
QSDS architecture is the fastest one among the compared designs. Due to the efficiency of the QSDS algorithm, the designed architecture uses less hardware resources than other solutions, reaching the highest throughput, even using a big search area.
5. Conclusions
This paper presented hardware architecture for a Quarter Sub-sampled Diamond Search algorithm named QSDS. The synthesis results showed that the designed architecture can achieve high throughput with a low hardware cost.
The presented architecture is able to run at 212.5 MHz in a Xilinx Virtex 4 FPGA. Only 3541 LUTs were used to implement the architecture. The QSDS architecture can work in real time for HDTV (1920x1080 pixels) videos. In the worst case, the QSDS architecture can process 45 HDTV frames per second. In the average case, the architecture is able to process 187 HDTV frames per second.
QSDS architecture reaches the best results for maximum throughput, and hardware resources utilization in comparison with presented related works.
6. References
[KUH 99] KUHN, P. M. Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation. Boston: Kluwer Academic Publishers, 1999.
[LAR 06] LARKIN D. et al. A Low Complexity Hardware Architecture for Motion Estimation. In: IEEE International Symposium on Circuits and Systems, Island of Kos, Greece, 2006, pp. 2677-2688.
[MOH 05] MOHAMMADZADEH, M. et al. An Optimized Systolic Array Architecture for Full-Search Block Matching Algorithm and its-Implementation on FPGA chips. In: The 3rd International IEEE-NEWCAS Conference, 2005, pp.174-177.
[REH 06] REHAN, M. et al. An FPGA Implementation of the Flexible TriangleSearch Algorithm for Block Based Motion Estimation. In: IEEE International Symposium on Circuits and Systems, Island of Kos, Greece, 2006. pp. 521-523.