FPGA-Based Traffic-Sign Classification
César Augusto Pereira de Jesus Gouveia
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor(s): Prof. Horácio Cláudio De Campos Neto Prof. Rui António Policarpo Duarte
Examination Committee
Chairperson: Prof. Teresa Maria Sá Ferreira Vazão Vasques Supervisor: Prof. Horácio Cláudio De Campos Neto Member of the Committee: Prof. Pedro Filipe Zeferino Tomas
June 2019
Declaration
I declare that this document is an original work of my own authorship and that it fulfills all the require- ments of the Code of Conduct and Good Practices of the Universidade de Lisboa.
Acknowledgments
I would first like to thank my supervisor Prof. Hor ´acio Neto for all the helpful insight, guidance and support through the development of this thesis.
I want to thank Prof. Rui Duarte and M ´ario Vestias for their encouragement and insightful comments.
Also, I would like to thank INESC-ID for providing me the tools to develop the experimental work for this thesis.
A big and heartfelt thank you to my family, my girlfriend and my colleagues for all the support given to me during the course of this work.
Resumo
O objetivo deste trabalho consiste no desenvolvimento de um sistema embebido hardware/software, numa plataforma SoC-FPGA, para a detec¸ ˜ao de sinais de tr ˆansito.
Os Sistemas Avanc¸ados de Assist ˆencia ao Condutor correspondem a um componente crucial nos carros aut ´onomos. Uma das principais tarefas deste tipo de sistemas ´e reconhecer sinais de tr ˆansito. Os sistemas de reconhecimento recentes s ˜ao tipicamente baseados em Redes Neuronais Convolucionais.
Por serem sistemas cr´ıticos de seguranc¸a, o processamento de imagem tem de ser efetuado em tempo real.
O sistema desenvolvido utiliza uma Rede Neuronal Convolucional para fazer o reconhecimento.
A rede ´e composta por duas camadas convolucionais e duas fully connected, e classifica 43 sinais de tr ˆansito (German Traffic Sign Recognition Benchmark) com uma probabilidade de acerto de 97.2
% (ap ´os quantizac¸ ˜ao). O sistema ´e composto por um multiprocessador dedicado para acelerar as ca- madas convolucionais, que constituem 93% do tempo total de execuc¸ ˜ao do programa e um processador gen ´erico que computa as camadasFully Connected, em software.
A soluc¸ ˜ao proposta, implementada no dispositivo Zynq 7020, consegue um ritmo de classificac¸ ˜ao de 8 classificac¸ ˜oes/s utilizando apenas um elemento de processamento, o que corresponde a uma acelerac¸ ˜ao de 5x em relac¸ ˜ao a uma soluc¸ ˜ao apenas em software. A arquitetura ´e totalmente escal ´avel (at ´e 36 elementos de processamento no dispositivo Zynq 7020), e com apenas quatro elementos de processamento atinge os requisitos de tempo real (aproximadamente60mspor classificac¸ ˜ao) utilizando apenas 17% dos recursos dispon´ıveis.
Palavras-chave:
Reconhecimento de Sinais de Tr ˆansito, Redes Neuronais Convolucionais, Hardware/Software Co-design, Acelerac¸ ˜ao Hardware, Systems on Chip, Custom Hardware DesignAbstract
The objective of this work is to develop an efficient traffic sign classification hardware/software embed- ded system on a SoC-FPGA platform.
Advanced Driver Assistant Systems are a crucial component on self-driving cars. One of the main tasks of these systems is to recognize traffic signs. Modern recognition systems are typically based on Convolutional Neural Networks. Because of the safety-critical nature of these systems, the image processing task must be performed in real-time.
The SoC FPGA system developed uses a Convolutional Neural Network to perform the recognition.
The network classifies 43 traffic signals, on the German Traffic Sign Recognition Benchmark, with a final accuracy of 97.2 % (after quantization) and is composed by two Convolutional and two Fully Con- nected layers. A dedicated multi-processor was developed to accelerate the Convolutional layers, which consume 93 % of the total execution time. The Fully connected layers are computed in software.
The proposed solution, implemented on a Zynq 7020 device, can achieve a classification rate of 8 classification/s using only one processing element, which is five times faster than a software-only solution, executing on an ARM processor embedded on the same device. The architecture is fully scalable (up to 36 Processing Elements can be implemented in a Zynq 7020 device), and with only 4 Processing Elements it is possible to meet real-time objectives (approximately60msper classification) using only 17 % of the available resources.
Keywords:
Traffic Sign Recognition, Convolutional Neural Networks, Hardware/Software Co- design, Hardware Acceleration, Systems on Chip, Custom Hardware Design.Contents
Declaration . . . iii
Acknowledgments . . . v
Resumo . . . vii
Abstract . . . ix
List of Tables . . . xv
List of Figures . . . xvii
Acronyms . . . xxi
Nomenclature . . . 1
Glossary . . . 1
1 Introduction 1 1.1 Motivation . . . 1
1.2 Thesis Outline . . . 2
2 Convolutional Neural Networks Background 5 2.1 Artificial Neural Networks . . . 5
2.1.1 Artificial Neural Network Classifier . . . 5
2.1.2 Network Training . . . 7
2.2 Convolutional Neural Networks . . . 8
2.2.1 Convolutional Layer . . . 8
2.2.2 Pooling Layer . . . 10
2.2.3 Fully-connected Layer . . . 10
2.2.4 Dropout . . . 10
2.2.5 CNN Architecture . . . 11
2.2.6 Fixed-point arithmetic . . . 11
2.3 CNN Development Frameworks . . . 12
2.3.1 Caffe . . . 12
2.3.1.1 Data Storage . . . 12
2.3.1.2 Layers . . . 13
2.3.2 Ristretto . . . 13
3 TSR CNN methods 15
3.1 GTSRB Dataset . . . 15
3.1.1 Data Organization . . . 15
3.1.2 Data Preparation . . . 16
3.2 TSR CNNs . . . 17
3.3 Conclusion . . . 21
4 FPGA-based CNN 23 4.1 SoC FPGA . . . 23
4.1.1 Zynq-7000 SoC FPGA . . . 23
4.1.2 AXI Protocol . . . 25
4.1.3 AXI4 interfaces: Lite and Stream . . . 26
4.2 CNN Hardware Implementations . . . 27
4.2.1 Data organization . . . 27
4.2.1.1 Local memories usage . . . 28
4.2.1.2 Data representation . . . 28
4.2.1.3 Loop Tiling . . . 29
4.2.2 Kernel-level computation . . . 29
4.2.3 Parallelization . . . 31
4.3 Conclusions . . . 31
5 Evaluation of network structure 33 5.1 Network Training . . . 33
5.1.1 Dataset preparation . . . 34
5.1.2 Training results . . . 34
5.2 Quantization Analysis . . . 35
5.3 Conclusions . . . 37
6 CNN Hardware Architecture 43 6.1 Convolutional Accelerator . . . 43
6.1.1 Processing Element . . . 44
6.1.1.1 Weights Memory . . . 45
6.1.1.2 MAC Cluster . . . 46
6.1.1.3 Adder Cascade . . . 46
6.1.1.4 BRP Unit . . . 46
6.1.2 Activations Memory . . . 47
6.1.3 Accelerator Control Unit . . . 49
6.2 Communication Control . . . 50
6.3 PE Configuration . . . 50
6.4 Conclusions . . . 52
7 Traffic Sign Recognizer 55
7.1 Hardware/Software Architecture . . . 55
7.2 Experimental Results . . . 58
7.2.1 Hardware Resources . . . 58
7.2.2 Execution Time . . . 59
7.2.3 Read/Processing/Write Times . . . 59
7.3 Architecture Scalability . . . 62
7.4 4-PE Architecture . . . 65
7.5 Conclusions . . . 67
8 Conclusion 69 8.1 Future Work . . . 70
8.1.1 Fully Connected Stages . . . 70
8.1.2 Traffic Sign Detection . . . 70
Bibliography 73
List of Tables
2.1 Fixed-point format notation. . . 12
3.1 Overview by date. . . 21
3.2 Metrics. . . 21
4.1 Zynq-7000 Programmable logic resources. . . 25
4.2 Performance comparison. . . 32
5.1 List of hyper-parameters used by both networks. . . 34
5.2 List of experiments. . . 36
5.3 Activations and weights from Sermanet network. . . 38
5.4 Jin’s activations and weights analysis. . . 38
5.5 Memory requirements of Top-5 accuracy configurations. . . 38
6.1 Bus width notation. . . 44
6.2 Data representation. . . 45
6.3 Local weights RAM port widths. . . 45
6.4 RAM A and B port widths. . . 49
6.5 Range of the counters for each stage. . . 49
6.6 Convolution parameters values for each stage. . . 50
6.7 Utilization in LUTs for one-PE with different configurations. . . 51
6.8 Utilization in Flip-Flops for one-PE with different configurations. . . 51
6.9 Utilization in Block RAM Tiles for one-PE with different configurations. . . 51
6.10 Utilization in DSPs for one-PE with different configurations. . . 51
6.11 Minimum clock period for one-PE with different configurations. . . 51
6.12 Utilization for Multi-PE Clusters. . . 52
7.1 HW/SW task assignment. . . 55
7.2 Resource usage. . . 58
7.3 Resource usage for the main sub-blocks of the convolutional accelerator. . . 58
7.4 Execution Times. . . 59
7.5 Read, processing and write times. . . 60
7.6 Read and processing time (for one-PE architecture). . . 64
7.7 Resource usage for the main sub-blocks of the 4-Processing Element (PE) convolutional accelerator. . . 67 7.8 Execution Times. . . 67
List of Figures
1.1 Common sequence for recognizing traffic signs, adapted from [1]. . . 3
2.1 Artificial Neuron. . . 6
2.2 ANN, adapted from [9]. . . 7
2.3 DNN, adapted from [9]. . . 7
2.4 Example of a convolutional layer with 2 filters of size 3x3, moving at a stride s=1, a padding p=0, and that produces 2 OFMs with 3x3 size, using 3 IFMs. . . 9
2.5 Dropout approach by [16], on a CNN. . . 11
2.6 Example of a CNN topology with 2 convolutional layers (C1, C2), 2 pooling layers (P1, P2) and 2 FC layers (FC1, FC2). . . 11
2.7 4D tensor representing a set ofnimages withcchannels andwxhsize. . . 12
2.8 A Caffe convolutional layer with top and bottom blobs. . . 13
3.1 43 GTSRB traffic sign classes [4]. . . 15
3.2 Image border of 10% around a traffic sign example. . . 16
3.3 Jittered image examples. . . 17
3.4 Number of samples per GTSRB class. . . 18
3.5 General CNN architecture for TSR. . . 18
3.6 Multi-scale CNN, adapted from [12]. . . 19
3.7 MCDNN, adapted from [11]. . . 19
3.8 MCDNN, adapted from [24]. . . 20
3.9 Figures (a) and (b) corresponds to Sermanet and Jin models respectively. . . 22
4.1 Zynq-7000 SoC main block diagram, adapted from [30]. . . 24
4.2 Basic DSP48E1 Block Functionality, adapted from [33]. . . 25
4.3 AXI read transaction [34]. . . 26
4.4 AXI write transaction [34]. . . 26
4.5 Time diagram for AXI4-Stream transaction, adapted [34]. . . 27
4.6 Weights and activations between layers, adapted from Figure 2.5. . . 28
4.7 Buffer approach, adapted from [35]. . . 28
4.8 Tiled convolutions, adapted from [35]. . . 29
4.9 Kernel-level computation for a 3x3 size, adapted from [40]. . . 30
4.10 Convolve over depth. . . 30
4.11 3D convolutions using local weights RAMs. . . 31
4.12 Figures (a) and (b) corresponds to global architecture and PE respectively, adapted from [7]. . . 32
5.1 Training Flow. . . 33
5.2 Figures (a) and (b) correspond to Sermanet and Jin models, respectively, with their addi- tional dropout layers. . . 35
5.3 Sermanet and Jin accuracy comparison. . . 36
5.4 Plots (a), (b) and (c) correspond to the activation bit-configuration tests of the Sermanet and Jin layers for 4-bit, 8-bit and 16-bit, respectively. Plot (a) presents the results of experiments 1 and 2, plot (b) the results of experiments 3 and 4 and plot (c) the results of experiments 5 and 6. . . 39
5.5 Plots (a), (b) and (c) correspond to the parameters bit-configuration tests of the Sermanet and Jin layers for 4-bit, 8-bit and 16-bit, respectively. Plot (a) presents the results of experiments 7 and 8, plot (b) the results of experiments 9 and 10, and (c) the results of experiments 11 and 12. . . 40
5.6 Sermanet and Jin accuracy comparison (after quantization). . . 41
6.1 Accelerator architecture. . . 44
6.2 Processing Element. . . 45
6.3 Ping Pong structure in the weights RAM. . . 45
6.4 MAC Cluster. . . 46
6.5 Generic Adder Cascade. . . 47
6.6 Adder Cascade implementation with DSP blocks. . . 47
6.7 BRP Unit . . . 48
6.8 RAM A and B dimensions. . . 48
6.9 RAM A and B port sizes in each stage. . . 48
6.10 Accelerator Control Unit. . . 53
6.11 Convolutional Stages (1 and 2) FSM. . . 53
6.12 Read-Processing-Write sequence for the convolutional stages. . . 54
6.13 BRAMs and DSPs consumption for different PE configurations. . . 54
7.1 Block diagram of PS+PL system. . . 57
7.2 Simplified block diagram of PS+PL system with interfaces only. . . 57
7.3 Input images reception (ILA). . . 60
7.4 Filter reception on stage 1 (ILA). . . 60
7.5 Filter reception on stage 2 (ILA). . . 61
7.6 Filter processing on stage 1 (Simulation). . . 61
7.7 Filter processing on stage 2 (Simulation). . . 61
7.8 Variation of the number of BRAMs with the number of PEs. . . 64
7.9 BRAMs and DSPs resource consumption with the number of PEs. . . 64
7.10 Variation of the execution time with the number of PEs. . . 65
7.11 Classifications per second variation with the number of PEs. . . 65
7.12 4-PE architecture. . . 66
7.13 Modified Convolutional Stages FSM. . . 66
7.14 Shift-register for controlling weight RAMs accesses. . . 66
8.1 Accelerator architecture for the Fully Connected (FC) stages. . . 71
8.2 Multi-scaled images. . . 71
Acronyms
ADAS Advanced Driver Assistance System AMBA Advanced Micro-controller Bus Architecture ANN Artificial Neural Network
APU Application Processing Unit ARM Advanced RISC Machines AXI Advanced eXtensible Interface AXI Advanced Extensible Interface BRAM Block Random-Access Memory BRP Bias ReLU Max-Pooling
BVLC Berkeley Vision and Learning Center CLB Configurable Logic Block
CNN Convolutional Neural Network CPU Central Processing Unit DDR Double Data Rate DMA Direct Memory Access DNN Deep Neural Network DSP Digital Signal Processing ELU Exponential Linear Unit ELU Exponential Linear Unit FC Fully Connected
FF Flip Flop
FIFO First In First Out
FPGA Field-programmable gate array FPU Floating Point Unit
FSM Finite State Machine
GHT Generalized Hough Transform GPU Graphics processing unit GP General Purpose
GTSDB German Traffic Sign Detection Benchmark GTSRB German Traffic Sign Recognition Benchmark HOG Histogram of Oriented Gradients
HP High Performance I/O Input/Output IFM Input Feature Map ILA Integrated Logic Analyzer L1 Level 1
L2 Level 2
LReLU Leaky Rectified Linear Unit LRN Local Response Normalization LUT Look-Up Table
MAC Multiply–Accumulate Unit
MCDNN Multi-Column Deep Neural Network OCM On-Chip Memory
OFM Output Feature Map PE Processing Element PL Programmable Logic PPM Portable Pixel Map
PReLU Parametric Rectified Linear Unit PS Processing System
RAM Random Access Memory
ReLU Rectified Linear Unit ROM Read-Only Memory SCU Snoop Control Unit SoC System-on-Chip TSR Traffic Sign Recognition
Chapter 1
Introduction
The objective of this work is to develop an efficient traffic sign classification hardware/software em- bedded system on a SoC-FPGA platform. This chapter describes the motivation to accelerate such applications, details the main objectives of the work and presents the thesis outline.
1.1 Motivation
One of the more relevant tasks of Advanced Driver Assistance System (ADAS) is to recognize traffic signs in real-world environments. A Traffic Sign Recognition (TSR) algorithm (Figure 1.1) is composed of two main phases: detection (one or more traffic signs are localized in the input image) and classification (the sign is classified as belonging to one of the pre-defined traffic sign classes) [1]. This work focus on the classification phase.
One of the main challenges for the TSR system is real-time operation. The system must provide the correct sign information even at a high travelling speed. Considering an information-only system, a velocity of 180 km/h, a distance from the camera to the sign of 30 m, the overall execution time (detection + classification) must not exceed 600 ms.
Traditional classification approaches were based on hand-crafted features and regular classifiers [2]. However, recent work on object recognition has shown that hand-crafted features, although well behaved on small-sized homogeneous datasets, do not perform as effectively on larger and heteroge- neous datasets, while Convolutional Neural Networks (CNNs) can produce more accurate and robust classifications [3]. Moreover, in the International Joint Conference on Neural Networks (IJCNN) the CNN were evaluated on German Traffic Sign Recognition Benchmark (GTSRB) dataset and achieved excellent results on the classification phase, surpassing the average human performance [4].
CNNs are growing in size and complexity, with tens of millions of parameters and computations, which can represent a challenge, even for modern Central Processing Units (CPUs). Field-programmable gate arrays (FPGAs) can provide a power-efficient embedded platform to execute such networks and therefore a relevant number of FPGA-based approaches has been recently proposed, e.g. [5, 6, 7].
1.2 Thesis Outline
This thesis is organized as follows:
• Chapter 2 provides a context overview and background knowledge. The main layers of CNNs are introduced and the fixed-point arithmetic used in this work is defined. The CNNs development frameworks, used for training and inference, are presented.
• Chapter 3 reviews existent works on the use of CNN-based architectures for TSR classification and details the German Traffic Sign Recognition Benchmark. The two most promising networks are chosen for architectural exploration in Chapter 5.
• Chapter 4 describes the Zynq-7000 System-on-Chip (SoC) FPGA and the communication protocol.
Previous works on FPGA-based architectures for CNNs are reviewed and parallelization methods are presented and analyzed.
• Chapter 5 compares the accuracy and resources requirements of the two selected networks in Chapter 3. The traffic sign dataset is augmented in order to standardize the contributions of each class (sign). A quantization analysis is performed in order to find the best trade-off between accu- racy and memory resources for the selected networks.
• Chapter 6 describes the architecture of the hardware accelerator that implements the network chosen in the Chapter 5. The processing element architecture is explored in order to find the best trade-off between performance and resource consumption.
• Chapter 7 describes the Hardware/Software architecture proposed and developed for the traffic sign classifier. The results achieved with the Hardware/Software system are compared with the Software-only approach. The architecture scalability is explored for several processing elements.
An architecture with 4 processing elements is then chosen for implementation and the necessary changes regarding the base architecture (1 PE-only) are described.
• Chapter 8 draws the final remarks and presents suggestions for future work.
Detection
Keep Right Give way Pedestrian Crossing
Roundabout
Classification
Figure 1.1: Common sequence for recognizing traffic signs, adapted from [1].
Chapter 2
Convolutional Neural Networks Background
CNNs have emerged as one of the most efficient architectures for image recognition and classification tasks. This chapter gives an overview over the main characteristics of CNNs. Section 2.1 introduces the Artificial Neuron, the Deep Neural Network and the convolution algorithm. Section 2.2 presents the concept of CNN as well as its main layers. Lastly, Section 2.3 provides a brief introduction to the training process of a neural network.
2.1 Artificial Neural Networks
Artificial Neural Networks (ANNs) are computational models inspired by the neurons and its intercon- nections present in the biological neural networks of the brain. The first probabilistic model of a neuron, called Perceptron, was presented by Rosenblatt in 1957 [8]. ANNs demontrated their capability as clas- sifiers even in image recognition tasks [9, 10].
2.1.1 Artificial Neural Network Classifier
Artificial neurons are the basic unit of ANNs. As represented in Figure 2.1, the main elements of artificial neurons are:
• Input nodes (x1, x2, ..., xn);
• Weighted connections (w1, w2, ..., wn));
• Activation function (Φ);
A neuron takes an arbitrary number of input signalsx1, x2, ..., xn and produces an output y. The connections between the input values and the summing function, are weighted with valuesw1, w2, ..., wn. A particular weight, which is represented asb1, acts as the threshold of the neuron, and is called bias.
∑
∑
w1
w2
wn
Weighted sum Weights of connections
...
ɸ
Activation function
u output y
x1
x2
xn Inputs signals
b1 Bias
Figure 2.1: Artificial Neuron.
The neuron output is computed as an inner product function of the input and weight vectors (u) (Equation 2.1), which is passed through a function called activation function. The earliest ANN model (perceptron), used by Rosenblatt, uses the hard-limiter activation function (y= Φ(u)) defined in Equation 2.2.
u=
n
X
i=0
wixi (2.1)
step(x) =
( 1ifx >0
−1 (or 0) otherwise
(2.2)
Other commonly activation functions can be used, such as:
• Sigmoid:
Φ(u) = 1
1 +e−u (2.3)
• Tanh:
Φ(u) =tanh(u) =sinh(u)
cosh(u) (2.4)
• Rectified Linear Units (ReLUs):
ReLU(x) =
( xifx >0 0 otherwise
(2.5)
Figure 2.2 shows an ANN composed by an input layer (leftmost layer), a hidden layer (middle) and an output layer (rightmost). An application for this network, for example, could be identifying whether a traffic sign image is a stop sign or not, by feeding the input neurons with the image pixels intensities scaled between 0 and 1. The output layer is just a single neuron and corresponds to the prediction value. If the output value is greater than 0.5 the input image is a stop signal, otherwise not.
The network above has only one hidden layer, however some networks have multiple hidden layers.
Networks with two or more hidden layers are usually called Deep Neural Network (DNN). Figure 2.3 presents an example of a four-layer DNN with two hidden layers.
Input Hidden
Output
Figure 2.2: ANN, adapted from [9].
Input Layer
Hidden Layer 2
Output Hidden
Layer 1
Figure 2.3: DNN, adapted from [9].
2.1.2 Network Training
The process of training a network consists of finding the weights and biases that produce the best overall classification results. As the training process is not the main focus of this project, only a simple introduction will be given in this section.
As said before, the goal is to find the set of weights and biases, given a set of training inputsx, that minimizes the distance between what the system predicts and the known classification. This distance is usually designated as loss function L(w, b). Different loss functions may be chosen, depending on the application. One example is the L2 distance :
L(w, b) = 1 2n
X
x
||Φ(u)−Φ0(u)||2. (2.6) Φ(u)corresponds to a vector of outputs from the network when the input is x (Equation 2.1) and Φ0(u)represents the one-hot classification vector (e.g. labels of the traffic signs). To minimize the above equation an algorithm called gradient descent is used, which performs updates on each weight and bias:
wk −→w0k=wk−η×∂C(w, b)
∂wk
(2.7)
bl−→b0l=bl−η×∂C(w, b)
∂bl (2.8)
These two equations (2.7, 2.8) are calculated for multiple steps, during training, so that the loss function decreases at each iteration, thus converging to a global minimum.ηis an additional parameter, denoted as learning rate, which determines the pace at which the loss function converges [9]. The base learning rate is the value of the learning rate at the beggining of the training, where no updates have been done yet.
The training duration may be too large, depending on the number of training inputs, for that reason a technique called stochastic gradient descent is used. This technique prevents the gradient vector from depending on all the partial derivative elements of the whole dataset. Instead they choose a small number of samples randomly, which turns out to be less computational expensive.
The backpropagation algorithm aims to find the minimum of the loss function using the gradient descent. Basically the error values on the outputs are propagated backwards through the network and weights and bias are updated based on the previous calculated gradient.
2.2 Convolutional Neural Networks
CNNs are a category of neural networks which, uses convolutional layers and are able to take advantage of the specific spatial structure of the (bidimensional) images. Although most of the work and research on this topic started in 2010 [11], [12]. The basic structure of CNN, for the subsequent works, was defined in 1998 [10]. CNNs usually have two main phases: feature extraction, which is composed of a set of stacked convolutional, activation and subsampling layer pairs; and classification, which includes a set of stacked FC and activation layer pairs that feed into a softmax function (normalization).
2.2.1 Convolutional Layer
A convolutional Layer is a set of filters or kernels, that extract a set of relevant features (Output Feature Maps (OFMs)) from a set of input images or Input Feature Maps (IFMs); for each filter k, a corresponding OFM is generated. A typical filter or kernel is a small matrix with size (5x5x3). An OFM is produced by convolving the source image or IFMs and applying dot products between a receptive field and a filter on all dimensions, as shown in Figure 2.4. A receptive field is a local region of the IFM that has the same size of the filter. The step size of the filter when traversing the IFM is called stride. Stride is usually set to 1 in the convolutional layers, but it can have higher values, especially in the earlier layers, in order to significantly reduce the input dimensions. The padding defines how the border of an input image is managed in the convolution process. Zero-padding adds zeros around the OFM so that it ends up with the same number of elements as the IFM. If the padding is disabled the border elements will be
discarted after each convolutional layer, therefore reducing the output size.
Each output (neuron) is only connected to a small block of the input image, and therefore the number of connections in a convolutional layer is much smaller than that of a fully connected layer. As such, convolutional networks are also faster to train, which allows the efficient implementation of deeper, many- layer networks.
The convolution layer procedure, detailed in Algorithm 1, is one of the most computationally intensive tasks of a CNN, reaching more than 90% of the CNN execution time in a typical implementation [13].
Algorithm 1 is performed for each convolutional layer and outputs N OFMs from D input channels or IFMs by executing N convolutions of size K x K on each input. This procedure is followed by a non-linear activation functionΦand an added bias termb. The termf represents the tensor of input feature maps, wthe tensor of filter set andf0 the tensor of output feature maps.
Algorithm 1Convolutional Layer
1: procedureCONVOLUTION
2: fornumber of output feature maps (n)do
3: foroutput feature map rows (i)do
4: foroutput feature map columns (j)do
5:
6: f0[n, i, j] =b[n] +PD d=1
PK p=1
PK
q=1f[d, i+p, j+q].w[n, d, p, q]
0 2 1 0 1
1 2 1 1 2
2 1 0 1 2
0 0 0 2 2
1 2 1 2 1
0 2 1 0 1
1 2 1 1 2
2 1 0 1 2
0 0 0 2 2
1 2 1 2 1
0 2 1 0 1
1 2 1 1 2
2 1 0 1 2
0 0 0 2 2
1 2 1 2 1
0 2 1 0 1
1 2 1 1 2
2 1 0 1 2
0 0 0 2 2
1 2 1 2 1
0 0 0
0 -1 1
0 0 0
0 0 0
0 -1 1
0 0 0
0 2 1 0 1
1 2 1 1 2
2 1 0 1 2
0 0 0 2 2
1 2 1 2 1
0 2 1 0 1
1 2 1 1 2
2 1 0 1 2
0 0 0 2 2
1 2 1 2 1
0 0 0
0 1 1
0 0 0
0 0 0
0 1 1
0 0 0
0 0 0
0 -1 0
0 0 0
0 0 0
0 -1 0
0 0 0
0 0 0
0 1 0
0 0 0
0 0 0
0 1 0
0 0 0
Input Images or Input Feature Maps
(3x5x5)
Filter W0
(3x3x3)
0 0 1
0 0 0
1 0 0
0 0 1
0 0 0
1 0 0
Filter W1
(3x3x3)
1 0 0
0 0 0
0 0 1
1 0 0
0 0 0
0 0 1
4 ...
...
4 ...
...
x[0,:,:]
x[1,:,:]
x[2,:,:]
w0[0,:,:]
w0[1,:,:]
w0[2,:,:]
w1[0,:,:]
w1[1,:,:]
w1[2,:,:]
2 ...
...
2 ...
...
Output Feature Maps (2x3x3)
o[0,:,:]
o[1,:,:]
Receptive fields (3x3)
Bias b0
(1x1x1)
Bias b1
(1x1x1)
1 0
Figure 2.4: Example of a convolutional layer with 2 filters of size 3x3, moving at a stride s=1, a padding p=0, and that produces 2 OFMs with 3x3 size, using 3 IFMs.
2.2.2 Pooling Layer
The sub-sampling or pooling layer is usually inserted after a convolutional layer. The purpose of this sequence is to gradually reduce the spatial size of the input (size of each OFM) and the computation load of the network. The pooling layer consists of a small kernel, usually 2x2, that traverses the IFM with a stride factor, which is usually 2, and selects one output pixel according to a function. The most common type of function is the max-pooling, which selects the pixel with the maximum value of the kernel. Algorithm 2 describes a max-pooling process. This approach allows to save significant memory to store the intermediate results.
Algorithm 2Max-Pooling Layer
1: procedurePOOLING
2: fornumber of output feature maps (n)do
3: forfeature map rows (i)do
4: forfeature map columns (j)do
5:
6: f0[n, i, j] =maxp,q[1:K](f[d, i+p, j+q])
2.2.3 Fully-connected Layer
The FC layer consists of a classical ANN layer, that receives as inputs all the OFMs of the previous layer(s). The activation function is computed on top of the result of a inner product multiplication and a bias offset - Algorithm 3.
The number of neurons of the last fully-connected layer is equal to the number of classes to pre- dict. Usually, a normalization function is applied to give a probabilistic interpretation to the provided classification. One of the most common normalization function is the softmax, which is given by:
θi = eyi PN
n=1eyn (2.9)
Algorithm 3Fully Connected Layer
1: procedureFULLYCONNECTED
2: fornumber of output feature maps (n)do
3:
4: f0[n] = Φ
b[n] +PC c=1
f[c], w[n, c]
2.2.4 Dropout
Dropout is a specific training technique used to reduce overfitting, therefore improving model general- ization. Dropout acts as a regularizer, by randomly setting a percentage of the activations of a layer to zero during training [14, 15].
Dropout is not commonly used on the output of convolutional layers, since their shared weights act as regularizes themselves. However, some works (e.g. [16]) reported accuracy improvements by adding
dropout in the convolutional layers (Figure 2.5), and refer that applying dropout in the earlier layers provide noisy inputs for the subsequent layers, thus reducing overfitting.
Convolutional Layers
Fully Connected Layers
Dropout is applied in all but the last layer.
Convolutional Layer
... ...
Dropout Layer
Low dropout value
High dropout value
Figure 2.5: Dropout approach by [16], on a CNN.
2.2.5 CNN Architecture
Figure 2.6 shows a general CNN structure, which is typically composed of convolutional, pooling (sub- sampling) and FC layers. Those layers are represented as C(c, n, k), P(k,s) and FC(ne) respectively, where c stands for the number of channels, n the number of convolution kernels with k x k size (and OFMs), s corresponds to the value of the stride andneis the number of output neurons.
C1(3, 108, 5) P2(2, 2)
... ...
FC1(100) FC2(43) 32 x 32
RGB image
P1(2, 2) C2(64, 108, 3)
Figure 2.6: Example of a CNN topology with 2 convolutional layers (C1, C2), 2 pooling layers (P1, P2) and 2 FC layers (FC1, FC2).
2.2.6 Fixed-point arithmetic
Fixed-point format can be used to represent fractional values after the range of the values has been studied and evaluated. The Q notation is the fixed-point format used to represent such values in this work.
A Q format value is represented in a fixed-point bit format asQ[QI].[QF], whereQI corresponds to the number of integer bits andQF the fractional bits. The word lenght can be obtained by adding both
parts (W L=QI+QF).
Negative values for QI represent extended resolution for fractional only numbers, as in [17]. E.g.
for an unsigned number, a negative QI presents the number of leading fractional zeros. Some format notation examples are shown in Table 2.1.
Table 2.1: Fixed-point format notation.
Decimal Number Fixed-point format Binary Number Word Length
3.375 Q8.8 11.011 000 00 16
0.078 125 Q−1.9 0.000 101 000 8
−0.078 125 Q1.7 1.111 011 0 8
2.3 CNN Development Frameworks
There are several frameworks for building CNNs, such as: Caffe, TensorFlow, Theano and PyTorch [18, 19, 20, 21]. The network used in this work was built using Caffe, which is described below. Caffe provides well-documented examples for the main CNN architectures, and a fast way for train, test and deploy models. On top of Caffe, another framework called Ristretto can be used in order to quantize the weights and activations of the network to fixed-point values.
2.3.1 Caffe
Caffe is a deep learning framework developed and maintained by the Berkeley Vision and Learning Center (BVLC). It is written in C++, provides a python interface and supports cuDNN v5 for Graphics processing unit (GPU) acceleration [22]. Moreover, Caffe supports both the training and the inference processes, using 32-bit floating point networks.
2.3.1.1 Data Storage
The Caffe’s method for storing and communicating data is through 4-dimensional arrays named blobs.
Blobs are responsible for the dataflow, in the network, for the forward and backward passes, and pro- vide a unified memory interface for holding batches of images, parameters and parameters updates [18]. Figure 2.7 presents a commonly approach representation of a 4D tensor (Blob) with the following parameters: image ID, channels, height and width.
Figure 2.7: 4D tensor representing a set ofnimages withcchannels andwxhsize.
2.3.1.2 Layers
Networks are described in Caffe as a set of layers and blobs, as shown in Figure 2.8. A Caffe layer takes one or more blobs as input and produces one or more blobs as output [18]. Algorithm 4 illustrate the syntax used to create a convolutional layer in Caffe. This layer produces 108 OFMs, with a kernel size of 5x5 and a stride of 1. Since this is the first convolutional layer, the input data is provided by the data blob (bottom), which corresponds to the input image, and the OFMs are stored in the conv1 blob (top).
The weight filler is set to xavier, which initializes the values of the weights in a uniform distribution. The bias filler is set to constant, which basically sets all bias values to a constant value. The values used for both activations (IFMs and OFMs) and weights are 32-bit floating-point.
Algorithm 4Caffe Layer
1: layer [
2: ”name”: ”convolutional1”
3: ”type”: ”Convolution”
4: ”bottom”: ”data”
5: ”top”: ”conv1”
6: convolutional param [
7: ”num output”: 108
8: ”kernel size”: 5
9: ”stride”: 1
10: weight filler [
11: type: ”xavier”
12: ]
13: bias filler [
14: type: ”constant”
15: ]
16: ]
17: ]
conv1
data convolutional1
Top blob
Convolutional Layer
Bottom blob
Figure 2.8: A Caffe convolutional layer with top and bottom blobs.
2.3.2 Ristretto
Ristretto is an extension of Caffe, which allows to train, test and finetune with fixed-point networks, by re-implementing Caffe-layers and simulating reduced word width arithmetic [23]. Ristretto takes a trained model as input, and automatically produces a condensed network version, by performs multiple
quantization-accuracy tests, in order to find an optimum ratio between compression rate and network’s accuracy. The network is then retrained using the reduced precision formats (finetune) obtained from the first phase. The network will have to adapt his model with reduced precision fixed-point parameters, that is, learn how to classify the same images with less degrees of freedom.
The network description files can be changed to quantize different layers. The bit-width used for different layers as well as other parameters can be set in the network configuration file. Algorithm 5 represents the same layer as Algorithm 4, but with fixed-point reduced precision for both activations and weights. An additional field is added to the prototxt description (quantization param) in order to define the word size (in bits) - bw; and the fractional length (in bits) - fl. The layer input and output are implemented using 16.bit (with a Q8.8) and the weights using 8-bit (with a Q1.7).
Algorithm 5Ristretto Layer
1: layer [
2: ”name”: ”conv1”
3: ”type”: ”ConvolutionRistretto”
4: ”bottom”: ”data”
5: ”top”: ”conv1”
6: convolutional param [
7: ”num output”: 108
8: ”kernel size”: 5
9: ”stride”: 1
10: weight filler [
11: type: ”xavier”
12: ]
13: bias filler [
14: type: ”constant”
15: ]
16: ]
17: quantization param [
18: bw layer in: 16
19: bw layer out: 16
20: bw params: 8
21: fl layer in: 8
22: fl layer out: 8
23: fl params: 7
24: ]
25: ]
Chapter 3
TSR CNN methods
This chapter starts by presenting the GTSRB, how it is organized, and some techniques to augment and standardize the dataset in Section 3.1. Section 3.2 explores different networks for traffic sign classifica- tion and evaluates the accuracy, number of convolutional and FC layers, and the number of weights of such networks.
3.1 GTSRB Dataset
Reliable TSR systems use an appropriate dataset in order to recognize each traffic sign class (Figure 3.1). GTSRB is a benchmark dataset comprising a set of signals from the German traffic sign system.
The traffic signs have a simple shape, size and colour, just as they are in compliance with the European Union traffic system standards and regulations, which makes it a reasonable option.
Figure 3.1: 43 GTSRB traffic sign classes [4].
3.1.1 Data Organization
The GTSRB dataset is organized according to the following criteria:
1. 43 traffic sign classes;
2. 39,209 training images;
3. 12,630 testing images.
Each image of the dataset, in Portable Pixel Map (PPM) format, represent only one traffic sign.
Images are not always squared and their size vary between 15x15 to 250x250 pixels. It’s important to notice that each image contains a border of 10% around the traffic sign (Figure 3.2) for methods that use edge detection [4].
Figure 3.2: Image border of 10% around a traffic sign example.
3.1.2 Data Preparation
The preparation of the dataset is divided into 3 phases: data splitting, augmentation and preprocessing.
The data splitting consists of dividing the main dataset into three disjoint sets: train, validation and test.
GTSRB provides a train and a test set, the validation set is supposed to be obtained from the train set, usually 30%. Secondly, the training dataset is augmented using a set of image transformations and upsampling some classes. This process is only implemented in the training set, as this new training samples will yield a more robust model to tackle potential deformations in the test set. Lastly, the training data is preprocessed before entering the network, in order to produce a more uniform dataset, since the images differ in size, contrast and brightness. The same happens with the validation and test sets.
Figure 3.3 resumes the data preparation work flow.
The GTSRB dataset only includes a training and test set. However, since both sets are drawn from the same distribution, the test dataset can be divided as a validation set and test set. A set of images are drawn from the same distribution if they are collected under the same conditions.
The GTSRB dataset has a different number of samples for each class (e.g. class 1 has 210 samples while class 2 has 2220), which makes it imbalanced, as shown in Figure 3.4. Since classes with more samples have a higher contribution to the loss function more than classes with fewer samples, the trained classification network may only learn to classify correctly the ”strong” classes. Moreover, the model may not generalize well on classes with fewer samples. [1] refers three methods in order to address this problem: additional samples of minority classes are generated to match the class with the largest number of samples (upsampling); samples of majority classes are matched in number related to minority classes (downsampling); or a combination of both, thus creating a final balanced, augmented dataset.
SPLITING DATA:
GTSRB image collection is divided into a training and a testing dataset, which works for both validation and test.
AUGMENTATION:
The training dataset is augmented by performing different image transformations and/or upsampling classes.
PREPROCESSING:
Before feeding the final dataset into the network, all images are preprocessed.
Main Dataset
Train Dataset Test Dataset
Augmented Dataset 1
Preprocessed Dataset 1
Preprocessed Test Dataset
Figure 3.3: Jittered image examples.
Different types of image transformations can be used to augment the dataset in order to improve the network accuracy as well as make it more robust to small changes in the input. [12] produces an augmented dataset by making four copies of the original dataset, and applying translation, rotation and scaling deformations to each of them. Considering all four additional versions of each image, the augmented dataset consists of 126,750 training samples. [11] uses the same transformations on each preprocessed image but applies them on-the-fly during training. Besides the transformations above referred, recent works [24] include others such as smoothing, motion blur and random crop in order to simulate the reality of a mounted camera on a car.
Since the size of images vary in the dataset, and usually CNN implementations requires all training images to be of same size, downsample or upsample the images is required before feeding them into the network. Contrast normalization is important as well since the dataset possess high contrast variation between images. [12] transform all images to 32x32 pixels and converted to YUV color space. Contrast normalization is done only on the Y channel. [11] uses 48x48 pixels format and contrast normalization is done on a MatLab color space that has image intensity as one of its components. After normalization process the image is converted back to RGB.
3.2 TSR CNNs
CNNs are one of the most effective methods to solve image classification problems. Traditional classi- fication approaches are based on hand-crafted features (e.g. Histogram of Oriented Gradients (HOG)) and regular classifiers (e.g. Random Forests) [2]. However, recent work on pedestrian gender recogni-
Figure 3.4: Number of samples per GTSRB class.
tion has shown that hand-crafted features, although well behaved on small-sized homogeneous datasets, do not perform effectively on larger and heterogeneous datasets. On the other side, CNNs can produce more generalized features which translates into more robust architectures [3].
In general, a CNN architecture (Figure 3.5) is composed, firstly of a preprocessing stage, where image resize and different types of normalizations are used. [11] uses image adjustment and adaptive histogram equalization to increase image contrast, and subsamples or upsamples the input image to 48x48. The second stage consists in a combination of convolution, pooling and non-linearity layers. The final stage is given by a set of FC layers, and after that a linear function, to give a certain probability to each class, usually a softmax [25].
Figure 3.5: General CNN architecture for TSR.
The choice of architecture is a major factor in the recognition of traffic signals using CNNs. [26]
tested 6 different convolutional network models, with different input sizes and number of convolutional layers, and consequently, different number of parameters. An architecture with 3 convolutional and 3 FC layers yielded the best results for the set of experiments performed, indicating that after a certain number of convolutional layers the result does not improve.
Sermanet [12] and IDSIA [11, 27] teams introduced CNNs in the GTSRB competition. [12] presented a multi-scale CNN architecture, with 1,437,443 parameters. The structure combines the features maps obtained from stage one and two, and feeds them both into the first FC layer, as described in Figure 3.6. The authors suggest that this type of structure benefits from combining both global and local features. A recognition rate of 98.97% was first reached, and after a few improvements on the network’s architecture, they achieved a new record of 99.17%. Their final version used a 2-layer classifier with 100 hidden units, instead of the previous single-layer classifier, and used a single input channel Y (ignored color information) instead of a 3-channel input with the full image color space YUV.
Figure 3.6: Multi-scale CNN, adapted from [12].
[11] implemented a Multi-Column Deep Neural Network (MCDNN), that used a committee of 25 CNNs (Figure 3.7) to combine DNNs trained on differently preprocessed data. This type of architecture shows that combining various DNNs trained on differently preprocessed data could give excellent results, in fact they have won the final competition phase of GTSRB 2011 with a recognition rate of 99.46%. It had 1,543,443 parameters, half of which were from the FC layers.
Figure 3.7: MCDNN, adapted from [11].
Both [12] and [11] architectures have a large number of arithmetic operations, and they use rectified sigmoid and scaled tanh respectively. Since those activation functions are computational heavy, [28]
and [24] proposed two different approaches to address those problems.
[28] come up with a new architecture with a lower number of parameters (1,162,284), and took advantage of less computational activation functions such as ReLU. The architecture is a committee of 20 CNNs and produced an accuracy of 99.65%. However it has a higher number of memory accesses
due to the Local Response Normalization (LRN) layers and zero-padding the image before passing it to the convolutional layer.
[24] focus on two points: use of parametric Parametric Rectified Linear Unit (PReLU), and dividing one of the convolutional layers into two individual blocks. For the first point, Leaky Rectified Linear Unit (LReLU) is a variation of Equation 2.5, that allows a non-zero gradient when the unit is not active, therefore introducing a multiplier, instead of only the comparator. PReLU take the concept of LReLU and set the parameter of the multiplication to be learned along with the other parameters of the neural network.
P ReLU(x) =
( xifx >0 axotherwise
(3.1)
For the second point, inputs from stage 2 are divided into two equal parts and feed into two individual units from stage 3. Since the number of filters in each block is halved the same happens to the number of parameters of that stage. A graphical representation is provided in Figure 3.8 to better understand this concept.
Figure 3.8: MCDNN, adapted from [24].
Finally, his architecture differentiate the traffic signs using their pictographs, for that reason a grayscale color space is used. An accuracy of 99.55% was reached on the classification part, that represents an increase of approximately 0.5% relatively to his previous work [29].
[25] proposed a network architecture with a value 2 of stride for the convolutional layers, and an Exponential Linear Unit (ELU) as activation function. The first point collects the properties of subsam- pling, usually seen in the pooling layer, and makes them inbuilt in the convolutional layers. As for the second point, an ELU is another variation of Equation 2.5:
ELU(x) =
( xifx >0 exp(x)−1otherwise
(3.2)
in order to achieve better results. The TSR algorithm shows an impressive recognition rate of 99.94%
for the ensemble of detection and classification, however detection part are done using Generalized Hough Transform (GHT), there are no details about the accuracy related only to the classification part,
and their classifier consists only of 16 classes.
Table 3.1 shows the accuracies achieved by each of the referred networks, while Table 3.2 resumes the main characteristics of the architectures analyzed.
Table 3.1: Overview by date.
Accuracy Year Team Method Nº of Classes
99.94% 2017 Shustanov et al. [25] GHT + CNN 16
99.55% 2016 Aghdam et al. [24] Plain CNN 43
99.65% 2014 Jin et al. [28] Ensemble of 20 CNN 43
99.46% 2012 IDSIA [11] Committee of 25 DNN / MCDNN 43
99.17% 2011 Sermanet [12] Multi-Scale CNN 43
Table 3.2: Metrics.
Metrics Sermanet [12] IDSIA [11] Jin [28] Aghdam [24] Shustanov [25] Fulco [26]
Input size 1x32x32 3x48x48 3x47x47 1x44x44 - 32x32
Color Space grayscale (Y) RGB RGB Grayscale HSV -
Nr CONV Layers 3 3 3 3 3 3
Filter Sizes - 4-7 3-5 3-5 3 -
Nº of filters 108 100-250 70-180 64-128 16-64 32-128
Stride - 1 1 1 2 -
Activation Function
rectified sigmoid
scaled
tanh ReLU PReLU ELU -
POOL type - MAX MAX MAX - MAX
Nr FC Layers 2 2 2 3 2 4
Hidden neurons 100 300 200 300+300 512 2048+2014
+512
Output neurons 43 43 43 43 16 43
Nº of weights 1,437,791 1,543,443 1,162,284 - - 3,788,907
3.3 Conclusion
This chapter explored the GTSRB dataset and evaluated multiple networks for TSR in order to find the most promissing ones for future hardware implementation. The GTSRB dataset is unbalanced such that some classes have a greater contribution than others. Therefore, different methods to augment and stardardize the dataset were detailed in this chapter. Different networks were explored and Jin and Sermanet network were chosen to be analyzed in detail in the following chapters. The specifications and structure of both networks is given in figure 3.9.
C1(3, 108, 5) ReLU1
P1(2, 2)
C2(108, 108, 5) ReLU2
P2(2, 2)
... ...
FC1(100) FC2(43) 32 x 32
RGB image
(a)
C1(3, 70, 5) ReLU1
P1(3, 2) LRN1
C2(70, 110, 3) ReLU2
P2(3, 2) LRN2
C3(110, 180, 3) ReLU3
P3(3, 2) LRN3
... ...
FC1(200) FC2(43) 47 x 47
RGB image
(b)
Figure 3.9: Figures (a) and (b) corresponds to Sermanet and Jin models respectively.
Chapter 4
FPGA-based CNN
This chapter presents different approaches for implementing design efficient hardware for CNN acceler- ation on FPGAs. The chapter starts with an overview of the target device and its more relevant features on Section 4.1. Section 4.2 provide an overview of the most relevant work regarding the acceleration of CNNs. In particular, Section 4.2.1 describes the data organization, while Section 4.2.2 provides two solutions for kernel-level computation. Section 4.2.3 details different types of parallelization for such architectures. A final performance comparison for the implementations analyzed is provided in the end of the chapter.
4.1 SoC FPGA
A SoC FPGA device integrates microprocessors with FPGA technology. FPGAs provide fine-grained components and special purpose components to accelerate typical patterns. This section gives insight into the more relevant features of the target hardware platform chosen to implement the CNN architec- ture, and how they can be used to maximize his capabilities.
4.1.1 Zynq-7000 SoC FPGA
The designed architecture makes use of several built-in components, and therefore it is important to have a thorough understanding of the target device’s main features and specifications.
The demonstration platform is a PYNQ-Z2 board from Digilent. The architecture was designed with focus on the Zynq-7000 All Programmable SoC1 family, which combines ARM’s processor technology (Processing System (PS)) with Programmable Logic (PL). The Zynq product range comprises 7 different devices, all having the same basic architecture. The main differences are in the device’s type and quantity of the PL. A block diagram of each device’s PS, PL and interconnections is presented in Figure 4.1. The PS includes an Application Processing Unit (APU) with a set of associated processing resources. The APU contains a dual-core ARM Cortex-A9 processor. The ARM processor has a clock frequency up to 1 GHz, depending on the particular Zynq device. Each ARM processor has a memory
1From this point on, this device will simply be mentioned to as Zynq.
hierarchy of two 32 KB Level 1 (L1) cache memory (for instructions and data), and a 512 KB Level 2 (L2) cache that is shared between both cores. Additionally, each core also includes a Floating Point Unit (FPU) engine. A 256 KB On-Chip Memory (OCM) is provided by the PS and also shared between cores. A Snoop Control Unit (SCU) provides the interface between the ARM cores, the L1 and L2 caches, and the OCM. A Double Data Rate (DDR) memory controller allows a straightforward interaction with an external DDR memory, offering access up to a 3.020 KB address space. Outside the APU region, a central interconnect connect both peripherals, DDR memory, L2 cache, and the PL together. Besides the interconnect unit, an Input/Output (I/O) multiplexer combines several I/O common peripherals. For the PL-PS interconnection there are a set of Advanced Extensible Interface (AXI) ports:
• 2 General Purpose (GP) 32-bit AXI Master Ports;
• 4 High Performance (HP) 32/64-bit AXI Slave Ports;
External Memory Interface
Common Peripherals
Common Acelerators Custom Acelerators Processing
System
Programmable Logic
High Perfor mance AXI Ports Gener al Purpose
AXI Ports
L1(D) L1(I) ARM processor M
M U L1(D) L1(I)
ARM processor M
M U
L2 Cache OCM
Figure 4.1: Zynq-7000 SoC main block diagram, adapted from [30].
The PL is based on a Xilinx Artix-7 or a Kintex-7 FPGA technology, depending on the Zynq family class. The PL section can be used to instantiate standard or custom IP blocks, using standard hardware description languages such as VHDL and Verilog. The main primitives of the zynq device are: Look-Up Tables (LUTs) (up to 277.400), Flip Flops (FFs) (up to 554.800), Digital Signal Processings (DSPs) (up to 2.020) and Block Random-Access Memorys (BRAMs) (up to 3.020 KB). Table 4.1 comprises the six different general purpose Zynq-7000 devices and the available PL resources for each of them. The primitive is a 6-Input LUT with general output and each Configurable Logic Block (CLB) has 8 6-input LUT, and therefore 16 FFs. The BRAMs provide on-chip data storage for dense memory requirements, and can be used to implement Random Access Memory (RAM), Read-Only Memory (ROM) and First In First Out (FIFO) buffers [31]. A 36-Kb BRAM can be configured as either two independent 18-Kb BRAMs, or one 36-Kb BRAM. The BRAM can also be configured as simple dual port or true dual
port, enabling independent write and read transfers for both ports. By combining two or more BRAMs together, larger capacity memories can be formed. Therefore, the Zynq device offers up to 3.020 KB of extensible BRAM memory. DSPs primarily feature a pre-adder/subtractor, multiplier, and a post- adder/subtractor/logic unit. Figure 4.2 describes the basic functionality of the DSP48E1 block. Using the inputs A,B,C and D, a set of functions can be implemented such as P=(AxB)+P’, or P=A:B+C+PCIN.
Using the PCIN input and the post-adder, a cascade of DSP48E1 can be formed [32].
+ -
x
Dual- Registe r
Dual- Registe r
Dual- Registe r
Dual- Registe r
Registe r Registe r
Registe r Registe r
Registe r
D Q
Registe r
D Q
+ - + -
Registe r
D Q
Registe r
D Q
Registe r
D Q
Registe r
D Q
Pre-add er
25x18 Multiplier
B
A
D
C
P
Post-add er / logi c unit
DSP48E1
PCIN
A:B
Figure 4.2: Basic DSP48E1 Block Functionality, adapted from [33].
Table 4.1: Zynq-7000 Programmable logic resources.
Device Z-7010 Z-7015 Z-7020 Z-7030 Z-7035 Z-7045 Z-7100 LUTs 17.600 46.200 53.200 78.600 171.900 218.600 277.400
FFs 35.200 92.400 106.400 157.200 343.800 437.200 554.800 BRAMs 240 KB 380 KB 560 KB 1.060 KB 2.000 KB 2.180 KB 3.020 KB
DSPs 80 160 220 400 900 900 2.020
4.1.2 AXI Protocol
The communication protocol used by the Zynq device is AXI4, and is part of the ARM Advanced Micro- controller Bus Architecture (AMBA) specification. The AXI protocol manage all the data transfers be- tween the device’s PS and PL [34]. This section presents some introductory concepts and general operating procedures of the AXI protocol, more specifically the AXI4-Lite and AXI4-Stream interfaces.
All AXI channels operate according to a valid/ready handshake. When it is confirmed that valid information is being transferred, the sender activates the valid signal. The receiver activates the ready signal when new data is available for receive. Therefore, the data is only transferred when both signals are active. There is another control signal, referred to as ”tlast”, which is used in the data channel. The
”tlast” signal means that the current transferred value is the last element of a given burst of data. There
are 5 different channels that AXI transactions use, that is: read address channel, read data channel, write address channel, write data channel and a write response channel. The protocol operates through a master-slave communication model. A master-slave transfer means a write process, whereas a slave- master transfer represents a read process. Master and slave can not swap roles throughout the middle of a transfer.
A read transaction (Figure 4.3) begins with a request from the master, by providing the initial ad- dress of the data and the control information in the read address channel. The slave responds with the requested data, in the read data channel.
Figure 4.3: AXI read transaction [34].
A write transaction (Figure 4.4) starts with a request from the master, which provides the destination address where data will be transmitted as well as extra control information (in the write address channel).
The slave validates the request and the data is transferred in the write data channel. The transfer may have been successful or some error occurred, in both cases, the slave must issue the situation, through the write response channel.
Figure 4.4: AXI write transaction [34].
4.1.3 AXI4 interfaces: Lite and Stream
The AXI4 version introduced some interfaces based on the native protocol, each one specific for different types of applications. This work will focus on two of those interfaces: AXI4-Lite and AXI4-Stream.
The AXI4-Lite interface corresponds to a lighter implementation of the AXI4-Full. This interface has the same number of channels as the previous one, however, it only supports data bus widths of 32-bit or 64-bit. The protocol implementation requires a smaller number of logic resources and the conversion to AXI4-Full is easily obtained by using the AXI interconnect blocks. The AXI4-Lite interface is, thus, suitable for short-lived transfers that require small amounts of data and minimal bandwidth requirements (e.g. the communication with the configuration registers in components).
The AXI4-Stream interface does not have an address phase, and the transaction flow is always from master to slave. Therefore, only the write data channel is used and a transfer is initiated immediately after both master and slave are fit to send and receive data respectively. Similar to the AXI4-Lite protocol, this variant allows a low resource design. Figure 4.5 shows an AXI4-Stream transaction. Values a1, a2, b1, b2 are transmitted in T2, T3, T6 and T7 times respectively, because both valid and ready signals are asserted on the clock rising edge. Values a2 and b2 represent the last value of a given data packet since ”tlast” is asserted along with valid and ready signals.
S_AXIS_ACLK
S_AXIS_TDATA
S_AXIS_TVALID
S_AXIS_TREADY
S_AXIS_TLAST
a1 a2
T0 T1 T2 T3 T4 T5 T6 T7
b1 b2
Figure 4.5: Time diagram for AXI4-Stream transaction, adapted [34].
4.2 CNN Hardware Implementations
In the previous section, the target platform was described, as well as the communication protocol. Sev- eral analyzed works are based on SoC FPGA ([7, 35, 36]. This section provides an overview of some of the most relevant work for CNN hardware implementations that explore the data organization, different types of kernel computation and the different levels of parallelism of such architectures.
4.2.1 Data organization
In order to have an efficient memory organization, it is necessary to balance on-chip memory resources and external memory band-width. External memory is capable of storing large amounts of data, with the disadvantage of having limited bandwidth. On the other hand, the internal FPGA memory is limited but it grants high bandwidth. Usually, the FPGA internal memory is insufficient to cache all the activations and weights for complex networks.