Evaluation of network structure - FPGA-Based Traffic-Sign Classification. Electrical and Comput

This chapter compares the accuracy and resources requirements of the Sermanet and Jin networks selected in Chapter 3. Section 5.1 describes how the GTSRB dataset was prepared for training, the training flow used and the training results (accuracy) obtained. A quantization analysis is presented in Section 5.2, which is used to perform an accuracy and memory requirements comparison. The preceding process can reduce the initial accuracy, therefore a final finetune step is performed for both networks.

5.1 Network Training

The network training is divided into two phases, the dataset preparation and the training phase. In or-der to prepare the two network models for training, a sequence of four steps was followed, for each of them (Figure 5.1). First, the input dataset was prepared before training: splitted, augmented and preprocessed, as described in Section 5.1.1. Then, the network was trained using Caffe and 32-bit floating-point values, for both weights and activations. To find the optimal reduced-precision configu-ration for the weights and activations, the network was quantized using Ristretto. Finally, the networks were finetuned, that is, retrained using the reduced-precision values obtained from quantization.

Dataset

preparation Training Quantization Finetune

Train Dataset

Test Dataset Network Model

Initial Dataset

Augmented Train Dataset

Preprocessed Train Dataset

Preprocessed Test Dataset

Figure 5.1: Training Flow.

5.1.1 Dataset preparation

The dataset preparation starts by dividing the initial dataset into a training dataset and a test dataset.

Both train and test sets were obtained from the same distribution, so the test set was used as the vali-dation set. The next step was to produce two augmented train datasets. Each dataset was augmented by upsampling all classes to 3000 samples, in order to standardize the contributions of each class, pro-viding a total of 129000 input images per dataset. The classes are upsampled by performing a series of image transformations recommended by each author model [12, 28]. The Jin augmented dataset was produced using random samples from each class and performing three random transformations for each sample: random translation within 10% of the image size, random rotation between−5° and 5°

and random scaling using a factor between 0.9 and 1.1. The Sermanet augmented dataset follows a similar approach, as samples are randomly perturbed in position ([-2,2] pixels) and the rotation varies between −15° and15°. In the last step, all images were preprocessed, including the augmented train dataset and the test set. The images were down-sampled or up-sampled to 47x47, for the Jin’s model, or 32x32 pixels for Sermanet. Then, a histogram equalization was performed for both models. At the end, the final images are in RGB color space and mapped to [0, 1] pixel range.

5.1.2 Training results

The training phase consists of training Jin and Sermanet models individually and compare the accuracy results from each one. The hyper-parameters used for training are presented in Table 5.1, as indicated by the authors [12, 28]. The activation function for the Sermanet network is changed from rectified sigmoid to ReLU to provide a less computationally expensive model. In order to increase generalization in the models, dropout is applied on the output of all but the last layer, following the approach presented in [16]. The probability for the neuron to be dropped in the first convolutional layer is 10%. For the subsequent convolutional layers, there is always a 10% probability increase over the previous layer. The first FC layer is fixed to a probability of 50%. Figure 5.2 shows both models with their additional dropout layers highlighted in blue. Those layers are represented asD(p), wherepstands for the probability of a neuron being dropped. The training process is carried through 20000 iterations, using a batch of 64 samples, for both networks, and the accuracy is collected every 500 iterations. Figure 5.3 shows that the network models achieve a maximum accuracy of 98.17% for the Sermanet model and 96.97% for the Jin. Furthermore, it’s possible to conclude the Sermanet’s model converges faster to the expected accuracy and ends up higher as well.

Table 5.1: List of hyper-parameters used by both networks.

Target Network Base Learning Rate Learning Rate Weight Decay

Sermanet 0.001 0.05 0.5

Jin Rate 0.1 0.001 0.1

... ...

D1(0.1) D2(0.2)

D3(0.5)

(a)

... ...

D1(0.1) D2(0.2)

D4(0.5) D3(0.3)

(b)

Figure 5.2: Figures (a) and (b) correspond to Sermanet and Jin models, respectively, with their additional dropout layers.

5.2 Quantization Analysis

This section presents the lower-precision format chosen, depending on the quantization results for each network model. A quantization analysis was performed using Ristretto to find the fixed-point configura-tion that produces the best tradeoff between accuracy and resources consumed. The accuracy of the networks was evaluated to different word sizes of the weights and activations. Table 5.2 summarizes the series of experiments and the configuration for each one. In every experiment, the number of bits and fixed-point representation of the activation values for both convolutional and FC layers are equal. The same is true for the weights.

In the first set of experiments, 4, 8 and 16-bit word lengths were used for the activations of both convolutional and FC layers, varying the number of fractional bits (QF) whithin a range of [-2, 8] for the 4-bit fixed-point representation, [-2, 11] for the 8-bit and [-1,19] for the 16-bit. A fixed configuration of Q16.16 is maintained through all the experiments for the weights of the layer. This way, it’s guaranteed that the degradation of the accuracy will never be affected by the binary configuration of the weights, isolating the variable corresponding to the activations. Figure 5.4 shows how accuracy varies by select-ing different word lengths and different fixed-point representations for the activations. Sermanet’s model obtains maximum accuracy for 4-bit, 8-bit and 16-bit using Q1.3, Q6.2 and Q8.8 fixed-point configura-tions, respectively. On the other hand, Jin’s model uses Q1.3, Q2.6 and Q2.14 for 4-bit, 8-bit and 16-bit, respectively.

0 2500 5000 7500 10000 12500 15000 17500 20000

No documento FPGA-Based Traffic-Sign Classification. Electrical and Computer Engineering (páginas 57-60)