NNC : a flexible neural network compiler = NNC: um compilador de rede neural flexível

(1)

Alex Silva Torres

NNC – A Flexible Neural Network Compiler

NNC – Um compilador de Rede Neural flexível

CAMPINAS

2019

(2)

NNC – A Flexible Neural Network Compiler

NNC – Um compilador de Rede Neural flexível

Dissertação apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Mestre em Ciência da Computação.

Thesis presented to the Institute of Computing of the University of Campinas in partial fulfillment of the requirements for the degree of Master in Computer Science.

Supervisor/Orientador: Prof. Dr. Guido Costa Souza de Araújo

Este exemplar corresponde à versão final da Dissertação defendida por Alex Silva Torres e orientada pelo Prof. Dr. Guido Costa Souza de Araújo.

CAMPINAS

2019

(3)

Ana Regina Machado - CRB 8/5467

Torres, Alex Silva,

T636n TorNNC - a flexible neural network compiler / Alex Silva Torres. – Campinas, SP : [s.n.], 2019.

TorOrientador: Guido Costa Souza de Araújo.

TorDissertação (mestrado) – Universidade Estadual de Campinas, Instituto de Computação.

Tor1. Compiladores (Computadores). 2. Redes neurais (Computação). 3. Programação paralela (Computação). 4. Aprendizado de máquina. 5. Engenharia de software. I. Araújo, Guido Costa Souza de, 1962-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: NNC - um compilador de rede neural flexível Palavras-chave em inglês:

Compilers (Electronic computers) Neural networks (Computer science) Parallel programming (Computer science) Machine learning

Software engineering

Área de concentração: Ciência da Computação Titulação: Mestre em Ciência da Computação Banca examinadora:

Guido Costa Souza de Araújo [Orientador] Roberto de Alencar Lotufo

Marcio Machado Pereira

Data de defesa: 26-09-2019

Programa de Pós-Graduação: Ciência da Computação

Identificação e informações acadêmicas do(a) aluno(a)

- ORCID do autor: https://orcid.org/0000-0001-5156-8032 - Currículo Lattes do autor: http://lattes.cnpq.br/6645605058708862

(4)

Alex Silva Torres

NNC – A Flexible Neural Network Compiler

NNC – Um compilador de Rede Neural flexível

Banca Examinadora:

• Prof. Dr. Guido Costa Souza de Araújo Unicamp

• Dr. Marcio Machado Pereira Unicamp

• Prof. Dr. Roberto de Alencar Lotufo Unicamp

A ata da defesa, assinada pelos membros da Comissão Examinadora, consta no SIGA/Sistema de Fluxo de Dissertação/Tese e na Secretaria do Programa da Unidade.

(5)

I thank Prof. Dr. Marcio Machado Pereira for his great help, for his patience, and for all its valuable tips and reviews, without which such work would not be possible.

I thank my advisor, Prof. Dr. Guido Costa Souza Araújo for guiding me during this work, for his teachings without which such work would not be possible.

I thank my family for the constant support in my life, without which nothing would be possible.

I thank my friends from the Computing Institute (IC-Unicamp) and CPQD who worked with me in this area, all exchanges of ideas were fundamental to my learning.

Thanks to everyone who published code, books, articles, and tutorials that I used during this work.

(6)

Com o crescimento de áreas como computação de borda, existe uma crescente necessidade de trazer redes neurais para uma grande diversidade de hardwares diferentes. Para isso são necessários softwares para realizar a inferencia de redes neurais já treinadas nesses dispositivos. Hoje as soluções existentes são engines que interpretam modelos e executam operação a operação. No entanto essas engines são projetadas para hardwares especificos e suas arquiteturas tornam dificil adicionar novos backends e frontends. Para prover uma maior flexibilidade para suportar diferentes hardwares, esse trabalho propõe o NNC, uma compilador para redes neurais flexível, cuja arquitetura foi projetada com acoplamento fraco, para permitir ao programador adicionar suporte para diferentes hardwares de uma maneira mais flexível. Com essa abordagem o NNC é capaz de gerar código para GPUs com maior poder de processamento como ARM Mali, quanto para microcontroladores com menor poder de processamento.

O NNC é divido em três camadas principais: o frontend, a camada de otimização e analise, e o backend. O frontend permite escrever parsers para diversos tipos de modelos. A camada de otimização e análise permite executar passos para melhorar a perfomance de execução da rede neural, bem como a realização de analises como por exemplo, determi-nar a memoria necessária para execução do modelo. Ao final, o backend gera código para diversas arquiteturas a partir do grafo computacional. Com uma arquitetura que utiliza uma abordagem de acoplamente fraco é possível escrever novos backends e conecta-los fa-cilmente. O NNC possui backends para NNAPI, ARMNN, além de possuir uma biblioteca de kernels próprios utilzando Vulkan. Vulkan é uma tecnologia que permite acesso mul-tiplataforma a GPUs modernas usadas em uma grande variedade de dispositivos, como telefones celulares e plataformas embarcadas.

Mesmo com o NNC estando ainda em sua versão inicial, foi possível com essa abor-dagem conseguir resultados de desempenho semelhantes a várias outras engines para exe-cução de inferencia de redes neurais, tendo a vantagem de já suportar uma quantidade maior de hardware que as demais engines.

(7)

With the growth of areas such as edge computing, there is an increasing need to bring neural networks to a wide diversity of different hardware. Requiring the software to perform the inference of neural networks already trained in these devices. Today, existing solutions are engines that interpret models and perform operation by operation. However, these engines are designed for specific hardware, and its architectures make it difficult to add new backends and frontends. To provide greater flexibility for supporting different hardware, this work proposes NNC, a flexible neural network compiler whose architecture is loosely coupled, to allow the programmer to add support for different hardware more flexibly. With this approach, NNC can generate code for GPUs with higher processing power such as ARM Mali, and microcontrollers with lower processing power.

NNC is divided into three main layers, the frontend, the optimization and analysis layer, and the backend. The frontend allows writing parsers for different models. The optimization and analysis layer allows performing passes to improve the performance execution of the neural network and performs analyzes as the memory required to run the model. And the backend generates code for several platforms from the computational graph. With an architecture that uses a low coupling approach, it is possible to write new backends and connect them easily. NNC has backends for NNAPI, ARMNN, and also has a library of its kernels using Vulkan. What is a technology that allows cross-platform access to modern GPUs used in a wide variety of devices, such as mobile phones and embedded platforms.

Even with NNC being still in its initial version, it was possible with this approach to achieve performance results similar to several other engines to perform inference of neural networks, having the advantage of already supporting more hardware platforms than other engines.

(8)

1.1 NNC Architecture . . . 15

2.1 Representation of the neuron model . . . 16

2.2 Anatomy of a neural network . . . 17

2.3 Fully connected neural network model . . . 18

2.4 Convolution operation over an image . . . 20

2.5 Max pooling operation . . . 20

2.6 VGG architecture . . . 22

2.7 Residual learning: a building block . . . 22

2.8 Resnet architecture . . . 23

2.9 Inception module . . . 24

2.10 Inception architecture . . . 24

3.1 ARM Mali GPU architecture . . . 27

3.2 ARM Mali shader architecture . . . 27

3.3 Memory regions and their scope in the OpenCL memory model. . . 29

3.4 Sum of two sets of four elements. . . 30

4.1 Operation F . . . 32

4.2 Arithmetic computational graph . . . 32

4.3 TFLite architecture . . . 36

4.4 NNAPI architecture . . . 37

4.5 Arm NN high level architecture . . . 37

4.6 Arm NN high level architecture . . . 38

5.1 XLA data flow . . . 40

5.2 TFLite Converter conversion process . . . 41

5.3 One-to-one conversion . . . 42

5.4 Standardizations . . . 42

5.5 Removal of operations . . . 42

5.6 Operations fusion . . . 43

6.1 High lever architecture . . . 50

6.2 NNC high level architecture . . . 51

6.3 NNC Architecture . . . 52

6.4 Tensors UML diagram . . . 53

6.5 Operation class UML . . . 54

6.6 Inception architecture branches . . . 55

6.7 Frontend interface class UML . . . 55

(9)

6.12 Binary operations fusion pass in computational graph . . . 61

6.13 Pad fusion pass in computational graph . . . 62

6.14 Activation function fusion in computational graph . . . 63

6.15 Batchnorm fusion pass in computational graph . . . 64

6.16 Node lowering pass in computational graph . . . 65

6.19 Schedule data structure . . . 69

6.20 Computation graph with generic operations . . . 70

6.21 Operations schedule with live tensors . . . 71

6.22 Backend interface uml diagram . . . 72

6.23 C++ reference backend execution flow . . . 73

6.24 NNAPI backend execution flow . . . 75

6.25 Vulkan backend execution flow . . . 76

7.1 Reduction of the number of operations . . . 81

7.2 Inference times for NNC and TFLite using NNAPI . . . 81

7.3 Inference times for NNC, Tensorflow and TFLite using ARMNN kernels . . 82

(10)

2.1 MobileNet Architecture . . . 25 5.1 Comparison of Optimizations in Computational Graphs of different

Frame-works . . . 47 7.1 Comparison of generated model size to each convolutional architecture . . 83

(11)

1 Introduction 13

1.1 Objectives . . . 14

2 Neural Networks 16 2.1 The neuron model . . . 16

2.2 Anatomy of a neural network . . . 17

2.3 Fully Connected neural networks . . . 17

2.4 Convolutional neural networks . . . 18

2.4.1 Convolution operation . . . 18 2.4.2 Convolutional layer . . . 19 2.4.3 Pooling layer . . . 20 2.4.4 Batch Normalization . . . 21 2.5 Convolutional architectures . . . 21 2.5.1 VGG . . . 21 2.5.2 Resnet . . . 22 2.5.3 Inception . . . 24 2.5.4 MobileNet . . . 25 3 Parallel programming 26 3.1 Mobile GPU architecture . . . 26

3.2 OpenCL . . . 27

3.2.1 Memory model . . . 28

3.2.2 OpenCL programming . . . 29

3.2.3 Vectorization . . . 30

4 Deep Learning Frameworks 31 4.1 Introduction . . . 31 4.1.1 Tensors . . . 31 4.1.2 Computational graph . . . 31 4.1.3 ONNX . . . 32 4.2 Training frameworks . . . 33 4.2.1 Tensorflow . . . 33 4.2.2 PyTorch . . . 34 4.3 Inference frameworks . . . 35 4.3.1 TFLite . . . 35 4.3.2 NNAPI . . . 36

(12)

5.3 TFLite Converter and Optimization . . . 41

5.4 Halide . . . 43

5.5 OptiML - specific language for machine learning . . . 44

5.6 Glow . . . 45

5.7 Comparison . . . 47

6 Implementation 49 6.1 Architecture . . . 49

6.2 Computational graph optimization . . . 50

6.3 Software architecture . . . 51

6.4 Tensors . . . 52

6.5 Kernel Level Computational Graph . . . 53

6.6 Frontend . . . 55

6.6.1 Class DuckTensor . . . 55

6.6.2 Class Keras . . . 56

6.7 Optimizations and transformations passes . . . 56

6.7.1 Constant folding . . . 56

6.7.2 Arithmetical conversions . . . 57

6.7.3 Inserting bias . . . 58

6.7.4 Reshapes in sequence removal . . . 59

6.7.5 Binary operations fusion . . . 60

6.7.6 Pad fusion . . . 61

6.7.7 Activation function fusion . . . 62

6.7.8 Batchnorm fusion . . . 63

6.7.9 Node lowering . . . 65

6.7.10 Dead code elimination . . . 66

6.8 Graph analysis . . . 68 6.8.1 Scheduling . . . 69 6.8.2 Liveness tensors . . . 71 6.9 Backend . . . 71 6.9.1 Reference C++ backend . . . 72 6.9.2 NNAPI backend . . . 74 6.9.3 Vulkan backend . . . 75 7 Experimental results 78 7.1 Toolchains . . . 78 7.2 Passes optimizations . . . 80

7.3 NNC vs TFLite using NNAPI . . . 81

7.4 NNC vs TFLite vs Tensorflow Core using ARMNN . . . 82

7.5 Backends performance analysis . . . 82

7.6 Generated model size . . . 83

8 Conclusions 85

(13)

Chapter 1 Introduction

In recent years deep learning has advanced in different areas as object detection[35], speech recognition[19], video classification[25], natural language processing[28], language translation[37], and these are some few examples. These advances were mainly due the frameworks that made training of complex models easier, like Tensorflow[13], PyTorch[30], Caffe[23], and CNTK[32].

These frameworks were developed for training deep learning models with large clusters in mind. However, the demand for the use of applications such as face detection[16], autonomous cars[15] and fingerprints recognition[14] on IoT devices or smartphones, it is necessary to make the inference of models of deep learning in these restricted devices.

On-device inference efficiency is a critical issue because of the many constraints: per-formance, low latency, power consumption, portability, etc. To meet these requirements we must use several different technologies to provide efficient data parceling, such as, OpenCL[6], Vulkan[11], or OpenGL[7], as well as specific technologies such as NEON[4] for vectoring data in ARM processors. Currently, most applications developed for the execution of inference of neural networks in restricted environments are based on engines. They usually receive a model that describes the computational graph with weights already trained and execute this graph node by node, also taking care of memory management during execution.

TensorFlow Lite [8], or only TFLite, is the Google lightweight solution for mobile and embedded devices. On the other hand, frameworks for performing Neural Net-work inference on mobile devices are consolidating into the Android Neural NetNet-works API[5]—NNAPI. NNAPI is an Android C API designed for running computationally in-tensive operations for machine learning on mobile devices. NNAPI is intended to provide a base layer of functionality for higher-level machine learning frameworks (such as TFLite) that build and execute Neural Networks models.

This work explores an approach closer to the compilers, to generate an ahead-of-time code. Thus, an architecture similar to LLVM[26] was developed, where the frontend, optimization, and backend steps are separated into layers. The frontend defines an API to describe the computational graph. The optimization layer performs a series of steps in this graph in order to improve it from a computational point of view. Finally, the backend generates code for several platforms from the optimized computational graph.

(14)

describes the objective to improve the architecture design of inference neural networks compilers. In Chatpers 2, 3 and 4 we give some background on Neural Networks, Parallel Programming, and Deep Learning Frameworks. The related works are described in Chap-ter 5. We explain in ChapChap-ter 6 the architecture of NNC and show how optimizations on it are performed. After the concepts developed within NNC are explained, we show some experimental results in chapter 7. Finally, we conclude in Chapter 8.

1.1 Objectives

This work was developed in partnership with Samsung, with the objective of efficiently executing some popular architectures used in computer vision to recognize images from the Tensorflow framework in restricted environments. For this purpose, we have listed some goals to make it possible to execute several architectures on different platforms and technologies used in IoT and mobile.

• Develop a flexible architecture, where it is possible to add support for the execution of neural network models for new hardware architectures and new technologies easily. • Perform optimizations in the computational graph during the compile time to

im-prove performance during execution.

• Use existing technologies to perform inference of neural networks in Android using NNAPI, and in ARM processors using ARMNN.

• Develop new kernels using Vulkan 1 _technology.

Figure1.1 summarizes the target architecture to achieve these goals. The main ponents are the optimizer and the graph analyzer. The frontend and the backend com-municate with the optimization layers and analyze using an API that generates and reads the computational graph. With this, the goal is to achieve the flexibility to write the frontend for various input models like Tensorflow/Keras and PyTorch. Also, backends to various hardware and platforms.

1_{Vulkan is a new generation graphics and compute API that provides high-efficiency, cross-platform}

access to modern GPUs used in a wide variety of devices from PCs and consoles to mobile phones and embedded platforms.

(15)

(16)

Chapter 2 Neural Networks

Neural networks are models inspired by the biological nervous system found in animals. They are composed of processing units called artificial neurons or perceptrons. Percep-trons were developed in the 1950s and 1960s by scientist Frank Rosenblatt, inspired by the works of Warren McCulloch and Walter Pitts[29].

The connections between the perceptrons simulate the biological synapses. These con-nections have weights associated in order to weigh the data received by each neuron. These weights may assume positive or negative values. What is called learning a neural network are the adjustments of these weights and the process that makes these adjustments is called back-propagation.

2.1 The neuron model

The mathematical model of the artificial neuron is composed of a function whose inputs are connections. Each connection receives a value and has a weight associated with it. Then the neuron makes a weighted sum of these weights, and this value is applied to the nonlinear function.

Figure 2.1 shows the representation of a neuron model, where the inputs are prefixed by i, and the weights by w. The function F can be decomposed into two parts: a summation uand an activation function g.

Figure 2.1: Representation of the neuron model

Mathematically speaking, we can describe the operation of the neuron as follows. Let I be the input vector I = [i1, i2, ..., in]t, and W the vector of weights W = [w1, w2, ..., w3]. The total input u received by the neuron can be defined as:

(17)

u = n X

j=1 ijwj

Thus, the output o of the neuron can be described by:

o = g(u) Where g is any nonlinear function.

2.2 Anatomy of a neural network

The training of a neural network has four main parts: Layers, input data, loss function, and the optimizer. Figure 2.2 shows, in a simplified way, how training of a neural network works. Layers can be connected in several different ways. The arrangement of these layers forms the architecture of the neural network. This architecture maps the input into predictions. The loss function compares these predictions with targets, thus producing a measure of how much the predictions beat against the expected values. These values are called loss values.

Figure 2.2: Anatomy of a neural network

2.3 Fully Connected neural networks

In the Fully Connected neural networks, also called dense or densely connected network, the neurons of one layer connect with all the neurons of their neighboring layers. Figure 2.3 shows an example of a neural network densely connected with four layers: one of input, one of output, and two hidden layers.

(18)

Figure 2.3: Fully connected neural network model

2.4 Convolutional neural networks

The convolutional neural networks were presented in the historical work Gradient-Based Learning Applied to Document Recognition work[27] presented by Yann LeCun. The idea was to use features learned to recognize handwrite imagens rather than engineering features as was done at the time. The main idea of the article was to use a layer that implemented the convolution operation and to use a descending gradient method to update the parameters of the convolution layer, and thus, the features are learned from the data.

2.4.1 Convolution operation

Before presenting how the convolution layer works in a neural network, it will show how the convolution operation works.

The objective of the convolution operation is to extract the high-level features such as edges, from the input image.

Convolution is an integral operation on two functions of real arguments.

s(t) = Z

x(a)w(t − a)da

Mathematically the convolution operation is represented by an asterisk. s(t) = (x ∗ w)(t)

For this work, we adopted the convention used in the deep learning frameworks. The function x is referred to as the input, the function w is referred to as filter and the output is referred to as a feature map.

However, since we are working with discrete values, we will assume that x and w are defined only for integer values of t. We can then define the discrete convolution as:

(19)

s(t) = ∞ X −∞

x(a)w(a − t)

The focus of this work is computational vision. For this, we need to define the con-volution operation for images, which can be interpreted as signals in two dimensions. Consider, for example, the input image I and the two-dimensional filter F:

s(t) = (I ∗ F )(i, j) =X m

X n

I(m, n)F (i − m, j − n) As the convolution is commutative, we can write the equation as follows:

s(t) = (I ∗ F )(i, j) =X m

X n

I(i − m, j − n)F (m, n)

This form is more practical to be implemented in a deep learning framework. However, most of the current frameworks implement the cross-correlation operation, that is, the convolution operation without inverting the filter and, in the end, calls this convolution. In this work, we will follow this convention and call both convolutions. The cross-correlation equation can be seen below.

s(t) = (I ∗ F )(i, j) =X m X n I(i + m, j + n)F (m, n)

2.4.2 Convolutional layer

The convolution layer is a layer that applies a convolution operation on the input feature. Figure 2.4 shows the convolution layer with a 2-dimensional input and a filter of size 2x2, the stride of size 1, and no padding. Convolutional neural networks usually have several convolutional layers. Where the first layers are responsible for learning basic shapes like edges and colors, and later layers learn more complex shapes like the shape of objects.

The convolutional filter works as a window that extracts a small piece of the image and applies an element multiplication between the two matrices, the filter and the piece extracted from the matrix, and they add all multiplied elements. This operation can be seen in each rectangle of the output feature map from the figure 2.4. In the end, an image input of size (4x3) generated an output feature map of size (3x2).

(20)

Figure 2.4: Convolution operation over an image

2.4.3 Pooling layer

The Pooling layer is responsible for reducing the spatial size of the feature map. That is, down-sampling the feature map. This property is useful for saving the computational power used for data processing and is helpful for next convolutional layers to extract higher-level features like the shape of objects.

Pooling uses a window to extract feature map values. This value can be the maximum value, average value, or the minimum value. In figure 2.5 can be observed the max pooling operation with a window of size 2, and stride with size 2.

The pooling with a window of size 2 will halve the size of the feature map. However, it is essential to note that the pooling operation does not modify the depth of the feature map.

(21)

The Convolutional Layer and the Pooling Layer, together form the i − th layer of a Convolutional Neural Network. Depending on the complexities in the images, the number of such layers may be increased for capturing low-levels details even further, but at the cost of more computational power.

2.4.4 Batch Normalization

Batch normalization[22] is a technique that has been developed with the aim of improving the accuracy and performance of deep network training. Training deep nets is difficult be-cause of the distribution of each layer’s inputs changes during training, as the parameters of the previous layers, change. This causes a problem that the authors call an internal covariate shift. That is the need for lower learning rates, which makes the training slow. And the authors propose to solve this problem by normalizing the inputs of the hidden lay-ers. Allowing you to use much higher learning rates and be less careful about initialization. Batch normalization is part of the model architecture and performing the normaliza-tion for each training mini-batch. It works subtracting the output of a previous acti-vation layer by batch mean and dividing by the batch standard deviation, as shown in equation 2.3. After this shift/scale of activation outputs by some parameters that will be learned during the training, as shown in equation 2.4.

Consider as input of batch normalization a mini-batch: B = {xi...m}

µB = 1 m m X i=1 xi (2.1) σ_B2 = 1 m m X i=1 (xi− µB)2 (2.2) ˆ xi = xi− µB pσ2 B+ (2.3) yi = γ ˆxi+ β ≡ BNγ,β(xi) (2.4) The parameters γ and β will be learned during the training.

2.5 Convolutional architectures

2.5.1 VGG

The paper that introduced the VGG architecture was a work by Karen Simonyan and Andrew Zisserman [33], whose objective was to investigate the effect of convolutional network depth on its accuracy in a large-scale image recognition configuration. The main contribution of the VGG architecture was the use of filters of size (3x3), which at the time was considered small since it was common to use larger size filters. This work demonstrated a significant improvement in the configurations compared to the prior

(22)

art because it achieved a depth of 16 to 19 layers of weight. In addition, significantly greater accuracy was also achieved compared to previous architectures of ConvNet. The downside of VGG architecture is that its convolutional layers require a lot of parameters, which leaves training slow, making it unworkable for use on restricted devices, such as mobile devices. Figure 2.6 shows the VGG architecture.

Figure 2.6 shows the VGG architecture.

Figure 2.6: VGG architecture

2.5.2 Resnet

An obstacle to stacking more layers and going deeper into neural network models is the problem of vanishing/exploding gradients. As the gradient is back-propagated to earlier layers, repeated multiplication may make the gradient infinitely small. As a result, as the network goes deeper, its performance gets saturated or even starts degrading rapidly. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun[20] proposed ResNet to solve vanishing gradients problem using deep residual learning. Figure 2.7 shows a block using residual learning, which is a shortcut connection from previews layer input.

Figure 2.7: Residual learning: a building block In the figure 2.8 can be seen the complete resnet architecture.

(23)

(24)

2.5.3 Inception

The primary purpose of the Inception model[36] was to offer an architecture that had high accuracy, but without having a size as large as that of the VGG architecture, that is, having a smaller number of parameters. Inception is an evolution of GoogleLeNet, using 12 times fewer parameters than the architecture of Krizhevsky et. al[35], in addition to being significantly more precise.

Figure 2.9 shows the main part of the Inception architecture, knows as Inception module.

Figure 2.9: Inception module

The inception module has two mains ideas. The first one is to combine a max-pooling layer and several convolution layers with filters of different sizes: (1x1), (3x3), and (5x5); and concatenate their outputs into a single output to the next step. The other idea is to use a convolutional layer with filter size (1x1) to reduce the dimensionality, thus decreasing the number of parameters required in the next convolutional layer and thus, reducing the size of the network and accelerating its training.

Figure 2.10 shows the entire Inception architecture, which, in turn, consists of the sequence of several Inception modules placed in series. A detail to be observed is that, in addition to the output at the end, the architecture has two previous outputs just after the Softmax operation. These previous outputs are used to aid in training to decrease the effect of the vanishing gradient.

(25)

2.5.4 MobileNet

The neural network MobileNet[21] was an attempt to design a convolutional network architecture considerably smaller in size than other architectures, such as the Inception network, but still capable of exhibits an accuracy that was not so inferior.

The architecture of the MobileNet changes the normal convolution by depthwise convo-lution followed by pointwise convoconvo-lution also called depthwise separable convoconvo-lution [24]. The architecture significantly reduces the number of parameters, and with this, the to-tal number of floating-point multiplication operations is reduced. This network proved to be advantageous for mobile and embedded vision applications because of its lower computational power.

In the MobileNet architecture, all layers are followed by a batch norm and ReLU non-linearity except for the final fully connected layer which has no nonnon-linearity and feeds into a softmax layer for classification, in the Table 2.1 we can see the MobileNet architecture. All layers in the MobileNet architecture are followed by a batch norm routine and nonlinearity ReLU, except for the final fully connected layer that has no nonlinearity and feeds a softmax layer for classification. Table 2.1 shows the MobileNet architecture.

Table 2.1: MobileNet Architecture

Type/Stride Filter Shape Input Size

Conv / s2 3 x 3 x 3 x 32 224 x 224 x 3 Conv dw / s1 3 x 3 x 32 dw 112 x 112 x 32 Conv / s1 1 x 1 x 32 x 64 112 x 112 x 32 Conv dw / s2 3 x 3 x 64 dw 112 x 112 x 64 Conv / s1 1 x 1 x 64 x 128 56 x 56 x 64 Conv dw / s1 3 x 3 x 128 dw 56 x 56 x 128 Conv / s1 1 x 1 x 128 x 128 56 x 56 x 128 Conv dw / s2 3 x 3 x 128 dw 56 x 56 x 128 Conv / s1 1 x 1 x 128 x 256 28 x 28 x 128 Conv dw / s1 3 x 3 x 256 dw 28 x 28 x 256 Conv / s1 1 x 1 x 256 x 256 28 x 28 x 256 Conv dw / s2 3 x 3 x 256 dw 28 x 28 x 256 Conv / s1 1 x 1 x 256 x 512 14 x 14 x 256 5x Conv dw / s1 3 x 3 x 512 dw 14 x 14 x 512 Conv / s1 1 x 1 x 512 x 512 14 x 14 x 512 Conv dw / s2 3 x 3 x 512 dw 14 x 14 x 512 Conv / s1 1 x 1 x 512 x 1024 7 x 7 x 512 Conv dw / s2 3 x 3 x 1024 dw 7 x 7 x 1024 Conv / s1 1 x 1 x 1024 x 1024 7 x 7 x 1024

Avg Pool / s1 Pool 7 x 7 7 x 7 x 1024

FC / s1 1024 x 1000 1 x 1 x 1024

(26)

Chapter 3 Parallel programming

Although modern mobile devices are powerful architectures, they are restricted comput-ing platforms in the sense that they need to provide the highest MIPS / Watt values with minimal battery usage. On the other hand, the graphics processing units (GPUs) embed-ded in this device are increasingly powerful and generic, allowing not only the rendering of images as well as handling general computing operations. In this way, it is interesting to explore the parallelism of the algorithms used in deep learning, to take advantage of these characteristics of smartphones. This session will discuss the architecture of GPUs used by these restricted devices and how to use parallel programming using the OpenCL framework to explore this architecture.

3.1 Mobile GPU architecture

The main difference between a CPU and a GPU is in the number of cores. While a modern CPU reaches 8 cores, a GPU can have hundreds or even thousands of processing cores. However, this large amount of GPU cores has a cost. While in the CPU each core can execute different instructions at the same time, in the GPU the cores always execute the same instruction at any moment, changing only the data processed in each core. This becomes one of the main factors when programming in parallel for GPUs. While the GPU can optimize many tasks, such as the sum of arrays, where all cores make sums at the same time with different data, other tasks that require each core to do different operations can slow them down. Figure 3.1 shows the ARM Mali GPU architecture. All shaders core share the L2 cache, and their size can range from 32-64KB per shaders core. This is important information when using the GPU’s global memory in the OpenCL framework discussed in the next session.

(27)

Figure 3.1: ARM Mali GPU architecture The figure 3.2 shows the Mali shader core architecture.

Figure 3.2: ARM Mali shader architecture

3.2 OpenCL

OpenCL (Open Computing Language) is a framework for writing programs that exe-cute across heterogeneous platforms consisting of central processing units (CPUs), and graphics processing units (GPUs), among others. OpenCL specifies programming lan-guages (based on C/C++) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the computer devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

(28)

The OpenCL specification defines four abstraction models:

• Platform model: An OpenCL platform consists of a host connected to one or more OpenCL devices. The platform model defines the roles of the host and the devices, and provides an abstract hardware model for devices.

• Execution model: subdivides an application into two parts: a host program, run on the host and a collection of kernels that is functions written in the OpenCL language. During execution, each kernel instance is identified in an address space as a work-item, which are organized into a group that ensures concurrent execution called the workgroup.

• Kernel programming model: High-level abstractions that the programmer uses to define how the concurrency model is mapped to physical hardware.

• Memory model: Abstract memory hierarchy that kernels use. It defines the memory regions in OpenCL how they interact with the kernels during an OpenCL compu-tation.

3.2.1 Memory model

As the memory system can vary widely between different computing platforms, OpenCL defines its memory abstraction to support portability. Two types of memory objects are defined by standard: buffer, and image. Buffers are like arrays in C, where data elements are stored contiguously in memory and can be accessed using pointers(memory address). Image objects are restricted to holding images. Unlike buffers, images cannot be directly referenced as if they were arrays. The advantages of using images are to allow the hardware to take advantage of spatial locality and to utilize the hardware acceleration available on many devices.

Memory regions

OpenCL classifies the memory in host memory and device memory. Host memory is directly available to the host and is set outside of OpenCL. Data is moved between the host and devices using OpenCL API functions or through a shared virtual memory interface. The device memory is the memory that is available to run the computing kernels.

The device memory is divided into four regions as shown in the figure 3.3.

• Global memory: Global Memory: Allows read and write access to all work items in all work-groups. The data that is transferred from the host to the device, as well as the data that must be moved back from the device to the host, must reside in the global memory.

• Constant memory: region of the global memory that remains constant during the execution of a kernel being accessible only for reading by work-items.

(29)

• Local memory: the memory that is shared between work-items within a work-group. • Private memory: memory that is unique to an individual work-item. Local variables

and nonpointer kernel arguments are private by default.

Figure 3.3: Memory regions and their scope in the OpenCL memory model.

3.2.2 OpenCL programming

The OpenCL programming model was designed with data parallelism as a primary target. But the model also supports task-parallelism. In OpenCL, the programmer implements their core computation using kernels whose execution is defined by the programming model. The main steps to execute a simple OpenCL application are shown below.

• Initially, a query is made to the host to look for available platforms and devices. • After the device has been discovered, a context for the device is created on the host. • A command queue is created per device. The host submits commands to the

command-queue to ask the device to perform the work.

• Buffers are created to hold data. To create a buffer, it is necessary to know the size of the buffer and the context in which the buffer will be allocated.

• The next step is to copy data from a host pointer to a buffer. This step is very slow, so we should make this copy a few times as possible.

• The kernel is created by selecting to run the desired function from within the pro-gram.

(30)

• After running the kernel, the output data must be copied from the device to the host. This step is very slow, so you should minimize the number of times this step is performed.

• The last step is to release all allocated resources.

3.2.3 Vectorization

Vectorization is the ability of the processor to perform, for example, a sum or multiplica-tion operamultiplica-tion on a vector of four or eight elements as a single instrucmultiplica-tion. Also known by the abbreviation SIMD (Single Instruction, Multiple Data). That allows data parallelism by instructions. That is, SIMD allows the computation of operation in intire small vector. For example, add or multiply two vectors of 8 elements in a single instruction. Figure 3.4 shows a comparison between the sum of two sets of four elements, where in the first part the execution of four sum statements is performed to operate, while in the second part the sum of two 4-element vectors is shown using instructions SIMD. In the latter case, the two vectors are summed by a single instruction, thus being much faster.

(31)

Chapter 4 Deep Learning Frameworks

4.1 Introduction

This session will discuss the fundamental concepts used in the deep learning frameworks, covering both the inference and training frameworks. Thus, we will introduce the concepts of tensors, kernels, variavéis, computational graphs, and write a quick approach to the main training and inference frameworks.

4.1.1 Tensors

Tensors are the primary data structure used by the deep learning frameworks to represent the data. Tensors are a generalization of matrices and are represented using n-dimensional arrays.

In the deep learning frameworks, tensors are containers for the data. The data types are usually of the numeric type. The key attributes of the tensors are:

• rank: is the number of axis of a tensor, that is, the dimension of the tensor. For instance, a scalar is a tensor of rank 0, a vector is a tensor of rank 1, and a matrix is a tensor of rank 2, and so on.

• shape: in deep learning frameworks, the shape is generally represented by a tuple that describes how many dimensions the tensor has in each axis.

• type: is the type of data stored inside the tensor. The commonly used data types are: int8, int16, int32, int64, float32, float64.

4.1.2 Computational graph

Operation or Kernel

Without loss of generality, we can define an operation as a function that receives one or more variables as input and return only a single output variable as output, as shown in the figure 4.1.

(32)

Figure 4.1: Operation F

Variables

Variables are a wrapper for a tensor. They are used as inputs and output of an operation. Being the tensor, only a data structure of all memory optimization in the computational graph uses the variables.

Computational graph

It is a representation of a math function in the language of graph theory. In the deep learning frameworks, the computational graph is a directed graph where the nodes cor-respond to operations or variables, and the edges corcor-respond to the tensors that flow by the graph, generally in the figures representing the computational graphs the variables that are inputs and outputs of the operations are omitted, showing only the variables that store values such as weights and bias, and the graph input.

In the figure 4.2 we can see a simple arithmetics computational graph that adds X1 and X2and multiplies the result by X3.

Figure 4.2: Arithmetic computational graph

4.1.3 ONNX

Open Neural Network Exchange (ONNX) is an open standard format for representing ma-chine learning models, thus allowing a framework to use a model generated and trained in another framework. It uses protubf to store its data structures. The main data used by ONNX is the computational graph, which is composed of nodes representing the op-erations, where these nodes are connected through edges representing the tensors. Each

(33)

operation in the ONNX model has an id predefined by default, so converted models have a certain limitation that they cannot use operations that are no longer defined.

4.2 Training frameworks

The training frameworks have the feature to calculate the derivative, using the chain rule, and calculate operations as the descending gradient of a computational graph. So these kinds of frameworks can calculate the back-propagation and the forward-propagation of a deep learning model.

4.2.1 Tensorflow

With the growth of the use of deep learning, several network architectures began to emerge and with this the need for an application that was generous enough to allow to describe any of these architectures and to execute optimally, distributing the processing between GPUs and clusters. Tensorflow was one of the machine learning frameworks that emerged to meet this demand.

Tensorflow[13] uses a computational graph approach to compute machine learning algorithms. Computational graphs give flexibility for the architecture description of neu-ral networks, and also allows performing optimizations already existent in the study of compilers.

In a TensorFlow graph, each node has zero or more inputs and zero or more outputs and represents the instantiation of an operation. An operation or kernel has a name and constitutes an abstract computation (e.g., “matrix multiply”, or “add”). An operation can have attributes, and all attributes must be provided or inferred at graph-construction time to instantiate a node to perform the operation.

Tensorflow saves the model using protobuf. Protobuf is an extensible mechanism for serializing structured data. Protubuf uses data compression in the serializing process.

Below we have a code in Python that describes a computational graph, and in the figure, we have the computational graph generated.

import tensorflow as tf

b = tf.Variable(tf.zeros([100])) # 100-d vector, init to zeroes

W = tf.Variable(tf.random_uniform([784,100],-1,1)) # 784x100 matrix

x = tf.placeholder(name="x") # Placeholder for input

relu = tf.nn.relu(tf.matmul(W, x) + b) # Relu(Wx+b)

C = [...] # Cost comput

As a complete framework for machine learning, Tensorflow has several parts, creation, and optimization of the computational graph, distribution of kernels in devices and cluster and execution. In this work, we compare the part of the TensorFlow that optimizes the computational graph.

(34)

• Common Subexpression Elimination: As the construction of computation graphs are often done by many different layers of abstractions in the client code, redundant copies of the same computation are left in the computational graph, to remove these parts, a common subexpression pass algorithm was implemented by the authors. • Controlling Data Communication and Memory Usage: scheduling of the operations

can result in better performance of the system, in particular for data transfers and memory usage. The authors used algorithms to analyze critically paths of the graph to analyze the best way to schedule the execution of kernels for memory passing. • Asynchronous Kernels: This is an optimization for environments where having many

active threads is relatively expensive in terms of memory usage or other resources. Examples of asynchronous kernels include the Receive kernel, and the Enqueue and Dequeue kernels which might need to block if queue space is not available or if no data is available to be read.

• Optimized Libraries for Kernel Implementations: The authors made extensively use of already-optimized libraries for numerical computation such as BLAS and cuBLAS, or GPU libraries for convolutional kernels for deep neural nets such as Cuda-convnet and cuDNN.

• Lossy Compression: Several algorithms used in deep learning, including those typi-cally used for training neural networks, are tolerant of noise and reduced precision arithmetic. So the authors used some conversions for less accurate types that make computing more efficient.

From the optimizations used by the authors, only the subexpression elimination is used in the computational graph, the others focus more on the execution of the training, this is one of the characteristics that distinguishes It from this present work. Since this work focuses on the inference of already trained models, we can do many other optimizations that during training would not be possible or recommended to do.

4.2.2 PyTorch

PyTorch[30] is a dynamic deep learning framework. The difference between a dynamic framework and a static one, such as Tensorflow, is that in the dynamic structure the differentiation of functions occurs while the computational graph is executed and, at the end of execution, the graph is discarded. In the static framework, differentiation happens ahead-of-time, and then it is rotated several times that graph with its symbolic derivatives. The dynamic framework has several advantages over the static one, such as simplicity of debugging, execution of recurring networks in simpler ways, and also the possibility of implementing architectures that change during training.

The core of PyTorch core is implemented in C++ for performance reasons. Its pro-gramming interface is in python language. An example of PyTorch code can be seen below.

(35)

import torch

from torch.autograd import Variable y = Variable(torch.rand(2,4))

x = Variable(torch.ones(2, 2), requires_grad=True) y = x + 2

z = y * y * 2 out = z.mean() out.backward()

In this example can be seen basically how PyTorch works. PyTorch uses variables to encapsulate the tensors, and all operations are done with variables. In this case, it was created the variables x and y. Then some mathematical operations are done with these variables, and the backward function performs the symbolic differentiation and the back-propagation in the computational graph.

The main difference between Tensorflow and PyTorch is that Tensorflow generates a computational graph, and that graph is immutable and used throughout the training. While in PyTorch, the computational graph is built and destroyed at each step of the training loop.

4.3 Inference frameworks

The inference is the forward-propagation operation in a computational graph. The infer-ence frameworks execute on an already trained model, so the step of back-propagation is not necessary.

4.3.1 TFLite

TFLite [8] is a lightweight solution for performing neural network inference operations on mobile and embedded devices. Tensorflow Lite supports a set of operations, both in floating point and quantized. These operations are encapsulated in computing kernels. These kernels usually incorporate an activation function, for example, the convolution kernel known as conv2d has a non-linear activation function that can be chosen by the developer.

The TFLite architecture consists of two applications, a Tensorflow graph converter for TFLite called TOCO and the TFLite application. TOCO converts the model to flatbuf format. Flatbuf is very similar to protobuf, but it doesn’t use compression to store data and uses a lightweight serialized form that allows saving space the saved file. TOCO is not only a converter but also performs some optimizations such as constant folding, eliminating dead code, and uses heuristics to find and merge operations according to the kernels already implemented in TFLite.

When Tensorflow finds NNAPI on the device, it only dispatches the model to calls of NNAPI. in Figure 4.3 we can see the complete Tensorflow Lite architecture.

(36)

Figure 4.3: TFLite architecture

4.3.2 NNAPI

NNAPI[5] is written in C ++ with a C interface, designed for running computationally intensive operations for machine learning on mobile devices. With NNAPI it is possible to perform the inference operation on the device itself, without the need to send a request to a server in the cloud to process the inference, thereby reducing the latency of the operation and increasing the availability, because it is possible to execute the inference, even without access to the internet.

NNAPI has support for driver development, so the hardware manufacturer can develop specific kernels to take advantage of some hardware specificity, leading to increased time and power efficiency, for example, we can have kernels developed specifically for GPUs or TPUs, the architecture of NNAPI can be seen in the Figure 4.4.

(37)

Figure 4.4: NNAPI architecture

4.3.3 ARM Neural Network

Arm NN[1] is an inference engine for CPUs, GPUs, and NPUs. The Arm NN provides an API so that it is possible to construct a computational graph, which will be used to perform the computation. Its architecture internally uses the ARMCL (Arm Compute Library), which is the library where the kernels are implemented, such as convolution, LSTM, fully connected, among others. The figure 4.5. shows a high-level view of Arm NN architecture.

Figure 4.5: Arm NN high level architecture

The figure 4.6 shows the three main steps when Arm NN is executing the model, first it assembly the computational graph, after it optimizes the use of memory for the execution

(38)

of this computational graph. Memory optimization uses a technique used in compilers called live variable analysis, however in the case of Arm NN the tensors are the variables and the kernels are the blocks that consume and modify the variables. The third step is to pass is to run the computational graph that was assembled with the ARMCL kernels.

(39)

Chapter 5 Related Works

Inference of neural networks in restricted environments is a relatively new area, that is, it is beginning to be developed now. That said, many of the papers listed in the topics below do not necessarily focus on inference, but also on training. These topics were chosen for the reason that they also make inferences or because they deal with subjects related to our work.

5.1 XLA

With the increasing use of Tensorflow, it has been observed that the execution model that works by interpreting the computational graph and running the kernel in general is inefficient. To solve this problem, it was proposed by Google the use of a compiler to compile the neural network, that is, the neural network is converted into machine code before being executed. In order to optimize the time spent during the training of neural networks, the XLA~[12] was created. XLA stands for Accelerated Linear Algebra. It is a linear algebra compiler that optimizes the TensorFlow computations.

Figure 5.1 shows how the data flow works in the XLA architecture. First, the compu-tational graph of TensorFlow is converted to a high-level representation called the HLO. HLO is the instruction format used by XLA to describe the operations performed by the graph. After this conversion, optimizations over the HLO are performed giving rise to a new optimized HLO that is then converted by XLA into low-level instructions in the LLVM format.

(40)

Figure 5.1: XLA data flow

The optimizations performed by the XLA in the HLO are similar to the optimizations performed by the TVM, such as operation fusion, constant folding, and dead code elimi-nation. Because XLA compiles parts of the computational graph, it returns a pointer to the kernel with newly compiled code. Then, TensorFlow handles the compiled code as if it were another kernel, while still managing memory.

5.2 TVM

Chen et.al [18] proposed TVM, a compiler that exposes graph-level and operator-level op-timizations to provide performance portability to deep learning workloads across various hardware back-ends. Their primary motivation lies in the fact that current frameworks rely on vendor-specific operator libraries nd optimize for a narrow range of server-class GPUs. Its compiler includes automated end-to-end optimization, which is historically a labor-intensive and high specialized task. Our work is related to the TVM layer that per-forms high-level optimizations in the computational graph. TVM implements many graph level optimizations. Examples are: (1) Operator fusion, which combines several small op-erations together; (2) Constant folding, which pre-computes the parts of the graph that can be determined statically; and (3) data layout transformations, which transform inter-nal data layouts into back-end-friendly forms. After performing the computatiointer-nal graph optimization step, TVM optimizes the memory allocation for the tensors and converts the high-level operations to the target hardware platform. In this work, the authors focused on creating a set of instructions focused on the frameworks for machine learning training. However, this set of instructions is not always efficient to perform the inference operation

(41)

of neural networks, which also makes it difficult to generate codes that use the NNAPI standard, which is a differential of the present work that focuses on the inference of neural networks.

5.3 TFLite Converter and Optimization

Also known as TFLite Converter [9]. It is responsible for converting Tensorflow models to flatbuffer format. Flatbuffer is serialization format that is interpreted by TFLite. TFLite Converter converts and optimizes Tensorflow models, thereby attempting to reduce the amount of neural network operations, or to exchange them for cheaper performance op-erations. Figure 5.2 shows the conversion process from a Tenforflow or Keras model to the generation of the flatbuffer file.

Figure 5.2: TFLite Converter conversion process The optimizations performed by TFLite Converter are:

• One-to-one conversion: used to tailor certain operations to NNAPI, for example, ADD to n tensors becomes several ADDs with two input tensors, and squeeze kernel switches are also performed by reshaping operation.

(42)

Figure 5.3: One-to-one conversion

• Standardizations: In the TensorFlow kernels as Conv, Depthwise Conv and Fully Connected may or may not have bias, in the case of NNAPI the bias is mandatory, so TOCO performs the insertion of bias when it is not present.

Figure 5.4: Standardizations

• Removal of operations: Substitution of binary operations with constants by a single constant value, and the exchange of several operations followed by reshape by a single reshape.

Figure 5.5: Removal of operations

• Operations fusion: The TOCO performs the merging of the activation functions as RELU, and also performs the binary operations function that comes in the sequence of a convolution, operating the bias of the predecessor operation.

(43)

Figure 5.6: Operations fusion

5.4 Halide

Processing the image pipeline for real-time processing is a costly process, and if you spend a significant amount of time in this process, it becomes unviable. To solve this problem, J. Ragan-Kelley et al. proposed a systematic model to create a set of instructions focused on the operations of image processing [17]. These instructions can be represented by a computational graph that can be optimized to generate GPU code. This makes image pipeline processing more efficient.

The Halide’s set of instructions is generic enough to be used from applications such as cameras, sensors such as Kinect, video processing, image processing in software such as Photoshop and Gimp, medical image processing and neural scanning performance to cope with the rapidly rising resolution and frame rate of image sensors and the increasing complexity of algorithms.

In the applications chosen as the focus by the authors of the work, it is widespread to have stencil pipelines. Stencil pipelines are graphs of different stencil computations. Iteration of the same stencil occurs, but it is the exception, not the rule; most stages apply their stencil only once before passing the data to the next stage, which performs different data-parallel computation over a different stencil.

To describe the image processing pipeline, a language called Halide DSL was created. This language allows you to write operations such as filters quickly and easily, in a few lines you can write applications that in other languages like C ++ would take hundreds or thousands of lines.

An example of a separable 3 × 3 unnormalized box filter, expressed as a chain of two functions in x, y is shown below:

UniformImage in(UInt(8), 2) Var x, y

Func blurx(x,y) = in(x-1,y) + in(x,y) + in(x+1,y)

Func out(x,y) = blurx(x,y-1) + blurx(x,y) + blurx(x,y+1) The contributions cited by the project authors were:

• a systematic model of the tradeoffs between locality, parallelism, and redundant recomputation in stencil pipelines;

(44)

• a scheduling representation that spans this space of choices;

• the DSL compiler based on this representation that combines Halide programs and schedule descriptions to synthesize points anywhere in this space, using a design where the choices for how to execute a program are separated not just from the definition of what to compute, but are pulled all the way outside the black box of the compiler;

• loop synthesizer for data-parallel pipelines based on simple interval analysis, which is simpler and less expressive than a polyhedral model, but more general in the class of expressions it can analyze;

• a code generator that produces high-quality vector code for image processing pipelines, using machinery much simpler than the polyhedral model;

• and an autotuner that can in high-performance schedules-up to 5 × faster than hand-optimized programs written by experts-for complex image processing pipelines using a stochastic search.

Although our work focuses on the inference of neural networks and Halide focuses on image processing, concepts from the Halide instruction set were used to create a set of instructions for this work.

5.5 OptiML - specific language for machine learning

With the increasing complexity of the deep learning models and the databases used by these models, the need to use modern hardware becomes evident. However, to have the advantages provided by these hardware requires using multiple parallel programming models targeted at different devices (e.g., CPUs and GPUs). However, programming these devices to run efficiently and correctly is difficult, error-prone, and results in software that is harder to read and maintain. To simplify the programming for machine learning a specific machine learning language has been proposed that can be compiled for several devices, which the authors called OptiML [34].

The OptiML is a declarative language that focuses on describing what operation should do, rather than how it should do it. The language use data types derived from three fundamental base types: Vector, Matrix, and Graph. These data types are polymorphic and flexible. Imperative model is also supported, as while loop assigning each index to value. In the example below, it is shown how the k-means algorithm is written in the OptiML syntax.

untilconverged(mu, tol){ mu =>

// calculate distances to current centroids val c = (0::m){i =>

val allDistances = mu mapRows { centroid => // distance from sample x(i) to centroid

(45)

}

allDistances.minIndex }

// move each cluster centroid to the // mean of the points assigned to it val newMu = (0::k,*) { i =>

val (weightedpoints, points) = sum(0,m) { j =>

if (c(i) == j){ (x(i),1) }

}

if (points == 0) Vector.zeros(n)

else weightedpoints / points }

newMu }

OptiML uses a metaprogramming technique known as lightweight modular staging [34] to build an intermediate representation of a program. Several static and dynamic opti-mizations are performed. Static optiopti-mizations are applied as transformations on the OptiML IR before code generation, while dynamic optimizations are implemented as part of OptiML data types or control structures.

OptiML implements static optimizations such as common subexpression elimination, dead code elimination, and loop hoisting. The authors use pattern matching on the IR to optimize sequences of operations according to standard linear algebra simplification rules. Op fusing is used for example is a situation as, k loops that iterate over data structures of the same size can be transformed into a single loop that computes k results, thereby reducing the number of main memory accesses.

The authors of OptiML focused on making a generic language for machine learning, the optimizations made in this work are focused on the optimization of loops and constructions at the compiler level, while in this work the optimizations are focused exclusively on the computational graph at the level of neural networks.

5.6 Glow

The purpose of Glow[31] projetct was to create an efficient compiler that could be coupled to the PyTorch project. Like other compilers like TVM and XLA, Glow also needs to support multiple targets for different domain-specific architectures (DSAs). The primary technique that the compiler uses for generating efficient code is the graph-lowering, which gives the name to the project.

High-Level IR used in this work maps the operations used in neural network models. These nodes can reference and access storage nodes that are owned by their containing module. Glow lowers the nodes that compute the gradient of the expression and the

(46)

stochastic gradient descent (SGD) node into a sequence of low-level operators (Sub, Mul, Add, and Save). The different compiler backends do not need to implement support for the DivGrad or SGD nodes. By contrast, classic machine learning frameworks that are not able to automatically generate fused kernels need to implement hundreds of CUDA and CPU compute kernels that represent the unlowered operators. This limits their ability to support new kinds of hardware and ties them to one or two major hardware vendors.

The approach to deep learning frameworks is to make a kernel for each device, for example, a 2D convolution for CPU, for GPU and DSP, however, this approach does not scale as device numbers grow every day. However, the Glow uses a technique called node lowering, which is the main technique that allows the Glow to compile for a large number of targets. In this technique, the compiler breaks the high-level operator nodes into low-level linear algebra operator nodes. For example, the fully connected layer is represented as a matrix multiplication followed by a broadcasted add. Different compiler backends do not have to implement the FullyConnected layer and a dozen other high-level opcodes, just the low-level matrix multiplication. The lowering phase comes after the graph is differentiated. Because the lowering transformation does not preserve the semantics of the graph, it is not possible to differentiate the graph for certain operators.

The phase called IRGen is where each high-level node is translated into one or more instructions of a low-level IR. The low-level IR enables a different kind of independent target optimizations that are not possible with the high-level graph format. In the con-text of hardware acceleration, the low-level instruction-based representation allows the compiler to represent device-specific operations such as asynchronous DMA operations. Hiding the latency of memory operations is important for utilizing the execution units of the hardware effectively, and the instruction-based representation allows the compiler to create a schedule that hides the latency of the memory operations.

The Glow as other compilers thought for the training of neural networks differentiates from the present work by the optimizations that are made, for example, Glow can not merge normalization with convolution, since during training the parameters of bach-normalization change, in However, when the focus is only inference, this fusion is entirely possible.

(47)

5.7 Comparison

NNC TOCO Tensorflow Glow TVM

Constant folding X X X X X

Subexpression eliminition X X X X X

Bias Standardization X X X X

Activation func fusion X X X X

Binary operation fusion X X X X

Batchnorm fusion X

Quantization X X X X X

Table 5.1: Comparison of Optimizations in Computational Graphs of different Frame-works

Table 5.1 shows the comparison between the optimizations of the computational graph performed by the different frameworks in comparison to NNC.

The main difference between NNC and inferential engines like ARMNN, TFLite, and Caffe2 is that NNC uses a standalone code generation approach while these inferential engines are runtimes that interpret a model and perform the operations described in the model.

The Halide, although not dealing with an inference of neural networks, presenting a set of instructions covering operations such as convolution. Therefore, it was not placed in the comparison table because it did not show other aspects of the inference of neural networks.

OptiML is a language with a more comprehensive focus in addition to neural networks, and covers machine learning in general. It contains a set of instructions which was used to build on its architecture, although this project did not have its exclusive focus on neural network inference.

Tensorflow is a framework composed of several internal parts. For the purpose of comparison, we will differentiate the Tensorflow-core from the TOCO. Tensorflow-core performs both training and inference. In this analysis we will focus on the inference part. Tensorflow-core has several kernels, where each kernel performs an operation. However, it performs few optimizations, practically executing the graph as described by the user.

The TOCO is the part of the TensorFlow framework that deals exclusively with the optimization of computational graphs for inference of neural networks in restricted envi-ronments. From the above, TOCO is the tool that most closely resembles NNC. The main difference is that TOCO generates a flatbuffer format that will be interpreted by TFLite. This makes it difficult to support new platforms that are not supported by TFLite. Be-cause to run the model, TFLite must be compiled and have kernels that support the target platform.

(48)

How-ever, because their focus is training rather than inference, they fail to perform some opti-mizations that can be achieved when the goal is just inference and can be valuable when seeks to execute neural networks in embedded environments. In later versions, GLOW has added support for inferential neural network compilation. However, this feature is still in the development stage. Its support for acceleration technologies used in embedded environments such as OpenCL and Vulkan is still quite limited.

The main difference from NNC to all other frameworks cited here is that NNC had its architecture designed to be easy to include new frontends and backends. This means that while the other frameworks only support one input and have only one output, NNC can support several.

(49)

Chapter 6 Implementation

This chapter describes a new Neural Network model Converter — NNC, that converts models built with the use of Keras and PyTorch, among others, in a C++ source code with NNAPI support to run on mobile devices with the Android operating system. This tool replaces the TFLite Converter — TOCO, due to its current limitations such as the addition of new front ends. NNC performs optimizations at a level close to NNAPI. Further backends (e.g., tflite) can be easily added. Section 6.1 describes the architecture of the NNC, and Section 6.2 describe the main optimizations performed by NNC.

There are several reasons not to use a training framework for inference in a restricted environment, such as mobile.

• As training deep learning frameworks calculates back-propagation and the forward-propagation operations, it needs more disk space and memory to store information in the model that is required for the learning step.

• The model has several layers that are used only in the training step, as, regulariza-tion layers as dropout, loss funcregulariza-tion, criterion, and optimizer.

• The training deep learning frameworks were developed targeting large GPU clusters. Therefore, they were made to perform optimized on GPUs with greater processing power and in large clusters.

Thus, it is necessary to develop a framework whose architecture is designed to run optimally in mobile environments, using GPUs with less processing power, and with less RAM, less GPU memory, and less disk space.

6.1 Architecture

(50)

Figure 6.1: High lever architecture

The process starts from a model previously trained by a training deep learning frame-work such as Tensorflow or PyTorch. The first step is to convert the computational graph into a format that the inference engine can read and execute. This step has performed the optimization of the computational graph. Because as already shown, the computational graph generated in training has unnecessary layers, when the objective is only to run the inference operation.

This optimization process also aims to generate the smallest possible model in terms of space. For this, some techniques such as operator fusion, quantization, and other compiler-based methods are used, which will be explored in the next session.

The last step is to execute the inference on the model. The inference engine must have low latency, small footprint; in our case, the inference on the model uses the mobile GPU when it is available to accelerate the process.

6.2 Computational graph optimization

This layer was developed to convert and optimize the trained model generated by a frame-work like Keras or PyTorch is called NNC(Neural Netframe-work Converter, and Optimizer). Explaining simply, what NNC does is to receive a model of a framework such as Keras, Tensorflow or PyTorch, and generate an optimized computational graph in a format that the inference engine can read, in our case this format is NNAPI or Flatbuffers. The figure 6.2 show this process in a simplified way.