A task-parallel approach for neural networks : Uma abordagem de paralelismo de tarefas para redes neurais

(1)

COMPUTAÇÃO

Marcos Vinícius Guimarães Martins Filho

A Task-Parallel Approach for Neural Networks

Uma Abordagem de Paralelismo de Tarefas para Redes

Neurais

CAMPINAS

2020

(2)

Uma Abordagem de Paralelismo de Tarefas para Redes Neurais

Dissertação apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Mestre em Ciência da Computação.

Dissertation presented to the Institute of Computing of the University of Campinas in partial fulllment of the requirements for the degree of Master in Computer Science.

Supervisor/Orientador: Prof. Dr. Guido Costa Souza de Araújo

Este exemplar corresponde à versão nal da Dissertação defendida por Marcos Vinícius Guimarães Martins Filho e orientada pelo Prof. Dr. Guido Costa Souza de Araújo.

CAMPINAS

2020

(3)

Ana Regina Machado - CRB 8/5467

Martins Filho, Marcos Vinícius Guimarães,

M366t MarA task-parallel approach for neural networks / Marcos Vinícius Guimarães Martins Filho. – Campinas, SP : [s.n.], 2020.

MarOrientador: Guido Costa Souza de Araújo.

MarDissertação (mestrado) – Universidade Estadual de Campinas, Instituto de Computação.

Mar1. Redes neurais (Computação). 2. OpenMP (Programação paralela). 3. Paralelismo de tarefas. 4. Programação paralela (Computação). I. Araújo, Guido Costa Souza de, 1962-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Uma abordagem de paralelismo de tarefas para redes neurais Palavras-chave em inglês:

Neural networks (Computer science) OpenMP (Parallel programming) Task parallelism

Parallel programming (Computer science)

Área de concentração: Ciência da Computação Titulação: Mestre em Ciência da Computação Banca examinadora:

Guido Costa Souza de Araújo [Orientador] Marcio Machado Pereira

Emilio de Camargo Francesquini

Data de defesa: 11-05-2020

Programa de Pós-Graduação: Ciência da Computação

Identificação e informações acadêmicas do(a) aluno(a)

- ORCID do autor: https://orcid.org/0000-0002-9234-6839 - Currículo Lattes do autor: http://lattes.cnpq.br/5991235594200187

(4)

A Task-Parallel Approach for Neural Networks

Uma Abordagem de Paralelismo de Tarefas para Redes Neurais

Banca Examinadora:

Prof. Dr. Guido Costa Souza de Araujo Unicamp

Prof. Dr. Emilio de Camargo Francesquini UFABC

Prof. Dr. Marcio Machado Pereira Unicamp

A ata da defesa, assinada pelos membros da Comissão Examinadora, consta no SIGA/Sistema de Fluxo de Dissertação/Tese e na Secretaria do Programa da Unidade.

(5)

I thank my parents for always being with me at every step of the way, and for having imbued me with the certainty about the role that education plays in a person's life.

I thank my brothers and sister for being such an inspiration.

I thank my whole family because, if they weren't who they are, I wouldn't be who I am ("it takes a village").

I thank my wife, Lívia, for every minute of every day. She has been my compass, my lighthouse and my anchor. Any attempt to translate into words what her love and support meant would be in vain.

I thank Pepper for oering unconditional and judgmentless support, and for all the moments shared between experiments, attempts and chapters.

I thank Prof. Guido for his guidance, advices and for allowing me to experience the academic and research environment in a level that I hadn't known yet.

I thank Prof. Marcio for his tireless patience, willingness to help and availability in the last couple of years.

Finally, I thank the Eldorado Research Institute for all of the actions that have made possible for this work to be executed.

(6)

quanto smart watches e super clusters.

Considerando que a etapa de treinamento é computacionalmente mais signicativa que a de inferência, é natural que o foco de pesquisas na área de otimização seja na busca por melhorias no tempo de treinamento. Não obstante, à medida que aplicações baseadas em redes neurais começaram a ser utilizadas na borda da rede, tanto o meio acadêmico quanto a indústria intensicaram a pesquisa e o desenvolvimento de técnicas que objetivam otimizar a execução da etapa de inferência, como a troca de pequenas margens de acurácia por modelos menores (como a Mobilenet, por exemplo) ou consumo reduzido de memória e eciência computacional (quantização, por exemplo).

No entanto, uma área que continua pouco explorada é o uso de paralelismo de modelo, ou seja, a possibilidade de se executar várias operações da rede neural ao mesmo tempo, a m de melhorar o tempo gasto na inferência. Apesar de suportada pelos principais fra-meworks de aprendizagem profunda, esta técnica só é utilizada durante o treinamento de modelos muito grandes, cujos parâmetros não cabem na memória de um único dispositivo e, por isso, precisam ser divididos em vários nós de um cluster (potencialmente heterogê-neo). Além disso, geralmente recai sobre o desenvolver a responsabilidade de determinar como as operações paralelas do modelo serão divididas através destes nós.

Este trabalho propõe uma nova abordagem para a utilização do paralelismo de modelo durante a etapa de inferência. Sua ideia central é explorar a correspondência entre um grafo de uxo de dados uma abstração comum para a representação de modelos de redes neurais e um grafo de tarefas, e utilizar o runtime de escalonamento de tarefas para permitir que as operações paralelas de uma rede neural emerjam naturalmente.

Este trabalho utiliza como base o TensorFlow XLA, um compilador de domínio es-pecíco para redes neurais que é capaz de gerar código binário altamente otimizado e customizado para uma plataforma especíca a partir do modelo de uma rede neural trei-nada. A forma como o XLA gera esse código foi modicada para que o runtime de escalonamento de tarefas da implementação de OpenMP do Clang fosse utilizado. Dessa forma, ao executar cada operação da rede através de uma tarefa, o paralelismo presente no modelo é naturalmente traduzido pela forma como um grafo de uxo de dados representa suas dependências na forma de tensores.

A abordagem proposta neste trabalho foi avaliada em dois modelos diferentes da fa-mília Inception de redes neurais, cujo objetivo é o reconhecimento de imagens. Speed-ups de 13.64% e 12.42% foram obtidos para as redes Inception-v3 e Inception-v4, respectiva-mente, o que demonstra que o paralelismo de modelo é uma estratégia promissora para a otimização do tempo de inferência de redes neurais.

(7)

Neural Networks and Deep Learning have recently excelsed in many elds, ranging from image recognition and object detection to music generation. All of this success has turned them into pervasive tools, with an ever-growing push for running deep learning-powered applications on devices as computationally dierent as smart watches and supercomputing clusters.

Since the training stage is signicantly more computationally intensive than the in-ference stage, it is natural that most of the optimization research focus has been on improving the training time. Nonetheless, as neural networks-based applications make their way into the edge of the network, both academia and industry started developing techniques for improving how the inference stage is executed, like trading o accuracy for reduced model sizes (Mobilenet, for example) or less memory consumption and faster operations (quantization, for example).

One area that remains poorly explored, though, is leveraging model parallelism the possibility of executing multiple neural network operations at the same time for speeding-up inference time. Although supported by several of the main deep learning frameworks nowadays, this approach is only considered for training large models that do not t on a single device, and therefore need to be split across a potentially heterogeneous cluster. Moreover, it is usually the software developer's responsibility to manually split the parallel operations across dierent devices.

This work proposes a novel approach for exploring model parallelism during the in-ference stage of a neural network. The central idea of this approach is to explore the correspondence that exists between a dataow graph a common abstraction for repre-senting neural network models, followed by many of the main deep learning frameworks and a task graph, and to use the task scheduling runtime to let emerge naturally the parallelism patterns that a model exhibits.

This work builds on TensorFlow XLA, a domain-specic compiler for neural networks that generates highly optimized platform-specic code for a neural network model. By modifying XLA's code generation, Clang's OpenMP task runtime has been used for run-ning every operation as a task. The parallelism in the model execution naturally emerges from how the dataow graph expresses dependencies in terms of tensors.

Our approach was evaluated on two dierent models of the Inception family of neural networks that target the image recognition task. Speed-ups of 13.64% and 12.42% were obtained for Inception-v3 and Inception-v4, respectively, showing that model parallelism is a strategy worth pursuing to optimize the inference time of neural networks.

(8)

2.3 XLA HLO . . . 19

2.4 XLA JIT execution . . . 20

2.5 XLA AOT compilation mode . . . 20

2.6 UML class diagram of the main ahead-of-time products . . . 21

2.7 The LeNet convolutional neural network . . . 25

2.8 Common task parallelism approaches . . . 26

2.9 A graph of tasks representing an algorithm for a particular problem . . . . 26

4.1 Modications to XLA for outlining kernels in separate functions . . . 37

4.2 XLA integration with the Task Runtime . . . 37

4.3 One of the recurring parallelizable kernel patterns exhibited by Inception-v3 40 4.4 Second approach: Kernel labeling . . . 43

4.5 Example DAG with convolutions in dierent levels . . . 47

4.6 Convolution barrier . . . 50

4.7 Dierent scheduling options for a TFG . . . 50

4.8 XLA Schedule Reordering . . . 51

4.9 VTune Amplier memory analysis comparison . . . 54

(9)

4.1 XLA Prole: Inception-v3 Kernel Type Contribution for Inference Time . . 39

4.2 First approach: Time comparison for Inception-v3 using 2 tasks . . . 41

4.3 First approach: Time comparison for Inception-v3 using 4 tasks . . . 41

4.4 Second approach: Time comparison for Inception-v3 using 2 tasks . . . 45

4.5 Second approach: Time comparison for Inception-v3 using 4 tasks . . . 45

4.6 Third approach: Time comparison for Inception-v3 using 2 tasks . . . 52

4.8 Theoretical maximum speed-up for parallelizable kernels in Inception-v3 . . 53

4.9 Cache misses comparison . . . 54

4.10 Inception-v3's top 20 most expensive kernels . . . 55

(10)

5 XLA AOT's convolution implementation . . . 24

6 OpenMP usage in a C function . . . 28

7 Fibonnaci with OpenMP Tasks . . . 28

8 An example of OpenMP's task dependency management . . . 30

9 Clang's OpenMP task runtime functions . . . 38

10 Device generation function . . . 47

11 The new Device-generating Function in XLA . . . 48

(11)

1 Introduction 12

1.1 Neural Networks Overview . . . 12

1.2 Contributions . . . 14

2 Background 16 2.1 TensorFlow and XLA - eXtended Linear Algebra . . . 16

2.2 Eigen . . . 21

2.2.1 Eigen Usage by XLA . . . 22

2.3 Convolutional Neural Networks . . . 24

2.4 Task Parallelism . . . 25

2.5 OpenMP . . . 27

3 Related Work 31 3.1 Academic Research . . . 31

3.2 Frameworks and Commercial Solutions . . . 33

4 Implementation and Results 36 4.1 First Approach: Every Kernel is a Task . . . 36

4.1.1 Test Environment . . . 38

4.1.2 Model Proling . . . 38

4.1.3 Results . . . 39

4.2 Second Approach: Multiple Kernels per Tasks . . . 41

4.2.2 Results . . . 44

4.3 Third Approach: Tasks for Big Kernels Only . . . 46

4.3.1 Further Reducing Task Overhead . . . 48

4.3.3 Results . . . 52

4.3.4 Obtaining Better Speed-ups . . . 54

4.3.5 Results on Other Neural Network Models . . . 56

5 Conclusions 58 5.1 Future Work . . . 59

5.1.1 Device Heterogeneity . . . 60

5.1.2 Parallelization Without Tasks . . . 61

(12)

Chapter 1 Introduction

Machine Learning (ML) and, more specically, neural networks (NNs) and deep learning (DL), are rapidly becoming powerful tools for building computational models that are able to represent, in a non-analytical way, complex problems involving a large number of variables in a multidimensional hyperspace. State-of-the-art solutions to many problems, such as image classication [58] [51], object detection [41], speech recognition [47] and natural language processing [27], for example, are based on neural networks and deep learning [45], which led to its adoption in dierent areas such as medicine [18], mathe-matics [19], robotics [44] and music [55], to name a few.

All of this success has revolutionized dierent elds [20] and caused, among others, an explosion in the number of ML-focused frameworks, a push to run ML-based applications on dierent devices from smartphones [37] to supercomputing clusters and huge investments in the creation of dedicated devices (or adapting existing ones) [2] [4] [1] and in the development of libraries and toolkits [10] [12] for ML and DL tasks.

1.1 Neural Networks Overview

For a neural network to learn how to perform any of the tasks described before, the rst step to be performed on it is called training. For the sake of illustration, suppose that the target goal is to have an NN classifying tumors on whether they are benign or malignant, given a scan of the tumor. To do this, the neural network needs to be fed with a lot of data the tumor scans so that it can learn to recognize the patterns that dierentiate the two types of tumors (or the "target classes"). Since an NN model is inspired by how neurons are connected in a human brain [20], this is done by extracting features from the input and passing them layer by layer until the end of the network. Each layer is connected to the next with an associated weight, and when the network misclassies an input all weights are updated so that it can improve next time. Roughly speaking, the learning process is an optimization problem where, for the given input data (usually called the training dataset), a loss function is minimized by nding the best weights for all layers.

Training is usually a very resource-intensive task. In order to learn well, a network needs a lot of data, on top of which it will perform a huge amount of operations. For

(13)

example, for training a Chinese speech recognition model, Baidu needs 4 terabytes of data on which 20 exaops of computations are performed [15]. Microsoft took 3 days to train a simplied 18-layers ResNet [31] network in Azure [13]. The full 152-layers network was expected to take about 3 weeks to train.

Given these levels of resource consumption, training is usually performed in the cloud using heterogeneous clusters of CPUs, GPUs, or even specialized devices, such as Google's TPUs [2].

After training completes, the other main task related to neural networks is called inference. Given a trained model, the neural network should be able to "see" new input and perform the task it has learned to do. Referring back to our previous example, it would be classifying tumors based on new scans. This is usually done in an application that uses the trained model. If training has produced a model that learned good weights for all neurons, this model is said to generalize well that is, it is accurate in evaluating new data it has never seen before.

As one can imagine, inference is a task far less computationally complex and resource-intensive than training. The input only travels through the network in one way forward and there is no need to update all weights if misclassications happen (the backpropa-gation operation). Because of that, inference can usually be done on resource-constrained devices, like laptops, smartphones [37], etc.

Figure 1.1 shows the typical training and inference workow described above. Con-sidering the resource consumption characteristics of these two tasks, most of the research and development focus has been on exploring ways of increasing the performance of the training phase, with little attention given to the task of optimizing the inference on an already trained neural network.

(14)

on which the algorithm operates [17]".

This work has the main goal of exploring the possibility of improving the inference phase execution time by leveraging model parallelism [26] [34] in a dataow graph. This is done by exploring the direct mapping that exists between the dataow graph that represents a neural network and a task graph [49].

Consider the Inception-v3 neural network [52], for example. Its structure is represented in Figure 1.2. Assuming the dataow abstraction, it is possible to see that there are multiple operations on this neural network that could be executed at the same time, and these operations occur in a cyclic pattern throughout the entire model.

Figure 1.2: The Inception-v3 neural network

As stated previously, most frameworks oer powerful features for improving the train-ing time, while inference receives little attention, as it is considered a "cheap" task. Given that most of the computation of Convolutional Neural Networks (CNNs) occurs in the convolution nodes (or operations) themselves, TensorFlow (TF) XLA [39] and other compiling infrastructures and deep learning frameworks focus their eorts in extracting intra-convolution parallelism. This is typically achieved by means of libraries like Eigen, which slice the input data into blocks and allocate them along the memory hierarchy, with the goal of minimizing expensive data-movements across caches and memory pages. Unfortunately, not much research has been done in exploiting inter-operation paral-lelism in the dataow graph. Although convolutions can run in parallel in many CNN models, kernel execution is typically serialized and Eigen threads from one convolution do not run in parallel with threads from other convolutions.

To the best of our knowledge, such inter-operation parallelism has not been deeply explored by any of the modern frameworks. This is the case for TensorFlow XLA, the tool on top of which this work was built.

Thus, the main research question addressed in this work is whether or not applying model parallelism to a dataow graph can yield speed-ups during the DL inference phase. Its contributions are:

(15)

The research and investigation of the adoption of model parallelism by current deep learning frameworks and tools;

The investigation of the direct mapping between a task and dataow graphs, and the task parallelism possibilities that this mapping entails;

The implementation of a exible mechanism for enabling task-based model paral-lelism on top of TensorFlow XLA;

The creation of a foundation that might allow more complex model parallelism congurations in the future, by exploring device heterogeneity, for example.

This dissertation resulted in a poster presentation named "Extracting Inter-convolution Parallelism in TF XLA using OpenMP Tasks", presented at the Compilers for Machine Learning (C4ML) workshop, that took place in February, 2020.

This dissertation is organized as follows. Chapter 2 discusses the main concepts ex-plored in this work, especially the functioning of TensorFlow XLA and its usage of the Eigen library, a review of convolutional neural networks, task parallelism and its support in OpenMP [23]. Other similar works are described and put in perspective in Chapter 3. Chapter 4 details how the support for model parallelism has been implemented in XLA, and analyzes the resulting performance. Chapter 5 summarizes the proposed approach and discusses future works.

(16)

Chapter 2 Background

This chapter explores the main concepts involved in our proposal, introduced in Chapter 1. As previously stated, since our main goal is to explore the possibility of applying model parallelism during inference, Section 2.1 presents TensorFlow XLA, the framework that provides the starting point for inference optimization. Eigen, a linear algebra library used by XLA for implementing dierent operations is presented in Section 2.2. A specic type of neural network, called Convolutional Neural Network, that was deeply explored during our experiments, is discussed in Section 2.3. Task parallelism, which provides the abstraction we used to explore model parallelism, is revisited in Section 2.4, and OpenMP, a library that supports task parallelism and that was also leveraged in this work, is presented in Section 2.5.

2.1 TensorFlow and XLA - eXtended Linear Algebra

TensorFlow (TF) [17] is a platform for high-performance numerical computation, origi-nally developed with a focus on machine learning and deep learning. Created by Google and then released as an open-source project, it is currently used on production by many services, both from Google and several other companies. TensorFlow's core abstraction is the TensorFlow Dataow Graph (TFG), which is used "to represent both the computation in an algorithm and the state on which the algorithm operates [17]".

On a TFG, each node represents a mathematical function (sometimes also referred to as a "kernel") that transforms its inputs in output. Inputs and outputs the graph's edges are called tensors, which are multi-dimensional arrays with elements from a primitive type, such as an integer or oating-point number, for example.

As discussed previously, as neural networks and deep learning show excellent results in many elds, a clear demand arises for running NN-based applications on all kinds of devices. Specially considering their pervasive use, smartphones and mobile platforms, in general, started to draw a lot of attention in terms of neural network execution support. On-device ML inference brings several benets, such as reduced latency (without the need to send data to and from remote servers), Internet-connection independence and privacy. Therefore, many frameworks added support or created parallel projects for their execution on constrained devices.

(17)

Figure 2.1: TensorFlow Lite Workow

With TensorFlow it has not been dierent, and there are two TF-based projects with partially overlapping goals moving in that direction, namely TensorFlow Lite and XLA.

TensorFlow Lite (TF Lite) is a collection of tools designed for running inference on constrained devices, such as smartphones, microcontrollers and embedded platforms in general. Its basic usage workow consists of nding a pre-trained TensorFlow model, running it through the TensorFlow Lite converter tool which is responsible for con-verting the model to the format that TF Lite supports and deploying it to the target device. Optionally, a model optimization toolkit can be used to reduce the model size and to apply other optimizations that make it more suitable for constrained devices. This workow is depicted in Figure 2.1, taken from the ocial TF Lite page1_.

XLA Extended Linear Algebra is TensorFlow's domain-specic compiler, re-sponsible for optimizing the execution of machine learning models by applying a series of compilers-inspired transformations on the TFG and generating highly optimized target-specic code that implements the model [39]. Some of the main motivations for building XLA were [16]:

Improved execution speed: target-specic code generation and optimizations such as kernel fusion and constant propagation reduce the overhead posed by the traditional TensorFlow runtime;

Improved memory usage: generating code for a specic model allows for a priori kernel scheduling and memory planning, eliminating the need of intermediary buers and long-lived memory regions;

Reduced mobile footprint: Eliminating the TensorFlow runtime and packing only the model-specic dependencies allow for an optimized binary size.

(18)

Figure 2.2: XLA functioning - Taken from [39]

XLA supports two distinct modes of operation Just-in-Time (JIT) and Ahead-of-Time (AOT) compilations with similar goals but dierent strategies and use-cases. Regardless of the mode, its basic functioning can be seen in Figure 2.2. Given a regular TFG, XLA works by generating an equivalent graph implemented in terms of its own kernels. During this process, the graph is transformed for producing a target-specic optimized model.

Regarding the XLA graph, its main abstraction is the High-Level Optimizer, or HLO, which can be seen as the equivalent of a compiler's Intermediate Representation (IR). When compiling a model for a target device, XLA performs compiler-inspired optimiza-tions on the HLO graph, producing binary code highly specic to the given model and device. This process can be seen in Figure 2.3.

The target-independent optimizations, as the name implies, focus on modications to the graph that do not depend on the specic device where the model is running. Exam-ples of such optimizations are dead code elimination, common subexpression elimination, constant folding, and target-independent kernel fusion. After this phase, the resulting HLO graph is passed to the XLA backend, where, given the knowledge of the target de-vice, other kinds of optimizations can be performed. If the target backend is a GPU, for example, dierent types of kernel fusion or buer analysis that would benet only a GPU execution can be done.

The currently supported XLA backends target CPUs (x64 and ARM64) and GPUs (NVIDIA through NVPTX) and both of them leverage LLVM [38] for IR and code gener-ation. Note that supporting a new backend can be relatively simple if the target hardware has an LLVM implementation already.

XLA's JIT mode can be enabled in the traditional TF training or inference (which sets it apart from TensorFlow Lite) workows for optimizing (sub)graphs of computations, either explicitly or implicitly. The implicit use is called auto-clustering, which delegates to TensorFlow itself the task of nding clusters supported by XLA (Figure 2.4).

(19)

Target-Figure 2.3: XLA HLO - Taken from [16]

specic code is generated for these clusters (which are essentially subgraphs of the model), providing optimized execution time and memory footprint, as discussed earlier.

XLA's AOT mode works by means of tfcompile, a tool that compiles a TensorFlow neural network model into executable code. A target use-case for tfcompile is to generate a model for neural network inference on mobile devices. Dierent from JIT, which supports GPUs as the backend, the ahead-of-time compilation currently supports CPUs only.

tfcompile takes a TF (sub)graph as input and generates a single function that im-plements it. The output of the AOT compilation is an object le with the function implementation and a header le that exposes the function signature. These components can be used in the nal application, being that a mobile (Android or iOS, for example) or regular x86_64 architecture environment. This process is indicated in Figure 2.5. First, a regular TFG is converted to an equivalent XLA HLO graph. As described before, this step involves applying several compiler-inspired optimizations on the model. Afterward, LLVM IR2 _{is generated for each one of the kernels in the XLA graph, which are later}

lowered to platform-specic code. Although still considered experimental, benchmarks show 4x binary size reductions when using AOT compilation [39].

The function that XLA produces for a model is wrapped by an auto-generated class that provides utility methods for invoking it and for controlling other aspects of the compiled model. Figure 1 depicts an example of such a class that was generated for the Inception-v3 model.

Note that the InceptionGraph class (the auto-generated one) inherits from another

(20)

Figure 2.4: XLA JIT execution - Adapted from [39]

Figure 2.5: XLA AOT compilation mode [39]

class called XlaCompiledCpuFunction. This is a utility base class provided by XLA that oers common methods for setting the model's inputs, retrieving the inferred outputs and controlling threading behavior. XlaCompiledCpuFunction uses another class, called ExecutableRunOptions, to control some aspects of the model function's runtime behavior. One of such aspects, that will be more thoroughly discussed in Chapter 4, is XLA's multi-threading support. For now, it is sucient to say that XLA's implementation of some kernels is based on Eigen [30], which is explored in Section 2.2. These kernels are prepared to use multiple threads for accelerating their execution, and ExecutableRunOptions allows the user to specify a thread pool that will be used for that. Figure 2.6 shows a UML class diagram depicting the relationship between the main classes in the ahead-of-time compilation product.

(21)

Listing 1 Example of an XLA AOT auto-generated class wrapping the Inception-v3 model

1 // Generated by tfcompile, the TensorFlow graph compiler. DO NOT EDIT!

2 //

3 // This header was generated via ahead-of-time compilation of a TensorFlow

4 // graph. An object file corresponding to this header was also generated.

5 // This header gives access to the functionality in that object file.

6 // 7 // clang-format off 8 9 #ifndef TFCOMPILE_GENERATED___tensorflow_compiler_aot_models__inception_graph_H_ 10 #define TFCOMPILE_GENERATED___tensorflow_compiler_aot_models__inception_graph_H_ 11 12 13 #include "tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h" 14 #include "tensorflow/core/platform/types.h" 15

16 namespace Eigen { struct ThreadPoolDevice; }

17 namespace xla { class ExecutableRunOptions; }

18

19 // (Implementation detail) Entry point to the function in the object file.

20 extern "C" void __tensorflow_compiler_aot_models__inception_graph(

21 void* result, const xla::ExecutableRunOptions* run_options,

22 const void** args, void** temps, tensorflow::int64* profile_counters);

23

24 class InceptionGraph : public tensorflow::XlaCompiledCpuFunction {

Figure 2.6: UML class diagram of the main ahead-of-time products

2.2 Eigen

Eigen is "a C++ template library for linear algebra" [30], that oers a wide range of mathematical operations based on matrices and vectors. It is implemented as a pure template library, which means it is header-only and, therefore, very compact and easy to be integrated in client applications.

(22)

Eigen's Tensor module [5] (which will be discussed in more details in the next Section) provides multi-threaded implementations of operations like convolutions and contractions, which, combined with the power of the compile-time optimizations and reduced footprint described earlier, has led to its adoption by XLA for implementing the AOT compilation versions of matrix multiplications and convolutions (that is, instead of generating code for its own versions of these kernels, XLA invokes Eigen instead). As will be discussed in-depth in the next Section and in Chapter 4, Eigen's ecient multi-threaded convolu-tion implementaconvolu-tion had a great inuence on the parallelizaconvolu-tion strategy proposed and implemented in this work.

2.2.1 Eigen Usage by XLA

To better understand the parallelization opportunities regarding convolutional neural net-works and its kernels, an in-depth analysis of how XLA uses Eigen to implement convo-lutions is required. This understanding is crucial for devising an intelligent strategy for exploiting inter-convolution parallelism, as will be discussed in Section 4.3.

XLA's documentation on how to use the Ahead of Time (AOT) mode [14] shows how Eigen is meant to be used, as can be seen in Listing 2.

First, the "Tensor" module from Eigen is included, which is a third-party contribution to Eigen's source code base [5] that denes the Tensor class. Tensors are "multidimen-sional arrays of elements" for which Eigen oers dierent implementations of operations, such as convolutions, for example, optimized for dierent environments. Although the currently supported environments are the CPU (single and multi-threaded) and GPU (using CUDA [3]), XLA's AOT mode only supports generating code that targets the CPU (both x86 and ARM).

The next lines in Listing 2 dene two objects that are used to control the number of threads that will be utilized when running inferences with the XLA-generated code: ThreadPool and ThreadPoolDevice.

The ThreadPool class, as the name suggests, denes a pool of threads that has its num-ber supplied upon construction. Eigen's thread pool implementation works by providing a Schedule() method3_{, that allows the user of the thread pool to schedule a function for}

execution. The function is placed at the work queue of one of the pool's threads, which is chosen randomly. All of the threads in a pool execute a loop where they wait for work on their own queues or try to steal pending work from the other threads' queues, thus ensuring maximum pool utilization at all times.

The ThreadPoolDevice object is the one that is actually going to be used by Eigen. It

3_{https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/}

(23)

Listing 2 Using Eigen and XLA AOT compilation. Reproduced from XLA's ocial documentation 1 #define EIGEN_USE_THREADS 2 #define EIGEN_USE_CUSTOM_THREAD_POOL 3 4 #include <iostream> 5 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"

6 #include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h" // generated 7

8 int main(int argc, char** argv) {

9 Eigen::ThreadPool tp(2); // Size the thread pool as appropriate. 10 Eigen::ThreadPoolDevice device(&tp, tp.NumThreads());

11

12 foo::bar::MatMulComp matmul; 13 matmul.set_thread_pool(&device);

Listing 3 The user-dened ThreadPoolDevice is set in the ExecutableRunOptions member

1 // Sets the intra-op thread pool used to run individual ops concurrenctly.

2 void set_thread_pool(const Eigen::ThreadPoolDevice* pool) {

3 run_options.set_intra_op_thread_pool(pool); 4 }

is built as a "projection" of the actual thread pool and can be used to control how many threads a Tensor will actually use for running its operation. Although a ThreadPool might have been constructed with 32 threads, for example, a ThreadPoolDevice that makes use of it can be declared with a smaller number of threads, like 16, for example.

The last line of Listing 2 shows how the ThreadPoolDevice instance is nally set in the object of the MatMulComp class (which, in this example, is the class that is automatically generated by tfcompile), that will, in turn, pass it along to its member of the ExecutableRunOptions class, which has been described earlier. Listings 3 and 4 shows the methods involved in this process.

The way this device is used during the model inference is through the run_options pointer passed to the XLA-generated model function and used at all convolutions. List-ing 5 shows the function that XLA invokes when generatList-ing the source code for a convo-lution kernel. Note how the run_options_ptr parameter is converted from a void* to an Listing 4 ExecutableRunOptions exposes methods for setting and retrieving the Thread-PoolDevice

1 // Sets the thread pool device on which to run Eigen subcomputations.

2 // Does not take ownership.

3 ExecutableRunOptions& set_intra_op_thread_pool(

4 const Eigen::ThreadPoolDevice* intra_op_thread_pool);

5

(24)

TensorFlow's Github public repository

1 TF_ATTRIBUTE_NO_SANITIZE_MEMORY void __xla_cpu_runtime_EigenConvF32(

2 const void* run_options_ptr, float* out, float* lhs, float* rhs,

3 int64 input_batch, int64 input_rows, int64 input_cols, int64 input_channels, 4 int64 kernel_rows, int64 kernel_cols, int64 kernel_channels,

5 int64 kernel_filters, int64 output_rows, int64 output_cols,

6 int64 row_stride, int64 col_stride, int64 padding_top, int64 padding_bottom, 7 int64 padding_left, int64 padding_right, int64 lhs_row_dilation,

8 int64 lhs_col_dilation, int64 rhs_row_dilation, int64 rhs_col_dilation) {

9 const xla::ExecutableRunOptions* run_options =

10 static_cast<const xla::ExecutableRunOptions*>(run_options_ptr);

11 XLA_LIGHTWEIGHT_CHECK(run_options->intra_op_thread_pool() != nullptr); 12 tensorflow::xla::EigenConvImpl(

13 *run_options->intra_op_thread_pool(), out, lhs, rhs, input_batch,

14 input_rows, input_cols, input_channels, kernel_rows, kernel_cols, 15 kernel_channels, kernel_filters, output_rows, output_cols, row_stride, 16 col_stride, padding_top, padding_bottom, padding_left, padding_right, 17 lhs_row_dilation, lhs_col_dilation, rhs_row_dilation, rhs_col_dilation); 18 }

The conclusions drawn from this analysis are that the user is completely responsible for dening how many threads Eigen will use for executing convolutions, and that this number is the same for all convolutions in the TFG, since all of them use the same ThreadPoolDevice.

2.3 Convolutional Neural Networks

As Chapter 4 will show, the experiments performed focused on a specic type of neu-ral network called Convolutional Neuneu-ral Networks (CNNs). Although rst proposed in the late 1980s, it was only successful applied in the 1990s [20], and gained popularity and started drawing a lot of attention after Alex Krizhevsky et al. [36] proposed an architecture that won the Imagenet competition in 2012. Ever since, this type of neu-ral network has been deeply applied for solving not only computer vision tasks but also speech recognition and sentence classication, to name a few.

The general operation of a CNN can be seen in Figure 2.7, taken from the LeNet [40] neural network. The rst layers of these networks are formed by convolutional kernels. Their basic idea is to slide (or to convolve) a series of lters through the input, each one responsible for discovering certain patterns in the data. For example, assuming the input is an image, there could be lters responsible for nding edges, curves, specic color

(25)

patterns, etc. The lter and input values are multiplied and summed, transforming the input that will feed the next layer. The output of a convolutional layer is usually passed through a subsampling layer, such as a Max or Average Polling, for example, which, in turn, feeds the next convolutional layer. The idea is that the chain of lters will discover dierent features in the data, each deeper layer identifying more intricate patterns than its predecessors.

Figure 2.7: The LeNet convolutional neural network

The nal parts of convolutional neural networks are usually formed by fully-connected layers. They are responsible for classifying the input, given all of the features discovered by the convolutional lters. For the sake of illustration, consider a network trained to recognize images. Assuming the patterns discovered for an input indicate the presence of two eyes, a nose and a mouth, the image could be classied as a person. If the features had indicated two eyes and a beak, the output would be a bird.

2.4 Task Parallelism

Data and task parallelism are the two main forms of exploring parallelism in the context of software development. The fundamental dierence between these two approaches is that, whereas the former consists of applying the same function to dierent parts of the input data simultaneously, the latter applies dierent functions to the same, or even dierent, parts of the data at the same time.

Figure 2.8 shows two commonly adopted task parallelism approaches. Part (a) shows an example where the user wishes to nd the minimum, maximum and average values for a very large integer array. In this case, all three (dierent) functions can be applied to the same data at the same time and there is no inter-task dependency. Part (b) shows an example of a pipeline where each task applies an operation (or transformation) to a dierent part of the input data at the same time. In this case, there might or might not be inter-task dependency.

A task parallel approach for solving a problem can be taken by abstracting the solution as a series of dierent steps that must be carried out until the completion of the algorithm. A step (or a task) might depend on the outputs of previous steps and, in turn, might produce data that becomes the input to other tasks. Therefore, the algorithm can be modeled as a dataow graph G = (V, E), where V represents the set of tasks, and E the data dependencies between any two nodes, as depicted in Figure 2.9.

(26)

Figure 2.8: Common task parallelism approaches. Sometimes, dierent, independent functions can be applied to the same data at the same time (a), or to dierent parts of the data, in a parallel pipeline (b).

Figure 2.9: A graph of tasks representing an algorithm for a particular problem Usually, the tasks are created by a producer process (or thread) and submitted to a task scheduling runtime that is responsible for determining when a task is ready to be executed and for dispatching it to one of the available worker processes (or threads) the consumers. A task is considered to be ready once all of its dependencies have been fullled. A correct execution of the dataow graph G is one that respects all of the dependencies (E) between the tasks (V).

(27)

present in a dataow graph can be explored, since multiple tasks can be ready at the same time and, therefore, be assigned to dierent workers. In the example from Figure 2.9, tasks T3, T4, T5 and T6 can be executed at the same time, if there are enough worker threads available (just as tasks T7, T8 and T9). Thus, the task parallelism abstraction can be viewed as a coarse-grained form of the out-of-order execution commonly implemented in superscalar processors [28].

Given the power of such abstraction, and how it constitutes a natural way of rea-soning about the parallel execution of an algorithm, many frameworks and programming languages provide support for this type of parallelism, simplifying the eort of writing task-parallel code. A few examples are .NET's Task Parallel Library, Intel's Threading Building Blocks, Cilk and OpenMP. Many of these frameworks and libraries are able to automate the detection and resolution of the inter-tasks dependencies, easing the burden of writing task-parallel code [28].

A common research eld on task parallelism aims at designing ecient task scheduling runtimes, in order to avoid incurring signicant task management overheads, which can degrade the potential speed-ups that this technique can produce. Many researches have proposed solutions that surpass the traditional and commonly found software-based run-times, relying on native CPU support or even MPSoCs and FPGAs for task management ([43], [24], [25], [53], [54], [59]).

2.5 OpenMP

OpenMP [23] is a set of library routines and compiler directives that can be used to apply high-level parallelism to code written in Fortran, C and C++. It was created in the 1990s to solve the problem of specifying portable, incremental and high-level parallelism in programs targeting scalable shared memory multiprocessor (SSMP) architectures, which were starting to emerge by that time.

Before OpenMP, programmers had to resort to frameworks and libraries that were not created with SSMP architectures in mind which added unnecessary complexity to the code to express the parallel behavior of their applications. The alternative was to rely on directives and special instructions created by SSMP vendors, which were specic to their own architectures and, therefore, harmed portability.

Listing 6 shows a C function that uses OpenMP to ll the values of a oat array based on an expression that involves accessing another array4_{. The "omp parallel for" pragma}

indicates that the enclosed loop should be executed by multiple threads. Under the hood, the compiler will translate the pragma to a series of function calls to the OpenMP library, that, in turn, takes the responsibility of creating the necessary threads, controlling concurrent access to shared memory regions and cleaning everything up once the work is complete.

The example in Listing 6 makes it clear how much of boilerplate work is taken away from the programmer by leveraging OpenMP to parallelize (parts of) the code. As stated

4_{From OpenMP's documentation at https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.}

(28)

5 for (i=1; i<n; i++) /* i is private by default */ 6 b[i] = (a[i] + a[i-1]) / 2.0;

7 }

before, this can be done incrementally, in a non-intrusive way. The "simple" function could be part of much a larger code-base, where the programmer can gradually introduce "pragma" directives such as the one represented above in a controlled fashion.

The success of the framework has led to the creation of a massive community around it, and its specication has been incremented an evolved throughout the years. Currently it is in version 5.0 (released in November, 2018). Amongst its main features, it is worth mentioning:

Thread anity;

Support for SIMD instructions (hardware assisted vector operations, like Intel AVX, for example);

Device heterogeneity, for ooading computations to specialized devices, such as GPUs and DSPs, for example;

Tasking constructs, to support task parallelism.

Listing 7 A C function to calculate the nth_{number in the Fibonnaci sequence using tasks} 1 int fib(int n) {

2 int i, j; 3 if (n<2)

4 return n;

5 else {

6 #pragma omp task shared(i)

7 i = fib(n-1);

8 #pragma omp task shared(j)

9 j = fib(n-2);

10 #pragma omp taskwait

11 return i+j;

12 }

13 }

Listing 7 shows an example of OpenMP's task support in a function that calculates the nth _{number in the Fibonnaci sequence}5_{. Note that in this case the same function}

5_{From OpenMP's documentation at https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.}

(29)

("b") is applied to dierent parts of the data (n - 1 and n - 2 ) at the same time. It only cost the developer three extra lines of code to parallelize it. The "taskwait" pragma is a barrier used to instruct OpenMP to wait for the completion of all tasks before executing the subsequent lines. In this case it was necessary because the result of the function (i + j ) depends on values that will only be available after all tasks complete their work.

Listing 8 illustrates a powerful feature supported by OpenMP. The "depend" clause can be used to express dependency relationships between tasks, which are determined by their access patterns to shared variables. Note that, since the rst task invokes a function that writes to "addition", this variable is an output dependency of the rst task. And, since the same variable is read by the second task (in the "divide" function), it is said to be an input dependency of this task.

Internally, the OpenMP implementation has to provide a task runtime that is respon-sible for managing the execution of the tasks submitted to it in a way that honors the dependency relationships determined by the programmer (note how this maps directly to the dataow graph abstraction described in Section 2.4). Considering the example depicted in Listing 8, the second task must not initiate until the rst one completes, or it would risk reading an invalid value from the "addition" variable.

The support for expressing inter-tasks dependencies and, consequently, the existence of the task runtime, are very important for this work, as will be explained in Chapter 4. Currently, OpenMP is supported by several dierent compilers6_{, including GCC and}

Clang (which also plays an important role in this work), and is used by many institutions and companies7_.

6_{https://www.openmp.org/resources/openmp-compilers-tools/} 7_{https://www.openmp.org/about/whos-using-openmp/}

(30)

Listing 8 An example of OpenMP's task dependency management

1 void sum(int a, int b, int* c) 2 {

3 *c = a + b; 4 }

5

6 void divide(int x, int y, int* result)

7 {

8 *result = x / y;

9 } 10

11 int main(int argc, char** argv) 12 { 13 int s1 = 10; 14 int s2 = 5; 15 int add = 0; 16 int div = 3; 17 int result = 0; 18

19 #pragma omp parallel 20 {

21 #pragma omp single

22 {

23 #pragma omp task shared(add) depend(in: s1, s2) depend(out: add) 24 sum(s1, s2, &add);

25

26 #pragma omp task shared(add) depend(in: add, div) depend(out: result) 27 divide(add, div, &result);

28

29 #pragma omp taskwait

30 }

31 } 32

33 return result;

(31)

Chapter 3 Related Work

Although there are numerous works that have proposed dierent ways of improving the execution of neural networks, the research and development focus has been on the train-ing phase the most computationally expensive one, as previously stated. This chapter is dedicated to highlighting research and free or commercial products that have similar goals and/or approaches to the ones presented in this work. Section 3.1 revisits academic research targeting the optimization of neural networks inference and how model paral-lelism has been approached in these works. Section 3.2 explores similar topics, but in the context of NN and DL frameworks and commercial solutions oerings.

3.1 Academic Research

A natural way of considering model parallelism, even during the inference phase, would be to explore device heterogeneity for executing dierent parts of a model at the same time. For example, while a CPU is busy running a specic (sub) branch of a model, the GPU could be used for running another (sub) branch in parallel. Although one could argue that this approach would make kernel scheduling more complex (consider synchronizing dependencies when kernels are running on dierent devices), it would be an ecient way of exploring the full computing power of a target device, and the parallelization opportunities presented by a neural network model, at the same time. Nevertheless, when heterogeneous architectures are discussed, the scenario is usually a data-center or cloud infrastructure, with high-end servers, GPUs and other kinds of accelerators.

One possible explanation for why heterogeneity is mostly considered for high-end servers is the lack of mainstream availability of GPUs and other co-processors in edge platforms, which could help to draw attention for model parallelism based on device het-erogeneity during inference. Wu et al. [57] performed an in-depth analysis of the mobile devices and System on Chips (SoCs) landscape in 2018, considering over 1 billion devices where Facebook's neural network engine is deployed. Their investigation showed that over 70% of the primary CPU cores in use were designed 5 or more years ago. As far as Graphics Processing Units are concerned, a median device has a GPU that is just as powerful as the mobile device's CPU, and only 11% of them have a GPU that is 3 times more powerful than the CPU. Therefore, considering that the presence of a GPU is not

(32)

is to speed-up inference by either creating more compact models or to simplify the arith-metic operations they perform (quantization, for example, involves reducing the number of bits that represent a number, also converting them from oating points to integers). Although these approaches usually incur in accuracy penalties, it is possible to nd a good cost/benet relationship when it comes to inference time.

On the device heterogeneity trend, Mirhoseini et al. [42] proposed a method for optimizing device placement for TensorFlow computational graphs based on a sequence-to-sequence model that is used to predict which subsets of kernels should run on which devices (CPU cores or GPUs), which would be a way of enabling model parallelism. However, this approach is completely dierent from the task parallelism approach taken by this work. Nevertheless, although the paper mentions the possibility of using their work for inferencing, experiments and results were only described for training. Their hardware setup comprises an Intel Haswell 2300 CPU, with 18 cores, and either 2 or 4 Nvidia Tesla K80 GPUs, which are data-center-grade devices more t for training a deep learning model. Speed-ups ranging from 19.0% to 23.5% are reported on dierent models. A two-stage pipeline for optimizing deep learning models in edge devices, implemented on the TVM stack [22] is presented by Jiang et al. [35]. Although also attempting to improve the inference phase of a trained machine learning model, the approach taken is radically dierent from the one proposed in this work. The rst stage of the pipeline involves performing compiler-inspired transformations on the dataow graph to improve its execution time and memory consumption. These transformations are a subset of the ones already implemented in TensorFlow XLA, and, therefore, from which our approach already benets, such as constant folding and kernel fusion. The second stage of the pipeline involves searching for optimal parameters tiling, reordering of operations to improve memory locality, loop unrolling, parallelization, etc. for each kernel in the graph, considering the target platform (what the paper calls the "kernel schedule"). The parallelization considered is limited, however, to intra-kernel parallelism, that is, dierent nodes from the dataow graph are still executed sequentially, even if there is the possibility of parallelization at this level. They report 3x and 10x improvements in latency for ResNet-18 and Mobilenet, respectively, with their pipeline implementation, which takes 1 hour to nd the optimal kernel parameters.

Zhihao et al. [34] propose FlexFlow, a deep learning framework that seeks to provide more comprehensive parallelization opportunities than just data and model parallelism (although both are also supported). To achieve that, the authors dene the SOAP paral-lelization space, comprising the Sample, Operation, Attribute and Parameter dimensions. Model parallelism can be obtained through the Operation and Parameter dimensions. FlexFlow uses an operator graph abstraction, that can be seen as the equivalent of a regular DL dataow graph, such as TensorFlow's, for example, to describe the neural

(33)

network's operations and state. This graph, along with the representation of the device topology, is used as the input for the execution optimizer. The execution optimizer uses a Markov Chain Monte Carlo (MCMC) algorithm to explore the SOAP space to nd par-allelization strategies, which go through the execution simulator, a component that uses a delta simulation algorithm to predict the execution time of the strategy, given some pre-dened assumptions about how the operations are executed and how the communi-cation paths among devices behave. When the search time budget dened by the user is consumed, a distributed runtime executes the model using the best strategy discovered this far.

Two important dierences exist between our work and FlexFlow. First, it was proposed as a new DL framework, which requires users to re-implement their NN models using their APIs. Since this work leverages a largely adopted framework, we can benet from the existing trained models and do not require existing users to re-implement their models. Second, the whole approach taken by FlexFlow is more suited for training neural network models, and not for inferencing. The SOAP space exploration and strategy simulation consume an amount of time that can be prohibitive when the focus is just on using the trained model for inferring the output for new input data. In fact, all experiments reported consider only training, and were performed on large CPU-GPU clusters, typical of data center setups.

3.2 Frameworks and Commercial Solutions

Analyzing the most prominent open source frameworks and commercial solutions available for neural networks and deep learning in general, it is possible to identify similarities with our proposal at dierent points of the spectrum. Nevertheless, none of these solutions oer features for model parallelism-based inference optimization.

MXNet [21] is, perhaps, the production-grade framework that keeps most similarities with our proposal. At the core of these similarities is their Dependency Engine1_{, a library}

responsible for transparently scheduling operations based on input and output data de-pendencies between them. The Dependency Engine is, essentially, a task runtime that obeys the data dependencies set by the user, very much the same as the approach taken by any implementation of a task scheduling runtime, such as OpenMP's (Section 2.5) one. Nevertheless, MXNet's model parallelism support depends on the user manually speci-fying on which device an operation should run, and is only applicable for heterogeneous platforms. That means that two operations can only execute in parallel if one is assigned to the CPU and the other to the GPU, for example2_{. Although our work only supports}

CPUs (for now), it was built on top of XLA's foundations, thus allowing model parallelism to be explored automatically, without the need of manual specication by the user.

Amazon oers a service called Amazon Elastic Inference that supports MXNet for accelerating deep learning inference [11]. It brings dynamic device scheduling for infer-ences, solving the a priori device assignment in MXNet described earlier. Nonetheless,

1_{https://mxnet.apache.org/api/architecture/note_engine} 2_{https://mxnet.apache.org/api/faq/model_parallel_lstm}

(34)

allelism, with a focus on exploring device heterogeneity for training very large models (in theory, inference would also be supported, though), when all of the layers and pa-rameters do not t into a single device's memory. In this situation, they support the specication of dierent devices for placing dierent layers of the model, thus overcoming the memory space limitation. The most signicant dierence with our work is the fact that, in both frameworks, the user has to manually specify the devices on which each layer will be placed, whereas our proposal consists in automatically exploring the model parallelism present in a given a model. Although, due to XLA AOT's limitations, our work only supports running models on CPUs, these frameworks support ooading kernels to the GPU and other accelerators (TPUs, in the case of TensorFlow). Nevertheless, our implementation could be expanded in the future to also consider device heterogeneity.

A recent subproject from TensorFlow, called Mesh TensorFlow3 _{aims at simplifying}

and improving TensorFlow's support for model parallelism. Although, once again, the main motivation is to support training large models, whose parameters and layers do not t into a single device, this project also has the improvement of the inference time as one of its goals. The central idea of Mesh TensorFlow is to consider that there is a mesh of processors available (CPUs, GPUs, TPUs etc.), and that tensor dimensions can be split to dierent mesh dimensions (for example, splitting the rst dimension of tensors across the rows of the processors mesh). Although more powerful than TensorFlow alone, Mesh TensorFlow still requires the user to specify all processors and tensors dimensions in a manual fashion, which is a very distinctive characteristic from the approach taken in this work.

TensorFlow Lite was briey introduced in Section 2.1. It is focused on inference and has some similarities with XLA, such as applying compiler-inspired optimizations to a TensorFlow model when converting it to TF Lite's format. Despite their partially overlapping goals and approaches, TF Lite does not generate a compiled version of the trained model, relying on a condensed version of TensorFlow's runtime instead. The recently added GPU back-end is capable of executing several kernels and has brought device heterogeneity to TensorFlow Lite. In spite of that, no form of model parallelism has been implemented in the tool (the model runs completely on the CPU or GPU, but not on both at the same time).

OpenVINO 4 _{is a computer vision toolkit from Intel with a workow very similar to}

XLA's. Initially, a trained model is consumed by the Model Optimizer, a tool that accepts dierent model formats (TensorFlow, Cae, MXNet etc.) and produces an optimized ver-sion of the network by applying transformations, such as horizontal kernel fuver-sions, for example. The optimized model is produced in an OpenVINO-specic IR that is later

con-3_{https://github.com/tensorflow/mesh}

(35)

sumed by the Inference Engine, a module responsible for optimizing the IR to the target hardware, using hardware-specic plugins. Here too the user needs to specify beforehand the target device for the network. OpenVINO supports a wide range of CPUs and accel-erators across Intel's ecosystem, such as GPUs, FPGAs and machine learning accelaccel-erators (Intel Movidius, for example). A special plugin, called "Heterogeneous Plugin", allows the network developer to specify dierent devices for dierent network layers. Again, the manual assignment characteristic is present, forcing the developer to customize this conguration for each model.

TVM [22] is a compiler for deep learning models that has an execution ow very similar to XLA's Ahead-of-Time compilation mode, with graph-level and operator-level optimizations. The latter are guided by a process where dierent valid implementations of an operator are generated for each hardware back-end. An optimized version is chosen by a machine learning model that takes the lowered program as input and outputs a predic-tion of its running time on a particular hardware. As it happens with other frameworks, though, TVM focuses only on exploring intra-kernel data parallelism, missing opportu-nities for further optimizing the inference time when there is the possibility of executing dierent kernels in parallel.

(36)

Chapter 4 Implementation and Results

In order to assess the potential benets that a task-parallel approach can bring to the inference stage of a neural network, dierent strategies have been implemented, followed by a series of tests to verify how well they performed on improving the inference execution time. The rest of this chapter provides an in-depth description of all the strategies adopted and the results obtained.

4.1 First Approach: Every Kernel is a Task

The rst approach implemented to empower XLA models with task parallelism consisted of abstracting each and every kernel in the neural network as an independent task. This is the simplest approach possible, and directly matches the abstraction of the neural network model as a task graph, as described earlier.

The rst step to implement this strategy was to modify the XLA code generation with function outlining, so that each kernel (a node in the neural network model) was outlined to its own function. When generating the LLVM IR, XLA "attens" the entire model in what would be the equivalent of a single-method procedural code. That makes it dicult to determine the appropriate task boundaries, and, therefore, needed to be changed so that each kernel was executed in its own method, which was later wrapped by a task. Figure 4.1 illustrates the XLA code generation "as-is" (a) and the eects of outlining each kernel to a specic function (b).

Therefore, XLA was modied such that, when generating the IR for each kernel, a new function is created. The kernel's inputs are transformed into the function's arguments and the output in its return value. LLVM's C++ API was used to generate the IR for the outlined kernels (as it already is for all code generated by XLA). Applying the function outlining and all of the modications made to the source code are optional, and can be enabled or disabled by the user through a new parameter that was added to tfcompile. After the completion of the previous task, XLA is ready to be integrated with the task runtime. For that, the code generation was modied so that, instead of directly invoking each kernel's implementing function, an OpenMP task is created and dispatched to the OpenMP runtime, which is responsible for scheduling its execution (Figure 4.2).

(37)

Figure 4.1: Modications to XLA for outlining kernels in separate functions. The original XLA code generation "attens" every kernel in a procedural "main" function (a). This behavior was modied so that each kernel is outlined to a specic function

a proven runtime helps to ensure the maturity and stability of the overall solution. As described in Section 2.5, the OpenMP runtime dispatches a task to be executed as soon as all of its input dependencies are satised in the XLA context, a task's dependencies are all the kernels that produce data needed by the task itself. The OpenMP runtime creates a thread for carrying out the execution of each task.

Figure 4.2: XLA integration with the Task Runtime

(38)

Listing 9 Functions exposed by Clang's OpenMP task runtime

1 // Function responsible for allocating a task

2 kmp_task_t *__kmp_task_alloc( 3 ident_t *loc_ref, 4 kmp_int32 gtid, 5 kmp_tasking_flags_t *flags, 6 size_t sizeof_kmp_task_t, 7 size_t sizeof_shareds, 8 kmp_routine_entry_t task_entry); 9 10

11 // Function responsible for submitting a task to the runtime

12 kmp_int32 __kmpc_omp_task_with_deps( 13 ident_t *loc_ref, 14 kmp_int32 gtid, 15 kmp_task_t *new_task, 16 kmp_int32 ndeps, 17 kmp_depend_info_t *dep_list, 18 kmp_int32 ndeps_noalias, 19 kmp_depend_info_t *noalias_dep_list);

4.1.1 Test Environment

All the tests for this rst approach were executed on an Ubuntu 16.04.5 LTS server, with an Intel Xeon E5-2640 v2 CPU with 2.00GHz of clock frequency and 64GiB of RAM. This CPU model has 8 physical cores but supports 16 logical cores through Intel's Hyper-Threading technology.

4.1.2 Model Proling

To help users debug and better understand their models, XLA comes with a proling tool that works by modifying the code generated to add instructions to query the CPU's cycle counter at the beginning and end of each neural network kernel, and then storing this dierence in a separate variable for each kernel. This is achieved by using LLVM's "readcyclecounter" intrinsic 1_.

After the inference runs, the user can invoke the tool's API to pretty-print the collected metrics. Since only cycle counters are stored, the user has to specify the clock frequency of

(39)

the CPU where the code is running as an API argument, so that the cycle counters can be converted to a time metric. The output provided by the proling tool shows information about the number of cycles and time took by each one of the kernels executed during the inference, and also which types of kernels are consuming most of the inference time (the hotspots).

It is worth mentioning that the proling tool is not available for the ahead-of-time (AOT) compilation mode for the version of XLA being used in this work (which comes with TensorFlow version 1.8). However, since it has been made available for a later release, the commit that adds this functionality 2 _{was cherry-picked to the branch of code used}

in this work.

To assess the performance of the rst tasking strategy, experiments were made with a pre-trained Inception-v3 model. This model was chosen because it exhibits several layers where the kernels could be executed in parallel, being a good target for the task-parallel approach.

To get a better sense of how the inference time is consumed by the dierent kernels in Inception-v3, the XLA proling tool was used. The results obtained can be seen in Table 4.1. 61.32% of the inference time is consumed by convolution kernels. Dot kernels (which is XLA's name for matrix multiplications) account for another 21.00% of the time. But, since XLA transforms certain types of convolutions into matrix multiplications, it is accurate to say that the convolution kernels account for 82.32% of the total inference time.

Kernel Type Total Inference Time

Convolution 61.32%

Dot 21.00%

Loop Fusion 10.73%

Reduce-window 6.94%

Other 0.01%

Table 4.1: XLA Prole: Inception-v3 Kernel Type Contribution for Inference Time

The time dominance of convolution kernels during the inference stage of CNNs is a very important piece of information, and one that guided many of decisions took during this project, as will become evident during the next Sections.

4.1.3 Results

As described in the previous Section, the Inception-v3 model was used in the experiments, where ten inferences were executed and the total time taken was recorded. The exper-iments were made with two and four OpenMP tasks, as the model exhibits a common recurring pattern where these numbers of kernels can be executed in parallel. Figure 4.3 shows an example of one of the recurring parallelizable patterns from the Inception-v3 neural network.

(40)

Figure 4.3: One of the recurring parallelizable kernel patterns exhibited by Inception-v3 Tables 4.2 and 4.3 show the performance obtained. A few important conclusions can be drawn from these results. First of all, it is clear that, as the number of Eigen threads increases, the task-parallel approach starts performing worse than the baseline, which indicates that any potential benets from running kernels in parallel are overcome by Eigen's performance, up to the point where the task threads are just posing an overhead on the overall running time.

Second, if we consider the total number of threads in execution on the experiments, the task-parallel approach has not provided any speed-ups. Consider, for example, the mean inference time when two Eigen threads and two task threads are used against the time obtained for the baseline version with four Eigen threads (the second and third lines of Table 4.2, respectively). In both cases, the total number of threads in execution is four. But, the mean inference time for the baseline version is 268.7ms, whilst 474.8ms is the number for the task-parallel version (a 76.7% slowdown). In other words, if the XLA user had a xed number of threads to distribute between Eigen and OpenMP, it would be preferable to use all of them in Eigen, instead of dividing them with OpenMP tasks. A similar analysis can be drawn for when we have four Eigen threads and four OpenMP tasks, versus eight Eigen threads (third and fourth lines of Table 4.3, respectively).

As the inference time responds so drastically to changes made to the number of Eigen threads, and since Eigen is only used in convolution kernels, these results are perfectly aligned with the proling outcome shown in Section 4.1.2, and corroborate that this kernel type is the true computational hotspot for convolutional neural networks.

A task-parallel approach for neural networks : Uma abordagem de paralelismo de tarefas para redes neurais

Marcos Vinícius Guimarães Martins Filho

A Task-Parallel Approach for Neural Networks

Uma Abordagem de Paralelismo de Tarefas para Redes

Neurais

CAMPINAS

2020

Uma Abordagem de Paralelismo de Tarefas para Redes Neurais

CAMPINAS

2020

A Task-Parallel Approach for Neural Networks

Uma Abordagem de Paralelismo de Tarefas para Redes Neurais

Chapter 1

Introduction

1.1 Neural Networks Overview

Chapter 2

Background

2.1 TensorFlow and XLA - eXtended Linear Algebra

2.2 Eigen

2.2.1 Eigen Usage by XLA

2.3 Convolutional Neural Networks

2.4 Task Parallelism

2.5 OpenMP

Chapter 3

Related Work

3.1 Academic Research

3.2 Frameworks and Commercial Solutions

Chapter 4

Implementation and Results

4.1 First Approach: Every Kernel is a Task

4.1.1 Test Environment

4.1.2 Model Proling

4.1.3 Results

4.1.2 Model Proling