GPU support for automatic generation of finite-differences Stencil Kernels

(1)

UNIVERSIDADEFEDERALDO RIO GRANDE DO NORTE

UNIVERSIDADEFEDERAL DORIOGRANDE DO NORTE

CENTRO DETECNOLOGIA

PROGRAMA DEPÓS-GRADUAÇÃO EMENGENHARIAELÉTRICA E

DECOMPUTAÇÃO

GPU Support for Automatic Generation of

Finite-Differences Stencil Kernels

Vitor Hugo Mickus Rodrigues

Advisor: Prof. Dr. Samuel Xavier de Souza

Co-advisor: Dr. Lucas Costa Pereira Cavalcante

Master Thesis presented to the Graduate Program in Electrical and Computer Engi-neering of UFRN (area of concentration: Computer Engineering) as part of the re-quirements for obtaining the Master of Sci-ence degree.

PPgEEC Order Number: M613

Natal, RN, January 16th, 2020

(2)

Rodrigues, Vitor Hugo Mickus.

GPU support for automatic generation of finite-differences Stencil Kernels / Vitor Hugo Mickus Rodrigues. - Natal, 2020. 43 f.: il.

Dissertação (mestrado) - Universidade Federal do Rio Grande do Norte, Centro de Tecnologia, Programa de Pós-Graduação em Engenharia Elétrica e Computação, NAtal, RN, 2020.

Orientador: Prof. Dr. Samuel Xavier de Souza. Coorientador: Dr. Lucas Costa Pereira Cavalcante.

1. Unidades de processamento gráfico (GPU) - Dissertação. 2. Domain specific language Dissertação. 3. Finite Differences Dissertação. 4. Devito Dissertação. 5. Arquitetura paralela -Dissertação. 6. Stencil Kernels - -Dissertação. I. Souza, Samuel Xavier de. II. Cavalcante, Lucas Costa Pereira. III. Título.

RN/UF/BCZM CDU 004.3 Universidade Federal do Rio Grande do Norte - UFRN

Sistema de Bibliotecas - SISBI

Catalogação de Publicação na Fonte. UFRN - Biblioteca Central Zila Mamede

(3)

Abstract

The growth of data to be processed in the Oil & Gas industry matches the requirements imposed by evolving algorithms based on stencil computations, such as Full Waveform Inversion and Reverse Time Migration. Graphical processing units (GPUs) are an attrac-tive architectural target for stencil computations because of its high degree of data paral-lelism. However, the rapid architectural and technological progression makes it difficult for even the most proficient programmers to remain up-to-date with the technological ad-vances at a micro-architectural level. This work presents an extension for an open source transpiler named Devito, designed to produce highly optimized finite difference kernels for inversion methods. Embedding it with the Oxford Parallel Domain Specific Language (OP-DSL) enabled the automatic code generation for GPU architectures from a high-level representation. This work aims to enable users coding in a symbolic representation level to effortlessly get their implementations leveraged by the processing capacities of GPU architectures. The implemented backend is evaluated on a NVIDIA R _{GTX Titan Z, and}

on a NVIDIA R

Tesla V100 in terms of execution time and in terms of operational in-tensity through the roofline model. A 3D acoustic isotropic wave propagation was used as experiment with stencil kernels for varying space-order discretization levels over grids with 2563points. It achieves approximately 63% of V100’s peak performance and 24% of Titan Z’s peak performance. This study indicates that improving memory usage should be the most efficient strategy for leveraging the performance of the implemented solution on the evaluated architectures.

Keywords: GPU, Domain Specific Languages, finite-differences, stencil kernels,parallel architectures, Devito, OPS

(4)

(5)

Resumo

A obtenção de soluções numéricas para algoritmos de inversão sísmica como Inver-são Completa da Forma de Onda (FWI, do inglês Full-Waveform Inversion) e Migração Reversa no Tempo (RTM, do inglês Reverse Time Migration), pode ser acelerada por ar-quiteturas que possuem um alto grau de paralelismo, como unidades de processamento gráfico (GPU, do inglês Graphical Processing Units). Porém, o rápido desenvolvimento de novas arquiteturas e tecnologias dificultam a manutenção e atualização das soluções implementadas. Neste trabalho, o transpilador de código fonte aberto chamado Devito é estendido para habilitar conversão automática de kernels de diferenças finitas para ar-quiteturas GPU. O framework Oxford Parallel Domain Specific Language (OP-DSL) foi utilizado para construção de um novo backend para o Devito. A solução implementada fora validada nas GPUs NVIDIA R

GTX Titan Z e NVIDIA R

Tesla V100. O desem-penho da implementação fora medido em termos de tempo de execução, e em termos de desempenho relativo através do modelo de roofline. Os testes foram feitos para diversos níveis de discretização de ordem espacial em um estêncil de propagação de onda acús-tica isotrópica 3D sobre uma malha de 2563. Os resultados demonstram que os kernels produzidos alcançaram aproximadamente 63% do desempenho máximo da V100 e cerca de 24% do desempenho máximo na GTX Titan Z. O estudo ainda revela que otimizar a transferência de dados entre CPU e GPU é um dos maiores desafios para alavancar o desempenho nas arquiteturas avaliadas.

Palavras-chave: Linguagem de Domínio Específico, Diferenças Finitas, Estêncil, Kernel, Arquiteturas Paralelas, GPU, Devito e OPS.

(6)

(7)

1 Introduction 1 2 Theory 5 2.1 Devito . . . 5 2.1.1 Intermediary Representation . . . 6 2.1.2 Optimizations . . . 7 2.1.3 Backend . . . 9 2.2 GPU Architecture . . . 9 2.3 OPS . . . 10 2.4 Roofline Model . . . 10 3 Related Work 13 3.1 FeniCS . . . 13 3.2 Firedrake . . . 13 3.3 esys-escript . . . 14 3.4 YASK . . . 14 3.5 Previous Work . . . 14 4 Implementation 15 4.1 Code Generation . . . 15 4.1.1 Expressions . . . 16 4.1.2 Block . . . 17 4.1.3 Dataset . . . 17 4.1.4 Stencil . . . 18 4.1.5 Data Argument . . . 19 4.1.6 Parallel Loop . . . 20 4.1.7 Memory Transfer . . . 21 4.2 Translation . . . 22 4.3 Compilation . . . 22 i

(8)

4.4 Execution . . . 23

5 Experiments and Results 25 5.1 Hardware Specification . . . 25

5.2 Compiler Information . . . 26

5.3 Acoustic Wave Propagation . . . 26

5.4 Correctness Verification . . . 29

5.5 Results . . . 29

6 Conclusion 35

Bibliography 35

(9)

List of Figures

2.1 Devito pipeline [1]. . . 7

2.2 Comparison between x86 CPU and NVIDIA GPUs - (chart from NVDIA presentation). . . 9

2.3 OPS traditional work flow . . . 11

4.1 Diagram of Devito and OPS integration. . . 16

4.2 2nd order stencil. . . 18

4.3 8thorder stencil. . . 19

4.4 Diagram explaining the workflow to generate a shared object by the OPS backend. . . 23

5.1 Wave propagating through the Marmousi model. In SubFigure a Shows the 2D Marmousi. SubFigure b shows the first instant after a source in-jected a perturbation in the surface. SubFigure c and SubFigure d the wave propagates through all the layers in the model. SubFigure d hap-pens after SubFigure c. . . 28

5.2 Numerical validation. First plot is the numerical value of the propagation for each backend. It represents the physical pressure applied to each point of the grid. Second plot is the difference between those values. . . 30

5.3 Roofline chart for GTX Titan Z GPU. Propagation field with 2563points and space order levels of 8, 12, 16 and 24 using Devito aggressive and basic DSE optimizations. . . 32

5.4 Roofline chart for V100 GPU. Propagation field with 2563 points and space order levels of 8, 12, 16 and 24 using Devito aggressive and basic DSE optimizations. . . 33

(10)

(11)

List of Tables

5.1 Hardware specification used. . . 25 5.2 Graphical cards specification used. . . 26 5.3 Flags description. . . 26 5.4 Data collected from profiling propagation kernel in GTX Titan Z using

nvprof. . . 31 5.5 Data collected from profiling propagation kernel in V100 using nvprof. . . 31

(12)

(13)

List of Symbols and Acronyms

API Application Programming Interface

AST Abstract Syntax Tree

CPU central processing unit

DRAM Dynamic Random Access Memory

DSL Domain Specific Language

FD Finite Difference

FEM Finite Element Method

FLOPS Float Point Operations

FWI Full-Waveform Inversion

GPU Graphical Processing Unit

HPC High Performance Computing

IET Iteration/Expression Tree

L-BFGS-B Large-scale Bound-constrained Optimization

LAPPS Laboratório de Arquiteturas Paralelas e Processamento de Sinal

NPAD Núcleo de Processamento de Alto Desempenho - IMD/UFRN

OI Operational Intensity

OP-DSL Oxford Parallel Domain Specific Languages

PDE Partial Difference Equation

RTM Reverse Time Migration

SO Space Order

UFL Unified Form Language

YASK Yet Another Stencil Kit

YLE YASK Loop Engine

(14)

(15)

Chapter 1 Introduction

A wide variety of physical phenomena can be formalized in terms of partial differen-tial equations (PDE) such as sound, heat, diffusion, electrostatics, electrodynamics, fluid dynamics, elasticity, and quantum mechanics. The development of computationally effi-cient methods for obtaining numerical solutions of PDEs through stencil kernels has been mentioned as a key computational science and engineering challenge to be addressed as one of the "seven dwarfs of computation" for at least the next decade, in [2]. In fact, large-scale PDE inversion algorithms that can be solved by finite-difference (FD) schemes used in exploration seismology such as full waveform inversion (FWI) and reverse time migra-tion (RTM) constitute some of the current most computamigra-tionally demanding problems in industrial and academic research.

In general, a stencil on structured grids is defined as a function that updates a point based on the values of its neighbors. The stencil structure remains constant as it moves from one point in space to the next. In the context of a wave-equation solver, the stencil is described by the support (grid-locations) and the coefficients of FD schemes. Using parallel designs such as graphics processing units (GPU) has relatively recently become the preferred choice to improve existing code for the current commercial and scientific community that performs stencil computations.

However, a significant barrier that has become increasingly more notable is the diffi-culty in programming these systems. As the hardware architectures grow in complexity, exploiting the potential of these devices requires higher know-how on parallel program-ming. The issue has further been compounded by a rapidly changing hardware design space, with a wide range of parallel architectures. For example, some designs offer many simple processors vs. fewer complex processors, some depend on multi-threading, and some even replace caches with explicitly addressed local stores. As no conventional wis-dom has yet emerged, it is unsustainable for wis-domain scientists to re-write their applica-tions for each new type of architecture regarded that developing and validating a PDE solver usually takes decades of effort.

To address the problem of algorithm sustainability, taking into account the uncertainty in future architectures, one solution involves decoupling the work of a domain scientist and a computer scientist. In this approach, Domain Specific Languages (DSL) are devel-oped by high-performance computing (HPC) specialists, and the specifics of the problem and the numerical solution method are specified in the DSL by the domain scientist. Us-ing source-to-source translation, the numerical solver can be targeted towards different

(16)

2 CHAPTER 1. INTRODUCTION

hardware backends. This ensures that only the backend that interfaces with the new ar-chitectures need to be written and supported by the translator. The underlying implemen-tation of the solver remains the same, thereby introducing a separation of concerns that results in a direct payoff in productivity.

Interest in building generic DSLs for solving PDEs is not new with early attempts dating back as far as 1970 [3, 4, 5]. More recently, two prominent finite element soft-ware packages, FEniCS [6] and Firedrake [7], have demonstrated the power of symbolic computation using the DSL paradigm. The optimization of regular grid and stencil com-putations has also produced a vast range of libraries and DSLs that aim to ease the efficient automated creation of high-performance codes [8, 9, 10, 11].

In line with DSL tools to solve PDEs, a particular software gains importance in the academic field, especially in the geophysics community. Devito is a DSL and code gener-ation framework for the design of optimized finite difference kernels for use in inversion methods.

This work presents an implementation for automatic GPU code-generation to Devito. This objective can be translated into extending Devito’s backend in such a way that the generated stencils are compatible with this target architecture. Currently, two backends exist in Devito: the default backend to run it on standard central processing units (CPU) architectures; and an alternative backend using the YASK (Yet Another Stencil Kit) sten-cil compiler to generate optimized C++ code for Intel R _Xeon R _{and Intel} R _{Xeon Phi}TM

architectures [1]. The strategy is to utilize one of the Oxford Parallel Domain Specific Languages (OP-DSL), called OPS, to build a third backend for Devito. OPS is a program-ming abstraction embedded in C/C++ for writing multi-block structured mesh algorithms. It is composed by the corresponding software library (an Application Programming Inter-face – API) and code translation tools (compilers). OPS enables automatic parallelization of the intermediary-level code produced (here, by Devito) using different parallel pro-gramming approaches.

As a result, it is expected that executable artifacts wrote in CUDA, OpenACC, OpenCL, OpenMP, and MPI get automatically and transparently composed for a diverse range of hardware from high-level symbolic descriptions of PDEs. It has been shown that OPS generated code is capable of matching or outperforming hand-coded and tuned implemen-tations [12], which implies considerable confidence in such an approach being capable of delivering high performance, code maintainability and future proofing. Although this work only generates code targeting CUDA language, it paves the way to other languages supported by OPS in future work.

It is possible to speculate that it would take much longer not only to compose complex FD problems but also to produce their various hand-coded parallel implementations, each of which would have to be then debugged and validated. Devito’s authors claim that the time savings on combining code generation with automatic parallel implementation for state-of-the-art hardware will have a significant impact on the efforts for modeling seismic inversion algorithms.

This document is organized as follows: Chapter 2 explains the tools and concepts used in this work. Chapter 3 introduces the related research in this area. The implementation produced in this work is presented in detail in Chapter 4. Test and analysis of the proposed

(17)

3

solution is described in Chapter 5. Finally the work conclusion is summarized in Chapter 6.

(18)

(19)

Chapter 2 Theory

In this Chapter some necessary though diffuse background knowledge is presented. The content includes computation aspects of GPU architectures, domain related specifici-ties on the problem considered as a proof of concept for proposed solution, and relevant considerations on the investigated software.

2.1 Devito

Devito1 is a tool to solve partial differential equations (PDEs). PDEs are used to describe numerous problems that are heavily constrained by physical laws. Some areas in which it has uses are: geophysics, earth and climate science, material science, chemical and mechanical engineering, medical imaging and physics, even in economics. Devito uses a domain specific language (DSL) as method to simplify development process for the user and solve PDEs using finite difference method.

Devito automatically generates C/C++ code with different levels of optimization for finite-difference schemes from a symbolic Python representation of PDEs. Devito claims that the automatically applied optimization is competitive with, and often better than, hand-optimized implementations.

In order to illustrate how Devito operates, let’s consider the Equation 5.1 of an acoustic wave propagation with a source injection and its initial conditions. Devito uses Sympy library for an easier symbolic representation. Writing this equation is demonstrated by Algorithm 2.1, which represents a small part of the solution.

Algorithm 2.1: Example of Devito declaring an acoustic wave propagation

1 from sympy i m p o r t Eq , s o l v e 2 from d e v i t o i m p o r t F u n c t i o n , T i m e F u n c t i o n , G r i d 3 4 g r i d = G r i d ( s h a p e = ( s i z e , s i z e ) ) 5 u = T i m e F u n c t i o n ( name= ’ u ’ , g r i d = g r i d , s p a c e _ o r d e r = 6 , t i m e _ o r d e r = 2 ) 6 m = F u n c t i o n ( name= ’m’ , g r i d = g r i d ) 7

(20)

6 CHAPTER 2. THEORY 8 eqn = Eq (m ∗ u . d t 2 − u . l a p l a c e ) 9 10 s t e n c i l = s o l v e ( eqn , u . f o r w a r d ) [ 0 ]

Devito performs just-in-time compilation and execution, so the domain expert can focus on the mathematical formulations, instead of writing low-level code. Following the example, the C code automatically generated by Devito from a Python environment is shown in Algorithm 2.2.

Algorithm 2.2: Devito auto generated C code using core backend. The code represents the propagation update for stencil of space order 2.

1 f o r ( i n t x = x_m ; x <= x_M ; x += 1 )

2 {

3 # pragma omp s i m d a l i g n e d ( damp , m, u : 3 2 )

4 f o r ( i n t y = y_m ; y <= y_M ; y += 1 ) 5 { 6 f l o a t r 0 = 1 . 0 F∗ d t ∗m[ x + 2 ] [ y + 2 ] [ z + 2 ] + 7 5 . 0 e−1F ∗ ( d t ∗ d t ) ∗damp [ x + 1 ] [ y + 1 ] [ z + 1 ] ; 8 9 u [ t 1 ] [ x + 2 ] [ y + 2 ] [ z + 2 ] = 10 1 . 0 F∗(− d t ∗m[ x + 2 ] [ y + 2 ] [ z + 2 ] ∗ u [ t 2 ] [ x + 2 ] [ y + 2 ] [ z + 2 ] / r 0 + 11 ( d t ∗ d t ∗ d t ) ∗ u [ t 0 ] [ x + 1 ] [ y + 2 ] [ z + 2 ] / r 0 + 12 ( d t ∗ d t ∗ d t ) ∗ u [ t 0 ] [ x + 2 ] [ y + 1 ] [ z + 2 ] / r 0 + 13 ( d t ∗ d t ∗ d t ) ∗ u [ t 0 ] [ x + 2 ] [ y + 2 ] [ z + 1 ] / r 0 + 14 ( d t ∗ d t ∗ d t ) ∗ u [ t 0 ] [ x + 2 ] [ y + 2 ] [ z + 3 ] / r 0 + 15 ( d t ∗ d t ∗ d t ) ∗ u [ t 0 ] [ x + 2 ] [ y + 3 ] [ z + 2 ] / r 0 + 16 ( d t ∗ d t ∗ d t ) ∗ u [ t 0 ] [ x + 3 ] [ y + 2 ] [ z + 2 ] / r 0 ) + 17 2 . 0 F∗ d t ∗m[ x + 2 ] [ y + 2 ] [ z + 2 ] ∗ u [ t 0 ] [ x + 2 ] [ y + 2 ] [ z + 2 ] / r 0 + 18 5 . 0 e−1F ∗ ( d t ∗ d t ) ∗damp [ x + 1 ] [ y + 1 ] [ z + 1 ] ∗ u [ t 2 ] [ x + 2 ] [ y + 2 ] [ z + 2 ] / r 0 − 19 6 . 0 F∗ d t ∗ d t ∗ d t ∗ u [ t 0 ] [ x + 2 ] [ y + 2 ] [ z + 2 ] / r 0 ; 20 } 21 }

The user does not need to see this generated code. It is handled by Devito’s compiler and the result from its execution is brought back to Python environment. Programming the Algorithm 2.1 is much simpler and maintainable than Algorithm 2.2. It can be used to enable code execution over different architectures with no modifications.

The key behind this mechanism lies on utilizing compiler technology to translate sym-bolic PDEs into an intermediary representation.

2.1.1 Intermediary Representation

Devito builds multiple layers of intermediate representations. Those layers are used for a series of passes that Devito automatically applies to the code. In each pass a series of optimization are applied considering the target architecture, this includes:

• Equations lowering. • Local analysis.

(21)

2.1. DEVITO 7

• Clustering.

• Symbolic optimization.

• Iteration/expression tree (IET) construction. • Synthesis.

All the steps above use a tree representation of the abstract syntactic structure of the source code. Figure 2.1 shows a detailed diagram of the described pipeline.

The IET construction is an iteration that lowers the intermediate representation into a syntax tree that contains Iterations and Expressions. Those are special tree nodes that have an important role in Devito. Expressions wraps equations, while Iterations are the representation of loops.

The last step, Synthesis, is where Devito specializes data types aiming a target API which is discussed in Subsection 2.1.3.

Equations lowering

Input Equations → Lowered Equations

Invariants extraction Shift-invariants detection

Factorization Common sub-expressions elimination

Local analysis

Symbolic optimization [DSE]

Clusters → Clusters

IET construction

Clusters → IET [abstract syntax tree]

IET optimization [DLE/YLE]

IET → IET

Synthesis

IET → CGen AST → C/C++ string

Clustering

Lowered Equations → Clusters

Declarations

Instrumentation for profiling

Header files, globals, macros, … Enforcement of iteration direction

Grouping

JIT Compilation C/C++ string → kernel.c → kernel.so

IET analysis

IET → IET

SIMD vectorization Loop blocking

Shared-memory (hierarchical) parallelism Low-level optimization

(e.g., sw prefatching)

Figure 2.1: Devito pipeline [1].

2.1.2 Optimizations

Devito supports two categories of optimizations: Devito Symbolic Engine (DSE) and Devito Loop Engine (DLE). For the context of this work, the DLE optimization will not

(22)

8 CHAPTER 2. THEORY

be discussed because the loops are replaced by a GPU kernel call; thus, the DLE does not apply for this case.

The DSE optimization aims to reduce the number of floating points operations per-formed in an algorithm. The main techniques used in this step are the common sub-expression elimination (CSE), factorization, extraction, and detection of aliases.

• CSE is a traditional compiler optimization for identifying pieces of sub-expressions that are repeated in a given expression. Once identified, those pieces of code can be replaced by a variable that executes the operation only once. For example, the following code:

a = b * c + g; d = b * c * e;

has a better performance if transformed into:

tmp = b * c; a = tmp + g; d = tmp * e;

• Factorization is the standard mathematical optimization. For the same expression, the common terms are grouped together. For example, the expression:

a = (b*c*d) + (e*c*d) + (f*c*d); can be re-written with fewer operations in

a = (c*d)*(b + e + f);

• Extraction is an optimization pass that offers a trading between operations and memory. In this stage sub-expressions that match certain condition are extracted from larger expression. This step applies a threshold to the number of operation count, extract-ing sub-expression to comply with it. This threshold was determined empirically aimextract-ing to improve performance for seismic algorithms. For example the expression:

5*temp*(u[t][x+1] - temp*u[t][x+2]) + 12*temp*u[t][x+3];

can be optimized, if rewritten, as:

temp0[x] = u[t][x+1] - u[t][x+2]; 5*temp*temp0[x] + 12*temp*u[x+2];

• Detection of aliases is an advanced optimization step. It searches for sub-expression that is being computed in different iteration points, and create an alias for it. For instance, the following expression:

5*temp*u[t][x+1] - 15*temp*u[t][x] + 10*temp*u[t][x+2];

can be optimized, if re-written, as:

temp0[x] = 5*temp*u[t][x];

(23)

2.2. GPU ARCHITECTURE 9

2.1.3 Backend

In the Synthesis stage of Devito pipeline it is possible to specialize the generated code. This is done though the backend selection.

The idea of the backend is to provide infrastructure to enable the integration with third-party software. Prior to this work, Devito supported two backends:

• core — it is the default backend. Does not apply any architecture specific changes in the generated code structure, so it relies on the DLE for the loop optimization.

• yask — provides the ability to generate code using the YASK stencil compiler. Targets processors of the family Intel R

Xeon and Intel R

Xeon PhiTM. Using this backend, Devito can transfer the loop optimization to the YASK Loop Engine (YLE).

The main contribution of this work is to extend and evaluate Devito’s support for a new backend targeting GPU architectures.

2.2 GPU Architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for developers, and is even more difficult understand the performance bottlenecks on GPU architectures. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.

Figure 2.2: Comparison between x86 CPU and NVIDIA GPUs - (chart from NVDIA presentation).

Figure 2.2 presents a comparison between x86 CPU and Nvidia GPUs. The left chart shows the increasing gap between floating-point operations for the two architectures. The difference in FLOPS is about 4.5 times considering the most recent model presented. The right chart indicates the difference in memory bandwidth between them. Nvidia GPU yields 6 times more memory transfer than x86 CPU considering the 2017 model.

(24)

10 CHAPTER 2. THEORY

Considering the GPU context, it is important to separate the concept of host and de-vice. Host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.

For a better efficiency of the kernel execution, the device micro architecture plays an important role. Each generation of graphical cards introduces new micro architecture with new resources and better performance. Each CUDA program must be compiled using the nvcccompiler. The microarchitecture can be chosen in the compilation process for a great combination of portability and efficiency [13]. Some of the main microarchitecture are Fermi, Pascal, Kepler, Maxwell, and Volta. The microarchitectures used in this work are Volta and Kepler.

2.3 OPS

The Oxford Parallel library for Structured mesh solvers (OPS) was created consider-ing code maintainability, performance portability, and future proofconsider-ing.

The main idea behind OPS is to enable the developer to focus in only one program-ming language. It enables the automatically generation of programs in multiples different architectures: MPI, OpenMP, OpenACC, CUDA and OpenCL. This work relies on OPS to generate code in CUDA syntax language.

OPS provides high-level code abstraction aimed at multi-block structured grid com-putations. It can be embedded in C/C++ and its API provides a basic structure for grid computations such as: blocks, datasets, and parallel loops. An explanation of the API is presented in Section 4.1.

The diagram in Figure 2.3 shows the workflow of OPS programs: starting from the desired structured mesh application, then algorithm specification using OPS API, then compiling and linkage it with the OPS libraries, and finally artifact execution on the tar-geted platform.

In this work, Devito is leveraged to support a new backend connected to the OPS li-brary. The new backend enables the computation of stencil kernels in a GPU environment using the CUDA parallel computing platform.

2.4 Roofline Model

The performance of the produced solutions is analyzed in terms of their floating-point performance, operational intensity and memory performance through the roofline model. This model exposes the rate between the extent of performance usage and the theoretical peak performance of the considered devices.

Attainable

Peak Performance [GFLOP/s] =min (

Peak Floating-Point Performance

(25)

2.4. ROOFLINE MODEL 11

Figure 2.3: OPS traditional work flow

The Operational Intensity (OI) measures the Dynamic Random Access Memory (DRAM) bandwidth needed by a kernel in a particular architecture. For the devices considered in this work, each read or write transaction between the DRAM and the caches have a 32 bytes size. Therefore, an application OI can be measured according to Equation 2.2.

OI[FLOP/Byte] =# Single Precision Floating-Point Operations

(# Memory Transactions) ∗ 32 (2.2) The kernel performance measures the number of executed floating-point operations per second. It can be measured according to Equation 2.3.

Performance[FLOP/s] = # Single Precision Floating-Point Operations

(26)

(27)

Chapter 3 Related Work

Many tools have been developed to work with PDEs. This Chapter lists some of the relevant tools for solving partial difference equations.

3.1 FeniCS

The FeniCS project [14] started in 2005 and is now at its version 1.5. This project aims to solve differential equations by finite element methods (FEM). FeniCS integrates a variety of other components to achieve its result, which are all licensed under the GNU GPL.

FeniCS provides a DSL using the Python language that allows produce maintainable high-performance implementations. FeniCS domain specific language is UFL (Unified Form Language), which are also used in other projects.

Considering FeniCS GPU support, a prototype compiler that allows the generation of low-level code for GPU and multicore CPUs was presented in [15].

3.2 Firedrake

Firedrake [16] is an automated system for the solution of partial differential equations using the finite element method. It relies on UFL from the FeniCS Project to enable an expressive specification of PDEs.

Some of the features in Firedrake are:

• Firedrake solves PDEs in a parallel approach using PETSc1. • It works with unstructured meshes.

• Automatic optimisation, including sum factorisation for high order elements, and vectorisation.

1_{PETSc is a suite of data structures and routines for the scalable (parallel) solution of scientific}

(28)

14 CHAPTER 3. RELATED WORK

3.3 esys-escript

The esys-escript [17] is a programming tool for implementing mathematical models in Python using the finite element method. It provides a Python DSL to create scripts that can execute from desktop computers to highly parallel supercomputers.

Application areas for esys-escript includes: earth mantle convection, geophysical in-version, porous media flow, reactive transport, and plate subduction. Some of the main features are:

• Two and three dimensional finite and spectral element simulations. • Unstructured meshes from gmsh2_.

• Parallelization with OpenMP and MPI support. • Partial support for GPU use.

3.4 YASK

YASK is a framework that allows the generation of high performance code targeted at Intel Xeon and Xeon Phi processors. It provides multiple optimizations and features in APIs for both C++ and Python, including:

• Temporal tiling in multiple dimensions to further increase cache locality. • YASK provides APIs for C++ and Python.

• Vector-folding – Increases data reuse though redesign of data layout. • Shared memory parallelization with OpenMP.

• Multi socket and nodes parallelization with MPI.

• Space tiling – Optimizes memory access by separating space iteration in chunks. • Temporal tiling in multiple dimensions to further increase cache locality.

3.5 Previous Work

This work is a collaboration with Imperial College London3which already explored the OPS potential for heat propagation equation [18]. The current work strengthen Devito-OPS development by refining existing methods and creating new ones to enable Devito-OPS support in Devito.

2_{Gmsh is an open source 3D finite element mesh generator with a built-in CAD engine and}

post-processor.

3_{Imperial College London (legally Imperial College of Science, Technology and Medicine) is a public}

(29)

Chapter 4 Implementation

This Chapter presents the modifications implemented in Devito compiler for enabling code generation in a OPS syntax. Code translation is discussed in Section 4.2. The compilation process for the proposed backend is explained in Section 4.3. Finally, the process of executing the generated binary is described in Section 4.4.

4.1 Code Generation

Considering Devito pipeline described in Section 2.1, the code generation for the spe-cialized backend OPS is handled at the Synthesis phase. Hence the proposed backend will share all the compilation pipeline until the specialization. The diagram in Figure 4.1 rep-resents an overview of the Devito and OPS integration, alongside the currently supported backends.

This work relies on built-in methods in Devito that are used for the code generation: • find_affine_trees — one of the main concerns for offloading data to the GPU is to identify loops that are parallelizable. This method handles this task by relying in a funda-mental result from compiler theory. The i − th Iteration in a nest comprising of n Itera-tionsis parallel, if for all dependecies D, expressed as distance vectors D = (d0, ..., dn−1),

either (d1, ..., di−1) = 0 [19].

• FindNodes — is actually a class that is responsible to find all instances of given type. For a syntax tree, or a sub tree, it gives a list of all nodes that match a given type. This work relies in this built-in function to find multiple interested nodes like Expression, and Iteration.

The follow Subsections explain each part of the code generation. Starting with the Expressions in Subsection 4.1.1, this structure is the entry point for the conversion to OPS syntax code. Subsection 4.1.2 presents the base structure for GPUs dataset using OPS. Next, in Subsection 4.1.3, the Dataset is analysed, responsible for declaring the data in the GPU device. The structure of Stencil is described in Subsection 4.1.4. It specifies points used when updating the grid. Subsection 4.1.6 indicates the part of the code that will execute in the GPU. Finally, the Memory Transfer aspects are described in Subsection 4.1.7.

(30)

16 CHAPTER 4. IMPLEMENTATION

Figure 4.1: Diagram of Devito and OPS integration.

4.1.1 Expressions

The expressions analysis is the most important part of the OPS backend because it will provide the basis for all the upcoming structures.

All defined expressions are obtained from a Devito’s IET. Using the method Find-Nodes, a list of all nodes of type Expression can be filtered out from a given IET. Each term of each element in the list of expression nodes is analysed. The types are identified for translation. Algorithm A.2, is a recursive method for expression evaluation. The base cases of the algorithm occur when objects of type Constant or Indexed are identified.

The Constant objects represent a variable or a small array that will not be updated inside the parallel loop. It will remain constant throughout the execution. This object is translated into an OPS object of type Accessible, which represents a symbol to be accessed.

The Indexed objects represent the data arrays that are accessed in the expression. In-dexed objects are either time independent or time dependent arrays. Time dependent arrays get modified at each time step of the loop. OPS only iterates through space dimen-sions and when a time dependent array is found it is necessary to allocate a new object for each time dimensions being accessed. Therefore, with the current solution, there will be as many Accessible objects as the number of time steps in the loop.

For example, an expression that initially is represented in C/C++ syntax as

(I) u[t+1][x][y] = u[t][x][y] + 1

is translated into

(31)

4.1. CODE GENERATION 17

according to the OPS API.

The array access u in Equation I will be replaced by ut0 when u is accessed in the current time index, and by ut1 when u is accessed one time index ahead. The term (0,0) specifies which position of the stencil will be accessed. In this case, it indicates the current spatial position.

An example of the generated expressions for the wave propagation can be seen in Algorithm A.3 as well as the kernel creation.

4.1.2 Block

OPS targets computation on multi-block structured mashes. And a block is a collec-tion of structured grids. To define a block it is only necessary to define the dimensionality of the grid collection through the method ops_decl_block, which parameters are:

• dims: dimension of the block.

• name: name used for output diagnostics.

An example of a block declaration is shown in Algorithm 4.1.

4.1.3 Dataset

Datasets are the way OPS handle data allocated in the GPU. The API method ops_decl_dat creates an ops_dat object. It requires nine parameters:

• block: structured block in which the dataset will be used. • dim: number of dimensions of the dataset.

• size: number of elements in each dimension.

• base: the start indices where the arrays begin, for the C/C++ syntax this is 0. • d_m: padding used in the negative direction of each dimension.

• d_p: padding used in the positive direction of each dimension.

• data: optional parameter. If the data is already allocated in the CPU memory it can be used as the initialize values for the data in the GPU. If NULL is provided, then OPS will allocate it automatically.

• type: name of the type used.

• name: name for output diagnostics.

Algorithm 4.1 shows an example of code automatically generated by Devito for build-ing objects of type ops_dat.

Algorithm 4.1: Building objects of type ops_dat

1 o p s _ b l o c k b l o c k = o p s _ d e c l _ b l o c k ( 3 , " b l o c k " ) ; 2 i n t damp_ base [ 3 ] = { 0 , 0 , 0 } ; 3 i n t damp_d_p [ 3 ] = { 0 , 0 , 0 } ; 4 i n t damp_d_m [ 3 ] = { 0 , 0 , 0 } ; 5 i n t damp_dim [ 3 ] = { 1 0 0 , 1 0 0 , 1 0 0 } ;

(32)

6 o p s _ d a t d a m p _ d a t = o p s _ d e c l _ d a t ( b l o c k , 1 , damp_dim , damp_base ,

7 damp_d_m , damp_d_p , &(damp [ 0 ] [ 0 ] [ 0 ] ) ,

8 " f l o a t " , " damp " ) ;

4.1.4 Stencil

The Stencil represents which data region is accessed. For example, Figure 4.2 repre-sents a 2nd order stencil. It updates the central position in yellow by accessing all green positions, located one position to the left and one position to the right in each dimension.

Figure 4.2: 2nd order stencil.

Figure 4.3 represents an 8th order stencil with access to four positions for each direc-tion in each dimension.

(33)

Figure 4.3: 8th order stencil.

To specify an stencil using OPS it is required to provide the relative positions for each accessed point. Therefore, the case illustrated by Figure 4.2 requires an integer array with eighteen numbers, representing six three-dimensional numbers (excluding the central position). Whereas the case illustrated by Figure 4.3 requires twenty five three-dimensional numbers.

The stencil declared in Algorithm 4.2 demonstrates stencil declaration for 8th stencil order. Using the new backend, this declaration is handled by Devito. Therefore, simpli-fying the declaration process for the end user.

Algorithm 4.2: Building an object of type ops_stencil for a 8th stencil order.

1 i n t s 3 d _ u t 0 _ 2 5 p t [ 7 5 ] = { 0 , 0 , 0 , −3 , 0 , 0 , 3 , 0 , 0 , 1 , 0 , 0 , −1 , 0 , 2 0 , −4 , 0 , 0 , 4 , 0 , 0 , −2 , 0 , 0 , 2 , 0 , 0 , 0 , 3 −3, 0 , 0 , 3 , 0 , 0 , 1 , 0 , 0 , −1, 0 , 0 , −4, 4 0 , 0 , 4 , 0 , 0 , −2 , 0 , 0 , 2 , 0 , 0 , 0 , −3 , 5 0 , 0 , 3 , 0 , 0 , 1 , 0 , 0 , −1 , 0 , 0 , −4 , 0 , 6 0 , 4 , 0 , 0 , −2 , 0 , 0 , 2 } ; 7 8 o p s _ s t e n c i l S3D_UT0_25PT = o p s _ d e c l _ s t e n c i l ( 3 , 2 5 , s 3 d _ u t 0 _ 2 5 p t , 9 " S3D_UT0_25PT " ) ;

4.1.5 Data Argument

Each Accessible identified according to Subsection 4.1.1 is used for the generation of the device code. The device code is generated in a separate file with the extension .h. This file contains methods here called kernels, each kernel executes in the GPU. Kernels

(34)

requires the Accessible to be specified both in the method signature and in the method invocation.

The analysed Accessible can produce different data types. If the Accessible is gen-erated by a Constant type, then the method ops_arg_gbl is used. If the Accessible is generated by an Indexed object, then the method ops_arg_dat is used. Both methods have output of type ops_dat.

— ops_arg_gbl parameters:

• data: dataset.

• dim: dataset dimension.

• type: type of access used (read, write or both). • acc: access type.

— ops_arg_dat parameters:

• data: dataset.

• dim:dataset dimension.

• stencil: stencil for accessing data.

• type: type of access used (read, write or both). • acc: access type.

Ops_dats are created in the arguments of the ops_par_loop method (Subsecion 4.1.6), see Algorithm 4.3 for an example.

4.1.6 Parallel Loop

The parallel loop is responsible for indicating which region of the code will be exe-cuted in the GPU device. It is composed by two parts: the call to the parallel loop, and the kernel method to be executed.

The kernel method is defined in a C++ header file (with file extension .h). It contains the expressions translated according to Subsection 4.1.1.

The parallelizable kernel to be executed in the device needs to be called from the host code. This is done by the method ops_par_loop. Its parameters are:

• kernel: kernel method defined in the header file. • name: name to identify the invoked kernel. • block: block for applying the loop.

• dims: dimension of the iteration.

• range: specifies the iteration limits for each dimension. It is a closed interval for the bottom limit and an open interval for the upper limit.

• args: ops_dats used in the kernel and defined according to Subsection 4.1.5. Algorithm 4.3 shows an example of how an ops_par_loop is built.

(35)

Algorithm 4.3: Building the method ops_par_loop

1 o p s _ p a r _ l o o p ( OPS_Kernel_0 , " OPS_Kernel_0 " , b l o c k , 3 ,

2 ( i n t ∗ ) O P S _ K e r n e l _ 0 _ r a n g e ,

3 o p s _ a r g _ d a t ( damp_dat , 1 , S3D_DAMP_1PT , " f l o a t " , OPS_READ ) ,

4 o p s _ a r g _ d a t ( u _ d a t [ t 0 ] , 1 , S3D_UT0_25PT , " f l o a t " , OPS_READ ) , 5 o p s _ a r g _ d a t ( u _ d a t [ t 1 ] , 1 , S3D_UT1_1PT , " f l o a t " , OPS_WRITE ) , 6 o p s _ a r g _ d a t ( u _ d a t [ t 2 ] , 1 , S3D_UT2_1PT , " f l o a t " , OPS_READ ) , 7 o p s _ a r g _ d a t ( v p _ d a t , 1 , S3D_VP_1PT , " f l o a t " , OPS_READ ) , 8 o p s _ a r g _ g b l (& h_x , 1 , " f l o a t " , OPS_READ ) , 9 o p s _ a r g _ g b l (& h_y , 1 , " f l o a t " , OPS_READ ) ,

10 o p s _ a r g _ g b l (& h_z , 1 , " f l o a t " , OPS_READ ) ) ;

4.1.7 Memory Transfer

Offloading data to the GPU requires a communication channel between CPU and CPU. All the data allocated in the CPU memory is handled by Devito, which is already initialized. To avoid data double allocation in the CPU memory space, Devito delivers a pointer to the allocated data.

The GPU has a separated memory system. There are two ways to update data on the GPU using OPS. The first one is when creating the method ops_dat according to Subsection 4.1.4, using the data parameter as initializer. But if the method ops_dat is already created, then the method ops_dat_release_raw_data is used, it notifies OPS that the data has been updated on the CPU memory, thus it must be transferred to the GPU. The method ops_dat_release_raw_data signature requires:

• dat: dataset that will have the data copied to.

• part: data initial position for the generated code (0 for C/C++).

• acc: type of access used in the host memory data. There are three options: OPS_READ, OPS_WRITE, and OPS_RW which respectively mean that the data has been read, written or both.

After the device executes the kernel, the computed data is delivered back to the CPU main memory, at the exact same position where data has been allocated by Devito. The method responsible for this task is the ops_dat_get_raw_pointer. It can copy data to the allocated region. The method’s signature requires:

• dat: dataset to copy data from.

• part: data initial position for the generated code (0 for C/C++). • stencil: stencil used for this dataset.

• memspace: indicates the memory space in which the data to be copied from is al-located. To reference a GPU memory space the OPS_DEVICE constant is used.

• return: pointer to a CPU memory position in which the data will be transferred to. An example of the generation for the data transfer methods are presented in Algorithm 4.4.

(36)

Algorithm 4.4: Building methods ops_dat_get_raw_pointer and ops_dat_release_raw_data.

1 o p s _ d a t _ g e t _ r a w _ p o i n t e r ( damp_dat , 0 , S3D_DAMP_1PT,& memspace ) ;

2 o p s _ d a t _ g e t _ r a w _ p o i n t e r ( u _ d a t [ t 1 ] , 0 , S3D_UT1_1PT ,& memspace ) ;

3 o p s _ d a t _ g e t _ r a w _ p o i n t e r ( u _ d a t [ t 2 ] , 0 , S3D_UT2_1PT ,& memspace ) ;

4 5 /∗ m o d i f i e s d a t a i n CPU ∗ / 6 7 o p s _ d a t _ r e l e a s e _ d a t a ( damp_dat , 0 , OPS_RW) ; 8 o p s _ d a t _ r e l e a s e _ d a t a ( u _ d a t [ t 1 ] , 0 , OPS_RW) ; 9 o p s _ d a t _ r e l e a s e _ d a t a ( u _ d a t [ t 2 ] , 0 , OPS_RW) ;

4.2 Translation

All the code generated according to Section 4.1 is written into two files. The first one has the extension .c, and describes the host code that is executed by the CPU. The second one is a header file with the extension .h, and describes the device code, that is parallelized.

Both files are created in a temporary folder on the file system, usually the folder /tmp on a unix-based system. The generated files feed an OPS script written in Python. This scripts translates the API compliant codes into optimized code for the supported frameworks, such as CUDA.

Figure 4.4 shows a diagram including the translation pipeline.

4.3 Compilation

In the standard backend of Devito core, when the user executes an Operator by calling the method apply, the just-in-time compilation process is triggered (if it is the first time of the execution of this operator). This method will:

1. Compile the code generated by Devito, producing a shared library file (file with extension .so).

2. Load the generated shared library into Python.

For the proposed OPS-based backend there is an additional step according to Section 4.2. The generated code has to be translated before being compiled for a GPU architec-ture. The steps 1 and 2 from the default backend remains the same, but intermediate steps are added:

1.5 OPS intermediate steps.

(a) Translate the code according to Section 4.2.

(b) Compile the CUDA code, using the NVIDIA compiler nvcc. This generates an object file with the extension .o.

(37)

4.4. EXECUTION 23

(c) Compile the translated C code linking the CUDA object, generating the shared object that will be invoked from Python.

Figure 4.4 shows a diagram with the compilation process described.

Figure 4.4: Diagram explaining the workflow to generate a shared object by the OPS backend.

All the compiler information as well as flags used in the compiling process are sum-marized in Section 5.2.

4.4 Execution

After the compilation process described in Section 4.3, Devito generated a shared object that can be invoked in Python. But the execution of the loaded binary only happens when the method apply is called.

Apply then executes the binary through a command call to the method created in C. All the parameters needed to execute the binary are handed through data pointers. For

(38)

the OPS-based backend it also triggers the execution by the GPU. After its execution the result calculated is brought back to Devito’s allocated memory space.

(39)

Chapter 5 Experiments and Results

This Chapter presents the experiment used to evaluate the proposed solution as well as the results obtained.

Section 5.1 and Section 5.2 describe the hardware specification in the scenarios set, compiler and flags adopted. The experiment procedure is described in Section 5.3. A numerical verification is presented in Section 5.4. The obtained results are shown in Section 5.5.

5.1 Hardware Specification

Two different environments were considered. The location of the hardware will be used as an identifier to reference each scenario. The first machine is located at LAPPS. The second one is at NPAD.

Location LAPPS NPAD

CPU Intel Core i7-6700 Intel Xeon E5-2683

CPU Clock 3.40GHz 2.10GHz

Number of Cores/Threads 4/8 14/28

RAM Memory (GB) 16 512

Graphical card GTX Titan Z V 100

Table 5.1: Hardware specification used.

It should be stressed that the GTX Titan Z graphics card combines two graphical processors, although only one of those is used in the scope of this work. Of course that implies in the utilization of half of the available resources.

(40)

26 CHAPTER 5. EXPERIMENTS AND RESULTS

GPU GTX Titan Z V 100

Memory Bandwidth (GB/s) 336 x2 900

Single Precision Peak Performance (GFLOPS) 4746 14000

Memory (GB) 6 x2 16

Micro-Architecture Kepler Volta

Table 5.2: Graphical cards specification used.

5.2 Compiler Information

The compilation process requires two different compilers. One is for compiling the device code, in this case written in CUDA. Another one is to link the CUDA object with the host code written in C/C++.

gcc version 7.4 with the flags -fopenmp -O3 -fPIC -Wall -shared -g. This command produces a shared object that will be called from Devito.

nvcc version 9.2 and version 10.1 are used Kepler and Volta NVIDIA micro-architectures, respectively, whith the flags -Xcompiler="-std=c99 -fPIC" -O3 are used to generate the CUDA object [20].

Flag Description

-fopenmp Activate the OpenMP extesion

-O3 Turn on all the optimization available by the compiler -fPIC Generate position-independent code

-Wall Enable all compiler warnings

-shared Produce a shared object which can then be linked with other objects to form an executable

-g Generate debug information

-Xcompiler Specify options directly to the C compiler or preprocessor -std=99 Specify code style according to the C99 standard

Table 5.3: Flags description.

5.3 Acoustic Wave Propagation

To measure the efficiency of the proposed solution the algorithm of an acoustic wave propagation is chosen for the experiment. This choice is due to the fact that this algorithm is a base to others geophysics algorithms like reverse time migration and full waveform

(41)

5.3. ACOUSTIC WAVE PROPAGATION 27

inversion, which have a high computation intensity and represents a challenge for the computational resources available.

When a wave is propagated through a medium using compression and decompression it is called an acoustic wave propagation, and its characterized by having a velocity that depends on the medium being propagated through.

The Equation 5.1 describes the mathematical formulation of the wave propagation with its initial condition.

           m(x, y, z)d 2_{u(x, y, z,t)} dt2 − ∇ 2_{u(x, y, z,t) = q} s, u(x, y, z, 0) = 0, du(x,t) dt |t=0= 0, (5.1) where: • m(x, y, z) = 1

c(x,y,z)2, represents the square slowness model as a function of the three

space coordinates (x, y, z);

• u(t, x, y, z), is the spatially varying acoustic wave field in each time step; • qs, is the source term representing the source injection;

Figure 5.1 shows a sequential visual example of an acoustic wave being propagated in a two dimensional subsurface. The subsurface chosen for the example is the experimental model Marmousi [21]. Sub figure a represents the initial model without any wave being propagated. Then a source injects a disturbance in the surface which is then propagated through all the space, Sub figure b shows the moment right after the beginning of the injection while Sub figures c and d shows instants at a later moment.

(42)

SubFigure a

SubFigure b

SubFigure c

SubFigure d

Figure 5.1: Wave propagating through the Marmousi model. In SubFigure a Shows the 2D Marmousi. SubFigure b shows the first instant after a source injected a perturbation in the surface. SubFigure c and SubFigure d the wave propagates through all the layers in the model. SubFigure d happens after SubFigure c.

(43)

5.4. CORRECTNESS VERIFICATION 29

An acoustic wave three dimensional with the following parameters were considered:

• Time step 0.001 s, Total time: 30 s

• Domain size of: 1 km x 1 km x 1 km with grid points: 323_{, 64}3_{, 128}3_{, 256}3_{, 512}3

• Ricker source injection with peak frequency 10 Hz • Space Order: 4, 8, 12, 16 and 24

• Single layer velocity model of 2 km/s • Absorbing boundary

Algorithm A.1 represents the Python code used for this experiment.

5.4 Correctness Verification

Before evaluating the performance of the algorithm it is essential that the proposed solution computes the correct result. The numerical result for both core and ops backends are directly compared.

Figure 5.2 shows the value computed in the last iteration of the propagation for each backend, and the difference between these results. The comparison reveal that there are no numerical mismatches between propagation simulations performed by core or ops back-ends.

5.5 Results

Data obtained in previous studies indicated that Devito is able to efficiently utilise In-tel architectures1with a high degree of efficiency, while maintaining the ability to increase accuracy by switching to higher order stencil discretization dynamically [1]. Luporini et al. show that remarkable speed-ups from 3x up to 4x can be attainable for those archi-tectures on scenarios with what they call "aggressive" optimizations to avoid redundant computation over 3D grids with space order discretization levels varying from 4 to 16. In this study, the performance of a new backend for Devito is measured on the NVIDIA R

architectures GTX Titan ZTM and Tesla V100TMconsidering scenarios with no symbolic optimizations (basic DSE), and with an aggressive symbolic optimization implemented by Devito (aggressive DSE). An isotropic acoustic wave propagation model with absorbing boundaries as described by Equation 5.1 is utilized.

This study measures the rate between attainable performance and the peak machine performance according to vendor specification, for both the considered devices. Taking into account the roofline model described in Section 2.4 it is evaluated how efficiently the generated algorithms utilize the GPU. Different varying space order levels of the gener-ated propagation stencil kernels are considered. For each of the considered space orders the propagation kernel is profiled using nvprof 2 in order to obtain: (a) the number of

1_Intel R _Xeon R _{E5-2690v2 with 10 physical cores, and Intel} R _Xeon R _PhiTM_{accelerator card.}

2_{The nvprof profiling tool enables you to collect and view profiling data from the command-line, and is}

(44)

Figure 5.2: Numerical validation. First plot is the numerical value of the propagation for each backend. It represents the physical pressure applied to each point of the grid. Second plot is the difference between those values.

single precision floating-point operations, (b) the number of memory transactions, and (c) the kernel execution time.

For each space order, the produced stencil kernel ran five times for 30.000 time steps. Table 5.4 shows the values collected for GTX Titan Z and Table 5.5 shows the values collected for V100, for basic and aggressive symbolic optimization levels, and space or-ders levels of 8, 12, 16 and 24. The values for OI are obtained according to Equation 2.2 whereas the values for performance are obtained according to Equation 2.3.

Figures 5.3 and 5.4 display the OI (FLOP/Byte) versus performance (GFLOP/s) from the values found in Tables 5.4 and 5.5, respectively. Each of the points in those plots are characterized by two values: (i) the space order, and (ii) the percentage in relation to the peak device performance. The performance bounds were obtained from vendor peak performance specifications and are shown in Table 5.2.

The maximum attainable performance (Section 2.4) for each architecture is calculated using Equation 2.1 considering the hardware specifications described in Table 5.2. Any algorithm running in the same architecture will be bound to this very same roof.

(45)

5.5. RESULTS 31 Space Order FP 32 Count Memory Operations Execution Time (s) OI (Flop/Byte) Performance (GFlop/s) Basic Optimization 8 1,450,112,268 22,722,746 553.92 1.99 78.54 12 2,013,392,118 28,068,109 854.39 2.24 70.70 16 2,375,372,938 29,871,728 907.72 2.48 78.51 24 2,898,342,158 33,348,001 1,150.01 2.71 75.61 Aggressive Optimization 8 641,887,345 22,637,047 135,73 0.89 141.88 12 760,134,906 27,737,029 179.15 0.86 127.29 16 842,931,505 29,704,549 180,55 0.89 140.06 24 929,761,776 32,926,331 219,76 0.88 126.92

Table 5.4: Data collected from profiling propagation kernel in GTX Titan Z using nvprof.

Space Order FP 32 Count Memory Operations Execution Time (s) OI (Flop/Byte) Performance (GFlop/s) Basic Optimization 8 1,450,996,129 9,245,436 553.92 4.90 693.77 12 2,013,446,796 9,112,947 854.39 6.90 740.48 16 2,375,384,531 7,722,032 907.72 9.61 816.86 24 2,898,311,328 11,862,338 1,150.01 7.64 719.60 Aggressive Optimization 8 641,882,304 9,256,098 15.31 2.18 1258.16 12 760,133,342 9,289,727 20.37 2.56 1119.42 16 842,930,745 8,026,245 20.21 3.28 1251.51 24 929,760,267 11,670,483 18.48 2.49 1509.60

(46)

1 2 4 8

Operat ional Int ensit y (FLOPs/Byt e) 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 P e rf o rm a n c e ( G F LO P s /s e c ) SP Ideal 4746 GFLOPs/sec 5.90% S O = 8 4.69% S O = 1 2 4.70% S O = 1 6 4.13% S O = 2 4 23.69% S O = 8 21.99% S O = 1 2 23.38% S O = 1 6 21.43% S O = 2 4 basic aggressive

Figure 5.3: Roofline chart for GTX Titan Z GPU. Propagation field with 2563points and space order levels of 8, 12, 16 and 24 using Devito aggressive and basic DSE optimiza-tions.

(47)

5.5. RESULTS 33

2 4 8 16

Operat ional Int ensit y (FLOPs/Byt e) 128 256 512 1024 2048 4096 8192 16384 P e rf o rm a n c e ( G F LO P s /s e c ) SP Ideal 14000 GFLOPs/sec 59.74% S O = 8 45.07% S O = 1 2 39.28% S O = 1 6 62.82% S O = 2 4 15.73% S O = 8 11.92% S O = 1 2 9.44% S O = 1 6 10.47% S O = 2 4 aggressive basic

Figure 5.4: Roofline chart for V100 GPU. Propagation field with 2563points and space order levels of 8, 12, 16 and 24 using Devito aggressive and basic DSE optimizations.

Figure 5.3 shows that for the basic optimization, the operation intensity increases with the space order level. For the aggressive DSE optimization their performance is almost identical. Figure 5.3 also shows that an aggressive optimization produces code with higher performance than with a basic DSE optimization for all considered scenarios. The aggressive mode produces code that achieve approximately 24% of the machine peak performance, while code produced with a basic mode performs with less than 6% of machine peak performance.

Figure 5.4 shows the performance result from the V100 graphic card. It shows how the Performance respond for a given Operation Intensity considering the different opti-mization cases. Performance gains using DSE aggressive optiopti-mization goes from approx-imately 16% to 63%. It is worth noting that there is a decrease in OI for so 24, this result was not expected as there are more operations in higher space order (so). Looking at the data from Table 5.5 it is evidenced that the amount of data transferred in so 24 is 45% higher than the so 16, while the difference in data transfer in the other scenarios was at most 15%. Thus ascertain that the amount of data needed for so 24 is much larger than expected, which indicates that memory accesses in GPU are not coalesced for this case.

The results from the aggressive optimization corroborate results presented in a related experiment, [1] that enabled Devito to generate code for the YASK framework and ob-tained peak performances going from 53% to 63% for Intel R _Xeon R _{and Intel} R _Xeon R

PhiTMarchitectures.

As all the points are located before the ridge point at both the roofline plots in Figures 5.3 and 5.4. Showing that it is possible to increase the Operational Intensity with a more

(48)

efficient memory transfer [2, 22]. This means that the produced codes should get greater benefits from optimizations targeted to perform memory exchanges more efficiently than from optimizations focused on increasing throughput. Therefore, enabling FLOPs reduc-tion and data locality such as common sub-expression eliminareduc-tion, factorizareduc-tion, and code motion should be considered as a priority for future works.

(49)

Chapter 6 Conclusion

The open-source project Devito R _{[23, 1] has been attracting the attention of academic}

[24, 25] and industrial [26] community. As a DSL for seismic inversion applications, it already provides a set of automated performance optimizations during code genera-tion that allow user applicagenera-tions to fully utilize the target hardware without changing the model specification, such as vectorization, shared-memory parallelism, loop blocking, auto-tuning, common sub-expression elimination, cross-iteration redundancy elimination (CIRE), expression hoisting and factorization. Devito also supports distributed-memory parallelism via MPI, and several halo-exchange schemes are available. Classic optimiza-tions such as computation-communication overlap (relying on asynchronous progress en-gine) are implemented. It can be integrated with a wide variety of methods (e.g. L-BFGS-B1) for solving minimization problems, such as in FWI. It can perform FWI on distributed memory parallel computers with Dask. It also implements support for standard CPU ar-chitectures, and for Intel R _Xeon R _{and Intel} R _{Xeon Phi}TM

architectures. However, the support to code specialization for GPU architectures is yet a work in progress.

This study created an extension of Devito to enable code generation for the OPS syn-tax. The proposed backend was evaluated in terms of machine peak performance for varying space order discretization levels on the NVIDIA R

devices GTX Titan ZTM and Tesla V100TM. As a result, the implemented backend achieves up to 62.82% of V100’s peak performance, which is consistent with results from work using Devito to generate YASK framework code [1]. It was observed that isotropic 3D wave propagation sten-cil kernels generated with aggressive symbolic optimizations achieved three times higher performance than kernels generated with no symbolic optimizations. This study, there-fore, indicates feasibility of combining the available power of GPU architectures with Devito code generation for efficiently solving seismic inversion algorithms.

However, some limitations are worth noting. In order to enable a seamless source-to-source translation of FWI algorithms, future work should provide support for receiver interpolation and adjoint propagation.

1_{Large-scale Bound-constrained Optimization is an algorithm for solving large nonlinear optimization}

(50)

(51)

Bibliography

[1] F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. Hückelheim, C. Yount, P. Witte, P. H. J. Kelly, F. J. Herrmann, and G. J. Gorman, “Architecture and performance of devito, a system for automated stencil computation,” CoRR, vol. abs/1807.03032, jul 2018.

[2] S. Williams, A. Waterman, and D. Patterson, “Roofline,” Communications of the ACM, vol. 52, p. 65, apr 2009.

[3] A. F. Cárdenas and W. J. Karplus, “PDEL—a language for partial differential equa-tions,” Communications of the ACM, vol. 13, pp. 184–191, mar 1970.

[4] Cook, G O Jr., “ALPAL: A tool for the development of large-scale simulation codes,” 1988.

[5] R. van Engelen, L. Wolters, and G. Cats, “CTADEL: A Generator of Multi-Platform High Performance Codes for PDE-based Scientific Applications,” in Proceedings of the 10th international conference on Supercomputing - ICS ’96, (New York, New York, USA), pp. 86–93, ACM Press, 2003.

[6] A. Logg, K. B. Olgaard, M. E. Rognes, and G. N. Wells, “FFC: the FEniCS Form Compiler,” in Automated Solution of Differential Equations by the Finite Element Method, Volume 84 of Lecture Notes in Computational Science and Engineering (A. Logg, K.-A. Mardal, and G. N. Wells, eds.), ch. 11, Springer, 2012.

[7] F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. McRae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly, “Firedrake: automating the finite element method by composing abstractions,” jan 2015.

[8] K. Hawick and D. P. Playne, “Simulation Software Generation using a Domain-Specific Language for Partial Differential Field Equations,” in Proceedings of the International Conference on Software Engineering Research and Practice (SERP), p. 7, The Steering Committee of The World Congress in Computer Science, Com-puter Engineering and Applied Computing (WorldComp), 2013.

[9] T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayap-pan, “A stencil compiler for short-vector SIMD architectures,” in Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS ’13, (New York, New York, USA), p. 13, ACM Press, 2013.

(52)

38 BIBLIOGRAPHY

[10] R. Membarth, F. Hannig, J. Teich, and H. Kostler, “Towards domain-specific com-puting for stencil codes in HPC,” in Proceedings - 2012 SC Companion: High Per-formance Computing, Networking Storage and Analysis, SCC 2012, pp. 1133–1138, IEEE, nov 2012.

[11] Y. Zhang and F. Mueller, “Auto-generation and auto-tuning of 3D stencil codes on GPU clusters,” in Proceedings of the Tenth International Symposium on Code Gen-eration and Optimization - CHO ’12, (New York, New York, USA), p. 155, ACM Press, 2012.

[12] I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran, and S. McIntosh-Smith, “The OPS domain specific abstraction for multi-block structured grid computations,” in Proceedings of WOLFHPC 2014: 4th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Stor, pp. 58–67, 2014.

[13] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the nvidia volta gpu architecture via microbenchmarking,” arXiv preprint arXiv:1804.06826, 2018.

[14] M. S. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, and G. N. Wells, “The FEniCS Project Version 1.5,” Archive of Numerical Software, vol. 3, no. 100, 2015.

[15] G. R. Markall, D. A. Ham, and P. H. Kelly, “Towards generating optimised finite el-ement solvers for gpus from high-level specifications,” Procedia Computer Science, vol. 1, no. 1, pp. 1815–1823, 2010.

[16] G. Bercea, A. T. T. McRae, D. A. Ham, L. Mitchell, F. Rathgeber, L. Nardi, F. Lu-porini, and P. H. J. Kelly, “A numbering algorithm for finite element on extruded meshes which avoids the unstructured mesh penalty,” CoRR, vol. abs/1604.05937, 2016.

[17] L. Gross, L. Bourgouin, A. J. Hale, and H. B. Mühlhaus, “Interface modeling in incompressible media using level sets in Escript,” Physics of the Earth and Planetary Interiors, vol. 163, no. 1-4, pp. 23–34, 2007.

[18] V. Pandolfo, “Investigating the OPS intermediate representation to target GPUs in the Devito DSL,” 2019.

[19] A. V. Aho, Compilers: principles, techniques and tools (for Anna University), 2/e. Pearson Education India, 2003.

[20] J. Cheng, M. Grossman, and T. McKercher, Professional Cuda C Programming. John Wiley & Sons, 2014.

[21] G. S. Martin, R. Wiley, and K. J. Marfurt, “Marmousi2: An elastic upgrade for marmousi,” The Leading Edge, vol. 25, no. 2, pp. 156–166, 2006.

(53)

BIBLIOGRAPHY 39

[22] M. Louboutin, M. Lange, F. J. Herrmann, N. Kukreja, and G. Gorman, “Perfor-mance prediction of finite-difference solvers for different computer architectures,” Computers and Geosciences, vol. 105, pp. 148–157, aug 2017.

[23] M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Ve-lesko, and G. J. Gorman, “Devito (v3.1.0): An embedded domain-specific language for finite differences and geophysical exploration,” Geoscientific Model Develop-ment, vol. 12, no. 3, pp. 1165–1187, 2019.

[24] O. F. Mojica and N. Kukreja, “Towards automatically building starting models for full-waveform inversion using global optimization methods: A PSO approach via DEAP + Devito,” may 2019.

[25] P. A. Witte, M. Louboutin, N. Kukreja, F. Luporini, M. Lange, G. J. Gorman, and F. J. Herrmann, “A large-scale framework for symbolic implementations of seismic inversion algorithms in Julia,” GEOPHYSICS, vol. 84, pp. F57–F71, may 2019.

[26] C. Yount, J. Tobin, A. Breuer, and A. Duran, “YASK - Yet another stencil kernel: A framework for HPC stencil code-generation and tuning,” in 2016 Sixth Interna-tional Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp. 30–39, IEEE, nov 2017.

[27] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization,” ACM Transactions on Mathematical Software (TOMS), vol. 23, no. 4, pp. 550–560, 1997.

(54)

GPU support for automatic generation of finite-differences Stencil Kernels

GPU Support for Automatic Generation of

Finite-Differences Stencil Kernels

Vitor Hugo Mickus Rodrigues

PPgEEC Order Number: M613

Natal, RN, January 16th, 2020

Abstract

Resumo

Contents

List of Figures

List of Tables

List of Symbols and Acronyms

Chapter 1

Introduction

Chapter 2

Theory

2.1

Devito

2.1.1

Intermediary Representation

2.1.2

Optimizations

2.1.3

Backend

2.2

GPU Architecture

2.3

OPS

2.4

Roofline Model

Chapter 3

Related Work

3.1

FeniCS

3.2

Firedrake

3.3

esys-escript

3.4

YASK

3.5

Previous Work

Chapter 4

Implementation

4.1

Code Generation

4.1.1

Expressions

4.1.2

Block

4.1.3

Dataset

4.1.4

Stencil

4.1.5

Data Argument

4.1.6

Parallel Loop

4.1.7

Memory Transfer

4.2

Translation

4.3

Compilation

4.4

Execution

Chapter 5

Experiments and Results

5.1

Hardware Specification

5.2

Compiler Information

5.3

Acoustic Wave Propagation

5.4

Correctness Verification

5.5

Results

Chapter 6

Conclusion