Acceleration of Green's function molecular dynamics using shared memory techniques and many-core architectures : Aceleração de dinâmica molecular baseada em funções de Green utilizando técnicas de memória compartilhada e arquiteturas de múltiplos núcleo

(1)

Universidade Estadual de Campinas Faculdade de Tecnologia

Fábio Andrijauskas

Acceleration of Green’s function molecular dynamics

using shared memory techniques and many-core

architectures

Aceleração de dinâmica molecular baseada em funções

de Green utilizando técnicas de memória

compartilhada e arquiteturas de múltiplos núcleos

(2)

Fábio Andrijauskas

Acceleration of Green’s function molecular dynamics using

shared memory techniques and many-core architectures

Aceleração de dinâmica molecular baseada em funções de Green

utilizando técnicas de memória compartilhada e arquiteturas de

múltiplos núcleos

Tese apresentada à Faculdade de Tecnologia da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Doutor em Tecnologia, na Área de Ciências dos Materiais.

Thesis presented to the School of Technology of the University of Campinas in partial fulfillment of the requirements for the degree of Doctor, in the area of Materials Science.

Orientador: Prof. Dr. Vitor Rafael Coluci

Este exemplar corresponde à versão final da Tese defendida por Fábio Andrijauskas e orientada pelo Prof. Dr. Vitor Rafael Coluci.

(3)

Ficha catalográfica

Universidade Estadual de Campinas Biblioteca da Faculdade de Tecnologia

Felipe de Souza Bueno - CRB 8/8577

Andrijauskas, Fábio,

An28a AndAcceleration of Green's function molecular dynamics using shared memory techniques and many-core architectures / Fábio Andrijauskas. – Limeira, SP : [s.n.], 2019.

AndOrientador: Vitor Rafael Coluci.

AndTese (doutorado) – Universidade Estadual de Campinas, Faculdade de Tecnologia.

And1. Dinâmica molecular. 2. Green, Funções de. 3. Aceleração (Mecânica). I. Coluci, Vitor Rafael, 1976-. II. Universidade Estadual de Campinas. Faculdade de Tecnologia. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Aceleração de dinâmica molecular baseada em funções de Green

utilizando técnicas de memória compartilhada e arquiteturas de múltiplos núcleos

Palavras-chave em inglês:

Molecular dynamics Green's functions

Acceleration (Mechanics)

Área de concentração: Ciência dos Materiais Titulação: Doutor em Tecnologia

Banca examinadora:

Vitor Rafael Coluci [Orientador] Sócrates de Oliveria Dantas Marcos Henrique Degani Alexandre Fontes da Fonseca Ricardo Paupitz Barbosa dos Santos

Data de defesa: 23-07-2019

Programa de Pós-Graduação: Tecnologia

Identificação e informações acadêmicas do(a) aluno(a)

- ORCID do autor: https://orcid.org/0000-0002-1254-8570 - Currículo Lattes do autor: http://lattes.cnpq.br/7771878233635494

(4)

FOLHA DE APROVAÇÃO

Abaixo se apresentam os membros da comissão julgadora da sessão pública de defesa de dissertação para o Título de Doutor em Tecnologia na Área de concentração de Ciências dos Materiais, a que submeteu a (o) aluna (o) Fábio Andrijauskas, em 23 de Julho de 2019 na Faculdade de Tecnologia- FT/ UNICAMP, em Limeira/SP.

Prof. Dr. Vitor Rafael Coluci

Presidente da Comissão Julgadora

Prof. Dr. Sócrates de Oliveira Dantas

Instituto de Ciências Exatas -- Departamento de Física -- Universidade Federal de Juiz de Fora

Prof. Dr. Ricardo Paupitz Barbosa dos Santos

Instituto de Geociências e Ciências Exatas -- Universidade Estadual Paulista

Prof. Dr. Marcos Henrique Degani

Faculdade de Ciências Aplicadas -- Universidade Estadual de Campinas

Prof. Dr. Alexandre Fontes da Fonseca

Instituto de Física ``Gleb Wataghin'' -- Universidade Estadual de Campinas

Ata da defesa, assinada pelos membros da Comissão Examinadora, consta no SIGA/Sistema de Fluxo de Dissertação/Tese e na Secretaria de Pós Graduação da FT.

(5)

Agradecimentos

Agradeço ao meu orientador, Professor Dr. Vitor, por toda a paciência e ajuda. Agradeço aos Professores Drs. Vinod e Socrates pela ajuda e pelos materiais. Minha esposa Gal, por iluminar o caminho durante as tempestades!

(6)

Resumo

A limitação do tempo de simulação de dinâmica molecular é uma questão importante para o estudo de propriedades de materiais em escalas de tempo realistas. Diversas técnicas abordam a questão de estender esse tempo de simulação. Em uma abordagem diferente, a técnica baseada em funções de Green foi aplicada para resolver as equações de movimento em simulações clássicas e permitiu ampliar em oito ordens de grandeza a escala de tempo de simulações de vibração de nanomateriais de carbono. No entanto, o alto custo com-putacional tem limitado o uso de dessa técnica para simulações em ciência de materiais. Neste trabalho, combinamos técnicas de computação de alto desempenho para acelerar as simulações baseadas em funções de Green. As versões paralelas foram implementadas para usar o processamento de múltiplos processadores utilizando OpenMP e co-processamento (aceleradores como Xeon Phi e Nvidia Tesla). As técnicas de paralelismo foram aplicadas principalmente nas operações matriciais envolvidas usando bibliotecas típicas de álgebra linear (Magma) e na atualização das posições e velocidades utilizando OpenAcc e OpenMP (offload ) para Xeon Phi. Os testes de validação e de desempenho foram aplicados em ca-deias atômicas unidimensionais onde os delsocamentos foram comparados com a solução exata disponível. Obteve-se uma aceleração máxima de 12× para a versão OpenMP e 23× para a combinação de Nvidia Tesla e OpenMP. Essas acelerações permitem ampliar o uso da técnica de dinâmica molecular baseada em funções de Green em ciência dos materiais.

(7)

Abstract

The time limitation of molecular dynamics simulations is an important issue for the study of material properties at realistic time scales. Several techniques address the issue of ex-tending the simulation time. In a different approach, a technique based on the Green’s functions was applied to solve the motion equations in classical simulations and allowed to extend the time scale of simulations of vibrational properties of carbon nanomate-rials by eight orders of magnitude. However, the high computational cost has limited the use of this technique for material science simulations. In this thesis, we combine high performance computing techniques to accelerate simulations based on Green’s func-tions. Parallel versions were implemented to use multiprocessor processing using OpenMP and co-processing (accelerators like Xeon Phi and Nvidia Tesla). Parallelism techniques were mainly applied to carry out matrix operations using typical linear algebra libraries (Magma) and to update positions and velocities using OpenAcc and OpenMP (offload ) for Xeon Phi. Validation and performance tests were performed on one-dimensional atomic chains from which the atomic displacements were compared with the available exact solu-tion. A maximum acceleration of 12× was obtained for the OpenMP version and 23× for the combination of Nvidia Tesla and OpenMP. These accelerations allow to increase the use of molecular dynamics simulations based on Green’s functions in materials science.

(8)

List of Figures

2.1 Three main steps to perform in a molecular dynamics: 1- Energy cal-culation, 2- Force calcal-culation, and 3- numerical integration to determine

velocity and positions. Adapted from [20]. . . 15

2.2 Predictions for the system size and time scale for a MD simulation for computations power in the petascale (dashed line) and exascale (dashed-dotted line). Blue regions indicate the system size and time scales that could be reached by using current parallel MD approached. Adapted from [21]. . . 15

3.1 Illustration of the hyperdynamics technique. The original potential (black curve) is modified to a new one (blue curve) to decrease the energy barriers around the potential minima. . . 17

4.1 Illustration of the GFMD. (a) The exact (usually unknown) trajectory of a particle. (b) The trajectory is determined for some specific discrete times in traditional integration algorithms. (c) The trajectory is determined an-alytically by Eq. (4.21) using GFMD. . . 23

5.1 Shared memory distributed processing. . . 26

5.2 Parallel code using OpenMP. . . 27

5.3 Distributed memory processing. . . 27

5.4 Shared and distributed memory. Adapted from [44]. . . 28

5.5 Comparison between Graphics Processing Unit (GPU) and Central Pro-cessing Unit (CPU). Adapted from [50]. . . 30

5.6 Xeon Phi architecture. Adapted from [51]. . . 30

6.1 Representation of a linear chain of atoms. . . 35

6.2 Normalized displacement of the central atom obtained by GFMD, by the exact solution, and by the Verlet integrator. . . 36

6.3 Usage map of the CPU time consumed by each routine of the implemented serial version of GFMD. The values correspond to a one-step simulation. The colors in the boxes indicate how many times the routine is called by the main code, where red indicates more calls and blue less calls. . . 38

6.4 Usage map of the CPU time consumed by each routine of the implemented serial version of GFMD. The values correspond to a 1000-step simulation. . 38

6.5 Schematic representation of the parallel version of the GFMD. The paral-lized modules are marked with curved dashed lines. . . 39

(9)

7.1 Normalized displacement of the central atom obtained by Parallel Green’s Function Molecular Dynamics (pGFMD), by the exact solution, and by the Verlet’s integrator. . . 43 7.2 Time of OpenMP GFMD simulating 6000 atoms with 20000 steps using 12

threads. . . 44 7.3 SpeedUp OpenMP GFMD simulating 6000 atoms with 20000 steps using

12 threads.. . . 45 7.4 Time of OpenMP GFMD simulating 12000 atoms with 20000 steps using

12 threads.. . . 46 7.5 SpeedUp OpenMP GFMD simulating 12000 atoms with 20000 steps using

12 threads.. . . 46 7.6 Time of OpenMP GFMD simulating 20000 atoms with 20000 steps using

12 threads.. . . 47 7.7 SpeedUp OpenMP GFMD simulating 20000 atoms with 20000 steps using

12 threads.. . . 48 7.8 Speedup of the parallel GFMD as function of the the system size. The

(10)

Chapter 1 Introduction

Computers play an important role in everyone‘s daily routine and it is regularly required, even when simple tasks are executed. Besides theory and experiments, computer simula-tions are considered nowadays as the third way of doing science and appear as powerful tools, able to predict the outcomes of experiments, especially when they are carried out in extreme environmental conditions such as high temperatures, high pressures or even if the costs to perform a real test are high or prohibitive [1].

Durrant and McCammon [2] present the quote of the Nobel Prize winner Richard Feynman

“if we were to name the most powerful assumption of all, which leads one on and on in an attempt to understand life, it is that all things are made of atoms, and that everything that living things do can be understood in terms of the jigglings and wigglings of atoms.”

Understanding and simulating the behavior of atoms is an important task in mate-rial science, fundamental to predict new properties, to develop new matemate-rials, and new drugs. For example, whereas the drug industry spends around U$50 billion annually, this investment results only in, on average, 20 new drugs [3]. Thus, computer simulations have been designed to speed up the process of drug development aiming to save both time and money [4].

The acceleration in science research led to recent advances in understanding molecular and cellular mechanisms of pain. For example, Kasimova et al. [5] have obtained insights about the voltage-gated ion channel modulation by lipids - cell channels related to several processes of the pain. Filizola [6] have simulated the behavior of some receptors and proposed new efficient approaches to the use opioid drugs.

Molecular Dynamics (MD) simulation is one of the most used techniques to study materials behavior by determining the motion of the atoms [7]. Classical MD solves the Newton’s motion equation to determine the position, velocity, and other atoms properties for systems with ∼ 103 − 106 _{atoms. When electrons are explicitly taken into account,}

the behavior of relatively small systems (∼ 10 − 102 _{atoms) is described by ab initio MD}

[8, 9].

Simulation time is an important limitation of MD methods. Nowadays, it is possible to simulate processes up to picoseconds–nanoseconds using classical MD with limited

(12)

CHAPTER 1. INTRODUCTION 12

atom amount, whereas time scales in microseconds–millisecond are not reached for larger systems (≈ 106 _{− 10}9 _{atoms) [10]. The limitation on the time scale that can be reached}

by the MD is related to the very fast movement of atoms with vibration periods of the order of femtoseconds. Thus, to describe the atomic motion correctly, time steps of 1 femtosecond or less are necessary for the numerical integration of Newton’s equations. For this time step size, one billion integration steps are necessary to simulate a more realistic time scale of 1 microsecond [11]. The simulation of such large integration steps is computationally intensive and can be prohibitive, even for large supercomputers.

To tackle the problem of the limited step size on MD, different strategies have been proposed. The strategies range from the use of high-performance computing to accelerate MD simulations to techniques that try to find ways to make the atoms change from one state to another more rapidly, or even combining several methods, such hyperdynamics and parallel replica dynamics [12]. Another strategy, proposed by Tewary in 2009 [13], uses the Green’s function to solve Newton’s equation of motion. This MD method based on solving the Green’s function of the system – named as Green’s Function Molecular Dynamics (GFMD) – was applied to simulate the propagation of a mechanical pulse in a one-dimensional lattice of nonlinear oscillators and in graphene from femtoseconds to microseconds, extending the time scales up to eight orders of magnitude.

Despite the large extension on the time scale on MD simulations obtained by the Green’s Function Molecular Dynamics (GFMD) technique and the promise of having solved the “timing problem in molecular modeling”1, the technique has not been largely spread and used by the scientific community. One of the reasons is the computational cost presented by the GFMD, which is higher than the traditional integration techniques commonly used in MD codes, such as the Verlet or Leap-frog schemes. GFMD requires the diagonalization of a matrix involving the atomic masses and the Taylor’s coefficients of the expansion of the interaction potential .

In this thesis, we develop a strategy to accelerate GFMD simulations by combining high performance computing techniques, such as OpenMP, GPU and OpenACL. Applied and validated for a one-dimensional lattice of harmonic oscillators, the parallelized version of GFMD provided speed up up to 30× for systems up to 20000 atoms. This acceleration can make the GFMD more competitive and allow it to be applied on the study of more realistic systems.

Chapter 2 is about molecular dynamics simulations, the basic process to perform a simulation of atoms and the major issues about this type of simulation. Chapter 3 describes how it is possible to accelerate the classical molecular dynamics using transition state theory. In this way, Chapter 4 describes the GFMD, and the advantages of using this method. Chapter 5 presents the high performance computing methods used to speed-up GFMD. Chapter 6 describes the implementation of a parallel version of GFMD and the Chapter 7 presents the results of this implementation. Finally, Chapter 8 presents the conclusions of this thesis and Chapter 9 the future works.

1_as _stated _in _the _{press-release}

https://www.nist.gov/news-events/news/2009/11/

(13)

13

Chapter 2 Molecular dynamics simulations

Molecular dynamics (MD) simulations are designed to numerically solve Newton’s mo-tion equamo-tions to simulate the behavior of atoms Rapaport [7]. These simulamo-tions can predict the behavior of materials in different conditions of temperature, pressure and solvation, without having to perform actual experiments [14, 15, 1]

Classical MD simulations take into account only classical interactions between parti-cles. For liquids and monoatomic gases, the classical approximation is reasonable when the de Broglie’s length (which depends on the mass of the particle and on the tempera-ture) is much less than the average distance between the particles. For molecular systems, the classical approximation can be applied when the thermal energy (kbT ) is less than the

energy associated with the molecular vibrations [16].

The interaction potential plays a crucial role in classical MD simulations. These potentials are approximations or classical representations of quantum potentials. Because the nuclei are much heavier than the electrons (Born-Oppenheimer approximation), the motion of the nuclei can be treated independently of the motion of the electrons. Thus, the trajectory of the particles (nuclei) is determined by the energy landscape created by the electrons. Usually, the energy landscape is described by analytical expressions which depend on the interaction type. For example, the Morse potential is a good approximation for the interaction between two atoms. On the other hand, for systems with thermal energies much smaller than the minimum of the Morse potential, the harmonic potential can used instead to describe the interaction. For example, the minimum of the interaction energy for Ne–Ne is about –0.004 eV whereas this value for H–H is –5 eV [16]. Thus, for thermal energies of a system at the room temperature (0.026 eV), only the H–H bond is stable in this condition, no bond breaking is expected, and the harmonic potential becomes a good approximation for this interaction.

To describe the various types of interaction in a molecular system, inter- and intra-molecular interactions are incorporated through analytical expressions in the total inter-action potential. The set of the expressions and of the their parameters are known as force field. The following potential V represents a force field that describes the bond stretch-ing, angular and torsion deformations (bonded terms) and van der Waals and eletrostatic interactions (nonbonded terms).

(14)

CHAPTER 2. MOLECULAR DYNAMICS SIMULATIONS 14 V = X bond Kr(r − r0)2+ X angle Kθ(θ − θ0)2 (2.1) + X torsion Vn[1 − (−1)ncos(nφ + γn)] (2.2) + X i,j 4ij[(σij/rij)12− (σij/rij)6] + qiqj/(4π0rij) (2.3)

Having defined the interaction potential, the evolution of the particles (ri(t) –

trajec-tories) of the system is obtained by solving the equations of motion for each particle i of mass mi: mi ∂2ri,α ∂t2 = − ∂V ∂ri,α , with α = x, y, z (2.4)

Several numerical algorithms can be used to perform the integration of motion equa-tions. One of the most used algorithm is the Verlet’s integrator. The basic idea of the Verlet’s integrator is to use a Taylor’s expansion for the positions of the system [17] for t ± ∆t around ri(t) where ∆t is the time step size.

ri(t + ∆t) = ri(t) + ∆t ˙ri(t) + ∆t2 2! ¨ri(t) + ∆t3 3! ... r_i(t) + O(∆t4) (2.5) ri(t − ∆t) = ri(t) − ∆t ˙ri(t) + ∆t2 2! r¨i(t) − ∆t3 3! ... ri(t) + O(∆t4) (2.6)

Adding Eqs. (2.5) and (2.6) we have

ri(t + ∆t) = 2ri(t) − ri(t − ∆t) +

∆t2

mi

Fi(t) + O(∆t4) (2.7)

where Fi is the force on atom i.

The main steps on a basic MD simulation are illustrated in Fig. 2.1. In order to resolve the motion of particles, time step sizes of ∼ 1 fs are necessary. The use of such small time step requires a large number of integration steps to reach realistic time of the order of µs – ms. Whereas the increase of the computational power (memory and floating operations per second) from the petascale to exascale (Fig. 2.2) can allow to study larger systems (∼ 1015 _{atoms) for larger times (ks), current parallel MD approaches based on}

high performance computing techniques such as domain-decomposition are still not able to simulate µs-ms time scale even for relatively small systems (∼ 103).

There are some techniques to tackle the MD time limitations. Accelerated molecular dynamics [18,19] provide ways to perform larger time steps or even time “jumps”. In this case, the trajectory of the system is stimulated to find a path for escape from a minimum of the energy landscape more quickly than in a traditional MD simulation [19].

(15)

CHAPTER 2. MOLECULAR DYNAMICS SIMULATIONS 15

Figure 2.1: Three main steps to perform in a molecular dynamics: 1- Energy calculation, 2- Force calculation, and 3- numerical integration to determine velocity and positions. Adapted from [20].

Figure 2.2: Predictions for the system size and time scale for a MD simulation for com-putations power in the petascale (dashed line) and exascale (dashed-dotted line). Blue regions indicate the system size and time scales that could be reached by using current parallel MD approached. Adapted from [21].

(16)

16

Chapter 3 Accelerated Molecular Dynamics

Simulations

The dynamical evolution of infrequent-event systems is characterized by vibrational excursions within a energy landscape. Transitions events between energy minima are infrequent and can occur after many vibrational periods. For example, an adatom on a metal surface at a thermal energy lower than the energy barrier for a diffusive jump, can take many vibrational periods before moving to another surface position.

There are several methods to tackle the time limitations of molecular dynamics sim-ulations. In order to reach larger times in MD simulations, techniques based on the Transition State Theory (TST) have been proposed [2, 22]. These techniques are based on the physics behavior of the systems and require changes in the usual MD algorithm shown in the previous chapter. Changes in potential function and on the temperature of the system are commonly employed. In this chapter, we briefly summarized the commonly used techniques based on the TST.

3.1 Parallel Replica

The parallel replica technique (Fig. ??) is based on running multiple constant temperature MD simulations in parallel with atomic configurations with different momenta [23] in order to explore the energy landscale faster than a single simulation would. It is performed to find configurations that are able to change the current state and to escape metastable regions presented in the energy landscape. In the parallel replica, an initial simulation is used “to explore” the system. If this simulation lasts long enough (how long is a parameter of the method), other simulations are started to reach other minimum states.

3.2 Temperature-accelerated Dynamics

Another way to extend the time in MD is by artificially increase the temperature of the simulated system to speed up the transitions and to find states with higher probability of changing states [24]. The use of the temperature-accelerated dynamics has made possible,

(17)

3.3. HYPERDYNAMICS 17

for instance, the simulation of proteins, and other structures such as the gp120 molecule from HIV-1 [25].

3.3 Hyperdynamics

This technique consists of modifying the original potential by adding a non-negative bias potential . A trajectory in the new biased potential surface can escape from the energy minima more rapidly which speeds up the MD simulation (Fig. 3.1) [26]. Fichthorn and Mubin present the advantages of hyperdynamics for various MD problems [27]. Extensions of the hyperdynamics have also been proposed. For example, Hirai [28] presents an adaptive technique that applies hyperdynamics to more complex potentials. The work of Miron and Fichthorn [29] presents a technique derived from hyperdynamics called bond-boost method, which was applied to study copper thin films [30].

Figure 3.1: Illustration of the hyperdynamics technique. The original potential (black curve) is modified to a new one (blue curve) to decrease the energy barriers around the potential minima.

The MD occurs through a series of events that change the system from one state to another. In this way, other techniques that simulate such events have also been used to accelerate the dynamics. Among these techniques are conformational flooding, replica exchange, self-guided MD, and umbrella sampling [31, 32]. For example, the umbrella sampling adds a compensation function, known as the umbrella, to the original potential energy function to induce the sampling of a set of conformations which requires prior knowledge of the conformations of interest [33].

With the advent of several methods of Accelerated Molecular Dynamics (AMD) and their inherent need to properly choose parameters that correctly describe the system and accelerate the dynamics, Smith, Okur, and Brooks [34] presents a mechanism to find these parameters without having to run several simulations and that can lead to further accelerations. The Molecular Dynamics Meta-Simulator run parallel replicas at different temperatures trying to reach different simulation states.

All the methods presented in this chapter to accelerate MD and to extend the simu-lation time are based on physical processes (e.g. heating). Another way to reach longer simulation times is by modifying the integration methods of the motion equations, which is the theme of the next chapter.

(18)

18

Chapter 4 Green’s Function Molecular Dynamics

Simulations

Small time steps (∼ 1 fs) are required for the integration algorithms used to solve the Newton’s equations of motion in molecular dynamics simulations. Therefore, integration algorithms that allow the use of larger time step sizes can be used to accelerate MD simulations and to reach realistic time scales. GFMD [13], proposed in 2009 by Vinod Tewary from the NIST/USA, is one of these algorithms. The technique uses Green’s function to obtain the expressions for the atomic positions and velocities.

We consider N interacting atoms identified by the index i (i = 1 . . . N ). The coordi-nates of the i-atom at the initial time (i.e. t = 0) is specified by the vector −r→0i and its

velocity by the vector −→V0i

Thus, at a given instant t, the position of the i-atom can be written as

~ri(t) = ~r0i+ ~ui(t), (4.1)

where ~ui(t) is the displacement of the i-atom.

The evolution of the position of an atom is obtained from the Newton’s equation mi

∂2~ri

∂t2 = ~Fi = −∇iV, (4.2)

where mi is the mass of i-atom and ~Fi is the resulting force action on the i-atom. If we

express the force in terms of the interaction potential V , Eq. (4.2) can be written as mi ∂2_r i,α ∂t2 = − ∂V ∂ri,α , with α = x, y, z (4.3)

Since ~ri is written terms of the displacements, it is possible to rewrite Eq. (4.3) as

mi ∂2_u i,α ∂t2 = − ∂V ∂ui,α . (4.4)

(19)

CHAPTER 4. GREEN’S FUNCTION MOLECULAR DYNAMICS SIMULATIONS 19

The total potential energy V = V ({|~ri|}) is a function of the positions of all atoms of the

system. Expanding the potential energy in a Taylor’s series in powers of |~ui(t)|:

V = V ({|~ri|}) = V ({|~r0i|}) + N X i X α=x,y,z ∂V ∂ui,α ({|~r0i|}) ui,α + N X i,j=1 X α,γ=x,y,z ∂2_V

∂ui,α∂uj,γ

({|~r0i|}) ui,α uj,γ

+ O(3), (4.5)

where O(3) is related to third end high order terms in the expansion. Taking the gradient of Eq. (4.5), we obtain − ∂V ∂ui,α = − ∂V ∂ui,α ({|~r0i|}) − N X j=1 X γ=x,y,z ∂2V ∂ui,α∂uj,γ

({|~r0i|}) uj,γ

− O0_i,α(3), (4.6)

where O_i0(3) are the third end higher order terms of the expansion of the gradient. The first term on the right side of the Eq. (4.6) is the force component when no deformation is present. This term is zero when the atom is on an equilibrium position. Using Eq. (4.6), we can rewrite Eq. (4.4) as

mi ∂2ui,α ∂t2 = Fi,α− N X j=1 X γ=x,y,z ∂2V ∂ui,α∂uj,γ

({|~r0i|}) uj,γ(t) + ∆Fi,α(t). (4.7)

where ∆Fi,α = −O0i,α(3) are the third end higher order, non-harmonic terms associated

with the potential. Defining wi,α(t) =

√

mi ui,α(t), Eq. (4.7) can be written as

∂2_w i,α ∂t2 = Fi,α √ mi − N X j=1 X γ=x,y,z 1 √ mi ∂2_V

∂ui,α∂uj,γ

({|~r0i|}) 1 √ mj wj,γ(t) + ∆F√i,α(t) mi . (4.8)

Actually, Eq. (4.8) represents a set of equations that can be written in the matrix form, where:

1. The term on the left side of the equation, in matrix notation is written as ¨ wi,α= ∂2_w i,α ∂t2 −→ I ∂2 ∂t2W; (4.9)

where I is the identity matrix (at most size dN ×dN , with d being the dimensionality of the problem) and W is a column vector (maximum size dN ) containing all the deformations wi,α(t);

(20)

2. The first term on the right side of equation can be written as Fi,α

√

mi −→ F,

(4.10) where F is a squared matrix with maximum size dN × dN ;

3. The last term on the right side of equation is ∆Fi,α(t)

√

mi −→ ∆F(t),

(4.11) with maximum size dN × dN ;

4. Finally, the second term on the right can be written as

− N X j=1 X γ=x,y,z 1 √ mi ∂2_V

∂ui,α∂uj,γ

({|~r0i|})

1 √

mj

wj,γ(t) −→ −DW, (4.12)

where D has a maximum size of dN × dN . The D matrix represents the dynamical matrix of the system [35].

Using these matrix notations, the matrix representation of Eq. (4.8) is I ∂2 ∂t2 + D W = Fef = F + ∆F(t), (4.13)

which represents a set of non-homogeneous coupled differential equations. The formal solution of Eq. (4.13) can be expressed as

W = I∂ 2 ∂t2 + D −1 Fef. (4.14)

The Green’s function is a typical approach to solve non-homogeneous differential equa-tions subjected to initial or boundary condiequa-tions. If we consider G(t − t0) as the causal Green’s function which is 0 for t < 0, therefore,

I∂ 2 ∂t2 + D G(t − t0) = Iδ(t − t0). (4.15) To solve the differential equation (4.15), we can firstly convert it to an algebraic equation which solution process is generally simpler than the original differential equation. One way to do that is by using Laplace transforms. Applying the Laplace transform in Eq. (4.15), we obtain

s2_{I + D L[G] − sG(0) − G}0

(21)

where s is the Laplace variable conjugated to t. If we consider the boundary conditions G(0) = G0(0) = 0, Eq. (4.16) becomes s2 I + D L[G] = I, and, therefore, L[G] = I s2 I + D = [s2_{I + D]}−1, (4.17)

The Laplace transform of Eq. (4.13) is s2

I + D L[W] − sW(0) − W0

(0) = IL[Fef], (4.18)

or

L[W] = L[G]L[Fef] + s L[G] W(0) + L[G] W0(0). (4.19)

Because D is real and symmetric, we can write VT

DV = E2, where V is the matrix with the eigenvectors of D and E2 _{is the diagonal matrix with the eigenvalues E}2

i of D.

Multiplying Eq. (4.19_{) by V}T _{and using I = VV}T_{, Eq. (4.19) is written as}

WL,∗= K FL,∗eff + sK W L,∗

(0) + K W0L,∗(0), (4.20) where K = VT_{L[G]V, W}L,∗ _{= V}T_{L[W], F}L,∗

eff = VTL[Feff], WL,∗(0) = VTW(0), and

W

0_L,∗

(0) = VT

W

0₍₀₎

, with L standing for Laplace transform. The superscript * indicates the vectors is “projected” in the space of the eigenvectors of D.

The inverse Laplace transform of Eq. (4.20) is L−1_WL,∗_{= L}−1

[K FL,∗eff ] + L −1

[sK WL,∗(0)] + L−1_{[K W}0L,∗(0)], where K is a diagonal matrix with elements Ki = (s2+ Ei2)

−1

.

The term of the left hand side of Eq. (4.21) is evaluated as follows: L−1[W_iL,∗] = W_i∗(t);

The terms of the right hand side of Eq. (4.21) are evaluated as follows: 1. L−1[KiL [Fi]] = L−1 1 s2 _{+ E}2 i F_i∗(s) = L−1 1 s2_{+ E}2 i ∗ L−1[F_i∗(s)] where (∗) means the convolution of two functions and it can be written as :

L−1[KiL [Fi]] = 1 Ei sin (Eit) ∗ L−1[Fi∗(s)] | {z } F_i∗(t) = 1 Ei Z t 0 F_i∗() sin (Ei)d.

(22)

Here we consider the Laplace transform: F_i∗(s) =

Z ∞

0

F_i∗(t)e−stdt,

for the case which F_i∗(s) = F

∗ i s and F_i∗(t) = F ∗ i if t ≥ 0, 0 if t < 0,

This case corresponds to the approximation that ∆F(t) = 0 which means that the higher order terms in the potential expansion are neglected (harmonic approxima-tion). Consequently: L−1[KiL [Fi]] = 1 Ei Z t 0 F_i∗sin (Ei)d = −F∗ i E2 i cos (Ei) t 0 = −F ∗ i E2 i [cos (Eit) − H(t)] ,

where H(t) is the Heaviside function. 2. L−1[KiW 0_∗ i (0)] = L −1 W 0_∗ i (0) s2_{+ E}2 i = W 0_∗ i (0) Ei sin (Eit); 3. L−1[s KiWi∗(0)] = L −1 s s2_{+ E}2 i W_i∗(0) = W_i∗(0) cos (Eit);

Therefore, the components of W∗ are W_i∗(t) = −F ∗ i E2 i [cos (Eit) − H(t)] + W_i0∗(0) Ei sin (Eit) + Wi∗(0) cos (Eit), (4.21)

where Fi is a constant, Ei eigenvalues of matrix D, D is associated with the second

derivative of potential energy and W_i0∗(t) = dW

∗ i(t) dt or W_i0∗(t) = F ∗ i Ei sin (Eit) + W 0_∗ i (0) cos (Eit) − Wi∗(0) Ei sin (Eit). (4.22)

Finally, the atomic positions and velocities are obtained by W(t) = V W∗(t) and W0(t) = V W

0_∗

(t), respectively. The W(t) and W0(t) provide the exact solutions for the atomic positions and velocities at all time t for the harmonic approximation (∆F(t) = 0). Fig. 4.1 illustrates the application of GFMD to determine the trajectory of a particle.

(23)

CHAPTER 4. GREEN’S FUNCTION MOLECULAR DYNAMICS SIMULATIONS 23 time x time x time x

a)

b)

c)

dt t1 t2 _t 3 0 Exact path

Figure 4.1: Illustration of the GFMD. (a) The exact (usually unknown) trajectory of a particle. (b) The trajectory is determined for some specific discrete times in traditional integration algorithms. (c) The trajectory is determined analytically by Eq. (4.21) using GFMD.

For non-harmonic cases, the atomic positions and velocities are then obtained by W(t) = V(t) W∗(t) and W0(t) = V(t) W

0_∗

(t), respectively, where Eqs. (4.21) and (4.22) are used in steps ∆t small enough to keep ∆F(t) negligible, in a process similar to Fig. 4.1 (b).

Algorithm 1 shows the basic steps of a molecular dynamics code based on the GFMD technique. The construction and diagonalization of D in each step of the integration loop corresponds to the general, non-harmonic case. For the harmonic case, correspond-ing to phonon and thermal problems where this approximation is valid, D is built and diagonalized only once.

(24)

Algorithm 1: GFMD pseudo-code

Read system information (geometry, initial conditions W(0), W0(0), and atomic masses)

Read simulation information (time step size ∆t, number of integration steps Nsteps,

etc);

Initialize variables (e.g. t ← 0) ; Integration loop:

while step < Nsteps do

1 Calculate the force constants D(t) and the 1st Taylor’s coefficients F(t) for the

current atomic configuration;

2 Diagonalize D(t) to obtain E_i2 and V;

3 Calculate the projected displacements W∗(t) ← VT_W(t); 4 Calculate the projected velocities W0∗(t) ← VT_W0(t);

5 Calculate the projected 1st Taylor’s coefficients F∗(t) ← VT_F(t); 6 Update projected displacements W∗(t + ∆t) ;

7 Update projected velocities W0∗(t + ∆t);

8 Update displacements W(t + ∆t) ← VW∗(t + ∆t); 9 Update velocities W0(t + ∆t) ← VW0∗(t + ∆t);

Update atomic positions from the displacements W(t + ∆t); Reset the displacements W(t + ∆t) ← 0;

Update time t ← t + ∆t;

Calculate properties (kinetic and potential energies, temperature, etc) using the updated atomic configuration (W(t + ∆t), W0(t + ∆t)) ;

(25)

25

Chapter 5 Shared memory and many-core

architectures

The simulation of complex and large atomic and molecular system requires the use of all the computational resource available. Supercomputers with large number of processing units and memory have also been used to tackle the MD simulation time issue [36]. Many High Performance Computing (HPC) techniques can be used to allow the utilization the computation power provided by the supercomputers. These HPC techniques are divided in main four groups: shared memory, distributed memory, hybrid systems, and heterogeneous systems. Several libraries and programming models can be used to access these techniques such as OpenMP, Message Passing Interface (MPI), OpenAcc, and MKL. In general, two approaches are used to obtain high computational power: (i) increase the processors clock, memory capacity, etc; or (ii) to divide the computational task into several computers (or several cores). The first approach requires small effort from pro-grammers, because program instructions will run faster even with no changes in the computer codes. However, several limits (physical and technical) are too great to be transposed in a timely manner and reach higher clock frequencies. For this reason, the second approach can be faster and cheaper even though it requires a greater effort from the programmers to analyze the possibilities of parallelism, design and implementation, test the correctness, and find the best performance settings Breshears [37].

5.1 Shared memory techniques

Shared memory programming consists of using techniques in which multiple cores or processors share the same memory (Fig. 5.1) [38]. The communication between the cores is done through the memory of the computational node. When one of the processors needs some data that is in memory, it can access directly.

(26)

5.1. SHARED MEMORY TECHNIQUES 26

Figure 5.1: Shared memory distributed processing.

This technique can be used with several libraries and frameworks. Two of them are the Portable Operating System Interface (POSIX) and Open Multi-Processing (OpenMP) [39]. POSIX is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX threads – different processing lines in the same computer process – consist of a Application Programming Interface (API) for light processes following the POSIX standard [40]. The process for using POSIX Threads consists of declaring a variable that will represent it, constructing a function that will be executed by that thread, and finally using the function pthread\_create to invoke its beginning [41].

In order to POSIX Threads be feasible, other resources are needed such as Mutexes. All this complexity must be managed by the programmer. For this task to be less arduous and for productivity to be greater, there are libraries that help in the process of parallelism of a code [38]. One of the libraries that make the development of parallel software easier is the OpenMP. It consists of a API that facilitates shared memory programming [38]. Several compilers support the markup that OpenMP provides, such as: GNU Compiler Collection (GCC), XL of IBM, among others. The implementation of parallel codes with OpenMP basically consists of marking areas that must be parallelized. An example is in Figure 5.2.

The pragma omp parallel for tag in the source code on Figure 5.2 already provides the parallelism of this array multiplication. The amount of threads that will be used is defined by the function omp_set_num_threads (n); where n is the number of threads. There are several other tags and functions that are provided by OpenMP [42].

With all these functionalities, OpenMP seems to be an optimal solution for most cases where it is necessary to increase the computational use and to reduce the processing time. However, it is necessary to plan and analyze the algorithms to be implemented.

(27)

1float *atoms;

2int i = 0,j = 0,k = 0,natom=50000; 3

4#pragma omp parallel for default(none) shared(atoms) 5for (i = 0; i < natom; ++i)

6{

7 atoms[i] = i*i; 8}

Figure 5.2: Parallel code using OpenMP.

At first glance, OpenMP solves any single-mark parallelism problem. In well-designed applications, the insertion of OpenMP has the characteristic of being successful and still simple [38]. However, now, the complexity of parallelism is delivered to the library, one can expect to use some cycles to solve what is to be processed and where. Given this fact, a small variation in performance is to be expected, depending on the type of problem and the number of threads to solve it. Anyway, all the OpenMP functionalities have great value in a closed context of a computer (with the possibility of multiple cores).

For scenarios where it is not possible to add more processing cores to a computer, the OpenMP solution is limited. To solve the limitation, there are other programming models to distribute the computational load. However, this load will be distributed among several computers. With the settings made consistently to the environment used and the problem in focus, programming models decrease the total execution time [43]. Therefore, the techniques of high-performance computing have full dependence on the application domain, and if they are applied erroneously, they can increase the execution time.

Physical, technical and financial limitations, among others, prevent information from being processed in only one computer. To overcome such limitations, it is possible to interconnect the computers so that the load is distributed (Fig. 5.3).

(28)

In the distributed model, computers do not exchange data through the memory system, but through a communication - specific bus. To make the communication possible, a protocol is needed to define how the messages will be. In addition to the protocol, it is necessary to use a communication technology. Thus, the scenario repeats itself as in the case of POSIX Thread versus OpenMP. However, in this case the competition is between Sockets and MPI.

MPI consists of a standardized communication protocol specific to distributed process-ing havprocess-ing several implementations [44]. Usprocess-ing MPI, it is possible for multiple computers to exchange messages. Among the various implementations are OpenMPI and mvapich. All implementations follow a protocol that establishes the form of messages. Message exchange reports various items relevant to the task, such as which process is sending the message and what kind of data will be processed.

All this information establishes that each node that receives the messages will know how to proceed. However, computers that are interconnected may still contain multiple processors (or multiple cores). In this new scenario, a different approach is required to fully use the computational capacity. In a scenario where there are several computers and where each computer has several processors or cores, special care is needed to distribute the tasks to each node (Fig. 5.4).

Figure 5.4: Shared and distributed memory. Adapted from [44].

Figure 5.4 shows a hybrid scenario comprised of a set of computers (nodes), in which each node has several processors connected by a network. In this way, each node can receive tasks and segment them so that each processor receives a part of this processing load [44]. In this hybrid model, it is necessary to explore the maximum of each node and even divide the task between different machines. MPI provides the form of communica-tion between processes (which will be on different machines) and OpenMP provides the characteristics so that several processing cores receive parts of this task. This approach is pertinent because the number of cores per machine is increasing and there are still several limitations to build a computer with a large number of cores. The hybrid approach is most commonly applied in molecular dynamics simulation packages such as LAMMPS [45], and quantum mechanics softwares like Quantum ESPRESSO [46] and VASP [47].

There are several ways to apply these computing techniques in MD simulations. In the decomposition per atom, atoms and their unconnected counterparts are distributed between processors. All the atomic coordinates are distributed before calculating the interactions. After calculating the forces, their new positions are also distributed

(29)

(broad-5.2. MANY-CORE ARCHITECTURES 29

cast). This technique is characterized by large communication between processes and threads. In such a scenario, the use of many-core architectures or hybrid systems can be of great value because of the number of processors available. Because communication between processors is based on buses and not on networks, the communication time is defined by O(N ). In the decomposition by force, a block division of the matrix that rep-resents all the combinations of pairs of atoms is done. In this way, the communication in previous broadcast is avoided by more selective communication. However, the generation of such blocks has a computational cost. In this step, heterogeneous systems have more advantages because they use two types of architecture, many-core and multi-core. Thus, the matrix separation calculations and the simulation calculation can be done in the most convenient architecture with a communication time of O(N/√p), where p is the number of processors. In the spatial decomposition, The geometric region of the simulation is divided and processed in a node. In certain cases, the migration between the regions is necessary which requires communication between the nodes. Therefore, in heterogeneous systems this approach has computational gain due to the diversity of processor types with a communication time O(N/p).

Each of the three techniques have different communication costs. The computational cost for communication between the processing nodes and cores must be proportional to the computational cost of the processing. Therefore, the time used to divide and execute the task communication in the three approaches above must always be less than the processing time of the forces, positions, etc. These three ways speed up MD globally. That is, the parallel algorithm consists of dividing the MD steps into several threads, processes or computers. However, at each step there are algorithms that can make the simulation faster.

Pure MPI systems use a process MPI per core [48]. In this way, the task is in memory. However, arrangements are needed to share. In pure OpenMP processes, external systems are required for memory sharing with the OpenMP for Intel Compilers Cluster. When hybrid systems are used, the division of data between nodes is done by MPI and the division of processing is done by OpenMP. Based on the hybrid model, the MasterOnly model defines that the MPI information exchange is not executed by threads, in the communication computation overlap MPI calls are made in the parallel region.

5.2 Many-core architectures

Another way to provide a faster MD simulation is by using hardware with different archi-tectures from regular CPU. Hardware with many cores can be used to reach more realistic times [49]. Current many-core architectures have large numbers of processors such as the Nvidia Tesla C1060 with 240 cores and the Intel Xeon Phi Coprocessor 7120X with 61 cores with 4 threads each. Figure 5.5 presents a comparison between CPU and GPU architectures where it is possible to see the large amount of Arithmetic Logic Unit (ALU) and the smaller amount of the control and cache units for GPU.

(30)

5.2. MANY-CORE ARCHITECTURES 30

Figure 5.5: Comparison between GPU and CPU. Adapted from [50].

Xeon Phi architecture contains a bus for interconnecting the cores and a tag directory (TD) that plays the role of memory control and cache. In this same interconnection ring, the graphics data double data rate (GDDR) is connected [51].

Figure 5.6: Xeon Phi architecture. Adapted from [51].

In the GPU architecture, it is possible to notice that there is a great specialization in the aspect of intensive computing (CPU bond) [50]. In the case of Xeon Phi, which shows features similar to those of GPU, the processors are dedicated to vector processing and the low exchange of data by Xeon Phi external memory.

Access to GPU and Xeon Phi is done similarly to OpenMP. Certain areas of the codes are marked to run on those devices. There are several studies of which parts of the codes should be executed in the many-core and multicore architectures. Zhu et al. [52], Shkurti et al. [53], Hou and Ge [54] present a set of MD algorithms adapted to the GPU system, in addition it shows a system of separation of the atoms to maximize the use as GPU, each work describes the intricacies required for the target problem. With this, heterogeneous systems must be able to determine which architecture option is best for each piece of code. “Best bang for your buck: GPU nodes for GROMACS biomolecular simulations” demonstrates that GPU has gains when compared to CPUs in several scenarios. This

(31)

5.2. MANY-CORE ARCHITECTURES 31

demonstrates a need to use many-core architecture. In this way, the computational use is increased and the computational time is decreased in the correct architectures [56, 57].

(32)

32

Chapter 6 Implementation of Parallel GFMD

All the HPC techniques presented in the previous chapter can be applied in MD sim-ulators. In particular, we apply those techniques to accelerate the integration algorithm proposed in the GFMD.

To implement the parallel version of GFMD, we firstly developed the serial version following the pseudo-code shown in the Chapter 4. To validate the serial version, we apply it to determine the displacements of a linear chain of atoms within the microcanonical ensemble. This system was chosen to exhibit an exact expression for the displacements. We also compared GFMD with the commonly used integration algorithm implemented in a worldwide MD simulator, to find the limitations of both techniques. The next step was to determine the parts of GFMD more suitable to be parallelized. Then, we choose the best HPC technique and optimized the number of threads, cores, and other parameters in order to reach the maximum speedup. Finally, the parallel version was implemented and validated with the results of the exact solution.

6.1 Serial GFMD

The serial code was implemented in the C language which allows the use of many tools for HPC [58]: such as OpenMP 4.0, OpenAcl, OpenAcc, MKL, Magma e other libraries, GCC and ICC, available for C. This implementation was based on the Rappaport’s [7] and Scherer’s [59] books. The source code 6.1 presents the initial steps to perform the GFMD - matrices creation and the integration of the motion equations. MatrixM stores the square root of the atomic masses for all atoms on the system, the variable constK, alfa, and beta holds all the spring constants, necessary to build the D matrix (referenced as dd in the code).

(33)

6.1. SERIAL GFMD 33 1DO_MATRIX { 2 if ( i == j) { 3 MatrixM[i][j ] = sqrt(carbon_mass); 4 }else{ 5 MatrixM[i][j ] = 0.0; 6 } 7} 8 9

10doubleconstK = prof ∗ gama ∗ gama / 2.0; 11doublealfa = 2.0 ∗ constK;

12doublebeta = −constK; 13 14DO_MATRIX { 15 if ( i == j) 16 { 17 if( i == 1 ) 18 { 19 dd[i ][ j ] = alfa/carbon_mass; 20 }else{ 21 if( i == natom −1) 22 { 23 dd[i ][ j ] = alfa/carbon_mass; 24 }else{ 25 dd[i ][ j ] = alfa/carbon_mass; 26 } 27 }

28 }else if ( j == (i+1) || j == (i−1)){

29 dd[i ][ j ] = beta/carbon_mass;

30 }else{

31 dd[i ][ j ] = 0.0;

32 }

33}

Listing 6.1: Matrix initialization with all necessary parameters to solve the Newton’s motion equation using GFMD method.

(34)

6.2. VALIDATION OF THE SERIAL GFMD 34

The code6.2 presents the steps required to update the atomic positions and velocities.

1void GFMDStep() 2{ 3 doubleomega = 0.0; 4 doublem2 = 0.0; 5 doublecf = 0.0981758 ; 6 7 int n = 0; 8 int xt = 0; 9 10 11 for (n = 0; n < natom; n ++) 12 { 13 omega = cf ∗ sqrt(w[n]) ∗ stepCount; 14 m2 = w[n];

15 mol[n].u1.x = −(mol[n].f.x/(m2∗carbon_mass))∗(cos(omega)−1.0)+( mol[n].v.x / (cf∗sqrt(w[n])))∗ sin( omega)+ mol[n].u.x ∗ cos(omega);

16 } 17 18 for (n = 0; n < natom; n ++) 19 { 20 omega = cf ∗ sqrt(w[n]) ∗ stepCount; 21 m2 = w[n];

22 mol[n].v1.x = (mol[n].f . x/(cf ∗ sqrt(w[n]) ∗ carbon_mass)) ∗sin(omega) + mol[n].v.x + cos(omega) − mol [n].u.x ∗ cf ∗ sqrt(w[n]) ∗ sin(omega);

23 }

24 projetaGFMDStep(); 25}

Listing 6.2: GFMD integration code

Other functions were also implemented in the serial version such as the ones to read the simulation input, to measure the computational time, and to write the outputs for post-processing analysis.

6.2 Validation of the serial GFMD

The validation of the serial version was done by applying the GFMD technique to determine the atomic displacements of a linear chain of N atoms of same mass m (Fig. 6.1.). Fixed boundary conditions were applied to the atoms in the both extremities. Only nearest neighbor interactions were considered with the harmonic approximation using a spring constant µ. The exact solution for the displacements can be obtained (see Appendix B) an is written as ul(t) = d N − 1 (N −1) 2 −1 X j=−(N −1)₂ h cos(kjL) cos(ω(kj)t) i (6.1)

where ω_k2 = (2µ/m)[1 − cos(ka)], j = −(N −1)₂ , . . . ,(N −1)₂ − 1, and d is the displacement of the central atom at time zero (i.e., u0(0) = d). The following parameters were used:

(35)

Figure 6.1: Representation of a linear chain of atoms.

Figure 6.2 presents the evolution of the position of the central atom (normalized by d) for a chain with N = 23 calculated by the serial code. These results were compared with the results from the exact solution and from the Verlet’s integrator. The results from the Verlet’s integrator were obtained with LAMMPS [45]. LAMMPS is one of the most used MD simulator which allows simulations with different potentials, integrators, and ensembles. The input used in the LAMMPS is shown in6.3. In the input, the Morse potential (Eq. 6.2) with following parameters rc = 1.6 Angstroms , D0 = 1.72345 eV and

α = 1.0, the same initially used in [13]. E = D0 h e(−2α(r−r0)_{− 2e}−α(r−r0) i r < rc (6.2) 1pair_style morse 1.6 2pair_coeff * * 1.72345 1.0 1.0 1.6 3 4run_style verlet 5timestep 0.001 6 7neighbor 2.5 bin

8neigh_modify every 5 delay 0 check yes 9

10group fixo id 1 23 11group rest id <> 2 22

12dump positions all xyz 2 filme.xyz 13velocity all zero linear

14fix 3 rest nve

15fix freeze1 fixo setforce 0.0 0.0 0.0 16thermo 100

17

18variable tempo equal (time*1000) 19variable z0 equal (z[12]-12.0)/0.0001 20variable zi equal z[1]

21variable zf equal z[23]

(36)

23run 1000

Listing 6.3: LAMMPS input.

Figure 6.2: Normalized displacement of the central atom obtained by GFMD, by the exact solution, and by the Verlet integrator.

(37)

6.3. COMPUTATIONAL COST OF THE SERIAL GFMD 37

6.3 Computational cost of the serial GFMD

To find the most time consuming parts of code, and, therefore, the ones to be paralleled, we used gprof2dot [60] to generate the usage map of the code routines. This tool gener-ates all the times and timing for each code function. To use the gprof2dot was necessary to compile the code with debug and profile flags (-g and -fno-omit-frame-pointer). All the statistics produced by gprof2dot are then exported the usage map (gprof gfmd | gprof2dot.py | dot -Tpng -o output.png). In these maps (Figs 6.3 and6.4), each square represents one function of the code and the colors represent the use of each function (blue represents less usage, green and orange larger usage, and red represents the most used functions). The functions mapped by the gprof2dot were:

1. main: initial function of the system;

2. criaMatrizesGFMD and inicializaMatrizesGFMD: create all matrices and calculate several linear algebra quantitites, such as eigenvectors and eigenvalues;

3. SingleStep, GFMDStep: perform one step on the simulation;

4. SetupJob, SetParans, FinalizeJob, GetNameList and PrintNameList: read the input file and configure the total number of steps, step size and others parameters; 5. vectorMatrix and matrixVetor: convert vector to a matrix and a matrix to a

vector, respectively;

6. PrintFrame, PrintSummary: print information for each step of the simulation; 7. AccumProps, PrintPosicao: print the atomic positions;

The routines createGFMDMarks and the two auxiliary functions vectorMatrix and matrixVetor are highly used for one-step simulation whereas the functions SingleStep, GFMDStep and the two auxiliary functions vectorMatrix and matrixVetor are the most called in a 1000-step simulation.

(38)

6.3. COMPUTATIONAL COST OF THE SERIAL GFMD 38

Figure 6.3: Usage map of the CPU time consumed by each routine of the implemented serial version of GFMD. The values correspond to a one-step simulation. The colors in the boxes indicate how many times the routine is called by the main code, where red indicates more calls and blue less calls.

Figure 6.4: Usage map of the CPU time consumed by each routine of the implemented serial version of GFMD. The values correspond to a 1000-step simulation.

(39)

6.4. PARALLEL GFMD 39

6.4 Parallel GFMD

The implementation of a parallel version of GFMD – pGFMD – was produced using C language and the Intel MKL, Magma, and ESSL mathematical libraries. As a result of the implementation, six modules have been produced. Figure 6.5 presents the steps of GFMD and the sequence of calls from each modulus. Curved dashed lines indicate the modules where it is possible to use threads to speed up the computation.

Figure 6.5: Schematic representation of the parallel version of the GFMD. The parallized modules are marked with curved dashed lines.

The diagonalization of the D matrix is an important step on GFMD. The code6.4show two ways to calculate the eigenvectors and eigenvalues using shared memory techniques. In this case, was used the Intel library MKL and the IBM library ESSL, they are two libraries well known for mathematical processing. Was necessary to define different amount of thread (“mkl_set_num_threads(mkleigen);” and "omp_set_num_threads(1);") for each area to reach the maximum speed-up, in the IBM library was necessary to use just one thread due a library characterized. This step was very important to find out how is the maximum possible speed-up for the shared memory approach, another way to get more speed-up is using many-core architecture.

(40)

1#ifdef INTEL

2mkl_set_num_threads(mkleigen);

3LAPACKE_dsyevx (LAPACK_COL_MAJOR, jobz, range, uplo, natom, temp,n,

0, 0, 0, 0, tp , &mqtd, w, z, n, ifail); 4#endif 5#ifdef IBM 6jobz = ’V’; 7range = ’A’; 8uplo = ’U’; 9omp_set_num_threads(1);

10dsyevx (&jobz, &range, &uplo, n, temp,n, 0, 0, 0, 0,

0.149166814624004135E-153, &mqtd, w, z, n, work, 0, iwork, ifail, 11&info); 12#endif 13 14#ifdef INTEL 15mkl_set_num_threads(mkltMultiplica);

16cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,

17 m, m, m, alpha, temp, m, temp1, m, beta, temp2, n); 18#endif

19#ifdef IBM

20omp_set_num_threads(ibmt);

21dgemm(&transa , &transb ,m,m,m, alpha , temp ,m, temp1 ,m, beta,temp2

,m);

22#endif

Listing 6.4: Calculation of eigenvalues and eigenvectors within parallel GFMD using shared memory techniques.

The parallelization of the calculation of eigenvectors and eigenvalues using a many-core architecture is shown in code 6.5. In this case, we used GPU with Magma library, this library is similar a Lapack, however, is ready to GPU processing [61, 62]. Was necessary to use Magma functions to prepare to process the GFMD matrices, is needed to load the matrices into GPU memory, “magma_dmalloc_cpu”, execute a request to get best buffers parameters and the calculation “magma_dsyevd_gpu”, and free all GPU memory “magma_free”. 1 2magma_init (); 3magma_int_t n = natom; 4 5double *a ,*r; 6double *d_r; 7double *h_work; 8 9magma_int_t lwork ; 10magma_int_t *iwork; 11magma_int_t liwork ; 12 13double *w1,*w2 ;

(41)

15

16magma_int_t ione = 1, i, j, info; 17

18double mione = -1.0;

19

20magma_int_t incr = 1, inci= 1; 21 22magma_int_t ISEED [4] = {0 ,0 ,0 ,1}; 23 24magma_dmalloc_cpu (&w1,n); 25magma_dmalloc_cpu (&w2,n); 26magma_dmalloc_cpu (&a, n2); 27magma_dmalloc_cpu (&r, n2); 28magma_dmalloc (&d_r, n2); 29 30double aux_work [1]; 31 32magma_int_t aux_iwork [1];

33magma_dsyevd_gpu (MagmaVec,MagmaUpper,n,d_r,n,w1,r,n, aux_work, -1 ,

aux_iwork , -1 ,& info);

34lwork = ( magma_int_t) aux_work[0]; 35

36liwork = aux_iwork [0];

37iwork =( magma_int_t*) malloc( liwork* sizeof( magma_int_t)); 38

39magma_dmalloc_cpu (&h_work, lwork); 40 41magma_dsetmatrix ( n, n, temp, n , d_r, n); 42magma_dsyevd_gpu(MagmaVec,MagmaUpper,n,d_r,n,w,r,n,h_work,lwork,iwork ,liwork,&info); 43 44free (w2); 45free (w1); 46free (a); 47free (r); 48free ( h_work); 49magma_free ( d_r); 50magma_finalize ();

Listing 6.5: Calculation of eigenvalues and eigenvectors within parallel GFMD using many-core architectures.

Because GFMD uses intensive matrix multiplication to obtain the projected positions and velocities (e.g., W∗(t) ← VT

W(t)), this matrix operation was also parallelized within pGFMD. The GPU parallelization of the matrix multiplication is shown in code6.6.

1int matrixMultiplicacaoVetorGPU(int natom,real* m1,real* m2,real* m3,

int transtt) 2{ 3ibmt = procThreads; 4threadCria = procThreads; 5threadPasso = procThreads; 6threadProjeta = procThreads;

(42)

7threadInic = procThreads; 8

9real *d_A, *d_B, *d_C; 10

11unsigned int size_A = natom*natom;

12unsigned int mem_size_A = sizeof(real) * size_A;

13 14cudaError_t t; 15 16cublasHandle_t handle; 17cublasStatus_t ret; 18ret = cublasCreate(&handle); 19 20 if (ret != CUBLAS_STATUS_SUCCESS) 21{ 22 exit(EXIT_FAILURE); 23} 24

25t = cudaMalloc((void **) &d_A, mem_size_A); 26

27t = cudaMemcpy(d_B, m2, mem_size_A, cudaMemcpyHostToDevice); 28

29const real alpha = 1.0f;

30const real beta = 0.0f;

31

32if(transtt == 1)

33 t = cublasDgemv(handle, CUBLAS_OP_N, natom, natom, &alpha,

d_B, natom, d_A, 1, &beta, d_C, 1);

34else

35 t = cublasDgemv(handle, CUBLAS_OP_T, natom, natom, &alpha,

d_B, natom, d_A, 1, &beta, d_C, 1);

36

37 cudaMemcpy(m3, d_C, mem_size_A, cudaMemcpyDeviceToHost); 38 39 cudaFree(d_A); 40 cudaFree(d_B); 41 cudaFree(d_C); 42 43 cublasDestroy(handle); 44 return 1; 45}

(43)

43

Chapter 7 Results

7.1 Validation of the parallel GFMD version

In order to validate the parallel implementation, we also applied pGFMD to determine the displacements of atoms of a linear chain. Figure 7.1 shows the displacement of the central atom of a 23-atom chain. The current implementation of pGFMD was also able to correctly reproduce the exact atom displacement for a time scale which the Verlet’s algorithm provides incorrect values.

Figure 7.1: Normalized displacement of the central atom obtained by pGFMD, by the exact solution, and by the Verlet’s integrator.

(44)

7.2. SPEEDUP 44

7.2 Speedup

A typical way to quantify the acceleration provided by an parallel algorithm is by the speedup. Speedup is the ratio of the execution time provided by the serial implementation of the algorithm by the time provided by its parallel version. To determine the speedup of the GFMD provided by pGFMD, we tested different number of cores and GPU threads. All tests were performed in a cluster with 2 Intel(R) Xeon(R) CPU X5650 @ 2.67GHz, 48 GB RAM, 24 cores, and one Nvidia Tesla M2070. These tests were necessary to determine the optimum use of the available threads of the GPU. Two measurements of speedup were carried out, using only one calculation of the dynamical matrix. The first measurement accounts for the matrix processing whereas the second one accounts for the position and velocity updating. These two independent measurements were performed due to the different computational workload related to those operations which affect the speedup behavior. Linear chains ranging from 10 to 20000 atoms were simulated. The limitation of the number of atoms is mainly due to the memory limitation on the GPU Tesla M2070.

Figure 7.2 shows the execution time for a simulation of 6000 atoms during 20000 time steps using the best thread configuration (12 cores). The line with dark blue circles represents the matrix processing using only CPU, the best value is ≈ 80 seconds; the line with red squares show the time using CPU threads on the atoms integration process, the best value is ≈ 380 seconds; the yellow circles line show the matrix processing using GPU, the value is a steady 90 seconds; and, the cyan square line shows the atom integration process using GPU, the value is ≈ 45 seconds. The steady values occurs because all process are on the GPU.

0 100 200 300 400 500 600 700 2 4 6 8 10 12 14 16 18 20 22 24 Time (seconds) Cores

6000 Atoms Matrix processing 6000 Atoms Integration 6000 Atoms Matrix Processing GPU 6000 Atoms Integration GPU

Figure 7.2: Time of OpenMP GFMD simulating 6000 atoms with 20000 steps using 12 threads.

Figure 7.3 shows the execution time of 6000 atoms for 6000 time steps using the best thread configuration (12 cores). The line with dark blue circles represents the Matrix

(45)

7.2. SPEEDUP 45

processing using only CPU, the best values is ≈ 11 times-fold using 12 cores, the line with red squares shows the speedup using CPU threads on the atoms integration process, the value is ≈ 1 time-fold, the yellow square line shows the matrix processing using GPU, the value is a steady 26 times-fold, and, the cyan square line shows the atom integration process using GPU, the value is ≈ 5 times-fold.

5 10 15 20 25 30 2 4 6 8 10 12 14 16 18 20 22 24 SpeedUp Cores

6000 Atoms Matrix Processing 6000 Atoms Integration 6000 Matrix Processing GPU 6000 Atoms Integration GPU

Figure 7.3: SpeedUp OpenMP GFMD simulating 6000 atoms with 20000 steps using 12 threads.

Figure 7.4 shows the execution time using 12000 atoms using 20000 steps using 12 threads. The line with dark blue circles represents the matrix processing using only CPU, the best values is with 12 cores using ≈ 480 seconds, the line with red squares shows the time using CPU threads on the atoms integration process, the best value is ≈ 1400 seconds, the yellow squares line shows the matrix processing using GPU, the value is a steady 200 seconds, and, the cyan square line shows the atom integration process using GPU, the value is ≈ 200 seconds.

(46)

7.2. SPEEDUP 46 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 2 4 6 8 10 12 14 16 18 20 22 24 Time (seconds) Cores

Figure 7.5 shows the execution time using 12000 atoms using 20000 steps using the 12 threads. The line with dark blue circles represents the Matrix processing using only CPU, the best values is ≈ 10 times-fold using 12 cores, the line with red squares shows the speedup using CPU threads on the atoms integration process, the value is ≈ 1 time-fold, the yellow circle line shows the matrix processing using GPU, the value is a steady 26 times-fold, and, the cyan square line shows the atom integration process using GPU, the value is ≈ 8 times-fold. 5 10 15 20 25 30 35 2 4 6 8 10 12 14 16 18 20 22 24 SpeedUp Cores

12000 Atoms Matrix Processing 12000 Atoms Integration 12000 Matrix Processing GPU 12000 Atoms Integration GPU

Figure 7.5: SpeedUp OpenMP GFMD simulating 12000 atoms with 20000 steps using 12 threads.

Figure 7.6 shows the execution time using 20000 atoms using 20000 steps using 12-thread configuration. The line with dark blue circles represents the matrix processing

(47)

7.2. SPEEDUP 47

using only CPU, the best values is with 12 cores using ≈ 6000 seconds, the line with red squares shows the time using CPU threads on the atoms integration process, the best value is ≈ 4800 seconds, the yellow circle line show the matrix processing using GPU, the value is a steady 250 seconds, and, the cyan square line shows the atom integration process using GPU, the value is ≈ 200 seconds.

0 5000 10000 15000 20000 25000 30000 2 4 6 8 10 12 14 16 18 20 22 24 Time (seconds) Cores

Figure 7.7 shows the execution time using 20000 atoms using 20000 steps using 12 threads configuration. The line with dark blue circles represents the matrix processing using only CPU, the best values is ≈ 8 times-fold using 12 cores, the line with red squares shows the speedup using CPU threads on the atoms integration process, the value is ≈ 5 time-fold, the yellow squares line show the matrix processing using GPU, the mean value is 58 times-fold, and, the cyan square line shows the atom integration process using GPU, the value is ≈ 5 times-fold.

Acceleration of Green's function molecular dynamics using shared memory techniques and many-core architectures : Aceleração de dinâmica molecular baseada em funções de Green utilizando técnicas de memória compartilhada e arquiteturas de múltiplos núcleo

Fábio Andrijauskas

Acceleration of Green’s function molecular dynamics

using shared memory techniques and many-core

architectures

Aceleração de dinâmica molecular baseada em funções

de Green utilizando técnicas de memória

compartilhada e arquiteturas de múltiplos núcleos

Fábio Andrijauskas

Acceleration of Green’s function molecular dynamics using

shared memory techniques and many-core architectures

Aceleração de dinâmica molecular baseada em funções de Green

utilizando técnicas de memória compartilhada e arquiteturas de

múltiplos núcleos

Agradecimentos

Resumo

Abstract

List of Figures

Contents

Chapter 1

Introduction

Chapter 2

Molecular dynamics simulations

Chapter 3

Accelerated Molecular Dynamics

Simulations

3.1

Parallel Replica

3.2

Temperature-accelerated Dynamics

3.3

Hyperdynamics

Chapter 4

Green’s Function Molecular Dynamics

Simulations

a)

b)

c)

Chapter 5

Shared memory and many-core

architectures

5.1

Shared memory techniques

5.2

Many-core architectures

Chapter 6

Implementation of Parallel GFMD

6.1

Serial GFMD

6.2

Validation of the serial GFMD

6.3

Computational cost of the serial GFMD

6.4

Parallel GFMD

Chapter 7

Results

7.1

Validation of the parallel GFMD version

7.2

Speedup