Cooperative Execution on Heterogeneous Multi-core Systems

(1)

Heterogeneous Multi-core Systems

Leonel Sousa

and Aleksandar Ilić

INESC-ID/IST TU Lisbon

INESC-ID/IST, TU Lisbon

(2)

Motivation

C

OMMODITY

C

OMPUTERS

= H

ETEROGENEOUS

S

YSTEMS

C

OMMODITY

C

OMPUTERS

= H

ETEROGENEOUS

S

YSTEMS

–

Multi-core General-Purpose Processors (CPU

S

)

–

Many-core Graphic Processing Units (GPU

Many core Graphic Processing Units (GPU

S

)

–

…

–

Special accelerators, co-processors, FPGA

S

=> H

UGE COMPUTING POWER

–

Not yet completely explored for

COLLABORATIVE COMPUTING

H

ETEROGENEITY MAKES PROBLEMS MUCH MORE COMPLEX

!

H

ETEROGENEITY MAKES PROBLEMS MUCH MORE COMPLEX

!

–

Performance modeling and load balancing

–

Different programming models and languages

(3)

Outline

C

OLLABORATIVE

E

NVIRONMENT FOR

H

ETEROGENEOUS

C

OLLABORATIVE

E

NVIRONMENT FOR

H

ETEROGENEOUS

C

OMPUTERS

P

ERFORMANCE

M

ODELING AND

L

OAD

B

ALANCING

State of the art for heterogeneous systems

–

State of the art for heterogeneous systems

–

for CPU+GPU

C

ASE

S

TUDY

: 2D B

ATCH

F

AST

F

OURIER

T

RANSFORM

C

ONCLUSIONS AND

F

UTURE

W

ORK

(4)

Desktop Heterogeneous Systems

M

ASTER

-

SLAVE

paradigm

M

ASTER SLAVE

paradigm

• CPU

(Master)

Gl b l

i

ll

–

Global execution controller

–

Access the whole global memory

• I

NTERCONNECTION

B

USSES

–

Limited and asymmetric communication bandwidth

–

Potential execution bottleneck

–

Potential execution bottleneck

• U

NDERLYING

D

EVICES (Slaves)

–

Different architectures and programming models

–

Computation performed using local memories

(5)

Redefining Tasks and Primitive Jobs

T

ASK

–

basic programming unit (coarser-grained)

T

ASK

–

basic programming unit (coarser-grained)

–

C

ONFIGURATION

P

ARAMETERS

–

Task: application and task dependency information

Environment: device type number of devices

–

Environment: device type, number of devices…

–

P

RIMITIVE

J

OB

W

RAPPER

–

D

IVISIBLE

T

ASK

– comprise several finer-grained Primitive

Jobs

–

A

GGLOMERATIVE

T

ASK

– allows grouping of Primitive Jobs

P

RIMITIVE

J

OB

–

minimal program portion for parallel

Primitive Job Granularity Task Type Divisible Agglomerative

Coarser-p g

p

execution

–

C

ONFIGURATION

P

ARAMETERS

–

I/O and performance specifics

Coarser

grained NO --Balanced YES NO

Finer/Balanced YES YES

I/O and performance specifics, …

–

C

ARRIES

P

ER

-D

EVICE

-T

YPE

I

MPLEMENTATIONS

–

Vendor-specific programming models and tools

–

Specific optimization techniques

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

(6)

for Heterogeneous Systems*

Task Level Parallelism

– TASKSCHEDULER

submits independent tasks to

JOB DISPATCHER

in respect to task and environment

configuration parameters and current platform state

f

D Q

t

from

DEVICEQUERY

structure

Data Level Parallelism

– PRIMITIVEJOBS

may be arranged into

y

g

JOBQUEUES

(currently, 1D-3D grid organization) for

DIVISIBLE (AGGLOMERATIVE) TASKS

– JOBDISPATCHER

uses

DEVICEQUERY

and

JOBQUEUE

information to map (agglomerated)

p ( gg

)

PRIMITIVE JOBS

to

the requested devices; then initiates and controls

further execution;

Nested Parallelism

Task Type

Divisible Agglomerative

– If provided,

JOBDISPATCHER

can be configured to

perceive certain number of cores of a multi-core

device as a single device

gg

NO

--YES NO

YES YES

*Aleksandar Ilic and Leonel Sousa. “Collaborative Execution Environment for Heterogeneous Parallel Systems”, In 12th

(7)

for Heterogeneous Systems*

PROBLEM

How to make good

DYNAMIC LOAD BALANCING

decisions using

PERFORMANCE MODELS

of the

devices aware of :

- application demands

- implementation specifics

- platform / device heterogeneity

- complex memory hierarchies

Task Type

Divisible Agglomerative

- limited asymmetric communication bandwidth

- …

gg

NO

--YES NO

YES YES

(8)

and Computation Distribution

C

ONSTANT

P

ERFORMANCE

M

ODELS

(CPM)

C

ONSTANT

P

ERFORMANCE

M

ODELS

(CPM)

–

D

EVICE

P

ERFORMANCE

(S

PEED

) : constant positive number

– Typically represents relative speed when executing a serial benchmark of a given size

–

C

OMPUTATIONOMPUTATION

D

ISTRIBUTIONISTRIBUTION

: proportional to the speed of device

F

UNCTIONAL

P

ERFORMANCE

M

ODELS

(FPM)

–

D

EVICE

P

ERFORMANCE

(S

PEED

) : continuous function of the problem size

– Typically require several benchmark runs and significant amount of time for building it

–

C

OMPUTATION

D

ISTRIBUTION

: relies on the “functional speed” of the processor

FPM

VS

. CPM

–

M

ORE

R

EALISTIC

: integrates features of heterogeneous processor

– Processor heterogeneity, the heterogeneity of memory structure, and the other effects (such as paging)

–

M

ORE

A

CCURATE

D

ISTRIBUTION

of computation across heterogeneous devices

–

A

PPLICATION

-

CENTRIC

approach characterize speed for different applications with different functions

(9)

Case Study : 2D FFT Batches

Part of a

PARALLEL

3D FFT

PROCEDURE

: H = FFT

_1D

(FFT

_2D

(h))

–

Very

HIGH COMMUNICATION

-

TO

-

COMPUTATION

ratio

P

ROBLEM

D

EFINITION

–

N

_in

= N

_out

= N

₁

N

₂

N

₃*

sizeof(data)

–

Performance [

[

FLOPS

] : N

]

₁_{1 2 3}

_N

₂

_N

₃ *

log(N

g

₁_{1 2 3}

N

₂

N

₃

)

–

N

₂

= const; N

₃

= const;

–

N

₁

– total number of computational chunks (

PRIMITIVEJOBS

)

(10)

**What we already know …* (1)**

Performance Metric Initialization Approximation Iteration

M

ETRIC

:

A

BSOLUTE

S

PEED

– p devices: P₁, P₂,…, P_P

– NN₁₁ total #Primitive Jobs (chunks)total #Primitive Jobs (chunks) – Device Load [chunks]: n₁, n₂,…, n_P – Absolute speed:

s_i(n_i) = n_i/t_i(n_i), 1≤i≤p

S

OLUTION

:

O

PTIMAL

L

OAD

B

ALANCING

B

ALANCING

Lies on the straight line that passes through the origin of coordinate system, such that:

x₁/s₁(x₁)= x₂/s₂(x₂)=…= x_p/s_p(x_p) x + x + + x = N

x₁ + x₂ + … + x_p = N₁

(11)

**What we already know …* (2)**

① All the Pcomputational units execute N₁/p

2D FFT Batches in parallel n_i = N₁/p, 1≤i≤p

② _{Record execution times: t}_i(N₁/p)

③ _{IF max}₁_≤_i,j_≤_p_{((t_i₍N1/p)-tj(N1/p))/ti(N1/p)}≤

ε

THEN di t ib ti l th bl THEN even distribution solves the problem and the algorithm stops;

ELSE absolute speeds of devices are calculated, such that:

s_i(N₁/p) = (N₁/p)/t_i(N₁/p), 1≤i≤p

*Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their

Functional Performance Models", HeteroPar 2009, Netherlands, Lecture Notes in Computer Science, vol. 6043, Springer, pp. 91-101, 25/9/2009, 2010.

(12)

**What we already know …* (3)**

① Performanceof each device is modeled as a constant

s_i(x) = s_i(N₁/p), 1≤i≤p

*Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their

Functional Performance Models", HeteroPar 2009, Netherlands, Lecture Notes in Computer Science, vol. 6043, Springer, pp. 91-101, 25/9/2009, 2010.

(13)

What we already know … (4)**

① Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N₁/p, max_i{s_i(N₁/p)}) (0,0), (N₁/p, min_i{s_i(N₁/p)}) ② Let x_i(U)_,_x

i(L)be the intersections with

s_i(x)IF exists xi(L)-xi(U)≥1

THEN go to 3 ELSE go to 5

**Lastovetsky, A., and R. Reddy, "Data Partitioning with a Functional Performance Model of Heterogeneous Processors",

(14)

What we already know … (5)**

1 Draw Upper

U

and Lower

L

(0,0), (N₁/p, max_i{s_i(N₁/p)}) (0,0), (N₁/p, min_i{s_i(N₁/p)}) 2 Let x_i(U)_,_x

③ Bisect the angle between

_U

and

_L

by the line

_M

, and calculate intersectionsx_i(M)

line

_M

, and calculate intersections x_i

(15)

What we already know … (6)**

1 Draw Upper

U

and Lower

L

(0,0), (N₁/p, max_i{s_i(N₁/p)}) (0,0), (N₁/p, min_i{s_i(N₁/p)}) 2 Let x_i(U)_,_x

3 Bisect the angle between

_U

and

_L

by the line

_M

, and calculate intersectionsx_i(M)

line

_M

, and calculate intersections x_i

④ IF

_Σ

_ix_i(M) _≤

_N

1 THEN _U=M ELSE L=M REPEAT 2

(16)

What we already know … (7)

A. Speeds of each device are modeled as

constants equal to the observed values [1]

s_i(x) = s_i(n_i), 1≤i≤p

(17)

What we already know … (8)

A. Speeds of each device are modeled as

constants equal to the observed values [1]

s_i(x) = s_i(n_i), 1≤i≤p

B. Functional performance modeling using

piecewise linear approximation [2]

[1] Galindo I., Almeida F., and Badía-Contelles J.M., "Dynamic Load Balancing on Dedicated Heterogeneous Systems", In

EuroPVM/MPI 2008, Springer, pp. 64-74, 2008.

(18)

Goals in the CPU + GPU Environment

O

NLINE

P

ERFORMANCE

M

ODELING

Goals in the CPU + GPU Environment

O

NLINE

P

ERFORMANCE

M

ODELING

– P

ERFORMANCE ESTIMATION

of all heterogeneous devices

DURING THE EXECUTION – No prior knowledge on the performance of an application is available on any of the devices

– Modeling of the overall CPU and GPU performance for different problem sizes (+ kernel-only GPU performance)

D

YNAMIC

L

OAD

B

ALANCING

– O

PTIMAL DISTRIBUTION OF COMPUTATIONS

(P

RIMITIVE

J

OBS

)

– Partial estimations of the performance should be built and used to decide on optimal mapping R t d l ti h ld id l d b l i ithi i

– Returned solution should provide load balancing within a given accuracy

C

OMMUNICATION

A

WARENESS

– M

M

ODELING THE BANDWIDTHODELING THE BANDWIDTH

for interconnection busses

DURING THE EXECUTIONDURING THE EXECUTION – To select problem sizes that maximize the interconnection bandwidth

– The algorithm should be aware of asymmetric bandwidth for Host-To-Device and Device-To-Host transfers

CPU+GPU A

RCHITECTURAL

S

PECIFICS

CPU+GPU A

RCHITECTURAL

S

PECIFICS

– Make use of

ENVIRONMENT

-

SPECIFIC FUNCTIONS

to ease performance modeling

– Asynchronous transfers and CUDA streams to overlap communication with computation

– Be aware of diverse capabilities of different devices, but also for devices of the same type (e.g. GT200 vs. Fermi)

(19)

Building Full Performance Models

F

ULL

P

ERFORMANCE

M

ODELS

:

P

ER

-D

EVICE

R

EAL

P

ER

D

EVICE

R

EAL

P

ERFORMANCE

– Experimentally obtained using CPU+ GPU platform specifics (Pageable/Pinned Memory) – Exhaustive search on the full range of

problem sizes

– High cost of building it !!!

Experimental Setup

CPU GPU

Intel Core 2 Quad nVIDIA GeForce_285GTX

Speed/Core (GHz) 2.83 1.476

2D FFT Batch CPU GPU

FFT Type Complex; Double Precision

Total Size (N1xN2xN3) 256x256x256

High Performance Software

(20)

in CPU + GPU Environment (1)

P

ROBLEM

D

EFINITION

:

P

RIMITIVE

J

OB

– nin _{– input data requirements}

– nout - output data requirements – nsize - problem size

– nflops _{- #float-point operations}

M

ETRIC

:

FLOPS

– p devices: P₁, P₂,…, P_P

– N₁ total #Primitive Jobs (chunks) – Device Load [chunks]: n₁, n₂,…, n_P – n_i = {n_iin, n_iout, n_isize, n_iflops}

S

OLUTION

:

O

PTIMAL

L

OAD

B

ALANCING

– FLOPS:

Perf_i(n_i) = n_iflops_/t

i(ni), 1≤i≤p

B

ALANCING

Lies on the straight line that passes through the origin of coordinate system, such that:

x_iflops _{= nf(x}

i), xi as nisize

nf(x₁)/Perf₁(x₁) = … = nf(x_p)/Perf_p(x_p) x₁ + x₂ + … + x_p = N₁

(21)

in CPU + GPU Environment (2)

① All the Pcomputational units execute N₁/p

② _{IF (device is GPU) AND (task is Divisible} and Agglomerative)

THEN go to 3 ELSE go to 4S go o

③ Split given computational load into streams and use asynchronous transfers to overlap communication with computation

(22)

in CPU + GPU Environment (3)

1 All the Pcomputational units execute N₁/p

Streaming

– S

UBDIVIDE

n

_i

computational chunks using

DIV

2

STRATEGY

2 IF (device is GPU) AND (task is Divisible and Agglomerative)

THEN go to 3 ELSE go to 4 i

– No prior knowledge on the performance of an application! – The next stream has half the load of a previous stream

– Algorithm may continue splitting the workload until the last stream is assigned with load equal to 1

S go o

③ Split given computational load into streams and use asynchronous transfers to overlap communication with computation

– B

ANDWIDTH

-

AWARE DIV

2

STRATEGY

– Interconnection bandwidth is a subject to the amount of data that should be transferred and not to the application-specific demands – Run small pre-calibration tests for HOST-TO-DEVICE ANDDEVICE

-TO-HOSTtransfers

– Tests can be stopped when saturation points are detected, or when transfers reach desired value (e.g. 60% of its theoretical) – CASE STUDY: nmin_size = 4

Stream 0

Stream 1

Stream 2

(23)

in CPU + GPU Environment (4)

1 All the Pcomputational units execute N₁/p

2 IF (device is GPU) AND (task is Divisible and Agglomerative)

THEN go to 3 ELSE go to 4S go o

3 Split given computational load into streams and use asynchronous transfers to overlap communication with computation

④ _{Execute & record execution times: t}_i(N₁/p) ⑤ _{IF max}₁_≤_i,j_≤_p_{((t_i₍N1/p)-tj(N1/p))/ti(N1/p)}≤

ε

THEN even distribution solves the problem and the algorithm stops;

ELSE performance of devices is calculated, such that:

Perf_i(N₁/p) = nf(N₁/p)/t_i(N₁/p), 1≤i≤p

(24)

in CPU + GPU Environment (5)

① Traditional approach: Performance of each device is modeled as a constant

Perf_i(x) = Perf_i(N₁/p), 1≤i≤p

(25)

in CPU + GPU Environment (6)

1 Traditional approach: Performance of each device is modeled as a constant

② GPU-specific Modeling: Using the obtained values from streaming execution

– HostToDevice Bandwidth

(26)

in CPU + GPU Environment (7)

– HostToDevice Bandwidth – DeviceToHost Bandwidth

(27)

in CPU + GPU Environment (8)

– HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance

(28)

in CPU + GPU Environment (9)

– HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

(29)

in CPU + GPU Environment (10)

2 GPU-specific Modeling: Using the obtained values from streaming execution

– HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

(30)

in CPU + GPU Environment (11)

① Draw Upper

U

and Lower

L

(0,0), (N₁/p, max_i{Perf_i(N₁/p)}) (0,0), (N₁/p, min_i{Perf_i(N₁/p)}) ② Let x_i(U)_and_x

Perf_i(x)

IF exists x_i(L)_-x i(U)≥1

_U

and

_L

by the ③ Bisect the angle between

_U

and

_L

by the

line

_M

, and calculate intersections x_i(M)

④ IF

_Σ

_ix_i(M) _≤

_N

1 THEN U=M

ELSE L=M REPEAT 2

(31)

in CPU + GPU Environment (12)

1 Draw Upper

U

and Lower

L

(0,0), (N₁/p, max_i{Perf_i(N₁/p)}) (0,0), (N₁/p, min_i{Perf_i(N₁/p)}) 2 Let x_i(U)_and_x

Perf_i(x)

_U

and

_L

by the 3 Bisect the angle between

_U

and

_L

by the

line

_M

4 IF

_Σ

_ix_i(M) _≤

_N

1 THEN U=M

ELSE L=M REPEAT 2

⑤ Employ streaming strategy on the

(32)

in CPU + GPU Environment (13)

1 Draw Upper

U

and Lower

L

(0,0), (N₁/p, max_i{Perf_i(N₁/p)})

Streaming

– S

TREAMING STRATEGY

(0,0), (N₁/p, min_i{Perf_i(N₁/p)}) 2 Let x_i(U)_and_x

Perf_i(x)

– Results obtained using DIV2 STRATEGYconsider application characterization (e.g. communication-to-computation ratio) – Workload size for the next stream should be chosen in order to

OVERLAP TRANSFERS WITH COMPUTATIONin the previous stream

_U

and

_L

by the

– B

ANDWIDTH

-

AWARE STREAMING STRATEGY

– Reuses the MINIMAL WORKLOAD SIZE FROM DIV2 STRATEGY (obtained via HOST-TO-DEVICEand DEVICE-TO-HOSTtests)

IF (

n

curr _≥

_n

min_size

₎

_{3 Bisect the angle between}

_U

_and

_L

_{by the} line

_M

4 IF

_Σ

_ix_i(M) _≤

_N

1 THEN U=M

IF (

n

≥

n

)

THEN use strategy (cont. overlapping)

ELSE restart strategy on ncurr _load

– In case that load drops under

n

min_size, strategy is restarted on i i l d

ELSE L=M REPEAT 2

⑤ Employ streaming strategy on the remaining load

(33)

in CPU + GPU Environment (14)

1 Draw Upper

U

and Lower

L

(0,0), (N₁/p, max_i{Perf_i(N₁/p)})

Streaming

– C

ASE

S

TUDY

: 2D FFT B

ATCH

– About 83% of total execution time goes on data transfers

HOSTTODEVICETransfers 46%

(0,0), (N₁/p, min_i{Perf_i(N₁/p)}) 2 Let x_i(U)_and_x

Perf_i(x) – HOSTTODEVICETransfers – 46%

– KERNELExecution– 17%

– DEVICETOHOSTTransfers – 37%

– B

ANDWIDTH

-

AWARE STREAMING STRATEGY

_U

and

_L

by the – nmin_size = 4, overlap HOSTTODEVICEtransfers and KERNEL

execution of two streams

_U

and

_L

by the line

_M

4 IF

_Σ

_ix_i(M) _≤

_N

1 THEN U=M

ELSE L=M REPEAT 2

⑤ Employ streaming strategy on the

(34)

in CPU + GPU Environment (15)

1 Draw Upper

U

and Lower

L

(0,0), (N₁/p, max_i{Perf_i(N₁/p)}) (0,0), (N₁/p, min_i{Perf_i(N₁/p)}) 2 Let x_i(U)_and_x

Perf_i(x)

_U

and

_L

by the 3 Bisect the angle between

_U

and

_L

by the

line

_M

4 IF

_Σ

_ix_i(M) _≤

_N

1 THEN U=M

ELSE L=M REPEAT 2

5 Employ streaming strategy on the calculated

(35)

in CPU + GPU Environment (16)

① Refine performance models with the newly obtained results

2 GPU ifi M d li U i th bt i d 2 GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance 3 Incorporate streaming results

(36)

in CPU + GPU Environment (17)

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

(37)

in CPU + GPU Environment (18)

– for each stream

f h t t t – for each stream restart

(38)

in CPU + GPU Environment (19)

– for each stream

(39)

in CPU + GPU Environment (20)

– for each stream

– for every stream combination

(40)

in CPU + GPU Environment (21)

– for each stream

(41)

in CPU + GPU Environment (22)

① Draw Upper

U

and Lower

L

(0,0), (N₁/p, max_i{Perf_i(N₁/p)}) (0,0), (N₁/p, min_i{Perf_i(N₁/p)}) ② Let x_i(U)_and_x

Perf_i(x)

_U

and

_L

by the ③ Bisect the angle between

_U

and

_L

by the

line

_M

④ IF

_Σ

_ix_i(M) _≤

_N

1 THEN U=M

ELSE L=M REPEAT 2

⑤ Employ streaming strategy on the calculated

(42)

in CPU + GPU Environment (23)

① Refine performance models with the newly obtained results

2 GPU ifi M d li U i th bt i d 2 GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance 3 Incorporate streaming results

– for each stream

(43)

in CPU + GPU Environment (24)

– for each stream

(44)

in CPU + GPU Environment (25)

– for each stream

(45)

in CPU + GPU Environment (26)

– for each stream

(46)

Summary and Future Work

C

OLLABORATIVE

E

NVIRONMENT FOR

H

ETEROGENEOUS

C

OMPUTERS

C

OLLABORATIVE

E

NVIRONMENT FOR

H

ETEROGENEOUS

C

OMPUTERS

T

RADITIONAL

O

A

PPROACHES FOR

O C

S O

P

ERFORMANCE

O

C

M

ODELING

O

G

–

Approximate the performance using number of points equal to the number of iterations

–

In this case, 3

POINTS

per each device

P

RESENTED

A

PPROACH FOR

P

ERFORMANCE

M

ODELING

–

Models the performance using

MORE THAN

30

POINTS

, in this case

–

C

OMMUNICATION

-

AWARE

– schedules in respect to limited and asymmetric interconnection bandwidth

–

Employs S

TREAMING

S

TRATEGIES

to overlap communication with computation across devices

–

B

UILDS SEVERAL PER

-

DEVICE MODELS AT THE SAME TIME

– OVERALLPERFORMANCEfor each device + STREAMINGGPU PERFORMANCE HOSTTODEVICEBANDWIDTHModeling

– HOSTTODEVICEBANDWIDTHModeling – DEVICETOHOSTBANDWIDTHModeling – GPU KERNELPERFORMANCEModeling

(47)