• Nenhum resultado encontrado

Cooperative Execution on Heterogeneous Multi-core Systems

N/A
N/A
Protected

Academic year: 2021

Share "Cooperative Execution on Heterogeneous Multi-core Systems"

Copied!
47
0
0

Texto

(1)

Heterogeneous Multi-core Systems

Leonel Sousa

and Aleksandar Ilić

INESC-ID/IST TU Lisbon

INESC-ID/IST, TU Lisbon

(2)

Motivation

Motivation

C

OMMODITY

C

OMPUTERS

= H

ETEROGENEOUS

S

YSTEMS

C

OMMODITY

C

OMPUTERS

= H

ETEROGENEOUS

S

YSTEMS

Multi-core General-Purpose Processors (CPU

S

)

Many-core Graphic Processing Units (GPU

Many core Graphic Processing Units (GPU

S

S

)

)

Special accelerators, co-processors, FPGA

S

=> H

UGE COMPUTING POWER

Not yet completely explored for

Not yet completely explored for

COLLABORATIVE COMPUTING

COLLABORATIVE COMPUTING

H

ETEROGENEITY MAKES PROBLEMS MUCH MORE COMPLEX

!

H

ETEROGENEITY MAKES PROBLEMS MUCH MORE COMPLEX

!

Performance modeling and load balancing

Different programming models and languages

(3)

Outline

Outline

C

OLLABORATIVE

E

NVIRONMENT FOR

H

ETEROGENEOUS

C

OLLABORATIVE

E

NVIRONMENT FOR

H

ETEROGENEOUS

C

OMPUTERS

P

ERFORMANCE

M

ODELING AND

L

OAD

B

ALANCING

State of the art for heterogeneous systems

State of the art for heterogeneous systems

for CPU+GPU

C

ASE

S

TUDY

: 2D B

ATCH

F

AST

F

OURIER

T

RANSFORM

C

ONCLUSIONS AND

F

UTURE

W

ORK

(4)

Desktop Heterogeneous Systems

Desktop Heterogeneous Systems

M

ASTER

-

SLAVE

paradigm

M

ASTER SLAVE

paradigm

CPU

(Master)

Gl b l

i

ll

Global execution controller

Access the whole global memory

I

NTERCONNECTION

B

USSES

Limited and asymmetric communication bandwidth

Potential execution bottleneck

Potential execution bottleneck

U

NDERLYING

D

EVICES (Slaves)

Different architectures and programming models

Computation performed using local memories

(5)

Redefining Tasks and Primitive Jobs

Redefining Tasks and Primitive Jobs

T

ASK

basic programming unit (coarser-grained)

T

ASK

basic programming unit (coarser-grained)

C

ONFIGURATION

P

ARAMETERS

Task: application and task dependency information

Environment: device type number of devices

Environment: device type, number of devices…

P

RIMITIVE

J

OB

W

RAPPER

D

IVISIBLE

T

ASK

– comprise several finer-grained Primitive

Jobs

Jobs

A

GGLOMERATIVE

T

ASK

– allows grouping of Primitive Jobs

P

RIMITIVE

J

OB

minimal program portion for parallel

Primitive Job Granularity Task Type Divisible Agglomerative

Coarser-p g

p

p

execution

C

ONFIGURATION

P

ARAMETERS

I/O and performance specifics

Coarser

grained NO --Balanced YES NO

Finer/Balanced YES YES

I/O and performance specifics, …

C

ARRIES

P

ER

-D

EVICE

-T

YPE

I

MPLEMENTATIONS

Vendor-specific programming models and tools

Specific optimization techniques

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

(6)

for Heterogeneous Systems*

Task Level Parallelism

Task Level Parallelism

– TASKSCHEDULER

submits independent tasks to

JOB DISPATCHER

in respect to task and environment

configuration parameters and current platform state

f

D Q

t

t

from

DEVICEQUERY

structure

Data Level Parallelism

– PRIMITIVEJOBS

may be arranged into

y

g

JOBQUEUES

(currently, 1D-3D grid organization) for

DIVISIBLE (AGGLOMERATIVE) TASKS

– JOBDISPATCHER

uses

DEVICEQUERY

and

JOBQUEUE

information to map (agglomerated)

p ( gg

)

PRIMITIVE JOBS

to

the requested devices; then initiates and controls

further execution;

Nested Parallelism

Task Type

Divisible Agglomerative

– If provided,

JOBDISPATCHER

can be configured to

perceive certain number of cores of a multi-core

device as a single device

gg

NO

--YES NO

YES YES

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

*Aleksandar Ilic and Leonel Sousa. “Collaborative Execution Environment for Heterogeneous Parallel Systems”, In 12th

(7)

for Heterogeneous Systems*

PROBLEM

How to make good

DYNAMIC LOAD BALANCING

decisions using

PERFORMANCE MODELS

of the

devices aware of :

- application demands

- implementation specifics

- platform / device heterogeneity

- complex memory hierarchies

Task Type

Divisible Agglomerative

- limited asymmetric communication bandwidth

- …

gg

NO

--YES NO

YES YES

(8)

and Computation Distribution

C

ONSTANT

P

ERFORMANCE

M

ODELS

(CPM)

C

ONSTANT

P

ERFORMANCE

M

ODELS

(CPM)

D

EVICE

P

ERFORMANCE

(S

PEED

) : constant positive number

– Typically represents relative speed when executing a serial benchmark of a given size

C

C

OMPUTATIONOMPUTATION

D

D

ISTRIBUTIONISTRIBUTION

: proportional to the speed of device

: proportional to the speed of device

F

UNCTIONAL

P

ERFORMANCE

M

ODELS

(FPM)

D

EVICE

P

ERFORMANCE

(S

PEED

) : continuous function of the problem size

– Typically require several benchmark runs and significant amount of time for building it

C

OMPUTATION

D

ISTRIBUTION

: relies on the “functional speed” of the processor

FPM

VS

. CPM

M

ORE

R

EALISTIC

: integrates features of heterogeneous processor

– Processor heterogeneity, the heterogeneity of memory structure, and the other effects (such as paging)

M

ORE

A

CCURATE

D

ISTRIBUTION

of computation across heterogeneous devices

A

PPLICATION

-

CENTRIC

approach characterize speed for different applications with different functions

(9)

Case Study : 2D FFT Batches

Case Study : 2D FFT Batches

Part of a

PARALLEL

3D FFT

PROCEDURE

: H = FFT

1D

(FFT

2D

(h))

Very

HIGH COMMUNICATION

-

TO

-

COMPUTATION

ratio

P

ROBLEM

D

EFINITION

N

in

= N

out

= N

1

N

2

N

3 *

sizeof(data)

Performance [

[

FLOPS

] : N

]

11 2 3

N

2

N

3 *

log(N

g

11 2 3

N

2

N

3

)

N

2

= const; N

3

= const;

N

1

– total number of computational chunks (

PRIMITIVEJOBS

)

(10)

What we already know …* (1)

Performance Metric Initialization Approximation Iteration

M

ETRIC

:

A

BSOLUTE

S

PEED

– p devices: P1, P2,…, PP

– NN11 total #Primitive Jobs (chunks)total #Primitive Jobs (chunks) – Device Load [chunks]: n1, n2,…, nP – Absolute speed:

si(ni) = ni/ti(ni), 1≤i≤p

S

OLUTION

:

O

PTIMAL

L

OAD

B

ALANCING

B

ALANCING

Lies on the straight line that passes through the origin of coordinate system, such that:

x1/s1(x1)= x2/s2(x2)=…= xp/sp(xp) x + x + + x = N

x1 + x2 + … + xp = N1

(11)

What we already know …* (2)

Performance Metric Initialization Approximation Iteration

① All the Pcomputational units execute N1/p

2D FFT Batches in parallel ni = N1/p, 1≤i≤p

Record execution times: ti(N1/p)

IF max1i,jp{((ti(N1/p)-tj(N1/p))/ti(N1/p)}≤

ε

THEN di t ib ti l th bl THEN even distribution solves the problem and the algorithm stops;

ELSE absolute speeds of devices are calculated, such that:

si(N1/p) = (N1/p)/ti(N1/p), 1≤i≤p

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

*Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their

Functional Performance Models", HeteroPar 2009, Netherlands, Lecture Notes in Computer Science, vol. 6043, Springer, pp. 91-101, 25/9/2009, 2010.

(12)

What we already know …* (3)

Performance Metric Initialization Approximation Iteration

① Performanceof each device is modeled as a constant

si(x) = si(N1/p), 1≤i≤p

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

*Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their

Functional Performance Models", HeteroPar 2009, Netherlands, Lecture Notes in Computer Science, vol. 6043, Springer, pp. 91-101, 25/9/2009, 2010.

(13)

What we already know …** (4)

Performance Metric Initialization Approximation Iteration

① Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{si(N1/p)}) (0,0), (N1/p, mini{si(N1/p)}) ② Let xi(U), x

i(L)be the intersections with

si(x)IF exists xi(L)-xi(U)≥1

THEN go to 3 ELSE go to 5

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

**Lastovetsky, A., and R. Reddy, "Data Partitioning with a Functional Performance Model of Heterogeneous Processors",

(14)

What we already know …** (5)

Performance Metric Initialization Approximation Iteration

1 Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{si(N1/p)}) (0,0), (N1/p, mini{si(N1/p)}) 2 Let xi(U), x

i(L)be the intersections with

si(x)IF exists xi(L)-xi(U)≥1

THEN go to 3 ELSE go to 5

③ Bisect the angle between

U

and

L

by the line

M

, and calculate intersectionsxi(M)

line

M

, and calculate intersections xi

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

**Lastovetsky, A., and R. Reddy, "Data Partitioning with a Functional Performance Model of Heterogeneous Processors",

(15)

What we already know …** (6)

Performance Metric Initialization Approximation Iteration

1 Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{si(N1/p)}) (0,0), (N1/p, mini{si(N1/p)}) 2 Let xi(U), x

i(L)be the intersections with

si(x)IF exists xi(L)-xi(U)≥1

THEN go to 3 ELSE go to 5

3 Bisect the angle between

U

and

L

by the line

M

, and calculate intersectionsxi(M)

line

M

, and calculate intersections xi

④ IF

Σ

ixi(M)

N

1 THEN U=M ELSE L=M REPEAT 2

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

**Lastovetsky, A., and R. Reddy, "Data Partitioning with a Functional Performance Model of Heterogeneous Processors",

(16)

What we already know … (7)

Performance Metric Initialization Approximation Iteration

A. Speeds of each device are modeled as

constants equal to the observed values [1]

si(x) = si(ni), 1≤i≤p

(17)

What we already know … (8)

Performance Metric Initialization Approximation Iteration

A. Speeds of each device are modeled as

constants equal to the observed values [1]

si(x) = si(ni), 1≤i≤p

B. Functional performance modeling using

piecewise linear approximation [2]

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

[1] Galindo I., Almeida F., and Badía-Contelles J.M., "Dynamic Load Balancing on Dedicated Heterogeneous Systems", In

EuroPVM/MPI 2008, Springer, pp. 64-74, 2008.

(18)

Goals in the CPU + GPU Environment

O

NLINE

P

ERFORMANCE

M

ODELING

Goals in the CPU + GPU Environment

O

NLINE

P

ERFORMANCE

M

ODELING

– P

ERFORMANCE ESTIMATION

of all heterogeneous devices

DURING THE EXECUTION – No prior knowledge on the performance of an application is available on any of the devices

– Modeling of the overall CPU and GPU performance for different problem sizes (+ kernel-only GPU performance)

D

YNAMIC

L

OAD

B

ALANCING

– O

PTIMAL DISTRIBUTION OF COMPUTATIONS

(P

RIMITIVE

J

OBS

)

– Partial estimations of the performance should be built and used to decide on optimal mapping R t d l ti h ld id l d b l i ithi i

– Returned solution should provide load balancing within a given accuracy

C

OMMUNICATION

A

WARENESS

– M

M

ODELING THE BANDWIDTHODELING THE BANDWIDTH

for interconnection busses

for interconnection busses

DURING THE EXECUTIONDURING THE EXECUTION – To select problem sizes that maximize the interconnection bandwidth

– The algorithm should be aware of asymmetric bandwidth for Host-To-Device and Device-To-Host transfers

CPU+GPU A

RCHITECTURAL

S

PECIFICS

CPU+GPU A

RCHITECTURAL

S

PECIFICS

– Make use of

ENVIRONMENT

-

SPECIFIC FUNCTIONS

to ease performance modeling

– Asynchronous transfers and CUDA streams to overlap communication with computation

– Be aware of diverse capabilities of different devices, but also for devices of the same type (e.g. GT200 vs. Fermi)

(19)

Building Full Performance Models

F

ULL

P

ERFORMANCE

M

ODELS

:

P

ER

-D

EVICE

R

EAL

P

ER

D

EVICE

R

EAL

P

ERFORMANCE

– Experimentally obtained using CPU+ GPU platform specifics (Pageable/Pinned Memory) – Exhaustive search on the full range of

problem sizes

– High cost of building it !!!

Experimental Setup

CPU GPU

Intel Core 2 Quad nVIDIA GeForce285GTX

Speed/Core (GHz) 2.83 1.476

2D FFT Batch CPU GPU

FFT Type Complex; Double Precision

Total Size (N1xN2xN3) 256x256x256

High Performance Software

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

(20)

in CPU + GPU Environment (1)

Performance Metric Initialization Approximation Iteration

P

ROBLEM

D

EFINITION

:

P

RIMITIVE

J

OB

– nin – input data requirements

– nout - output data requirements – nsize - problem size

– nflops - #float-point operations

M

ETRIC

:

FLOPS

– p devices: P1, P2,…, PP

– N1 total #Primitive Jobs (chunks) – Device Load [chunks]: n1, n2,…, nP – ni = {niin, niout, nisize, niflops}

S

OLUTION

:

O

PTIMAL

L

OAD

B

ALANCING

– FLOPS:

Perfi(ni) = niflops/t

i(ni), 1≤i≤p

B

ALANCING

Lies on the straight line that passes through the origin of coordinate system, such that:

xiflops = nf(x

i), xi as nisize

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

nf(x1)/Perf1(x1) = … = nf(xp)/Perfp(xp) x1 + x2 + … + xp = N1

(21)

in CPU + GPU Environment (2)

Performance Metric Initialization Approximation Iteration

① All the Pcomputational units execute N1/p

2D FFT Batches in parallel ni = N1/p, 1≤i≤p

IF (device is GPU) AND (task is Divisible and Agglomerative)

THEN go to 3 ELSE go to 4S go o

③ Split given computational load into streams and use asynchronous transfers to overlap communication with computation

(22)

in CPU + GPU Environment (3)

Performance Metric Initialization Approximation Iteration

1 All the Pcomputational units execute N1/p

2D FFT Batches in parallel ni = N1/p, 1≤i≤p

Streaming

– S

UBDIVIDE

n

i

computational chunks using

DIV

2

STRATEGY

2 IF (device is GPU) AND (task is Divisible and Agglomerative)

THEN go to 3 ELSE go to 4 i

– No prior knowledge on the performance of an application! – The next stream has half the load of a previous stream

– Algorithm may continue splitting the workload until the last stream is assigned with load equal to 1

S go o

③ Split given computational load into streams and use asynchronous transfers to overlap communication with computation

– B

ANDWIDTH

-

AWARE DIV

2

STRATEGY

– Interconnection bandwidth is a subject to the amount of data that should be transferred and not to the application-specific demands – Run small pre-calibration tests for HOST-TO-DEVICE ANDDEVICE

-TO-HOSTtransfers

– Tests can be stopped when saturation points are detected, or when transfers reach desired value (e.g. 60% of its theoretical) – CASE STUDY: nmin_size = 4

Stream 0

Stream 1

Stream 2

(23)

in CPU + GPU Environment (4)

Performance Metric Initialization Approximation Iteration

1 All the Pcomputational units execute N1/p

2D FFT Batches in parallel ni = N1/p, 1≤i≤p

2 IF (device is GPU) AND (task is Divisible and Agglomerative)

THEN go to 3 ELSE go to 4S go o

3 Split given computational load into streams and use asynchronous transfers to overlap communication with computation

Execute & record execution times: ti(N1/p) ⑤ IF max1i,jp{((ti(N1/p)-tj(N1/p))/ti(N1/p)}≤

ε

THEN even distribution solves the problem and the algorithm stops;

ELSE performance of devices is calculated, such that:

Perfi(N1/p) = nf(N1/p)/ti(N1/p), 1≤i≤p

(24)

in CPU + GPU Environment (5)

Performance Metric Initialization Approximation Iteration

Traditional approach: Performance of each device is modeled as a constant

Perfi(x) = Perfi(N1/p), 1≤i≤p

(25)

in CPU + GPU Environment (6)

Performance Metric Initialization Approximation Iteration

1 Traditional approach: Performance of each device is modeled as a constant

Perfi(x) = Perfi(N1/p), 1≤i≤p

② GPU-specific Modeling: Using the obtained values from streaming execution

– HostToDevice Bandwidth

(26)

in CPU + GPU Environment (7)

Performance Metric Initialization Approximation Iteration

1 Traditional approach: Performance of each device is modeled as a constant

Perfi(x) = Perfi(N1/p), 1≤i≤p

② GPU-specific Modeling: Using the obtained values from streaming execution

– HostToDevice Bandwidth – DeviceToHost Bandwidth

(27)

in CPU + GPU Environment (8)

Performance Metric Initialization Approximation Iteration

1 Traditional approach: Performance of each device is modeled as a constant

Perfi(x) = Perfi(N1/p), 1≤i≤p

② GPU-specific Modeling: Using the obtained values from streaming execution

– HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance

(28)

in CPU + GPU Environment (9)

Performance Metric Initialization Approximation Iteration

1 Traditional approach: Performance of each device is modeled as a constant

Perfi(x) = Perfi(N1/p), 1≤i≤p

② GPU-specific Modeling: Using the obtained values from streaming execution

– HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

(29)

in CPU + GPU Environment (10)

Performance Metric Initialization Approximation Iteration

1 Traditional approach: Performance of each device is modeled as a constant

Perfi(x) = Perfi(N1/p), 1≤i≤p

2 GPU-specific Modeling: Using the obtained values from streaming execution

– HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

(30)

in CPU + GPU Environment (11)

Performance Metric Initialization Approximation Iteration

① Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{Perfi(N1/p)}) (0,0), (N1/p, mini{Perfi(N1/p)}) ② Let xi(U)and x

i(L)be the intersections with

Perfi(x)

IF exists xi(L)-x i(U)≥1

THEN go to 3 ELSE go to 5

③ Bisect the angle between

U

and

L

by the ③ Bisect the angle between

U

and

L

by the

line

M

, and calculate intersections xi(M)

④ IF

Σ

ixi(M)

N

1 THEN U=M

ELSE L=M REPEAT 2

(31)

in CPU + GPU Environment (12)

Performance Metric Initialization Approximation Iteration

1 Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{Perfi(N1/p)}) (0,0), (N1/p, mini{Perfi(N1/p)}) 2 Let xi(U)and x

i(L)be the intersections with

Perfi(x)

IF exists xi(L)-x i(U)≥1

THEN go to 3 ELSE go to 5

3 Bisect the angle between

U

and

L

by the 3 Bisect the angle between

U

and

L

by the

line

M

, and calculate intersections xi(M)

4 IF

Σ

ixi(M)

N

1 THEN U=M

ELSE L=M REPEAT 2

Employ streaming strategy on the

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

(32)

in CPU + GPU Environment (13)

Performance Metric Initialization Approximation Iteration

1 Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{Perfi(N1/p)})

Streaming

– S

TREAMING STRATEGY

(0,0), (N1/p, mini{Perfi(N1/p)}) 2 Let xi(U)and x

i(L)be the intersections with

Perfi(x)

– Results obtained using DIV2 STRATEGYconsider application characterization (e.g. communication-to-computation ratio) – Workload size for the next stream should be chosen in order to

OVERLAP TRANSFERS WITH COMPUTATIONin the previous stream

IF exists xi(L)-x i(U)≥1

THEN go to 3 ELSE go to 5

3 Bisect the angle between

U

and

L

by the

– B

ANDWIDTH

-

AWARE STREAMING STRATEGY

– Reuses the MINIMAL WORKLOAD SIZE FROM DIV2 STRATEGY (obtained via HOST-TO-DEVICEand DEVICE-TO-HOSTtests)

IF (

n

curr

n

min_size

)

3 Bisect the angle between

U

and

L

by the line

M

, and calculate intersections xi(M)

4 IF

Σ

ixi(M)

N

1 THEN U=M

IF (

n

n

)

THEN use strategy (cont. overlapping)

ELSE restart strategy on ncurr load

– In case that load drops under

n

min_size, strategy is restarted on i i l d

ELSE L=M REPEAT 2

Employ streaming strategy on the remaining load

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

(33)

in CPU + GPU Environment (14)

Performance Metric Initialization Approximation Iteration

1 Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{Perfi(N1/p)})

Streaming

– C

ASE

S

TUDY

: 2D FFT B

ATCH

– About 83% of total execution time goes on data transfers

HOSTTODEVICETransfers 46%

(0,0), (N1/p, mini{Perfi(N1/p)}) 2 Let xi(U)and x

i(L)be the intersections with

Perfi(x) – HOSTTODEVICETransfers – 46%

– KERNELExecution– 17%

– DEVICETOHOSTTransfers – 37%

– B

ANDWIDTH

-

AWARE STREAMING STRATEGY

IF exists xi(L)-x i(U)≥1

THEN go to 3 ELSE go to 5

3 Bisect the angle between

U

and

L

by the – nmin_size = 4, overlap HOSTTODEVICEtransfers and KERNEL

execution of two streams

3 Bisect the angle between

U

and

L

by the line

M

, and calculate intersections xi(M)

4 IF

Σ

ixi(M)

N

1 THEN U=M

ELSE L=M REPEAT 2

Employ streaming strategy on the

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

(34)

in CPU + GPU Environment (15)

Performance Metric Initialization Approximation Iteration

1 Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{Perfi(N1/p)}) (0,0), (N1/p, mini{Perfi(N1/p)}) 2 Let xi(U)and x

i(L)be the intersections with

Perfi(x)

IF exists xi(L)-x i(U)≥1

THEN go to 3 ELSE go to 5

3 Bisect the angle between

U

and

L

by the 3 Bisect the angle between

U

and

L

by the

line

M

, and calculate intersections xi(M)

4 IF

Σ

ixi(M)

N

1 THEN U=M

ELSE L=M REPEAT 2

5 Employ streaming strategy on the calculated

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

(35)

in CPU + GPU Environment (16)

Performance Metric Initialization Approximation Iteration

① Refine performance models with the newly obtained results

2 GPU ifi M d li U i th bt i d 2 GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance 3 Incorporate streaming results

(36)

in CPU + GPU Environment (17)

Performance Metric Initialization Approximation Iteration

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

(37)

in CPU + GPU Environment (18)

Performance Metric Initialization Approximation Iteration

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

f h t t t – for each stream restart

(38)

in CPU + GPU Environment (19)

Performance Metric Initialization Approximation Iteration

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

f h t t t – for each stream restart

(39)

in CPU + GPU Environment (20)

Performance Metric Initialization Approximation Iteration

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

f h t t t – for each stream restart

– for every stream combination

(40)

in CPU + GPU Environment (21)

Performance Metric Initialization Approximation Iteration

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

f h t t t – for each stream restart

– for every stream combination

(41)

in CPU + GPU Environment (22)

Performance Metric Initialization Approximation Iteration

① Draw Upper

U

and Lower

L

lines through the following points:

(0,0), (N1/p, maxi{Perfi(N1/p)}) (0,0), (N1/p, mini{Perfi(N1/p)}) ② Let xi(U)and x

i(L)be the intersections with

Perfi(x)

IF exists xi(L)-x i(U)≥1

THEN go to 3 ELSE go to 5

③ Bisect the angle between

U

and

L

by the ③ Bisect the angle between

U

and

L

by the

line

M

, and calculate intersections xi(M)

④ IF

Σ

ixi(M)

N

1 THEN U=M

ELSE L=M REPEAT 2

⑤ Employ streaming strategy on the calculated

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

(42)

in CPU + GPU Environment (23)

Performance Metric Initialization Approximation Iteration

① Refine performance models with the newly obtained results

2 GPU ifi M d li U i th bt i d 2 GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance 3 Incorporate streaming results

– for each stream

f h t t t – for each stream restart

– for every stream combination

(43)

in CPU + GPU Environment (24)

Performance Metric Initialization Approximation Iteration

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

f h t t t – for each stream restart

– for every stream combination

(44)

in CPU + GPU Environment (25)

Performance Metric Initialization Approximation Iteration

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

f h t t t – for each stream restart

– for every stream combination

(45)

in CPU + GPU Environment (26)

Performance Metric Initialization Approximation Iteration

1 Refine performance models with the newly obtained results

② GPU ifi M d li U i th bt i d ② GPU-specific Modeling: Using the obtained

values from streaming execution – HostToDevice Bandwidth – DeviceToHost Bandwidth – GPU Kernel Performance ③ Incorporate streaming results

– for each stream

f h t t t – for each stream restart

– for every stream combination

(46)

Summary and Future Work

Summary and Future Work

C

OLLABORATIVE

E

NVIRONMENT FOR

H

ETEROGENEOUS

C

OMPUTERS

C

OLLABORATIVE

E

NVIRONMENT FOR

H

ETEROGENEOUS

C

OMPUTERS

T

RADITIONAL

O

A

PPROACHES FOR

O C

S O

P

ERFORMANCE

O

C

M

ODELING

O

G

Approximate the performance using number of points equal to the number of iterations

In this case, 3

POINTS

per each device

P

RESENTED

A

PPROACH FOR

P

ERFORMANCE

M

ODELING

Models the performance using

MORE THAN

30

POINTS

, in this case

C

OMMUNICATION

-

AWARE

– schedules in respect to limited and asymmetric interconnection bandwidth

Employs S

TREAMING

S

TRATEGIES

to overlap communication with computation across devices

B

UILDS SEVERAL PER

-

DEVICE MODELS AT THE SAME TIME

– OVERALLPERFORMANCEfor each device + STREAMINGGPU PERFORMANCE HOSTTODEVICEBANDWIDTHModeling

– HOSTTODEVICEBANDWIDTHModeling – DEVICETOHOSTBANDWIDTHModeling – GPU KERNELPERFORMANCEModeling

(47)

Thank you

Thank you

Referências

Documentos relacionados

Abstract Uhl’s disease, also known as Uhl anomaly, is a rare disease secondary to selective but uncontrolled apoptosis of right ventricular myocytes during the perinatal period,

No que diz respeito à influência do cristianismo, especialmente o protestantismo de Calvino, procuramos estabelecer uma relação entre essa corrente e a forma como ajudou Rougemont

 Retorno Sobre o Investimento (ROIA): Para Wernke (2008) é um dos instrumentos mais difundidos relativamente á análise do desempenho dos investimentos empresariais, é a medida

O presente trabalho teve por objetivo verificar a efetividade do PMAQ-AB no ano de 2014 na cidade de Santa Fé do Araguaia-TO, observando na visão da população

Foram avaliados os seguintes parâmetros: valor da HbA1c nos três trimestres, glicemia em jejum e glicemia uma hora após o pequeno almoço, IMC pré-concepção, idade materna, duração

Although the SDM output still is used in the digital domain by entering the delay network, the signal is in fact sigma-delta modulated and can be used at any time to reconstruct

The results obtained from both type of tests prove that a higher perception of value from health products that require the use of big data technologies (PV) is positively correlated

A espécie de invalidade relativamente à omissão de constituição de arguido foi bastante polémica na doutrina e jurisprudência cujas posições foram de um extremo ao outro no