energy eﬃciency

(1)

energy eﬃciency

João Guerreiro

^∗

, Aleksandar Ilic, Nuno Roma, Pedro Tomás

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal, Rua Alves Redol, 9, Lisboa 10 0 0-029, Portugal

a r t i c l e i n f o

Article history:

Received 30 November 2016 Revised 20 August 2017 Accepted 1 February 2018 Available online 9 February 2018 Keywords:

GPGPU

Application classiﬁcation DVFS

Optimal frequency Energy savings

a b s t r a c t

TheincreasingimportanceofGPUsashigh-performanceacceleratorsand thepowerand energyconstraintsofcomputingsystems,makeitfundamentaltodeveloptechniquesfor energyeﬃciencymaximizationofGPGPUapplications.Amongseveralpotentialtechniques, dynamicvoltageand frequencyscaling (DVFS)standsoutas oneofthemostpromising approaches. Hence,novel DVFS-awareperformance and power classiﬁcation models are hereinproposed thatcorrelate application characteristicsand GPU architecturefeatures.

Inparticular,byanalysingtheutilizationofgraphicsandmemorycomponentsatasingle voltageand frequency levels,theproposed classiﬁcationmethodologies areabletopre- dictthe impact of DVFSonGPGPU applications execution timeand power and energy consumption.TheaccuracyoftheproposedapproachisvalidatedontwomodernNVIDIA GPUsfrom theMaxwelland Pascal generations,byrelying on35benchmarksfrom the Rodinia,Polybench,Parboil,SHOCand CUDASDK suites.Experimentalresultsshow that theproposedapproachcantypically predicttheoptimaloperatingfrequenciesofgraph- icsandmemorysubsystems,attainingupto36%energysavings(averageof16%),which correspondtoanaveragedeviationof0.74%regarding theoptimalcase.Moreover,when consideringamaximumperformance penaltyof10%,upto 26%energysavings arestill attained.

1. Introduction

Generalpurposeacceleratorshavealreadygaineda firmpresence inmostmodern high-performancecomputing(HPC) systems.In particular, theGraphics Processing Units (GPUs),are commonlyused toincrease the resulting systemperfor- mancewhenexecutingapplicationsfrommanycommercialandscientificdomains.Thiscanbeeasilyobservedatthemost recentversion oftheTOP500list(June2017):88ofthesesystemsare equippedwithaccelerators,where72ofthemuse GPUs.However,suchanestablishedadoptionofGPUsintensifiestheimportancetofindreliablemechanismsthatensurethe maximumefficiencyofthecomputingsystem, bothintermsofperformanceand(mostimportantly)energyconsumption.

Accordingly,signiﬁcant researcheffortsare beingputforth intheinvestigationofDynamicVoltageandFrequencyScaling (DVFS)techniques(oneofthemostusedpowermanagementstrategies),duetotheinherentpotentialforsigniﬁcantpower andenergysavingsinmanyofthecomputersystemcomponents[1–5].

Studyingtheeffects ofDVFS on theresultingenergy-eﬃciency ofcomputingsystems requiresanalysing itsimpact on differentapplications,asgeneral-purposeapplicationscanlargelyvaryinthewaytheyusethecomputationalandmemory

∗ Corresponding author.

E-mail address: [email protected] (J. Guerreiro).

(2)

resourcesofthedeviceswheretheyareexecuting[6,7].Whilesomeapplicationsperformalargenumberofcomputational operationsforeach loadeddata(more compute-intensive),otherapplicationsmayperformveryfew(moreI/O-ormemory- intensive).Althoughtheresultingperformanceoftheformertypeofapplicationsismorelikelytoscaleproportionallywith thefrequencyofthecores(highestfrequency ≡ bestperformance),thisbehaviourisnotguaranteedforthelattersetofap- plications.Additionally,moststandardapplicationsfallsomewherebetweentheseextremes,increasingthecharacterization complexity. However, knowingthe type of applicationexecuting can leadto interesting opportunities forenergysavings.

For instance,certain applicationscan be executed atlower frequencylevels withnegligibleperformance drop-off. Identi- fying theseclasses ofapplications can lead to signiﬁcant energysavings, since lower operating frequencyleadsto lower powerconsumption.Hence,toconductthistypeofanalysis,itisfundamentaltoadoptmethodologiesthatallowaproper classiﬁcationoftheapplicationworkloads.

Some previous studies onworkloadcharacterization intheGPU-domainused PrincipalComponentAnalysis (PCA)and hierarchicalclustering[7–9],whichmaketheunderstandingofeachresultingclassharder(fromthecomputingarchitecture perspective)anddonotnecessarilyresultinanaccurateenergy-awareclassiﬁcation.OtherworksdependonGPUsimulators andperformancecountersthatarenon-existentinrealhardwaredevices,renderingtheseapproachesimpossibletoreplicate in realsystems.Furthermore,most existingGPU simulatorsare based onthe NVIDIATesla andFermimicroarchitectures, whichhavealreadybeenfollowedbyKepler(2013),Maxwell(2014)andPascal(2016).AdifferentapproachtowardsDVFS consists in thedevelopment ofaccurate performance andpower models that allow predictingthe GPUbehaviour under differentvoltageandfrequencyscenarios[10–21].However, whiledetailedperformanceandpowermodelsmayultimately produce moreaccurate results,the herein presentedwork showsthat applicationclassiﬁcation isa moreconvenient and viable approach not onlyto identifyremarkable energy-savingsopportunities, butalsoto achieve near-optimalresults in termsofenergysavings.

Accordingly, themaincontributionofthepresentedresearchistoprovideanewclassificationmethodologyforGPGPU applications, allowing an easy identification ofwhich applicationscan benefit fromDVFS, interms of energysavings. To that goal,separate methodologiesto classifyGPUapplicationsare proposed,focusing ontheeffectsofDVFS ontheir performance andpower consumption,based onhow theapplications exploit thedifferentGPU resources.The classifiersare trainedoffline,usingacollectionofsyntheticbenchmarks,andcanbeusedtoclassifyanyapplicationusinghardwareper- formance eventsgatheredduringits executionon asingleoperatingfrequency. Theresulting classis abletocharacterize howthefrequencyscalingwillaffecttheapplicationexecutiontimeorpowerconsumption,foralltheremainingoperating frequencies.

The proposedmethodologywasvalidatedusingasetof35applicationsfromdifferentrelevantbenchmarksuites(Par- boil [22], Rodinia[23], SHOC [24], Polybench [25] and CUDA SDK[26]), on two modern GPUdevices: GTX Titan X and Titan Xp,fromthe Maxwell andPascalmicroarchitectures, respectively.The experimental resultsshow that theproposed methodologiesareabletoaccuratelyandconsistentlyclassifytheconsidered GPUapplicationsintermsoftheirbehaviour inperformance,powerandenergyconsumption.Theproposedclassificationmethodologyallowsfindinganear-optimalpair ofoperatingfrequencies(coreandmemory),whichcanresultonaverageenergy-savingsof16%(20%),andonpeakenergy- savings that are ashighas36% (32%) on the Maxwell GPU(Pascal GPU). Additionally,in situations wherea decrease in theprocessingperformanceisnotallowedorhighlydiscouraged,theproposedmethodologyisstillabletofindrealoppor- tunities forenergy-efficiency withalimitedperformance trade-off (e.g., <10%),resultinginaverage energy-savingsof9%

on theMaxwellGPU(16%onPascal),althoughinsome classessuchsavings canevenbeashighas22% withonlya 0.2%

performancetrade-off.Accordingly,themostsigniﬁcantcontributionsofthispaperarethefollowing:

• Analysis oftheimpact ofDVFSon theperformance andpowerconsumptionofdifferenttypes ofGPUbenchmarkson realhardware;

• Novelapplication-classiﬁcationschemebasedonGPUperformanceandpowermetrics,abletocharacterizetheimpactof DVFS ontheexecutionofdifferenttypesofapplications,forawiderangeofGPUoperatingfrequencies,validatedwith standardapplicationsonrealGPUdevices;

• Application ofthe proposedclassiﬁcationmethodologiestooptimize:(i) theconsumedenergy;(ii)theenergyv.s.performancetrade-off;and(iii)theexploitableenergy-savingsranges.

Therestofthispaperisorganizedasfollows.InSection2,theprevalentGPUarchitecturesarebrieflydiscussed,aswell asthe impact ofDVFS onthe performance andpowerconsumption ofdifferent applications.Section 3presentsthe pro- posedmethodologiestoclassify GPUapplicationsintoclasseswithsimilarcharacteristics.Section4validatestheproposed methodologies usinga set of35 standard applications.Finally,Section 5applies the proposed methodologiesto findthe best operatingfrequenciesforthedifferentapplications,inordertomaximize theenergy-savings.Section6comparesthe proposedmethodologywiththecurrentstateoftheart,andfinallySection7concludesthemanuscript.

2. AnalysisoftheDVFSimpactonGPGPUapplications

Similarly to other computingsystems, thearchitecture of currentGPUdevices allows forthedifferent componentsto be clockedat distinct andindependent frequencies. Fig. 1 presentsa simpliﬁed representation ofa modern GPUdevice,

(3)

Fig. 1. Existing frequency domains in modern GPU devices ( e.g. NVIDIA Kepler, Maxwell and Pascal GPUs).

Fig. 2. DVFS impact on two distinct Rodinia applications on a NVIDIA GTX Titan X, where F CoreRef= 1164 MHz .

containingtwodistinctfrequencydomains:theGraphics(orCore)domainandtheMemorydomain.¹TheGraphicsdomain includesboth thestreaming-multiprocessors (SMs)andthe L2cache,whiletheMemory domainincludesonlythe device main memory,i.e.DRAM memory.Scaling thefrequencyofeach domain canhave differentresultsonan applicationex- ecution,largely dependingonthe consideredapplicationcharacteristics [27].While itcan be expectedthata decreaseof thecorefrequency(F_Core)andvoltage(V_Core) willcausethekernel² executiontimetoincrease (T∝_F¹

Core)andtheresulting power consumption todecrease (P_dynamic∝V_Core²F andP_static∝Ve^γ^V^Core), the applicationperformance andpower consumption overdifferentfrequencies are highlydependent onthe waytheapplication exploits eachof theGPUsubsystems. In fact,it hasbeenshownthataccurately predictingtheimpact ofDVFSin theexecutiontime orpower consumptionoften requirestheusageofcomplexpredictivemodels[28,29].Accordingly,itisimportanttounderstandtheeffectsofDVFSon both theexecutiontime (t) andpowerconsumption (P), inordertobe ableto extractmeaningful conclusionsrelativeto thebehaviouroftheresultingenergyconsumption(E)overdifferentfrequencies(E=P×t).

To illustrate the referred problem, Fig. 2 presents one example withtwo applications that have their execution time affectedverydifferentlywhenthecoreandmemoryfrequenciesarescaled,namelyHotspot(fromRodinia)andBlackscholes (fromCUDASDK).ForHotspot(seeFig.2a),thekernelexecutiontimealwaysscalesinverselywiththecorefrequency(F_Core).

However, fortheBlackscholes case(see Fig.2b),it ispossible tomaintainthe overallkernel executiontime whilescaling downthecorefrequency,aslongasthememorykeepsoperatingatthehigherfrequency.Itcanalsobeobservedthatthe effects ofscaling the memoryfrequency inthe executiontime are also very differentinthe two applications. Whilefor theBlackscholes(Fig.2b)thereisasigniﬁcantincreaseintheexecutiontimewhenthememoryfrequencyisdecreased,the performanceofHotspot(Fig.2a)isnotaffectedbythesamechange.

1This setup is currently used by several NVIDIA GPU microarchitectures, such as Kepler, Maxwell and Pascal.

2Kernel: routine to be executed in a massively parallel fashion on a GPU device by multiple threads.

(4)

Fig. 3. Examples of DVFS impact on overlapped instructions. The instructions pairs ( Mem1 , Comp1 ) and ( Mem2 , Comp2 ) require full synchronization.

Accordingly, this manuscriptproposestwo separate methodologiesto classify anyapplication accordingto theimpact of DVFS in the resulting execution time and power consumption. Finally,by combining the two approaches it is possible to infer how the energy-consumption will change when the frequency of the cores and of the memory are scaled.

Sections2.1and2.2describetheDVFS impactontheapplicationsexecutiontimeandpowerconsumption,respectively.

2.1. DVFSimpactontheapplicationsperformance

TheimpactofDVFSonanapplicationexecutiontimeisacomplexproblemthatrequiresdeepunderstandingoftheGPU architecture.Inparticular,oneoftheGPUmaindesigngoalsconcernstheuseofmultiplegroupsofparallelthreads(warps in NVIDIAnomenclature) tohide instruction latency.However, inmany applicationsit isnot always possibleto hidethe instructionlatencywithotherwarps.Therefore,whenanalysingtheapplicationsperformance,fromtheperspectiveoftheir bottlenecksandlimitingfactors,mostworkstendtoconsidertwo maintypesofapplications[3,30,31]:(1)compute-bound, wheretheexecutiontimeismainlydeterminedbytheperformanceoftheprocessingcomponents;and(2)memory-bound, wheretheexecutiontimemainlydependsonthebandwidthandlatencyofthememoryhierarchywhensatisfyingmemory accessrequests.Accordingly,theadoptedsetupintermsofthecoreandmemoryoperatingfrequencies(F_CoreandF_Mem)will resultindifferentperformanceversuspowerpatternsifacertainkernelismorememory-boundormorecompute-bound.

Moreover, while one kernel maybe compute-bound ata certain operating frequency state, it may become memory- bound atdifferentcoreand/or memoryfrequencystates.Toillustrate such condition,Fig. 3apresentsthe relativeweight variationofthememoryandcomputeoperationsofone givenkernelforthreedifferentscenarios.Atfrequencystate(F_C1, F_M1),theexecutionofbothMem1 andComp1instructionsoccur atthe sametimeandbothﬁnishtheir executionatthe same instant.In thisexample,thesubsequentinstructions Mem2andComp2 requirefull synchronizationbutsinceboth instructionsﬁnishatthesametime,thelatencyofthethreadswaitingtobeissuedisfullyhiddenbythethreadscurrently executing.However,ifthecorefrequencyisincreasedtoahighervalue(F_C2)and/orthememoryfrequencyisdecreasedto a lowervalue (FM2)such that ^F_F^M2

C2 <^FF^M1_C1, therewillbea timeintervalwhereonlythe Meminstructionsareexecuting on the GPU,meaningthatthere arenotenough threadsexecuting Compinstruction thatcan hidethelatencyofthethreads waitingonpending memoryoperations.Hence,atfrequencystate (F_C2,F_M2) theapplicationisconsideredto bememory- bound,sinceitsperformancebottleneckdependsonthelatencyofthememoryoperations.If,onthecontrary,theoperating frequenciesweresettostate(F_C3,F_M3),suchthat ^F_F^M3

C3 >^FF^M1_C1,theexecutionoftheCompinstructionswouldbelongerthan theMeminstructions,meaningtheperformanceislimitedbythecomputeinstructions,thusresultinginacompute-bound classiﬁcation.

As aresult,thecommonlyusedbinaryclassiﬁcation (compute ormemory-bound) maynot bevalidforall combinations of frequency levels. In the remainder of this work an application is considered memory-bounded at frequency F_Mem_{_}_i if with that memory frequency, there is at least one core frequency (within the range allowed by the device) where the applicationperformanceislimitedbytheaccessestotheDRAM.Ontheotherhand,ifwithmemoryfrequencysettoF_Mem_{_}_i theapplicationisneverboundedbyDRAMaccesses(forallcorefrequencies),theapplicationisconsideredtobecompute- boundedatfrequencyF_Mem_{_}_i.

(5)

Fig. 4. DVFS-aware performance classes depending on the DRAM and Graphics utilization.

Inthetwo extremescenarios,wherethememorythroughputisextremely low comparedwiththecore(^F_F^Mem

Core →0)or extremely high(^F_F^Mem

Core →∞) itistrivialto classifyanygivenapplicationinthe memory-boundandcompute-boundclasses, respectively.Accordingly,astheratio ^F_F^Mem

Core isdecreasedfrom∞(forexamplebyﬁxingF_CoreconstantanddecreasingF_Mem), moreandmore applicationsinitially classiﬁedascompute-boundwillstart becoming memory-boundateach memoryfre- quencylevel.Fig.3bpresentsone exampleofsucha scenario,wheretheperformancebehaviour offourdifferentkernels changeasthememoryfrequencyvaries.Itisimportanttostressthatsinceall possiblevaluesof ^F_F^Mem

Core canbeachievedby ﬁxingone ofthevaluesconstantandscalingtheother,forthesakeofsimpliﬁcationthevalueofF_Coreisconsidered constant,withoutlossofgenerality. Whenthememoryfrequencyisscaleddownbetweenanytwofrequencylevels(e.g. from F_Mem5toF_Mem3),theamountoftimerequiredtosatisfyallDRAMaccesseswillincrease,andtherefore,oneoffourpossible scenarioswilloccur:

1. Kernel 1: The application was already memory-bound at the highest memory frequency (F_Mem5), so it will remain memory-boundatanyotherlowermemoryfrequency;

2. Kernel2:TheapplicationwaswellbalancedasboththeMemandCompinstructionsstartandﬁnishatthesametime;

sinceadecreaseofthememoryfrequencywillmostlyaffecttheMeminstructions,theapplicationwillbecomememory- boundatF_Mem3;

3. Kernel3:Theapplicationwascompute-bound,butassoonasthememoryfrequencyreducestovalueslowerthanF_Mem4 theperformanceoftheapplicationstartsbeinglimitedbytheDRAMbandwidth,thereforebeingmemory-boundatfre- quencyF_Mem3;

4. Kernel4:Theapplicationwascompute-boundatF_Mem5 andit remainscompute-boundatthelowermemoryfrequency F_Mem3.

Hence, considering that one ofthe objectives of thiswork isto provide a DVFS-aware classiﬁcation forthe resulting performanceofGPUapplications,thisclassiﬁcationmustbeabletocharacterizehowtheexecutiontimeofeachapplication changeswhencoreandmemoryDVFSisapplied.Therefore,theproposedmethodologytocharacterizetheimpactofDVFS intheapplicationsperformancewillhavetodependonthememoryfrequencylevelsoftheGPUdevice.

SinceGPUdevicesdonotprovideanyperformancecountersthatimmediatelydeﬁnewhattypeofapplicationisrunning andhow its executionis affected byfrequency scaling,it is importantto choose therelevant countersthat can be used to indirectlyinfer thisinformation.The impact ofDVFS inthe executiontime of an applicationdependsmostlyon how it utilizesthe GPU resources. Inparticular, since GPUapplications are usually able to exploit theinherent parallelism of thedevice, theapplications executiontime isa resultoftheoverlapof theseveralinstructionsexecuted by themultiple warps.Since thememoryandcomputeinstructionsaregenerallyoverlappedduringtheexecution, theimpactofDVFS on anapplicationperformancewillbedependentontheutilizationofboththeMemoryandGraphicsresources.Additionally, since differentinstructions canbe executed on distinct functional unitsofthe Graphics domain (singleprecision, double precision, specialfunction, load/stores,etc.), the utilizationofthe Graphics domainresources willbe mostlyrelatedwith thecomponentwiththehighestutilizationofthatdomain. Hence,dependingontheutilizationratiooftheMemoryand Graphicsresources(GraphicsUtil^DRAMUtil^..),differentclassesofapplicationscanbeconsidered.Bytakingintoconsiderationwhichofthe twoGPUdomainsismoredominant,suchclassescanbegroupedinthreemainlevels(seealsoFig.4):

1. Applicationsthatarememory-boundatthehighestallowedmemoryfrequencylevel(F_Mem_{_}_high),andthereforeatallother lowermemoryfrequencylevelsiftheyexist(seeClassTAinFig.4).TheseapplicationshaveaveryhighDRAMutilization andtheir executiontime ishighlyaffected bychangesinthememoryfrequency.Additionally,asaconsequenceofthe computeinstructionsbeingoftenstalledduetomemorydependencies, theyarealsocharacterizedby alowutilization ofthemajorgraphicscomponents(sharedmemory,ﬂoating-pointunits,etc.).

2. Applicationsthatarecompute-boundatF_Mem_{_}_low,andthereforeatallotherhighermemoryfrequencylevels.Theexecu- tionoftheseapplicationsisnotsigniﬁcantlyaffectedbychangesinthememoryfrequency,butitishighlydependenton

(6)

Table 1

Power consumption for matrixMul- CUBLAS kernels with different iterations.

15 iters 1500 iters (3505, 975) 226 W 239 W (810, 595) 116 W 120 W

thecoreoperatingfrequency.Such applications(seeClassTCinFig.4) arecharacterized bya highgraphics utilization andalowmemoryutilization.

3. Applicationsthat scalewithbothmemoryandcorefrequencies(see ClassTBinFig.4). Ifmultiplememoryfrequency levels areavailable, itis possibletosubdivide thisclassinto multipleclasses,whichwillincludethe applicationsthat arecompute-boundatF_Mem_{_}_highandchangetomemory-boundatthelowermemoryfrequencylevels.Byconsideringone distinctclassforeachadditionalmemoryfrequencylevel,itwillallowthedistinctionbetweenapplicationsthatbecome memory-boundatdifferentmemoryfrequencylevels,resultinginbetteraccuracyinthecharacterization.

Section3willdetailfurtherhowtheboundariesbetweentheclassescanbeobtained,withanexampleoftheutilization oftheproposedapproachonarealGPUdevice.

2.2. DVFSimpactontheapplicationspowerconsumption

As it waspreviously seen, theexecutionof instructionsondifferentcomponents oftheGPU isoftenpartially orfully overlapped. However,the powerconsumptionoftheseveraldifferentcomponentscannotbe hiddenormasqueraded,and mustbealwayscombinedtogetherinordertoobtainthetotalpowerconsumptionoftheGPU.

When considering the instantaneous power consumption (P_Total) [18] of a CMOS circuit, one must take into account both its dynamic (P_Dynamic) and static (P_Static) fractions, as P_Total=P_Dynamic+P_Static. Additionally, when considering both voltage and frequency scaling, these two fractions of the total power consumption have different scaling behaviours, as P_Dynamic∝V²F[32]andP_Static∝VIˆ_leak[33],whereFdenotestheoperatingfrequency,VthechipsupplyvoltageandIˆ_leakisthe normalizedleakagecurrentforasingletransistor,dependentonthethresholdvoltage(V_th) andonthetemperature.These formulationsrelatetotheinstantaneouspowerconsumptionofeachseparatedevicecomponent.Inadevicewithmorethan asinglefrequencydomain(e.g.GPU)thetotalpowerconsumptioncanbeexpressedas:

PGPU=PGraphics

(

^Fcore,Vcore

)

+PMemory

(

^Fmem,Vmem

)

, (1)

where P_Graphics andPMemory representthe powerconsumption ofthegraphics andmemorydomains, respectively.Infact, each ofthesepartscanbefurther decomposedintothe severalparcelsofthepowerthat isconsumedbyeach ofthein- ternalcomponents ofthat domain(processing cores,LD/STunits,sharedmemories, L2cache, etc).However,since device manufacturersdonotfullydisclosethedesignofeacharchitecture,severaloftheparametersthatarerequiredforanaccu- ratepowermodellingareusuallyunknown.Additionally,itisnoteasytomeasureorinfertheinstantaneousstatic/dynamic power parcelsincurrentGPUdevices.In fact,mostmodernGPUs onlyprovideone singlecounter toreport theinstanta- neous powerconsumptionofthe wholeGPUboard,combininginformationfrommultiple contributingdomains (Memory andGraphics)inan undisclosedmanner, makingita veryhard taskto distinguishall theseparateeffects contributingto that value. Accordingly, this workwill not focus oncharacterizing each individual parcelof thepower consumption ofa givenapplication,butrathertheaveragetotal,atthedevicelevel,duringthewholeapplicationexecution.

Finally, toanalyze DVFS impact, thiswork focuseson therelative difference inthe GPUpower consumption between the differentfrequencylevels.The proposed methodologydoesnot take intoaccountthe influenceoftemperatureinthe GPUpowerconsumption.Whiletakingitintoconsiderationcouldpotentiallybebeneficialwhencreatingapowermodelof thearchitecture(sincethevalueofleakagecurrentIˆ_leakquadraticallyincreaseswithtemperature[18]),itislesssignificant whenclassifyingtheimpactofDVFSonthepowerconsumption(i.e.,therelativechangeinpowerconsumption).

Asanexample,Fig.5presentsthepowerconsumptionandtemperatureoftheGTXTitanXGPUduringtheexecutionof two variationsofthesameapplication, ontwofrequencyconﬁgurations: (FMem=3505MHz,FCore=975MHz) and(FMem= 810MHz,F_Core=595MHz).Thetwo applicationsexecutethesamematrixMulCUBLASkernel,repeatedadifferentnumber oftimes(15and1500),thusleadingtodifferentGPUactivetimes.Itcanbeseenthat,asexpected,theapplicationrunning foralongerperiodoftimecausestheGPUtemperaturetoincrease, whichinturnincreasestheGPUpowerconsumption.

However,althoughthereisapowerconsumptionincreaseduetotemperaturevariations(upto13Watthehighestfrequency setting,asshowninTable1),whenlookingattherelativechangesinthepowerconsumption,theimpactofthetemperature isalmostnon-existent(seeFig.6),whichmeansthetwoapplicationscanbeclassiﬁedsimilarly.

Regarding the effectsof DVFS, Fig.7 presentsan example ofthe overlap(in time) ofthe memoryandcomputational instructions oftwoidealapplications, andhowthetotal executiontime ofeachapplicationis affectedby corefrequency scaling (in both examples, thememory voltageand frequencylevels are ﬁxed to aconstant value). The applicationwith higher DRAMutilization(see Fig. 7a)hasits executionmostlydominatedby the memoryaccesses, speciﬁcallyto DRAM.

Since inthisexampleonly theF_Core ischanging (decreasing),the instantaneouspower consumptionremains constant for

(7)

Fig. 5. GPU power consumption and temperature obtained during the execution of the matrixMulCUBLAS kernel with square matrices of size 8192 with two different durations (with 15 and 1500 repetitions of the kernel, i.e. with 6 and 330 seconds, respectively). Here are presented the results for two distinct GPU conﬁgurations on the Titan X GPU: ( F Mem= 3505 MHz, FCore= 975 MHz) and ( FMem= 810 MHz, F Core= 595 MHz).

Fig. 6. Power consumption for matrixMulCUBLAS kernels with different number of iterations, on NVIDIA’s GTX Titan X.

all components whoseutilization isnot changedby the core frequencydownscaling, which includesboth the staticand dynamicpowerconsumptionofthecomponentsinthememorydomain.Furthermore,sincethetotalexecutiontime(T)does notchange,theaveragepowerconsumptionofthesecomponentswillalsoremainconstant.Onthecontrary,thegraphics domain willbe highly affectedby the corefrequency downscaling, resultingin an increase ofthe execution time ofthe compute instructions(T₂). Infact,by takingintoaccount theprevious considerationsrelative totheinstantaneous power, itcan be expectedthat thepower consumptionwill downscale onthe componentsutilizedby the computeinstructions.

Bycombining all theeffects (F_Core and V_Core downscaling),theaverage powerconsumption is expectedto decreasewith F_CoreV_Core².

Ontheotherhand,theapplicationwithhighSMutilization(seeFig.7b)presentsanincreaseofitstotalexecutiontime (fromTtoT)whenthecorefrequencyisdecreased.Similarlytothepreviouscase,theinstantaneouspowerconsumptionof

(8)

Fig. 7. Expected impact of DVFS on the execution time of two ideal applications (in both cases F Memis constant).

Fig. 8. DVFS-aware power classes depending on the DRAM and Graphics utilization. The example power consumption curves were obtained on NVIDIA’s GTX Titan X, for the Blackscholes, Lud, spmv and GEMM benchmarks.

theSMcomponentswilldownscalewithF_CoreV_Core².However,sincetheperformancebottleneckisnowassociatedwiththe compute instructions,theobservedincreaseofthetotalexecutiontimewillincrease theircontributiontothetotal power consumption ofthedevice.Inother words,thecontributionofthememorycomponentstothepowerconsumptionofthe wholeapplicationwilldecrease,sincethesamenumberofmemoryoperationsisbeingexecuted,butscatteredforalonger periodoftime,asthememorycomponentsnowspendmoretime idlingthanbefore(T₄ decreasesandT₅ increases).This willresultinagreaterdecreaseoftheaveragepowerconsumptionthanthepreviousF_CoreV_Core².

A similar approach could be applied to analyse the impact of memory frequency scaling in the average power con- sumptionofdifferentapplications,alsoresultingintwodistinctbehavioursdependingonhowtheGPUresourcesarebeing utilized.Accordingly,dependingonthecombinedresourceutilizationofthetwoGPUdomains,itispossibletoidentifythe classespresentedinFig.8.Thedeﬁnitionofeachproposedclass,dependingonhowtheirpowerconsumptionchangeswith thefrequencyofthecoresandmemory,is:

1. ClassPA:highersensibilitytomemoryfrequencychanges,lowersensibilitytocorefrequencychanges;

2. ClassPB:highersensibilitytomemoryfrequencychanges,highersensibilitytocorefrequencychanges;

(9)

Fig. 9. Overview of the proposed procedure to classify the DVFS impact on the performance/power consumption of GPU applications.

3. ClassPC:lowersensibilitytomemoryfrequencychanges,lowersensibilitytocorefrequencychanges;

4. ClassPD:lowersensibilitytomemoryfrequencychanges,highersensibilitytocorefrequencychanges.

Unliketheperformancecharacterization,thenumberofidentiﬁedclassesinthischaracterizationdoesnotneedtoscale with the number of memoryfrequency levels available in the GPU device. The only special case is when there is only onememoryfrequencyavailable(rarecaseinmodernGPUdevices),inwhichthereisnoneedtocharacterizeapplications regardingtheimpactofmemoryfrequencydownscalingontheir executions,thereforeonlybeingpossibletoidentifytwo classes(ClassesPAandPB).

Bytakingintoconsideration thesetofobservations laidout inthissection, thefollowingsectionapplies theproposed genericperformanceandpowerclassiﬁcationtoarealGPUdevice,andderivesamethodologytoobtaintheboundariesof theperformanceandpowerclasses(Section3.2).

3. Applicationcharacterizationprocedure

Thedescriptionoftheconceivedmethodologytoobtaintheperformanceandpower-awarecharacterizationsofanygiven applicationwillbeconductedbyusingarepresentativeexampleusingaNVIDIAGTXTitanXGPU.Theselectionofthispar- ticularGPUdevicetotesttheimpactofDVFSontheapplicationsexecutionarisesnotonlyfromitsgreaterofferofdifferent core frequencylevels (43 non-idle levels) andmemory (DRAM) frequency levels (3 non-idle levels),but also because it providesaconvenientinterfacetomeasurethepowerconsumptionatruntime.

Fig. 9 presents an overview of the proposed procedure to classify the DVFS impact on the performance (or powerconsumption)ofGPUapplications.Totraintheclassifiersacollectionofsyntheticbenchmarksisused,withtheirexecution timeandpowerconsumption measuredatallallowed operatingfrequencylevels.Inordertocharacterizehoweachappli- cationisstressingtheGPUcomponents,agroup ofhardwareperformanceeventsisalsomeasured atasingle(reference) operatingfrequency.Basedonthegatheredmetrics,anhierarchicalclusteringalgorithmisusedtodefine theclassofeach syntheticbenchmark,usedlaterinthesupervisedtrainingoftheneural-networkclassifier.Oncetheclassifieristrained,it allowsthecharacterizationofthatgivenarchitectureandhowDVFSimpactstheexecutionofdifferenttypesofapplications.

ThefollowingsectionsfurtherdetaileachofthestepsoftheprocedurepresentedinFig.9. 3.1. Syntheticbenchmarkingandproﬁling

In order to classify any given application it is first necessary to characterize the adopted GPU device, more specifi- callyhowDVFSimpacts theexecutiontimeandpowerconsumptionofdifferenttypesofapplications.Thisisusually done throughtheexecutionandprofilingofacontrolledsetofsyntheticbenchmarksdesignedforthisspecificpurpose.Accord- ingly,andsincetheDVFSimpactontheapplicationexecutionisrelatedtohowtheresourcesoftheGraphicsandMemory domainsareutilized,differentbenchmarkapplicationsneedtobecreated,withvaryinglevelsofutilizationofthetwoGPU domains.

Fig.10presentsthebasicstructure ofoneofthedevelopedsyntheticGPUkernels. Byexecuting multiplekernelswith distinct valuesoftheNUM_ITERSparameter,differentcombinationswithdifferentratios betweenthenumberofmemory accessesandtheamountofcomputationscanbetested.Eachsyntheticbenchmarkwasexecutedatallsupportedfrequency levels,duringwhichtheexecutiontime andpowerconsumptionwereaccuratelymeasured,inordertoevaluatehoweach applicationisaffectedby frequencyscaling.TheutilizationoftheseveralGPUresourcesby thedifferentkernelswasalso quantiﬁed, by measuring severalhardware performance eventsduring theexecution ofeach kernel. However, unlike the executiontime andpowerconsumption,thesevaluesaremeasuredatasinglefrequencylevel.Sincethehighestoperating

(10)

Fig. 10. Example PTX code from one of the developed synthetic kernels.

Table 2

Metrics used in the DVFS-aware performance and power classiﬁcation methodologies.

Metric name Description Used in

Performance Power DRAM Transactions Number of reads and writes to DRAM memory YES YES

L2 Transactions Number of reads and writes to L2 cache YES YES

Shared Transactions Number of reads and writes to shared-memory YES YES FU Single Utilization † level of the single-precision function units YES YES FU Double Utilization † level of the double-precision function units YES YES

FU Special Utilization † level of the special function units YES YES

FU Texture Utilization † level of the texture function units NO YES

Registers Number of registers used per thread NO YES

Occupancy Ratio of active warps on an SM with the maximum NO YES number supported by the SM

† Utilization: Ratio of the experimentally achieved throughput of each unit with respect to the theoretical peak.

frequencylevelistheonethatusuallyprovidesthebest-performance,theperformanceeventsweremeasuredatthislevel (onGTXTitanX:F_Mem=3505MHzandF_Core=1164MHz).

The impact of DVFS in the performance and power-consumption of applications is in both cases dependent on how theGPUcomponentsare beingutilized.However,thecomponentsmostrelevantandhowtotakethem intoaccount may differbetweenthetwoclassifications.Whileintheperformanceclassificationtheinterestingmetricsrefertotheutilization of themostpredominantcomponents, i.e.theoneswithhigherprobability to be limitingtheperformance, inthe power classificationtheconsideredmetricsaretheaggregateutilizationofallcomponents,sincethedevicepowerconsumptionis thesumofthepowerconsumptionofeachindividualcomponent.Additionally,thecomponentswiththegreatestinfluence on theperformancemayalsodifferfromthose withthehighestinfluenceonthepowerconsumption(see [19,20,34]),i.e.

theperformancemetricsthatwillbeutilizedforthetwoclassiﬁcationmethodologiesmaybedifferent.

ThesetofperformancecountersthatwereusedtoquantifytheutilizationofthetwoGPUdomainsintheNVIDIAGTX TitanXGPUarepresentedinTable2.Thenumberofregistersusedperthreadcanbeobtainedatcompiletimebyusingthe nvcccompiler[35],whiletheremainingperformancecounterscanbemeasuredusingthenvidiaproﬁler(nvprof)[36]during theexecutionofthekernels.TheDRAMbandwidthiscomputedasfollows:

DRAM_Bandwidth=DRAMTransactions×Transaction_width

Execution_time , (2)

whereDRAMTransactions andExecution_timearemeasuredby theprofiler,whileTransaction_widthisacharacteristicofthe GPUmicroarchitecture.Theshared-memoryandL2bandwidthsarecomputedsimilarly.ThequantificationoftheDRAMand Graphicdomainsutilizationthatareusedintheproposed DVFS-awareperformanceclassificationarecomputedasfollows:

DRAM_Util= DRAM_Bandwidth

DRAM_Peak_{_}_Bandwidth (3)

Graphics_Peak_{_}_Util= max

All_components

{ δ

i·Utilization_i

}

⁽⁴⁾

(11)

Fig. 11. Performance and Power characterization of synthetic GPU benchmarks on NVIDIA’s GTX TITAN X.

FortheDVFS-awarepowerclassiﬁcation,theGraphicsdomainutilizationiscomputedasfollows:

Graphics_Agg_{_}_Util= 1

i

ω

i

All_components

i

ω

i·Utilization_i (5)

The valuesofthe coeﬃcients

δ

i and

ω

i,inEqs. (4)and(5),are architecture speciﬁc andcorrespond tothe weightof thecomponenti tothe overallperformance (orpowerconsumption) oftheGraphicsdomain. Thedeterminationofthese parameterscanbedonethroughpropermodellingofthearchitecture,aspreviouslysuggestedin[34](validforGPUsfrom theFermimicroarchitecture).Nevertheless,forthesakeofsimpliﬁcation,therestofthismanuscriptwillassume

δ

i=

ω

i=1.

3.2. Deﬁningtheclassboundaries

After the executionandprofiling ofall thesynthetic kernels,it ispossible to groupthe benchmarksaccordingto the DVFSimpactontheirexecution.Inordertoseparatethemintoclusterswithsimilarcharacteristics,aclusteringapproachis used.Althoughmultiplealgorithmscanbeusedinthiscontext,techniquesleadingtosphericalclusterswereavoided(e.g., K-Means).Thehierarchicalclusteringtechniquewasselected,sinceitdirectlyspecifiesahierarchyofgroupsandtherefore simplifiestheselectionoftheoptimalnumberofclusters.

The setof features used by thisclusteringprocedure are the resultingchanges in the executiontime (or powerconsumption),causedbymemoryandcorefrequencyscaling,i.e.thevalueoftheincrease(ordecrease)oftheexecutiontime (orpowerconsumption)ineachfrequencylevel,relativetotheone achievedatthereferencefrequencyconfiguration(on GTXTitanX:F_Mem=3505MHzandF_Core=1164MHz).Duringthishierarchicalclusteringprocedure,theeuclideandistance (distancemetric) isusedtoquantifythesimilaritybetweenapplicationsandtheWard’scriteria(linkagecriteria)isusedfor clustermerging,since itminimizesthewithin-clustervariance.The consideredcut inthetreewasperformedinorderto obtainthe desirednumberofclasses.As previously stated,in ordertoaccurately identifythe benchmarksthattransition fromcompute-boundtomemory-boundateachmemoryfrequencylevel,thetotalnumberofperformanceclassesisrelated withthenumberofmemoryfrequenciesavailableintheGPUdevice.OnthisparticularGPUdevice,theperformanceclassi- ficationwillconsiderfourclassesofapplications,namelytwoextremeclassesTAandTC(hereafterreferredtoasT1andT4) andtwomiddleclassesT2andT3,correspondingtotwosubdivisionsofTBinFig.4),associatedwiththetwolowermem- oryoperatingfrequencies.Forthepowerclassificationthesuggestedsetofclasses(4)willbe considered(P1-P4). Fig.11a presentstheresultofthehierarchicalclusteringusingthepowerconsumptionfeatures.

Finally,for each classificationmethodology (performance orpower), afterall the syntheticbenchmarks havebeenas- signedtoone ofthefourclusters, aclassifiermustbe trainedto allowtheclassificationofnew(unseen) applications.Al- thoughmultiplealgorithmsexistintheliterature,neuralnetworks(NN)arehereinadoptedduetotheirrecognizedcapacity tomodelcomplexnon-linearproblems.Inordertovalidate theclassifierthe setofsyntheticbenchmarksisrandomlydi- videdintotwosubsets:training-setandvalidationset.Hence,aneuralnetworkstopologywithtwofully-connectedhidden layerswasdevised,wherethesigmoidfunctionisusedastheactivationfunctiononallneurons.Theperformancenetwork haslayersizes40and10,whilethepowernetworkhaslayersizes20and40.Theinputnodesarethenfedwiththevalues oftheGraphics andMemoryresource utilizations,withtheneural networkbeingtrainedto identifytheclassesassigned duringthehierarchicalclusteringstage.

Figs.11bandcdepictthesetofsyntheticbenchmarksusedfortheperformance andpowerclassifications,respectively, aswellastheobtainedclassesandtheir respectiveboundaries. Itcanbe seenthat,fortheperformance classificationthe DVFS impactontheperformanceissolelydependent ontheratio GraphicsUtil^DRAMUtil^..,resultinginlinearboundaries,whileforthe powerclassificationthisrelationshipisnotlinear.

Itisimportanttostressthatgiventhesmallsizesoftheneural-networks,thetimerequiredtotrainisrelativelysmall.In fact,thebiggesteffortinthetrainingphaseofourclassiﬁcationmethodsisintheexecutionofallthesyntheticbenchmarks

(12)

on the available frequency conﬁgurations. Nonetheless, thisis a step that is performedonce (oﬄine) and thereforewill not haveanyimpactintheapplicationexecutiontime.Duringapplicationexecution,itisonlynecessarytorun theneural networkforinference,corresponding,onaverage,tolessthanasecondonanInteli74500Uprocessor.

3.3. Extensiontootherarchitectures

While the proposed methodology mainly considers a NVIDIAGPU from the Maxwell microarchitecture, it is possible to makesomeconsiderations regardingits extensiontoother architectures.Theproposedapproachassumesthat theGPU device has two independent frequency domains, which is the case for all recent GPU devices (both NVIDIA and AMD).

Since theexecutionofthe syntheticbenchmarkstriestocover aconsiderablerangeof possiblecombinationsofresource utilizationfromthetwoGPUdomains,itispossibletoadapttheexecuted settotheconsideredGPUdevice.Additionally, asitwaspreviouslymentioned,thesuggestednumberofperformanceclassesdependsontheavailablenumberofmemory frequency levels on the considered device (see Section 2.1). On theother hand,the numberof suggestedpower classes, foranyGPUwithmorethana singlememoryfrequencylevelsisfour(see Section2.2).However,iffutureGPUscontinue increasing thenumberofallowedmemoryandcorefrequencylevels,itisexpectablethatthenumberofclassesshouldbe increasedaswell.Whenconsideringadifferentnumberofclassesthanthoseconsideredinthissection,convenientchanges shouldalsobeappliedtothehierarchicalclusteringstage,speciﬁcally,inthecut-off pointofthedendrogramtree,inorder toobtainthedesirednumberofclusters.

Accordingly,giventhegeneralityoftheproposedmethodologies, itisclearthattheycanbestraightforwardlyextended to other GPUs.In fact,giventhe currentevolution trendofNVIDIAGPUs, itis expectedthat theproposed approach will only keepgettingmoreinteresting,since everynewgeneration hasbeen introducinglargerrangesofallowed frequencies forboththecoreandmemorydomains.

4. Validationoftheclassiﬁcationalgorithms

Toevaluatetheproposedmethodology,severalCUDA-basedapplicationbenchmarksfromtheParboil[22],Rodinia[23], SHOC[24],Polybench[25]andCUDASDK[26]suites(seeTable3)wereexecutedontwoGPUdevicesfromdifferentNVIDIA microarchitectures: GTX Titan X (Maxwell microarchitecture) andTitan Xp (Pascal microarchitecture). The Maxwell GPU (PascalGPU)providesauser-levelinterfacetoscalethecoreoperatingfrequencybetweendifferentlevels,withinthe595- 1164MHz (582-1911MHz) range, and the DRAM frequencyin 3 (2) non-idle levels: 810, 3300 and3505MHz (5705 and 4705MHz). Additionally,atthelowest memoryfrequency,i.e.810MHz,theGTXTitan XGPUallows tofurtherdownscale the corefrequencyinto21 additionallevels, downto 135MHz.Inboth GPUs,the defaultfrequencysetup correspondsto a dynamically managedfrequency state, denoted as Auto-boost[37],used to boost the applications performance, by in- creasingGPUcoreandmemoryfrequencieswhensufficientpowerandthermalheadroomisavailable.Notwithstanding,to apply theproposedperformanceandpowerclassificationmodel,itishereinassumedthattheGPUoperatesatthehighest user-controlled coreandmemoryfrequencylevels(i.e.,F_Mem=3505 MHzandF_Core=1164MHzfortheMaxwellGPUand F_Mem=5705MHz andF_Core=1911 MHzfor thePascal GPU).Hence, each benchmark is only executed atsuch reference frequencylevel,wheretheperformancecountersvaluesweremeasured usingtheNVIDIAProfiler [36].Accordingly,based on: (i) thegathered valuesoftheperformance counters; (ii)the proposed classificationscheme(see Fig. 9);and(iii)the classifierstrainedusingthesyntheticbenchmarks(seeFigs.11bandc),eachapplicationwassubsequentlyclassifiedinone performanceclassandonepowerclass.

Finally, to validate and evaluate the attainedclassifications, each application was alsoexecuted at all user-controlled memoryandcorefrequencylevels,includingwiththeactivationoftheAuto-boostfeature.Ateachfrequencyconfiguration, theexecutiontimeofapplicationswasaccuratelymeasuredusingtheNVIDIAProfiler[36]andtheGPUpowerconsumption wasobtainedusingtheNVML[38] library,whichis aC-basedAPIthat allowsamongst other things,monitoringthestate of the GPU. In some NVIDIA GPUdevices, asis the casefor both the considered devices, itis possible to use NVML to get the current powerdraw of the device. Through experimental testing, itwas determined that the refresh rateof the valuesobtainedusingNVML isabout100ms.Accordingly, aGPUpowermeasuring tool wasimplemented,which samples theGPUpowerdrawevery25ms.Thekernelsfromtheapplicationswithsmallerexecutiontimeswererepeateduntileach kernelwasexecutedforatleast1s,inordertoincreasetheaccuracyofthemeasuredsamples.Thepowerconsumptionof each kernelwascomputedastheaverageofall gatheredsamples.Forapplicationswithmultiplekernels,thetotal power consumptionwasobtainedbyaveragingtheconsumptionofeachkernelweightedwiththerelativeexecutiontimeofeach kernel.Inordertoguaranteetheintegrityofthemeasurementstaken,allapplications(bothsyntheticandreal)areexecuted 10times.Thevaluespresentedcorrespondtotheaverageresultsoveralltheexecutedruns.

Sections4.1and4.2presenttheresultsoftheproposedperformanceandpowerclassiﬁcationmethodologies,respectively, whileSection4.3presentstheresultsofcombiningthetwoapproachesinordertoobtainaDVFS-awareenergyclassiﬁcation ofGPUapplications.

(13)

COVAR Polybench Default

FDTD-2D Polybench Default

GEMM Polybench 2048 ×2048

GRAMSCHM Polybench Default

SYRK Polybench Default

Backprop Rodinia 655360

CFD Rodinia missile.domn.0.2M

Gaussian Rodinia 2048 ×2048

Hotspot Rodinia 1024, 2, 10,0 0 0 K-Means Rodinia 30 0 0 0 0 0_34f.txt

LUD Rodinia 8192 ×8192

ParticleFilter_Float Rodinia -x 256 -y 256 -z 80 -np 50,0 0 0 ParticleFilter_Naive Rodinia -x 256 -y 256 -z 80 -np 50,0 0 0

Srad_V1 Rodinia 4096 ×4096

Srad_V2 Rodinia 4096 ×4096

Streamcluster Rodinia Default

BFS SHOC -passes 100 -size 4

FFT SHOC -passes 100 -size 4

MD5Hash SHOC -passes 5 -size 4

Reduction SHOC -iter 10 0 0 -passes 1 -size 4

S3D SHOC -passes 100 -size 4

Sort SHOC -passes 100 -size 4

SPMV SHOC -iter 400 -passes 1 -size 2 Stencil2d SHOC -passes 1 -num-iters 100 -size 4

Fig. 12. DVFS-aware performance classiﬁcation of the tested applications depending on the utilization of the DRAM and Graphics domains, on GTX Titan X (Maxwell) with F Mem= 3505 MHz and F Core = 1164 MHz.

4.1. Performanceclassiﬁcation

Fig.12presentsthecomputedvaluesfortheDRAM_Util andGraphics_Peak_{_}_Util foralltestedapplicationsandhowtheyare classiﬁedintheGTXTitanXGPUdevice accordingtothepreviouslytrainedclassiﬁer(see Section3). Additionally,Fig.13 presentshowtheexecutiontimeoftheapplicationsofeachresultingclassisaffectedbythecoreandmemoryfrequency downscaling, withthepresented valuesbeing normalizedto theexecution time ofeach applicationatthe referencefre- quencystate.Tosimplifytheanalysisoftheresults,adashed linewasaddedtothegraphsinordertoshow theexpected

(14)

Fig. 13. DVFS-Aware Performance classes on GTX Titan X (Maxwell), with F Ref= 1164 MHz. The execution times are normalized to the value at F Mem = 3505 MHz and F Core= 1164 MHz.

executiontimevariation foraperfectcompute-boundapplication,whereT∝_F¹

Core.Finally,asetofmarkers(1–5)waswere alsoaddedtotheﬁgure,tohighlightthemaintakeawaysfromtheresults.

Byanalysingtheresultspresentedinthesegraphs,itcanbeconcludedthatalthoughthisdevicehasthreememoryfre- quencylevels,theyarenotuniformlyseparated.Infact,thetwohighestmemoryfrequencylevels(3505and3300MHz)are so closetoeach other (only6% different)that theperformance resultsofall applicationsdidnot signiﬁcantlychangebe-

(15)

Fig. 14. DVFS-aware performance classiﬁcation of the tested applications depending on the utilization of the DRAM and Graphics domains, on GTX Titan X (Maxwell) with F Mem= 3505 MHz and F Core = 1164 MHz.

tweenthesetwolevels.Notwithstanding,astherearemoresigniﬁcantdifferencesinthepowerdomain,thesetwomemory levelswillbeexploitedinSection5toattainenergysavings.

Ontheother hand,whenanalysingtheclassificationresults,asetofimportantobservationsareattained,whichclearly justifiesthe partitionsoftheapplicationsinthesefourclasses.Inparticular,by observingmarker(1),itcan beseen that whenF_Mem=3505MHz,forthehighestcorefrequencies(above1013MHz,i.e.F_Ref/F_Core<1.15),theapplicationsinclassT1 haveavariationoftheirexecutiontimethatislower thanthecorefrequencyvariation.Hence,atthismemoryfrequency, theseapplicationsare considered memory-bound. However,atF_Core=1013MHz (F_{Re f}/F_Core=1.15), theseapplicationsstart havingtheir executiontime scaling atthe samerateasthecorefrequency (seeFig. 3a). Whenthe memoryfrequencyis scaleddownto810MHz(see(2)inFig.13),thereisasignificantdrop-off intheperformanceoftheseapplications,whichis expectedsincetheyhaveveryhighDRAMutilization(seeT1inFig.12).Also,asexpectedatthislowestmemoryfrequency, therangeofcorefrequencieswheretheseapplicationshavenegligibleperformancedrop-off isincreased,sinceitmeansthat atalowest memoryfrequencythecorefrequencyneedsto befurtherdecreased fortheGraphicscomponents tobecome theperformancebottleneck.

ClassT2composestheapplicationsthatarecompute-boundatF_Mem=3505MHz(see(3a)inFig.13),sincetheirexecu- tiontimealwaysscaleswithF_Core).However,theseapplicationsarememory-boundatF_Mem=810MHz(see(3b)inFig.13), sinceforhighcorefrequenciestheexecutiontimedoesnotscalewithF_Core.Additionally,theexecutiontimeoftheseappli- cationsalsosigniﬁcantlyincreaseswhenthememoryfrequencyisdecreasedtothelowestlevel.However,inalesserextent thantheapplicationsfromclusterT1(compare(2)and(3b)inFig.13),whichisconsistentwiththeirsmallerutilizationof theDRAMresources(seeT2Fig.12).

Furthermore,ClassT3includestheapplicationsthat havetheir executiontime scalingwithbothcoreandmemoryfre- quencies.However, whilethesamecouldbe potentiallysaid abouttheapplicationsintheT2class,theapplicationsinT3 always havetheir executiontimes scalewiththe corefrequency(even atF_Mem=810), which isnot thecaseforthe ap- plicationsinT2(compare (4a) and(4b)inFig.13).The behaviourofmanyof theT3applicationsis aresultoftheir low occupancyoftheGPUresources,resultinginasequentialexecutionofmanyinstructions.

Finally,classT4comprisesalltheapplicationsthatcanbeconsideredcompute-boundatallmemoryfrequencylevels(see themarkers(5)inFig.13),astheirexecutiontimealwaysscalesinverselywiththecorefrequencyandneverscaleswiththe memoryfrequency(lessthan1%performancechangewhenthememoryfrequencyischangedfrom3505MHzto810MHz).

Again,itisimportanttostressthatthismethodologyallowstheclassiﬁcationofGPUapplicationsintoclassesthatchar- acterizetheir performance atall frequencylevels,byusingtheinformationobtainedfromtheir executionatasinglecore frequencyinarealhardwaredevice.

4.2. Power-awareclassiﬁcation

Fig.14presentstheutilizationlevelofseveralcomponentsoftheGraphicsandMemorydomainsforthetestedapplica- tionsandhowtheyareclassiﬁedaccordingtotheproposedpower-awareclassiﬁer(fortheTitanXGPUdevice).Theimpact ofDVFSinthepowerconsumption oftheapplicationsofeachclassispresentedinFig.15.Asitwasthecasefortheper- formance characterization,all values are normalizedto the powerconsumption at the referencefrequencies. The dashed lineshowsthecorefrequencyvariationateach level,i.e.applicationsthatfollowthelinehavepowerconsumptionscaling

(16)

Fig. 15. DVFS-Aware Power classiﬁcation on GTX Titan X (Maxwell), with F Ref= 1164 MHz. The power consumptions are normalized to the value at F Mem= 3505 MHz and F Core= 1164 MHz.

linearlywithF_Core,whileapplicationswithapowerconsumptionvariationbelowthislinehavesuper-linearscalabilitywith F_Core.

Eachoftheresultingclassesdisplaysdifferentdegreesofsensitivitytovariationsinthecoreandmemoryfrequencies.

In particular, whencomparing theapplications relativepower consumption andwhen F_Mem decreases from3505MHzto 810MHz(at F_Core= F_Ref)-comparethemarkers(1)inFig.15-itcanbe perceivedthatapplicationsfromclassesP1and P2haveanhigherdegreeofsensitivitytomemoryfrequencychanges,sincetheyachievepowerconsumptionsintherange from70%to50%ofthereferenceconsumption(correspondingtopower-savingsbetween30%and50%).Ontheotherhand,

(17)

Fig. 16. DVFS-Aware energy classiﬁcation on GTX Titan X (Maxwell).

applications fromclassesP3 and P4 display alower degree of sensitivityto memoryfrequency changes,since they only achievepower-savingsbetween10%and30%,whenthememoryfrequencyisdecreasedbetweenthesamevalues.

Whenanalysing theapplicationspowerconsumption sensitivitytocorefrequencyvariations,it canbe concludedthat, applications from classes P1 andP3 display a lower degree of sensitivity than applications from classes P2 and P4. In particular,bycomparingtheapplicationspowerconsumption variationatF_Mem=3505 MHzwiththedashedline(seethe markers(2)inFig.15),itcanbeobservedthattheapplicationsfromclassesP1andP3presentalinearorsub-linearscaling withthedashed-lineintherangeF_Ref/F_Core∈[1; 1.25].Ontheother hand,forthesamecorefrequencyrange,applications fromclassesP2andP4showasuper-linearpowerconsumptiondrop-off.

TheexceptiontosuchtrendispresentedinClassP2,forthematrixMulCUBLASbenchmark,whichdoesnotﬁttheproﬁle of the remaining benchmarks (see the marker (3)in Fig. 15). Upon further inspection, it was observed that, duringthe executionofthiskernelatthehigherfrequencylevels,thepowerconsumption oftheGPUreachesveryhighlevels(close tothedevicepowercapof275W),thussuggestingtheeffectisduetointernalGPUpowercontrol mechanisms.Sincethe valuespresentedinFig.15arenormalizedtothemaximumpowerconsumption,whichforthematrixMulCUBLASbenchmark islimitedbythedevicecapabilities,theresultswillbeskewed.

4.3. Energy-awareclassiﬁcation

Since,ontheGTXTitanXGPU,theproposedmethodologiesgiverisetofourperformance classesandfourpowerclas- sificationsforthisspecificGPUdevice,itcanbedefinedasetof16energyclasses.However, itisobservedthat thesetof testedbenchmarksareclassifiedintoonly10outofthose16classes.Sincebothperformance- andpower-awareclassifica- tionsareconsideringtheresourceutilizationoftheGraphicsandMemorydomains,eventhoughtheyusedifferentwaysto combinetheutilizationsoftheseveralcomponents,manyofthecombinationsbetweenperformanceandpowerclassesare veryhardtoachieve.Theresultsoftheobtained10energy-awareclassesarepresentedinFig.16,whereitispresentedthe energy-savingsofall applicationsforallavailable frequencylevels.Theenergy-saving (R)atfrequencystate(F_Core,F_Mem)is

(18)

Fig. 17. DVFS-Aware energy classiﬁcation on Titan Xp (Pascal).

computedinthefollowingway:

R

(

^F^Core^,^F^Mem

)

⁼^E^{re f}⁻^E

(

^FCore,F_Mem

)

E_{re f} , (6)

whereE_ref istheenergyconsumptionatthereferencefrequencies(F_Mem=3505MHz,F_Core=1164MHz).

Fig.17presentstheobtainedenergy-awareclassesofapplicationsontheTitanXpGPU(Pascalmicroarchitecture).Since in this device there are only two memory frequencylevels, the proposed classiﬁcation methodology will result inthree performance classes.Combiningtheseclasseswiththefourpowerclasseswillultimatelyresultin12energyclasses,some ofwhichwill againbe unoccupied.Thelargerrangeofoperatingcorefrequenciesavailable inthisGPUdevice (compared withMaxwell GPU),results inalarger rangeofcorefrequencieswhere theapplicationswithvery highDRAMutilization (ClassE1)decreasepowerconsumptionwithoutlosingperformance:from1911MHzdownto1404MHz(vs.1164-1013MHz onMaxwell).Ontheotherhand,thelackofalowmemoryfrequencylevelresultsinlessinterestingenergy-savingsoppor- tunitiesforapplicationswithhighgraphicsutilizationandlowDRAMutilization(ClassE12).

As itcanbeseen,bycombiningtheresultsofthetwoproposed classificationmethodologies,asetofclassesthatsuc- cessfully grouptogetherapplicationswithsimilarenergyconsumptioncurvescanbeobtained.Theresultofthisclassifica- tion canbe usedinmanydifferentwaystomaximize theenergy-efficiency ofcomputationalsystems,some ofwhichare proposedinSection5.

5. Application-awareDVFS

ByprovidingasystematicmechanismtocharacterizetheimpactofDVFSontheenergy-consumptionofanyGPUappli- cation,theproposedDVFS-awareclassificationmethodologiescreatemanyinterestingopportunitiestoimprovetheenergy- efficiency of HPCsystems. In fact, since the impact ofDVFS on the energy consumption combines the performance and powerclassificationsoftheapplication,byconsidering theaveragetrendamongall applicationsintheclass,itispossible to predict howthe energyconsumption ofeach applicationwill changewiththefrequency scalingof eachGPU domain.

Sections 5.1and5.2addresstwo differentwaysto usetheseclassification methodologies, namelytochoose thebest operating frequencies (coreandmemory), inorderto solelymaximize energy-savingsorto tryto maximize energy-savings withoutdecreasingthe performancemorethanadefinedthreshold.Section5.3addressesanotherpossibleusageofthese classification methodologies,that combinesthepredictedfrequencyrangeswiththeenergy-savings,inorderto selectthe mostconvenientoperatingfrequencyforascenariowheremultipledistinctapplicationsaresimultaneouslyrunningonthe GPUdevice.

5.1. Optimalfrequencyformaximizingenergy-savings

Just asthe impact ofDVFS inthe energyconsumption of applications within the sameclass is very similar between them, it can alsobe observedthat, all applications of thesame class have thesame (or very similar)optimal operating

(19)

Fig. 18. Achieved energy-savings for the selected and optimal frequency levels, on GTX Titan X (Maxwell).

frequencies (memoryandcorefrequencies that minimize theenergyconsumption onthat device).Hence, by considering the average ofthe optimalfrequencies from thesynthetic benchmarksof each class (referred inthis section asselected frequencies),itispossibleto attainsigniﬁcantenergy-savings.Accordingly, Fig.18Adepictstheoptimalcoreandmemory frequenciesforeachapplicationexecutedonGTXTitanX,aswellastheselectedper-classcoreandmemoryfrequencies.

Asexpected, applicationswithvery highcoreutilizations(classesE10,E11,E12, E15andE16) areabletodecreasethe memoryfrequencytothe minimumvalue,while applicationswithhighDRAM utilization(classesE1 andE2)requirethe memoryfrequencytobesettothehighestvalue.

Ontheotherhand,itisinteresting tonotethat contrarytowhat onemightexpect,theapplicationsfromtheclassE1 (HighDRAM/LowGraphics)havehigheroptimalcorefrequenciesthantheapplicationsfromclassE16(LowDRAM/High Graphics).The former group ofapplications are able to downscale thecore frequency(with F_Mem=3505 MHz) inorder to achieve a lower energy consumption, whilethe applications fromclass E16 are able to decrease thecore frequencies evenfurther(withF_Mem=810MHz), whichisaresultoftheinteractionbetweenthetwopowercomponentsfromEq.1. However,sincetheseapplications(classE16)areverycompute-intensivetheywillhaveanhighperformancedrop-off when runningatthemiddlecorefrequencylevels,whichwillbefurtheranalysedinSection5.2.

(20)

Fig. 19. Achieved energy-savings for the selected and optimal frequency levels, when the maximum allowed performance drop-off is of 10%, on GTX Titan X (Maxwell).

Fig. 18B presents the energy-savings obtainedwith the selected frequency levels in relation to: (i) the reference frequency;(ii)totheNVIDIAAuto-boost;and(iii)totheworstcase,i.e.thehighestenergy-consumptionofeachapplication.

As itcanbeseen,allapplicationscanachieve lowerenergyconsumptionsthantheonesobtainedusingthereferencefre- quency(frequencystatewithhighestperformance),andonlythreeapplicationswouldhavelowerenergyconsumptionusing NVIDIA’sAuto-Boost.Onaverage,usingtheselectedpairoffrequencieswouldwield16%energy-savingscomparedwiththe reference,and13%inrelationtotheAuto-Boost.Additionally,someapplications(classesE12,E15andE16)canevenachieve greater energy-savings,promptedbythe factthat they candecreasethe memoryfrequencyto23% ofitsreferencevalue, makingthemachievecloseto36%energy-savings.

Althoughtheproposed proceduretochoosetheselectedfrequenciesdoesnotguaranteeoptimalenergy-savings,i.e.the selected frequencyisnot alwaysequal totheoptimalone,fromthe resultspresentedinFig.18C,it canbe seenthat the difference between optimalconﬁguration (horizontal lineat100%, representingthe lowest energy consumption)andthe oneobtainedusingtheselectedfrequenciesisverysmall,withanaverageof0.74%differencebetweentheselectedandthe optimalenergyconsumptions.

5.2. Optimalfrequencyforenergyv.s.performancetrade-off

Another interesting situation to consider is the maximization ofthe applications energy-eﬃciency without sacriﬁcing their performance. Consideringa situationwhere atmosta 10% drop-off inperformance isallowed, andcomparing itto theperformanceofthereferencefrequencystate,theresultspresentedinFig.19canbeobtained,fortheGTXTitanXGPU.

Again,bychoosingthenewlyselectedfrequencies,allapplicationsareabletoachievelowerenergyconsumptionsthanthe oneatthereferencestate.Additionally,allobtainedenergy-savingsusingtheselectedfrequencylevelsareatmost3%distant fromtheoptimalenergy-savings,withaverageenergy-savingsof9%versusthereferenceand7%versustheAuto-boostsetup.

Itisinteresting tonotethat theapplicationswiththegreatergainsarethosewitheitheraveryhighDRAMutilization andlowGraphicsutilization(classE1),ortheapplicationsintheotherextreme(classesE15andE16).Asitwaspreviously seen,thishappensmainlyintheﬁrsttypeofapplications(classE1),since atthehighestmemoryfrequencyitispossible todownscalethecorefrequencywhileachievingverynegligibleperformancedrop-off.Thesetypesofapplicationscansave up to16% withlessthan10% increase intheir executiontime.On theother hand,applicationsfromclassesE15andE16 are able to run at the lowestmemory frequency withnegligible performance drop-off. For example,if justthe memory