energy efficiency
João Guerreiro
∗, Aleksandar Ilic, Nuno Roma, Pedro Tomás
INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal, Rua Alves Redol, 9, Lisboa 10 0 0-029, Portugal
a r t i c l e i n f o
Article history:
Received 30 November 2016 Revised 20 August 2017 Accepted 1 February 2018 Available online 9 February 2018 Keywords:
GPGPU
Application classification DVFS
Optimal frequency Energy savings
a b s t r a c t
TheincreasingimportanceofGPUsashigh-performanceacceleratorsand thepowerand energyconstraintsofcomputingsystems,makeitfundamentaltodeveloptechniquesfor energyefficiencymaximizationofGPGPUapplications.Amongseveralpotentialtechniques, dynamicvoltageand frequencyscaling (DVFS)standsoutas oneofthemostpromising approaches. Hence,novel DVFS-awareperformance and power classification models are hereinproposed thatcorrelate application characteristicsand GPU architecturefeatures.
Inparticular,byanalysingtheutilizationofgraphicsandmemorycomponentsatasingle voltageand frequency levels,theproposed classificationmethodologies areabletopre- dictthe impact of DVFSonGPGPU applications execution timeand power and energy consumption.TheaccuracyoftheproposedapproachisvalidatedontwomodernNVIDIA GPUsfrom theMaxwelland Pascal generations,byrelying on35benchmarksfrom the Rodinia,Polybench,Parboil,SHOCand CUDASDK suites.Experimentalresultsshow that theproposedapproachcantypically predicttheoptimaloperatingfrequenciesofgraph- icsandmemorysubsystems,attainingupto36%energysavings(averageof16%),which correspondtoanaveragedeviationof0.74%regarding theoptimalcase.Moreover,when consideringamaximumperformance penaltyof10%,upto 26%energysavings arestill attained.
© 2018ElsevierB.V.Allrightsreserved.
1. Introduction
Generalpurposeacceleratorshavealreadygaineda firmpresence inmostmodern high-performancecomputing(HPC) systems.In particular, theGraphics Processing Units (GPUs),are commonlyused toincrease the resulting systemperfor- mancewhenexecutingapplicationsfrommanycommercialandscientificdomains.Thiscanbeeasilyobservedatthemost recentversion oftheTOP500list(June2017):88ofthesesystemsare equippedwithaccelerators,where72ofthemuse GPUs.However,suchanestablishedadoptionofGPUsintensifiestheimportancetofindreliablemechanismsthatensurethe maximumefficiencyofthecomputingsystem, bothintermsofperformanceand(mostimportantly)energyconsumption.
Accordingly,significant researcheffortsare beingputforth intheinvestigationofDynamicVoltageandFrequencyScaling (DVFS)techniques(oneofthemostusedpowermanagementstrategies),duetotheinherentpotentialforsignificantpower andenergysavingsinmanyofthecomputersystemcomponents[1–5].
Studyingtheeffects ofDVFS on theresultingenergy-efficiency ofcomputingsystems requiresanalysing itsimpact on differentapplications,asgeneral-purposeapplicationscanlargelyvaryinthewaytheyusethecomputationalandmemory
∗ Corresponding author.
E-mail address: [email protected] (J. Guerreiro).
https://doi.org/10.1016/j.parco.2018.02.001 0167-8191/© 2018 Elsevier B.V. All rights reserved.
resourcesofthedeviceswheretheyareexecuting[6,7].Whilesomeapplicationsperformalargenumberofcomputational operationsforeach loadeddata(more compute-intensive),otherapplicationsmayperformveryfew(moreI/O-ormemory- intensive).Althoughtheresultingperformanceoftheformertypeofapplicationsismorelikelytoscaleproportionallywith thefrequencyofthecores(highestfrequency ≡ bestperformance),thisbehaviourisnotguaranteedforthelattersetofap- plications.Additionally,moststandardapplicationsfallsomewherebetweentheseextremes,increasingthecharacterization complexity. However, knowingthe type of applicationexecuting can leadto interesting opportunities forenergysavings.
For instance,certain applicationscan be executed atlower frequencylevels withnegligibleperformance drop-off. Identi- fying theseclasses ofapplications can lead to significant energysavings, since lower operating frequencyleadsto lower powerconsumption.Hence,toconductthistypeofanalysis,itisfundamentaltoadoptmethodologiesthatallowaproper classificationoftheapplicationworkloads.
Some previous studies onworkloadcharacterization intheGPU-domainused PrincipalComponentAnalysis (PCA)and hierarchicalclustering[7–9],whichmaketheunderstandingofeachresultingclassharder(fromthecomputingarchitecture perspective)anddonotnecessarilyresultinanaccurateenergy-awareclassification.OtherworksdependonGPUsimulators andperformancecountersthatarenon-existentinrealhardwaredevices,renderingtheseapproachesimpossibletoreplicate in realsystems.Furthermore,most existingGPU simulatorsare based onthe NVIDIATesla andFermimicroarchitectures, whichhavealreadybeenfollowedbyKepler(2013),Maxwell(2014)andPascal(2016).AdifferentapproachtowardsDVFS consists in thedevelopment ofaccurate performance andpower models that allow predictingthe GPUbehaviour under differentvoltageandfrequencyscenarios[10–21].However, whiledetailedperformanceandpowermodelsmayultimately produce moreaccurate results,the herein presentedwork showsthat applicationclassification isa moreconvenient and viable approach not onlyto identifyremarkable energy-savingsopportunities, butalsoto achieve near-optimalresults in termsofenergysavings.
Accordingly, themaincontributionofthepresentedresearchistoprovideanewclassificationmethodologyforGPGPU applications, allowing an easy identification ofwhich applicationscan benefit fromDVFS, interms of energysavings. To that goal,separate methodologiesto classifyGPUapplicationsare proposed,focusing ontheeffectsofDVFS ontheir per- formance andpower consumption,based onhow theapplications exploit thedifferentGPU resources.The classifiersare trainedoffline,usingacollectionofsyntheticbenchmarks,andcanbeusedtoclassifyanyapplicationusinghardwareper- formance eventsgatheredduringits executionon asingleoperatingfrequency. Theresulting classis abletocharacterize howthefrequencyscalingwillaffecttheapplicationexecutiontimeorpowerconsumption,foralltheremainingoperating frequencies.
The proposedmethodologywasvalidatedusingasetof35applicationsfromdifferentrelevantbenchmarksuites(Par- boil [22], Rodinia[23], SHOC [24], Polybench [25] and CUDA SDK[26]), on two modern GPUdevices: GTX Titan X and Titan Xp,fromthe Maxwell andPascalmicroarchitectures, respectively.The experimental resultsshow that theproposed methodologiesareabletoaccuratelyandconsistentlyclassifytheconsidered GPUapplicationsintermsoftheirbehaviour inperformance,powerandenergyconsumption.Theproposedclassificationmethodologyallowsfindinganear-optimalpair ofoperatingfrequencies(coreandmemory),whichcanresultonaverageenergy-savingsof16%(20%),andonpeakenergy- savings that are ashighas36% (32%) on the Maxwell GPU(Pascal GPU). Additionally,in situations wherea decrease in theprocessingperformanceisnotallowedorhighlydiscouraged,theproposedmethodologyisstillabletofindrealoppor- tunities forenergy-efficiency withalimitedperformance trade-off (e.g., <10%),resultinginaverage energy-savingsof9%
on theMaxwellGPU(16%onPascal),althoughinsome classessuchsavings canevenbeashighas22% withonlya 0.2%
performancetrade-off.Accordingly,themostsignificantcontributionsofthispaperarethefollowing:
• Analysis oftheimpact ofDVFSon theperformance andpowerconsumptionofdifferenttypes ofGPUbenchmarkson realhardware;
• Novelapplication-classificationschemebasedonGPUperformanceandpowermetrics,abletocharacterizetheimpactof DVFS ontheexecutionofdifferenttypesofapplications,forawiderangeofGPUoperatingfrequencies,validatedwith standardapplicationsonrealGPUdevices;
• Application ofthe proposedclassificationmethodologiestooptimize:(i) theconsumedenergy;(ii)theenergyv.s.per- formancetrade-off;and(iii)theexploitableenergy-savingsranges.
Therestofthispaperisorganizedasfollows.InSection2,theprevalentGPUarchitecturesarebrieflydiscussed,aswell asthe impact ofDVFS onthe performance andpowerconsumption ofdifferent applications.Section 3presentsthe pro- posedmethodologiestoclassify GPUapplicationsintoclasseswithsimilarcharacteristics.Section4validatestheproposed methodologies usinga set of35 standard applications.Finally,Section 5applies the proposed methodologiesto findthe best operatingfrequenciesforthedifferentapplications,inordertomaximize theenergy-savings.Section6comparesthe proposedmethodologywiththecurrentstateoftheart,andfinallySection7concludesthemanuscript.
2. AnalysisoftheDVFSimpactonGPGPUapplications
Similarly to other computingsystems, thearchitecture of currentGPUdevices allows forthedifferent componentsto be clockedat distinct andindependent frequencies. Fig. 1 presentsa simplified representation ofa modern GPUdevice,
Fig. 1. Existing frequency domains in modern GPU devices ( e.g. NVIDIA Kepler, Maxwell and Pascal GPUs).
Fig. 2. DVFS impact on two distinct Rodinia applications on a NVIDIA GTX Titan X, where F CoreRef= 1164 MHz .
containingtwodistinctfrequencydomains:theGraphics(orCore)domainandtheMemorydomain.1TheGraphicsdomain includesboth thestreaming-multiprocessors (SMs)andthe L2cache,whiletheMemory domainincludesonlythe device main memory,i.e.DRAM memory.Scaling thefrequencyofeach domain canhave differentresultsonan applicationex- ecution,largely dependingonthe consideredapplicationcharacteristics [27].While itcan be expectedthata decreaseof thecorefrequency(FCore)andvoltage(VCore) willcausethekernel2 executiontimetoincrease (T∝F1
Core)andtheresulting power consumption todecrease (Pdynamic∝VCore2F andPstatic∝VeγVCore), the applicationperformance andpower consump- tion overdifferentfrequencies are highlydependent onthe waytheapplication exploits eachof theGPUsubsystems. In fact,it hasbeenshownthataccurately predictingtheimpact ofDVFSin theexecutiontime orpower consumptionoften requirestheusageofcomplexpredictivemodels[28,29].Accordingly,itisimportanttounderstandtheeffectsofDVFSon both theexecutiontime (t) andpowerconsumption (P), inordertobe ableto extractmeaningful conclusionsrelativeto thebehaviouroftheresultingenergyconsumption(E)overdifferentfrequencies(E=P×t).
To illustrate the referred problem, Fig. 2 presents one example withtwo applications that have their execution time affectedverydifferentlywhenthecoreandmemoryfrequenciesarescaled,namelyHotspot(fromRodinia)andBlackscholes (fromCUDASDK).ForHotspot(seeFig.2a),thekernelexecutiontimealwaysscalesinverselywiththecorefrequency(FCore).
However, fortheBlackscholes case(see Fig.2b),it ispossible tomaintainthe overallkernel executiontime whilescaling downthecorefrequency,aslongasthememorykeepsoperatingatthehigherfrequency.Itcanalsobeobservedthatthe effects ofscaling the memoryfrequency inthe executiontime are also very differentinthe two applications. Whilefor theBlackscholes(Fig.2b)thereisasignificantincreaseintheexecutiontimewhenthememoryfrequencyisdecreased,the performanceofHotspot(Fig.2a)isnotaffectedbythesamechange.
1This setup is currently used by several NVIDIA GPU microarchitectures, such as Kepler, Maxwell and Pascal.
2Kernel: routine to be executed in a massively parallel fashion on a GPU device by multiple threads.
Fig. 3. Examples of DVFS impact on overlapped instructions. The instructions pairs ( Mem1 , Comp1 ) and ( Mem2 , Comp2 ) require full synchronization.
Accordingly, this manuscriptproposestwo separate methodologiesto classify anyapplication accordingto theimpact of DVFS in the resulting execution time and power consumption. Finally,by combining the two approaches it is possi- ble to infer how the energy-consumption will change when the frequency of the cores and of the memory are scaled.
Sections2.1and2.2describetheDVFS impactontheapplicationsexecutiontimeandpowerconsumption,respectively.
2.1. DVFSimpactontheapplicationsperformance
TheimpactofDVFSonanapplicationexecutiontimeisacomplexproblemthatrequiresdeepunderstandingoftheGPU architecture.Inparticular,oneoftheGPUmaindesigngoalsconcernstheuseofmultiplegroupsofparallelthreads(warps in NVIDIAnomenclature) tohide instruction latency.However, inmany applicationsit isnot always possibleto hidethe instructionlatencywithotherwarps.Therefore,whenanalysingtheapplicationsperformance,fromtheperspectiveoftheir bottlenecksandlimitingfactors,mostworkstendtoconsidertwo maintypesofapplications[3,30,31]:(1)compute-bound, wheretheexecutiontimeismainlydeterminedbytheperformanceoftheprocessingcomponents;and(2)memory-bound, wheretheexecutiontimemainlydependsonthebandwidthandlatencyofthememoryhierarchywhensatisfyingmemory accessrequests.Accordingly,theadoptedsetupintermsofthecoreandmemoryoperatingfrequencies(FCoreandFMem)will resultindifferentperformanceversuspowerpatternsifacertainkernelismorememory-boundormorecompute-bound.
Moreover, while one kernel maybe compute-bound ata certain operating frequency state, it may become memory- bound atdifferentcoreand/or memoryfrequencystates.Toillustrate such condition,Fig. 3apresentsthe relativeweight variationofthememoryandcomputeoperationsofone givenkernelforthreedifferentscenarios.Atfrequencystate(FC1, FM1),theexecutionofbothMem1 andComp1instructionsoccur atthe sametimeandbothfinishtheir executionatthe same instant.In thisexample,thesubsequentinstructions Mem2andComp2 requirefull synchronizationbutsinceboth instructionsfinishatthesametime,thelatencyofthethreadswaitingtobeissuedisfullyhiddenbythethreadscurrently executing.However,ifthecorefrequencyisincreasedtoahighervalue(FC2)and/orthememoryfrequencyisdecreasedto a lowervalue (FM2)such that FFM2
C2 <FFM1C1, therewillbea timeintervalwhereonlythe Meminstructionsareexecuting on the GPU,meaningthatthere arenotenough threadsexecuting Compinstruction thatcan hidethelatencyofthethreads waitingonpending memoryoperations.Hence,atfrequencystate (FC2,FM2) theapplicationisconsideredto bememory- bound,sinceitsperformancebottleneckdependsonthelatencyofthememoryoperations.If,onthecontrary,theoperating frequenciesweresettostate(FC3,FM3),suchthat FFM3
C3 >FFM1C1,theexecutionoftheCompinstructionswouldbelongerthan theMeminstructions,meaningtheperformanceislimitedbythecomputeinstructions,thusresultinginacompute-bound classification.
As aresult,thecommonlyusedbinaryclassification (compute ormemory-bound) maynot bevalidforall combinations of frequency levels. In the remainder of this work an application is considered memory-bounded at frequency FMem_i if with that memory frequency, there is at least one core frequency (within the range allowed by the device) where the applicationperformanceislimitedbytheaccessestotheDRAM.Ontheotherhand,ifwithmemoryfrequencysettoFMem_i theapplicationisneverboundedbyDRAMaccesses(forallcorefrequencies),theapplicationisconsideredtobecompute- boundedatfrequencyFMem_i.
Fig. 4. DVFS-aware performance classes depending on the DRAM and Graphics utilization.
Inthetwo extremescenarios,wherethememorythroughputisextremely low comparedwiththecore(FFMem
Core →0)or extremely high(FFMem
Core →∞) itistrivialto classifyanygivenapplicationinthe memory-boundandcompute-boundclasses, respectively.Accordingly,astheratio FFMem
Core isdecreasedfrom∞(forexamplebyfixingFCoreconstantanddecreasingFMem), moreandmore applicationsinitially classifiedascompute-boundwillstart becoming memory-boundateach memoryfre- quencylevel.Fig.3bpresentsone exampleofsucha scenario,wheretheperformancebehaviour offourdifferentkernels changeasthememoryfrequencyvaries.Itisimportanttostressthatsinceall possiblevaluesof FFMem
Core canbeachievedby fixingone ofthevaluesconstantandscalingtheother,forthesakeofsimplificationthevalueofFCoreisconsidered con- stant,withoutlossofgenerality. Whenthememoryfrequencyisscaleddownbetweenanytwofrequencylevels(e.g. from FMem5toFMem3),theamountoftimerequiredtosatisfyallDRAMaccesseswillincrease,andtherefore,oneoffourpossible scenarioswilloccur:
1. Kernel 1: The application was already memory-bound at the highest memory frequency (FMem5), so it will remain memory-boundatanyotherlowermemoryfrequency;
2. Kernel2:TheapplicationwaswellbalancedasboththeMemandCompinstructionsstartandfinishatthesametime;
sinceadecreaseofthememoryfrequencywillmostlyaffecttheMeminstructions,theapplicationwillbecomememory- boundatFMem3;
3. Kernel3:Theapplicationwascompute-bound,butassoonasthememoryfrequencyreducestovalueslowerthanFMem4 theperformanceoftheapplicationstartsbeinglimitedbytheDRAMbandwidth,thereforebeingmemory-boundatfre- quencyFMem3;
4. Kernel4:Theapplicationwascompute-boundatFMem5 andit remainscompute-boundatthelowermemoryfrequency FMem3.
Hence, considering that one ofthe objectives of thiswork isto provide a DVFS-aware classification forthe resulting performanceofGPUapplications,thisclassificationmustbeabletocharacterizehowtheexecutiontimeofeachapplication changeswhencoreandmemoryDVFSisapplied.Therefore,theproposedmethodologytocharacterizetheimpactofDVFS intheapplicationsperformancewillhavetodependonthememoryfrequencylevelsoftheGPUdevice.
SinceGPUdevicesdonotprovideanyperformancecountersthatimmediatelydefinewhattypeofapplicationisrunning andhow its executionis affected byfrequency scaling,it is importantto choose therelevant countersthat can be used to indirectlyinfer thisinformation.The impact ofDVFS inthe executiontime of an applicationdependsmostlyon how it utilizesthe GPU resources. Inparticular, since GPUapplications are usually able to exploit theinherent parallelism of thedevice, theapplications executiontime isa resultoftheoverlapof theseveralinstructionsexecuted by themultiple warps.Since thememoryandcomputeinstructionsaregenerallyoverlappedduringtheexecution, theimpactofDVFS on anapplicationperformancewillbedependentontheutilizationofboththeMemoryandGraphicsresources.Additionally, since differentinstructions canbe executed on distinct functional unitsofthe Graphics domain (singleprecision, double precision, specialfunction, load/stores,etc.), the utilizationofthe Graphics domainresources willbe mostlyrelatedwith thecomponentwiththehighestutilizationofthatdomain. Hence,dependingontheutilizationratiooftheMemoryand Graphicsresources(GraphicsUtilDRAMUtil..),differentclassesofapplicationscanbeconsidered.Bytakingintoconsiderationwhichofthe twoGPUdomainsismoredominant,suchclassescanbegroupedinthreemainlevels(seealsoFig.4):
1. Applicationsthatarememory-boundatthehighestallowedmemoryfrequencylevel(FMem_high),andthereforeatallother lowermemoryfrequencylevelsiftheyexist(seeClassTAinFig.4).TheseapplicationshaveaveryhighDRAMutilization andtheir executiontime ishighlyaffected bychangesinthememoryfrequency.Additionally,asaconsequenceofthe computeinstructionsbeingoftenstalledduetomemorydependencies, theyarealsocharacterizedby alowutilization ofthemajorgraphicscomponents(sharedmemory,floating-pointunits,etc.).
2. Applicationsthatarecompute-boundatFMem_low,andthereforeatallotherhighermemoryfrequencylevels.Theexecu- tionoftheseapplicationsisnotsignificantlyaffectedbychangesinthememoryfrequency,butitishighlydependenton
Table 1
Power consumption for matrixMul- CUBLAS kernels with different iterations.
15 iters 1500 iters (3505, 975) 226 W 239 W (810, 595) 116 W 120 W
thecoreoperatingfrequency.Such applications(seeClassTCinFig.4) arecharacterized bya highgraphics utilization andalowmemoryutilization.
3. Applicationsthat scalewithbothmemoryandcorefrequencies(see ClassTBinFig.4). Ifmultiplememoryfrequency levels areavailable, itis possibletosubdivide thisclassinto multipleclasses,whichwillincludethe applicationsthat arecompute-boundatFMem_highandchangetomemory-boundatthelowermemoryfrequencylevels.Byconsideringone distinctclassforeachadditionalmemoryfrequencylevel,itwillallowthedistinctionbetweenapplicationsthatbecome memory-boundatdifferentmemoryfrequencylevels,resultinginbetteraccuracyinthecharacterization.
Section3willdetailfurtherhowtheboundariesbetweentheclassescanbeobtained,withanexampleoftheutilization oftheproposedapproachonarealGPUdevice.
2.2. DVFSimpactontheapplicationspowerconsumption
As it waspreviously seen, theexecutionof instructionsondifferentcomponents oftheGPU isoftenpartially orfully overlapped. However,the powerconsumptionoftheseveraldifferentcomponentscannotbe hiddenormasqueraded,and mustbealwayscombinedtogetherinordertoobtainthetotalpowerconsumptionoftheGPU.
When considering the instantaneous power consumption (PTotal) [18] of a CMOS circuit, one must take into account both its dynamic (PDynamic) and static (PStatic) fractions, as PTotal=PDynamic+PStatic. Additionally, when considering both voltage and frequency scaling, these two fractions of the total power consumption have different scaling behaviours, as PDynamic∝V2F[32]andPStatic∝VIˆleak[33],whereFdenotestheoperatingfrequency,VthechipsupplyvoltageandIˆleakisthe normalizedleakagecurrentforasingletransistor,dependentonthethresholdvoltage(Vth) andonthetemperature.These formulationsrelatetotheinstantaneouspowerconsumptionofeachseparatedevicecomponent.Inadevicewithmorethan asinglefrequencydomain(e.g.GPU)thetotalpowerconsumptioncanbeexpressedas:
PGPU=PGraphics
(
Fcore,Vcore)
+PMemory(
Fmem,Vmem)
, (1)where PGraphics andPMemory representthe powerconsumption ofthegraphics andmemorydomains, respectively.Infact, each ofthesepartscanbefurther decomposedintothe severalparcelsofthepowerthat isconsumedbyeach ofthein- ternalcomponents ofthat domain(processing cores,LD/STunits,sharedmemories, L2cache, etc).However,since device manufacturersdonotfullydisclosethedesignofeacharchitecture,severaloftheparametersthatarerequiredforanaccu- ratepowermodellingareusuallyunknown.Additionally,itisnoteasytomeasureorinfertheinstantaneousstatic/dynamic power parcelsincurrentGPUdevices.In fact,mostmodernGPUs onlyprovideone singlecounter toreport theinstanta- neous powerconsumptionofthe wholeGPUboard,combininginformationfrommultiple contributingdomains (Memory andGraphics)inan undisclosedmanner, makingita veryhard taskto distinguishall theseparateeffects contributingto that value. Accordingly, this workwill not focus oncharacterizing each individual parcelof thepower consumption ofa givenapplication,butrathertheaveragetotal,atthedevicelevel,duringthewholeapplicationexecution.
Finally, toanalyze DVFS impact, thiswork focuseson therelative difference inthe GPUpower consumption between the differentfrequencylevels.The proposed methodologydoesnot take intoaccountthe influenceoftemperatureinthe GPUpowerconsumption.Whiletakingitintoconsiderationcouldpotentiallybebeneficialwhencreatingapowermodelof thearchitecture(sincethevalueofleakagecurrentIˆleakquadraticallyincreaseswithtemperature[18]),itislesssignificant whenclassifyingtheimpactofDVFSonthepowerconsumption(i.e.,therelativechangeinpowerconsumption).
Asanexample,Fig.5presentsthepowerconsumptionandtemperatureoftheGTXTitanXGPUduringtheexecutionof two variationsofthesameapplication, ontwofrequencyconfigurations: (FMem=3505MHz,FCore=975MHz) and(FMem= 810MHz,FCore=595MHz).Thetwo applicationsexecutethesamematrixMulCUBLASkernel,repeatedadifferentnumber oftimes(15and1500),thusleadingtodifferentGPUactivetimes.Itcanbeseenthat,asexpected,theapplicationrunning foralongerperiodoftimecausestheGPUtemperaturetoincrease, whichinturnincreasestheGPUpowerconsumption.
However,althoughthereisapowerconsumptionincreaseduetotemperaturevariations(upto13Watthehighestfrequency setting,asshowninTable1),whenlookingattherelativechangesinthepowerconsumption,theimpactofthetemperature isalmostnon-existent(seeFig.6),whichmeansthetwoapplicationscanbeclassifiedsimilarly.
Regarding the effectsof DVFS, Fig.7 presentsan example ofthe overlap(in time) ofthe memoryandcomputational instructions oftwoidealapplications, andhowthetotal executiontime ofeachapplicationis affectedby corefrequency scaling (in both examples, thememory voltageand frequencylevels are fixed to aconstant value). The applicationwith higher DRAMutilization(see Fig. 7a)hasits executionmostlydominatedby the memoryaccesses, specificallyto DRAM.
Since inthisexampleonly theFCore ischanging (decreasing),the instantaneouspower consumptionremains constant for
Fig. 5. GPU power consumption and temperature obtained during the execution of the matrixMulCUBLAS kernel with square matrices of size 8192 with two different durations (with 15 and 1500 repetitions of the kernel, i.e. with 6 and 330 seconds, respectively). Here are presented the results for two distinct GPU configurations on the Titan X GPU: ( F Mem= 3505 MHz, FCore= 975 MHz) and ( FMem= 810 MHz, F Core= 595 MHz).
Fig. 6. Power consumption for matrixMulCUBLAS kernels with different number of iterations, on NVIDIA’s GTX Titan X.
all components whoseutilization isnot changedby the core frequencydownscaling, which includesboth the staticand dynamicpowerconsumptionofthecomponentsinthememorydomain.Furthermore,sincethetotalexecutiontime(T)does notchange,theaveragepowerconsumptionofthesecomponentswillalsoremainconstant.Onthecontrary,thegraphics domain willbe highly affectedby the corefrequency downscaling, resultingin an increase ofthe execution time ofthe compute instructions(T2). Infact,by takingintoaccount theprevious considerationsrelative totheinstantaneous power, itcan be expectedthat thepower consumptionwill downscale onthe componentsutilizedby the computeinstructions.
Bycombining all theeffects (FCore and VCore downscaling),theaverage powerconsumption is expectedto decreasewith FCoreVCore2.
Ontheotherhand,theapplicationwithhighSMutilization(seeFig.7b)presentsanincreaseofitstotalexecutiontime (fromTtoT)whenthecorefrequencyisdecreased.Similarlytothepreviouscase,theinstantaneouspowerconsumptionof
Fig. 7. Expected impact of DVFS on the execution time of two ideal applications (in both cases F Memis constant).
Fig. 8. DVFS-aware power classes depending on the DRAM and Graphics utilization. The example power consumption curves were obtained on NVIDIA’s GTX Titan X, for the Blackscholes, Lud, spmv and GEMM benchmarks.
theSMcomponentswilldownscalewithFCoreVCore2.However,sincetheperformancebottleneckisnowassociatedwiththe compute instructions,theobservedincreaseofthetotalexecutiontimewillincrease theircontributiontothetotal power consumption ofthedevice.Inother words,thecontributionofthememorycomponentstothepowerconsumptionofthe wholeapplicationwilldecrease,sincethesamenumberofmemoryoperationsisbeingexecuted,butscatteredforalonger periodoftime,asthememorycomponentsnowspendmoretime idlingthanbefore(T4 decreasesandT5 increases).This willresultinagreaterdecreaseoftheaveragepowerconsumptionthanthepreviousFCoreVCore2.
A similar approach could be applied to analyse the impact of memory frequency scaling in the average power con- sumptionofdifferentapplications,alsoresultingintwodistinctbehavioursdependingonhowtheGPUresourcesarebeing utilized.Accordingly,dependingonthecombinedresourceutilizationofthetwoGPUdomains,itispossibletoidentifythe classespresentedinFig.8.Thedefinitionofeachproposedclass,dependingonhowtheirpowerconsumptionchangeswith thefrequencyofthecoresandmemory,is:
1. ClassPA:highersensibilitytomemoryfrequencychanges,lowersensibilitytocorefrequencychanges;
2. ClassPB:highersensibilitytomemoryfrequencychanges,highersensibilitytocorefrequencychanges;
Fig. 9. Overview of the proposed procedure to classify the DVFS impact on the performance/power consumption of GPU applications.
3. ClassPC:lowersensibilitytomemoryfrequencychanges,lowersensibilitytocorefrequencychanges;
4. ClassPD:lowersensibilitytomemoryfrequencychanges,highersensibilitytocorefrequencychanges.
Unliketheperformancecharacterization,thenumberofidentifiedclassesinthischaracterizationdoesnotneedtoscale with the number of memoryfrequency levels available in the GPU device. The only special case is when there is only onememoryfrequencyavailable(rarecaseinmodernGPUdevices),inwhichthereisnoneedtocharacterizeapplications regardingtheimpactofmemoryfrequencydownscalingontheir executions,thereforeonlybeingpossibletoidentifytwo classes(ClassesPAandPB).
Bytakingintoconsideration thesetofobservations laidout inthissection, thefollowingsectionapplies theproposed genericperformanceandpowerclassificationtoarealGPUdevice,andderivesamethodologytoobtaintheboundariesof theperformanceandpowerclasses(Section3.2).
3. Applicationcharacterizationprocedure
Thedescriptionoftheconceivedmethodologytoobtaintheperformanceandpower-awarecharacterizationsofanygiven applicationwillbeconductedbyusingarepresentativeexampleusingaNVIDIAGTXTitanXGPU.Theselectionofthispar- ticularGPUdevicetotesttheimpactofDVFSontheapplicationsexecutionarisesnotonlyfromitsgreaterofferofdifferent core frequencylevels (43 non-idle levels) andmemory (DRAM) frequency levels (3 non-idle levels),but also because it providesaconvenientinterfacetomeasurethepowerconsumptionatruntime.
Fig. 9 presents an overview of the proposed procedure to classify the DVFS impact on the performance (or power- consumption)ofGPUapplications.Totraintheclassifiersacollectionofsyntheticbenchmarksisused,withtheirexecution timeandpowerconsumption measuredatallallowed operatingfrequencylevels.Inordertocharacterizehoweachappli- cationisstressingtheGPUcomponents,agroup ofhardwareperformanceeventsisalsomeasured atasingle(reference) operatingfrequency.Basedonthegatheredmetrics,anhierarchicalclusteringalgorithmisusedtodefine theclassofeach syntheticbenchmark,usedlaterinthesupervisedtrainingoftheneural-networkclassifier.Oncetheclassifieristrained,it allowsthecharacterizationofthatgivenarchitectureandhowDVFSimpactstheexecutionofdifferenttypesofapplications.
ThefollowingsectionsfurtherdetaileachofthestepsoftheprocedurepresentedinFig.9. 3.1. Syntheticbenchmarkingandprofiling
In order to classify any given application it is first necessary to characterize the adopted GPU device, more specifi- callyhowDVFSimpacts theexecutiontimeandpowerconsumptionofdifferenttypesofapplications.Thisisusually done throughtheexecutionandprofilingofacontrolledsetofsyntheticbenchmarksdesignedforthisspecificpurpose.Accord- ingly,andsincetheDVFSimpactontheapplicationexecutionisrelatedtohowtheresourcesoftheGraphicsandMemory domainsareutilized,differentbenchmarkapplicationsneedtobecreated,withvaryinglevelsofutilizationofthetwoGPU domains.
Fig.10presentsthebasicstructure ofoneofthedevelopedsyntheticGPUkernels. Byexecuting multiplekernelswith distinct valuesoftheNUM_ITERSparameter,differentcombinationswithdifferentratios betweenthenumberofmemory accessesandtheamountofcomputationscanbetested.Eachsyntheticbenchmarkwasexecutedatallsupportedfrequency levels,duringwhichtheexecutiontime andpowerconsumptionwereaccuratelymeasured,inordertoevaluatehoweach applicationisaffectedby frequencyscaling.TheutilizationoftheseveralGPUresourcesby thedifferentkernelswasalso quantified, by measuring severalhardware performance eventsduring theexecution ofeach kernel. However, unlike the executiontime andpowerconsumption,thesevaluesaremeasuredatasinglefrequencylevel.Sincethehighestoperating
Fig. 10. Example PTX code from one of the developed synthetic kernels.
Table 2
Metrics used in the DVFS-aware performance and power classification methodologies.
Metric name Description Used in
Performance Power DRAM Transactions Number of reads and writes to DRAM memory YES YES
L2 Transactions Number of reads and writes to L2 cache YES YES
Shared Transactions Number of reads and writes to shared-memory YES YES FU Single Utilization † level of the single-precision function units YES YES FU Double Utilization † level of the double-precision function units YES YES
FU Special Utilization † level of the special function units YES YES
FU Texture Utilization † level of the texture function units NO YES
Registers Number of registers used per thread NO YES
Occupancy Ratio of active warps on an SM with the maximum NO YES number supported by the SM
† Utilization: Ratio of the experimentally achieved throughput of each unit with respect to the theoretical peak.
frequencylevelistheonethatusuallyprovidesthebest-performance,theperformanceeventsweremeasuredatthislevel (onGTXTitanX:FMem=3505MHzandFCore=1164MHz).
The impact of DVFS in the performance and power-consumption of applications is in both cases dependent on how theGPUcomponentsare beingutilized.However,thecomponentsmostrelevantandhowtotakethem intoaccount may differbetweenthetwoclassifications.Whileintheperformanceclassificationtheinterestingmetricsrefertotheutilization of themostpredominantcomponents, i.e.theoneswithhigherprobability to be limitingtheperformance, inthe power classificationtheconsideredmetricsaretheaggregateutilizationofallcomponents,sincethedevicepowerconsumptionis thesumofthepowerconsumptionofeachindividualcomponent.Additionally,thecomponentswiththegreatestinfluence on theperformancemayalsodifferfromthose withthehighestinfluenceonthepowerconsumption(see [19,20,34]),i.e.
theperformancemetricsthatwillbeutilizedforthetwoclassificationmethodologiesmaybedifferent.
ThesetofperformancecountersthatwereusedtoquantifytheutilizationofthetwoGPUdomainsintheNVIDIAGTX TitanXGPUarepresentedinTable2.Thenumberofregistersusedperthreadcanbeobtainedatcompiletimebyusingthe nvcccompiler[35],whiletheremainingperformancecounterscanbemeasuredusingthenvidiaprofiler(nvprof)[36]during theexecutionofthekernels.TheDRAMbandwidthiscomputedasfollows:
DRAMBandwidth=DRAMTransactions×Transaction_width
Execution_time , (2)
whereDRAMTransactions andExecution_timearemeasuredby theprofiler,whileTransaction_widthisacharacteristicofthe GPUmicroarchitecture.Theshared-memoryandL2bandwidthsarecomputedsimilarly.ThequantificationoftheDRAMand Graphicdomainsutilizationthatareusedintheproposed DVFS-awareperformanceclassificationarecomputedasfollows:
DRAMUtil= DRAMBandwidth
DRAMPeak_Bandwidth (3)
GraphicsPeak_Util= max
All_components
{ δ
i·Utilizationi}
(4)Fig. 11. Performance and Power characterization of synthetic GPU benchmarks on NVIDIA’s GTX TITAN X.
FortheDVFS-awarepowerclassification,theGraphicsdomainutilizationiscomputedasfollows:
GraphicsAgg_Util= 1
i
ω
iAll_components
i
ω
i·Utilizationi (5)The valuesofthe coefficients
δ
i andω
i,inEqs. (4)and(5),are architecture specific andcorrespond tothe weightof thecomponenti tothe overallperformance (orpowerconsumption) oftheGraphicsdomain. Thedeterminationofthese parameterscanbedonethroughpropermodellingofthearchitecture,aspreviouslysuggestedin[34](validforGPUsfrom theFermimicroarchitecture).Nevertheless,forthesakeofsimplification,therestofthismanuscriptwillassumeδ
i=ω
i=1.3.2. Definingtheclassboundaries
After the executionandprofiling ofall thesynthetic kernels,it ispossible to groupthe benchmarksaccordingto the DVFSimpactontheirexecution.Inordertoseparatethemintoclusterswithsimilarcharacteristics,aclusteringapproachis used.Althoughmultiplealgorithmscanbeusedinthiscontext,techniquesleadingtosphericalclusterswereavoided(e.g., K-Means).Thehierarchicalclusteringtechniquewasselected,sinceitdirectlyspecifiesahierarchyofgroupsandtherefore simplifiestheselectionoftheoptimalnumberofclusters.
The setof features used by thisclusteringprocedure are the resultingchanges in the executiontime (or powercon- sumption),causedbymemoryandcorefrequencyscaling,i.e.thevalueoftheincrease(ordecrease)oftheexecutiontime (orpowerconsumption)ineachfrequencylevel,relativetotheone achievedatthereferencefrequencyconfiguration(on GTXTitanX:FMem=3505MHzandFCore=1164MHz).Duringthishierarchicalclusteringprocedure,theeuclideandistance (distancemetric) isusedtoquantifythesimilaritybetweenapplicationsandtheWard’scriteria(linkagecriteria)isusedfor clustermerging,since itminimizesthewithin-clustervariance.The consideredcut inthetreewasperformedinorderto obtainthe desirednumberofclasses.As previously stated,in ordertoaccurately identifythe benchmarksthattransition fromcompute-boundtomemory-boundateachmemoryfrequencylevel,thetotalnumberofperformanceclassesisrelated withthenumberofmemoryfrequenciesavailableintheGPUdevice.OnthisparticularGPUdevice,theperformanceclassi- ficationwillconsiderfourclassesofapplications,namelytwoextremeclassesTAandTC(hereafterreferredtoasT1andT4) andtwomiddleclassesT2andT3,correspondingtotwosubdivisionsofTBinFig.4),associatedwiththetwolowermem- oryoperatingfrequencies.Forthepowerclassificationthesuggestedsetofclasses(4)willbe considered(P1-P4). Fig.11a presentstheresultofthehierarchicalclusteringusingthepowerconsumptionfeatures.
Finally,for each classificationmethodology (performance orpower), afterall the syntheticbenchmarks havebeenas- signedtoone ofthefourclusters, aclassifiermustbe trainedto allowtheclassificationofnew(unseen) applications.Al- thoughmultiplealgorithmsexistintheliterature,neuralnetworks(NN)arehereinadoptedduetotheirrecognizedcapacity tomodelcomplexnon-linearproblems.Inordertovalidate theclassifierthe setofsyntheticbenchmarksisrandomlydi- videdintotwosubsets:training-setandvalidationset.Hence,aneuralnetworkstopologywithtwofully-connectedhidden layerswasdevised,wherethesigmoidfunctionisusedastheactivationfunctiononallneurons.Theperformancenetwork haslayersizes40and10,whilethepowernetworkhaslayersizes20and40.Theinputnodesarethenfedwiththevalues oftheGraphics andMemoryresource utilizations,withtheneural networkbeingtrainedto identifytheclassesassigned duringthehierarchicalclusteringstage.
Figs.11bandcdepictthesetofsyntheticbenchmarksusedfortheperformance andpowerclassifications,respectively, aswellastheobtainedclassesandtheir respectiveboundaries. Itcanbe seenthat,fortheperformance classificationthe DVFS impactontheperformanceissolelydependent ontheratio GraphicsUtilDRAMUtil..,resultinginlinearboundaries,whileforthe powerclassificationthisrelationshipisnotlinear.
Itisimportanttostressthatgiventhesmallsizesoftheneural-networks,thetimerequiredtotrainisrelativelysmall.In fact,thebiggesteffortinthetrainingphaseofourclassificationmethodsisintheexecutionofallthesyntheticbenchmarks
on the available frequency configurations. Nonetheless, thisis a step that is performedonce (offline) and thereforewill not haveanyimpactintheapplicationexecutiontime.Duringapplicationexecution,itisonlynecessarytorun theneural networkforinference,corresponding,onaverage,tolessthanasecondonanInteli74500Uprocessor.
3.3. Extensiontootherarchitectures
While the proposed methodology mainly considers a NVIDIAGPU from the Maxwell microarchitecture, it is possible to makesomeconsiderations regardingits extensiontoother architectures.Theproposedapproachassumesthat theGPU device has two independent frequency domains, which is the case for all recent GPU devices (both NVIDIA and AMD).
Since theexecutionofthe syntheticbenchmarkstriestocover aconsiderablerangeof possiblecombinationsofresource utilizationfromthetwoGPUdomains,itispossibletoadapttheexecuted settotheconsideredGPUdevice.Additionally, asitwaspreviouslymentioned,thesuggestednumberofperformanceclassesdependsontheavailablenumberofmemory frequency levels on the considered device (see Section 2.1). On theother hand,the numberof suggestedpower classes, foranyGPUwithmorethana singlememoryfrequencylevelsisfour(see Section2.2).However,iffutureGPUscontinue increasing thenumberofallowedmemoryandcorefrequencylevels,itisexpectablethatthenumberofclassesshouldbe increasedaswell.Whenconsideringadifferentnumberofclassesthanthoseconsideredinthissection,convenientchanges shouldalsobeappliedtothehierarchicalclusteringstage,specifically,inthecut-off pointofthedendrogramtree,inorder toobtainthedesirednumberofclusters.
Accordingly,giventhegeneralityoftheproposedmethodologies, itisclearthattheycanbestraightforwardlyextended to other GPUs.In fact,giventhe currentevolution trendofNVIDIAGPUs, itis expectedthat theproposed approach will only keepgettingmoreinteresting,since everynewgeneration hasbeen introducinglargerrangesofallowed frequencies forboththecoreandmemorydomains.
4. Validationoftheclassificationalgorithms
Toevaluatetheproposedmethodology,severalCUDA-basedapplicationbenchmarksfromtheParboil[22],Rodinia[23], SHOC[24],Polybench[25]andCUDASDK[26]suites(seeTable3)wereexecutedontwoGPUdevicesfromdifferentNVIDIA microarchitectures: GTX Titan X (Maxwell microarchitecture) andTitan Xp (Pascal microarchitecture). The Maxwell GPU (PascalGPU)providesauser-levelinterfacetoscalethecoreoperatingfrequencybetweendifferentlevels,withinthe595- 1164MHz (582-1911MHz) range, and the DRAM frequencyin 3 (2) non-idle levels: 810, 3300 and3505MHz (5705 and 4705MHz). Additionally,atthelowest memoryfrequency,i.e.810MHz,theGTXTitan XGPUallows tofurtherdownscale the corefrequencyinto21 additionallevels, downto 135MHz.Inboth GPUs,the defaultfrequencysetup correspondsto a dynamically managedfrequency state, denoted as Auto-boost[37],used to boost the applications performance, by in- creasingGPUcoreandmemoryfrequencieswhensufficientpowerandthermalheadroomisavailable.Notwithstanding,to apply theproposedperformanceandpowerclassificationmodel,itishereinassumedthattheGPUoperatesatthehighest user-controlled coreandmemoryfrequencylevels(i.e.,FMem=3505 MHzandFCore=1164MHzfortheMaxwellGPUand FMem=5705MHz andFCore=1911 MHzfor thePascal GPU).Hence, each benchmark is only executed atsuch reference frequencylevel,wheretheperformancecountersvaluesweremeasured usingtheNVIDIAProfiler [36].Accordingly,based on: (i) thegathered valuesoftheperformance counters; (ii)the proposed classificationscheme(see Fig. 9);and(iii)the classifierstrainedusingthesyntheticbenchmarks(seeFigs.11bandc),eachapplicationwassubsequentlyclassifiedinone performanceclassandonepowerclass.
Finally, to validate and evaluate the attainedclassifications, each application was alsoexecuted at all user-controlled memoryandcorefrequencylevels,includingwiththeactivationoftheAuto-boostfeature.Ateachfrequencyconfiguration, theexecutiontimeofapplicationswasaccuratelymeasuredusingtheNVIDIAProfiler[36]andtheGPUpowerconsumption wasobtainedusingtheNVML[38] library,whichis aC-basedAPIthat allowsamongst other things,monitoringthestate of the GPU. In some NVIDIA GPUdevices, asis the casefor both the considered devices, itis possible to use NVML to get the current powerdraw of the device. Through experimental testing, itwas determined that the refresh rateof the valuesobtainedusingNVML isabout100ms.Accordingly, aGPUpowermeasuring tool wasimplemented,which samples theGPUpowerdrawevery25ms.Thekernelsfromtheapplicationswithsmallerexecutiontimeswererepeateduntileach kernelwasexecutedforatleast1s,inordertoincreasetheaccuracyofthemeasuredsamples.Thepowerconsumptionof each kernelwascomputedastheaverageofall gatheredsamples.Forapplicationswithmultiplekernels,thetotal power consumptionwasobtainedbyaveragingtheconsumptionofeachkernelweightedwiththerelativeexecutiontimeofeach kernel.Inordertoguaranteetheintegrityofthemeasurementstaken,allapplications(bothsyntheticandreal)areexecuted 10times.Thevaluespresentedcorrespondtotheaverageresultsoveralltheexecutedruns.
Sections4.1and4.2presenttheresultsoftheproposedperformanceandpowerclassificationmethodologies,respectively, whileSection4.3presentstheresultsofcombiningthetwoapproachesinordertoobtainaDVFS-awareenergyclassification ofGPUapplications.
COVAR Polybench Default
FDTD-2D Polybench Default
GEMM Polybench 2048 ×2048
GRAMSCHM Polybench Default
SYRK Polybench Default
Backprop Rodinia 655360
CFD Rodinia missile.domn.0.2M
Gaussian Rodinia 2048 ×2048
Hotspot Rodinia 1024, 2, 10,0 0 0 K-Means Rodinia 30 0 0 0 0 0_34f.txt
LUD Rodinia 8192 ×8192
ParticleFilter_Float Rodinia -x 256 -y 256 -z 80 -np 50,0 0 0 ParticleFilter_Naive Rodinia -x 256 -y 256 -z 80 -np 50,0 0 0
Srad_V1 Rodinia 4096 ×4096
Srad_V2 Rodinia 4096 ×4096
Streamcluster Rodinia Default
BFS SHOC -passes 100 -size 4
FFT SHOC -passes 100 -size 4
MD5Hash SHOC -passes 5 -size 4
Reduction SHOC -iter 10 0 0 -passes 1 -size 4
S3D SHOC -passes 100 -size 4
Sort SHOC -passes 100 -size 4
SPMV SHOC -iter 400 -passes 1 -size 2 Stencil2d SHOC -passes 1 -num-iters 100 -size 4
Fig. 12. DVFS-aware performance classification of the tested applications depending on the utilization of the DRAM and Graphics domains, on GTX Titan X (Maxwell) with F Mem= 3505 MHz and F Core = 1164 MHz.
4.1. Performanceclassification
Fig.12presentsthecomputedvaluesfortheDRAMUtil andGraphicsPeak_Util foralltestedapplicationsandhowtheyare classifiedintheGTXTitanXGPUdevice accordingtothepreviouslytrainedclassifier(see Section3). Additionally,Fig.13 presentshowtheexecutiontimeoftheapplicationsofeachresultingclassisaffectedbythecoreandmemoryfrequency downscaling, withthepresented valuesbeing normalizedto theexecution time ofeach applicationatthe referencefre- quencystate.Tosimplifytheanalysisoftheresults,adashed linewasaddedtothegraphsinordertoshow theexpected
Fig. 13. DVFS-Aware Performance classes on GTX Titan X (Maxwell), with F Ref= 1164 MHz. The execution times are normalized to the value at F Mem = 3505 MHz and F Core= 1164 MHz.
executiontimevariation foraperfectcompute-boundapplication,whereT∝F1
Core.Finally,asetofmarkers(1–5)waswere alsoaddedtothefigure,tohighlightthemaintakeawaysfromtheresults.
Byanalysingtheresultspresentedinthesegraphs,itcanbeconcludedthatalthoughthisdevicehasthreememoryfre- quencylevels,theyarenotuniformlyseparated.Infact,thetwohighestmemoryfrequencylevels(3505and3300MHz)are so closetoeach other (only6% different)that theperformance resultsofall applicationsdidnot significantlychangebe-
Fig. 14. DVFS-aware performance classification of the tested applications depending on the utilization of the DRAM and Graphics domains, on GTX Titan X (Maxwell) with F Mem= 3505 MHz and F Core = 1164 MHz.
tweenthesetwolevels.Notwithstanding,astherearemoresignificantdifferencesinthepowerdomain,thesetwomemory levelswillbeexploitedinSection5toattainenergysavings.
Ontheother hand,whenanalysingtheclassificationresults,asetofimportantobservationsareattained,whichclearly justifiesthe partitionsoftheapplicationsinthesefourclasses.Inparticular,by observingmarker(1),itcan beseen that whenFMem=3505MHz,forthehighestcorefrequencies(above1013MHz,i.e.FRef/FCore<1.15),theapplicationsinclassT1 haveavariationoftheirexecutiontimethatislower thanthecorefrequencyvariation.Hence,atthismemoryfrequency, theseapplicationsare considered memory-bound. However,atFCore=1013MHz (FRe f/FCore=1.15), theseapplicationsstart havingtheir executiontime scaling atthe samerateasthecorefrequency (seeFig. 3a). Whenthe memoryfrequencyis scaleddownto810MHz(see(2)inFig.13),thereisasignificantdrop-off intheperformanceoftheseapplications,whichis expectedsincetheyhaveveryhighDRAMutilization(seeT1inFig.12).Also,asexpectedatthislowestmemoryfrequency, therangeofcorefrequencieswheretheseapplicationshavenegligibleperformancedrop-off isincreased,sinceitmeansthat atalowest memoryfrequencythecorefrequencyneedsto befurtherdecreased fortheGraphicscomponents tobecome theperformancebottleneck.
ClassT2composestheapplicationsthatarecompute-boundatFMem=3505MHz(see(3a)inFig.13),sincetheirexecu- tiontimealwaysscaleswithFCore).However,theseapplicationsarememory-boundatFMem=810MHz(see(3b)inFig.13), sinceforhighcorefrequenciestheexecutiontimedoesnotscalewithFCore.Additionally,theexecutiontimeoftheseappli- cationsalsosignificantlyincreaseswhenthememoryfrequencyisdecreasedtothelowestlevel.However,inalesserextent thantheapplicationsfromclusterT1(compare(2)and(3b)inFig.13),whichisconsistentwiththeirsmallerutilizationof theDRAMresources(seeT2Fig.12).
Furthermore,ClassT3includestheapplicationsthat havetheir executiontime scalingwithbothcoreandmemoryfre- quencies.However, whilethesamecouldbe potentiallysaid abouttheapplicationsintheT2class,theapplicationsinT3 always havetheir executiontimes scalewiththe corefrequency(even atFMem=810), which isnot thecaseforthe ap- plicationsinT2(compare (4a) and(4b)inFig.13).The behaviourofmanyof theT3applicationsis aresultoftheir low occupancyoftheGPUresources,resultinginasequentialexecutionofmanyinstructions.
Finally,classT4comprisesalltheapplicationsthatcanbeconsideredcompute-boundatallmemoryfrequencylevels(see themarkers(5)inFig.13),astheirexecutiontimealwaysscalesinverselywiththecorefrequencyandneverscaleswiththe memoryfrequency(lessthan1%performancechangewhenthememoryfrequencyischangedfrom3505MHzto810MHz).
Again,itisimportanttostressthatthismethodologyallowstheclassificationofGPUapplicationsintoclassesthatchar- acterizetheir performance atall frequencylevels,byusingtheinformationobtainedfromtheir executionatasinglecore frequencyinarealhardwaredevice.
4.2. Power-awareclassification
Fig.14presentstheutilizationlevelofseveralcomponentsoftheGraphicsandMemorydomainsforthetestedapplica- tionsandhowtheyareclassifiedaccordingtotheproposedpower-awareclassifier(fortheTitanXGPUdevice).Theimpact ofDVFSinthepowerconsumption oftheapplicationsofeachclassispresentedinFig.15.Asitwasthecasefortheper- formance characterization,all values are normalizedto the powerconsumption at the referencefrequencies. The dashed lineshowsthecorefrequencyvariationateach level,i.e.applicationsthatfollowthelinehavepowerconsumptionscaling
Fig. 15. DVFS-Aware Power classification on GTX Titan X (Maxwell), with F Ref= 1164 MHz. The power consumptions are normalized to the value at F Mem= 3505 MHz and F Core= 1164 MHz.
linearlywithFCore,whileapplicationswithapowerconsumptionvariationbelowthislinehavesuper-linearscalabilitywith FCore.
Eachoftheresultingclassesdisplaysdifferentdegreesofsensitivitytovariationsinthecoreandmemoryfrequencies.
In particular, whencomparing theapplications relativepower consumption andwhen FMem decreases from3505MHzto 810MHz(at FCore= FRef)-comparethemarkers(1)inFig.15-itcanbe perceivedthatapplicationsfromclassesP1and P2haveanhigherdegreeofsensitivitytomemoryfrequencychanges,sincetheyachievepowerconsumptionsintherange from70%to50%ofthereferenceconsumption(correspondingtopower-savingsbetween30%and50%).Ontheotherhand,
Fig. 16. DVFS-Aware energy classification on GTX Titan X (Maxwell).
applications fromclassesP3 and P4 display alower degree of sensitivityto memoryfrequency changes,since they only achievepower-savingsbetween10%and30%,whenthememoryfrequencyisdecreasedbetweenthesamevalues.
Whenanalysing theapplicationspowerconsumption sensitivitytocorefrequencyvariations,it canbe concludedthat, applications from classes P1 andP3 display a lower degree of sensitivity than applications from classes P2 and P4. In particular,bycomparingtheapplicationspowerconsumption variationatFMem=3505 MHzwiththedashedline(seethe markers(2)inFig.15),itcanbeobservedthattheapplicationsfromclassesP1andP3presentalinearorsub-linearscaling withthedashed-lineintherangeFRef/FCore∈[1; 1.25].Ontheother hand,forthesamecorefrequencyrange,applications fromclassesP2andP4showasuper-linearpowerconsumptiondrop-off.
TheexceptiontosuchtrendispresentedinClassP2,forthematrixMulCUBLASbenchmark,whichdoesnotfittheprofile of the remaining benchmarks (see the marker (3)in Fig. 15). Upon further inspection, it was observed that, duringthe executionofthiskernelatthehigherfrequencylevels,thepowerconsumption oftheGPUreachesveryhighlevels(close tothedevicepowercapof275W),thussuggestingtheeffectisduetointernalGPUpowercontrol mechanisms.Sincethe valuespresentedinFig.15arenormalizedtothemaximumpowerconsumption,whichforthematrixMulCUBLASbenchmark islimitedbythedevicecapabilities,theresultswillbeskewed.
4.3. Energy-awareclassification
Since,ontheGTXTitanXGPU,theproposedmethodologiesgiverisetofourperformance classesandfourpowerclas- sificationsforthisspecificGPUdevice,itcanbedefinedasetof16energyclasses.However, itisobservedthat thesetof testedbenchmarksareclassifiedintoonly10outofthose16classes.Sincebothperformance- andpower-awareclassifica- tionsareconsideringtheresourceutilizationoftheGraphicsandMemorydomains,eventhoughtheyusedifferentwaysto combinetheutilizationsoftheseveralcomponents,manyofthecombinationsbetweenperformanceandpowerclassesare veryhardtoachieve.Theresultsoftheobtained10energy-awareclassesarepresentedinFig.16,whereitispresentedthe energy-savingsofall applicationsforallavailable frequencylevels.Theenergy-saving (R)atfrequencystate(FCore,FMem)is
Fig. 17. DVFS-Aware energy classification on Titan Xp (Pascal).
computedinthefollowingway:
R
(
FCore,FMem)
=Ere f−E(
FCore,FMem)
Ere f , (6)
whereEref istheenergyconsumptionatthereferencefrequencies(FMem=3505MHz,FCore=1164MHz).
Fig.17presentstheobtainedenergy-awareclassesofapplicationsontheTitanXpGPU(Pascalmicroarchitecture).Since in this device there are only two memory frequencylevels, the proposed classification methodology will result inthree performance classes.Combiningtheseclasseswiththefourpowerclasseswillultimatelyresultin12energyclasses,some ofwhichwill againbe unoccupied.Thelargerrangeofoperatingcorefrequenciesavailable inthisGPUdevice (compared withMaxwell GPU),results inalarger rangeofcorefrequencieswhere theapplicationswithvery highDRAMutilization (ClassE1)decreasepowerconsumptionwithoutlosingperformance:from1911MHzdownto1404MHz(vs.1164-1013MHz onMaxwell).Ontheotherhand,thelackofalowmemoryfrequencylevelresultsinlessinterestingenergy-savingsoppor- tunitiesforapplicationswithhighgraphicsutilizationandlowDRAMutilization(ClassE12).
As itcanbeseen,bycombiningtheresultsofthetwoproposed classificationmethodologies,asetofclassesthatsuc- cessfully grouptogetherapplicationswithsimilarenergyconsumptioncurvescanbeobtained.Theresultofthisclassifica- tion canbe usedinmanydifferentwaystomaximize theenergy-efficiency ofcomputationalsystems,some ofwhichare proposedinSection5.
5. Application-awareDVFS
ByprovidingasystematicmechanismtocharacterizetheimpactofDVFSontheenergy-consumptionofanyGPUappli- cation,theproposedDVFS-awareclassificationmethodologiescreatemanyinterestingopportunitiestoimprovetheenergy- efficiency of HPCsystems. In fact, since the impact ofDVFS on the energy consumption combines the performance and powerclassificationsoftheapplication,byconsidering theaveragetrendamongall applicationsintheclass,itispossible to predict howthe energyconsumption ofeach applicationwill changewiththefrequency scalingof eachGPU domain.
Sections 5.1and5.2addresstwo differentwaysto usetheseclassification methodologies, namelytochoose thebest op- erating frequencies (coreandmemory), inorderto solelymaximize energy-savingsorto tryto maximize energy-savings withoutdecreasingthe performancemorethanadefinedthreshold.Section5.3addressesanotherpossibleusageofthese classification methodologies,that combinesthepredictedfrequencyrangeswiththeenergy-savings,inorderto selectthe mostconvenientoperatingfrequencyforascenariowheremultipledistinctapplicationsaresimultaneouslyrunningonthe GPUdevice.
5.1. Optimalfrequencyformaximizingenergy-savings
Just asthe impact ofDVFS inthe energyconsumption of applications within the sameclass is very similar between them, it can alsobe observedthat, all applications of thesame class have thesame (or very similar)optimal operating
Fig. 18. Achieved energy-savings for the selected and optimal frequency levels, on GTX Titan X (Maxwell).
frequencies (memoryandcorefrequencies that minimize theenergyconsumption onthat device).Hence, by considering the average ofthe optimalfrequencies from thesynthetic benchmarksof each class (referred inthis section asselected frequencies),itispossibleto attainsignificantenergy-savings.Accordingly, Fig.18Adepictstheoptimalcoreandmemory frequenciesforeachapplicationexecutedonGTXTitanX,aswellastheselectedper-classcoreandmemoryfrequencies.
Asexpected, applicationswithvery highcoreutilizations(classesE10,E11,E12, E15andE16) areabletodecreasethe memoryfrequencytothe minimumvalue,while applicationswithhighDRAM utilization(classesE1 andE2)requirethe memoryfrequencytobesettothehighestvalue.
Ontheotherhand,itisinteresting tonotethat contrarytowhat onemightexpect,theapplicationsfromtheclassE1 (HighDRAM/LowGraphics)havehigheroptimalcorefrequenciesthantheapplicationsfromclassE16(LowDRAM/High Graphics).The former group ofapplications are able to downscale thecore frequency(with FMem=3505 MHz) inorder to achieve a lower energy consumption, whilethe applications fromclass E16 are able to decrease thecore frequencies evenfurther(withFMem=810MHz), whichisaresultoftheinteractionbetweenthetwopowercomponentsfromEq.1. However,sincetheseapplications(classE16)areverycompute-intensivetheywillhaveanhighperformancedrop-off when runningatthemiddlecorefrequencylevels,whichwillbefurtheranalysedinSection5.2.
Fig. 19. Achieved energy-savings for the selected and optimal frequency levels, when the maximum allowed performance drop-off is of 10%, on GTX Titan X (Maxwell).
Fig. 18B presents the energy-savings obtainedwith the selected frequency levels in relation to: (i) the reference fre- quency;(ii)totheNVIDIAAuto-boost;and(iii)totheworstcase,i.e.thehighestenergy-consumptionofeachapplication.
As itcanbeseen,allapplicationscanachieve lowerenergyconsumptionsthantheonesobtainedusingthereferencefre- quency(frequencystatewithhighestperformance),andonlythreeapplicationswouldhavelowerenergyconsumptionusing NVIDIA’sAuto-Boost.Onaverage,usingtheselectedpairoffrequencieswouldwield16%energy-savingscomparedwiththe reference,and13%inrelationtotheAuto-Boost.Additionally,someapplications(classesE12,E15andE16)canevenachieve greater energy-savings,promptedbythe factthat they candecreasethe memoryfrequencyto23% ofitsreferencevalue, makingthemachievecloseto36%energy-savings.
Althoughtheproposed proceduretochoosetheselectedfrequenciesdoesnotguaranteeoptimalenergy-savings,i.e.the selected frequencyisnot alwaysequal totheoptimalone,fromthe resultspresentedinFig.18C,it canbe seenthat the difference between optimalconfiguration (horizontal lineat100%, representingthe lowest energy consumption)andthe oneobtainedusingtheselectedfrequenciesisverysmall,withanaverageof0.74%differencebetweentheselectedandthe optimalenergyconsumptions.
5.2. Optimalfrequencyforenergyv.s.performancetrade-off
Another interesting situation to consider is the maximization ofthe applications energy-efficiency without sacrificing their performance. Consideringa situationwhere atmosta 10% drop-off inperformance isallowed, andcomparing itto theperformanceofthereferencefrequencystate,theresultspresentedinFig.19canbeobtained,fortheGTXTitanXGPU.
Again,bychoosingthenewlyselectedfrequencies,allapplicationsareabletoachievelowerenergyconsumptionsthanthe oneatthereferencestate.Additionally,allobtainedenergy-savingsusingtheselectedfrequencylevelsareatmost3%distant fromtheoptimalenergy-savings,withaverageenergy-savingsof9%versusthereferenceand7%versustheAuto-boostsetup.
Itisinteresting tonotethat theapplicationswiththegreatergainsarethosewitheitheraveryhighDRAMutilization andlowGraphicsutilization(classE1),ortheapplicationsintheotherextreme(classesE15andE16).Asitwaspreviously seen,thishappensmainlyinthefirsttypeofapplications(classE1),since atthehighestmemoryfrequencyitispossible todownscalethecorefrequencywhileachievingverynegligibleperformancedrop-off.Thesetypesofapplicationscansave up to16% withlessthan10% increase intheir executiontime.On theother hand,applicationsfromclassesE15andE16 are able to run at the lowestmemory frequency withnegligible performance drop-off. For example,if justthe memory