• Nenhum resultado encontrado

Scheduling

5.2 Comparison - Experimentation Results

As analysed in the introductory chapter, we compared our proposed scheduler with the default Linux scheduler and with 2 more state-of-the-art contention-aware schedulers, LLC-MRB and LBB. For this comparison, a workload has been created with 17 applications from the Polyhedral Benchmark Suite [8], so that we have enough applications from all classes. The whole workload is executed for 1 hour, with quantums of 1 second, as stated before and when an application finishes, it gets respawned to execute again. Our scheduler is compared to LLC-MRB, LBB, the default Linux scheduler and a naive Gang scheduler in terms of fairness and total throughput.

The workload consisted of the following applications compiled in various sizes. In table, we can see the data sets for each application used :

In theGang scheduler: all available cores (8) are allocated to each application and each application is executed alone.

In the default Linux scheduler: the optimum number of cores as defined from the prediction model are allocated to each application and then the applications are left to be scheduled by CFS. This means that threads of the same applications might not be co-scheduled.

InLLC-MRBandLBB: the number of allocated cores for each application is defined by the prediction model, but applications are co-scheduled using a sorted list by LLC misses or memory bandwidth correspondingly and combining applications from the beginning of

Table 5.1:Benchmarks Data Sets

2mm 1024

cholesky 1024

jacobi default suite large data set stream_d0, stream_d1 5*LLC cache size

trmm 3000

dynprog (700,350)

mvt 600

syr2k 900

atax 8000

gemver 9000

gemm 1150

cholesky 4096

atax 18000

correlation 1500

covariance 1500

2mm 1500

the list with applications from the end in order to balance the LLC misses and the memory bandwidth consumption correspondingly. Figure 5.1 gives a brief image of these scheduling algorithms.

In our scheduler, calledLCN: the number of allocated cores are defined by the prediction model except from N applications, which are allocated the half cores and scheduled twice as described above. The workload is divided into pairs and alone applications using the previous algorithm.

In table 5.2, we can see the number of times each applications has been completed during the execution of the workload from the different schedulers. In the last rows of the table, we can also see the total number of executions of all the applications for each scheduler (corresponding to the the overall throughput of each scheduler), the number of most applications that each scheduler has achieved the best performance (a different measurement of overall throughput) and the standard deviation between the executions number of all applications for each scheduler (corresponding to the fairness). The schedulers can be compared in terms of overall throughput using 2 criterias : (i) the total number of times all the applications have been completed (ii) the total number of applications for which the scheduler has managed to achieve the biggest number of completions. The schedulers can also be compared in terms of fairness using the covariance between the completion-times number of each application and the corresponding completion-times number of the Gang scheduler. Green colour is used to point where a scheduler has managed to achieve the most completions for an application and yellow colour is used if there are multiple schedulers that have reached the maximum number. As it is evident, our scheduler outperforms CFS and the other state-of-the-art contention-aware schedulers in terms of overall throughput using both criteria. In terms of fairness, it seems that the other contention-aware schedulers are better with LLC-MRB being the best. However, this is not a clear conclusion, since it may not mean that some applications are unfairly treated by our scheduler, but instead that some 62

Figure 5.1:LLC-MRB and LBB balancing schedulers

applications are highly favoured by our scheduler.

For a more qualitative view of the comparison between the different schedulers we can see the following graphs produced by table 5.2. First of all, we have calculated for each application and for each scheduler, the gain compared to the Gang scheduler by dividing with the corresponding executions number of the Gang scheduler. More formally, the calculations used the formulacelli,j = celli,j/celli,1, f or i >= 2. Figure 5.2 presents the gains of the 4 schedulers compared separately per application. We have also calculated the average gain and the standard deviation of gains for each scheduler. Figure 5.3 presents those results.

Table 5.2:Comparison between schedulers

Benchmarks Gang CFS LLC-MRB LBB LCN

2mm 54 101 92 92 101

cholesky 55 124 92 92 136

jacobi 13 15 13 16 17

stream_d0 31 34 32 30 34

stream_d1 31 34 21 32 21

trmm 54 75 69 69 69

dynprog 54 22 36 53 36

mvt 43 31 39 34 30

syr2k 72 80 92 92 92

atax 27 20 25 19 22

gemver 54 38 69 69 39

gemm 72 85 55 55 92

cholesky 43 49 55 55 55

atax 43 25 46 21 39

correlation 43 62 55 55 55

covariance 43 62 55 55 55

2mm 72 79 92 92 92

Total exec. 804 936 938 931 985

Most Improved - 5 7 5 8

Standard Deviation - 0.46 0.30 0.32 0.46

64

Figure 5.2:Comparison of gains per application between 4 schedulers

Figure 5.3:Comparison of average gain and standard deviation between 4 schedulers

66

Investigating the schedulers from a theoretic perspective, we can explain their difference in total throughput.

CFS scheduler, as described in the introduction, cannot take full advantage of multicore systems. CFS scheduler might even separate the same threads of an application to different time quantums. This will result in important parallelism gain losses. The data sharing patterns that would be otherwise leveraged will be ignored by CFS. Furthermore, being contention-unaware, CFS cannot locate resource contention and co-schedules threads in a contention-agnostic way, thus hurting significantly their performance.

On the other side, both LLC-MRB and LBB cannot differentiate between class N and C applications, since they both exhibit low LLC misses and memory link usage [1]. As a result, they co-schedule class L applications with class C applications, thus creating even more contention. As it ws clarified in chapter 3, the co-execution of C and L applications results in dramatic slowdown of the C application. However, our scheduler can tell them apart through the classification scheme and limits the contention in a satisfying degree.

This is simply the result of our classification scheme being directly bonded to the memory hierarchy of modern operating systems.

Chapter 6

Documentos relacionados