MICROARCHITECTURES DESCRIPTION - Exposing the Voltage Design Margins of Modern x86 Microproces

Figure 12: Sandy Bridge microarchitecture

The pipeline of the Sandy Bridge microarchitecture includes the three parts mentioned above. Furthermore, the size of the uOP cache is 1.5K as can be seen in Figure 12 above. With the Sandy Bridge microarchitecture up to six micro-ops can be issued simultaneously.

As far as the cache hierarchy is concerned, there are 3 levels of cache. The first level is divided in to two different caches. The first one is the Instruction Cache (L1Icache) which was referred to above. The second one is the level 1 data cache (L1Dcache). The L2 cache is used for both data and instructions. The Last Level Cache (LLC) is common for all cores in a package through a ring interconnect as described below. The LLC is divided into slices which are connected by a bi-directional interconnect. From the cores’ point of view, the LLC slices appear as a shared cache with multiple ports and bandwidth that scales with the number of cores. The hit latency depends on the place of the core in relation with its distance from the LLC slice. If they are far away, then the data have to travel a longer distance through the ring. As the number of cores increases, so does the number of LLC slices and therefore it is unlikely for the ring and the LLC to become a bottleneck to the performance of the core.

Moreover, in Sandy Bridge microarchitecture, if Hyper-Threading is available for a chip based on this microarchitecture, then the L1Dcache is shared between the two threads of the chip.

In conclusion, the changes in the microarchitecture introduced by Sandy Bridge are aimed at improving power consumption and not so much on improving performance.

For example, the introduction of the Decoded L1ICache is aimed mostly on improving the consumption rather than the performance. By reducing the power needed, it is possible to gain on performance by increasing the clock frequency for the same TDP levels.

3.1.2 Haswell Details

Haswell microarchitecture also known as (fourth generation microarchitecture) is a built on the successful Sandy Bridge and Ivy Bridge. Its die size is the same with Ivy Bridge microarchitecture’s die size which is 22nm. The following image is a depiction of the Haswell pipeline.

Figure 13: Haswell microarchitecture

It is obvious that there are not many changes compared to the pipeline of Sandy Bridge. Concerning the front end, a major difference between Sandy Bridge and Haswell is that not only the L1DCache is shared between threads but the decoded L1ICache too. Another enhancement that is provided by the Haswell microarchitecture and improves the efficiency of the front end is that from now on the decoders of the instructions are divided between the threads in a more sophisticated way. Their use is alternating between the sibling logical microprocessors and a logical microprocessor can use the decoders exclusively for as long as the other one is idle. The main difference between the Sandy Bridge and Haswell microarchitecture, as far as the back end is concerned, is that the scheduler can now dispatch up to 8 micro-ops for execution simultaneously in contrast with the 6 that could be dispatched before.

Furthermore, the execution core can handle twice the number of floating point simultaneous operations compared with Sandy Bridge. The out of order engine in Haswell microarchitecture can handle 192 micro-ops, whereas the Sandy Bridge microarchitecture can only handle 168. Concerning the cache hierarchy, the only difference between the two microarchitectures is that some configurations of the Haswell microarchitecture include a level 4 cache. The Level 3 cache (LLC) is the same as the one described for the Sandy Bridge microarchitecture. The following image shows the ring interconnect between the cores and the Level 3 cache slices.

Figure 14: Core and L3 Cache Communication

Microprocessors based on the Haswell microarchitecture are very power efficient.

The energy usage of a chip based on Haswell is 41 % lower comparing to the energy consumption of an Ivy Bridge chip since from 17W it decreased to 10 W. With the Haswell microarchitecture microprocessors switch between power modes 25% faster than the ones based on Ivy Bridge. (Tarush Jain, 2013) This way the battery life of all portable devices that use such microprocessors increases and in the same time the cooling needs are reduced. This way it is possible to remove fans which would make the device lighter and easier to use.

3.2 P-States and Turbo-Boost 2.0

As discussed in the introduction and in the previous sections of this chapter, it is very important to improve the energy efficiency of the microprocessor. The Dynamic Voltage and Frequency Scaling (DVFS) is a step forward to this direction. DVFS is a strategy used for dynamically reducing and increasing the voltage and the frequency in a computer system. If the computational burden is high at a specific time, then the values of both frequency and voltage are increasing. On the other hand, if a part of the microprocessor is unused then these values can be lowered resulting in lower power consumption (lower TDP) and lower temperatures. A P-State is a predefined voltage/frequency pair that a microprocessor can be set to work at by software. In modern microprocessors there are many P-States. Generally, P0 is the P-State with the highest frequency and thus the highest voltage. This means that, taking into consideration the formula P=CV²f where C is the capacitance, V is the voltage and f is the frequency, the power consumed (P) is the highest possible. The P-States from P1 to Pn are handled by the operating system (OS). According to the needs, the OS goes up and down to the P-States range making the energy consumption optimal. The difference of the frequency between P1 and Pn is not very big. The difference in frequency between P1 and P0 is much greater than between the other P-States. P0 is a P-State that can be entered only through hardware on very specific cases such as single-threaded execution of a significant load. Intel’s Turbo Boost technology makes use of the P- States and in relation with the number of active microprocessors the frequency can be increased up to a certain threshold. This means that the voltage, the TDP and the temperatures are increased too.

The microprocessors used for this study have Intel Turbo Boost enabled. On the high performance i7 Sandy Bridge microprocessor the maximum frequency reached when Turbo Boost was enabled was 4.12 GHz. This frequency was reached when only one core was fully occupied. On the other hand, when all the cores were active the maximum frequency reached was 3.6 GHz. On the Ultra-Low i5 microprocessor, the

maximum frequency reached was 2.6 GHz. This frequency was also reached under the same circumstances with the high performance one mentioned above. The frequency of the i5 when all the cores were fully occupied was 2.3 GHz.

3.3 Intel Hyper-Threading

Intel’s Hyper-Threading is a way of making a single physical core operating as multiple threads. This is done by holding a copy of the architecture state for each thread The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. The two threads use the same execution core. So, for Hyper-Threading to work, only a small amount of hardware needs to be added. When a microprocessor supports Hyper-Threading, it is possible for the operating system to schedule two processes on each thread. When one of the processes cannot continue its execution then the other process can continue if the resources requested for its execution are available. Also, when only one process is scheduled for a thread, then the execution must be as fast as if there was no Hyper-Threading support. To achieve the highest performance with the Hyper-Threading, the operating system must be able to distinct a physical core from a thread. For example, if there are two processes that should run on a system with two physical cores where each one consists of two threads, the operating system should schedule one process to each physical core so that both of them are working equally. On the other hand, if the processes were scheduled on the two threads of the same physical core, one of the physical cores would be idle and the other one would be fully loaded. In conclusion, the supporting of Hyper-Threading should be combined with a reforming of the operating system that makes it able to distinct threads from physical cores in order to achieve optimal behaviour. Both of the systems where the experiments took place support Hyper-Threading.

No documento Exposing the Voltage Design Margins of Modern x86 Microprocessors (páginas 34-39)