Using gpGPU as a Vector Co-Processor

(1)

Using gpGPU as a Vector Co-Processor

Hermínia do Rosário Lopes Mendes Departamento de Informática

Universidade do Minho

Abstract: The performance of scientific computation is a requirement and a challenger for many computer architectures on the market. To manipulate large data with the precision and fast processing required by the users of scientific applications on large scale, the Intel developed the Intel Extended Memory 64 bit Technology – EM64T. The Nvidia GeForce put in the market the Graphics Processors Units (GPU) 8800 for scientific computation which deals with large numbers but in a small scale. The propose of this paper is to compare the performance of this two architectures when operating with large numbers that are required for scientific calculations in clusters environment. .

1. Introduction

One of the major concerns of the scientific computation is the fast processing of the large data.

Both Intel Extended Memory 64 Technology - EM64T and Graphics Processors Units (GPUs) 8800 of the Nvidia GeForce architectures supplies graphics applications which deal with 2D and 3D images. A 3D data type is called a vertex; a data structure with four components: x–coordinate, y–coordinate, z–

coordinate and w–coordinate (colour). Vertex values are usually 32 bit floating- point values. Pixels are typically 32 bits, usually consisting of four 8-bit channels:

R (red), G (green), B (Blue), A (Transparency). And they are based on floating point operations [1], that are discussed in the next sections on both architectures.

ICCA’07 1

(2)

2. Floating-point operations on EM64T

The Intel EM64T is 64-bit microprocessor architecture, with additional architectural features (including instructions and registers) and increased the floating-point and the multimedia unit [2]. The CPUs have integer registers (GPRs) and floating-point registers (FPRs), witch can hold single precision values (32 bits) and double precision values (64 bits). When operate with single precision values an FPR can hold a single precision value and the other half of the FPR (32 bits) is unused. And also includes operations that put two single-precision operands in a single 64 bits floating-point register [1].

The pipeline CPU can reduce the clock cycle due instructions with floating- point values by overlapping instructions that not depend on the others. For larger and complex operations there are logic levels in each pipe stage to achieve a higher clock rate but this solution increase the latency of the operations [1].

Floating-point operations for graphics are normally in single precision, not in a double precision, and often at a precision less than required by IEEE format.

Rather waste the 64-bit ALUs when operating on 32-bit, 16-bit, or 8-bit integer, multimedia instruction can operate on several data items at the same time. Thus, a portioned add operation on 16 bit data with a 64-bit ALU performs four adds in a single clock cycle. These operations are called Single-Instruction Multiple-Data (SIMD) [1].

3. Graphical Processor Unit (GPU)

GPU have up 16 parallel pipelines with more than one Processing Elements (PEs) each with no direct communication between them [3]. A single processing element (PE) is capable of performing an operation of four component vector (4- vectors) of single precision floating-point values in one clock cycle. GPU have different floating point formats: 16-bits, 24-bits and 32-bits [4].

Floating-point operations are in a single precision in the IEEE format, and in Single-Instruction Multiple-Data (SIMD) [4].

The pipeline GPU is divided in several stages: Vertex operation that determines the independent data parallel sections which are stored in textures in GPU memory; Primitive assembly invoke the range of each section by passing vertices to the GPU; Fragment operation that generates fragments for every pixel location producing thousands to millions of fragments; Composition that

ICCA’07 2

(3)

processes the generated fragments: Final image outputs a value per fragment that may be the final result or it may be stored as a texture and than used in additional computations [4]. The pipelines of GPU operate independently of each other on a common memory, the graphics cards is similar to a shared memory parallel computer. Neither the hardware nor software supports an implementation of the distributed shared memory (DSM) completely and efficiently to solve the problem of a single address space on the shared memory [5].

On data-intensive application, the PEs spend most of time waiting for data.

The implementation of the Data Stream Based (DSB) model controls the data- flow to and from PEs. An instruction prescribes both the operation to be executed and the required data. The individual elements of the data stream can be assembly from the memory before the actual processing avoiding the memory gap. This allows the optimization of the memory access, minimizing latencies and maximizing the bandwidth. In Instructions Stream Based (ISB) architectures only a limited prefetch of the input data can occur, as jumps are expected. In DSB model the time of a jump occurs it is longer.

The double floating point is supported in GPUs 8900.

4. GPU versus CPU

The GPU of the Nvidia GeForce 8800 have more transistors than the CPUs of the EM64T that are consumed by the L2 cache, while the GPU consumes the majority of the transistors on computation, their caches are smaller than the chaches of the CPU. Which make the GPU faster on scientific computation then the CPU, and less bandwidth from the main memory than the CPU [4].

The instructions sets of GPU are limited compared to CPU instructions sets, because they basically are math operations.

5. Clusters

The EM64T is a cluster which each processor has its own data memory and a single instruction memory, and a control processor which fetches and dispatches instructions. The address space on different processors can refers different locations on different memories. Each processor is a separate computer that is connected on a local area network. The multimedia extensions are limited by SIMD parallelism [1].

ICCA’07 3

(4)

One cluster of only GPUs is not an available solution because of the slow data transport between the graphics card and the main memory.

One challenger is to incorporate more than one GPU on one cluster, by reading a texture result in a page request that may be resident on the local GPU or on a remote GPU. When the CPU receives the list of the global address sends a request to the remote GPU for each page that is dirty on the remote GPU. The remote GPU on that node renders the request page into a texture to the local node.

Then the CPU renders those pages back to the local GPU and updates its page tables [5].

6. Conclusion

The Intel EM64T manipulates large data with double precision on large scale.

The GPU 8800 of Nvidia GeForce for scientific calculations in a small scale is faster than a CPU even if operates with double precision floating point.

One of the solutions implemented until this date is the incorporation of GPU on a cluster supported by CPUs to scientific applications.

References

[1] John L. Hennessy, David A. Patterson: Computer Architecture: A Quantitative Approach, Third Ed. Morgan Kaufmann Publishers, 2002, chapter 2, 3, 6, 8.

[2] http://www.intel.com

[3] Martin Runf, Robert Strzodken, Graphic Processor Units: New Prospects for Parallel Computing, In A.M. Bruaset and A. Tveito, Editors, Numerical Solution of Partias Differential Equations on Parallel Computers , volume 51 of Lecture Notes in Computational Science and Engineering, SpringerVenlag 2005

[4] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jeans Kruger, Aaron E. Lefohn, and Timothy J. Purcell: A Survey of General- Purpose Computation on Graphics Hardware. In Computer Graphics Forum (March 2007)

[5] Adam Moerschell, John D. Owens: Distributed Texture Memory in a Multi-GPU Environment. Graphics Hardware M. Olano, P. Slusallek Editors, 2006

ICCA’07 4