Solving the Interface System on Multi-GPU and Cluster-GPU To solve the interface system (3) on Multi-GPU, a block algorithm of the conju-

4 The Eﬃciency of the Parallel Schur Complement Method

4.3 Solving the Interface System on Multi-GPU and Cluster-GPU To solve the interface system (3) on Multi-GPU, a block algorithm of the conju-

gate gradient method was implemented, with computations distributed between np GPUs with help of OpenMP. In this case, Algorithm 3 is implemented in a parallel region, created by the directiveomp parallel. The matrix S = {sij} of (3) is partitioned into blocks. The results of the our PCG inner products in separate threads of OpenMP are stored in shared variables and summarized by the directiveatomic, followed by a barrier synchronization.

To divide the matrix S, we represent it as a graphG_S(V, E), whereV ={i} is a set of vertices, which correspond to the row indices (the number of vertices is equal to the dimension of S); E = {(i, j)} is a set of edges, whose ends correspond to the row and column indices of nonzero elements of S. A graph GS is divided intonp parts by the multilevel algorithm [14]. After that, every vertex of the graph is assigned to the GPU identifierk ∈[1, np]. According to their GPU identifiers, the vertices are divided into the internal and boundary vertices. The latter are associated with at least one vertex that has a different GPU identifier.

After partitioning, each blockSk contains several matrices:

– S_k^[i^k^,i^k^] is the matrix associated with the internal vertices;

– S_k^[i^k^,b^k^],S_k^[b^k^,i^k^] are the matrices associated with the internal and boundary vertices;

– S_k^[b^k^,b^m^] is the matrix associated with the boundary vertices of thek-th and m-th blocks.

Herek=mandk, m∈[1, np], andnp is the number of blocks.

MatrixS can be rewritten as follows

⎛

⎜⎜

⎜⎝

S^[i₁¹^,i¹^] S₁^[i¹^,b¹^] · · · 0 0 · · · 0 0 S₁^[b¹^,i¹^] S₁^[b¹^,b¹^] · · · 0 S₁^[b¹^,b^k^] · · · 0 S₁^[b¹^,b^np^]

... ... . .. ... ... ... ... ... 0 0 · · · S_k^[i^k^,i^k^] S_k^[i^k^,b^k^] · · · 0 0 0 S_k^[b^k^,b¹^] · · ·S_k^[b^k^,i^k^] S_k^[b^k^,b^k^] · · · 0 S_k^[b^k^,b^np^]

... ... ... ... ... . .. ... ... 0 0 · · · 0 0 · · ·Sn^[i_p^np^,i^np^] Sn^[i_p^np^,b^np^]

0 Sn^[bp^np^,b¹^]· · · 0 Sn^[bp^np^,b^k^] · · ·Sn^[bp^np^,i^np^]Sn^[bp^np^,b^np^]

⎞

⎟⎟

⎟⎠ .

When the matrixS is multiplied by the vectorp, two vectors are computed on each GPU:

q_k^b =S_k^[b^k^,i^k^]pⁱ_k+

mn_p m=1

S_k^[b^k^,b^m^]p^b_m q_kⁱ =S_k^[i^k^,i^k^]pⁱ_k+S_k^[i^k^,b^k^]p^b_k, (4)

wherekis a GPU identiﬁcator, andp^T =

pⁱ₁, p^b₁, . . . , pⁱ_k, p^b_k, . . . , pⁱ_n_p, p^b_n_p

. This implementation of the matrix-vector product reduces communications between the blocks at each iteration of the conjugate gradient method. Indeed, to perform the following steps, the exchange of vectorsq^b_k is required, whose size is much smaller than that of the vectorq.

The results in Table 4 show that the GPU algorithm of the conjugate gradient method signiﬁcantly speeds up solution of the interface system. For 16 subdomains, the speedup wass(1)_{CP U} = 72. For 1024 subdomains, s(1)_{CP U} = 94.

Multiple GPUs provided the speedup s(8)CP U = 251 for nΩ = 64 and s(8)GP U = 3.5 fornΩ = 16. The increase in the number of subdomains results in reducing the number of nonzero elements in the Schur complement matrix.

Thus, the eﬃciency of using multiple GPUs to solve the system (3) is reduced.

For example, the speedup wass(8)GP U = 1.3 for 1024 subdomains.

In our experiments, the minimal total time to form and solve (3) was obtained in the casenΩ = 1024 (see Tables 2 and 4). It took 1 hour and 48 minutes to perform these steps on a CPU only. In the case of a single GPU, the cost of construction and solution of this system was reduced in 22 times. The minimal time computingt(8)GP U = 2 min was reached, when eight GPU were involved.

It took one and a half minute to form the system (3) on CPU and to solve it on

Hybrid Multi-GPU Solver Based on Schur Complement Method 77 Table 4.The time to solve (3) in a one computational module, sec.

nΩ CPU 1 GPU 2 GPU 4 GPU 6 GPU 8 GPU

16 >28800 912 421 205 281 264

32 >28800 983 630 430 317 274

64 18067 287 174 98 82 72

128 — — 127 78 66 60

256 — — 106 70 61 56

512 — — 85 63 56 53

1024 6502 69 76 59 54 52

GPU. We obtained the lowest total cost, when the system was formed on CPU and then was solved on eight GPUs.

The system of equations (3) can also be solved on several computational modules with GPU (Cluster-GPU). We made experiments with the following conﬁgurations:

– eight MPI processes, performed in a single module (1×8);

– four processes in two modules (2×4);

– two processes in four modules (4×2), and – one process in eight modules (8×1).

For communications, we used MPI functions: Allgatherv to assemble the vector q and Allreduce to sum scalarsα, β, ρ. These functions were called by one of the OpenMP threads, running inside each MPI process.

Experiments show the execution time of (3) signiﬁcantly depends on distribution of computations of matrix vector product between CPU and GPU. In case of nΩ = 16 (see Table 5), the most computationally expensive part is ˆ

q_k^b =

m<n_p m=1,m=k

S_k^[b^k^,b^m^]p^b_m of (4), which is computed on CPU. This is caused by the large sizes of blocks of S_k^[b^k^,b^m^] and by their uneven distribution between parallel processes/threads. In such cases, it is possible to transfer this operation to GPU. The results given in Table 5 show that distribution of computations between diﬀerent computing modules results in acceleration of 1.2–1.5 times, most likely due to the competition for CPU cache memory.

More uniform partition ofGSsubgraphs boundary vertices and smaller sizes of S_k^[b^k^,b^m^], m=klead to more eﬃcient computation of ˆq_k^b for CPU fornΩ= 1024, and therefore, to domination of vector operations in the execution time (see Algorithm 3, 2b, 2f). In this case, invocations of cublasDdotat scalar values take about 90% of execution time (variants 1×8, 2×4, 4×2).

Acceleration of calculations with a larger number of subdomains results from a more sparse systems of equations (N nz reduced, N increased). Therefore, the total number of arithmetic operations in the multiplication of matrixSBB

and vectorp(3) is decreased. Distribution of matrix blocks between computing

Table 5.The time to solve (3) in 1 to 8 computational modules, sec.

nΩ N N N z 1×8 2×4 4×2 8×1

16 66030 266826696 321 268 208 233

64 70620 90931750 78 72 71 95

128 75732 58592886 68 60 59 86

256 81753 38066007 66 57 66 87

512 88248 23955996 63 55 68 102

1024 95523 14749329 65 54 54 110

modules allows us to remove the restriction on the size of the interface system (3), but requires further improvement of communication algorithms.

We do not provide comparison of this solver with CUSP, because CUSP does not support multi-GPU computations. The presented results were obtained from the hybrid cluster Uranus in the IMM UB RAS, which consists of dual processor nodes Intel Xeon E5675 with eight GPUs NVIDIA Tesla M2090.

5 Conclusion

The hybrid implementation of the Schur complement method presented in this paper allowed us to distribute computation between CPU and GPU in a balanced way. The optimal choice of the algorithm to form the Schur complement depends on the number and size of subdomains into which the mesh is divided.

If one subdomain contains relatively small number of mesh elements (< 5000) or unknowns (<1500 for internal and <2500 for the boundary nodes), then it is more efficient to use direct methods to find the inverse matrix, and therefore, to use only CPU. For large problems, iterative algorithms executed on several GPUs are the most efficient. The interface system of equations should be solved on GPUs, which can accelerate this step in tens and hundreds of times.

Acknowledgments. This research is supported by RFBR (projects: 11-01- 00275-a, 12-07-31114-mol a) and the joint program N18 of the Presidium of RAS and the Ural Branch of RAS (project 12-P-1-1005).

References

1. Giraud, L., Haidar, A., Saad, Y.: Sparse approximations of the Schur complement for parallel algebraic hybrid solvers in 3D. Numerical Mathematics 3, 276–294 (2010)

2. Gaidamour, J., Henon, P.: A parallel direct/iterative solver based on a Schur complement approach. In: IEEE 11th International Conference on Computational Sci- ence and Engineering, Sao Paulo, Brazil, pp. 98–105 (2008)

Hybrid Multi-GPU Solver Based on Schur Complement Method 79 3. Agullo, E., Giraud, L., Guermouche, A., Roman, J.: Parallel hierarchical hybrid linear solvers for emerging computing platforms. Comptes Rendus Mecanique 339(2- 3), 96–103 (2011)

4. Yamazaki, I., Li, X.S.: On techniques to improve robustness and scalability of a parallel hybrid linear solver. In: Palma, J.M.L.M., Dayd´e, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 421–434. Springer, Heidelberg (2011)

5. Rajamanickam, S., Boman, E.G., Heroux, M.A.: Shylu: A hybrid-hybrid solver for multicore platforms. In: IEEE 26th International Parallel and Distributed Process- ing Symposium, IPDPS, pp. 631–643 (2012)

6. Przemieniecki, J.: Theory of Matrix Structural Analysis. McGaw-Hill, New York (1968)

7. Kopysov, S.P., Krasnoperov, I.V., Rychkov, V.N.: An object-oriented method for domain decomposition. Numerical Methods and Programming 4, 176–193 (2003) 8. Kopysov, S.P., Krasnopyorov, I.V., Novikov, A.K., Rychkov, V.N.: Parallel Dis-

tributed Object-Oriented Framework for Domain Decomposition, pp. 605–614.

Springer (2006)

9. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM (2003)

10. Kopysov, S.: Optimal domain decomposition for parallel sustructuring method. In:

Mesh Methods for Boundary-Value Problems and Applications. Proceedings of 5th Russian Seminar, pp. 121–124. Kazan University, Kazan (2004)

11. Kopysov, S.P., Novikov, A.K.: Parallel adaptive mesh reﬁnement with load balanc- ing on heterogeneous cluster, pp. 425–432. Nova Science Publishers (2006) 12. Kopysov, S.P., Krasnoperov, I.V., Rychkov, V.N.: Implementation of an object-

oriented model of domain decomposition on the basis of parallel distributed corba- components. Numerical Methods and Programming 4(1), 19–36 (2003)

13. Kopysov, S.P., Novikov, A.K., Sagdeeva, Y.A.: Solving of discontinuous galerkin method systems on gpu. Bulletin of Udmurt University. Mathematics. Mechanics.

Computer Science (4), 121–131 (2011)

14. Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning scheme for irregular graphs. SIAM Review 41(2), 278–300 (1999)

No documento St. Petersburg, Russia, September/October 2013 Proceedings (páginas 88-93)