Halo Exchange Latency - St. Petersburg, Russia, September/October 2013 Proceedings

3 Experiments

3.6 Halo Exchange Latency

The 3-D halo exchange communication pattern is arguably the most commonly used data exchange in scientiﬁc applications. Each MPI task requires data pro- duced on the ranks with neighboring coordinates (i.e., halo), and therefore has to exchange halo data with its neighbors. We use thehalo exchangebenchmark with four diﬀerent halo exchange implementation strategies, discussed in [7] in detail. In the following, we summarize the key features of the implementation.

Halo exchange benchmark investigates several communication strategies denoted:Sendrecv,Isend-recv,Isend-Irecv, andAllAtOnce. In theSendrecv strategy, each task communicates along dimensions in order; within each dimen- sion, the tasks issue two blocking MPI Sendrecv calls to communicate to the peers in forward and in backward directions. In theIsend-recvstrategy, a task first issues a non-blockingIsendrequest, then synchronouslyreceivesthe data from its peer, not waiting for the send to complete. Abarrierensures that communication in one direction finishes forallranks before moving on to the next direction. This strategy typically leads to a flood of unexpected messages, which

Table 3.Latency, in microseconds, of multi-rack BG/Q collectives for various ranks per node (RPN)

Worldcommunicator Half-Worldsub-communicator RPN 8 racks 32 racks 48 racks 8 racks 32 racks 48 racks

barrier

1 3.92 4.84 5.03 3.84 4.41 4.64

16 5.42 6.93 7.54 5.33 5.96 6.19

64 11.02 11.88 14.46 10.74 11.73 11.79

broadcast16 bytes

1 4.88 5.84 6.05 4.78 5.44 5.64

16 6.26 7.25 7.48 6.20 6.84 7.06

64 10.74 11.52 11.90 10.68 11.28 11.57

broadcast 8192 bytes

1 9.85 10.80 11.02 9.73 10.39 10.59

16 11.41 12.48 12.68 11.20 11.98 12.13

64 18.66 19.83 19.98 18.56 19.21 19.61

reduction1double

1 5.32 5.32 6.55 5.21 6.04 6.13

16 7.95 8.96 9.15 7.85 8.53 8.79

64 16.62 17.71 17.83 16.74 17.38 17.56

reduction256doubles

1 6.69 7.69 7.92 6.58 7.27 7.50

16 10.84 11.85 12.06 10.84 11.57 11.75

64 21.84 22.94 22.99 21.75 22.59 22.51

can degrade performance; however, this is a valid strategy observed in produc- tion ALCF workloads. The Isend-Irecv strategy relaxes synchronization for receives – a task posts non-blocking send and receive, completes them with waitall, and then starts next direction. Finally, theAllAtOncestrategy posts six non-blocking sendsand receives for all dimensions simultaneously, after whichwaitall ensures that all outstanding requests are complete. All bench- marks are using the communicator returned by cart createcall with default task placement.

Performance of halo exchange pattern is sensitive to load imbalance. When tasks are running unevenly, some nodes start sending messages to the peers before matching receive is posted. The low-level communication library may either buffer the message on the sending side, causing extra overhead and increasing the chance of conflicts in messaging unit, or buffer the message on the receiving side, putting the result into the queue ofunexpected messages. To study the system response, the benchmark puts a “sleep time” (delay) before halo exchanges to mimic the imbalance. The reported time is “true” communication time, where “sleep time” is excluded.

BothIntrepidandMiraexhibit similar response to diﬀerent strategies; therefore, we only present theMiraresults on Figure 3a. For messages close to 64-256 doubles (512 to 2,048 bytes), the eﬀects of communication pattern are becoming more visible. Expectedly, the less synchronousAllAtOncestrategy is the fastest,

Characterization and Understanding Machine-Specific Interconnects 101 andIsend-recvis the slowest. Figure 3b shows the results for imbalanced halo exchange overhead, which is the total amount of halo exchange time minus the specified amount of delay (a 100 ms in our case). A small imbalance in tasks helps the processes hide the communication progress over the compute part, and we observed that all communication strategies, exceptIsend-recv, bene- fit from such imbalance. Isend-recvperforms the worst because the blocking receiveinhibits a task from processing other messages, therefore messages sent by the previous non-blockingsendare likely to reach the receiver before the cor- respondingreceiveis posted, resulting in extra overhead to manage unexpected messages.

We further compare the halo exchange latency by varying the number of MPI tasks per node. As Figure 4a shows, the BG/Q messaging unit is not entirely saturated. However, 64 MPI ranks per node incurs signiﬁcantly more overhead than fewer ranks per node, which is primarily due to the inability of using a dedicated communication thread. Small messages are not able to saturate the available bandwidth; therefore, the latencies are comparable. For 1,024 doubles

0 1 4 16 64 256 1K 4K 8K 10⁻¹

10⁰

Message size in doubles

Latency (milliseconds)

Sendrecv Isend−recv Isend−Irecv AllAtOnce

(a)Halo exchange only

0 1 4 16 64 256 1K 4K 8K 10⁻¹

10⁰

Message size in doubles

Latency (milliseconds)

Sendrecv Isend−recv Isend−Irecv AllAtOnce

(b)A 100 ms delay before exchange Fig. 3.Various halo exchange implementations with 16 tasks per node onMira

0 1 4 16 64 256 1K 4K 8K 10⁻¹

10⁰ 10¹

Message size in doubles

Latency (milliseconds)

BG/Q − 1 RPN (MPI) BG/Q − 4 RPN (MPI) BG/Q − 16 RPN (MPI) BG/Q − 64 RPN (MPI)

(a)Halo exchange only

0 1 4 16 64 256 1K 4K 8K 10⁻¹

Message size in doubles

Latency (milliseconds)

BG/Q − 1 RPN (MPI) BG/Q − 4 RPN (MPI) BG/Q − 16 RPN (MPI) BG/Q − 64 RPN (MPI)

(b)A 100 ms delay before exchange Fig. 4.AllAtOnceimplementation at various number of tasks per node onMira

and larger, the messaging unit is saturated showing performance degradation when more tasks are placed on a node. When delay is added to mimic compu- tation before halo exchanges, the MPI ranks on the same node are less likely to communicate at the same time, therefore the above degradation becomes negligible, as shown in Figure 4b.

According to the benchmark study, our general recommendations for pro- gramming halo exchange applications on Blue Gene systems are very generic and entirely expected: a) do not synchronize processes without a reason; and b) do not predefine any order of communications. The Blue Gene interconnect is generally very fast and predefined ordering of send and receive requests does not bring benefits, but oppositely, degrades communication performance.

No documento St. Petersburg, Russia, September/October 2013 Proceedings (páginas 112-115)