Prism 6.0 User's Guide

Second Case Study - Collective Operations

Our first case study centered about point-to-point communications. Now, let us turn our attention to one based on collective operations. We examine another popular benchmark, which sorts sets of keys. The benchmark was run under Prism on a single, shared-memory node using 16 processes. Once again, we begin by setting Sun MPI environment variables

% setenv MPI_SPIN 1
% setenv MPI_PROCBIND 1

since we are interested in the performance of this benchmark as a dedicated job.

Synchronizing Skewed Processes: Timeline View

The message-passing part of the code involves a bucket sort, implemented with an MPI_Allreduce, an MPI_Alltoall, and an MPI_Alltoallv, though no such knowledge is required for effective profiling with Prism. Instead, running the code under Prism, we quickly see that the most time-consuming MPI calls are MPI_Alltoallv and MPI_Allreduce. (See "Summary Statistics of MPI Usage. ".) Navigating a small section of the timeline window (see Figure 7-5), we see that there is actually a tight succession of MPI_Allreduce, MPI_Alltoall, and MPI_Alltoallv calls. One such iteration is shown in Figure 7-8 (We have shaded and labeled time-consuming sections.)

Figure 7-8 One Iteration of the Sort Benchmark.

The reason MPI_Allreduce costs much time may already be apparent from this timeline view. The start edge of the MPI_Allreduce region is ragged, while the end edge is flat.

Synchronizing Skewed Processes: Scatter Plot View

We can see even more data in one glance by going to a scatter plot. In Figure 7-9, time spent in MPI_Allreduce (its latency) is plotted against the finishing time for each call to this MPI routine. There is one warm-up iteration, followed by a brief gap, and then ten more iterations, evenly spaced. In each iteration, an MPI process might spend as long as 10 to 30 ms in the MPI_Allreduce call, but other processes might spend vanishingly little time in the reduce. The issue is not that the operation is all that time consuming, but simply that it is a synchronizing operation, and so early arrivers have to spend a some time waiting for latecomers.

Figure 7-9 Scatter Plot of MPI_Allreduce Latencies (x axis: MPI_Allreduce_end).

We see another view of the same behavior by selecting the time for the MPI_Allreduce_start event, rather than for the MPI_Allreduce_end event, for the X axis. Clicking on Refresh produces the view seen in Figure 7-10. This curious view is much like Figure 7-9, but now the lines of points are slanted up to the left instead of standing straight up. The slopes indicate that high latency is exactly correlated to early entry into the synchronizing call. For example, a 30-ms latency corresponds to an early entrance into the MPI_Allreduce call by 30 ms. This is simply another indication of what we saw in Figure 7-9. That is, processes enter the call at different times, but they all exit almost immediately once the last process has arrived. At that point, all processes are fairly well synchronized.

Figure 7-10 Scatter Plot of `MPI_Allreduce` Latencies (x axis: MPI_Allreduce_start) Showing Process Skew.

The next MPI call is to MPI_Alltoall, but from our Prism profile we discover that it occurs among well-synchronized processes (thanks to the preceding MPI_Allreduce operation) and uses very small messages (64 bytes). It consumes very little time.

Interpreting Performance Using Histograms

The chief MPI call is this case study is the MPI_Alltoallv operation. The processes are still well synchronized, as we saw in Figure 7-8, but we learn from the Table display that there are on average 2 Mbyte of data being sent or received per process. Clicking on the Histogram tab, we get the view seen in Figure 7-11. There are a few, high-latency outliers, which a scatter plot would indicate take place during the first warm-up iteration. Most of the calls, however, take roughly 40 ms. The effective bandwidth for this operation is therefore

(2 Mbyte / process)
* 16 processes / 40 ms = 800 Mbyte/second

Basically, each datum undergoes two copies (one to shared memory and one from shared memory) and each copy entails two memory operations (a load and a store), so this figure represents a memory bandwidth of 4 * 800 Mbyte/s = 3.2 Gbyte/s. This benchmark was run on an HPC 6000 server, whose backplane is rated at 2.6 Gbyte/s. Our calculation is approximate, but it nevertheless indicates that we are seeing saturation of the SMP backplane and we cannot expect to do much better with our MPI_Alltoallv operation.

Figure 7-11 Histogram of `MPI_Alltoallv` Latencies.