The message-passing part of the code involves a bucket sort, implemented with an MPI_Allreduce, an MPI_Alltoall, and an MPI_Alltoallv, though no such knowledge is required for effective profiling with Prism. Instead, running the code under Prism, we quickly see that the most time-consuming MPI calls are MPI_Alltoallv and MPI_Allreduce. (See "Summary Statistics of MPI Usage. ".) Navigating a small section of the timeline window (see Figure 7-5), we see that there is actually a tight succession of MPI_Allreduce, MPI_Alltoall, and MPI_Alltoallv calls. One such iteration is shown in Figure 7-8 (We have shaded and labeled time-consuming sections.)
The reason MPI_Allreduce costs much time may already be apparent from this timeline view. The start edge of the MPI_Allreduce region is ragged, while the end edge is flat.