The Sun MPI library uses the TCP protocol to communicate over a variety of networks. MPI depends on TCP to ensure reliable, correct data flow. TCP's reliability compensates for unreliability in the underlying network, as the TCP retransmission algorithms will handle any segments that are lost or corrupted. In most cases, this works well with good performance characteristics. However, when doing all-to-all and all-to-one communication over certain networks, a large number of TCP segments may be lost, resulting in poor performance.
You can compensate for this diminished performance over TCP in these ways:
When writing your own algorithms, avoid flooding one node with a lot of data.
If you need to do all-to-all or all-to-one communication, use one of the Sun MPI routines to do so. They are implemented in a way that avoids congesting a single node with lots of data. The following routines fall into this category:
MPI_Alltoall and MPI_Alltoallv - These have been implemented using a pairwise communication pattern, so that every rank is communicating with only one other rank at a given time.
MPI_Gather/MPI_Gatherv - The root process sends ready-to-send packets to each nonroot-rank process to tell the processes to send their data. In this way, the root process can regulate how much data it is receiving at any one time. Using this ready-to-send method is, however associated with a minor performance cost. For this reason, you can override this method by setting the MPI_TCPSAFEGATHER environment variable to 0. (See the Sun MPI user's guides for information about environment variables.)