Sun S3L 3.0 Programming and Reference Guide

Sorting and Ranking S3L Arrays

Sun S3L includes sorting routines for parallel sorting of one-dimensional arrays or multidimensional arrays along a user-specified axis. It also includes routines for computing the grade (rank) of the elements of an array.

The sorting and grading routines are most efficient when the arrays are block distributed. For multidimensional sorts and grades, performance is best when the axis to be sorted or graded is local to a process.

The sort routines use a variation of the sample sort algorithm. In a coordinated operation, all processors extract a random sample of the data. The distribution of this sample should match as closely as possible the distribution of the data to be sorted. Based on this sample, all processes split their data into np packets, each destined for a particular process. A global interprocess communication stage then collects the packets into the appropriate processes. Each process then independently sorts its own data using a quicksort algorithm. Finally, all the processes coordinate to restore the data to its original distribution.

The first communication stage, where packets of data are sent to the appropriate processes, is more intensive than the operation that restores the original distribution. This is because the second communication stage only involves exchanges between processes with neighboring ranks.

In general, the parallel S3L sort routines exhibit good scaling characteristics. While some data distribution patterns can affect the quality of load balancing and the performance of local sorts, the internal parameters of the parallel algorithm have been chosen to achieve good performance for most cases of data distribution.

Sorting single-precision floating-point numbers is faster than sorting double-precision floating-point numbers. Also, sorting 64-byte integers can be slower than sorting 64-byte floating-point numbers.