Many Sun S3L routines employ a serial algorithm when called from an application running on a single process and a different, parallel algorithm when called from a multiprocess application. When those Sun S3L routines are executed on a small number of processes--two or three--they are likely to be slower than the serial version running on a single process. This is because the higher overhead involved in the parallel process can overshadow any gains resulting from parallelization of the operation.
This means that, in general, MPI applications that call Sun S3L routines should be executing on at least four processes.