Sun Performance Library User's Guide HomeContentsPreviousNextIndex


Chapter 2

Using Sun Performance Library

This chapter describes using the Sun Performance Library to improve the execution speed of applications written in either FORTRAN 77, Fortran 95, or C. Although some modifications to applications might be required to gain peak performance, many applications can benefit significantly from using Sun Performance Library without making source code changes or recompiling.

Improving Application Performance

Use Sun Performance Library in the following ways to improve the speed of user code without making any code changes:

Replacing Routines With Sun Performance Library Routines

Many applications are built using one or more of the base Netlib libraries supported by the Sun Performance Library. Third-party vendors can also use BLAS and LAPACK as building blocks in their applications. Because Sun Performance Library maintains the same interfaces and functionality of these libraries, base Netlib routines can be replaced with Sun Performance Library routines.

Sun Performance Library can be included in a user's development environment to improve application performance on single processor and multiprocessor (MP) platforms. Sun Performance Library routines can be faster than the corresponding Netlib routines or routines provided by other vendors that perform similar functions. The serial speed of many Sun Performance Library routines has been increased, and many routines have been parallelized that might be serial in other products.

Improving Performance of Other Libraries

Users of other mathematical libraries can replace the BLAS in their library with the BLAS in Sun Performance Library, while leaving other routines unchanged. This is helpful when an application has a dependency on proprietary interfaces in another library that prevent the other library from being completely replaced. Many commercial math libraries are built around a core of generic BLAS and LAPACK routines, so replacing those generic routines with the highly optimized BLAS and LAPACK routines in Sun Performance Library can give speed improvements on both serial and MP platforms. Because replacing the core routines does not require any code changes, the proprietary library features can still be used.

Even libraries that already have fast core routines may get additional speedups by using Sun Performance Library. For example, if another vendor's core routines are based on BLAS, these routines can be replaced with Sun Performance Library routines, which have SPARC specific optimizations. Many Sun Performance Library routines have also been parallelized.

Using Tools to Restructure Code

In some cases, other libraries may not directly use the routines in the Sun Performance Library; however, there might be conversion aids available. For example, EISPACK users can refer to a conversion chart in the LAPACK Users' Manual that shows how to convert EISPACK calls to LAPACK calls.

Several vendors market automatic code restructuring tools that replace existing code with Sun Performance Library code. For example, a source- to- source conversion tool can replace existing BLAS code structures with calls to the BLAS in Sun Performance Library. These tools can also recognize many user written matrix multiplications and replace them with calls to the matrix multiplication subroutine in Sun Performance Library.

Fortran f77/f95 Interfaces

The Sun Performance Library routines can be called from within a FORTRAN 77, Fortran 95, or a C program. However, C programs must still use the FORTRAN 77 calling sequence.

Sun Performance Library f77/f95 interfaces use the following conventions:

When calling Sun Performance Library routines:

Using Fortran 95 Features

This release supports Fortran 95 language features. To use the Sun Performance Library Fortran 95 modules and definitions, including the USE SUNPERF statement in the program. The USE SUNPERF statement enables the following features:

For example, the SAXPY routine is defined as follows in the man page:

SUBROUTINE SAXPY([N], ALPHA, X, [INCX], Y, [INCY])
REAL ALPHA
INTEGER INCX, INCY, N
REAL X(*), Y(*)

Note that the arguments N, INCX, and INCY are optional.

Suppose the user tries to call the SAXPY routine with the following arguments:

USE SUNPERF
COMPLEX ALPHA
REAL    X(100), Y(100), XA(100,100), RALPHA
INTEGER INCX, INCY

If mismatches in the type, shape, or number of arguments occur, the compiler would issue the following error message:

 	 ERROR: No specific match can be found for the generic subprogram 
call "AXPY".

Using the arguments defined above, the following examples show incorrect calls to the SAXPY routine due type, shape, or number mismatches.

The following calls to the AXPY interface are valid.

	 CALL AXPY(N,RALPHA,X,Y=Y,INCY=INCY)
	 CALL AXPY(N,RALPHA,X,INCX,Y)
	 CALL AXPY(N,RALPHA,X,Y=Y)
	 CALL AXPY(ALPHA=RALPHA,X=X,Y=Y)

Fortran Examples

Getting peak performance from Sun Performance Library for single processor applications is a matter of identifying code constructs in an application that can be replaced by calls to subroutines in Sun Performance Library. Multiprocessor applications can get additional speed by identifying opportunities for parallelization.

The easiest situation occurs when a block of user code exactly duplicates a capability of Sun Performance Library. Consider the code below:

      DO I = 1, N
          DO J = 1, N
              Y(I) = Y(I) + A(I,J) * X(J)
          END DO
      END DO

This is the matrix-vector product y Ax + y, which can be performed with the DGEMV subroutine.

As another example, consider the following code fragment:

      DO I = 1, N
          IF (V2(I,K) .LT. 0.0) THEN
              V2(I,K) = 0.0
          ELSE
              DO J = 1, M
                  X(J,I) = X(J,I) + Vl(J,K) * V2(I,K)
              END DO
          END IF 
      END DO

In other cases, a block of code can be equivalent to several Sun Performance Library calls or contain a mixture of code that can be replaced together with code that has no natural replacement in Sun Performance Library. One way to rewrite the code with Sun Performance Library is shown below:

      DO I = 1, N
          IF (V2(I,K) .LT. 0.0) THEN
             V2(I,K) = 0.0
          END IF 
      END DO
      CALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)

An f95 specific example is also shown.

WHERE (V(1:N,K) .LT. 0.0) THEN
       V(1:N,K) = 0.0
END WHERE 
CALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)

The code to replace negative numbers with zero in V2 has no natural analog in Sun Performance Library, so that code is pulled out of the outer loop. With that code removed to its own loop, the rest of the loop can be recognized as being a rank- 1 update of the general matrix x, which can be accomplished using the DGER routine from BLAS.

Note that if there are many negative or zero values in V2, it may be that the majority of the time is not spent in the rank- 1 update and so replacing that code with the call to DGER might not bring a large payoff. It might be worthwhile to evaluate the reference to K. If it is a loop index, it may be that the loops shown here are part of a larger code structure, and loops over DGEMV or DGER can often be converted to some form of matrix multiplication. If so, a single call to a matrix multiplication routine will probably bring a much larger payoff than a loop over calls to DGER.

All Sun Performance Library routines are MT-safe (multithread safe). Because the routines are MT-safe, additional performance is possible on MP platforms by using the auto-parallelizing compiler to parallelize loops that contain calls to Sun Performance Library.

An example of an effective combination of a Sun Performance Library routine together with an auto-parallelizing compiler parallelization directive is shown in the following example.

      C$PAR DOALL
      DO I = 1, N
             CALL DGBMV ('No transpose', N, N, ALPHA, A, LDA,
     $     B(l,I), 1, BETA, C(l,I), 1)
      END DO

Sun Performance Library contains a routine named DGBMV to multiply a banded matrix by a vector. By putting this routine into a properly constructed loop, it is possible to use the routines in Sun Performance Library to multiply a banded matrix by a matrix. The compiler will not parallelize this loop by default because the presence of subroutine calls in a loop inhibits parallelization. However, because Sun Performance Library routines are MT-safe, a user may use parallelization directives as shown below to instruct the compiler to parallelize this loop.

Note that a user can also use compiler directives to parallelize a loop with a subroutine call that ordinarily would not be parallelizable. For example, it is ordinarily not possible to parallelize a loop containing a call to some of the linear system solvers, because some vendors have implemented those routines using code that is not MT-safe. Loops containing calls to the expert drivers of the linear system solvers (routines whose names end in SVX) are usually not parallelizable with other implementations of LAPACK. The implementation of LAPACK in Sun Performance Library allows parallelization of loops containing such calls. Because the versions in Sun Performance Library are MT-safe, users of MP platforms can get additional performance by parallelizing these loops.

C Interfaces

Sun Performance Library contains native C interfaces for each of the routines contained in LAPACK, BLAS, FFTPACK, VFFTPACK, and LINPACK. The Sun Performance Library C interfaces have the following features:

The following example compares the standard LAPACK Fortran interface and the Sun Performance Library C interfaces for the DGBCON routine.

 CALL DGBCON (NORM, N, NSUB, NSUPER, DA, LDA, IPIVOT, DANORM,
              DRCOND, DWORK, IWORK2, INFO)
 void dgbcon(char norm, int n, int nsub, int nsuper, double *da,
             int lda, int *ipivot, double danorm, double drcond, 
             int *info)

Note that the names of the arguments are the same and that arguments with the same name have the same base type. Scalar arguments that are used only as input values, such as NORM and N, are passed by value in the C version. Arrays and scalars that will be used to return values are passed by reference.

The Sun Performance Library C interfaces improve on CLAPACK, available on Netlib, which is an f2c translation of the standard libraries. For example, all of the CLAPACK routines are followed by a trailing underscore to maintain compatibility with Fortran compilers, which often postfix routine names in the object (.o) file with an underscore. The Sun Performance Library C interfaces do not require a trailing underscore.

Sun Performance Library C interfaces use the following conventions:

Sun Performance Library uses global integer registers %g2, %g3, and %g4 in 32-bit mode and %g2 through %g5 in 64-bit mode as scratch registers. User code should not use these registers for temporary storage, and then call a Sun Performance Library routine. The data will be overwritten when the Sun Performance Library routine uses these registers.

C Examples

The key to using Sun Performance Library to get peak performance from applications is to recognize opportunities to transform user-written code sequences into calls to Sun Performance Library functions. The following code sequence adapted from LAPACK shows one example:

int    i; 
float a[n], b[n], largest;
 
largest = a[0]; 
for (i = 0; i < n; i++)
{
if (a[i] > largest)
    largest = a[i];
    if (b[i] > largest
    largest = b[i];
}

There is no subroutine in Sun Performance Library that exactly replicates the functionality of the code above. However, the code can be accelerated by replacing it with the several calls to Sun Performance Library as shown below:

int    i, large_index; 
float a[n], b[n], largest;
 
large_index = isamax (n, a, l); 
largest = a[large_index]; 
large_index = isamax (n, b, l); 
if (b[large_index] > largest) 
     largest = b[large_index];

Note the differences between the call to the native C isamax in Sun Performance Library above and the call shown below to a comparable function in CLAPACK:

/* 1. Declare scratch variable to allow 1 to be passed by value */ 
int one = l;
/* 2. Append underscore to conform to FORTRAN naming system     */
/* 3. Pass all arguments, even scalar input-only, by reference  */ 
/* 4. Subtract one to convert from FORTRAN indexing conventions */
large_index = isamax_ (&n, a, &one) - l; 
largest = a[large_index]; large_index = isamax_ (&n, b, &one) - l; 
if (b[large_index] > largest) 
     largest = b[large_index];

As an example of a program that uses Sun Performance Library routines from user-managed threads, consider a real-time signal processing application running on a 4-processor server with one processor dedicated to acquiring the data, two processors dedicated to performing FFTs on the data, and one processor dedicated to postprocessing the data after the FFTs. It begins by creating multiple running instances of the function that performs the FFT:

for (i = 0; i < NCPUS_FOR_FFT; i++) {
   who[i] = i;
   do_fft[i] = 0;
   fft_done_buff_available[i] = l;
   (void)thr_create ((void *)0, (size_t)0, fft_func,
                     (void *)&who[i], (long)0, (thread_t *)0);

The code below is a simplified implementation of part of fft_func started by thr_create in the loop above. Note that production code should check the return value from thr_create above and should use semaphores rather than busy waits at the synchronization points in the code below.

cpu_id = *who_am_i; 
while (1) {
  while (!do_fft[cpu_id]) {} 
  rfftf (n, &dataset[0][cpu_id], &scratch[0][cpu_id]); 
  while (!fft_done_buff_available[cpu_id]) {}
  fft_done_buff_available[cpu_id] = 0; 
  scopy (n, &dataset[0][cpu_id], 1, &fft_done_buff[0][cpu_id], 1);
  do_fft[cpu_id] = 0;
}


Sun Microsystems, Inc.
Copyright information. All rights reserved.
Feedback
Library   |   Contents   |   Previous   |   Next   |   Index