Performance and Optimization

Fortran Programming Guide

Chapter 9

Performance and Optimization

This chapter considers some optimization techniques that may improve the performance of numerically intense Fortran programs. Proper use of algorithms, compiler options, library routines, and coding practices can bring significant performance gains. This discussion does not discuss cache, I/O, or system environment tuning. Parallelization issues are treated in the next chapter.

Some of the issues considered here are:

Compiler options that may improve performance
Compiling with feedback from runtime performance profiles
Use of optimized library routines for common procedures
Coding strategies to improve performance of key loops

The subject of optimization and performance tuning is much too complex to be treated exhaustively here. However, this discussion should provide the reader with a useful introduction to these issues. A list of books that cover the subject much more deeply appears at the end of the chapter.

Optimization and performance tuning is an art that depends heavily on being able to determine what to optimize or tune.

Choice of Compiler Options

Choice of the proper compiler options is the first step in improving performance. Sun compilers offer a wide range of options that affect the object code. In the default case, where no options are explicitly stated on the compile command line, most options are off. To improve performance, these options must be explicitly selected.

Performance options are normally off by default because most optimizations force the compiler to make assumptions about a user's source code. Programs that conform to standard coding practices and do not introduce hidden side effects should optimize correctly. However, programs that take liberties with standard practices might run afoul of some of the compiler's assumptions. The resulting code might run faster, but the computational results might not be correct.

Recommended practice is to first compile with all options off, verify that the computational results are correct and accurate, and use these initial results and performance profile as a baseline. Then, proceed in steps--recompiling with additional options and comparing execution results and performance against the baseline. If numerical results change, the program might have questionable code, which needs careful analysis to locate and reprogram.

If performance does not improve significantly, or degrades, as a result of adding optimization options, the coding might not provide the compiler with opportunities for further performance improvements. The next step would then be to analyze and restructure the program at the source code level to achieve better performance.

Performance Option Reference

The compiler options listed in the following table provide the user with a repertoire of strategies to improve the performance of a program over default compilation. Only some of the compilers' more potent performance options appear in the table. A more complete list can be found in the Fortran User's Guide.

TABLE 9-1 Some Effective Performance Options
Action Option

Uses a combination of optimization options together -fast

Sets compiler optimization level to n -On (-O = -O3)

Specifies general target hardware -xtarget=sys

Specifies a particular Instruction Set Architecture -xarch=isa

Optimizes using performance profile data (with -O5) -xprofile=use

Unrolls loops by n -unroll=n

Permits simplifications and optimization of floating-point -fsimple=1|2

Performs dependency analysis to optimize loops -depend

Some of these options increase compilation time because they invoke a deeper analysis of the program. Some options work best when routines are collected into files along with the routines that call them (rather than splitting each routine into its own file); this allows the analysis to be global.

-fast

This single option selects a number of performance options that, working together, produce object code optimized for execution speed without an excessive increase in compilation time.

The options selected by -fast are subject to change from one release to another, and not all are available on each platform:

-native generates code optimized for the host architecture.
-O5 sets optimization level.
-libmil inlines calls to some simple library functions.
-fsimple=2 simplifies floating-point code.
-dalign uses faster, double word loads and stores.
-xlibmopt use optimized libm math library.
-fns selects non-standard floating-point mode
-ftrap=%none turns off all trapping for f77, or
-ftrap=common selects common floating-point trapping for f95
-depend analyze loops for data dependencies.
-pad=common improves cache performance.
-xvector=yes invokes vectorized library functions in loops.

-fast provides a quick way to engage much of the optimizing power of the compilers. Each of the composite options may be specified individually, and each may have side effects to be aware of (discussed in the Fortran User's Guide). Following -fast with additional options adds further optimizations. For example:

f95 -fast -xarch=v9a ...

compiles for an UltraSPARC 64-bit enabled Solaris platform.

Note – -fast includes -dalign and -native. These options may have unexpected side effects for some programs.

-On

No compiler optimizations are performed by the compilers unless a -O option is specified explicitly (or implicitly with macro options like -fast). In nearly all cases, specifying an optimization level for compilation improves program execution performance. On the other hand, higher levels of optimization increase compilation time and may significantly increase code size.

For most cases, level -O3 is a good balance between performance gain, code size, and compilation time. Level -O4 adds automatic inlining of calls to routines contained in the same source file as the caller routine, among other things. (See the Fortran User's Guide for further information about subprogram call inlining.)

Level -O5 adds more aggressive optimization techniques that would not be applied at lower levels. In general, levels above -O3 should be specified only to those routines that make up the most compute-intensive parts of the program and thereby have a high certainty of improving performance. (There is no problem linking together parts of a program compiled with different optimization levels.)

PRAGMA OPT=n

Use the C$ PRAGMA SUN OPT=n directive to set different optimization levels for individual routines in a source file. This directive will override the -On flag on the compiler command line, but must be used with the -xmaxopt=n flag to set a maximum optimization level. See the f77(1) and f95(1) man pages for details.

Optimization With Runtime Profile Feedback

The compiler applies its optimization strategies at level O3 and above much more efficiently if combined with -xprofile=use. With this option, the optimizer is directed by a runtime execution profile produced by the program (compiled with -xprofile=collect) with typical input data. The feedback profile indicates to the compiler where optimization will have the greatest effect. This may be particularly important with -O5. Here's a typical example of profile collection with higher optimization levels:
demo% f95 -o prg -fast -xprofile=collect prg.f ...

demo% prg

demo% f95 -o prgx -fast -O5 -xprofile=use:prg.profile prg.f ...

demo% prgx
The first compilation in the example generates an executable that produces statement coverage statistics when run. The second compilation uses this performance data to guide the optimization of the program.

(See the Fortran User's Guide for details on -xprofile options.)

-dalign

With -dalign the compiler is able to generate double-word load/store instructions whenever possible. Programs that do much data motion may benefit significantly when compiled with this option. (It is one of the options selected by -fast.) The double-word instructions are almost twice as fast as the equivalent single word operations.

However, users should be aware that using -dalign (and therefore -fast) may cause problems with some programs that have been coded expecting a specific alignment of data in COMMON blocks. With -dalign, the compiler may add padding to ensure that all double (and quad) precision data (either REAL or COMPLEX) are aligned on double-word boundaries, with the result that:

COMMON blocks might be larger than expected due to added padding.
All program units sharing COMMON must be compiled with -dalign if any one of them is compiled with -dalign.

For example, a program that writes data by aliasing an entire COMMON block of mixed data types as a single array might not work properly with -dalign because the block will be larger (due to padding of double and quad precision variables) than the program expects.

-depend

Adding -depend to optimization levels -O3 and higher (on the SPARC platform) extends the compiler's ability to optimize DO loops and loop nests. With this option, the optimizer analyzes inter-iteration loop dependencies to determine whether or not certain transformations of the loop structure can be performed. Only loops without dependencies can be restructured. However, the added analysis might increase compilation time.

-fsimple=2

Unless directed to, the compiler does not attempt to simplify floating-point computations (the default is -fsimple=0). With the -fast option,
-fsimple=1 is used and some conservative assumptions are made. Adding
-fsimple=2 enables the optimizer to make further simplifications with the understanding that this might cause some programs to produce slightly different results due to rounding effects. If -fsimple level 1 or 2 is used, all program units should be similarly compiled to ensure consistent numerical accuracy.

-unroll=n

Unrolling short loops with long iteration counts can be profitable for some routines. However, unrolling can also increase program size and might even degrade performance of other loops. With n=1, the default, no loops are unrolled automatically by the optimizer. With n greater than 1, the optimizer attempts to unroll loops up to a depth of n.

The compiler's code generator makes its decision to unroll loops depending on a number of factors. The compiler might decline to unroll a loop even though this option is specified with n>1.

If a DO loop with a variable loop limit can be unrolled, both an unrolled version and the original loop are compiled. A runtime test on iteration count determines whether or not executing the unrolled loop is inappropriate. Loop unrolling, especially with simple one or two statement loops, increases the amount of computation done per iteration and provides the optimizer with better opportunities to schedule registers and simplify operations. The tradeoff between number of iterations, loop complexity, and choice of unrolling depth is not easy to determine, and some experimentation might be needed.

The example that follows shows how a simple loop might be unrolled to a depth of four with -unroll=4 (the source code is not changed with this option):
Original Loop:
DO I=1,20000

X(I) = X(I) + Y(I)*A(I)

END DO

Unrolled by 4 compiles as if it were written:
DO I=1, 19997,4

TEMP1 = X(I) + Y(I)*A(I)

TEMP2 = X(I+1) + Y(I+1)*A(I+1)

TEMP3 = X(I+2) + Y(I+2)*A(I+2)

X(I+3) = X(I+3) + Y(I+3)*A(I+3)

X(I) = TEMP1

X(I+1) = TEMP2

X(I+2) = TEMP3

END DO
This example shows a simple loop with a fixed loop count. The restructuring is more complex with variable loop counts.

-xtarget=platform

The performance of some programs might improve if the compiler has an accurate description of the target computer hardware. When program performance is critical, the proper specification of the target hardware could be very important. This is especially true when running on the newer SPARC processors. However, for most programs and older SPARC processors, the performance gain could be negligible and a generic specification might be sufficient.

The Fortran User's Guide lists all the system names recognized by -xtarget=. For any given system name (for example, ultra2, for UltraSPARC II^TM), -xtarget expands into a specific combination of -xarch, -xcache, and -xchip that properly matches that system. The optimizer uses these specifications to determine strategies to follow and instructions to generate.

The special setting -xtarget=native enables the optimizer to compile code targeted at the host system (the system doing the compilation). This is obviously useful when compilation and execution are done on the same system. When the execution system is not known, it is desirable to compile for a generic architecture. Therefore, -xtarget=generic is the default, even though it might produce suboptimal performance.

Other Performance Strategies

Assuming that you have experimented with using a variety of optimization options, compiling your program and measuring actual runtime performance, the next step might be to look closely at the Fortran source program to see what further tuning can be tried.

Focusing on just those parts of the program that use most of the compute time, you might consider the following strategies:

Replace handwritten procedures with calls to equivalent optimized libraries.
Remove I/O, calls, and unnecessary conditional operations from key loops.
Eliminate aliasing that might inhibit optimization.
Rationalize tangled, spaghetti-like code to use block IF.

These are some of the good programming practices that tend to lead to better performance. It is possible to go further, hand-tuning the source code for a specific hardware configuration. However, these attempts might only further obscure the code and make it even more difficult for the compiler's optimizer to achieve significant performance improvements. Excessive hand-tuning of the source code can hide the original intent of the procedure and could have a significantly detrimental effect on performance for different architectures.

Using Optimized Libraries

In most situations, optimized commercial or shareware libraries perform standard computational procedures far more efficiently than you could by coding them by hand.

For example, the Sun Performance Library^TM is a suite of highly optimized mathematical subroutines based on the standard LAPACK, BLAS, FFTPACK, VFFTPACK, and LINPACK libraries. Performance improvement using these routines can be significant when compared with hand coding. See the Sun Performance Library User's Guide for details.

Eliminating Performance Inhibitors

Use the Sun WorkShop Performance Analyzer to identify the key computational parts of the program. Then, carefully analyze the loop or loop nest to eliminate coding that might either inhibit the optimizer from generating optimal code or otherwise degrade performance. Many of the nonstandard coding practices that make portability difficult might also inhibit optimization by the compiler.

Reprogramming techniques that improve performance are dealt with in more detail in some of the reference books listed at the end of the chapter. Three major approaches are worth mentioning here:

Removing I/O From Key Loops

I/O within a loop or loop nest enclosing the significant computational work of a program will seriously degrade performance. The amount of CPU time spent in the I/O library might be a major portion of the time spent in the loop. (I/O also causes process interrupts, thereby degrading program throughput.) By moving I/O out of the computation loop wherever possible, the number of calls to the I/O library can be greatly reduced.

Eliminating Subprogram Calls

Subroutines called deep within a loop nest could be called thousands of times. Even if the time spent in each routine per call is small, the total effect might be substantial. Also, subprogram calls inhibit optimization of the loop that contains them because the compiler cannot make assumptions about the state of registers over the call.

Automatic inlining of subprogram calls (using -inline=x,y,..z, or -O4) is one way to let the compiler replace the actual call with the subprogram itself (pulling the subprogram into the loop). The subprogram source code for the routines that are to be inlined must be found in the same file as the calling routine.

There are other ways to eliminate subprogram calls:

Use statement functions. If the external function being called is a simple math function, it might be possible to rewrite the function as a statement function or set of statement functions. Statement functions are compiled in-line and can be optimized.
Push the loop into the subprogram. That is, rewrite the subprogram so that it can be called fewer times (outside the loop) and operate on a vector or array of values per call.

Rationalizing Tangled Code

Complicated conditional operations within a computationally intensive loop can dramatically inhibit the compiler's attempt at optimization. In general, a good rule to follow is to eliminate all arithmetic and logical IF's, replacing them with block IF's:
Original Code:

IF(A(I)-DELTA) 10,10,11

10 XA(I) = XB(I)*B(I,I)

XY(I) = XA(I) - A(I)

GOTO 13

11 XA(I) = Z(I)

XY(I) = Z(I)

IF(QZDATA.LT.0.) GOTO 12

ICNT = ICNT + 1

ROX(ICNT) = XA(I)-DELTA/2.

12 SUM = SUM + X(I)

13 SUM = SUM + XA(I)

Untangled Code:

IF(A(I).LE.DELTA) THEN

XA(I) = XB(I)*B(I,I)

XY(I) = XA(I) - A(I)

ELSE

XA(I) = Z(I)

XY(I) = Z(I)

IF(QZDATA.GE.0.) THEN

ICNT = ICNT + 1

ROX(ICNT) = XA(I)-DELTA/2.

ENDIF

SUM = SUM + X(I)

ENDIF

SUM = SUM + XA(I)
Using block IF not only improves the opportunities for the compiler to generate optimal code, it also improves readability and assures portability.

Further Reading

The following reference books provide more details:

Numerical Computation Guide, Sun Microsystems, Inc.
Analyzing Program Performance with Sun WorkShop, Sun Microsystems, Inc.
FORTRAN Optimization, by Michael Metcalf, Academic Press 1985
High Performance Computing, by Kevin Dowd, O'Reilly & Associates, 1993

**TABLE 9-1** Some Effective Performance Options
Action	Option
Uses a combination of optimization options together	`-fast`
Sets compiler optimization level to n	`-O`n (`-O` `=` `-O3`)
Specifies general target hardware	`-xtarget=`sys
Specifies a particular Instruction Set Architecture	`-xarch=`isa
Optimizes using performance profile data (with `-O5`)	`-xprofile=use`
Unrolls loops by n	`-unroll=`n
Permits simplifications and optimization of floating-point	`-fsimple=1\|2`
Performs dependency analysis to optimize loops	`-depend`

demo% `f95 -o prg -fast -xprofile=collect prg.f ...` demo% `prg` demo% `f95 -o prgx -fast -O5 -xprofile=use:prg.profile prg.f ...` demo% `prgx`

Original Loop: DO I=1,20000 X(I) = X(I) + Y(I)A(I) END DO Unrolled by 4 compiles as if it were written: DO I=1, 19997,4 TEMP1 = X(I) + Y(I)A(I) TEMP2 = X(I+1) + Y(I+1)A(I+1) TEMP3 = X(I+2) + Y(I+2)A(I+2) X(I+3) = X(I+3) + Y(I+3)*A(I+3) X(I) = TEMP1 X(I+1) = TEMP2 X(I+2) = TEMP3 END DO

Original Code: IF(A(I)-DELTA) 10,10,11 10 XA(I) = XB(I)B(I,I) XY(I) = XA(I) - A(I) GOTO 13 11 XA(I) = Z(I) XY(I) = Z(I) IF(QZDATA.LT.0.) GOTO 12 ICNT = ICNT + 1 ROX(ICNT) = XA(I)-DELTA/2. 12 SUM = SUM + X(I) 13 SUM = SUM + XA(I) Untangled Code: IF(A(I).LE.DELTA) THEN XA(I) = XB(I)B(I,I) XY(I) = XA(I) - A(I) ELSE XA(I) = Z(I) XY(I) = Z(I) IF(QZDATA.GE.0.) THEN ICNT = ICNT + 1 ROX(ICNT) = XA(I)-DELTA/2. ENDIF SUM = SUM + X(I) ENDIF SUM = SUM + XA(I)

Library | Contents | Previous | Next | Index

Chapter 9

Performance and Optimization

Choice of Compiler Options

Performance Option Reference

`-fast`

`-O`n

`PRAGMA OPT=`n

Optimization With Runtime Profile Feedback

`-dalign`

`-depend`

`-fsimple=2`

`-unroll=`n

`-xtarget=`platform

Other Performance Strategies

Using Optimized Libraries

Eliminating Performance Inhibitors

Removing I/O From Key Loops

Eliminating Subprogram Calls

Rationalizing Tangled Code

Further Reading

Chapter 9

Performance and Optimization

Choice of Compiler Options

Performance Option Reference

-fast

-On

PRAGMA OPT=n

Optimization With Runtime Profile Feedback

-dalign

-depend

-fsimple=2

-unroll=n

-xtarget=platform

Other Performance Strategies

Using Optimized Libraries

Eliminating Performance Inhibitors

Removing I/O From Key Loops

Eliminating Subprogram Calls

Rationalizing Tangled Code

Further Reading

`-fast`

`-O`n

`PRAGMA OPT=`n

`-dalign`

`-depend`

`-fsimple=2`

`-unroll=`n

`-xtarget=`platform