C H A P T E R  9

Performance and Optimization

This chapter considers some optimization techniques that may improve the performance of numerically intense Fortran programs. Proper use of algorithms, compiler options, library routines, and coding practices can bring significant performance gains. This discussion does not discuss cache, I/O, or system environment tuning. Parallelization issues are treated in the next chapter.

Some of the issues considered here are:

The subject of optimization and performance tuning is much too complex to be treated exhaustively here. However, this discussion should provide the reader with a useful introduction to these issues. A list of books that cover the subject much more deeply appears at the end of the chapter.

Optimization and performance tuning is an art that depends heavily on being able to determine what to optimize or tune.


9.1 Choice of Compiler Options

Choice of the proper compiler options is the first step in improving performance. Sun compilers offer a wide range of options that affect the object code. In the default case, where no options are explicitly stated on the compile command line, most options are off. To improve performance, these options must be explicitly selected.

Performance options are normally off by default because most optimizations force the compiler to make assumptions about a user's source code. Programs that conform to standard coding practices and do not introduce hidden side effects should optimize correctly. However, programs that take liberties with standard practices might run afoul of some of the compiler's assumptions. The resulting code might run faster, but the computational results might not be correct.

Recommended practice is to first compile with all options off, verify that the computational results are correct and accurate, and use these initial results and performance profile as a baseline. Then, proceed in steps--recompiling with additional options and comparing execution results and performance against the baseline. If numerical results change, the program might have questionable code, which needs careful analysis to locate and reprogram.

If performance does not improve significantly, or degrades, as a result of adding optimization options, the coding might not provide the compiler with opportunities for further performance improvements. The next step would then be to analyze and restructure the program at the source code level to achieve better performance.

9.1.1 Performance Options

The compiler options listed in the following table provide the user with a repertoire of strategies to improve the performance of a program over default compilation. Only some of the compilers' more potent performance options appear in the table. A more complete list can be found in the Fortran User's Guide.


TABLE 9-1 Some Effective Performance Options

Action

Option

Uses a combination of optimization options together

-fast

Sets compiler optimization level to n

-On (-O = -O3)

Specifies general target hardware

-xtarget=sys

Specifies a particular Instruction Set Architecture

-xarch=isa

Optimizes using performance profile data (with -O5)

-xprofile=use

Unrolls loops by n

-unroll=n

Permits simplifications and optimization of floating-point

-fsimple=1|2

Performs dependency analysis to optimize loops

-depend

Performs interprocedural optimizations

-xipo


Some of these options increase compilation time because they invoke a deeper analysis of the program. Some options work best when routines are collected into files along with the routines that call them (rather than splitting each routine into its own file); this allows the analysis to be global.

9.1.1.1 -fast

This single option selects a number of performance options.



Note - This option is defined as a particular selection of other options that is subject to change from one release to another, and between compilers. Also, some of the options selected by -fast might not be available on all platforms. Compile with the -dryrun flag to see the expansion of -fast.



-fast provides high performance for certain benchmark applications. However, the particular choice of options may or may not be appropriate for your application. Use -fast as a good starting point for compiling your application for best performance. But additional tuning may still be required. If your program behaves improperly when compiled with -fast, look closely at the individual options that make up -fast and invoke only those appropriate to your program that preserve correct behavior.

Note also that a program compiled with -fast may show good performance and accurate results with some data sets, but not with others. Avoid compiling with -fast those programs that depend on particular properties of floating-point arithmetic.

Because some of the options selected by -fast have linking implications, if you compile and link in separate steps be sure to link with -fast also.

-fast selects the following options:

-fast provides a quick way to engage much of the optimizing power of the compilers. Each of the composite options may be specified individually, and each may have side effects to be aware of (discussed in the Fortran User's Guide). Note also that the exact expansion of -fast may change with each compiler release. Compiling with -dryrun will show the expansion of all command-line flags.

Following -fast with additional options adds further optimizations. For example:

f95 -fast -xarch=v9a ...

compiles for a 64-bit enabled, UltraSPARC Solaris platform.

Because -fast invokes -dalign, -fns, -fsimple=2, programs compiled with -fast can result in nonstandard floating-point arithmetic, nonstandard alignment of data, and nonstandard ordering of expression evaluation. These selections might not be appropriate for most programs.

9.1.1.2 -On

The compiler performs no optimizations unless a -O option is specified explicitly (or implicitly with macro options like -fast). In nearly all cases, specifying an optimization level at compilation improves program execution performance. On the other hand, higher levels of optimization increase compilation time and may significantly increase code size.

For most cases, level -O3 is a good balance between performance gain, code size, and compilation time. Level -O4 adds automatic inlining of calls to routines contained in the same source file as the caller routine, among other things. (See the Fortran User's Guide for further information about subprogram call inlining.)

Level -O5 adds more aggressive optimization techniques that would not be applied at lower levels. In general, levels above -O3 should be specified only to those routines that make up the most compute-intensive parts of the program and thereby have a high certainty of improving performance. (There is no problem linking together parts of a program compiled with different optimization levels.)

9.1.1.3 PRAGMA OPT=n

Use the C$ PRAGMA SUN OPT=n directive to set different optimization levels for individual routines in a source file. This directive will override the -On flag on the compiler command line, but must be used with the -xmaxopt=n flag to set a maximum optimization level. See the f95(1) man page for details.

9.1.1.4 Optimization With Runtime Profile Feedback

The compiler applies its optimization strategies at level O3 and above much more efficiently if combined with -xprofile=use. With this option, the optimizer is directed by a runtime execution profile produced by the program (compiled with
-xprofile=collect) with typical input data. The feedback profile indicates to the compiler where optimization will have the greatest effect. This may be particularly important with -O5. Here's a typical example of profile collection with higher optimization levels:


demo% f95 -o prg -fast -xprofile=collect prg.f ...
demo% prg    
demo% f95 -o prgx -fast -O5 -xprofile=use:prg.profile prg.f ...
demo% prgx  

The first compilation in the example generates an executable that produces statement coverage statistics when run. The second compilation uses this performance data to guide the optimization of the program.

(See the Fortran User's Guide for details on -xprofile options.)

9.1.1.5 -dalign

With -dalign the compiler is able to generate double-word load/store instructions whenever possible. Programs that do much data motion may benefit significantly when compiled with this option. (It is one of the options selected by -fast.) The double-word instructions are almost twice as fast as the equivalent single word operations.

However, users should be aware that using -dalign (and therefore -fast) may cause problems with some programs that have been coded expecting a specific alignment of data in COMMON blocks. With -dalign, the compiler may add padding to ensure that all double (and quad) precision data (either REAL or COMPLEX) are aligned on double-word boundaries, with the result that:

For example, a program that writes data by aliasing an entire COMMON block of mixed data types as a single array might not work properly with -dalign because the block will be larger (due to padding of double and quad precision variables) than the program expects.

9.1.1.6 -depend

Adding -depend to optimization levels -O3 and higher extends the compiler's ability to optimize DO loops and loop nests. With this option, the optimizer analyzes inter-iteration data dependences to determine whether or not certain transformations of the loop structure can be performed. Only loops without data dependences can be restructured. However, the added analysis might increase compilation time.

9.1.1.7 -fsimple=2

Unless directed to, the compiler does not attempt to simplify floating-point computations (the default is -fsimple=0). -fsimple=2 enables the optimizer to make aggressive simplifications with the understanding that this might cause some programs to produce slightly different results due to rounding effects. If -fsimple level 1 or 2 is used, all program units should be similarly compiled to ensure consistent numerical accuracy. See the Fortran User's Guide for important information about this option.

9.1.1.8 -unroll=n

Unrolling short loops with long iteration counts can be profitable for some routines. However, unrolling can also increase program size and might even degrade performance of other loops. With n=1, the default, no loops are unrolled automatically by the optimizer. With n greater than 1, the optimizer attempts to unroll loops up to a depth of n.

The compiler's code generator makes its decision to unroll loops depending on a number of factors. The compiler might decline to unroll a loop even though this option is specified with n>1.

If a DO loop with a variable loop limit can be unrolled, both an unrolled version and the original loop are compiled. A runtime test on iteration count determines if it is appropriate to execute the unrolled loop. Loop unrolling, especially with simple one or two statement loops, increases the amount of computation done per iteration and provides the optimizer with better opportunities to schedule registers and simplify operations. The tradeoff between number of iterations, loop complexity, and choice of unrolling depth is not easy to determine, and some experimentation might be needed.

The example that follows shows how a simple loop might be unrolled to a depth of four with -unroll=4 (the source code is not changed with this option):


Original Loop:

    DO I=1,20000
       X(I) = X(I) + Y(I)*A(I)
    END DO
 

Unrolled by 4 compiles as if it were written:

    DO I=1, 19997,4
       TEMP1 = X(I) + Y(I)*A(I)
       TEMP2 = X(I+1) + Y(I+1)*A(I+1)
       TEMP3 = X(I+2) + Y(I+2)*A(I+2)
       X(I+3) = X(I+3) + Y(I+3)*A(I+3)
       X(I) = TEMP1
       X(I+1) = TEMP2
       X(I+2) = TEMP3
    END DO

This example shows a simple loop with a fixed loop count. The restructuring is more complex with variable loop counts.

9.1.1.9 -xtarget=platform

The performance of some programs might improve if the compiler has an accurate description of the target computer hardware. When program performance is critical, the proper specification of the target hardware could be very important. This is especially true when running on the newer SPARC processors. However, for most programs and older SPARC processors, the performance gain could be negligible and a generic specification might be sufficient.

The Fortran User's Guide lists all the system names recognized by -xtarget=. For any given system name (for example, ultra2, for UltraSPARC-II), -xtarget expands into a specific combination of -xarch, -xcache, and -xchip that properly matches that system. The optimizer uses these specifications to determine strategies to follow and instructions to generate.

The special setting -xtarget=native enables the optimizer to compile code targeted at the host system (the system doing the compilation). This is obviously useful when compilation and execution are done on the same system. When the execution system is not known, it is desirable to compile for a generic architecture. Therefore, -xtarget=generic is the default, even though it might produce suboptimal performance.

UltraSPARC-III and UltraSPRC_IV Support

Both the -xtarget and -xchip flags accept ultra3 and ultra3 variants and will generate optimized code for UltraSPARC-III and UltraSPARC-IV processor.s When compiling and running an application on the latest UltraSPARC platforms, specify the -fast flag to automatically select the proper compiler optimization options for that platform.

For cross-compilations (compiling on a platform other than the latest UltraSPARC platforms but generating binaries intended to run on an UltraSPARC-III processor), use these flags:

-fast -xtarget=ultra3 -xarch=v8plusb (or -xarch=v9b)

Use -xarch=v9b to compile for 64-bit code generation.

See the Fortran User's Guide for a list of -xtarget flags for the latest UltraSPARC processors.

Note that programs compiled specifically for the UltraSPARC-III and UltraSPARC-IV platforms with -xarch=v8plusb or v9b will not operate on platforms earlier UltraSPARC platforms. Use -xarch=v8plusa (or v9a for 64-bit code generation) to compile programs to run compatibly on UltraSPARC-I, UltraSPARC-II, and UltraSPARC-III.

Performance profiling, with -xprofile=collect: and -xprofile=use:, is particularly effective on the UltraSPARC-III and UltraSPARC-IV platforms because it allows the compiler to identify the most frequently executed sections of the program and perform localized optimizations to best advantage.

64-Bit x86 Platform Support

The Sun Studio Fortran compiler supports the compilation of 32-bit and 64-bit code for Solaris x86 platforms.

The -xtarget=pentium3 flag expands to:
-xarch=sse -xchip=pentium3 -xcache=16/32/4:256/32/4.

For Pentium 4 systems, -xtarget=pentium4 expands to:
-xarch=sse2 -xchip=pentium4 -xcache=8/64/4:256/128/8.

A new -xarch option, -xarch=amd64, specifies compilation for the 64-bit AMD instruction set.

A new -xtarget option, -xtarget=opteron, specifies the -xarch, -xchip, and -xcache settings for 32-bit AMD compilation.

You must specify -xarch=amd64 after -fast and -xtarget on the command line to generate 64-bit code. The new -xtarget=opteron option does not automatically generate 64-bit code. It expands to -xarch=sse2, -xchip=opteron, and -xcache=64/64/2:1024/64/16, which result in 32-bit code. The -fast option also results in 32-bit code because it is a macro which also defines an -xtarget value. All the current -xtarget values (except -xtarget=native64 and -xtarget=generic64) result in 32- bit code, so it is necessary to specify -xarch=amd64 after (to the right of) -fast or -xtarget to compile 64-bit code, as in:

% f95 -fast -xarch=amd64 or
% f95 -xtarget=opteron -xarch=amd64

Also, the existing -xarch=generic64 option now supports the x86 platform in addition to SPARC platforms.

The compilers now predefine __amd64 and __x86_64 when you specify -xarch=amd64.

Additional information about compilation and performance on 32-bit and 64-bit x86 platforms can be found in the Fortran User's Guide.

9.1.1.10 Interprocedural Optimization With -xipo

This new f95 compiler flag, introduced with the release of Forte Developer 6 update 2, performs whole-program optimizations by invoking an interprocedural analysis pass. Unlike -xcrossfile, -xipo optimizes across all object files at the link step and is not limited to just the source files on the compile command.

-xipo is particularly useful when compiling and linking large multi-file applications. Object files compiled with -xipo have analysis information saved within them. This enables interprocedural analysis across source and pre-compiled program files.

For details on how to use interprocedural analysis effectively, see the Fortran User's Guide.

9.1.1.11 Add PRAGMA ASSUME Assertions

By adding ASSUME directives at strategic points in the source code you can help guide the compiler's optimization stragegy by revealing important information about the program that is not determinable any other way. For example, you can let the compiler know that the trip count of a DO loop is always greater than a value, or that there is a high probability that an IF branch will not be taken. The compiler can use this information to generate better code, based on these assertions.

As an added bonus, the programmer can use the ASSUME pragma to validate the exectution of the program by enabling warning messages to be issued whenever an assertion turns out to be false at run time.

For details, see the description of the ASSUME pragma in Chapter 2 of the Fortran User's Guide, and the -xassume_control compiler command-line option in Chapter 3 of that manual.

9.1.2 Other Performance Strategies

Assuming that you have experimented with using a variety of optimization options, compiling your program and measuring actual runtime performance, the next step might be to look closely at the Fortran source program to see what further tuning can be tried.

Focusing on just those parts of the program that use most of the compute time, you might consider the following strategies:

These are some of the good programming practices that tend to lead to better performance. It is possible to go further, hand-tuning the source code for a specific hardware configuration. However, these attempts might only further obscure the code and make it even more difficult for the compiler's optimizer to achieve significant performance improvements. Excessive hand-tuning of the source code can hide the original intent of the procedure and could have a significantly detrimental effect on performance for different architectures.

9.1.3 Using Optimized Libraries

In most situations, optimized commercial or shareware libraries perform standard computational procedures far more efficiently than you could by coding them by hand.

For example, the Sun Performance Librarytrademark is a suite of highly optimized mathematical subroutines based on the standard LAPACK, BLAS, FFTPACK, VFFTPACK, and LINPACK libraries. Performance improvement using these routines can be significant when compared with hand coding. See the Sun Performance Library User's Guide for details.

9.1.4 Eliminating Performance Inhibitors

Use the Sun Studio Performance Analyzer to identify the key computational parts of the program. Then, carefully analyze the loop or loop nest to eliminate coding that might either inhibit the optimizer from generating optimal code or otherwise degrade performance. Many of the nonstandard coding practices that make portability difficult might also inhibit optimization by the compiler.

Reprogramming techniques that improve performance are dealt with in more detail in some of the reference books listed at the end of the chapter. Three major approaches are worth mentioning here:

9.1.4.1 Removing I/O From Key Loops

I/O within a loop or loop nest enclosing the significant computational work of a program will seriously degrade performance. The amount of CPU time spent in the I/O library might be a major portion of the time spent in the loop. (I/O also causes process interrupts, thereby degrading program throughput.) By moving I/O out of the computation loop wherever possible, the number of calls to the I/O library can be greatly reduced.

9.1.4.2 Eliminating Subprogram Calls

Subroutines called deep within a loop nest could be called thousands of times. Even if the time spent in each routine per call is small, the total effect might be substantial. Also, subprogram calls inhibit optimization of the loop that contains them because the compiler cannot make assumptions about the state of registers over the call.

Automatic inlining of subprogram calls (using -inline=x,y,..z, or -O4) is one way to let the compiler replace the actual call with the subprogram itself (pulling the subprogram into the loop). The subprogram source code for the routines that are to be inlined must be found in the same file as the calling routine.

There are other ways to eliminate subprogram calls:

9.1.4.3 Rationalizing Tangled Code

Complicated conditional operations within a computationally intensive loop can dramatically inhibit the compiler's attempt at optimization. In general, a good rule to follow is to eliminate all arithmetic and logical IF's, replacing them with block IF's:


Original Code:
    IF(A(I)-DELTA) 10,10,11
10  XA(I) = XB(I)*B(I,I)
    XY(I) = XA(I) - A(I)
    GOTO 13
11  XA(I) = Z(I)
    XY(I) = Z(I)
    IF(QZDATA.LT.0.) GOTO 12
    ICNT = ICNT + 1
    ROX(ICNT) = XA(I)-DELTA/2.
12  SUM = SUM + X(I)
13  SUM = SUM + XA(I)
 
Untangled Code:
    IF(A(I).LE.DELTA) THEN
      XA(I) = XB(I)*B(I,I)
      XY(I) = XA(I) - A(I)
    ELSE
      XA(I) = Z(I)
      XY(I) = Z(I)
      IF(QZDATA.GE.0.) THEN
        ICNT = ICNT + 1
        ROX(ICNT) = XA(I)-DELTA/2.
      ENDIF
      SUM = SUM + X(I)
    ENDIF
    SUM = SUM + XA(I)

Using block IF not only improves the opportunities for the compiler to generate optimal code, it also improves readability and assures portability.

9.1.5 Viewing Compiler Commentary

If you compile with the -g debugging option, you can view source code annotations generated by the compiler by using the er_src(1) utility, part of the Sun Studio Performance Analysis Tools. This utility can also be used to view the source code annotated with the generated assembly language. Here is an example of the commentary produced by er_src on a simple do loop:


demo% f95 -c -g -O4 do.f
demo% er_src do.o
Source file: /home/user21/do.f
Object file: do.o
Load Object: do.o
 
     1.         program do
     2.         common aa(100),bb(100)
 
   Function x inlined from source file do.f into the code for the following line
   Loop below pipelined with steady-state cycle count = 3 before unrolling
   Loop below unrolled 5 times
   Loop below has 2 loads, 1 stores, 0 prefetches, 1 FPadds, 1 FPmuls, and 0 FPdivs per iteration
     3.         call x(aa,bb,100)
     4.         end
     5.                 subroutine x(a,b,n)
     6.                 real a(n), b(n)
     7.                 v = 5.
     8.                 w = 10.
 
   Loop below pipelined with steady-state cycle count = 3 before unrolling
   Loop below unrolled 5 times
   Loop below has 2 loads, 1 stores, 0 prefetches, 1 FPadds, 1 FPmuls, and 0 FPdivs per iteration
     9.                 do 1 i=1,n
    10. 1                       a(i) = a(i)+v*b(i)
    11.                 return
    12.                 end

Commentary messages detail the optimization actions taken by the compiler. In the example we can see that the compiler has inlined the call to the subroutine and unrolled the loop 5 times. Reviewing this information might provide clues as to further optimization strategies you can use.

For detailed information about compiler commentary and disassembled code, see the Sun Studio Performance Analyzer manual.


9.2 Further Reading

The following reference books provide more details: