Analyzing Program Performance With Sun WorkShop

How Optimization Affects Loops

As you might infer from the descriptions of the compiler hints, associating optimized code with source code can be tricky. Clearly, you would prefer to see information from the compiler presented to you in a way that relates as directly as possible to your source code. Unfortunately, the compiler optimizer "reads" your program in terms of its internal language, and although it tries to relate that to your source code, it is not always successful.

Some particular optimizations that can cause confusion are described in the following sections.

Inlining

Inlining is an optimization applied only at optimization level -O4 and only for functions contained within one file. That is, if one file contains 17 Fortran functions, 16 of those can be inlined into the first function, and you compile at -O4, then the source code for those 16 functions may be copied into the body of the first function. Then, when further optimizations are applied, it becomes difficult to determine which loop on which source line number was subjected to which optimization.

If the compiler hints seem particularly opaque, consider compiling with -O3 -parallel -Zlp, so that you can see what the compiler says about your loops before it tries to inline any of your functions.

In particular, "phantom" loops--that is, loops that the compiler claims exist, but you know do not exist in your source code--could well be a symptom of inlining.

Loop Transformations--Unrolling, Jamming, Splitting, and Transposing

The compiler performs many loop optimizations that radically change the body of the loop. These include optimizations, unrolling, jamming, splitting, and transposing.

LoopTool and LoopReport attempt to provide hints that make as much sense as possible, but given the nature of the problem of associating optimized code with source code, the hints may be misleading.

Parallel Loops Nested Inside Serial Loops

If a parallel loop is nested inside a serial loop, the runtime information reported by LoopTool and LoopReport may be misleading because each loop is stipulated to use the wall-clock time of each of its loop iterations. If an inner loop is parallelized, it is assigned the wall-clock time of each iteration, although some of those iterations are running in parallel.

However, the outer loop is assigned only the runtime of its child, the parallel loop, which will be the runtime of the longest parallel instantiation of the inner loop. This double timing leads to the anomaly of the outer loop apparently consuming less time than the inner loop.