Sun Studio 12 Update 1: Performance Analyzer

Compiler Commentary

Compiler commentary indicates how compiler-optimized code has been generated. Compiler commentary lines are displayed in blue, to distinguish them from index lines and original source lines. Various parts of the compiler can incorporate commentary into the executable. Each comment is associated with a specific line of source code. When the annotated source is written, the compiler commentary for any source line appears immediately preceding the source line.

The compiler commentary describes many of the transformations which have been made to the source code to optimize it. These transformations include loop optimizations, parallelization, inlining and pipelining. The following shows an example of compiler commentary.


0.   	0.   	0.   	0.   	 28.       SUBROUTINE dgemv_g2 (transa, m, n, alpha, b, ldb,   &
     	     	     	     	 29.      &                   c, incc, beta, a, inca)
     	     	     	     	 30.       CHARACTER (KIND=1) :: transa
                                	 31.       INTEGER   (KIND=4) :: m, n, incc, inca, ldb
                                	 32.       REAL      (KIND=8) :: alpha, beta
                                	 33.       REAL      (KIND=8) :: a(1:m), b(1:ldb,1:n), c(1:n)
                                	 34.       INTEGER            :: i, j
                                	 35.       REAL      (KIND=8) :: tmr, wtime, tmrend
                                	 36.       COMMON/timer/ tmr
                                	 37. 
                             Function wtime_ not inlined because the compiler has not seen 
                             the body of the routine
0.   	0.   	0.   	0.   	 38.       tmrend = tmr + wtime()


                             Function wtime_ not inlined because the compiler has not seen 
                             the body of the routine
                             Discovered loop below has tag L16
0.   	0.   	0.   	0.   	 39.       DO WHILE(wtime() < tmrend)
                          	
                         	Array statement below generated loop L4
0.   	0.   	0.   	0.   	 40.       a(1:m) = 0.0
                                	 41. 
     	     	     	     	
                         	Source loop below has tag L6
0.   	0.   	0.   	0.   	 42.       DO j = 1, n       ! <=-----\ swapped loop indices
   	
                         	Source loop below has tag L5
                             L5 cloned for unrolling-epilog.  Clone is L19
                             All 8 copies of L19 are fused together as part of unroll and jam
                             L19 scheduled with steady-state cycle count = 9
                             L19 unrolled 4 times
                             L19 has 9 loads, 1 stores, 8 prefetches, 8 FPadds, 
                             8 FPmuls, and 0 FPdivs per iteration
                             L19 has 0 int-loads, 0 int-stores, 11 alu-ops, 0 muls, 
                             0 int-divs and 0 shifts per iteration
                             L5 scheduled with steady-state cycle count = 2
                             L5 unrolled 4 times
                             L5 has 2 loads, 1 stores, 1 prefetches, 1 FPadds, 1 FPmuls,
                             and 0 FPdivs per iteration
                             L5 has 0 int-loads, 0 int-stores, 4 alu-ops, 0 muls, 
                             0 int-divs and 0 shifts per iteration
0.210	0.210	0.210	0.   	 43.          DO i = 1, m    !    <=--/
4.003	4.003	4.003	0.050	 44.             a(i) = a(i) + b(i,j) * c(j)
0.240	0.240	0.240	0.   	 45.          END DO  
0.   	0.   	0.   	0.   	 46.       END DO  
                              	   47.       END DO
                         	        48. 
0.   	0.   	0.   	0.   	 49.       RETURN
0.   	0.   	0.   	0.   	 50.       END

You can set the types of compiler commentary displayed in the Source tab using the Source/Disassembly tab in the Set Data Presentation dialog box; for details, see Setting Data Presentation Options.

Common Subexpression Elimination

One very common optimization recognizes that the same expression appears in more than one place, and that performance can be improved by generating the code for that expression in one place. For example, if the same operation appears in both the if and the else branches of a block of code, the compiler can move that operation to just before the if statement. When it does so, it assigns line numbers to the instructions based on one of the previous occurrences of the expression. If the line numbers assigned to the common code correspond to one branch of an if structure, and the code actually always takes the other branch, the annotated source shows metrics on lines within the branch that is not taken.

Loop Optimizations

The compiler can do several types of loop optimization. Some of the more common ones are as follows:

Loop unrolling consists of repeating several iterations of a loop within the loop body, and adjusting the loop index accordingly. As the body of the loop becomes larger, the compiler can schedule the instructions more efficiently. Also reduced is the overhead caused by the loop index increment and conditional check operations. The remainder of the loop is handled using loop peeling.

Loop peeling consists of removing a number of loop iterations from the loop, and moving them in front of or after the loop, as appropriate.

Loop interchange changes the ordering of nested loops to minimize memory stride, to maximize cache-line hit rates.

Loop fusion consists of combining adjacent or closely located loops into a single loop. The benefits of loop fusion are similar to loop unrolling. In addition, if common data is accessed in the two pre-optimized loops, cache locality is improved by loop fusion, providing the compiler with more opportunities to exploit instruction-level parallelism.

Loop fission is the opposite of loop fusion: a loop is split into two or more loops. This optimization is appropriate if the number of computations in a loop becomes excessive, leading to register spills that degrade performance. Loop fission can also come into play if a loop contains conditional statements. Sometimes it is possible to split the loops into two: one with the conditional statement and one without. This can increase opportunities for software pipelining in the loop without the conditional statement.

Sometimes, with nested loops, the compiler applies loop fission to split a loop apart, and then performs loop fusion to recombine the loop in a different way to increase performance. In this case, you see compiler commentary similar to the following:


    Loop below fissioned into 2 loops
    Loop below fused with loop on line 116
    [116]    for (i=0;i<nvtxs;i++) {

Inlining of Functions

With an inline function, the compiler inserts the function instructions directly at the locations where it is called instead of making actual function calls. Thus, similar to a C/C++ macro, the instructions of an inline function are replicated at each call location. The compiler performs explicit or automatic inlining at high optimization levels (4 and 5). Inlining saves the cost of a function call and provides more instructions for which register usage and instruction scheduling can be optimized, at the cost of a larger code footprint in memory. The following is an example of inlining compiler commentary.


                Function initgraph inlined from source file ptralias.c 
                    into the code for the following line
0.       0.         44.       initgraph(rows);

Note –

The compiler commentary does not wrap onto two lines in the Source tab of the Analyzer.


Parallelization

If your code contains Sun, Cray, or OpenMP parallelization directives, it can be compiled for parallel execution on multiple processors. The compiler commentary indicates where parallelization has and has not been performed, and why. The following shows an example of parallelization computer commentary.


0.       6.324       9. c$omp  parallel do shared(a,b,c,n) private(i,j,k)
                   Loop below parallelized by explicit user directive
                   Loop below interchanged with loop on line 12
0.010    0.010     [10]            do i = 2, n-1

                   Loop below not parallelized because it was nested in a parallel loop
                   Loop below interchanged with loop on line 12
0.170    0.170      11.               do j = 2, i

For more details about parallel execution and compiler-generated body functions, refer to Overview of OpenMP Software Execution.