Performance Analyzer Source View Layout

Language:

The Source view is divided into columns, with fixed-width columns for individual metrics on the left and the annotated source taking up the remaining width on the right.

Identifying the Original Source Lines

All lines displayed in black in the annotated source are taken from the original source file. The number at the start of a line in the annotated source column corresponds to the line number in the original source file. Any lines with characters displayed in a different color are either index lines or compiler commentary lines.

Index Lines in the Source View

A source file is any file compiled to produce an object file or interpreted into bytecode. An object file normally contains one or more regions of executable code corresponding to functions, subroutines, or methods in the source code. Performance Analyzer analyzes the object file, identifies each executable region as a function, and attempts to map the functions it finds in the object code to the functions, routines, subroutines, or methods in the source file associated with the object code. When Performance Analyzer succeeds, it adds an index line in the annotated source file in the location corresponding to the first instruction in the function found in the object code.

The annotated source shows an index line for every function, including inline functions, even though inline functions are not displayed in the list displayed by the Function view. The Source view displays index lines in red italics with text in angle-brackets. The simplest type of index line corresponds to the function’s default context. The default source context for any function is defined as the source file to which the first instruction in that function is attributed. The following example shows an index line for a C function icputime.

                    578. int
                    579. icputime(int k)
0.       0.         580. {
                         <Function: icputime>

As the example shows, the index line appears on the line following the first instruction. For C source, the first instruction corresponds to the opening brace at the start of the function body. In Fortran source, the index line for each subroutine follows the line containing the subroutine keyword. Also, a main function index line follows the first Fortran source instruction executed when the application starts, as shown in the following example:

    	     	     	     	1. ! Copyright (c) 2006, 2010, Oracle and/or its affiliates. All Rights Reserved.
    	     	     	     	2. ! @(#)omptest.f 1.11 10/03/24 SMI
    	     	     	     	3. ! Synthetic f90 program, used for testing openmp directives and the
    	     	     	     	4. !       analyzer
    	     	     	     	5. 
0.   	0.   	0.   	0.  	6.        program omptest
                                      <Function: MAIN>  
                                   7.
                                   8. !$PRAGMA C (gethrtime, gethrvtime)

Sometimes, Performance Analyzer might not be able to map a function it finds in the object code with any programming instructions in the source file associated with that object code; for example, code may be #included or inlined from another file, such as a header file.

Also displayed in red are special index lines and other special lines that are not compiler commentary. For example, as a result of compiler optimization, a special index line might be created for a function in the object code that does not correspond to code written in any source file. For details, refer to Special Lines in the Source, Disassembly and PCs Tabs.

Compiler Commentary

Compiler commentary indicates how compiler-optimized code has been generated. Compiler commentary lines are displayed in blue to distinguish them from index lines and original source lines. Various parts of the compiler can incorporate commentary into the executable. Each comment is associated with a specific line of source code. When the annotated source is written, the compiler commentary for any source line appears immediately preceding the source line.

The compiler commentary describes many of the transformations which have been made to the source code to optimize it. These transformations include loop optimizations, parallelization, inlining and pipelining. The following example shows compiler commentary.

0.   	0.   	0.   	0.   	 28.       SUBROUTINE dgemv_g2 (transa, m, n, alpha, b, ldb,   &
     	     	     	     	 29.      &                   c, incc, beta, a, inca)
     	     	     	     	 30.       CHARACTER (KIND=1) :: transa
                                	 31.       INTEGER   (KIND=4) :: m, n, incc, inca, ldb
                                	 32.       REAL      (KIND=8) :: alpha, beta
                                	 33.       REAL      (KIND=8) :: a(1:m), b(1:ldb,1:n), c(1:n)
                                	 34.       INTEGER            :: i, j
                                	 35.       REAL      (KIND=8) :: tmr, wtime, tmrend
                                	 36.       COMMON/timer/ tmr
                                	 37. 
                             Function wtime_ not inlined because the compiler has not seen 
                             the body of the routine
0.   	0.   	0.   	0.   	 38.       tmrend = tmr + wtime()


                             Function wtime_ not inlined because the compiler has not seen 
                             the body of the routine
                             Discovered loop below has tag L16
0.   	0.   	0.   	0.   	 39.       DO WHILE(wtime() < tmrend)
                          	
                         	Array statement below generated loop L4
0.   	0.   	0.   	0.   	 40.       a(1:m) = 0.0
                                	 41. 
     	     	     	     	
                         	Source loop below has tag L6
0.   	0.   	0.   	0.   	 42.       DO j = 1, n       ! <=-----\ swapped loop indices
   	
                         	Source loop below has tag L5
                             L5 cloned for unrolling-epilog.  Clone is L19
                             All 8 copies of L19 are fused together as part of unroll and jam
                             L19 scheduled with steady-state cycle count = 9
                             L19 unrolled 4 times
                             L19 has 9 loads, 1 stores, 8 prefetches, 8 FPadds, 
                             8 FPmuls, and 0 FPdivs per iteration
                             L19 has 0 int-loads, 0 int-stores, 11 alu-ops, 0 muls, 
                             0 int-divs and 0 shifts per iteration
                             L5 scheduled with steady-state cycle count = 2
                             L5 unrolled 4 times
                             L5 has 2 loads, 1 stores, 1 prefetches, 1 FPadds, 1 FPmuls,
                             and 0 FPdivs per iteration
                             L5 has 0 int-loads, 0 int-stores, 4 alu-ops, 0 muls, 
                             0 int-divs and 0 shifts per iteration
0.210	0.210	0.210	0.   	 43.          DO i = 1, m    
4.003	4.003	4.003	0.050	 44.             a(i) = a(i) + b(i,j) * c(j)
0.240	0.240	0.240	0.   	 45.          END DO  
0.   	0.   	0.   	0.   	 46.       END DO  
                              	   47.       END DO
                         	        48. 
0.   	0.   	0.   	0.   	 49.       RETURN
0.   	0.   	0.   	0.   	 50.       END

You can set the types of compiler commentary displayed in the Source view using the Source/Disassembly tab in the Settings dialog box; for details, see Configuration Settings.

Common Subexpression Elimination

One very common optimization recognizes that the same expression appears in more than one place, and that performance can be improved by generating the code for that expression in one place. For example, if the same operation appears in both the if and the else branches of a block of code, the compiler can move that operation to just before the if statement. When it does so, it assigns line numbers to the instructions based on one of the previous occurrences of the expression. If the line numbers assigned to the common code correspond to one branch of an if structure and the code actually always takes the other branch, the annotated source shows metrics on lines within the branch that is not taken.

Loop Optimizations

The compiler can do several types of loop optimization. Some of the more common ones are as follows:

Loop unrolling
Loop peeling
Loop interchange
Loop fission
Loop fusion

Loop unrolling consists of repeating several iterations of a loop within the loop body, and adjusting the loop index accordingly. As the body of the loop becomes larger, the compiler can schedule the instructions more efficiently. Also reduced is the overhead caused by the loop index increment and conditional check operations. The remainder of the loop is handled using loop peeling.

Loop peeling consists of removing a number of loop iterations from the loop, and moving them in front of or after the loop, as appropriate.

Loop interchange changes the ordering of nested loops to minimize memory stride, in order to maximize cache-line hit rates.

Loop fusion consists of combining adjacent or closely located loops into a single loop. The benefits of loop fusion are similar to loop unrolling. In addition, if common data is accessed in the two pre-optimized loops, cache locality is improved by loop fusion, providing the compiler with more opportunities to exploit instruction-level parallelism.

Loop fission is the opposite of loop fusion: a loop is split into two or more loops. This optimization is appropriate if the number of computations in a loop becomes excessive, leading to register spills that degrade performance. Loop fission can also come into play if a loop contains conditional statements. Sometimes it is possible to split the loops into two: one with the conditional statement and one without. This approach can increase opportunities for software pipelining in the loop without the conditional statement.

Sometimes, with nested loops, the compiler applies loop fission to split a loop apart, and then performs loop fusion to recombine the loop in a different way to increase performance. In this case, you see compiler commentary similar to the following example:

    Loop below fissioned into 2 loops
    Loop below fused with loop on line 116
    [116]    for (i=0;i<nvtxs;i++) {

Inlining of Functions

With an inline function, the compiler inserts the function instructions directly at the locations where it is called instead of making actual function calls. Thus, similar to a C/C++ macro, the instructions of an inline function are replicated at each call location. The compiler performs explicit or automatic inlining at high optimization levels (4 and 5).

Inlining saves the cost of a function call and provides more instructions for which register usage and instruction scheduling can be optimized, at the cost of a larger code footprint in memory. The following example shows inlining compiler commentary.

                Function initgraph inlined from source file ptralias.c 
                    into the code for the following line
0.       0.         44.       initgraph(rows);

Note - The compiler commentary does not wrap onto two lines in the Source view of Performance Analyzer.

Parallelization

Code that contains Sun, Cray, or OpenMP parallelization directives can be compiled for parallel execution on multiple processors. The compiler commentary indicates where parallelization has and has not been performed, and why. The following example shows parallelization computer commentary.

0.       6.324       9. c$omp  parallel do shared(a,b,c,n) private(i,j,k)
                   Loop below parallelized by explicit user directive
                   Loop below interchanged with loop on line 12
0.010    0.010     [10]            do i = 2, n-1

                   Loop below not parallelized because it was nested in a parallel loop
                   Loop below interchanged with loop on line 12
0.170    0.170      11.               do j = 2, i

For more details about parallel execution and compiler-generated body functions, refer to Overview of OpenMP Software Execution.

Special Lines in the Annotated Source

Several other annotations for special cases can be shown under the Source view, either in the form of compiler commentary or as special lines displayed in the same color as index lines. For details, refer to Special Lines in the Source, Disassembly and PCs Tabs.

Source Line Metrics

Source code metrics are displayed, for each line of executable code, in fixed-width columns. The metrics are the same as in the function list. You can change the defaults for an experiment using a .er.rc file. For details, see Setting Defaults in .er.rc Files. You can also change the metrics displayed and the highlighting thresholds in Performance Analyzer using the Settings dialog box. For details, see Configuration Settings.

Annotated source code shows the metrics of an application at the source-line level. It is produced by taking the PCs (program counts) that are recorded in the application’s call stack, and mapping each PC to a source line. To produce an annotated source file, Performance Analyzer first determines all of the functions that are generated in a particular object module (.o file) or load object, then scans the data for all PCs from each function.

In order to produce annotated source, Performance Analyzer must be able to find and read the object module or load object to determine the mapping from PCs to source lines. It must be able to read the source file to produce an annotated copy, which is displayed. See How the Tools Find Source Code for a description of the process used to find an experiment's source code.

The compilation process goes through many stages, depending on the level of optimization requested, and transformations take place which can confuse the mapping of instructions to source lines. For some optimizations, source line information might be completely lost, while for others, it might be confusing. The compiler relies on various heuristics to track the source line for an instruction, and these heuristics are not infallible.

Interpreting Source Line Metrics

Metrics for an instruction must be interpreted as metrics accrued while waiting for the instruction to be executed. If the instruction being executed when an event is recorded comes from the same source line as the leaf PC, the metrics can be interpreted as due to execution of that source line. However, if the leaf PC comes from a different source line than the instruction being executed, at least some of the metrics for the source line that the leaf PC belongs to must be interpreted as metrics accumulated while this line was waiting to be executed. An example is when a value that is computed on one source line is used on the next source line.

For hardware-counter overflow profiling using a precise hardware counter (as indicated in the output from collect -h), the leaf PC is the PC of the instruction that causes the counter to overflow, not the next instruction to be executed. For non-precise hardware counters the leaf PC reported might be several instructions past the instruction that causes the overflow. This is because the kernel mechanism for recognizing when the overflow occurs has a variable amount of skid.

The issue of how to interpret the metrics matters most when a substantial delay occurs in execution, such as at a cache miss or a resource queue stall, or when an instruction is waiting for a result from a previous instruction. In such cases, the metrics for the source lines can seem to be unreasonably high. Look at other nearby lines in the code to find the line responsible for the high metric value.

Metric Formats

The four possible formats for the metrics that can appear on a line of annotated source code are explained in Table 7–1.

Table 7-1 Annotated Source-Code Metrics

Metric	Significance
(Blank)	No PC in the program corresponds to this line of code. This case should always apply to comment lines, and applies to apparent code lines in the following circumstances: All the instructions from the apparent piece of code have been eliminated during optimization. The code is repeated elsewhere, and the compiler performed common subexpression recognition and tagged all the instructions with the lines for the other copy. The compiler tagged an instruction from that line with an incorrect line number.
`0.`	Some PCs in the program were tagged as derived from this line, but no data referred to those PCs: they were never in a call stack that was sampled statistically or traced. The `0.` metric does not indicate that the line was not executed, only that it did not show up statistically in a profiling data packet or a recorded tracing data packet.
`0.000`	At least one PC from this line appeared in the data, but the computed metric value rounded to zero.
`1.234`	The metrics for all PCs attributed to this line added up to the non-zero numerical value shown.