Sun Studio 12 Update 1: Performance Analyzer

Chapter 8 Understanding Annotated Source and Disassembly Data

Annotated source code and annotated disassembly code are useful for determining which source lines or instructions within a function are responsible for poor performance, and to view commentary on how the compiler has performed transformations on the code. This section describes the annotation process and some of the issues involved in interpreting the annotated code.

Annotated Source Code

Annotated source code for an experiment can be viewed in the Performance Analyzer by selecting the Source tab in the left pane of the Analyzer window. Alternatively, annotated source code can be viewed without running an experiment, using the er_src utility. This section of the manual describes how source code is displayed in the Performance Analyzer. For details on viewing annotated source code with the er_src utility, see Viewing Source/Disassembly Without An Experiment.

Annotated source in the Analyzer contains the following information:

The contents of the original source file
The performance metrics of each line of executable source code
Highlighting of code lines with metrics exceeding a specific threshold
Index lines
Compiler commentary

Performance Analyzer Source Tab Layout

The Source tab is divided into columns, with fixed-width columns for individual metrics on the left and the annotated source taking up the remaining width on the right.

Identifying the Original Source Lines

All lines displayed in black in the annotated source are taken from the original source file. The number at the start of a line in the annotated source column corresponds to the line number in the original source file. Any lines with characters displayed in a different color are either index lines or compiler commentary lines.

Index Lines in the Source Tab

A source file is any file compiled to produce an object file or interpreted into byte code. An object file normally contains one or more regions of executable code corresponding to functions, subroutines, or methods in the source code. The Analyzer analyzes the object file, identifies each executable region as a function, and attempts to map the functions it finds in the object code to the functions, routines, subroutines, or methods in the source file associated with the object code. When the analyzer succeeds, it adds an index line in the annotated source file in the location corresponding to the first instruction in the function found in the object code.

The annotated source shows an index line for every function, including inline functions, even though inline functions are not displayed in the list displayed by the Function tab. The Source tab displays index lines in red italics with text in angle-brackets. The simplest type of index line corresponds to the function’s default context. The default source context for any function is defined as the source file to which the first instruction in that function is attributed. The following example shows an index line for a C function icputime.

                    578. int
                    579. icputime(int k)
0.       0.         580. {
                         <Function: icputime>

As can be seen from the above example, the index line appears on the line following the first instruction. For C source, the first instruction corresponds to the opening brace at the start of the function body. In Fortran source, the index line for each subroutine follows the line containing the subroutine keyword. Also, a main function index line follows the first Fortran source instruction executed when the application starts, as shown in the following example:

   	                           	 1. ! Copyright 02/27/09 Sun Microsystems, Inc. All Rights Reserved
     	     	     	     	   2. ! @(#)omptest.f 1.10 09/02/27 SMI
     	     	     	     	   3. ! Synthetic f90 program, used for testing openmp directives and the
     	     	     	     	   4. !       analyzer
     	     	     	     	   5. 
0.   	0.   	0.   	0.   	   6.         program omptest
                                          <Function: MAIN>  
                                       7.
                                       8. !$PRAGMA C (gethrtime, gethrvtime)

Sometimes, the Analyzer might not be able to map a function it finds in the object code with any programming instructions in the source file associated with that object code; for example, code may be #included or inlined from another file, such as a header file.

Also displayed in red are special index lines and other special lines that are not compiler commentary. For example, as a result of compiler optimization, a special index line might be created for a function in the object code that does not correspond to code written in any source file. For details, refer to Special Lines in the Source, Disassembly and PCs Tabs.

Compiler Commentary

Compiler commentary indicates how compiler-optimized code has been generated. Compiler commentary lines are displayed in blue, to distinguish them from index lines and original source lines. Various parts of the compiler can incorporate commentary into the executable. Each comment is associated with a specific line of source code. When the annotated source is written, the compiler commentary for any source line appears immediately preceding the source line.

The compiler commentary describes many of the transformations which have been made to the source code to optimize it. These transformations include loop optimizations, parallelization, inlining and pipelining. The following shows an example of compiler commentary.

0.   	0.   	0.   	0.   	 28.       SUBROUTINE dgemv_g2 (transa, m, n, alpha, b, ldb,   &
     	     	     	     	 29.      &                   c, incc, beta, a, inca)
     	     	     	     	 30.       CHARACTER (KIND=1) :: transa
                                	 31.       INTEGER   (KIND=4) :: m, n, incc, inca, ldb
                                	 32.       REAL      (KIND=8) :: alpha, beta
                                	 33.       REAL      (KIND=8) :: a(1:m), b(1:ldb,1:n), c(1:n)
                                	 34.       INTEGER            :: i, j
                                	 35.       REAL      (KIND=8) :: tmr, wtime, tmrend
                                	 36.       COMMON/timer/ tmr
                                	 37. 
                             Function wtime_ not inlined because the compiler has not seen 
                             the body of the routine
0.   	0.   	0.   	0.   	 38.       tmrend = tmr + wtime()


                             Function wtime_ not inlined because the compiler has not seen 
                             the body of the routine
                             Discovered loop below has tag L16
0.   	0.   	0.   	0.   	 39.       DO WHILE(wtime() < tmrend)
                          	
                         	Array statement below generated loop L4
0.   	0.   	0.   	0.   	 40.       a(1:m) = 0.0
                                	 41. 
     	     	     	     	
                         	Source loop below has tag L6
0.   	0.   	0.   	0.   	 42.       DO j = 1, n       ! <=-----\ swapped loop indices
   	
                         	Source loop below has tag L5
                             L5 cloned for unrolling-epilog.  Clone is L19
                             All 8 copies of L19 are fused together as part of unroll and jam
                             L19 scheduled with steady-state cycle count = 9
                             L19 unrolled 4 times
                             L19 has 9 loads, 1 stores, 8 prefetches, 8 FPadds, 
                             8 FPmuls, and 0 FPdivs per iteration
                             L19 has 0 int-loads, 0 int-stores, 11 alu-ops, 0 muls, 
                             0 int-divs and 0 shifts per iteration
                             L5 scheduled with steady-state cycle count = 2
                             L5 unrolled 4 times
                             L5 has 2 loads, 1 stores, 1 prefetches, 1 FPadds, 1 FPmuls,
                             and 0 FPdivs per iteration
                             L5 has 0 int-loads, 0 int-stores, 4 alu-ops, 0 muls, 
                             0 int-divs and 0 shifts per iteration
0.210	0.210	0.210	0.   	 43.          DO i = 1, m    !    <=--/
4.003	4.003	4.003	0.050	 44.             a(i) = a(i) + b(i,j) * c(j)
0.240	0.240	0.240	0.   	 45.          END DO  
0.   	0.   	0.   	0.   	 46.       END DO  
                              	   47.       END DO
                         	        48. 
0.   	0.   	0.   	0.   	 49.       RETURN
0.   	0.   	0.   	0.   	 50.       END

You can set the types of compiler commentary displayed in the Source tab using the Source/Disassembly tab in the Set Data Presentation dialog box; for details, see Setting Data Presentation Options.

Common Subexpression Elimination

One very common optimization recognizes that the same expression appears in more than one place, and that performance can be improved by generating the code for that expression in one place. For example, if the same operation appears in both the if and the else branches of a block of code, the compiler can move that operation to just before the if statement. When it does so, it assigns line numbers to the instructions based on one of the previous occurrences of the expression. If the line numbers assigned to the common code correspond to one branch of an if structure, and the code actually always takes the other branch, the annotated source shows metrics on lines within the branch that is not taken.

Loop Optimizations

The compiler can do several types of loop optimization. Some of the more common ones are as follows:

Loop unrolling
Loop peeling
Loop interchange
Loop fission
Loop fusion

Loop unrolling consists of repeating several iterations of a loop within the loop body, and adjusting the loop index accordingly. As the body of the loop becomes larger, the compiler can schedule the instructions more efficiently. Also reduced is the overhead caused by the loop index increment and conditional check operations. The remainder of the loop is handled using loop peeling.

Loop peeling consists of removing a number of loop iterations from the loop, and moving them in front of or after the loop, as appropriate.

Loop interchange changes the ordering of nested loops to minimize memory stride, to maximize cache-line hit rates.

Loop fusion consists of combining adjacent or closely located loops into a single loop. The benefits of loop fusion are similar to loop unrolling. In addition, if common data is accessed in the two pre-optimized loops, cache locality is improved by loop fusion, providing the compiler with more opportunities to exploit instruction-level parallelism.

Loop fission is the opposite of loop fusion: a loop is split into two or more loops. This optimization is appropriate if the number of computations in a loop becomes excessive, leading to register spills that degrade performance. Loop fission can also come into play if a loop contains conditional statements. Sometimes it is possible to split the loops into two: one with the conditional statement and one without. This can increase opportunities for software pipelining in the loop without the conditional statement.

Sometimes, with nested loops, the compiler applies loop fission to split a loop apart, and then performs loop fusion to recombine the loop in a different way to increase performance. In this case, you see compiler commentary similar to the following:

    Loop below fissioned into 2 loops
    Loop below fused with loop on line 116
    [116]    for (i=0;i<nvtxs;i++) {

Inlining of Functions

With an inline function, the compiler inserts the function instructions directly at the locations where it is called instead of making actual function calls. Thus, similar to a C/C++ macro, the instructions of an inline function are replicated at each call location. The compiler performs explicit or automatic inlining at high optimization levels (4 and 5). Inlining saves the cost of a function call and provides more instructions for which register usage and instruction scheduling can be optimized, at the cost of a larger code footprint in memory. The following is an example of inlining compiler commentary.

                Function initgraph inlined from source file ptralias.c 
                    into the code for the following line
0.       0.         44.       initgraph(rows);

Note –

The compiler commentary does not wrap onto two lines in the Source tab of the Analyzer.

Parallelization

If your code contains Sun, Cray, or OpenMP parallelization directives, it can be compiled for parallel execution on multiple processors. The compiler commentary indicates where parallelization has and has not been performed, and why. The following shows an example of parallelization computer commentary.

0.       6.324       9. c$omp  parallel do shared(a,b,c,n) private(i,j,k)
                   Loop below parallelized by explicit user directive
                   Loop below interchanged with loop on line 12
0.010    0.010     [10]            do i = 2, n-1

                   Loop below not parallelized because it was nested in a parallel loop
                   Loop below interchanged with loop on line 12
0.170    0.170      11.               do j = 2, i

For more details about parallel execution and compiler-generated body functions, refer to Overview of OpenMP Software Execution.

Special Lines in the Annotated Source

Several other annotations for special cases can be shown under the Source tab, either in the form of compiler commentary, or as special lines displayed in the same color as index lines. For details, refer to Special Lines in the Source, Disassembly and PCs Tabs.

Source Line Metrics

Source code metrics are displayed, for each line of executable code, in fixed-width columns. The metrics are the same as in the function list. You can change the defaults for an experiment using a .er.rc file; for details, see Commands That Set Defaults. You can also change the metrics displayed and highlighting thresholds in the Analyzer using the Set Data Presentation dialog box; for details, see Setting Data Presentation Options.

Annotated source code shows the metrics of an application at the source-line level. It is produced by taking the PCs (program counts) that are recorded in the application’s call stack, and mapping each PC to a source line. To produce an annotated source file, the Analyzer first determines all of the functions that are generated in a particular object module (.o file) or load object, then scans the data for all PCs from each function. In order to produce annotated source, the Analyzer must be able to find and read the object module or load object to determine the mapping from PCs to source lines, and it must be able to read the source file to produce an annotated copy, which is displayed. The Analyzer searches for the source file, object file, and executable files in the following default locations in turn, and stops when it finds a file of the correct basename:

The archive directories of experiments
The current working directory
The absolute pathname as recorded in the executables or compilation objects

The default can be changed by the addpath or setpath directive, or by the Analyzer GUI.

If a file cannot be found using the path list set by addpath or setpath, you can specify one or more path remappings with the pathmap command. With the pathmap command you specify an old-prefix and new-prefix. In any pathname for a source file, object file, or shared object that begins with the prefix specified with old-prefix, the old prefix is replaced by the prefix specified with new-prefix. The resulting path is then used to find the file. Multiple pathmap commands can be supplied, and each is tried until the file is found.

The compilation process goes through many stages, depending on the level of optimization requested, and transformations take place which can confuse the mapping of instructions to source lines. For some optimizations, source line information might be completely lost, while for others, it might be confusing. The compiler relies on various heuristics to track the source line for an instruction, and these heuristics are not infallible.

Interpreting Source Line Metrics

Metrics for an instruction must be interpreted as metrics accrued while waiting for the instruction to be executed. If the instruction being executed when an event is recorded comes from the same source line as the leaf PC, the metrics can be interpreted as due to execution of that source line. However, if the leaf PC comes from a different source line than the instruction being executed, at least some of the metrics for the source line that the leaf PC belongs to must be interpreted as metrics accumulated while this line was waiting to be executed. An example is when a value that is computed on one source line is used on the next source line.

The issue of how to interpret the metrics matters most when there is a substantial delay in execution, such as at a cache miss or a resource queue stall, or when an instruction is waiting for a result from a previous instruction. In such cases the metrics for the source lines can seem to be unreasonably high, and you should look at other nearby lines in the code to find the line responsible for the high metric value.

Metric Formats

The four possible formats for the metrics that can appear on a line of annotated source code are explained in Table 8–1.

Table 8–1 Annotated Source-Code Metrics


Metric	Significance
(Blank)	No PC in the program corresponds to this line of code. This case should always apply to comment lines, and applies to apparent code lines in the following circumstances: All the instructions from the apparent piece of code have been eliminated during optimization. The code is repeated elsewhere, and the compiler performed common subexpression recognition and tagged all the instructions with the lines for the other copy. The compiler tagged an instruction from that line with an incorrect line number.
`0.`	Some PCs in the program were tagged as derived from this line, but no data referred to those PCs: they were never in a call stack that was sampled statistically or traced. The `0.` metric does not indicate that the line was not executed, only that it did not show up statistically in a profiling data packet or a recorded tracing data packet.
`0.000`	At least one PC from this line appeared in the data, but the computed metric value rounded to zero.
`1.234`	The metrics for all PCs attributed to this line added up to the non-zero numerical value shown.

Annotated Disassembly Code

Annotated disassembly provides an assembly-code listing of the instructions of a function or object module, with the performance metrics associated with each instruction. Annotated disassembly can be displayed in several ways, determined by whether line-number mappings and the source file are available, and whether the object module for the function whose annotated disassembly is being requested is known:

If the object module is not known, the Analyzer disassembles the instructions for just the specified function, and does not show any source lines in the disassembly.
If the object module is known, the disassembly covers all functions within the object module.
If the source file is available, and line number data is recorded, the Analyzer can interleave the source with the disassembly, depending on the display preference.
If the compiler has inserted any commentary into the object code, it too, is interleaved in the disassembly if the corresponding preferences are set.

Each instruction in the disassembly code is annotated with the following information.

A source line number, as reported by the compiler
Its relative address
The hexadecimal representation of the instruction, if requested
The assembler ASCII representation of the instruction

Where possible, call addresses are resolved to symbols (such as function names). Metrics are shown on the lines for instructions, and can be shown on any interleaved source code if the corresponding preference is set. Possible metric values are as described for source-code annotations, in Table 8–1.

The disassembly listing for code that is #included in multiple locations repeats the disassembly instructions once for each time that the code has been #included. The source code is interleaved only for the first time a repeated block of disassembly code is shown in a file. For example, if a block of code defined in a header called inc_body.h is #included by four functions named inc_body , inc_entry, inc_middle, and inc_exit, then the block of disassembly instructions appears four times in the disassembly listing for inc_body.h, but the source code is interleaved only in the first of the four blocks of disassembly instructions. Switching to Source tab reveals index lines corresponding to each of the times that the disassembly code was repeated.

Index lines can be displayed in the Disassembly tab. Unlike with the Source tab, these index lines cannot be used directly for navigation purposes. However, placing the cursor on one of the instructions immediately below the index line and selecting the Source tab navigates you to the file referenced in the index line.

Files that #include code from other files show the included code as raw disassembly instructions without interleaving the source code. However, placing the cursor on one of these instructions and selecting the Source tab opens the file containing the #included code. Selecting the Disassembly tab with this file displayed shows the disassembly code with interleaved source code.

Source code can be interleaved with disassembly code for inline functions, but not for macros.

When code is not optimized, the line numbers for each instruction are in sequential order, and the interleaving of source lines and disassembled instructions occurs in the expected way. When optimization takes place, instructions from later lines sometimes appear before those from earlier lines. The Analyzer’s algorithm for interleaving is that whenever an instruction is shown as coming from line N, all source lines up to and including line N are written before the instruction. One effect of optimization is that source code can appear between a control transfer instruction and its delay slot instruction. Compiler commentary associated with line N of the source is written immediately before that line.

Interpreting Annotated Disassembly

Interpreting annotated disassembly is not straightforward. The leaf PC is the address of the next instruction to execute, so metrics attributed to an instruction should be considered as time spent waiting for the instruction to execute. However, the execution of instructions does not always happen in sequence, and there might be delays in the recording of the call stack. To make use of annotated disassembly, you should become familiar with the hardware on which you record your experiments and the way in which it loads and executes instructions.

The next few subsections discuss some of the issues of interpreting annotated disassembly.

Instruction Issue Grouping

Instructions are loaded and issued in groups known as instruction issue groups . Which instructions are in the group depends on the hardware, the instruction type, the instructions already being executed, and any dependencies on other instructions or registers. As a result, some instructions might be underrepresented because they are always issued in the same clock cycle as the previous instruction, so they never represent the next instruction to be executed. And when the call stack is recorded, there might be several instructions that could be considered the next instruction to execute.

Instruction issue rules vary from one processor type to another, and depend on the instruction alignment within cache lines. Since the linker forces instruction alignment at a finer granularity than the cache line, changes in a function that might seem unrelated can cause different alignment of instructions. The different alignment can cause a performance improvement or degradation.

The following artificial situation shows the same function compiled and linked in slightly different circumstances. The two output examples shown below are the annotated disassembly listings from the er_print utility. The instructions for the two examples are identical, but the instructions are aligned differently.

In this example the instruction alignment maps the two instructions cmp and bl,a to different cache lines, and a significant amount of time is used waiting to execute these two instructions.

   Excl.     Incl.
User CPU  User CPU
    sec.      sec.
                             1. static int
                             2. ifunc()
                             3. {
                             4.     int i;
                             5.
                             6.     for (i=0; i<10000; i++)
                                <function: ifunc>
   0.010     0.010              [ 6]    1066c:  clr         %o0
   0.        0.                 [ 6]    10670:  sethi       %hi(0x2400), %o5
   0.        0.                 [ 6]    10674:  inc         784, %o5
                             7.         i++;
   0.        0.                 [ 7]    10678:  inc         2, %o0
## 1.360     1.360              [ 7]    1067c:  cmp         %o0, %o5
## 1.510     1.510              [ 7]    10680:  bl,a        0x1067c
   0.        0.                 [ 7]    10684:  inc         2, %o0
   0.        0.                 [ 7]    10688:  retl
   0.        0.                 [ 7]    1068c:  nop
                             8.     return i;
                             9. }

In this example, the instruction alignment maps the two instructions cmp and bl,a to the same cache line, and a significant amount of time is used waiting to execute only one of these instructions.

   Excl.     Incl.
User CPU  User CPU
    sec.      sec.
                             1. static int
                             2. ifunc()
                             3. {
                             4.     int i;
                             5.
                             6.     for (i=0; i<10000; i++)
                                <function: ifunc>
   0.        0.                 [ 6]    10684:  clr         %o0
   0.        0.                 [ 6]    10688:  sethi       %hi(0x2400), %o5
   0.        0.                 [ 6]    1068c:  inc         784, %o5
                             7.         i++;
   0.        0.                 [ 7]    10690:  inc         2, %o0
## 1.440     1.440              [ 7]    10694:  cmp         %o0, %o5
   0.        0.                 [ 7]    10698:  bl,a        0x10694
   0.        0.                 [ 7]    1069c:  inc         2, %o0
   0.        0.                 [ 7]    106a0:  retl
   0.        0.                 [ 7]    106a4:  nop
                             8.     return i;
                             9. }

Instruction Issue Delay

Sometimes, specific leaf PCs appear more frequently because the instruction that they represent is delayed before issue. This appearance can occur for a number of reasons, some of which are listed below:

The previous instruction takes a long time to execute and is not interruptible, for example when an instruction traps into the kernel.
An arithmetic instruction needs a register that is not available because the register contents were set by an earlier instruction that has not yet completed. An example of this sort of delay is a load instruction that has a data cache miss.
A floating-point arithmetic instruction is waiting for another floating-point instruction to complete. This situation occurs for instructions that cannot be pipelined, such as square root and floating-point divide.
The instruction cache does not include the memory word that contains the instruction (I-cache miss).
On UltraSPARC ® III processors, a cache miss on a load instruction blocks all instructions that follow it until the miss is resolved, regardless of whether these instructions use the data that is being loaded. UltraSPARC II processors only block instructions that use the data item that is being loaded.

Attribution of Hardware Counter Overflows

Apart from TLB misses on UltraSPARC platforms, the call stack for a hardware counter overflow event is recorded at some point further on in the sequence of instructions than the point at which the overflow occurred, for various reasons including the time taken to handle the interrupt generated by the overflow. For some counters, such as cycles or instructions issued, this delay does not matter. For other counters, such as those counting cache misses or floating point operations, the metric is attributed to a different instruction from that which is responsible for the overflow. Often the PC that caused the event is only a few instructions before the recorded PC, and the instruction can be correctly located in the disassembly listing. However, if there is a branch target within this instruction range, it might be difficult or impossible to tell which instruction corresponds to the PC that caused the event. For hardware counters that count memory access events, the Collector searches for the PC that caused the event if the counter name is prefixed with a plus, +.

Special Lines in the Source, Disassembly and PCs Tabs

Outline Functions

Outline functions can be created during feedback-optimized compilations. They are displayed as special index lines in the Source tab and Disassembly tab. In the Source tab, an annotation is displayed in the block of code that has been converted into an outline function.

                   Function binsearchmod inlined from source file ptralias2.c into the 
0.       0 .         58.         if( binsearchmod( asize, &element ) ) {
0.240    0.240       59.             if( key != (element << 1) ) {
0.       0.          60.                 error |= BINSEARCHMODPOSTESTFAILED;
                        <Function: main -- outline code from line 60 [_$o1B60.main]>
0.040    0.040     [ 61]                 break;
0.       0.          62.             }
0.       0.          63.         }

In the Disassembly tab, the outline functions are typically displayed at the end of the file.

                        <Function: main -- outline code from line 85 [_$o1D85.main]>
0.       0.             [ 85] 100001034:  sethi       %hi(0x100000), %i5
0.       0.             [ 86] 100001038:  bset        4, %i3
0.       0.             [ 85] 10000103c:  or          %i5, 1, %l7
0.       0.             [ 85] 100001040:  sllx        %l7, 12, %l5
0.       0.             [ 85] 100001044:  call        printf ! 0x100101300
0.       0.             [ 85] 100001048:  add         %l5, 336, %o0
0.       0.             [ 90] 10000104c:  cmp         %i3, 0
0.       0.             [ 20] 100001050:  ba,a        0x1000010b4
                        <Function: main -- outline code from line 46 [_$o1A46.main]>
0.       0.             [ 46] 100001054:  mov         1, %i3
0.       0.             [ 47] 100001058:  ba          0x100001090
0.       0.             [ 56] 10000105c:  clr         [%i2]
                        <Function: main -- outline code from line 60 [_$o1B60.main]>
0.       0.             [ 60] 100001060:  bset        2, %i3
0.       0.             [ 61] 100001064:  ba          0x10000109c
0.       0.             [ 74] 100001068:  mov         1, %o3

The name of the outline function is displayed in square brackets, and encodes information about the section of outlined code, including the name of the function from which the code was extracted and the line number of the beginning of the section in the source code. These mangled names can vary from release to release. The Analyzer provides a readable version of the function name. For further details, refer to Outline Functions.

If an outline function is called when collecting the performance data for an application, the Analyzer displays a special line in the annotated disassembly to show inclusive metrics for that function. For further details, see Inclusive Metrics.

Compiler-Generated Body Functions

When a compiler parallelizes a loop in a function, or a region that has parallelization directives, it creates new body functions that are not in the original source code. These functions are described in Overview of OpenMP Software Execution.

The compiler assigns mangled names to body functions that encode the type of parallel construct, the name of the function from which the construct was extracted, the line number of the beginning of the construct in the original source, and the sequence number of the parallel construct. These mangled names vary from release to release of the microtasking library, but are shown demangled into more comprehensible names.

The following shows a typical compiler-generated body function as displayed in the functions list.

7.415      14.860      psec_ -- OMP sections from line 9 [_$s1A9.psec_]
3.873       3.903      craydo_ -- MP doall from line 10 [_$d1A10.craydo_]

As can be seen from the above examples, the name of the function from which the construct was extracted is shown first, followed by the type of parallel construct, followed by the line number of the parallel construct, followed by the mangled name of the compiler-generated body function in square brackets. Similarly, in the disassembly code, a special index line is generated.

0.       0.            <Function: psec_ -- OMP sections from line 9 [_$s1A9.psec_]>
0.       7.445         [24]    1d8cc:  save        %sp, -168, %sp
0.       0.            [24]    1d8d0:  ld          [%i0], %g1
0.       0.            [24]    1d8d4:  tst         %i1

0.       0.            <Function: craydo_ -- MP doall from line 10 [_$d1A10.craydo_]>
0.       0.030         [ ?]    197e8:  save        %sp, -128, %sp
0.       0.            [ ?]    197ec:  ld          [%i0 + 20], %i5
0.       0.            [ ?]    197f0:  st          %i1, [%sp + 112]
0.       0.            [ ?]    197f4:  ld          [%i5], %i3

With Cray directives, the function may not be correlated with source code line numbers. In such cases, a [ ?] is displayed in place of the line number. If the index line is shown in the annotated source code, the index line indicates instructions without line numbers, as shown below.

                     9. c$mic  doall shared(a,b,c,n) private(i,j,k)
                  
                   Loop below fused with loop on line 23
                   Loop below not parallelized because autoparallelization 
                     is not enabled
                   Loop below autoparallelized
                   Loop below interchanged with loop on line 12
                   Loop below interchanged with loop on line 12
3.873     3.903         <Function: craydo_ -- MP doall from line 10 [_$d1A10.craydo_],
                      instructions without line numbers>
0.        3.903     10.            do i = 2, n-1

Note –

Index lines and compiler-commentary lines do not wrap in the real displays.

Dynamically Compiled Functions

Dynamically compiled functions are functions that are compiled and linked while the program is executing. The Collector has no information about dynamically compiled functions that are written in C or C++, unless the user supplies the required information using the Collector API function collector_func_load(). Information displayed by the Function tab, Source tab, and Disassembly tab depends on the information passed to collector_func_load() as follows:

If information is not supplied, collector_func_load() is not called; the dynamically compiled and loaded function appears in the function list as <Unknown>. Neither function source nor disassembly code is viewable in the Analyzer.
If no source file name and no line-number table is provided, but the name of the function, its size, and its address are provided, the name of the dynamically compiled and loaded function and its metrics appear in the function list. The annotated source is available, and the disassembly instructions are viewable, although the line numbers are specified by [?] to indicate that they are unknown.
If the source file name is given, but no line-number table is provided, the information displayed by the Analyzer is similar to the case where no source file name is given, except that the beginning of the annotated source displays a special index line indicating that the function is composed of instructions without line numbers. For example:
1.121 1.121 <Function func0, instructions without line numbers> 1. #include <stdio.h>
If the source file name and line-number table is provided, the function and its metrics are displayed by the Function tab, Source tab, and Disassembly tab in the same way as conventionally compiled functions.

For more information about the Collector API functions, see Dynamic Functions and Modules.

For Java programs, most methods are interpreted by the JVM software. The Java HotSpot virtual machine, running in a separate thread, monitors performance during the interpretive execution. During the monitoring process, the virtual machine may decide to take one or more interpreted methods, generate machine code for them, and execute the more-efficient machine-code version, rather than interpret the original.

For Java programs, there is no need to use the Collector API functions; the Analyzer signifies the existence of Java HotSpot-compiled code in the annotated disassembly listing using a special line underneath the index line for the method, as shown in the following example.

                     11.    public int add_int () {
                     12.       int       x = 0;
                         <Function: Routine.add_int()>
2.832     2.832          Routine.add_int() <HotSpot-compiled leaf instructions>
0.        0.             [ 12] 00000000: iconst_0
0.        0.             [ 12] 00000001: istore_1

The disassembly listing only shows the interpreted byte code, not the compiled instructions. By default, the metrics for the compiled code are shown next to the special line. The exclusive and inclusive CPU times are different than the sum of all the inclusive and exclusive CPU times shown for each line of interpreted byte code. In general, if the method is called on several occasions, the CPU times for the compiled instructions are greater than the sum of the CPU times for the interpreted byte code, because the interpreted code is executed only once when the method is initially called, whereas the compiled code is executed thereafter.

The annotated source does not show Java HotSpot-compiled functions. Instead, it displays a special index line to indicate instructions without line numbers. For example, the annotated source corresponding to the disassembly extract shown above is as follows:

                     11.    public int add_int () {
2.832     2.832        <Function: Routine.add_int(), instructions without line numbers>
0.        0.         12.       int       x = 0;
                       <Function: Routine.add_int()>

Java Native Functions

Native code is compiled code originally written in C, C++, or Fortran, called using the Java Native Interface (JNI) by Java code. The following example is taken from the annotated disassembly of file jsynprog.java associated with demo program jsynprog.

                     5. class jsynprog
                        <Function: jsynprog.<init>()>
0.       5.504          jsynprog.JavaCC() <Java native method>
0.       1.431          jsynprog.JavaCJava(int) <Java native method>
0.       5.684          jsynprog.JavaJavaC(int) <Java native method>
0.       0.             [  5] 00000000: aload_0
0.       0.             [  5] 00000001: invokespecial <init>()
0.       0.             [  5] 00000004: return

Because the native methods are not included in the Java source, the beginning of the annotated source for jsynprog.java shows each Java native method using a special index line to indicate instructions without line numbers.

0.       5.504          <Function: jsynprog.JavaCC(), instructions without line 
                           numbers>
0.       1.431          <Function: jsynprog.JavaCJava(int), instructions without line 
                           numbers>
0.       5.684          <Function: jsynprog.JavaJavaC(int), instructions without line 
                           numbers>

Note –

The index lines do not wrap in the real annotated source display.

Cloned Functions

The compilers have the ability to recognize calls to a function for which extra optimization can be performed. An example of such is a call to a function where some of the arguments passed are constants. When the compiler identifies particular calls that it can optimize, it creates a copy of the function, which is called a clone, and generates optimized code.

In the annotated source, compiler commentary indicates if a cloned function has been created:

0.       0.       Function foo from source file clone.c cloned, 
                   creating cloned function _$c1A.foo; 
                   constant parameters propagated to clone
0.       0.570     27.    foo(100, 50, a, a+50, b);

Note –

Compiler commentary lines do not wrap in the real annotated source display.

The clone function name is a mangled name that identifies the particular call. In the above example, the compiler commentary indicates that the name of the cloned function is _$c1A.foo. This function can be seen in the function list as follows:

0.350     0.550     foo
0.340     0.570     _$c1A.foo

Each cloned function has a different set of instructions, so the annotated disassembly listing shows the cloned functions separately. They are not associated with any source file, and therefore the instructions are not associated with any source line numbers. The following shows the first few lines of the annotated disassembly for a cloned function.

0.       0.           <Function: _$c1A.foo>
0.       0.           [?]    10e98:  save        %sp, -120, %sp
0.       0.           [?]    10e9c:  sethi       %hi(0x10c00), %i4
0.       0.           [?]    10ea0:  mov         100, %i3
0.       0.           [?]    10ea4:  st          %i3, [%i0]
0.       0.           [?]    10ea8:  ldd         [%i4 + 640], %f8

Static Functions

Static functions are often used within libraries, so that the name used internally in a library does not conflict with a name that the user might use. When libraries are stripped, the names of static functions are deleted from the symbol table. In such cases, the Analyzer generates an artificial name for each text region in the library containing stripped static functions. The name is of the form <static>@0x12345, where the string following the @ sign is the offset of the text region within the library. The Analyzer cannot distinguish between contiguous stripped static functions and a single such function, so two or more such functions can appear with their metrics coalesced. Examples of static functions can be found in the functions list of the jsynprog demo, reproduced below.

0.       0.       <static>@0x18780
0.       0.       <static>@0x20cc
0.       0.       <static>@0xc9f0
0.       0.       <static>@0xd1d8
0.       0.       <static>@0xe204

In the PCs tab, the above functions are represented with an offset, as follows:

0.       0.       <static>@0x18780 + 0x00000818
0.       0.       <static>@0x20cc + 0x0000032C
0.       0.       <static>@0xc9f0 + 0x00000060
0.       0.       <static>@0xd1d8 + 0x00000040
0.       0.       <static>@0xe204 + 0x00000170

An alternative representation in the PCs tab of functions called within a stripped library is:

<library.so> -- no functions found + 0x0000F870

Inclusive Metrics

In the annotated disassembly, special lines exist to tag the time taken by outline functions.

The following shows an example of the annotated disassembly displayed when an outline function is called:

0.       0.        43.         else
0.       0.        44.         {
0.       0.        45.                 printf("else reached\n");
0.       2.522         <inclusive metrics for outlined functions>

Branch Target

An artificial line, <branch target>, shown in the annotated disassembly listing, corresponds to a PC of an instruction where the backtracking to find its effective address fails because the backtracking algorithm runs into a branch target.

Viewing Source/Disassembly Without An Experiment

You can view annotated source code and annotated disassembly code using the er_src utility, without running an experiment. The display is generated in the same way as in the Analyzer, except that it does not display any metrics. The syntax of the er_src command is

er_src [ -func | -{source,src} item tag | -disasm item tag |
-{cc,scc,dcc} com_spec | -outfile filename | -V ] object

object is the name of an executable, a shared object, or an object file (.o file).

item is the name of a function or of a source or object file used to build the executable or shared object. item can also be specified in the form function’file’, in which case er_src displays the source or disassembly of the named function in the source context of the named file.

tag is an index used to determine which item is being referred to when multiple functions have the same name. It is required, but is ignored if not necessary to resolve the function.

The special item and tag, all -1, tells er_src to generate the annotated source or disassembly for all functions in the object.

Note –

The output generated as a result of using all -1 on executables and shared objects may be very large.

The following sections describe the options accepted by the er_src utility.

`-func`

List all the functions from the given object.

`-{source,src}` `item tag`

Show the annotated source for the listed item.

`-disasm` `item tag`

Include the disassembly in the listing. The default listing does not include the disassembly. If there is no source available, a listing of the disassembly without compiler commentary is produced.

`-{c,scc,dcc}` `com-spec`

Specify which classes of compiler commentary classes to show. com-spec is a list of classes separated by colons. The com-spec is applied to source compiler commentary if the -scc option is used, to disassembly commentary if the -dcc option is used, or to both source and disassembly commentary if -c is used. See Commands That Control the Source and Disassembly Listings for a description of these classes.

The commentary classes can be specified in a defaults file. The system wide er.rc defaults file is read first, then an .er.rc file in the user’s home directory, if present, then an .er.rc file in the current directory. Defaults from the .er.rc file in your home directory override the system defaults, and defaults from the .er.rc file in the current directory override both home and system defaults. These files are also used by the Analyzer and the er_print utility, but only the settings for source and disassembly compiler commentary are used by the er_src utility. See Commands That Set Defaults for a description of the defaults files. Commands in a defaults file other than scc and dcc are ignored by the er_src utility.

`-outfile` `filename`

Open the file filename for output of the listing. By default, or if the filename is a dash (-), output is written to stdout.

`-V`

Print the current release version.

Chapter 8 Understanding Annotated Source and Disassembly Data

Annotated Source Code

Performance Analyzer Source Tab Layout

Identifying the Original Source Lines

Index Lines in the Source Tab

Compiler Commentary

Common Subexpression Elimination

Loop Optimizations

Inlining of Functions

Parallelization

Special Lines in the Annotated Source

Source Line Metrics

Interpreting Source Line Metrics

Metric Formats

Annotated Disassembly Code

Interpreting Annotated Disassembly

Instruction Issue Grouping

Instruction Issue Delay

Attribution of Hardware Counter Overflows

Special Lines in the Source, Disassembly and PCs Tabs

Outline Functions

Compiler-Generated Body Functions

Dynamically Compiled Functions

Java Native Functions

Cloned Functions

Static Functions

Inclusive Metrics

Branch Target

Viewing Source/Disassembly Without An Experiment

-func

-{source,src} item tag

-disasm item tag

-{c,scc,dcc} com-spec

-outfile filename

-V

`-func`

`-{source,src}` `item tag`

`-disasm` `item tag`

`-{c,scc,dcc}` `com-spec`

`-outfile` `filename`

`-V`