JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Solaris Studio 12.2: Performance Analyzer
search filter icon
search icon

Document Information


1.  Overview of the Performance Analyzer

2.  Performance Data

3.  Collecting Performance Data

4.  The Performance Analyzer Tool

5.  The er_print Command Line Performance Analysis Tool

6.  Understanding the Performance Analyzer and Its Data

7.  Understanding Annotated Source and Disassembly Data

Annotated Source Code

Performance Analyzer Source Tab Layout

Identifying the Original Source Lines

Index Lines in the Source Tab

Compiler Commentary

Common Subexpression Elimination

Loop Optimizations

Inlining of Functions


Special Lines in the Annotated Source

Source Line Metrics

Interpreting Source Line Metrics

Metric Formats

Annotated Disassembly Code

Interpreting Annotated Disassembly

Instruction Issue Grouping

Instruction Issue Delay

Attribution of Hardware Counter Overflows

Special Lines in the Source, Disassembly and PCs Tabs

Outline Functions

Compiler-Generated Body Functions

Dynamically Compiled Functions

Java Native Functions

Cloned Functions

Static Functions

Inclusive Metrics

Branch Target

Annotations for Store and Load Instructions

Viewing Source/Disassembly Without An Experiment


-{source,src} item tag

-{disasm,dis} item tag

-{cc,scc,dcc} com-spec

-outfile filename


8.  Manipulating Experiments

9.  Kernel Profiling


Annotated Source Code

Annotated source code for an experiment can be viewed in the Performance Analyzer by selecting the Source tab in the left pane of the Analyzer window. Alternatively, annotated source code can be viewed without running an experiment, using the er_src utility. This section of the manual describes how source code is displayed in the Performance Analyzer. For details on viewing annotated source code with the er_src utility, see Viewing Source/Disassembly Without An Experiment.

Annotated source in the Analyzer contains the following information:

Performance Analyzer Source Tab Layout

The Source tab is divided into columns, with fixed-width columns for individual metrics on the left and the annotated source taking up the remaining width on the right.

Identifying the Original Source Lines

All lines displayed in black in the annotated source are taken from the original source file. The number at the start of a line in the annotated source column corresponds to the line number in the original source file. Any lines with characters displayed in a different color are either index lines or compiler commentary lines.

Index Lines in the Source Tab

A source file is any file compiled to produce an object file or interpreted into byte code. An object file normally contains one or more regions of executable code corresponding to functions, subroutines, or methods in the source code. The Analyzer analyzes the object file, identifies each executable region as a function, and attempts to map the functions it finds in the object code to the functions, routines, subroutines, or methods in the source file associated with the object code. When the analyzer succeeds, it adds an index line in the annotated source file in the location corresponding to the first instruction in the function found in the object code.

The annotated source shows an index line for every function, including inline functions, even though inline functions are not displayed in the list displayed by the Function tab. The Source tab displays index lines in red italics with text in angle-brackets. The simplest type of index line corresponds to the function’s default context. The default source context for any function is defined as the source file to which the first instruction in that function is attributed. The following example shows an index line for a C function icputime.

                    578. int
                    579. icputime(int k)
0.       0.         580. {
                         <Function: icputime>

As can be seen from the above example, the index line appears on the line following the first instruction. For C source, the first instruction corresponds to the opening brace at the start of the function body. In Fortran source, the index line for each subroutine follows the line containing the subroutine keyword. Also, a main function index line follows the first Fortran source instruction executed when the application starts, as shown in the following example:

                                   1. ! Copyright (c) 2006, 2010, Oracle and/or its affiliates. All Rights Reserved.
                                   2. ! @(#)omptest.f 1.11 10/03/24 SMI
                                   3. ! Synthetic f90 program, used for testing openmp directives and the
                                   4. !       analyzer
0.       0.       0.       0.      6.        program omptest
                                      <Function: MAIN>  
                                   8. !$PRAGMA C (gethrtime, gethrvtime)

Sometimes, the Analyzer might not be able to map a function it finds in the object code with any programming instructions in the source file associated with that object code; for example, code may be #included or inlined from another file, such as a header file.

Also displayed in red are special index lines and other special lines that are not compiler commentary. For example, as a result of compiler optimization, a special index line might be created for a function in the object code that does not correspond to code written in any source file. For details, refer to Special Lines in the Source, Disassembly and PCs Tabs.

Compiler Commentary

Compiler commentary indicates how compiler-optimized code has been generated. Compiler commentary lines are displayed in blue, to distinguish them from index lines and original source lines. Various parts of the compiler can incorporate commentary into the executable. Each comment is associated with a specific line of source code. When the annotated source is written, the compiler commentary for any source line appears immediately preceding the source line.

The compiler commentary describes many of the transformations which have been made to the source code to optimize it. These transformations include loop optimizations, parallelization, inlining and pipelining. The following shows an example of compiler commentary.

0.       0.       0.       0.        28.       SUBROUTINE dgemv_g2 (transa, m, n, alpha, b, ldb,   &
                                     29.      &                   c, incc, beta, a, inca)
                                     30.       CHARACTER (KIND=1) :: transa
                                     31.       INTEGER   (KIND=4) :: m, n, incc, inca, ldb
                                     32.       REAL      (KIND=8) :: alpha, beta
                                     33.       REAL      (KIND=8) :: a(1:m), b(1:ldb,1:n), c(1:n)
                                     34.       INTEGER            :: i, j
                                     35.       REAL      (KIND=8) :: tmr, wtime, tmrend
                                     36.       COMMON/timer/ tmr
                             Function wtime_ not inlined because the compiler has not seen 
                             the body of the routine
0.       0.       0.       0.        38.       tmrend = tmr + wtime()

                             Function wtime_ not inlined because the compiler has not seen 
                             the body of the routine
                             Discovered loop below has tag L16
0.       0.       0.       0.        39.       DO WHILE(wtime() < tmrend)
                             Array statement below generated loop L4
0.       0.       0.       0.        40.       a(1:m) = 0.0
                             Source loop below has tag L6
0.       0.       0.       0.        42.       DO j = 1, n       ! <=-----\ swapped loop indices
                             Source loop below has tag L5
                             L5 cloned for unrolling-epilog.  Clone is L19
                             All 8 copies of L19 are fused together as part of unroll and jam
                             L19 scheduled with steady-state cycle count = 9
                             L19 unrolled 4 times
                             L19 has 9 loads, 1 stores, 8 prefetches, 8 FPadds, 
                             8 FPmuls, and 0 FPdivs per iteration
                             L19 has 0 int-loads, 0 int-stores, 11 alu-ops, 0 muls, 
                             0 int-divs and 0 shifts per iteration
                             L5 scheduled with steady-state cycle count = 2
                             L5 unrolled 4 times
                             L5 has 2 loads, 1 stores, 1 prefetches, 1 FPadds, 1 FPmuls,
                             and 0 FPdivs per iteration
                             L5 has 0 int-loads, 0 int-stores, 4 alu-ops, 0 muls, 
                             0 int-divs and 0 shifts per iteration
0.210    0.210    0.210    0.        43.          DO i = 1, m    
4.003    4.003    4.003    0.050     44.             a(i) = a(i) + b(i,j) * c(j)
0.240    0.240    0.240    0.        45.          END DO  
0.       0.       0.       0.        46.       END DO  
                                     47.       END DO
0.       0.       0.       0.        49.       RETURN
0.       0.       0.       0.        50.       END

You can set the types of compiler commentary displayed in the Source tab using the Source/Disassembly tab in the Set Data Presentation dialog box; for details, see Setting Data Presentation Options.

Common Subexpression Elimination

One very common optimization recognizes that the same expression appears in more than one place, and that performance can be improved by generating the code for that expression in one place. For example, if the same operation appears in both the if and the else branches of a block of code, the compiler can move that operation to just before the if statement. When it does so, it assigns line numbers to the instructions based on one of the previous occurrences of the expression. If the line numbers assigned to the common code correspond to one branch of an if structure, and the code actually always takes the other branch, the annotated source shows metrics on lines within the branch that is not taken.

Loop Optimizations

The compiler can do several types of loop optimization. Some of the more common ones are as follows:

Loop unrolling consists of repeating several iterations of a loop within the loop body, and adjusting the loop index accordingly. As the body of the loop becomes larger, the compiler can schedule the instructions more efficiently. Also reduced is the overhead caused by the loop index increment and conditional check operations. The remainder of the loop is handled using loop peeling.

Loop peeling consists of removing a number of loop iterations from the loop, and moving them in front of or after the loop, as appropriate.

Loop interchange changes the ordering of nested loops to minimize memory stride, to maximize cache-line hit rates.

Loop fusion consists of combining adjacent or closely located loops into a single loop. The benefits of loop fusion are similar to loop unrolling. In addition, if common data is accessed in the two pre-optimized loops, cache locality is improved by loop fusion, providing the compiler with more opportunities to exploit instruction-level parallelism.

Loop fission is the opposite of loop fusion: a loop is split into two or more loops. This optimization is appropriate if the number of computations in a loop becomes excessive, leading to register spills that degrade performance. Loop fission can also come into play if a loop contains conditional statements. Sometimes it is possible to split the loops into two: one with the conditional statement and one without. This can increase opportunities for software pipelining in the loop without the conditional statement.

Sometimes, with nested loops, the compiler applies loop fission to split a loop apart, and then performs loop fusion to recombine the loop in a different way to increase performance. In this case, you see compiler commentary similar to the following:

    Loop below fissioned into 2 loops
    Loop below fused with loop on line 116
    [116]    for (i=0;i<nvtxs;i++) {
Inlining of Functions

With an inline function, the compiler inserts the function instructions directly at the locations where it is called instead of making actual function calls. Thus, similar to a C/C++ macro, the instructions of an inline function are replicated at each call location. The compiler performs explicit or automatic inlining at high optimization levels (4 and 5). Inlining saves the cost of a function call and provides more instructions for which register usage and instruction scheduling can be optimized, at the cost of a larger code footprint in memory. The following is an example of inlining compiler commentary.

                Function initgraph inlined from source file ptralias.c 
                    into the code for the following line
0.       0.         44.       initgraph(rows);

Note - The compiler commentary does not wrap onto two lines in the Source tab of the Analyzer.


If your code contains Sun, Cray, or OpenMP parallelization directives, it can be compiled for parallel execution on multiple processors. The compiler commentary indicates where parallelization has and has not been performed, and why. The following shows an example of parallelization computer commentary.

0.       6.324       9. c$omp  parallel do shared(a,b,c,n) private(i,j,k)
                   Loop below parallelized by explicit user directive
                   Loop below interchanged with loop on line 12
0.010    0.010     [10]            do i = 2, n-1

                   Loop below not parallelized because it was nested in a parallel loop
                   Loop below interchanged with loop on line 12
0.170    0.170      11.               do j = 2, i

For more details about parallel execution and compiler-generated body functions, refer to Overview of OpenMP Software Execution.

Special Lines in the Annotated Source

Several other annotations for special cases can be shown under the Source tab, either in the form of compiler commentary, or as special lines displayed in the same color as index lines. For details, refer to Special Lines in the Source, Disassembly and PCs Tabs.

Source Line Metrics

Source code metrics are displayed, for each line of executable code, in fixed-width columns. The metrics are the same as in the function list. You can change the defaults for an experiment using a .er.rc file; for details, see Commands That Set Defaults. You can also change the metrics displayed and highlighting thresholds in the Analyzer using the Set Data Presentation dialog box; for details, see Setting Data Presentation Options.

Annotated source code shows the metrics of an application at the source-line level. It is produced by taking the PCs (program counts) that are recorded in the application’s call stack, and mapping each PC to a source line. To produce an annotated source file, the Analyzer first determines all of the functions that are generated in a particular object module (.o file) or load object, then scans the data for all PCs from each function. In order to produce annotated source, the Analyzer must be able to find and read the object module or load object to determine the mapping from PCs to source lines, and it must be able to read the source file to produce an annotated copy, which is displayed. The Analyzer searches for the source file, object file, and executable files in the following default locations in turn, and stops when it finds a file of the correct basename:

The default can be changed by the addpath or setpath directive, or in the Analyzer using the Set Data Presentation dialog box.

If a file cannot be found using the path list set by addpath or setpath, you can specify one or more path remappings with the pathmap command. With the pathmap command you specify an old-prefix and new-prefix. In any pathname for a source file, object file, or shared object that begins with the prefix specified with old-prefix, the old prefix is replaced by the prefix specified with new-prefix. The resulting path is then used to find the file. Multiple pathmap commands can be supplied, and each is tried until the file is found.

The compilation process goes through many stages, depending on the level of optimization requested, and transformations take place which can confuse the mapping of instructions to source lines. For some optimizations, source line information might be completely lost, while for others, it might be confusing. The compiler relies on various heuristics to track the source line for an instruction, and these heuristics are not infallible.

Interpreting Source Line Metrics

Metrics for an instruction must be interpreted as metrics accrued while waiting for the instruction to be executed. If the instruction being executed when an event is recorded comes from the same source line as the leaf PC, the metrics can be interpreted as due to execution of that source line. However, if the leaf PC comes from a different source line than the instruction being executed, at least some of the metrics for the source line that the leaf PC belongs to must be interpreted as metrics accumulated while this line was waiting to be executed. An example is when a value that is computed on one source line is used on the next source line.

The issue of how to interpret the metrics matters most when there is a substantial delay in execution, such as at a cache miss or a resource queue stall, or when an instruction is waiting for a result from a previous instruction. In such cases the metrics for the source lines can seem to be unreasonably high, and you should look at other nearby lines in the code to find the line responsible for the high metric value.

Metric Formats

The four possible formats for the metrics that can appear on a line of annotated source code are explained in Table 7-1.

Table 7-1 Annotated Source-Code Metrics

No PC in the program corresponds to this line of code. This case should always apply to comment lines, and applies to apparent code lines in the following circumstances:
  • All the instructions from the apparent piece of code have been eliminated during optimization.

  • The code is repeated elsewhere, and the compiler performed common subexpression recognition and tagged all the instructions with the lines for the other copy.

  • The compiler tagged an instruction from that line with an incorrect line number.

Some PCs in the program were tagged as derived from this line, but no data referred to those PCs: they were never in a call stack that was sampled statistically or traced. The 0. metric does not indicate that the line was not executed, only that it did not show up statistically in a profiling data packet or a recorded tracing data packet.
At least one PC from this line appeared in the data, but the computed metric value rounded to zero.
The metrics for all PCs attributed to this line added up to the non-zero numerical value shown.