Annotated Disassembly Code

Language:

Annotated disassembly provides an assembly-code listing of the instructions of a function or object module with the performance metrics associated with each instruction. Annotated disassembly can be displayed in several ways, determined by whether line number mappings and the source file are available, and whether the object module for the function whose annotated disassembly is being requested is known.

If the object module is not known, Performance Analyzer disassembles the instructions for just the specified function, and does not show any source lines in the disassembly.
If the object module is known, the disassembly covers all functions within the object module.
If the source file is available and line number data is recorded, Performance Analyzer can interleave the source with the disassembly, depending on the display preference.
If the compiler has inserted any commentary into the object code, it too is interleaved in the disassembly if the corresponding preferences are set.

Each instruction in the disassembly code is annotated with the following information:

A source line number, as reported by the compiler
Its relative address
The hexadecimal representation of the instruction, if requested
The assembler ASCII representation of the instruction

Where possible, call addresses are resolved to symbols (such as function names). Metrics are shown on the lines for instructions. They can be shown on any interleaved source code if the corresponding preference is set. Possible metric values are as described for source-code annotations in Table 14.

The disassembly listing for code that is #included in multiple locations repeats the disassembly instructions once for each time that the code has been #included. The source code is interleaved only for the first time a repeated block of disassembly code is shown in a file. For example, if a block of code defined in a header called inc_body.h is #included by four functions named inc_body , inc_entry, inc_middle, and inc_exit, then the block of disassembly instructions appears four times in the disassembly listing for inc_body.h, but the source code is interleaved only in the first of the four blocks of disassembly instructions. Switching to Source view reveals index lines corresponding to each of the times that the disassembly code was repeated.

Index lines can be displayed in the Disassembly view. Unlike with the Source view, these index lines cannot be used directly for navigation purposes. Placing the cursor on one of the instructions immediately below the index line and selecting the Source view navigates you to the file referenced in the index line.

Files that #include code from other files show the included code as raw disassembly instructions without interleaving the source code. Placing the cursor on one of these instructions and selecting the Source view opens the file containing the #included code. Selecting the Disassembly view with this file displayed shows the disassembly code with interleaved source code.

Source code can be interleaved with disassembly code for inline functions but not for macros.

When code is not optimized, the line numbers for each instruction are in sequential order, and the interleaving of source lines and disassembled instructions occurs in the expected way. When optimization takes place, instructions from later lines sometimes appear before those from earlier lines. The Analyzer’s algorithm for interleaving is that whenever an instruction is shown as coming from line N, all source lines up to and including line N are written before the instruction. One effect of optimization is that source code can appear between a control transfer instruction and its delay slot instruction. Compiler commentary associated with line N of the source is written immediately before that line.

Interpreting Annotated Disassembly

Interpreting annotated disassembly is not straightforward. The leaf PC is the address of the next instruction to execute, so metrics attributed to an instruction should be considered as time spent waiting for the instruction to execute. However, the execution of instructions does not always happen in sequence, and delays might occur in the recording of the call stack. To make use of annotated disassembly, you should become familiar with the hardware on which you record your experiments and the way in which it loads and executes instructions.

The next few subsections discuss some of the issues of interpreting annotated disassembly.

Instruction Issue Grouping

Instructions are loaded and issued in groups known as instruction issue groups . Which instructions are in the group depends on the hardware, the instruction type, the instructions already being executed, and any dependencies on other instructions or registers. As a result, some instructions might be underrepresented because they are always issued in the same clock cycle as the previous instruction so they never represent the next instruction to be executed. When the call stack is recorded, there might be several instructions that could be considered the next instruction to execute.

Instruction issue rules vary from one processor type to another, and depend on the instruction alignment within cache lines. Because the linker forces instruction alignment at a finer granularity than the cache line, changes in a function that might seem unrelated can cause different alignment of instructions. The different alignment can cause a performance improvement or degradation.

The following artificial situation shows the same function compiled and linked in slightly different circumstances. The two output examples shown below are the annotated disassembly listings from the er_print utility. The instructions for the two examples are identical, but the instructions are aligned differently.

In the following output example the instruction alignment maps the two instructions cmp and bl,a to different cache lines. A significant amount of time is used waiting to execute these two instructions.

   Excl.     Incl.
User CPU  User CPU
    sec.      sec.
                             1. static int
                             2. ifunc()
                             3. {
                             4.     int i;
                             5.
                             6.     for (i=0; i<10000; i++)
                                <function: ifunc>
   0.010     0.010              [ 6]    1066c:  clr         %o0
   0.        0.                 [ 6]    10670:  sethi       %hi(0x2400), %o5
   0.        0.                 [ 6]    10674:  inc         784, %o5
                             7.         i++;
   0.        0.                 [ 7]    10678:  inc         2, %o0
## 1.360     1.360              [ 7]    1067c:  cmp         %o0, %o5
## 1.510     1.510              [ 7]    10680:  bl,a        0x1067c
   0.        0.                 [ 7]    10684:  inc         2, %o0
   0.        0.                 [ 7]    10688:  retl
   0.        0.                 [ 7]    1068c:  nop
                             8.     return i;
                             9. }

In the following output example, the instruction alignment maps the two instructions cmp and bl,a to the same cache line. A significant amount of time is used waiting to execute only one of these instructions.

   Excl.     Incl.
User CPU  User CPU
    sec.      sec.
                             1. static int
                             2. ifunc()
                             3. {
                             4.     int i;
                             5.
                             6.     for (i=0; i<10000; i++)
                                <function: ifunc>
   0.        0.                 [ 6]    10684:  clr         %o0
   0.        0.                 [ 6]    10688:  sethi       %hi(0x2400), %o5
   0.        0.                 [ 6]    1068c:  inc         784, %o5
                             7.         i++;
   0.        0.                 [ 7]    10690:  inc         2, %o0
## 1.440     1.440              [ 7]    10694:  cmp         %o0, %o5
   0.        0.                 [ 7]    10698:  bl,a        0x10694
   0.        0.                 [ 7]    1069c:  inc         2, %o0
   0.        0.                 [ 7]    106a0:  retl
   0.        0.                 [ 7]    106a4:  nop
                             8.     return i;
                             9. }

Instruction Issue Delay

Sometimes, specific leaf PCs appear more frequently because the instruction that they represent is delayed before issue. This appearance can occur for a number of reasons, some of which are listed below:

The previous instruction takes a long time to execute and is not interruptible, for example when an instruction traps into the kernel.
An arithmetic instruction needs a register that is not available because the register contents were set by an earlier instruction that has not yet completed. An example of this sort of delay is a load instruction that has a data cache miss.
A floating-point arithmetic instruction is waiting for another floating-point instruction to complete. This situation occurs for instructions that cannot be pipelined, such as square root and floating-point divide.
The instruction cache does not include the memory word that contains the instruction (I-cache miss).

Attribution of Hardware Counter Overflows

Apart from TLB misses on some platforms and precise counters, the call stack for a hardware counter overflow event is recorded at some point further on in the sequence of instructions than the point at which the overflow occurred. This delay occurs for various reasons including the time taken to handle the interrupt generated by the overflow. For some counters, such as cycles or instructions issued, this delay does not matter. For other counters, such as those counting cache misses or floating point operations, the metric is attributed to a different instruction from that which is responsible for the overflow.

Often the PC that caused the event is only a few instructions before the recorded PC, and the instruction can be correctly located in the disassembly listing. However, if a branch target is within this instruction range, it might be difficult or impossible to determine which instruction corresponds to the PC that caused the event.

Systems that have processors with counters that are labeled with the precise keyword allow memoryspace profiling without any special compilation of binaries. For example the SPARC T4 and M7 processors provide several precise counters. Run the collect -h command and look for the precise keyword to determine if your system allows memoryspace profiling.

For example, running the following command on a system with the SPARC T7 processor shows the precise raw counters available:

$ collect -h | grep -i precise | grep -v alias
           Counters labeled as precise in the list below will collect memory-space data by default.
    loads      Instr_ld       precise load-store     events 0123 Load Instructions
    stores     Instr_st       precise load-store     events 0123 Store Instructions
    dcm        DC_miss_commit precise load-store     events 0123 L1 D-cache Misses
        use the 'precise' counter:
    Instr_ld                  precise load-store     events 0123
    Instr_st                  precise load-store     events 0123
    Instr_SPR_ring_ops        precise load-store     events 0123
    Instr_atomic              precise load-store     events 0123
    Instr_SW_prefetch         precise load-store     events 0123
    Instr_block_ld_st         precise load-store     events 0123
    DC_miss_L2_L3_hit_commit  precise load-store     events 0123
                              precise load-store     events 0123
                              precise load-store     events 0123
    DC_miss_commit            precise load-store     events 0123