Interpreting Annotated Disassembly

Language:

Interpreting annotated disassembly is not straightforward. The leaf PC is the address of the next instruction to execute, so metrics attributed to an instruction should be considered as time spent waiting for the instruction to execute. However, the execution of instructions does not always happen in sequence, and delays might occur in the recording of the call stack. To make use of annotated disassembly, you should become familiar with the hardware on which you record your experiments and the way in which it loads and executes instructions.

The next few subsections discuss some of the issues of interpreting annotated disassembly.

Instruction Issue Grouping

Instructions are loaded and issued in groups known as instruction issue groups . Which instructions are in the group depends on the hardware, the instruction type, the instructions already being executed, and any dependencies on other instructions or registers. As a result, some instructions might be underrepresented because they are always issued in the same clock cycle as the previous instruction so they never represent the next instruction to be executed. When the call stack is recorded, there might be several instructions that could be considered the next instruction to execute.

Instruction issue rules vary from one processor type to another, and depend on the instruction alignment within cache lines. Because the linker forces instruction alignment at a finer granularity than the cache line, changes in a function that might seem unrelated can cause different alignment of instructions. The different alignment can cause a performance improvement or degradation.

The following artificial situation shows the same function compiled and linked in slightly different circumstances. The two output examples shown below are the annotated disassembly listings from the er_print utility. The instructions for the two examples are identical, but the instructions are aligned differently.

In the following output example the instruction alignment maps the two instructions cmp and bl,a to different cache lines. A significant amount of time is used waiting to execute these two instructions.

   Excl.     Incl.
User CPU  User CPU
    sec.      sec.
                             1. static int
                             2. ifunc()
                             3. {
                             4.     int i;
                             5.
                             6.     for (i=0; i<10000; i++)
                                <function: ifunc>
   0.010     0.010              [ 6]    1066c:  clr         %o0
   0.        0.                 [ 6]    10670:  sethi       %hi(0x2400), %o5
   0.        0.                 [ 6]    10674:  inc         784, %o5
                             7.         i++;
   0.        0.                 [ 7]    10678:  inc         2, %o0
## 1.360     1.360              [ 7]    1067c:  cmp         %o0, %o5
## 1.510     1.510              [ 7]    10680:  bl,a        0x1067c
   0.        0.                 [ 7]    10684:  inc         2, %o0
   0.        0.                 [ 7]    10688:  retl
   0.        0.                 [ 7]    1068c:  nop
                             8.     return i;
                             9. }

In the following output example, the instruction alignment maps the two instructions cmp and bl,a to the same cache line. A significant amount of time is used waiting to execute only one of these instructions.

   Excl.     Incl.
User CPU  User CPU
    sec.      sec.
                             1. static int
                             2. ifunc()
                             3. {
                             4.     int i;
                             5.
                             6.     for (i=0; i<10000; i++)
                                <function: ifunc>
   0.        0.                 [ 6]    10684:  clr         %o0
   0.        0.                 [ 6]    10688:  sethi       %hi(0x2400), %o5
   0.        0.                 [ 6]    1068c:  inc         784, %o5
                             7.         i++;
   0.        0.                 [ 7]    10690:  inc         2, %o0
## 1.440     1.440              [ 7]    10694:  cmp         %o0, %o5
   0.        0.                 [ 7]    10698:  bl,a        0x10694
   0.        0.                 [ 7]    1069c:  inc         2, %o0
   0.        0.                 [ 7]    106a0:  retl
   0.        0.                 [ 7]    106a4:  nop
                             8.     return i;
                             9. }

Instruction Issue Delay

Sometimes, specific leaf PCs appear more frequently because the instruction that they represent is delayed before issue. This appearance can occur for a number of reasons, some of which are listed below:

The previous instruction takes a long time to execute and is not interruptible, for example when an instruction traps into the kernel.
An arithmetic instruction needs a register that is not available because the register contents were set by an earlier instruction that has not yet completed. An example of this sort of delay is a load instruction that has a data cache miss.
A floating-point arithmetic instruction is waiting for another floating-point instruction to complete. This situation occurs for instructions that cannot be pipelined, such as square root and floating-point divide.
The instruction cache does not include the memory word that contains the instruction (I-cache miss).

Attribution of Hardware Counter Overflows

Apart from TLB misses on some platforms and precise counters, the call stack for a hardware counter overflow event is recorded at some point further on in the sequence of instructions than the point at which the overflow occurred. This delay occurs for various reasons including the time taken to handle the interrupt generated by the overflow. For some counters, such as cycles or instructions issued, this delay does not matter. For other counters, such as those counting cache misses or floating point operations, the metric is attributed to a different instruction from that which is responsible for the overflow.

Often the PC that caused the event is only a few instructions before the recorded PC, and the instruction can be correctly located in the disassembly listing. However, if a branch target is within this instruction range, it might be difficult or impossible to determine which instruction corresponds to the PC that caused the event.

Systems that have processors with counters that are labeled with the precise keyword allow memoryspace profiling without any special compilation of binaries. For example the SPARC T2, SPARC T3, and SPARC T4 processors provide several precise counters. Run the collect -h command and look for the precise keyword to determine if your system allows memoryspace profiling.

For example, running the following command on a system with the SPARC T4 processor shows the precise raw counters available:

% collect -h | & grep -i precise | grep -v alias
    Instr_ld[/{0|1|2|3}],1000003 (precise load-store events)
    Instr_st[/{0|1|2|3}],1000003 (precise load-store events)
    SW_prefetch[/{0|1|2|3}],1000003 (precise load-store events)
    Block_ld_st[/{0|1|2|3}],1000003 (precise load-store events)
    DC_miss_L2_L3_hit_nospec[/{0|1|2|3}],1000003 (precise load-store events)
    DC_miss_local_hit_nospec[/{0|1|2|3}],1000003 (precise load-store events)
    DC_miss_remote_L3_hit_nospec[/{0|1|2|3}],1000003 (precise load-store events)
    DC_miss_nospec[/{0|1|2|3}],1000003 (precise load-store events)