C H A P T E R  5

Understanding the Performance Analyzer and Its Data

The Performance Analyzer reads the event data that is collected by the Collector and converts it into performance metrics. The metrics are computed for various elements in the structure of the target program, such as instructions, source lines, functions, and load objects. In addition to a header, the data recorded for each event collected has two parts:

The process of associating the metrics with the program structure is not always straightforward, due to the insertions, transformations, and optimizations made by the compiler. This chapter describes the process and discusses the effect on what you see in the Performance Analyzer displays.

This chapter covers the following topics:


How Data Collection Works

The output from a data collection run is an experiment, which is stored as a directory with various internal files and subdirectories in the file system.

Experiment Format

All experiments must have three files:

In addition, experiments have binary data files representing the profile events in the life of the process. Each data file has a series of events, as described below under Interpreting Performance Metrics. Separate files are used for each type of data, but each file is shared by all LWPs in the target. The data files are named as follows:

TABLE 5-1 Data Types and Corresponding File Names

Data Type

File Name

clock-profiling

profile

HWC-profiling

hwcounters

Synchronization-tracing

synctrace

Heap-tracing

heaptrace

MPI-tracing

mpitrace


For clock-profiling, or HW-counter-overflow-profiling, the data is written in a signal handler invoked by the clock-tick or counter-overflow. For synchronization tracing, heap tracing, or MPI tracing, data is written from libcollector.so routines that are interposed by LD_PRELOAD on the normal user-invoked routines. Each such interposition routine partially fills in a data record, then invokes the normal user-invoked routine, and fills in the rest of the data record when that routine returns, and writes the record to the data file.

All data files are memory-mapped, and filled in blocks. The records are filled in such a way as to always have a valid record structure, so that experiments can be read as they are being written. The buffer management strategy is designed to minimize contention and serialization between LWPs.

The archive subdirectory

Each experiment has an archive subdirectory, which contains binary files describing each load object referenced in the loadobjects file. These files are produced by er_archive, which runs at the end of data collection. If the process terminates abnormally, er_archive may not be invoked, in which case, the archive files are written by er_print or Analyzer when first invoked on the experiment.

Descendant Processes

Descendant processes write their experiments into subdirectories within the founder-process' experiment. These subdirectories are named with an underscore, a code letter (f for fork and x for exec), and a number are added to its immediate creator's experiment name, giving the genealogy of the descendant. For example, if the experiment name for the founder process is test.1.er, the experiment for the child process created by its third fork is test.1.er/_f3.er. If that child process execs a new image, the corresponding experiment name is test.1.er/_f3_x1.er. Descendant experiments consist of the same files as the parent experiment, but they do not have descendant experiments (all descendants are represented by subdirectories in the founder experiment), and they do not have archive subdirectories (all archiving is done into the founder experiment).

Dynamic Functions

Experiments where the target creates dynamic functions have additional records in the loadobjects file describing those functions, and an additional file, dyntext, containing a copy of the actual instructions of the dynamic functions. The copy is needed to produce annotated disassembly of dynamic functions.

Java Experiments

Java experiments also have additional records in the loadobjects file, both for dynamic functions created by the JVM for its internal purposes, and for dynamically-compiled (HotSpot) versions of the target Java methods.

Java experiments also have additional records in the loadobjects file, both for dynamic functions created by the JVM for its internal purposes, and for dynamically-compiled (HotSpot) versions of the target Java methods.

In addition, Java experiments have a JAVA_CLASSES file, containing information about all of the user's Java classes invoked.

Java heap- and synchronization tracing data is recorded using a JVMPI agent, which is part of libcollector.so. It receives events that are mapped into the recorded trace events. The agent also receives events for class loading and HotSpot compilation, that are used to write the JAVA_CLASSES file, and the Java-compiled method records in the loadobjects file.

Recording Experiments

There are three different ways to record an experiment, with the collect command, with dbx creating a process, and with dbx creating an experiment from a running process.

collect Experiments

When collect is used to record an experiment, the collect program itself creates the experiment directory and sets LD_PRELOAD to ensure that libcollector.so is preloaded into the target's address space. It then sets environment variables to inform libcollector about the experiment name, and data collection options, and exec's the target on top of itself.

libcollector.so is responsible for writing all experiment files.

dbx Experiments, Creating the Process

When dbx is used to launch a process with data collection enabled, it also creates the experiment directory, and ensures preloading of libcollector.so. It stops the process at a breakpoint before its first instruction, and then calls an initialization routine in libcollector to start the data collection.

Java experiments can not be collected by dbx, since dbx uses a JVMDI agent for debugging, and that agent can not coexist with the JVMPI agent needed for data collection.

dbx Experiments, on a Running Process

When dbx is used to start an experiment on a running process, it creates the experiment directory, but can not use LD_PRELOAD. It makes an interactive function call into the target to dlopen libcollector.so, and then calls libcollector.so's initialization routine, just as it does when creating the process. Data is written by libcollector.so just as in a collect experiment.

Since libcollector.so was not in the target address space when the process started, any data collection that depends on interposition on user-callable functions (synchronization tracing, heap tracing, MPI tracing), may not work. In general, the symbols will have already been resolved to the underlying functions, so the interposition can not happen. Furthermore, the following of descendant processes also depends on interposition, and will not work properly for experiments created by dbx on a running process.

If the user has explicitly LD_PRELOAD'ed libcollector.so before starting the process with dbx, or before using dbx to attach to the running process, tracing data may be collected.


Interpreting Performance Metrics

The data for each event contains a high-resolution timestamp, a thread ID, an LWP ID, and a processor ID. The first three of these can be used to filter the metrics in the Performance Analyzer by time, thread or LWP. See the getcpuid(2) man page for information on processor IDs. On systems where getcpuid is not available, the processor ID is -1, which maps to Unknown.

In addition to the common data, each event generates specific raw data, which is described in the following sections. Each section also contains a discussion of the accuracy of the metrics derived from the raw data and the effect of data collection on the metrics.

Clock-Based Profiling

The event-specific data for clock-based profiling consists of an array of profiling interval counts for each of the ten microstates maintained by the kernel for each LWP. At the end of the profiling interval, the count for the microstate of each LWP is incremented by 1, and a profiling signal is scheduled. The array is only recorded and reset when the LWP is in user mode in the CPU. If the LWP is in user mode when the profiling signal is scheduled, the array element for the User-CPU state is 1, and the array elements for all the other states are 0. If the LWP is not in user mode, the data is recorded when the LWP next enters user mode, and the array can contain an accumulation of counts for various states.

The call stack is recorded at the same time as the data. If the LWP is not in user mode at the end of the profiling interval, the call stack cannot change until the LWP enters user mode again. Thus the call stack always accurately records the position of the program counter at the end of each profiling interval.

The metrics to which each of the microstates contributes are shown in TABLE 5-2.

TABLE 5-2 How Kernel Microstates Contribute to Metrics

Kernel Microstate

Description

Metric Name

LMS_USER

Running in user mode

User CPU Time

LMS_SYSTEM

Running in system call or page fault

System CPU Time

LMS_TRAP

Running in any other trap

System CPU Time

LMS_TFAULT

Asleep in user text page fault

Text Page Fault Time

LMS_DFAULT

Asleep in user data page fault

Data Page Fault Time

LMS_KFAULT

Asleep in kernel page fault

Other Wait Time

LMS_USER_LOCK

Asleep waiting for user-mode lock

User Lock Time

LMS_SLEEP

Asleep for any other reason

Other Wait Time

LMS_STOPPED

Stopped (/proc, job control, or lwp_stop)

Other Wait Time

LMS_WAIT_CPU

Waiting for CPU

Wait CPU Time


Accuracy of Timing Metrics

Timing data is collected on a statistical basis, and is therefore subject to all the errors of any statistical sampling method. For very short runs, in which only a small number of profile packets is recorded, the call stacks might not represent the parts of the program which consume the most resources. You should run your program for long enough or enough times to accumulate hundreds of profile packets for any function or source line you are interested in.

In addition to statistical sampling errors, there are specific errors that arise from the way the data is collected and attributed and the way the program progresses through the system. Some of the circumstances in which inaccuracies or distortions can appear in the timing metrics are described in what follows.

In addition to the inaccuracies just described, timing metrics are distorted by the process of collecting data. The time spent recording profile packets never appears in the metrics for the program, because the recording is initiated by profiling signal. (This is another instance of correlation.) The user CPU time spent in the recording process is distributed over whatever microstates are recorded. The result is an underaccounting of the User CPU Time metric and an overaccounting of other metrics. The amount of time spent recording data is typically less than one percent of the CPU time for the default profiling interval.

Comparisons of Timing Metrics

If you compare timing metrics obtained from the profiling done in a clock-based experiment with times obtained by other means, you should be aware of the following issues.

For a single-threaded application, the total LWP time recorded for a process is usually accurate to a few tenths of a percent, compared with the values returned by gethrtime(3C) for the same process. The CPU time can vary by several percentage points from the values returned by gethrvtime(3C) for the same process. Under heavy load, the variation might be even more pronounced. However, the CPU time differences do not represent a systematic distortion, and the relative times reported for different functions, source-lines, and such are not substantially distorted.

For multithreaded applications using unbound threads, differences in values returned by gethrvtime() could be meaningless. This is because gethrvtime() returns values for an LWP, and a thread can change from one LWP to another.

The LWP times that are reported in the Performance Analyzer can differ substantially from the times that are reported by vmstat, because vmstat reports times that are summed over CPUs. If the target process has more LWPs than the system on which it is running has CPUs, the Performance Analyzer shows more wait time than vmstat reports.

The microstate timings that appear in the Statistics tab of the Performance Analyzer and the er_print statistics display are based on process file system usage reports, for which the times spent in the microstates are recorded to high accuracy. See the proc(4) man page for more information. You can compare these timings with the metrics for the <Total> function, which represents the program as a whole, to gain an indication of the accuracy of the aggregated timing metrics. However, the values displayed in the Statistics tab can include other contributions that are not included in the timing metric values for <Total>. These contributions come from the following sources:

Synchronization Wait Tracing

The Collector collects synchronization delay events by tracing calls to the functions in the threads library, libthread.so, or to the real time extensions library, librt.so. The event-specific data consists of high-resolution timestamps for the request and the grant (beginning and end of the call that is traced), and the address of the synchronization object (the mutex lock being requested, for example). The thread and LWP IDs are the IDs at the time the data is recorded. The wait time is the difference between the request time and the grant time. Only events for which the wait time exceeds the specified threshold are recorded. The synchronization wait tracing data is recorded in the experiment at the time of the grant.

If the program uses bound threads, the LWP on which the waiting thread is scheduled cannot perform any other work until the event that caused the delay is completed. The time spent waiting appears both as Synchronization Wait Time and as User Lock Time. User Lock Time can be larger than Synchronization Wait Time because the synchronization delay threshold screens out delays of short duration.

If the program uses unbound threads, it is possible for the LWP on which the waiting thread is scheduled to have other threads scheduled on it and continue to perform user work. The User Lock Time is zero if all LWPs are kept busy while some threads are waiting for a synchronization event. However, the Synchronization Wait Time is not zero because it is associated with a particular thread, not with the LWP on which the thread is running.

The wait time is distorted by the overhead for data collection. The overhead is proportional to the number of events collected. The fraction of the wait time spent in overhead can be minimized by increasing the threshold for recording events.

Hardware-Counter Overflow Profiling

Hardware-counter overflow profiling data includes a counter ID and the overflow value. The value can be larger than the value at which the counter is set to overflow, because the processor executes some instructions between the overflow and the recording of the event. This is especially true of cycle and instruction counters, which are incremented much more frequently than counters such as floating-point operations or cache misses. The delay in recording the event also means that the program counter address recorded with call stack does not correspond exactly to the overflow event.

See Attribution of Hardware Counter Overflows for more information. See also the discussion of Traps. Traps and trap handlers can cause significant differences between reported User CPU time and time reported by the cycle counter.

The amount of data collected depends on the overflow value. Choosing a value that is too small can have the following consequences.

Choosing a value that is too large can result in too few overflows for good statistics. The counts that are accrued after the last overflow are attributed to the collector function collector_final_counters. If you see a substantial fraction of the counts in this function, the overflow value is too large.

Heap Tracing

The Collector records tracing data for calls to the memory allocation and deallocation functions malloc, realloc, memalign and free by interposing on these functions. If your program bypasses these functions to allocate memory, tracing data is not recorded. Tracing data is not recorded for Java memory management, which uses a different mechanism.

The functions that are traced could be loaded from any of a number of libraries. The data that you see in the Performance Analyzer might depend on the library from which a given function is loaded.

If a program makes a large number of calls to the traced functions in a short space of time, the time taken to execute the program can be significantly lengthened. The extra time is used in recording the tracing data.

MPI Tracing

MPI tracing records information about calls to MPI library functions. The event-specific data consists of high-resolution timestamps for the request and the grant (beginning and end of the call that is traced), the number of send and receive operations and the number of bytes sent or received. Tracing is done by interposing on the calls to the MPI library. The interposing functions do not have detailed information about the optimization of data transmission, nor about transmission errors, so the information that is presented represents a simple model of the data transmission, which is explained in the following paragraphs.

The number of bytes received is the length of the buffer as defined in the call to the MPI function. The actual number of bytes received is not available to the interposing function.

Some of the Global Communication functions have a single origin or a single receiving process known as the root. The accounting for such functions is done as follows:

The following examples illustrate the accounting procedure. In these examples, G is the size of the group.

For a call to MPI_Bcast(),

For a call to MPI_Allreduce(),

For a call to MPI_Reduce_scatter(),


Call Stacks and Program Execution

A call stack is a series of program counter addresses (PCs) representing instructions from within the program. The first PC, called the leaf PC, is at the bottom of the stack, and is the address of the next instruction to be executed. The next PC is the address of the call to the function containing the leaf PC; the next PC is the address of the call to that function, and so forth, until the top of the stack is reached. Each such address is known as a return address. The process of recording a call stack involves obtaining the return addresses from the program stack and is referred to as unwinding the stack. For information on unwind failures, see Incomplete Stack Unwinds.

The leaf PC in a call stack is used to assign exclusive metrics from the performance data to the function in which that PC is located. Each PC on the stack, including the leaf PC, is used to assign inclusive metrics to the function in which it is located.

Most of the time, the PCs in the recorded call stack correspond in a natural way to functions as they appear in the source code of the program, and the Performance Analyzer's reported metrics correspond directly to those functions. Sometimes, however, the actual execution of the program does not correspond to a simple intuitive model of how the program would execute, and the Performance Analyzer's reported metrics might be confusing. See Mapping Addresses to Program Structure for more information about such cases.

Single-Threaded Execution and Function Calls

The simplest case of program execution is that of a single-threaded program calling functions within its own load object.

When a program is loaded into memory to begin execution, a context is established for it that includes the initial address to be executed, an initial register set, and a stack (a region of memory used for scratch data and for keeping track of how functions call each other). The initial address is always at the beginning of the function _start(), which is built into every executable.

When the program runs, instructions are executed in sequence until a branch instruction is encountered, which among other things could represent a function call or a conditional statement. At the branch point, control is transferred to the address given by the target of the branch, and execution proceeds from there. (Usually the next instruction after the branch is already committed for execution: this instruction is called the branch delay slot instruction. However, some branch instructions annul the execution of the branch delay slot instruction.)

When the instruction sequence that represents a call is executed, the return address is put into a register, and execution proceeds at the first instruction of the function being called.

In most cases, somewhere in the first few instructions of the called function, a new frame (a region of memory used to store information about the function) is pushed onto the stack, and the return address is put into that frame. The register used for the return address can then be used when the called function itself calls another function. When the function is about to return, it pops its frame from the stack, and control returns to the address from which the function was called.

Function Calls Between Shared Objects

When a function in one shared object calls a function in another shared object, the execution is more complicated than in a simple call to a function within the program. Each shared object contains a Program Linkage Table, or PLT, which contains entries for every function external to that shared object that is referenced from it. Initially the address for each external function in the PLT is actually an address within ld.so, the dynamic linker. The first time such a function is called, control is transferred to the dynamic linker, which resolves the call to the real external function and patches the PLT address for subsequent calls.

If a profiling event occurs during the execution of one of the three PLT instructions, the PLT PCs are deleted, and exclusive time is attributed to the call instruction. If a profiling event occurs during the first call through a PLT entry, but the leaf PC is not one of the PLT instructions, any PCs that arise from the PLT and code in ld.so are replaced by a call to an artificial function, @plt, which accumulates inclusive time. There is one such artificial function for each shared object. If the program uses the LD_AUDIT interface, the PLT entries might never be patched, and non-leaf PCs from @plt can occur more frequently.

Signals

When a signal is sent to a process, various register and stack operations occur that make it look as though the leaf PC at the time of the signal is the return address for a call to a system function, sigacthandler(). sigacthandler() calls the user-specified signal handler just as any function would call another.

The Performance Analyzer treats the frames resulting from signal delivery as ordinary frames. The user code at the point at which the signal was delivered is shown as calling the system function sigacthandler(), and it in turn is shown as calling the user's signal handler. Inclusive metrics from both sigacthandler() and any user signal handler, and any other functions they call, appear as inclusive metrics for the interrupted function.

The Collector interposes on sigaction() to ensure that its handlers are the primary handlers for the SIGPROF signal when clock data is collected and SIGEMT signal when hardware counter data is collected.

Traps

Traps can be issued by an instruction or by the hardware, and are caught by a trap handler. System traps are traps which are initiated from an instruction and trap into the kernel. All system calls are implemented using trap instructions, for example. Some examples of hardware traps are those issued from the floating point unit when it is unable to complete an instruction (such as the fitos instruction on the UltraSPARC® III platform), or when the instruction is not implemented in the hardware.

When a trap is issued, the LWP enters system mode. The microstate is usually switched from User CPU state to Trap state then to System state. The time spent handling the trap can show as a combination of System CPU time and User CPU time, depending on the point at which the microstate is switched. The time is attributed to the instruction in the user's code from which the trap was initiated (or to the system call).

For some system calls, it is considered critical to provide as efficient handling of the call as possible. The traps generated by these calls are known as fast traps. Among the system functions which generate fast traps are gethrtime and gethrvtime. In these functions, the microstate is not switched because of the overhead involved.

In other circumstances it is also considered critical to provide as efficient handling of the trap as possible. Some examples of these are TLB (translation lookaside buffer) misses and register window spills and fills, for which the microstate is not switched.

In both cases, the time spent is recorded as User CPU time. However, the hardware counters are turned off because the mode has been switched to system mode. The time spent handling these traps can therefore be estimated by taking the difference between User CPU time and Cycles time, preferably recorded in the same experiment.

There is one case in which the trap handler switches back to user mode, and that is the misaligned memory reference trap for an 8-byte integer which is aligned on a 4-byte boundary in Fortran. A frame for the trap handler appears on the stack, and a call to the handler can appear in the Performance Analyzer, attributed to the integer load or store instruction.

When an instruction traps into the kernel, the instruction following the trapping instruction appears to take a long time, because it cannot start until the kernel has finished executing the trapping instruction.

Tail-Call Optimization

The compiler can do one particular optimization whenever the last thing a particular function does is to call another function. Rather than generating a new frame, the callee re-uses the frame from the caller, and the return address for the callee is copied from the caller. The motivation for this optimization is to reduce the size of the stack, and, on SPARC® platforms, to reduce the use of register windows.

Suppose that the call sequence in your program source looks like this:

A -> B -> C -> D

When B and C are tail-call optimized, the call stack looks as if function A calls functions B, C, and D directly.

A -> B
A -> C
A -> D

That is, the call tree is flattened. When code is compiled with the -g option, tail-call optimization takes place only at a compiler optimization level of 4 or higher. When code is compiled without the -g option, tail-call optimization takes place at a compiler optimization level of 2 or higher.

Explicit Multithreading

A simple program executes in a single thread, on a single LWP (light-weight process). Multithreaded executables make calls to a thread creation function, to which the target function for execution is passed. When the target exits, the thread is destroyed by the threads library. Newly-created threads begin execution at a function called _thread_start(), which calls the function passed in the thread creation call. For any call stack involving the target as executed by this thread, the top of the stack is _thread_start(), and there is no connection to the caller of the thread creation function. Inclusive metrics associated with the created thread therefore only propagate up as far as _thread_start() and the <Total> function.

In addition to creating the threads, the threads library also creates LWPs to execute the threads. Threading can be done either with bound threads, where each thread is bound to a specific LWP, or with unbound threads, where each thread can be scheduled on a different LWP at different times.

As an example of the scheduling of unbound threads, when a thread is at a synchronization barrier such as a mutex_lock, the threads library can schedule a different thread on the LWP on which the first thread was executing. The time spent waiting for the lock by the thread that is at the barrier appears in the Synchronization Wait Time metric, but since the LWP is not idle, the time is not accrued into the User Lock Time metric.

In addition to the user threads, the standard threads library in the Solaris Operating System versions 7 and 8 creates some threads are used to perform signal handling and other tasks. If the program uses bound threads, additional LWPs are also created for these threads. Performance data is not collected or displayed for these threads, which spend most of their time sleeping. However, the time spent in these threads is included in the process statistics and in the times recorded in the sample data. The threads library in the Solaris 9 Operating System and the alternate threads library in the Solaris 8 Operating System do not create these extra threads.

Overview of Java Technology-Based Software Execution

To the typical developer, a Javatrademark technology-based application runs just like any other program. It begins at a main entry point, typically named class.main, which may call other methods, just as a C or C++ application does.

To the operating system, an application written in the Java programming language, (pure or mixed with C/C++), runs as a process instantiating the [1]Java virtual machine (JVM). The JVMtrademark software is compiled from C++ sources and starts execution at _start, which calls main, and so forth. It reads bytecode from .class and/or .jar files, and performs the operations specified in that program. Among the operations that can be specified is the dynamic loading of a shared object, and calls into various functions or methods contained within that object.

During execution of a Java technology-based application, most methods are interpreted by the JVM software; these methods are referred to in this document as interpreted methods. Other methods may be dynamically compiled by the Java HotSpottrademark virtual machine, and are referred to as compiled methods. Dynamically compiled methods are loaded into the data space of the application, and may be unloaded at some later point in time. For any particular method, there will be an interpreted version, and there may also be one or more compiled versions. Code written in the Java programming language may also call directly into native-compiled code, either C, C++, or native-compiled (SBA SPARC Bytecode Accelerator) Java; the targets of such calls are referred to as native methods.

The JVM software does a number of things that are typically not done by applications written in traditional languages. At startup, it creates a number of regions of dynamically-generated code in its data space. One of these is the actual interpreter code used to process the application's bytecode methods.

During the interpretive execution, the Java HotSpot virtual machine monitors performance, and may decide to take one or more methods that it has been interpreting, generate machine code for them, and execute the more-efficient machine code version, rather than interpret the original. That generated machine code is also in the data space of the process. In addition, other code is generated in the data space to execute the transitions between interpreted and compiled code.

Applications written in the Java programming language are inherently multithreaded, and have one JVM software thread for each thread in the user's program. It also has several housekeeping threads used for signal handling, memory management, and Java HotSpot virtual machine compilation. Depending on the version of libthread.so used, there may be a one-to-one correspondence between threads and LWPs, or a more complex relationship. But for any version of the library, at any instant a thread may be unscheduled, or scheduled onto an LWP. Data for a thread is not collected while that thread is not scheduled onto an LWP.

The performance tools collect their data by recording events in the life of the each LWP of the process, along with the callstack at the time of the event. At any point in the execution of any application, Java technology-based or otherwise, the callstack represents where the program is in its execution, and how it got there. One important way that mixed-model Java technology-based applications differ from traditional C, C++, and Fortran applications is that at any instant during the run of the target there are two callstacks that are meaningful: a Java callstack, and a machine callstack. Both stacks are collected and correlated with each other.

Java Processing Representations

There are three representations for displaying performance data for applications written in the Java programming language: the Java representation, the Expert-Java representation, and the Machine representation. The Java Representation is shown by default where the data supports it. The following section summarizes the main differences between these three representations.

The Java Representation

The Java representation shows compiled and interpreted Java methods by name, and shows native methods in their natural form. During execution, there may be many instances of a particular Java method executed: the interpreted version, and, perhaps, one or more compiled versions. In the Java representation all methods are shown aggregated as a single method. This mode is selected in the analyzer by default.

The Expert-Java Representation

The Expert-Java representation is similar to the Java Representation, except that some details of the JVM internals that are suppressed in the Java Representation are exposed in the Expert-Java Representation.

The Machine Representation

The Machine representation shows functions from the JVM itself, rather than from the application being interpreted by the JVM. It also shows all compiled and native methods. The machine representation looks the same as that of applications written in traditional languages. The callstack shows JVM frames, native frames, and compiled-method frames. Some of the JVM frames represent transition code between interpreted Java, compiled Java, and native code.

Parallel Execution and Compiler-Generated Body Functions

If your code contains Sun, Cray, or OpenMP parallelization directives, it can be compiled for parallel execution. OpenMP is a feature available with the Sun compilers and tools. Refer to the OpenMP API User's Guide and the relevant sections in the Fortran Programming Guide and C User's Guide, or visit the web site defining the OpenMP standard, http://www.openmp.org.

When a loop or other parallel construct is compiled for parallel execution, the compiler-generated code is executed by multiple threads, coordinated by the microtasking library. Parallelization by the Sun compilers follows the procedure outlined below.

Generation of Body Functions

When the compiler encounters a parallel construct, it sets up the code for parallel execution by placing the body of the construct in a separate body function and replacing the construct with a call to a microtasking library function. The microtasking library function is responsible for dispatching threads to execute the body function. The address of the body function is passed to the microtasking library function as an argument.

If the parallel construct is delimited with one of the directives in the following list, then the construct is replaced with a call to the microtasking library function _ _mt_MasterFunction_().

A loop that is parallelized automatically by the compiler is also replaced by a call to _ _mt_MasterFunction_().

If an OpenMP parallel construct contains one or more worksharing do, for or sections directives, each worksharing construct is replaced by a call to the microtasking library function _ _mt_Worksharing_() and a new body function is created for each.

The compiler assigns names to body functions that encode the type of parallel construct, the name of the function from which the construct was extracted, the line number of the beginning of the construct in the original source, and the sequence number of the parallel construct. These mangled names vary from release to release of the microtasking library.

Parallel Execution Sequence

The program begins execution with only one thread, the main thread. The first time the program calls _ _mt_MasterFunction_(), this function calls the Solaris threads library function, thr_create() to create worker threads. Each worker thread executes the microtasking library function _ _mt_SlaveFunction_(), which was passed as an argument to thr_create().

In addition to worker threads, the standard threads library in the Solaris Operating System versions 7 and 8 creates some threads to perform signal handling and other tasks. Performance data is not collected for these threads, which spend most of their time sleeping. However, the time spent in these threads is included in the process statistics and the times recorded in the sample data. The threads library in the Solaris 9 Operating System and the alternate threads library in the Solaris 8 Operating System do not create these extra threads.

Once the threads have been created, _ _mt_MasterFunction_() manages the distribution of available work among the main thread and the worker threads. If work is not available, _ _mt_SlaveFunction_() calls _ _mt_WaitForWork_(), in which the worker thread waits for available work. As soon as work becomes available, the thread returns to _ _mt_SlaveFunction_().

When work is available, each thread executes a call to _ _ mt_run_my_job_(), to which information about the body function is passed. The sequence of execution from this point depends on whether the body function was generated from a parallel sections directive, a parallel do (or parallel for) directive, a parallel workshare directive, or a parallel directive.

 FIGURE 5-1 Schematic Call Tree for a Multithreaded Program That Contains a Parallel Do or Parallel For Construct

Schematic call tree for a multithreaded program that contains a parallel do or parallel for construct

When all parallel work is finished, the threads return to either _ _mt_MasterFunction_() or _ _mt_SlaveFunction_() and call _ _mt_EndOfTaskBarrier_() to perform any synchronization work involved in the termination of the parallel construct. The worker threads then call _ _mt_WaitForWork_() again, while the main thread continues to execute in the serial region.

The call sequence described here applies not only to a program running in parallel, but also to a program compiled for parallelization but running on a single-CPU machine, or on a multiprocessor machine using only one LWP.

The call sequence for a simple parallel do construct is illustrated in FIGURE 5-1. The call stack for a worker thread begins with the threads library function _thread_start(), the function which actually calls _ _mt_SlaveFunction_(). The dotted arrow indicates the initiation of the thread as a consequence of a call from _ _mt_MasterFunction_() to thr_create(). The continuing arrows indicate that there might be other function calls which are not represented here.

The call sequence for a parallel region in which there is a worksharing do construct is illustrated in FIGURE 5-2. The caller of _ _ mt_run_my_job_() is either _ _mt_MasterFunction_() or _ _mt_SlaveFunction_(). The entire diagram can replace the call to _ _mt_run_my_job_() in FIGURE 5-1.

 FIGURE 5-2 Schematic Call Tree for a Parallel Region With a Worksharing Do or Worksharing For Construct

Schematic call tree for a parallel region with a worksharing do or worksharing for construct

In these call sequences, all the compiler-generated body functions are called from the same function (or functions) in the microtasking library, which makes it difficult to associate the metrics from the body function with the original user function. The Performance Analyzer inserts an imputed call to the body function from the original user function, and the microtasking library inserts an imputed call from the body function to the barrier function, _ _mt_EndOfTaskBarrier_(). The metrics due to the synchronization are therefore attributed to the body function, and the metrics for the body function are attributed to the original function. With these insertions, inclusive metrics from the body function propagate directly to the original function rather than through the microtasking library functions. The side effect of these imputed calls is that the body function appears as a callee of both the original user function and the microtasking functions. In addition, the user function appears to have microtasking library functions as its callers, and can appear to call itself. Double-counting of inclusive metrics is avoided by the mechanism used for recursive function calls (see How Recursion Affects Function-Level Metrics).

Worker threads typically use CPU time while they are in _ _mt_WaitForWork_() in order to reduce latency when new work arrives, that is, when the main thread reaches a new parallel construct. This is known as a busy-wait. However, you can set an environment variable to specify a sleep wait, which shows up in the Performance Analyzer as Other Wait time instead of User CPU time. There are generally two situations where the worker threads spend time waiting for work, where you might want to redesign your program to reduce the waiting:

By default, the microtasking library uses threads that are bound to LWPs. You can override this default in the Solaris Operating System versions 7 and 8 by setting the environment variable MT_BIND_LWP to FALSE.



Note - The multiprocessing dispatch process is implementation-dependent and might change from release to release.



Incomplete Stack Unwinds

Stack unwind may fail for a number of reasons:

On any platform, hand-written assembler code may violate the conventions.

Intermediate Files

If you generate intermediate files using the -E or -P compiler options, the Performance Analyzer uses the intermediate file for annotated source code, not the original source file. The #line directives generated with -E can cause problems in the assignment of metrics to source lines.

The following line appears in annotated source if there are instructions from a function that do not have line numbers referring to the source file that was compiled to generate the function:

function_name -- <instructions without line numbers>

Line numbers can be absent under the following circumstances:


Mapping Addresses to Program Structure

Once a call stack is processed into PC values, the Performance Analyzer maps those PCs to shared objects, functions, source lines, and disassembly lines (instructions) in the program. This section describes those mappings.

The Process Image

When a program is run, a process is instantiated from the executable for that program. The process has a number of regions in its address space, some of which are text and represent executable instructions, and some of which are data which is not normally executed. PCs as recorded in the call stack normally correspond to addresses within one of the text segments of the program.

The first text section in a process derives from the executable itself. Others correspond to shared objects that are loaded with the executable, either at the time the process is started, or dynamically loaded by the process. The PCs in a call stack are resolved based on the executable and shared objects loaded at the time the call stack was recorded. Executables and shared objects are very similar, and are collectively referred to as load objects.

Because shared objects can be loaded and unloaded in the course of program execution, any given PC might correspond to different functions at different times during the run. In addition, different PCs might correspond to the same function, when a shared object is unloaded and then reloaded at a different address.

Load Objects and Functions

Each load object, whether an executable or a shared object, contains a text section with the instructions generated by the compiler, a data section for data, and various symbol tables. All load objects must contain an ELF symbol table, which gives the names and addresses of all the globally-known functions in that object. Load objects compiled with the -g option contain additional symbolic information, which can augment the ELF symbol table and provide information about functions that are not global, additional information about object modules from which the functions came, and line number information relating addresses to source lines.

The term function is used to describe a set of instructions that represent a high-level operation described in the source code. The term covers subroutines as used in Fortran, methods as used in C++ and Java, and the like. Functions are described cleanly in the source code, and normally their names appear in the symbol table representing a set of addresses; if the program counter is within that set, the program is executing within that function.

In principle, any address within the text segment of a load object can be mapped to a function. Exactly the same mapping is used for the leaf PC and all the other PCs on the call stack. Most of the functions correspond directly to the source model of the program. Some do not; these functions are described in the following sections.

Aliased Functions

Typically, functions are defined as global, meaning that their names are known everywhere in the program. The name of a global function must be unique within the executable. If there is more than one global function of a given name within the address space, the runtime linker resolves all references to one of them. The others are never executed, and so do not appear in the function list. In the Summary tab, you can see the shared object and object module that contain the selected function.

Under various circumstances, a function can be known by several different names. A very common example of this is the use of so-called weak and strong symbols for the same piece of code. A strong name is usually the same as the corresponding weak name, except that it has a leading underscore. Many of the functions in the threads library also have alternate names for pthreads and Solaris threads, as well as strong and weak names and alternate internal symbols. In all such cases, only one name is used in the function list of the Performance Analyzer. The name chosen is the last symbol at the given address in alphabetic order. This choice most often corresponds to the name that the user would use. In the Summary tab, all the aliases for the selected function are shown.

Non-Unique Function Names

While aliased functions reflect multiple names for the same piece of code, there are circumstances under which multiple pieces of code have the same name:

Static Functions From Stripped Shared Libraries

Static functions are often used within libraries, so that the name used internally in a library does not conflict with a name that the user might use. When libraries are stripped, the names of static functions are deleted from the symbol table. In such cases, the Performance Analyzer generates an artificial name for each text region in the library containing stripped static functions. The name is of the form <static>@0x12345, where the string following the @ sign is the offset of the text region within the library. The Performance Analyzer cannot distinguish between contiguous stripped static functions and a single such function, so two or more such functions can appear with their metrics coalesced.

Stripped static functions are shown as called from the correct caller, except when the PC from the static function is a leaf PC that appears after the save instruction in the static function. Without the symbolic information, the Performance Analyzer does not know the save address, and cannot tell whether to use the return register as the caller. It always ignores the return register. Since several functions can be coalesced into a single <static>@0x12345 function, the real caller or callee might not be distinguished from the adjacent functions.

Fortran Alternate Entry Points

Fortran provides a way of having multiple entry points to a single piece of code, allowing a caller to call into the middle of a function. When such code is compiled, it consists of a prologue for the main entry point, a prologue to the alternate entry point, and the main body of code for the function. Each prologue sets up the stack for the function's eventual return and then branches or falls through to the main body of code.

The prologue code for each entry point always corresponds to a region of text that has the name of that entry point, but the code for the main body of the subroutine receives only one of the possible entry point names. The name received varies from one compiler to another.

The prologues rarely account for any significant amount of time, and the functions corresponding to entry points other than the one that is associated with the main body of the subroutine rarely appear in the Performance Analyzer. Call stacks representing time in Fortran subroutines with alternate entry points usually have PCs in the main body of the subroutine, rather than the prologue, and only the name associated with the main body will appear as a callee. Likewise, all calls from the subroutine are shown as being made from the name associated with the main body of the subroutine.

Cloned Functions

The compilers have the ability to recognize calls to a function for which extra optimization can be performed. An example of such calls is a call to a function for which some of the arguments are constants. When the compiler identifies particular calls that it can optimize, it creates a copy of the function, which is called a clone, and generates optimized code. The clone function name is a mangled name that identifies the particular call. The Analyzer demangles the name, and presents each instance of a cloned function separately in the function list. Each cloned function has a different set of instructions, so the annotated disassembly listing shows the cloned functions separately. Each cloned function has the same source code, so the annotated source listing sums the data over all copies of the function.

Inlined Functions

An inlined function is a function for which the instructions generated by the compiler are inserted at the call site of the function instead of an actual call. There are two kinds of inlining, both of which are done to improve performance, and both of which affect the Performance Analyzer.

Both kinds of inlining have the same effect on the display of metrics. Functions that appear in the source code but have been inlined do not show up in the function list, nor do they appear as callees of the functions into which they have been inlined. Metrics that would otherwise appear as inclusive metrics at the call site of the inlined function, representing time spent in the called function, are actually shown as exclusive metrics attributed to the call site, representing the instructions of the inlined function.



Note - Inlining can make data difficult to interpret, so you might want to disable inlining when you compile your program for performance analysis.



In some cases, even when a function is inlined, a so-called out-of-line function is left. Some call sites call the out-of-line function, but others have the instructions inlined. In such cases, the function appears in the function list but the metrics attributed to it represent only the out-of-line calls.

Compiler-Generated Body Functions

When a compiler parallelizes a loop in a function, or a region that has parallelization directives, it creates new body functions that are not in the original source code. These functions are described in Parallel Execution and Compiler-Generated Body Functions.

The Performance Analyzer shows these functions as normal functions, and assigns a name to them based on the function from which they were extracted, in addition to the compiler-generated name. Their exclusive and inclusive metrics represent the time spent in the body function. In addition, the function from which the construct was extracted shows inclusive metrics from each of the body functions. The means by which this is achieved is described in Parallel Execution Sequence.

When a function containing parallel loops is inlined, the names of its compiler-generated body functions reflect the function into which it was inlined, not the original function.



Note - The names of compiler-generated body functions can only be demangled for modules compiled with -g



Outline Functions

Outline functions can be created during feedback optimization. They represent code that is not normally executed. Specifically, it is code that is not executed during the training run that is used to generate the feedback. To improve paging and instruction-cache behavior, such code is moved elsewhere in the address space, and is made into a separate function. The name of the outline function encodes information about the section of outlined code, including the name of the function from which the code was extracted and the line number of the beginning of the section in the source code. These mangled names can vary from release to release. The Performance Analyzer provides a readable version of the function name.

Outline functions are not really called, but rather are jumped to; similarly they do not return, they jump back. In order to make the behavior more closely match the user's source code model, the Performance Analyzer imputes an artificial call from the main function to its outline portion.

Outline functions are shown as normal functions, with the appropriate inclusive and exclusive metrics. In addition, the metrics for the outline function are added as inclusive metrics in the function from which the code was outlined.

Dynamically Compiled Functions

Dynamically compiled functions are functions that are compiled and linked while the program is executing. The Collector has no information about dynamically compiled functions that are written in C or C++, unless the user supplies the required information using the Collector API functions. See Dynamic Functions and Modules for information about the API functions. If information is not supplied, the function appears in the performance analysis tools as <Unknown>.

For Java programs, the Collector obtains information on methods that are compiled by the Java HotSpottrademark virtual machine, and there is no need to use the API functions to provide the information. For other methods, the performance tools show information for the Javatrademark virtual machine that executes the methods. In Java mode, all methods are merged with the interpreted version. In Expert-Java mode, the HotSpot-compiled and interpreted versions of each method are both shown separately. In the machine mode, each HotSpot-compiled version is shown separately, and JVM functions are shown for each interpreted method.

The <Unknown> Function

Under some circumstances, a PC does not map to a known function. In such cases, the PC is mapped to the special function named <Unknown>.

The following circumstances show PCs mapping to <Unknown>:

Callers and callees of the <Unknown> function represent the previous and next PCs in the call stack, and are treated normally.

The <no Java callstack recorded> Function

The <no Java callstack recorded> function is similar to the <Unknown> function, but for Java threads, in the Java representation only. When the collector receives an event from a Java thread, it unwinds the native stack and calls into the Java virtual machine to obtain the corresponding Java stack. If that call fails for any reason, the event is shown in the analyzer with the artificial function <no Java callstack recorded>. The JVM may refuse to report a callstack either to avoid deadlock, or when unwinding the Java stack would cause excessive synchronization.

The <Total> Function

The <Total> function is an artificial construct used to represent the program as a whole. All performance metrics, in addition to being attributed to the functions on the call stack, are attributed to the special function <Total>. It appears at the top of the function list and its data can be used to give perspective on the data for other functions. In the Callers-Callees list, it is shown as the nominal caller of _start() in the main thread of execution of any program, and also as the nominal caller of _thread_start() for created threads. If the stack unwind was incomplete, the <Total> function can appear as the caller of other functions.

Functions Related to HW Counter Profiling

The following functions are related to HW counter profiling:


Mapping Data Addresses to Program Data Objects

Once a PC from a hardware counter event corresponding to a memory operation has been processed to successfully backtrack to a likely causal memory-referencing instruction, the Performance Analyzer uses instruction identifiers and descriptors provided by the compiler in its hardware profiling support information to derive the associated program data object.

The term dataobject is used to refer to program constants, variables, arrays and aggregates such as structures and unions, along with distinct aggregate elements, described in source code. Depending on the source language, dataobject types and their sizes will vary. Many dataobjects are explicitly named in source programs, while others may be unnamed. Some dataobjects are derived or aggregated from other (simpler) dataobjects, resulting in a rich, often complex, set of dataobjects.

Each dataobject has an associated scope, the region of the source program where it is defined and can be referenced, which may be global (such as a load object), a particular compilation unit (an object file), or function. Identical data objects may be defined with different scopes, or particular data objects referred to differently in different scopes.

Data-derived metrics from hardware counter events for memory operations collected with backtracking enabled are attributed to the associated program dataobject and propagate to any aggregates containing the dataobject and the artificial <Total> which is considered to contain all dataobjects (including <Unknown> and <Scalars>). The different subtypes of <Unknown> propagate up to the <Unknown> aggregate. The following section describes the <Total>, <Scalars>, and <Unknown> dataobjects.

Dataobject Descriptors

Data objects are fully described by a combination of their declared type and name. A simple scalar dataobject {int i} describes an variable called i of type int, while {const+pointer+int p} describes a constant pointer to a type int called p. Types containing spaces have them replaced with underscore (_), and unnamed data objects are represented with a name of dash (-), for example: {double_precision_complex -}.

An entire aggregate is similarly represented {structure:foo_t foo} for a structure of type foo_t called foo. A single element of an aggregate requires the additional specification of its container, for example, {structure:foo_t foo}.{int i} for a member i of type int of the previous structure foo. Aggregates may also themselves be elements of (larger) aggregates, with their corresponding descriptor constructed as a concatenation of aggregate descriptors and ultimately a scalar descriptor.

While a fully-qualified descriptor may not always be necessary to disambiguate dataobjects, it provides a generic complete specification to assist with dataobject identification.

The <Total> Dataobject

The <Total> dataobject is an artificial construct used to represent the program's data objects as a whole. All performance metrics, in addition to being attributed to a distinct dataobject (and any aggregate to which it belongs), are attributed to the special dataobject <Total>. It appears at the top of the dataobject list and its data can be used to give perspective to the data for other data objects.

The <Scalars> Dataobject

While aggregate elements have their performance metrics additionally attributed into the metric value for their associated aggregate, all of the scalar constants and variables have their performance metrics additionally attributed into the metric value for the artificial <Scalars> dataobject.

The <Unknown> Dataobject and Its Elements

Under various circumstances, event data can not be mapped to a particular dataobject. In such cases, the data is mapped to the special dataobject named <Unknown> and one of its elements as described below.

One or more compilation objects were compiled without hardware profiling support, such that it is not possible to ascertain data objects associated with memory-referencing instructions or validate backtracking.

The hardware profiling support information provided in the compilation object was insufficient to verify the validity of backtracking.

Event backtracking encountered a control transfer target and, being unable to resolve whether the control transfer actually occurred or not, was unable to determine the likely causal memory-referencing instruction (and its associated dataobject).

Backtracking determined the likely causal memory-referencing instruction, but its associated dataobject was not specified by the compiler.

Backtracking determined the likely causal memory-referencing instruction, but it was not identified by the compiler and associated dataobject determination is therefore not possible. Compiler temporaries are generally unidentified.


Annotated Code Listings

Annotated source code and annotated disassembly code are useful for determining which source lines or instructions within a function are responsible for poor performance. This section describes the annotation process and some of the issues involved in interpreting the annotated code.

Annotated Source Code

Annotated source code shows the resource consumption of an application at the source-line level. It is produced by taking the PCs that are recorded in the application's call stack, and mapping each PC to a source line. To produce an annotated source file, the Performance Analyzer first determines all of the functions that are generated in a particular object module (.o file) or load object, then scans the data for all PCs from each function. In order to produce annotated source, the Performance Analyzer must be able to find and read the object module or load object to determine the mapping from PCs to source lines, and it must be able to read the source file to produce an annotated copy, which is displayed. The Performance Analyzer searches for the source, object and executable files in the following locations in turn, and stops when it finds a file of the correct basename:

The compilation process goes through many stages, depending on the level of optimization requested, and transformations take place which can confuse the mapping of instructions to source lines. For some optimizations, source line information might be completely lost, while for others, it might be confusing. The compiler relies on various heuristics to track the source line for an instruction, and these heuristics are not infallible.

Interpreting Source Line Metrics

Metrics for an instruction must be interpreted as metrics accrued while waiting for the instruction to be executed. If the instruction being executed when an event is recorded comes from the same source line as the leaf PC, the metrics can be interpreted as due to execution of that source line. However, if the leaf PC comes from a different source line than the instruction being executed, at least some of the metrics for the source line that the leaf PC belongs to must be interpreted as metrics accumulated while this line was waiting to be executed. An example is when a value that is computed on one source line is used on the next source line.

The issue of how to interpret the metrics matters most when there is a substantial delay in execution, such as at a cache miss or a resource queue stall, or when an instruction is waiting for a result from a previous instruction. In such cases the metrics for the source lines can seem to be unreasonably high, and you should look at other lines in the code to find the line responsible for the high metric value.

Metric Formats

The four possible formats for the metrics that can appear on a line of annotated source code are explained in TABLE 5-3.

TABLE 5-3 Annotated Source-Code Metrics

Metric

Significance

(Blank)

No PC in the program corresponds to this line of code. This case should always apply to comment lines, and applies to apparent code lines in the following circumstances:

  • All the instructions from the apparent piece of code have been eliminated during optimization.
  • The code is repeated elsewhere, and the compiler performed common subexpression recognition and tagged all the instructions with the lines for the other copy.
  • The compiler tagged an instruction with an incorrect line number.

0.

Some PCs in the program were tagged as derived from this line, but there was no data that referred to those PCs: they were never in a call stack that was sampled statistically or traced for thread-synchronization data. The 0. metric does not indicate that the line was not executed, only that it did not show up statistically in a profiling data packet or a tracing data packet.

0.000

At least one PC from this line appeared in the data, but the computed metric value rounded to zero.

1.234

The metrics for all PCs attributed to this line added up to the non-zero numerical value shown.


Compiler Commentary

Various parts of the compiler can incorporate commentary into the executable. Each comment is associated with a specific line of source code. When the annotated source is written, the compiler commentary for any source line appears immediately preceding the source line.

The compiler commentary describes many of the transformations which have been made to the source code to optimize it. These transformations include loop optimizations, parallelization, inlining and pipelining.

Common Subexpression Elimination

One very common optimization recognizes that the same expression appears in more than one place, and that performance can be improved by generating the code for that expression in one place. For example, if the same operation appears in both the if and the else branches of a block of code, the compiler can move that operation to just before the if statement. When it does so, it assigns line numbers to the instructions based on one of the previous occurrences of the expression. If the line numbers assigned to the common code correspond to one branch of an if structure, and the code actually always takes the other branch, the annotated source shows metrics on lines within the branch that is not taken.

Parallelization Directives

When the compiler generates body functions from code that contains parallelization directives, inclusive metrics for the parallel loop or section are attributed to the parallelization directive, because this line is the call site for the compiler-generated body function. Inclusive and exclusive metrics also appear on the code in the loops or sections. These metrics sum to the inclusive metrics on the parallelization directives.

Special Lines in the Source

Whenever the source line for a PC cannot be determined, the metrics for that PC are attributed to a special source line that is inserted at the top of the annotated source file. High metrics on that line indicates that part of the code from the given object module does not have line-mappings. Annotated disassembly can help you determine the instructions that do not have mappings. The special lines are:

Annotated Disassembly Code

Annotated disassembly provides an assembly-code listing of the instructions of a function or object module, with the performance metrics associated with each instruction. Annotated disassembly can be displayed in several ways, determined by whether line-number mappings and the source file are available, and whether the object module for the function whose annotated disassembly is being requested is known:

Each instruction in the disassembly code is annotated with the following information.

Where possible, call addresses are resolved to symbols (such as function names). Metrics are shown on the lines for instructions, and can be shown on any interleaved source code if the corresponding preference is set. Possible metric values are as described for source-code annotations, in TABLE 5-3.

When code is not optimized, the line numbers for each instruction are in sequential order, and the interleaving of source lines and disassembled instructions occurs in the expected way. When optimization takes place, instructions from later lines sometimes appear before those from earlier lines. The Performance Analyzer's algorithm for interleaving is that whenever an instruction is shown as coming from line N, all source lines up to and including line N are written before the instruction. One effect of optimization is that source code can appear between a control transfer instruction and its delay slot instruction. Compiler commentary associated with line N of the source is written immediately before that line.

Interpreting annotated disassembly is not straightforward. The leaf PC is the address of the next instruction to execute, so metrics attributed to an instruction should be considered as time spent waiting for the instruction to execute. However, the execution of instructions does not always happen in sequence, and there might be delays in the recording of the call stack. To make use of annotated disassembly, you should become familiar with the hardware on which you record your experiments and the way in which it loads and executes instructions.

The next few subsections discuss some of the issues of interpreting annotated disassembly.

Instruction Issue Grouping

Instructions are loaded and issued in groups known as instruction issue groups. Which instructions are in the group depends on the hardware, the instruction type, the instructions already being executed, and any dependencies on other instructions or registers. This means that some instructions might be underrepresented because they are always issued in the same clock cycle as the previous instruction, so they never represent the next instruction to be executed. It also means that when the call stack is recorded, there might be several instructions which could be considered the next instruction to execute.

Instruction issue rules vary from one processor type to another, and depend on the instruction alignment within cache lines. Since the linker forces instruction alignment at a finer granularity than the cache line, changes in a function that might seem unrelated can cause different alignment of instructions. The different alignment can cause a performance improvement or degradation.

The following artificial situation shows the same function compiled and linked in slightly different circumstances. The two output examples shown below are the annotated disassembly listings from er_print. The instructions for the two examples are identical, but the instructions are aligned differently.

In this example the instruction alignment maps the two instructions cmp and bl,a to different cache lines, and a significant amount of time is used waiting to execute these two instructions.

   Excl.     Incl.     
User CPU  User CPU  
    sec.      sec. 
                             1. static int
                             2. ifunc()
                             3. {
                             4.     int i;
                             5. 
                             6.     for (i=0; i<10000; i++)
                                <function: ifunc>
   0.010     0.010              [ 6]    1066c:  clr         %o0
   0.        0.                 [ 6]    10670:  sethi       %hi(0x2400), %o5
   0.        0.                 [ 6]    10674:  inc         784, %o5
                             7.         i++;
   0.        0.                 [ 7]    10678:  inc         2, %o0
## 1.360     1.360              [ 7]    1067c:  cmp         %o0, %o5
## 1.510     1.510              [ 7]    10680:  bl,a        0x1067c
   0.        0.                 [ 7]    10684:  inc         2, %o0
   0.        0.                 [ 7]    10688:  retl        
   0.        0.                 [ 7]    1068c:  nop         
                             8.     return i;
                             9. }

In this example, the instruction alignment maps the two instructions cmp and bl,a to the same cache line, and a significant amount of time is used waiting to execute only one of these instructions.

   Excl.     Incl.     
User CPU  User CPU  
    sec.      sec. 
                             1. static int
                             2. ifunc()
                             3. {
                             4.     int i;
                             5. 
                             6.     for (i=0; i<10000; i++)
                                <function: ifunc>
   0.        0.                 [ 6]    10684:  clr         %o0
   0.        0.                 [ 6]    10688:  sethi       %hi(0x2400), %o5
   0.        0.                 [ 6]    1068c:  inc         784, %o5
                             7.         i++;
   0.        0.                 [ 7]    10690:  inc         2, %o0
## 1.440     1.440              [ 7]    10694:  cmp         %o0, %o5
   0.        0.                 [ 7]    10698:  bl,a        0x10694
   0.        0.                 [ 7]    1069c:  inc         2, %o0
   0.        0.                 [ 7]    106a0:  retl        
   0.        0.                 [ 7]    106a4:  nop         
                             8.     return i;
                             9. }

Instruction Issue Delay

Sometimes, specific leaf PCs appear more frequently because the instruction that they represent is delayed before issue. This can occur for a number of reasons, some of which are listed below:

Attribution of Hardware Counter Overflows

Apart from TLB misses, the call stack for a hardware counter overflow event is recorded at some point further on in the sequence of instructions than the point at which the overflow occurred, for various reasons including the time taken to handle the interrupt generated by the overflow. For some counters, such as cycles or instructions issued, this does not matter. For other counters, such as those counting cache misses or floating point operations, the metric is attributed to a different instruction from that which is responsible for the overflow. Often the PC that caused the event is only a few instructions before the recorded PC, and the instruction can be correctly located in the disassembly listing. However, if there is a branch target within this instruction range, it might be difficult or impossible to tell which instruction corresponds to the PC that caused the event. For hardware counters that count memory access events, the Collector searches for the PC that caused the event if the counter name is prefixed with a plus, +.

Special Lines in the Disassembly and PCs Tabs

There are a number of special lines that may appear in the Disassembly and PCs tabs. This section describes such lines using examples as they might appear in the analyzer.

Program Linkage Table (PLT) Instructions

When a function in one load object calls a function in a different shared object, the actual call transfers first to a three-instruction sequence in the PLT, and then to the real destination. The analyzer removes PCs that correspond to the PLT, and assigns the metrics for these PCs to the call instruction. Therefore, if a call instruction has an unexpectedly high metric, it could be due to the PLT instructions rather than the call instructions. See also Function Calls Between Shared Objects.


1 (Footnote) The terms "Java virtual machine" and "JVM" mean a virtual machine for the Javatrademark platform.