C H A P T E R 7 - Understanding the Performance Analyzer and Its Data

C H A P T E R 7

Understanding the Performance Analyzer and Its Data

The Performance Analyzer reads the event data that is collected by the Collector and converts it into performance metrics. The metrics are computed for various elements in the structure of the target program, such as instructions, source lines, functions, and load objects. In addition to a header, the data recorded for each event collected has two parts:

Some event-specific data that is used to compute metrics

A call stack of the application that is used to associate those metrics with the program structure

The process of associating the metrics with the program structure is not always straightforward, due to the insertions, transformations, and optimizations made by the compiler. This chapter describes the process and discusses the effect on what you see in the Performance Analyzer displays.

This chapter covers the following topics:

How Data Collection Works

Interpreting Performance Metrics

Call Stacks and Program Execution

Mapping Addresses to Program Structure

Mapping Data Addresses to Program Data Objects

How Data Collection Works

The output from a data collection run is an experiment, which is stored as a directory with various internal files and subdirectories in the file system.

Experiment Format

All experiments must have three files:

A log file; an ASCII file that contains information about what data was collected, the versions of various components, a record of various events during the life of the target, and the word size of the target.

A map file; an ASCII file that records the time-dependent information about what loadobjects are loaded into the address space of the target, and the times at which they are loaded or unloaded.

An overview file; a binary file containing usage information recorded at every sample point in the experiment.

In addition, experiments have binary data files representing the profile events in the life of the process. Each data file has a series of events, as described below under Interpreting Performance Metrics. Separate files are used for each type of data, but each file is shared by all LWPs in the target. The data files are named as follows:

TABLE 7-1 Data Types and Corresponding File Names
Data Type	File Name
Clock-based profiling	`profile`
Hardware counter overflow profiling	`hwcounters`
Synchronization tracing	`synctrace`
Heap tracing	`heaptrace`
MPI tracing	`mpitrace`

For clock-based profiling, or hardware counter overflow profiling, the data is written in a signal handler invoked by the clock tick or counter overflow. For synchronization tracing, heap tracing, or MPI tracing, data is written from libcollector.so routines that are interposed by the LD_PRELOAD environment variable on the normal user-invoked routines. Each such interposition routine partially fills in a data record, then invokes the normal user-invoked routine, and fills in the rest of the data record when that routine returns, and writes the record to the data file.

All data files are memory-mapped and written in blocks. The records are filled in such a way as to always have a valid record structure, so that experiments can be read as they are being written. The buffer management strategy is designed to minimize contention and serialization between LWPs.

An experiment can optionally contain an ASCII file with the filename of notes. This file is automatically created when using the -C comment argument to the collect command. You can create or edit the file manually after the experiment has been created. The contents of the file are prepended to the experiment header.

The `archives` Directory

Each experiment has an archives directory that contains binary files describing each load object referenced in the loadobjects file. These files are produced by the er_archive utility, which runs at the end of data collection. If the process terminates abnormally, the er_archive utility may not be invoked, in which case, the archive files are written by the er_print utility or the Analyzer when first invoked on the experiment.

Descendant Processes

Descendant processes write their experiments into subdirectories within the founder-process' experiment. These subdirectories are named with an underscore, a code letter (f for fork, x for exec, and c for combination), and a number are added to its immediate creator's experiment name, giving the genealogy of the descendant. For example, if the experiment name for the founder process is test.1.er, the experiment for the child process created by its third fork is test.1.er/_f3.er. If that child process executes a new image, the corresponding experiment name is test.1.er/_f3_x1.er. Descendant experiments consist of the same files as the parent experiment, but they do not have descendant experiments (all descendants are represented by subdirectories in the founder experiment), and they do not have archive subdirectories (all archiving is done into the founder experiment).

Dynamic Functions

An experiment where the target creates dynamic functions has additional records in the loadobjects file describing those functions, and an additional file, dyntext, containing a copy of the actual instructions of the dynamic functions. The copy is needed to produce annotated disassembly of dynamic functions.

Java Experiments

A Java experiment has additional records in the loadobjects file, both for dynamic functions created by the JVM software for its internal purposes, and for dynamically-compiled (HotSpot) versions of the target Java methods.

In addition, a Java experiment has a JAVA_CLASSES file, containing information about all of the user's Java classes invoked.

Java heap tracing data and synchronization tracing data are recorded using a JVMPI agent, which is part of libcollector.so. The agent receives events that are mapped into the recorded trace events. The agent also receives events for class loading and HotSpot compilation, that are used to write the JAVA_CLASSES file, and the Java-compiled method records in the loadobjects file.

Recording Experiments

You can record an experiment in three different ways:

With the collect command

With dbx creating a process

With dbx creating an experiment from a running process

The Performance Tools Collect window in the Analyzer GUI runs a collect experiment; the Collector dialog in the IDE runs a dbx experiment.

collect Experiments

When you use the collect command to record an experiment, the collect utility creates the experiment directory and sets the LD_PRELOAD environment variable to ensure that libcollector.so is preloaded into the target's address space. It then sets environment variables to inform libcollector.so about the experiment name, and data collection options, and executes the target on top of itself.

libcollector.so is responsible for writing all experiment files.

dbx Experiments That Create a Process

When dbx is used to launch a process with data collection enabled, dbx also creates the experiment directory and ensures preloading of libcollector.so. dbx stops the process at a breakpoint before its first instruction, and then calls an initialization routine in libcollector.so to start the data collection.

Java experiments can not be collected by dbx, since dbx uses a Java trademark Virtual Machine Debug Interface (JVMDI) agent for debugging, and that agent can not coexist with the Java Virtual Machine Profiling Interface (JVMPI) agent needed for data collection.

dbx Experiments, on a Running Process

When dbx is used to start an experiment on a running process, it creates the experiment directory, but cannot use the LD_PRELOAD environment variable. dbx makes an interactive function call into the target tolopen libcollector.so, and then calls the libcollector.so initialization routine, just as it does when creating the process. Data is written by libcollector.so just as in a collect experiment.

Since libcollector.so was not in the target address space when the process started, any data collection that depends on interposition on user-callable functions (synchronization tracing, heap tracing, MPI tracing) might not work. In general, the symbols have already been resolved to the underlying functions, so the interposition can not happen. Furthermore, the following of descendant processes also depends on interposition, and does not work properly for experiments created by dbx on a running process.

If you have explicitly preloaded libcollector.so before starting the process with dbx, or before using dbx to attach to the running process, you can collect tracing data.

Interpreting Performance Metrics

The data for each event contains a high-resolution timestamp, a thread ID, an LWP ID, and a processor ID. The first three of these can be used to filter the metrics in the Performance Analyzer by time, thread or LWP. See the getcpuid(2) man page for information on processor IDs. On systems where getcpuid is not available, the processor ID is -1, which maps to Unknown.

In addition to the common data, each event generates specific raw data, which is described in the following sections. Each section also contains a discussion of the accuracy of the metrics derived from the raw data and the effect of data collection on the metrics.

Clock-Based Profiling

The event-specific data for clock-based profiling consists of an array of profiling interval counts. On the Solaris OS, an interval counter is provided At the end of the profiling interval, the appropriate interval counter is incremented by 1, and another profiling signal is scheduled. The array is recorded and reset only when the Solaris LWP thread enters CPU user mode. Resetting the array consists of setting the array element for the User-CPU state to 1, and the array elements for all the other states to 0. The array data is recorded on entry to user mode before the array is reset. Thus, the array contains an accumulation of counts for each microstate that was entered since the previous entry into user mode. for each of the ten microstates maintained by the kernel for each Solaris LWP. On the Linux OS, microstates do not exist; the only interval counter is User CPU Time.

The call stack is recorded at the same time as the data. If the Solaris LWP is not in user mode at the end of the profiling interval, the call stack cannot change until the LWP or thread enters user mode again. Thus the call stack always accurately records the position of the program counter at the end of each profiling interval.

The metrics to which each of the microstates contributes on the Solaris OS are shown in TABLE 7-2.

TABLE 7-2 How Kernel Microstates Contribute to Metrics
Kernel Microstate	Description	Metric Name
`LMS_USER`	Running in user mode	User CPU Time
`LMS_SYSTEM`	Running in system call or page fault	System CPU Time
`LMS_TRAP`	Running in any other trap	System CPU Time
`LMS_TFAULT`	Asleep in user text page fault	Text Page Fault Time
`LMS_DFAULT`	Asleep in user data page fault	Data Page Fault Time
`LMS_KFAULT`	Asleep in kernel page fault	Other Wait Time
`LMS_USER_LOCK`	Asleep waiting for user-mode lock	User Lock Time
`LMS_SLEEP`	Asleep for any other reason	Other Wait Time
`LMS_STOPPED`	Stopped (`/proc`, job control, or `lwp_stop`)	Other Wait Time
`LMS_WAIT_CPU`	Waiting for CPU	Wait CPU Time

Accuracy of Timing Metrics

Timing data is collected on a statistical basis, and is therefore subject to all the errors of any statistical sampling method. For very short runs, in which only a small number of profile packets is recorded, the call stacks might not represent the parts of the program which consume the most resources. Run your program for long enough or enough times to accumulate hundreds of profile packets for any function or source line you are interested in.

In addition to statistical sampling errors, specific errors arise from the way the data is collected and attributed and the way the program progresses through the system. The following are some of the circumstances in which inaccuracies or distortions can appear in the timing metrics:

When a Solaris LWP or Linux thread is created, the time spent before the first profile packet is recorded is less than the profiling interval, but the entire profiling interval is ascribed to the microstate recorded in the first profile packet. If many LWP or threads are created, the error can be many times the profiling interval.

When a Solaris LWP or Linux thread is destroyed, some time is spent after the last profile packet is recorded. If many LWPs or threads are destroyed, the error can be many times the profiling interval.

Rescheduling of LWPs or threads can occur during a profiling interval. As a consequence, the recorded state of the LWP might not represent the microstate in which it spent most of the profiling interval. The errors are likely to be larger when there are more Solaris LWPs or Linux threads to run than there are processors to run them.

A program can behave in a way that is correlated with the system clock. In this case, the profiling interval always expires when the Solaris LWP or Linux thread is in a state that might represent a small fraction of the time spent, and the call stacks recorded for a particular part of the program are overrepresented. On a multiprocessor system, the profiling signal can induce a correlation: processors that are interrupted by the profiling signal while they are running LWPs for the program are likely to be in the Trap-CPU microstate when the microstate is recorded.

The kernel records the microstate value when the profiling interval expires. When the system is under heavy load, that value might not represent the true state of the process. On the Solaris OS, this situation is likely to result in overaccounting of the Trap-CPU or Wait-CPU microstate.

The threads library sometimes discards profiling signals when it is in a critical section, resulting in an underaccounting of timing metrics. The problem applies to the default threads library on the Solaris 8 OS only.

When the system clock is being synchronized with an external source, the time stamps recorded in profile packets do not reflect the profiling interval but include any adjustment that was made to the clock. The clock adjustment can make it appear that profile packets are lost. The time period involved is usually several seconds, and the adjustments are made in increments.

In addition to the inaccuracies just described, timing metrics are distorted by the process of collecting data. The time spent recording profile packets never appears in the metrics for the program, because the recording is initiated by the profiling signal. (This is another instance of correlation.) The user CPU time spent in the recording process is distributed over whatever microstates are recorded. The result is an underaccounting of the User CPU Time metric and an overaccounting of other metrics. The amount of time spent recording data is typically less than a few percent of the CPU time for the default profiling interval.

Comparisons of Timing Metrics

If you compare timing metrics obtained from the profiling done in a clock-based experiment with times obtained by other means, you should be aware of the following issues.

For a single-threaded application, the total Solaris LWP or Linux thread time recorded for a process is usually accurate to a few tenths of a percent, compared with the values returned by gethrtime(3C) for the same process. The CPU time can vary by several percentage points from the values returned by gethrvtime(3C) for the same process. Under heavy load, the variation might be even more pronounced. However, the CPU time differences do not represent a systematic distortion, and the relative times reported for different functions, source-lines, and such are not substantially distorted.

For multithreaded applications using unbound threads on the Solaris OS, differences in values returned by gethrvtime() could be meaningless because gethrvtime() returns values for an LWP, and a thread can change from one LWP to another.

The LWP times that are reported in the Performance Analyzer can differ substantially from the times that are reported by vmstat, because vmstat reports times that are summed over CPUs. If the target process has more LWPs than the system on which it is running has CPUs, the Performance Analyzer shows more wait time than vmstat reports.

The microstate timings that appear in the Statistics tab of the Performance Analyzer and the er_print statistics display are based on process file system /proc usage reports, for which the times spent in the microstates are recorded to high accuracy. See the proc(4) man page for more information. You can compare these timings with the metrics for the <Total> function, which represents the program as a whole, to gain an indication of the accuracy of the aggregated timing metrics. However, the values displayed in the Statistics tab can include other contributions that are not included in the timing metric values for <Total>. These contributions come from the following sources:

Threads that are created by the system that are not profiled. The standard threads library in the Solaris 8 OS creates system threads that are not profiled. These threads spend most of their time sleeping, and the time shows in the Statistics tab as Other Wait time.

Periods of time in which data collection is paused.

User CPU time and hardware counter cycle time differ because the hardware counters are turned off when the CPU mode has been switched to system mode. For more information, see Traps.

Synchronization Wait Tracing

Synchronization wait tracing is available only on Solaris platforms. The Collector collects synchronization delay events by tracing calls to the functions in the threads library, libthread.so, or to the real time extensions library, librt.so. The event-specific data consists of high-resolution timestamps for the request and the grant (beginning and end of the call that is traced), and the address of the synchronization object (the mutex lock being requested, for example). The thread and LWP IDs are the IDs at the time the data is recorded. The wait time is the difference between the request time and the grant time. Only events for which the wait time exceeds the specified threshold are recorded. The synchronization wait tracing data is recorded in the experiment at the time of the grant.

If the program uses bound threads, the LWP on which the waiting thread is scheduled cannot perform any other work until the event that caused the delay is completed. The time spent waiting appears both as Synchronization Wait Time and as User Lock Time. User Lock Time can be larger than Synchronization Wait Time because the synchronization delay threshold screens out delays of short duration.

If the program uses unbound threads, it is possible for the LWP on which the waiting thread is scheduled to have other threads scheduled on it and continue to perform user work. The User Lock Time is zero if all LWPs are kept busy while some threads are waiting for a synchronization event. However, the Synchronization Wait Time is not zero because it is associated with a particular thread, not with the LWP on which the thread is running.

The wait time is distorted by the overhead for data collection. The overhead is proportional to the number of events collected. You can minimize the fraction of the wait time spent in overhead by increasing the threshold for recording events.

Hardware Counter Overflow Profiling

Hardware counter overflow profiling is available only on Solaris platforms. Hardware counter overflow profiling data includes a counter ID and the overflow value. The value can be larger than the value at which the counter is set to overflow, because the processor executes some instructions between the overflow and the recording of the event. The value is especially likely to be larger for cycle and instruction counters, which are incremented much more frequently than counters such as floating-point operations or cache misses. The delay in recording the event also means that the program counter address recorded with call stack does not correspond exactly to the overflow event. See Attribution of Hardware Counter Overflows for more information. See also the discussion of Traps. Traps and trap handlers can cause significant differences between reported User CPU time and time reported by the cycle counter.

The amount of data collected depends on the overflow value. Choosing a value that is too small can have the following consequences.

The amount of time spent collecting data can be a substantial fraction of the execution time of the program. The collection run might spend most of its time handling overflows and writing data instead of running the program.

A substantial fraction of the counts can come from the collection process. These counts are attributed to the collector function collector_record_counters. If you see high counts for this function, the overflow value is too small.

The collection of data can alter the behavior of the program. For example, if you are collecting data on cache misses, the majority of the misses could come from flushing the collector instructions and profiling data from the cache and replacing it with the program instructions and data. The program would appear to have a lot of cache misses, but without data collection there might in fact be very few cache misses.

Choosing a value that is too large can result in too few overflows for good statistics. The counts that are accrued after the last overflow are attributed to the collector function collector_final_counters. If you see a substantial fraction of the counts in this function, the overflow value is too large.

Heap Tracing

The Collector records tracing data for calls to the memory allocation and deallocation functions malloc, realloc, memalign, and free by interposing on these functions. If your program bypasses these functions to allocate memory, tracing data is not recorded. Tracing data is not recorded for Java memory management, which uses a different mechanism.

The functions that are traced could be loaded from any of a number of libraries. The data that you see in the Performance Analyzer might depend on the library from which a given function is loaded.

If a program makes a large number of calls to the traced functions in a short space of time, the time taken to execute the program can be significantly lengthened. The extra time is used in recording the tracing data.

Dataspace Profiling

A dataspace profile is a data collection in which memory- related events, such as cache misses, are reported against the data-object references that cause the events rather than just the instructions where the memory-related events occur. Dataspace profiling is not available on Linux systems.

To allow dataspace profiling, the target must be a C program, compiled for the SPARC architecture, with the -xhwcprof flag and -xdebugformat=dwarf -g flag. Furthermore, the data collected must be hardware counter profiles for memory-related counters and the optional + sign must be prepended to the counter name. The Performance Analyzer now includes two tabs related to dataspace profiling, the DataObject tab and the DataLayout tab, and various tabs for memory objects.

MPI Tracing

MPI tracing is available only on Solaris platforms. MPI tracing records information about calls to MPI library functions. The event-specific data consists of high-resolution timestamps for the request and the grant (beginning and end of the call that is traced), the number of send and receive operations and the number of bytes sent or received. Tracing is done by interposing on the calls to the MPI library. The interposing functions do not have detailed information about the optimization of data transmission, nor about transmission errors, so the information that is presented represents a simple model of the data transmission, which is explained in the following paragraphs.

The number of bytes received is the length of the buffer as defined in the call to the MPI function. The actual number of bytes received is not available to the interposing function.

Some of the Global Communication functions have a single origin or a single receiving process known as the root. The accounting for such functions is done as follows:

Root sends data to all processes, itself included.

Root receives data from all processes. itself included.

Each process communicates with each process, itself included

The following examples illustrate the accounting procedure. In these examples, G is the size of the group.

For a call to MPI_Bcast(),

Root sends G packets of N bytes, one packet to each process, including itself

All G processes in the group (including root) receive N bytes

For a call to MPI_Allreduce(),

Each process sends G packets of N bytes

Each process receives G packets of N bytes

For a call to MPI_Reduce_scatter(),

Each process sends G packets of N/G bytes

Each process receives G packets of N/G bytes

Call Stacks and Program Execution

A call stack is a series of program counter addresses (PCs) representing instructions from within the program. The first PC, called the leaf PC, is at the bottom of the stack, and is the address of the next instruction to be executed. The next PC is the address of the call to the function containing the leaf PC; the next PC is the address of the call to that function, and so forth, until the top of the stack is reached. Each such address is known as a return address. The process of recording a call stack involves obtaining the return addresses from the program stack and is referred to as unwinding the stack. For information on unwind failures, see Incomplete Stack Unwinds.

The leaf PC in a call stack is used to assign exclusive metrics from the performance data to the function in which that PC is located. Each PC on the stack, including the leaf PC, is used to assign inclusive metrics to the function in which it is located.

Most of the time, the PCs in the recorded call stack correspond in a natural way to functions as they appear in the source code of the program, and the Performance Analyzer's reported metrics correspond directly to those functions. Sometimes, however, the actual execution of the program does not correspond to a simple intuitive model of how the program would execute, and the Performance Analyzer's reported metrics might be confusing. See Mapping Addresses to Program Structure for more information about such cases.

Single-Threaded Execution and Function Calls

The simplest case of program execution is that of a single-threaded program calling functions within its own load object.

When a program is loaded into memory to begin execution, a context is established for it that includes the initial address to be executed, an initial register set, and a stack (a region of memory used for scratch data and for keeping track of how functions call each other). The initial address is always at the beginning of the function _start(), which is built into every executable.

When the program runs, instructions are executed in sequence until a branch instruction is encountered, which among other things could represent a function call or a conditional statement. At the branch point, control is transferred to the address given by the target of the branch, and execution proceeds from there. (Usually the next instruction after the branch is already committed for execution: this instruction is called the branch delay slot instruction. However, some branch instructions annul the execution of the branch delay slot instruction).

When the instruction sequence that represents a call is executed, the return address is put into a register, and execution proceeds at the first instruction of the function being called.

In most cases, somewhere in the first few instructions of the called function, a new frame (a region of memory used to store information about the function) is pushed onto the stack, and the return address is put into that frame. The register used for the return address can then be used when the called function itself calls another function. When the function is about to return, it pops its frame from the stack, and control returns to the address from which the function was called.

Function Calls Between Shared Objects

When a function in one shared object calls a function in another shared object, the execution is more complicated than in a simple call to a function within the program. Each shared object contains a Program Linkage Table, or PLT, which contains entries for every function external to that shared object that is referenced from it. Initially the address for each external function in the PLT is actually an address within ld.so, the dynamic linker. The first time such a function is called, control is transferred to the dynamic linker, which resolves the call to the real external function and patches the PLT address for subsequent calls.

If a profiling event occurs during the execution of one of the three PLT instructions, the PLT PCs are deleted, and exclusive time is attributed to the call instruction. If a profiling event occurs during the first call through a PLT entry, but the leaf PC is not one of the PLT instructions, any PCs that arise from the PLT and code in ld.so are replaced by a call to an artificial function, @plt, which accumulates inclusive time. There is one such artificial function for each shared object. If the program uses the LD_AUDIT interface, the PLT entries might never be patched, and non-leaf PCs from @plt can occur more frequently.

Signals

When a signal is sent to a process, various register and stack operations occur that make it look as though the leaf PC at the time of the signal is the return address for a call to a system function, sigacthandler(). sigacthandler() calls the user-specified signal handler just as any function would call another.

The Performance Analyzer treats the frames resulting from signal delivery as ordinary frames. The user code at the point at which the signal was delivered is shown as calling the system function sigacthandler(), and sigacthandler() in turn is shown as calling the user's signal handler. Inclusive metrics from both sigacthandler() and any user signal handler, and any other functions they call, appear as inclusive metrics for the interrupted function.

The Collector interposes on sigaction() to ensure that its handlers are the primary handlers for the SIGPROF signal when clock data is collected and SIGEMT signal when hardware counter overflow data is collected.

Traps

Traps can be issued by an instruction or by the hardware, and are caught by a trap handler. System traps are traps that are initiated from an instruction and trap into the kernel. All system calls are implemented using trap instructions, for example. Some examples of hardware traps are those issued from the floating point unit when it is unable to complete an instruction (such as the fitos instruction for some register-content values on the UltraSPARC® III platform), or when the instruction is not implemented in the hardware.

When a trap is issued, the Solaris LWP or Linux kernel enters system mode. On the Solaris OS, the microstate is usually switched from User CPU state to Trap state then to System state. The time spent handling the trap can show as a combination of System CPU time and User CPU time, depending on the point at which the microstate is switched. The time is attributed to the instruction in the user's code from which the trap was initiated (or to the system call).

For some system calls, it is considered critical to provide as efficient handling of the call as possible. The traps generated by these calls are known as fast traps. Among the system functions that generate fast traps are gethrtime and gethrvtime. In these functions, the microstate is not switched because of the overhead involved.

In other circumstances it is also considered critical to provide as efficient handling of the trap as possible. Some examples of these are TLB (translation lookaside buffer) misses and register window spills and fills, for which the microstate is not switched.

In both cases, the time spent is recorded as User CPU time. However, the hardware counters are turned off because the CPU mode has been switched to system mode. The time spent handling these traps can therefore be estimated by taking the difference between User CPU time and Cycles time, preferably recorded in the same experiment.

In one case the trap handler switches back to user mode, and that is the misaligned memory reference trap for an 8-byte integer which is aligned on a 4-byte boundary in Fortran. A frame for the trap handler appears on the stack, and a call to the handler can appear in the Performance Analyzer, attributed to the integer load or store instruction.

When an instruction traps into the kernel, the instruction following the trapping instruction appears to take a long time, because it cannot start until the kernel has finished executing the trapping instruction.

Tail-Call Optimization

The compiler can do one particular optimization whenever the last thing a particular function does is to call another function. Rather than generating a new frame, the callee re-uses the frame from the caller, and the return address for the callee is copied from the caller. The motivation for this optimization is to reduce the size of the stack, and, on SPARC platforms, to reduce the use of register windows.

Suppose that the call sequence in your program source looks like this:

A -> B -> C -> D

When B and C are tail-call optimized, the call stack looks as if function A calls functions B, C, and D directly.

A -> B

A -> C

A -> D

That is, the call tree is flattened. When code is compiled with the -g option, tail-call optimization takes place only at a compiler optimization level of 4 or higher. When code is compiled without the -g option, tail-call optimization takes place at a compiler optimization level of 2 or higher.

Explicit Multithreading

A simple program executes in a single thread, on a single LWP (lightweight process) in the Solaris OS. Multithreaded executables make calls to a thread creation function, to which the target function for execution is passed. When the target exits, the thread is destroyed by the threads library. Newly-created threads begin execution at a function called _thread_start(), which calls the function passed in the thread creation call. For any call stack involving the target as executed by this thread, the top of the stack is _thread_start(), and there is no connection to the caller of the thread creation function. Inclusive metrics associated with the created thread therefore only propagate up as far as _thread_start() and the <Total> function.

In addition to creating the threads, the threads library also creates LWPs on Solaris to execute the threads. Threading can be done either with bound threads, where each thread is bound to a specific LWP, or with unbound threads, where each thread can be scheduled on a different LWP at different times.

If bound threads are used, the threads library creates one LWP per thread.

If unbound threads are used, the threads library decides how many LWPs to create to run efficiently, and which LWPs to schedule the threads on. The threads library can create more LWPs at a later time if they are needed. Unbound threads are not part of the Solaris 9 OS or of the alternate threads library in the Solaris 8 OS.

As an example of the scheduling of unbound threads, when a thread is at a synchronization barrier such as a mutex_lock, the threads library can schedule a different thread on the LWP on which the first thread was executing. The time spent waiting for the lock by the thread that is at the barrier appears in the Synchronization Wait Time metric, but since the LWP is not idle, the time is not accrued into the User Lock Time metric.

In addition to the user threads, the standard threads library in the Solaris 8 OS creates some threads that are used to perform signal handling and other tasks. If the program uses bound threads, additional LWPs are also created for these threads. Performance data is not collected or displayed for these threads, which spend most of their time sleeping. However, the time spent in these threads is included in the process statistics and in the times recorded in the sample data. The threads library in the Solaris 9 OS and the alternate threads library in the Solaris 8 OS do not create these extra threads.

The Linux OS provides P-threads (POSIX threads) for explicit multithreading. The data type pthread_attr_t controls the behavioral attributes of a thread. To create a bound thread, the attribute's scope must be set to PTHREAD_SCOPE_SYSTEM using the pthread_attr_setscope() function. Threads are unbound by default, or if the attribute scope is set to PTHREAD_SCOPE_PROCESS. To create a new thread, the application calls the P-thread API function pthread_create(), passing a pointer to an application-defined start routine as one of the function arguments. When the new thread starts execution, it runs in a Linux-specific system function, clone(), which calls another internal initialization function, pthread_start_thread(), which in turn calls the user-defined start routine originally passed to pthread_create(). The Linux metrics-gathering functions available to the Collector are thread-specific, whether the thread is bound to an LWP or not. Therefore, when the collect utility runs, it interposes a metrics-gathering function, named collector_root(), between pthread_start_thread() and the application-defined thread start routine.

Overview of Java Technology-Based Software Execution

To the typical developer, a Java technology-based application runs just like any other program. The application begins at a main entry point, typically named class.main, which may call other methods, just as a C or C++ application does.

To the operating system, an application written in the Java programming language, (pure or mixed with C/C++), runs as a process instantiating the JVM software. The JVM software is compiled from C++ sources and starts execution at _start, which calls main, and so forth. It reads bytecode from .class and/or .jar files, and performs the operations specified in that program. Among the operations that can be specified is the dynamic loading of a native shared object, and calls into various functions or methods contained within that object.

During execution of a Java technology-based application, most methods are interpreted by the JVM software; these methods are referred to in this document as interpreted methods. Other methods may be dynamically compiled by the Java HotSpot virtual machine, and are referred to as compiled methods. Dynamically compiled methods are loaded into the data space of the application, and may be unloaded at some later point in time. For any particular method, there is an interpreted version, and there may also be one or more compiled versions. Code written in the Java programming language might also call directly into native-compiled code, either C, C++, or Fortran; the targets of such calls are referred to as native methods.

The JVM software does a number of things that are typically not done by applications written in traditional languages. At startup, it creates a number of regions of dynamically-generated code in its data space. One of these regions is the actual interpreter code used to process the application's bytecode methods.

During the interpretive execution, the Java HotSpot virtual machine monitors performance, and may decide to take one or more methods that it has been interpreting, generate machine code for them, and execute the more-efficient machine code version, rather than interpret the original. That generated machine code is also in the data space of the process. In addition, other code is generated in the data space to execute the transitions between interpreted and compiled code.

Applications written in the Java programming language are inherently multithreaded, and have one JVM software thread for each thread in the user's program. Java applications also have several housekeeping threads used for signal handling, memory management, and Java HotSpot virtual machine compilation. Depending on the version of libthread.so used, there may be a one-to-one correspondence between threads and LWPs, or a more complex relationship. For the default libthread.so thread library on the Solaris 8 OS, a thread might be unscheduled at any instant, or scheduled onto an LWP. Data for a thread is not collected while that thread is not scheduled onto an LWP. A thread is never unscheduled when using the alternate libthread.so library on the Solaris 8 OS nor when using the Solaris 9 OS threads.

Data collection is implemented with various methods in the JVMPI in J2SE 1.4.2 and the JVMTI in J2SE 5.0.

Java Call Stacks and Machine Call Stacks

The performance tools collect their data by recording events in the life of each Solaris LWP or Linux thread, along with the call stack at the time of the event. At any point in the execution of any application, the call stack represents where the program is in its execution, and how it got there. One important way that mixed-model Java applications differ from traditional C, C++, and Fortran applications is that at any instant during the run of the target there are two callstacks that are meaningful: a Java call stack, and a machine callstack. Both call stacks are recorded during profiling, and are reconciled during analysis.

Clock-based Profiling and Hardware Counter Overflow Profiling

Clock-based profiling and hardware counter overflow profiling for Java programs work just as for C, C++, and Fortran programs, except that both Java call stacks and machine call stacks are collected.

Synchronization Tracing

Synchronization tracing for Java programs is based on events generated when a thread attempts to acquire a Java Monitor. Both machine cal lstacks and Java call stacks are collected for these events, but no synchronization tracing data is collected for internal locks used within the JVM software.

Heap Tracing

Heap tracing data records object-allocation events, generated by the user code, and object-deallocation events, generated by the garbage collector. In addition, any use of C/C++ memory-management functions, such as malloc and free, also generates events that are recorded. Those events might come from native code, or from the JVM software itself.

Java Processing Representations

There are three representations for displaying performance data for applications written in the Java programming language: the Java representation, the Expert-Java representation, and the Machine representation. The Java representation is shown by default where the data supports it. The following section summarizes the main differences between these three representations.

The User Representation

The User representation shows compiled and interpreted Java methods by name, and shows native methods in their natural form. During execution, there might be many instances of a particular Java method executed: the interpreted version, and, perhaps, one or more compiled versions. In the Java representation all methods are shown aggregated as a single method. This representation is selected in the Analyzer by default.

A PC for a Java method in the Java representation corresponds to the method-id and a bytecode index into that method; a PC for a native function correspond to a machine PC. The call stack for a Java thread may have a mixture of Java PCs and machine PCs. It does not have any frames corresponding to Java housekeeping code, which does not have a Java representation. Under some circumstances, the JVM software cannot unwind the Java stack, and a single frame with the special function, <no Java callstack recorded>, is returned. Typically, it amounts to no more than 5-10% of the total time.

The function list in the Java representation shows metrics against the Java methods and any native methods called. The caller-callee panel shows the calling relationships in the Java representation.

Source for a Java method corresponds to the source code in the .java file from which it was compiled, with metrics on each source line. The disassembly of any Java method shows the bytecode generated for it, with metrics against each bytecode, and interleaved Java source, where available.

The Timeline in the Java representation shows only Java threads. The call stack for each thread is shown with its Java methods.

All Java programs may have explicit synchronization, usually performed by calling the monitor-enter routine.

Synchronization-delay tracing in the Java representation is based on the JVMPI synchronization events. Data from the normal synchronization tracing is not shown in the Java representation.

Data space profiling in the Java representation is not currently supported

The Expert-User Representation

The Expert-Java representation is similar to the Java Representation, except that some details of the JVM internals that are suppressed in the Java Representation are exposed in the Expert-Java Representation. With the Expert-Java representation, the Timeline shows all threads; the call stack for housekeeping threads is a native call stack.

The Machine Representation

The Machine representation shows functions from the JVM software itself, rather than from the application being interpreted by the JVM software. It also shows all compiled and native methods. The machine representation looks the same as that of applications written in traditional languages. The call stack shows JVM frames, native frames, and compiled-method frames. Some of the JVM frames represent transition code between interpreted Java, compiled Java, and native code.

Source from compiled methods are shown against the Java source; the data represents the specific instance of the compiled-method selected. Disassembly for compiled methods show the generated machine assembler code, not the Java bytecode. Caller-callee relationships show all overhead frames, and all frames representing the transitions between interpreted, compiled, and native methods.

The Timeline in the machine representation shows bars for all threads, LWPs, or CPUs, and the call stack in each is the machine-representation call stack.

In the machine representation, thread synchronization devolves into calls to _lwp_mutex_lock. No synchronization data is shown, since these calls are not traced.

Overview of OpenMP Software Execution

The actual execution model of OpenMP Applications is described in the OpenMP specifications (See, for example, OpenMP Application Program Interface, Version 2.5, section 1.3.) The specification, however, does not describe some implementation details that may be important to users, and the actual implementation at Sun Microsystems is such that directly recorded profiling information does not easily allow the user to understand how the threads interact.

As any single-threaded program runs, its call stack shows its current location, and a trace of how it got there, starting from the beginning instructions in a routine called _start, which calls main, which then proceeds and calls various subroutines within the program. When a subroutine contains a loop, the program executes the code inside the loop repeatedly until the loop exit criterion is reached. The execution then proceeds to the next sequence of code, and so forth.

When the program is parallelized with OpenMP (or by autoparallelization), the behavior is different. An intuitive model of that behavior has the main, or master, thread executing just as a single-threaded program. When it reaches a parallel loop or parallel region, additional slave threads appear, each a clone of the master thread, with all of them executing the contents of the loop or parallel region, in parallel, each for differents chunks of work. When all chunks of work are completed, all the threads are synchronized, the slave threads disappear, and the master thread proceeds.

When the compiler generates code for a parallel region or loop (or any other OpenMP construct), the code inside it is extracted and made into an independent function, called an mfunction. (It may also be referred to as an outlined function, or a loop-body-function.) The name of the function encodes the OpenMP construct type, the name of the function from which it was extracted, and the line number of the source line at which the construct appears. The names of these functions are shown in the Analyzer in the following form, where the name in brackets is the actual symbol-table name of the function.:

bardo_ -- OMP parallel region from line 9 [_$p1C9.bardo_]atomsum_ -- MP doall from line 7 [_$d1A7.atomsum_]

There are other forms of such functions, derived from other source constructs, for which the OMP parallel region in the name is replaced by MP construct, MP doall, or OMP sections. In the following discussion, all of these are referred to generically as "parallel region".

Each thread executing the code within the parallel loop can invoke its mfunction multiple times, with each invocation doing a chunk of the work within the loop. When all the chunks of work are complete, each thread calls synchronization or reduction routines in the library; the master thread then continues, while the slave threads become idle, waiting for the master thread to enter the next parallel region. All of the scheduling and synchronization are handled by calls to the OpenMP runtime.

During its execution, the code within the parallel region might be doing a chunk of the work, or it might be synchronizing with other threads or picking up additional chunks of work to do. It might also call other functions, which may in turn call still others. A slave thread (or the master thread) executing within a parallel region, might itself, or from a function it calls, act as a master thread, and enter its own parallel region, giving rise to nested parallelism.

The Analyzer collects data based on statistical sampling of call stacks, and aggregates its data across all threads and shows metrics of performance based on the type of data collected, against functions, callers and callees, source lines, and instructions. It presents information on the performance of OpenMP programs in either of two modes, User mode and Machine mode. (A third mode, Expert mode, is supported, but is identical to User mode.)

User Mode Display of OpenMP Profile Data

The User mode presentation of the profile data attempts to present the information as if the program really executed according to the model describedin Overview of OpenMP Software Execution. The actual data captures the implementation details of the runtime library, libmtsk.so, which does not correspond to the model. In User mode, the presentation of profile data is altered to match the model better, and differs from the recorded data and Machine mode presentation in three ways:

Artificial functions, are constructed representing the state of each thread from the point of view of the OpenMP runtime library.

Call stacks are manipulated to report data corresponding to the model of how the code runs, as described above.

Two additional metrics of performance are constructed for clock-based profiling experiments, corresponding to time spent doing useful work and time spent waiting in the OpenMP runtime.

Artificial Functions

Artificial functions are constructed and put onto the User mode call stacks reflecting events in which a thread was in some state within the OpenMP runtime library.

The following artificial functions are defined; each is followed by a description of its function:

<OMP-overhead> -- executing in the OpenMP library

<OMP-idle> -- slave thread, waiting for work

<OMP-reduction> -- thread performing a reduction operations

<OMP-implicit_barrier> -- thread waiting at an implicit barrier

<OMP-explicit_barrier> -- thread waiting at an explicit barrier

<OMP-lock_wait> -- thread waiting for a lock

<OMP-critical_section_wait> -- thread waiting to enter a critical section

<OMP-ordered_section_wait> -- thread waiting for its turn to enter an ordered section

When a thread is in an OpenMP runtime state corresponding to one of those functions, the corresponding function is added as the leaf function on the stack. When a thread's leaf function is anywhere in the OpenMP runtime, it is replaced by <OMP-overhead> as the leaf function. Otherwise, all PCs from the OpenMP runtime are omitted from the user-mode stack.

User Mode Call Stacks

The easiest way to understand this model is to look at the call stacks of an OpenMP program at various points in its execution. This section considers a simple program that has a main program that calls one subroutine, foo. That subroutine has a single parallel loop, in which the threads do work, contend for, acquire, and release a lock, and enter and leave a critical section. An additional set of call stacks is shown, reflecting the state when one slave thread has called another function, bar, which enters a nested parallel region.

In this presentation, all the inclusive time spent in a parallel region is included in the inclusive time in the function from which it was extracted, including time spent in the OpenMP runtime, and that inclusive time is propagated all the way up to main and _start

The call stacks that represent the behavior in this model appear as shown in the subsections that follow. The actual names of the parallel region functions are of the following form, as described above:

foo -- OMP parallel region from line 9[ [_$p1C9.foo] bar -- OMP parallel region from line 5[ [_$p1C5.bar]

For clarity, the following shortened forms are used in the descriptions:

foo-OMP... bar-OMP...

In the descriptions, call stacks from all threadsare shown at an instant during execution of the program. The call stack for each thread is shown as a stack of frames, matching the data from selecting an individual profile event in the Analyzer Timeline tab for a single thread, with the leaf PC at the top. In the Timeline tab, each frame is shown with a PC offset, which is omitted below. The stacks from all the threads are shown in a horizontal array, while in the Analyzer Timeline tab, the stacks for other threads would appear in profile bars stacked vertically. Furthermore, in the representation presented, the stacks for all the threads are shown as if they were captured at exactly the same instant, while in a real experiment, the stacks are captured independently in each thread, and may be skewed relative to each other.

The call stacks shown represent the data as it is presented with a vew mode of User in the Analyzer or in the er_print utility.

1. Before the first parallel region

Before the first parallel region is entered, there is only the one thread, the master thread.

Master
`foo`
`main`
`_start`

2. Upon entering the first parallel region

At this point, the library has created the slave threads, and all of the threads, master and slaves, are about to start processing their chunks of work. All threads are shown as having called into the code for the parallel region, foo-OMP..., from foo at the line on which the OpenMP directive for the construct appears, or from the line containing the loop statement that was autoparallelized. The code for the parallel region in each thread is calling into the OpenMP support library, shown as the <OMP-overhead> function, from the first instruction in the parallel region.

Master	Slave 1	Slave 2	Slave 3
<OMP-overhead>	<OMP-overhead>	<OMP-overhead>	<OMP-overhead>
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

The window in which <OMP-overhead> might appear is quite small, so that function might not appear in any particular experiment.

3. While executing within a parallel region

All four of the threads are doing useful work in the parallel region.

Master	Slave 1	Slave 2	Slave 3
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

4. While executing within a parallel region between chunks of work

All four of the threads are doing useful work, but one has finished one chunk of work, and is obtaining its next chunk.

Master	Slave 1	Slave 2	Slave 3
	<OMP-overhead>
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

5. While executing in a critical section within the parallel region

All four of the threads are executing, each within the parallel region. One of them is in the critical section, while one of the others is running before reaching the critical section (or after finishing it). The remaining two are waiting to enter the critical he data collected does not distinguish between the call stack of the thread that is executing in the critical section, and that of the thread that has not yet reached, or has already passed the critical section.

The <OMP-overhead> is unlikely to appear in a real experiment.

5. While executing in a critical section within the parallel region

Master	Slave 1	Slave 2	Slave 3
`<OMP-critical_section_wait>`			`<OMP-critical_section_wait>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

The data collected does not distinguish between the call stack of the thread that is executing in the critical section, and that of the thread that has not yet reached, or has already passed the critical section.

6. While executing around a lock within the parallel region

A section of code around a lock is completely analogous to a critical section. All four of the threads are executing within the parallel region. One thread is executing while holding the lock, one is executing before acquiring the lock (or after acquiring and releasing it), and the other two threads are waiting for the lock.

Master	Slave 1	Slave 2	Slave 3
<OMP-lock_wait>			<OMP-lock_wait>
foo-OMP...	foo-OMP...	foo-OMP...	foo-OMP...
foo	foo	foo	foo
main	main	main	main
_start	_start	_start	_start

As in the critical section example, the data collected does not distinguish between the call stack of a thread holding the lock and executing, or executing before it acquires the lock or after it releases it.

7. Near the end of a parallel region

At this point, three of the threads have finished all their chunks of work, but one of them is still working. The OpenMP construct in this case implicitly specified a barrier; if the user code had explicitly specified the barrier, the <OMP-implicit_barrier> function would be replaced by <OMP-explicit_barrier>.

Master	Slave 1	Slave 2	Slave 3
`<OMP-implicit_barrier>`	`<OMP-implicit_barrier>`		`<OMP-implicit_barrier>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

8. Near the end of a parallel region, with one or more reduction variables

At this point, two of the threads have finished all their chunks of work, and are performing the reduction computations, but one of them is still working, and the fourth has finished its part of the reduction, and is waiting at the barrier.

Master	Slave 1	Slave 2	Slave 3
`<OMP-reduction>`	`<OMP-implicit_barrier>`		`<OMP-implicit_barrier>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

While one thread is shown in the <OMP-reduction> function, the actual time spent in doing the reduction is usually quite small, and is rarely captured in a call stack sample.

9. At the end of a parallel region

At this point, all threads have finished all chunks of work within the parallel region, and have reached the barrier.

Master	Slave 1	Slave 2	Slave 3
`<OMP-implicit_barrier>`	`<OMP-implicit_barrier>`	`<OMP-implicit_barrier>`	`<OMP-implicit_barrier>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

Since all the threads have reached the barrier, they may all proceed, and it is unlikely that an experiment would ever find all the threads in this state.

10. After leaving the parallel region

At this point, all the slave threads are waiting for entry into the next parallel region, either spinning or sleeping, depending on the various environment variables set by the user. The program is in serial execution.

Master	Slave 1	Slave 2	Slave 3
`foo`
`main`
`_start`	`<OMP-idle>`	`<OMP-idle>`	`<OMP-idle>`

11. While executing in a nested parallel region

All four of the threads are working, each within the outer parallel region. One of the slave threads has called another function, bar, and it has created a nested parallel region, and an additional slave thread is created to work with it.

Master	Slave 1	Slave 2	Slave 3	Slave 4
	bar-OMP...			bar-OMP...
	bar			bar
foo-OMP...	foo-OMP...	foo-OMP...	foo-OMP...	foo-OMP...
foo	foo	foo	foo	foo
main	main	main	main	main
_start	_start	_start	_start	_start

OpenMP Metrics

When processing a clock-profile event for an OpenMP program, two metrics corresponding to the time spent in each of two states in the OpenMP system are shown. They are "OMP work" and "OMP wait".

Time is accumulated in "OMP work" whenever a thread is executing from the user code, whether in serial or parallel. Time is accumulated in "OMP wait" whenever a thread is waiting for something before it can proceed, whether the wait is a busy-wait (spin-wait), or sleeping. The sum of these two metrics matches the "Total LWP Time" metric in the clock profiles.

Machine Presentation of OpenMP Profiling Data

The real callstacks of the program during various phases of execution are quite different from the ones portrayed above in the intuitive model. The Machine mode of presentation shows the callstacks as measured, with no transformations done, and no artificial functions constructed. The clock-profiling metrics are, however, still shown.

In each the callstacks below, libmtsk represents one or more frames in the callstack within the OpenMP runtime library. The details of which functions appear and in which order change from release to release, as does the internal implementation of code for a barrier, or to perform a reduction.

1. Before the first parallel region

Before the first parallel region is entered, there is only the one thread, the master thread. The callstack is identical to that in User mode.

Master
`foo`
`main`
`_start`

2. During execution in a parallel region

Master	Slave 1	Slave 2	Slave 3
`foo-OMP...`
`libmtsk`
`foo`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`main`	libmtsk	libmtsk	libmtsk
`_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`

In Machine mode, the slave threads are shown as starting in _lwp_start, rather than in _start where the master starts. (In some versions of the thread library, that function may appear as _thread_start.)

3. At the point at which all threads are at a barrier

Master	Slave 1	Slave 2	Slave 3
`libmtsk`
`foo-OMP...`
`foo`	libmtsk	libmtsk	libmtsk
`main`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`

Unlike when the threads are executing in the parallel region, when the threads are waiting at a barrier there are no frames from the OpenMP runtime between foo and the parallel region code, foo-OMP.... The reason is that the real execution does not include the OMP parallel region function, but the OpenMP runtime manipulates registers so that the stack unwind shows a call from the last-executed parallel region function to the runtime barrier code. Without it, there would be no way to determine which parallel region is related to the barrier call in Machine mode.

4. After leaving the parallel region

Master	Slave 1	Slave 2	Slave 3
`foo`
`main`	libmtsk	libmtsk	libmtsk
`_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`

In the slave threads, no user frames are on the call stack.

5. When in a nested parallel region

Master	Slave 1	Slave 2	Slave 3	Slave 4
	bar-OMP...
`foo-OMP...`	libmtsk
libmtsk	`bar`
`foo`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`bar-OMP...`
`main`	libmtsk	libmtsk	libmtsk	libmtsk
`_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`

Incomplete Stack Unwinds

Stack unwind might fail for a number of reasons:

If the stack has been corrupted by the user code; if so, the program might core dump, or the data collection code might core dump, depending on exactly how the stack was corrupted.

If the user code does not follow the standard ABI conventions for function calls. In particular, on the SPARC platform, if the return register, %o7, is altered before a save instruction is executed.

On any platform, hand-written assembler code might violate the conventions.

On the x86 platform, if C or Fortran code is compiled at high optimization, which means that it does not have frame pointers to assist in the unwind.

On the x86 platform, if C++ code is compiled at any optimization level with the -noex or -features=no%except options.

If the leaf PC is in a function after the callee's frame is popped from the stack, but before the function returns.

If the call stack contains more than about 250 frames, the Collector does not have the space to completely unwind the call stack. In this case, PCs for functions from _start to some point in the call stack are not recorded in the experiment. The artificial function <Truncated-stack> is shown as called from <Total> to tally the topmost frames recorded.

On x86 and x64 platforms, stack unwind can be problematic if the code does not preserve frame pointers. To preserve frame pointers, compile with the -xreg=no%frameptr option.

Intermediate Files

If you generate intermediate files using the -E or -P compiler options, the Analyzer uses the intermediate file for annotated source code, not the original source file. The #line directives generated with -E can cause problems in the assignment of metrics to source lines.

The following line appears in annotated source if there are instructions from a function that do not have line numbers referring to the source file that was compiled to generate the function:

function_name -- <instructions without line numbers>

Line numbers can be absent under the following circumstances:

You compiled without specifying the -g option.

The debugging information was stripped after compilation, or the executables or object files that contain the information are moved or deleted or subsequently modified.

The function contains code that was generated from #include files rather than from the original source file.

At high optimization, if code was inlined from a function in a different file.

The source file has #line directives referring to some other file; compiling with the -E option, and then compiling the resulting .i file is one way in which this happens. It may also happen when you compile with the -P flag.

The object file cannot be found to read line number information.

The file was compiled without the -g flag, or the file was stripped.

The compiler used generates incomplete line number tables.

Mapping Addresses to Program Structure

Once a call stack is processed into PC values, the Analyzer maps those PCs to shared objects, functions, source lines, and disassembly lines (instructions) in the program. This section describes those mappings.

The Process Image

When a program is run, a process is instantiated from the executable for that program. The process has a number of regions in its address space, some of which are text and represent executable instructions, and some of which are data that is not normally executed. PCs as recorded in the call stack normally correspond to addresses within one of the text segments of the program.

The first text section in a process derives from the executable itself. Others correspond to shared objects that are loaded with the executable, either at the time the process is started, or dynamically loaded by the process. The PCs in a call stack are resolved based on the executable and shared objects loaded at the time the call stack was recorded. Executables and shared objects are very similar, and are collectively referred to as load objects.

Because shared objects can be loaded and unloaded in the course of program execution, any given PC might correspond to different functions at different times during the run. In addition, different PCs at different times might correspond to the same function, when a shared object is unloaded and then reloaded at a different address.

Load Objects and Functions

Each load object, whether an executable or a shared object, contains a text section with the instructions generated by the compiler, a data section for data, and various symbol tables. All load objects must contain an ELF symbol table, which gives the names and addresses of all the globally-known functions in that object. Load objects compiled with the -g option contain additional symbolic information, which can augment the ELF symbol table and provide information about functions that are not global, additional information about object modules from which the functions came, and line number information relating addresses to source lines.

The term function is used to describe a set of instructions that represent a high-level operation described in the source code. The term covers subroutines as used in Fortran, methods as used in C++ and the Java programming language, and the like. Functions are described cleanly in the source code, and normally their names appear in the symbol table representing a set of addresses; if the program counter is within that set, the program is executing within that function.

In principle, any address within the text segment of a load object can be mapped to a function. Exactly the same mapping is used for the leaf PC and all the other PCs on the call stack. Most of the functions correspond directly to the source model of the program. Some do not; these functions are described in the following sections.

Aliased Functions

Typically, functions are defined as global, meaning that their names are known everywhere in the program. The name of a global function must be unique within the executable. If there is more than one global function of a given name within the address space, the runtime linker resolves all references to one of them. The others are never executed, and so do not appear in the function list. In the Summary tab, you can see the shared object and object module that contain the selected function.

Under various circumstances, a function can be known by several different names. A very common example of this is the use of so-called weak and strong symbols for the same piece of code. A strong name is usually the same as the corresponding weak name, except that it has a leading underscore. Many of the functions in the threads library also have alternate names for pthreads and Solaris threads, as well as strong and weak names and alternate internal symbols. In all such cases, only one name is used in the function list of the Analyzer. The name chosen is the last symbol at the given address in alphabetic order. This choice most often corresponds to the name that the user would use. In the Summary tab, all the aliases for the selected function are shown.

Non-Unique Function Names

While aliased functions reflect multiple names for the same piece of code, under some circumstances, multiple pieces of code have the same name:

Sometimes, for reasons of modularity, functions are defined as static, meaning that their names are known only in some parts of the program (usually a single compiled object module). In such cases, several functions of the same name referring to quite different parts of the program appear in the Analyzer. In the Summary tab, the object module name for each of these functions is given to distinguish them from one another. In addition, any selection of one of these functions can be used to show the source, disassembly, and the callers and callees of that specific function.

Sometimes a program uses wrapper or interposition functions that have the weak name of a function in a library and supersede calls to that library function. Some wrapper functions call the original function in the library, in which case both instances of the name appear in the Analyzer function list. Such functions come from different shared objects and different object modules, and can be distinguished from each other in that way. The Collector wraps some library functions, and both the wrapper function and the real function can appear in the Analyzer.

Static Functions From Stripped Shared Libraries

Static functions are often used within libraries, so that the name used internally in a library does not conflict with a name that you might use. When libraries are stripped, the names of static functions are deleted from the symbol table. In such cases, the Analyzer generates an artificial name for each text region in the library containing stripped static functions. The name is of the form <static>@0x12345, where the string following the @ sign is the offset of the text region within the library. The Analyzer cannot distinguish between contiguous stripped static functions and a single such function, so two or more such functions can appear with their metrics coalesced.

Stripped static functions are shown as called from the correct caller, except when the PC from the static function is a leaf PC that appears after the save instruction in the static function. Without the symbolic information, the Analyzer does not know the save address, and cannot tell whether to use the return register as the caller. It always ignores the return register. Since several functions can be coalesced into a single <static>@0x12345 function, the real caller or callee might not be distinguished from the adjacent functions.

Fortran Alternate Entry Points

Fortran provides a way of having multiple entry points to a single piece of code, allowing a caller to call into the middle of a function. When such code is compiled, it consists of a prologue for the main entry point, a prologue to the alternate entry point, and the main body of code for the function. Each prologue sets up the stack for the function's eventual return and then branches or falls through to the main body of code.

The prologue code for each entry point always corresponds to a region of text that has the name of that entry point, but the code for the main body of the subroutine receives only one of the possible entry point names. The name received varies from one compiler to another.

The prologues rarely account for any significant amount of time, and the functions corresponding to entry points other than the one that is associated with the main body of the subroutine rarely appear in the Analyzer. Call stacks representing time in Fortran subroutines with alternate entry points usually have PCs in the main body of the subroutine, rather than the prologue, and only the name associated with the main body appears as a callee. Likewise, all calls from the subroutine are shown as being made from the name associated with the main body of the subroutine.

Cloned Functions

The compilers have the ability to recognize calls to a function for which extra optimization can be performed. An example of such calls is a call to a function for which some of the arguments are constants. When the compiler identifies particular calls that it can optimize, it creates a copy of the function, which is called a clone, and generates optimized code. The clone function name is a mangled name that identifies the particular call. The Analyzer demangles the name, and presents each instance of a cloned function separately in the function list. Each cloned function has a different set of instructions, so the annotated disassembly listing shows the cloned functions separately. Each cloned function has the same source code, so the annotated source listing sums the data over all copies of the function.

Inlined Functions

An inlined function is a function for which the instructions generated by the compiler are inserted at the call site of the function instead of an actual call. There are two kinds of inlining, both of which are done to improve performance, and both of which affect the Analyzer.

C++ inline function definitions. The rationale for inlining in this case is that the cost of calling a function is much greater than the work done by the inlined function, so it is better to simply insert the code for the function at the call site, instead of setting up a call. Typically, access functions are defined to be inlined, because they often only require one instruction. When you compile with the -g option, inlining of functions is disabled; compilation with -g0 permits inlining of functions, and is recommended.

Explicit or automatic inlining performed by the compiler at high optimization levels (4 and 5). Explicit and automatic inlining is performed even when -g is turned on. The rationale for this type of inlining can be to save the cost of a function call, but more often it is to provide more instructions for which register usage and instruction scheduling can be optimized.

Both kinds of inlining have the same effect on the display of metrics. Functions that appear in the source code but have been inlined do not show up in the function list, nor do they appear as callees of the functions into which they have been inlined. Metrics that would otherwise appear as inclusive metrics at the call site of the inlined function, representing time spent in the called function, are actually shown as exclusive metrics attributed to the call site, representing the instructions of the inlined function.

Note - Inlining can make data difficult to interpret, so you might want to disable inlining when you compile your program for performance analysis.

In some cases, even when a function is inlined, a so-called out-of-line function is left. Some call sites call the out-of-line function, but others have the instructions inlined. In such cases, the function appears in the function list but the metrics attributed to it represent only the out-of-line calls.

Compiler-Generated Body Functions

When a compiler parallelizes a loop in a function, or a region that has parallelization directives, it creates new body functions that are not in the original source code. These functions are described in Overview of OpenMP Software Execution.

The Analyzer shows these functions as normal functions, and assigns a name to them based on the function from which they were extracted, in addition to the compiler-generated name. Their exclusive metrics and inclusive metrics represent the time spent in the body function. In addition, the function from which the construct was extracted shows inclusive metrics from each of the body functions. The means by which this is achieved is described in Overview of OpenMP Software Execution.

When a function containing parallel loops is inlined, the names of its compiler-generated body functions reflect the function into which it was inlined, not the original function.

Note - The names of compiler-generated body functions can only be demangled for modules compiled with -g

Outline Functions

Outline functions can be created during feedback-optimized compilations. They represent code that is not normally executed, specifically code that is not executed during the training run used to generate the feedback for the final optimized compilation. A typical example is code that performs error checking on the return value from library functions; the error-handling code is never normally run. To improve paging and instruction-cache behavior, such code is moved elsewhere in the address space, and is made into a separate function. The name of the outline function encodes information about the section of outlined code, including the name of the function from which the code was extracted and the line number of the beginning of the section in the source code. These mangled names can vary from release to release. The Analyzer provides a readable version of the function name.

Outline functions are not really called, but rather are jumped to; similarly they do not return, they jump back. In order to make the behavior more closely match the user's source code model, the Analyzer imputes an artificial call from the main function to its outline portion.

Outline functions are shown as normal functions, with the appropriate inclusive and exclusive metrics. In addition, the metrics for the outline function are added as inclusive metrics in the function from which the code was outlined.

For further details on feedback-optimized compilations, refer to the description of the -xprofile compiler option in Appendix B of the C User's Guide, Appendix A of the C++ User's Guide, or Chapter 3 of the Fortran User's Guide.

Dynamically Compiled Functions

Dynamically compiled functions are functions that are compiled and linked while the program is executing. The Collector has no information about dynamically compiled functions that are written in C or C++, unless the user supplies the required information using the Collector API functions. See Dynamic Functions and Modules for information about the API functions. If information is not supplied, the function appears in the performance analysis tools as <Unknown>.

For Java programs, the Collector obtains information on methods that are compiled by the Java HotSpot virtual machine, and there is no need to use the API functions to provide the information. For other methods, the performance tools show information for the JVM software that executes the methods. In the Java representation, all methods are merged with the interpreted version. In the machine representation, each HotSpot-compiled version is shown separately, and JVM functions are shown for each interpreted method.

The `<Unknown>` Function

Under some circumstances, a PC does not map to a known function. In such cases, the PC is mapped to the special function named <Unknown>.

The following circumstances show PCs mapping to <Unknown>:

When a function written in C or C++ is dynamically generated, and information about the function is not provided to the Collector using the Collector API functions. See Dynamic Functions and Modules for more information about the Collector API functions.

When a Java method is dynamically compiled but Java profiling is disabled.

When the PC corresponds to an address in the data section of the executable or a shared object. One case is the SPARC V7 version of libc.so, which has several functions (.mul and .div, for example) in its data section. The code is in the data section so that it can be dynamically rewritten to use machine instructions when the library detects that it is executing on a SPARC V8 or SPARC V9 platform.

When the PC corresponds to a shared object in the address space of the executable that is not recorded in the experiment.

When the PC is not within any known load object. The most likely cause is an unwind failure, where the value recorded as a PC is not a PC at all, but rather some other word. If the PC is the return register, and it does not seem to be within any known load object, it is ignored, rather than attributed to the <Unknown> function.

When a PC maps to an internal part of the JVM software for which the Collector has no symbolic information.

Callers and callees of the <Unknown> function represent the previous and next PCs in the call stack, and are treated normally.

New and OpenMP Special Functions

Artificial functions are constructed and put onto the User mode call stacks reflecting events in which a thread was in some state within the OpenMP runtime library. The following artificial functions are defined; each is followed by a description of its function:

<OMP-overhead> -- executing in the OpenMP library

<OMP-idle> -- slave thread, waiting for work

<OMP-reduction> -- thread performing a reduction operations

<OMP-implicit_barrier> -- thread waiting at an implicit barrier

<OMP-explicit_barrier> -- thread waiting at an explicit barrier

<OMP-lock_wait> -- thread waiting for a lock

<OMP-critical_section_wait> -- thread waiting to enter a critical section

<OMP-ordered_section_wait> -- thread waiting for its turn to enter an ordered section

The `<JVM-System>` Function

In the User representation, the <JVM-System> function represents time used by the JVM software performing actions other than running a Java program. In this time interval, the JVM software is performing tasks such as garbage collection and HotSpot compilation. By default, <JVM-System> is visible in the Function list.

The `<no` `Java` `callstack` `recorded>` Function

The <no Java callstack recorded> function is similar to the <Unknown> function, but for Java threads, in the Java representation only. When the Collector receives an event from a Java thread, it unwinds the native stack and calls into the JVM software to obtain the corresponding Java stack. If that call fails for any reason, the event is shown in the Analyzer with the artificial function <no Java callstack recorded>. The JVM software might refuse to report a call stack either to avoid deadlock, or when unwinding the Java stack would cause excessive synchronization.

The `<Truncated-stack>` Function

The size of the buffer used by the Analyzer for recording the metrics of individual functions in the call stack is limited. If the size of the call stack becomes so large that the buffer becomes full, any further increase in size of the callstack will force the analyzer to drop function profile information. Since in most programs the bulk of exclusive CPU time is spent in the leaf functions, the Analyzer drops the metrics for functions the less critical functions at the bottom of the stack, starting with the entry functions _start() and main(). The metrics for the dropped functions are consolidated into the single artificial <Truncated-stack> function. The <Truncated-stack> function may also appear in Java programs.

The `<Total>` Function

The <Total> function is an artificial construct used to represent the program as a whole. All performance metrics, in addition to being attributed to the functions on the call stack, are attributed to the special function <Total>. The function appears at the top of the function list and its data can be used to give perspective on the data for other functions. In the Callers-Callees list, it is shown as the nominal caller of _start() in the main thread of execution of any program, and also as the nominal caller of _thread_start() for created threads. If the stack unwind was incomplete, the <Total> function can appear as the caller of <Truncated-stack>.

Functions Related to Hardware Counter Overflow Profiling

The following functions are related to hardware countr overflow profiling:

collector_not_program_related: The counter does not relate to the program.

collector_lost_hwc_overflow: The counter appears to have exceeded the overflow value without generating an overflow signal. The value is recorded and the counter reset.

collector_lost_sigemt: The counter appears to have exceeded the overflow value and been halted but the overflow signal appears to be lost. The value is recorded and the counter reset.

collector_hwc_ABORT: Reading the hardware counters has failed, typically when a privileged process has taken control of the counters, resulting in the termination of hardware counter collection.

collector_final_counters: The values of the counters taken immediately before suspending or terminating collection, with the count since the previous overflow. If this corresponds to a significant fraction of the <Total> count, a smaller overflow interval (that is, a higher resolution configuration) is recommended.

collector_record_counters: The counts accumulated while handling and recording hardware counter events, partially accounting for hardware counter overflow profiling overhead. If this corresponds to a significant fraction of the <Total> count, a larger overflow interval (that is, a lower resolution configuration) is recommended.

Mapping Data Addresses to Program Data Objects

Once a PC from a hardware counter event corresponding to a memory operation has been processed to successfully backtrack to a likely causal memory-referencing instruction, the Analyzer uses instruction identifiers and descriptors provided by the compiler in its hardware profiling support information to derive the associated program data object.

The term data object is used to refer to program constants, variables, arrays and aggregates such as structures and unions, along with distinct aggregate elements, described in source code. Depending on the source language, data object types and their sizes vary. Many data objects are explicitly named in source programs, while others may be unnamed. Some data objects are derived or aggregated from other (simpler) data objects, resulting in a rich, often complex, set of data objects.

Each data object has an associated scope, the region of the source program where it is defined and can be referenced, which may be global (such as a load object), a particular compilation unit (an object file), or function. Identical data objects may be defined with different scopes, or particular data objects referred to differently in different scopes.

Data-derived metrics from hardware counter events for memory operations collected with backtracking enabled are attributed to the associated program data object type and propagate to any aggregates containing the data object and the artificial <Total>, which is considered to contain all data objects (including <Unknown> and <Scalars>). The different subtypes of <Unknown> propagate up to the <Unknown> aggregate. The following section describes the <Total>, <Scalars>, and <Unknown> data objects.

Data Object Descriptors

Data objects are fully described by a combination of their declared type and name. A simple scalar data object {int i} describes an variable called i of type int, while {const+pointer+int p} describes a constant pointer to a type int called p. Spaces in the type names are replaced with underscore (_), and unnamed data objects are represented with a name of dash (-), for example: {double_precision_complex -}.

An entire aggregate is similarly represented {structure:foo_t} for a structure of type foo_t. An element of an aggregate requires the additional specification of its container, for example, {structure:foo_t}.{int i} for a member i of type int of the previous structure of type foo_t. Aggregates can also themselves be elements of (larger) aggregates, with their corresponding descriptor constructed as a concatenation of aggregate descriptors and ultimately a scalar descriptor.

While a fully-qualified descriptor may not always be necessary to disambiguate dataobjects, it provides a generic complete specification to assist with dataobject identification.

The <Total> Data Object

The <Total> data object is an artificial construct used to represent the program's data objects as a whole. All performance metrics, in addition to being attributed to a distinct data object (and any aggregate to which it belongs), are attributed to the special data object <Total>. It appears at the top of the data object list and its data can be used to give perspective to the data for other data objects.

The <Scalars> Data Object

While aggregate elements have their performance metrics additionally attributed into the metric value for their associated aggregate, all of the scalar constants and variables have their performance metrics additionally attributed into the metric value for the artificial <Scalars> data object.

The <Unknown> Data Object and Its Elements

Under various circumstances, event data can not be mapped to a particular data object. In such cases, the data is mapped to the special data object named <Unknown> and one of its elements as described below.

Module with trigger PC not compiled with -xhwcprof

No event-causing instruction or data object was identified because the object code was not compiled with hardware counter profiling support.

Backtracking failed to find a valid branch target

No event-causing instruction was identified because the hardware profiling support information provided in the compilation object was insufficient to verify the validity of backtracking.

Backtracking traversed a branch target

No event-causing instruction or data object was identified because backtracking encountered a control transfer target in the instruction stream.

No identifying descriptor provided by the compiler

Backtracking determined the likely causal memory-referencing instruction, but its associated data object was not specified by the compiler.

No type information

Backtracking determined the likely event-causing instruction, but the instruction was not identified by the compiler as a memory-referencing instruction.

Not determined from the symbolic information provided by the compiler

Backtracking determined the likely causal memory-referencing instruction, but it was not identified by the compiler and associated data object determination is therefore not possible. Compiler temporaries are generally unidentified.

Backtracking was prevented by a jump or call instruction

No event-causing instructions were identified because backtracking encountered a branch or call instruction in the instruction stream.

Backtracking did not find trigger PC

No event-causing instructions were found within the maximum backtracking range.

Could not determine VA because registers changed after trigger instruction

The virtual address of the data object was not determined because registers were overwritten during hardware counter skid.

Memory-referencing instruction did not specify a valid VA

The virtual address of the data object did not appear to be valid.

Memory Objects

Memory objects are components in the memory subsystem, such as cache-lines, pages, and memory-banks. The object is determined from an index computed from the virtual and/or physical address as recorded. Memory objects are predefined for virtual pages and physical pages, for sizes of 8KB, 64KB, 512KB, and 4 MB. You can define others with the mobj_define command in the er_print utility. You can also define custom memory objects using theAdd Memory Objects dialog box in the Analyzer, which you can open by clicking the Add Custom Object button in the Set Data Presentation dialog box.

How Data Collection Works

Experiment Format

The archives Directory

Descendant Processes

Dynamic Functions

Java Experiments

Recording Experiments

collect Experiments

dbx Experiments That Create a Process

dbx Experiments, on a Running Process

Interpreting Performance Metrics

Clock-Based Profiling

Accuracy of Timing Metrics

Comparisons of Timing Metrics

Synchronization Wait Tracing

Hardware Counter Overflow Profiling

Heap Tracing

Dataspace Profiling

MPI Tracing

Call Stacks and Program Execution

Single-Threaded Execution and Function Calls

Function Calls Between Shared Objects

Signals

Traps

Tail-Call Optimization

Explicit Multithreading

Overview of Java Technology-Based Software Execution

Java Call Stacks and Machine Call Stacks

Clock-based Profiling and Hardware Counter Overflow Profiling

Synchronization Tracing

Heap Tracing

Java Processing Representations

The User Representation

The Expert-User Representation

The Machine Representation

Overview of OpenMP Software Execution

User Mode Display of OpenMP Profile Data

Artificial Functions

User Mode Call Stacks

OpenMP Metrics

Machine Presentation of OpenMP Profiling Data

Incomplete Stack Unwinds

Intermediate Files

Mapping Addresses to Program Structure

The Process Image

Load Objects and Functions

Aliased Functions

Non-Unique Function Names

Static Functions From Stripped Shared Libraries

Fortran Alternate Entry Points

Cloned Functions

Inlined Functions

Compiler-Generated Body Functions

Outline Functions

Dynamically Compiled Functions

The <Unknown> Function

New and OpenMP Special Functions

The <JVM-System> Function

The <no Java callstack recorded> Function

The <Truncated-stack> Function

The <Total> Function

Functions Related to Hardware Counter Overflow Profiling

Mapping Data Addresses to Program Data Objects

Data Object Descriptors

The <Total> Data Object

The <Scalars> Data Object

The <Unknown> Data Object and Its Elements

Memory Objects

The `archives` Directory

The `<Unknown>` Function

The `<JVM-System>` Function

The `<no` `Java` `callstack` `recorded>` Function

The `<Truncated-stack>` Function

The `<Total>` Function