Sun Studio 12 Update 1: Performance Analyzer

Chapter 7 Understanding the Performance Analyzer and Its Data

The Performance Analyzer reads the event data that is collected by the Collector and converts it into performance metrics. The metrics are computed for various elements in the structure of the target program, such as instructions, source lines, functions, and load objects. In addition to a header containing a timestamp, thread id, LWP id, and CPU id, the data recorded for each event collected has two parts:

The process of associating the metrics with the program structure is not always straightforward, due to the insertions, transformations, and optimizations made by the compiler. This chapter describes the process and discusses the effect on what you see in the Performance Analyzer displays.

This chapter covers the following topics:

How Data Collection Works

The output from a data collection run is an experiment, which is stored as a directory with various internal files and subdirectories in the file system.

Experiment Format

All experiments must have three files:

In addition, experiments have binary data files representing the profile events in the life of the process. Each data file has a series of events, as described below under Interpreting Performance Metrics. Separate files are used for each type of data, but each file is shared by all LWPs in the target.

For clock-based profiling, or hardware counter overflow profiling, the data is written in a signal handler invoked by the clock tick or counter overflow. For synchronization tracing, heap tracing, MPI tracing, or Open MP tracing, data is written from libcollector routines that are interposed by the LD_PRELOAD environment variable on the normal user-invoked routines. Each such interposition routine partially fills in a data record, then invokes the normal user-invoked routine, and fills in the rest of the data record when that routine returns, and writes the record to the data file.

All data files are memory-mapped and written in blocks. The records are filled in such a way as to always have a valid record structure, so that experiments can be read as they are being written. The buffer management strategy is designed to minimize contention and serialization between LWPs.

An experiment can optionally contain an ASCII file with the filename of notes. This file is automatically created when using the -C comment argument to the collect command. You can create or edit the file manually after the experiment has been created. The contents of the file are prepended to the experiment header.

The archives Directory

Each experiment has an archives directory that contains binary files describing each load object referenced in the map.xml file. These files are produced by the er_archive utility, which runs at the end of data collection. If the process terminates abnormally, the er_archive utility may not be invoked, in which case, the archive files are written by the er_print utility or the Analyzer when first invoked on the experiment.

Descendant Processes

Descendant processes write their experiments into subdirectories within the founder-process’ experiment directory.

These new experiments are named to indicate their lineage as follows:

For example, if the experiment name for the founder process is test.1.er, the experiment for the child process created by its third fork is test.1.er/_f3.er. If that child process executes a new image, the corresponding experiment name is test.1.er/_f3_x1.er. Descendant experiments consist of the same files as the parent experiment, but they do not have descendant experiments (all descendants are represented by subdirectories in the founder experiment), and they do not have archive subdirectories (all archiving is done into the founder experiment).

Dynamic Functions

An experiment where the target creates dynamic functions has additional records in the map.xml file describing those functions, and an additional file, dyntext, containing a copy of the actual instructions of the dynamic functions. The copy is needed to produce annotated disassembly of dynamic functions.

Java Experiments

A Java experiment has additional records in the map.xml file, both for dynamic functions created by the JVM software for its internal purposes, and for dynamically-compiled (HotSpot) versions of the target Java methods.

In addition, a Java experiment has a JAVA_CLASSES file, containing information about all of the user’s Java classes invoked.

Java tracing data is recorded using a JVMTI agent, which is part of libcollector.so. The agent receives events that are mapped into the recorded trace events. The agent also receives events for class loading and HotSpot compilation, that are used to write the JAVA_CLASSES file, and the Java-compiled method records in the map.xml file.

Recording Experiments

You can record an experiment in three different ways:

The Performance Tools Collect window in the Analyzer GUI runs a collect experiment; the Collector dialog in the IDE runs a dbx experiment.

collect Experiments

When you use the collect command to record an experiment, the collect utility creates the experiment directory and sets the LD_PRELOAD environment variable to ensure that libcollector.so and other libcollector modules are preloaded into the target’s address space. It then sets environment variables to inform libcollector.so about the experiment name, and data collection options, and executes the target on top of itself.

libcollector.so and associated are responsible for writing all experiment files.

dbx Experiments That Create a Process

When dbx is used to launch a process with data collection enabled, dbx also creates the experiment directory and ensures preloading of libcollector.so. Then dbx stops the process at a breakpoint before its first instruction, and calls an initialization routine in libcollector.so to start the data collection.

Java experiments can not be collected by dbx, since dbx uses a Java Virtual Machine Debug Interface (JVMDI) agent for debugging, and that agent can not coexist with the Java Virtual Machine Tools Interface (JVMTI) agent needed for data collection.

dbx Experiments, on a Running Process

When dbx is used to start an experiment on a running process, it creates the experiment directory, but cannot use the LD_PRELOAD environment variable. dbx makes an interactive function call into the target to open libcollector.so, and then calls the libcollector.so initialization routine, just as it does when creating the process. Data is written by libcollector.so and its modules just as in a collect experiment.

Since libcollector.so was not in the target address space when the process started, any data collection that depends on interposition on user-callable functions (synchronization tracing, heap tracing, MPI tracing) might not work. In general, the symbols have already been resolved to the underlying functions, so the interposition can not happen. Furthermore, the following of descendant processes also depends on interposition, and does not work properly for experiments created by dbx on a running process.

If you have explicitly preloaded libcollector.so before starting the process with dbx, or before using dbx to attach to the running process, you can collect tracing data.

Interpreting Performance Metrics

The data for each event contains a high-resolution timestamp, a thread ID, an LWP ID, and a processor ID. The first three of these can be used to filter the metrics in the Performance Analyzer by time, thread, LWP, or CPU. See the getcpuid(2) man page for information on processor IDs. On systems where getcpuid is not available, the processor ID is -1, which maps to Unknown.

In addition to the common data, each event generates specific raw data, which is described in the following sections. Each section also contains a discussion of the accuracy of the metrics derived from the raw data and the effect of data collection on the metrics.

Clock-Based Profiling

The event-specific data for clock-based profiling consists of an array of profiling interval counts. On the Solaris OS, an interval counter is provided. At the end of the profiling interval, the appropriate interval counter is incremented by 1, and another profiling signal is scheduled. The array is recorded and reset only when the Solaris LWP thread enters CPU user mode. Resetting the array consists of setting the array element for the User-CPU state to 1, and the array elements for all the other states to 0. The array data is recorded on entry to user mode before the array is reset. Thus, the array contains an accumulation of counts for each microstate that was entered since the previous entry into user mode, for each of the ten microstates maintained by the kernel for each Solaris LWP. On the Linux OS, microstates do not exist; the only interval counter is User CPU Time.

The call stack is recorded at the same time as the data. If the Solaris LWP is not in user mode at the end of the profiling interval, the call stack cannot change until the LWP or thread enters user mode again. Thus the call stack always accurately records the position of the program counter at the end of each profiling interval.

The metrics to which each of the microstates contributes on the Solaris OS are shown in Table 7–1.

Table 7–1 How Kernel Microstates Contribute to Metrics

Kernel Microstate 

Description 

Metric Name 

LMS_USER

Running in user mode 

User CPU Time 

LMS_SYSTEM

Running in system call or page fault 

System CPU Time 

LMS_TRAP

Running in any other trap 

System CPU Time 

LMS_TFAULT

Asleep in user text page fault 

Text Page Fault Time 

LMS_DFAULT

Asleep in user data page fault 

Data Page Fault Time 

LMS_KFAULT

Asleep in kernel page fault 

Other Wait Time 

LMS_USER_LOCK

Asleep waiting for user-mode lock 

User Lock Time 

LMS_SLEEP

Asleep for any other reason 

Other Wait Time 

LMS_STOPPED

Stopped (/proc, job control, or lwp_stop)

Other Wait Time 

LMS_WAIT_CPU

Waiting for CPU 

Wait CPU Time 

Accuracy of Timing Metrics

Timing data is collected on a statistical basis, and is therefore subject to all the errors of any statistical sampling method. For very short runs, in which only a small number of profile packets is recorded, the call stacks might not represent the parts of the program which consume the most resources. Run your program for long enough or enough times to accumulate hundreds of profile packets for any function or source line you are interested in.

In addition to statistical sampling errors, specific errors arise from the way the data is collected and attributed and the way the program progresses through the system. The following are some of the circumstances in which inaccuracies or distortions can appear in the timing metrics:

In addition to the inaccuracies just described, timing metrics are distorted by the process of collecting data. The time spent recording profile packets never appears in the metrics for the program, because the recording is initiated by the profiling signal. (This is another instance of correlation.) The user CPU time spent in the recording process is distributed over whatever microstates are recorded. The result is an underaccounting of the User CPU Time metric and an overaccounting of other metrics. The amount of time spent recording data is typically less than a few percent of the CPU time for the default profiling interval.

Comparisons of Timing Metrics

If you compare timing metrics obtained from the profiling done in a clock-based experiment with times obtained by other means, you should be aware of the following issues.

For a single-threaded application, the total Solaris LWP or Linux thread time recorded for a process is usually accurate to a few tenths of a percent, compared with the values returned by gethrtime(3C) for the same process. The CPU time can vary by several percentage points from the values returned by gethrvtime(3C) for the same process. Under heavy load, the variation might be even more pronounced. However, the CPU time differences do not represent a systematic distortion, and the relative times reported for different functions, source-lines, and such are not substantially distorted.

For multithreaded applications using unbound threads on the Solaris OS, differences in values returned by gethrvtime() could be meaningless because gethrvtime() returns values for an LWP, and a thread can change from one LWP to another.

The LWP times that are reported in the Performance Analyzer can differ substantially from the times that are reported by vmstat, because vmstat reports times that are summed over CPUs. If the target process has more LWPs than the system on which it is running has CPUs, the Performance Analyzer shows more wait time than vmstat reports.

The microstate timings that appear in the Statistics tab of the Performance Analyzer and the er_print statistics display are based on process file system /proc usage reports, for which the times spent in the microstates are recorded to high accuracy. See the proc (4) man page for more information. You can compare these timings with the metrics for the <Total> function, which represents the program as a whole, to gain an indication of the accuracy of the aggregated timing metrics. However, the values displayed in the Statistics tab can include other contributions that are not included in the timing metric values for <Total>. These contributions come from the periods of time in which data collection is paused.

User CPU time and hardware counter cycle time differ because the hardware counters are turned off when the CPU mode has been switched to system mode. For more information, see Traps.

Synchronization Wait Tracing

The Collector collects synchronization delay events by tracing calls to the functions in the threads library, libthread.so, or to the real time extensions library, librt.so. The event-specific data consists of high-resolution timestamps for the request and the grant (beginning and end of the call that is traced), and the address of the synchronization object (the mutex lock being requested, for example). The thread and LWP IDs are the IDs at the time the data is recorded. The wait time is the difference between the request time and the grant time. Only events for which the wait time exceeds the specified threshold are recorded. The synchronization wait tracing data is recorded in the experiment at the time of the grant.

The LWP on which the waiting thread is scheduled cannot perform any other work until the event that caused the delay is completed. The time spent waiting appears both as Synchronization Wait Time and as User Lock Time. User Lock Time can be larger than Synchronization Wait Time because the synchronization delay threshold screens out delays of short duration.

The wait time is distorted by the overhead for data collection. The overhead is proportional to the number of events collected. You can minimize the fraction of the wait time spent in overhead by increasing the threshold for recording events.

Hardware Counter Overflow Profiling

Hardware counter overflow profiling data includes a counter ID and the overflow value. The value can be larger than the value at which the counter is set to overflow, because the processor executes some instructions between the overflow and the recording of the event. The value is especially likely to be larger for cycle and instruction counters, which are incremented much more frequently than counters such as floating-point operations or cache misses. The delay in recording the event also means that the program counter address recorded with call stack does not correspond exactly to the overflow event. See Attribution of Hardware Counter Overflows for more information. See also the discussion of Traps. Traps and trap handlers can cause significant differences between reported User CPU time and time reported by the cycle counter.

The amount of data collected depends on the overflow value. Choosing a value that is too small can have the following consequences.

Heap Tracing

The Collector records tracing data for calls to the memory allocation and deallocation functions malloc, realloc, memalign, and free by interposing on these functions. If your program bypasses these functions to allocate memory, tracing data is not recorded. Tracing data is not recorded for Java memory management, which uses a different mechanism.

The functions that are traced could be loaded from any of a number of libraries. The data that you see in the Performance Analyzer might depend on the library from which a given function is loaded.

If a program makes a large number of calls to the traced functions in a short space of time, the time taken to execute the program can be significantly lengthened. The extra time is used in recording the tracing data.

Dataspace Profiling

A dataspace profile is a data collection in which memory- related events, such as cache misses, are reported against the data-object references that cause the events rather than just the instructions where the memory-related events occur. Dataspace profiling is not available on Linux systems.

To allow dataspace profiling, the target must be a C program, compiled for the SPARC architecture, with the -xhwcprof flag and -xdebugformat=dwarf -g flag. Furthermore, the data collected must be hardware counter profiles for memory-related counters and the optional + sign must be prepended to the counter name. The Performance Analyzer includes two tabs related to dataspace profiling, the DataObject tab and the DataLayout tab, and various tabs for memory objects.

Dataspace profiling can also be done with clock-profiling, by prepending a plus sign ( + ) to the profiling interval.

MPI Tracing

MPI tracing is based on a modified VampirTrace data collector. For more information, see the Vampirtrace User Manual on the Technische Universität Dresden web site.

Call Stacks and Program Execution

A call stack is a series of program counter addresses (PCs) representing instructions from within the program. The first PC, called the leaf PC, is at the bottom of the stack, and is the address of the next instruction to be executed. The next PC is the address of the call to the function containing the leaf PC; the next PC is the address of the call to that function, and so forth, until the top of the stack is reached. Each such address is known as a return address. The process of recording a call stack involves obtaining the return addresses from the program stack and is referred to as unwinding the stack. For information on unwind failures, see Incomplete Stack Unwinds.

The leaf PC in a call stack is used to assign exclusive metrics from the performance data to the function in which that PC is located. Each PC on the stack, including the leaf PC, is used to assign inclusive metrics to the function in which it is located.

Most of the time, the PCs in the recorded call stack correspond in a natural way to functions as they appear in the source code of the program, and the Performance Analyzer’s reported metrics correspond directly to those functions. Sometimes, however, the actual execution of the program does not correspond to a simple intuitive model of how the program would execute, and the Performance Analyzer’s reported metrics might be confusing. See Mapping Addresses to Program Structure for more information about such cases.

Single-Threaded Execution and Function Calls

The simplest case of program execution is that of a single-threaded program calling functions within its own load object.

When a program is loaded into memory to begin execution, a context is established for it that includes the initial address to be executed, an initial register set, and a stack (a region of memory used for scratch data and for keeping track of how functions call each other). The initial address is always at the beginning of the function _start(), which is built into every executable.

When the program runs, instructions are executed in sequence until a branch instruction is encountered, which among other things could represent a function call or a conditional statement. At the branch point, control is transferred to the address given by the target of the branch, and execution proceeds from there. (Usually the next instruction after the branch is already committed for execution: this instruction is called the branch delay slot instruction. However, some branch instructions annul the execution of the branch delay slot instruction).

When the instruction sequence that represents a call is executed, the return address is put into a register, and execution proceeds at the first instruction of the function being called.

In most cases, somewhere in the first few instructions of the called function, a new frame (a region of memory used to store information about the function) is pushed onto the stack, and the return address is put into that frame. The register used for the return address can then be used when the called function itself calls another function. When the function is about to return, it pops its frame from the stack, and control returns to the address from which the function was called.

Function Calls Between Shared Objects

When a function in one shared object calls a function in another shared object, the execution is more complicated than in a simple call to a function within the program. Each shared object contains a Program Linkage Table, or PLT, which contains entries for every function external to that shared object that is referenced from it. Initially the address for each external function in the PLT is actually an address within ld.so, the dynamic linker. The first time such a function is called, control is transferred to the dynamic linker, which resolves the call to the real external function and patches the PLT address for subsequent calls.

If a profiling event occurs during the execution of one of the three PLT instructions, the PLT PCs are deleted, and exclusive time is attributed to the call instruction. If a profiling event occurs during the first call through a PLT entry, but the leaf PC is not one of the PLT instructions, any PCs that arise from the PLT and code in ld.so are replaced by a call to an artificial function, @plt, which accumulates inclusive time. There is one such artificial function for each shared object. If the program uses the LD_AUDIT interface, the PLT entries might never be patched, and non-leaf PCs from @plt can occur more frequently.

Signals

When a signal is sent to a process, various register and stack operations occur that make it look as though the leaf PC at the time of the signal is the return address for a call to a system function, sigacthandler(). sigacthandler() calls the user-specified signal handler just as any function would call another.

The Performance Analyzer treats the frames resulting from signal delivery as ordinary frames. The user code at the point at which the signal was delivered is shown as calling the system function sigacthandler(), and sigacthandler() in turn is shown as calling the user’s signal handler. Inclusive metrics from both sigacthandler() and any user signal handler, and any other functions they call, appear as inclusive metrics for the interrupted function.

The Collector interposes on sigaction() to ensure that its handlers are the primary handlers for the SIGPROF signal when clock data is collected and SIGEMT signal when hardware counter overflow data is collected.

Traps

Traps can be issued by an instruction or by the hardware, and are caught by a trap handler. System traps are traps that are initiated from an instruction and trap into the kernel. All system calls are implemented using trap instructions. Some examples of hardware traps are those issued from the floating point unit when it is unable to complete an instruction (such as the fitos instruction for some register-content values on the UltraSPARC® III platform), or when the instruction is not implemented in the hardware.

When a trap is issued, the Solaris LWP or Linux kernel enters system mode. On the Solaris OS, the microstate is usually switched from User CPU state to Trap state then to System state. The time spent handling the trap can show as a combination of System CPU time and User CPU time, depending on the point at which the microstate is switched. The time is attributed to the instruction in the user’s code from which the trap was initiated (or to the system call).

For some system calls, it is considered critical to provide as efficient handling of the call as possible. The traps generated by these calls are known as fast traps. Among the system functions that generate fast traps are gethrtime and gethrvtime. In these functions, the microstate is not switched because of the overhead involved.

In other circumstances it is also considered critical to provide as efficient handling of the trap as possible. Some examples of these are TLB (translation lookaside buffer) misses and register window spills and fills, for which the microstate is not switched.

In both cases, the time spent is recorded as User CPU time. However, the hardware counters are turned off because the CPU mode has been switched to system mode. The time spent handling these traps can therefore be estimated by taking the difference between User CPU time and Cycles time, preferably recorded in the same experiment.

In one case the trap handler switches back to user mode, and that is the misaligned memory reference trap for an 8-byte integer which is aligned on a 4-byte boundary in Fortran. A frame for the trap handler appears on the stack, and a call to the handler can appear in the Performance Analyzer, attributed to the integer load or store instruction.

When an instruction traps into the kernel, the instruction following the trapping instruction appears to take a long time, because it cannot start until the kernel has finished executing the trapping instruction.

Tail-Call Optimization

The compiler can do one particular optimization whenever the last thing a particular function does is to call another function. Rather than generating a new frame, the callee reuses the frame from the caller, and the return address for the callee is copied from the caller. The motivation for this optimization is to reduce the size of the stack, and, on SPARC platforms, to reduce the use of register windows.

Suppose that the call sequence in your program source looks like this:

A -> B -> C -> D

When B and C are tail-call optimized, the call stack looks as if function A calls functions B, C, and D directly.

A -> B
A -> C
A -> D

That is, the call tree is flattened. When code is compiled with the -g option, tail-call optimization takes place only at a compiler optimization level of 4 or higher. When code is compiled without the- g option, tail-call optimization takes place at a compiler optimization level of 2 or higher.

Explicit Multithreading

A simple program executes in a single thread, on a single LWP (lightweight process) in the Solaris OS. Multithreaded executables make calls to a thread creation function, to which the target function for execution is passed. When the target exits, the thread is destroyed.

The Solaris OS supports two thread implementations: Solaris threads and POSIX threads (Pthreads). Beginning with the Solaris 10 OS, both thread implementations are included in libc.so.

With Solaris threads, newly-created threads begin execution at a function called _thread_start(), which calls the function passed in the thread creation call. For any call stack involving the target as executed by this thread, the top of the stack is _thread_start(), and there is no connection to the caller of the thread creation function. Inclusive metrics associated with the created thread therefore only propagate up as far as _thread_start() and the <Total> function. In addition to creating the threads, the Solaris threads implementation also creates LWPs on Solaris to execute the threads. Each thread is bound to a specific LWP.

Pthreads is available in the Solaris 10 OS as well as in the Linux OS for explicit multithreading.

In both environments, to create a new thread, the application calls the Pthread API function pthread_create(), passing a pointer to an application-defined start routine as one of the function arguments.

On the Solaris OS, when a new pthread starts execution, it calls the _lwp_start() function. On the Solaris 10 OS, _lwp_start() calls an intermediate function _thr_setup(), which then calls the application-defined start routine that was specified in pthread_create().

On the Linux OS, when the new pthread starts execution, it runs a Linux-specific system function, clone(), which calls another internal initialization function, pthread_start_thread(), which in turn calls the application-defined start routine that was specified in pthread_create() . The Linux metrics-gathering functions available to the Collector are thread-specific. Therefore, when the collect utility runs, it interposes a metrics-gathering function, named collector_root(), between pthread_start_thread() and the application-defined thread start routine.

Overview of Java Technology-Based Software Execution

To the typical developer, a Java technology-based application runs just like any other program. The application begins at a main entry point, typically named class.main, which may call other methods, just as a C or C++ application does.

To the operating system, an application written in the Java programming language, (pure or mixed with C/C++), runs as a process instantiating the JVM software. The JVM software is compiled from C++ sources and starts execution at _start, which calls main, and so forth. It reads bytecode from .class and/or .jar files, and performs the operations specified in that program. Among the operations that can be specified is the dynamic loading of a native shared object, and calls into various functions or methods contained within that object.

The JVM software does a number of things that are typically not done by applications written in traditional languages. At startup, it creates a number of regions of dynamically-generated code in its data space. One of these regions is the actual interpreter code used to process the application’s bytecode methods.

During execution of a Java technology-based application, most methods are interpreted by the JVM software; these methods are referred to as interpreted methods. The Java HotSpot virtual machine monitors performance as it interprets the bytecode to detect methods that are frequently executed. Methods that are repeatedly executed might then be compiled by the Java HotSpot virtual machine to generate machine code for those methods. The resulting methods are referred to as compiled methods. The virtual machine executes the more efficient compiled methods thereafter, rather than interpret the original bytecode for the methods. Compiled methods are loaded into the data space of the application, and may be unloaded at some later point in time. In addition, other code is generated in the data space to execute the transitions between interpreted and compiled code.

Code written in the Java programming language might also call directly into native-compiled code, either C, C++, or Fortran; the targets of such calls are referred to as native methods.

Applications written in the Java programming language are inherently multithreaded, and have one JVM software thread for each thread in the user’s program. Java applications also have several housekeeping threads used for signal handling, memory management, and Java HotSpot virtual machine compilation.

Data collection is implemented with various methods in the JVMTI in J2SE 5.0.

Java Call Stacks and Machine Call Stacks

The performance tools collect their data by recording events in the life of each Solaris LWP or Linux thread, along with the call stack at the time of the event. At any point in the execution of any application, the call stack represents where the program is in its execution, and how it got there. One important way that mixed-model Java applications differ from traditional C, C++, and Fortran applications is that at any instant during the run of the target there are two call stacks that are meaningful: a Java call stack, and a machine call stack. Both call stacks are recorded during profiling, and are reconciled during analysis.

Clock-based Profiling and Hardware Counter Overflow Profiling

Clock-based profiling and hardware counter overflow profiling for Java programs work just as for C, C++, and Fortran programs, except that both Java call stacks and machine call stacks are collected.

Java Processing Representations

There are three representations for displaying performance data for applications written in the Java programming language: the Java representation, the Expert-Java representation, and the Machine representation. The Java representation is shown by default where the data supports it. The following section summarizes the main differences between these three representations.

The User Representation

The User representation shows compiled and interpreted Java methods by name, and shows native methods in their natural form. During execution, there might be many instances of a particular Java method executed: the interpreted version, and, perhaps, one or more compiled versions. In the Java representation all methods are shown aggregated as a single method. This representation is selected in the Analyzer by default.

A PC for a Java method in the Java representation corresponds to the method-id and a bytecode index into that method; a PC for a native function correspond to a machine PC. The call stack for a Java thread may have a mixture of Java PCs and machine PCs. It does not have any frames corresponding to Java housekeeping code, which does not have a Java representation. Under some circumstances, the JVM software cannot unwind the Java stack, and a single frame with the special function, <no Java callstack recorded>, is returned. Typically, it amounts to no more than 5-10% of the total time.

The function list in the Java representation shows metrics against the Java methods and any native methods called. The caller-callee panel shows the calling relationships in the Java representation.

Source for a Java method corresponds to the source code in the .java file from which it was compiled, with metrics on each source line. The disassembly of any Java method shows the bytecode generated for it, with metrics against each bytecode, and interleaved Java source, where available.

The Timeline in the Java representation shows only Java threads. The call stack for each thread is shown with its Java methods.

Data space profiling in the Java representation is not currently supported.

The Expert-User Representation

The Expert-Java representation is similar to the Java Representation, except that some details of the JVM internals that are suppressed in the Java Representation are exposed in the Expert-Java Representation. With the Expert-Java representation, the Timeline shows all threads; the call stack for housekeeping threads is a native call stack.

The Machine Representation

The Machine representation shows functions from the JVM software itself, rather than from the application being interpreted by the JVM software. It also shows all compiled and native methods. The machine representation looks the same as that of applications written in traditional languages. The call stack shows JVM frames, native frames, and compiled-method frames. Some of the JVM frames represent transition code between interpreted Java, compiled Java, and native code.

Source from compiled methods are shown against the Java source; the data represents the specific instance of the compiled-method selected. Disassembly for compiled methods show the generated machine assembler code, not the Java bytecode. Caller-callee relationships show all overhead frames, and all frames representing the transitions between interpreted, compiled, and native methods.

The Timeline in the machine representation shows bars for all threads, LWPs, or CPUs, and the call stack in each is the machine-representation call stack.

Overview of OpenMP Software Execution

The actual execution model of OpenMP Applications is described in the OpenMP specifications (See, for example, OpenMP Application Program Interface, Version 3.0, section 1.3.) The specification, however, does not describe some implementation details that may be important to users, and the actual implementation at Sun Microsystems is such that directly recorded profiling information does not easily allow the user to understand how the threads interact.

As any single-threaded program runs, its call stack shows its current location, and a trace of how it got there, starting from the beginning instructions in a routine called _start, which calls main, which then proceeds and calls various subroutines within the program. When a subroutine contains a loop, the program executes the code inside the loop repeatedly until the loop exit criterion is reached. The execution then proceeds to the next sequence of code, and so forth.

When the program is parallelized with OpenMP (or by autoparallelization), the behavior is different. An intuitive model of that behavior has the main, or master, thread executing just as a single-threaded program. When it reaches a parallel loop or parallel region, additional slave threads appear, each a clone of the master thread, with all of them executing the contents of the loop or parallel region, in parallel, each for different chunks of work. When all chunks of work are completed, all the threads are synchronized, the slave threads disappear, and the master thread proceeds.

When the compiler generates code for a parallel region or loop (or any other OpenMP construct), the code inside it is extracted and made into an independent function, called an mfunction. (It may also be referred to as an outlined function, or a loop-body-function.) The name of the function encodes the OpenMP construct type, the name of the function from which it was extracted, and the line number of the source line at which the construct appears. The names of these functions are shown in the Analyzer in the following form, where the name in brackets is the actual symbol-table name of the function:

bardo_ -- OMP parallel region from line 9 [_$p1C9.bardo_]
atomsum_ -- MP doall from line 7 [_$d1A7.atomsum_]

There are other forms of such functions, derived from other source constructs, for which the OMP parallel region in the name is replaced by MP construct, MP doall, or OMP sections. In the following discussion, all of these are referred to generically as “parallel regions”.

Each thread executing the code within the parallel loop can invoke its mfunction multiple times, with each invocation doing a chunk of the work within the loop. When all the chunks of work are complete, each thread calls synchronization or reduction routines in the library; the master thread then continues, while the slave threads become idle, waiting for the master thread to enter the next parallel region. All of the scheduling and synchronization are handled by calls to the OpenMP runtime.

During its execution, the code within the parallel region might be doing a chunk of the work, or it might be synchronizing with other threads or picking up additional chunks of work to do. It might also call other functions, which may in turn call still others. A slave thread (or the master thread) executing within a parallel region, might itself, or from a function it calls, act as a master thread, and enter its own parallel region, giving rise to nested parallelism.

The Analyzer collects data based on statistical sampling of call stacks, and aggregates its data across all threads and shows metrics of performance based on the type of data collected, against functions, callers and callees, source lines, and instructions. It presents information on the performance of OpenMP programs in one of three modes, User mode , Expert mode, and Machine mode.

For more detailed information, see An OpenMP Runtime API for Profiling at the OpenMP user community web site.

User Mode Display of OpenMP Profile Data

The User mode presentation of the profile data attempts to present the information as if the program really executed according to the model described in Overview of OpenMP Software Execution. The actual data captures the implementation details of the runtime library, libmtsk.so , which does not correspond to the model. In User mode, the presentation of profile data is altered to match the model better, and differs from the recorded data and Machine mode presentation in three ways:

Artificial Functions

Artificial functions are constructed and put onto the User mode call stacks reflecting events in which a thread was in some state within the OpenMP runtime library.

The following artificial functions are defined:

<OMP-overhead>

Executing in the OpenMP library 

<OMP-idle>

Slave thread, waiting for work 

<OMP-reduction>

Thread performing a reduction operation 

<OMP-implicit_barrier>

Thread waiting at an implicit barrier 

<OMP-explicit_barrier>

Thread waiting at an explicit barrier 

<OMP-lock_wait>

Thread waiting for a lock 

<OMP-critical_section_wait>

Thread waiting to enter a critical section 

<OMP-ordered_section_wait>

Thread waiting for its turn to enter an ordered section 

When a thread is in an OpenMP runtime state corresponding to one of those functions, the corresponding function is added as the leaf function on the stack. When a thread’s leaf function is anywhere in the OpenMP runtime, it is replaced by <OMP-overhead> as the leaf function. Otherwise, all PCs from the OpenMP runtime are omitted from the user-mode stack.

User Mode Call Stacks

For OpenMP experiments, user mode shows reconstructed call stacks similar to those obtained when the program is compiled without OpenMP.

OpenMP Metrics

When processing a clock-profile event for an OpenMP program, two metrics corresponding to the time spent in each of two states in the OpenMP system are shown: OpenMP Work and OpenMP Wait.

Time is accumulated in OpenMP Work whenever a thread is executing from the user code, whether in serial or parallel. Time is accumulated in OpenMP Wait whenever a thread is waiting for something before it can proceed, whether the wait is a busy-wait (spin-wait), or sleeping. The sum of these two metrics matches the Total LWP Time metric in the clock profiles.

Machine Presentation of OpenMP Profiling Data

The real call stacks of the program during various phases of execution are quite different from the ones portrayed above in the intuitive model. The Machine mode of presentation shows the call stacks as measured, with no transformations done, and no artificial functions constructed. The clock-profiling metrics are, however, still shown.

In each of the call stacks below, libmtsk represents one or more frames in the call stack within the OpenMP runtime library. The details of which functions appear and in which order change from release to release, as does the internal implementation of code for a barrier, or to perform a reduction.

  1. Before the first parallel region

    Before the first parallel region is entered, there is only the one thread, the master thread. The call stack is identical to that in User mode.

    Master 

    foo

    main

    _start

  2. During execution in a parallel region

    Master 

    Slave 1 

    Slave 2 

    Slave 3 

    foo-OMP...

         

    libmtsk

         

    foo

    foo-OMP...

    foo-OMP...

    foo-OMP...

    main

    libmtsk

    libmtsk

    libmtsk

    _start

    _lwp_start

    _lwp_start

    _lwp_start

    In Machine mode, the slave threads are shown as starting in _lwp_start , rather than in _start where the master starts. (In some versions of the thread library, that function may appear as _thread_start .)

  3. At the point at which all threads are at a barrier

    Master 

    Slave 1 

    Slave 2 

    Slave 3 

    libmtsk

         

    foo-OMP...

         

    foo

    libmtsk

    libmtsk

    libmtsk

    main

    foo-OMP...

    foo-OMP...

    foo-OMP...

    _start

    _lwp_start

    _lwp_start

    _lwp_start

    Unlike when the threads are executing in the parallel region, when the threads are waiting at a barrier there are no frames from the OpenMP runtime between foo and the parallel region code, foo-OMP.... The reason is that the real execution does not include the OMP parallel region function, but the OpenMP runtime manipulates registers so that the stack unwind shows a call from the last-executed parallel region function to the runtime barrier code. Without it, there would be no way to determine which parallel region is related to the barrier call in Machine mode.

  4. After leaving the parallel region

    Master 

    Slave 1 

    Slave 2 

    Slave 3 

    foo

         

    main

    libmtsk

    libmtsk

    libmtsk

    _start

    _lwp_start

    _lwp_start

    _lwp_start

    In the slave threads, no user frames are on the call stack.

  5. When in a nested parallel region

    Master 

    Slave 1 

    Slave 2 

    Slave 3 

    Slave 4 

     

    bar-OMP...

         

    foo-OMP...

    libmtsk

         

    libmtsk

    bar

         

    foo

    foo-OMP...

    foo-OMP...

    foo-OMP...

    bar-OMP...

    main

    libmtsk

    libmtsk

    libmtsk

    libmtsk

    _start

    _lwp_start

    _lwp_start

    _lwp_start

    _lwp_start

Incomplete Stack Unwinds

Stack unwind might fail for a number of reasons:

Intermediate Files

If you generate intermediate files using the -E or -P compiler options, the Analyzer uses the intermediate file for annotated source code, not the original source file. The #line directives generated with -E can cause problems in the assignment of metrics to source lines.

The following line appears in annotated source if there are instructions from a function that do not have line numbers referring to the source file that was compiled to generate the function:

function_name -- <instructions without line numbers>

Line numbers can be absent under the following circumstances:

Mapping Addresses to Program Structure

Once a call stack is processed into PC values, the Analyzer maps those PCs to shared objects, functions, source lines, and disassembly lines (instructions) in the program. This section describes those mappings.

The Process Image

When a program is run, a process is instantiated from the executable for that program. The process has a number of regions in its address space, some of which are text and represent executable instructions, and some of which are data that is not normally executed. PCs as recorded in the call stack normally correspond to addresses within one of the text segments of the program.

The first text section in a process derives from the executable itself. Others correspond to shared objects that are loaded with the executable, either at the time the process is started, or dynamically loaded by the process. The PCs in a call stack are resolved based on the executable and shared objects loaded at the time the call stack was recorded. Executables and shared objects are very similar, and are collectively referred to as load objects.

Because shared objects can be loaded and unloaded in the course of program execution, any given PC might correspond to different functions at different times during the run. In addition, different PCs at different times might correspond to the same function, when a shared object is unloaded and then reloaded at a different address.

Load Objects and Functions

Each load object, whether an executable or a shared object, contains a text section with the instructions generated by the compiler, a data section for data, and various symbol tables. All load objects must contain an ELF symbol table, which gives the names and addresses of all the globally-known functions in that object. Load objects compiled with the -g option contain additional symbolic information, which can augment the ELF symbol table and provide information about functions that are not global, additional information about object modules from which the functions came, and line number information relating addresses to source lines.

The term function is used to describe a set of instructions that represent a high-level operation described in the source code. The term covers subroutines as used in Fortran, methods as used in C++ and the Java programming language, and the like. Functions are described cleanly in the source code, and normally their names appear in the symbol table representing a set of addresses; if the program counter is within that set, the program is executing within that function.

In principle, any address within the text segment of a load object can be mapped to a function. Exactly the same mapping is used for the leaf PC and all the other PCs on the call stack. Most of the functions correspond directly to the source model of the program. Some do not; these functions are described in the following sections.

Aliased Functions

Typically, functions are defined as global, meaning that their names are known everywhere in the program. The name of a global function must be unique within the executable. If there is more than one global function of a given name within the address space, the runtime linker resolves all references to one of them. The others are never executed, and so do not appear in the function list. In the Summary tab, you can see the shared object and object module that contain the selected function.

Under various circumstances, a function can be known by several different names. A very common example of this is the use of so-called weak and strong symbols for the same piece of code. A strong name is usually the same as the corresponding weak name, except that it has a leading underscore. Many of the functions in the threads library also have alternate names for pthreads and Solaris threads, as well as strong and weak names and alternate internal symbols. In all such cases, only one name is used in the function list of the Analyzer. The name chosen is the last symbol at the given address in alphabetic order. This choice most often corresponds to the name that the user would use. In the Summary tab, all the aliases for the selected function are shown.

Non-Unique Function Names

While aliased functions reflect multiple names for the same piece of code, under some circumstances, multiple pieces of code have the same name:

Static Functions From Stripped Shared Libraries

Static functions are often used within libraries, so that the name used internally in a library does not conflict with a name that you might use. When libraries are stripped, the names of static functions are deleted from the symbol table. In such cases, the Analyzer generates an artificial name for each text region in the library containing stripped static functions. The name is of the form <static>@0x12345, where the string following the @ sign is the offset of the text region within the library. The Analyzer cannot distinguish between contiguous stripped static functions and a single such function, so two or more such functions can appear with their metrics coalesced.

Stripped static functions are shown as called from the correct caller, except when the PC from the static function is a leaf PC that appears after the save instruction in the static function. Without the symbolic information, the Analyzer does not know the save address, and cannot tell whether to use the return register as the caller. It always ignores the return register. Since several functions can be coalesced into a single <static>@0x12345 function, the real caller or callee might not be distinguished from the adjacent functions.

Fortran Alternate Entry Points

Fortran provides a way of having multiple entry points to a single piece of code, allowing a caller to call into the middle of a function. When such code is compiled, it consists of a prologue for the main entry point, a prologue to the alternate entry point, and the main body of code for the function. Each prologue sets up the stack for the function’s eventual return and then branches or falls through to the main body of code.

The prologue code for each entry point always corresponds to a region of text that has the name of that entry point, but the code for the main body of the subroutine receives only one of the possible entry point names. The name received varies from one compiler to another.

The prologues rarely account for any significant amount of time, and the functions corresponding to entry points other than the one that is associated with the main body of the subroutine rarely appear in the Analyzer. Call stacks representing time in Fortran subroutines with alternate entry points usually have PCs in the main body of the subroutine, rather than the prologue, and only the name associated with the main body appears as a callee. Likewise, all calls from the subroutine are shown as being made from the name associated with the main body of the subroutine.

Cloned Functions

The compilers have the ability to recognize calls to a function for which extra optimization can be performed. An example of such calls is a call to a function for which some of the arguments are constants. When the compiler identifies particular calls that it can optimize, it creates a copy of the function, which is called a clone, and generates optimized code. The clone function name is a mangled name that identifies the particular call. The Analyzer demangles the name, and presents each instance of a cloned function separately in the function list. Each cloned function has a different set of instructions, so the annotated disassembly listing shows the cloned functions separately. Each cloned function has the same source code, so the annotated source listing sums the data over all copies of the function.

Inlined Functions

An inlined function is a function for which the instructions generated by the compiler are inserted at the call site of the function instead of an actual call. There are two kinds of inlining, both of which are done to improve performance, and both of which affect the Analyzer.

Both kinds of inlining have the same effect on the display of metrics. Functions that appear in the source code but have been inlined do not show up in the function list, nor do they appear as callees of the functions into which they have been inlined. Metrics that would otherwise appear as inclusive metrics at the call site of the inlined function, representing time spent in the called function, are actually shown as exclusive metrics attributed to the call site, representing the instructions of the inlined function.


Note –

Inlining can make data difficult to interpret, so you might want to disable inlining when you compile your program for performance analysis.


In some cases, even when a function is inlined, a so-called out-of-line function is left. Some call sites call the out-of-line function, but others have the instructions inlined. In such cases, the function appears in the function list but the metrics attributed to it represent only the out-of-line calls.

Compiler-Generated Body Functions

When a compiler parallelizes a loop in a function, or a region that has parallelization directives, it creates new body functions that are not in the original source code. These functions are described in Overview of OpenMP Software Execution.

The Analyzer shows these functions as normal functions, and assigns a name to them based on the function from which they were extracted, in addition to the compiler-generated name. Their exclusive metrics and inclusive metrics represent the time spent in the body function. In addition, the function from which the construct was extracted shows inclusive metrics from each of the body functions. The means by which this is achieved is described in Overview of OpenMP Software Execution.

When a function containing parallel loops is inlined, the names of its compiler-generated body functions reflect the function into which it was inlined, not the original function.


Note –

The names of compiler-generated body functions can only be demangled for modules compiled with -g.


Outline Functions

Outline functions can be created during feedback-optimized compilations. They represent code that is not normally executed, specifically code that is not executed during the training run used to generate the feedback for the final optimized compilation. A typical example is code that performs error checking on the return value from library functions; the error-handling code is never normally run. To improve paging and instruction-cache behavior, such code is moved elsewhere in the address space, and is made into a separate function. The name of the outline function encodes information about the section of outlined code, including the name of the function from which the code was extracted and the line number of the beginning of the section in the source code. These mangled names can vary from release to release. The Analyzer provides a readable version of the function name.

Outline functions are not really called, but rather are jumped to; similarly they do not return, they jump back. In order to make the behavior more closely match the user’s source code model, the Analyzer imputes an artificial call from the main function to its outline portion.

Outline functions are shown as normal functions, with the appropriate inclusive and exclusive metrics. In addition, the metrics for the outline function are added as inclusive metrics in the function from which the code was outlined.

For further details on feedback-optimized compilations, refer to the description of the -xprofile compiler option in Appendix B of the C User’s Guide, Appendix A of the C++ User’s Guide, or Chapter 3 of the Fortran User’s Guide.

Dynamically Compiled Functions

Dynamically compiled functions are functions that are compiled and linked while the program is executing. The Collector has no information about dynamically compiled functions that are written in C or C++, unless the user supplies the required information using the Collector API functions. See Dynamic Functions and Modules for information about the API functions. If information is not supplied, the function appears in the performance analysis tools as <Unknown>.

For Java programs, the Collector obtains information on methods that are compiled by the Java HotSpot virtual machine, and there is no need to use the API functions to provide the information. For other methods, the performance tools show information for the JVM software that executes the methods. In the Java representation, all methods are merged with the interpreted version. In the machine representation, each HotSpot-compiled version is shown separately, and JVM functions are shown for each interpreted method.

The <Unknown> Function

Under some circumstances, a PC does not map to a known function. In such cases, the PC is mapped to the special function named <Unknown> .

The following circumstances show PCs mapping to <Unknown>:

Callers and callees of the <Unknown> function represent the previous and next PCs in the call stack, and are treated normally.

OpenMP Special Functions

Artificial functions are constructed and put onto the User mode call stacks reflecting events in which a thread was in some state within the OpenMP runtime library. The following artificial functions are defined:

<OMP-overhead>

Executing in the OpenMP library 

<OMP-idle>

Slave thread, waiting for work 

<OMP-reduction>

Thread performing a reduction operation 

<OMP-implicit_barrier>

Thread waiting at an implicit barrier 

<OMP-explicit_barrier>

Thread waiting at an explicit barrier 

<OMP-lock_wait>

Thread waiting for a lock 

<OMP-critical_section_wait>

Thread waiting to enter a critical section 

<OMP-ordered_section_wait>

Thread waiting for its turn to enter an ordered section 

The <JVM-System> Function

In the User representation, the <JVM-System> function represents time used by the JVM software performing actions other than running a Java program. In this time interval, the JVM software is performing tasks such as garbage collection and HotSpot compilation. By default, <JVM-System> is visible in the Function list.

The <no Java callstack recorded> Function

The <no Java callstack recorded> function is similar to the <Unknown> function, but for Java threads, in the Java representation only. When the Collector receives an event from a Java thread, it unwinds the native stack and calls into the JVM software to obtain the corresponding Java stack. If that call fails for any reason, the event is shown in the Analyzer with the artificial function <no Java callstack recorded>. The JVM software might refuse to report a call stack either to avoid deadlock, or when unwinding the Java stack would cause excessive synchronization.

The <Truncated-stack> Function

The size of the buffer used by the Analyzer for recording the metrics of individual functions in the call stack is limited. If the size of the call stack becomes so large that the buffer becomes full, any further increase in size of the call stack will force the analyzer to drop function profile information. Since in most programs the bulk of exclusive CPU time is spent in the leaf functions, the Analyzer drops the metrics for functions the less critical functions at the bottom of the stack, starting with the entry functions _start() and main(). The metrics for the dropped functions are consolidated into the single artificial <Truncated-stack> function. The <Truncated-stack> function may also appear in Java programs.

The <Total> Function

The <Total> function is an artificial construct used to represent the program as a whole. All performance metrics, in addition to being attributed to the functions on the call stack, are attributed to the special function <Total> . The function appears at the top of the function list and its data can be used to give perspective on the data for other functions. In the Callers-Callees list, it is shown as the nominal caller of _start() in the main thread of execution of any program, and also as the nominal caller of _thread_start() for created threads. If the stack unwind was incomplete, the <Total> function can appear as the caller of <Truncated-stack>.

Functions Related to Hardware Counter Overflow Profiling

The following functions are related to hardware counter overflow profiling:

Mapping Performance Data to Index Objects

Index objects represent sets of things whose index can be computed from the data recorded in each packet. Index-object sets that are predefined include: Threads, Cpus, Samples, and Seconds. Other index objects may be defined either through the er_print indxobj_define command, issued directly or in a .er.rc file. In the IDE, you can define index objects by selecting Set Data Presentation from the View menu, selecting the Tabs tab, and clicking the Add Custom Index Object button.

For each packet, the index is computed and the metrics associated with the packet are added to the Index Object at that index. An index of -1 maps to the <Unknown> Index Object. All metrics for index objects are exclusive metrics, as no hierarchical representation of index objects is meaningful.

Mapping Data Addresses to Program Data Objects

Once a PC from a hardware counter event corresponding to a memory operation has been processed to successfully backtrack to a likely causal memory-referencing instruction, the Analyzer uses instruction identifiers and descriptors provided by the compiler in its hardware profiling support information to derive the associated program data object.

The term data object is used to refer to program constants, variables, arrays and aggregates such as structures and unions, along with distinct aggregate elements, described in source code. Depending on the source language, data object types and their sizes vary. Many data objects are explicitly named in source programs, while others may be unnamed. Some data objects are derived or aggregated from other (simpler) data objects, resulting in a rich, often complex, set of data objects.

Each data object has an associated scope, the region of the source program where it is defined and can be referenced, which may be global (such as a load object), a particular compilation unit (an object file), or function. Identical data objects may be defined with different scopes, or particular data objects referred to differently in different scopes.

Data-derived metrics from hardware counter events for memory operations collected with backtracking enabled are attributed to the associated program data object type and propagate to any aggregates containing the data object and the artificial <Total>, which is considered to contain all data objects (including <Unknown> and <Scalars>). The different subtypes of <Unknown> propagate up to the <Unknown> aggregate. The following section describes the <Total>, <Scalars>, and <Unknown> data objects.

Data Object Descriptors

Data objects are fully described by a combination of their declared type and name. A simple scalar data object {int i} describes a variable called i of type int, while {const+pointer+int p} describes a constant pointer to a type int called p. Spaces in the type names are replaced with underscore (_), and unnamed data objects are represented with a name of dash (-), for example: {double_precision_complex -}.

An entire aggregate is similarly represented {structure:foo_t} for a structure of type foo_t. An element of an aggregate requires the additional specification of its container, for example, {structure:foo_t}.{int i} for a member i of type int of the previous structure of type foo_t. Aggregates can also themselves be elements of (larger) aggregates, with their corresponding descriptor constructed as a concatenation of aggregate descriptors and ultimately a scalar descriptor.

While a fully-qualified descriptor may not always be necessary to disambiguate data objects, it provides a generic complete specification to assist with data object identification.

The <Total> Data Object

The <Total> data object is an artificial construct used to represent the program’s data objects as a whole. All performance metrics, in addition to being attributed to a distinct data object (and any aggregate to which it belongs), are attributed to the special data object <Total>. It appears at the top of the data object list and its data can be used to give perspective to the data for other data objects.

The <Scalars> Data Object

While aggregate elements have their performance metrics additionally attributed into the metric value for their associated aggregate, all of the scalar constants and variables have their performance metrics additionally attributed into the metric value for the artificial <Scalars> data object.

The <Unknown> Data Object and Its Elements

Under various circumstances, event data can not be mapped to a particular data object. In such cases, the data is mapped to the special data object named <Unknown> and one of its elements as described below.

Mapping Performance Data to Memory Objects

Memory objects are components in the memory subsystem, such as cache-lines, pages, and memory-banks. The object is determined from an index computed from the virtual and/or physical address as recorded. Memory objects are predefined for virtual pages and physical pages, for sizes of 8 KB, 64 KB, 512 KB, and 4 MB. You can define others with the mobj_define command in the er_print utility. You can also define custom memory objects using the Add Memory Objects dialog box in the Analyzer, which you can open by clicking the Add Custom Memory Object button in the Set Data Presentation dialog box.