JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Solaris Studio 12.3: Performance Analyzer     Oracle Solaris Studio 12.3 Information Library
search filter icon
search icon

Document Information


1.  Overview of the Performance Analyzer

2.  Performance Data

What Data the Collector Collects

Clock Data

Clock-based Profiling Under Oracle Solaris

Clock-based Profiling Under Linux

Clock-based Profiling for MPI Programs

Clock-based Profiling for OpenMP Programs

Clock-based Profiling for the Oracle Solaris Kernel

Hardware Counter Overflow Profiling Data

Hardware Counter Lists

Format of the Aliased Hardware Counter List

Format of the Raw Hardware Counter List

Synchronization Wait Tracing Data

Heap Tracing (Memory Allocation) Data

MPI Tracing Data

Global (Sampling) Data

How Metrics Are Assigned to Program Structure

Function-Level Metrics: Exclusive, Inclusive, and Attributed

Interpreting Attributed Metrics: An Example

How Recursion Affects Function-Level Metrics

3.  Collecting Performance Data

4.  The Performance Analyzer Tool

5.  The er_print Command Line Performance Analysis Tool

6.  Understanding the Performance Analyzer and Its Data

7.  Understanding Annotated Source and Disassembly Data

8.  Manipulating Experiments

9.  Kernel Profiling


What Data the Collector Collects

The Collector collects various kinds of data using several methods:

Both profiling data and tracing data contain information about specific events, and both types of data are converted into performance metrics. Global data is not converted into metrics, but is used to provide markers that can be used to divide the program execution into time segments. The global data gives an overview of the program execution during that time segment.

The data packets collected at each profiling event or tracing event include the following information:

For more information on threads and lightweight processes, see Chapter 6, Understanding the Performance Analyzer and Its Data.

In addition to the common data, each event-specific data packet contains information specific to the data type.

The data types and how you might use them are described in the following subsections:

Clock Data

When you are doing clock-based profiling, the data collected depends on the metrics provided by the operating system.

Clock-based Profiling Under Oracle Solaris

In clock-based profiling under Oracle Solaris, the state of each thread is stored at regular time intervals. This time interval is called the profiling interval. The information is stored in an integer array: one element of the array is used for each of the ten microaccounting states maintained by the kernel. The data collected is converted by the Performance Analyzer into times spent in each state, with a resolution of the profiling interval. The default profiling interval is approximately 10 milliseconds (10 ms). The Collector provides a high-resolution profiling interval of approximately 1 ms and a low-resolution profiling interval of approximately 100 ms, and, where the operating system permits, allows arbitrary intervals. Running the collect -h command with no other arguments prints the range and resolution allowable on the system on which it is run.

The metrics that are computed from clock-based data are defined in the following table.

Table 2-1 Solaris Timing Metrics

User CPU time
Time spent running in user mode on the CPU.
Wall time
Elapsed time spent in Thread 1. This is usually the “wall clock time”
Total thread time
Sum of all thread times.
System CPU time
Thread time spent running in kernel mode on the CPU or in a trap state.
Wait CPU time
Thread time spent waiting for a CPU.
User lock time
Thread time spent waiting for a lock.
Text page fault time
Thread time spent waiting for a text page.
Data page fault time
Thread time spent waiting for a data page.
Other wait time
Thread time spent waiting for a kernel page, or time spent sleeping or stopped.

For multithreaded experiments, times other than wall clock time are summed across all threads. Wall time as defined is not meaningful for multiple-program multiple-data (MPMD) targets.

Timing metrics tell you where your program spent time in several categories and can be used to improve the performance of your program.

Clock-based Profiling Under Linux

Under Linux operating systems, the only metric available is User CPU time. Although the total CPU utilization time reported is accurate, it may not be possible for the Analyzer to determine the proportion of the time that is actually System CPU time as accurately as for Oracle Solaris. Although the Analyzer displays the information as if the data were for a lightweight process (LWP), in reality there are no LWP’s on Linux; the displayed LWP ID is actually the thread ID.

Clock-based Profiling for MPI Programs

Clock-profiling data can be collected on an MPI experiment that is run with Oracle Message Passing Toolkit, formerly known as Sun HPC ClusterTools. The Oracle Message Passing Toolkit must be at least version 8.1.

The Oracle Message Passing Toolkit is made available as part of the Oracle Solaris 11 release. If it is installed on your system, you can find it in /usr/openmpi. If it is not already installed on your Oracle Solaris 11 system, you can search for the package with the command pkg search openmpi if a package repository is configured for the system. See the manual Adding and Updating Oracle Solaris 11 Software Packages in the Oracle Solaris 11 documentation library for more information about installing software in Oracle Solaris 11.

When you collect clock-profiling data on an MPI experiment, two additional metrics can be shown:

On Oracle Solaris, MPI Work accumulates when work is being done either serially or in parallel. MPI Wait accumulates when the MPI runtime is waiting for synchronization, and accumulates whether the wait is using CPU time or sleeping, or when work is being done in parallel, but the thread is not scheduled on a CPU.

On Linux, MPI Work and MPI Wait are accumulated only when the process is active in either user or system mode. Unless you have specified that MPI should do a busy wait, MPI Wait on Linux is not useful.

Note - If your are using Linux with Oracle Message Passing Toolkit 8.2 or 8.2.1, you might need a workaround. The workaround is not needed for version 8.1 or 8.2.1c, or for any version if you are using an Oracle Solaris Studio compiler.

The Oracle Message Passing Toolkit version number is indicated by the installation path such as /opt/SUNWhpc/HPC8.2.1, or you can type mpirun —V to see output as follows where the version is shown in italics:

mpirun (Open MPI) 1.3.4r22104-ct8.2.1-b09d-r70

If your application is compiled with a GNU or Intel compiler, and you are using Oracle Message Passing Toolkit 8.2 or 8.2.1 for MPI, to obtain MPI state data you must use the -WI and --enable-new-dtags options with the Oracle Message Passing Toolkit link command. These options cause the executable to define RUNPATH in addition to RPATH, allowing the MPI State libraries to be enabled with the LD_LIBRARY_PATH environment variable.

Clock-based Profiling for OpenMP Programs

If clock-based profiling is performed on an OpenMP program, two additional metrics are provided: OpenMP Work and OpenMP Wait.

On Oracle Solaris, OpenMP Work accumulates when work is being done either serially or in parallel. OpenMP Wait accumulates when the OpenMP runtime is waiting for synchronization, and accumulates whether the wait is using CPU time or sleeping, or when work is being done in parallel, but the thread is not scheduled on a CPU.

On the Linux operating system, OpenMP Work and OpenMP Wait are accumulated only when the process is active in either user or system mode. Unless you have specified that OpenMP should do a busy wait, OpenMP Wait on Linux is not useful.

Data for OpenMP programs can be displayed in any of three view modes. In User mode, slave threads are shown as if they were really cloned from the master thread, and have call stacks matching those from the master thread. Frames in the call stack coming from the OpenMP runtime code ( are suppressed. In Expert user mode, the master and slave threads are shown differently, and the explicit functions generated by the compiler are visible, and the frames from the OpenMP runtime code ( are suppressed. For Machine mode, the actual native stacks are shown.

Clock-based Profiling for the Oracle Solaris Kernel

The er_kernel utility can collect clock-based profile data on the Oracle Solaris kernel.

The er_kernel utility captures kernel profile data and records the data as an Analyzer experiment in the same format as an experiment created on user programs by the collect utility. The experiment can be processed by the er_print utility or the Performance Analyzer. A kernel experiment can show function data, caller-callee data, instruction-level data, and a timeline, but not source-line data (because most Oracle Solaris modules do not contain line-number tables).

See Chapter 9, Kernel Profiling for more information.

Hardware Counter Overflow Profiling Data

Hardware counters keep track of events like cache misses, cache stall cycles, floating-point operations, branch mispredictions, CPU cycles, and instructions executed. In hardware counter overflow profiling, the Collector records a profile packet when a designated hardware counter of the CPU on which a thread is running overflows. The counter is reset and continues counting. The profile packet includes the overflow value and the counter type.

Various processor chip families support from two to eighteen simultaneous hardware counter registers. The Collector can collect data on one or more registers. For each register the Collector allows you to select the type of counter to monitor for overflow, and to set an overflow value for the counter. Some hardware counters can use any register, others are only available on a particular register. Consequently, not all combinations of hardware counters can be chosen in a single experiment.

Hardware counter overflow profiling can also be done on the kernel with the er_kernel utility. See Chapter 9, Kernel Profiling for more information.

Hardware counter overflow profiling data is converted by the Performance Analyzer into count metrics. For counters that count in cycles, the metrics reported are converted to times; for counters that do not count in cycles, the metrics reported are event counts. On machines with multiple CPUs, the clock frequency used to convert the metrics is the harmonic mean of the clock frequencies of the individual CPUs. Because each type of processor has its own set of hardware counters, and because the number of hardware counters is large, the hardware counter metrics are not listed here. The next subsection tells you how to find out what hardware counters are available.

One use of hardware counters is to diagnose problems with the flow of information into and out of the CPU. High counts of cache misses, for example, indicate that restructuring your program to improve data or text locality or to increase cache reuse can improve program performance.

Some of the hardware counters correlate with other counters. For example, branch mispredictions and instruction cache misses are often related because a branch misprediction causes the wrong instructions to be loaded into the instruction cache, and these must be replaced by the correct instructions. The replacement can cause an instruction cache miss, or an instruction translation lookaside buffer (ITLB) miss, or even a page fault.

Hardware counter overflows are often delivered one or more instructions after the instruction which caused the event and the corresponding event counter to overflow: this is referred to as “skid” and it can make counter overflow profiles difficult to interpret. In the absence of hardware support for precise identification of the causal instruction, an apropos backtracking search for a candidate causal instruction may be attempted.

When such backtracking is supported and specified during collection, hardware counter profile packets additionally include the PC (program counter) and EA (effective address) of a candidate memory-referencing instruction appropriate for the hardware counter event. (Subsequent processing during analysis is required to validate the candidate event PC and EA.) This additional information about memory-referencing events facilitates various data-oriented analyses, known as dataspace profiling. Backtracking is supported only on SPARC based platforms running the Oracle Solaris operating system.

On some SPARC chips, the counter interrupts are precise, and no backtracking is needed. Such counters are indicated by the word precise following the event type.

If you prepend a + sign to precise counters that are related to memory, you enable memoryspace profiling, which can help you to determine which program lines and memory addresses are causing memory-related program delays. See Dataspace Profiling and Memoryspace Profiling for more information about memoryspace profiling.

Backtracking and recording of a candidate event PC and EA can also be specified for clock-profiling, although the data might be difficult to interpret. Backtracking on hardware counters is more reliable.

Hardware Counter Lists

Hardware counters are processor-specific, so the choice of counters available to you depends on the processor that you are using. The performance tools provide aliases for a number of counters that are likely to be in common use. You can obtain a list of available hardware counters on any particular system from the Collector by typing collect -h with no other arguments in a terminal window on that system. If the processor and system support hardware counter profiling, the collect -h command prints two lists containing information about hardware counters. The first list contains hardware counters that are aliased to common names; the second list contains raw hardware counters. If neither the performance counter subsystem nor the collect command know the names for the counters on a specific system, the lists are empty. In most cases, however, the counters can be specified numerically.

Here is an example that shows the entries in the counter list. The counters that are aliased are displayed first in the list, followed by a list of the raw hardware counters. Each line of output in this example is formatted for print.

Aliased HW counters available for profiling:
cycles[/{0|1|2|3}],31599989 (`CPU Cycles', alias for Cycles_user; CPU-cycles)
insts[/{0|1|2|3}],31599989 (`Instructions Executed', alias for Instr_all; events)
loads[/{0|1|2|3}],9999991 (`Load Instructions', alias for Instr_ld; 
      precise load-store events)
stores[/{0|1|2|3}],1000003 (`Store Instructions', alias for Instr_st; 
      precise load-store events)
dcm[/{0|1|2|3}],1000003 (`L1 D-cache Misses', alias for DC_miss_nospec; 
      precise load-store events)
Raw HW counters available for profiling:
Cycles_user[/{0|1|2|3}],1000003 (CPU-cycles)
Instr_all[/{0|1|2|3}],1000003 (events)
Instr_ld[/{0|1|2|3}],1000003 (precise load-store events)
Instr_st[/{0|1|2|3}],1000003 (precise load-store events)
DC_miss_nospec[/{0|1|2|3}],1000003 (precise load-store events)
Format of the Aliased Hardware Counter List

In the aliased hardware counter list, the first field (for example, cycles) gives the alias name that can be used in the -h counter... argument of the collect command. This alias name is also the identifier to use in the er_print command.

The second field lists the available registers for the counter; for example, [/{0|1|2|3}].

The third field, for example, 9999991, is the default overflow value for the counter. For aliased counters, the default value has been chosen to provide a reasonable sample rate. Because actual rates vary considerably, you might need to specify a non-default value.

The fourth field, in parentheses, contains type information. It provides a short description (for example, CPU Cycles), the raw hardware counter name (for example, Cycles_user), and the type of units being counted (for example, CPU-cycles).

Possible entries in the type information field include the following:

If the last or only word of the type information is:

In the aliased hardware counter list in the example, the type information contains one word, CPU-cycles for the first counter and events for the second counter. For the third counter, the type information contains two words, load-store events.

Format of the Raw Hardware Counter List

The information included in the raw hardware counter list is a subset of the information in the aliased hardware counter list. Each line in the raw hardware counter list includes the internal counter name as used by cputrack(1), the register numbers on which that counter can be used, the default overflow value, the type information, and the counter units, which can be either CPU-cycles or events.

If the counter measures events unrelated to the program running, the first word of type information is not-program-related. For such a counter, profiling does not record a call stack, but instead shows the time being spent in an artificial function, collector_not_program_related . Thread and LWP ID’s are recorded, but are meaningless.

The default overflow value for raw counters is 1000003. This value is not ideal for most raw counters, so you should specify overflow values when specifying raw counters.

Synchronization Wait Tracing Data

In multithreaded programs, the synchronization of tasks performed by different threads can cause delays in execution of your program, because one thread might have to wait for access to data that has been locked by another thread, for example. These events are called synchronization delay events and are collected by tracing calls to the Solaris or pthread thread functions. The process of collecting and recording these events is called synchronization wait tracing. The time spent waiting for the lock is called the wait time.

Events are only recorded if their wait time exceeds a threshold value, which is given in microseconds. A threshold value of 0 means that all synchronization delay events are traced, regardless of wait time. The default threshold is determined by running a calibration test, in which calls are made to the threads library without any synchronization delay. The threshold is the average time for these calls multiplied by an arbitrary factor (currently 6). This procedure prevents the recording of events for which the wait times are due only to the call itself and not to a real delay. As a result, the amount of data is greatly reduced, but the count of synchronization events can be significantly underestimated.

Synchronization tracing is not supported for Java programs.

Synchronization wait tracing data is converted into the following metrics.

Table 2-2 Synchronization Wait Tracing Metrics

Synchronization delay events.
The number of calls to a synchronization routine where the wait time exceeded the prescribed threshold.
Synchronization wait time.
Total of wait times that exceeded the prescribed threshold.

From this information you can determine if functions or load objects are either frequently blocked, or experience unusually long wait times when they do make a call to a synchronization routine. High synchronization wait times indicate contention among threads. You can reduce the contention by redesigning your algorithms, particularly restructuring your locks so that they cover only the data for each thread that needs to be locked.

Heap Tracing (Memory Allocation) Data

Calls to memory allocation and deallocation functions that are not properly managed can be a source of inefficient data usage and can result in poor program performance. In heap tracing, the Collector traces memory allocation and deallocation requests by interposing on the C standard library memory allocation functions malloc, realloc, valloc, and memalign and the deallocation function free. Calls to mmap are treated as memory allocations, which allows heap tracing events for Java memory allocations to be recorded. The Fortran functions allocate and deallocate call the C standard library functions, so these routines are traced indirectly.

Heap profiling for Java programs is not supported.

Heap tracing data is converted into the following metrics.

Table 2-3 Memory Allocation (Heap Tracing) Metrics

The number of calls to the memory allocation functions.
Bytes allocated
The sum of the number of bytes allocated in each call to the memory allocation functions.
The number of calls to the memory allocation functions that did not have a corresponding call to a deallocation function.
Bytes leaked
The number of bytes that were allocated but not deallocated.

Collecting heap tracing data can help you identify memory leaks in your program or locate places where there is inefficient allocation of memory.

Another definition of memory leaks that is commonly used, such as in the dbx debugging tool, says a memory leak is a dynamically-allocated block of memory that has no pointers pointing to it anywhere in the data space of the program. The definition of leaks used here includes this alternative definition, but also includes memory for which pointers do exist.

MPI Tracing Data

The Collector can collect data on calls to the Message Passing Interface (MPI) library.

MPI tracing is implemented using the open source VampirTrace 5.5.3 release. It recognizes the following VampirTrace environment variables:

Controls whether or not call stacks are recorded in the data. The default setting is 1. Setting VT_STACKS to 0 disables call stacks.
Controls the size of the internal buffer of the MPI API trace collector. The default value is 64M (64 MBytes).
Controls the number of times the buffer is flushed before terminating MPI tracing. The default value is 0, which allows the buffer to be flushed to disk whenever it is full. Setting VT_MAX_FLUSHES to a positive number sets a limit for the number of times the buffer is flushed.
Turns on various error and status messages. The default value is 1, which turns on critical error and status messages. Set the variable to 2 if problems arise.

For more information on these variables, see the Vampirtrace User Manual on the Technische Universität Dresden web site.

MPI events that occur after the buffer limits have been reached are not written into the trace file resulting in an incomplete trace.

To remove the limit and get a complete trace of an application, set the VT_MAX_FLUSHES environment variable to 0. This setting causes the MPI API trace collector to flush the buffer to disk whenever the buffer is full.

To change the size of the buffer, set the VT_BUFFER_SIZE environment variable. The optimal value for this variable depends on the application that is to be traced. Setting a small value increases the memory available to the application, but triggers frequent buffer flushes by the MPI API trace collector. These buffer flushes can significantly change the behavior of the application. On the other hand, setting a large value such as 2G minimizes buffer flushes by the MPI API trace collector, but decreases the memory available to the application. If not enough memory is available to hold the buffer and the application data, parts of the application might be swapped to disk leading to a significant change in the behavior of the application.

The functions for which data is collected are listed below.


MPI tracing data is converted into the following metrics.

Table 2-4 MPI Tracing Metrics

MPI Sends
Number of MPI point-to-point sends started
MPI Bytes Sent
Number of bytes in MPI Sends
MPI Receives
Number of MPI point-to-point receives completed
MPI Bytes Received
Number of bytes in MPI Receives
MPI Time
Time spent in all calls to MPI functions
Other MPI Events
Number of calls to MPI functions that neither send nor receive point-to-point messages

MPI Time is the total thread time spent in the MPI function. If MPI state times are also collected, MPI Work Time plus MPI Wait Time for all MPI functions other than MPI_Init and MPI_Finalize should approximately equal MPI Work Time. On Linux, MPI Wait and Work are based on user+system CPU time, while MPI Time is based on real tine, so the numbers will not match.

MPI byte and message counts are currently collected only for point-to-point messages; they are not recorded for collective communication functions. The MPI Bytes Received metric counts the actual number of bytes received in all messages. MPI Bytes Sent counts the actual number of bytes sent in all messages. MPI Sends counts the number of messages sent, and MPI Receives counts the number of messages received.

Collecting MPI tracing data can help you identify places where you have a performance problem in an MPI program that could be due to MPI calls. Examples of possible performance problems are load balancing, synchronization delays, and communications bottlenecks.

Global (Sampling) Data

Global data is recorded by the Collector in packets called sample packets. Each packet contains a header, a timestamp, execution statistics from the kernel such as page fault and I/O data, context switches, and a variety of page residency (working-set and paging) statistics. The data recorded in sample packets is global to the program and is not converted into performance metrics. The process of recording sample packets is called sampling.

Sample packets are recorded in the following circumstances:

The performance tools use the data recorded in the sample packets to group the data into time periods, which are called samples. You can filter the event-specific data by selecting a set of samples, so that you see only information for these particular time periods. You can also view the global data for each sample.

The performance tools make no distinction between the different kinds of sample points. To make use of sample points for analysis you should choose only one kind of point to be recorded. In particular, if you want to record sample points that are related to the program structure or execution sequence, you should turn off periodic sampling, and use samples recorded when dbx stops the process, or when a signal is delivered to the process that is recording data using the collect command, or when a call is made to the Collector API functions.