Profiling data is collected by recording profile events at regular intervals. The interval is either a time interval obtained by using the system clock or a number of hardware events of a specific type. When the interval expires, a signal is delivered to the system and the data is recorded at the next opportunity.
Tracing data is collected by interposing a wrapper function on various system functions and library functions so that calls to the functions can be intercepted and data recorded about the calls.
Sample data is collected by calling various system routines to obtain global information.
Function and instruction count data is collected by instrumenting the executable and any shared objects for the executable and for any shared objects that are dynamically opened or statically linked to the executable and instrumented. The number of times each function or basic-block was executed is recorded.
Thread analysis data is collected to support the Thread Analyzer.
Both profiling data and tracing data contain information about specific events, and both types of data are converted into performance metrics. Sample data is not converted into metrics, but is used to provide markers that can be used to divide the program execution into time segments. The sample data gives an overview of the program execution during that time segment.
The data packets collected at each profiling event or tracing event include the following information:
A header identifying the data.
A high-resolution timestamp.
A thread ID.
A processor (CPU) ID, when available from the operating system.
A copy of the call stack. For Java programs, two call stacks are recorded: the machine call stack and the Java call stack.
For OpenMP programs, an identifier for the current parallel region and the OpenMP state are also collected.
For more information on threads and lightweight processes, see Understanding Performance Analyzer and Its Data.
The data types and how you might use them are described in the following subsections:
When you are doing clock profiling, the data collected depends on the information provided by the operating system.
In clock profiling under Oracle Solaris, the state of each thread is stored at regular time intervals. This time interval is called the profiling interval. The data collected is converted into times spent in each state, with a resolution of the profiling interval.
The default profiling interval is approximately 10 milliseconds (10 ms). You can specify a high-resolution profiling interval of approximately 1 ms and a low-resolution profiling interval of approximately 100 ms. If the operating system permits you can also specify a custom interval. Run the collect -h command with no other arguments to print the range and resolution allowable on the system.
The following table shows the performance metrics that Performance Analyzer and er_print can display when an experiment contains clock profiling data. Note that the metrics from all threads are added together.
Timing metrics tell you where your program spent time in several categories and can be used to improve the performance of your program.
High user CPU time tells you where the program did most of the work. You can use it to find the parts of the program where you might gain the most from redesigning the algorithm.
High system CPU time tells you that your program is spending a lot of time in calls to system routines.
High wait CPU time tells you that more threads are ready to run than there are CPUs available, or that other processes are using the CPUs.
High user lock time tells you that threads are unable to obtain the lock that they request.
High text page fault time means that the code ordered by the linker is organized in memory so that many calls or branches cause a new page to be loaded.
High data page fault time indicates that access to the data is causing new pages to be loaded. Reorganizing the data structure or the algorithm in your program can fix this problem.
On Linux platforms, the clock data can only be shown as Total CPU time. Linux CPU time is the sum of user CPU time and system CPU time.
If clock profiling is performed on an OpenMP program, additional metrics are provided: Master Thread Time, OpenMP Work, and OpenMP Wait.
On Oracle Solaris, Master Thread Time is the total time spent in the master thread and corresponds to wall-clock time. The metric is not available on Linux.
On Oracle Solaris, OpenMP Work accumulates when work is being done either serially or in parallel. OpenMP Wait accumulates when the OpenMP runtime is waiting for synchronization, and accumulates whether the wait is using CPU time or sleeping, or when work is being done in parallel but the thread is not scheduled on a CPU.
On the Linux operating system, OpenMP Work and OpenMP Wait are accumulated only when the process is active in either user or system mode. Unless you have specified that OpenMP should do a busy wait, OpenMP Wait on Linux is not useful.
Data for OpenMP programs can be displayed in any of three view modes. In User mode, slave threads are shown as if they were really cloned from the master thread, and have call stacks matching those from the master thread. Frames in the call stack coming from the OpenMP runtime code (libmtsk.so) are suppressed. In Expert user mode, the master and slave threads are shown differently, and the explicit functions generated by the compiler are visible, and the frames from the OpenMP runtime code (libmtsk.so) are suppressed. For Machine mode, the actual native stacks are shown.
The er_kernel utility can collect clock-based profile data on the Oracle Solaris kernel. You can profile the kernel by running the er_kernel utility directly from the command line or by choosing Profile Kernel from the File menu in Performance Analyzer.
The er_kernel utility captures kernel profile data and records the data as a Performance Analyzer experiment in the same format as an experiment created on user programs by the collect utility. The experiment can be processed by the er_print utility or Performance Analyzer. A kernel experiment can show function data, caller-callee data, instruction-level data, and a timeline, but not source-line data (because most Oracle Solaris modules do not contain line-number tables).
er_kernel can also record a user-level experiment on any processes running at the time, for which the user has permissions. Such experiments are similar to experiments that collect creates but have data only for User CPU Time and System CPU Time, and do not have support for Java or OpenMP profiling.
See Kernel Profiling for more information.
Clock profiling data can be collected on an MPI experiment that is run with Oracle Message Passing Toolkit, formerly known as Sun HPC ClusterTools. The Oracle Message Passing Toolkit must be at least version 8.1.
The Oracle Message Passing Toolkit is available as part of the Oracle Solaris 11 release. If it is installed on your system, you can find it in /usr/openmpi. If it is not already installed on your Oracle Solaris 11 system, you can search for the package with the command pkg search openmpi if a package repository is configured for the system. See Adding and Updating Software in Oracle Solaris 11 for more information about installing software in Oracle Solaris 11.
When you collect clock profiling data on an MPI experiment, you can view two additional metrics:
MPI Work, which accumulates when the process is inside the MPI runtime doing work, such as processing requests or messages
MPI Wait, which accumulates when the process is inside the MPI runtime but waiting for an event, buffer, or message
On Oracle Solaris, MPI Work accumulates when work is being done either serially or in parallel. MPI Wait accumulates when the MPI runtime is waiting for synchronization, and accumulates whether the wait is using CPU time or sleeping, or when work is being done in parallel but the thread is not scheduled on a CPU.
On Linux, MPI Work and MPI Wait are accumulated only when the process is active in either user or system mode. Unless you have specified that MPI should do a busy wait, MPI Wait on Linux is not useful.
The Oracle Message Passing Toolkit version number is indicated by the installation path such as /opt/SUNWhpc/HPC8.2.1, or you can type mpirun —V to see output as follows where the version is shown in italics:
mpirun (Open MPI) 1.3.4r22104-ct8.2.1-b09d-r70
If your application is compiled with a GNU or Intel compiler, and you are using Oracle Message Passing Toolkit 8.2 or 8.2.1 for MPI, to obtain MPI state data you must use the –WI and –-enable-new-dtags options with the Oracle Message Passing Toolkit link command. These options cause the executable to define RUNPATH in addition to RPATH, allowing the MPI State libraries to be enabled with the LD_LIBRARY_PATH environment variable.
Hardware counters keep track of events like cache misses, cache stall cycles, floating-point operations, branch mispredictions, CPU cycles, and instructions executed. In hardware counter profiling, the Collector records a profile packet when a designated hardware counter of the CPU on which a thread is running overflows. The counter is reset and continues counting. The profile packet includes the overflow value and the counter type.
Various processor chip families support from two to eighteen simultaneous hardware counter registers. The Collector can collect data on one or more registers. For each register you can select the type of counter to monitor for overflow, and set an overflow value for the counter. Some hardware counters can use any register, while others are only available on a particular register. Consequently, not all combinations of hardware counters can be chosen in a single experiment.
Hardware counter profiling can also be done on the kernel in Performance Analyzer and with the er_kernel utility. See Kernel Profiling for more information.
Hardware counter profiling data is converted by Performance Analyzer into count metrics. For counters that count in cycles, the metrics reported are converted to times. For counters that do not count in cycles, the metrics reported are event counts. On machines with multiple CPUs, the clock frequency used to convert the metrics is the harmonic mean of the clock frequencies of the individual CPUs. Because each type of processor has its own set of hardware counters, and because the number of hardware counters is large, the hardware counter metrics are not listed here. Hardware Counter Lists tells you how to find out what hardware counters are available.
If two specific counters, "cycles" and "insts", are collected, two additional metrics are available, "CPI" and "IPC", meaning cycles-per-instruction and instructions-per-cycle", respectively. They are always shown as a ratio, and not as a time, count, or percentage. A high value of CPI or low value of IPC indicates code that runs inefficiently in the machine; conversely, a low value of CPI or a high value of IPC indicates code that runs efficiently in the pipeline.
One use of hardware counters is to diagnose problems with the flow of information into and out of the CPU. High counts of cache misses, for example, indicate that restructuring your program to improve data or text locality or to increase cache reuse can improve program performance.
Some of the hardware counters correlate with other counters. For example, branch mispredictions and instruction cache misses are often related because a branch misprediction causes the wrong instructions to be loaded into the instruction cache. These must be replaced by the correct instructions. The replacement can cause an instruction cache miss, an instruction translation lookaside buffer (ITLB) miss, or even a page fault.
For many hardware counters, the overflows are often delivered one or more instructions after the instruction that caused the overflow event. Such behavior is referred to as "skid", and it can make counter overflow profiles difficult to interpret: the performance data will be associated with an instruction executed after the one that actually caused the observed event. For many chips and counters, the number of instructions skidded over is not deterministic; for some it is precisely deterministic (usually zero or one).
Some memory-related counters have a deterministic skid and are available for Memoryspace Profiling. They are labelled as "precise load-store" counters in the list described in Hardware Counter Lists. For such counters, the triggering PC and EA can be determined and memoryspace/dataspace data are captured by default. See Dataspace Profiling and Memoryspace Profiling for more information.
Hardware counters are processor-specific, so the choice of counters available depends on the processor that you are using. The performance tools provide aliases for a number of counters that are likely to be in common use. You can determine the maximum number of hardware counters definitions for profiling on the current machine, and see the full list of available hardware counters, as well as the default counter set, by running collect -h with no other arguments on the current machine.
If the processor and system support hardware counter profiling, the collect -h command prints two lists containing information about hardware counters. The first list contains hardware counters that are aliased to common names. The second list contains raw hardware counters. If neither the performance counter subsystem nor the collect command have the names for the counters on a specific system, the lists are empty. In most cases, however, the counters can be specified numerically.
The following example shows entries in the counter list. The counters that are aliased are displayed first in the list, followed by a list of the raw hardware counters. Each line of output in this example is formatted for print.
Aliases for most useful HW counters: alias raw name type units regs description cycles Cycles_user CPU-cycles 0123 CPU Cycles insts Instr_all events 0123 Instructions Executed c_stalls Commit_0_cyc CPU-cycles 0123 Stall Cycles loads Instr_ld precise load-store events 0123 Load Instructions stores Instr_st precise load-store events 0123 Store Instructions dcm DC_miss_commit precise load-store events 0123 L1 D-cache Misses ... Raw HW counters: name type units regs description Sel_pipe_drain_cyc CPU-cycles 0123 Sel_0_wait_cyc CPU-cycles 0123 Sel_0_ready_cyc CPU-cycles 0123 ...
In the aliased hardware counter list, the first field (for example, cycles) gives the alias name that can be used in the -h counter... argument of the collect command. This alias name is also the identifier to use in the er_print command.
The second field gives the raw name of the counter. For example, loads is short for Instr_ld
The third field contains type information, which might be empty (for example, precise load-store).
Possible entries in the type information field include the following:
memoryspace, the memory-based counter interrupt occurs with a known, precise skid, and is supported for memoryspace profiling. For such counters the Performance Analyzer can, and by default will, collect memoryspace and dataspace data. See the MemoryObjects Views description for details.
load, store, or load-store, the counter is memory-related.
not-program-related, the counter captures events initiated by some other program, such as CPU-to-CPU cache snoops. Using the counter for profiling generates a warning and profiling does not record a call stack.
The fourth field contains the type of units being counted (for example, events).
The unit can be one of the following:
CPU-cycles, the counter can be used to provide a time-based metric. The metrics reported for such counters are converted by default to inclusive and exclusive times, but can optionally be shown as event counts.
events, the metric is inclusive and exclusive event counts, and cannot be converted to a time.
The fifth field lists the available registers for the counter. For example, 0123.
The sixth field gives a short description of the counter, for example Load Instructions.
The information included in the raw hardware counter list is a subset of the information in the aliased hardware counter list. Each line in the raw hardware counter list includes the internal counter name as used by cputrack(1), the type information, the counter units, which can be either CPU-cycles or events, and the register numbers on which that counter can be used.
If the counter measures events unrelated to the program running, the first word of type information is not-program-related. For such a counter, profiling does not record a call stack, but instead shows the time being spent in an artificial function, collector_not_program_related . Thread and LWP IDs are recorded, but are meaningless.
The default overflow value for raw counters is 1000003. This value is not ideal for most raw counters, so you should specify overflow values when specifying raw counters.
In multithreaded programs, the synchronization of tasks performed by different threads can cause delays in execution of your program. For example, one thread might have to wait for access to data that has been locked by another thread. These events are called synchronization delay events and are collected by tracing calls to the Solaris or pthread thread functions. The process of collecting and recording these events is called synchronization wait tracing. The time spent waiting for the lock is called the synchronization wait time .
Events are only recorded if their wait time exceeds a threshold value , which is given in microseconds. A threshold value of 0 means that all synchronization delay events are traced, regardless of wait time. The default threshold is determined by running a calibration test, in which calls are made to the threads library without any synchronization delay. The threshold is the average time for these calls multiplied by an arbitrary factor (currently 6). This procedure prevents the recording of events for which the wait times are due only to the call itself and not to a real delay. As a result, the amount of data is greatly reduced, but the count of synchronization events can be significantly underestimated.
For Java programs, synchronization tracing might cover Java method calls in the profiled program, native synchronization calls, or both.
From this information you can determine whether functions or load objects are either frequently blocked or experience unusually long wait times when they do make a call to a synchronization routine. High synchronization wait times indicate contention among threads. You can reduce the contention by redesigning your algorithms, particularly restructuring your locks so that they cover only the data for each thread that needs to be locked.
Calls to memory allocation and deallocation functions that are not properly managed can be a source of inefficient data usage and can result in poor program performance. In heap tracing, the Collector traces memory allocation and deallocation requests by interposing on the C standard library memory allocation functions malloc, realloc, valloc, and memalign and the deallocation function free. Calls to mmap are treated as memory allocations, which enables heap tracing events for Java memory allocations to be recorded. The Fortran functions allocate and deallocate call the C standard library functions, so these routines are traced indirectly.
Heap profiling for Java programs is not supported.
Collecting heap tracing data can help you identify memory leaks in your program or locate places where there is inefficient allocation of memory.
When you look at the Leaks view with filters applied, the leaks shown are for memory allocations that were done under the filtering criteria and not deallocated at any time. Leaks are not restricted to those allocations that are not deallocated under the filtering criteria.
A memory leak here is defined as a block of memory that is dynamically allocated, but never freed, independent of whether a pointer to it exists in the process address space. (Other tools might define a leak differently, as a block of memory that is allocated, not freed, but which no longer has a pointer to it in the process address space.)
I/O data collection traces input/output system calls including reads and writes. It measures the duration of the calls, tracks the files and descriptors, and the amount of data transferred. You can use the I/O metrics to identify the files, file handles, and call stacks that have high byte transfer volumes and total thread time.
Process-wide resource-utilization samples contain statistics from the kernel such as page fault and I/O data, context switches, and a variety of page residency (working-set and paging) statistics. The data is attributed to the process and does not map to function-level metrics. Process-wide resource-utilization samples are recorded in the following circumstances:
When the program stops for any reason during debugging in dbx, such as at a breakpoint if the option to do this is set.
When you use the dbx collector sample record command to manually record a sample.
At a call to collector_sample if you have put calls to this routine in your code (see Program Control of Data Collection Using libcollector Library).
When a specified signal is delivered if you have used the -l option with the collect command (see the collect(1) man page).
When collection is initiated and terminated.
When you pause collection with the dbx collector pause command (just before the pause) and when you resume collection with the dbx collector resume command (just after the resume).
Before and after a descendant process is created.
The performance tools use the data recorded in the sample packets to group the data into time periods, which are called samples . You can filter the event-specific data by selecting a set of samples so that you see only information for these particular time periods. You can also view the global data for each sample.
The performance tools make no distinction between the different kinds of sample points. To make use of sample points for analysis you should choose only one kind of point to be recorded. In particular, if you want to record sample points that are related to the program structure or execution sequence, you should turn off periodic sampling and use samples recorded when dbx stops the process, or when a signal is delivered to the process that is recording data using the collect command, or when a call is made to the Collector API functions.
For more information on these variables, see the Vampirtrace User Manual on the Technische Universität Dresden web site.
MPI events that occur after the buffer limits have been reached are not written into the trace file resulting in an incomplete trace.
To remove the limit and get a complete trace of an application, set the VT_MAX_FLUSHES environment variable to 0. This setting causes the MPI API trace collector to flush the buffer to disk whenever the buffer is full.
To change the size of the buffer, set the VT_BUFFER_SIZE environment variable. The optimal value for this variable depends on the application that is to be traced. Setting a small value increases the memory available to the application but triggers frequent buffer flushes by the MPI API trace collector. These buffer flushes can significantly change the behavior of the application. On the other hand, setting a large value such as 2G minimizes buffer flushes by the MPI API trace collector but decreases the memory available to the application. If not enough memory is available to hold the buffer and the application data, parts of the application might be swapped to disk, leading to a significant change in the behavior of the application.
MPI Time is the total thread time spent in the MPI function. If MPI state times are also collected, MPI Work Time plus MPI Wait Time for all MPI functions other than MPI_Init and MPI_Finalize should approximately equal MPI Work Time. On Linux, MPI Wait and Work are based on user+system CPU time, while MPI Time is based on real time, so the numbers will not match.
MPI byte and message counts are currently collected only for point‐to‐point messages. They are not recorded for collective communication functions. The MPI Bytes Received metric counts the actual number of bytes received in all messages. MPI Bytes Sent counts the actual number of bytes sent in all messages. MPI Sends counts the number of messages sent, and MPI Receives counts the number of messages received.
Collecting MPI tracing data can help you identify places where you have a performance problem in an MPI program that could be due to MPI calls. Examples of possible performance problems are load balancing, synchronization delays, and communications bottlenecks.