Skip Navigation Links | |
Exit Print View | |
Oracle Solaris Studio 12.3: Performance Analyzer Oracle Solaris Studio 12.3 Information Library |
1. Overview of the Performance Analyzer
3. Collecting Performance Data
4. The Performance Analyzer Tool
5. The er_print Command Line Performance Analysis Tool
6. Understanding the Performance Analyzer and Its Data
dbx Experiments That Create a Process
dbx Experiments on a Running Process
Interpreting Performance Metrics
Hardware Counter Overflow Profiling
Call Stacks and Program Execution
Single-Threaded Execution and Function Calls
Function Calls Between Shared Objects
Overview of Java Technology-Based Software Execution
Java Call Stacks and Machine Call Stacks
Clock-based Profiling and Hardware Counter Overflow Profiling
User View Mode of Java Profiling Data
Expert View Mode of Java Profiling Data
Machine View Mode of Java Profiling Data
Overview of OpenMP Software Execution
User View Mode of OpenMP Profile Data
Expert View Mode of OpenMP Profiling Data
Machine View Mode of OpenMP Profiling Data
Mapping Addresses to Program Structure
Static Functions From Stripped Shared Libraries
Fortran Alternate Entry Points
Compiler-Generated Body Functions
Dynamically Compiled Functions
The <no Java callstack recorded> Function
The <Truncated-stack> Function
Functions Related to Hardware Counter Overflow Profiling
Mapping Performance Data to Index Objects
Mapping Data Addresses to Program Data Objects
The <Unknown> Data Object and Its Elements
Mapping Performance Data to Memory Objects
The data for each event contains a high-resolution timestamp, a thread ID, and a CPU ID. These can be used to filter the metrics in the Performance Analyzer by time, thread, or CPU. See the getcpuid(2) man page for information on CPU IDs. On systems where getcpuid is not available, the processor ID is -1, which maps to Unknown.
In addition to the common data, each event generates specific raw data, which is described in the following sections. Each section also contains a discussion of the accuracy of the metrics derived from the raw data and the effect of data collection on the metrics.
The event-specific data for clock-based profiling consists of an array of profiling interval counts. On Oracle Solaris, an interval counter is provided. At the end of the profiling interval, the appropriate interval counter is incremented by 1, and another profiling signal is scheduled. The array is recorded and reset only when the Solaris thread enters CPU user mode. Resetting the array consists of setting the array element for the User-CPU state to 1, and the array elements for all the other states to 0. The array data is recorded on entry to user mode before the array is reset. Thus, the array contains an accumulation of counts for each microstate that was entered since the previous entry into user mode, for each of the ten microstates maintained by the kernel for each Solaris thread. On the Linux operating system, microstates do not exist; the only interval counter is User CPU Time.
The call stack is recorded at the same time as the data. If the Solaris thread is not in user mode at the end of the profiling interval, the call stack cannot change until the thread enters user mode again. Thus the call stack always accurately records the position of the program counter at the end of each profiling interval.
The metrics to which each of the microstates contributes on Oracle Solaris are shown in Table 6-1.
Table 6-1 How Kernel Microstates Contribute to Metrics
|
Timing data is collected on a statistical basis, and is therefore subject to all the errors of any statistical sampling method. For very short runs, in which only a small number of profile packets is recorded, the call stacks might not represent the parts of the program which consume the most resources. Run your program for long enough or enough times to accumulate hundreds of profile packets for any function or source line you are interested in.
In addition to statistical sampling errors, specific errors arise from the way the data is collected and attributed and the way the program progresses through the system. The following are some of the circumstances in which inaccuracies or distortions can appear in the timing metrics:
When a thread is created, the time spent before the first profile packet is recorded is less than the profiling interval, but the entire profiling interval is ascribed to the microstate recorded in the first profile packet. If many threads are created, the error can be many times the profiling interval.
When a thread is destroyed, some time is spent after the last profile packet is recorded. If many threads are destroyed, the error can be many times the profiling interval.
Rescheduling of threads can occur during a profiling interval. As a consequence, the recorded state of the thread might not represent the microstate in which it spent most of the profiling interval. The errors are likely to be larger when there are more threads to run than there are processors to run them.
A program can behave in a way that is correlated with the system clock. In this case, the profiling interval always expires when the thread is in a state that might represent a small fraction of the time spent, and the call stacks recorded for a particular part of the program are overrepresented. On a multiprocessor system, the profiling signal can induce a correlation: processors that are interrupted by the profiling signal while they are running threads for the program are likely to be in the Trap-CPU microstate when the microstate is recorded.
The kernel records the microstate value when the profiling interval expires. When the system is under heavy load, that value might not represent the true state of the process. On Oracle Solaris, this situation is likely to result in overaccounting of the Trap-CPU or Wait-CPU microstate.
When the system clock is being synchronized with an external source, the time stamps recorded in profile packets do not reflect the profiling interval but include any adjustment that was made to the clock. The clock adjustment can make it appear that profile packets are lost. The time period involved is usually several seconds, and the adjustments are made in increments.
Experiments recorded on machines that dynamically change their operating clock frequency may induce inaccuracies in profiling.
In addition to the inaccuracies just described, timing metrics are distorted by the process of collecting data. The time spent recording profile packets never appears in the metrics for the program, because the recording is initiated by the profiling signal. (This is another instance of correlation.) The user CPU time spent in the recording process is distributed over whatever microstates are recorded. The result is an underaccounting of the User CPU Time metric and an overaccounting of other metrics. The amount of time spent recording data is typically less than a few percent of the CPU time for the default profiling interval.
If you compare timing metrics obtained from the profiling done in a clock-based experiment with times obtained by other means, you should be aware of the following issues.
For a single-threaded application, the total thread time recorded for a process is usually accurate to a few tenths of a percent, compared with the values returned by gethrtime(3C) for the same process. The CPU time can vary by several percentage points from the values returned by gethrvtime(3C) for the same process. Under heavy load, the variation might be even more pronounced. However, the CPU time differences do not represent a systematic distortion, and the relative times reported for different functions, source-lines, and such are not substantially distorted.
The thread times that are reported in the Performance Analyzer can differ substantially from the times that are reported by vmstat, because vmstat reports times that are summed over CPUs. If the target process has more LWPs than the system on which it is running has CPUs, the Performance Analyzer shows more wait time than vmstat reports.
The microstate timings that appear in the Statistics tab of the Performance Analyzer and the er_print statistics display are based on process file system /proc usage reports, for which the times spent in the microstates are recorded to high accuracy. See the proc (4) man page for more information. You can compare these timings with the metrics for the <Total> function, which represents the program as a whole, to gain an indication of the accuracy of the aggregated timing metrics. However, the values displayed in the Statistics tab can include other contributions that are not included in the timing metric values for <Total>. These contributions come from the periods of time in which data collection is paused.
User CPU time and hardware counter cycle time differ because the hardware counters are turned off when the CPU mode has been switched to system mode. For more information, see Traps.
Hardware counter overflow profiling data includes a counter ID and the overflow value. The value can be larger than the value at which the counter is set to overflow, because the processor executes some instructions between the overflow and the recording of the event. The value is especially likely to be larger for cycle and instruction counters, which are incremented much more frequently than counters such as floating-point operations or cache misses. The delay in recording the event also means that the program counter address recorded with call stack does not correspond exactly to the overflow event. See Attribution of Hardware Counter Overflows for more information. See also the discussion of Traps. Traps and trap handlers can cause significant differences between reported User CPU time and time reported by the cycle counter.
Experiments recorded on machines that dynamically change their operating clock frequency will show inaccuracies in the conversion of cycle-based count to time.
The amount of data collected depends on the overflow value. Choosing a value that is too small can have the following consequences.
The amount of time spent collecting data can be a substantial fraction of the execution time of the program. The collection run might spend most of its time handling overflows and writing data instead of running the program.
A substantial fraction of the counts can come from the collection process. These counts are attributed to the collector function collector_record_counters . If you see high counts for this function, the overflow value is too small.
The collection of data can alter the behavior of the program. For example, if you are collecting data on cache misses, the majority of the misses could come from flushing the collector instructions and profiling data from the cache and replacing it with the program instructions and data. The program would appear to have a lot of cache misses, but without data collection there might in fact be very few cache misses.
Dataspace profiling is an extension to hardware counter profiling that is used for memory references. Hardware counter profiling can attribute metrics to user functions, source lines, and instructions, but not to data objects that are being referenced. By default, the Collector only captures the user instruction addresses. When dataspace profiling is enabled, the Collector also captures data addresses. Backtracking is the technique used on some machines to get the performance information that supports dataspace profiling. When backtracking is enabled, the Collector looks back at the load or store instructions that occurred before a hardware counter event to find a candidate instruction that could cause the event. On some systems, counters are precise and no backtracking is needed. Such counters are indicated by the word precise in the output of the collect -h command.
A dataspace profile is a data collection in which memory-related events, such as cache misses, are reported against the data object references that cause the events rather than just the instructions where the memory-related events occur.
A memoryspace profile is similar to a dataspace profile except that in a memoryspace profile the events are reported against components of the memory subsystem such as cache-lines or pages, rather than data objects in the program. Memoryspace profiling occurs when you prepend a + sign to a precise counter that is related to memory.
To allow dataspace profiling, the target must be a C, C++, or Fortran program, compiled for the SPARC architecture, with the -xhwcprof flag and -xdebugformat=dwarf -g flags. Furthermore, the data collected must be hardware counter profiles for memory-related counters and the + sign must be prepended to the counter name. If the optional + is prepended to one memory-related counter, but not all, the counters without the + report dataspace data against the <Unknown> data object, with subtype Dataspace data not requested during data collection.
On machines with precise interrupts no backtracking is required, and memoryspace profiling does not require the -xhwcprof flag and -xdebugformat=dwarf -g flags for compilation. Dataspace profiling, even on such machines, does require the flags.
When an experiment includes a dataspace or memoryspace profile, the er_print utility allows three additional commands: data_objects, data_single, and data_layout, as well as various commands relating to memory objects. See Commands That Control Hardware Counter Dataspace and Memory Object Lists for more information.
In addition, the Performance Analyzer includes two tabs related to dataspace profiling, the DataObjects tab and the DataLayout tab, and various tabs for memory objects. See The DataObjects Tab and The DataLayout Tab and The MemoryObjects Tabs.
Running collect -h with no additional arguments lists hardware counters, and specifies whether they are load, store, or load-store related and whether they are precise. See Hardware Counter Overflow Profiling Data.
The Collector collects synchronization delay events by tracing calls to the functions in the threads library, libthread.so, or to the real time extensions library, librt.so. The event-specific data consists of high-resolution timestamps for the request and the grant (beginning and end of the call that is traced), and the address of the synchronization object (the mutex lock being requested, for example). The thread and LWP IDs are the IDs at the time the data is recorded. The wait time is the difference between the request time and the grant time. Only events for which the wait time exceeds the specified threshold are recorded. The synchronization wait tracing data is recorded in the experiment at the time of the grant.
The waiting thread cannot perform any other work until the event that caused the delay is completed. The time spent waiting appears both as Synchronization Wait Time and as User Lock Time. User Lock Time can be larger than Synchronization Wait Time because the synchronization delay threshold screens out delays of short duration.
The wait time is distorted by the overhead for data collection. The overhead is proportional to the number of events collected. You can minimize the fraction of the wait time spent in overhead by increasing the threshold for recording events.
The Collector records tracing data for calls to the memory allocation and deallocation functions malloc, realloc, memalign, and free by interposing on these functions. If your program bypasses these functions to allocate memory, tracing data is not recorded. Tracing data is not recorded for Java memory management, which uses a different mechanism.
The functions that are traced could be loaded from any of a number of libraries. The data that you see in the Performance Analyzer might depend on the library from which a given function is loaded.
If a program makes a large number of calls to the traced functions in a short space of time, the time taken to execute the program can be significantly lengthened. The extra time is used in recording the tracing data.
MPI tracing is based on a modified VampirTrace data collector. For more information, see the VampirTrace User Manual on the Technische Universität Dresden web site.