C H A P T E R  2

Performance Data

The performance tools work by recording data about specific events while a program is running, and converting the data into measurements of program performance called metrics.

This chapter describes the data collected by the performance tools, how it is processed and displayed, and how it can be used for performance analysis. Because there is more than one tool that collects performance data, the term Collector is used to refer to any of these tools. Likewise, because there is more than one tool that analyzes performance data, the term analysis tools is use to refer to any of these tools.

This chapter covers the following topics.

See Chapter 3 for information on collecting and storing performance data.

See Chapter 4 for information on analyzing performance data with the performance analyzer.

See Chapter 5 for information on profiling the kernel while the Solaris OS is running a load.

See Chapter 6 for information on analyzing performance data with the er_print utility


What Data the Collector Collects

The Collector collects three different kinds of data: profiling data, tracing data and global data.

Both profiling data and tracing data contain information about specific events, and both types of data are converted into performance metrics. Global data is not converted into metrics, but is used to provide markers that can be used to divide the program execution into time segments. The global data gives an overview of the program execution during that time segment.

The data packets collected at each profiling event or tracing event include the following information:

For more information on threads and lightweight processes, see Chapter 7.

In addition to the common data, each event-specific data packet contains information specific to the data type. The five types of data that the Collector can record are:

These five data types, the metrics that are derived from them, and how you might use them, are described in the following subsections. A sixth type of data, global sampling data, cannot be converted to metrics because it does not include call stack information.

Clock Data

When you are doing clock-based profiling, the data collected depends on the metrics provided by the operating system.

Clock-based Profiling Under the Solaris OS

In clock-based profiling under the Solaris OS, the state of each LWP is stored at regular time intervals. This time interval is called the profiling interval. The information is stored in an integer array: one element of the array is used for each of the ten microaccounting states maintained by the kernel. The data collected is converted by the Performance Analyzer into times spent in each state, with a resolution of the profiling interval. The default profiling interval is approximately 10 milliseconds (10 ms). The Collector provides a high-resolution profiling interval of approximately 1 ms and a low-resolution profiling interval of approximately 100 ms., and, where the OS permits, allows arbitrary intervals. Running the collect command with no arguments prints the range and resolution allowable on the system on which it is run.

The metrics that are computed from clock-based data are defined in the following table.


TABLE 2-1 Solaris Timing Metrics

Metric

Definition

User CPU time

LWP time spent running in user mode on the CPU.

Wall time

LWP time spent in LWP 1. This is usually the "wall clock time"

Total LWP time

Sum of all LWP times.

System CPU time

LWP time spent running in kernel mode on the CPU or in a trap state.

Wait CPU time

LWP time spent waiting for a CPU.

User lock time

LWP time spent waiting for a lock.

Text page fault time

LWP time spent waiting for a text page.

Data page fault time

LWP time spent waiting for a data page.

Other wait time

LWP time spent waiting for a kernel page, or time spent sleeping or stopped.


For multithreaded experiments, times other than wall clock time are summed across all LWPs. Wall time as defined is not meaningful for multiple-program multiple-data (MPMD) programs.

Timing metrics tell you where your program spent time in several categories and can be used to improve the performance of your program.

Clock-based Profiling Under the Linux OS

Under the Linux OS, the only metric available is User CPU time. Although the total CPU utilization time reported is accurate, it may not be possible for the Analyzer to determine the proportion of the time that is actually System CPU time as accurately as for the Solaris OS. Although the Analyzer displays the information as if the data were for a lightweight process (LWP), in reality there are no LWP's on a Linux OS; the displayed LWP ID is actually the thread ID.

Hardware Counter Overflow Profiling Data

Hardware counters keep track of events like cache misses, cache stall cycles, floating-point operations, branch mispredictions, CPU cycles, and instructions executed. In hardware counter overflow profiling, the Collector records a profile packet when a designated hardware counter of the CPU on which an LWP is running overflows. The counter is reset and continues counting. The profile packet includes the overflow value and the counter type.

Various CPU families support from two to eighteen simultaneous hardware counter registers. The Collector can collect data on one or more registers. For each register the Collector allows you to select the type of counter to monitor for overflow, and to set an overflow value for the counter. Some hardware counters can use any register, others are only available on a particular register. Consequently, not all combinations of hardware counters can be chosen in a single experiment.

Hardware counter overflow profiling data is converted by the Performance Analyzer into count metrics. For counters that count in cycles, the metrics reported are converted to times; for counters that do not count in cycles, the metrics reported are event counts. On machines with multiple CPUs, the clock frequency used to convert the metrics is the harmonic mean of the clock frequencies of the individual CPUs. Because each type of processor has its own set of hardware counters, and because the number of hardware counters is large, the hardware counter metrics are not listed here. The next subsection tells you how to find out what hardware counters are available.

One use of hardware counters is to diagnose problems with the flow of information into and out of the CPU. High counts of cache misses, for example, indicate that restructuring your program to improve data or text locality or to increase cache reuse can improve program performance.

Some of the hardware counters provide similar or related information. For example, branch mispredictions and instruction cache misses are often related because a branch misprediction causes the wrong instructions to be loaded into the instruction cache, and these must be replaced by the correct instructions. The replacement can cause an instruction cache miss, or an instruction translation lookaside buffer (ITLB) miss, or even a page fault.

Hardware counter overflows are often delivered one or more instructions after the instruction which caused the event and the corresponding event counter to overflow: this is referred to as "skid" and it can make counter overflow profiles difficult to interpret. In the absence of hardware support for precise identification of the causal instruction, an apropos backtracking search for a candidate causal instruction may be attempted.

When such backtracking is supported and specified during collection, hardware counter profile packets additionally include the PC (program counter) and EA (effective address) of a candidate memory-referencing instruction appropriate for the hardware counter event. (Subsequent processing during analysis is required to validate the candidate event PC and EA.) This additional information about memory-referencing events facilitates various data-oriented analyses.

Hardware Counter Lists

Hardware counters are processor-specific, so the choice of counters available to you depends on the processor that you are using. The performance tools provide aliases for a number of counters that are likely to be in common use. You can obtain a list of available hardware counters on any particular system from the Collector by typing collect with no arguments in a terminal window on that system. If the processor and system support hardware counter profiling, the collect command prints two lists containing information about hardware counters. The first list contains "well-known" (aliased) hardware counters; the second list contains raw hardware counters.

Here is an example that shows the entries in the counter list. The counters that are considered well-known are displayed first in the list, followed by a list of the raw hardware counters. Each line of output in this example is formatted for print.


Well known HW counters available for profiling:
cycles[/{0|1}],9999991 ('CPU Cycles', alias for Cycle_cnt; CPU-cycles)
insts[/{0|1}],9999991 ('Instructions Executed', alias for Instr_cnt; events)
dcrm[/1],100003 ('D$ Read Misses', alias for DC_rd_miss; load events)
...
Raw HW counters available for profiling:
Cycle_cnt[/{0|1}],1000003 (CPU-cycles)
Instr_cnt[/{0|1}],1000003 (events)
DC_rd[/0],1000003 (load events)

Format of the Well-Known Hardware Counter List

In the well-known hardware counter list, the first field (for example, cycles) gives the alias name that can be used in the -h counter... argument of the collect command. This alias name is also the identifier to use in the er_print command.

The second field lists the available registers for the counter; for example, [/{0|1}]. For well-known counters, the default value has been chosen to provide a reasonable sample rate. Because actual rates vary considerably, you might need to specify a non-default value.

The third field, for example, 9999991, is the default overflow value for the counter.

The fourth field, in parentheses, contains type information. It provides a short description (for example, CPU Cycles), the raw hardware counter name (for example, Cycle_cnt), and the type of units being counted (for example, CPU-cycles), which can include up to two words.

If the first word of type information is:

If the second or only word of the type information is:

In the well-known hardware counter list in the example, the type information contains one word, CPU-cycles for the first counter and events for the second counter. For the third counter, the type information contains two words, load events.

Format of the Raw Hardware Counter List

The information included in the raw hardware counter list is a subset of the information in the well-known hardware counter list. Each line includes the internal counter name as used by cpu-track(1), the register number(s) on which that counter can be used, the default overflow value, and the counter units, which can be either CPU-cycles or Events.

If the counter measures events unrelated to the program running, the first word of type information is not-program-related. For such a counter, profiling does not record a call stack, but instead shows the time being spent in an artificial function, collector_not_program_related. Thread and LWP ID's are recorded, but are meaningless.

The default overflow value for raw counters is 1000003. This value is not ideal for most raw counters, so you should specify timeout values when specifying raw counters.

Synchronization Wait Tracing Data

In multithreaded programs, the synchronization of tasks performed by different threads can cause delays in execution of your program, because one thread might have to wait for access to data that has been locked by another thread, for example. These events are called synchronization delay events and are collected by tracing calls to the Solaris or pthread thread functions. The process of collecting and recording these events is called synchronization wait tracing. The time spent waiting for the lock is called the wait time. Currently, synchronization wait tracing is only available for systems running the Solaris OS.

Events are only recorded if their wait time exceeds a threshold value, which is given in microseconds. A threshold value of 0 means that all synchronization delay events are traced, regardless of wait time. The default threshold is determined by running a calibration test, in which calls are made to the threads library without any synchronization delay. The threshold is the average time for these calls multiplied by an arbitrary factor (currently 6). This procedure prevents the recording of events for which the wait times are due only to the call itself and not to a real delay. As a result, the amount of data is greatly reduced, but the count of synchronization events can be significantly underestimated.

Synchronization tracing for Java programs is based on events generated when a thread attempts to acquire a Java Monitor. Both machine and Java call stacks are collected for these events, but no synchronization tracing data is collected for internal locks used within the JavatrademarkVirtual Machine (JVM)[1] software. In the machine representation, thread synchronization devolves into calls to _lwp_mutex_lock, and no synchronization data is shown, since these calls are not traced.

Synchronization wait tracing data is converted into the following metrics:


TABLE 2-2 Synchronization Wait Tracing Metrics

Metric

Definition

Synchronization delay events.

The number of calls to a synchronization routine where the wait time exceeded the prescribed threshold.

Synchronization wait time.

Total of wait times that exceeded the prescribed threshold.


From this information you can determine if functions or load objects are either frequently blocked, or experience unusually long wait times when they do make a call to a synchronization routine. High synchronization wait times indicate contention among threads. You can reduce the contention by redesigning your algorithms, particularly restructuring your locks so that they cover only the data for each thread that needs to be locked.

Heap Tracing (Memory Allocation) Data

Calls to memory allocation and deallocation functions that are not properly managed can be a source of inefficient data usage and can result in poor program performance. In heap tracing, the Collector traces memory allocation and deallocation requests by interposing on the C standard library memory allocation functions malloc, realloc, valloc, and memalign and the deallocation function free. Calls to mmap are treated as memory allocations, which allows heap tracing events for Java memory allocations to be recorded. The Fortran functions allocate and deallocate call the C standard library functions, so these routines are also traced indirectly.

Heap profiling for Java programs is not supported.

Heap tracing data is converted into the following metrics:


TABLE 2-3 Memory Allocation (Heap Tracing) Metrics

Metric

Definition

Allocations

The number of calls to the memory allocation functions.

Bytes allocated

The sum of the number of bytes allocated in each call to the memory allocation functions.

Leaks

The number of calls to the memory allocation functions that did not have a corresponding call to a deallocation function.

Bytes leaked

The number of bytes that were allocated but not deallocated.


Collecting heap tracing data can help you identify memory leaks in your program or locate places where there is inefficient allocation of memory.

Another definition of memory leaks that is commonly used, such as in the dbx debugging tool, says a memory leak is a dynamically-allocated block of memory that has no pointers pointing to it anywhere in the data space of the program. The definition of leaks used here includes this alternative definition, but also includes memory for which pointers do exist.

MPI Tracing Data

The Collector can collect data on calls to the Message Passing Interface (MPI) library. Currently, MPI tracing is only available for system running the Solaris OS. The functions for which data is collected are listed below.


MPI_Allgather

MPI_Allgatherv

MPI_Allreduce

MPI_Alltoall

MPI_Alltoallv

MPI_Barrier

MPI_Bcast

MPI_Bsend

MPI_Gather

MPI_Gatherv

MPI_Irecv

MPI_Isend

MPI_Recv

MPI_Reduce

MPI_Reduce_scatter

MPI_Rsend

MPI_Scan

MPI_Scatter

MPI_Scatterv

MPI_Send

MPI_Sendrecv

MPI_Sendrecv_replace

MPI_Ssend

MPI_Wait

MPI_Waitall

MPI_Waitany

MPI_Waitsome

MPI_Win_fence

MPI_Win_lock

 


MPI tracing data is converted into the following metrics:


TABLE 2-4 MPI Tracing Metrics

Metric

Definition

MPI Receives

Number of receive operations in MPI functions that receive data

MPI Bytes Received

Number of bytes received in MPI functions

MPI Sends

Number of send operations in MPI functions that send data

MPI Bytes Sent

Number of bytes sent in MPI functions

MPI Time

Time spent in all calls to MPI functions

Other MPI Calls

Number of calls to other MPI functions


The number of bytes recorded as received or sent is the buffer size given in the call. This number might be larger than the actual number of bytes received or sent. In the global communication functions and collective communication functions, the number of bytes sent or received is the maximum number, assuming direct interprocessor communication and no optimization of the data transfer or re-transmission of the data.

The functions from the MPI library that are traced are listed in TABLE 2-5, categorized as MPI send functions, MPI receive functions, MPI send and receive functions, and other MPI functions.


TABLE 2-5 Classification of MPI Functions Into Send, Receive, Send and Receive, and Other

Category

Functions

MPI send functions

MPI_Bsend, MPI_Isend, MPI_Rsend, MPI_Send, MPI_Ssend

MPI receive functions

MPI_Irecv, MPI_Recv

MPI send and receive functions

MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Bcast, MPI_Gather, MPI_Gatherv, MPI_Reduce, MPI_Reduce_scatter, MPI_Scan, MPI_Scatter, MPI_Scatterv, MPI_Sendrecv, MPI_Sendrecv_replace

Other MPI functions

MPI_Barrier, MPI_Wait, MPI_Waitall, MPI_Waitany, MPI_Waitsome, MPI_Win_fence, MPI_Win_lock


Collecting MPI tracing data can help you identify places where you have a performance problem in an MPI program that could be due to MPI calls. Examples of possible performance problems are load balancing, synchronization delays, and communications bottlenecks.

Global (Sampling) Data

Global data is recorded by the Collector in packets called sample packets. Each packet contains a header, a timestamp, execution statistics from the kernel such as page fault and I/O data, context switches, and a variety of page residency (working-set and paging) statistics. The data recorded in sample packets is global to the program and is not converted into performance metrics. The process of recording sample packets is called sampling.

Sample packets are recorded in the following circumstances:

The performance tools use the data recorded in the sample packets to group the data into time periods, which are called samples. You can filter the event-specific data by selecting a set of samples, so that you see only information for these particular time periods. You can also view the global data for each sample.

The performance tools make no distinction between the different kinds of sample points. To make use of sample points for analysis you should choose only one kind of point to be recorded. In particular, if you want to record sample points that are related to the program structure or execution sequence, you should turn off periodic sampling, and use samples recorded when dbx stops the process, or when a signal is delivered to the process that is recording data using the collect command, or when a call is made to the Collector API functions.


How Metrics Are Assigned to Program Structure

Metrics are assigned to program instructions using the call stack that is recorded with the event-specific data. If the information is available, each instruction is mapped to a line of source code and the metrics assigned to that instruction are also assigned to the line of source code. See Chapter 7 for a more detailed explanation of how this is done.

In addition to source code and instructions, metrics are assigned to higher level objects: functions and load objects. The call stack contains information on the sequence of function calls made to arrive at the instruction address recorded when a profile was taken. The Performance Analyzer uses the call stack to compute metrics for each function in the program. These metrics are called function-level metrics.

Function-Level Metrics: Exclusive, Inclusive, and Attributed

The Performance Analyzer computes three types of function-level metrics: exclusive metrics, inclusive metrics and attributed metrics.

For a function that only appears at the bottom of call stacks (a leaf function), the exclusive and inclusive metrics are the same.

Exclusive and inclusive metrics are also computed for load objects. Exclusive metrics for a load object are calculated by summing the function-level metrics over all functions in the load object. Inclusive metrics for load objects are calculated in the same way as for functions.

Exclusive and inclusive metrics for a function give information about all recorded paths through the function. Attributed metrics give information about particular paths through a function. They show how much of a metric came from a particular function call. The two functions involved in the call are described as a caller and a callee. For each function in the call tree:

Comparison of attributed and inclusive metrics for the caller or the callee gives further information:

To locate places where you could improve the performance of your program:

Interpreting Attributed Metrics: An Example

Exclusive, inclusive and attributed metrics are illustrated in FIGURE 2-1, which contains a fragment of a call tree. The focus is on the central function, function C. There may be calls to other functions which do not appear in this figure.


FIGURE 2-1 Call Tree Illustrating Exclusive, Inclusive, and Attributed Metrics

Call tree illustrating exclusive, inclusive and attributed metrics.


Function C calls two functions, function E and function F, and attributes 10 units of its inclusive metric to function E and 10 units to function F. These are the callee attributed metrics. Their sum (10+10) added to the exclusive metric of function C (5) equals the inclusive metric of function C (25).

The callee attributed metric and the callee inclusive metric are the same for function E but different for function F. This means that function E is only called by function C but function F is called by some other function or functions. The exclusive metric and the inclusive metric are the same for function E but different for function F. This means that function F calls other functions, but function E does not.

Function C is called by two functions: function A and function B, and attributes 10 units of its inclusive metric to function A and 15 units to function B. These are the caller attributed metrics. Their sum (10+15) equals the inclusive metric of function C.

The caller attributed metric is equal to the difference between the inclusive and exclusive metric for function A, but it is not equal to this difference for function B. This means that function A only calls function C, but function B calls other functions besides function C. (In fact, function A might call other functions but the time is so small that it does not appear in the experiment.)

How Recursion Affects Function-Level Metrics

Recursive function calls, whether direct or indirect, complicate the calculation of metrics. The Performance Analyzer displays metrics for a function as a whole, not for each invocation of a function: the metrics for a series of recursive calls must therefore be compressed into a single metric. This does not affect exclusive metrics, which are calculated from the function at the bottom of the call stack (the leaf function), but it does affect inclusive and attributed metrics.

Inclusive metrics are computed by adding the metric for the event to the inclusive metric of the functions in the call stack. To ensure that the metric is not counted multiple times in a recursive call stack, the metric for the event is only added to the inclusive metric for each unique function.

Attributed metrics are computed from inclusive metrics. In the simplest case of recursion, a recursive function has two callers: itself and another function (the initiating function). If all the work is done in the final call, the inclusive metric for the recursive function is attributed to itself and not to the initiating function. This attribution occurs because the inclusive metric for all the higher invocations of the recursive function is regarded as zero to avoid multiple counting of the metric. The initiating function, however, correctly attributes to the recursive function as a callee the portion of its inclusive metric due to the recursive call.


1 (Footnote) The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.