Advanced Topics: Understanding the Sampling Analyzer and Its Data

Analyzing Program Performance with Sun WorkShop

Chapter 6

Advanced Topics: Understanding the Sampling Analyzer and Its Data

The Sampling Analyzer reads the data collected by the Sampling Collector and converts the data to performance metrics, which are computed against various elements in the structure of the target program. Each event collected has two parts:

Some event-specific data that is used to compute metrics
A call stack of the application that is used to associate those metrics with the program structure.

This chapter covers the following topics:

Event-Specific Data and What It Means
Call Stacks and Program Execution
Mapping Addresses to Program Structure
Annotated Source Code and Disassembly Code
Understanding Performance Costs

Event-Specific Data and What It Means

The event-specific data for each event recorded contains a high-resolution timestamp, a thread ID, and an LWP ID. The timestamp can be used to select only part of a run, while the thread and LWP IDs can be used to select a subset of threads and LWPs. In addition, each event generates specific raw data, which is described in the following sections:

"Clock-Based Profiling"
"Synchronization Wait Tracing"
"Hardware-Counter Overflow Profiling"

Clock-Based Profiling

Clock-based profiling data consists of a set of tick-counts delivered with each profiling signal to each LWP. There is a separate tick for each of the microaccounting states maintained by the kernel. Some of those states are aggregated into System CPU Time, while others are aggregated into System Wait Time. The remaining states are presented individually in the Analyzer.

When the LWP is in user-mode in the CPU, the tick array delivered typically contains a 1 (one) for the User-CPU state, and zeros for all the other states. When the LWP is in one of the other states, the ticks are accumulated, but a profile signal is not sent until the process returns to user-CPU state.

Since the ticks are integer counts, each representing one profile interrupt interval, while LWP scheduling is done at finer granularity, there is an inherent uncertainty in the state as attributed in the profile packet. Typically, the total LWP time, computed by summing all the ticks in all states, is accurate to a few tenths of a percent, as compared with the values returned by gethrtime() in the process. The CPU time may vary by several percentage points, compared with values returned by gethrvtime() in the process. Under heavy load, the variation may be even more pronounced. However, the CPU time differences do not represent a systematic distortion, and the relative times reported for different routines, source-lines, and such are not substantially distorted.

For information about gethrtime() and gethrvtime(), see the man pages for these functions.

Note – Be careful in comparing the LWP times reported in the Analyzer with the numbers from vmstat. The Analyzer times represent the sum of the various microstate accounting times during the lifetime of each LWP, whereas vmstat reports times summed over physical CPUs. If, for example, the target process has many more LWPs than the system on which it is running has CPUs, the Analyzer shows much more wait time than vmstat reports. In the simplest such case, with two CPU-bound LWPs and one physical CUP, the Analyzer reports the sum of the two LWPs, as well as each LWP separately, as having approximately 50% wait (idle) time; vmstat reports no idle time. The CPU is busy all the time, but each LWP is spending half its time waiting while the other LWP is running.

Synchronization Wait Tracing

Synchronization wait tracing events are collected by tracing calls to the functions in the threads library. The event data consists of high-resolution timestamps for the request and the grant (beginning and end of the call that is traced), and the address of the synchronization object (the mutex lock being requested, for example). Only events for which the difference between request and grant times exceeds the specified threshold are recorded. Synchronization trace data is accurate to within a few tenths of a percent, compared with time stamps recorded in the process itself.

Hardware-Counter Overflow Profiling

Hardware-counter overflow profiling allows you to specify a hardware counter and an overflow value (number of increments) for that counter on the CPU on which a given LWP is running. Hardware counters typically tally instruction-cache misses, data-cache misses, clock ticks, instructions executed, and the like. When the designated counter reaches the overflow value, the Collector records the call stack for the LWP and includes a timestamp and the IDs of the LWP and the thread running on it. You can access this data in the Analyzer and use it to support count metrics.

Hardware counters are system-specific, so the choice of counters available to you depend on the system you are using. Many systems do not support hardware-counter overflow profiling. On these machines, the feature is disabled.

Call Stacks and Program Execution

A call stack is a series of program addresses (PCs) representing instructions from within the program. The first PC, called the leaf PC, is at the bottom of the stack, and is the address of the next instruction to be executed. The next PC is the address of the call to the function containing the leaf PC; the next PC is the address of the call to that function, and so forth, until the top of the stack is reached. The process of recording a call stack is referred to as "unwinding the stack" and is described in Unwinding the Stack.

The leaf PC in a call stack is used to attribute exclusive metrics from the performance data to the function in which that PC is found. All the other PCs on the stack are used to attribute inclusive metrics to the function in which they are found.

Most of the time, the PCs in the recorded call stack correspond in a natural way to functions as they appear in the source code of the program, and the Analyzer's reported metrics correspond directly to those functions. Sometimes, however, the actual execution of the program may not correspond to a simple intuitive model of how the program would execute, and the Analyzer's reported metrics may be confusing. See Mapping Addresses to Program Structure for more information about such cases.

Single-Threaded Execution and Function Calls

The simplest case of program execution is that of a single-threaded program calling functions within its own load object.

When a program is loaded into memory to begin execution, a context is established for it that includes the initial address to be executed, an initial register set, and a stack (a region of memory used for scratch data and for keeping track of how functions call each other). The initial address is always in the function _start(), built into every executable.

When the program runs, each instruction executes in sequence until an instruction is encountered that represents a call, jump, or branch. At that point, control is transferred to the address given by the target of the branch, and execution proceeds from there.

When the instruction sequence that represents a call is executed, the return address is put into a register, and execution proceeds at the first instruction of the function being called.

In most cases, somewhere in the first few instructions of the called function, a new frame is pushed onto the stack, and the return address is put into that frame. The register used for the return address can then be used when the called function itself calls another function. When the function is about to return, it pops its frame from the stack, and control returns to the address from which the function was called.

Function Calls Between Shared Objects

When a function in one shared object calls a function in another shared object, the call is more complicated than in a simple call to a function within the program. Each shared object contains a Program Linkage Table, or PLT, which contains entries for every function external to that shared object that is referenced from it. Initially the address for each external function in the PLT is actually an address within ld.so, the dynamic linker. The first time such a function is called, control is transferred to the dynamic linker, which resolves the call to the real external function and patches the PLT address for subsequent calls.

Signals

When a signal is sent to a process, various register and stack operations occur that make it look as though the leaf PC at the time of the signal is the return address for a call to a system routine, sigacthandler(). sigacthandler() calls the user-specified signal handler just as any function would call another. The Analyzer treats the stack frames for such calls normally, although the stack frames can make it look as though any instruction can generate a call.

Fast Traps

Some instructions trap into the kernel and then are passed back to user mode in a lightweight version of signals, known as fast traps. The Analyzer knows about one of these, the exception for misaligned integer memory references in SPARC-v9. In that case, frames for the misaligned integer trap appear, as if the trapping instruction called the handler.

Kernel Traps

Some instructions trap into the kernel, and are emulated in the kernel. One example is the fitos instruction on the UltraSPARC-III platform, which converts a large integer to single-precision floating point. No special handling is done in the Analyzer, but the instruction following the trapping instruction appears to take a long time, because it cannot issue until the kernel is through.

Tail-Call Optimization

One particular optimization can be done whenever the last thing a particular routine does is to call another routine. Rather than actually making the call and then popping the frame from the stack and returning, the caller pops the stack and then calls its callee. The motivation for this optimization is to reduce the size of the stack, and, on SPARC machines, to reduce the use of register windows.

In effect, your program source implies that it behaved like this:
A -> B -> C -> D
But when B and C are tail-call optimized, the call stack looks as if the program is doing this:
A -> B

A -> C

A -> D
That is, the call tree is flattened. When code is compiled with the -g option, tail-call optimization takes place only at O4 or higher. When code is compiled without the -g option, tail-call optimization takes place at O2 or higher.

Explicit Multithreading

A simple program executes in a single thread, on a single LWP (light-weight process). Multithreaded executables make calls to a thread creation routine, which creates additional LWPs to run the threads. The operating system controls the assignment of LWPs to CPUs for execution, while the threads library controls the scheduling of threads onto LWPs. Newly created threads begin execution at a routine called _thread_start(), which calls the function passed in the thread creation call. Threading can be done with either bound threads, where each thread is bound to a specific LWP, or with unbound threads, where each thread may be scheduled on a different LWP at different times.

Parallel Execution and Compiler-Generated Body Functions

If your code contains Sun, Cray, or OpenMP parallelization directives, it can be compiled for parallel execution. (OpenMP is a feature available only for Fortran 95. You might want to refer to the chapters on parallelization and OpenMP in the Fortran Programming Guide for background on parallelization strategies and OpenMP directives.)

When a loop or other parallel construct is compiled for parallel execution, the compiler-generated code is executed by multiple threads, coordinated by the microtasking library.

When the compiler encounters a parallel construct, it sets up the code for parallel execution by placing the body of the construct in a separate body function and replacing the construct with a call to a microtasking library routine. The microtasking library routine is responsible for dispatching threads to execute the body function. The address of the body function is passed to the microtasking library routine as an argument.

If the parallel construct is delimited with one of the following:

The Sun directive c$par doall
The Cray directive c$mic doall
An OpenMP c$omp PARALLEL, c$omp PARALLEL DO, or c$omp PARALLEL SECTIONS directive

then the construct is replaced with a call to the microtasking library routine __mt_MasterFunction_(). A loop that is parallelized automatically by the compiler is also replaced by a call to __mt_MasterFunction_().

If a c$omp PARALLEL construct contains one or more worksharing c$omp DO or c$omp SECTIONS directives, each worksharing construct is replaced by a call to the microtasking library routine __mt_Worksharing_().

The compiler assigns names to body functions of the form:
_$1$mf_string1_$namelength$functionname$linenumber$string2
string1 denotes the type of parallel construct, either parallel, sections, doall, or DOALL.
namelength is the number of characters in functionname.
functionname is the name of the function from which the construct was extracted, usually ending in an underscore.
linenumber is the line number of the construct in the original source.
string2 is related to the name of the source file.

To make the data easier to analyze, the Analyzer provides these functions with a more readable name, in addition to the compiler-generated name.

At run time, initially only the main thread executes. The first time it executes a call to __mt_MasterFunction_(), __mt_MasterFunction_() initiates the creation of multiple worker threads, the number based on the value specified by the environment variable PARALLEL or OMP_NUM_THREADS, or by a call to the OpenMP run-time routine omp_set_num_threads(). Thereafter __mt_MasterFunction_() manages the distribution of available work among the master thread and the worker threads.

In the main thread, __mt_MasterFunction_() calls a sequence of dispatcher functions that eventually call the body function. (This is also the behavior you see for code compiled for parallelization but running on a single-CPU machine, or on a multiprocessor machine using only one thread.)

Worker threads are created using the Solaris threads library. The call stack for a worker thread begins with the threads library routine _thread_start(). _thread_start() makes a call to __mt_SlaveFunction_(), which the thread continues to execute during its lifetime. __mt_SlaveFunction_() calls __mt_WaitForWork_(), in which the thread waits for available work. When work becomes available, the thread returns to __mt_SlaveFunction_(), which then initiates a call to the body function. When the work is finished, the worker thread returns to __mt_SlaveFunction_(), which calls __mt_WaitForWork_() again. You can observe the control flow for a thread in the Analyzer Callers-Callees window. See Examining Caller-Callee Metrics for a Function for information about how to use this window.

Note – In these call sequences, the Analyzer shows an imputed call to the function from which the compiler-generated body functions were extracted. This call is inserted as if the original function called the compiler-generated body functions, so that inclusive data is reported against the original function.

Worker threads typically use CPU time while they are in __mt_WaitForWork_() in order to reduce latency when new work arrives, that is, when the main thread reaches a new parallel construct. (This is known as a busy-wait.) However, you can set an environment variable to specify a sleep wait, which shows up in the Analyzer as LWP time, but not CPU time. There are generally two situations where the worker threads spend time waiting for work, where you might want to redesign your program to reduce the waiting:

When the main thread is executing in a serial region and there is nothing for the worker threads to do
When the work load is unbalanced, and some threads have finished and are waiting while others are still executing

By default, the microtasking library uses threads that are bound to LWPs. You can override this default by setting the FLAG variable in the makefile to UNBOUND before you build the program, or by setting the environment variable MT_BIND_LWP to FALSE.

Loops with a long long index call somewhat different microtasking library routines than loops with an integer or long index.

Note – The whole multiprocessing dispatch process is implementation-dependent, and may change from release to release.

Unwinding the Stack

When the Collector records an event, it records the call stack of the process at the time of the event. The call stack recorded consists of the address of the next instruction to be executed (the leaf PC), the contents of the return register, and the contents of the return address on each frame of the stack, eventually reaching the address of the call instruction in _start() for the main thread and _thread_start() for the worker threads.

The Collector always records the return register, and the Analyzer uses a heuristic to determine whether or not the return address has been pushed on the stack. If it has, the return register is ignored; if it has not, the return register is used as the calling PC. A specific register, known as the frame pointer, is used to find the first frame on the stack; each frame contains a previous frame pointer used to find the frame of its caller. On Intel machines, for optimized code, a previous frame pointer is not maintained in each stack frame, and a heuristic is used to unwind the stack.

Mapping Addresses to Program Structure

Once a call stack is processed into PC values, the Analyzer maps those PCs to shared objects, functions, source lines, and disassembly lines (instructions) in the program. This section describes those mappings.

The Process Image

When a program is run, a process is instantiated from the executable for that program. The process has a number of regions in its address space, some of which are text and represent executable instructions, and some of which are data which is not normally executed. PCs as recorded in the call stack normally correspond to addresses within one of the text segments of the program.

The first text section in a process derives from the executable itself. Others correspond to shared objects that are loaded with the executable, either at the time the process is started, or dynamically loaded by the process. The PCs in a call stack are resolved based on the executable and shared objects loaded at the time the call stack was recorded. Executables and shared objects are very similar, and are collectively referred to as load objects.

Because shared objects can be loaded and unloaded in the course of program execution, any given PC may correspond to different functions at different times during the run. In addition, different PCs may correspond to the same function, when a shared object is unloaded and then reloaded at a different address.

Load Objects and Functions

Each load object, whether an executable or a shared object, contains a text section with the instructions generated, a data section for data, and various symbol tables. All load objects must contain an ELF symbol table, which gives the names and addresses of all the globally-known functions in that object. Load objects compiled with the -g option contain additional symbolic information, which can augment the ELF symbol table and provide information about functions that are not global, additional information about object modules from which the functions came, and line number information relating addresses to source lines.

The term function is used to describe a set of instructions that represent a high-level operation described in the source code. The term covers subroutines as used in Fortran, methods as used in C++, and the like. Functions are described cleanly in the source code, and normally their names appear in the symbol table representing a set of addresses; if the program counter is within that set, the program is executing within that function.

In principle, any address within the text segment of a load object can be mapped to a function. Exactly the same mapping is used for the leaf PC and all the other PCs on the call stack. Most of the functions correspond directly to the source model of the program. Some do not; these functions are described in the following sections:

Aliased Functions
Non-Unique Function Names
Static Functions from Stripped Shared Libraries
Fortran Alternate Entry Points
Inlined Functions
Compiler-Generated Body Functions
Outline Functions
The <Unknown> Function
The <Total> Function

Aliased Functions

Typically, functions are defined as global, meaning that their names are known everywhere in the program. The name of a global function must be unique within the executable. If there is more than one global function of a given name within the address space, the runtime linker resolves all references to one of them, and the others are never executed, and so do not appear in the function list. From the Summary Metrics window, you can see the shared object and object module that contain the selected function.

Under various circumstances, a function may be known by several different names. A very common example of this is the use of so-called weak and strong symbols for the same piece of code. A strong name is typically the same as the corresponding weak name, except that it has a leading underscore. Many of the functions in the thread library also have alternate names for pthreads and Solaris threads, as well as strong and weak names and alternate internal symbols. In all such cases, only one name is used in the function list of the Analyzer. The name chosen is the last symbol at the given address in alphabetic order. This choice most often corresponds to the name that the user would use. In the Summary Metrics window, all the aliases for the selected function are shown.

Non-Unique Function Names

While aliased functions reflect multiple names for the same piece of code, there are circumstances under which multiple pieces of code have the same name:

Sometimes, for reasons of modularity, functions are defined as static, meaning that their names are known only in some parts of the program (typically a single compiled object module). In such cases, the Analyzer may see several functions of the same name referring to quite different parts of the program. In the Summary Metrics window, the object module name for each of these functions is given to distinguish them from one another. In addition, any selection of one of these functions can be used to show the source, disassembly, and the callers and callees of that specific function.
Sometimes a program may use wrapper or interposition functions that have the weak name of a function in a library and supersede calls to that library function. Some wrapper functions call the original function in the library, in which case both instances of the name appear in the Analyzer function list. Such functions come from different shared objects and different object modules, and can be distinguished from each other in that way. The Collector also wraps some library functions, and both the wrapper function and the real function can appear in the Analyzer.

Static Functions from Stripped Shared Libraries

Static functions are often used within libraries, so that the name used internally in a library does not conflict with a name that the user might use. When libraries are stripped, the names of static functions are deleted. In such cases, the Analyzer generates an artificial name of the form <static>@0x12345, where the string following the @ sign is the offset of that function within the library. The Analyzer cannot distinguish between contiguous stripped static functions and a single such function, so two or more such functions may appear with their metrics coalesced.

Fortran Alternate Entry Points

Fortran provides a way of having multiple entry points to a single piece of code, allowing a caller to call into the middle of a function. When such code is compiled, it consists of a prologue for the main entry point, a prologue to the alternate entry point, and the main body of code for the function. Each prologue sets up the stack for the function's eventual return and then branches or falls through the the main body of code.

Different compilers order the pieces of a Fortran subroutine with alternate entry points differently. The prologue code for each entry point always corresponds to a region of text that has the name of that entry point, but the code for the main body of the routine can receive either of the two entry point names.

On the SPARC platform, the WS 6 compilers generate the alternate entry point prologue first, and the main entry point and body code second, and all metrics from the main body receive the name of the main entry point.
On the x86 platform, the compilers generate the prologue for the main entry point first, then the prologue for the alternate entry point and the main body. Metrics for the main body are associated with the alternate entry point name. The prologues rarely account for any significant amount of time, and the "functions" corresponding to entry points other than the one that is associated with the main body of the subroutine rarely appear.

Inlined Functions

An inlined function is code defined as a function in the source, which compiles to instructions that are inserted at the call site of the function, instead of an actual call. There are two kinds of inlining, both of which will affect the Analyzer:

C++ inline function definitions
Explicit or automatic inlining performed at high optimization (levels O4 and O5)

Both versions are done to improve performance.

To specify C++ inlining, either include the body of a method in the class definition for that method, or tag the method explicitly as being an inline function. The rationale for inlining in this case is that the cost of calling a function is much greater than the work done by the inlined function, so it is better to simply insert the code for the function at the call site, instead of setting up a call. Typically, access functions are defined to be inlined, because they often only require one instruction. Normally, when you compile with the -g option, even the functions defined as being inlined are compiled as normal functions. However, if you compile C++ with -g0, even with no other optimizations, all functions defined as inlined are compiled as such.

Explicit and automatic inlining is performed at high optimization, even when -g is turned on. The rationale for this type of inlining can be to save the cost of a function call, but more often it is to provide more instructions that can be subject to register usage and instruction scheduling optimizations.

Both kinds of inlining have the same effect on the function list. Functions that appear in the source code but have been inlined do not show up in the function list, and metrics that would normally be thought of as inclusive metrics at the call site of the inlined function (representing time spent in the called function) are actually shown as exclusive metrics (representing the instructions of the inlined function, attributed to the call site).

Note – In many cases, inlining can make data difficult to interpret, so you might want to disable inlining when you measure performance.

In some cases, even when a function is inlined, a so-called out-of-line function is left. Sometimes some call sites do call the out-of-line version, but others have the instructions inlined. In such cases, the functions may appear in the function list, as if they were never inlined, but the metrics attributed to them reflect only the out-of-line calls.

Compiler-Generated Body Functions

When a compiler parallelizes a loop in a function, or a region that has parallelization directives, it creates new body functions that do not explicitly appear in the source model of the program. These functions are described in more detail in Parallel Execution and Compiler-Generated Body Functions.

The Analyzer shows these functions as normal functions, and assigns a label to them based on the function from which they were extracted, in addition to the compiler-generated name. Their exclusive and inclusive metrics represent the time spent in the body function. In addition, the function from which the construct was extracted shows inclusive metrics from each of the body functions.

When a function containing parallel loops is inlined, the names of its compiler-generated body functions reflect the function into which it was inlined, not the original function.

Outline Functions

Outline functions can be created during feedback optimization. They represent code that is not normally expected to be executed. Specifically, it is code that is not executed during the "training run" used to generate the feedback. To improve paging and instruction-cache behavior, such code is moved elsewhere in the address space, and is made into a function with a name of the form:
_$1$outlinestring1$namelength$functionname$linenumber$string2
string1 is related to the specific section of outlined code.
namelength is the number of characters in functionname.
functionname is the name of the function from which the construct was extracted.
linenumber is the line number of the construct in the original source.
string2 is related to a compiler internal name.

Outline functions are shown as normal functions, with the appropriate inclusive and exclusive metrics. In addition, the metrics for the outline function are added as inclusive metrics in the function from which the code was outlined.

As with compiler-generated body functions, the Analyzer displays an imputed call from the function from which the outline function is derived.

The <Unknown> Function

Under some circumstances, a PC does not map to a known function. In such cases, the PC is mapped to the special function named <Unknown>.

The following circumstances will show PCs mapping to <Unknown>:

When the PC belongs to the dynamic linker, ld.so.
When the PC corresponds to the PLT (Program Linkage Table) in a load object. This happens whenever a function in one load object calls a function in a different shared object. The actual call transfers first to a three-instruction sequence in the PLT, and then to the real destination.
When the PC corresponds to an address in the data section of the executable or a shared object; normally data is not executable, so a data address never appears in a call stack. Sometimes, however, if code is self-modifying, the program writes these writable instructions into the program data space before executing them. The SPARC v7 version of libc.so has several functions (.mul and .div, for example) in its data section (the code is in the data section so that it can be dynamically rewritten to use machine instructions when the library detects that it is executing on a SPARC v8 or v9 machine).
When the PC is not within any known load object. The most likely cause of this is an unwind failure, where the value recorded as a PC is not a PC at all, but rather some other word. (If the PC is the return register, and it does not seem to be within any known load object, it is ignored, rather than attributed to the <Unknown> function.)

The <Total> Function

The <Total> function is an artificial construct used to represent the program as a whole. All performance metrics, in addition to being attributed to the functions on the call stack, are attributed to the special function <Total>. It appears at the top of the function list and its data can be used to give perspective on the data for other functions.

The Callers-Callees Window

This section discusses the Callers-Callees window, and how the program execution is reflected in that window.

The <Total> Function

The special function <Total> is shown as the nominal caller of _start() in the main thread of execution of any program, and also as the nominal caller of _thread_start() for created threads.

Fortran Alternate Entry Points

Call stacks representing time in Fortran subroutines with alternate entry points usually have PCs in the main body of the subroutine, rather than the prologue, and only the name associated with the main body will appear as a callee. In any case, the collected data does not allow the Analyzer to distinguish between calls to the main entry point and calls to the alternate entry point.

Likewise, all calls from the subroutine are shown as being made from the name associated with the main body of the subroutine.

Inlined Functions

Inlined functions do not show up as callees of the routines into which they have been inlined. Be careful of interpreting data for functions that are inlined in some places, but appear as normal functions elsewhere. Only the metrics on the regular function show up in the Analyzer, and this usage may represent a small fraction of the total metrics for all the instances of that function, inlined and normal.

Compiler-Generated Body Functions

Compiler-generated body functions are directly called by routines in the microtasking library, as described in Parallel Execution and Compiler-Generated Body Functions. However, in order to make the behavior shown in the Analyzer more closely related to the source model of execution, the Analyzer imputes an artificial call from the function from which the loop routine was extracted, at the line from which it was extracted. Thus in the Analyzer, the function from which a body routine was extracted appears as the caller, and inclusive time propagates up to it.

Outline Functions

Outline functions are not really called, but rather are jumped to; similarly they do not return, they jump back. In order to make the behavior more closely match the user's source model, the Analyzer imputes an artificial call from the main routine to its outline portion.

Tail-Call Optimization

Intermediate calls that have been tail-call optimized may not appear explicitly in the Callers-Callees window.

Signals

The Analyzer treats the frames resulting from signal delivery as ordinary frames. The user code at the point at which the signal was delivered is shown as "calling" the system routine sigacthandler(), and it in turn is shown as calling the user's signal-handler. Inclusive metrics from both sigacthandler() and any user signal handler, and any other functions they call, appear as inclusive metrics for the interrupted routine.

Stripped Static Functions

Stripped static functions are shown as called from the correct caller, except when the PC from the static function is a leaf PC that appears after the save instruction in the static function. Without the symbolic information, the Analyzer does not know the save address, and cannot tell whether to use the return register as the caller. It always ignores the return register. Since several functions can be coalesced into a single <static>@0x12345 function, the real caller or callee might not be distinguished from the adjacent routines.

The <Unknown> Function

Callers and callees of the <Unknown> function represent the previous and next PCs in the call stack, and are treated normally.

Recursive Calls

A recursive call is one in which a function calls itself. In the Callers-Callees window, the recursive function is shown as a caller of itself, but not as a callee.

Annotated Source Code and Disassembly Code

The annotated source code and disassembly code features of the Analyzer are useful for which operations within a function are causing poor performance.

Annotated Source Code

Annotated source shows the resource consumption of an application at the source-line level. It is produced by taking the PCs that are recorded in the application's call stack, and mapping each PC to a source line. To produce an annotated source file, the Analyzer first determines all of the functions that are generated in a particular object module (.o file), then scans the data for all PCs from each function. In order to produce annotated source, the Analyzer must be able to find and read the object module or load object to determine the mapping from PCs to source lines, and it must be able to read the source file to produce an annotated copy, which is displayed.

The compilation process goes through many stages, depending on the level of optimization requested, and transformation take place which may confuse the mapping of instructions to source lines. For some optimizations, source line information may be completely lost, while for others, it may be confusing. The compiler relies on various heuristics to track the source line for an instruction, and these heuristics are not infallible.

The four types of metrics that can appear on a line of annotated source code are explained in TABLE 6-1.

TABLE 6-1 Annotated Source-Code Metrics
Metric Significance

(Blank) No PC in the program corresponds to this line of code. This should always happen for comment lines. It also happens for apparent code lines in the following circumstances:
All the instructions from the apparent piece of code have been optimized away.
The code is repeated elsewhere, and the compiler did common subexpression recognition and tagged all the instructions with the lines for the other copy.
The compiler simply mistagged the instruction that really came from that line with an incorrect line number.

0. Some PCs in the program were tagged as derived from this line, but there were no data that referred to those PCs: they were never in a call stack that was sampled statistically or traced for thread-synchronization data. The 0. metric does not mean that the line was not executed, only that it did not show up statistically in a profile and that a thread-synchronization call from that line never had a delay exceeding the threshold.

0.000 At least one PC from this line appeared in the data, but the computed metric value rounded to zero.

1.234 The metrics for all PCs attributed to this line added up to the non-zero numerical value shown.

Compiler Commentary

Various parts of the compiler can incorporate commentary into the executable. Each comment is associated with a specific line of source.

Some of the commentary is inserted by the f95 compiler, reflecting potential performance costs attributable to copy-in and or copy-out required to pass an array section to a subroutine. When code is compiled for parallel analysis, additional commentary reflecting the parallelization state of loops is inserted.

When the annotated source is written, the compiler commentary for any source line appears immediately preceding the source line.

The Unknown Line: <sum of all instructions without line numbers>

Whenever the source line for a PC can not be determined, the metrics for that PC are attributed to a special source line that is inserted at the top of the annotated source file. High metrics on that line indicates that part of the code from the given object module does not have line-mappings. Annotated disassembly can help you determine what the instructions do that do not have mappings.

Common Subexpression Elimination

One very common optimization recognizes that the same expression appears in more than one place, and that performance can be improved by generating the code for that expression in one place. For example, if the same operation appears in both the if and the else branches of a block of code, the compiler can move that operation to just before the if statement. When it does so, it assigns line numbers to the instructions based on one of the previous occurrences of the expression. If the line numbers assigned correspond to one branch of an if structure, and the code actually always takes the other branch, the annotated source might show metrics on lines within the branch that is not taken.

Annotated Disassembly

Annotated disassembly provides an assembly-code listing of the instructions of a function or object module, with the performance metrics associated with each instruction. The more frequently a given instruction or set of instructions appears, the more time is being spent in that function. Annotated disassembly can be displayed in several ways, determined by whether line-number mappings and the source file are available, and whether the object module for the function whose annotated disassembly is being requested is known.

If the object module is not known, the Analyzer disassembles the instructions for just the specified function, and does not show any source lines within the disassembly.
If the object module is known, the disassembly covers all functions within the object module.
If the source file is available, and line number data is recorded, the Analyzer interleaves the source with the disassembly.
If the compiler has inserted any commentary into the object code, it too, is interleaved in the annotated disassembly.

When code is not optimized, line numbers are simple, and the interleaving of source and disassembled instructions appears natural. When optimization takes place, instructions from later lines sometimes appear before those from earlier lines. The Analyzer's algorithm for interleaving is that whenever an instruction is shown as coming from line N, all source lines up to and including line N are written before the instruction. Compiler commentary associated with line N of the source are written immediately before that line.

Each instruction in the disassembly code is annotated with the following information:

A source line number, as reported by the compiler
Its relative address
The hexadecimal representation of the instruction
The assembler ASCII representation of the instruction

Where possible, call addresses are resolved to symbols. Metrics are shown on the lines for instructions, but not on any interleaved source or commentary. Possible metric values are as described for source-code annotations, in TABLE 6-1.

Understanding Performance Costs

You can examine metric values at the function level, the source-line level, or the disassembly instruction level. High metric values at each of these levels reveal different ways in which you can refine your code to make it more efficient

Performance at the Function-Level

Functions have high metric values either because they are being executed many times, or because each execution of the function takes a long time.

If the function is being executed many times, the performance-improvement opportunities lie in reducing the number of calls, or in inlining the function.
If each execution of a function takes a long time, the performance improvement opportunities lie in making the function's algorithms more efficient.

It is usually easiest to identify opportunities for increasing performance efficiency by examining the annotated source of the function.

Performance at the Source Line Level

Lines that have high metric values in the annotated source represent the places in the function where most of the execution time is being spent. Performance improvement opportunities lie in improving or rewriting the algorithm, or increasing the optimization level for the function. Where the algorithm seems efficient and well-optimized, performance improvement opportunities can be identified by looking at the annotated disassembly.

Performance at the Instruction Level

Typically, the burden of generating efficient code at the instruction level is on the compiler. Sometimes, specific leaf PCs appear more frequently because the instruction that they represent is delayed before issue. Sometimes a specific leaf PC appears because the previous instruction takes a long time to execute and is not interruptible, for example when an instruction traps into the kernel.

There are several causes of instruction issue delays, and each represents a potential opportunity for improving performance. Instructions issue delays can be caused by an arithmetic instruction needing a register that is not available because the register contents were set by an earlier instruction that has not yet completed. Two examples of this sort of delay are load instructions that have data cache misses, and floating-point arithmetic instructions that require more than one cycle to execute, such as floating-divide.

Instructions may also seem overrepresented because the instruction cache does not include the memory word that contains the instruction. Instructions may seem underrepresented because they are always issued in the same clock as the previous instruction, so they never represent the next instruction to be executed.

A -> B -> C -> D

A -> B A -> C A -> D

_$1$mf_string1_$namelength$functionname$linenumber$string2

_$1$outlinestring1$namelength$functionname$linenumber$string2

**TABLE 6-1** Annotated Source-Code Metrics
Metric	Significance
(Blank)	No PC in the program corresponds to this line of code. This should always happen for comment lines. It also happens for apparent code lines in the following circumstances: All the instructions from the apparent piece of code have been optimized away. The code is repeated elsewhere, and the compiler did common subexpression recognition and tagged all the instructions with the lines for the other copy. The compiler simply mistagged the instruction that really came from that line with an incorrect line number.
`0.`	Some PCs in the program were tagged as derived from this line, but there were no data that referred to those PCs: they were never in a call stack that was sampled statistically or traced for thread-synchronization data. The `0.` metric does not mean that the line was not executed, only that it did not show up statistically in a profile and that a thread-synchronization call from that line never had a delay exceeding the threshold.
`0.000`	At least one PC from this line appeared in the data, but the computed metric value rounded to zero.
`1.234`	The metrics for all PCs attributed to this line added up to the non-zero numerical value shown.

Library | Contents | Previous | Next | Index

Chapter 6

Advanced Topics: Understanding the Sampling Analyzer and Its Data

Event-Specific Data and What It Means

Clock-Based Profiling

Synchronization Wait Tracing

Hardware-Counter Overflow Profiling

Call Stacks and Program Execution

Single-Threaded Execution and Function Calls

Function Calls Between Shared Objects

Signals

Fast Traps

Kernel Traps

Tail-Call Optimization

Explicit Multithreading

Parallel Execution and Compiler-Generated Body Functions

Unwinding the Stack

Mapping Addresses to Program Structure

The Process Image

Load Objects and Functions

Aliased Functions

Non-Unique Function Names

Static Functions from Stripped Shared Libraries

Fortran Alternate Entry Points

Inlined Functions

Compiler-Generated Body Functions

Outline Functions

The <Unknown> Function

The <Total> Function

The Callers-Callees Window

The <Total> Function

Fortran Alternate Entry Points

Inlined Functions

Compiler-Generated Body Functions

Outline Functions

Tail-Call Optimization

Signals

Stripped Static Functions

The <Unknown> Function

Recursive Calls

Annotated Source Code and Disassembly Code

Annotated Source Code

Compiler Commentary

The Unknown Line: <sum of all instructions without line numbers>

Common Subexpression Elimination

Annotated Disassembly

Understanding Performance Costs

Performance at the Function-Level

Performance at the Source Line Level

Performance at the Instruction Level

The `<Unknown>` Function

The `<Total>` Function

The `<Total>` Function

The `<Unknown>` Function

The Unknown Line: `<sum of all instructions without line numbers>`