A call stack is a series of program counter addresses (PCs) representing instructions from within the program. The first PC, called the leaf PC, is at the bottom of the stack, and is the address of the next instruction to be executed. The next PC is the address of the call to the function containing the leaf PC; the next PC is the address of the call to that function, and so forth, until the top of the stack is reached. Each such address is known as a return address. The process of recording a call stack involves obtaining the return addresses from the program stack and is referred to as unwinding the stack. For information on unwind failures, see Incomplete Stack Unwinds.
The leaf PC in a call stack is used to assign exclusive metrics from the performance data to the function in which that PC is located. Each PC on the stack, including the leaf PC, is used to assign inclusive metrics to the function in which it is located.
Most of the time, the PCs in the recorded call stack correspond in a natural way to functions as they appear in the source code of the program, and the Performance Analyzer’s reported metrics correspond directly to those functions. Sometimes, however, the actual execution of the program does not correspond to a simple intuitive model of how the program would execute, and the Performance Analyzer’s reported metrics might be confusing. See Mapping Addresses to Program Structure for more information about such cases.
When a program is loaded into memory to begin execution, a context is established for it that includes the initial address to be executed, an initial register set, and a stack (a region of memory used for scratch data and for keeping track of how functions call each other). The initial address is always at the beginning of the function _start(), which is built into every executable.
When the program runs, instructions are executed in sequence until a branch instruction is encountered, which among other things could represent a function call or a conditional statement. At the branch point, control is transferred to the address given by the target of the branch, and execution proceeds from there. (Usually the next instruction after the branch is already committed for execution: this instruction is called the branch delay slot instruction. However, some branch instructions annul the execution of the branch delay slot instruction).
When the instruction sequence that represents a call is executed, the return address is put into a register, and execution proceeds at the first instruction of the function being called.
In most cases, somewhere in the first few instructions of the called function, a new frame (a region of memory used to store information about the function) is pushed onto the stack, and the return address is put into that frame. The register used for the return address can then be used when the called function itself calls another function. When the function is about to return, it pops its frame from the stack, and control returns to the address from which the function was called.
When a function in one shared object calls a function in another shared object, the execution is more complicated than in a simple call to a function within the program. Each shared object contains a Program Linkage Table, or PLT, which contains entries for every function external to that shared object that is referenced from it. Initially the address for each external function in the PLT is actually an address within ld.so, the dynamic linker. The first time such a function is called, control is transferred to the dynamic linker, which resolves the call to the real external function and patches the PLT address for subsequent calls.
If a profiling event occurs during the execution of one of the three PLT instructions, the PLT PCs are deleted, and exclusive time is attributed to the call instruction. If a profiling event occurs during the first call through a PLT entry, but the leaf PC is not one of the PLT instructions, any PCs that arise from the PLT and code in ld.so are attributed to an artificial function, @plt, which accumulates inclusive time. There is one such artificial function for each shared object. If the program uses the LD_AUDIT interface, the PLT entries might never be patched, and non-leaf PCs from @plt can occur more frequently.
When a signal is sent to a process, various register and stack operations occur that make it look as though the leaf PC at the time of the signal is the return address for a call to a system function, sigacthandler(). sigacthandler() calls the user-specified signal handler just as any function would call another.
The Performance Analyzer treats the frames resulting from signal delivery as ordinary frames. The user code at the point at which the signal was delivered is shown as calling the system function sigacthandler(), and sigacthandler() in turn is shown as calling the user’s signal handler. Inclusive metrics from both sigacthandler() and any user signal handler, and any other functions they call, appear as inclusive metrics for the interrupted function.
The Collector interposes on sigaction() to ensure that its handlers are the primary handlers for the SIGPROF signal when clock data is collected and SIGEMT signal when hardware counter overflow data is collected.
Traps can be issued by an instruction or by the hardware, and are caught by a trap handler. System traps are traps that are initiated from an instruction and trap into the kernel. All system calls are implemented using trap instructions. Some examples of hardware traps are those issued from the floating point unit when it is unable to complete an instruction (such as the fitos instruction for some register-content values on the UltraSPARC III platform), or when the instruction is not implemented in the hardware.
When a trap is issued, the Solaris LWP or Linux kernel enters system mode. On the Solaris OS, the microstate is usually switched from User CPU state to Trap state then to System state. The time spent handling the trap can show as a combination of System CPU time and User CPU time, depending on the point at which the microstate is switched. The time is attributed to the instruction in the user’s code from which the trap was initiated (or to the system call).
For some system calls, it is considered critical to provide as efficient handling of the call as possible. The traps generated by these calls are known as fast traps. Among the system functions that generate fast traps are gethrtime and gethrvtime. In these functions, the microstate is not switched because of the overhead involved.
In other circumstances it is also considered critical to provide as efficient handling of the trap as possible. Some examples of these are TLB (translation lookaside buffer) misses and register window spills and fills, for which the microstate is not switched.
In both cases, the time spent is recorded as User CPU time. However, the hardware counters are turned off because the CPU mode has been switched to system mode. The time spent handling these traps can therefore be estimated by taking the difference between User CPU time and Cycles time, preferably recorded in the same experiment.
In one case the trap handler switches back to user mode, and that is the misaligned memory reference trap for an 8-byte integer which is aligned on a 4-byte boundary in Fortran. A frame for the trap handler appears on the stack, and a call to the handler can appear in the Performance Analyzer, attributed to the integer load or store instruction.
When an instruction traps into the kernel, the instruction following the trapping instruction appears to take a long time, because it cannot start until the kernel has finished executing the trapping instruction.
The compiler can do one particular optimization whenever the last thing a particular function does is to call another function. Rather than generating a new frame, the callee reuses the frame from the caller, and the return address for the callee is copied from the caller. The motivation for this optimization is to reduce the size of the stack, and, on SPARC platforms, to reduce the use of register windows.
Suppose that the call sequence in your program source looks like this:
A -> B -> C -> D
When B and C are tail-call optimized, the call stack looks as if function A calls functions B, C, and D directly.
A -> B A -> C A -> D
That is, the call tree is flattened. When code is compiled with the -g option, tail-call optimization takes place only at a compiler optimization level of 4 or higher. When code is compiled without the- g option, tail-call optimization takes place at a compiler optimization level of 2 or higher.
A simple program executes in a single thread, on a single LWP (lightweight process) in the Solaris OS. Multithreaded executables make calls to a thread creation function, to which the target function for execution is passed. When the target exits, the thread is destroyed.
The Solaris OS supports two thread implementations: Solaris threads and POSIX threads (Pthreads). Beginning with the Solaris 10 OS, both thread implementations are included in libc.so.
With Solaris threads, newly-created threads begin execution at a function called _thread_start(), which calls the function passed in the thread creation call. For any call stack involving the target as executed by this thread, the top of the stack is _thread_start(), and there is no connection to the caller of the thread creation function. Inclusive metrics associated with the created thread therefore only propagate up as far as _thread_start() and the <Total> function. In addition to creating the threads, the Solaris threads implementation also creates LWPs on Solaris to execute the threads. Each thread is bound to a specific LWP.
Pthreads is available in the Solaris 10 OS as well as in the Linux OS for explicit multithreading.
In both environments, to create a new thread, the application calls the Pthread API function pthread_create(), passing a pointer to an application-defined start routine as one of the function arguments.
On the Solaris OS, when a new pthread starts execution, it calls the _lwp_start() function. On the Solaris 10 OS, _lwp_start() calls an intermediate function _thr_setup(), which then calls the application-defined start routine that was specified in pthread_create().
On the Linux OS, when the new pthread starts execution, it runs a Linux-specific system function, clone(), which calls another internal initialization function, pthread_start_thread(), which in turn calls the application-defined start routine that was specified in pthread_create() . The Linux metrics-gathering functions available to the Collector are thread-specific. Therefore, when the collect utility runs, it interposes a metrics-gathering function, named collector_root(), between pthread_start_thread() and the application-defined thread start routine.
To the typical developer, a Java technology-based application runs just like any other program. The application begins at a main entry point, typically named class.main, which may call other methods, just as a C or C++ application does.
To the operating system, an application written in the Java programming language, (pure or mixed with C/C++), runs as a process instantiating the JVM software. The JVM software is compiled from C++ sources and starts execution at _start, which calls main, and so forth. It reads bytecode from .class and/or .jar files, and performs the operations specified in that program. Among the operations that can be specified is the dynamic loading of a native shared object, and calls into various functions or methods contained within that object.
The JVM software does a number of things that are typically not done by applications written in traditional languages. At startup, it creates a number of regions of dynamically-generated code in its data space. One of these regions is the actual interpreter code used to process the application’s bytecode methods.
During execution of a Java technology-based application, most methods are interpreted by the JVM software; these methods are referred to as interpreted methods. The Java HotSpot virtual machine monitors performance as it interprets the bytecode to detect methods that are frequently executed. Methods that are repeatedly executed might then be compiled by the Java HotSpot virtual machine to generate machine code for those methods. The resulting methods are referred to as compiled methods. The virtual machine executes the more efficient compiled methods thereafter, rather than interpret the original bytecode for the methods. Compiled methods are loaded into the data space of the application, and may be unloaded at some later point in time. In addition, other code is generated in the data space to execute the transitions between interpreted and compiled code.
Code written in the Java programming language might also call directly into native-compiled code, either C, C++, or Fortran; the targets of such calls are referred to as native methods.
Applications written in the Java programming language are inherently multithreaded, and have one JVM software thread for each thread in the user’s program. Java applications also have several housekeeping threads used for signal handling, memory management, and Java HotSpot virtual machine compilation.
Data collection is implemented with various methods in the JVMTI in J2SE.
The performance tools collect their data by recording events in the life of each Solaris LWP or Linux thread, along with the call stack at the time of the event. At any point in the execution of any application, the call stack represents where the program is in its execution, and how it got there. One important way that mixed-model Java applications differ from traditional C, C++, and Fortran applications is that at any instant during the run of the target there are two call stacks that are meaningful: a Java call stack, and a machine call stack. Both call stacks are recorded during profiling, and are reconciled during analysis.
Clock-based profiling and hardware counter overflow profiling for Java programs work just as for C, C++, and Fortran programs, except that both Java call stacks and machine call stacks are collected.
There are three representations for displaying performance data for applications written in the Java programming language: the Java representation, the Expert-Java representation, and the Machine representation. The Java representation is shown by default where the data supports it. The following section summarizes the main differences between these three representations.
The User representation shows compiled and interpreted Java methods by name, and shows native methods in their natural form. During execution, there might be many instances of a particular Java method executed: the interpreted version, and, perhaps, one or more compiled versions. In the Java representation all methods are shown aggregated as a single method. This representation is selected in the Analyzer by default.
A PC for a Java method in the Java representation corresponds to the method-id and a bytecode index into that method; a PC for a native function correspond to a machine PC. The call stack for a Java thread may have a mixture of Java PCs and machine PCs. It does not have any frames corresponding to Java housekeeping code, which does not have a Java representation. Under some circumstances, the JVM software cannot unwind the Java stack, and a single frame with the special function, <no Java callstack recorded>, is returned. Typically, it amounts to no more than 5-10% of the total time.
The function list in the Java representation shows metrics against the Java methods and any native methods called. The caller-callee panel shows the calling relationships in the Java representation.
Source for a Java method corresponds to the source code in the .java file from which it was compiled, with metrics on each source line. The disassembly of any Java method shows the bytecode generated for it, with metrics against each bytecode, and interleaved Java source, where available.
The Timeline in the Java representation shows only Java threads. The call stack for each thread is shown with its Java methods.
Data space profiling in the Java representation is not currently supported.
The Expert-Java representation is similar to the Java Representation, except that some details of the JVM internals that are suppressed in the Java Representation are exposed in the Expert-Java Representation. With the Expert-Java representation, the Timeline shows all threads; the call stack for housekeeping threads is a native call stack.
The Machine representation shows functions from the JVM software itself, rather than from the application being interpreted by the JVM software. It also shows all compiled and native methods. The machine representation looks the same as that of applications written in traditional languages. The call stack shows JVM frames, native frames, and compiled-method frames. Some of the JVM frames represent transition code between interpreted Java, compiled Java, and native code.
Source from compiled methods are shown against the Java source; the data represents the specific instance of the compiled-method selected. Disassembly for compiled methods show the generated machine assembler code, not the Java bytecode. Caller-callee relationships show all overhead frames, and all frames representing the transitions between interpreted, compiled, and native methods.
The Timeline in the machine representation shows bars for all threads, LWPs, or CPUs, and the call stack in each is the machine-representation call stack.
The actual execution model of OpenMP applications is described in the OpenMP specifications (See, for example, OpenMP Application Program Interface, Version 3.0, section 1.3.) The specification, however, does not describe some implementation details that may be important to users, and the actual implementation from Oracle is such that directly recorded profiling information does not easily allow the user to understand how the threads interact.
As any single-threaded program runs, its call stack shows its current location, and a trace of how it got there, starting from the beginning instructions in a routine called _start, which calls main, which then proceeds and calls various subroutines within the program. When a subroutine contains a loop, the program executes the code inside the loop repeatedly until the loop exit criterion is reached. The execution then proceeds to the next sequence of code, and so forth.
When the program is parallelized with OpenMP (or by autoparallelization), the behavior is different. An intuitive model of the parallelized program has the main, or master, thread executing just as a single-threaded program. When it reaches a parallel loop or parallel region, additional slave threads appear, each a clone of the master thread, with all of them executing the contents of the loop or parallel region, in parallel, each for different chunks of work. When all chunks of work are completed, all the threads are synchronized, the slave threads disappear, and the master thread proceeds.
The actual behavior of the parallelized program is not so straightforward. When the compiler generates code for a parallel region or loop (or any other OpenMP construct), the code inside it is extracted and made into an independent function, called an mfunction in the Oracle implementation. (It may also be referred to as an outlined function, or a loop-body-function.) The name of the mfunction encodes the OpenMP construct type, the name of the function from which it was extracted, and the line number of the source line at which the construct appears. The names of these functions are shown in the Analyzer's Expert mode and Machine mode in the following form, where the name in brackets is the actual symbol-table name of the function:
bardo_ -- OMP parallel region from line 9 [_$p1C9.bardo_] atomsum_ -- MP doall from line 7 [_$d1A7.atomsum_]
There are other forms of such functions, derived from other source constructs, for which the OMP parallel region in the name is replaced by MP construct, MP doall, or OMP sections. In the following discussion, all of these are referred to generically as parallel regions.
Each thread executing the code within the parallel loop can invoke its mfunction multiple times, with each invocation doing a chunk of the work within the loop. When all the chunks of work are complete, each thread calls synchronization or reduction routines in the library; the master thread then continues, while the slave threads become idle, waiting for the master thread to enter the next parallel region. All of the scheduling and synchronization are handled by calls to the OpenMP runtime.
During its execution, the code within the parallel region might be doing a chunk of the work, or it might be synchronizing with other threads or picking up additional chunks of work to do. It might also call other functions, which may in turn call still others. A slave thread (or the master thread) executing within a parallel region, might itself, or from a function it calls, act as a master thread, and enter its own parallel region, giving rise to nested parallelism.
The Analyzer collects data based on statistical sampling of call stacks, and aggregates its data across all threads and shows metrics of performance based on the type of data collected, against functions, callers and callees, source lines, and instructions. The Analyzer presents information on the performance of OpenMP programs in one of three modes: User mode , Expert mode, and Machine mode.
For more detailed information about data collection for OpenMP programs, see An OpenMP Runtime API for Profiling at the OpenMP user community web site.
The User mode presentation of the profile data attempts to present the information as if the program really executed according to the intuitive model described in Overview of OpenMP Software Execution. The actual data, shown in the Machine mode, captures the implementation details of the runtime library, libmtsk.so , which does not correspond to the model. The Expert mode shows a mix of data altered to fit the model, and the actual data.
In User mode, the presentation of profile data is altered to match the model better, and differs from the recorded data and Machine mode presentation in three ways:
Artificial functions are constructed representing the state of each thread from the point of view of the OpenMP runtime library.
Call stacks are manipulated to report data corresponding to the model of how the code runs, as described above.
Two additional metrics of performance are constructed for clock-based profiling experiments, corresponding to time spent doing useful work and time spent waiting in the OpenMP runtime. The metrics are OpenMP Work and OpenMP Wait.
For OpenMP 3.0 programs, a third metric OpenMP Overhead is constructed.
The following artificial functions are defined:
When a thread is in an OpenMP runtime state corresponding to one of the artificial functions, the artificial function is added as the leaf function on the stack. When a thread’s actual leaf function is anywhere in the OpenMP runtime, it is replaced by <OMP-overhead> as the leaf function. Otherwise, all PCs from the OpenMP runtime are omitted from the user-mode stack.
For OpenMP 3.0 programs, the <OMP-overhead> artificial function is not used. The artificial function is replaced by an OpenMP Overhead metric.
For OpenMP experiments, User mode shows reconstructed call stacks similar to those obtained when the program is compiled without OpenMP. The goal is to present profile data in a manner that matches the intuitive understanding of the program rather than showing all the details of the actual processing. The call stacks of the master thread and slave threads are reconciled and the artificial <OMP-*> functions are added to the call stack when the OpenMP runtime library is performing certain operations.
When processing a clock-profile event for an OpenMP program, two metrics corresponding to the time spent in each of two states in the OpenMP system are shown: OpenMP Work and OpenMP Wait.
Time is accumulated in OpenMP Work whenever a thread is executing from the user code, whether in serial or parallel. Time is accumulated in OpenMP Wait whenever a thread is waiting for something before it can proceed, whether the wait is a busy-wait (spin-wait), or sleeping. The sum of these two metrics matches the Total LWP Time metric in the clock profiles.
The OpenMP Wait and OpenMP Work metrics are shown in User mode, Expert Mode, and Machine mode.
When you look at OpenMP experiments in Expert view mode you see the artificial functions of the form <OMP-*> when the OpenMP runtime is performing certain operations, similar to User view mode. However, Expert view mode separately shows compiler-generated mfunctions that represent parallelized loops, tasks, and so on. In User mode, these compiler-generated mfunctions are aggregated with user functions.
Machine mode shows native call stacks for all threads and outline functions generated by the compiler.
The real call stacks of the program during various phases of execution are quite different from the ones portrayed above in the intuitive model. The Machine mode shows the call stacks as measured, with no transformations done, and no artificial functions constructed. The clock-profiling metrics are, however, still shown.
In each of the call stacks below, libmtsk represents one or more frames in the call stack within the OpenMP runtime library. The details of which functions appear and in which order change from release to release of OpenMP, as does the internal implementation of code for a barrier, or to perform a reduction.
Before the first parallel region
Before the first parallel region is entered, there is only the one thread, the master thread. The call stack is identical to that in User mode and Expert mode.
During execution in a parallel region
In Machine mode, the slave threads are shown as starting in _lwp_start , rather than in _start where the master starts. (In some versions of the thread library, that function may appear as _thread_start .) The calls to foo-OMP... represent the mfunctions that are generated for parallelized regions.
At the point at which all threads are at a barrier
Unlike when the threads are executing in the parallel region, when the threads are waiting at a barrier there are no frames from the OpenMP runtime between foo and the parallel region code, foo-OMP.... The reason is that the real execution does not include the OMP parallel region function, but the OpenMP runtime manipulates registers so that the stack unwind shows a call from the last-executed parallel region function to the runtime barrier code. Without it, there would be no way to determine which parallel region is related to the barrier call in Machine mode.
After leaving the parallel region
In the slave threads, no user frames are on the call stack.
When in a nested parallel region
Stack unwind is defined in Call Stacks and Program Execution.
Stack unwind might fail for a number of reasons:
If the stack has been corrupted by the user code; if so, the program might core dump, or the data collection code might core dump, depending on exactly how the stack was corrupted.
If the user code does not follow the standard ABI conventions for function calls. In particular, on the SPARC platform, if the return register, %o7, is altered before a save instruction is executed.
On any platform, hand-written assembler code might violate the conventions.
If the leaf PC is in a function after the callee’s frame is popped from the stack, but before the function returns.
If the call stack contains more than about 250 frames, the Collector does not have the space to completely unwind the call stack. In this case, PCs for functions from _start to some point in the call stack are not recorded in the experiment. The artificial function <Truncated-stack> is shown as called from <Total> to tally the topmost frames recorded.
If the Collector fails to unwind the frames of optimized functions on x86 platforms.
If you generate intermediate files using the -E or -P compiler options, the Analyzer uses the intermediate file for annotated source code, not the original source file. The #line directives generated with -E can cause problems in the assignment of metrics to source lines.
The following line appears in annotated source if there are instructions from a function that do not have line numbers referring to the source file that was compiled to generate the function:
function_name -- <instructions without line numbers>
Line numbers can be absent under the following circumstances:
You compiled without specifying the -g option.
The debugging information was stripped after compilation, or the executables or object files that contain the information are moved or deleted or subsequently modified.
The function contains code that was generated from #include files rather than from the original source file.
At high optimization, if code was inlined from a function in a different file.
The source file has #line directives referring to some other file; compiling with the -E option, and then compiling the resulting .i file is one way in which this happens. It may also happen when you compile with the -P flag.
The object file cannot be found to read line number information.
The compiler used generates incomplete line number tables.