Overview of OpenMP Software Execution (Sun Studio 12 Update 1: Performance Analyzer)

Sun Studio 12 Update 1: Performance Analyzer

Overview of OpenMP Software Execution

The actual execution model of OpenMP Applications is described in the OpenMP specifications (See, for example, OpenMP Application Program Interface, Version 3.0, section 1.3.) The specification, however, does not describe some implementation details that may be important to users, and the actual implementation at Sun Microsystems is such that directly recorded profiling information does not easily allow the user to understand how the threads interact.

As any single-threaded program runs, its call stack shows its current location, and a trace of how it got there, starting from the beginning instructions in a routine called _start, which calls main, which then proceeds and calls various subroutines within the program. When a subroutine contains a loop, the program executes the code inside the loop repeatedly until the loop exit criterion is reached. The execution then proceeds to the next sequence of code, and so forth.

When the program is parallelized with OpenMP (or by autoparallelization), the behavior is different. An intuitive model of that behavior has the main, or master, thread executing just as a single-threaded program. When it reaches a parallel loop or parallel region, additional slave threads appear, each a clone of the master thread, with all of them executing the contents of the loop or parallel region, in parallel, each for different chunks of work. When all chunks of work are completed, all the threads are synchronized, the slave threads disappear, and the master thread proceeds.

When the compiler generates code for a parallel region or loop (or any other OpenMP construct), the code inside it is extracted and made into an independent function, called an mfunction. (It may also be referred to as an outlined function, or a loop-body-function.) The name of the function encodes the OpenMP construct type, the name of the function from which it was extracted, and the line number of the source line at which the construct appears. The names of these functions are shown in the Analyzer in the following form, where the name in brackets is the actual symbol-table name of the function:

bardo_ -- OMP parallel region from line 9 [_$p1C9.bardo_]
atomsum_ -- MP doall from line 7 [_$d1A7.atomsum_]

There are other forms of such functions, derived from other source constructs, for which the OMP parallel region in the name is replaced by MP construct, MP doall, or OMP sections. In the following discussion, all of these are referred to generically as “parallel regions”.

Each thread executing the code within the parallel loop can invoke its mfunction multiple times, with each invocation doing a chunk of the work within the loop. When all the chunks of work are complete, each thread calls synchronization or reduction routines in the library; the master thread then continues, while the slave threads become idle, waiting for the master thread to enter the next parallel region. All of the scheduling and synchronization are handled by calls to the OpenMP runtime.

During its execution, the code within the parallel region might be doing a chunk of the work, or it might be synchronizing with other threads or picking up additional chunks of work to do. It might also call other functions, which may in turn call still others. A slave thread (or the master thread) executing within a parallel region, might itself, or from a function it calls, act as a master thread, and enter its own parallel region, giving rise to nested parallelism.

The Analyzer collects data based on statistical sampling of call stacks, and aggregates its data across all threads and shows metrics of performance based on the type of data collected, against functions, callers and callees, source lines, and instructions. It presents information on the performance of OpenMP programs in one of three modes, User mode , Expert mode, and Machine mode.

For more detailed information, see An OpenMP Runtime API for Profiling at the OpenMP user community web site.

User Mode Display of OpenMP Profile Data

The User mode presentation of the profile data attempts to present the information as if the program really executed according to the model described in Overview of OpenMP Software Execution. The actual data captures the implementation details of the runtime library, libmtsk.so , which does not correspond to the model. In User mode, the presentation of profile data is altered to match the model better, and differs from the recorded data and Machine mode presentation in three ways:

Artificial functions are constructed representing the state of each thread from the point of view of the OpenMP runtime library.
Call stacks are manipulated to report data corresponding to the model of how the code runs, as described above.
Two additional metrics of performance are constructed for clock-based profiling experiments, corresponding to time spent doing useful work and time spent waiting in the OpenMP runtime.

Artificial Functions

Artificial functions are constructed and put onto the User mode call stacks reflecting events in which a thread was in some state within the OpenMP runtime library.

The following artificial functions are defined:

`<OMP-overhead>`	Executing in the OpenMP library
`<OMP-idle>`	Slave thread, waiting for work
`<OMP-reduction>`	Thread performing a reduction operation
`<OMP-implicit_barrier>`	Thread waiting at an implicit barrier
`<OMP-explicit_barrier>`	Thread waiting at an explicit barrier
`<OMP-lock_wait>`	Thread waiting for a lock
`<OMP-critical_section_wait>`	Thread waiting to enter a critical section
`<OMP-ordered_section_wait>`	Thread waiting for its turn to enter an ordered section

When a thread is in an OpenMP runtime state corresponding to one of those functions, the corresponding function is added as the leaf function on the stack. When a thread’s leaf function is anywhere in the OpenMP runtime, it is replaced by <OMP-overhead> as the leaf function. Otherwise, all PCs from the OpenMP runtime are omitted from the user-mode stack.

User Mode Call Stacks

For OpenMP experiments, user mode shows reconstructed call stacks similar to those obtained when the program is compiled without OpenMP.

OpenMP Metrics

When processing a clock-profile event for an OpenMP program, two metrics corresponding to the time spent in each of two states in the OpenMP system are shown: OpenMP Work and OpenMP Wait.

Time is accumulated in OpenMP Work whenever a thread is executing from the user code, whether in serial or parallel. Time is accumulated in OpenMP Wait whenever a thread is waiting for something before it can proceed, whether the wait is a busy-wait (spin-wait), or sleeping. The sum of these two metrics matches the Total LWP Time metric in the clock profiles.

Machine Presentation of OpenMP Profiling Data

The real call stacks of the program during various phases of execution are quite different from the ones portrayed above in the intuitive model. The Machine mode of presentation shows the call stacks as measured, with no transformations done, and no artificial functions constructed. The clock-profiling metrics are, however, still shown.

In each of the call stacks below, libmtsk represents one or more frames in the call stack within the OpenMP runtime library. The details of which functions appear and in which order change from release to release, as does the internal implementation of code for a barrier, or to perform a reduction.

Before the first parallel region

Before the first parallel region is entered, there is only the one thread, the master thread. The call stack is identical to that in User mode.

Master

foo

main

_start

Master
`foo`
`main`
`_start`

During execution in a parallel region

Master	Slave 1	Slave 2	Slave 3
`foo-OMP...`
`libmtsk`
`foo`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`main`	`libmtsk`	`libmtsk`	`libmtsk`
`_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`

In Machine mode, the slave threads are shown as starting in _lwp_start , rather than in _start where the master starts. (In some versions of the thread library, that function may appear as _thread_start .)

At the point at which all threads are at a barrier

Master	Slave 1	Slave 2	Slave 3
`libmtsk`
`foo-OMP...`
`foo`	`libmtsk`	`libmtsk`	`libmtsk`
`main`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`

Unlike when the threads are executing in the parallel region, when the threads are waiting at a barrier there are no frames from the OpenMP runtime between foo and the parallel region code, foo-OMP.... The reason is that the real execution does not include the OMP parallel region function, but the OpenMP runtime manipulates registers so that the stack unwind shows a call from the last-executed parallel region function to the runtime barrier code. Without it, there would be no way to determine which parallel region is related to the barrier call in Machine mode.

After leaving the parallel region

Master

Slave 1

Slave 2

Slave 3

foo

main

libmtsk

libmtsk

libmtsk

_start

_lwp_start

_lwp_start

_lwp_start

In the slave threads, no user frames are on the call stack.

Master	Slave 1	Slave 2	Slave 3
`foo`
`main`	`libmtsk`	`libmtsk`	`libmtsk`
`_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`

When in a nested parallel region

Master	Slave 1	Slave 2	Slave 3	Slave 4
	`bar-OMP...`
`foo-OMP...`	`libmtsk`
`libmtsk`	`bar`
`foo`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`bar-OMP...`
`main`	`libmtsk`	`libmtsk`	`libmtsk`	`libmtsk`
`_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`	`_lwp_start`