User Mode Call Stacks (Sun Studio 12: Performance Analyzer)

Sun Studio 12: Performance Analyzer

User Mode Call Stacks

The easiest way to understand this model is to look at the call stacks of an OpenMP program at various points in its execution. This section considers a simple program that has a main program that calls one subroutine, foo. That subroutine has a single parallel loop, in which the threads do work, contend for, acquire, and release a lock, and enter and leave a critical section. An additional set of call stacks is shown, reflecting the state when one slave thread has called another function, bar, which enters a nested parallel region.

In this presentation, all the inclusive time spent in a parallel region is included in the inclusive time in the function from which it was extracted, including time spent in the OpenMP runtime, and that inclusive time is propagated all the way up to main and _start

The call stacks that represent the behavior in this model appear as shown in the subsections that follow. The actual names of the parallel region functions are of the following form, as described above:

foo -- OMP parallel region from line 9[ [_$p1C9.foo]
bar -- OMP parallel region from line 5[ [_$p1C5.bar]

For clarity, the following shortened forms are used in the descriptions:

foo -- OMP...
bar -- OMP...

In the descriptions, call stacks from all threads are shown at an instant during execution of the program. The call stack for each thread is shown as a stack of frames, matching the data from selecting an individual profile event in the Analyzer Timeline tab for a single thread, with the leaf PC at the top. In the Timeline tab, each frame is shown with a PC offset, which is omitted below. The stacks from all the threads are shown in a horizontal array, while in the Analyzer Timeline tab, the stacks for other threads would appear in profile bars stacked vertically. Furthermore, in the representation presented, the stacks for all the threads are shown as if they were captured at exactly the same instant, while in a real experiment, the stacks are captured independently in each thread, and may be skewed relative to each other.

The call stacks shown represent the data as it is presented with a view mode of User in the Analyzer or in the er_print utility.

Before the first parallel region

Before the first parallel region is entered, there is only the one thread, the master thread.

Master

foo

main

_start

Master
`foo`
`main`
`_start`

Upon entering the first parallel region

At this point, the library has created the slave threads, and all of the threads, master and slaves, are about to start processing their chunks of work. All threads are shown as having called into the code for the parallel region, foo-OMP..., from foo at the line on which the OpenMP directive for the construct appears, or from the line containing the loop statement that was autoparallelized. The code for the parallel region in each thread is calling into the OpenMP support library, shown as the <OMP-overhead> function, from the first instruction in the parallel region.

Master	Slave 1	Slave 2	Slave 3
`<OMP-overhead>`	`<OMP-overhead>`	`<OMP-overhead>`	`<OMP-overhead>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

The window in which <OMP-overhead> might appear is quite small, so that function might not appear in any particular experiment.

While executing within a parallel region

All four of the threads are doing useful work in the parallel region.

Master	Slave 1	Slave 2	Slave 3
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

While executing within a parallel region between chunks of work

All four of the threads are doing useful work, but one has finished one chunk of work, and is obtaining its next chunk.

Master	Slave 1	Slave 2	Slave 3
	`<OMP-overhead>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

While executing in a critical section within the parallel region

All four of the threads are executing, each within the parallel region. One of them is in the critical section, while one of the others is running before reaching the critical section (or after finishing it). The remaining two are waiting to enter the critical section themselves.

Master	Slave 1	Slave 2	Slave 3
`<OMP-critical_section_wait>`			`<OMP-critical_section_wait>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

The data collected does not distinguish between the call stack of the thread that is executing in the critical section, and that of the thread that has not yet reached, or has already passed the critical section.

While executing around a lock within the parallel region

A section of code around a lock is completely analogous to a critical section. All four of the threads are executing within the parallel region. One thread is executing while holding the lock, one is executing before acquiring the lock (or after acquiring and releasing it), and the other two threads are waiting for the lock.

Master	Slave 1	Slave 2	Slave 3
`<OMP-lock_wait>`			`<OMP-lock_wait>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

As in the critical section example, the data collected does not distinguish between the call stack of a thread holding the lock and executing, or executing before it acquires the lock or after it releases it.

Near the end of a parallel region

At this point, three of the threads have finished all their chunks of work, but one of them is still working. The OpenMP construct in this case implicitly specified a barrier; if the user code had explicitly specified the barrier, the <OMP-implicit_barrier> function would be replaced by <OMP-explicit_barrier>.

Master	Slave 1	Slave 2	Slave 3
`<OMP-implicit_barrier>`	`<OMP-implicit_barrier>`		`<OMP-implicit_barrier>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

Near the end of a parallel region, with one or more reduction variables

At this point, two of the threads have finished all their chunks of work, and are performing the reduction computations, but one of them is still working, and the fourth has finished its part of the reduction, and is waiting at the barrier.

Master	Slave 1	Slave 2	Slave 3
`<OMP-reduction>`	`<OMP-implicit_barrier>`		`<OMP-implicit_barrier>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

While one thread is shown in the <OMP-reduction> function, the actual time spent in doing the reduction is usually quite small, and is rarely captured in a call stack sample.

At the end of a parallel region

At this point, all threads have finished all chunks of work within the parallel region, and have reached the barrier.

Master	Slave 1	Slave 2	Slave 3
`<OMP-implicit_barrier>`	`<OMP-implicit_barrier>`	`<OMP-implicit_barrier>`	`<OMP-implicit_barrier>`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`

Since all the threads have reached the barrier, they may all proceed, and it is unlikely that an experiment would ever find all the threads in this state.

After leaving the parallel region

At this point, all the slave threads are waiting for entry into the next parallel region, either spinning or sleeping, depending on the various environment variables set by the user. The program is in serial execution.

Master

Slave 1

Slave 2

Slave 3

foo

main

_start

<OMP-idle>

<OMP-idle>

<OMP-idle>

Master	Slave 1	Slave 2	Slave 3
`foo`
`main`
`_start`	`<OMP-idle>`	`<OMP-idle>`	`<OMP-idle>`

While executing in a nested parallel region

All four of the threads are working, each within the outer parallel region. One of the slave threads has called another function, bar, and it has created a nested parallel region, and an additional slave thread is created to work with it.

Master	Slave 1	Slave 2	Slave 3	Slave 4
	`bar-OMP...`			`bar-OMP...`
	`bar`			`bar`
`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`	`foo-OMP...`
`foo`	`foo`	`foo`	`foo`	`foo`
`main`	`main`	`main`	`main`	`main`
`_start`	`_start`	`_start`	`_start`	`_start`