Sun Studio 12 Update 1: Performance Analyzer

Machine Presentation of OpenMP Profiling Data

The real call stacks of the program during various phases of execution are quite different from the ones portrayed above in the intuitive model. The Machine mode of presentation shows the call stacks as measured, with no transformations done, and no artificial functions constructed. The clock-profiling metrics are, however, still shown.

In each of the call stacks below, libmtsk represents one or more frames in the call stack within the OpenMP runtime library. The details of which functions appear and in which order change from release to release, as does the internal implementation of code for a barrier, or to perform a reduction.

  1. Before the first parallel region

    Before the first parallel region is entered, there is only the one thread, the master thread. The call stack is identical to that in User mode.

    Master 

    foo

    main

    _start

  2. During execution in a parallel region

    Master 

    Slave 1 

    Slave 2 

    Slave 3 

    foo-OMP...

         

    libmtsk

         

    foo

    foo-OMP...

    foo-OMP...

    foo-OMP...

    main

    libmtsk

    libmtsk

    libmtsk

    _start

    _lwp_start

    _lwp_start

    _lwp_start

    In Machine mode, the slave threads are shown as starting in _lwp_start , rather than in _start where the master starts. (In some versions of the thread library, that function may appear as _thread_start .)

  3. At the point at which all threads are at a barrier

    Master 

    Slave 1 

    Slave 2 

    Slave 3 

    libmtsk

         

    foo-OMP...

         

    foo

    libmtsk

    libmtsk

    libmtsk

    main

    foo-OMP...

    foo-OMP...

    foo-OMP...

    _start

    _lwp_start

    _lwp_start

    _lwp_start

    Unlike when the threads are executing in the parallel region, when the threads are waiting at a barrier there are no frames from the OpenMP runtime between foo and the parallel region code, foo-OMP.... The reason is that the real execution does not include the OMP parallel region function, but the OpenMP runtime manipulates registers so that the stack unwind shows a call from the last-executed parallel region function to the runtime barrier code. Without it, there would be no way to determine which parallel region is related to the barrier call in Machine mode.

  4. After leaving the parallel region

    Master 

    Slave 1 

    Slave 2 

    Slave 3 

    foo

         

    main

    libmtsk

    libmtsk

    libmtsk

    _start

    _lwp_start

    _lwp_start

    _lwp_start

    In the slave threads, no user frames are on the call stack.

  5. When in a nested parallel region

    Master 

    Slave 1 

    Slave 2 

    Slave 3 

    Slave 4 

     

    bar-OMP...

         

    foo-OMP...

    libmtsk

         

    libmtsk

    bar

         

    foo

    foo-OMP...

    foo-OMP...

    foo-OMP...

    bar-OMP...

    main

    libmtsk

    libmtsk

    libmtsk

    libmtsk

    _start

    _lwp_start

    _lwp_start

    _lwp_start

    _lwp_start