Sun Studio 12 Update 1: OpenMP API User's Guide

Chapter 7 Performance Considerations

Once you have a correct, working OpenMP program, it is worth considering its overall performance. There are some general techniques that you can utilize to improve the efficiency and scalability of an OpenMP application, as well as techniques specific to the Sun platforms. These are discussed briefly here.

For additional information, see Solaris Application Programming, by Darryl Gove, which is available from http://www.sun.com/books/catalog/solaris_app_programming.xml

Also, visit the Sun Developer portal for occasional articles and case studies regarding performance analysis and optimization of OpenMP applications, at http://developers.sun.com/sunstudio/.

7.1 Some General Recommendations

The following are some general techniques for improving performance of OpenMP applications.

Minimize synchronization.
- Avoid or minimize the use of BARRIER, CRITICAL sections, ORDERED regions, and locks.
- Use the NOWAIT clause where possible to eliminate redundant or unnecessary barriers. For example, there is always an implied barrier at the end of a parallel region. Adding NOWAIT to a final DO in the region eliminates one redundant barrier.
- Use named CRITICAL sections for fine-grained locking.
- Use explicit FLUSH with care. Flushes can cause data cache restores to memory, and subsequent data accesses may require reloads from memory, all of which decrease efficiency.
By default, idle threads will be put to sleep after a certain time out period. It could be that the default time out period is not sufficient for your application, causing the threads to go to sleep too soon or too late. The SUNW_MP_THR_IDLE environment variable can be used to override the default time out period, even up to the point where the idle threads will never be put to sleep and remain active all the time.
Parallelize at the highest level possible, such as outer DO/FOR loops. Enclose multiple loops in one parallel region. In general, make parallel regions as large as possible to reduce parallelization overhead. For example:

This construct is less efficient:

!$OMP PARALLEL
  ....
  !$OMP DO
    ....
  !$OMP END DO
  ....
!$OMP END PARALLEL

!$OMP PARALLEL
  ....
   !$OMP DO
     ....
   !$OMP END DO
  ....
!$OMP END PARALLEL

than this one:

!$OMP PARALLEL
  ....
  !$OMP DO
    ....
  !$OMP END DO
  .....

  !$OMP DO
    ....
  !$OMP END DO

!$OMP END PARALLEL

Use PARALLEL DO/FOR instead of worksharing DO/FOR directives in parallel regions. The PARALLEL DO/FOR is implemented more efficiently than a general parallel region containing possibly several loops. For example:

This construct is less efficient:

!$OMP PARALLEL
  !$OMP DO
    .....
  !$OMP END DO
!$OMP END PARALLEL

than this one:

!$OMP PARALLEL DO
   ....
!$OMP END PARALLEL

Use SUNW_MP_PROCBIND to bind threads to processors. Processor binding, when used along with static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will be in the local cache from a previous invocation of a parallel region. See 2.3 Processor Binding.
Use MASTER instead of SINGLE wherever possible.
- The MASTER directive is implemented as an IF-statement with no implicit BARRIER : IF(omp_get_thread_num() == 0) {...}
- The SINGLE directive is implemented similar to other worksharing constructs. Keeping track of which thread reached SINGLE first adds additional runtime overhead. There is an implicit BARRIER if NOWAIT is not specified. It is less efficient.
Choose the appropriate loop scheduling.
- STATIC causes no synchronization overhead and can maintain data locality when data fits in cache. However, STATIC may lead to load imbalance.
- DYNAMIC,GUIDED incurs a synchronization overhead to keep track of which chunks have been assigned. And, while these schedules could lead to poor data locality, they can improve load balancing. Experiment with different chunk sizes.
Use LASTPRIVATE with care, as it has the potential of high overhead.
- Data needs to be copied from private to shared storage upon return from the parallel construct.
- The compiled code checks which thread executes the logically last iteration. This imposes extra work at the end of each chunk in a parallel DO/FOR. The overhead adds up if there are many chunks.
Use efficient thread-safe memory management.
- Applications could be using malloc() and free() explicitly, or implicitly in the compiler-generated code for dynamic/allocatable arrays, vectorized intrinsics, and so on.
- The thread-safe malloc() and free() in libc have a high synchronization overhead caused by internal locking. Faster versions can be found in the libmtmalloc library. Link with -lmtmalloc to use libmtmalloc.
Small data cases may cause OpenMP parallel loops to underperform. Use the IF clause on PARALLEL constructs to indicate that a loop should run parallel only in those cases where some performance gain can be expected.
When possible, merge loops. For example:

merge two loops

!$omp parallel do
  do i = ...

statements_1

  end do
!$omp parallel do
  do i = ...

statements_2

  end do

into a single loop

!$omp parallel do
  do i = ...

statements_1

statements_2

  end do

Try nested parallelism if your application lacks scalability beyond a certain level. See 1.2 Special Conventions Used Here for more information about nested parallelism in OpenMP.

7.2 False Sharing And How To Avoid It

Careless use of shared memory structures with OpenMP applications can result in poor performance and limited scalability. Multiple processors updating adjacent shared data in memory can result in excessive traffic on the multiprocessor interconnect and, in effect, cause serialization of computations.

7.2.1 What Is False Sharing?

Most high performance processors, such as UltraSPARC processors, insert a cache buffer between slow memory and the high speed registers of the CPU. Accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache. Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory.

However, simultaneous updates of individual elements in the same cache line coming from different processors invalidates entire cache lines, even though these updates are logically independent of each other. Each update of an individual element of a cache line marks the line as invalid. Other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.

This situation is called false sharing. If this occurs frequently, performance and scalability of an OpenMP application will suffer significantly.

False sharing degrades performance when all of the following conditions occur.

Shared data is modified by multiple processors.
Multiple processors update data within the same cache line.
This updating occurs very frequently (for example, in a tight loop).

Note that shared data that is read-only in a loop does not lead to false sharing.

7.2.2 Reducing False Sharing

Careful analysis of those parallel loops that play a major part in the execution of an application can reveal performance scalability problems caused by false sharing. In general, false sharing can be reduced by

making use of private data as much as possible;
utilizing the compiler’s optimization features to eliminate memory loads and stores.

In specific cases, the impact of false sharing may be less visible when dealing with larger problem sizes, as there might be less sharing.

Techniques for tackling false sharing are very much dependent on the particular application. In some cases, a change in the way the data is allocated can reduce false sharing. In other cases, changing the mapping of iterations to threads, giving each thread more work per chunk (by changing the chunksize value) can also lead to a reduction in false sharing.

7.3 Solaris OS Tuning Features

New features have been introduced in the Solaris OS that improve the performance of OpenMP programs. These include Memory Placement Optimizations, locality groups, and Multiple Page Size Support.

7.3.1 Locality Groups

The concept of a locality group (lgroup) has been introduced in Solaris to represent a set of CPU-like and memory-like hardware resources that are within some latency of each other.

Solaris OS assigns a thread to an lgroup when the thread is created. That lgroup is called the thread's home lgroup. Solaris OS runs the thread on the CPUs in the thread's home lgroup and allocates memory from that lgroup whenever possible. If resources from the home lgroup are unavailable, Solaris allocates resources from other lgroups. When a thread has affinity for more than one lgroup, the OS allocates resources from lgroups chosen in order of affinity strength.

The lgroup APIs export the lgroup abstraction for applications to use for observability and performance tuning. A new library, called liblgrp, contains the new APIs. Applications can use the APIs to perform the following tasks:

Traverse the group hierarchy
Discover the contents and characteristics of a given lgroup
Affect the thread and memory placement on lgroups

For example, the lgrp_affinity_set() function sets the affinity that a thread or set of threads have for a given lgroup. The OS uses the lgroup affinities as advice as to where to run a thread and allocate its memory.

The madvise() Standard C library function can be used to advise the OS that a region of user virtual memory is expected to follow a particular pattern of use. For example, calling madvise() with the MADV_ACCESS_LWP argument tells the kernel that the next thread to touch the specified region of memory will access it most heavily, so the kernel will try to allocate the memory and other resources for this range and the thread accordingly. Use of the madvise() function can increase system performance when used by programs that have specific knowledge of their access patterns over memory. The kernel needs information about the likely pattern of an application's memory use in order to allocate memory resources efficiently.

For more information about locality groups, refer to the manual Solaris: Memory and Thread Placement Optimization Developer's Guide

7.3.2 Multiple Page Size Support

The Multiple Page Size Support (MPSS) feature allows a program to use different page sizes for different regions of virtual memory. The default Solaris page size is relatively small (8KB on UltraSPARC processors and 4KB on AMD64 Opteron processors). The default page size on a specific platform can be obtained with the Solaris OS command: pagesize . The -a option on this command lists all the supported page sizes. (See the pagesize(1) man page for details.)

Applications that suffer from too many TLB misses may experience a performance boost by using a larger page size. TLB misses can be measured using the Sun Performance Analyzer.

There are three ways to change the default page size for an application:

Use the Solaris OS command ppgsz(1)
Use MPSS specific environment variables. See the mpss.so.1(1) man page for details.
Compile the application with the -xpagesize, -xpagesize_heap, or the -xpagesize_stack options. (See the compiler man pages for details.)

7.4 Analyzing the Performance of an OpenMP Program

The Collector and Performance Analyzer are a pair of Sun Studio tools that can be used to collect and analyze performance data for an application. The Collector tool collects performance data using a statistical method called profiling and by tracing function calls. The Performance Analyzer processes the data recorded by the Collector, and displays various metrics of performance at program, function, OpenMP parallel region, OpenMP task, source-line, and assembly instruction levels. The Performance Analyzer can also display the raw data in a graphical format as a function of time.

See the collect(1) and analyzer(1) man pages and the Sun Studio Performance Analyzer manuals for more details.