C H A P T E R 7 - Performance Considerations

C H A P T E R 7

Performance Considerations

Once you have a correct, working OpenMP program, it is worth considering its overall performance. There are some general techniques that you can utilize to improve the efficiency and scalability of an OpenMP application, as well as techniques specific to the Sun platforms. These are discussed briefly here.

For additional information, see Techniques for Optimizing Applications: High Performance Computing, by Rajat Garg and Ilya Sharapov, which is available from http://www.sun.com/books/catalog/garg.xml

Also, visit the Sun Developer portal for occasional articles and case studies regarding performance analysis and optimization of OpenMP applications, at http://developers.sun.com/prodtech/cc/.

7.1 Some General Recommendations

The following are some general techniques for improving performance of OpenMP applications.

Minimize synchronization.

Avoid or minimize the use of BARRIER, CRITICAL sections, ORDERED regions, and locks.

Use the NOWAIT clause where possible to eliminate redundant or unnecessary barriers. For example, there is always an implied barrier at the end of a parallel region. Adding NOWAIT to a final DO in the region eliminates one redundant barrier.

Use named CRITICAL sections for fine-grained locking.

Use explicit FLUSH with care. Flushes can cause data cache restores to memory, and subsequent data accesses may require reloads from memory, all of which decrease efficiency.

If a SHARED variable in a parallel region is read by the threads executing the region, but not written to by any of the threads, then specify that variable to be FIRSTPRIVATE instead of SHARED. This avoids accessing the variable by dereferencing a pointer, and avoids cache conflicts.

Avoid wasting resources by requesting more threads than you plan to use. Experiment with SUNW_MP_THR_IDLE, to put spinning worker threads to sleep when not needed. See Section 5.3, OpenMP Environment Variables.

Parallelize at the highest level possible, such as outer DO/FOR loops. Enclose multiple loops in one parallel region. In general, make parallel regions as large as possible to reduce parallelization overhead. For example:

This construct is less efficient:

!$OMP PARALLEL

  ....

  !$OMP DO

    ....

  !$OMP END DO

  ....

!$OMP END PARALLEL

!$OMP PARALLEL

  ....

   !$OMP DO

     ....

   !$OMP END DO

  ....

!$OMP END PARALLEL

than this one:

!$OMP PARALLEL

  ....

  !$OMP DO

    ....

  !$OMP END DO

  .....

  !$OMP DO

    ....

  !$OMP END DO

!$OMP END PARALLEL

Use PARALLEL DO/FOR instead of worksharing DO/FOR directives in parallel regions. The PARALLEL DO/FOR is implemented more efficiently than a general parallel region containing possibly several loops. For example:

This construct is less efficient:

!$OMP PARALLEL

  !$OMP DO

    .....

  !$OMP END DO

!$OMP END PARALLEL

than this one:

!$OMP PARALLEL DO

   ....

!$OMP END PARALLEL

Use SUNW_MP_PROCBIND to bind lightweight processes (LWPs) to processors. Processor binding, when used along with static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will be in the local cache from a previous invocation of a parallel region. See Section 5.4, Processor Binding.

Use MASTER instead of SINGLE wherever possible.

The MASTER directive is implemented as an IF-statement with no implicit BARRIER :
IF(omp_get_thread_num() == 0) {...}

The SINGLE directive is implemented similar to other worksharing constructs. Keeping track of which thread reached SINGLE first adds additional runtime overhead. There is an implicit BARRIER if NOWAIT is not specified. It is less efficient.

Choose the appropriate loop scheduling.

STATIC causes no synchronization overhead and can maintain data locality when data fits in cache. However, STATIC may lead to load imbalance.

DYNAMIC,GUIDED incurs a synchronization overhead to keep track of which chunks have been assigned. And, while these schedules could lead to poor data locality, they can improve load balancing. Experiment with different chunk sizes.

Use LASTPRIVATE with care, as it has the potential of high overhead.

Data needs to be copied from private to shared storage upon return from the parallel construct.

The compiled code checks which thread executes the logically last iteration. This imposes extra work at the end of each chunk in a parallel DO/FOR. The overhead adds up if there are many chunks.

Use efficient thread-safe memory management.

Applications could be using malloc() and free() explicitly, or implicitly in the compiler-generated code for dynamic/allocatable arrays, vectorized intrinsics, and so on.

The thread-safe malloc() and free() in libc have a high synchronization overhead caused by internal locking. Faster versions can be found in the libmtmalloc library. Link with -lmtmalloc to use libmtmalloc.

7.2 False Sharing And How To Avoid It

Careless use of shared memory structures with OpenMP applications can result in poor performance and limited scalability. Multiple processors updating adjacent shared data in memory can result in excessive traffic on the multiprocessor interconnect and, in effect, cause serialization of computations.

7.2.1 What Is False Sharing?

Most high performance processors, such as UltraSPARC III, insert a cache buffer between slow memory and the high speed registers of the CPU. Accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache. Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.

However, simultaneous updates of individual elements in the same cache line coming from different processors invalidates entire cache lines, even though these updates are logically independent of each other. Each update of an individual element of a cache line marks the line as invalid. Other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.

This situation is called false sharing, and could be a significant cause of poor performance and scalability in OpenMP applications.

False sharing degrades performance when all of the following conditions occur.

Shared data is modified by multiple processors.

Multiple processors update data within the same cache line.

This updating occurs very frequently (for example, in a tight loop).

Note that shared data that is read-only in a loop does not lead to false sharing.

7.2.2 Reducing False Sharing

Careful analysis of those parallel loops that play a major part in the execution of an application can reveal performance scalability problems caused by false sharing. In general, false sharing can be reduced by

changing the structure and use of shared data into private data,

increasing the problem size (iteration length),

changing the mapping of iterations to processors to give each processor more work per iteration (chunk size),

utilizing the compiler's optimization features to eliminate memory loads and stores.

7.3 Operating System Tuning Features

Starting with the Solaris 9 release, the operating system provides scalability and high performance for the SunFire systems. New features introduced with Solaris 9 OS that improve the performance of OpenMP programs without hardware upgrades are Memory Placement Optimizations (MPO) and Multiple Page Size Support (MPSS), among others.

MPO allows the OS to allocate pages close to the processors that access those pages. SunFire 6800, SunFire 15K, and SunFire E25K systems have different memory latencies within the same UniBoard versus between different UniBoards. The default MPO policy, called first-touch, allocates memory on the UniBoard containing the processor that first touches the memory. The first-touch policy can greatly improve the performance of applications where data accesses are made mostly to the memory local to each processor with first-touch placement. Compared to a random memory placement policy where the memory is evenly distributed throughout the system, the memory latencies for applications can be lowered and the bandwidth increased, leading to higher performance.

MPSS allows a program to use different page sizes for different regions of virtual memory. The default page size for Solaris 9 OS is 8KB. With the 8KB page size, applications that use large memory can have a lot of TLB misses, since the number of TLB entries on UltraSPARC III Cu and UltraSPARC IV allow accesses to only a few megabytes of memory. UltraSPARC III Cu and UltraSPARC IV support four different page sizes: 8 KB, 64 KB, 512 KB, and 4MB. With MPSS, user processes can request one of these four page sizes. Thus MPSS can significantly reduce the number of TLB misses and lead to improved performance for applications that use a large amount of memory.