8.1 Some General Performance Recommendations

Some general techniques for improving performance of OpenMP applications are:

Minimize synchronization.
- Avoid or minimize the use of BARRIER, CRITICAL sections, ORDERED regions, and locks.
- Use the NOWAIT clause where possible to eliminate redundant or unnecessary barriers. For example, there is always an implied barrier at the end of a parallel region. Adding NOWAIT to a final DO in the region eliminates one redundant barrier.
- Use named CRITICAL sections for fine-grained locking.
- Use explicit FLUSH with care. Flushes can cause data cache restores to memory, and subsequent data accesses might require reloads from memory, all of which decrease efficiency.
By default, idle threads will be put to sleep after a certain timeout period. If a thread does not find work by the end of the timeout period, it will go to sleep, thus avoiding wasting processor cycles at the expense of other threads. The default timeout period might not be sufficient for your application, causing the threads to go to sleep too soon or too late. In general, if an application has dedicated processors to run on, SPIN would give best application performance. If an application shares processors with other applications at the same time, then the default setting, or a setting that does not let the threads spin for too long, would be best for system throughput .Use the SUNW_MP_THR_IDLE environment variable to override the default timeout period, even up to the point where the idle threads will never be put to sleep and remain active all the time.

Parallelize at the highest level possible, such as outer DO/FOR loops. Enclose multiple loops in one parallel region. In general, make parallel regions as large as possible to reduce parallelization overhead. For example, this construct is less efficient:

!$OMP PARALLEL
  ....
  !$OMP DO
    ....
  !$OMP END DO
  ....
!$OMP END PARALLEL

!$OMP PARALLEL
  ....
   !$OMP DO
     ....
   !$OMP END DO
  ....
!$OMP END PARALLEL

A more efficient construct:

!$OMP PARALLEL
  ....
  !$OMP DO
    ....
  !$OMP END DO
  .....

  !$OMP DO
    ....
  !$OMP END DO

!$OMP END PARALLEL

Use PARALLEL DO/FOR instead of worksharing DO/FOR directives in parallel regions. The PARALLEL DO/FOR is implemented more efficiently than a general parallel region containing possibly several loops. For example, this construct is less efficient:
```
!$OMP PARALLEL
  !$OMP DO
    .....
  !$OMP END DO
!$OMP END PARALLEL
```
while this construct is more efficient:
```
!$OMP PARALLEL DO
   ....
!$OMP END PARALLEL
```

On Oracle Solaris systems, use SUNW_MP_PROCBIND to bind threads to processors. Processor binding, when used along with static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will be in the local cache from a previous invocation of a parallel region. See 2.3 Processor Binding.
Use MASTER instead of SINGLE wherever possible.
- The MASTER directive is implemented as an IF-statement with no implicit BARRIER : IF(omp_get_thread_num() == 0) {...}
- The SINGLE directive is implemented similar to other worksharing constructs. Keeping track of which thread reached SINGLE first adds additional runtime overhead. There is an implicit BARRIER if NOWAIT is not specified, which is less efficient.
Choose the appropriate loop scheduling.
- STATIC causes no synchronization overhead and can maintain data locality when data fits in cache. However, STATIC could lead to load imbalance.
- DYNAMIC,GUIDED incurs a synchronization overhead to keep track of which chunks have been assigned. While these schedules could lead to poor data locality, they can improve load balancing. Experiment with different chunk sizes.
Use LASTPRIVATE with care, as it has the potential of high overhead.
- Data needs to be copied from private to shared storage upon return from the parallel construct.
- The compiled code checks which thread executes the logically last iteration. This imposes extra work at the end of each chunk in a parallel DO/FOR. The overhead adds up if there are many chunks.
Use efficient thread-safe memory management.
- Applications could be using malloc() and free() explicitly, or implicitly in the compiler-generated code for dynamic/allocatable arrays, vectorized intrinsics, and so on.
- The thread-safe malloc() and free() in libc have a high synchronization overhead caused by internal locking. Faster versions can be found in the libmtmalloc library. Link with -lmtmalloc to use libmtmalloc.
Small data cases could cause OpenMP parallel loops to underperform. Use the if clause on PARALLEL constructs to indicate that a loop should run parallel only in those cases where some performance gain can be expected.
When possible, merge loops. For example, merge two loops:

!$omp parallel do
  do i = ...
..........statements_1...
  end do
!$omp parallel do
  do i = ...
..........statements_2...
  end do

Merged into a single loop:

!$omp parallel do
  do i = ...
..........statements_1...
..........statements_2...
  end do

Try nested parallelism if your application lacks scalability beyond a certain level. See 1.2 Special Conventions for more information about nested parallelism in OpenMP.

Skip Navigation Links
Exit Print View
	Oracle Solaris Studio 12.3: OpenMP API User's Guide Oracle Solaris Studio 12.3 Information Library