2. Compiling and Running OpenMP Programs
3. Implementation-Defined Behaviors
6. Automatic Scoping of Variables
8.2 False Sharing And How To Avoid It
8.3 Solaris OS Tuning Features
The following are some general techniques for improving performance of OpenMP applications.
Minimize synchronization.
Avoid or minimize the use of BARRIER, CRITICAL sections, ORDERED regions, and locks.
Use the NOWAIT clause where possible to eliminate redundant or unnecessary barriers. For example, there is always an implied barrier at the end of a parallel region. Adding NOWAIT to a final DO in the region eliminates one redundant barrier.
Use named CRITICAL sections for fine-grained locking.
Use explicit FLUSH with care. Flushes can cause data cache restores to memory, and subsequent data accesses may require reloads from memory, all of which decrease efficiency.
By default, idle threads will be put to sleep after a certain time out period. It could be that the default time out period is not sufficient for your application, causing the threads to go to sleep too soon or too late. The SUNW_MP_THR_IDLE environment variable can be used to override the default time out period, even up to the point where the idle threads will never be put to sleep and remain active all the time.
Parallelize at the highest level possible, such as outer DO/FOR loops. Enclose multiple loops in one parallel region. In general, make parallel regions as large as possible to reduce parallelization overhead. For example:
This construct is less efficient: !$OMP PARALLEL .... !$OMP DO .... !$OMP END DO .... !$OMP END PARALLEL !$OMP PARALLEL .... !$OMP DO .... !$OMP END DO .... !$OMP END PARALLEL than this one: !$OMP PARALLEL .... !$OMP DO .... !$OMP END DO ..... !$OMP DO .... !$OMP END DO !$OMP END PARALLEL
Use PARALLEL DO/FOR instead of worksharing DO/FOR directives in parallel regions. The PARALLEL DO/FOR is implemented more efficiently than a general parallel region containing possibly several loops. For example:
This construct is less efficient: !$OMP PARALLEL !$OMP DO ..... !$OMP END DO !$OMP END PARALLEL than this one: !$OMP PARALLEL DO .... !$OMP END PARALLEL
On Solaris systems, use SUNW_MP_PROCBIND to bind threads to processors. Processor binding, when used along with static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will be in the local cache from a previous invocation of a parallel region. See 2.3 Processor Binding.
Use MASTER instead of SINGLE wherever possible.
The MASTER directive is implemented as an IF-statement with no implicit BARRIER : IF(omp_get_thread_num() == 0) {...}
The SINGLE directive is implemented similar to other worksharing constructs. Keeping track of which thread reached SINGLE first adds additional runtime overhead. There is an implicit BARRIER if NOWAIT is not specified. It is less efficient.
Choose the appropriate loop scheduling.
STATIC causes no synchronization overhead and can maintain data locality when data fits in cache. However, STATIC may lead to load imbalance.
DYNAMIC,GUIDED incurs a synchronization overhead to keep track of which chunks have been assigned. And, while these schedules could lead to poor data locality, they can improve load balancing. Experiment with different chunk sizes.
Use LASTPRIVATE with care, as it has the potential of high overhead.
Data needs to be copied from private to shared storage upon return from the parallel construct.
The compiled code checks which thread executes the logically last iteration. This imposes extra work at the end of each chunk in a parallel DO/FOR. The overhead adds up if there are many chunks.
Use efficient thread-safe memory management.
Applications could be using malloc() and free() explicitly, or implicitly in the compiler-generated code for dynamic/allocatable arrays, vectorized intrinsics, and so on.
The thread-safe malloc() and free() in libc have a high synchronization overhead caused by internal locking. Faster versions can be found in the libmtmalloc library. Link with -lmtmalloc to use libmtmalloc.
Small data cases may cause OpenMP parallel loops to underperform. Use the if clause on PARALLEL constructs to indicate that a loop should run parallel only in those cases where some performance gain can be expected.
When possible, merge loops. For example:
merge two loops
!$omp parallel do do i = ...
statements_1
end do !$omp parallel do do i = ...
statements_2
end do
into a single loop
!$omp parallel do do i = ...
statements_1
statements_2
end do
Try nested parallelism if your application lacks scalability beyond a certain level. See 1.2 Special Conventions Used Here for more information about nested parallelism in OpenMP.