Go to main content
Oracle® Developer Studio 12.5: OpenMP API User's Guide

Exit Print View

Updated: July 2016
 
 

8.1 General Performance Recommendations

This section describes some general techniques for improving the performance of OpenMP applications.

  • Minimize synchronization.

    • Avoid or minimize the use of synchronizations such as barrier, critical, ordered, taskwait, and locks.

    • Use the nowait clause where possible to eliminate redundant or unnecessary barriers. For example, there is always an implied barrier at the end of a parallel region. Adding nowait to a worksharing loop in the region that is not followed by any code in the region eliminates one redundant barrier.

    • Use named critical sections for fine-grained locking where appropriate so that not all critical sections in the program will use the same, default lock.

  • Use the OMP_WAIT_POLICY, SUNW_MP_THR_IDLE, or SUNW_MP_WAIT_POLICY environment variables to control the behavior of waiting threads. By default, idle threads will be put to sleep after a certain timeout period. If a thread does not find work by the end of the timeout period, it will go to sleep, thus avoiding wasting processor cycles at the expense of other threads. The default timeout period might not be appropriate for your application, causing the threads to go to sleep too soon or too late. In general, if an application has dedicated processors to run on, then an active wait policy that would make waiting threads spin would give better performance. If an application runs simultaneously with other applications, then a passive wait policy that would put waiting threads to sleep would be better for system throughput.

  • Parallelize at the highest level possible, such as outermost loops. Enclose multiple loops in one parallel region. In general, make parallel regions as large as possible to reduce parallelization overhead. For example, this construct is less efficient:

    #pragma omp parallel
    {
       #pragma omp for
       {
          ...
       }
    }
    
    #pragma omp parallel
    {
       #pragma omp for
       {
          ...
       }
    }
    

    A more efficient construct:

    #pragma omp parallel
    {
       #pragma omp for
       {
          ...
       }
    
       #pragma omp for
       {
          ...
       }
    }
    
  • Use a parallel for/do construct, instead of a worksharing for/do construct nested inside a parallel construct. For example, this construct is less efficient:

    #pragma omp parallel
    {
       #pragma omp for
       {
          ... statements ...
       }
    }

    This construct is more efficient:

    #pragma omp parallel for
    {
       ... statements ...
    }
  • When possible, merge parallel loops to avoid parallelization overhead. For example, merge the two parallel for loops:

    #pragma omp parallel for
    for (i=1; i<N; i++)
       {
         ... statements 1 ...
       }
    
    #pragma omp parallel for
    for (i=1; i<N; i++)
       {
         ... statements 2 ...
       }
    

    The resulting single parallel for loop is more efficient:

    #pragma omp parallel for
    for (i=1; i<N; i++)
       {
         ... statements 1 ...
         ... statements 2 ...
       }
  • Use the OMP_PROC_BIND or SUNW_MP_PROCBIND environment variable to bind threads to processors. Processor binding, when used along with static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will be in the local cache from a previous invocation of a parallel region. See Processor Binding (Thread Affinity).

  • Use master instead of single where possible.

    • The master directive is implemented as an if statement with no implicit barrier: if (omp_get_thread_num() == 0) {...}

    • The single construct is implemented similarly to other worksharing constructs. Keeping track of which thread reaches single first adds additional runtime overhead. Moreover, there is an implicit barrier if nowait is not specified, which is less efficient.

  • Choose the appropriate loop schedule.

    • The static loop schedule requires no synchronization and can maintain data locality when data fits in cache. However, the static schedule could lead to load imbalance.

    • The dynamic and guided loop schedules incur a synchronization overhead to keep track of which chunks have been assigned. While these schedules could lead to poor data locality, they can improve load balancing. Experiment with different chunk sizes.

  • Use efficient thread-safe memory management. An application could be using malloc() and free() functions explicitly, or implicitly in the compiler-generated code for dynamic arrays, allocatable arrays, vectorized intrinsics, and so on. The thread-safe malloc() and free() in the standard C library, libc.so, have a high synchronization overhead caused by internal locking. Faster versions can be found in other libraries, such as the libmtmalloc.so library. Specify –lmtmalloc to link with libmtmalloc.so.

  • Small data sets could cause OpenMP parallel regions to underperform. Use the if clause on the parallel construct to specify that the region should be run in parallel only in those cases where some performance gain can be expected.

  • Try nested parallelism if your application lacks scalability beyond a certain level. However, use nested parallelism with care as it adds synchronization overhead because the thread team of every nested parallel region has to synchronize at a barrier. Also, nested parallelism may oversubscribe the machine, leading to degraded performance.

  • Use lastprivate with care, as it has the potential of high overhead.

    • Data needs to be copied from a thread's private memory to shared memory before the return from the region.

    • Extra checks are added for lastprivate. For example, the compiled code for a worksharing loop with the lastprivate clause checks which thread executes the sequentially last iteration. This imposes extra work at the end of each chunk in the loop, which may add up if there are many chunks.

  • Use explicit flush with care. A flush causes data to be stored to memory, and subsequent data accesses may require reload from memory, all of which decrease efficiency.