C H A P T E R 5 - Compiling for OpenMP

C H A P T E R 5

Compiling for OpenMP

This chapter describes how to compile programs that utilize the OpenMP API.

To run a parallelized program in a multithreaded environment, you must set the OMP_NUM_THREADS environment variable prior to program execution. This tells the runtime system the maximum number of threads the program can create. The default is 1. In general, set OMP_NUM_THREADS to a value no larger than the available number of processors on the target platform. Set OMP_DYNAMIC to FALSE to use the number of threads specified by OMP_NUM_THREADS.

The latest information regarding Sun Studio compilers and OpenMP can be found on the Sun Developer Network portal, http://developers.sun.com/sunstudio

5.1 Compiler Options To Use

To enable explicit parallelization with OpenMP directives, compile your program with the cc, CC, or f95 option flag -xopenmp. This flag can take an optional keyword argument. (The f95 compiler accepts both -xopenmp and -openmp as synonyms.)

The -xopenmp flag accepts the following keyword sub-options.

`-xopenmp=parallel`	Enables recognition of OpenMP pragmas. The minimum optimization level for `-xopenmp=parallel` is `-xO3`. The compiler changes the optimization from a lower level to `-xO3` if necessary, and issues a warning.
`-xopenmp=noopt`	Enables recognition of OpenMP pragmas. The compiler does not raise the optimization level if it is lower than `-xO3`. If you explicitly set the optimization level lower than `-xO3`, as in `-xO2` `-openmp=noopt` the compiler will issue an error. If you do not specify an optimization level with `-openmp=noopt`, the OpenMP pragmas are recognized, the program is parallelized accordingly, but no optimization is done. (This sub-option applies to `cc` and `f95` only; `CC` issues a warning if specified, and no OpenMP parallelization is done.)
`-xopenmp=stubs`	This option is no longer supported. An OpenMP stubs library is provided for users' convenience. To compile an OpenMP program that calls OpenMP library routines but ignores the OpenMP pragmas, compile the program without an `-xopenmp` option, and link the object files with the `libompstubs.a` library. For example, `%` `cc omp_ignore.c -lompstubs` Linking with both `libompstubs.a` and the OpenMP runtime library `libmtsk.so` is unsupported and may result in unexpected behavior.
`-xopenmp=none`	Disables recognition of OpenMP pragmas and does not change the optimization level.

Additional Notes:

If you do not specify -xopenmp on the command line, the compiler assumes -xopenmp=none (disabling recognition of OpenMP pragmas).

If you specify -xopenmp but without a keyword sub-option, the compiler assumes -xopenmp=parallel.

Do not specify -xopenmp together with -xparallel or -xexplicitpar on the command line.

Specifying -xopenmp=parallel or noopt will define the _OPENMP preprocessor token to be YYYYMM (specifically 200505L for C/C++ and 200505 for Fortran 95).

When debugging OpenMP programs with dbx, compile with -xopenmp=noopt -g

The default optimization level for -xopenmp might change in future releases. Compilation warning messages can be avoided by specifying an appropriate optimization level explicitly.

With Fortran 95, -xopenmp , -xopenmp=parallel, -xopenmp=noopt will add -stackvar automatically.

If you compile with -xopenmp when building a dynamic (.so) library, you must also specify -xopenmp when linking the executable, and the compiler used to create the executable must be at least as new as the compiler that built the dynamic library with -xopenmp. Using different compiler versions with -xopenmp to create the executable and the library, can result in unexpected behavior.

Use the -xvpara C and Fortran 95 option to display compiler parallelization messages.

5.2 Fortran 95 OpenMP Validation

You can obtain a static, interprocedural validation of a Fortran 95 program's OpenMP directives by using the f95 compiler's global program checking feature. Enable OpenMP checking by compiling with the -XlistMP flag. (Diagnostic messages from -XlistMP appear in a separate file created with the name of the source file and a .lst extension). The compiler will diagnose the following violations and parallelization inhibitors:

Violations in the specifications of parallel directives, including improper nesting.

Parallelization inhibitors due to data usage, detected by interprocedural dependence analysis.

Parallelization inhibitors detected by interprocedural pointer analysis.

For example, compiling a source file ord.f with -XlistMP produces a diagnostic file ord.lst:

FILE  "ord.f"

     1  !$OMP PARALLEL

     2  !$OMP DO ORDERED

     3                  do i=1,100

     4                          call work(i)

     5                  end do

     6  !$OMP END DO

     7  !$OMP END PARALLEL

     9  !$OMP PARALLEL

    10  !$OMP DO

    11                  do i=1,100

    12                          call work(i)

    13                  end do

    14  !$OMP END DO

    15  !$OMP END PARALLEL

    16                  end

    17                  subroutine work(k)

    18  !$OMP ORDERED

**** ERR-OMP:  It is illegal for an ORDERED directive to bind to a

directive (ord.f, line 10, column 2) that does not have the

ORDERED clause specified.

    19                  write(*,*) k

    20  !$OMP END ORDERED

    21                  return

    22                  end

In this example, the ORDERED directive in subroutine WORK receives a diagnostic that refers to the second DO directive because it lacks an ORDERED clause.

5.3 OpenMP Environment Variables

The OpenMP specification define four environment variables that control the execution of OpenMP programs. These are summarized in the following table. Additional multiprocessing environment variables affect execution of OpenMP programs and are not part of the OpenMP specifications. These are summarized in the following table.

TABLE 5-1 OpenMP Environment Variables
Environment Variable	Function
OMP_SCHEDULE	Sets schedule type for DO, PARALLEL DO, for, parallel for, directives/pragmas with schedule type RUNTIME specified. If not defined, a default value of STATIC is used. value is "type[,chunk]" Example: `setenv OMP_SCHEDULE "GUIDED,4"`
OMP_NUM_THREADS or PARALLEL	Sets the number of threads to use during execution of a parallel region. You can override this value by a NUM_THREADS clause, or a call to OMP_SET_NUM_THREADS(). If not set, a default of 1 is used. value is a positive integer. For compatibility with legacy programs, setting the PARALLEL environment variable has the same effect as setting OMP_NUM_THREADS. However, if they are both set to different values, the runtime library will issue an error message. Example: `setenv OMP_NUM_THREADS 16`
OMP_DYNAMIC	Enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. If not set, a default value of TRUE is used. value is either TRUE or FALSE. Example: `setenv OMP_DYNAMIC FALSE`
OMP_NESTED	Enables or disables nested parallelism. value is either TRUE or FALSE. The default is FALSE. Example: `setenv OMP_NESTED FALSE`

TABLE 5-2 Multiprocessing Environment Variables
Environment Variable	Function
SUNW_MP_WARN	Controls warning messages issued by the OpenMP runtime library. If set to TRUE the runtime library issues warning messages to `stderr`; FALSE disables warning messages. The default is FALSE. The OpenMP runtime library has the ability to check for many common OpenMP violations, such as incorrect nesting and deadlocks. Runtime checking does add overhead to the execution of the program. See Runtime Warnings. Example: setenv SUNW_MP_WARN TRUE
SUNW_MP_THR_IDLE	Controls the end-of-task status of each helper thread executing the parallel part of a program. You can set the value to SPIN, SLEEP ns, or SLEEP nms. The default is SLEEP -- the thread sleeps after completing a parallel task until a new parallel task arrives. Choosing SLEEP time specifies the amount of time a helper thread should spin-wait after completing a parallel task. If, while a thread is spinning, a new task arrives for the thread, the thread executes the new task immediately. Otherwise, the thread goes to sleep and is awakened when a new task arrives. time may be specified in seconds, `(`ns`)` or just `(`n`)`, or milliseconds, `(`nms`)`. SLEEP with no argument puts the thread to sleep immediately after completing a parallel task. SLEEP, SLEEP (0), SLEEP (0s), and SLEEP (0ms) are all equivalent. Example: setenv SUNW_MP_THR_IDLE SLEEP(50ms)
SUNW_MP_PROCBIND	The `SUNW_MP_PROCBIND` environment variable can be used to bind threads of an OpenMP program to processors. Performance can be enhanced with processor binding, but performance degradation will occur if multiple threads are bound to the same processor. See Section 5.4, Processor Binding for details.
SUNW_MP_MAX_POOL_THREADS	Specifies the maximum size of the thread pool. The thread pool contains only non-user threads that the OpenMP runtime library creates. It does not contain the master thread or any threads created explicitly by the user's program. If this environment variable is set to zero, the thread pool will be empty and all parallel regions will be executed by one thread. The default, if not specified, is 1023. See Section 2.2, Control of Nested Parallelism for details.
SUNW_MP_MAX_NESTED_LEVELS	Specifies the maximum depth of active nested parallel regions. Any parallel region that has an active nested depth greater than the value of this environment variable will be executed by only one thread. A parallel region is considered not active if it is an OpenMP parallel region that has a false IF clause. The default, if not specified, is 4. See Section 2.2, Control of Nested Parallelism for details.
STACKSIZE	Sets the stack size for each thread. The value is in kilobytes. The default thread stack sizes are 4 Mb on 32-bit SPARC V8 and x86 platforms, and 8 Mb on 64-bit SPARC V9 and x86 platforms. Example: setenv STACKSIZE 8192 sets the thread stack size to 8 Mb The STACKSIZE environment variable also accepts numerical values with a suffix of either B, K, M, or G for bytes, kilobytes, megabytes, or gigabytes respectively. The default is kilobytes.
SUNW_MP_GUIDED_WEIGHT	Sets the weighting factor used to determine the size of chunks assigned to threads in loops with GUIDED scheduling. The value should be a positive floating-point number, and will apply to all loops with GUIDED scheduling in the program. If not set, the default value assumed is 2.0.

5.4 Processor Binding

With processor binding, the programmer instructs the Operating System (Solaris) that a thread in the program should run on the same processor throughout the execution of the program.

Processor binding, when used along with static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel or worksharing region will be in the local cache from a previous invocation of a parallel or worksharing region.

From the hardware point of view, a computer system is composed of one or more physical processors. From the Operating System (Solaris) point of view, each of these physical processors maps to one or more virtual processors onto which threads in a program can be run. For example, each UltraSPARC IV physical processor has two cores. From the Solaris OS point of view, each of these cores is a virtual processor onto which a thread can be scheduled to run.

When the operating system binds threads to processors, they are in effect bound to specific virtual processors, not physical processors.

To bind threads in an OpenMP program to specific virtual processors, set the SUNW_MP_PROCBIND environment variable. The value specified for SUNW_MP_PROCBIND can be one of the following:

The string "TRUE" or "FALSE" (or lower case "true" or "false"). For example,
% setenv SUNW_MP_PROCBIND "false"

A non-negative integer. For example,
% setenv SUNW_MP_PROCBIND "2"

A list of two or more non-negative integers separated by one or more spaces. For example,
% setenv SUNW_MP_PROCBIND "0 2 4 6"

Two non-negative integers, n1 and n2, separated by a minus ("-"); n1 must be less than or equal to n2. For example,
% setenv SUNW_MP_PROCBIND "0-6"

Note that the non-negative integers referred to above denote logical identifiers (IDs). Logical IDs may be different from virtual processor IDs. The difference will be explained below.

Virtual Processor IDs:

Each virtual processor in a system has a unique processor ID. You can use the Solaris OS psrinfo(1M) command to display information about the processors in a system, including their processor IDs. Moreover, you can use the prtdiag(1M) command to display system configuration and diagnostic information.

On later Solaris releases, you can use psrinfo -pv to list all physical processors in the system and the virtual processors that are associated with each physical processor.

Virtual processor IDs may be sequential or there may be gaps in the IDs. For example, on a Sun Fire 4810 with 8 UltraSPARC IV processors (16 cores), the virtual processor IDs may be: 0, 1, 2, 3, 8, 9, 10, 11 512, 513, 514, 515, 520, 521, 522, 523.

Logical IDs:

As mentioned above, the non-negative integers specified for SUNW_MP_PROCBIND are logical IDs. Logical IDs are consecutive integers that start with 0. If the number of virtual processors available in the system is n, then their logical IDs are 0, 1, ..., n-1, in the order presented by psrinfo(1M). The following Korn shell script can be used to display the mapping from virtual processor IDs to logical IDs.

#!/bin/ksh

NUMV=`psrinfo | fgrep "on-line" | wc -l`

set -A VID `psrinfo | cut -f1`

echo "Total number of on-line virtual processors = $NUMV"

echo

let "I=0"

let "J=0"

while [[ $I -lt $NUMV ]]

do

  echo "Virtual processor ID ${VID[I]} maps to logical ID ${J}"

  let "I=I+1"

  let "J=J+1"

done

On systems where a single physical processor maps to several virtual processors, it may be useful to know which logical IDs correspond to virtual processors that belong to the same physical processor. The following Korn shell script can be used with later Solaris releases to display this information.

#!/bin/ksh

NUMV=`psrinfo | grep "on-line" | wc -l`

set -A VLIST `psrinfo | cut -f1`

set -A CHECKLIST `psrinfo | cut -f1`

let "I=0"

while [ $I -lt $NUMV ]

do

  let "COUNT=0"

  SAMELIST="$I"

  let "J=I+1"

  while [ $J -lt $NUMV ]

do

    if [ ${CHECKLIST[J]} -ne -1 ]

    then

      if [ `psrinfo -p ${VLIST[I]} ${VLIST[J]}` = 1 ]

      then

	SAMELIST="$SAMELIST $J"

	let "CHECKLIST[J]=-1"

	let "COUNT=COUNT+1"

fi

fi

    let "J=J+1"

  done

  if [ $COUNT -gt 0 ]

  then

    echo "The following logical IDs belong to the same physical processor:"

    echo "$SAMELIST"

    echo " "

fi

  let "I=I+1"

done

Interpreting the Value Specified for SUNW_MP_PROCBIND:

If the value specified for SUNW_MP_PROCBIND is a non-negative integer, then that integer denotes the starting logical ID of the virtual processor to which threads should be bound. Threads will be bound to virtual processors in a round-robin fashion, starting with the processor with the specified logical ID, and wrapping around to the processor with logical ID 0, after binding to the processor with logical ID n-1.If the value specified for SUNW_MP_PROCBIND is a list of two or more non-negative integers, then threads will be bound in a round-robin fashion to virtual processors with the specified logical IDs. Processors with logical IDs other than those specified will not be used.

If the value specified for SUNW_MP_PROCBIND is two non-negative integers separated by a minus ("-"), then threads will be bound in a round-robin fashion to virtual processors in the range that begins with the first logical ID and ends with the second logical ID. Processors with logical IDs other than those included in the range will not be used.

If the value specified for SUNW_MP_PROCBIND does not conform to one of the forms described above, or if an invalid logical ID is given, then an error message will be emitted and execution of the program will terminate.

Note that the number of threads created by the microtasking library, libmtsk, depends on environment variables, API calls in the user's program, and the num_threads clause. SUNW_MP_PROCBIND specifies the logical IDs of virtual processors to which the threads should be bound. Threads will be bound to that set of processors in a round-robin fashion. If the number of threads used in the program is less than the number of logical IDs specified by SUNW_MP_PROCBIND, then some virtual processors will not be used by the program. If the number of threads is greater than the number of logical IDs specified by SUNW_MP_PROCBIND, them some virtual processors will have more than one thread bound to them.

5.5 Stacks and Stack Sizes

The executing program maintains a main memory stack for the initial thread executing the program, as well as distinct stacks for each slave thread. Stacks are temporary memory address spaces used to hold arguments and automatic variables during invocation of a subprogram or function reference.

In general, the default main stack size is 8 megabytes. Compiling Fortran programs with the f95 -stackvar option forces the allocation of local variables and arrays on the stack as if they were automatic variables. Use of -stackvar with OpenMP programs is implied with explicitly parallelized programs because it improves the optimizer's ability to parallelize calls in loops. (See the Fortran User's Guide for a discussion of the -stackvar flag.) However, this may lead to stack overflow if not enough memory is allocated for the stack.

Use the limit C-shell command, or the ulimit ksh/sh command, to display or set the size of the main stack.

Each slave thread of an OpenMP program has its own thread stack. This stack mimics the initial (or main) thread stack but is unique to the thread. The thread's PRIVATE arrays and variables (local to the thread) are allocated on the thread stack. The default size is 4 megabytes on 32-bit SPARC V8 and x86 platforms, and 8 megabytes on 64-bit SPARC V9 and x86 platforms. The size of the helper thread stack is set with the STACKSIZE environment variable.

demo% setenv STACKSIZE 16384   <-Set thread stack size to 16 Mb (C shell)

demo$ STACKSIZE=16384          <-Same, using Bourne/Korn shell

demo$ export STACKSIZE

Finding the best stack size might have to be determined by trial and error. If the stack size is too small for a thread to run it may cause silent data corruption in neighboring threads, or segmentation faults. If you are unsure about stack overflows, compile your Fortran, C, or C++ programs with the -xcheck=stkovf flag to force a segmentation fault on stack overflow. This stops the program before any data corruption can occur.