C H A P T E R 5 |
Compiling for OpenMP |
This chapter describes how to compile programs that utilize the OpenMP API.
To run a parallelized program in a multithreaded environment, you must set the OMP_NUM_THREADS environment variable prior to program execution. This tells the runtime system the maximum number of threads the program can create. The default is 1. In general, set OMP_NUM_THREADS to a value no larger than the available number of processors on the target platform. Set OMP_DYNAMIC to FALSE to use the number of threads specified by OMP_NUM_THREADS.
The compiler readme files contain information about limitations and known deficiencies regarding their OpenMP implementation. Readme files are viewable directly by invoking the compiler with the -xhelp=readme flag, or by pointing an HTML browser to the documentation index for the installed software at
file:/opt/SUNWspro/docs/index.html
To enable explicit parallelization with OpenMP directives, compile your program with the cc, CC, or f95 option flag -xopenmp. This flag can take an optional keyword argument. (The f95 compiler accepts both -xopenmp and -openmp as synonyms.)
The -xopenmp flag accepts the following keyword sub-options.
You can obtain a static, interprocedural validation of a Fortran 95 program's OpenMP directives by using the f95 compiler's global program checking feature. Enable OpenMP checking by compiling with the -XlistMP flag. (Diagnostic messages from -XlistMP appear in a separate file created with the name of the source file and a .lst extension). The compiler will diagnose the following violations and parallelization inhibitors:
For example, compiling a source file ord.f with -XlistMP produces a diagnostic file ord.lst:
In this example, the ORDERED directive in subroutine WORK receives a diagnostic that refers to the second DO directive because it lacks an ORDERED clause.
The OpenMP specifications define four environment variables that control the execution of OpenMP programs. These are summarized in the following table. Additional multiprocessing environment variables affect execution of OpenMP programs and are not part of the OpenMP specifications. These are summarized in the following table.
Controls warning messages issued by the OpenMP runtime library. If set to TRUE the runtime library issues warning messages to stderr; FALSE disables warning messages. The default is FALSE. The OpenMP runtime library has the ability to check for many common OpenMP violations, such as incorrect nesting and deadlocks. Runtime checking does add overhead to the execution of the program. See Runtime Warnings. |
|
Controls the end-of-task status of each helper thread executing the parallel part of a program. You can set the value to SPIN, SLEEP ns, or SLEEP nms. The default is SLEEP -- the thread sleeps after completing a parallel task until a new parallel task arrives. Choosing SLEEP time specifies the amount of time a helper thread should spin-wait after completing a parallel task. If, while a thread is spinning, a new task arrives for the thread, the thread executes the new task immediately. Otherwise, the thread goes to sleep and is awakened when a new task arrives. time may be specified in seconds, (ns) or just (n), or milliseconds, (nms). SLEEP with no argument puts the thread to sleep immediately after completing a parallel task. SLEEP, SLEEP (0), SLEEP (0s), and SLEEP (0ms) are all equivalent. |
|
The SUNW_MP_PROCBIND environment variable can be used to bind LWPs (lightweight processes) of an OpenMP program to processors. Performance can be enhanced with processor binding, but performance degradation will occur if multiple LWPs are bound to the same processor. See Section 5.4, Processor Binding for details. |
|
Specifies the maximum size of the thread pool. The thread pool contains only non-user threads that the OpenMP runtime library creates. It does not contain the master thread or any threads created explicitly by the user's program. If this environment variable is set to zero, the thread pool will be empty and all parallel regions will be executed by one thread. The default, if not specified, is 1023. See Section 2.2, Control of Nested Parallelism for details. |
|
Specifies the maximum depth of active nested parallel regions. Any parallel region that has an active nested depth greater than the value of this environment variable will be executed by only one thread. A parallel region is considered not active if it is an OpenMP parallel region that has a false IF clause. The default, if not specified, is 4. See Section 2.2, Control of Nested Parallelism for details. |
|
Sets the stack size for each thread. The value is in kilobytes. The default thread stack sizes are 4 Mb on 32-bit SPARC V8 and x86 platforms, and 8 Mb on 64-bit SPARC V9 and x86 platforms. |
|
Sets the weighting factor used to determine the size of chunks assigned to threads in loops with GUIDED scheduling. The value should be a positive floating-point number, and will apply to all loops with GUIDED scheduling in the program. If not set, the default value assumed is 2.0. |
Processor binding, when used along with static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will be in the local cache from a previous invocation of a parallel region.
By default, lightweight processes, LWPs, are not bound to processors. It is left up to the Solaris OS to schedule LWPs onto processors. The multitasking routines in the OpenMP runtime library, libmtsk, always use a one-to-one threading model; that is, each thread corresponds to a single LWP.
The value specified by the SUNW_MP_PROCBIND environment variable denotes the "logical" processor identifiers (IDs) to which the LWPs are to be bound. Logical processor IDs are consecutive integers that start with 0, and may or may not be identical to the actual processor IDs. If n processors are available online, then their virtual processor IDs are 0, 1, ..., n-1, in the order presented by psrinfo(1M).
The mapping between logical processor IDs and real processor IDs is dependent on the system. On most systems, real processor IDs are sequential; however, removing system boards may cause holes in the range. On some systems, IDs are in groups of 4 with gaps of 32 between the beginning of each group; thus processors would be numbered 0, 1, 2, 3, 32, 33, 34, 35 and so forth.
The number of threads created by libmtsk is determined by environment variables and/or API calls in the user's program. SUNW_MP_PROCBIND specifies a set of logical processors as described below. LWPs are bound to that set of logical processors in a cyclic fashion. If the number of LWPs is less than the number of processors, then some processors do not have LWPs bound to them. If the number of LWPs is greater than the number of processors, them some processors will have more than one LWP bound to them.
The value specified for SUNW_MP_PROCBIND can be one of the following:
If the value specified for SUNW_MP_PROCBIND is FALSE, then no processor binding is performed. This is the default behavior.
If the value specified for SUNW_MP_PROCBIND is TRUE, then it is as if it were the integer 0.
If the value specified for SUNW_MP_PROCBIND is a non-negative integer, then that integer specifies the starting logical processor ID to which LWPs should be bound. LWPs will be bound to processors in a round-robin fashion, starting with the specified logical processor ID, and wrapping around to logical processor ID 0 after logical processor ID n-1.
If the value specified for SUNW_MP_PROCBIND is a list of two or more non-negative integers, then LWPs will be bound in a round-robin fashion to the specified logical processor IDs. No IDs other then those in the list will be used.
If the value specified for SUNW_MP_PROCBIND is two non-negative integers separated by a minus ("-"), then LWPs will be bound in a round-robin fashion to processors in the range that begins with the first logical processor ID and ends with the second logical processor ID. No IDs other than those mentioned in the range will be used.
If the value specified for SUNW_MP_PROCBIND does not conform to one of the forms described above, or if an invalid logical processor ID is given, then the environment variable SUNW_MP_PROCBIND will be ignored and LWPs will not be bound to processors. If warnings are enabled, a warning message will be issued in this case.
The executing program maintains a main memory stack for the initial thread executing the program, as well as distinct stacks for each slave thread. Stacks are temporary memory address spaces used to hold arguments and automatic variables during invocation of a subprogram or function reference.
In general, the default main stack size is 8 megabytes. Compiling Fortran programs with the f95 -stackvar option forces the allocation of local variables and arrays on the stack as if they were automatic variables. Use of -stackvar with OpenMP programs is implied with explicitly parallelized programs because it improves the optimizer's ability to parallelize calls in loops. (See the Fortran User's Guide for a discussion of the -stackvar flag.) However, this may lead to stack overflow if not enough memory is allocated for the stack.
Use the limit C-shell command, or the ulimit ksh/sh command, to display or set the size of the main stack.
Each slave thread of an OpenMP program has its own thread stack. This stack mimics the initial (or main) thread stack but is unique to the thread. The thread's PRIVATE arrays and variables (local to the thread) are allocated on the thread stack. The default size is 4 megabytes on 32-bit SPARC V8 and x86 platforms, and 8 megabytes on 64-bit SPARC V9 and x86 platforms. The size of the helper thread stack is set with the STACKSIZE environment variable.
demo% setenv STACKSIZE 16384 <-Set thread stack size to 16 Mb (C shell) demo% STACKSIZE=16384 <-Same, using Bourne/Korn shell demo% export STACKSIZE |
Finding the best stack size might have to be determined by trial and error. If the stack size is too small for a thread to run it may cause silent data corruption in neighboring threads, or segmentation faults. If you are unsure about stack overflows, compile your Fortran, C, or C++ programs with the -xcheck=stkovf flag to force a segmentation fault on stack overflow. This stops the program before any data corruption can occur.
Copyright © 2004, Sun Microsystems, Inc. All Rights Reserved.