|C H A P T E R 2|
This chapter discusses the features of OpenMP nested parallelism.
OpenMP uses a fork-join model of parallel execution. When a thread encounters a parallel construct, the thread creates a team composed of itself and some additional (possibly zero) number of threads. The encountering thread becomes the master of the new team. The other threads of the team are called slave threads of the team. All team members execute the code inside the parallel construct. When a thread finishes its work within the parallel construct, it waits at the implicit barrier at the end of the parallel construct. When all team members have arrived at the barrier, the threads can leave the barrier. The master thread continues execution of user code beyond the end of the parallel construct, while the slave threads wait to be summoned to join other teams.
OpenMP parallel regions can be nested inside each other. If nested parallelism is disabled, then the new team created by a thread encountering a parallel construct inside a parallel region consists only of the encountering thread. If nested parallelism is enabled, then the new team may consist of more than one thread.
The OpenMP runtime library maintains a pool of threads that can be used as slave threads in parallel regions. When a thread encounters a parallel construct and needs to create a team of more than one thread, the thread will check the pool and grab idle threads from the pool, making them slave threads of the team. The master thread might get fewer slave threads than it needs if there is not a sufficient number of the idle threads in the pool. When the team finishes executing the parallel region, the slave threads return to the pool.
Nested parallelism can be controlled at runtime by setting various environment variables prior to execution of the program.
Nested parallelism can be enabled or disabled by setting the OMP_NESTED environment variable or calling omp_set_nested() function (Section 126.96.36.199, OMP_SET_NESTED Routine).
The following example shows a team of more than one thread executing a nested parallel region when nested parallelism is enabled.
void report_num_threads(int level)
Compiling and running this program with nested parallelism enabled produces the following output:
Compare with running the same program but with nested parallelism disabled:
The OpenMP runtime library maintains a pool of threads that can be used as slave threads in parallel regions. Setting the SUNW_MP_MAX_POOL_THREADS environment variable controls the number of threads in the pool. The default value is 1023.
The thread pool consists of only non-user threads that the runtime library creates. It does not include the master thread or any thread created explicitly by the user's program. If this environment variable is set to zero, the thread pool will be empty and all parallel regions will be executed by one thread.
The following example shows that a parallel region can get fewer threads if there are not sufficient threads in the pool.The code is the same as above. The number of threads needed for all the parallel regions to be active at the same time is 8. The pool needs to contain at least 7 idle threads. If we set SUNW_MP_MAX_POOL_THREADS to 5, two of the four inner-most parallel regions may not be able to get all the slave threads they ask for. One possible result is shown below.
Environment variable SUNW_MP_MAX_NESTED_LEVELS controls the maximum depth of nested parallel regions that require more than one thread.
Any parallel region that has an active nested depth greater than the value of this environment variable will be executed by only one thread. A parallel region is considered not active if it is an OpenMP parallel region that has a false IF clause.
The following code will create 4 levels of nested parallel regions. If SUNW_MP_MAX_NESTED_LEVELS is set to 2, then nested parallel regions at nested depth of 3 and 4 are executed single-threaded.
Compiling and running this program with a maximum nesting level of 4 gives the following possible output. (Actual results will depend on how the OS schedules threads).
Running with the nesting level set at 2 gives the following as a possible result:
Calls to the following OpenMP routines within nested parallel regions deserve some discussion.
The 'set' calls affect only the parallel regions at the same or inner nesting levels encountered by the calling thread. They do not affect parallel regions encountered by other threads, and they do not affect parallel regions the calling thread will later encounter in any outer levels.
The 'get' calls will return the values set by the calling thread. When a team is created, the slave threads will inherit the values from the master thread.
Compiling and running this program gives the following as one possible result:
For example, suppose you have a program that contains two levels of parallelism and the degree of parallelism at each level is 2. And, your system has four cpus and you want use all four CPUs to speed up the execution of this program. Just parallelizing any one level will use only two CPUs. You want to parallelize both levels.
For example, suppose you have a program that contains two levels of parallelism. The degree of parallelism at the outer level is 4 and the load is balanced. You have a system with four CPUs and want to use all four CPUs to speed up the execution of this program. Then, in general, using all 4 threads for the outer level could yield better performance than using 2 threads for the outer parallel region, and using the other 2 threads as slave threads for the inner parallel regions.