C H A P T E R  4

Parallel Processing

This chapter describes using the Sun Performance Library in multiprocessor environments.


4.1 Run-Time Issues

At run time, if running with compiler parallelization, Sun Performance Library uses the same pool of threads that the compiler does. The per-thread stack size must be set to at least 4 Mbytes on 32-bit platforms and 8 Mbytes on 64-bit platforms. This is done with the STACKSIZE environment variable (units in Kbytes). To set the per-thread stack size to 4 Mbytes in a 32-bit environment:


my_host% setenv STACKSIZE 4000

To set the per-thread stack size to 8 Mbytes in a 64-bit environment:


my_host% setenv STACKSIZE 8000

Setting the STACKSIZE environment variable is not required for programs running with POSIX or Solaris threads. In this case, user-created threads that call Sun Performance Library routines must have a stack size of at least 4 Mbytes. Failure to supply an adequate stack size for the Sun Performance Library routines might result in stack overflow problems. Symptoms of stack overflow problems include runtime failures that could be difficult to diagnose. For more information on setting the stack size of user-created threads, see the pthread_create(3THR), pthread_attr_init(3THR) and pthread_attr_setstacksize(3THR) man pages for POSIX threads or the thr_create(3THR) for Solaris threads.


4.2 Degree of Parallelism

Selected routines in the Sun Performance Library are parallelized using compiler directives, library routines, and environment variables from the OpenMP Fortran Application Program Interface. The number of threads these routines will perform in parallel is controlled by the environment variable OMP_NUM_THREADS, which is set by the user at run time. Environment variable PARALLEL can also be used, but if both are set they must have the same value; otherwise, a fatal error will occur upon execution. Both environment variables can be overridden by calling the Sun Performance Library routine USE_THREADS or the OpenMP routine OMP_SET_NUM_THREADS in the user code.

A user code can be parallelized by:

The Sun Performance Library routines execute in parallel if

The Sun Performance Library employs OpenMP directives in its parallelization and does not support nested parallelism. If the user code is parallelized as stated above, when a Sun Performance Library routine is called it will execute in serial if it detects that it is being called from a parallel region; otherwise, it will execute in parallel.

POSIX or Solaris threads can also be created to execute in parallel selected regions in the user code. When it is called under this parallel model, a Sun Performance Library routine cannot detect that it is being called from a parallel region. Therefore, the environment variable OMP_NUM_THREADS must be set to 1 (or must be unset) or a call to USE_THREAD(1)must be made in appropriate places in the user code. Otherwise, nested parallelism with undefined results will occur.

For example, if the program containing the following code segment is linked with
-xopenmp=parallel and OMP_NUM_THREADS is set to 4, the loop will execute in parallel, and there will be four instances of DGEMM running concurrently. However, each DGEMM instance will run in serial since only one level of parallelization is supported.


!$OMP PARALLEL
    DO I = 1, N
       CALL DGEMM(...)
    END DO
!$OMP END PARALLEL

In the following code example, if the program is not linked with -xautopar, the loop will not be parallelized, but each instance of DGEMM will be executed by four threads.


    DO I = 1, N
      CALL DGEMM(...)
    END DO

If the program containing the following code segment is linked with -xopenmp=parallel and if OMP_NUM_THREADS is set to a value greater than 1, the region shown will be executed by a single thread. However, each DGEMM call will be executed by OMP_NUM_THREADS threads.


!$OMP SINGLE
    DO I = 1, N
      CALL DGEMM(...)
    END DO
!$OMP END SINGLE

In the following code example, there will be at most two-way parallelism, regardless of the number of OpenMP threads available for execution. Only one level of parallelism exists, which are the two sections. Further parallelism within a DGEMM call is suppressed.


!$OMP PARALLEL SECTIONS
!$OMP SECTION
    DO I = 1, N / 2
      CALL DGEMM(...)
    END DO
!$OMP SECTION
    DO I = N / 2 + 1, N
      CALL DGEMM(...)
    END DO
!$OMP END PARALLEL SECTIONS


4.3 Synchronization Mechanisms

One characteristic of the POSIX/Solaris threading model is that bound threads of a running application relinquish the CPUs when they are idle, thus providing good throughput and resource usage in a shared (over-subscribed) environment. By default, bound threads in a compiler-parallelized code spin-wait when they are idle, which can result in suboptimal throughput when there are other applications in the system competing for CPU resource. In this case, environment variable SUNW_MP_THR_IDLE can be used to control the behavior of a thread after it finishes its share of a parallel job:


my_host% setenv SUNW_MP_THR_IDLE value

Here, value can either be spin or sleep n s or sleep n ms , and spin is the default. sleep puts the thread to sleep after spin-waiting n units. The wait unit can be seconds (s, the default unit) or milliseconds (ms). sleep with no arguments puts the thread to sleep immediately after completing a parallel task. If SUNW_MP_THR_IDLE contains an illegal value or isn’t set, spin is used as the default.

The following settings would cause threads to spin-wait (default behavior), spin for 2 seconds before sleeping, or spin for 100 milliseconds before sleeping, respectively. Using Sun Performance Library routines does not change the spin-wait behavior of the code.


% setenv SUNW_MP_THR_IDLE spin
% setenv SUNW_MP_THR_IDLE 2s
% setenv SUNW_MP_THR_IDLE 100ms


4.4 Parallel Processing Examples

This section demonstrates using the OMP_NUM_THREADS environment variable along with compile and link options to create code that execute serially and in parallel.

To create a serial application:

The following example shows how to compile and link with the shared Sun Performance library libsunperf.so.


my_host% cc -xmemalign=8s -xarch=native any.c -xlic_lib=sunperf

or


my_host% f95 -dalign -xarch=native any.f95 -xlic_lib=sunperf

To create a parallel application that execute on mulitple processors:

For example, to use 24 processors, type the following commands:


my_host% f95 -dalign -xarch=native my_app.f -xlic_lib=sunperf
my_host% setenv OMP_NUM_THREADS 24
my_host% ./a.out

The previous example allows Sun Performance Library routines to run in parallel, but no part of the user code my_app.f will run in parallel. For the compiler to attempt to parallelize my_app.f, either -xopenmp=parallel or -xautopar is required on the compile line:


my_host% f95 -dalign -xopenmp=parallel my_app.f -xlic_lib=sunperf
my_host% setenv OMP_NUM_THREADS 24
my_host% ./a.out