C H A P T E R  8

Using the DTrace Utility With Open MPI

This chapter describes how to use the Solaristrademark Dynamic Tracing (DTrace) utility with Open MPI. DTrace is a comprehensive dynamic tracing utility that you can use to monitor the behavior of applications programs as well as the operating system itself. You can use DTrace on live production systems to understand those systems’ behavior and to track down any problems that might be occurring.

The D language is the programming language used to create the source code for DTrace programs.

The content of this chapter assumes knowledge of the D language and how to use DTrace.

The following topics are covered in this chapter:

For more information about the D language and DTrace, refer to the Solaris Dynamic Tracing Guide (Part Number 817-6223). This guide is part of the Solaris 10 OS Software Developer Collection.

Solaris 10 OS documentation can be found on the web at the following location:

http://www.sun.com/documentation

Follow these links to the Solaris Dynamic Tracing Guide:

Solaris Operating Systems -> Solaris 10 -> Solaris 10 Software Developer Collection



Note - The programs and script mentioned in the sections that follow are located at:

/opt/SUNWhpc/examples/mpi/dtrace



Checking the mpirun Privileges

Before you run a program under DTrace, you need to make sure that you have the correct mpirun privileges.

In order to run the script under mpirun, make sure that you have dtrace_proc and dtrace_user privileges. Otherwise, DTrace will return the following error because it does not have sufficient privileges:


dtrace:  failed to initialize dtrace:  DTrace requires additional privileges


procedure icon  To Determine the Correct Privileges on the Cluster

To determine whether you have the appropriate privileges on the entire cluster, perform the following steps:

1. Use your favorite text editor to create the following shell script, called mpppriv.sh:


#!/bin/sh 
# mpppriv.sh -  run ppriv under a shell so you can get the privileges 
#		of the process that mpirun creates 
ppriv $$

2. Type the following command, replacing host1 and host2 with the names of hosts in your cluster:


% mpirun -np 2 --host host1,host2 mpppriv.sh

If the output of ppriv shows that the E privilege set has the dtrace privileges, then you will be able to run dtrace under mpirun (see the following two examples). Otherwise, you must adjust your system to get dtrace access.

The following example shows the output from ppriv when the privileges have not been set:


% ppriv $$ 
4084:  -csh 
flags = <none> 
	E:  basic 
	I:  basic 
	P:  basic 
	L:  all

This example shows ppriv output when the privileges have been set:


% ppriv $$ 
2075:  tcsh 
flags = <none> 
	E:basic,dtrace_proc,dtrace_user 
	I:basic,dtrace_proc,dtrace_user 
	P:basic,dtrace_proc,dtrace_user 
	L:  all



Note - To update your privileges, ask your system administrator to add the dtrace_user and dtrace_proc privileges to your account in the /etc/user_attr file.


After the privileges have been changed, you can use the ppriv command to view the changed privileges.


Running DTrace with MPI Programs

There are two ways to use dynamic tracing with MPI programs:

Running an MPI Program Under DTrace

For illustration purposes, assume you have a program named mpiapp.


procedure icon  To Trace a Program Using the mpitrace.d Script

single-step bullet  Type the following command:


% mpirun -np 4 dtrace -s mpitrace.d -c mpiapp

The advantage of tracing an MPI program in this way is that all the processes in the job will be traced from the beginning. This method is probably most useful in doing performance measurements, when you need to start at the beginning of an application and you need all the processes in a job to participate in collecting data.

This approach also has some disadvantages. One disadvantage of running a program like the one in the above example is that all the tracing output for all four processes is directed to standard output (stdout). One way around this problem is to create a script similar to the script in the following section.:


procedure icon  To Trace a Parallel Program and Get Separate Trace Files

1. Create a shell script (called partrace.sh in this example) similar to the following:


#!/bin/sh 
# partrace.sh - a helper script to dtrace Open MPI jobs from the 
#		start of the job.  
dtrace -s $1 -c $2 -o $2.$OMPI_COMM_WORLD_RANK.trace

2. Type the following command to run the partrace.sh shell script:


% mpirun -np 4 partrace.sh mpitrace.d mpiapp

This will run mpiapp under dtrace using the mpitrace.d script. The script saves the trace output for each process in a job under a separate file name, based on the program name and rank of the process. Note that subsequent runs will append the data into the existing trace files.



Note - The status of the OMPI_COMM_WORLD_RANK.trace variable is unstable and subject to change. Use this variable with caution.


Attaching DTrace to a Running MPI Program

The second way to use dtrace with Open MPI is to attach dtrace to a running MPI program.


procedure icon  To Attach DTrace to a Running MPI Program

Perform the following procedure:

1. Log in to the node in which you are interested.

2. Type commands similar to the following command to get the process ID (PID) of the running program on the node of interest.


% prstat 0 1 | grep mpiapp
24768 joeuser     526M 3492K sleep   59    0   0:00:08 0.1% mpiapp/1
24770 joeuser     518M 3228K sleep   59    0   0:00:08 0.1% mpiapp/1
 

3. Decide which rank you want to use to attach dtrace.

The lower PID number is usually the lower rank on the node.

4. Type the following command to attach to the rank 1 process (identified by its process ID, which is 24770 in the example) and run the DTrace script mpitrace.d:


% dtrace -p 24770 -s mpitrace.d

Simple MPI Tracing

DTrace enables you to easily trace programs. When used in conjunction with MPI and the more than 200 functions defined in the MPI standard, DTrace provides an easy way to determine which functions might be in error during the debugging process, or those functions that might be of interest. After you determine the function showing the error, it is easy to locate the desired job, process, and rank on which to run your scripts. As demonstrated above, DTrace allows you to perform these determinations while the program is running

Although the MPI standard provides the MPI profiling interface, using DTrace does provide a number of advantages. The advantages of using DTrace include the following:

The following example shows a simple script that traces the entry and exit into all the MPI API calls.


mpitrace.d:  
pid$target:libmpi:MPI_*:entry 
{ 
printf(“Entered %s...”, probefunc);
}
 
pid$target:libmpi:MPI_*:return 
{ 
printf(“exiting, return value = %d\n”, arg1); 
}

When you use this example script to attach DTrace to a job that performs send and recv operations, the output looks similar to the following:


% dtrace -q -p 24770 -s mpitrace.d 
Entered MPI_Send...exiting, return value = 0
Entered MPI_Recv...exiting, return value = 0 
Entered MPI_Send...exiting, return value = 0 
Entered MPI_Recv...exiting, return value = 0 
Entered MPI_Send...exiting, return value = 0 ...

You can easily modify the mpitrace.d script to include an argument list. The resulting output resembles truss output. For example:


mpitruss.d:  
pid$target:libmpi:MPI_Send:entry,
pid$target:libmpi:MPI_*send:entry, 
pid$target:libmpi:MPI_Recv:entry,
pid$target:libmpi:MPI_*recv:entry 
{ 
printf(“%s(0x%x, %d, 0x%x, %d, %d, 0x%x)”,probefunc, arg0, arg1, arg2, arg3, arg4, arg5);
}
pid$target:libmpi:MPI_Send:return, 
pid$target:libmpi:MPI_*send:return,
pid$target:libmpi:MPI_Recv:return, 
pid$target:libmpi:MPI_*recv:return 
{
printf(“\t\t = %d\n”, arg1); 
}

The mpitruss.d script shows how you can specify wildcard names to match the functions. Both probes will match all send and receive type function calls in the MPI library. The first probe shows the usage of the built-in arg variables to print out the arglist of the function being traced.

Take care when wildcarding the entrypoint and the formatting argument output, because you could end up printing either too many arguments, or not enough arguments, for certain functions. For example, in the above case, the MPI_Irecv and MPI_Isend functions will not have their Request handle parameters printed out.

The following example shows a sample output of the mpitruss.d script:


% dtrace -q -p 24770 -s mpitruss.d 
MPI_Send(0x80470b0, 1, 0x8060f48, 0, 1,0x8060d48) = 0 
MPI_Recv(0x80470a8, 1, 0x8060f48, 0, 0, 0x8060d48) = 0
MPI_Send(0x80470b0, 1, 0x8060f48, 0, 1, 0x8060d48) = 0 
MPI_Recv(0x80470a8, 1,0x8060f48, 0, 0, 0x8060d48) = 0 ...


Tracking Down Resource Leaks

One of the biggest issues with programming is the unintentional leaking of resources (such as memory). With MPI, tracking and repairing resource leaks can be somewhat more challenging because the objects being leaked are in the middleware, and thus are not easily detected by the use of memory checkers.

DTrace helps with debugging such problems using variables, the profile provider, and a callstack function. The mpicommcheck.d script (shown in the example below) probes for all the MPI communicator calls that allocate and deallocate communicators, and keeps track of the stack each time the function is called. Every 10 seconds the script dumps out the current count of MPI communicator calls and the total calls for the allocation and deallocation of communicators. When the dtrace session ends (usually by pressing Ctrl-C, if you attached to a running MPI program), the script will print out the totals and all the different stack traces, as well as the number of times those stack traces were reached.

In order to perform these tasks, the script uses DTrace features such as variables, associative arrays, built-in functions (count, ustack) and the predefined variable probefunc.

The following example shows the mpicommcheck.d script.


mpicommcheck.d:  
BEGIN 
{ 
  allocations = 0; 
  deallocations = 0; 
  prcnt = 0; 
}
 
pid$target:libmpi:MPI_Comm_create:entry, 
pid$target:libmpi:MPI_Comm_dup:entry,
pid$target:libmpi:MPI_Comm_split:entry 
{ 
  ++allocations; 
  @counts[probefunc] = count(); 
  @stacks[ustack()] = count(); 
}
 
pid$target:libmpi:MPI_Comm_free:entry 
{ 
  ++deallocations; 
  @counts[probefunc] = count(); 
  @stacks[ustack()] = count(); 
}
 
profile:::tick-1sec 
/++prcnt > 10/ 
{
  printf(“=====================================================================”);
  printa(@counts); 
  printf(“Communicator Allocations = %d \n”, allocations);
  printf(“Communicator Deallocations = %d\n”, deallocations); 
  prcnt = 0; 
}
 
END 
{ 
  printf(“Communicator Allocations = %d, Communicator Deallocations = %d\n”,
  	allocations, deallocations); 
}

This script attaches dtrace to a suspect section of code in your program (that is, a section of code that might contain a resource leak). If, during the process of running the script, you see that the printed totals for allocations and deallocations are starting to steadily diverge, you might have a resource leak. Depending on how your program is designed, it might take some time and observation of the allocation/deallocation totals in order to definitively determine that the code contains a resource leak. Once you do determine that a resource leak is definitely occurring, you can press Ctrl-C to break out of the dtrace session. Next, using the stack traces dumped, you can try to determine where the issue might be occurring.

The following example shows code containing a resource leak, and the output that is displayed using the mpicommcheck.d script.

The sample MPI program containing the resource leak is called mpicommleak. This program performs three MPI_Comm_dup operations and two MPI_Comm_free operations. The program thus “leaks” one communicator operation with each iteration of a loop.

When you attach dtrace to mpicommleak using the mpicommcheck.d script above, you will see a 10-second periodic output. This output shows that the count of the allocated communicators is growing faster than the count of deallocations.

When you finally end the dtrace session by pressing Ctrl-C, the session will have output a total of five stack traces, showing the distinct three MPI_Comm_dup and two MPI_Comm_free call stacks, as well as the number of times each call stack was encountered.

For example:


% prstat 0 1 | grep mpicommleak
 24952 joeuser    518M 3212K sleep   59    0   0:00:01 1.8% mpicommleak/1
 24950 joeuser    518M 3212K sleep   59    0   0:00:00 0.2% mpicommleak/1
% dtrace -q -p 24952  -s mpicommcheck.d
=====================================================================
  MPI_Comm_free                                                     4
  MPI_Comm_dup                                                      6
Communicator Allocations = 6
Communicator Deallocations = 4
=====================================================================
  MPI_Comm_free                                                     8
  MPI_Comm_dup                                                     12
Communicator Allocations = 12
Communicator Deallocations = 8
=====================================================================
  MPI_Comm_free                                                    12
  MPI_Comm_dup                                                     18
Communicator Allocations = 18
Communicator Deallocations = 12
^C
Communicator Allocations = 21, Communicator Deallocations = 14
 
libmpi.so.0.0.0`MPI_Comm_free
              mpicommleak`deallocate_comms+0x19
              mpicommleak`main+0x6d
              mpicommleak`0x805081a
                7
 
              libmpi.so.0.0.0`MPI_Comm_free
              mpicommleak`deallocate_comms+0x26
              mpicommleak`main+0x6d
              mpicommleak`0x805081a
                7
 
              libmpi.so.0.0.0`MPI_Comm_dup
              mpicommleak`allocate_comms+0x1e
              mpicommleak`main+0x5b
              mpicommleak`0x805081a
                7
 
              libmpi.so.0.0.0`MPI_Comm_dup
              mpicommleak`allocate_comms+0x30
              mpicommleak`main+0x5b
              mpicommleak`0x805081a
                7
 
              libmpi.so.0.0.0`MPI_Comm_dup
              mpicommleak`allocate_comms+0x42
              mpicommleak`main+0x5b
              mpicommleak`0x805081a
                7
 


Using the DTrace mpiperuse Provider

PERUSE is an MPI interface that allows you to obtain detailed information about the performance and interactions of processes, software, and MPI. PERUSE provides a greater level of detail about process performance than does the standard MPI profiling interface (PMPI).

For more information about PERUSE and the current PERUSE specification, see:

http://www.mpi-peruse.org

Open MPI includes a DTrace provider named mpiperuse. This provider enables you to configure Open MPI to support DTrace probes into the Open MPI shared library libmpi.

DTrace Support in the ClusterTools Software

In Sun HPC ClusterTools 8.2.1c software, there are preconfigured executables and libraries with the mpiperuse provider probes built in. They are located in the /opt/SUNWhpc/HPC8.2.1c/sun/instrument directory. Use the wrappers and utilities located in this directory to access the mpiperuse provider.



Note - No recompilation is necessary in order to use the mpiperuse provider. Just run the application to be DTraced using /opt/SUNWhpc/HPC8.2.1c/sun/instrument/bin/mpirun.


Available mpiperuse Probes

The DTrace mpiperuse probes expose the events specified in the current PERUSE specification. These events track the life cycle of requests within the MPI library. For more information about this life cycle and the actual events provided by PERUSE, see Section 4 of the PERUSE Specification.

Sections 4.3.1 and 4.4 of the PERUSE Specification list and describe the individual events exposed by PERUSE.

The mpiperuse provider makes these events available to DTrace. The probe names correspond to the event names listed in Sections 4.3.1 and 4.4 of the PERUSE specification. For each event, the corresponding probe name is similar, except that the leading PERUSE is removed, the probe name is all lowercase, and underscores are replaced with hyphens. For example, the probe for PERUSE_COMM_MSG_ARRIVED is comm-msg-arrived.

All of the probes are classified under the mpiperuse provider. This means that to find the probe names, you would look under the mpiperuse name. It also means that when you make a DTrace statement, you can include a wildcard for all probes simply by using the mpiperuse classification.

Specifying an mpiperuse Probe in a D Script

In the D scripting language, specifying an mpiperuse provider takes the following form:


mpiperuse$target:::probe-name

where probe-name is the name of the mpiperuse probe you want to use.

For example, to specify a probe to capture a PERUSE_COMM_REQ_ACTIVATE event, add the following line to a D script:


 mpiperuse$target:::comm-req-activate

This alerts DTrace that you want to use the mpiperuse provider to capture the PERUSE_COMM_REQ_ACTIVATE event. In this example, the optional object and function fields in the probe description are omitted. This directs DTrace to find all occurrences of the comm-req-activate probes in the MPI library and its plugins instead of a specific probe. This is necessary because certain probes can happen in multiple places in the MPI library.

For more information about the D language and its syntax, refer to the Solaris Dynamic Tracing Guide (Part Number 817-6223). This guide is part of the Solaris 10 OS Software Developer Collection.

Available Arguments

All of the mpiperuse probes receive the following arguments:


TABLE 8-1 Available mpiperuse Arguments

args[0] = mpiconninfo_t *i

This provides a basic source and destination for the request and which protocol is expected to be used for the transfer. This typedef is defined in /usr/lib/dtrace/mpi.d.

args[1] = uintptr_t uid

This is the PERUSE unique id for the request that fired the probe (as defined by the PERUSE specifications). For OMPI this is the address of the actual request.

args[2] = uint_t op

This value indicates whether the probe is for a send == 0 or recv == 1 request.

args[3] = mpicomm_spec_t *cs

This structure is defined in /usr/lib/dtrace/mpi.d and mimics the spec structure, as defined on page 22 of the PERUSE specification.


How To Use mpiperuse Probes to See Message Queues

To use the mpiperuse provider, make reference to the appropriate mpiperuse provider probes and arguments in a DTrace script, as you would for any other provider (such as the pid provider).

The procedure for running scripts with mpiperuse probes follows the same steps as those shown in Running an MPI Program Under DTrace and Attaching DTrace to a Running MPI Program, except that you must edit the partrace.sh script before you run it.

Change partrace.sh to include a -Z switch after the dtrace command, as shown in the following example.


#!/bin/sh 
# partrace.sh - a helper script to dtrace Open MPI jobs from the 
#		start of the job.  
dtrace -Z -s $1 -c $2 -o $2.$OMPI_COMM_WORLD_RANK.trace

This change allows probes that do not exist at initial load time to be used in a script (that is, the probes are in plugins that have not been dlopened).

The following example shows how to use the mpiperuse probes when running a DTrace script. Use the example script provided in /opt/SUNWhpc/HPC8.2.1c/sun/examples/dtrace/mpistat.d

1. Compile and run a script against a program.

In this example, the script file is called dtest.c. Substitute the name and path of your script for dtest.c.


% /opt/SUNWhpc/HPC8.2.1c/sun/instrument/bin/mpicc ~myhomedir/scraps/usdt/examples/dtest.c -o dtest
% /opt/SUNWhpc/HPC8.2.1c/sun/instrument/bin/mpirun -np 2 dtest
Initing MPI...
Initing MPI...
Do communications...
Do communications...
attach to pid 13371 to test tracing.

2. In another window, type the following command:


% dtrace -q -p 13371 -s /opt/SUNWhpc/HPC8.2.1c/sun/examples/dtrace/mpistat.d
input(Total) Q-sizes      Q-Matches            output
bytes active posted unexp posted unexp         bytes active
    0      0      0     0      0     0          0      0 
    0      0      0     0      0     0          0      0 
    0      0      0     0      0     0          0      0 
    0      0      0     0      0     0          0      0 
    0      0      0     0      1     0          5      0 
    0      0      0     0      1     0          5      0 
    0      0      0     0      1     0          5      0 
    0      0      0     0      1     0          5      0 
    0      0      0     0      1     0          5      0 
    0      0      0     0      1     0          5      0 
    0      0      0     0      1     0          5      0 
    0      0      0     0      1     0          5      0 
    0      0      0     0      2     0         10      0 
    0      0      0     0      2     0         10      0 
    0      0      0     0      2     0         10      0 
    0      0      0     0      2     0         10      0 
    0      0      0     0      2     0         10      0 
    0      0      0     0      2     0         10      0 
    0      0      0     0      2     0         10      0 
    0      0      0     0      2     0         10      0 
    0      0      0     0      3     0         15      0 
    0      0      0     0      3     0         15      0 
    0      0      0     0      3     0         15      0 
    0      0      0     0      3     0         15      0 
    0      0      0     0      3     0         15      0 
    0      0      0     0      3     0         15      0 
    0      0      0     0      3     0         15      0 

 

mpiperuse Usage Examples

The examples in this section show how to perform the described DTrace operations from the command line.


procedure icon  To Count the Number of Messages To or From a Host

single-step bullet  Issue the following DTrace command, substituting the process ID of the process you want to monitor for pid:


dtrace -p pid -n ’mpiperuse$target:::comm-req-xfer-end { @[args[0]->ci_remote] = count(); }’

DTrace returns a result similar to the following. In this example, the process ID is 25428 and the host name is joe-users-host2.


% dtrace -p 25428 -n ’mpiperuse$target:::comm-req-xfer-end {@[args[0]->ci_remote] = count();}’
dtrace: description ’mpiperuse$target:::comm-req-xfer-end ’ matched 17 probes
^C
joe-users-host2 recv 3
joe-users-host2 send 3


procedure icon  To Count the Number of Messages To or From Specific BTLs

single-step bullet  Issue the following DTrace command, substituting the process ID of the process you want to monitor for pid:


dtrace -p pid -n ’mpiperuse$target:::comm-req-xfer-end { @[args[0]->ci_protocol] = count(); }’

DTrace returns a result similar to the following. In this example, the process ID is 25445.


 % dtrace -p 25445 -n ’mpiperuse$target:::comm-req-xfer-end {@[args[0]->ci_protocol] = count();}’
dtrace: description ’mpiperuse$target:::comm-req-xfer-end ’ matched 17 probes
^C
 
  sm 60


procedure icon  To Obtain Distribution Plots of Message Sizes Sent or Received From a Host

single-step bullet  Issue the following DTrace command, substituting the process ID of the process you want to monitor for pid:


dtrace -p pid -n ’mpiperuse$target:::comm-req-xfer-end { @[args[0]->ci_remote] = quantize(args[3]->mcs_count); }’

DTrace returns a result similar to the following. In this example, the process ID is 25445.


 % dtrace -p 25445 -n ’mpiperuse$target:::comm-req-xfer-end {@[args[0]->ci_remote] = quantize(args[3]->mcs_count);}’
dtrace: description ’mpiperuse$target:::comm-req-xfer-end ’ matched 17 probes
^C
 
  myhost 
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4        
               8 |                                         0        


procedure icon  To Create Distribution Plots of Message Sizes By Communicator, Rank, and Send/Receive

single-step bullet  Issue the following DTrace command, substituting the process ID of the process you want to monitor for pid:


dtrace -p pid -n ’mpiperuse$target:::comm-req-xfer-end {@[args[3]->mcs_comm, args[3]->mcs_peer, args[3]->mcs_op] = quantize(args[3]->mcs_count);}’

DTrace returns a result similar to the following. In this example, the process ID is 24937.


% dtrace -p 24937 -n ’mpiperuse$target:::comm-req-xfer-end {@[args[3]->mcs_comm, args[3]->mcs_peer, args[3]->mcs_op] = quantize(args[3]->mcs_count);}’
dtrace: description ’mpiperuse$target:::comm-req-xfer-end ’ matched 19 probes
^C
        134614864        1  recv                                              
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9        
               8 |                                         0        
 
        134614864        1  send                                              
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9        
               8 |                                         0