C H A P T E R 9 - Using the DTrace Utility With Sun MPI

This chapter discusses how to use the Solaris Dynamic Tracing utility (DTrace) with Sun MPI. DTrace is a comprehensive dynamic tracing utility that you can use to monitor the behavior of applications programs as well as the operating system itself. You can use DTrace on live production systems to understand those systems' behavior and to track down any problems that might be occurring.

The D language is the programming language used to create the source code for DTrace programs.

The material in this chapter assumes knowledge of the D language and how to use DTrace.

For more information about the D language and DTrace, refer to the Solaris Dynamic Tracing Guide (Part Number 817-6223). This guide is part of the Solaris 10 OS Software Developer Collection.

Solaris Operating Systems -> Solaris 10 -> Solaris 10 Software Developer Collection

mprun Privileges

Before you run a program under DTrace, you need to make sure that you have the correct mprun privileges.

In order to run the script under mprun, make sure that you have dtrace_proc and dtrace_user privileges. Otherwise, DTrace will return the following error because it does not have sufficient privileges:

dtrace:  failed to initialize dtrace:  DTrace requires additional privileges

To determine whether you have the appropriate privileges on the entire cluster, perform the following steps:

1. Use your favorite text editor to create the following shell script, called mpppriv.sh:

#!/bin/sh

# mpppriv.sh -  run ppriv under a shell so you can get the privileges

#		of the process that mprun creates

ppriv $$

% mprun -np 0 -Ns mpppriv.sh

If the output of ppriv shows that the E privilege set has the dtrace privileges, then you will be able to run dtrace under mprun (see the two examples below) Otherwise, you will need to adjust your system to get dtrace access.

The following example shows the output from ppriv if the correct user privileges have not been set:

% ppriv $$

4084:  -csh

flags = <none>

	E:  basic

	I:  basic

	P:  basic

	L:  all

% ppriv $$

2075:  tcsh

flags = <none>

	E:basic,dtrace_proc,dtrace_user

	I:basic,dtrace_proc,dtrace_user

	P:basic,dtrace_proc,dtrace_user

	L:  all

After the privileges have been changed, you can use the ppriv command to execute the dtrace commands under mprun.

Running DTrace with MPI Programs

Running an MPI Program Under DTrace

For illustration purposes, assume you have a program named mpiapp. To trace the program mpiapp using the mpitrace.d script, type the following command:

% mprun -np 4 dtrace -s mpitrace.d -c mpiapp

The advantage of tracing an MPI program in this way is that all the processes in the job will be traced from the beginning. This method is probably most useful in doing performance measurements, when you need to start at the beginning of an application and you need all the processes in a job to participate in collecting data.

This approach also has some disadvantages. One disadvantage of running a program like the one in the above example is that all the tracing output for all four processes is directed to standard output (stdout). One way around this problem is to create a script similar to the following:

#!/bin/sh

# partrace.sh - a helper script to dtrace Sun MPI jobs from the

#		start of the job.

dtrace -s $1 -c $2 -o $2.$MP_JOBID.$SUNHPC_PROC_RANK.trace

To trace a parallel program and get separate trace files, type the following command to run the partrace.sh shell script:

% mprun -np 4 partrace.sh mpitrace.d mpiapp

This will run mpiapp under dtrace using the mpitrace.d script. The script saves the trace output for each process in a job under a separate file name, based on the job ID and rank of the process.

Attaching to MPI Processes

The second way to use dtrace with Sun MPI is to attach dtrace to a running process. Perform the following procedure:

1. Type the following command to get the process ID (PID) of the running process and the nodes on which it is running.

% mpps -p

	JOBNAME 	NPROC 				UID 			STATE 		AOUT

	cre.1 			2 		joeuser 	 				RUN 			mpiapp

			RANK 			PID 			STATE		NODE

			0 			6390 			RUN 		mynode

			1 			6391 			RUN 		mynode

2. Decide which rank you want to use to attach dtrace, and then log in to the node that contains the rank you want to use (in this example, rank 1 for the job cre.1). In the example, you would log in to the node mynode.

3. Type the following command to attach to the rank 1 process (identified by its process ID, which is 6391 in the example) and run the DTrace script mpitrace.d:

% dtrace -p 6391 -s mpitrace.d

Simple MPI Tracing

DTrace enables you to easily trace programs. When used in conjunction with MPI and the more than 200 functions defined in the MPI standard, DTrace provides an easy way to determine which functions might be in error during the debugging process, or those functions which might be of interest. After you determine the function showing the error, it is easy to locate the desired job, process, and rank on which to run your scripts. As demonstrated above, DTrace allows you to perform these determinations while the program is running

Although the MPI standard provides the MPI profiling interface, using DTrace does provide a number of advantages. The advantages of using DTrace include the following:

The following example shows a simple script that traces the entry and exit into all the MPI API calls.

mpitrace.d:

pid$target:libmpi:MPI_*:entry

printf("Entered %s...", probefunc);

pid$target:libmpi:MPI_*:return

printf("exiting, return value = %d\n", arg1);

When you use this example script to attach DTrace to a job that performs send and recv operations, the output looks similar to the following:

% dtrace -q -p 6391 -s mpitrace.d

Entered MPI_Send...exiting, return value = 0

Entered MPI_Recv...exiting, return value = 0

Entered MPI_Send...exiting, return value = 0

Entered MPI_Recv...exiting, return value = 0

Entered MPI_Send...exiting, return value = 0 ...

You can easily modify the mpitrace.d script to include an argument list. The resulting output resembles truss output. For example:

mpitruss.d:

pid$target:libmpi:MPI_Send:entry,

pid$target:libmpi:MPI_*send:entry,

pid$target:libmpi:MPI_Recv:entry,

pid$target:libmpi:MPI_*recv:entry

printf("%s(0x%x, %d, 0x%x, %d, %d, 0x%x)",probefunc, arg0, arg1, arg2, arg3, arg4, arg5);

pid$target:libmpi:MPI_Send:return,

pid$target:libmpi:MPI_*send:return,

pid$target:libmpi:MPI_Recv:return,

pid$target:libmpi:MPI_*recv:return

printf("\t\t = %d\n", arg1);

The mpitruss.d script shows how you can specify wildcard names to match the functions. Both probes will match all send and receive type function calls in the MPI library. The first probe shows the usage of the built-in arg variables to print out the arglist of the function being traced.

Take care when wildcarding the entrypoint and the formatting argument output, because you could end up printing either too many arguments, or not enough arguments, for certain functions. For example, in the above case, the MPI_Irecv and MPI_Isend functions will not have their Request handle parameters printed out.

% dtrace -q -p 6391 -s mpitruss.d

MPI_Send(0x80470b0, 1, 0x8060f48, 0, 1,0x8060d48) = 0

MPI_Recv(0x80470a8, 1, 0x8060f48, 0, 0, 0x8060d48) = 0

MPI_Send(0x80470b0, 1, 0x8060f48, 0, 1, 0x8060d48) = 0

MPI_Recv(0x80470a8, 1,0x8060f48, 0, 0, 0x8060d48) = 0 ...

Tracking Down Resource Leaks

One of the biggest issues with programming is the unintentional leaking of resources (such as memory). With MPI, tracking and repairing resource leaks can be somewhat more challenging because the objects being leaked are in the middleware, and thus are not easily detected by the use of memory checkers.

DTrace helps with debugging such problems using variables, the profile provider, and a callstack function. The mpicommcheck.d script (shown in the example below) probes for all the the MPI communicator calls that allocate and deallocate communicators, and keeps track of the stack each time the function is called. Every 10 seconds the script dumps out the current count of MPI communicator calls and the total calls for the allocation and deallocation of communicators. When the dtrace session ends (usually by typing Ctrl-C, if you attached to a running MPI program), the script will print out the totals and all the different stack traces, as well as the number of times those stack traces were reached.

In order to perform these tasks, the script uses DTrace features such as variables, associative arrays, built-in functions (count, ustack) and the predefined variable probefunc.

mpicommcheck.d:

BEGIN

  allocations = 0;

  deallocations = 0;

  prcnt = 0;

pid$target:libmpi:MPI_Comm_create:entry,

pid$target:libmpi:MPI_Comm_dup:entry,

pid$target:libmpi:MPI_Comm_split:entry

  ++allocations;

  @counts[probefunc] = count();

  @stacks[ustack()] = count();

pid$target:libmpi:MPI_Comm_free:entry

  ++deallocations;

  @counts[probefunc] = count();

  @stacks[ustack()] = count();

profile:::tick-1sec

/++prcnt > 10/

  printf("=====================================================================");

  printa(@counts);

  printf("Communicator Allocations = %d \n", allocations);

  printf("Communicator Deallocations = %d\n", deallocations);

  prcnt = 0;

END

  printf("Communicator Allocations = %d, Communicator Deallocations = %d\n",

  	allocations, deallocations);

This script attaches dtrace to a suspect section of code in your program (that is, a section of code that might contain a resource leak). If, during the process of running the script, you see that the printed totals for allocations and deallocations are starting to steadily diverge, you might have a resource leak. Depending on how your program is designed, it might take some time and observation of the allocation/deallocation totals in order to definitively determine that the code contains a resource leak. Once you do determine that a resource leak is definitely occurring, you can type Ctrl-C to break out of the dtrace session. Next, using the stack traces dumped, you can try to determine where the issue might be occurring.

The following example shows code containing a resource leak, and the output that is displayed using the mpicommcheck.d script.

The sample MPI program containing the resource leak is called mpicommleak. This program performs three MPI_Comm_dup operations and two MPI_Comm_free operations. The program thus "leaks" one communicator operation with each iteration of a loop.

When you attach dtrace to mpicommleak using the mpicommcheck.d script above, you will see a 10-second periodic output. This output shows that the count of the allocated communicators is growing faster than the count of deallocations.

When you finally end the dtrace session by typing Ctrl-C, the session will have output a total of five stack traces, showing the distinct three MPI_Comm_dup and two MPI_Comm_free call stacks, as well as the number of times each call stack was encountered.

% dtrace -q -p 6581 -s mpicommcheck.d

================================================================

  MPI_Comm_free 													4

  MPI_Comm_dup 													6

Communicator Allocations = 6

Communicator Deallocations = 4

================================================================

  MPI_Comm_free 													8

  MPI_Comm_dup 													12

Communicator Allocations = 12

Communicator Deallocations = 8

================================================================

  MPI_Comm_free 													12

  MPI_Comm_dup 													18

Communicator Allocations = 18

Communicator Deallocations = 12

^C

Communicator Allocations = 21, Communicator Deallocations = 14

	libmpi.so.1`MPI_Comm_dup

	mpicommleak`allocate_comms+0x1e

	mpicommleak`main+0x5b

	mpicommleak`0x805091a

	libmpi.so.1`MPI_Comm_dup

	mpicommleak`allocate_comms+0x30

	mpicommleak`main+0x5b

	mpicommleak`0x805091a

	libmpi.so.1`MPI_Comm_dup

	mpicommleak`allocate_comms+0x42

	mpicommleak`main+0x5b

	mpicommleak`0x805091a

	libmpi.so.1`MPI_Comm_free

	mpicommleak`deallocate_comms+0x19

	mpicommleak`main+0x6a

	mpicommleak`0x805091a

	libmpi.so.1`MPI_Comm_free

	mpicommleak`deallocate_comms+0x26

	mpicommleak`main+0x6a

	mpicommleak`0x805091a

mprun Privileges

Running DTrace with MPI Programs

Running an MPI Program Under DTrace

Attaching to MPI Processes

Simple MPI Tracing

Tracking Down Resource Leaks

`mprun` Privileges