Collecting Data From MPI Programs

Language:

The Collector can collect performance data from multi-process programs that use the Message Passing Interface (MPI).

The Collector supports the Oracle Message Passing Toolkit 8 (formerly known as Sun HPC ClusterTools 8) and its updates. The Collector can recognize other versions of MPI; the list of valid MPI versions is shown when you run collect -h with no additional arguments.

The Oracle Message Passing Toolkit MPI software is available at http://www.oracle.com/us/products/tools/message-passing-toolkit-070499.html for installing on Oracle Solaris 10 and Linux systems.

The Oracle Message Passing Toolkit is made available as part of the Oracle Solaris 11 release. If it is installed on your system, you can find it in /usr/openmpi. If it is not already installed on your Oracle Solaris 11 system, you can search for the package with the command pkg search openmpi if a package repository is configured for the system. See the manual Adding and Updating Oracle Solaris 11 Software Packages in the Oracle Solaris 11 documentation library for more information about installing software in Oracle Solaris 11.

For information about MPI and the MPI standard, see the MPI web site http://www.mcs.anl.gov/mpi/. For more information about Open MPI, see the web site http://www.open-mpi.org/.

To collect data from MPI jobs, you must use the collect command; the dbx collector subcommands cannot be used to start MPI data collection. Details are provided in Running the collect Command for MPI.

Running the `collect` Command for MPI

The collect command can be used to trace and profile MPI applications.

To collect data, use the following syntax:

collect [collect-arguments] mpirun [mpirun-arguments] -- program-name [program-arguments]

For example, the following command runs MPI tracing and profiling on each of the 16 MPI processes, storing the data in a single MPI experiment:

collect -M OMPT mpirun -np 16 -- a.out 3 5

The –M OMPT option indicates MPI profiling is to be done and Oracle Message Passing Toolkit is the MPI version.

The initial collect process reformats the mpirun command to specify running the collect command with appropriate arguments on each of the individual MPI processes.

The ‐‐ argument immediately before the program_name is required for MPI profiling. If you do not include the ‐‐ argument, the collect command displays an error message and no experiment is collected.

Note - The technique of using the mpirun command to spawn explicit collect commands on the MPI processes is no longer supported for collecting MPI trace data. You can still use this technique for collecting other types of data.

Storing MPI Experiments

Because multiprocessing environments can be complex, you should be aware of some issues about storing MPI experiments when you collect performance data from MPI programs. These issues concern the efficiency of data collection and storage, and the naming of experiments. See Where the Data Is Stored for information on naming experiments, including MPI experiments.

Each MPI process that collects performance data creates its own subexperiment. While an MPI process creates an experiment, it locks the experiment directory; all other MPI processes must wait until the lock is released before they can use the directory. Store your experiments on a file system that is accessible to all MPI processes.

If you do not specify an experiment name, the default experiment name is used. Within the experiment, the Collector will create one subexperiment for each MPI rank. The Collector uses the MPI rank to construct a subexperiment name with the form M_rm.er, where m is the MPI rank.

If you plan to move the experiment to a different location after it is complete, then specify the –A copy option with the collect command. To copy or move the experiment, do not use the UNIX cp or mv command. Instead, use the er_cp or er_mv command as described in Manipulating Experiments.

MPI tracing creates temporary files in /tmp/a.*.z on each node. These files are removed during the MPI_finalize() function call. Make sure that the file systems have enough space for the experiments. Before collecting data on a long-running MPI application, do a short-duration trial run to verify file sizes. Also see Estimating Storage Requirements for information on how to estimate the space needed.

MPI profiling is based on the open source VampirTrace 5.5.3 release. It recognizes several supported VampirTrace environment variables and a new one, VT_STACKS, which controls whether call stacks are recorded in the data. For further information on the meaning of these variables, see the VampirTrace 5.5.3 documentation.

The default value of the environment variable VT_BUFFER_SIZE limits the internal buffer of the MPI API trace collector to 64 Mbytes. After the limit has been reached for a particular MPI process, the buffer is flushed to disk if the VT_MAX_FLUSHES limit has not been reached. By default, VT_MAX_FLUSHES is set to 0. This setting causes the MPI API trace collector to flush the buffer to disk whenever the buffer is full. If you set VT_MAX_FLUSHES to a positive number, you limit the number of flushes allowed. If the buffer fills up and cannot be flushed, events are no longer written into the trace file for that process. The result can be an incomplete experiment, and, in some cases, the experiment might not be readable.

To change the size of the buffer, use the environment variable VT_BUFFER_SIZE. The optimal value for this variable depends on the application that is to be traced. Setting a small value will increase the memory available to the application but will trigger frequent buffer flushes by the MPI API trace collector. These buffer flushes can significantly change the behavior of the application. On the other hand, setting a large value like 2 Gbytes will minimize buffer flushes by the MPI API trace collector but decrease the memory available to the application. If not enough memory is available to hold the buffer and the application data, parts of the application might be swapped to disk, leading also to a significant change in the behavior of the application.

Another important variable is VT_VERBOSE, which turns on various error and status messages. Set this variable to 2 or higher if problems arise.

Normally, MPI trace output data is post-processed when the mpirun target exits. A processed data file is written to the experiment and information about the post-processing time is written into the experiment header. MPI post-processing is not done if MPI tracing is explicitly disabled with –m off. In the event of a failure in post-processing, an error is reported, and no MPI Tabs or MPI tracing metrics are available.

If the mpirun target does not actually invoke MPI, an experiment is still recorded but no MPI trace data is produced. The experiment reports an MPI post-processing error, and no MPI Tabs or MPI tracing metrics will be available.

If the environment variable VT_UNIFY is set to 0, the post-processing routines are not run by collect. They are run the first time er_print or analyzer are invoked on the experiment.

Note - If you copy or move experiments between computers or nodes, you cannot view the annotated source code or source lines in the annotated disassembly code unless you have access to the source files or a copy with the same timestamp. You can put a symbolic link to the original source file in the current directory in order to see the annotated source. You can also use settings in the Settings dialog box. Use the Search Path tab (see Search Path Settings) to manage a list of directories to be used for searching for source files. Use the Pathmaps tab (see Pathmaps Settings) to map the leading part of a file path from one location to another.