The Collector can collect performance data from multi-process programs that use the Message Passing Interface (MPI).
The Collector supports the Oracle Message Passing Toolkit 8 (formerly known as Sun HPC ClusterTools 8) and its updates. The Collector can recognize other versions of MPI; the list of valid MPI versions is shown when you run collect with no arguments.
The Oracle Message Passing Toolkit MPI software is available at http://www.oracle.com/us/products/tools/message-passing-toolkit-070499.html
To collect data from MPI jobs, you must use the collect command; the dbx collector subcommands cannot be used to start MPI data collection. Details are provided in Running the collect Command for MPI.
The collect command can be used to trace and profile MPI applications.
collect [collect-arguments] mpirun [mpirun-arguments] -- program-name [program-arguments]
For example, the following command runs MPI tracing and profiling on each of the 16 MPI processes, storing the data in a single MPI experiment:
collect -M OMPT mpirun -np 16 -- a.out 3 5
The -M OMPT option indicates MPI profiling is to be done and Oracle Message Passing Toolkit is the MPI version.
The initial collect process reformats the mpirun command to specify running the collect command with appropriate arguments on each of the individual MPI processes.
The -- argument immediately before the program_name is required for MPI profiling. If you do not include the -- argument, the collect command displays an error message and no experiment is collected.
Note - The technique of using the mpirun command to spawn explicit collect commands on the MPI processes is no longer supported for collecting MPI trace data. You can still use this technique for collecting other types of data.
Because multiprocessing environments can be complex, you should be aware of some issues about storing MPI experiments when you collect performance data from MPI programs. These issues concern the efficiency of data collection and storage, and the naming of experiments. See Where the Data Is Stored for information on naming experiments, including MPI experiments.
Each MPI process that collects performance data creates its own subexperiment. While an MPI process creates an experiment, it locks the experiment directory; all other MPI processes must wait until the lock is released before they can use the directory. Store your experiments on a file system that is accessible to all MPI processes.
If you do not specify an experiment name, the default experiment name is used. Within the experiment, the Collector will create one subexperiment for each MPI rank. The Collector uses the MPI rank to construct a subexperiment name with the form M_rm.er, where m is the MPI rank.
If you plan to move the experiment to a different location after it is complete, then specify the -A copy option with the collect command. To copy or move the experiment, do not use the UNIX® cp or mv command; instead, use the er_cp or er_mv command as described in Chapter 8, Manipulating Experiments.
MPI tracing creates temporary files in /tmp/a.*.z on each node. These files are removed during the MPI_finalize() function call. Make sure that the file systems have enough space for the experiments. Before collecting data on a long running MPI application, do a short duration trial run to verify file sizes. Also see Estimating Storage Requirements for information on how to estimate the space needed.
MPI profiling is based on the open source VampirTrace 5.5.3 release. It recognizes several supported VampirTrace environment variables, and a new one, VT_STACKS, which controls whether or not call stacks are recorded in the data. For further information on the meaning of these variables, see the VampirTrace 5.5.3 documentation.
The default value of the environment variable VT_BUFFER_SIZE limits the internal buffer of the MPI API trace collector to 64 Mbytes. After the limit has been reached for a particular MPI process, the buffer is flushed to disk, if the VT_MAX_FLUSHES limit has not been reached. By default VT_MAX_FLUSHES is set to 0. This setting causes the MPI API trace collector to flush the buffer to disk whenever the buffer is full. If you set VT_MAX_FLUSHES to a positive number, you limit the number of flushes allowed. If the buffer fills up and cannot be flushed, events are no longer written into the trace file for that process. The result can be an incomplete experiment, and in some cases, the experiment might not be readable.
To change the size of the buffer, use the environment variable VT_BUFFER_SIZE. The optimal value for this variable depends on the application that is to be traced. Setting a small value will increase the memory available to the application but will trigger frequent buffer flushes by the MPI API trace collector. These buffer flushes can significantly change the behavior of the application. On the other hand, setting a large value, like 2 Gbytes, will minimize buffer flushes by the MPI API trace collector, but decrease the memory available to the application. If not enough memory is available to hold the buffer and the application data this might cause parts of the application to be swapped to disk leading also to a significant change in the behavior of the application.
Normally, MPI trace output data is post-processed when the mpirun target exits; a processed data file is written to the experiment, and information about the post-processing time is written into the experiment header. MPI post-processing is not done if MPI tracing is explicitly disabled with -m off. In the event of a failure in post-processing, an error is reported, and no MPI Tabs or MPI tracing metrics are available.
If the mpirun target does not actually invoke MPI, an experiment is still recorded, but no MPI trace data is produced. The experiment reports an MPI post-processing error, and no MPI Tabs or MPI tracing metrics will be available.
Note - If you copy or move experiments between computers or nodes, you cannot view the annotated source code or source lines in the annotated disassembly code unless you have access to the source files or a copy with the same timestamp. You can put a symbolic link to the original source file in the current directory in order to see the annotated source. You can also use settings in the Set Data Presentation dialog box: the Search Path tab (see Search Path Tab) lets you manage a list of directories to be used for searching for source files, the Pathmaps tab (see Pathmaps Tab) enables you to map the leading part of a file path from one location to another.