While Prism profiling support offers programmers many diagnostic tools, a trained eye is still helpful. Your experience will grow with time. In addition to the lessons of the case study, other features to watch for are:
Deadlock - If the code truly deadlocks, then Prism's other features may used to examine the state that causes the program to jam. In particular, click on Interrupt (from the Execute menu) and then Where (from the Debug menu) to see where the program is deadlocked. (The program should be compiled with -g so that debugging information will be available to Prism.
Chapter 2, Using Prism has helpful information describing compilation with both -g and optimization flags turned on.) Prism also has a message-queue visualizer that may be used to study all outstanding messages at the time of deadlock.
Many outstanding messages - Performance may be degraded if a program overloads MPI buffers, either by having many outstanding messages or a few large ones.
Sends and receives out of order - Performance can be degraded if sends are not matched by receives, again stressing buffering capabilities. Receives may be out of order or may have collective operations interspersed.
Blocks of processes may be highly synchronized - Visual inspection of timelines can reveal blocks of processes marching in lockstep for long periods at a time. This suggests that they are synchronized by a high volume of message passing among them. If such a job is launched on a multinode cluster, processes within a block should be colocated on a node. Figure 7-2 shows a subtle example of this.
Prism's MPI performance analysis can collect a lot of data. TNF probe data collection employs buffer wraparound, so that once a buffer file is filled the newer events will overwrite older ones. Thus, final traces do not necessarily report events starting at the beginning of a program and, indeed, the time at which events start to be reported may vary slightly from one MPI process to another, depending on the amount of probed activity on each process. Nevertheless, trace files will generally show representative profiles of an application since newer, surviving events tend to represent execution during steady state.
If buffer wraparound is an issue, then solutions include:
Using larger trace buffers
Selective enabling of probes
Profiling isolated sections of code
Prism's MPI performance analysis can disturb an application's performance characteristics, so it is sometimes desirable to focus data collection even if larger trace buffers are an option.
To increase the size of trace buffers beyond the default value, use the Prism command
(prism all) tnffile filename size
where size is the size in Kbytes of the output file for each process. The default value is 128 Kbytes.
By default, trace buffers are placed in /usr/tmp before they are merged into the user's trace file. If this file partition is too small for very large traces, buffers can be redirected to other directories using the PRISM_TNFDIR environment variable. In order to minimize profile disruption caused by writing very large trace files to disk, one should use local file systems such as /usr/tmp and /tmp whenever possible instead of file systems that are mounted over a network.
While Prism generally cleans up trace buffers after the final merge, abnormal conditions could leave large files behind. Users who abort profiling sessions with large traces should check /usr/tmp periodically for large, unwanted files.
One might focus data collection on events that are believed to be most relevant to performance in order either to reduce sizes of buffer files or to make profiling less intrusive. TNF probes are organized in probe groups. For the TNF-instrumented version of the Sun MPI library, the probe groups are structured as follows:
Some TNF probes belong to more than one group in the TNF-instrumented version of the Sun MPI library. For example, there are several probes that belong to both the mpi_request group and the mpi_pt2pt group. For further information about probe groups, see the Sun MPI 4.0 Programming and Reference Guide.
For message-passing performance, typically the most important groups are
mpi_pt2pt - point-to-point message passing
mpi_request - other probes for asynchronous point-to-point calls
mpi_coll - collectives
mpi_io_rw - file I/O
If there is heavy use of MPI_Pack and MPI_Unpack, their probes should also be enabled.
Another way of controlling trace sizes is to profile only isolated sections of code. Prism supports this functionality by allowing users to turn collection on and off during program execution whenever execution is stopped - say, with a break point or by using the interrupt command.
If the profiled section will be entered and exited many times, data collection may be turned on and off automatically using tracepoints. Note that the term "trace" is used now in a different context. For TNF use, a trace is a probe. For Prism and other debuggers, a tracepoint is a point where execution stops and possibly an action takes place but, unlike a breakpoint, program execution resumes after the action.
For example, if data collection should be turned on at line 128 but then off again at line 223, one may specify
(prism all) trace at 128 {tnfcollection on} (prism all) trace at 223 {tnfcollection off}
If the application was compiled and linked with high degrees of optimization, then specification of line numbers may be meaningless. If the application was compiled and linked without -g, then specification of line numbers will simply not work. In such cases, data collection may be turned on and off at entry points to routines using trace in routine syntax.
TNF data collection can also be turned on and off within user source code using the routines tnf_process_disable, tnf_process_enable, tnf_thread_disable, and tnf_thread_enable. Since these are C functions, one must call them as follows from Fortran:
call tnf_process_disable() !$pragma c(tnf_process_disable) call tnf_process_enable() !$pragma c(tnf_process_enable) call tnf_thread_disable() !$pragma c(tnf_thread_disable) call tnf_thread_enable() !$pragma c(tnf_thread_enable)
Whether these functions are called from C or Fortran, one must then link with -ltnfprobe. For more information, see the Solaris man pages on these functions.
While Sun HPC ClusterTools libraries have TNF probes for performance profiling, user code probably will not. You can add probes manually, but since they are C macros you can add them only to C and C++ code. To use TNF probes from Fortran code, you must make calls to C code, such as in this C file, probes.c:
#include <tnf/probe.h> void my_probe_start_(char *label_val) { TNF_PROBE_1(my_probe_start,"user_probes","",tnf_string,label, label_val); } void my_probe_end_ (double *ops_val) { TNF_PROBE_1(my_probe_end ,"user_probes","",tnf_double,ops, *ops_val); }
The start routine accepts a descriptive string, while the end routine takes a double-precision operation count.Then, using Fortran, you might write in main.f:
DOUBLE PRECISION OPERATION_COUNT OPERATION_COUNT = 2.D0 * N CALL MY_PROBE_START("DOTPRODUCT") XSUM = 0.D0 DO I = 1, N XSUM = XSUM + X(I) * Y(I) END DO CALL MY_PROBE_END(OPERATION_COUNT)
Fortran will convert routine names to lowercase and append an underscore character.
To compile and link, use
% cc -c probes.c % f77 main.f probes.o -ltnfprobe
By default, the Prism command tnfcollection on enables all probes. Alternatively, these sample probes could be controlled through their probe group user_probes. Profile analysis can use the interval my_probe.
For more information on TNF probes, consult the man page for TNF_PROBE(3X).
For more involved data collection experiments, you can collect TNF profiling information in batch mode, for viewing and analysis in a later, interactive session. Such collection may be performed using Prism in commands-only mode, invoked with prism -C. For example, the simplest data collection experiment would be
tnfcollection on run wait quit
The wait command is needed to keep file merge from happening until after the program has completed running. One way of feeding commands to Prism is through the .prisminit file (note the leading "."), which is read by Prism upon startup. See Appendix A, Commands-Only Prism and Chapter 10, Customizing Prism for more information on commands-only mode and .prisminit files, respectively.
Sometimes it is hard to account for MPI activity properly. For example, if one issues an asynchronous send or receive (MPI_Isend or MPI_Irecv), the data movement may occur during that call, during the corresponding MPI_Wait or MPI_Test call, or during any other MPI call in between.
Similarly, general polling (such as with the environment variable MPI_POLLALL) may skew accounting. For example, an incoming message may be read during a send call because polling causes arrivals to be polled aggressively.
In sum, it will not generally be possible to produce pictures like Figure 7-7, from which one can extract an effective bandwidth estimate.