It is generally desirable to perform code profiling and tuning on "stripped down" runs so that many profiling experiments may be run. The following precautions are recommended.
Try to maintain the same problem size, since changing the size of your data set can change the performance characteristics of your code. If the problem size must be reduced due to a limited number of processors being available, try to determine how the data set should be scaled to maintain the same performance behavior. For many algorithms, it makes most sense to maintain a fixed subgrid size. For example, if a full dataset of 8 Gbytes is expected to run on 64 processors, then a fixed subgrid size of 128 Mbyte per processor suggests profiling a 512-Mbyte dataset on 4 processors.
Try to shorten experiments by running fewer iterations. Turn timers on only after the dataset has been initialized if you expect the initialization phase to be unimportant in production runs. Perform a minimum of one warmup iteration before timing. First iterations of many operations are not representative of the steady-state performance, which is what impacts production run time. Perform several iterations and verify that the performance characteristics have stabilized before trying to draw steady-state conclusions about a code's behavior.
For multiprocess codes, perform a barrier before starting a timer (with MPI_Barrier, for example). For example:
CALL MPI_COMM_RANK(MPI_COMM_WORLD,ME,IER) IF ( ME .EQ. 0 ) THEN initialization END IF T_START = MPI_WTIME() timed portion T_END = MPI_WTIME()
In this case, most processes may accumulate time in the interesting, timed portion, waiting for process 0 to emerge from uninteresting initialization. This would skew your program's timings.
When stopping a timer, remember that measurements of elapsed time will differ on different processes. So, execute another barrier before the "stop" timer. Alternatively, use "maximum" elapsed time for all processes.
Do not time very small fragments of code. This is true for uniprocessor codes, and the consequences are greater with many processors. Code fragments perform differently when timed in isolation and introduction of barrier calls for timing purposes can be disruptive for short intervals.
A high-quality timer called gethrtime is available on UltraSPARC-based systems. From C, your code can call it using:
#include hrtime_t gethrtime(void);
This timer returns time in nanoseconds since some arbitrary point in the past (since system power-on). This time is well-defined for any node, but varies considerably from node to node. Consult the gethrtime man page for gethrtime for details. From Sun Fortran 77, you can write
INTEGER*8 GETHRTIME !$PRAGMA C(GETHRTIME) DOUBLE PRECISION TSECONDS TSECONDS = 1.D-9 * GETHRTIME()
which converts nanoseconds to seconds.
The overhead and resolution for gethrtime from user code are usually better than one microsecond.
The MPI standard supports a profiling interface, which allows any user to profile either individual MPI calls or the entire library. This facility is provided by supporting two equivalent APIs for each MPI routine. One has the prefix MPI_, while the other has PMPI_. User codes typically call the MPI_ routines. A profiling routine or library will typically provide wrappers for the MPI_ APIs that simply call the PMPI_ ones, with timer calls around the PMPI_ call.
More generally, you may use this interface to change the behavior of MPI routines without modifying your source code. For example, suppose you believe that most of the time spent in some collective call such as MPI_Allreduce is due to the synchronization of the processes that is implicit to such a call. Then, you might compile a wrapper such as the one shown below, and link it into your code before -lmpi. The effect will be that time profiled by MPI_Allreduce calls will be due will be due exclusively" to the all-reduce operation, with synchronization costs attributed to barrier operations.
subroutine MPI_Allreduce(x,y,n,type,op,comm,ier) integer x(*), y(*), n, type, op, comm, ier call PMPI_Barrier(comm,ier) call PMPI_Allreduce(x,y,n,type,op,comm,ier) end
Profiling wrappers or libraries may be used even with application binaries that have already been linked. See the Solaris man page for ld for more information on the environment variable LD_PRELOAD.
Profiling libraries are available from independent sources for use with Sun MPI. Typically, their functionality is rather limited compared to that of Prism with TNF, but for certain applications their use may be more convenient or they may represent useful springboards for particular, customized profiling activities. An example of a profiling library is included in the multiprocessing environment (MPE) from Argonne National Laboratory. For more information on this library and on the MPI profiling interface, see the Sun MPI 4.0 Programming and Reference Guide.
The Solaris utility gprof may be used for multiprocess codes, such as those that use MPI. It can be helpful for profiling user routines, which are not automatically instrumented with TNF probes by Sun HPC ClusterTools 3.0 software. Several points should be noted:
Codes should be compiled and linked with -pg for Fortran or -xpg for C.
The environment variable PROFDIR may be used to profile multiprocess jobs, such as those that use MPI.
The gprof should be used after program execution to gather summary statistics either on individual processes or for multiprocess aggregates.
The Sun MPI libraries are not profiled. Typically, on-node data transfers will appear as memcpy calls while transfers between nodes will appear as read and write calls in gprof profiles.
There may be no obvious relationships between process ids, used to tag profile files, and MPI process ranks.
There is a very small chance that profiles from different processes will overwrite one another if a multiprocess job spans multiple nodes.
For more information about gprof, see the gprof man page.
You can implement custom post-processing of TNF data using the tnfdump utility, which converts TNF trace files, such as the one produced by Prism, into an ASCII listing of timestamps, time differentials, events, and probe arguments.
To use this command, specify
% tnfdump filename
where filename is the name of the TNF trace data file produced by Prism (not by prex).
The resulting ASCII listing can be several times larger than the tracefile and may require a wide window for viewing. Nevertheless, it is full of valuable information.
For more information about the tnfdump command, see the tnfdump(1) man page.
Prism invokes TNF utilities to perform data collection, so it is possible to profile MPI programs directly without using Prism. Although Prism provides a number of ease-of-use facilities, such as representing process timelines according to MPI rank, and it reconciling timestamps when a job is distributed over many nodes and uses multiple clocks that are not synchronized with one another, Prism's own processes may affect profiling activity, so in certain cases bypassing Prism during data collection is desirable.The utility to perform TNF data collection directly is prex. To enable all probes, place the following commands in your .prexrc file. (Note the leading "." in the file name.)
enable $all trace $all continue
Then, remove old buffer files, run prex, and merge and view the data, as shown below.
Because prex does not correct for the effects of clock skew, it is useful only for MPI programs running on individual SMPs. Also, data collected by prex does not identify MPI ranks in the data--if you attempt to display prex data in tnfview, the VIDs (ranks) will be displayed in random order.
% rm /tmp/trace-* % mprun -np 4 prex -s 128 a.out
% bsub -I -n 4 prex -s 128 a.out % /opt/SUNWhpc/bin/sparcv7/tnfmerge -o a.tnf /tmp/trace-* % /opt/SUNWhpc/bin/sparcv7/tnfview a.tnf
For more information on prex, see its Solaris man page.