It is generally desirable to perform code profiling and tuning on "stripped down" runs so that many profiling experiments may be run. The following precautions are recommended.
Try to maintain the same problem size, since changing the size of your data set can change the performance characteristics of your code. If the problem size must be reduced due to a limited number of processors being available, try to determine how the data set should be scaled to maintain the same performance behavior. For many algorithms, it makes most sense to maintain a fixed subgrid size. For example, if a full dataset of 8 Gbytes is expected to run on 64 processors, then a fixed subgrid size of 128 Mbyte per processor suggests profiling a 512-Mbyte dataset on 4 processors.
Try to shorten experiments by running fewer iterations. Turn timers on only after the dataset has been initialized if you expect the initialization phase to be unimportant in production runs. Perform a minimum of one warmup iteration before timing. First iterations of many operations are not representative of the steady-state performance, which is what impacts production run time. Perform several iterations and verify that the performance characteristics have stabilized before trying to draw steady-state conclusions about a code's behavior.
For multiprocess codes, perform a barrier before starting a timer (with MPI_Barrier, for example). For example:
CALL MPI_COMM_RANK(MPI_COMM_WORLD,ME,IER) IF ( ME .EQ. 0 ) THEN initialization END IF T_START = MPI_WTIME() timed portion T_END = MPI_WTIME()
In this case, most processes may accumulate time in the interesting, timed portion, waiting for process 0 to emerge from uninteresting initialization. This would skew your program's timings.
When stopping a timer, remember that measurements of elapsed time will differ on different processes. So, execute another barrier before the "stop" timer. Alternatively, use "maximum" elapsed time for all processes.
Do not time very small fragments of code. This is true for uniprocessor codes, and the consequences are greater with many processors. Code fragments perform differently when timed in isolation and introduction of barrier calls for timing purposes can be disruptive for short intervals.
A high-quality timer called gethrtime is available on UltraSPARC-based systems. From C, your code can call it using:
#include hrtime_t gethrtime(void);
This timer returns time in nanoseconds since some arbitrary point in the past (since system power-on). This time is well-defined for any node, but varies considerably from node to node. Consult the gethrtime man page for gethrtime for details. From Sun Fortran 77, you can write
INTEGER*8 GETHRTIME !$PRAGMA C(GETHRTIME) DOUBLE PRECISION TSECONDS TSECONDS = 1.D-9 * GETHRTIME()
which converts nanoseconds to seconds.
The overhead and resolution for gethrtime from user code are usually better than one microsecond.