Simple Performance Optimization Tool (SPOT) 2.0 User's Guide

The spot_diff Report

The script spot_diff is automatically run by SPOT after each new set of SPOT data is gathered. This tool compares each new run with the preceding ones. The output from the spot_diff script is the spot_diff.html file that is found in the directory where the experiments are being recorded. The spot_diff.html file contains several tables that compare SPOT experiment data in a tabular HTML format. Large differences are highlighted to alert the user to possible performance problems.

It is also possible to call spot_diff from the command line for situations where greater control over the particular experiments is required. An example of such a commandline is:

spot_diff -e <experiment1> -e <experiment2> -o <output_file>

The spot_diff man page, included in the CMT Developer Tools distribution, contains complete usage information.

To explain spot_diff output, in this section we will examine a spot_diff.html file which was automatically generated after running two Spot experiments based on the code in Example 2-3. The first run was compiled with the Sun Studio 12 compiler with -xO2 optimization and the second run used -fast. The output from the run with -xO2 optimisation was recorded in the directory -O2_1, the output from the run with -fast optimisation was recorded in the directory fast_1.

Figure 3–15 Summary of Key Experiment Metrics

The Summary of Key Metrics section compares several top-level metrics for the two experiments. We see that by enabling higher compiler optimization both the runtime and number of executed instructions decrease. It is also apparent that the total number of bytes read and written to the bus are similar, but because the "-fast" experiment ran more quickly its bus bandwidth is correspondingly higher.

Figure 3–16 Summary of Top Stalls

The top causes for stalls are printed in two tables, one by percent execution time and the other in absolute seconds. Depending on the application under observation or user preference, one or the other may be more useful in identifying a performance problem. In the example used here it may be more useful to look at the top stalls printed in seconds because the two runs are doing the same work.

The table shows that the optimizations enabled by -fast significantly reduce the cache related stalls but have little effect on the Data TLB stall time. We also see that Floating Point Use stalls were nearly eliminated in the -fast run. By clicking on the column heading hyperlinks to go to the individual SPOT experiments’ profiles it can be learned that:

Prefetch instructions are responsible for reducing the cache stalls
Better code scheduling eliminated back-to-back floating point operations which reduced the Floating Point Use stalls

Figure 3–17 Bit Instruction Counts Report

The binary was compiled with -xbinopt=prepare, so SPOT was able to gather instruction count data. The difference in instruction count between the binary compiled at -xO2 and at -fast is mostly due to unrolling (and to a much lesser extent, inlining) done by the compiler at -fast which greatly reduces the amount of branches and loop-related calculations. The prefetch instructions that appear only with -fast optimization also appear in this table, and are largely responsible for the better cache performance in the -fast experiment. Only instructions that show both high variance between experiments and a high total count are printed in this table. For example, both experiments have a large number of floating point loads which are not listed in this table because the counts were largely the same in the two experiments. Detailed Bit data can be seen by clicking down into the individual Spot experiments.

Figure 3–18 Flags Report

Here we see that the only difference in the compiler flags between the two experiments is the optimization level, as expected.

Figure 3–19 Trap Rate Report

While the total number of Data TLB traps in the two experiments are roughly the same, the trap rate, as reported, is higher in the -fast experiment because it runs in less time. All other trap rates (which can be seen in the hyperlinked Spot reports) were too low to report in this example.

Figure 3–20 Time Spent in Top Functions

As in the section showing top Stall Data, these tables are presented in both percent time and in seconds of execution time. In either table it is apparent that the functions cache_miss(), fp_routine() and tlb_miss() are inlined when compiling at -fast but not at -xO2.