C H A P T E R 3 |
Profiler |
This chapter discusses the Netra DPS profiler used in the Netra Data Plane software. Topics include:
The Netra DPS profiler is a set of API calls that help you collect various critical data during the execution of an application. You can profile one or more areas of your application such as CPU utilization, I/O wait times, and so on. Information gathered using the profiler helps you decide where to direct performance-tuning efforts. The profiler uses special counters and resources available in the system hardware to collect critical information about the application.
As with instrumentation-based profiling, there is a slight overhead for collecting data during the application run. The profiler uses as little overhead as possible so that the presented data is very close to the actual application run without the profiler API in place.
You enable the profiler with the -pg command-line option (tejacc). You can insert the API calls at desired places to start collecting profiling data. The profiler configures and sets the hardware resources to capture the requested data. At the same time, the profiler reserves and sets up the memory buffer where the data will be stored. You can insert calls to update the profiler data at any further location in the application. With this setup, the profiler reads the current values of the data and stores the values in memory.
There is an option to store additional user data in the memory along with each update capture. Storing this data helps you analyze the application in the context of different application-specific data.
You can also obtain the current profiler data in the application and use the data as desired. With the assistance of other communication mechanisms you can send the data to the host or other parts of the application.
By demarking the portions that are being profiled, you can dump the collected data to the console. The data is presented as a comma-delimited table that can be further processed for report generation.
To minimize the amount of memory space needed for the profile capture, the profiler uses a circular buffer mechanism to store the data. In a circular buffer, the start and the end data is preserved, yet the intermediate data is overwritten when the buffer becomes full.
The profiling data is captured into different groups based on the significance of the data. For example, with the CPU performance group, events such as completed instruction cycles, data cache misses, and secondary cache misses are captured. In the memory performance group, events such as memory queue and memory cycles are captured. Refer to the Profiler API chapter of the Netra Data Plane Software Suite 2.0 Reference Manual for the different groups and different events that are captured and measured on the target.
The profiler output consists of one line per profiler record. Each line commonly has a format of nine comma-delimited fields. The fields contain values in hexadecimal. If a record is prefixed with a -1, the buffer allocated for the profiler records has overrun. When a buffer overrun occurs, you should increase the value of the profiler_buffer_size property as described in the Configuration API chapter of the Netra Data Plane Software Suite 2.0 Reference Manual, and run the application again.
TABLE 3-1 describes the fields of the profiler record:
Refer to Profiler Output Example for an example of dump output.
For profiler API function descriptions, refer to the Netra Data Plane Software Suite 2.0 Reference Manual.
CODE EXAMPLE 3-1 provides an example of profiler API usage.
You can change the profiler configuration in the software architecture. The following example shows the profiler properties that you can change per process.
main_process is the process object that was created using the teja_process_create call. The property values are applied to all threads mapped to the process specified using main_process.
The following is an example of the profiler output.
The string, ver2.0, is the profiler dump format version. The string is used as an identifier of the output format. The string helps scripts written to process the output validate the format before processing further.
In the first record, call type 1 represents teja_profiler_start. The values 100 and 4 seen in the event_hi and event_lo columns are the types of events in group 1 being measured. In the record with ID 30e6, call type 2 represents teja_profiler_update, so the values 36c2ba96 and ce are the values of the event types 100 and 1 respectively.
Cycle counts are in increasing order so the difference between two of them provides the exact number of cycle counts between two profiler API calls. The difference divided by the processor frequency calculates the actual time between two calls.
IDs 2be4 and 2bf6 represent the source location of the profiler API call. The records/profiler_call_locations.txt file lists a table that maps IDs and actual source locations.
Profiling consists of instrumenting your application to extract performance information that can be used to analyze, diagnose, and tune your application.
Netra DPS provides an interface to assist you to obtain this information from your application. In general, profiling information consists of hardware performance counters and a few user-defined counters. This section defines the profiling information and how to obtain it.
Profiling is a disruptive activity that can have a significant performance effect. Take care to minimize profiling code and also to measure the effects of the profiling code. This can be done by measuring performance with and without the profiling code. One of the most disruptive parts of profiling is printing the profiling data to the console. To reduce the effects of prints, try to aggregate profiling statistics for many periods before printing, and print only in a designated strand.
The hardware counters for the CPU, DRAM controllers, and JBus are described in TABLE 3-2, TABLE 3-3, and TABLE 3-4 respectively.
instr_cnt |
Number of completed instructions. Annulled, mispredicted, or trapped instructions are not counted.[1] |
SB_full |
Number of store buffer full cycles.[2] |
FP_instr_cnt |
Number of completed floating-point instructions. [3] Annulled or trapped instruction are not counted. |
IC_miss |
|
DC_miss |
Number of data cache (L1) misses for loads (store misses are not included because the cache is write-through nonallocating). |
ITLB_miss |
Number of instruction TLB miss trap taken (includes real_translation misses). |
DTLB_miss |
Number of data TLB miss trap taken (includes real_translation misses). |
L2_imiss |
Number of secondary cache (L2) misses due to instruction cache requests. |
L2_dmiss_Id |
Number of secondary cache (L2) misses due to data cache load requests.[4] |
jbus_cycles |
|
dma_reads |
|
dma_read_latency |
|
dma_writes |
|
dma_write8 |
|
ordering_waits |
|
pio_reads |
|
pio_read_latency |
|
pio_writes |
|
aok_dok_off_cycles |
|
aok_off_cycles |
|
dok_off_cycles |
Each strand has its own set of CPU counters that only tracks its own events and can only be accessed by that strand. Only two CPU counters are 32 bits wide each. To prevent overflows, the measurement period should not exceed 6 seconds. In general, keep the measurement period between 1 and 5 seconds. When taking measurements, ensure that the application behavior is in a steady state. To check this behavior, measure the event a few times to see that it does not vary by more than a few percent between measurements. To measure all nine CPU counters, eight measurements are required. The application’s behavior should be consistent over the entire collection period. To profile each strand on a 32-thread application, each thread must have code to read and set the counters. Sample code is provided in CODE EXAMPLE 3-1. You must compile your own aggregate statistics across multiple strands or a core.
The JBus and DRAM controller counters are less useful. Since these resources are shared across all strands, only one thread should gather these counters.
The key user-defined statistic is the count of packets processed by the thread. Another statistic that can be important is a measure of idle time, which is the number of times the thread polled for a packet and did not find any packets to process.
The following example shows how to measure idle time. Assume that the workload looks like the following:
User-defined counters count the number of times through the loop where no work was done. Measure the time of the idle loop by running idle loop alone (idle_loop_time). Then run real workload, counting the number of idle loops (idle_loop_count)
You can calculate the following metrics after collecting the appropriate hardware counter data using the Netra DPS profiling infrastructure. Use the metrics to quantify performance effects and help in optimizing the application performance.
Calculate this metric by dividing instruction count by the total number of ticks during a time period when the thread is in a stable state. You can also calculate the IPC for a specific section of code. The highest number possible is 1 IPC, which is the maximum throughput of 1 core of the UltraSPARC T1 processor.
This metric is the inverse of IPC. This metric is useful for estimating the effect of various stalls in the CPU.
Multiplying this number with the L1 cache miss latency helps estimate the cost, in cycles, of instruction cache misses. Compare this number to the overall CPI to see if this is the cause of a performance bottleneck.
This metric indicates the number of instructions that miss in the L2 cache, and enables you to calculate the contribution of instruction misses to overall CPI.
Data cache miss rate in combination with the L2 cache miss rate quantifies the effect of memory accesses. Multiplying this metric with data cache miss latency provides an indication of its effect (contribution) on CPI.
Similar to data cache miss rate, this metric has higher cost in terms of cycles of contribution to overall CPI. This metric also enables you to estimate the memory bandwidth requirements.
The profiler script is used to summarize the profiling output generated from the profiler. The profiler script (written in perl) converts the raw profiler output to a summarized format that is easy to read and interpret.
Two scripts are available, profiler.pl and profiler_n2.pl. profiler.pl is used for parsing outputs generated from a Sun UltraSPARC T1 (CMT1) processor. profile_n2.pl is used for parsing outputs generated from a Sun UltraSPARC T2 (CMT2) processor.
For Sun UltraSPARC T1 platforms (such as a Sun Fire T2000 system):
For Sun UltraSPARC T2 platforms (such as a Sun SPARC Enterprise T5220 system):
This file consists of raw profile data generated by the Netra DPS profiler. Typically, this data is captured on the console and saved into a file with .csv suffix, indicating that this is a CSV (comma-separated values) file. For example, input_file.csv
This file is generated by redirecting the outputs of the profiler.pl script to an output file. This file should also be in CSV format. For example, output_file.csv.
Note - If there is no redirection (that is, the output_file is not specified), the output of the script will display on the console. |
Raw profile data is the direct output from the profiler.
The following shows an example of the raw profile data output from a Sun UltraSPARC T1 processor:
The following shows an example of the raw profile data output from the Sun UltraSPARC T2 processor:
Summarized profile data is the processed data generated from the profiler.pl and the profile_n2.pl for the Sun UltraSPARC T1 (CMT1) and (Sun UltraSPARC T2 (CMT2) processors, respectively.
For the Sun UltraSPARC T1 processor, the summary displays as in the following example:
TABLE 3-5 describes each field in the top section of the summarized Sun UltraSPARC T1 profile data output:
For the Sun UltraSPARC T2 processor, the summary displays as in the following example:
6393524 2050737 0 0 0 48603479 0 0 2636500 0 0 0 184283 13328505 0 150 0 0 0 0 74964356 210256 1032899 |
TABLE 3-6 describes each field in the top section of the summarized Sun UltraSPARC T2 profile data output:
You can use the output values of the summarized data to derive various important performance parameters. This section lists performance parameters and the method from which they are derived.
This can be obtained from the Userdata.1 field.
Average number of instructions executed in a packet.
Formula: value = (Instr_cnt / pkts_per_interval)
Average number of instructions executed per cycle.
Formula: value = (Instr_cnt / cycle)
Average number of packets executed per second (in Kilo-packets per second).
Formula: value = ((pkts_per_interval / (cycle / cpu_frequency)) / 1000)
Average number of SB_full occurrences per 1000 instructions executed.
Formula: value = ((SB_full / Instr_cnt) * 1000)
Average number of FP_instr_cnt occurrences per 1000 instructions executed.
Formula: value = ((FP_Instr_cnt / Instr_cnt) * 1000)
Average number of IC_miss occurrences per 1000 instructions executed.
Formula: value = ((IC_miss / Instr_cnt) * 1000)
Average number of DC_miss occurrences per 1000 instructions executed.
Formula: value = ((DC_miss / Instr_cnt) * 1000)
Average number of ITLB_miss occurrences per 1000 instructions executed.
Formula: value = ((ITLB_miss / Instr_cnt) * 1000)
Average number of DTLB_miss occurrences per 1000 instructions executed.
Formula: value = ((DTLB_miss / Instr_cnt) * 1000)
Average number of L2_miss occurrences per 1000 instructions executed.
Formula: value = ((L2_miss / Instr_cnt) * 1000)
Average number of L2_Dmiss_LD occurrences per 1000 instructions executed.
Formula: value = ((L2_miss / Instr_cnt) * 1000)
Average number of instructions executed in a packet.
Formula: value = (All_instr / pkts_per_interval)
Average number of instructions executed per cycle.
Formula: value = (All_instr / cycle)
Average number of Store instructions executed per packet.
Formula: value = (Store_instr / pkts_per_interval)
Average number of Load instructions executed per packet.
Formula: value = (Load_instr / pkts_per_interval)
Average number of L2 cache Load misses per packet.
Formula: value = (L2_load_misses / pkts_per_interval)
Average number of L1 Icache misses per 1000 packet.
Formula: value = (Icache_misses x 1000) / pkts_per_interval)
Average number of L1 Icache misses per packet.
Formula: value = (Dcache_misses / pkts_per_interval)
Average number of packets executed per second (in Kilo-packets per second).
Formula: value = ((pkts_per_interval / (cycle / cpu_frequency)) / 1000)
Note - Not all possible parameters are shown here. You can derive any parameter with any formula using the data outputs from the summary. |
Note - These formulas can easily be inserted into a spreadsheet program. |
To Use a Spreadsheet For Performance Analysis |
For example, an output_file.csv generated by profiler.pl.
2. Insert formulas into the spreadsheet.
3. Save the spreadsheet for future reference.
Copyright © 2008, Sun Microsystems, Inc. All Rights Reserved.