Profiler

C H A P T E R 3

Profiler

This chapter discusses the Sun Netra DPS profiler used in the Sun Netra Data Plane software. Topics include:

Profiler Introduction

How the Profiler Works

Groups and Events

Profiler Output

Profiler Examples

Profiling Application Performance

User-Defined Statistics

Profiling Metrics

Using the Profiler Script

Profiler Scripts

Profiler Introduction

The Sun Netra DPS profiler is a set of API calls that help you collect various critical data during the execution of an application. The user can profile one or more areas of your application such as CPU utilization, I/O wait times, and so on. Information gathered using the profiler helps you decide where to direct performance-tuning efforts. The profiler uses special counters and resources available in the system hardware to collect critical information about the application.

As with instrumentation-based profiling, there is a slight overhead for collecting data during the application run. The profiler uses as little overhead as possible so that the presented data is very close to the actual application run without the profiler API in place.

How the Profiler Works

The user enables the profiler with the -pg command-line option (tejacc). Insert the API calls at desired places to start collecting profiling data. The profiler configures and sets the hardware resources to capture the requested data. At the same time, the profiler reserves and sets up the memory buffer where the data will be stored. Insert calls to update the profiler data at any further location in the application. With this setup, the profiler reads the current values of the data and stores the values in memory.

There is an option to store additional user data in the memory along with each update capture. Storing this data helps you analyze the application in the context of different application-specific data.

The user can also obtain the current profiler data in the application and use the data as desired. With the assistance of other communication mechanisms you can send the data to the host or other parts of the application.

By demarking the portions that are being profiled, you can dump the collected data to the console. The data is presented as a comma-delimited table that can be further processed for report generation.

To minimize the amount of memory space needed for the profile capture, the profiler uses a circular buffer mechanism to store the data. In a circular buffer, the start and the end data is preserved, yet the intermediate data is overwritten when the buffer becomes full.

Groups and Events

The profiling data is captured into different groups. For example, with the CPU performance group, events such as completed instruction cycles, data cache misses and secondary cache misses are captured. In the memory performance group, events such as memory queue and memory cycles are captured. Refer to the Profiler API chapter of the Sun Netra Data Plane Software Suite 2.1 Update 1 Reference Manual for the different groups and different events that are captured and measured on the target.

Profiler Output

The profiler output consists of one line per profiler record. Each line commonly has a format of nine comma-delimited fields. The fields contain values in hexadecimal. If a record is prefixed with a -1, the buffer allocated for the profiler records has overrun. When a buffer overrun occurs, you should increase the value of the profiler_buffer_size property as described in the Configuration API chapter of the Sun Netra Data Plane Software Suite 2.1 Update 1 Reference Manual, and run the application again.

TABLE 3-1 describes the fields of the profiler record:

TABLE 3-1 Profiler Record Fields
Field	Description
CPU ID	Represents the CPU ID where the current profiler call was made.
Caller ID	Represents the source location of the `teja_profiler` call. The your-build-directory/`reports/profiler_calls_location.txt` file lists all of the IDs and their corresponding source locations. The `profiler_calls_location.txt` is generated when the application is successfully built.
Call Type	Type of `teja_profiler` call. The values listed are defined in the `teja_profiler.h` file.
Completed Cycles	Running total of completed clock cycles so far. The user can use this value to calculate the time between two entries.
Program Counter	Value of the program counter when the current profiler call was invoked.
Group Type	Group number of the events started or being measured.
Event Values	Value of the events. This value can be one or more columns depending on the target processor. The target-dependent values are described in the Profiler API chapter in the Sun Netra Data Plane Software Suite 2.1 Update 1 Reference Manual. The order of the events are the same as the location of the bit set in the event bit mask, passed to `teja_profiler_start`, starting from left to right. For the entry that represents `teja_profiler_start`, the values represent the event types. There are two events per record (group) in the dump output: `event_hi` - represents the higher bit set in the event mask `event_lo` - represents the lower bit set in the event mask Overflow values consist of the following: `0x0` - no overflow `0x1` - overflow of the event_lo `0x2` - overflow of the event_hi `0x3` - overflow of both `event_hi` and `event_lo`
Overflow	Overflow information of one or more events being measured. The value is target-dependent.
User Data	Value of the user-defined data. Zero or more columns, depending on the number of counters allocated and recorded by the user.

Refer to Profiler Output Example for an example of dump output.

Profiler Examples

For profiler API function descriptions, refer to the Sun Netra Data Plane Software Suite 2.1 Update 1 Reference Manual.

Profiler API

This section includes profiler API usage for both Sun UltraSPARC T1 and Sun UltraSPARC T2 processors.

Profiler API Usage for the Sun UltraSPARC T1 Processor

The only difference when profiling functions are used for the Sun UltraSPARC T1 processor is in the teja_profiler_start function call for CPU group of events. Profiling CPU group on the Sun UltraSPARC T1 processor enables the measuring of only one additional event along with the completed instruction count that is always an available event for this group.

EXAMPLE 3-1 provides an example of profiler API usage for the Sun UltraSPARC T1 processor.

EXAMPLE 3-1 Sample Profiler API Usage for the Sun UltraSPARC T1 Processor
main() { /* ...user code... / teja_profiler_start(TEJA_PROFILER_CMT_CPU, TEJA_PROFILER_CMT_CPU_IC_MISS); / ...user code... / while (packet) { / ...user code... */ teja_profiler_update(TEJA_PROFILER_CMT_CPU, num_pkt); if (num_pkt == 100) teja_profiler_dump(generator_thread); num_pkt = 0; } } teja_profiler_stop(TEJA_PROFILER_CMT_CPU); }

EXAMPLE 3-1 Sample Profiler API Usage for the Sun UltraSPARC T1 Processor

main() 
{ 
    /* ...user code... */ 
    teja_profiler_start(TEJA_PROFILER_CMT_CPU, 
                                            TEJA_PROFILER_CMT_CPU_IC_MISS); 
    /*   ...user code... */ 
    while (packet) { 
        /*  ...user code... */ 
        teja_profiler_update(TEJA_PROFILER_CMT_CPU, num_pkt); 
        if (num_pkt == 100) 
            teja_profiler_dump(generator_thread);
            num_pkt = 0;
        }
    }
    teja_profiler_stop(TEJA_PROFILER_CMT_CPU);
}

Profiler API Usage for the Sun UltraSPARC T2 Processor

EXAMPLE 3-2 provides an example of profiler API usage for the Sun UltraSPARC T2 processor.

EXAMPLE 3-2 Sample Profiler API Usage for the Sun UltraSPARC T2 Processor
main() { /* ...user code... / teja_profiler_start(TEJA_PROFILER_CMT_CPU, TEJA_PROFILER_CMT_CPU_IC_MISS \| TEJA_PROFILER_CMT_CPU_DC_MISS); / ...user code... / while (packet) { / ...user code... */ teja_profiler_update(TEJA_PROFILER_CMT_CPU, num_pkt); if (num_pkt == 100) teja_profiler_dump(generator_thread); num_pkg = 0 } } teja_profiler_stop(TEJA_PROFILER_CMT_CPU); }

EXAMPLE 3-2 Sample Profiler API Usage for the Sun UltraSPARC T2 Processor

main() 
{ 
    /* ...user code... */ 
    teja_profiler_start(TEJA_PROFILER_CMT_CPU, 
                                            TEJA_PROFILER_CMT_CPU_IC_MISS |  
                                                                          TEJA_PROFILER_CMT_CPU_DC_MISS); 
    /*   ...user code... */ 
    while (packet) { 
        /*  ...user code... */ 
        teja_profiler_update(TEJA_PROFILER_CMT_CPU, num_pkt); 
        if (num_pkt == 100) 
            teja_profiler_dump(generator_thread);
            num_pkg = 0
        }
    }
    teja_profiler_stop(TEJA_PROFILER_CMT_CPU);
}

Profiler Configuration

You can change the profiler configuration in the software architecture C-based file. The following example shows the profiler properties that you can change per process.

teja_process_set_property(main_process, “profiler_log_table_size”,"4096");

main_process is the process object that was created using the teja_process_create call. The property values are applied to all threads mapped to the process specified using main_process.

Profiler Output Example

The following is an example of the profiler output.

TEJA_PROFILE_DUMP_START,ver1.1
CPUID,ID,Type,Cycles,PC,Grp,Evt_Hi,Evt_Lo,Overflow,User Data 
4,15136,1,4d048ad5c4,521f08,1,100,2 
4,30e6,2,4d162a0db0,5128f0,1,36c2ba96,ce,0,1e8480,3da594c
4,18236,1,4cf2eb9ce4,521f08,1,100,1 
4,3a2f,2,4d048acb40,5128f0,1,31cffa4,c2a,0,1b7740,3da594c
TEJA_PROFILE_DUMP_END

The string, ver1.1, is the profiler dump format version. The string is used as an identifier of the output format. The string helps scripts written to process the output validate the format before processing further.

Each profiler record (which normally consists of a lot more lines than the above example) consists of a start delimiter, TEJA_PROFILE_DUMP_START, and an end delimiter, TEJA_PROFILE_DUMP_END. All profiled data records for a thread are displayed between the start and end delimiter.

In the first record, call type 1 represents teja_profiler_start. The values 100 and 1 seen in the event_hi and event_lo columns are the types of events in group 1 being measured. In the record with ID 30e6, call type 2 represents teja_profiler_update, so the values 36c2ba96 and ce are the values of the event types 100 and 2, respectively.

Cycle counts are accumulative. Thus, the difference between two of them provides the exact number of cycle counts between two profiler API calls. The difference divided by the processor frequency calculates the actual time between two calls.

IDs 18236 and 15136 represent the source location of the profiler API call. The
your-build-directory/reports/profiler_calls_location.txt file lists a table that maps IDs and actual source locations.

Profiling Application Performance

Profiling consists of instrumenting your application to extract performance information that can be used to analyze, diagnose, and tune your application. Sun Netra DPS provides an interface to assist you to obtain this information from your application. In general, profiling information consists of hardware performance counters and a few user-defined counters. This section defines the profiling information and how to obtain it.

Profiling is a disruptive activity that can have a significant performance effect. Take care to minimize profiling code and also to measure the effects of the profiling code. This can be done by measuring performance with and without the profiling code. One of the most disruptive parts of profiling is printing the profiling data to the console. To reduce the effects of prints, try to aggregate profiling statistics for many periods before printing, and print only in a designated strand.

Sun UltraSPARC T1 Performance Counters

The CPU, DRAM, and JBus performance counters for Sun UltraSPARC T1 processor are described in TABLE 3-2, TABLE 3-3, and TABLE 3-4, respectively.

TABLE 3-2 Sun UltraSPARC T1 CPU Performance Counters
Event Name	Description
instr_cnt	Number of completed instructions. Annulled, mispredicted, or trapped instructions are not counted.^[1]
SB_full	Number of store buffer full cycles.^[2]
FP_instr_cnt	Number of completed floating-point instructions. ^[3] Annulled or trapped instruction are not counted.
IC_miss	Number of instruction cache (L1) misses.
DC_miss	Number of data cache (L1) misses for loads (store misses are not included because the cache is write-through nonallocating).
ITLB_miss	Number of instruction TLB miss trap taken (includes `real_translation` misses).
DTLB_miss	Number of data TLB miss trap taken (includes `real_translation` misses).
L2_imiss	Number of secondary cache (L2) misses due to instruction cache requests.
L2_dmiss_Id	Number of secondary cache (L2) misses due to data cache load requests.^[4]

TABLE 3-3 DRAM Performance Counters
Event Name	Description
mem_reads	Number of read transactions.
mem_writes	Number of write transactions.
bank_busy_stalls	Number of bank busy stalls (when transactions are pending).
rd_queue_latency	Read queue latency (incremented by number of read transactions in the queue each cycle).
wr_queue_latency	Write queue latency (incremented by number of write transactions in the queue each cycle).
rw_queue_latency	Read and write queue latency (incremented by number of write transactions in the queue each cycle).
wr_buf_hits	Writeback buffer hits (incremented by 1 each time a read is deferred due to conflicts with pending writes).

TABLE 3-4 JBus Performance Counters
Event Name	Description
jbus_cycles	JBus cycles (time).
dma_reads	DMA read transactions (inbound).
dma_read_latency	Total DMA read latency.
dma_writes	DMA write transactions.
dma_write8	DMA WR8 sub transactions.
ordering_waits	Ordering waits (JBI to L2 queues blocked each cycle).
pio_reads	PIO read transactions (outbound).
pio_read_latency	Total PIO read latency.
pio_writes	PIO write transactions.
aok_dok_off_cycles	AOK or DOK off cycles seen.
aok_off_cycles	AOK_OFF cycles seen.
dok_off_cycles	DOK_OFF cycles seen.

Each strand has its own set of CPU counters that only tracks its own events and can only be accessed by that strand. Performance counters are 32 bits wide so they can measure the values in range from 0 to 2³². If measured event has value greater than 2³² the corresponding counter will overflow as it will be indicated in the Overflow field of the output record. If the counter will overflow or not depends on properties of the code that is profiled, the clock frequency of the processor, the measured event and the profiling period. In the case of performance counter overflow it is suggested to the user to decrease the profiling period. When taking measurements, ensure that the application behavior is in a steady state. To check this behavior, measure the event a few times to see that it does not vary by more than a few percent between measurements. To measure all nine CPU counters, eight measurements are required. The application’s behavior should be consistent over the entire collection period. To profile each strand on a 32-thread application, each thread must have code to read and set the counters. The user must compile their own aggregate statistics across multiple strands or a core.

Since the JBus and DRAM performance counters are shared across all strands, only one thread should gather these counters.

Sun UltraSPARC T2 Performance Counters

The CPU performance counters for the Sun UltraSPARC T2 processor are described in TABLE 3-5.

TABLE 3-5 Sun UltraSPARC T2 CPU Performance Counters
Event Name	Description
`Completed_branches`	Number of completed branches.
`Taken_branches`	Number of branches taken.
`FGU_arithmatic_instr`	Number of floating-point arithmetic instructions executed.
`Load_instr`	Number of load instructions executed.
`Store_instr`	Number of store Instructions executed.
`sethi_instr`	Number of `sethi` instructions executed.
`Other_instr`	Number of all other instructions executed.
`Atomics`	Number of atomic operations executed.
`All_instr`	Total number of instructions executed.
`Icache_misses`	Number of instruction cache misses.
`Dcache_misses`	Number of L1 data cache misses.
`L2_instr_misses`	Number of secondary cache (L2) misses due to instruction cache requests.
`L2_load_misses`	Measures the number of secondary cache (L2) misses due to data cache load requests.
`ITLB_ref_L2`	For each ITLB miss, this counts the number of accesses the ITLB hardware tablewalk makes to L2 when hardware tablewalk is enabled.
`DTLB_ref_L2`	For each DTLB miss, this counts the number of accesses the DTLB hardware tablewalk makes to L2 when hardware tablewalk is enabled.
`ITLB_miss_L2`	For each ITLB miss, this counts the number of accesses the ITLB hardware tablewalk makes to L2 which misses in L2 when hardware tablewalk is enabled. Note: Depending on the hardware tablewalk configuration, each ITLB miss may issue from 1 to 4 requests to L2 to search TSB’s.
`DTLB_miss_L2`	For each DTLB miss, this counts the number of accesses the DTLB hardware tablewalk makes to L2 which misses in L2 when hardware tablewalk is enabled. Note: Depending on the hardware tablewalk configuration, each DTLB miss may issue from 1 to 4 requests to L2 to search TSB’s.
`Stream_LD_to_PCX`	Counts the number of SPU load operations to L2.
`Stream_ST_to_PCX`	Counts the number of SPU store operations to L2.
`CPU_LD_to_PCX`	Counts the number of CPU loads to L2.
`CPU_Ifetch_to_PCX`	Counts the number of I-fetches to L2.
`CPU_ST_to_PCX`	Counts the number of CPU stores to L2.
`MMU_LD_to_PCX`	Counts the number of MMU loads to L2.
`DES_3DES_OP`	Increments for each CWQ or ASI operation that uses DES/3DES unit.
`AES_OP`	Increments for each CWQ or ASI operation which uses AES unit.
`RC4_OP`	Increments for each CWQ or ASI operation which uses RC4.
`MD5_SHA1_SHA256_OP`	Increments for each CWQ or ASI operation which uses MD5, SHA-1, or SHA-256.
`MA_OP`	Increments for each CWQ or ASI modular arithmetic operation.
`CRC_TCPIP_Cksum_OP`	Increments for each iSCSI CRC or TCP/IP checksum operation.
`DES_3DES_Busy_cycle`	Increments each cycle when DES/3DES unit is busy.
`AES_Busy_cycle`	Number of busy cycles encountered per profiling interval when attempting to execute the AES operation.
`RC4_Busy_cycle`	Number of busy cycles encountered per profiling interval when attempting to execute the RC4 operation.
`MD5_SHA1_SHA256_Busy_cycle`	Number of busy cycles encountered per profiling interval when attempting to execute the `MD5_SHA1_SHA256` operation.
`MA_Busy`	Increments each cycle when modular arithmetic unit is busy.
`CRC_MPA_Cksum`	Increments each cycle when CRC/MPA/checksum unit is busy.
`ITLB_miss`	Includes all misses (successful and unsuccessful tablewalks).
`DTLB_miss`	Includes all misses (successful and unsuccessful tablewalks).
`TLB_miss`	Counts both ITLB and DTLB misses (successful and unsuccessful tablewalks).

Note - The final output of the profiler displays the Event names, shown in TABLE 3-5, which are the same as the events listed in Sun Netra Data Plane Software Suite 2.1 Update 1 Reference Manual.

When taking measurements, ensure that the application behavior is in a steady state. To check this behavior, measure the event a few times to see that it does not vary by more than a few percent between measurements. Since a user can measure any two events at a time, in order to measure all 38 CPU counters, 19 measurements are required. The application behavior should be consistent over the entire collection period. To profile each strand on a 64-thread application, each thread must have code to read and set the counters. Sample code is provided in EXAMPLE 3-2 (Sample Profiler API Usage for the Sun UltraSPARC T2 Processor). The user must compile their own aggregate statistics across multiple strands or a core.

The Sun UltraSPARC T2 DRAM Performance Counters are the same as the Sun UltraSPARC T1 DRAM Performance Counters described in TABLE 3-3.

User-Defined Statistics

The key user-defined statistic is the count of packets processed by the thread. Another statistic that can be important is a measure of idle time, which is the number of times the thread polled for a packet and did not find any packets to process.

The following example shows how to measure idle time. Assume that the workload looks like the following:

while(1){
       If( work_to_do ) {
           Do work
           Increment work_count
       } else {
           Increment idle_loop_count
       }
}

User-defined counters count the number of times through the loop where no work was done. Measure the time of the idle loop by running idle loop alone (idle_loop_time). Then run real workload, counting the number of idle loops (idle_loop_count)

Idle_time = idle_loop_count * idle_loop_time

Profiling Metrics

The user can calculate the following metrics after collecting the appropriate hardware counter data using the Sun Netra DPS profiling infrastructure. Use the metrics to quantify performance effects and help in optimizing the application performance.

Instructions per cycle (IPC)

Calculate this metric by dividing instruction count by the total number of ticks during a time period when the thread is in a stable state. The user can also calculate the IPC for a specific section of code. The highest number possible is 1 IPC, which is the maximum throughput of 1 core of the UltraSPARC T processor.

Cycles per instructions (CPI)

This metric is the inverse of IPC. This metric is useful for estimating the effect of various stalls in the CPU.

Instruction cache misses per instruction (IC_miss per instruction)

Multiplying this number with the L1 cache miss latency helps estimate the cost, in cycles, of instruction cache misses. Compare this number to the overall CPI to see if this is the cause of a performance bottleneck.

L2 instruction cache misses per instruction (L2_imiss per instruction)

This metric indicates the number of instructions that miss in the L2 cache, and enables you to calculate the contribution of instruction misses to overall CPI.

Data cache misses per instruction (DC_miss per instruction)

Data cache miss rate in combination with the L2 cache miss rate quantifies the effect of memory accesses. Multiplying this metric with data cache miss latency provides an indication of its effect (contribution) on CPI.

L2 cache misses per instruction (L2_miss per instruction)

Similar to data cache miss rate, this metric has higher cost in terms of cycles of contribution to overall CPI. This metric also enables you to estimate the memory bandwidth requirements.

Using the Profiler Script

The profiler script is used to summarize the profiling output generated from the profiler. The profiler script (written in perl) converts the raw profiler output to a summarized format that is easy to read and interpret.

Profiler Scripts

Two scripts are available, profiler.pl and profiler_n2.pl. profiler.pl is used for parsing outputs generated from a Sun UltraSPARC T1 (CMT1) processor. profile_n2.pl is used for parsing outputs generated from a Sun UltraSPARC T2 (CMT2) processor.

Usage

For Sun UltraSPARC T1 platforms (such as a Sun Fire T2000 system):

% profiler.pl input_file > output_file

For Sun UltraSPARC T2 platforms (such as a Sun SPARC Enterprise T5220 system):

% profiler_n2.pl input_file > output_file

input_file

This file consists of raw profile data generated by the Sun Netra DPS profiler. Typically, this data is captured on the console and saved into a file with .csv suffix, indicating that this is a CSV (comma-separated values) file. For example, input_file.csv

output_file

This file is generated by redirecting the outputs of the profiler.pl script to an output file. This file should also be in CSV format. For example, output_file.csv.

Note - If there is no redirection (that is, the output_file is not specified), the output of the script will display on the console.

Raw Profile Data

Raw profile data is the direct output from the profiler.

The following shows an example of the raw profile data output from a Sun UltraSPARC T1 processor:

TEJA_PROFILE_DUMP_START,ver1.1
CPUID,ID,Type,Cycles,PC,Grp,Evt_Hi,Evt_Lo,Overflow,User Data
4,18236,1,4cf2eb9ce4,521f08,1,100,1
4,3a2f,2,4d048acb40,5128f0,1,31cffa4,c2a,0,1b7740,3da594c
4,18236,1,4d048ad5c4,521f08,1,100,2
4,3a2f,2,4d162a0db0,5128f0,1,31d274e,0,0,1e8480,3da594c
4,18236,1,4d162a1888,521f08,1,100,4
4,3a2f,2,4d27c951cc,5128f0,1,31d2e36,50e,0,2191c0,3da594c
4,18236,1,4d27c95c28,521f08,1,100,8
4,3a2f,2,4d396893a0,5128f0,1,31d238f,25b863,0,249f00,3da594c
4,18236,1,4d39689dd8,521f08,1,100,10
4,3a2f,2,4d4b07cca0,5128f0,1,31cf8de,0,0,27ac40,3da594c
4,18236,1,4d4b07d708,521f08,1,100,20
4,3a2f,2,4d5ca70e88,5128f0,1,31d183c,0,0,2ab980,3da594c
4,18236,1,4d5ca7194c,521f08,1,100,40
4,3a2f,2,4d6e4654ac,5128f0,1,31d2bd3,1b2,0,2dc6c0,3da594c
4,18236,1,4d6e465ef4,521f08,1,100,80
TEJA_PROFILE_DUMP_END

The following shows an example of the raw profile data output from the Sun UltraSPARC T2 processor:

TEJA_PROFILE_DUMP_START,ver1.1
CPUID,ID,Type,Cycles,PC,Grp,Evt_Hi,Evt_Lo,Overflow,User Data
2,315,1,d8a403c78c,52cf10,1,12,12
2,21c9,2,d8a403e3b1,514fe8,1,e,e,0,927c0,1d905b
2,4cd,1,d8a403eca2,52cf10,1,22,22
2,21c9,2,d8b8cd3be2,514fe8,1,5e89cc,5e89cc,0,30d40,0
2,4cd,1,d8b8cd3fee,52cf10,1,42,42
2,21c9,2,d8cd9812d0,514fe8,1,0,0,0,30d40,0
2,4cd,1,d8cd98178a,52cf10,1,82,82
2,21c9,2,d8e2636b16,514fe8,1,db21ac,db21ac,0,30d40,0
2,4cd,1,d8e2636f18,52cf10,1,102,102
2,21c9,2,d8f72f1c5c,514fe8,1,46042d,46042d,0,30d40,0
2,4cd,1,d8f72f2058,52cf10,1,202,202
2,21c9,2,d90bfa2d22,514fe8,1,0,0,0,30d40,0
2,4cd,1,d90bfa3181,52cf10,1,402,402
2,21c9,2,d920c5ce6c,514fe8,1,24ea141,24ea141,0,30d40,0
2,4cd,1,d920c5d301,52cf10,1,802,802
2,21c9,2,d93590ffc6,514fe8,1,8fb2c,8fb2c,0,30d40,0
2,4cd,1,d9359103dc,52cf10,1,fd2,fd2
2,21c9,2,d94a5cf7e3,514fe8,1,3f5f51c,3f5f51c,0,30d40,0
2,4cd,1,d94a5cfc19,52cf10,1,13,13
2,21c9,2,d95f283398,514fe8,1,0,0,0,30d40,0
2,4cd,1,d95f28379f,52cf10,1,23,23
2,21c9,2,d973f413a1,514fe8,1,2734a8,2734a8,0,30d40,0
2,4cd,1,d973f417ba,52cf10,1,103,103
2,21c9,2,d988bfbbca,514fe8,1,0,0,0,30d40,0
2,4cd,1,d988bfbfe1,52cf10,1,203,203
2,21c9,2,d99d8be47f,514fe8,1,61aa,61aa,0,30d40,0
2,4cd,1,d99d8be94f,52cf10,1,44,44
2,21c9,2,d9b257ba5a,514fe8,1,0,0,0,30d40,0
2,4cd,1,d9b257be48,52cf10,1,84,84
2,21c9,2,d9c7237ebc,514fe8,1,0,0,0,30d40,0
2,4cd,1,d9c72382f0,52cf10,1,104,104
2,21c9,2,d9dbee7725,514fe8,1,0,0,0,30d40,0
2,4cd,1,d9dbee7b2f,52cf10,1,204,204
2,21c9,2,d9f0b99d84,514fe8,1,0,0,0,30d40,0
2,4cd,1,d9f0b9a1c5,52cf10,1,15,15
2,21c9,2,da05853c14,514fe8,1,0,0,0,30d40,0
2,4cd,1,da05854024,52cf10,1,25,25
2,21c9,2,da1a5067bf,514fe8,1,0,0,0,30d40,0

2,4cd,1,da1a506bdd,52cf10,1,45,45
2,21c9,2,da2f1c54fd,514fe8,1,300388,300388,0,30d40,0
2,4cd,1,da2f1c5948,52cf10,1,85,85
2,21c9,2,da43e87245,514fe8,1,0,0,0,30d40,0
2,4cd,1,da43e876d0,52cf10,1,105,105
2,21c9,2,da58b3416a,514fe8,1,3d0910,3d0910,0,30d40,0
2,4cd,1,da58b3457e,52cf10,1,205,205
2,21c9,2,da6d7e5a3b,514fe8,1,0,0,0,30d40,0
2,4cd,1,da6d7e5e5d,52cf10,1,16,16
2,21c9,2,da824aa191,514fe8,1,0,0,0,30d40,0
2,4cd,1,da824aa5e5,52cf10,1,26,26
2,21c9,2,da9715c92e,514fe8,1,0,0,0,30d40,0
2,4cd,1,da9715cd85,52cf10,1,46,46
2,21c9,2,daabe167f2,514fe8,1,0,0,0,30d40,0
2,4cd,1,daabe16c18,52cf10,1,86,86
2,21c9,2,dac0ad6c8d,514fe8,1,0,0,0,30d40,0
2,4cd,1,dac0ad7142,52cf10,1,106,106
2,21c9,2,dad5792613,514fe8,1,0,0,0,30d40,0
2,4cd,1,dad5792a2b,52cf10,1,206,206
2,21c9,2,daea449364,514fe8,1,0,0,0,30d40,0
2,4cd,1,daea44979f,52cf10,1,17,17
2,21c9,2,daff0f72f4,514fe8,1,0,0,0,30d40,0
2,4cd,1,daff0f76fd,52cf10,1,27,27
2,21c9,2,db13db2e84,514fe8,1,0,0,0,30d40,0
2,4cd,1,db13db32cc,52cf10,1,47,47
2,21c9,2,db28a68860,514fe8,1,0,0,0,30d40,0
2,4cd,1,db28a68c8d,52cf10,1,87,87
2,21c9,2,db3d7120a0,514fe8,1,0,0,0,30d40,0
2,4cd,1,db3d7125a6,52cf10,1,107,107
2,21c9,2,db523c58b1,514fe8,1,0,0,0,30d40,0
2,4cd,1,db523c5cdf,52cf10,1,207,207
2,21c9,2,db6707bf3f,514fe8,1,0,0,0,30d40,0
2,4cd,1,db6707c3ea,52cf10,1,4b,4b
2,21c9,2,db7bd4202d,514fe8,1,0,0,0,30d40,0
2,4cd,1,db7bd42494,52cf10,1,8b,8b
2,21c9,2,db909fb827,514fe8,1,0,0,0,30d40,0
2,4cd,1,db909fbc6c,52cf10,1,cb,cb
2,21c9,2,dba56a6332,514fe8,1,0,0,0,30d40,0
2,4cd,1,dba56a67dd,52cf10,1,12,12
TEJA_PROFILE_DUMP_END

Summarized Profile Data

Summarized profile data is the processed data generated from the profiler.pl and the profile_n2.pl for the Sun UltraSPARC T1 (CMT1) and (Sun UltraSPARC T2 (CMT2) processors, respectively.

Sun UltraSPARC T1 Processor Profiler Output

For the Sun UltraSPARC T1 processor, the summary displays as in the following example:

cpuid , cycle ,  SB_full ,ITLB_miss ,Instr_cnt ,FP_instr_cnt ,DTLB_miss ,IC_miss ,L2_Imiss ,DC_miss ,L2_Dmiss_LD ,userdata1 ,userdata2 ,
4 , 289219777 ,3121, 0, 51104522, 0, 0, 1080, 433, 2471858, 236191, 2600000  ,64641356 ,
CPU,StartPC,UpdatePC,Cycles,Instr_cnt,CntrName,Value,UserData.1,UserData.2,
4,0x521f08,0x5128f0,295649212,52240523,FP_instr_cnt,0,400000,64641356,
4,0x521f08,0x5128f0,147824128,26122620,IC_miss,689,600000,64641356,
4,0x521f08,0x5128f0,295647284,52238312,DC_miss,2472263,800000,64641356,
4,0x521f08,0x5128f0,295646420,52234078,ITLB_miss,0,1000000,64641356,
4,0x521f08,0x5128f0,295644896,52241803,DTLB_miss,0,1200000,64641356,
4,0x521f08,0x5128f0,295649084,52246157,L2_Imiss,434,1400000,64641356,
4,0x521f08,0x5128f0,295646316,52250156,L2_Dmiss_LD,236270,1600000,64641356,
4,0x521f08,0x5128f0,295644764,52232100,SB_full,3114,1800000,64641356,

TABLE 3-6 describes each field in the top section of the summarized Sun UltraSPARC T1 profile data output:

TABLE 3-6 Sun UltraSPARC T1 Profile Data Output Field Descriptions
Field	Description
`cpuid`	CPU ID found in the first column of the raw profile data. Note: If profiling is done for multiple strands, then multiple rows of summarized data (with different CPU IDs) are shown in the top section.
`cycle`	Average number of clock cycles elapsed per profiling interval.
`SB_full`	Average number of `SB_full` occurrences per profiling interval.
`ITLB_miss`	Average number of `ITLB_miss` occurrences per profiling interval.
`Instr_cnt`	Average number of instructions executed per profiling interval.
`FP_instr_cnt`	Average number of floating point instructions executed per profiling interval.
DTLB_miss	Average number of `DTLB_miss` occurrences per profiling interval.
IC_miss	Average number of `IC_miss` occurrences per profiling interval.
L2_Imiss	Average number of `L2_Imiss` occurrences per profiling interval.
DC_miss	Average number of `DC_miss` occurrences per profiling interval.
L2_Dmiss_LD	Average number of `L2_Dmiss_LD` occurrences per profiling interval.
UserData.1	Average number taken from the User Defined Data1 column.
UserData.2	Average number taken from the User Defined Data2 column.

Sun UltraSPARC T2 Processor Profiler Output

For the Sun UltraSPARC T2 processor, the summary displays as in the following example:

CPUid                        8
Cycles                       213357798
Store_instr                  5157787
L2_instr_misses              549
ITLB_miss_L2                 0
CPU_ST_to_PCX                4801072
MA_OP                        0
MA_Busy                      0
Completed_branches           8346953
Icache_misses                1932
Stream_LD_to_PCX             0
DES_3DES_OP                  0
DES_3DES_Busy_cycle          0
Sethi_instr                  0
L2_load_misses               59993
DTLB_miss_L2                 0
MMU_LD_to_PCX                0
CRC_TCPIP_Cksum_OP           0
CRC_MPA_Cksum                0
Taken_branches               5334546
Dcache_misses                1024428
Stream_ST_to_PCX             0
AES_OP                       0
AES_Busy_cycle               0
Other_instr                  37370926
FGU_arithmatic_instr         0
ITLB_ref_L2                  0
CPU_LD_to_PCX                1779478
RC4_OP                       0
RC4_Busy_cycle               0
ITLB_miss                    0
Atomics                      347142
Load_instr                   14564094
DTLB_ref_L2                  0
CPU_Ifetch_to_PCX            2603
MD5_SHA1_SHA256_OP           0
MD5_SHA1_SHA256_Busy_cycle   0
DTLB_miss                    0
TLB_miss                     0
All_instr                    65033422
Userdata.1                   200000
Userdata.2                   0

Note - The data in the second and third sections of the Sun UltraSPARC T2 summary are identical. The format of the first section is the field header. The format in the second section matches the layout of the field header. The format in the third section is in one single column. This layout enables you to easily transfer data to a spreadsheet file column.

TABLE 3-7 describes each field in the top section of the summarized Sun UltraSPARC T2 profile data output:

TABLE 3-7 Sun UltraSPARC T2 Profile Data Output Field Descriptions
Field	Description
`CPUid`	CPU ID found in the first column of the raw profile data. Note: If profiling is done for multiple strands, then multiple rows of summarized data (with different CPU IDs) are shown in the top section.
`cycles`	Average number of clock cycles elapsed per profiling interval.
`Completed_branches`	Number of completed branches per profiling interval.
`Taken_branches`	Number of branches taken per profiling interval.
`FGU_arithmatic_instr`	Number of Floating-point arithmetic instructions executed per profiling interval.
`Load_instr`	Number of Load instructions executed per profiling interval.
`Store_instr`	Number of Store Instructions executed per profiling interval.
`sethi_instr`	Number of `sethi` instructions executed per profiling interval.
`Other_instr`	Number of all other instructions executed per profiling interval.
`Atomics`	Number of atomic operations executed per profiling interval.
`All_instr`	Total number of instructions executed per profiling interval.
`Icache_misses`	Number of Instruction Cache misses per profiling interval.
`Dcache_misses`	Number of L1 Data Cache misses per profiling interval.
`L2_instr_misses`	Number of L2 cache instruction misses per profiling interval.
`L2_load_misses`	Number of L2 cache load misses per profiling interval.
`ITLB_ref_L2`	For each ITLB miss, this is the number of accesses the ITLB hardware tablewalk makes to L2 per profiling interval when hardware tablewalk is enabled.
`DTLB_ref_L2`	For each DTLB miss, this is the number of accesses the DTLB hardware tablewalk makes to L2 per profiling interval when hardware tablewalk is enabled.
`ITLB_miss_L2`	For each ITLB miss, this is the number of accesses the ITLB hardware tablewalk makes to L2 which misses in L2 per profiling interval when hardware tablewalk is enabled. Note: Depending on the hardware tablewalk configuration, each ITLB miss may issue from 1 to 4 requests to L2 to search TSB’s.
`DTLB_miss_L2`	For each DTLB miss, this is the number of accesses the DTLB hardware tablewalk makes to L2 which misses in L2 per profiling interval when hardware tablewalk is enabled. Note: Depending on the hardware tablewalk configuration, each DTLB miss may issue from 1 to 4 requests to L2 to search TSB’s.
`Stream_LD_to_PCX`	Number of SPU load operations to L2 per profiling interval.
`Stream_ST_to_PCX`	Number of SPU store operations to L2 per profiling interval.
`CPU_LD_to_PCX`	Number of CPU loads to L2 per profiling interval.
`CPU_Ifetch_to_PCX`	Number of I-fetches to L2 per profiling interval.
`CPU_ST_to_PCX`	Number of CPU stores to L2 per profiling interval.
`MMU_LD_to_PCX`	Number of MMU loads to L2 per profiling interval.
`DES_3DES_OP`	Number of increments for each CWQ or ASI operation which uses DES/3DES unit per profiling interval.
`AES_OP`	Number of increments for each CWQ or ASI operation which uses AES unit per profiling unit.
`RC4_OP`	Number of increments for each CWQ or ASI operation which uses RC4 per profiling interval.
`MD5_SHA1_SHA256_OP`	Number of increments for each CWQ or ASI operation which uses MC5, SHA-1, or SHA-256 per profiling interval.
`MA_OP`	Number of increments for each CWQ or ASI modular arithmetic operation per profiling interval.
`CRC_TCPIP_Cksum_OP`	Number of increments for each iSCSI CRC or TCP/IP checksum operation per profiling interval.
`DES_3DES_Busy_cycle`	Number of increments per profiling interval for each cycle when DES/3DES unit is busy
`AES_Busy_cycle`	Number of busy cycles encountered per profiling interval when attempting to execute the AES operation.
`RC4_Busy_cycle`	Number of busy cycles encountered per profiling interval when attempting to execute the RC4 operation.
`MD5_SHA1_SHA256_Busy_cycle`	Number of busy cycles encountered per profiling interval when attempting to execute the `MD5_SHA1_SHA256` operation.
`MA_Busy`	Number of increments per profiling interval for each cycle when modular arithmetic unit is busy.
`CRC_MPA_Cksum`	Number of increments per profiling interval for each cycle when CRC/MPA/checksum unit is busy.
`ITLB_miss`	Number of misses (successful and unsuccessful tablewalks) per profiling interval.
`DTLB_miss`	Number of misses (successful and unsuccessful tablewalks) per profiling interval.
`TLB_miss`	Number of both ITLB and DTLB misses, including successful and unsuccessful tablewalks per profiling interval.
`Userdata.1`	Average number taken from the User Defined Data1 column.
`Userdata.2`	Average number taken from the User Defined Data2 column.

Performance Parameters Calculations

Use the output values of the summarized data to derive various important performance parameters. This section lists performance parameters and the method from which they are derived.

Key for this section:

Division: /

Multiplication: *

pkts_per_interval = Number of packets per interval (for example, 200000)

This can be obtained from the Userdata.1 field.

cpu_frequency = CPU frequency in Hz (for example, 1200000000 for Sun Fire T2000 system)

Sun UltraSPARC T1 Processor

Instructions per Packet:

Average number of instructions executed in a packet.

Formula: value = (Instr_cnt / pkts_per_interval)

Instructions per Cycle (IPC):

Average number of instructions executed per cycle.

Formula: value = (Instr_cnt / cycle)

Packet Rate:

Average number of packets executed per second (in Kilo-packets per second).

Formula: value = ((pkts_per_interval / (cycle / cpu_frequency)) / 1000)

`SB_full` per thousand instructions:

Average number of SB_full occurrences per 1000 instructions executed.

Formula: value = ((SB_full / Instr_cnt) * 1000)

`FP_instr_cnt` per thousand instructions:

Average number of FP_instr_cnt occurrences per 1000 instructions executed.

Formula: value = ((FP_Instr_cnt / Instr_cnt) * 1000)

`IC_miss` per thousand instructions:

Average number of IC_miss occurrences per 1000 instructions executed.

Formula: value = ((IC_miss / Instr_cnt) * 1000)

`DC_miss` per thousand instructions:

Average number of DC_miss occurrences per 1000 instructions executed.

Formula: value = ((DC_miss / Instr_cnt) * 1000)

`ITLB_miss` per thousand instructions:

Average number of ITLB_miss occurrences per 1000 instructions executed.

Formula: value = ((ITLB_miss / Instr_cnt) * 1000)

`DTLB_miss` per thousand instructions:

Average number of DTLB_miss occurrences per 1000 instructions executed.

Formula: value = ((DTLB_miss / Instr_cnt) * 1000)

`L2_imiss` per thousand instructions:

Average number of L2_miss occurrences per 1000 instructions executed.

Formula: value = ((L2_miss / Instr_cnt) * 1000)

`L2_dmiss_LD` per thousand instructions:

Average number of L2_Dmiss_LD occurrences per 1000 instructions executed.

Formula: value = ((L2_miss / Instr_cnt) * 1000)

Sun UltraSPARC T2 Processor

Instruction per Packet:

Average number of instructions executed in a packet.

Formula: value = (All_instr / pkts_per_interval)

Instructions per Cycle (IPC):

Average number of instructions executed per cycle.

Formula: value = (All_instr / cycle)

Note - The Sun UltraSPARC T2 processor has two pipelines in each core. The maximum IPC number of each pipeline is 1. Therefore, the maximum IPC number of each core is 2. Pipeline utilization is this number of each pipeline multiplied by 100%. For example, if the IPC is 0.8, then the pipeline utilization of that pipeline is 80%.

Store Instructions per Packet:

Average number of Store instructions executed per packet.

Formula: value = (Store_instr / pkts_per_interval)

Load Instructions per Packet:

Average number of Load instructions executed per packet.

Formula: value = (Load_instr / pkts_per_interval)

L2 Load misses per Packet:

Average number of L2 cache Load misses per packet.

Formula: value = (L2_load_misses / pkts_per_interval)

Icache misses per 1000 Packets:

Average number of L1 Icache misses per 1000 packet.

Formula: value = (Icache_misses x 1000) / pkts_per_interval)

Dcache misses per Packet:

Average number of L1 Icache misses per packet.

Formula: value = (Dcache_misses / pkts_per_interval)

Packet Rate:

Average number of packets executed per second (in Kilo-packets per second).

Formula: value = ((pkts_per_interval / (cycle / cpu_frequency)) / 1000)

Note - Not all possible parameters are shown here. The user can derive any parameter with any formula using the data outputs from the summary.

Note - These formulas can easily be inserted into a spreadsheet program.

To Use a Spreadsheet for Performance Analysis

1. Open the summary file.

For example, an output_file.csv generated by profiler.pl (for UltraSPARC T1) or by profiler_n2.pl (for UltraSPARC T2).

2. Insert formulas into the spreadsheet.

See the sample_analysis.sxc spreadsheet provided as part of the software package. You can open with an OpenOffice compatible software. This file is included in the SUNWndps/src/libs/profile directory. The first spreadsheet in this template (click on the Output from profile script tab) consists of sample output generated from
Step 1. The second spreadsheet in this template (click on the Analysis tab) consists of formulas for computing the data in the first spreadsheet. The format in the Analysis spreadsheet is designed so that you can compare the data generated on each thread side by side.

3. Save the spreadsheet for future reference.

You can form your own spreadsheet templates for your own analysis. For example, each application can have its own data imported to a spreadsheet for analysis.

^{1 (TableFootnote) Tcc instructions that are cancelled due to encountering a higher-priority trap are still counted.}

^{2 (TableFootnote) SB_full increments every cycle a strand (virtual processor) is stalled due to a full store buffer, regardless of whether other strands are able to keep the processor busy. The overflow trap for SB_full is not precise to the instruction following the event that occurs when ovfl is set. The trap might occur on the instruction following the event or the following two instructions.}

^{3 (TableFootnote) Only floating-point instructions that execute in the shared FPU are counted. The following instructions are executed in the shared FPU: FADDS, FADDD, FSUBS, FSUBD, FMULS, FMULD, FDIVS, FDIVD, FSMULD, FSTOX, FDTOX, FXTOS, FXTOD, FITOS, FDTOS, FITOD, FSTOD, FSTOI, FDTOI, FCMPS, FCMPD, FCMPES, FCMPED.}

^{4 (TableFootnote) L2 misses because stores cannot be counted by the performance instrumentation logic.}

Profiler Introduction

How the Profiler Works

Groups and Events

Profiler Output

Profiler Examples

Profiler API

Profiler API Usage for the Sun UltraSPARC T1 Processor

Profiler API Usage for the Sun UltraSPARC T2 Processor

Profiler Configuration

Profiler Output Example

Profiling Application Performance

Sun UltraSPARC T1 Performance Counters

Sun UltraSPARC T2 Performance Counters

User-Defined Statistics

Profiling Metrics

Using the Profiler Script

Profiler Scripts

Usage

input_file

output_file

Raw Profile Data

Summarized Profile Data

Sun UltraSPARC T1 Processor Profiler Output

Sun UltraSPARC T2 Processor Profiler Output

Performance Parameters Calculations

Sun UltraSPARC T1 Processor

Instructions per Packet:

Instructions per Cycle (IPC):

Packet Rate:

SB_full per thousand instructions:

FP_instr_cnt per thousand instructions:

IC_miss per thousand instructions:

DC_miss per thousand instructions:

ITLB_miss per thousand instructions:

DTLB_miss per thousand instructions:

L2_imiss per thousand instructions:

L2_dmiss_LD per thousand instructions:

Sun UltraSPARC T2 Processor

Instruction per Packet:

Instructions per Cycle (IPC):

Store Instructions per Packet:

Load Instructions per Packet:

L2 Load misses per Packet:

Icache misses per 1000 Packets:

Dcache misses per Packet:

Packet Rate:

To Use a Spreadsheet for Performance Analysis

`SB_full` per thousand instructions:

`FP_instr_cnt` per thousand instructions:

`IC_miss` per thousand instructions:

`DC_miss` per thousand instructions:

`ITLB_miss` per thousand instructions:

`DTLB_miss` per thousand instructions:

`L2_imiss` per thousand instructions:

`L2_dmiss_LD` per thousand instructions: