Sun Directory Server Enterprise Edition 7.0 Troubleshooting Guide

Troubleshooting Drops in Performance

This section describes how to begin troubleshooting a drop in performance. It describes possible causes of performance drops, describes the information you need to consult if you experience a performance drop, and how to analyze this information.

Possible Causes of a Drop in Performance

Make certain that you have not mistaken an active or passive hang for a performance drop. If you are experiencing a performance drop, it could be for one of the following reasons:

Collecting Data About a Drop in Performance

Collect information about disk, CPU, memory, and process stack use during the period in which performance is dropping.

Collecting Disk, CPU, and Memory Statistics

If your CPU is very low (at or around 10%), try to determine if the problem is network related using the netstat command as follows:


# netstat -an | grep port

A performance drop may be the result of the network if a client is not receiving information despite the fact that access logs show that results work sent immediately. Running the ping andtraceroute commands can help you determine if network latency is responsible for the problem.

Collect swap information to see if you are running out of memory. Memory may be your problem if the output of the swap command is small.

Solaris 

swap -l

HP-UX 

swapinfo

Linux 

free

Windows 

Already provided in C:\report.txt

On Solaris, use the output of the prstat command to identify if other processes could be impacting the system performance. On Linux and HP-UX, use the top command.

Collecting Consecutive Process Stacks on Solaris

Collect consecutive pstack and prstat output of the Directory Server during the period when the performance drops as described in Analyzing Data About a Unresponsive Process: an Example. For example, you could use the following script on Solaris to gather pstack and prstat information:


#!/bin/sh

i=0
while [ "$i" -lt "10" ]
do
        echo "$i/n"
        date= `date"+%y%m%d:%H%M%S"
        prstat -L -p $1 0 1 > /tmp/prstat.$date
        pstack $1 > /tmp/pstack.$date
        i=`expr $i + 1`
        sleep 1
done

Analyzing Data Collected About a Performance Problem

In general, look through your data for patterns and commonalities in the errors encountered. For example, if all operation problems are associated with searches to static groups, modifies to static groups, and searches on roles, this indicates that Directory Server is not properly tuned to handle these expensive operations. For example, the nsslapd-search-tune attribute is not configured correctly for static group related searches, or maybe the uniqueMember attribute indexed in a substring affects the group related updates. If you notice that problems are associated with unrelated operations but all at a particular time, this might indicate a memory access problem or a disk access problem.

You can take information culled from you pstacks to SunSolve and search for them along with the phrase unresponsive events to see if anything similar to your problem has already been encountered and solved. SunSolve is located at http://sunsolve.sun.com/pub-cgi/show.pl?target=tous

The remainder of this section provides additional tips to help you analyze the data you collected in the previous steps.

Analyzing the Access Log Using the logconv Command

You can use the logconv command to analyze the Directory Server access logs. This command extracts usage statistics and counts the occurrences of significant events. For more information about this tool, see logconv(1).

For example, run the logconv command as follows:


# logconv -s 50 -efcibaltnxgju access > analysis.access

Check the output file for the following:

Identifying Capacity Limitations: an Exercise

Often a capacity limitation manifests itself as a performance issue. To differentiate between performance and capacity, performance might be defined as “How fast the system is going” while capacity is “the maximum performance of the system or an individual component.”

If your CPU is very low (at or around 10%), try to determine if the disk controllers are fully loaded and if input/output is the cause. To determine if your problem is disk related, use the iostat tool as follows:


# iostat -xnMCz -T d 10

For example, a directory is available on the internet. Their customers submit searches from multiple sites and the Service Level Agreement (SLA) was no more than 5% of requests with response times of over 3 seconds. Currently 15% of request take more than 3 seconds, which puts the business in a penalty situation. The system is a 6800 with 12x900MHz CPUs.

The vmstat output looks as follows:


procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr m0 m1 m1 m1   in   sy   cs us sy id
 0 2 0 8948920 5015176 374 642 10 12 13 0 2 1  2  1  2  132 2694 1315 14  3 83
 0 19 0 4089432 188224 466 474 50 276 278 0 55 5 5 4 3 7033 6191 2198 19  4 77
 0 19 0 4089232 188304 430 529 91 211 211 0 34 8 6 5 4 6956 9611 2377 16  5 79
 0 18 0 4085680 188168 556 758 96 218 217 0 40 12 4 6 4 6979 7659 2354 18 6 77
 0 18 0 4077656 188128 520 501 75 217 216 0 46 9 3 5 2 7044 8044 2188 17  5 78

We look at the right 3 columns, us=user, sy=system and id=idle, which show that over 50% of the CPU is idle and available for the performance problem. One way to detect a memory problem is to look at the sr, or scan rate, column of the vmstat output. If the page scanner ever starts running, or the scan rate gets over 0, then we need to look more closely at the memory system. The odd part of this display is that the blocked queue on the left of the display has 18 or 19 processes in it but there are no processes in the run queue. This suggests that the process is blocking somewhere in Solaris without using all of the available CPU.

Next, we look at the I/O subsystem. The iostat command has a switch, -C, which will aggregate I/Os at the controller level. We run the iostat command as follows:


#  iostat -xnMCz -T d
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  396.4   10.7    6.6    0.1  0.0 20.3    0.0   49.9   0 199 c1
  400.2    8.8    6.7    0.0  0.0 20.2    0.0   49.4   0 199 c3
  199.3    6.0    3.3    0.0  0.0 10.1    0.0   49.4   0  99 c1t0d0
  197.1    4.7    3.3    0.0  0.0 10.2    0.0   50.4   0 100 c1t1d0
  198.2    3.7    3.4    0.0  0.0  9.4    0.0   46.3   0  99 c3t0d0
  202.0    5.1    3.3    0.0  0.0 10.8    0.0   52.4   0 100 c3t1d0

On controller 1 we are doing 396 reads per second and on controller 3 we are doing 400 reads per second. On the right side of the data, we see that the output shows the controller is almost 200% busy. So the individual disks are doing almost 200 reads per second and the output shows the disks as 100% busy. That leads us to a rule of thumb that individual disks perform at approximately 150 I/Os per second. This does not apply to LUNs or LDEVs from the big disk arrays. So our examination of the numbers leads us to suggest adding 2 disks to each controller and relaying out the data.

In this exercise we looked at all the numbers and attempted to locate the precise nature of the problem. Do not assume adding CPUs and memory will fix all performance problems. In this case, the search programs were exceeding the capacity of the disk drives which manifested itself as a performance problem of transactions with extreme response times. All those CPUs were waiting on the disk drives.