Sun Directory Server Enterprise Edition 7.0 Troubleshooting Guide

Troubleshooting an Unresponsive Process

The type of performance problem you are experiencing depends on the level of CPU available as described in the following table. The first step in troubleshooting a Directory Server that is still running but no longer responding to client application requests is to identify which of the three types of performance issue it corresponds to.

Table 5–1 CPU Level Associated With Performance Problems


CPU Level	Problem Description
CPU = 0%	Passive hang, the server is completely unresponsive
CPU > 10% CPU < 90%	Performance drop, the server is operating but not at the expected rate
CPU = 100%	Active hang, the server is completely unresponsive

The remainder of this section describes the following troubleshooting procedures:

Symptoms of an Unresponsive Process

If your error log contains errors about not being able to open file descriptors, this is usually a symptom of an unresponsive process. For example, the error log may contain a message such as the following:

[17/APR/2009:01:41:13 +0000] - ERROR<12293> - Connection  - conn=-1 
op=-1 msgId=-1 - fd limit exceeded Too many open file descriptors - not listening 
on new connection

Other symptoms of an unresponsive process include LDAP connections that do not answer or that hang, no messages in the error or access logs, or an access log that is never updated.

Collecting Data About an Unresponsive Process

The prstat -L tool tells you the amount of CPU being used for each thread. If you collect a process stack using the pstack utility at the same time you run the prstat tool, you can then use the pstack output to see what the thread was doing when it had trouble. If you run the prstat and pstack simultaneously several times, then you can see over time if the same thread was causing the problem and if it was encountering the problem during the same function call. If you are experiencing a performance drop, then run the commands simultaneously every 2 seconds. If you are experiencing a passive or active hang, run the commands with a slightly longer delay, for example every 10 seconds or so.

Analyzing Data About a Unresponsive Process: an Example

For example, you try running an ldapsearch on your Directory Server as follows:

# ldapsearch -p 5389 -D "cn=Directory Manager" -w secret 
-b "o=test" description=*

Suppose, this command runs for 40 seconds and does not give any results. To analyze why the process in unresponsive, first get the process ID using the following command:

# ps -aef | grep slapd | grep slapd-server1
   mares 15013 24159  0 13:06:20 pts/32   0:00 grep slapd-server1
   mares 14993     1  1 13:05:36 ?        0:04 ./ns-slapd -D
/local/dsInst -i /local/dsInst

Next, rerun the search and during the search run the prstat and pstack commands simultaneously for the Directory Server process, which in the output above has a process ID of 14993.

prstat -L -p 14993 0 1 > prstat.output ; pstack 14993 > pstack.output

We rerun the commands three times, with an interval of two seconds between each consecutive run.

The output of the first prstat command appears as follows:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/LWPID      
 14993 mares     128M  110M cpu0    59    0   0:00.02 3.0% ns-slapd/51
 14993 mares     128M  110M sleep   59    0   0:00.49 1.3% ns-slapd/32
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/16
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/15
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/14
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/13
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/12
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/11
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/10
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/9
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/8
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/6
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/5
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/4
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/3
Total: 1 processes, 51 lwps, load averages: 0.36, 0.29, 0.17

The problem appears to be occurring in thread 51. Next, we look for thread 51 in the output of the first pstack command and it appears as follows:

-----------------  lwp# 51 / thread# 51  --------------------
 ffffffff7eb55a78 ???????? (1, 102183a10, ffffffff70c1d340, 1001c5390, 0, 
ffffffff7ecea248)
 ffffffff77925fe0 id2entry (1002b7610, 1a09, 0, ffffffff70c1e7f4, 0, ffffffff77a6faa8) 
+ 3e8
 ffffffff7795ed20 ldbm_back_next_search_entry_ext (101cfcb90, 10190fd60, 0, 101b877b0, 
1a08, 45b4aa34) + 300
 ffffffff7ebaf6f8 ???????? (101cfcb90, 1002b7610, 1, ffffffff70c1eaf4, 0, 0)
 ffffffff7ebafbc4 ???????? (101cfcb90, 1, ffffffff70c1eaf4, 0, 10190fd60, 
ffffffff70c1e980)
 ffffffff7ebaf170 op_shared_search (101cfcb90, 0, 1015ad240, 0, ffffffffffffffff, 
ffffffff7ecea248) + 8c0
 ffffffff7e92efcc search_core_pb (101cfcb90, 2, 1000, 4000, ffffffff7ea4c810, 
ffffffff7ea56088) + 6c4
 ffffffff7e93a710 dispatch_operation_core_pb (101cfcb90, 101cfcb90, c00, 
ffffffff7ea4c810, 0, d10) + cc
 ffffffff7e926420 ???????? (101f3fe80, 102fd3250, 2, 63, 2, 200000)
 ffffffff7e92672c ldap_frontend_main_using_core_api (101f3fe80, 102fd3250, 2, 
101da1218, 10133db10, 0) + fc
 ffffffff7e927764 ???????? (220, 101c97310, ffffffffffffffff, 800, 958, 101f3fe80)
 ffffffff7d036a7c _pt_root (101c97310, ffffffff70b00000, 0, 0, 20000, ffffffff70c1ff48)
 + d4
 ffffffff7c1173bc _lwp_start (0, 0, 0, 0, 0, 0)

Note –

The ends of the lines in this example have been wrapped so that they fit on the page.

The output of the second and third pstack command show the same results, with thread 51 doing the same types of operation.

All three pstack outputs taken at two second intervals show thread 51 doing the same search operations. The first parameter of the op_shared_search function contains the address of the operations taking place, which is 101cfcb90. The same operation occurs in each of the three stacks, meaning that the same search is taking place during the four seconds that elapsed between the first and the last pstack run. Moreover, the prstat output always shows thread 51 as the thread taking the highest amount of CPU.

If you check the access log for the result of the search operations at the time the hang was observed, we find that it is a result of the search on the unindexed description entry. By creating a description index, this hang will be avoided.

Troubleshooting Drops in Performance

This section describes how to begin troubleshooting a drop in performance. It describes possible causes of performance drops, describes the information you need to consult if you experience a performance drop, and how to analyze this information.

Possible Causes of a Drop in Performance

Make certain that you have not mistaken an active or passive hang for a performance drop. If you are experiencing a performance drop, it could be for one of the following reasons:

Other processes are affecting CPU or disk access
Network problems
High input/ouput rate
Memory swapping
Unindexed searches, such as when an index is missing or when a “!” filter is used
Complex searches, such as searches on static groups, class of service, and roles
Complex updates, such as to static groups, class of service, and roles
Sub-optimum hardware
Sub-optimum system settings, such as fds or keepalive
Directory Server tuned incorrectly

Collecting Data About a Drop in Performance

Collect information about disk, CPU, memory, and process stack use during the period in which performance is dropping.

Collecting Disk, CPU, and Memory Statistics

If your CPU is very low (at or around 10%), try to determine if the problem is network related using the netstat command as follows:

# netstat -an | grep port

A performance drop may be the result of the network if a client is not receiving information despite the fact that access logs show that results work sent immediately. Running the ping andtraceroute commands can help you determine if network latency is responsible for the problem.

Collect swap information to see if you are running out of memory. Memory may be your problem if the output of the swap command is small.

Solaris	`swap` `-l`
HP-UX	`swapinfo`
Linux	`free`
Windows	Already provided in `C:\report.txt`

On Solaris, use the output of the prstat command to identify if other processes could be impacting the system performance. On Linux and HP-UX, use the top command.

Collecting Consecutive Process Stacks on Solaris

Collect consecutive pstack and prstat output of the Directory Server during the period when the performance drops as described in Analyzing Data About a Unresponsive Process: an Example. For example, you could use the following script on Solaris to gather pstack and prstat information:

#!/bin/sh

i=0
while [ "$i" -lt "10" ]
do
        echo "$i/n"
        date= `date"+%y%m%d:%H%M%S"
        prstat -L -p $1 0 1 > /tmp/prstat.$date
        pstack $1 > /tmp/pstack.$date
        i=`expr $i + 1`
        sleep 1
done

Analyzing Data Collected About a Performance Problem

In general, look through your data for patterns and commonalities in the errors encountered. For example, if all operation problems are associated with searches to static groups, modifies to static groups, and searches on roles, this indicates that Directory Server is not properly tuned to handle these expensive operations. For example, the nsslapd-search-tune attribute is not configured correctly for static group related searches, or maybe the uniqueMember attribute indexed in a substring affects the group related updates. If you notice that problems are associated with unrelated operations but all at a particular time, this might indicate a memory access problem or a disk access problem.

You can take information culled from you pstacks to SunSolve and search for them along with the phrase unresponsive events to see if anything similar to your problem has already been encountered and solved. SunSolve is located at http://sunsolve.sun.com/pub-cgi/show.pl?target=tous

The remainder of this section provides additional tips to help you analyze the data you collected in the previous steps.

Analyzing the Access Log Using the `logconv` Command

You can use the logconv command to analyze the Directory Server access logs. This command extracts usage statistics and counts the occurrences of significant events. For more information about this tool, see logconv(1).

For example, run the logconv command as follows:

# logconv -s 50 -efcibaltnxgju access > analysis.access

Check the output file for the following:

Unindexed searches (notes=U)

If unindexed searches are present, search for the associated indexes using the dsconf list-indexes command. If the index exists, then you may be reaching the limit of your all-ids-threshold property. This property defines the maximum number of values per index key in an index list. Increase the all-ids-threshold and reindex.

If the index does not exist, then you need to create the index and then reindex. For information about creating an index, see To Create Indexes in Sun Directory Server Enterprise Edition 7.0 Administration Guide.
High file descriptor consumption

To manage a problem with file descriptor consumption you may need to request to increase the file descriptors available at the system level. You may want to reduce the number of persistent searches (notes=persistent), modify the client applications that do not disconnect, or reduce the idle timeout value set by the nsslapd-idletimeout property.
Searches with long etimes or that return many entries

For example. if the etime is 344, grep the access log for etime 344. The access log tells you the connection and operation. You can use this information to see what the operation was doing when the performance drop occurred, when the connection was opened, and who was the binding user. If all of the same operations have long etimes, that points to a problem with a particular operation. If the same binding user is always associated with a long etime, this suggests an ACI issue.

If you suspect an ACI problem with the binding user, prove it by running the same operation with the Directory Manager user, who is not subject to ACIs.
Searches on the uniquemember attribute or on the wrong filters.

Look on SunSolve for static group performance hot patches. Run your search by specifying the nsslapd-search-tune attribute.
Long ADDand MOD operations

Identifying Capacity Limitations: an Exercise

Often a capacity limitation manifests itself as a performance issue. To differentiate between performance and capacity, performance might be defined as “How fast the system is going” while capacity is “the maximum performance of the system or an individual component.”

If your CPU is very low (at or around 10%), try to determine if the disk controllers are fully loaded and if input/output is the cause. To determine if your problem is disk related, use the iostat tool as follows:

# iostat -xnMCz -T d 10

For example, a directory is available on the internet. Their customers submit searches from multiple sites and the Service Level Agreement (SLA) was no more than 5% of requests with response times of over 3 seconds. Currently 15% of request take more than 3 seconds, which puts the business in a penalty situation. The system is a 6800 with 12x900MHz CPUs.

The vmstat output looks as follows:

procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr m0 m1 m1 m1   in   sy   cs us sy id
 0 2 0 8948920 5015176 374 642 10 12 13 0 2 1  2  1  2  132 2694 1315 14  3 83
 0 19 0 4089432 188224 466 474 50 276 278 0 55 5 5 4 3 7033 6191 2198 19  4 77
 0 19 0 4089232 188304 430 529 91 211 211 0 34 8 6 5 4 6956 9611 2377 16  5 79
 0 18 0 4085680 188168 556 758 96 218 217 0 40 12 4 6 4 6979 7659 2354 18 6 77
 0 18 0 4077656 188128 520 501 75 217 216 0 46 9 3 5 2 7044 8044 2188 17  5 78

We look at the right 3 columns, us=user, sy=system and id=idle, which show that over 50% of the CPU is idle and available for the performance problem. One way to detect a memory problem is to look at the sr, or scan rate, column of the vmstat output. If the page scanner ever starts running, or the scan rate gets over 0, then we need to look more closely at the memory system. The odd part of this display is that the blocked queue on the left of the display has 18 or 19 processes in it but there are no processes in the run queue. This suggests that the process is blocking somewhere in Solaris without using all of the available CPU.

Next, we look at the I/O subsystem. The iostat command has a switch, -C, which will aggregate I/Os at the controller level. We run the iostat command as follows:

#  iostat -xnMCz -T d
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  396.4   10.7    6.6    0.1  0.0 20.3    0.0   49.9   0 199 c1
  400.2    8.8    6.7    0.0  0.0 20.2    0.0   49.4   0 199 c3
  199.3    6.0    3.3    0.0  0.0 10.1    0.0   49.4   0  99 c1t0d0
  197.1    4.7    3.3    0.0  0.0 10.2    0.0   50.4   0 100 c1t1d0
  198.2    3.7    3.4    0.0  0.0  9.4    0.0   46.3   0  99 c3t0d0
  202.0    5.1    3.3    0.0  0.0 10.8    0.0   52.4   0 100 c3t1d0

On controller 1 we are doing 396 reads per second and on controller 3 we are doing 400 reads per second. On the right side of the data, we see that the output shows the controller is almost 200% busy. So the individual disks are doing almost 200 reads per second and the output shows the disks as 100% busy. That leads us to a rule of thumb that individual disks perform at approximately 150 I/Os per second. This does not apply to LUNs or LDEVs from the big disk arrays. So our examination of the numbers leads us to suggest adding 2 disks to each controller and relaying out the data.

In this exercise we looked at all the numbers and attempted to locate the precise nature of the problem. Do not assume adding CPUs and memory will fix all performance problems. In this case, the search programs were exceeding the capacity of the disk drives which manifested itself as a performance problem of transactions with extreme response times. All those CPUs were waiting on the disk drives.

Troubleshooting Process Hangs

This section describes how to troubleshoot a totally unresponsive Directory Server process. A totally unresponsive process is called a hang, and there are two types of hang you might experience:

Active hang, when the CPU level is at 100%. For example, the process encounters an infinite loop meaning it waits forever waiting for and servicing a request.
Passive hang, when the CPU level is at 0%. For example, the process encounters a deadlock where two or more threads of a process are waiting for the other to finish, and thus neither ever does.

The remainder of this section describes how to troubleshoot each of these types of process hang.

Troubleshooting an Active Hang

A hang is active if the top or vmstat 1 output show CPU levels of over 95%.

This section describes the causes of an active hang, how to collect information about an active hang, and out to analyze this data.

Possible Causes of an Active Hang

Possible causes of an active hang include the following:

An infinite loop
Retry of an unsuccessful operation, such as a replication operation or a bad commit

Collecting and Analyzing Data About an Active Hang

On a Solaris system, collect several traces of the Directory Server process stack that is hanging, using the Solaris pstack utility. You should also collect statistics about the active process using the Solaris prstat -L utility. You must collect this information while the server is hanging.

The consecutive pstack and prstat data should be collected every second.

Troubleshooting a Passive Hang

A hang is passive if the top or vmstat 1 output show low CPU levels.

Possible Causes of a Passive Hang

Possible causes of a passive hang include the following:

A deadlock resulting from locks or conditional variables
A defunct thread

Collecting and Analyzing Data About a Passive Hang

On a Solaris system, collect several traces of the Directory Server process stack that is hanging, using the Solaris pstack utility. You must collect this information while the server is hanging. The consecutive pstack data should be collected every three seconds.

Collect several core files that show the state of the server threads while the server is hanging. Do this by generating a core file using the gcore command, changing the name of the core file, waiting 30 seconds, and generating another core file. Repeat the process as least once to get a minimum of three sets of core files and related data.

For more information about generating a core file, see Generating a Core File.