Sun Java System Directory Server Enterprise Edition 6.1 Troubleshooting Guide

Chapter 5 Troubleshooting Directory Server Problems

This chapter describes how to troubleshoot general problems with Directory Server. It includes information about the following topics:

Troubleshooting a Crash

This section describe how to begin troubleshooting a crashed Directory Server process. It describes possible causes of a crash, what pieces of information you need to collect to help identify the problem, and how to analyze the information you collect.

Possible Causes of a Crash

A crash could be caused by one or more of the following:

Buffer overflows
Out of resources, such as memory, disk, or file descriptors
Memory allocation problems, such as double frees or free unallocated memory
NULL de-referencing
Other programmatic errors

If a Directory Server process crashes, you need to open a service request with the Sun Support Center.

Collecting Data About a Crash

This section describes the data you need to collect when the server crashes. The most critical data to collect is the core file.

Note –

If you contact the Sun Support Center about a crashed Directory Server process, you must provide a core file and logs.

Generating a Core File

Core file and crash dumps are generated when a process or application terminates abnormally. You must configure your system to allow Directory Server to generate a core file if the server crashes. The core file contains a snapshot of the Directory Server process at the time of the crash, and can be indispensable in determining what led to the crash. Core files are written to the same directory as the errors logs, by default, instance-path/logs/. Core files can be quite large, as they include the entry cache.

If a core file was not generated automatically, you can configure your operating system to allow core dumping by using the commands described in the following table and then waiting for the next crash to retrieve the data.

Solaris

coreadm

ulimit -c unlimited
ulimit -H -c unlimited

Linux

ulimit -c unlimited
ulimit -H -c unlimited

HPUX/AIX

ulimit -c

Windows

Windows crashdump

For example, on Solaris OS, you enable applications to generate core files using the following command:

# coreadm -g /path-to-file/%f.%n.%p.core -e global -e process \
 -e global-setid -e proc-setid -e log

The path-to-file specifies the full path to the core file you want to generate. The file will be named using the executable file name (%f), the system node name (%n), and the process ID (%p).

If after enabling core file generation your system still does not create a core file, you may need to change the file-size writing limits set by your operating system. Use the ulimit command to change the maximum core file size and maximum stack segment size as follows:

# ulimit -c unlimited 
# ulimit -s unlimited

Check that the limits are set correctly using the -a option as follows:

# ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        unlimited
coredump(blocks)     unlimited
nofiles(descriptors) 256
vmemory(kbytes)      unlimited

For information about configuring core file generate on Red Hat Linux and Windows, see Configuring the Operating System to Generate Core Files in Sun Gathering Debug Data for Sun Java System Directory Server 5.

Next, verify that applications can generate core files using the kill -11 process-id command. The cores should be generated in either the specified directory or in the default instance-name/logs directory.

# cd /var/cores
# sleep 100000 &
[1] process-id
# kill -11 process-id
# ls

Getting the Core and Shared Libraries

Get all the libraries and binaries associated with the slapd process for core file analysis. Collect the libraries using the pkg_app script . The pkg_app script packages an executable and all of its shared libraries into one compressed tar file. You provide the process ID of the application and, optionally, the name of the core file to be opened. For more information about the pkg_app script see Using the pkg_app Script on Solaris.

As superuser, run the pkg_app script as follows:

# pkg_app server-pid core-file

Note –

You can also run the pkg_app script without a core file. This reduces the size of the script's output. You need to later set the variable to the correct location of the core file.

Additional Information

To look at the log files created at the time the problem occurred, check the following files:

# install-path/instance-name/logs/errors*
# install-path/instance-name/logs/access*

If the crash is related to the operating system running out of disk or memory, retrieve the system logs. For example, on Solaris OS check the /var/adm/messages file and the /var/log/syslogs file for hardware or memory failures.

To get complete version output, use the following commands:

# cd install-path/bin/slapd/server
# ./ns-slapd -D install-path/instance-name -V

Analyzing Crash Data

Whenever the Directory Server crashes, it generates a core. With this core file and the process stack of the core file you obtained from the ns-slapd binary directory, you can analyze the problem.

This section describes how to analyze the core file crash data on a Solaris OS.

Examining a Core File on Solaris

Once you have obtained a core file, run the pstack and pmap Solaris utilities on the file. The pmap utility shows the process map, which includes a list of virtual addresses, where the dynamic libraries are loaded, and where the variables are declared. The pstack utility shows the process stack. For each thread in the process, it describes the exact stack of instruction the thread was executing at the moment when the process died or when the pstack command was executed.

These utilities must be run from the directory that contains the ns-slapd binary, root-dir/bin/slapd/server. Run the utilities as follows:

# pstack core-file

# pmap core-file

If the results of the pstack utility are almost empty, all of the lines in the output look as follows:

0002c3cc ???????? (1354ea0, f3400, 1354ea0, 868, 2fc, 1353ff8)

If your pstack output looks like this, confirm that you are running the utilities from the ns-slapd binary directory. If you did not run the utility from the ns-slapd binary directory, then go to the directory and rerun the utility.

You can also use the mdb command instead of the pstack command to know the stack of the core. Run the mdb command as follows:

# mdb $path-to-executable $path-to-core
$C to show the core stack
$q to quit

The output of the mdb and the pstack commands provide helpful information about the process stack at the time of the crash. The mdb $C command output provides the exact thread that caused the crash.

On Solaris 8 and 9, the first thread of the pstack output often contains the thread responsible for the crash. On Solaris 10, use mdb to find the crashing thread or, if using the pstack command, analyze the stack by looking for threads that do not contain lwp-park, poll, and pollsys.

For example, the following core process stack occurs during the call of a plug-in function:

core '../../../slapd-psvmrr3-27/logs/core' of 18301:    ./ns-slapd \
-D /opt/iplanet/servers/slapd-psvmrr3-27 
-i /opt/iplanet/se
-----------------  lwp# 13 / thread# 25  --------------------
 ff2b3148 strlen   (0, fde599fb, 0, fbed1, 706d2d75, fde488a8) + 1c
 ff307ef8 sprintf  (7fffffff, fde488a0, fde599d8, fde599ec, 706d2d75, fde599fc) + 3c
 fde47cf8 ???????? (1354ea0, 850338, fde59260, e50243, 923098, 302e3800) + f8
 fde429cc ???????? (1354ea0, 3, 440298, 154290, 345c10, 154290) + 614
 ff164018 plugin_call_exop_plugins (1354ea0, 8462a0, d0c, ff1e7c70, ff202a94, 1353ff8) + d0
 0002c3cc ???????? (1354ea0, f3400, 1354ea0, 868, 2fc, 1353ff8)
 00025e08 ???????? (0, 1353ff8, fdd02a68, f3400, f3000, fbc00)
 fef47d18 _pt_root (362298, fe003d10, 0, 5, 1, fe401000) + a4
 fed5b728 _thread_start (362298, 0, 0, 0, 0, 0) + 40

When analyzing process stacks from cores, concentrate on the operations in the middle of the thread. Processes at the bottom are too general and processes at the top are too specific. The commands in the middle of the thread are specific to the Directory Server and can thus help you identify at which point during processing the operation failed. In the above example, we see the plugin_call_exop_plugins process call indicates a problem calling an external operation in the custom plug-in.

If the problem is related to the Directory Server, you can use the function call that seems like the most likely cause of the problem to search on SunSolve for known problems associated with this function call. SunSolve is located at http://sunsolve.sun.com/.

If you do locate a problem related to the one you are experiencing, confirm that it applies to the version of Directory Server that you are running. To get information about the version you are running, use the following command:

# ns-slapd -V

If after doing a basic analysis of your core files you cannot identify the problem, collect the binaries and libraries using the pkg_app script and contact the Sun Support Center.

Troubleshooting an Unresponsive Process

The type of performance problem you are experiencing depends on the level of CPU available as described in the following table. The first step in troubleshooting a Directory Server that is still running but no longer responding to client application requests is to identify which of the three types of performance issue it corresponds to.

Table 5–1 CPU Level Associated With Performance Problems


CPU Level	Problem Description
CPU = 0%	Passive hang, the server is completely unresponsive
CPU > 10% CPU < 90%	Performance drop, the server is operating but not at the expected rate
CPU = 100%	Active hang, the server is completely unresponsive

The remainder of this section describes the following troubleshooting procedures:

Symptoms of an Unresponsive Process

If your error log contains errors about not being able to open file descriptors, this is usually a symptom of an unresponsive process. For example, the error log may contain a message such as the following:

[17/Jan/2007:01:41:13 +0000] - ERROR<12293> - Connection  - conn=-1 
op=-1 msgId=-1 - fd limit exceeded Too many open file descriptors - not listening 
on new connection

Other symptoms of an unresponsive process include LDAP connections that do not answer or that hang, no messages in the error or access logs, or an access log that is never updated.

Collecting Data About an Unresponsive Process

The prstat tool tells you the amount of CPU being used for each thread. If you collect a process stack using the pstack utility at the same time you run the prstat tool, you can then use the pstack output to see what the thread was doing when it had trouble. If you run the prstat and pstack simultaneously several times, then you can see over time if the same thread was causing the problem and if it was encountering the problem during the same function call. If you are experiencing a performance drop, then run the commands simultaneously every 2 seconds. If you are experiencing a passive or active hang, run the commands with a slightly longer delay, for example every 10 seconds or so.

Analyzing Data About a Unresponsive Process: an Example

For example, you try running an ldapsearch on your Directory Server as follows:

# ldapsearch -p 5389 -D "cn=Directory Manager" -w secret 
-b"o=test" description=*

This command generates a 40 second search with no results. To analyze why the process in unresponsive, first get the process ID using the following command:

# ps -aef | grep slapd | grep slapd-server1
   mares 15013 24159  0 13:06:20 pts/32   0:00 grep slapd-server1
   mares 14993     1  1 13:05:36 ?        0:04 ./ns-slapd -D
/u1/SUNWdsee/user1/52/slapd-server1 -i /u1/SUNWdsee/user1/52/slapd-s

Next, rerun the search and during the search run the prstat and pstack commands simultaneously for the Directory Server process, which in the output above has a process ID of 14993.

prstat -L -p 14993 0 1 > prstat.output ; pstack 14993 > pstack.output

We rerun the commands three times, with an interval of two seconds between each consecutive run.

The output of the first prstat command appears as follows:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/LWPID      
 14993 mares     128M  110M cpu0    59    0   0:00.02 3.0% ns-slapd/51
 14993 mares     128M  110M sleep   59    0   0:00.49 1.3% ns-slapd/32
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/16
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/15
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/14
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/13
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/12
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/11
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/10
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/9
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/8
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/6
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/5
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/4
 14993 mares     128M  110M sleep   59    0   0:00.00 0.0% ns-slapd/3
Total: 1 processes, 51 lwps, load averages: 0.36, 0.29, 0.17

The problem appears to be occurring in thread 51. Next, we look for thread 51 in the output of the first pstack command and it appears as follows:

-----------------  lwp# 51 / thread# 51  --------------------
 ffffffff7eb55a78 ???????? (1, 102183a10, ffffffff70c1d340, 1001c5390, 0, 
ffffffff7ecea248)
 ffffffff77925fe0 id2entry (1002b7610, 1a09, 0, ffffffff70c1e7f4, 0, ffffffff77a6faa8) 
+ 3e8
 ffffffff7795ed20 ldbm_back_next_search_entry_ext (101cfcb90, 10190fd60, 0, 101b877b0, 
1a08, 45b4aa34) + 300
 ffffffff7ebaf6f8 ???????? (101cfcb90, 1002b7610, 1, ffffffff70c1eaf4, 0, 0)
 ffffffff7ebafbc4 ???????? (101cfcb90, 1, ffffffff70c1eaf4, 0, 10190fd60, 
ffffffff70c1e980)
 ffffffff7ebaf170 op_shared_search (101cfcb90, 0, 1015ad240, 0, ffffffffffffffff, 
ffffffff7ecea248) + 8c0
 ffffffff7e92efcc search_core_pb (101cfcb90, 2, 1000, 4000, ffffffff7ea4c810, 
ffffffff7ea56088) + 6c4
 ffffffff7e93a710 dispatch_operation_core_pb (101cfcb90, 101cfcb90, c00, 
ffffffff7ea4c810, 0, d10) + cc
 ffffffff7e926420 ???????? (101f3fe80, 102fd3250, 2, 63, 2, 200000)
 ffffffff7e92672c ldap_frontend_main_using_core_api (101f3fe80, 102fd3250, 2, 
101da1218, 10133db10, 0) + fc
 ffffffff7e927764 ???????? (220, 101c97310, ffffffffffffffff, 800, 958, 101f3fe80)
 ffffffff7d036a7c _pt_root (101c97310, ffffffff70b00000, 0, 0, 20000, ffffffff70c1ff48)
 + d4
 ffffffff7c1173bc _lwp_start (0, 0, 0, 0, 0, 0)

Note –

The ends of the lines in this example have been wrapped so that they fit on the page.

The output of the second and third pstack command show the same results, with thread 51 doing the same types of operation.

All three pstack outputs taken at two second intervals show thread 51 doing the same search operations. The first parameter of the op_shared_search function contains the address of the operations taking place, which is 101cfcb90. The same operation occurs in each of the three stacks, meaning that the same search is taking place during the four seconds that elapsed between the first and the last pstack run. Moreover, the prstat output always shows thread 51 as the thread taking the highest amount of CPU.

If you check the access log for the result of the search operations at the time the hang was observed, we find that it is a result of the search on the unindexed description entry. By creating a description index, this hang will be avoided.

Troubleshooting Drops in Performance

This section describes how to begin troubleshooting a drop in performance. It describes possible causes of performance drops, describes the information you need to consult if you experience a performance drop, and how to analyze this information.

Possible Causes of a Drop in Performance

Make certain that you have not mistaken an active or passive hang for a performance drop. If you are experiencing a performance drop, it could be for one of the following reasons:

Other processes are affecting CPU or disk access
Network problems
High input/ouput rate
Memory swapping
Unindexed searches, such as when an index is missing or when a “!” filter is used
Complex searches, such as searches on static groups, class of service, and roles
Complex updates, such as to static groups, class of service, and roles
Sub-optimum hardware
Sub-optimum system settings, such as fds or keepalive
Directory Server tuned incorrectly

Collecting Data About a Drop in Performance

Collect information about disk, CPU, memory, and process stack use during the period in which performance is dropping.

Collecting Disk, CPU, and Memory Statistics

If your CPU is very low (at or around 10%), try to determine if the problem is network related using the netstat command as follows:

# netstat -an | grep port

A performance drop may be the result of the network if a client is not receiving information despite the fact that access logs show that results work sent immediately. Running the ping andtraceroute commands can help you determine if network latency is responsible for the problem.

Collect swap information to see if you are running out of memory. Memory may be your problem if the output of the swap command is small.

Solaris	`swap` `-l`
HP-UX	`swapinfo`
Linux	`free`
Windows	Already provided in `C:\report.txt`

On Solaris, use the output of the prstat command to identify if other processes could be impacting the system performance. On Linux and HP-UX, use the top command.

Collecting Consecutive Process Stacks on Solaris

Collect consecutive pstack and prstat output of the Directory Server during the period when the performance drops as described in Analyzing Data About a Unresponsive Process: an Example. For example, you could use the following script on Solaris to gather pstack and prstat information:

#!/bin/sh

i=0
while [ "$i" -lt "10" ]
do
        echo "$i/n"
        date= `date"+%y%m%d:%H%M%S"
        prstate -L -p $1 0 1 > /tmp/prstate.$date
        pstack $1 > /tmp/pstack.$date
        i=`expr $i + 1`
        sleep 1
done

Using the `idsktune` Command

The idsktune command provides information about system parameters, patch level, and tuning recommendations. You can use the output of this command to detect problems in thread libraries or patches that are missing. For more information about the idsktune command, see idsktune(1M).

Analyzing Data Collected About a Performance Problem

In general, look through your data for patterns and commonalities in the errors encountered. For example, if all operation problems are associated with searches to static groups, modifies to static groups, and searches on roles, this indicates that Directory Server is not properly tuned to handle these expensive operations. For example, the nsslapd-search-tune attribute is not configured correctly for static group related searches, or maybe the uniqueMember attribute indexed in a substring affects the group related updates. If you notice that problems are associated with unrelated operations but all at a particular time, this might indicate a memory access problem or a disk access problem.

You can take information culled from you pstacks to SunSolve and search for them along with the phrase unresponsive events to see if anything similar to your problem has already been encountered and solved. SunSolve is located at http://sunsolve.sun.com/pub-cgi/show.pl?target=tous

The remainder of this section provides additional tips to help you analyze the data you collected in the previous steps.

Analyzing the Access Log Using the `logconv` Command

You can use the logconv command to analyze the Directory Server access logs. This command extracts usage statistics and counts the occurrences of significant events. For more information about this tool, see logconv(1).

For example, run the logconv command as follows:

# logconv -s 50 -efcibaltnxgju access > analysis.access

Check the output file for the following:

Unindexed searches (notes=U)

If unindexed searches are present, search for the associated indexes using the dsconf list-indexes command. If the index exists, then you may be reaching the limit of your all-ids-threshold property. This property defines the maximum number of values per index key in an index list. Increase the all-ids-threshold and reindex.

If the index does not exist, then you need to create the index and then reindex. For information about creating an index, see To Create Indexes in Sun Java System Directory Server Enterprise Edition 6.1 Administration Guide.
High file descriptor consumption

To manage a problem with file descriptor consumption you may need to request to increase the file descriptors available at the system level. You may want to reduce the number of persistent searches (notes=persistent), modify the client applications that do not disconnect, or reduce the idle time-out value set by the nsslapd-idletimeout property.
Searches with long etimes or that return many entries

For example. if the etime is 344, grep the access log for etime 344. The access log tells you the connection and operation. You can use this information to see what the operation was doing when the performance drop occurred, when the connection was opened, and who was the binding user. If all of the same operations have long etimes, that points to a problem with a particular operation. If the same binding user is always associated with a long etime, this suggests an ACI issue.

If you suspect an ACI problem with the binding user, prove it by running the same operation with the Directory Manager user, who is not subject to ACIs.
Searches on the uniquemember attribute or on the wrong filters.

Look on SunSolve for static group performance hot patches. Run your search by specifying the nsslapd-search-tune attribute.
Long ADDand MOD operations

Identifying Capacity Limitations: an Exercise

Often a capacity limitation manifests itself as a performance issue. To differentiate between performance and capacity, performance might be defined as “How fast the system is going” while capacity is “the maximum performance of the system or an individual component.”

If your CPU is very low (at or around 10%), try to determine if the disk controllers are fully loaded and if input/output is the cause. To determine if your problem is disk related, use the iostat tool as follows:

# iostat -xnMCz -T d 10

For example, a directory is available on the internet. Their customers submit searches from multiple sites and the Service Level Agreement (SLA) was no more than 5% of requests with response times of over 3 seconds. Currently 15% of request take more than 3 seconds, which puts the business in a penalty situation. The system is a 6800 with 12x900MHz CPUs.

The vmstat output looks as follows:

procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr m0 m1 m1 m1   in   sy   cs us sy id
 0 2 0 8948920 5015176 374 642 10 12 13 0 2 1  2  1  2  132 2694 1315 14  3 83
 0 19 0 4089432 188224 466 474 50 276 278 0 55 5 5 4 3 7033 6191 2198 19  4 77
 0 19 0 4089232 188304 430 529 91 211 211 0 34 8 6 5 4 6956 9611 2377 16  5 79
 0 18 0 4085680 188168 556 758 96 218 217 0 40 12 4 6 4 6979 7659 2354 18 6 77
 0 18 0 4077656 188128 520 501 75 217 216 0 46 9 3 5 2 7044 8044 2188 17  5 78

We look at the right 3 columns, us=user, sy=system and id=idle, which show that over 50% of the CPU is idle and available for the performance problem. One way to detect a memory problem is to look at the sr, or scan rate, column of the vmstat output. If the page scanner ever starts running, or the scan rate gets over 0, then we need to look more closely at the memory system. The odd part of this display is that the blocked queue on the left of the display has 18 or 19 processes in it but there are no processes in the run queue. This suggests that the process is blocking somewhere in Solaris without using all of the available CPU.

Next, we look at the I/O subsystem. With Solaris 8, the iostat command has a switch, -C, which will aggregate I/Os at the controller level. We run the iostat command as follows:

#  iostat -xnMCz -T d
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  396.4   10.7    6.6    0.1  0.0 20.3    0.0   49.9   0 199 c1
  400.2    8.8    6.7    0.0  0.0 20.2    0.0   49.4   0 199 c3
  199.3    6.0    3.3    0.0  0.0 10.1    0.0   49.4   0  99 c1t0d0
  197.1    4.7    3.3    0.0  0.0 10.2    0.0   50.4   0 100 c1t1d0
  198.2    3.7    3.4    0.0  0.0  9.4    0.0   46.3   0  99 c3t0d0
  202.0    5.1    3.3    0.0  0.0 10.8    0.0   52.4   0 100 c3t1d0

On controller 1 we are doing 396 reads per second and on controller 3 we are doing 400 reads per second. On the right side of the data, we see that the output shows the controller is almost 200% busy. So the individual disks are doing almost 200 reads per second and the output shows the disks as 100% busy. That leads us to a rule of thumb that individual disks perform at approximately 150 I/Os per second. This does not apply to LUNs or LDEVs from the big disk arrays. So our examination of the numbers leads us to suggest adding 2 disks to each controller and relaying out the data.

In this exercise we looked at all the numbers and attempted to locate the precise nature of the problem. Do not assume adding CPUs and memory will fix all performance problems. In this case, the search programs were exceeding the capacity of the disk drives which manifested itself as a performance problem of transactions with extreme response times. All those CPUs were waiting on the disk drives.

Troubleshooting Process Hangs

This section describes how to troubleshoot a totally unresponsive Directory Server process. A totally unresponsive process is called a hang, and there are two types of hang you might experience:

Active hang, when the CPU level is at 100%. For example, the process encounters an infinite loop meaning it waits forever waiting for and servicing a request.
Passive hang, when the CPU level is at 0%. For example, the process encounters a deadlock where two or more threads of a process are waiting for the other to finish, and thus neither ever does.

The remainder of this section describes how to troubleshoot each of these types of process hang.

Troubleshooting an Active Hang

A hang is active if the top or vmstat 1 output show CPU levels of over 95%.

This section describes the causes of an active hang, how to collect information about an active hang, and out to analyze this data.

Possible Causes of an Active Hang

Possible causes of an active hang include the following:

An infinite loop
Retry of an unsuccessful operation, such as a replication operation or a bad commit

Collecting and Analyzing Data About an Active Hang

On a Solaris system, collect several traces of the Directory Server process stack that is hanging using the Solaris pstack utility. Run the command from the root-dir/bin/slapd/server directory. You should also collect statistics about the active process using the Solaris prstat utility. You must collect this information while the server is hanging.

The consecutive pstack and prstat data should be collected every second.

Troubleshooting a Passive Hang

A hang is passive if the top or vmstat 1 output show low CPU levels.

Possible Causes of a Passive Hang

Possible causes of a passive hang include the following:

A deadlock resulting from locks or conditional variables
A defunct thread

Collecting and Analyzing Data About a Passive Hang

On a Solaris system, collect several traces of the Directory Server process stack that is hanging using the Solaris pstack utility. Run the command from the root-dir/bin/slapd/server directory. You must collect this information while the server is hanging. The consecutive pstack data should be collected every three seconds.

Collect several core files that show the state of the server threads while the server is hanging. Do this by generating a core file using the gcore command, changing the name of the core file, waiting 30 seconds, and generating another core file. Repeat the process as least once to get a minimum of three sets of core files and related data.

For more information about generating a core file, see Generating a Core File.

Troubleshooting Database Problems

This section describes how to troubleshoot an inaccessible database

Possible Causes of Database Problems

The Directory Server database may be inaccessible for one of the following reasons:

Database corruption
index corruption
Shared region file corruption
Missing change log
Corrupted change log
Database offline, for example it is being reimported
Missing transaction log

To Troubleshoot a Database Problem

If the server refuses to start, remove the guardian file and all shared memory files while the server is offline and then retry a new start.
# install-path/instance-name/db/guardian # install-path/instance-name/db/_db.00*
If the start succeeds and the database still cannot be loaded, continue with this procedure.

Backup up all database file stored in the db/ directory.

Collect error and access log files from the time during which the database was inaccessible.
# install-path/instance-name/logs/errors* # install-path/instance-name/access*

Troubleshooting Memory Leaks

This section describes how to troubleshoot a memory leak.

Possible Causes of a Memory Leak

Memory leaks are caused by problems allocating memory, either in Directory Server itself or in custom plug-ins. Troubleshooting these problems can be very difficult, particularly in the case of custom plug-ins.

Collecting Data About a Memory Leak

It is important to do the following before collecting data about your memory leak:

Disable any custom plug-ins
Reduce the cache setting to very low values
Enable the audit log

One you have done the above, run a test that proves your memory leak. During the life of the test run, gather output from the pmonitor utility as follows:

The pmonitor utility is a process monitor.

Collect the generic Directory Server data, as described in Collecting Generic Data. This data includes the version of Directory Server that you are running, logs from the test run, in particular the audit log, and the Directory Server configuration file.

With the data you collected, you can now contact the Sun Support Center for assistance with your problem.

Analyzing Memory Leaks Using the `libumem` Library

On Solaris systems, the libumem library is a memory agent library that tracks all of the address allocated into the process memory footprint. Usually it is not used in a production environment because it is much slower. However, it is helpful for analyzing the cause of a memory leak. For more information about the libumem library, see the technical article at the following location: http://access1.sun.com/techarticles/libumem.html

Restart the Directory Server using the following command:

# SUN_SUPPORT_SLAPD_NOSH=true LD_PRELOAD=libumem.so \
UMEM_DEBUG=contents,audit=40,guards UMEM_LOGGING=transaction ./start-slapd

The libumem library is now loaded before the Directory Server starts, instead of using SmartHeap.

Next, run the gcore command several times, once before the memory use started to grow and once after. The gcore command provides a list of addresses and pointers. Use these to read the libumem library.

# cd install-root/bin/slapd/server
gcore -o /tmp/directory-core process-id

Finally, use the mdb and splitrec tools to analyze the results. The splitrec tool compares the results to see the complete stack of the leak.

# cd install-root/bin/slapd/server
echo "::umausers -e" | mdb ./ns-slapd path_gcore1 > res.1
eacho "::umausers -e" | mdb ./ns-slapd path_gcore2 > res.2
splitrec -1 res.1 res.2

The splitrec tool is available through Sun Support. This tool provides a summary of the stacks that have been identified as responsible for leaking allocation stacks. Sun Support can use the contents of these stacks to identify known memory leaks in the SunSolve database. Sometimes the splitrec tool does not provide any output because by default it is configured to report leaks only for stacks that have been identified as leaking more than 100 times. Configure this limit to a lower value using the splitrec -l option.

Chapter 5 Troubleshooting Directory Server Problems

Troubleshooting a Crash

Possible Causes of a Crash

Collecting Data About a Crash

Generating a Core File

Getting the Core and Shared Libraries

Additional Information

Analyzing Crash Data

Examining a Core File on Solaris

Troubleshooting an Unresponsive Process

Symptoms of an Unresponsive Process

Collecting Data About an Unresponsive Process

Analyzing Data About a Unresponsive Process: an Example

Troubleshooting Drops in Performance

Possible Causes of a Drop in Performance

Collecting Data About a Drop in Performance

Collecting Disk, CPU, and Memory Statistics

Collecting Consecutive Process Stacks on Solaris

Using the idsktune Command

Analyzing Data Collected About a Performance Problem

Analyzing the Access Log Using the logconv Command

Identifying Capacity Limitations: an Exercise

Troubleshooting Process Hangs

Troubleshooting an Active Hang

Possible Causes of an Active Hang

Collecting and Analyzing Data About an Active Hang

Troubleshooting a Passive Hang

Possible Causes of a Passive Hang

Collecting and Analyzing Data About a Passive Hang

Troubleshooting Database Problems

Possible Causes of Database Problems

To Troubleshoot a Database Problem

Troubleshooting Memory Leaks

Possible Causes of a Memory Leak

Collecting Data About a Memory Leak

Analyzing Memory Leaks Using the libumem Library

Using the `idsktune` Command

Analyzing the Access Log Using the `logconv` Command

Analyzing Memory Leaks Using the `libumem` Library