|Oracle® Fusion Middleware Troubleshooting Guide for Oracle Directory Server Enterprise Edition
11g Release 1 (126.96.36.199.0)
Part Number E28966-01
|PDF · Mobi · ePub|
This chapter describes how to troubleshoot general problems with Directory Server. It includes information about the following topics:
This section describe how to begin troubleshooting a crashed Directory Server process. It describes possible causes of a crash, what pieces of information you need to collect to help identify the problem, and how to analyze the information you collect.
A crash could be caused by one or more of the following:
Out of resources, such as memory, disk, or file descriptors
Memory allocation problems, such as double frees or free unallocated memory
Other programmatic errors
If a Directory Server process crashes, you need to open a service request with the Sun Support Center.
This section describes the data you need to collect when the server crashes. The most critical data to collect is the core file.
If you contact the Sun Support Center about a crashed Directory Server process, you must provide a core file and logs.
Core file and crash dumps are generated when a process or application terminates abnormally. You must configure your system to allow Directory Server to generate a core file if the server crashes. The core file contains a snapshot of the Directory Server process at the time of the crash, and can be indispensable in determining what led to the crash. Core files are written to the same directory as the
errors logs, by default,
/logs/. Core files can be quite large, as they include the entry cache.
If a core file was not generated automatically, you can configure your operating system to allow core dumping by using the commands described in the following table and then waiting for the next crash to retrieve the data.
ulimit -c unlimited ulimit -H -c unlimited
ulimit -c unlimited ulimit -H -c unlimited
For example, on Solaris OS, you enable applications to generate core files using the following command:
# coreadm -g /path-to-file/%f.%n.%p.core -e global -e process \ -e global-setid -e proc-setid -e log
The path-to-file specifies the full path to the core file you want to generate. The file will be named using the executable file name (
%f), the system node name (
%n), and the process ID (
If after enabling core file generation your system still does not create a core file, you may need to change the file-size writing limits set by your operating system. Use the
ulimit command to change the maximum core file size and maximum stack segment size as follows:
# ulimit -c unlimited # ulimit -s unlimited
Check that the limits are set correctly using the
-a option as follows:
# ulimit -a time(seconds) unlimited file(blocks) unlimited data(kbytes) unlimited stack(kbytes) unlimited coredump(blocks) unlimited nofiles(descriptors) 256 vmemory(kbytes) unlimited
For information about configuring core file generate on Red Hat Linux and Windows, see the respective operating system documentation.
Next, verify that applications can generate core files using the
process-id command. The cores should be generated in either the specified directory or in the default
# cd /var/cores # sleep 100000 &  process-id # kill -11 process-id # ls
Get all the libraries and binaries associated with the slapd process for core file analysis. Collect the libraries using the
pkgapp script. The
pkgapp script packages an executable and all of its shared libraries into one compressed tar file. You provide the process ID of the application and, optionally, the name of the core file to be opened. For more information about the
pkgapp script see Using the
pkgapp Script on Solaris.
superuser, run the
pkgapp script as follows:
# pkgapp server-pid core-file
You can also run the
pkgapp script without a core file. This reduces the size of the script's output. You need to later set the variable to the correct location of the core file.
To look at the log files created at the time the problem occurred, check the following files:
# instance-name/logs/errors* # instance-name/logs/access*
If the crash is related to the operating system running out of disk or memory, retrieve the system logs. For example, on Solaris OS check the
/var/adm/messages file and the
/var/log/syslogs file for hardware or memory failures.
To get complete version output, use the following commands:
# dsadm -V
Whenever the Directory Server crashes, it generates a core. With this core file and the process stack of the core file you obtained from the ns-slapd binary directory, you can analyze the problem.
This section describes how to analyze the core file crash data on a Solaris OS.
Once you have obtained a core file, run the
pmap Solaris utilities on the file. The
pmap utility shows the process map, which includes a list of virtual addresses, where the dynamic libraries are loaded, and where the variables are declared. The
pstack utility shows the process stack. For each thread in the process, it describes the exact stack of instruction the thread was executing at the moment when the process died or when the
pstack command was executed.
# pstack core-file # pmap core-file
If the results of the
pstack utility are almost empty, all of the lines in the output look as follows:
0002c3cc ???????? (1354ea0, f3400, 1354ea0, 868, 2fc, 1353ff8)
In this case, make sure to run
pstack on the machine where the core file was generated.
You can also use the
mdb command instead of the
pstack command to know the stack of the core. Run the
mdb command as follows:
# mdb $path-to-executable $path-to-core $C to show the core stack $q to quit
The output of the
mdb and the
pstack commands provide helpful information about the process stack at the time of the crash. The
mdb $C command output provides the exact thread that caused the crash.
On Solaris 9, the first thread of the
pstack output often contains the thread responsible for the crash. On Solaris 10, use
mdb to find the crashing thread or, if using the
pstack command, analyze the stack by looking for threads that do not contain
For example, the following core process stack occurs during the call of a plug-in function:
core '/local/dsInst/logs/core' of 18301: ./ns-slapd \ -D /local/dsInst -i /local/dsInst ----------------- lwp# 13 / thread# 25 -------------------- ff2b3148 strlen (0, fde599fb, 0, fbed1, 706d2d75, fde488a8) + 1c ff307ef8 sprintf (7fffffff, fde488a0, fde599d8, fde599ec, 706d2d75, fde599fc) \ + 3c fde47cf8 ???????? (1354ea0, 850338, fde59260, e50243, 923098, 302e3800) + f8 fde429cc ???????? (1354ea0, 3, 440298, 154290, 345c10, 154290) + 614 ff164018 plugin_call_exop_plugins (1354ea0, 8462a0, d0c, ff1e7c70, ff202a94, \ 1353ff8) + d0 0002c3cc ???????? (1354ea0, f3400, 1354ea0, 868, 2fc, 1353ff8) 00025e08 ???????? (0, 1353ff8, fdd02a68, f3400, f3000, fbc00) fef47d18 _pt_root (362298, fe003d10, 0, 5, 1, fe401000) + a4 fed5b728 _thread_start (362298, 0, 0, 0, 0, 0) + 40
When analyzing process stacks from cores, concentrate on the operations in the middle of the thread. Processes at the bottom are too general and processes at the top are too specific. The commands in the middle of the thread are specific to the Directory Server and can thus help you identify at which point during processing the operation failed. In the above example, we see the
plugin_call_exop_plugins process call indicates a problem calling an external operation in the custom plug-in.
If the problem is related to the Directory Server, you can use the function call that seems like the most likely cause of the problem to search on SunSolve for known problems associated with this function call. SunSolve is located at
If you do locate a problem related to the one you are experiencing, confirm that it applies to the version of Directory Server that you are running. To get information about the version you are running, use the following command:
# dsadm -V
If after doing a basic analysis of your core files you cannot identify the problem, collect the binaries and libraries using the
pkgapp script and contact the Sun Support Center.
The type of performance problem you are experiencing depends on the level of CPU available as described in the following table. The first step in troubleshooting a Directory Server that is still running but no longer responding to client application requests is to identify which of the three types of performance issue it corresponds to.
Table 5-1 CPU Level Associated With Performance Problems
|CPU Level||Problem Description|
CPU = 0%
Passive hang, the server is completely unresponsive
CPU < 90%
Performance drop, the server is operating but not at the expected rate
CPU = 100%
Active hang, the server is completely unresponsive
The remainder of this section describes the following troubleshooting procedures:
If your error log contains errors about not being able to open file descriptors, this is usually a symptom of an unresponsive process. For example, the error log may contain a message such as the following:
[17/APR/2009:01:41:13 +0000] - ERROR<12293> - Connection - conn=-1 op=-1 msgId=-1 - fd limit exceeded Too many open file descriptors - not listening on new connection
Other symptoms of an unresponsive process include LDAP connections that do not answer or that hang, no messages in the error or access logs, or an access log that is never updated.
prstat -L tool tells you the amount of CPU being used for each thread. If you collect a process stack using the
pstack utility at the same time you run the
prstat tool, you can then use the
pstack output to see what the thread was doing when it had trouble. If you run the
pstack simultaneously several times, then you can see over time if the same thread was causing the problem and if it was encountering the problem during the same function call. If you are experiencing a performance drop, then run the commands simultaneously every 2 seconds. If you are experiencing a passive or active hang, run the commands with a slightly longer delay, for example every 10 seconds or so.
For example, you try running an
ldapsearch on your Directory Server as follows:
# ldapsearch -p 5389 -D "cn=Directory Manager" -w secret -b "o=test" description=*
Suppose, this command runs for 40 seconds and does not give any results. To analyze why the process in unresponsive, first get the process ID using the following command:
# ps -aef | grep slapd | grep slapd-server1 mares 15013 24159 0 13:06:20 pts/32 0:00 grep slapd-server1 mares 14993 1 1 13:05:36 ? 0:04 ./ns-slapd -D /local/dsInst -i /local/dsInst
Next, rerun the search and during the search run the
pstack commands simultaneously for the Directory Server process, which in the output above has a process ID of 14993.
prstat -L -p 14993 0 1> prstat.output ; pstack 14993> pstack.output
We rerun the commands three times, with an interval of two seconds between each consecutive run.
The output of the first
prstat command appears as follows:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID 14993 mares 128M 110M cpu0 59 0 0:00.02 3.0% ns-slapd/51 14993 mares 128M 110M sleep 59 0 0:00.49 1.3% ns-slapd/32 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/16 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/15 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/14 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/13 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/12 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/11 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/10 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/9 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/8 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/6 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/5 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/4 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/3 Total: 1 processes, 51 lwps, load averages: 0.36, 0.29, 0.17
The problem appears to be occurring in thread 51. Next, we look for thread 51 in the output of the first
pstack command and it appears as follows:
----------------- lwp# 51 / thread# 51 -------------------- ffffffff7eb55a78 ???????? (1, 102183a10, ffffffff70c1d340, 1001c5390, 0, ffffffff7ecea248) ffffffff77925fe0 id2entry (1002b7610, 1a09, 0, ffffffff70c1e7f4, 0, ffffffff77a6faa8) + 3e8 ffffffff7795ed20 ldbm_back_next_search_entry_ext (101cfcb90, 10190fd60, 0, 101b877b0, 1a08, 45b4aa34) + 300 ffffffff7ebaf6f8 ???????? (101cfcb90, 1002b7610, 1, ffffffff70c1eaf4, 0, 0) ffffffff7ebafbc4 ???????? (101cfcb90, 1, ffffffff70c1eaf4, 0, 10190fd60, ffffffff70c1e980) ffffffff7ebaf170 op_shared_search (101cfcb90, 0, 1015ad240, 0, ffffffffffffffff, ffffffff7ecea248) + 8c0 ffffffff7e92efcc search_core_pb (101cfcb90, 2, 1000, 4000, ffffffff7ea4c810, ffffffff7ea56088) + 6c4 ffffffff7e93a710 dispatch_operation_core_pb (101cfcb90, 101cfcb90, c00, ffffffff7ea4c810, 0, d10) + cc ffffffff7e926420 ???????? (101f3fe80, 102fd3250, 2, 63, 2, 200000) ffffffff7e92672c ldap_frontend_main_using_core_api (101f3fe80, 102fd3250, 2, 101da1218, 10133db10, 0) + fc ffffffff7e927764 ???????? (220, 101c97310, ffffffffffffffff, 800, 958, 101f3fe80) ffffffff7d036a7c _pt_root (101c97310, ffffffff70b00000, 0, 0, 20000, ffffffff70c1ff48) + d4 ffffffff7c1173bc _lwp_start (0, 0, 0, 0, 0, 0)
The ends of the lines in this example have been wrapped so that they fit on the page.
The output of the second and third
pstack command show the same results, with thread 51 doing the same types of operation.
pstack outputs taken at two second intervals show thread 51 doing the same search operations. The first parameter of the
op_shared_search function contains the address of the operations taking place, which is
101cfcb90. The same operation occurs in each of the three stacks, meaning that the same search is taking place during the four seconds that elapsed between the first and the last
pstack run. Moreover, the
prstat output always shows thread 51 as the thread taking the highest amount of CPU.
If you check the access log for the result of the search operations at the time the hang was observed, we find that it is a result of the search on the unindexed description entry. By creating a description index, this hang will be avoided.
This section describes how to begin troubleshooting a drop in performance. It describes possible causes of performance drops, describes the information you need to consult if you experience a performance drop, and how to analyze this information.
Make certain that you have not mistaken an active or passive hang for a performance drop. If you are experiencing a performance drop, it could be for one of the following reasons:
Other processes are affecting CPU or disk access
High input/ouput rate
Unindexed searches, such as when an index is missing or when a "!" filter is used
Complex searches, such as searches on static groups, class of service, and roles
Complex updates, such as to static groups, class of service, and roles
Sub-optimum system settings, such as
Directory Server tuned incorrectly
Collect information about disk, CPU, memory, and process stack use during the period in which performance is dropping.
If your CPU is very low (at or around 10%), try to determine if the problem is network related using the
netstat command as follows:
# netstat -an | grep port
A performance drop may be the result of the network if a client is not receiving information despite the fact that access logs show that results work sent immediately. Running the
traceroute commands can help you determine if network latency is responsible for the problem.
Collect swap information to see if you are running out of memory. Memory may be your problem if the output of the swap command is small.
|Platform||Memory Loss Indicator|
Already provided in
On Solaris, use the output of the
prstat command to identify if other processes could be impacting the system performance. On Linux and HP-UX, use the
prstat output of the Directory Server during the period when the performance drops as described in Analyzing Data About a Unresponsive Process: an Example. For example, you could use the following script on Solaris to gather
#!/bin/sh i=0 while [ "$i" -lt "10" ] do echo "$i/n" date= `date"+%y%m%d:%H%M%S" prstat -L -p $1 0 1> /tmp/prstat.$date pstack $1> /tmp/pstack.$date i=`expr $i + 1` sleep 1 done
In general, look through your data for patterns and commonalities in the errors encountered. For example, if all operation problems are associated with searches to static groups, modifies to static groups, and searches on roles, this indicates that Directory Server is not properly tuned to handle these expensive operations. For example, the
nsslapd-search-tune attribute is not configured correctly for static group related searches, or maybe the
uniqueMember attribute indexed in a substring affects the group related updates. If you notice that problems are associated with unrelated operations but all at a particular time, this might indicate a memory access problem or a disk access problem.
You can take information culled from you
pstacks to SunSolve and search for them along with the phrase
unresponsive events to see if anything similar to your problem has already been encountered and solved. SunSolve is located at
The remainder of this section provides additional tips to help you analyze the data you collected in the previous steps.
You can use the
logconv command to analyze the Directory Server access logs. This command extracts usage statistics and counts the occurrences of significant events. For more information about this tool, see logconv.
For example, run the
logconv command as follows:
# logconv -s 50 -efcibaltnxgju access> analysis.access
Check the output file for the following:
Unindexed searches (
If unindexed searches are present, search for the associated indexes using the
dsconf list-indexes command. If the index exists, then you may be reaching the limit of your
all-ids-threshold property. This property defines the maximum number of values per index key in an index list. Increase the
all-ids-threshold and reindex.
If the index does not exist, then you need to create the index and then reindex. For information about creating an index, see To Create Indexes in Administrator's Guide for Oracle Directory Server Enterprise Edition.
High file descriptor consumption
To manage a problem with file descriptor consumption you may need to request to increase the file descriptors available at the system level. You may want to reduce the number of persistent searches (
notes=persistent), modify the client applications that do not disconnect, or reduce the idle timeout value set by the
Searches with long etimes or that return many entries
For example. if the etime is
grep the access log for
etime 344. The access log tells you the connection and operation. You can use this information to see what the operation was doing when the performance drop occurred, when the connection was opened, and who was the binding user. If all of the same operations have long etimes, that points to a problem with a particular operation. If the same binding user is always associated with a long etime, this suggests an ACI issue.
If you suspect an ACI problem with the binding user, prove it by running the same operation with the Directory Manager user, who is not subject to ACIs.
Searches on the
uniquemember attribute or on the wrong filters.
Look on SunSolve for static group performance hot patches. Run your search by specifying the
Often a capacity limitation manifests itself as a performance issue. To differentiate between performance and capacity, performance might be defined as "How fast the system is going" while capacity is "the maximum performance of the system or an individual component."
If your CPU is very low (at or around 10%), try to determine if the disk controllers are fully loaded and if input/output is the cause. To determine if your problem is disk related, use the
iostat tool as follows:
# iostat -xnMCz -T d 10
For example, a directory is available on the internet. Their customers submit searches from multiple sites and the Service Level Agreement (SLA) was no more than 5% of requests with response times of over 3 seconds. Currently 15% of request take more than 3 seconds, which puts the business in a penalty situation. The system is a 6800 with 12x900MHz CPUs.
vmstat output looks as follows:
procs memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m1 m1 in sy cs us sy id 0 2 0 8948920 5015176 374 642 10 12 13 0 2 1 2 1 2 132 2694 1315 14 3 83 0 19 0 4089432 188224 466 474 50 276 278 0 55 5 5 4 3 7033 6191 2198 19 4 77 0 19 0 4089232 188304 430 529 91 211 211 0 34 8 6 5 4 6956 9611 2377 16 5 79 0 18 0 4085680 188168 556 758 96 218 217 0 40 12 4 6 4 6979 7659 2354 18 6 77 0 18 0 4077656 188128 520 501 75 217 216 0 46 9 3 5 2 7044 8044 2188 17 5 78
We look at the right 3 columns,
id=idle, which show that over 50% of the CPU is idle and available for the performance problem. One way to detect a memory problem is to look at the
sr, or scan rate, column of the
vmstat output. If the page scanner ever starts running, or the scan rate gets over 0, then we need to look more closely at the memory system. The odd part of this display is that the blocked queue on the left of the display has 18 or 19 processes in it but there are no processes in the run queue. This suggests that the process is blocking somewhere in Solaris without using all of the available CPU.
Next, we look at the I/O subsystem. The
iostat command has a switch,
-C, which will aggregate I/Os at the controller level. We run the
iostat command as follows:
# iostat -xnMCz -T d extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 396.4 10.7 6.6 0.1 0.0 20.3 0.0 49.9 0 199 c1 400.2 8.8 6.7 0.0 0.0 20.2 0.0 49.4 0 199 c3 199.3 6.0 3.3 0.0 0.0 10.1 0.0 49.4 0 99 c1t0d0 197.1 4.7 3.3 0.0 0.0 10.2 0.0 50.4 0 100 c1t1d0 198.2 3.7 3.4 0.0 0.0 9.4 0.0 46.3 0 99 c3t0d0 202.0 5.1 3.3 0.0 0.0 10.8 0.0 52.4 0 100 c3t1d0
On controller 1 we are doing 396 reads per second and on controller 3 we are doing 400 reads per second. On the right side of the data, we see that the output shows the controller is almost 200% busy. So the individual disks are doing almost 200 reads per second and the output shows the disks as 100% busy. That leads us to a rule of thumb that individual disks perform at approximately 150 I/Os per second. This does not apply to LUNs or LDEVs from the big disk arrays. So our examination of the numbers leads us to suggest adding 2 disks to each controller and relaying out the data.
In this exercise we looked at all the numbers and attempted to locate the precise nature of the problem. Do not assume adding CPUs and memory will fix all performance problems. In this case, the search programs were exceeding the capacity of the disk drives which manifested itself as a performance problem of transactions with extreme response times. All those CPUs were waiting on the disk drives.
This section describes how to troubleshoot a totally unresponsive Directory Server process. A totally unresponsive process is called a hang, and there are two types of hang you might experience:
Active hang, when the CPU level is at 100%. For example, the process encounters an infinite loop meaning it waits forever waiting for and servicing a request.
Passive hang, when the CPU level is at 0%. For example, the process encounters a deadlock where two or more threads of a process are waiting for the other to finish, and thus neither ever does.
The remainder of this section describes how to troubleshoot each of these types of process hang.
A hang is active if the
vmstat 1 output show CPU levels of over 95%.
This section describes the causes of an active hang, how to collect information about an active hang, and out to analyze this data.
Possible causes of an active hang include the following:
An infinite loop
Retry of an unsuccessful operation, such as a replication operation or a bad commit
On a Solaris system, collect several traces of the Directory Server process stack that is hanging, using the Solaris
pstack utility. You should also collect statistics about the active process using the Solaris
prstat -L utility. You must collect this information while the server is hanging.
prstat data should be collected every second.
A hang is passive if the
vmstat 1 output show low CPU levels.
Possible causes of a passive hang include the following:
A deadlock resulting from locks or conditional variables
A defunct thread
On a Solaris system, collect several traces of the Directory Server process stack that is hanging, using the Solaris
pstack utility. You must collect this information while the server is hanging. The consecutive
pstack data should be collected every three seconds.
Collect several core files that show the state of the server threads while the server is hanging. Do this by generating a core file using the
gcore command, changing the name of the core file, waiting 30 seconds, and generating another core file. Repeat the process as least once to get a minimum of three sets of core files and related data.
For more information about generating a core file, see Generating a Core File.
This section describes how to troubleshoot an inaccessible database
The Directory Server database may be inaccessible for one of the following reasons:
Shared region file corruption
Missing change log
Corrupted change log
Database offline, for example it is being reimported
Missing transaction log
Analyze the error log to find the required information.
This section describes how to troubleshoot a memory leak.
Memory leaks are caused by problems allocating memory, either in Directory Server itself or in custom plug-ins. Troubleshooting these problems can be very difficult, particularly in the case of custom plug-ins.
It is important to do the following before collecting data about your memory leak:
Disable any custom plug-ins
Reduce the cache setting to very low values
Enable the audit log
Once you have done the above, run the
prstat -L utility and check the
VSZ column if it grows.
Collect the generic Directory Server data, as described in Collecting Generic Data. This data includes the version of Directory Server that you are running, logs from the test run, in particular the audit log, and the Directory Server configuration file.
With the data you collected, you can now contact the Sun Support Center for assistance with your problem.
On Solaris systems, the
libumem library is a memory agent library that is helpful for analyzing the cause of a memory leak. For more information about the
libumem library, see the technical article at the following location:
Restart the Directory Server using the following command:
# SUN_SUPPORT_SLAPD_NOSH=true LD_PRELOAD=libumem.so \ UMEM_DEBUG=contents,audit=40,guards UMEM_LOGGING=transaction ./dsadm start
libumem library is now loaded before the Directory Server starts, instead of using the Directory Server memory allocation.
Next, run the
gcore command several times, once before the memory use started to grow and once after. The
gcore command will dump a memory image in the current directory.
# gcore core.process-id
Finally, use the
mdb tool to analyze the results.