For example, you try running an ldapsearch on your Directory Server as follows:
# ldapsearch -p 5389 -D "cn=Directory Manager" -w secret -b"o=test" description=* |
This command generates a 40 second search with no results. To analyze why the process in unresponsive, first get the process ID using the following command:
# ps -aef | grep slapd | grep slapd-server1 mares 15013 24159 0 13:06:20 pts/32 0:00 grep slapd-server1 mares 14993 1 1 13:05:36 ? 0:04 ./ns-slapd -D /u1/SUNWdsee/user1/52/slapd-server1 -i /u1/SUNWdsee/user1/52/slapd-s |
Next, rerun the search and during the search run the prstat and pstack commands simultaneously for the Directory Server process, which in the output above has a process ID of 14993.
prstat -L -p 14993 0 1 > prstat.output ; pstack 14993 > pstack.output |
We rerun the commands three times, with an interval of two seconds between each consecutive run.
The output of the first prstat command appears as follows:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID 14993 mares 128M 110M cpu0 59 0 0:00.02 3.0% ns-slapd/51 14993 mares 128M 110M sleep 59 0 0:00.49 1.3% ns-slapd/32 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/16 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/15 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/14 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/13 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/12 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/11 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/10 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/9 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/8 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/6 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/5 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/4 14993 mares 128M 110M sleep 59 0 0:00.00 0.0% ns-slapd/3 Total: 1 processes, 51 lwps, load averages: 0.36, 0.29, 0.17 |
The problem appears to be occurring in thread 51. Next, we look for thread 51 in the output of the first pstack command and it appears as follows:
----------------- lwp# 51 / thread# 51 -------------------- ffffffff7eb55a78 ???????? (1, 102183a10, ffffffff70c1d340, 1001c5390, 0, ffffffff7ecea248) ffffffff77925fe0 id2entry (1002b7610, 1a09, 0, ffffffff70c1e7f4, 0, ffffffff77a6faa8) + 3e8 ffffffff7795ed20 ldbm_back_next_search_entry_ext (101cfcb90, 10190fd60, 0, 101b877b0, 1a08, 45b4aa34) + 300 ffffffff7ebaf6f8 ???????? (101cfcb90, 1002b7610, 1, ffffffff70c1eaf4, 0, 0) ffffffff7ebafbc4 ???????? (101cfcb90, 1, ffffffff70c1eaf4, 0, 10190fd60, ffffffff70c1e980) ffffffff7ebaf170 op_shared_search (101cfcb90, 0, 1015ad240, 0, ffffffffffffffff, ffffffff7ecea248) + 8c0 ffffffff7e92efcc search_core_pb (101cfcb90, 2, 1000, 4000, ffffffff7ea4c810, ffffffff7ea56088) + 6c4 ffffffff7e93a710 dispatch_operation_core_pb (101cfcb90, 101cfcb90, c00, ffffffff7ea4c810, 0, d10) + cc ffffffff7e926420 ???????? (101f3fe80, 102fd3250, 2, 63, 2, 200000) ffffffff7e92672c ldap_frontend_main_using_core_api (101f3fe80, 102fd3250, 2, 101da1218, 10133db10, 0) + fc ffffffff7e927764 ???????? (220, 101c97310, ffffffffffffffff, 800, 958, 101f3fe80) ffffffff7d036a7c _pt_root (101c97310, ffffffff70b00000, 0, 0, 20000, ffffffff70c1ff48) + d4 ffffffff7c1173bc _lwp_start (0, 0, 0, 0, 0, 0) |
The ends of the lines in this example have been wrapped so that they fit on the page.
The output of the second and third pstack command show the same results, with thread 51 doing the same types of operation.
All three pstack outputs taken at two second intervals show thread 51 doing the same search operations. The first parameter of the op_shared_search function contains the address of the operations taking place, which is 101cfcb90. The same operation occurs in each of the three stacks, meaning that the same search is taking place during the four seconds that elapsed between the first and the last pstack run. Moreover, the prstat output always shows thread 51 as the thread taking the highest amount of CPU.
If you check the access log for the result of the search operations at the time the hang was observed, we find that it is a result of the search on the unindexed description entry. By creating a description index, this hang will be avoided.