Troubleshoot Process Hangs and Loops

6 Troubleshoot Process Hangs and Loops

This chapter provides information and guidance about some specific procedures for troubleshooting hanging or looping processes.

Problems can occur that involve hanging or looping processes. A hang can occur for many reasons, but often stems from a deadlock in an application code, API code, or library code. A hang can be due to a bug in the Java HotSpot VM.

Sometimes an apparent hang turns out to be, in fact, a loop. For example, a bug in a VM process that causes one or more threads to go into an infinite loop can consume all available CPU cycles.

The initial step when you diagnose a hang is to find out if the VM process is idle or consuming all available CPU cycles. You can do this using a native operating system (OS) utility. If the process appears to be busy and is consuming all available CPU cycles, then it is likely that the issue is a looping thread rather than a deadlock. On the Oracle Solaris operating system, for example, the command prstat -L -p pid can be used to report the statistics for all lightweight processes (LWPs) in the target process and therefore will identify the threads that are consuming a lot of CPU cycles.

This chapter contains the following sections:

Diagnose a Loop Process

If a VM process appears to be looping, try to get a thread dump. A thread dump often makes it clear which thread is looping, and the trace stack in the thread dump can provide the direction on where (and maybe why) the thread is looping.

If the application console (standard input/output) is available, then press the Control+\ key combination (on Oracle Solaris or Linux) or the Control+Break key combination (on Windows) to cause the HotSpot VM to print a thread dump, including thread state. On Oracle Solaris and Linux operating systems the thread dump can also be obtained by sending a SIGQUIT to the process (command kill -QUIT pid). In this case, the thread dump is printed to the standard output of the target process. The output might be directed to a file, depending on how the process was started.

If the Java process is started with the -XX:+PrintClassHistogram command-line option, then the Control+Break handler will produce a heap histogram.

If a thread dump can be obtained, then a good place to start is the thread stacks of the threads that are in the RUNNABLE state. See Thread Dump, for more information about the format of the thread dump, as well as a table of the possible thread states in the thread dump. In some cases, it might be necessary to get a sequence of thread dumps in order to determine which threads appear to be continuously busy.

If the application console is not available (for example, the process is running in the background, or the VM output is directed to an unknown location), then the jstack utility or the jhsdb jstack utility can be used to get the stack thread. See The jstack Utility or the jstack mode of jhsdb for more about the output of these utilities. The jstack utility or the jhsdb jstack utility should also be used if the thread dump does not provide any evidence that a Java thread is looping.

When reviewing the output of the jstack utility, focus initially on the threads that are in the RUNNABLE state. This is the most likely state for threads that are busy and possibly looping. It might be necessary to execute jstack a number of times to get a better idea of which threads are looping. If a thread appears to be always in the RUNNABLE state, then use jhsdb jstack with the --mixed option to print the native frames and provide a further hint about what the thread is doing. If a thread appears to be looping continuously while in the RUNNABLE state, then this situation can indicate a potential HotSpot VM bug that needs further investigation.

If the VM does not respond to Control+\, then this could indicate a VM bug rather than an issue with the application or library code. In this case, use jhsdb jstack with the --mixed option to get a thread stack for all threads. The output will include the thread stacks for VM internal threads. In this stack trace, identify threads that do not appear to be waiting. For example, on the Oracle Solaris operating system, you identify the threads that are not in functions such as __lwp_cond_wait, __lwp_park, ___pollsys, or other blocking functions. If it appears that the looping is caused by a VM bug, then collect as much data as possible and submit a bug report. See Submit a Bug Report for more about data collection.

Diagnose a Hung Process

Use the thread dump to diagnose a hung process.

If the application appears to be hung and the process appears to be idle, then the first step is to try to get a thread dump. If the application console is available, then press Control+\ (on Oracle Solaris or Linux), or Control+Break (on Windows) to cause the HotSpot VM to print a thread dump. On the Oracle Solaris and Linux operating systems, the thread dump can also be obtained by sending a SIGQUIT to the process (command kill -QUIT pid). If the hung process can generate a thread dump, then the output is printed to the standard output of the target process.

After printing the thread dump, the HotSpot VM executes a deadlock detection algorithm.

The following sections describe various situations for a hung process.

Deadlock Detected

If a deadlock is detected, then it will be printed along with the stack trace of the threads involved in the deadlock.

The following example shows the stack trace for this situation.

Found one Java-level deadlock:
=============================
"AWT-EventQueue-0":
  waiting to lock monitor 0x000ffbf8 (object 0xf0c30560, a java.awt.Component$AWTTreeLock),
  which is held by "main"
"main":
  waiting to lock monitor 0x000ffe38 (object 0xf0c41ec8, a java.util.Vector),
  which is held by "AWT-EventQueue-0"

Java stack information for the threads listed above:
===================================================
"AWT-EventQueue-0":
        at java.awt.Container.removeNotify(Container.java:2503)
        - waiting to lock <0xf0c30560> (a java.awt.Component$AWTTreeLock)
        at java.awt.Window$1DisposeAction.run(Window.java:604)
        at java.awt.Window.doDispose(Window.java:617)
        at java.awt.Dialog.doDispose(Dialog.java:625)
        at java.awt.Window.dispose(Window.java:574)
        at java.awt.Window.disposeImpl(Window.java:584)
        at java.awt.Window$1DisposeAction.run(Window.java:598)
        - locked <0xf0c41ec8> (a java.util.Vector)
        at java.awt.Window.doDispose(Window.java:617)
        at java.awt.Window.dispose(Window.java:574)
        at javax.swing.SwingUtilities$SharedOwnerFrame.dispose(SwingUtilities.java:1743)
        at javax.swing.SwingUtilities$SharedOwnerFrame.windowClosed(SwingUtilities.java:1722)
        at java.awt.Window.processWindowEvent(Window.java:1173)
        at javax.swing.JDialog.processWindowEvent(JDialog.java:407)
        at java.awt.Window.processEvent(Window.java:1128)
        at java.awt.Component.dispatchEventImpl(Component.java:3922)
        at java.awt.Container.dispatchEventImpl(Container.java:2009)
        at java.awt.Window.dispatchEventImpl(Window.java:1746)
        at java.awt.Component.dispatchEvent(Component.java:3770)
        at java.awt.EventQueue.dispatchEvent(EventQueue.java:463)
        at java.awt.EventDispatchThread.pumpOneEventForHierarchy(EventDispatchThread.java:214)
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:163)
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:157)
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:149)
        at java.awt.EventDispatchThread.run(EventDispatchThread.java:110)
"main":
        at java.awt.Window.getOwnedWindows(Window.java:844)
        - waiting to lock <0xf0c41ec8> (a java.util.Vector)
        at javax.swing.SwingUtilities$SharedOwnerFrame.installListeners(SwingUtilities.java:1697)
        at javax.swing.SwingUtilities$SharedOwnerFrame.addNotify(SwingUtilities.java:1690)
        at java.awt.Dialog.addNotify(Dialog.java:370)
        - locked <0xf0c30560> (a java.awt.Component$AWTTreeLock)
        at java.awt.Dialog.conditionalShow(Dialog.java:441)
        - locked <0xf0c30560> (a java.awt.Component$AWTTreeLock)
        at java.awt.Dialog.show(Dialog.java:499)
        at java.awt.Component.show(Component.java:1287)
        at java.awt.Component.setVisible(Component.java:1242)
        at test01.main(test01.java:10)

Found 1 deadlock.

The default deadlock detection works with locks that are obtained using the synchronized keyword, as well as with locks that are obtained using the java.util.concurrent package. If the Java VM flag -XX:+PrintConcurrentLocks is set, then the stack trace also shows a list of lock owners.

If a deadlock is detected, then you must examine the output in more detail in order to understand the deadlock. In the previous example, the thread main is locking object 0xf0c30560 and is waiting to enter 0xf0c41ec8, which is locked by thread AWT-EventQueue-0. However, thread AWT-EventQueue-0 is waiting to enter 0xf0c30560, which is locked by main.

The detail in the stack traces provides information to help you find the deadlock.

Deadlock Not Detected

If the thread dump is printed and no deadlocks are found, then the issue might be a bug in which a thread is waiting for a monitor that is never notified. This could be a timing issue or a general logic bug.

To find out more about the issue, examine each of the threads in the thread dump and each thread that is blocked in Object.wait(). The caller frame in the stack trace indicates the class and method that is invoking the wait() method. If the code was compiled with line number information (the default), then this provides a direction as to the code to examine. In most cases, you must have some knowledge of the application logic or library in order to diagnose this issue further. In general, you must understand how the synchronization works in the application and the details and conditions for when and where the monitors are notified.

No Thread Dump

If the VM is deadlocked or hung, use the jstack or jhsdb jstack command.

If the VM does not respond to Control+\ or Control+Break, then it is possible that the VM is deadlocked or hung for some other reason. In that case, use The jstack Utility or jhsdb jstack to get a thread dump. This also applies in the case when the application is not accessible, or the output is directed to an unknown location.

In the thread dump, examine each of the threads in the BLOCKED state. The top frame can sometimes indicate why the thread is blocked (for example, Object.wait or Thread.sleep). The rest of the stack will give an indication of what the thread is doing. This is particularly true when the source is compiled with line number information (the default), and you can cross-reference the source code.

If a thread is in the BLOCKED state and the reason is not clear, then use jhsdb jstack --mixed to get a mixed stack. With the mixed stack output, it should be possible to identify why the thread is blocked. If a thread is blocked trying to enter a synchronized method or block, then you will see frames such as ObjectMonitor::enter near the top of the stack. The following example shows a sample, mixed-stack output.

----------------- t@13 -----------------
0xff31e8b8      ___lwp_cond_wait + 0x4
0xfea8c810      void ObjectMonitor::EnterI(Thread*) + 0x2b8
0xfeac86b8      void ObjectMonitor::enter2(Thread*) + 0x250
:

Threads in the RUNNABLE state might also be blocked. The top frames in the mixed stack should indicate what the thread is doing.

One specific thread to check is VMThread. This is the special thread used to execute operations like garbage collection (GC). It can be identified as the thread that is executing VMThread::run() in its initial frames. On the Oracle Solaris, it is typically t@4. On Linux, it should be identifiable using the C++ mangled name _ZN8VMThread4loopEv.

In general, the VM thread is in one of three states: waiting to execute a VM operation, synchronizing all threads in preparation for a VM operation, or executing a VM operation. If you suspect that a hang is a HotSpot VM bug rather than an application or class library deadlock, then pay special attention to the VM thread.

If the VM thread appears to be stuck in SafepointSynchronize::begin, then this could indicate an issue bringing the VM to a safepoint. A safepoint indicates that all threads executing in the VM are blocked and waiting for a special operation, such as GC, to complete.

If the VM thread appears to be stuck in function, where function ends in doit, then this could also indicate a VM problem.

In general, if you can execute the application from the command line, and you get to a state where the VM does not respond to Control+\ or Control+Break, it is more likely that you have uncovered a VM bug, a thread library issue, or a bug in another library. When this occurs, get a crash dump. See Collect Core Dumps for instructions about gathering as much information as possible, and submit a bug report or call support.

One other tool to mention in the context of hung processes is the pstack utility on the Oracle Solaris operating system. On the Oracle Solaris 8 and 9 operating systems, this utility prints the thread stacks for LWPs in the target process. On the Oracle Solaris 10 operating system and starting with the JDK 5.0 release, the output of pstack is similar, though not identical, to the output from jhsdb jstack --mixed.The Oracle Solaris 10 operating system implementation of pstack prints the fully qualified class name, method name, and bytecode index (BCI). It will also print line numbers for cases where the source was compiled with line number information (the default). This is useful for developers and administrators who are familiar with the other utilities on the Oracle Solaris operating system that exercise features of the /proc file system.

The equivalent tool of pstack on Linux is lsstack. This utility is included in some distributions and otherwise obtained from sourceforge. At the time of this writing, lsstack reported native frames only.

Oracle Solaris 8 Thread Library

The default thread library on the Oracle Solaris 8 operating system is often referred to as the T1 library. This thread library implemented the m:n threading model, where m user threads are mapped to n kernel-level threads (LWPs). The Oracle Solaris 8 operating system also shipped with an alternative and newer thread library in /usr/lib/lwp. The alternative thread library is often referred to as the T2 library, and it became the default thread library in the Oracle Solaris 9 and 10 operating systems. In older releases of J2SE (pre-1.4.0 in particular), there were a number of issues with the default thread library (for example, bugs in the thread library, LWP synchronization problems, or LWP starvation). LWP starvation is a scenario in which there are user threads in the RUNNABLE state, but there are no kernel level threads available.

Although the issues cited are historical, it should be noted that when the JDK software is deployed on the Oracle Solaris 8 operating system, it still uses the T1 library by default. LWP starvation type issues do not happen because the JDK release uses "bound threads" so that each user thread is bound to a kernel thread. However, in the event that you encounter an issue, such as a hang, that you believe is a thread library issue, then you can instruct the HotSpot VM to use the T2 library by adding /usr/lib/lwp to the LD_LIBRARY_PATH. To check if the T2 library is in use, issue the command pldd pid to list the libraries loaded by the specified process.