If one or more servers on your site stops responding unaccountably after running under load for a certain period of time, there are a few possible causes:
HTTP servers not sending requests to your application.
A Java deadlock.
Some resource that your application depends on is itself hung (such as the database or some service with which the application communicates via sockets). For example, if a single client opens up hundreds of connections to request pages and then stops reading the response data, this could lock up a server without any real failure of any ATG components.
You may also have consumed all of the memory in your JVM. If this happens, you’ll usually see
OutOfMemory
errors in your console right before the server hangs. This may appear as a hang because the server will do a garbage collection to reclaim a few bytes, run a few lines of code, then walk through the heap again trying to find another few bytes to reclaim.An infinite loop in some code.
Here are some steps you can take to attempt to identify the cause of the server hang.
Check the CPU utilization of the machine and particularly the Java process running your ATG application. If CPU utilization is 100%, it is either an
OutOfMemory
problem or a CPU burning thread.Check the server logs to see if any errors right before the hang indicate why the server has failed. You might see a “server not responding” message or an
OutOfMemory
error.Get a thread dump from your Java VM. A thread dump can help you recognize all of these problems.
If all threads are waiting in system calls such as socket read/write, then they are waiting for a resource to respond (for instance, the database or the network). You should look to this resource for answers. If the resource is a database, try using a third party database tool to make a query. It is possible that the tables used by your ATG application are locked by some other operation so they will wait until that operation has completed.