Description (Sun Java System Application Server Enterprise Edition 8.1 2005Q2 Troubleshooting Guide)

Sun Java System Application Server Enterprise Edition 8.1 2005Q2 Troubleshooting Guide

Description

A disk contention can have a negative impact on user data read/writes to the disk devices, as well as on HADB writing to history files. Severe disk contention may delay or abort user transactions. Delay in history file writing may cause node restarts and, in the worst case, lead to double failures.

The disk contention can be identified by monitoring the disk I/O from the OS, for the disks used for data, log devices and history files. History file write delays are written to the HADB history files. This can be identified by “NSUP BEWARE timestamp Last flush took too long (x msecs).”

This warning shows that disk I/O took too long. If the delay exceeds ten seconds, the node supervisor restarts the trans process with the error message:

Child process trans0 10938 does not respond.
Child died - restarting nsup.
Psup::stop: stopping all processes.

This message indicates that a trans (clu_trans_srv) process has been too busy doing other things (for example, waiting to write to the history file) to reply to the node supervisor’s request checking the heartbeat of the trans process. This causes the nsup to believe that the trans has died and then restarts it.

When the operating system is overloaded with too many processes (many HADB nodes co-located with other processes), the scheduling of I/O operations may be delayed. When the HADB related I/O work is delayed, HADB nodes write the following in the history files, “HADB warning: Schedule of async <read,write\> operation took ...”

This problem is observed especially in Red Hat AS 2.1 when multiple HADB nodes are placed on the same host and all the nodes use the same disk to place their devices.