Is There Disk Contention? (Sun Java System Application Server Enterprise Edition 8.1 2005Q2 Troubleshooting Guide)

Sun Java System Application Server Enterprise Edition 8.1 2005Q2 Troubleshooting Guide

Is There Disk Contention?

Description

A disk contention can have a negative impact on user data read/writes to the disk devices, as well as on HADB writing to history files. Severe disk contention may delay or abort user transactions. Delay in history file writing may cause node restarts and, in the worst case, lead to double failures.

The disk contention can be identified by monitoring the disk I/O from the OS, for the disks used for data, log devices and history files. History file write delays are written to the HADB history files. This can be identified by “NSUP BEWARE timestamp Last flush took too long (x msecs).”

This warning shows that disk I/O took too long. If the delay exceeds ten seconds, the node supervisor restarts the trans process with the error message:

Child process trans0 10938 does not respond.
Child died - restarting nsup.
Psup::stop: stopping all processes.

This message indicates that a trans (clu_trans_srv) process has been too busy doing other things (for example, waiting to write to the history file) to reply to the node supervisor’s request checking the heartbeat of the trans process. This causes the nsup to believe that the trans has died and then restarts it.

When the operating system is overloaded with too many processes (many HADB nodes co-located with other processes), the scheduling of I/O operations may be delayed. When the HADB related I/O work is delayed, HADB nodes write the following in the history files, “HADB warning: Schedule of async <read,write\> operation took ...”

This problem is observed especially in Red Hat AS 2.1 when multiple HADB nodes are placed on the same host and all the nodes use the same disk to place their devices.

Solution

Use one disk per node to place the devices used by that node. If the node has more than one data devices and the disk contention is observed, move one data device to another disk. The same applies to the history file as well.

Make sure that all data and log devices and all history files reside on local disks (not NFS-mounted or other remotely mounted disks).

If the monitoring tools still show contention on the HADB disks, the data buffer pool size can be increased.