By default, the server fault monitor restarts the database after the second consecutive timed-out probe. If the database is lightly loaded, two consecutive timed-out probes should be sufficient to indicate that the database is hanging. However, during periods of heavy load, a server fault monitor probe might time out even if the database is functioning correctly. To prevent the server fault monitor from restarting the database unnecessarily, increase the maximum number of consecutive timed-out probes.
Increasing the maximum number of consecutive timed-out probes increases the time that is required to detect that the database is hanging.
To change the maximum number of consecutive timed-out probes allowed, create one entry in a custom action file for each consecutive timed-out probe that is allowed except the first timed-out probe.
You are not required to create an entry for the first timed-out probe. The action that the server fault monitor performs in response to the first timed-out probe is preset.
For the last allowed timed-out probe, create an entry in which the keywords are set as follows:
ERROR_TYPE is set to TIMEOUT_ERROR.
ERROR is set to the maximum number of consecutive timed-out probes that are allowed.
ACTION is set to RESTART.
For each remaining consecutive timed-out probe except the first timed-out probe, create an entry in which the keywords are set as follows:
ERROR_TYPE is set to TIMEOUT_ERROR.
ERROR is set to the sequence number of the timed-out probe. For example, for the second consecutive timed-out probe, set this keyword to 2. For the third consecutive timed-out probe, set this keyword to 3.
ACTION is set to NONE.
To facilitate debugging, specify a message that indicates the sequence number of the timed-out probe.
The following example shows the entries in a custom action file for increasing the maximum number of consecutive timed-out probes to five.
{ ERROR_TYPE=TIMEOUT; ERROR=2; ACTION=NONE; CONNECTION_STATE=*; NEW_STATE=*; MESSAGE="Timeout #2 has occurred."; } { ERROR_TYPE=TIMEOUT; ERROR=3; ACTION=NONE; CONNECTION_STATE=*; NEW_STATE=*; MESSAGE="Timeout #3 has occurred."; } { ERROR_TYPE=TIMEOUT; ERROR=4; ACTION=NONE; CONNECTION_STATE=*; NEW_STATE=*; MESSAGE="Timeout #4 has occurred."; } { ERROR_TYPE=TIMEOUT; ERROR=5; ACTION=RESTART; CONNECTION_STATE=*; NEW_STATE=*; MESSAGE="Timeout #5 has occurred. Restarting."; }
This example shows the entries in a custom action file for increasing the maximum number of consecutive timed-out probes to five. These entries specify the following behavior:
The server fault monitor ignores the second consecutive timed-out probe through the fourth consecutive timed-out probe.
In response to the fifth consecutive timed-out probe, the action that the server fault monitor performs is restart.
The entries apply regardless of the state of the connection between the database and the server fault monitor when the timeout occurs.
The state of the connection between the database and the server fault monitor must remain unchanged after the timeout occurs.
When the second consecutive timed-out probe through the fourth consecutive timed-out probe occurs, a message of the following form is printed to the resource's log file:
Timeout #number has occurred. |
When the fifth consecutive timed-out probe occurs, the following message is printed to the resource's log file:
Timeout #5 has occurred. Restarting. |