Sun Cluster Data Service for Oracle Guide for Solaris OS

Customizing the Sun Cluster HA for Oracle Server Fault Monitor

Customizing the Sun Cluster HA for Oracle server fault monitor enables you to modify the behavior of the server fault monitor as follows:

Overriding the preset action for an error
Specifying an action for an error for which no action is preset

Caution –

Before you customize the Sun Cluster HA for Oracle server fault monitor, consider the effects of your customizations, especially if you change an action from restart or switch over to ignore or stop monitoring. If errors remain uncorrected for long periods, the errors might cause problems with the database. If you encounter problems with the database after customizing the Sun Cluster HA for Oracle server fault monitor, revert to using the preset actions. Reverting to the preset actions enables you to determine if the problem is caused by your customizations.

Customizing the Sun Cluster HA for Oracle server fault monitor involves the following activities:

Defining custom behavior for errors
Propagating a custom action file to all nodes in a cluster
Specifying the custom action file that a server fault monitor should use

Defining Custom Behavior for Errors

The Sun Cluster HA for Oracle server fault monitor detects the following types of errors:

DBMS errors that occur during a probe of the database by the server fault monitor
Alerts that Oracle logs in the alert log file
Timeouts that result from a failure to receive a response within the time that is set by the Probe_timeout extension property

To define custom behavior for these types of errors, create a custom action file.

Custom Action File Format

A custom action file is a plain text file. The file contains one or more entries that define the custom behavior of the Sun Cluster HA for Oracle server fault monitor. Each entry defines the custom behavior for a single DBMS error, a single time-out error, or several logged alerts. A maximum of 1024 entries is allowed in a custom action file.

Note –

Each entry in a custom action file overrides the preset action for an error, or specifies an action for an error for which no action is preset. Create entries in a custom action file only for the preset actions that you are overriding or for errors for which no action is preset. Do not create entries for actions that you are not changing.

An entry in a custom action file consists of a sequence of keyword-value pairs that are separated by semicolons. Each entry is enclosed in braces.

The format of an entry in a custom action file is as follows:

{
[ERROR_TYPE=DBMS_ERROR|SCAN_LOG|TIMEOUT_ERROR;]
ERROR=error-spec; 
[ACTION=SWITCH|RESTART|STOP|NONE;]
[CONNECTION_STATE=co|di|on|*;]
[NEW_STATE=co|di|on|*;]
[MESSAGE="message-string"]
}

White space may be used between separated keyword-value pairs and between entries to format the file.

The meaning and permitted values of the keywords in a custom action file are as follows:

ERROR_TYPE

Indicates the type of the error that the server fault monitor has detected. The following values are permitted for this keyword:

DBMS_ERROR: Specifies that the error is a DBMS error.
SCAN_LOG: Specifies that the error is an alert that is logged in the alert log file.
TIMEOUT_ERROR: Specifies that the error is a timeout.

The ERROR_TYPE keyword is optional. If you omit this keyword, the error is assumed to be a DBMS error.

ERROR

Identifies the error. The data type and the meaning of error-spec are determined by the value of the ERROR_TYPE keyword as shown in the following table.

`ERROR_TYPE`	Data Type	Meaning
`DBMS_ERROR`	Integer	The error number of a DBMS error that is generated by Oracle
`SCAN_LOG`	Quoted regular expression	A string in an error message that Oracle has logged to the Oracle alert log file
`TIMEOUT_ERROR`	Integer	The number of consecutive timed‐out probes since the server fault monitor was last started or restarted

You must specify the ERROR keyword. If you omit this keyword, the entry in the custom action file is ignored.

ACTION

Specifies the action that the server fault monitor is to perform in response to the error. The following values are permitted for this keyword:

NONE: Specifies that the server fault monitor ignores the error.
STOP: Specifies that the server fault monitor is stopped.
RESTART: Specifies that the server fault monitor stops and restarts the entity that is specified by the value of the Restart_type extension property of the SUNW.oracle_server resource.
SWITCH: Specifies that the server fault monitor switches over the database server resource group to another node.

The ACTION keyword is optional. If you omit this keyword, the server fault monitor ignores the error.

CONNECTION_STATE

Specifies the required state of the connection between the database and the server fault monitor when the error is detected. The entry applies only if the connection is in the required state when the error is detected. The following values are permitted for this keyword:

*: Specifies that the entry always applies, regardless of the state of the connection.
co: Specifies that the entry applies only if the server fault monitor is attempting to connect to the database.
on: Specifies that the entry applies only if the server fault monitor is online. The server fault monitor is online if it is connected to the database.
di: Specifies that the entry applies only if the server fault monitor is disconnecting from the database.

The CONNECTION_STATE keyword is optional. If you omit this keyword, the entry always applies, regardless of the state of the connection.

NEW_STATE

Specifies the state of the connection between the database and the server fault monitor that the server fault monitor must attain after the error is detected. The following values are permitted for this keyword:

*: Specifies that the state of the connection must remain unchanged.
co: Specifies that the server fault monitor must disconnect from the database and reconnect immediately to the database.
di: Specifies that the server fault monitor must disconnect from the database. The server fault monitor reconnects when it next probes the database.

The NEW_STATE keyword is optional. If you omit this keyword, the state of the database connection remains unchanged after the error is detected.

MESSAGE

Specifies an additional message that is printed to the resource's log file when this error is detected. The message must be enclosed in double quotes. This message is additional to the standard message that is defined for the error.

The MESSAGE keyword is optional. If you omit this keyword, no additional message is printed to the resource's log file when this error is detected.

Changing the Response to a DBMS Error

The action that the server fault monitor performs in response to each DBMS error is preset as listed in Table B–1. To determine whether you need to change the response to a DBMS error, consider the effect of DBMS errors on your database to determine if the preset actions are appropriate. For examples, see the subsections that follow.

To change the response to a DBMS error, create an entry in a custom action file in which the keywords are set as follows:

ERROR_TYPE is set to DBMS_ERROR.
ERROR is set to the error number of the DBMS error.
ACTION is set to the action that you require.

Responding to an Error the Effects of Which Are Major

If an error that the server fault monitor ignores affects more than one session, action by the server fault monitor might be required to prevent a loss of service.

For example, no action is preset for Oracle error 4031: unable to allocate num-bytes bytes of shared memory. However, this Oracle error indicates that the shared global area (SGA) has insufficient memory, is badly fragmented, or both states apply. If this error affects only a single session, ignoring the error might be appropriate. However, if this error affects more than one session, consider specifying that the server fault monitor restart the database.

The following example shows an entry in a custom action file for changing the response to a DBMS error to restart.

Example 1–1 Changing the Response to a DBMS Error to Restart

{
ERROR_TYPE=DBMS_ERROR;
ERROR=4031; 
ACTION=restart;
CONNECTION_STATE=*; 
NEW_STATE=*;
MESSAGE="Insufficient memory in shared pool.";
}

This example shows an entry in a custom action file that overrides the preset action for DBMS error 4031. This entry specifies the following behavior:

In response to DBMS error 4031, the action that the server fault monitor performs is restart.
This entry applies regardless of the state of the connection between the database and the server fault monitor when the error is detected.
The state of the connection between the database and the server fault monitor must remain unchanged after the error is detected.
The following message is printed to the resource's log file when this error is detected:
Insufficient memory in shared pool.

Ignoring an Error the Effects of Which Are Minor

If the effects of an error to which the server fault monitor responds are minor, ignoring the error might be less disruptive than responding to the error.

For example, the preset action for Oracle error 4030: out of process memory when trying to allocate num-bytes bytes is restart. This Oracle error indicates that the server fault monitor could not allocate private heap memory. One possible cause of this error is that insufficient memory is available to the operating system. If this error affects more than one session, restarting the database might be appropriate. However, this error might not affect other sessions because these sessions do not require further private memory. In this situation, consider specifying that the server fault monitor ignore the error.

The following example shows an entry in a custom action file for ignoring a DBMS error.

Example 1–2 Ignoring a DBMS Error

{
ERROR_TYPE=DBMS_ERROR;
ERROR=4030;
ACTION=none;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="";
}

This example shows an entry in a custom action file that overrides the preset action for DBMS error 4030. This entry specifies the following behavior:

The server fault monitor ignores DBMS error 4030.
This entry applies regardless of the state of the connection between the database and the server fault monitor when the error is detected.
The state of the connection between the database and the server fault monitor must remain unchanged after the error is detected.
No additional message is printed to the resource's log file when this error is detected.

Changing the Response to Logged Alerts

The Oracle software logs alerts in a file that is identified by the Alert_log_file extension property. The server fault monitor scans this file and performs actions in response to alerts for which an action is defined.

Logged alerts for which an action is preset are listed in Table B–2. Change the response to logged alerts to change the preset action, or to define new alerts to which the server fault monitor responds.

To change the response to logged alerts, create an entry in a custom action file in which the keywords are set as follows:

ERROR_TYPE is set to SCAN_LOG.
ERROR is set to a quoted regular expression that identifies a string in an error message that Oracle has logged to the Oracle alert log file.
ACTION is set to the action that you require.

The server fault monitor processes the entries in a custom action file in the order in which the entries occur. Only the first entry that matches a logged alert is processed. Later entries that match are ignored. If you are using regular expressions to specify actions for several logged alerts, ensure that more specific entries occur before more general entries. Specific entries that occur after general entries might be ignored.

For example, a custom action file might define different actions for errors that are identified by the regular expressions ORA-65 and ORA-6. To ensure that the entry that contains the regular expression ORA-65 is not ignored, ensure that this entry occurs before the entry that contains the regular expression ORA-6.

The following example shows an entry in a custom action file for changing the response to a logged alert.

Example 1–3 Changing the Response to a Logged Alert

{
ERROR_TYPE=SCAN_LOG;
ERROR="ORA-00600: internal error";
ACTION=RESTART;
}

This example shows an entry in a custom action file that overrides the preset action for logged alerts about internal errors. This entry specifies the following behavior:

In response to logged alerts that contain the text ORA-00600: internal error, the action that the server fault monitor performs is restart.
This entry applies regardless of the state of the connection between the database and the server fault monitor when the error is detected.
The state of the connection between the database and the server fault monitor must remain unchanged after the error is detected.
No additional message is printed to the resource's log file when this error is detected.

Changing the Maximum Number of Consecutive Timed-Out Probes

By default, the server fault monitor restarts the database after the second consecutive timed-out probe. If the database is lightly loaded, two consecutive timed-out probes should be sufficient to indicate that the database is hanging. However, during periods of heavy load, a server fault monitor probe might time out even if the database is functioning correctly. To prevent the server fault monitor from restarting the database unnecessarily, increase the maximum number of consecutive timed-out probes.

Caution –

Increasing the maximum number of consecutive timed-out probes increases the time that is required to detect that the database is hanging.

To change the maximum number of consecutive timed-out probes allowed, create one entry in a custom action file for each consecutive timed‐out probe that is allowed except the first timed-out probe.

Note –

You are not required to create an entry for the first timed‐out probe. The action that the server fault monitor performs in response to the first timed‐out probe is preset.

For the last allowed timed-out probe, create an entry in which the keywords are set as follows:

ERROR_TYPE is set to TIMEOUT_ERROR.
ERROR is set to the maximum number of consecutive timed‐out probes that are allowed.
ACTION is set to RESTART.

For each remaining consecutive timed‐out probe except the first timed-out probe, create an entry in which the keywords are set as follows:

ERROR_TYPE is set to TIMEOUT_ERROR.
ERROR is set to the sequence number of the timed-out probe. For example, for the second consecutive timed-out probe, set this keyword to 2. For the third consecutive timed-out probe, set this keyword to 3.
ACTION is set to NONE.

Tip –

To facilitate debugging, specify a message that indicates the sequence number of the timed-out probe.

The following example shows the entries in a custom action file for increasing the maximum number of consecutive timed-out probes to five.

Example 1–4 Changing the Maximum Number of Consecutive Timed-Out Probes

{
ERROR_TYPE=TIMEOUT;
ERROR=2;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #2 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=3;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #3 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=4;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #4 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=5;
ACTION=RESTART;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #5 has occurred. Restarting.";
}

This example shows the entries in a custom action file for increasing the maximum number of consecutive timed-out probes to five. These entries specify the following behavior:

The server fault monitor ignores the second consecutive timed-out probe through the fourth consecutive timed-out probe.
In response to the fifth consecutive timed-out probe, the action that the server fault monitor performs is restart.
The entries apply regardless of the state of the connection between the database and the server fault monitor when the timeout occurs.
The state of the connection between the database and the server fault monitor must remain unchanged after the timeout occurs.
When the second consecutive timed-out probe through the fourth consecutive timed-out probe occurs, a message of the following form is printed to the resource's log file:
Timeout #number has occurred.
When the fifth consecutive timed-out probe occurs, the following message is printed to the resource's log file:
Timeout #5 has occurred. Restarting.

Propagating a Custom Action File to All Nodes in a Cluster

A server fault monitor must behave consistently on all cluster nodes. Therefore, the custom action file that the server fault monitor uses must be identical on all cluster nodes. After creating or modifying a custom action file, ensure that this file is identical on all cluster nodes by propagating the file to all cluster nodes. To propagate the file to all cluster nodes, use the method that is most appropriate for your cluster configuration:

Locating the file on a file system that all nodes share
Locating the file on a highly available local file system
Copying the file to the local file system of each cluster node by using operating system commands such as the rcp(1) command or the rdist(1) command

Specifying the Custom Action File That a Server Fault Monitor Should Use

To apply customized actions to a server fault monitor, you must specify the custom action file that the fault monitor should use. Customized actions are applied to a server fault monitor when the server fault monitor reads a custom action file. A server fault monitor reads a custom action file when the you specify the file.

Specifying a custom action file also validates the file. If the file contains syntax errors, an error message is displayed. Therefore, after modifying a custom action file, specify the file again to validate the file.

Caution –

If syntax errors in a modified custom action file are detected, correct the errors before the fault monitor is restarted. If the syntax errors remain uncorrected when the fault monitor is restarted, the fault monitor reads the erroneous file, ignoring entries that occur after the first syntax error.

How to Specify the Custom Action File That a Server Fault Monitor Should Use

On a cluster node, become superuser.

Set the Custom_action_file extension property of the SUNW.oracle_server resource.

Set this property to the absolute path of the custom action file.
# scrgadm -c -j server-resource\ -x custom_action_file=filepath
-j server-resource

Specifies the SUNW.oracle_server resource

-x custom_action_file=filepath

Specifies the absolute path of the custom action file