Sun Cluster Data Service for Oracle Guide for Solaris OS

Customizing the Sun Cluster HA for Oracle Server Fault Monitor

Customizing the Sun Cluster HA for Oracle server fault monitor enables you to modify the behavior of the server fault monitor as follows:


Caution – Caution –

Before you customize the Sun Cluster HA for Oracle server fault monitor, consider the effects of your customizations, especially if you change an action from restart or switch over to ignore or stop monitoring. If errors remain uncorrected for long periods, the errors might cause problems with the database. If you encounter problems with the database after customizing the Sun Cluster HA for Oracle server fault monitor, revert to using the preset actions. Reverting to the preset actions enables you to determine if the problem is caused by your customizations.


Customizing the Sun Cluster HA for Oracle server fault monitor involves the following activities:

  1. Defining custom behavior for errors

  2. Propagating a custom action file to all nodes in a cluster

  3. Specifying the custom action file that a server fault monitor should use

Defining Custom Behavior for Errors

The Sun Cluster HA for Oracle server fault monitor detects the following types of errors:

To define custom behavior for these types of errors, create a custom action file.

Custom Action File Format

A custom action file is a plain text file. The file contains one or more entries that define the custom behavior of the Sun Cluster HA for Oracle server fault monitor. Each entry defines the custom behavior for a single DBMS error, a single time-out error, or several logged alerts. A maximum of 1024 entries is allowed in a custom action file.


Note –

Each entry in a custom action file overrides the preset action for an error, or specifies an action for an error for which no action is preset. Create entries in a custom action file only for the preset actions that you are overriding or for errors for which no action is preset. Do not create entries for actions that you are not changing.


An entry in a custom action file consists of a sequence of keyword-value pairs that are separated by semicolons. Each entry is enclosed in braces.

The format of an entry in a custom action file is as follows:

{
[ERROR_TYPE=DBMS_ERROR|SCAN_LOG|TIMEOUT_ERROR;]
ERROR=error-spec; 
[ACTION=SWITCH|RESTART|STOP|NONE;]
[CONNECTION_STATE=co|di|on|*;]
[NEW_STATE=co|di|on|*;]
[MESSAGE="message-string"]
}

White space may be used between separated keyword-value pairs and between entries to format the file.

The meaning and permitted values of the keywords in a custom action file are as follows:

ERROR_TYPE

Indicates the type of the error that the server fault monitor has detected. The following values are permitted for this keyword:

DBMS_ERROR

Specifies that the error is a DBMS error.

SCAN_LOG

Specifies that the error is an alert that is logged in the alert log file.

TIMEOUT_ERROR

Specifies that the error is a timeout.

The ERROR_TYPE keyword is optional. If you omit this keyword, the error is assumed to be a DBMS error.

ERROR

Identifies the error. The data type and the meaning of error-spec are determined by the value of the ERROR_TYPE keyword as shown in the following table.

ERROR_TYPE

Data Type 

Meaning 

DBMS_ERROR

Integer 

The error number of a DBMS error that is generated by Oracle 

SCAN_LOG

Quoted regular expression 

A string in an error message that Oracle has logged to the Oracle alert log file 

TIMEOUT_ERROR

Integer 

The number of consecutive timed‐out probes since the server fault monitor was last started or restarted 

You must specify the ERROR keyword. If you omit this keyword, the entry in the custom action file is ignored.

ACTION

Specifies the action that the server fault monitor is to perform in response to the error. The following values are permitted for this keyword:

NONE

Specifies that the server fault monitor ignores the error.

STOP

Specifies that the server fault monitor is stopped.

RESTART

Specifies that the server fault monitor stops and restarts the entity that is specified by the value of the Restart_type extension property of the SUNW.oracle_server resource.

SWITCH

Specifies that the server fault monitor switches over the database server resource group to another node.

The ACTION keyword is optional. If you omit this keyword, the server fault monitor ignores the error.

CONNECTION_STATE

Specifies the required state of the connection between the database and the server fault monitor when the error is detected. The entry applies only if the connection is in the required state when the error is detected. The following values are permitted for this keyword:

*

Specifies that the entry always applies, regardless of the state of the connection.

co

Specifies that the entry applies only if the server fault monitor is attempting to connect to the database.

on

Specifies that the entry applies only if the server fault monitor is online. The server fault monitor is online if it is connected to the database.

di

Specifies that the entry applies only if the server fault monitor is disconnecting from the database.

The CONNECTION_STATE keyword is optional. If you omit this keyword, the entry always applies, regardless of the state of the connection.

NEW_STATE

Specifies the state of the connection between the database and the server fault monitor that the server fault monitor must attain after the error is detected. The following values are permitted for this keyword:

*

Specifies that the state of the connection must remain unchanged.

co

Specifies that the server fault monitor must disconnect from the database and reconnect immediately to the database.

di

Specifies that the server fault monitor must disconnect from the database. The server fault monitor reconnects when it next probes the database.

The NEW_STATE keyword is optional. If you omit this keyword, the state of the database connection remains unchanged after the error is detected.

MESSAGE

Specifies an additional message that is printed to the resource's log file when this error is detected. The message must be enclosed in double quotes. This message is additional to the standard message that is defined for the error.

The MESSAGE keyword is optional. If you omit this keyword, no additional message is printed to the resource's log file when this error is detected.

Changing the Response to a DBMS Error

The action that the server fault monitor performs in response to each DBMS error is preset as listed in Table B–1. To determine whether you need to change the response to a DBMS error, consider the effect of DBMS errors on your database to determine if the preset actions are appropriate. For examples, see the subsections that follow.

To change the response to a DBMS error, create an entry in a custom action file in which the keywords are set as follows:

Responding to an Error the Effects of Which Are Major

If an error that the server fault monitor ignores affects more than one session, action by the server fault monitor might be required to prevent a loss of service.

For example, no action is preset for Oracle error 4031: unable to allocate num-bytes bytes of shared memory. However, this Oracle error indicates that the shared global area (SGA) has insufficient memory, is badly fragmented, or both states apply. If this error affects only a single session, ignoring the error might be appropriate. However, if this error affects more than one session, consider specifying that the server fault monitor restart the database.

The following example shows an entry in a custom action file for changing the response to a DBMS error to restart.


Example 1–1 Changing the Response to a DBMS Error to Restart

{
ERROR_TYPE=DBMS_ERROR;
ERROR=4031; 
ACTION=restart;
CONNECTION_STATE=*; 
NEW_STATE=*;
MESSAGE="Insufficient memory in shared pool.";
}

This example shows an entry in a custom action file that overrides the preset action for DBMS error 4031. This entry specifies the following behavior:


Ignoring an Error the Effects of Which Are Minor

If the effects of an error to which the server fault monitor responds are minor, ignoring the error might be less disruptive than responding to the error.

For example, the preset action for Oracle error 4030: out of process memory when trying to allocate num-bytes bytes is restart. This Oracle error indicates that the server fault monitor could not allocate private heap memory. One possible cause of this error is that insufficient memory is available to the operating system. If this error affects more than one session, restarting the database might be appropriate. However, this error might not affect other sessions because these sessions do not require further private memory. In this situation, consider specifying that the server fault monitor ignore the error.

The following example shows an entry in a custom action file for ignoring a DBMS error.


Example 1–2 Ignoring a DBMS Error

{
ERROR_TYPE=DBMS_ERROR;
ERROR=4030;
ACTION=none;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="";
}

This example shows an entry in a custom action file that overrides the preset action for DBMS error 4030. This entry specifies the following behavior:


Changing the Response to Logged Alerts

The Oracle software logs alerts in a file that is identified by the Alert_log_file extension property. The server fault monitor scans this file and performs actions in response to alerts for which an action is defined.

Logged alerts for which an action is preset are listed in Table B–2. Change the response to logged alerts to change the preset action, or to define new alerts to which the server fault monitor responds.

To change the response to logged alerts, create an entry in a custom action file in which the keywords are set as follows:

The server fault monitor processes the entries in a custom action file in the order in which the entries occur. Only the first entry that matches a logged alert is processed. Later entries that match are ignored. If you are using regular expressions to specify actions for several logged alerts, ensure that more specific entries occur before more general entries. Specific entries that occur after general entries might be ignored.

For example, a custom action file might define different actions for errors that are identified by the regular expressions ORA-65 and ORA-6. To ensure that the entry that contains the regular expression ORA-65 is not ignored, ensure that this entry occurs before the entry that contains the regular expression ORA-6.

The following example shows an entry in a custom action file for changing the response to a logged alert.


Example 1–3 Changing the Response to a Logged Alert

{
ERROR_TYPE=SCAN_LOG;
ERROR="ORA-00600: internal error";
ACTION=RESTART;
}

This example shows an entry in a custom action file that overrides the preset action for logged alerts about internal errors. This entry specifies the following behavior:


Changing the Maximum Number of Consecutive Timed-Out Probes

By default, the server fault monitor restarts the database after the second consecutive timed-out probe. If the database is lightly loaded, two consecutive timed-out probes should be sufficient to indicate that the database is hanging. However, during periods of heavy load, a server fault monitor probe might time out even if the database is functioning correctly. To prevent the server fault monitor from restarting the database unnecessarily, increase the maximum number of consecutive timed-out probes.


Caution – Caution –

Increasing the maximum number of consecutive timed-out probes increases the time that is required to detect that the database is hanging.


To change the maximum number of consecutive timed-out probes allowed, create one entry in a custom action file for each consecutive timed‐out probe that is allowed except the first timed-out probe.


Note –

You are not required to create an entry for the first timed‐out probe. The action that the server fault monitor performs in response to the first timed‐out probe is preset.


For the last allowed timed-out probe, create an entry in which the keywords are set as follows:

For each remaining consecutive timed‐out probe except the first timed-out probe, create an entry in which the keywords are set as follows:


Tip –

To facilitate debugging, specify a message that indicates the sequence number of the timed-out probe.


The following example shows the entries in a custom action file for increasing the maximum number of consecutive timed-out probes to five.


Example 1–4 Changing the Maximum Number of Consecutive Timed-Out Probes

{
ERROR_TYPE=TIMEOUT;
ERROR=2;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #2 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=3;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #3 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=4;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #4 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=5;
ACTION=RESTART;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #5 has occurred. Restarting.";
}

This example shows the entries in a custom action file for increasing the maximum number of consecutive timed-out probes to five. These entries specify the following behavior:


Propagating a Custom Action File to All Nodes in a Cluster

A server fault monitor must behave consistently on all cluster nodes. Therefore, the custom action file that the server fault monitor uses must be identical on all cluster nodes. After creating or modifying a custom action file, ensure that this file is identical on all cluster nodes by propagating the file to all cluster nodes. To propagate the file to all cluster nodes, use the method that is most appropriate for your cluster configuration:

Specifying the Custom Action File That a Server Fault Monitor Should Use

To apply customized actions to a server fault monitor, you must specify the custom action file that the fault monitor should use. Customized actions are applied to a server fault monitor when the server fault monitor reads a custom action file. A server fault monitor reads a custom action file when the you specify the file.

Specifying a custom action file also validates the file. If the file contains syntax errors, an error message is displayed. Therefore, after modifying a custom action file, specify the file again to validate the file.


Caution – Caution –

If syntax errors in a modified custom action file are detected, correct the errors before the fault monitor is restarted. If the syntax errors remain uncorrected when the fault monitor is restarted, the fault monitor reads the erroneous file, ignoring entries that occur after the first syntax error.


How to Specify the Custom Action File That a Server Fault Monitor Should Use

  1. On a cluster node, become superuser.

  2. Set the Custom_action_file extension property of the SUNW.oracle_server resource.

    Set this property to the absolute path of the custom action file.


    # scrgadm -c -j server-resource\
      -x custom_action_file=filepath
    
    -j server-resource

    Specifies the SUNW.oracle_server resource

    -x custom_action_file=filepath

    Specifies the absolute path of the custom action file