Sun Cluster Data Service for Oracle RAC Guide for Solaris OS

Chapter 4 Administering Sun Cluster Support for Oracle RAC

This chapter explains how to administer Sun Cluster Support for Oracle RAC on your Sun Cluster nodes.

Overview of Administration Tasks for Sun Cluster Support for Oracle RAC

Table 4–1 summarizes the administration tasks for Sun Cluster Support for Oracle RAC.

Perform these tasks whenever they are required.

Table 4–1 Administration Tasks for Sun Cluster Support for Oracle RAC


Task	Instructions
AdministerOracle RAC databases from Sun Cluster	Administering Oracle RAC Databases From Sun Cluster
Tune Sun Cluster Support for Oracle RAC extension properties	Tuning Sun Cluster Support for Oracle RAC
Tune Sun Cluster Support for Oracle RAC fault monitors	Tuning the Sun Cluster Support for Oracle RAC Fault Monitors
Customize the Oracle 9i RAC server fault monitor	Customizing the Oracle 9i RAC Server Fault Monitor
Troubleshoot Sun Cluster Support for Oracle RAC	Chapter 5, Troubleshooting Sun Cluster Support for Oracle RAC

Examples in this chapter provide the long forms of the Sun Cluster maintenance commands. Most commands also have short forms. Except for the forms of the command names, the commands are identical. For a list of the commands and their short forms, see Appendix A, Sun Cluster Object-Oriented Commands, in Sun Cluster 3.1 - 3.2 Hardware Administration Manual for Solaris OS.

Automatically Generated Names for Sun Cluster Objects

When the clsetup utility or Sun Cluster Manager is used to create resources, these tools assign preset names to the resources. If you are administering resources that were created by using the clsetup utility or Sun Cluster Manager, see the following table for these names.

Resource Type	Resource Name
`SUNW.rac_svm`	`rac-svm-rs`
`SUNW.rac_cvm`	`rac-cvm-rs`
`SUNW.rac_udlm`	`rac-udlm-rs`
`SUNW.rac_framework`	`rac-framework-rs`
`SUNW.scalable_rac_server`	`ora-sid-rs`, where `ora-sid` is the SID on the primary node without any numbers in the SID
`SUNW.scalable_rac_listener`	`rac-listener-rs`
`SUNW.scalable_rac_server_proxy`	`rac_server_proxy-rs`
`SUNW.crs_framework`	`crs_framework-rs`
`SUNW.ScalDeviceGroup`	`scaldg-name-rs`, where `dg-name` is the name of the device group that the resource represents
`SUNW.ScalMountPoint`	`scal-mp-dir-rs`, where `mp-dir` is the mount point of the file system, with `/` replaced by `–`
`SUNW.qfs`	`qfs-mp-dir-rs`, where `mp-dir` is the mount point of the file system, with `/` replaced by `–`
`SUNW.LogicalHostname`	`lh-name`, where `lh-name` is the logical hostname that you specified when you created the resource

Administering Oracle RAC Databases From Sun Cluster

Administering Oracle RAC databases from Sun Cluster involves using Sun Cluster administration tools to modify the states of Sun Cluster resources for Oracle RAC database instances. For information about how to create these resources, see Configuring Resources for Oracle RAC Database Instances.

The software architectures of Oracle 9i, Oracle 10g R1, and Oracle 10g R2 are different. As a result of these differences, the resources for Oracle RAC database instances that Sun Cluster requires depend on the version of Oracle that you are using. Consequently, the administration of Oracle RAC databases from Sun Cluster also depends on the version of Oracle that you are using.

Note –

If you are using Oracle 10g R1, you cannot administer Oracle RAC databases from Sun Cluster. Instead, use Oracle CRS utilities to start and shut down Oracle RAC database instances.

The effects of changes to the states of Sun Cluster resources on Oracle database components are explained in the subsections that follow.

Effects of State Changes to Sun Cluster Resources for Oracle 10g R2 RAC Database Instances

In Oracle 10g, the Oracle CRS manage the startup and shutdown of Oracle database instances, listeners, and other components that are configured in the CRS. Oracle CRS are a mandatory component of Oracle 10g. CRS also monitor components that are started by CRS and, if failures are detected, perform actions to recover from failures.

Because Oracle CRS manage the startup and shutdown of Oracle database components, these components cannot be stopped and started exclusively under the control of the Sun Cluster RGM. Instead, Oracle CRS and the Sun Cluster RGM interoperate so that when Oracle RAC database instances are started and stopped by Oracle CRS, the state of the database instances is propagated to Sun Cluster resources.

Table 4–2 Propagation of State Changes Between Sun Cluster Resources and Oracle CRS Resources


Trigger	Initial State		Resulting State
Trigger	Sun Cluster Resource	Oracle CRS Resource	Sun Cluster Resource	Oracle CRS Resource
Sun Cluster command to take offline a resource	Enabled and online	Enabled and online	Enabled and offline	Enabled and offline
Oracle CRS command to stop a resource	Enabled and online	Enabled and online	Enabled and offline	Enabled and offline
Sun Cluster command to bring online a resource	Enabled and offline	Enabled and offline	Enabled and online	Enabled and online
Oracle CRS command to start a resource	Enabled and offline	Enabled and offline	Enabled and online	Enabled and online
Sun Cluster command to disable a resource	Enabled and online	Enabled and online	Disabled and offline	Disabled and offline
Oracle CRS command to disable a resource	Enabled and online	Enabled and online	Enabled and online	Disabled and online
Oracle `SQLPLUS` command to shut down the database	Enabled and online	Enabled and online	Enabled and offline	Enabled and offline
Sun Cluster command to enable a resource	Disabled and offline	Disabled and offline	Enabled and online or offline	Enabled and online or offline
Oracle CRS command to enable a resource	Disabled and offline	Disabled and offline	Disabled and offline	Enabled and offline

The names of the states of Sun Cluster resources and Oracle CRS resources are identical. However, the meaning of each state name is different for Sun Cluster resources and Oracle CRS resources. For more information, see the following table.

Table 4–3 Comparisons of States for Sun Cluster Resources and Oracle CRS Resources


State	Meaning for Sun Cluster Resources	Meaning for Oracle CRS Resources
Enabled	The resource is available to the Sun Cluster RGM for automatic startup, failover, or restart. A resource that is enabled can also be in either the online state or the offline state.	The resource is available to run under Oracle CRS for automatic startup, failover, or restart. A resource that is enabled can also be in either the online state or the offline state.
Disabled	The resource is unavailable to the Sun Cluster RGM for automatic startup, failover, or restart. A resource that is disabled is also offline.	The resource is unavailable to run under the Oracle CRS for automatic startup, failover, or restart. A resource that is disabled can also be in either the online state or the offline state.
Online	The resource is running and providing service.	The resource is running and providing service. A resource that is online must also be enabled.
Offline	The resource is stopped and not providing service.	The resource is stopped and not providing service. A resource that is offline can also be in either the disabled state or the enabled state.

For detailed information about the state of Sun Cluster resources, see Resource and Resource Group States and Settings in Sun Cluster Concepts Guide for Solaris OS.

For detailed information about the state of Oracle CRS resources, see your Oracle documentation.

Effects of State Changes to Sun Cluster Resources for Oracle 9i RAC Database Instances

In Oracle 9i, Oracle database components can be stopped and started exclusively under the control of the Sun Cluster RGM. The effects of state changes to Sun Cluster resources for Oracle 9i RAC database instances are as follows:

Bringing online a resource for an Oracle 9i RAC database component starts the component on the nodes where the resource is brought online.
Taking offline a resource for an Oracle 9i RAC database component stops the component on the nodes where the resource is taken offline.

Tuning Sun Cluster Support for Oracle RAC

To tune the Sun Cluster Support for Oracle RAC data service, you modify the extension properties of the resources for this data service. For details about these extension properties, see Appendix C, Sun Cluster Support for Oracle RAC Extension Properties. Typically, you use the option -p property=value of the clresource(1CL) command to set extension properties of Sun Cluster Support for Oracle RAC resources. You can also use the procedures in Chapter 2, Administering Data Service Resources, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS to configure the resources later.

Guidelines for Setting Timeouts

Many of the extension properties for Sun Cluster Support for Oracle RAC specify timeouts for steps in reconfiguration processes. The optimum values for most of these timeouts are independent of your cluster configuration. Therefore, you should not need to change the timeouts from their default values.

Timeouts that depend on your cluster configuration are described in the subsections that follow. If timeouts occur during reconfiguration processes, increase the values of these timeout properties to accommodate your cluster configuration.

SPARC: VxVM Component Reconfiguration Step 4 Timeout

The time that is required for step 4 of a reconfiguration of the VxVM component of Sun Cluster Support for Oracle RAC is affected by the size and complexity of your VERITAS shared-disk group configuration. If your VERITAS shared-disk group configuration is large or complex and the reconfiguration of the VxVM component times out, increase the timeout for step 4 of a reconfiguration of the VxVM component.

To increase the timeout for step 4 of a reconfiguration of the VxVM component, increase the value of the Cvm_step4_timeout extension property of the SUNW.rac_cvm resource.

For more information, see SPARC: SUNW.rac_cvm Extension Properties.

Example 4–1 Setting the VxVM Component Reconfiguration Step 4 Timeout

# clresource set -p cvm_step4_timeout=1200 rac-cvm-rs

This example sets the timeout for step 4 of a reconfiguration of the VxVM component to 1200 seconds. This example assumes that the VxVM component is represented by an instance of the SUNW.rac_cvm resource type that is named rac-cvm-rs.

Reservation Step Timeout

The time that is required for reservation commands to run is affected by the following factors:

The number of shared physical disks in the cluster
The load on the cluster

If the number of shared physical disks in the cluster is large, or if your cluster is heavily loaded, the reconfiguration of Sun Cluster Support for Oracle RAC might time out. If such a timeout occurs, increase the reservation step timeout.

To increase the reservation step timeout, increase the Reservation_timeout extension property of the SUNW.rac_framework resource.

For more information, see SUNW.rac_framework Extension Properties.

Example 4–2 Setting the Reservation Step Timeout

# clresource set -p reservation_timeout=350 rac-framework-rs

This example sets the timeout for the reservation step of a reconfiguration of Sun Cluster Support for Oracle RAC to 350 seconds. This example assumes that RAC framework component is represented by an instance of the SUNW.rac_framework resource type that is named rac-frameowrk-rs.

SPARC: Guidelines for Setting the Communications Port Range for the Oracle UDLM

An application other than the Oracle UDLM on a cluster node might use a range of communications ports that conflicts with the range for the Oracle UDLM. If such a conflict occurs, modify the range of communications ports that the Oracle UDLM uses.

The range of communications ports that the Oracle UDLM uses is determined by the values of the following extension properties of the SUNW.rac_udlm resource type:

Port. Specifies the communications port number that the Oracle UDLM uses. The first number in the range of communications port numbers that the Oracle UDLM uses is the value of Port.
Num_ports. Specifies the number of communications ports that the Oracle UDLM uses. The last number in the range of communications port numbers that the Oracle UDLM uses is the sum of the values of Port and Num_ports.

For more information, see SPARC: SUNW.rac_udlm Extension Properties.

Example 4–3 Setting the Communications Port Number for the Oracle UDLM

# clresource set -p port=7000 rac-udlm-rs

This example sets the communications port number that the Oracle UDLM uses to 7000. The following assumptions apply to this example:

The Oracle UDLM component is represented by an instance of the SUNW.rac_udlm resource type that is named rac-udlm-rs.
The command in this example is run as part of the procedure for modifying an extension property that is tunable only when disabled. For more information, see How to Modify an Extension Property That Is Tunable Only When a Resource Is Disabled.

How to Modify an Extension Property That Is Tunable Only When a Resource Is Disabled

Restrictions apply to the circumstances in which you can modify an extension property that is tunable only when a resource is disabled. Those circumstances depend on the resource type as follows:

SUNW.rac_udlm – Only when the Oracle UDLM is not running on any cluster node
SUNW.rac_cvm – Only when VxVM is not running in cluster mode on any cluster node

This procedure provides the long forms of the Sun Cluster maintenance commands. Most commands also have short forms. Except for the forms of the command names, the commands are identical. For a list of the commands and their short forms, see Appendix A, Sun Cluster Object-Oriented Commands, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.

Disable each resource that the RAC framework resource group contains and bring the RAC framework resource group into the UNMANAGED state.

Disable the instance of the SUNW.rac_framework resource only after you have disabled all other resources that the RAC framework resource group contains. The other resources in the RAC framework resource group depend on the SUNW.rac_framework resource.

For detailed instructions, see Disabling Resources and Moving Their Resource Group Into the UNMANAGED State in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.

Reboot all the nodes that are in the node list of the RAC framework resource group.

Use the clresource command to set the property to its new value.
# clresource set -p property=value resource
property

Specifies the name of the property that you are changing.

value

The new value of the property.

resource

Specifies the name of the resource for which you are modifying an extension property. If this resource was created by using the clsetup utility, the name depends on the resource type, as shown in Automatically Generated Names for Sun Cluster Objects.

Bring the RAC framework resource group and its resources online.
# clresourcegroup online resource-group
resource-group

Specifies the name of the RAC framework resource group that is to be moved to the MANAGED state and brought online. If this resource group was created by using the clsetup utility, the name of the resource group is rac-framework-rg.

Tuning the Sun Cluster Support for Oracle RAC Fault Monitors

Fault monitoring for the Sun Cluster Support for Oracle RAC data service is provided by fault monitors for the following resources:

Scalable device group resource
Scalable file-system mount-point resource
Oracle 9i RAC server resource
Oracle 9i RAC listener resource

Each fault monitor is contained in a resource whose resource type is shown in the following table.

Table 4–4 Resource Types for Sun Cluster Support for Oracle RAC Fault Monitors


Fault Monitor	Resource Type
Scalable device group	`SUNW.ScalDeviceGroup`
Scalable file-system mount point	`SUNW.ScalMountPoint`
Oracle 9i RAC server	`SUNW.scalable_rac_server`
Oracle 9iRAC listener	`SUNW.scalable_rac_listener`

System properties and extension properties of these resources control the behavior of the fault monitors. The default values of these properties determine the preset behavior of the fault monitors. The preset behavior should be suitable for most Sun Cluster installations. Therefore, you should tune the Sun Cluster Support for Oracle RAC fault monitors only if you need to modify this preset behavior.

Tuning the Sun Cluster Support for Oracle RAC fault monitors involves the following tasks:

Setting the interval between fault monitor probes
Setting the timeout for fault monitor probes
Defining the criteria for persistent faults
Specifying the failover behavior of a resource

For more information, see Tuning Fault Monitors for Sun Cluster Data Services in Sun Cluster Data Services Planning and Administration Guide for Solaris OS. Information about the Sun Cluster Support for Oracle RAC fault monitors that you need to perform these tasks is provided in the subsections that follow.

Operation of the Fault Monitor for a Scalable Device Group

By default, the fault monitor monitors all logical volumes in the device group that the resource represents. If you require only a subset of the logical volumes in a device group to be monitored, set the LogicalDeviceList extension property.

The status of the device group is derived from the statuses of the individual logical volumes that are monitored. If all monitored logical volumes are healthy, the device group is healthy. If any monitored logical volume is faulty, the device group is faulty. If a device group is discovered to be faulty, monitoring of the resource that represents the group is stopped and the resource is put into the disabled state.

The status of an individual logical volume is obtained by querying the volume's volume manager. If the status of a Solaris Volume Manager for Sun Cluster volume cannot be determined from a query, the fault monitor performs file input/output (I/O) operations to determine the status.

Note –

For mirrored disks, if one submirror is faulty, the device group is still considered to be healthy.

If reconfiguration of userland cluster membership causes an I/O error, the monitoring of device group resources by fault monitors is suspended while userland cluster membership monitor (UCMM) reconfigurations are in progress.

Operation of the Fault Monitor for Scalable File-System Mount Points

To determine if the mounted file system is available, the fault monitor performs I/O operations such as opening, reading, and writing to a test file on the file system. If an I/O operation is not completed within the timeout period, the fault monitor reports an error. To specify the timeout for I/O operations, set the IOTimeout extension property.

The response to an error depends on the type of the file system, as follows:

If the file system is an NFS file system on a Network Appliance NAS device, the response is as follows:
- Monitoring of the resource is stopped on the current node.
- The resource is placed into the disabled state on the current node, causing the file system to be unmounted from that node.
If the file system is a Sun StorEdge^TM QFS shared file system, the response is as follows:
- If the node on which the error occurred is hosting the metadata server resource, the metadata server resource is failed over to another node.
- The file system is unmounted.
If the failover attempt fails, the file system remains unmounted and a warning is given.

Operation of the Oracle 9i RAC Server Fault Monitor

The fault monitor for the Oracle 9i RAC server uses a request to the server to query the health of the server.

The server fault monitor is started through pmfadm to make the monitor highly available. If the monitor is killed for any reason, the Process Monitor Facility (PMF) automatically restarts the monitor.

The server fault monitor consists of the following processes.

A main fault monitor process
A database client fault probe

Operation of the Main Fault Monitor

The main fault monitor determines that an operation is successful if the database is online and no errors are returned during the transaction.

Operation of the Database Client Fault Probe

The database client fault probe performs the following operations:

Monitoring the partition for archived redo logs
If the partition is healthy, determining whether the database is operational

The probe uses the timeout value that is set in the resource property Probe_timeout to determine how much time to allocate to successfully probe Oracle.

Operations to Monitor the Partition for Archived Redo Logs

The database client fault probe queries the dynamic performance view v$archive_dest to determine all possible destinations for archived redo logs. For every active destination, the probe determines whether the destination is healthy and has sufficient free space for storing archived redo logs.

If the destination is healthy, the probe determines the amount of free space in the destination's file system. If the amount of free space is less than 10% of the file system's capacity and is less than 20 Mbytes, the probe prints a message to syslog.
If the destination is in ERROR status, the probe prints a message to syslog and disables operations to determine whether the database is operational. The operations remain disabled until the error condition is cleared .

Operations to Determine Whether the Database is Operational

If the partition for archived redo logs is healthy, the database client fault probe queries the dynamic performance view v$sysstat to obtain database performance statistics. Changes to these statistics indicate that the database is operational. If these statistics remain unchanged between consecutive queries, the fault probe performs database transactions to determine if the database is operational. These transactions involve the creation, updating, and dropping of a table in the user table space.

The database client fault probe performs all its transactions as the Oracle user. The ID of this user is specified during the preparation of the nodes or zones as explained in How to Create the DBA Group and the DBA User Accounts.

Actions by the Server Fault Monitor in Response to a Database Transaction Failure

If a database transaction fails, the server fault monitor performs an action that is determined by the error that caused the failure. To change the action that the server fault monitor performs, customize the server fault monitor as explained in Customizing the Oracle 9i RAC Server Fault Monitor.

If the action requires an external program to be run, the program is run as a separate process in the background.

Possible actions are as follows:

Ignore. The server fault monitor ignores the error.
Stop monitoring. The server fault monitor is stopped without shutting down the database.
Restart. The server fault monitor stops and restarts the Oracle 9i RAC server resource.

Scanning of Logged Alerts by the Server Fault Monitor

The Oracle software logs alerts in an alert log file. The absolute path of this file is specified by the alert_log_file extension property of the SUNW.scalable_rac_server resource. The server fault monitor scans the alert log file for new alerts at the following times:

When the server fault monitor is started
Each time that the server fault monitor queries the health of the server

If an action is defined for a logged alert that the server fault monitor detects, the server fault monitor performs the action in response to the alert.

Preset actions for logged alerts are listed in Table B–2. To change the action that the server fault monitor performs, customize the server fault monitor as explained in Customizing the Oracle 9i RAC Server Fault Monitor.

Operation of the Oracle 9i RAC Listener Fault Monitor

The Oracle 9i RAC listener fault monitor checks the status of an Oracle listener.

If the listener is running, the Oracle 9i RAC listener fault monitor considers a probe successful. If the fault monitor detects an error, the listener is restarted.

Note –

The listener resource does not provide a mechanism for setting the listener password. If Oracle listener security is enabled, a probe by the listener fault monitor might return Oracle error TNS-01169. Because the listener is able to respond, the listener fault monitor treats the probe as a success. This action does not cause a failure of the listener to remain undetected. A failure of the listener returns a different error, or causes the probe to time out.

The listener probe is started through pmfadm to make the probe highly available. If the probe is killed, PMF automatically restarts the probe.

If a problem occurs with the listener during a probe, the probe tries to restart the listener. The value that is set for the resource property retry_count determines the maximum number of times that the probe attempts the restart. If, after trying for the maximum number of times, the probe is still unsuccessful, the probe stops the fault monitor.

Obtaining Core Files for Troubleshooting DBMS Timeouts

To facilitate troubleshooting of unexplained DBMS timeouts, you can enable the fault monitor to create a core file when a probe timeout occurs. The contents of the core file relate to the fault monitor process. The fault monitor creates the core file in the / directory. To enable the fault monitor to create a core file, use the coreadm command to enable set-id core dumps. For more information, see the coreadm(1M) man page.

Customizing the Oracle 9i RAC Server Fault Monitor

Customizing the Oracle 9i RAC server fault monitor enables you to modify the behavior of the server fault monitor as follows:

Overriding the preset action for an error
Specifying an action for an error for which no action is preset

Caution –

Before you customize the Oracle 9i RAC server fault monitor, consider the effects of your customizations, especially if you change an action from restart or switch over to ignore or stop monitoring. If errors remain uncorrected for long periods, the errors might cause problems with the database. If you encounter problems with the database after customizing the Oracle 9i RAC server fault monitor, revert to using the preset actions. Reverting to the preset actions enables you to determine if the problem is caused by your customizations.

Customizing the Oracle 9i RAC server fault monitor involves the following activities:

Defining custom behavior for errors
Propagating a custom action file to all nodes in a cluster
Specifying the custom action file that a server fault monitor should use

Defining Custom Behavior for Errors

The Oracle 9i RAC server fault monitor detects the following types of errors:

DBMS errors that occur during a probe of the database by the server fault monitor
Alerts that Oracle logs in the alert log file
Timeouts that result from a failure to receive a response within the time that is set by the Probe_timeout extension property

To define custom behavior for these types of errors, create a custom action file.

Custom Action File Format

A custom action file is a plain text file. The file contains one or more entries that define the custom behavior of the Oracle 9i RAC server fault monitor. Each entry defines the custom behavior for a single DBMS error, a single timeout error, or several logged alerts. A maximum of 1024 entries is allowed in a custom action file.

Note –

Each entry in a custom action file overrides the preset action for an error, or specifies an action for an error for which no action is preset. Create entries in a custom action file only for the preset actions that you are overriding or for errors for which no action is preset. Do not create entries for actions that you are not changing.

An entry in a custom action file consists of a sequence of keyword-value pairs that are separated by semicolons. Each entry is enclosed in braces.

The format of an entry in a custom action file is as follows:

{
[ERROR_TYPE=DBMS_ERROR|SCAN_LOG|TIMEOUT_ERROR;]
ERROR=error-spec; 
[ACTION=RESTART|STOP|NONE;]
[CONNECTION_STATE=co|di|on|*;]
[NEW_STATE=co|di|on|*;]
[MESSAGE="message-string"]
}

White space may be used between separated keyword-value pairs and between entries to format the file.

The meaning and permitted values of the keywords in a custom action file are as follows:

ERROR_TYPE

Indicates the type of the error that the server fault monitor has detected. The following values are permitted for this keyword:

DBMS_ERROR: Specifies that the error is a DBMS error.
SCAN_LOG: Specifies that the error is an alert that is logged in the alert log file.
TIMEOUT_ERROR: Specifies that the error is a timeout.

The ERROR_TYPE keyword is optional. If you omit this keyword, the error is assumed to be a DBMS error.

ERROR

Identifies the error. The data type and the meaning of error-spec are determined by the value of the ERROR_TYPE keyword as shown in the following table.

`ERROR_TYPE`	Data Type	Meaning
`DBMS_ERROR`	Integer	The error number of a DBMS error that is generated by Oracle
`SCAN_LOG`	Quoted regular expression	A string in an error message that Oracle has logged to the Oracle alert log file
`TIMEOUT_ERROR`	Integer	The number of consecutive timed-out probes since the server fault monitor was last started or restarted

You must specify the ERROR keyword. If you omit this keyword, the entry in the custom action file is ignored.

ACTION

Specifies the action that the server fault monitor is to perform in response to the error. The following values are permitted for this keyword:

NONE: Specifies that the server fault monitor ignores the error.
STOP: Specifies that the server fault monitor is stopped.
RESTART: Specifies that the server fault monitor stops and restarts the Oracle 9i RAC server resource.

The ACTION keyword is optional. If you omit this keyword, the server fault monitor ignores the error.

CONNECTION_STATE

Specifies the required state of the connection between the database and the server fault monitor when the error is detected. The entry applies only if the connection is in the required state when the error is detected. The following values are permitted for this keyword:

*: Specifies that the entry always applies, regardless of the state of the connection.
co: Specifies that the entry applies only if the server fault monitor is attempting to connect to the database.
on: Specifies that the entry applies only if the server fault monitor is online. The server fault monitor is online if it is connected to the database.
di: Specifies that the entry applies only if the server fault monitor is disconnecting from the database.

The CONNECTION_STATE keyword is optional. If you omit this keyword, the entry always applies, regardless of the state of the connection.

NEW_STATE

Specifies the state of the connection between the database and the server fault monitor that the server fault monitor must attain after the error is detected. The following values are permitted for this keyword:

*: Specifies that the state of the connection must remain unchanged.
co: Specifies that the server fault monitor must disconnect from the database and reconnect immediately to the database.
di: Specifies that the server fault monitor must disconnect from the database. The server fault monitor reconnects when it next probes the database.

The NEW_STATE keyword is optional. If you omit this keyword, the state of the database connection remains unchanged after the error is detected.

MESSAGE

Specifies an additional message that is printed to the resource's log file when this error is detected. The message must be enclosed in double quotes. This message is additional to the standard message that is defined for the error.

The MESSAGE keyword is optional. If you omit this keyword, no additional message is printed to the resource's log file when this error is detected.

Changing the Response to a DBMS Error

The action that the server fault monitor performs in response to each DBMS error is preset as listed in Table B–1. To determine whether you need to change the response to a DBMS error, consider the effect of DBMS errors on your database to determine if the preset actions are appropriate. For examples, see the subsections that follow.

To change the response to a DBMS error, create an entry in a custom action file in which the keywords are set as follows:

ERROR_TYPE is set to DBMS_ERROR.
ERROR is set to the error number of the DBMS error.
ACTION is set to the action that you require.

Responding to an Error Whose Effects Are Major

If an error that the server fault monitor ignores affects more than one session, action by the server fault monitor might be required to prevent a loss of service.

For example, no action is preset for Oracle error 4031: unable to allocate num-bytes bytes of shared memory. However, this Oracle error indicates that the shared global area (SGA) has insufficient memory, is badly fragmented, or both states apply. If this error affects only a single session, ignoring the error might be appropriate. However, if this error affects more than one session, consider specifying that the server fault monitor restart the database.

The following example shows an entry in a custom action file for changing the response to a DBMS error to restart.

Example 4–4 Changing the Response to a DBMS Error to Restart

{
ERROR_TYPE=DBMS_ERROR;
ERROR=4031; 
ACTION=restart;
CONNECTION_STATE=*; 
NEW_STATE=*;
MESSAGE="Insufficient memory in shared pool.";
}

This example shows an entry in a custom action file that overrides the preset action for DBMS error 4031. This entry specifies the following behavior:

In response to DBMS error 4031, the action that the server fault monitor performs is restart.
This entry applies regardless of the state of the connection between the database and the server fault monitor when the error is detected.
The state of the connection between the database and the server fault monitor must remain unchanged after the error is detected.
The following message is printed to the resource's log file when this error is detected:
Insufficient memory in shared pool.

Ignoring an Error Whose Effects Are Minor

If the effects of an error to which the server fault monitor responds are minor, ignoring the error might be less disruptive than responding to the error.

For example, the preset action for Oracle error 4030: out of process memory when trying to allocate num-bytes bytes is restart. This Oracle error indicates that the server fault monitor could not allocate private heap memory. One possible cause of this error is that insufficient memory is available to the operating system. If this error affects more than one session, restarting the database might be appropriate. However, this error might not affect other sessions because these sessions do not require further private memory. In this situation, consider specifying that the server fault monitor ignore the error.

The following example shows an entry in a custom action file for ignoring a DBMS error.

Example 4–5 Ignoring a DBMS Error

{
ERROR_TYPE=DBMS_ERROR;
ERROR=4030;
ACTION=none;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="";
}

This example shows an entry in a custom action file that overrides the preset action for DBMS error 4030. This entry specifies the following behavior:

The server fault monitor ignores DBMS error 4030.
This entry applies regardless of the state of the connection between the database and the server fault monitor when the error is detected.
The state of the connection between the database and the server fault monitor must remain unchanged after the error is detected.
No additional message is printed to the resource's log file when this error is detected.

Changing the Response to Logged Alerts

The Oracle software logs alerts in a file that is identified by the alert_log_file extension property. The server fault monitor scans this file and performs actions in response to alerts for which an action is defined.

Logged alerts for which an action is preset are listed in Table B–2. Change the response to logged alerts to change the preset action, or to define new alerts to which the server fault monitor responds.

To change the response to logged alerts, create an entry in a custom action file in which the keywords are set as follows:

ERROR_TYPE is set to SCAN_LOG.
ERROR is set to a quoted regular expression that identifies a string in an error message that Oracle has logged to the Oracle alert log file.
ACTION is set to the action that you require.

The server fault monitor processes the entries in a custom action file in the order in which the entries occur. Only the first entry that matches a logged alert is processed. Later entries that match are ignored. If you are using regular expressions to specify actions for several logged alerts, ensure that more specific entries occur before more general entries. Specific entries that occur after general entries might be ignored.

For example, a custom action file might define different actions for errors that are identified by the regular expressions ORA-65 and ORA-6. To ensure that the entry that contains the regular expression ORA-65 is not ignored, ensure that this entry occurs before the entry that contains the regular expression ORA-6.

The following example shows an entry in a custom action file for changing the response to a logged alert.

Example 4–6 Changing the Response to a Logged Alert

{
ERROR_TYPE=SCAN_LOG;
ERROR="ORA-00600: internal error";
ACTION=RESTART;
}

This example shows an entry in a custom action file that overrides the preset action for logged alerts about internal errors. This entry specifies the following behavior:

In response to logged alerts that contain the text ORA-00600: internal error, the action that the server fault monitor performs is restart.
This entry applies regardless of the state of the connection between the database and the server fault monitor when the error is detected.
The state of the connection between the database and the server fault monitor must remain unchanged after the error is detected.
No additional message is printed to the resource's log file when this error is detected.

Changing the Maximum Number of Consecutive Timed-Out Probes

By default, the server fault monitor restarts the database after the second consecutive timed-out probe. If the database is lightly loaded, two consecutive timed-out probes should be sufficient to indicate that the database is hanging. However, during periods of heavy load, a server fault monitor probe might time out even if the database is functioning correctly. To prevent the server fault monitor from restarting the database unnecessarily, increase the maximum number of consecutive timed-out probes.

Caution –

Increasing the maximum number of consecutive timed-out probes increases the time that is required to detect that the database is hanging.

To change the maximum number of consecutive timed-out probes allowed, create one entry in a custom action file for each consecutive timed-out probe that is allowed except the first timed-out probe.

Note –

You are not required to create an entry for the first timed-out probe. The action that the server fault monitor performs in response to the first timed-out probe is preset.

For the last allowed timed-out probe, create an entry in which the keywords are set as follows:

ERROR_TYPE is set to TIMEOUT_ERROR.
ERROR is set to the maximum number of consecutive timed-out probes that are allowed.
ACTION is set to RESTART.

For each remaining consecutive timed-out probe except the first timed-out probe, create an entry in which the keywords are set as follows:

ERROR_TYPE is set to TIMEOUT_ERROR.
ERROR is set to the sequence number of the timed-out probe. For example, for the second consecutive timed-out probe, set this keyword to 2. For the third consecutive timed-out probe, set this keyword to 3.
ACTION is set to NONE.

Tip –

To facilitate debugging, specify a message that indicates the sequence number of the timed-out probe.

The following example shows the entries in a custom action file for increasing the maximum number of consecutive timed-out probes to five.

Example 4–7 Changing the Maximum Number of Consecutive Timed-Out Probes

{
ERROR_TYPE=TIMEOUT;
ERROR=2;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #2 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=3;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #3 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=4;
ACTION=NONE;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #4 has occurred.";
}

{
ERROR_TYPE=TIMEOUT;
ERROR=5;
ACTION=RESTART;
CONNECTION_STATE=*;
NEW_STATE=*;
MESSAGE="Timeout #5 has occurred. Restarting.";
}

This example shows the entries in a custom action file for increasing the maximum number of consecutive timed-out probes to five. These entries specify the following behavior:

The server fault monitor ignores the second consecutive timed-out probe through the fourth consecutive timed-out probe.
In response to the fifth consecutive timed-out probe, the action that the server fault monitor performs is restart.
The entries apply regardless of the state of the connection between the database and the server fault monitor when the timeout occurs.
The state of the connection between the database and the server fault monitor must remain unchanged after the timeout occurs.
When the second consecutive timed-out probe through the fourth consecutive timed-out probe occurs, a message of the following form is printed to the resource's log file:
Timeout #number has occurred.
When the fifth consecutive timed-out probe occurs, the following message is printed to the resource's log file:
Timeout #5 has occurred. Restarting.

Propagating a Custom Action File to All Nodes in a Cluster

A server fault monitor must behave consistently on all cluster nodes. Therefore, the custom action file that the server fault monitor uses must be identical on all cluster nodes. After creating or modifying a custom action file, ensure that this file is identical on all cluster nodes by propagating the file to all cluster nodes. To propagate the file to all cluster nodes, use the method that is most appropriate for your cluster configuration:

Locating the file on a file system that all nodes share
Locating the file on a highly available local file system
Copying the file to the local file system of each cluster node by using operating system commands such as the rcp(1) command or the rdist(1) command

Specifying the Custom Action File That a Server Fault Monitor Should Use

To apply customized actions to a server fault monitor, you must specify the custom action file that the fault monitor should use. Customized actions are applied to a server fault monitor when the server fault monitor reads a custom action file. A server fault monitor reads a custom action file when the you specify the file.

Specifying a custom action file also validates the file. If the file contains syntax errors, an error message is displayed. Therefore, after modifying a custom action file, specify the file again to validate the file.

Caution –

If syntax errors in a modified custom action file are detected, correct the errors before the fault monitor is restarted. If the syntax errors remain uncorrected when the fault monitor is restarted, the fault monitor reads the erroneous file, ignoring entries that occur after the first syntax error.

How to Specify the Custom Action File That a Server Fault Monitor Should Use

On a cluster node, become superuser or assume a role that provides solaris.cluster.modify RBAC authorization.

Set the Custom_action_file extension property of the SUNW.scalable_rac_server resource.

Set this property to the absolute path of the custom action file.
# clresource set -p custom_action_file=filepath server-resource
-p custom_action_file=filepath

Specifies the absolute path of the custom action file.

server-resource

Specifies the SUNW.scalable_rac_server resource.