Tuning the Support for Oracle RAC Fault Monitors

Language:

Fault monitoring for the Support for Oracle RAC data service is provided by fault monitors for the following resources:

Scalable device group resource
Scalable file-system mount-point resource

Each fault monitor is contained in a resource whose resource type is shown in the following table.

Table 17 Resource Types for Support for Oracle RAC Fault Monitors

Fault Monitor	Resource Type
Scalable device group	`SUNW.ScalDeviceGroup`
Scalable file-system mount point	`SUNW.ScalMountPoint`

Standard properties and extension properties of these resources control the behavior of the fault monitors. The default values of these properties determine the preset behavior of the fault monitors. The preset behavior should be suitable for most Oracle Solaris Cluster installations. Therefore, you should tune the Support for Oracle RAC fault monitors only if you need to modify this preset behavior.

Tuning the Support for Oracle RAC fault monitors involves the following tasks:

Setting the interval between fault monitor probes
Setting the timeout for fault monitor probes
Defining the criteria for persistent faults
Specifying the failover behavior of a resource

For more information, see Tuning Fault Monitors for Oracle Solaris Cluster Data Services in Oracle Solaris Cluster 4.3 Data Services Planning and Administration Guide . Information about the Support for Oracle RAC fault monitors that you need to perform these tasks is provided in the subsections that follow:

Operation of the Fault Monitor for a Scalable Device Group

By default, the fault monitor monitors all logical volumes in the device group that the resource represents. If you require only a subset of the logical volumes in a device group to be monitored, set the LogicalDeviceList extension property.

The status of the device group is derived from the statuses of the individual logical volumes that are monitored. If all monitored logical volumes are healthy, the device group is healthy. If any monitored logical volume is faulty, the device group is faulty. If a device group is discovered to be faulty, monitoring of the resource that represents the group is stopped and the resource is put into the disabled state.

The status of an individual logical volume is obtained by querying the volume's volume manager. If the status of a Solaris Volume Manager for Sun Cluster volume cannot be determined from a query, the fault monitor performs file input/output (I/O) operations to determine the status.

Note - For mirrored disks, if one submirror is faulty, the device group is still considered to be healthy.

If reconfiguration of userland cluster membership causes an I/O error, the monitoring of device group resources by fault monitors is suspended while userland cluster membership monitor (UCMM) reconfigurations are in progress.

Operation of the Fault Monitor for Scalable File-System Mount Points

To determine if the mounted file system is available, the fault monitor performs I/O operations such as opening, reading, and writing to a test file on the file system. If an I/O operation is not completed within the timeout period, the fault monitor reports an error. To specify the timeout for I/O operations, set the IOTimeout extension property.

The response to an error depends on the type of the file system, as follows:

If the file system is an NFS file system on a qualified NAS device, the response is as follows:
- Monitoring of the resource is stopped on the current cluster node.
- The resource is placed into the disabled state on the current cluster node, causing the file system to be unmounted from that node.
If the file system is a StorageTek QFS shared file system, the response is as follows:
- If the cluster node on which the error occurred is hosting the metadata server resource, the metadata server resource is failed over to another node.
- The file system is unmounted.
If the failover attempt fails, the file system remains unmounted and a warning is given.

Obtaining Core Files for Troubleshooting DBMS Timeouts

To facilitate troubleshooting of unexplained DBMS timeouts, you can enable the fault monitor to create a core file when a probe timeout occurs. The contents of the core file relate to the fault monitor process. The fault monitor creates the core file in the root (/) directory. To enable the fault monitor to create a core file, use the coreadm command to enable set-id core dumps.

# coreadm -g /var/cores/%f.%n.%p.core -e global -e process \
-e global-setid -e proc-setid -e log

For more information, see the coreadm(1M) man page.