Tuning Fault Monitors for Oracle Solaris Cluster Data Services

Language:

Each data service that is supplied with the Oracle Solaris Cluster product has a built-in fault monitor. The fault monitor performs the following functions:

Detecting the unexpected termination of processes for the data service server
Checking the health of the data service

The fault monitor is contained in the resource that represents the application for which the data service was written. You create this resource when you register and configure the data service. For more information, see the documentation for the data service.

Standard properties and extension properties of this resource control the behavior of the fault monitor. The default values of these properties determine the preset behavior of the fault monitor. The preset behavior should be suitable for most Oracle Solaris Cluster installations. Therefore, you should tune a fault monitor only if you need to modify this preset behavior.

Tuning a fault monitor involves the following tasks:

Perform these tasks when you register and configure the data service. For more information, see the documentation for the data service.

Note - A resource's fault monitor is started when you bring online the resource group that contains the resource. You do not need to start the fault monitor explicitly.

Setting the Interval Between Fault Monitor Probes

To determine whether a resource is operating correctly, the fault monitor probes this resource periodically. The interval between fault monitor probes affects the availability of the resource and the performance of your system as follows:

The interval between fault monitor probes affects the length of time that is required to detect a fault and respond to the fault. Therefore, if you decrease the interval between fault monitor probes, the time that is required to detect a fault and respond to the fault is also decreased. This decrease enhances the availability of the resource.
Each fault monitor probe consumes system resources such as processor cycles and memory. Therefore, if you decrease the interval between fault monitor probes, the performance of the system is degraded.

The optimum interval between fault monitor probes also depends on the time that is required to respond to a fault in the resource. This time depends on how the complexity of the resource affects the time that is required for operations such as restarting the resource.

To set the interval between fault monitor probes, set the Thorough_probe_interval standard property of the resource to the interval in seconds that you require.

Setting the Timeout for Fault Monitor Probes

The timeout for fault monitor probes specifies the length of time that a fault monitor waits for a response from a resource to a probe. If the fault monitor does not receive a response within this timeout, the fault monitor treats the resource as faulty. The time that a resource requires to respond to a fault monitor probe depends on the operations that the fault monitor performs to probe the resource. For information about operations that a data service's fault monitor performs to probe a resource, see the documentation for the data service.

The time that is required for a resource to respond also depends on factors that are unrelated to the fault monitor or the application, for example:

System configuration
Cluster configuration
System load
Amount of network traffic

To set the timeout for fault monitor probes, set the Probe_timeout extension property of the resource to the timeout in seconds that you require.

For fault monitor probes of most resource types, you can also configure the Timeout_threshold property to send notification when a probe execution time is near the timeout limit. Such notifications can help you identify probe timeouts which are set too low, which might cause a false failover. For more information about the Timeout_threshold property, see the r_properties(5) man page.

Defining the Criteria for Persistent Faults

To minimize the disruption that transient faults in a resource cause, a fault monitor restarts the resource in response to such faults. For persistent faults, more disruptive action than restarting the resource is required:

For a failover resource, the fault monitor fails over the resource to another node.
For a scalable resource, the fault monitor takes the resource offline.

A fault monitor treats a fault as persistent if the number of complete failures of a resource exceeds a retry count that is specified by the Retry_count standard property. Defining the criteria for persistent faults enables you to set the retry count and the retry interval to accommodate the performance characteristics of your cluster and your availability requirements.

This section describes the following topics:

Complete Failures and Partial Failures of a Resource

A fault monitor treats some faults as a complete failure of a resource. A complete failure typically causes a complete loss of service. The following failures are examples of a complete failure:

Unexpected termination of the process for a data service server
Inability of a fault monitor to connect to a data service server

A complete failure causes the fault monitor to increase by 1 the count of complete failures in the retry interval.

A fault monitor treats other faults as a partial failure of a resource. A partial failure is less serious than a complete failure, and typically causes a degradation of service, but not a complete loss of service. An example of a partial failure is an incomplete response from a data service server before a fault monitor probe is timed out.

A partial failure causes the fault monitor to increase by a fractional amount the count of complete failures in the retry interval. Partial failures are still accumulated over the retry interval.

The following characteristics of partial failures depend on the data service:

The types of faults that the fault monitor treats as partial failure
The fractional amount that each partial failure adds to the count of complete failures

For information about faults that a data service's fault monitor detects, see the documentation for the data service.

Dependencies of the Retry Count and the Retry Interval on Other Properties

The maximum length of time that is required for a single restart of a faulty resource is the sum of the values of the following properties:

Thorough_probe_interval system property
Probe_timeout extension property

To ensure that you allow enough time for the retry count to be reached within the retry interval, use the following expression to calculate values for the retry interval and the retry count:

retry_interval >= 2 x retry_count × (thorough_probe_interval + probe_timeout)

The factor of 2 accounts for partial probe failures that do not immediately cause the resource to be failed over or taken offline.

Standard Properties for Setting the Retry Count and the Retry Interval

To set the retry count and the retry interval, set the following standard properties of the resource:

To set the retry count, set the Retry_count standard property to the maximum allowed number of complete failures.
To set the retry interval, set the Retry_interval standard property to the interval in seconds that you require.

Specifying the Failover Behavior of a Resource

The failover behavior of a resource determines how the RGM responds to the following faults:

Failure of the resource to start
Failure of the resource to stop
Failure of the resource's fault monitor to stop

To specify the failover behavior of a resource, set the Failover_mode standard property of the resource. For information about the possible values of this property, see the description of the Failover_mode standard property in the r_properties(5) man page.