Defining the Criteria for Persistent Faults

Language:

To minimize the disruption that transient faults in a resource cause, a fault monitor restarts the resource in response to such faults. For persistent faults, more disruptive action than restarting the resource is required:

For a failover resource, the fault monitor fails over the resource to another node.
For a scalable resource, the fault monitor takes the resource offline.

A fault monitor treats a fault as persistent if the number of complete failures of a resource exceeds a specified threshold within a specified retry interval. Defining the criteria for persistent faults enables you to set the threshold and the retry interval to accommodate the performance characteristics of your cluster and your availability requirements.

Complete Failures and Partial Failures of a Resource

A fault monitor treats some faults as a complete failure of a resource. A complete failure typically causes a complete loss of service. The following failures are examples of a complete failure:

Unexpected termination of the process for a data service server
Inability of a fault monitor to connect to a data service server

A complete failure causes the fault monitor to increase by 1 the count of complete failures in the retry interval.

A fault monitor treats other faults as a partial failure of a resource. A partial failure is less serious than a complete failure, and typically causes a degradation of service, but not a complete loss of service. An example of a partial failure is an incomplete response from a data service server before a fault monitor probe is timed out.

A partial failure causes the fault monitor to increase by a fractional amount the count of complete failures in the retry interval. Partial failures are still accumulated over the retry interval.

The following characteristics of partial failures depend on the data service:

The types of faults that the fault monitor treats as partial failure
The fractional amount that each partial failure adds to the count of complete failures

For information about faults that a data service's fault monitor detects, see the documentation for the data service.

Dependencies of the Threshold and the Retry Interval on Other Properties

The maximum length of time that is required for a single restart of a faulty resource is the sum of the values of the following properties:

Thorough_probe_interval system property
Probe_timeout extension property

To ensure that you allow enough time for the threshold to be reached within the retry interval, use the following expression to calculate values for the retry interval and the threshold:

retry_interval >= 2 x threshold × (thorough_probe_interval + probe_timeout)

The factor of 2 accounts for partial probe failures that do not immediately cause the resource to be failed over or taken offline.

System Properties for Setting the Threshold and the Retry Interval

To set the threshold and the retry interval, set the following system properties of the resource:

To set the threshold, set the Retry_count system property to the maximum allowed number of complete failures.
To set the retry interval, set the Retry_interval system property to the interval in seconds that you require.