To minimize the disruption that transient faults in a resource cause, a fault monitor restarts the resource in response to such faults. For persistent faults, more disruptive action than restarting the resource is required:
For a failover resource, the fault monitor fails over the resource to another node.
For a scalable resource, the fault monitor takes the resource offline.
A fault monitor treats a fault as persistent if the number of complete failures of a resource exceeds a specified threshold within a specified retry interval. Defining the criteria for persistent faults enables you to set the threshold and the retry interval to accommodate the performance characteristics of your cluster and your availability requirements.
A fault monitor treats some faults as a complete failure of a resource. A complete failure typically causes a complete loss of service. The following failures are examples of a complete failure:
Unexpected termination of the process for a data service server
Inability of a fault monitor to connect to a data service server
A complete failure causes the fault monitor to increase by 1 the count of complete failures in the retry interval.
A fault monitor treats other faults as a partial failure of a resource. A partial failure is less serious than a complete failure, and typically causes a degradation of service, but not a complete loss of service. An example of a partial failure is an incomplete response from a data service server before a fault monitor probe is timed out.
A partial failure causes the fault monitor to increase by a fractional amount the count of complete failures in the retry interval. Partial failures are still accumulated over the retry interval.
The following characteristics of partial failures depend on the data service:
The types of faults that the fault monitor treats as partial failure
The fractional amount that each partial failure adds to the count of complete failures
For information about faults that a data service's fault monitor detects, see the documentation for the data service.
Thorough_probe_interval system property
Probe_timeout extension property
To ensure that you allow enough time for the threshold to be reached within the retry interval, use the following expression to calculate values for the retry interval and the threshold:
retry_interval >= 2 x threshold × (thorough_probe_interval + probe_timeout)
The factor of 2 accounts for partial probe failures that do not immediately cause the resource to be failed over or taken offline.
To set the threshold and the retry interval, set the following system properties of the resource: