Sun Cluster Data Services Developer's Guide for Solaris OS

Implementing a Fault Monitor

The DSDL absorbs much of the complexity of implementing a fault monitor by providing a predetermined model. A Monitor_start method launches the fault monitor, under the control of PMF, when the resource starts on a node. The fault monitor runs in loop as long as the resource is running on the node. The high-level logic of a DSDL fault monitor is as follows.

The scds_fm_sleep function uses the Thorough_probe_interval property to determine the amount of time between probes. Any application process failures determined by PMF during this interval lead to a restart of the resource.
The probe itself returns a value indicating the severity of failures, from 0, no failure, to 100 complete failure.
The probe return value is sent to the scds_action function, which maintains a cumulative failure history within the interval of the Retry_interval property.
The scds_action function determines what to do in the event of failure, as follows.
- If the cumulative failure is below 100, do nothing.
- If the cumulative failure reaches 100 (complete failure) restart the data service. If Retry_interval is exceeded, reset the history.
- If the number of restarts exceeds the value of the Retry_count property, within the time specified by Retry_interval, failover the data service.