Implementing a Fault Monitor

Language:

The DSDL absorbs much of the complexity of implementing a fault monitor by providing a predetermined model. A Monitor_start method starts the fault monitor, under the control of the PMF, when the resource starts on a node. The fault monitor runs in a loop as long as the resource is running on the node.

The high-level logic of a DSDL fault monitor is as follows:

The scds_fm_sleep() function uses the Thorough_probe_interval property to determine the amount of time between probes. Any application process failures that are detected by the PMF during this interval lead to a restart of the resource.
The probe itself returns a value that indicates the severity of failures, from 0, no failure, to 100 complete failure.
The probe return value is sent to the scds_action() function, which maintains a cumulative failure history within the interval of the Retry_interval property.
The scds_action() function determines what to do in the event of a failure, as follows:
- If the cumulative failure is below 100, do nothing.
- If the cumulative failure reaches 100 (complete failure), restart the data service. If Retry_interval is exceeded, reset the history.
- If the number of restarts exceeds the value of the Retry_count property, within the time specified by Retry_interval, fail over the data service.