Oracle® Solaris Cluster Data Services Developer's Guide

Exit Print View

Updated: July 2014, E39646-01
 
 

Implementing a Fault Monitor

The DSDL absorbs much of the complexity of implementing a fault monitor by providing a predetermined model. A Monitor_start method starts the fault monitor, under the control of the PMF, when the resource starts on a node. The fault monitor runs in a loop as long as the resource is running on the node.

    The high-level logic of a DSDL fault monitor is as follows:

  • The scds_fm_sleep() function uses the Thorough_probe_interval property to determine the amount of time between probes. Any application process failures that are detected by the PMF during this interval lead to a restart of the resource.

  • The probe itself returns a value that indicates the severity of failures, from 0, no failure, to 100 complete failure.

  • The probe return value is sent to the scds_action() function, which maintains a cumulative failure history within the interval of the Retry_interval property.

  • The scds_action() function determines what to do in the event of a failure, as follows:

    • If the cumulative failure is below 100, do nothing.

    • If the cumulative failure reaches 100 (complete failure), restart the data service. If Retry_interval is exceeded, reset the history.

    • If the number of restarts exceeds the value of the Retry_count property, within the time specified by Retry_interval, fail over the data service.