Designing the Fault Monitor Daemon

Language:

Resource type implementations that use the DSDL typically have a fault monitor daemon that carries out the following responsibilities:

Periodically monitors the health of the application that is being managed. This particular responsibility of a monitor daemon largely depends on the particular application and can vary widely from resource type to resource type. The DSDL contains some built-in utility functions that perform health checks for simple TCP-based services. You can use these utilities to implement applications that use ASCII-based protocols, such as HTTP, NNTP, IMAP, and POP3.
Keeps track of the problems that are encountered by the application by using the resource properties Retry_interval and Retry_count. When the application fails completely, the fault monitor needs to determine whether the PMF action script should restart the service or whether the application failures have accumulated so rapidly that a failover needs to be carried out. The DSDL utilities scds_fm_action() and scds_fm_sleep() are intended to aid you in implementing this mechanism.
Takes action, typically either restarting the application or attempting a failover of the containing resource group. The DSDL utility scds_fm_action() implements this algorithm. This utility computes the current accumulation of probe failures in the past number of Retry_interval seconds for this purpose.
Updates the resource state so that the state of the application's health is available to the Oracle Solaris Cluster administrative commands, as well as to the cluster management GUI. .

The DSDL utilities are designed so that the main loop of the fault monitor daemon can be represented by the pseudo code at the end of this section.

Keep the following factors in mind when you implement a fault monitor with the DSDL:

scds_fm_sleep() detects the death of an application process rapidly because notification of the application process's death through the PMF is asynchronous. Thus, the fault detection time is reduced significantly, thereby increasing the availability of the service. A fault monitor might otherwise wake up every so often to check on a service's health and find that the application process has died.
If the RGM rejects the attempt to fail over the service with the scha_control API, scds_fm_action() resets, or forgets, its current failure history. This function resets its current failure history because its history already exceeds Retry_count. If the monitor daemon wakes up in the next iteration and is unable to successfully complete its health check of the daemon, the monitor daemon again attempts to call the scha_control() function. That call is probably rejected once again, as the situation that led to its rejection in the last iteration is still valid. Resetting the history ensures that the fault monitor at least attempts to correct the situation locally (for example, through restarting the application) in the next iteration.
scds_fm_action() does not reset application failure history in case of restart failures, as you would typically like to issue scha_control() quickly thereafter if the situation does not correct itself.
The utility scds_fm_action() updates the resource status to SCHA_RSSTATUS_OK, SCHA_RSSTATUS_DEGRADED, or SCHA_RSSTATUS_FAULTED depending on the failure history. This status is consequently available to cluster system management.

In most cases, you can implement the application-specific health check action in a separate stand-alone utility (svc_probe(), for example). You can integrate it with the following generic main loop.

for (;;) {
/* sleep for a duration of thorough_probe_interval between
*  successive probes.
*/
(void) scds_fm_sleep(scds_handle,
scds_get_rs_thorough_probe_interval(scds_handle));
/* Now probe all ipaddress we use. Loop over
* 1. All net resources we use.
* 2. All ipaddresses in a given resource.
* For each of the ipaddress that is probed,
* compute the failure history.
*/
probe_result = 0;
/* Iterate through the all resources to get each
* IP address to use for calling svc_probe()
*/
for (ip = 0; ip < netaddr->num_netaddrs; ip++) {
/* Grab the hostname and port on which the
* health has to be monitored.
*/
hostname = netaddr->netaddrs[ip].hostname;
port = netaddr->netaddrs[ip].port_proto.port;
/*
* HA-XFS supports only one port and
* hence obtaint the port value from the
* first entry in the array of ports.
*/
ht1 = gethrtime();
/* Latch probe start time */
probe_result = svc_probe(scds_handle, hostname, port, timeout);
/*
* Update service probe history,
* take action if necessary.
* Latch probe end time.
*/
ht2 = gethrtime();
/* Convert to milliseconds */
dt = (ulong_t)((ht2 - ht1) / 1e6);
/*
* Compute failure history and take
* action if needed
*/
(void) scds_fm_action(scds_handle,
probe_result, (long)dt);
}       /* Each net resource */
}       /* Keep probing forever */