Sun Cluster Data Services Planning and Administration Guide for Solaris OS

Sun Cluster Data Service Fault Monitors

This section provides general information about data service fault monitors. The Sun-supplied data services contain fault monitors that are built into the package. The fault monitor (or fault probe) is a process that probes the health of the data service.

Fault Monitor Invocation

The RGM invokes the fault monitor when you bring a resource group and its resources online. This invocation causes the RGM to internally call the MONITOR_START method for the data service.

The fault monitor performs the following two functions.

monitors the abnormal exit of the data service server process or processes
checks the health of the data service

Monitoring of the Abnormal Exit of the Server Process

The Process Monitor Facility (PMF) monitors the data service processes.

The data service fault probe runs in an infinite loop and sleeps for an adjustable amount of time that the resource property Thorough_probe_interval sets. While sleeping, the probe checks with the PMF to see if the process has exited. If the process has exited, the probe updates the status of the data service as “Service daemon not running” and takes action. The action can involve restarting the data service locally or failing over the data service to a secondary cluster node. To decide whether to restart or to fail over the data service, the probe checks the value set in the resource properties Retry_count and Retry_interval for the data service application resource.

Checking the Health of the Data Service

Typically, communication between the probe and the data service occurs through a dedicated command or a successful connection to the specified data service port.

The logic that the probe uses is roughly as follows.

Sleep (Thorough_probe_interval).
Perform health checks under a time-out property Probe_timeout. Probe_timeout is a resource extension property of each data service that you can set.
If Step 2 is a success, that is, the service is healthy, update the success/failure history. To update the success/failure history, purge any history records that are older than the value that is set for the resource property Retry_interval. The probe sets the status message for the resource as “Service is online” and returns to Step 1.

If Step 2 resulted in a failure, the probe updates the failure history. The probe then computes the total number of times that the health check failed.

The result of the health check can range from a complete failure to success. The interpretation of the result depends on the specific data service. Consider a scenario where the probe can successfully connect to the server and send a handshake message to the server, but the probe receives only a partial response before it times out. This scenario is most likely a result of system overload. If some action is taken (such as restarting the service), the clients reconnect to the service, thus further overloading the system. If this event occurs, a data service fault monitor can decide not to treat this “partial” failure as fatal. Instead, the monitor can track this failure as a nonfatal probe of the service. These partial failures are still accumulated over the interval that the Retry_interval property specifies.

However, if the probe cannot connect to the server at all, the failure can be considered fatal. Partial failures lead to incrementing the failure count by a fractional amount. Every time the failure count reaches total failure (either by a fatal failure or by accumulation of partial failures), the probe restarts or fails over the data service in an attempt to correct the situation.
If the result of the computation in Step 3 (the number of failures in the history interval) is less than the value of the resource property Retry_count, the probe attempts to correct the situation locally (for example, by restarting the service). The probe sets the status message of the resource as “Service is degraded” and returns to Step 1.
If the number of failures in Retry_interval exceeds Retry_count, the probe calls scha_control with the “giveover” option. This option requests failover of the service. If this request succeeds, the fault probe stops on this node. The probe sets the status message for the resource as, “Service has failed.”
The Sun Cluster framework can deny the scha_control request issued in the previous step for various reasons. The return code of scha_control identifies the reason. The probe checks the return code. If the scha_control is denied, the probe resets the failure/success history and starts afresh. This probe resets the history because the number of failures is already above Retry_count, and the fault probe would attempt to issue scha_control in each subsequent iteration (which would be denied again). This request would place additional load on the system and would increase the likelihood of further service failures.

The probe then returns to Step 1.