Sun Cluster 3.0 Data Services Installation and Configuration Guide

Health Checks of the Data Service

Typically, communication between the probe and the data service occurs through a dedicated command or a successful connection to the specified data service port.

If no messages have been received on the control socket, the probe, after sleeping for an interval specified by Thorough_probe_interval, checks the health of the data service. The logic followed by the probe is as follows:

  1. Sleep (Thorough_probe_interval).

  2. Perform health checks under a time-out property Probe_timeout. This is a resource extension property of each data service that you can set.

  3. If the result of Step 2 is a success, that is, the service is healthy, update the success/failure history by purging any history records that are older than the value set for the resource property Retry_interval. The probe sets the status message for the resource as "Service is online" and returns to Step 1.

    If Step 2 resulted in a failure, the probe updates the failure history. It then computes the total number of times the health check failed.

    The result of the health check can range from a total failure to success. The interpretation of the result depends on the specific data service. Consider a scenario where the probe can successfully connect to the server and send a handshake message to it but receives only a partial response before timing out. This scenario is most likely a result of system overload. If some action is taken (such as restarting the service), the clients reconnect to the service again, thus further overloading the system. In that case, a data service fault monitor can decide not to treat this "partial" failure as fatal. Instead, the monitor can just track this failure as a nonfatal probe of the service. These partial failures are still accumulated over the interval specified in Retry_interval.

    However, if the probe cannot connect to the server at all, it can be considered a fatal failure. Partial failures lead to incrementing the failure count by a fractional amount. A fatal (total) failure always increments the failure count by 1. Every time the failure count increases by 1 (either by a fatal failure or by accumulation of partial failures), the probe attempts to correct the situation either by restarting or failing over the data service.

  4. If the result of the computation in Step 3 (the number of failures in the history interval) is less than the value of the resource property Retry_count, the probe attempts to correct the situation locally (for example, by restarting the service). The probe sets the status message of the resource as "Service is degraded" and returns to Step 1.

  5. If the number of failures in Retry_interval exceeds Retry_count, the probe calls scha_control with the "giveover" option. This option requests failover of the service. If this request succeeds, the fault probe stops on this node. The probe sets the status message for the resource as: "Service has failed."

  6. The scha_control request issued in the previous step can be denied by the Sun Cluster framework because of various reasons; the reason is identified by the return code of scha_control. The probe checks the return code. If the scha_control is denied, the probe resets the failure/success history and starts afresh. The reason for this action is that because the number of failures is already above Retry_count, the fault probe would attempt to issue scha_control in each subsequent iteration (which is to be denied again). This request would place additional load on the system and increase the likelihood of further service failures in the case where they have been triggered by an overloaded system. The probe then returns to Step 1.