C.1.2 Sanity Checking of Probing Node (Sun Cluster 2.2 System Administration Guide)

Sun Cluster 2.2 System Administration Guide

C.1.2 Sanity Checking of Probing Node

The network fault probing and data service-specific fault probing require each node to probe another node for a response. Before doing a takeover, the probing node performs a number of basic sanity checks of itself. These checks attempt to ensure that the problem does not really lie with the probing node. They also try to ensure that taking over from the server that seems to be having a problem really will improve the situation. Without the sanity checks, the problem of false takeovers would likely arise. That is, a sick node would wrongly blame another node for lack of response and would take over from the healthier server.

The probing node performs the following sanity checks on itself before doing a takeover from another node:

The probing node checks its own ability to use the public network, as described in "C.2 Public Network Monitoring (PNM)".
The probing node also checks whether its own HA data services are responding. All the HA data services that the probing node is already running are checked. If any are not responsive, takeover is inhibited, on the assumption that the probing node will not do any better trying to run another node's services if it can't run its own. Furthermore, the failure of the probing node's own HA data services to respond might be an indication of some underlying problem with the probing node that could be causing the probe of the other node to fail. Sun Cluster HA for NFS provides an important example of this phenomenon: to lock a file on another node, the probing node's own lockd and statd daemons must be working. By checking the response of its lockd and statd daemons, the probing node rules out the scenario where its own daemons' failure to respond makes the other node look unresponsive.