Sun Cluster 2.2 System Administration Guide

C.4.1 Sun Cluster HA for NFS Fault Probes

The probing server runs two types of periodic probes against another server's NFS service.

  1. The probing server sends a NULL RPC to all daemon processes on the target node that are required to provide NFS service; these daemons are rpcbind, mountd, nfsd, lockd, and statd.

  2. The probing server does an end-to-end test: it tries to mount an NFS file system from the other node, and then to read and write a file in that file system. It does this end-to-end test for every file system that the other node is currently sharing. Because the mount is expensive, it is executed less often than the other probes.

If any of these probes fail, the probing node will consider doing a takeover from the serving node. However, certain conditions might inhibit the takeover from occurring immediately:

After passing these Sun Cluster HA for NFS-specific tests, the process of considering whether or not to do a takeover continues with calls to hactl(1M) (see "C.1.2 Sanity Checking of Probing Node").

The probing server also checks its own NFS service. The logic is similar to the probes of the other server, but instead of doing takeovers, error messages are logged to syslog and an attempt is made to restart any daemons whose process no longer exists. In other words, the restart of a daemon process is performed only when the daemon process has exited or crashed. The restart of a daemon process is not attempted if the daemon process still exists but is not responding, because that would require killing the daemon without knowing which data structures it is updating. The restart is also not done if a local restart has been attempted too recently (within the last hour). Instead, the other server is told to consider doing a takeover (provided the other server passes its own sanity checks). Finally, the rpcbind daemon is never restarted, because there is no way to inform processes that had registered with rpcbind that they need to re-register.