Sun Cluster HA for NFS Fault Monitor

The Sun Cluster HA for NFS fault monitor uses the following processes:

NFS system fault monitoring, which involves monitoring the NFS daemons (nfsd, mountd, statd, and lockd) and resolving any problems that occur. The NFS system fault monitoring also monitors the RPC portmapper service daemon (rpcbind).
NFS resource fault monitoring, which is specific to each NFS resource. The fault monitor of each resource checks the status of each shared path to monitor the file systems that the resource exports.

Fault Monitor Startup

First, an NFS resource MONITOR_START method starts the NFS system fault monitor. This start method first checks if the NFS system fault monitor (nfs_daemons_probe) is already running under the process monitor daemon (rpc.pmfd). If the NFS system fault monitor is not running, the start method starts the nfs_daemons_probe process under the control of the process monitor. The start method then starts the resource fault monitor (nfs_probe), also under the control of the process monitor.

Fault Monitor Stop

First, the NFS resource MONITOR_STOP method stops the resource fault monitor. Then, this method stops the NFS system fault monitor if no other NFS resource fault monitor runs on the local node.

NFS System Fault Monitor Process

The NFS system fault monitor probes rpcbind, statd, lockd, nfsd, and mountd on the local node by checking for the presence of the process and its response to a null rpc call. This monitor uses the following NFS extension properties.

`Rpcbind_nullrpc_timeout`	`Lockd_nullrpc_timeout`
`Nfsd_nullrpc_timeout`	`Rpcbind_nullrpc_reboot`
`Mountd_nullrpc_timeout`	`Nfsd_nullrpc_restart`
`Statd_nullrpc_timeout`	`Mountd_nullrpc_restart`

See Configuring Sun Cluster HA for NFS Extension Properties to review or set extension properties.

Each system fault-monitor probe cycle performs the following steps in a loop.

Sleeps for Cheap_probe_interval.

Probes rpcbind.

If the process terminates unexpectedly, but a warm restart of the daemon is in progress, rpcbind continutes to probe other daemons.

If the process terminates unexpectedly, then the fault monitor reboots the node.

If a null rpc call to the daemon terminates unexpectedly, Rpcbind_nullrpc_reboot=True, and Failover_mode=HARD, then the fault monitor reboots the node.

Probes statd first, and then lockd.

If statd or lockd terminate unexpectedly, the system fault monitor attempts to restart both daemons.

If a null rpc call to these daemons terminates unexpectedly, the fault monitor logs a message to syslog but does not restart statd or lockd.

Probes mountd.

If mountd terminates unexpectedly, the fault monitor attempts to restart the daemon.

If the null rpc call to the daemon terminates unexpectedly and Mountd_nullrpc_restart=True, the fault monitor attempts to restart mountd if the cluster file system is available.

Probes nfsd.

If nfsd terminates unexpectedly, the fault monitor attempts to restart the daemon.

If the null rpc call to the daemon terminates unexpectedly and Nfsd_nullrpc_restart=TRUE, then the fault monitor attempts to restart nfsd if the cluster file system is available.

If any of the above NFS daemons (except rpcbind) fail to restart during a probe cycle, the NFS system fault monitor will retry the restart in the next cycle. When all of the NFS daemons are restarted and healthy, the resource status is set to ONLINE. The monitor tracks unexpected terminations of NFS daemons in the last Retry_interval. When the total number of unexpected daemon terminations has reached Retry_count, the system fault monitor issues a scha_control giveover. If the giveover call fails, the monitor attempts to restart the failed NFS daemon.

At the end of each probe cycle, if all daemons are healthy, the monitor clears the history of failures.

NFS Resource Fault Monitor Process

Before starting the resource fault monitor probes, all of the shared paths are read from the dfstab file and stored in memory. In each probe cycle, all of the shared paths are probed in each iteration by performing stat() on the path.

Each resource fault monitor fault probe performs the following steps in a loop.

Sleeps for Thorough_probe_interval.
Refreshes the memory if dfstab has been changed since the last read.

If an error occurs while reading the dfstab file, the resource status is set to FAULTED, and the monitor skips the remainder of the checks in the current probe cycle.
Probes all of the shared paths in each iteration by performing stat() on the path.

If any path is not functional, the resource status is set to FAULTED.
Probes for the presence of NFS daemons (nfsd, mountd, lockd, statd) and rpcbind.
If any of these daemons are down, the resource status is set to FAULTED.

If all shared paths are valid and NFS daemons are present, the resource status is reset to ONLINE at the end of the probe cycle.

Previous: Configuring Sun Cluster HA for NFS Extension Properties