NFS System Fault Monitoring Process (Sun Cluster Data Service for NFS Guide for Solaris OS)

Sun Cluster Data Service for NFS Guide for Solaris OS

NFS System Fault Monitoring Process

The NFS system fault monitor probe monitors the NFS daemons (nfsd, mountd, statd, and lockd) and the RPC portmapper service daemon (rpcbind) on the local node. The probe checks for the presence of the process and its response to a null rpc call. This monitor uses the following NFS extension properties:

Rpcbind_nullrpc_timeout
Rpcbind_nullrpc_reboot
Statd_nullrpc_timeout
Lockd_nullrpc_timeout
Mountd_nullrpc_timeout
Mountd_nullrpc_restart
Nfsd_nullrpc_timeout
Nfsd_nullrpc_restart

See Setting Sun Cluster HA for NFS Extension Properties.

Each NFS system fault monitor probe cycle performs the following steps in a loop. The system property Cheap_probe_interval specifies the interval between probes.

The fault monitor probes rpcbind.

If the process terminates unexpectedly, but a warm restart of the daemon is in progress, rpcbind continutes to probe other daemons.

If the process terminates unexpectedly, then the fault monitor reboots the node.

If a null rpc call to the daemon terminates unexpectedly, Rpcbind_nullrpc_reboot=True, and Failover_mode=HARD, then the fault monitor reboots the node.
The fault monitor probes statd first, and then lockd.

If statd or lockd terminate unexpectedly, the system fault monitor attempts to restart both daemons.

If a null rpc call to these daemons terminates unexpectedly, the fault monitor logs a message to syslog but does not restart statd or lockd.
The fault monitor probes mountd.

If mountd terminates unexpectedly, the fault monitor attempts to restart the daemon.

If the null rpc call to the daemon terminates unexpectedly and Mountd_nullrpc_restart=True, the fault monitor attempts to restart mountd if the cluster file system is available.
The fault monitor probes nfsd.

If nfsd terminates unexpectedly, the fault monitor attempts to restart the daemon.

If the null rpc call to the daemon terminates unexpectedly and Nfsd_nullrpc_restart=TRUE, then the fault monitor attempts to restart nfsd if the cluster file system is available.
If any of the above NFS daemons (except rpcbind) fail to restart during a probe cycle, the NFS system fault monitor will retry the restart in the next cycle. When all of the NFS daemons are restarted and healthy, the resource status is set to ONLINE. The monitor tracks unexpected terminations of NFS daemons in the last Retry_interval. When the total number of unexpected daemon terminations has reached Retry_count, the system fault monitor issues a scha_control giveover. If the giveover call fails, the monitor attempts to restart the failed NFS daemon.
At the end of each probe cycle, if all daemons are healthy, the monitor clears the history of failures.