NFS System Fault Monitor Process (Sun Cluster Data Service for Network File System (NFS) Guide for Solaris OS)

Sun Cluster Data Service for Network File System (NFS) Guide for Solaris OS

NFS System Fault Monitor Process

The NFS system fault monitor probes rpcbind, statd, lockd, nfsd, and mountd on the local node by checking for the presence of the process and its response to a null rpc call. This monitor uses the following NFS extension properties.

`Rpcbind_nullrpc_timeout`	`Lockd_nullrpc_timeout`
`Nfsd_nullrpc_timeout`	`Rpcbind_nullrpc_reboot`
`Mountd_nullrpc_timeout`	`Nfsd_nullrpc_restart`
`Statd_nullrpc_timeout`	`Mountd_nullrpc_restart`

See Configuring Sun Cluster HA for NFS Extension Properties to review or set extension properties.

Each system fault-monitor probe cycle performs the following steps in a loop.

Sleeps for Cheap_probe_interval.

Probes rpcbind.

If the process terminates unexpectedly, but a warm restart of the daemon is in progress, rpcbind continutes to probe other daemons.

If the process terminates unexpectedly, then the fault monitor reboots the node.

If a null rpc call to the daemon terminates unexpectedly, Rpcbind_nullrpc_reboot=True, and Failover_mode=HARD, then the fault monitor reboots the node.

Probes statd first, and then lockd.

If statd or lockd terminate unexpectedly, the system fault monitor attempts to restart both daemons.

If a null rpc call to these daemons terminates unexpectedly, the fault monitor logs a message to syslog but does not restart statd or lockd.

Probes mountd.

If mountd terminates unexpectedly, the fault monitor attempts to restart the daemon.

If the null rpc call to the daemon terminates unexpectedly and Mountd_nullrpc_restart=True, the fault monitor attempts to restart mountd if the cluster file system is available.

Probes nfsd.

If nfsd terminates unexpectedly, the fault monitor attempts to restart the daemon.

If the null rpc call to the daemon terminates unexpectedly and Nfsd_nullrpc_restart=TRUE, then the fault monitor attempts to restart nfsd if the cluster file system is available.

If any of the above NFS daemons (except rpcbind) fail to restart during a probe cycle, the NFS system fault monitor will retry the restart in the next cycle. When all of the NFS daemons are restarted and healthy, the resource status is set to ONLINE. The monitor tracks unexpected terminations of NFS daemons in the last Retry_interval. When the total number of unexpected daemon terminations has reached Retry_count, the system fault monitor issues a scha_control giveover. If the giveover call fails, the monitor attempts to restart the failed NFS daemon.

At the end of each probe cycle, if all daemons are healthy, the monitor clears the history of failures.