NFS Fault Monitor Process (Sun Cluster 3.1 Data Service for Network File System (NFS))

Sun Cluster 3.1 Data Service for Network File System (NFS)

NFS Fault Monitor Process

To check for the presence of the process and its response to a null rpc call, the system fault monitor probes rpcbind, statd, lockd, nfsd, and mountd. This monitor uses the following NFS extension properties.

`Rpcbind_nullrpc_timeout`	`Lockd_nullrpc_timeout`
`Nfsd_nullrpc_timeout`	`Rpcbind_nullrpc_reboot`
`Mountd_nullrpc_timeout`	`Nfsd_nullrpc_restart`
`Statd_nullrpc_timeout`	`Mountd_nullrpc_restart`

See Configuring Sun Cluster HA for NFS Extension Properties to review or set extension properties.

If a daemon needs to be stopped, the calling method sends a kill signal to the process id (PID) and waits to verify that the PID disappears. The amount of time that the calling method waits is a fraction of the method's timeouts. If the process does not stop within that period of time, the fault monitor assumes that the process failed.

Note –

If the process needs more time to stop, increase the timeout of the method that was running when the process was sent the kill signal.

After the daemon is stopped, the fault monitor restarts the daemon and waits until the daemon is registered under RPC. If a new RPC handle can be created, the status of the daemon is reported in the fault monitor internally as a success. If the RPC handle cannot be created, the status of the daemon is returned to the fault monitor as unknown, and no error messages are printed.

Each system fault-monitor probe cycle performs the following steps in a loop.

Sleeps for Cheap_probe_interval.

Probes rpcbind.

If the process fails and Failover_mode=HARD, then the fault monitor reboots the node.

If a null rpc call to the daemon fails, Rpcbind_nullrpc_reboot=False, and Failover_mode=HARD, then the fault monitor reboots the node.

Probes statd and lockd.

If statd or lockd fail, the fault monitor attempts to restart both daemons. If the fault monitor cannot restart the daemons, all of the NFS resources fail over to another node.

If a null rpc call to these daemons fails, the fault monitor logs a message to syslog but does not restart statd or lockd.

Probe mountd.

If mountd fails, the fault monitor attempts to restart the daemon.

If the kstat counter, nfs_server:calls, is not increasing, the following actions occur.
1. A null rpc call is sent to mountd.
2. If the null rpc call fails and Mountd_nullrpc_restart=True, the fault monitor attempts to restart mountd if the cluster file system is available.
3. If the fault monitor cannot restart mountd and the number of failures reaches Retry_count, then all of the NFS resources fail over to another node.

Probe nfsd.

If nfsd fails, the fault monitor attempts to restart the daemon.

If the kstat counter, nfs_server:calls, is not increasing, the following actions occur.
1. A null rpc call is sent to nfsd.
2. If the null rpc call fails and Nfsd_nullrpc_restart=TRUE, then the fault monitor attempts to restart nfsd.
3. If the fault monitor cannot restart nfsd and the number of failures reaches Retry_count, then all of the NFS resources fail over to another node.

If any of the NFS daemons fail to restart, the status of all of the online NFS resources is set to FAULTED. When all of the NFS daemons are restarted and healthy, the resource status is set to ONLINE again.