A Monitored Daemon Fails Causing a Node to Reboot at Run Time

When a monitored daemon fails, the Daemon Monitor triggers a recovery response. The recovery response is often to restart the failed daemon. If the daemon fails to restart correctly, the Daemon Monitor reboots the node. The failure of a monitored daemon is the most common cause of a node reboot.

If the system recovers correctly, the daemon core and error message might be the only evidence of the failure. You must take the failure seriously even though the system has recovered.

For a list of recovery responses made by the Daemon Monitor, see the nhpmd1M man page. For a summary of the causes of daemon failure during startup, see A Monitored Daemon Fails Causing a Master-Eligible Node to Reboot at Startup and A Monitored Daemon Fails Causing a Diskless Node or Dataless Node to Reboot at Startup.

TABLE 6-1and TABLE 6-2 summarize the events that can cause a monitored daemon to fail at run time. To recover from daemon failure, perform the procedure in To Recover From Daemon Failure.

**TABLE 6-1 Possible Causes of Daemon Failure at Run Time**
Failed Daemon	Possible Cause at Run Time
`nhcmmd`	The `nhcmmd` daemon was killed.
The failing node does not see its presence in the `cluster_nodes_table`.
`nhprobed`	The `nhprobed` daemon was killed.

**TABLE 6-2 Causes of Daemon Failure on Master-Eligible Nodes During Failover or Switchover**
Failed Daemon	Possible Cause During Failover or Switchover
`nhcrfsd`	The `nhcrfsd` daemon was killed during the failover or switchover.
`nhcmmd`	The node cannot connect to the `nhprobed` daemon.
`nhprobed`	The node cannot create the required threads, sockets, or pipe.

To Recover From Daemon Failure

Examine the core file produced by the failed daemon.

The core file is located in the /var/tmp/SUNWcgha/core directory on the Solaris OS and the /var/sun/nhas/core directory on Linux, and has the format:core.node_name.executable_file_name.process_ID.time

For more information about core dumps, see the coreadm1M man page on Solaris or coreadm8on Linux.

Examine the system log files for an error message produced by the failed daemon.

For example, the following error message is produced by the failure of a daemon launched by the rpc nametag:
```
[ID 615790 local0.notice] "rpc" Failed to stay up.
```
For information about which nametag launches which daemon, see the nhpmd1M man page.

Identify the cause of the daemon failure.

Use the information obtained in Step 1, Step 2, TABLE 6-1, and TABLE 6-2.

Fix the underlying problem, if necessary.

Confirm that the recovery procedure has been carried out by searching the system log files for local0 information.
- If your system log file is not configured for local0 information, reconfigure it.
  
  For information, see the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.
- If local0 information is logged to a file, search the file for the string nhpmd.
  
  Lines containing the string nhpmd describe the recovery response performed by the Daemon Monitor.

Recovering From Node Reboot at Run Time

A Monitored Daemon Fails Causing a Node to Reboot at Run Time