C H A P T E R  6

Recovering From Node Reboot at Run Time

For information about the causes of node reboot at run time, see A Monitored Daemon Fails Causing a Node to Reboot at Run Time.


A Monitored Daemon Fails Causing a Node to Reboot at Run Time

When a monitored daemon fails, the Daemon Monitor triggers a recovery response. The recovery response is often to restart the failed daemon. If the daemon fails to restart correctly, the Daemon Monitor reboots the node. The failure of a monitored daemon is the most common cause of a node reboot.

If the system recovers correctly, the daemon core and error message might be the only evidence of the failure. You must take the failure seriously even though the system has recovered.

For a list of recovery responses made by the Daemon Monitor, see the nhpmd1M man page. For a summary of the causes of daemon failure during startup, see A Monitored Daemon Fails Causing a Master-Eligible Node to Reboot at Startup and A Monitored Daemon Fails Causing a Diskless Node or Dataless Node to Reboot at Startup.

TABLE 6-1and TABLE 6-2 summarize the events that can cause a monitored daemon to fail at run time. To recover from daemon failure, perform the procedure in To Recover From Daemon Failure.


TABLE 6-1   Possible Causes of Daemon Failure at Run Time 
Failed Daemon Possible Cause at Run Time
nhcmmd

The nhcmmd daemon was killed.
The failing node does not see its presence in the cluster_nodes_table.
nhprobed The nhprobed daemon was killed.


TABLE 6-2   Causes of Daemon Failure on Master-Eligible Nodes During Failover or Switchover 
Failed Daemon Possible Cause During Failover or Switchover
nhcrfsd The nhcrfsd daemon was killed during the failover or switchover.
nhcmmd The node cannot connect to the nhprobed daemon.
nhprobed The node cannot create the required threads, sockets, or pipe.

procedure icon  To Recover From Daemon Failure

  1. Examine the core file produced by the failed daemon.

    The core file is located in the /var/tmp/SUNWcgha/core directory on the Solaris OS and the /var/sun/nhas/core directory on Linux, and has the format:core.node_name.executable_file_name.process_ID.time

    For more information about core dumps, see the coreadm1M man page on Solaris or coreadm8on Linux.

  2. Examine the system log files for an error message produced by the failed daemon.

    For example, the following error message is produced by the failure of a daemon launched by the rpc nametag:

    [ID 615790 local0.notice] "rpc" Failed to stay up.

    For information about which nametag launches which daemon, see the nhpmd1M man page.

  3. Identify the cause of the daemon failure.

    Use the information obtained in Step 1, Step 2, TABLE 6-1, and TABLE 6-2.

  4. Fix the underlying problem, if necessary.

  5. Confirm that the recovery procedure has been carried out by searching the system log files for local0 information.

    • If your system log file is not configured for local0 information, reconfigure it.

      For information, see the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.

    • If local0 information is logged to a file, search the file for the string nhpmd.

      Lines containing the string nhpmd describe the recovery response performed by the Daemon Monitor.