Sun Cluster Concepts Guide for Solaris OS

Failfast Mechanism

The failfast mechanism detects a critical problem in either the global zone or in a non-global zone on a node. The action that Sun Cluster takes when failfast detects a problem depends on whether the problem occurs in the global zone or a non-global zone.

If the critical problem is located in the global zone, Sun Cluster forcibly shuts down the node. Sun Cluster then removes the node from cluster membership.

If the critical problem is located in a non-global zone, Sun Cluster reboots that non-global zone.

If a node loses connectivity with other nodes, the node attempts to form a cluster with the nodes with which communication is possible. If that set of nodes does not form a quorum, Sun Cluster software halts the node and “fences” the node from shared storage. See About Failure Fencing for details about this use of failfast.

If one or more cluster-specific daemons die, Sun Cluster software declares that a critical problem has occurred. Sun Cluster software runs cluster-specific daemons in both the global zone and in non-global zones. If a critical problem occurs, Sun Cluster either shuts down and removes the node or reboots the non-global zone where the problem occurred.

When a cluster-specific daemon that runs in a non-global zone fails, a message similar to the following is displayed on the console.


cl_runtime: NOTICE: Failfast: Aborting because "pmfd" died in zone "zone4" (zone id 3)
35 seconds ago.

When a cluster-specific daemon that runs in the global zone fails and the node panics, a message similar to the following is displayed on the console.


panic[cpu1]/thread=2a10007fcc0: Failfast: Aborting because "pmfd" died in zone "global" (zone id 0)
35 seconds ago.
409b8 cl_runtime:__0FZsc_syslog_msg_log_no_argsPviTCPCcTB+48 (70f900, 30, 70df54, 407acc, 0)
%l0-7: 1006c80 000000a 000000a 10093bc 406d3c80 7110340 0000000 4001 fbf0

After the panic, the node might reboot and attempt to rejoin the cluster. Alternatively, if the cluster is composed of SPARC based systems, the node might remain at the OpenBootTM PROM (OBP) prompt. The next action of the node is determined by the setting of the auto-boot? parameter. You can set auto-boot? with the eeprom command, at the OpenBoot PROM ok prompt. See the eeprom(1M) man page.