Failfast Mechanism (Sun Cluster Concepts Guide for Solaris OS)

Sun Cluster Concepts Guide for Solaris OS

Failfast Mechanism

If the CMM detects a critical problem with a node, it notifies the cluster framework to forcibly shut down (panic) the node and to remove it from the cluster membership. The mechanism by which this occurs is called failfast. Failfast causes a node to shut down in two ways.

If a node leaves the cluster and then attempts to start a new cluster without having quorum, it is “fenced” from accessing the shared disks. See About Failure Fencing for details about this use of failfast.
If one or more cluster-specific daemons die (clexecd, rpc.pmfd, rgmd, or rpc.ed) the failure is detected by the CMM and the node panics.

When the death of a cluster daemon causes a node to panic, a message similar to the following is displayed on the console for that node.

panic[cpu0]/thread=40e60: Failfast: Aborting because "pmfd" died 35 seconds ago.
409b8 cl_runtime:__0FZsc_syslog_msg_log_no_argsPviTCPCcTB+48 (70f900, 30, 70df54, 407acc, 0)
%l0-7: 1006c80 000000a 000000a 10093bc 406d3c80 7110340 0000000 4001 fbf0

After the panic, the node might reboot and attempt to rejoin the cluster. Alternatively, if the cluster is composed of SPARC based systems, the node might remain at the OpenBoot^TM PROM (OBP) prompt. The next action of the node is determined by the setting of the auto-boot? parameter. You can set auto-boot? with eeprom(1M), at the OpenBoot PROM ok prompt.