Failfast Mechanism (Sun Cluster Concepts Guide for Solaris OS)

Sun Cluster Concepts Guide for Solaris OS

Failfast Mechanism

The failfast mechanism detects a critical problem on either a global-cluster voting node or global-cluster non-voting node. The action that Sun Cluster takes when failfast detects a problem depends on whether the problem occurs in a voting node or a non-voting node.

If the critical problem is located in a voting node, Sun Cluster forcibly shuts down the node. Sun Cluster then removes the node from cluster membership.

If the critical problem is located in a non-voting node, Sun Cluster reboots that non-voting node.

If a node loses connectivity with other nodes, the node attempts to form a cluster with the nodes with which communication is possible. If that set of nodes does not form a quorum, Sun Cluster software halts the node and “fences” the node from the shared disks, that is, prevents the node from accessing the shared disks.

You can turn off fencing for selected disks or for all disks.

Caution –

If you turn off fencing under the wrong circumstances, your data can be vulnerable to corruption during application failover. Examine this data corruption possibility carefully when you are considering turning off fencing. If your shared storage device does not support the SCSI protocol, such as a Serial Advanced Technology Attachment (SATA) disk, or if you want to allow access to the cluster's storage from hosts outside the cluster, turn off fencing.

If one or more cluster-specific daemons die, Sun Cluster software declares that a critical problem has occurred. Sun Cluster software runs cluster-specific daemons on both voting nodes and non-voting nodes. If a critical problem occurs, Sun Cluster either shuts down and removes the node or reboots the non-voting node where the problem occurred.

When a cluster-specific daemon that runs on a non-voting node fails, a message similar to the following is displayed on the console.

cl_runtime: NOTICE: Failfast: Aborting because "pmfd" died in zone "zone4" (zone id 3)
35 seconds ago.

When a cluster-specific daemon that runs on a voting node fails and the node panics, a message similar to the following is displayed on the console.

panic[cpu1]/thread=2a10007fcc0: Failfast: Aborting because "pmfd" died in zone "global" (zone id 0)
35 seconds ago.
409b8 cl_runtime:__0FZsc_syslog_msg_log_no_argsPviTCPCcTB+48 (70f900, 30, 70df54, 407acc, 0)
%l0-7: 1006c80 000000a 000000a 10093bc 406d3c80 7110340 0000000 4001 fbf0

After the panic, the Solaris host might reboot and the node might attempt to rejoin the cluster. Alternatively, if the cluster is composed of SPARC based systems, the host might remain at the OpenBoot PROM (OBP) prompt. The next action of the host is determined by the setting of the auto-boot? parameter. You can set auto-boot? with the eeprom command, at the OpenBoot PROM ok prompt. See the eeprom(1M) man page.