Failure Fencing in Greater Than Two-Node Clusters (Sun Cluster 2.2 Software Installation Guide)

Sun Cluster 2.2 Software Installation Guide

Failure Fencing in Greater Than Two-Node Clusters

The difficulty with the SCSI-2 reservation model used in two-node clusters is that the SCSI reservations are host-specific. If a host has issued reservations on shared devices, it effectively shuts out every other node that can access the device, faulty or not. Consequently, this model breaks down when more than two nodes are connected to the multihost disks in a shared disk environment such as OPS.

For example, if one node hangs in a three-node cluster, the other two nodes reconfigure. However, neither of the surviving nodes can issue SCSI reservations to protect the underlying shared devices from the faulty node, as this action also shuts out the other surviving node. But without the reservations, the faulty node might revive and issue I/O to the shared devices, despite the fact that its view of the cluster is no longer current.

Consider a four-node cluster with storage devices directly accessible from all the cluster nodes. If one node hangs, and the other three nodes reconfigure, none of them can issue the reservations to protect the underlying devices from the faulty node, as the reservations will also prevent some of the valid cluster members from issuing any I/O to the devices. But without the reservations, we have the real danger of the faulty node reviving and issuing I/O to shared devices despite the fact that its view of the cluster is no longer current.

Now consider the problem of split-brain situations. In the case of a four-node cluster, a variety of interconnect failures are possible. We will define a partition as a set of cluster nodes where each node can communicate with every other member within that partition, but not with any other cluster node that is outside the partition. There can be situations where, due to interconnect failures, two partitions are formed with two nodes in one partition and two nodes in the other partition, or with three nodes in one partition and one node in the other partition. Or there can even be cases where a four-node cluster can degenerate into four different partitions with one node in each partition. In all such cases, Sun Cluster attempts to arrive at a consistent distributed consensus on which partition should stay up and which partition should abort. Consider the following two cases.

Case 1. Two partitions, with two nodes in each partition. As in the case of the one-one split in a two-node cluster, the CMMs in either partition do not have quorum to conclude which partition should stay up and which partition should abort. To meet the goals of data integrity and high availability, both partitions should not stay up and both partitions should not go down. As in the case of a two-node cluster, it is possible to adjudicate by means of an external device (the quorum disk). A designated node in each partition can race for the reservation on the designated quorum device, and whichever partition claims the reservation first is declared the winner. However, the node that successfully obtains the reservation on the quorum device prevents the other node from accessing the device, due to the nature of the SCSI-2 reservation. This model is not ideal, because the quorum device contains data useful to both nodes.

Case 2. Two partitions, with three nodes in first partition and one node in second partition. Though the majority partition in this case has adequate quorum, the crux of the problem is that the single isolated node has no knowledge of activities of the other three nodes. Perhaps they formed a valid cluster and this node should abort. Or perhaps all three nodes actually failed, in which case the single isolated node must stay up to maintain availability. With total loss of communication and without an external device to mediate, it is impossible to decide. Racing for the reservation of a configured external quorum device leads to a situation worse than in case 1. If one of the nodes in the majority partition reserves the quorum device, it excludes the other two nodes in its own partition from accessing the device. But worse, if the single isolated node wins the race for the reservation, this may lead to the loss of three potentially healthy nodes from the cluster. Once again, the disk reservation solution does not work well.

The inability to use the disk reservation technique also renders the system vulnerable to the formation of multiple independent clusters, each in its own isolated partition, in the presence of interconnect failures and operator errors. Consider case 2 above: Assume that the CMMs or some external entity somehow decides that the three nodes in the majority partition should stay up and the single isolated node should abort. Assume that at some later point in time the administrator attempts to start up the aborted node, without repairing the interconnect. The node still would be unable to communicate with any of the surviving members, and thinking it is the only node in the cluster, would attempt to reserve the quorum device. It would succeed because there are no quorum reservations in effect, and would form its own independent cluster with itself as the sole member.

Therefore, the simple quorum reservation scheme is unusable for three- and four-node clusters with storage devices directly accessible from all nodes. We need new techniques to solve the following three problems:

How to resolve all split-brain situations in three- and four-node clusters?
How to failure fence faulty nodes from shared devices?
How to prevent isolated partitions from forming multiple independent clusters?

To handle these split-brain situations in three- and four-node clusters, a combination of heuristics and manual intervention is used, with the caveat that operator error during the manual intervention phase can destroy the integrity of the cluster. In "Quorum Devices (VxVM)", we discussed the policies that can be specified to determine behavior in the event of a cluster partition for greater than two node clusters. If you choose the interventionist policy, the CMMs on all partitions will suspend all cluster operations in each partition while waiting for manual operator input as to which partition should continue to form a valid cluster and which partition should abort. It is the operator's responsibility to let a desired partition continue and to abort all other partitions. Allowing more than one partition to form a valid cluster can result in data corruption.

If you choose a pre-deterministic policy, a preferred node is requested (either the highest or lowest node id in the cluster). When a split-brain situation occurs, the partition containing the preferred node automatically becomes the new cluster, if it is able. All other partitions must be aborted manually. The selected quorum device is used solely to break a tie in the case of a split-brain for two-node clusters. Note that this situation is possible even in a four-node cluster where only two cluster members are active when the split-brain occurs. The quorum device still plays a role, but in a much more limited capacity.

Once a partition has been selected to stay up, the next question is how to effectively protect the data from other partitions that should have aborted. Even though we require the operator to abort all other partitions, the command to abort the partition may not succeed immediately, and without an effective failure fencing mechanism, there is always the danger of hung nodes reviving and issuing pending I/O to shared devices before processing the abort request. In this case, the faulty nodes are reset before a valid cluster is formed in some partition.

To prevent a failed node from reviving and issuing I/O to the multihost disks, the faulty node is forcefully terminated by one of the surviving nodes. It is taken down to the OpenBoot PROM through the Terminal Concentrator or System Service Processor (Sun Enterprise 10000 systems), and the hung image of the operating system is terminated. This terminal operation prevents you from accidentally resuming a system by typing go at the Boot PROM. The surviving cluster members wait for a positive acknowledgment from the termination operation before proceeding with cluster reconfiguration.

If there is no response to the termination command, then the hardware power sequencer (if present) is tripped to power cycle the faulty node. If tripping is not successful, then the system displays the following message requesting information to continue cluster reconfiguration:

/opt/SUNWcluster/bin/scadmin continuepartition localnode clustername \
007*** ISSUE ABORTPARTITION OR CONTINUEPARTITION ***
You must ensure that the unreachable node no longer has access to the shared data.
You may allow the proposed cluster to form after you have ensured that the unreachable node has aborted or is down.
Proposed cluster partition:

Caution -

You should ensure that the faulty node has been successfully terminated before issuing the scadmin continuepartition command on the surviving nodes.

Partitioned, isolated, and terminated nodes do eventually boot up, and if due to some oversight, if the administrator tries to join the node into the cluster without repairing the interconnects, this node must be prevented from forming a valid cluster partition of its own, if it is unable to communicate with the existing cluster.

Assume a case where two partitions are formed with three nodes in one partition and one node in the other partition. A designated node in the majority partition terminates the isolated node and the three nodes form a valid cluster in their own partition. The isolated node, on booting up, tries to form a cluster of its own due to an administrator running the startcluster(1M) command and replying in the affirmative when asked for confirmation. Because the isolated node believes it is the only node in the cluster, it tries to reserve the quorum device and actually succeeds in doing so, because none of the three nodes in the valid partition can reserve the quorum device without locking each other out.

To resolve this problem, Sun Cluster 2.2 uses nodelock, whereby a designated cluster node opens a telnet(1) session to an unused port in the Terminal Concentrator as part of its cluster reconfiguration, and keeps this session alive as long as it is a cluster member. If this node were to leave the membership, the nodelock would be passed on to one of the remaining cluster members. In the above example, if the isolated node were to try to form its own cluster, it would try to acquire this lock and fail, because one of the nodes in the existing membership (in the other partition) would be holding the lock. If a valid cluster member is unable to acquire the lock for any reason, it is not a fatal error, but is logged as an error requiring immediate attention. The locking facility should be considered a safety feature rather than a mechanism critical to the operation of the cluster, and its failure should not be considered catastrophic. In order to speedily detect faults in this area, processes in the Sun Cluster framework monitor whether the Terminal Concentrator is accessible from the cluster nodes.