1.3.4 Failure Fencing (Sun Cluster 2.2 Software Installation Guide)

Sun Cluster 2.2 Software Installation Guide

1.3.4 Failure Fencing

In any clustering system, once a node is no longer in the cluster, it must be prevented from continuing to write to the multihost disks. Otherwise, data corruption could ensue. The surviving nodes of the cluster need to be able to start reading from and writing to the multihost disk. If the node that is no longer in the cluster is continuing to write to the multihost disk, its writes would confuse and ultimately corrupt the updates that the surviving nodes are performing.

Preventing a node that is no longer in the cluster from writing to the disk is called failure fencing. Failure fencing is very important for ensuring data integrity by preventing an isolated node from coming up in its own partition as a separate cluster when the actual cluster exists in a different partition.

Caution -

It is very important to prevent the faulty node from performing I/O as the two cluster nodes now have very different views. The faulty node's cluster view includes both cluster members (because it has not been reconfigured), while the surviving node's cluster view consists of a one-node cluster (itself).

In a two-node cluster, if one node hangs or fails, the other node detects the missing heartbeats from the faulty node and reconfigures itself to become the sole cluster member. Part of this reconfiguration involves fencing the shared devices to prevent it from performing I/O on the multihost disks. In all Sun Cluster configurations with only two-nodes, this is accomplished through the use of SCSI-2 reservations on the multihost disks. The surviving node reserves the disks and prevents the failed node from performing I/O on the reserved disks. The semantics of the SCSI-2 reservation is that it is atomic in nature and if two nodes simultaneously attempt to reserve the device, one is guaranteed to succeed and the other guaranteed to fail.

1.3.4.1 Failure Fencing (SSVM and CVM)

Failure fencing is done differently depending on the cluster topology. The simplest case is a two-node cluster.

Failure Fencing Two-Node Clusters

In a two-node cluster, the quorum device determines which node remains in the cluster and the failed node is prevented from starting its own cluster because it cannot reserve the quorum device. SCSI-2 reservation is used to fence a failed node and prevent it from updating the multihost disks.

Failure Fencing Greater Than Two-Node Clusters

The difficulty with the SCSI-2 reservation model used in two-node clusters is that the SCSI reservations are host-specific. If a host has issued reservations on shared devices, it effectively shuts out every other node that can access the device, faulty or not. Consequently, this model breaks down when more than two nodes are connected to the multihost disks in a shared disk environment such as OPS.

For example, if one node hangs in a three-node cluster, the other two nodes reconfigure. However, neither of the surviving nodes can issue SCSI reservations to protect the underlying shared devices from the faulty node, as this action also shuts out the other surviving node. But without the reservations, the faulty node might "wake up" and issue I/O to the shared devices, despite the fact that its view of the cluster is no longer current.

Consider a four-node cluster with storage devices directly accessible from all the cluster nodes. If one node hangs, and the other three nodes reconfigure, none of them can issue the reservations to protect the underlying devices from the faulty node, as the reservations will also prevent some of the valid cluster members from issuing any I/O to the devices. But without the reservations, we have the real danger of the faulty node "waking up" and issuing I/O to shared devices despite the fact that its view of the cluster is no longer current.

Now consider the problem of split-brain situations. In the case of a four-node cluster, a variety of interconnect failures are possible. We will define a partition as a set of cluster nodes where each node can communicate with every other member within that partition, but not with any other cluster node that is outside the partition. There can be situations where, due to interconnect failures, two partitions are formed with two nodes in one partition and two nodes in the other partition, or with three nodes in one partition and one node in the other partition. Or there can even be cases where a four-node cluster can degenerate into four different partitions with one node in each partition. In all such cases, Sun Cluster attempts to arrive at a consistent distributed consensus on which partition should stay up and which partition should abort. Consider the following two cases.

Case 1. Two partitions with two nodes in each partition. As in the case of the one-one split in the case of a two-node cluster, the CMMs in either partition do not have quorum to conclude decisively which partition should stay up and which partition should abort. To meet the goals of data integrity and high availability, both partitions should not stay up and both partitions should not go down. As in the case of a two-node cluster, is it possible to adjudicate by means of an external device (the quorum disk) A designated node in each partition can race for the reservation on the designated quorum device, and whichever partition wins would be declared the winner. However, the node that successfully obtains the reservation on the quorum device shuts out the other node in its own partition from accessing the device due to the nature of the SCSI-2 reservation. because the quorum device contains useful data, this is not a desirable thing to do.

Case 2. Two partitions with three nodes in one partition and one node in the other partition. Even though the majority partition in this case has adequate quorum, the crux of the problem here is that the single isolated node has no idea what happened to the other three nodes. Perhaps they formed a valid cluster and this node should abort. But perhaps they did not; perhaps all three nodes did really fail for some reason. In this case, the single isolated node must stay up to maintain availability. With total loss of communication and without an external device to mediate, it is impossible to decide. Racing for the reservation of a configured external quorum device leads to a situation worse than in case 1. If one of the nodes in the majority partition reserves the quorum device, it shuts out the other two nodes in its own partition from accessing the device. But what is worse is that if the single isolated node wins the race for the reservation, it may lead to the loss of three potentially healthy nodes from the cluster. Once again, the disk reservation solution does not work well.

The inability to use the disk reservation technique also renders the system vulnerable to the formation of multiple independent clusters, each in its own isolated partition, in the presence of interconnect failures and operator errors. Consider case 2 above: Assume that the CMMs or some external entity somehow decides that the three nodes in the majority partition should stay up and the single isolated node should abort. Assume at some later point in time, the administrator now attempts to start up the aborted node without repairing the interconnect. The node would still be unable to communicate with any of the surviving members, and thinking it is the only node in the cluster, attempt to reserve the quorum device. It would succeed because there are no quorum reservations in effect, due to the reasons elucidated above, and form its own independent cluster with itself as the sole cluster member.

Therefore, the simple quorum reservation scheme is unusable for three and four node clusters with storage devices directly accessible from all nodes. We need new techniques to solve the following three problems:

How to resolve all split-brain situations in three and four node clusters?
How do we failure fence faulty nodes from shared devices?
How do we prevent isolated partitions from forming multiple independent clusters?

To solve the different types of split-brain situations in three and four node clusters, a combination of heuristics and manual intervention is used, with the caveat that operator error during the manual intervention phase can destroy the integrity of the cluster. In "1.3.3 Quorum Devices (SSVM and CVM)", we discussed the policies that can be specified to determine what occurs in the event of a cluster partition for greater than two node clusters. If you choose the interventionist policy, the CMMs on all partitions will suspend all cluster operations in each partition waiting for manual operator input as to which partition is to continue to form a valid cluster and which partition should abort. It is the operator's responsibility to let a desired partition to continue and to abort all other partitions. Allowing more than one partition to form a valid cluster can result in irretrievable data corruption.

If you choose a pre-deterministic policy, a preferred node (either the highest or lowest node id in the cluster) is requested, and when a split-brain situation occurs, the partition containing the preferred node will automatically become the new cluster if it is able to do so. All other partitions must be manually aborted by operator intervention. The selected quorum device is used solely to break the tie in the case of a split-brain for two-node clusters. Note that this situation can happen even in a configured four-node cluster, where only two cluster members are active when a split-brain situation occurs. The quorum device still plays a role, but in a much more limited capacity.

Once a partition has been selected to stay up, the next question is how to effectively protect the data from other partitions that should have aborted. Even though we require the operator to abort all other partitions, the command to abort the partition may not succeed immediately, and without an effective failure fencing mechanism, there is always the danger of hung nodes suddenly "waking up" and issuing pending I/O to shared devices before processing the abort request. In this case, the faulty nodes are reset before a valid cluster is formed in some partition.

To prevent a failed node from "waking up" and issuing I/O to the multihost disks, the faulty node is forcefully terminated by one of the surviving nodes, and dropped down to the OpenBoot PROM via the Terminal Concentrator or System Service Processor (Sun Enterprise 10000 systems), and the hung image of the operating system is terminated. This terminal operation prevents you from accidentally resuming a system by typing go at the Boot PROM. The surviving cluster members wait for a positive acknowledgment from the termination operation before proceeding with the cluster reconfiguration process.

If there is no response to the termination command, then the hardware power sequencer (if present) is tripped to power cycle the faulty node. If tripping is not successful, then the system displays the following message requesting information to continue cluster reconfiguration:

/opt/SUNWcluster/bin/scadmin continuepartition localnode clustername
\007*** ISSUE ABORTPARTITION OR CONTINUEPARTITION ***
You must ensure that the unreachable node no longer has access to the 
shared data. You may allow the proposed cluster to form after you have 
ensured that the unreachable node has aborted or is down.
Proposed cluster partition:

Note -

You should ensure that the faulty node has been successfully terminated before issuing the scadmin continuepartition command on the surviving nodes.

Partitioned, isolated, and terminated nodes do eventually boot up, and if due to some oversight, if the administrator tries to join the node into the cluster without repairing the interconnects, we must prevent this node from forming a valid cluster partition of its own if the node is unable to communicate with the existing cluster.

Assume a case where two partitions are formed with three nodes in one partition and one node in the other partition. A designated node in the majority partition terminates the isolated node and the three nodes form a valid cluster in their own partition. The isolated node, on booting up, tries to form a cluster of its own due to an administrator running the startcluster(1M) command and replying in the affirmative when asked for confirmation. Because the isolated node believes it is the only node in the cluster, it tries to reserve the quorum device and actually succeeds in doing so, because there are three nodes in the valid partition and none of them can reserve the quorum device without locking the others out.

To resolve this problem, a new concept is needed, a nodelock. For this concept, a designated node in the cluster membership opens a telnet(1) session to an unused port in the Terminal Concentrator as part of its cluster reconfiguration and keeps this session alive as long as it is a cluster member. If this node were to leave the membership, the nodelock is passed on to one of the remaining cluster members. In the above example, if the isolated node were to try to form its own cluster, it would try to acquire this lock and fail, because one of the nodes in the existing membership (in the other partition) would be holding this lock. If one of the valid cluster members is unable to acquire this lock for some reason, it would not be considered a fatal, but would be logged as an error requiring immediate attention. The locking facility should be considered as a safety feature rather than a mechanism critical to the operation of the cluster, and its failure should not be considered catastrophic. In order to speedily detect faults in this area, monitoring processes in the Sun Cluster framework monitor whether the Terminal Concentrator is accessible from the cluster nodes.

1.3.4.2 Failure Fencing (Solstice DiskSuite)

In Sun Cluster configurations using Solstice DiskSuite as the volume manager, it is Solstice DiskSuite itself that determines cluster quorum and provides failure fencing. There is no distinction between different cluster topologies for failure fencing. That is, two-node and greater than two-node clusters are treated identically. This is possible for two reasons:

Unlike the situation with Cluster Volume Manager, there is no concept of shared disks in the HA environment. At most, only one node can master a diskset at any given time. This precludes the situation where more than one node would need to access the diskset after a node fails.
The split-brain situation, where the cluster interconnect fails is viewed as a double failure (both private links have failed). In the event of a split-brain situation, the guarantee is that data integrity will be maintained. There is no guarantee that the cluster will be able to continue without user intervention. For example, in a three-node cluster where all nodes are attached to the diskset, and split brain develops, it is possible that one node will crash, two nodes will crash, or all three nodes will crash. This algorithm has been shown to have higher actual availability than any algorithm employing a quorum device.

Disk fencing is accomplished in the following manner.

After a node is removed from the cluster, a remaining node does a SCSI reserve of the disk. After this, other nodes--including the one no longer in the cluster--are prevented by the disk itself to read or write to the disk. The disk will return a Reservation_Conflict error to the read or write command. In Solstice DiskSuite configurations, the SCSI reserve is accomplished by issuing the Sun multihost ioctl MHIOCTKOWN.
Nodes that are in the cluster continuously enable the MHIOCENFAILFAST ioctl for the disks that they are accessing. This ioctl is a directive to the disk driver, giving the node the capability to panic itself if it cannot access the disk due to the disk being reserved by some other node. The MHIOCENFAILFAST ioctl causes the driver to check the error return from every read and write that this node issues to the disk for the Reservation_Conflict error code, and it also periodically, in background, issues a test operation to the disk to check for Reservation_Conflict. Both the foreground and the background control flow paths will panic should Reservation_Conflict be returned.
The MHIOCENFAILFAST ioctl is not specific to dual-hosted disks. If the node that has enabled the MHIOCENFAILFAST for a disk loses access to that disk due to another node reserving the disk (by SCSI-2 exclusive reserve), the node panics.

This solution to disk fencing relies on the SCSI-2 concept of disk reservation, which requires that a disk be reserved by exactly one single node.

For Solstice DiskSuite configurations, the installation program scinstall(1M) does not prompt for a quorum device, a node preference, or to select a failure fencing policy as is done in SSVM and CVM configurations. When Solstice DiskSuite is specified as the volume manager, you cannot configure direct-attach devices, that is, devices that directly attach to more than two nodes. Disks can only be connected to pairs of nodes.

Note -

Although the scconf(1M) command allows you to specify the +D flag to enable configuring direct-attach devices, you should not do so in Solstice DiskSuite configurations.