Sun Cluster 2.2 Software Installation Guide

Quorum, Quorum Devices, and Failure Fencing

The concept of quorum comes into play quite often in distributed systems. Fundamentally, quorum is a majority consensus used to determine the best solution in an ambiguous situation. The actual number that constitutes an acceptable quorum varies from situation to situation; some situations may require a simple greater-than-50% consensus, while others may require a two-thirds majority. In a distributed system, a set of communicating processes comprises the potential members of the quorum. To ensure that the system operates effectively and to make critical decisions about the behavior of the system, the processes need to agree on the desired quorum and then try to obtain consensus on some underlying issue by communicating messages until a quorum is obtained.

In Sun Cluster, two different types of quorums are used.

CMM Quorum

The Sun Cluster and Solstice HA clustering products determine CMM quorum by different methods. In previous Sun Cluster releases, including Sun Cluster 2.0 and 2.1, the cluster framework determined CMM quorum. In Solstice HA, quorum was determined by the volume manager, Solstice DiskSuite. Sun Cluster 2.2 is an integrated release based on both Sun Cluster 2.1 and Solstice HA 1.3. In Sun Cluster 2.2, determining CMM quorum depends on the volume manager (Solstice DiskSuite or VxVM). If Solstice DiskSuite is the volume manager, CMM quorum is determined by a quorum of metadevice state database replicas managed by Solstice DiskSuite. If VxVM is used as the volume manager, CMM quorum is determined by the cluster framework.

For Sun Cluster 2.2, CMM quorum is determined by the following:

It is necessary to determine cluster quorum when nodes join or leave the cluster and in the event that the cluster interconnect fails. In Solstice HA 1.3, cluster interconnect failure was considered a double failure and the software guaranteed to preserve data integrity, but did not guarantee that the cluster could continue without user intervention. Manual intervention for dual failures was part of the system design, as the safest method to ensure data integrity.

Sun Cluster 2.2 software attempts to preserve data integrity and also to maintain cluster availability without user intervention. To preserve cluster availability, Sun Cluster 2.2 implements several new processes. These include quorum devices and the Terminal Concentrator or System Service Processor. Note that just as Solstice HA 1.3 used Solstice DiskSuite to determine cluster quorum, Sun Cluster 2.2 also uses the volume manager as the primary factor in determining cluster quorum and cluster behavior upon failure of the cluster interconnect. The results of cluster interconnect failure are described in "Quorum Devices (VxVM)".

CCD Quorum

The Cluster Configuration Database (CCD) needs to obtain quorum to elect a valid and consistent copy of the CCD. Refer to "Cluster Configuration Database", for an overview of the CCD.

Sun Cluster does not have a storage topology that guarantees direct access from all cluster nodes to underlying storage devices for all configurations. This precludes the possibility of using a single logical volume to store the CCD database, which would guarantee that updates would be propagated correctly across restarts of the cluster framework.The CCD communicates with its peers through the cluster interconnect, and this logical link is unavailable on nodes that are not cluster members. We will illustrate the CCD quorum requirement with a simple example.

Assume a three-node cluster consisting of nodes A, B, and C. Node A exits the cluster leaving B and C as the surviving cluster members. The CCD is updated and the updates are propagated to nodes B and C. Now, nodes B and C leave the cluster. Subsequently, node A is restarted. However, A does not have the most recent copy of the CCD database because it has no means of knowing the updates that happened on nodes B and C. In fact, irrespective of which node is started first, it is not possible to determine absolutely which node has the most recent copy of the CCD database. Only when all three nodes are restarted is there sufficient information to determine the most recent copy of the CCD. If a valid CCD could not be elected, all query or update operations on the CCD would fail with an invalid CCD error.

In practice, starting all cluster nodes before determining a valid copy of the CCD is too restrictive a condition. This condition can be relaxed by imposing a restriction on the update operation.

If n is the number of nodes currently configured in the cluster, at least floor(n) nodes [If n is an odd number, then floor(n) = 1 + ((n-1)/2). If n is an even number, then floor(n) = 1 + (n/2)] must be up for updates to be propagated. In this case, it is sufficient for ceiling(n) identical copies [If n is an odd number, then ceiling(n) = (n+1)/2 If n is an even number, then ceiling(n) = (n/2)] to be present to elect a valid database on a cluster restart. The valid CCD is then propagated to all cluster nodes that do not already have it.

Note that even if the CCD is invalid, a node is allowed to join the cluster. However, the CCD can be neither updated nor queried in this state. This implies that all components of the cluster framework that rely on the CCD remain in a dysfunctional state. In particular, logical hosts cannot be mastered and data services cannot be activated in this state. The CCD is enabled only after sufficient number of nodes join the cluster for quorum to be reached. Alternatively, an administrator can restore the CCD database with the maximum CCD generation number.

CCD quorum problems can be avoided if at least one or more nodes stay up during a reconfiguration. In this case, the valid copy on any of these nodes will be propagated to the newly joining nodes. Another alternative is to ensure that the cluster is started up on the node that has the most recent copy of the CCD database. Nevertheless, it is quite possible that after a system crash while a database update was in progress, the recovery algorithm finds inconsistent CCD copies. In such cases, it is the responsibility of the administrator to restore the database using the restore option to the ccdadm(1M) command (see the man page for details). The CCD also provides a checkpoint facility to back up the current contents of the database. It is good practice to make a backup copy of the CCD database after any change to system configuration. The backup copy can subsequently be used to restore the database. The CCD is very small compared to conventional relational databases and the backup and restore operations take no more than a few seconds to complete.

CCD Quorum in Two-Node Clusters

In the case of two-node clusters, the previously discussed quorum majority rule would require both nodes to be cluster members for updates to succeed, which is too restrictive. On the other hand, if updates are allowed in this configuration while only one node is up, the database must be made consistent manually before the cluster is restarted. This can be accomplished by either restarting the node with the most recent copy first, or by restoring the database with the ccdadm(1M) restore operation after both nodes have joined the cluster. In the latter case, even though both nodes are able to join the cluster, the CCD will be in an invalid state until the restore operation is complete.

This problem is solved by configuring persistent storage for the database on a shared disk device. The shared copy is used only when a single node is active. When the second node joins, the shared CCD is copied to each node.

Whenever a node leaves the cluster, the shared CCD is reactivated by copying the local CCD into the shared. In this way, updates are enabled only when a single node is in the cluster membership. This also ensures reliable propagation of updates across cluster restarts.

The downside of using a shared storage device for the shared copy of the CCD is that two disks must be allocated exclusively for this purpose, because the volume manager precludes the use of these disks for any other purpose. The use of the two disks can be avoided if some application downtime as described above can be tolerated in a production environment.

Similar to the Sun Cluster 2.2 integration issues with the CMM quorum, a shared CCD is not supported in all Sun Cluster configurations. If Solstice DiskSuite is the volume manager, the shared CCD is not supported. Because the shared CCD is only used when one node is active, the failure addressed by the shared CCD is not common.

Quorum Devices (VxVM)

In certain cases--for example, in a two-node cluster when both cluster interconnects fail and both cluster nodes are still members--Sun Cluster needs assistance from a hardware device to solve the problem of cluster quorum. This device is called the quorum device.

Quorum devices must be used in clusters running VERITAS Volume Manager (VxVM), regardless of the number of cluster nodes. Solstice DiskSuite assures cluster quorum through the use of its own metadevice state database replicas, and as such, does not need a quorum device. Quorum devices are neither required nor supported in Solstice DiskSuite configurations. When you install a cluster using Solstice DiskSuite, the scinstall(1M) program will not ask for, or accept a quorum device.

The quorum device is merely a disk or a controller which is specified during the cluster installation procedure by using the scinstall(1M) command. The quorum device is a logical concept; there is nothing special about the specific piece of hardware chosen as the quorum device. VxVM does not allow a portion of a disk to be in a separate disk group, so an entire disk and its plex (mirror) are required for the quorum device.

A quorum device ensures that at any point in time only one node can update the multihost disks that are shared between nodes. The quorum device comes into use if the cluster interconnect is lost between nodes. Each node (or set of nodes in a greater than two-node cluster) should not attempt to update shared data unless it can establish that it is part of the majority quorum. The nodes take a vote, or quorum, to decide which nodes remain in the cluster. Each node determines how many other nodes it can communicate with. If it can communicate with more than half of the cluster, then it is in the majority quorum and is allowed to remain a cluster member. If it is not in the majority quorum, the node aborts from the cluster.

The quorum device acts as the "third vote" to prevent a tie. For example, in a two-node cluster, if the cluster interconnect is lost, each node will "race" to reserve the quorum device. Figure 1-9 shows a two-node cluster with a quorum device located in one of the multihost disk enclosures.

Figure 1-9 Two-Node Cluster With Quorum Device

Graphic

The node that reserves the quorum device then has two votes toward quorum versus the remaining node that has only one vote. The node with the quorum will then start its own cluster (mastering the multihost disks) and the other node will abort.

Before each cluster reconfiguration, the set of nodes and the quorum device vote to approve the new system configuration. Reconfiguration proceeds only if a majority quorum is reached. After a reconfiguration, a node remains in the cluster only if it is part of the majority partition.


Note -

In greater than two-node clusters, each set of nodes that share access to multihost disks must be configured to use a quorum device.


The concept of quorum device changes somewhat in greater than two-node clusters. If there is an even split for nodes that do not share a quorum device--referred to as a "split-brain" partition--you must be able to decide which set of nodes will become a new cluster and which set will abort. This situation is not handled by the quorum device. Instead, as part of the installation process, when you configure the quorum device(s), you are asked questions that determine what will happen when such a partition occurs. One of two events occurs in this partition situation depending on whether you requested to have the cluster software automatically select the new cluster membership or whether you specified manual intervention.

For example, consider a four-node cluster (that might or might not share a storage device common to all nodes) where a network failure results in node 0 and 1 communicating with each other and nodes 2 and 3 communicating with each other. In this situation, the automatic or manual decision of quorum would be used. The cluster monitor software is quite intelligent. It tries to determine on its own which nodes should be cluster members and which should not. It resorts to the quorum device to break a tie or the manual and automatic selection of cluster domains only in extreme situations.


Note -

The failure of a quorum device is similar to the failure of a node in a two-node cluster. Although the failure of a quorum device does not cause a failover of services, it does reduce the high availability of a two-node cluster in that no further node failures can be tolerated. A failed quorum device can be reconfigured or replaced while the cluster is running. The cluster can remain running as long as no other component failure occurs while the quorum repair or replacement is in progress.


The quorum device on its own cannot account for all scenarios where a decision must be made on cluster membership. For example, consider a fully operational three-node cluster, where all of the nodes share access to the multihost disks, such as the Sun StorEdge A5000. If one node aborts or loses both cluster interconnects, and the other two nodes are still able to communicate to each other, the two remaining nodes do not have to reserve the quorum device to break a tie. Instead, the majority voting that comes into play (two votes out of three) determines that the two nodes that can communicate with each other can form the cluster. However, the two nodes that form the cluster must still prevent the crashed or hung node from coming back online and corrupting the shared data. They do this by using a technique called failure fencing, as described in "Failure Fencing ".

Failure Fencing

In any clustering system, once a node is no longer in the cluster, it must be prevented from continuing to write to the multihost disks. Otherwise, data corruption could ensue. The surviving nodes of the cluster need to be able to start reading from and writing to the multihost disk. If the node that is no longer in the cluster is continuing to write to the multihost disk, its writes would confuse and ultimately corrupt the updates that the surviving nodes are performing.

Preventing a node that is no longer in the cluster from writing to the disk is called failure fencing. Failure fencing is very important for ensuring data integrity by preventing an isolated node from coming up in its own partition as a separate cluster when the actual cluster exists in a different partition.


Caution - Caution -

It is very important to prevent the faulty node from performing I/O as the two cluster nodes now have very different views. The faulty node's cluster view includes both cluster members (because it has not been reconfigured), while the surviving node's cluster view consists of a one-node cluster (itself).


In a two-node cluster, if one node hangs or fails, the other node detects the missing heartbeats from the faulty node and reconfigures itself to become the sole cluster member. Part of this reconfiguration involves fencing the shared devices to prevent the faulty node from performing I/O on the multihost disks. In all Sun Cluster configurations with only two nodes, this fencing is accomplished through the use of SCSI-2 reservations on the multihost disks. The surviving node reserves the disks and prevents the failed node from performing I/O on the reserved disks. SCSI-2 reservation is atomic; if two nodes attempt to reserve the device simultaneously, one node succeeds and one node fails.

Failure Fencing (VxVM)

Failure fencing is done differently depending on cluster topology. The simplest case is a two-node cluster.

Failure Fencing in Two-Node Clusters

In a two-node cluster, the quorum device determines which node remains in the cluster. The failed node is prevented from starting its own cluster because it cannot reserve the quorum device. SCSI-2 reservation is used to fence a failed node and prevent it from updating the multihost disks.

Failure Fencing in Greater Than Two-Node Clusters

The difficulty with the SCSI-2 reservation model used in two-node clusters is that the SCSI reservations are host-specific. If a host has issued reservations on shared devices, it effectively shuts out every other node that can access the device, faulty or not. Consequently, this model breaks down when more than two nodes are connected to the multihost disks in a shared disk environment such as OPS.

For example, if one node hangs in a three-node cluster, the other two nodes reconfigure. However, neither of the surviving nodes can issue SCSI reservations to protect the underlying shared devices from the faulty node, as this action also shuts out the other surviving node. But without the reservations, the faulty node might revive and issue I/O to the shared devices, despite the fact that its view of the cluster is no longer current.

Consider a four-node cluster with storage devices directly accessible from all the cluster nodes. If one node hangs, and the other three nodes reconfigure, none of them can issue the reservations to protect the underlying devices from the faulty node, as the reservations will also prevent some of the valid cluster members from issuing any I/O to the devices. But without the reservations, we have the real danger of the faulty node reviving and issuing I/O to shared devices despite the fact that its view of the cluster is no longer current.

Now consider the problem of split-brain situations. In the case of a four-node cluster, a variety of interconnect failures are possible. We will define a partition as a set of cluster nodes where each node can communicate with every other member within that partition, but not with any other cluster node that is outside the partition. There can be situations where, due to interconnect failures, two partitions are formed with two nodes in one partition and two nodes in the other partition, or with three nodes in one partition and one node in the other partition. Or there can even be cases where a four-node cluster can degenerate into four different partitions with one node in each partition. In all such cases, Sun Cluster attempts to arrive at a consistent distributed consensus on which partition should stay up and which partition should abort. Consider the following two cases.

Case 1. Two partitions, with two nodes in each partition. As in the case of the one-one split in a two-node cluster, the CMMs in either partition do not have quorum to conclude which partition should stay up and which partition should abort. To meet the goals of data integrity and high availability, both partitions should not stay up and both partitions should not go down. As in the case of a two-node cluster, it is possible to adjudicate by means of an external device (the quorum disk). A designated node in each partition can race for the reservation on the designated quorum device, and whichever partition claims the reservation first is declared the winner. However, the node that successfully obtains the reservation on the quorum device prevents the other node from accessing the device, due to the nature of the SCSI-2 reservation. This model is not ideal, because the quorum device contains data useful to both nodes.

Case 2. Two partitions, with three nodes in first partition and one node in second partition. Though the majority partition in this case has adequate quorum, the crux of the problem is that the single isolated node has no knowledge of activities of the other three nodes. Perhaps they formed a valid cluster and this node should abort. Or perhaps all three nodes actually failed, in which case the single isolated node must stay up to maintain availability. With total loss of communication and without an external device to mediate, it is impossible to decide. Racing for the reservation of a configured external quorum device leads to a situation worse than in case 1. If one of the nodes in the majority partition reserves the quorum device, it excludes the other two nodes in its own partition from accessing the device. But worse, if the single isolated node wins the race for the reservation, this may lead to the loss of three potentially healthy nodes from the cluster. Once again, the disk reservation solution does not work well.

The inability to use the disk reservation technique also renders the system vulnerable to the formation of multiple independent clusters, each in its own isolated partition, in the presence of interconnect failures and operator errors. Consider case 2 above: Assume that the CMMs or some external entity somehow decides that the three nodes in the majority partition should stay up and the single isolated node should abort. Assume that at some later point in time the administrator attempts to start up the aborted node, without repairing the interconnect. The node still would be unable to communicate with any of the surviving members, and thinking it is the only node in the cluster, would attempt to reserve the quorum device. It would succeed because there are no quorum reservations in effect, and would form its own independent cluster with itself as the sole member.

Therefore, the simple quorum reservation scheme is unusable for three- and four-node clusters with storage devices directly accessible from all nodes. We need new techniques to solve the following three problems:

  1. How to resolve all split-brain situations in three- and four-node clusters?

  2. How to failure fence faulty nodes from shared devices?

  3. How to prevent isolated partitions from forming multiple independent clusters?

To handle these split-brain situations in three- and four-node clusters, a combination of heuristics and manual intervention is used, with the caveat that operator error during the manual intervention phase can destroy the integrity of the cluster. In "Quorum Devices (VxVM)", we discussed the policies that can be specified to determine behavior in the event of a cluster partition for greater than two node clusters. If you choose the interventionist policy, the CMMs on all partitions will suspend all cluster operations in each partition while waiting for manual operator input as to which partition should continue to form a valid cluster and which partition should abort. It is the operator's responsibility to let a desired partition continue and to abort all other partitions. Allowing more than one partition to form a valid cluster can result in data corruption.

If you choose a pre-deterministic policy, a preferred node is requested (either the highest or lowest node id in the cluster). When a split-brain situation occurs, the partition containing the preferred node automatically becomes the new cluster, if it is able. All other partitions must be aborted manually. The selected quorum device is used solely to break a tie in the case of a split-brain for two-node clusters. Note that this situation is possible even in a four-node cluster where only two cluster members are active when the split-brain occurs. The quorum device still plays a role, but in a much more limited capacity.

Once a partition has been selected to stay up, the next question is how to effectively protect the data from other partitions that should have aborted. Even though we require the operator to abort all other partitions, the command to abort the partition may not succeed immediately, and without an effective failure fencing mechanism, there is always the danger of hung nodes reviving and issuing pending I/O to shared devices before processing the abort request. In this case, the faulty nodes are reset before a valid cluster is formed in some partition.

To prevent a failed node from reviving and issuing I/O to the multihost disks, the faulty node is forcefully terminated by one of the surviving nodes. It is taken down to the OpenBoot PROM through the Terminal Concentrator or System Service Processor (Sun Enterprise 10000 systems), and the hung image of the operating system is terminated. This terminal operation prevents you from accidentally resuming a system by typing go at the Boot PROM. The surviving cluster members wait for a positive acknowledgment from the termination operation before proceeding with cluster reconfiguration.

If there is no response to the termination command, then the hardware power sequencer (if present) is tripped to power cycle the faulty node. If tripping is not successful, then the system displays the following message requesting information to continue cluster reconfiguration:


/opt/SUNWcluster/bin/scadmin continuepartition localnode clustername \
007*** ISSUE ABORTPARTITION OR CONTINUEPARTITION ***
You must ensure that the unreachable node no longer has access to the shared data.
You may allow the proposed cluster to form after you have ensured that the unreachable node has aborted or is down.
Proposed cluster partition: 


Caution - Caution -

You should ensure that the faulty node has been successfully terminated before issuing the scadmin continuepartition command on the surviving nodes.


Partitioned, isolated, and terminated nodes do eventually boot up, and if due to some oversight, if the administrator tries to join the node into the cluster without repairing the interconnects, this node must be prevented from forming a valid cluster partition of its own, if it is unable to communicate with the existing cluster.

Assume a case where two partitions are formed with three nodes in one partition and one node in the other partition. A designated node in the majority partition terminates the isolated node and the three nodes form a valid cluster in their own partition. The isolated node, on booting up, tries to form a cluster of its own due to an administrator running the startcluster(1M) command and replying in the affirmative when asked for confirmation. Because the isolated node believes it is the only node in the cluster, it tries to reserve the quorum device and actually succeeds in doing so, because none of the three nodes in the valid partition can reserve the quorum device without locking each other out.

To resolve this problem, Sun Cluster 2.2 uses nodelock, whereby a designated cluster node opens a telnet(1) session to an unused port in the Terminal Concentrator as part of its cluster reconfiguration, and keeps this session alive as long as it is a cluster member. If this node were to leave the membership, the nodelock would be passed on to one of the remaining cluster members. In the above example, if the isolated node were to try to form its own cluster, it would try to acquire this lock and fail, because one of the nodes in the existing membership (in the other partition) would be holding the lock. If a valid cluster member is unable to acquire the lock for any reason, it is not a fatal error, but is logged as an error requiring immediate attention. The locking facility should be considered a safety feature rather than a mechanism critical to the operation of the cluster, and its failure should not be considered catastrophic. In order to speedily detect faults in this area, processes in the Sun Cluster framework monitor whether the Terminal Concentrator is accessible from the cluster nodes.

Failure Fencing (Solstice DiskSuite)

In Sun Cluster configurations using Solstice DiskSuite as the volume manager, it is Solstice DiskSuite itself that determines cluster quorum and provides failure fencing. There is no distinction between different cluster topologies for failure fencing. That is, two-node and greater than two-node clusters are treated identically. This is possible for two reasons:

Disk fencing is accomplished in the following manner.

  1. After a node is removed from the cluster, a remaining node does a SCSI reservation of the disk. After this, other nodes (including the node no longer in the cluster) are prevented by the disk itself to read or write to the disk. The disk will return a Reservation_Conflict error to the read or write command. In Solstice DiskSuite configurations, the SCSI reservation is accomplished by issuing the Sun multihost ioctl MHIOCTKOWN.

  2. Nodes that are in the cluster continuously enable the MHIOCENFAILFAST ioctl for the disks they are accessing. This ioctl is a directive to the disk driver, and gives the node the capability to panic itself if it cannot access the disk due to the disk being reserved by some other node. The MHIOCENFAILFAST ioctl causes the driver to check the error return from every read and write that this node issues to the disk for the Reservation_Conflict error code, and it also--periodically, in the background--issues a test operation to the disk to check for Reservation_Conflict. Both the foreground and background control flow paths panic if Reservation_Conflict is returned.

  3. The MHIOCENFAILFAST ioctl is not specific to dual-hosted disks. If the node that has enabled the MHIOCENFAILFAST for a disk loses access to that disk due to another node reserving the disk (by SCSI-2 exclusive reservation), the node panics.

This solution to disk fencing relies on the SCSI-2 concept of disk reservation, which requires that a disk be reserved by exactly one node.

For Solstice DiskSuite configurations, the installation program scinstall(1M) does not prompt you to specify a quorum device or node preference, or to select a failure fencing policy, as is done in VxVM configurations.

Preventing Partitioned Clusters (VxVM)

Two-Node Clusters

If lost interconnects occur in a two-node cluster, both nodes attempt to start the cluster reconfiguration process with only the local node in the cluster membership (because each has lost the heartbeat from the other node). The first node that succeeds in reserving the configured quorum device remains as the sole surviving member of the cluster. The node that failed to reserve the quorum device aborts.

If you try to start up the aborted node without repairing the faulty interconnect, the aborted node (which is still unable to contact the surviving node) attempts to reserve the quorum device, because it sees itself as the only node in the cluster. This attempt will fail because the reservation on the quorum device is held by the other node. This action effectively prevents a partitioned node from forming its own cluster.

Three- or Four-Node Clusters

If a node drops out of a four-node cluster as a result of a reset issued via the terminal concentrator (TC), the surviving cluster nodes are unable to reserve the quorum device, since the reservation by any other node prevents the two healthy nodes from accessing the device. However, if you erroneously ran the scadmin startcluster command on the partitioned node, the partitioned node would form its own cluster, since it is unable to communicate with any other node. There are no quorum reservations in effect to prevent it from forming its own cluster.

Instead of the quorum scheme, Sun Cluster resorts to a cluster-wide lock (nodelock) mechanism. An unused port in the TC of the cluster, or the SSP, is used. (Multiple TCs are used for campus-wide clusters.) During installation, you choose the TC or SSP for this node-locking mechanism. This information is stored in the CCD. One of the cluster members always holds this lock for the lifetime of a cluster activation; that is, from the time the first node successfully forms a new cluster until the last node leaves the cluster. If the node holding the lock fails, the lock is automatically moved to another node.

The only function of the nodelock is to prevent operator error from starting a new cluster in a split-brain scenario.


Note -

The first node joining the cluster aborts if it is unable to obtain this lock. However, node failures or aborts do not occur if the second and subsequent nodes of the cluster are unable to obtain this lock.


Node locking functions in this way: