Sun Cluster Concepts Guide for Solaris OS

Quorum and Quorum Devices

This section contains the following topics:

Note –

For a list of the specific devices that Sun Cluster software supports as quorum devices, contact your Sun service provider.

Because cluster nodes share data and resources, a cluster must never split into separate partitions that are active at the same time because multiple active partitions might cause data corruption. The Cluster Membership Monitor (CMM) and quorum algorithm guarantee that at most one instance of the same cluster is operational at any time, even if the cluster interconnect is partitioned.

For an introduction to quorum and CMM, see Cluster Membership in Sun Cluster Overview for Solaris OS.

Two types of problems arise from cluster partitions:

Split brain
Amnesia

Split brain occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned into subclusters. Each partition “believes” that it is the only partition because the nodes in one partition cannot communicate with the node in the other partition.

Amnesia occurs when the cluster restarts after a shutdown with cluster configuration data older than at the time of the shutdown. This problem can occur when you start the cluster on a node that was not in the last functioning cluster partition.

Sun Cluster software avoids split brain and amnesia by:

Assigning each node one vote
Mandating a majority of votes for an operational cluster

A partition with the majority of votes gains quorum and is allowed to operate. This majority vote mechanism prevents split brain and amnesia when more than two nodes are configured in a cluster. However, counting node votes alone is not sufficient when more than two nodes are configured in a cluster. In a two-node cluster, a majority is two. If such a two-node cluster becomes partitioned, an external vote is needed for either partition to gain quorum. This external vote is provided by a quorum device.

About Quorum Vote Counts

Use the clquorum show command to determine the following information:

Total configured votes
Current present votes
Votes required for quorum

See the cluster(1CL) man page.

Both nodes and quorum devices contribute votes to the cluster to form quorum.

A node contributes votes depending on the node's state:

A node has a vote count of one when it boots and becomes a cluster member.
A node has a vote count of zero when the node is being installed.
A node has a vote count of zero when an system administrator places the node into maintenance state.

Quorum devices contribute votes that are based on the number of votes that are connected to the device. When you configure a quorum device, Sun Cluster software assigns the quorum device a vote count of N-1 where N is the number of connected votes to the quorum device. For example, a quorum device that is connected to two nodes with nonzero vote counts has a quorum count of one (two minus one).

A quorum device contributes votes if one of the following two conditions are true:

At least one of the nodes to which the quorum device is currently attached is a cluster member.
At least one of the nodes to which the quorum device is currently attached is booting, and that node was a member of the last cluster partition to own the quorum device.

You configure quorum devices during the cluster installation, or afterwards, by using the procedures that are described in Chapter 6, Administering Quorum, in Sun Cluster System Administration Guide for Solaris OS.

About Failure Fencing

A major issue for clusters is a failure that causes the cluster to become partitioned (called split brain). When split brain occurs, not all nodes can communicate, so individual nodes or subsets of nodes might try to form individual or subset clusters. Each subset or partition might “believe” it has sole access and ownership to the multihost devices. When multiple nodes attempt to write to the disks, data corruption can occur.

Failure fencing limits node access to multihost devices by physically preventing access to the disks. Failure fencing applies only to nodes, not to zones. When a node leaves the cluster (it either fails or becomes partitioned), failure fencing ensures that the node can no longer access the disks. Only current member nodes have access to the disks, resulting in data integrity.

Device services provide failover capability for services that use multihost devices. When a cluster member that currently serves as the primary (owner) of the device group fails or becomes unreachable, a new primary is chosen. The new primary enables access to the device group to continue with only minor interruption. During this process, the old primary must forfeit access to the devices before the new primary can be started. However, when a member drops out of the cluster and becomes unreachable, the cluster cannot inform that node to release the devices for which it was the primary. Thus, you need a means to enable surviving members to take control of and access global devices from failed members.

The Sun Cluster software uses SCSI disk reservations to implement failure fencing. Using SCSI reservations, failed nodes are “fenced” away from the multihost devices, preventing them from accessing those disks.

SCSI-2 disk reservations support a form of reservations, which either grants access to all nodes attached to the disk (when no reservation is in place). Alternatively, access is restricted to a single node (the node that holds the reservation).

When a cluster member detects that another node is no longer communicating over the cluster interconnect, it initiates a failure fencing procedure to prevent the other node from accessing shared disks. When this failure fencing occurs, the fenced node panics with a “reservation conflict” message on its console.

The discovery that a node is no longer a cluster member, triggers a SCSI reservation on all the disks that are shared between this node and other nodes. The fenced node might not be “aware” that it is being fenced and if it tries to access one of the shared disks, it detects the reservation and panics.

Failfast Mechanism for Failure Fencing

The mechanism by which the cluster framework ensures that a failed node cannot reboot and begin writing to shared storage is called failfast.

Nodes that are cluster members continuously enable a specific ioctl, MHIOCENFAILFAST, for the disks to which they have access, including quorum disks. This ioctl is a directive to the disk driver. The ioctl gives a node the capability to panic itself if it cannot access the disk due to the disk being reserved by some other node.

The MHIOCENFAILFAST ioctl causes the driver to check the error return from every read and write that a node issues to the disk for the Reservation_Conflict error code. The ioctl periodically, in the background, issues a test operation to the disk to check for Reservation_Conflict. Both the foreground and background control flow paths panic if Reservation_Conflict is returned.

For SCSI-2 disks, reservations are not persistent. Reservations do not survive node reboots. For SCSI-3 disks with Persistent Group Reservation (PGR), reservation information is stored on the disk and persists across node reboots. The failfast mechanism works the same, whether you have SCSI-2 disks or SCSI-3 disks.

If a node loses connectivity to other nodes in the cluster, and it is not part of a partition that can achieve quorum, it is forcibly removed from the cluster by another node. Another node that is part of the partition that can achieve quorum places reservations on the shared disks. When the node that does not have quorum attempts to access the shared disks, it receives a reservation conflict and panics as a result of the failfast mechanism.

After the panic, the node might reboot and attempt to rejoin the cluster or, if the cluster is composed of SPARC based systems, stay at the OpenBoot PROM (OBP) prompt. The action that is taken is determined by the setting of the auto-boot parameter. You can set auto-boot with eeprom, at the OpenBoot PROM ok prompt in a SPARC based cluster. See the eeprom(1M) man page. Alternatively, you can set up this parameter with the SCSI utility that you optionally run after the BIOS boots in an x86 based cluster.

About Quorum Configurations

The following list contains facts about quorum configurations:

Quorum devices can contain user data.
In an N+1 configuration where N quorum devices are each connected to one of the 1 through N nodes and the N+1 node, the cluster survives the death of either all 1 through N nodes or any of the N/2 nodes. This availability assumes that the quorum device is functioning correctly.
In an N-node configuration where a single quorum device connects to all nodes, the cluster can survive the death of any of the N-1 nodes. This availability assumes that the quorum device is functioning correctly.
In an N-node configuration where a single quorum device connects to all nodes, the cluster can survive the failure of the quorum device if all cluster nodes are available.

For examples of quorum configurations to avoid, see Bad Quorum Configurations. For examples of recommended quorum configurations, see Recommended Quorum Configurations.

Adhering to Quorum Device Requirements

Ensure that Sun Cluster software supports your specific device as a quorum device. If you ignore this requirement, you might compromise your cluster's availability.

Note –

For a list of the specific devices that Sun Cluster software supports as quorum devices, contact your Sun service provider.

Sun Cluster software supports the following types of quorum devices:

Multihosted shared disks that support SCSI-3 PGR reservations
Dual-hosted shared disks that support SCSI-2 reservations
A Network-Attached Storage (NAS) device from Sun Microsystems, Incorporated or from Network Appliance, Incorporated
A quorum server process that runs on the quorum server machine

Note –

A replicated device is not supported as a quorum device.

In a two–node configuration, you must configure at least one quorum device to ensure that a single node can continue if the other node fails. See Figure 3–2.

For examples of quorum configurations to avoid, see Bad Quorum Configurations. For examples of recommended quorum configurations, see Recommended Quorum Configurations.

Adhering to Quorum Device Best Practices

Use the following information to evaluate the best quorum configuration for your topology:

Do you have a device that is capable of being connected to all nodes of the cluster?
- If yes, configure that device as your one quorum device. You do not need to configure another quorum device because your configuration is the most optimal configuration.
  
  Caution –
  If you ignore this requirement and add another quorum device, the additional quorum device reduces your cluster's availability.
- If no, configure your dual-ported device or devices.
Ensure that the total number of votes contributed by quorum devices is strictly less than the total number of votes contributed by nodes. Otherwise, your nodes cannot form a cluster if all disks are unavailable, even if all nodes are functioning.

Note –
In particular environments, you might desire to reduce overall cluster availability to meet your needs. In these situations, you can ignore this best practice. However, not adhering to this best practice decreases overall availability. For example, in the configuration that is outlined in Atypical Quorum Configurations the cluster is less available: the quorum votes exceed the node votes. In a cluster, if access to the shared storage between Nodes A and Node B is lost, the entire cluster fails.

See Atypical Quorum Configurations for the exception to this best practice.
Specify a quorum device between every pair of nodes that shares access to a storage device. This quorum configuration speeds the failure fencing process. See Quorum in Greater Than Two–Node Configurations.
In general, if the addition of a quorum device makes the total cluster vote even, the total cluster availability decreases.
Quorum devices slightly slow reconfigurations after a node joins or a node dies. Therefore, do not add more quorum devices than are necessary.

For examples of quorum configurations to avoid, see Bad Quorum Configurations. For examples of recommended quorum configurations, see Recommended Quorum Configurations.

Recommended Quorum Configurations

This section shows examples of quorum configurations that are recommended. For examples of quorum configurations you should avoid, see Bad Quorum Configurations.

Quorum in Two–Node Configurations

Two quorum votes are required for a two-node cluster to form. These two votes can derive from the two cluster nodes, or from just one node and a quorum device.

Figure 3–2 Two–Node Configuration

Illustration: Shows Node A and Node B with one quorum
device that is connected to two nodes.

Quorum in Greater Than Two–Node Configurations

You can configure a greater than two-node cluster without a quorum device. However, if you do so, you cannot start the cluster without a majority of nodes in the cluster.

Illustration: Config1: NodeA-D. A/B connect to (->) QD1.
C/D -> QD2. Config2: NodeA-C. A/C -> QD1. B/C -> QD2. Config3: NodeA-C ->
one QD.

Atypical Quorum Configurations

Figure 3–3 assumes you are running mission-critical applications (Oracle database, for example) on Node A and Node B. If Node A and Node B are unavailable and cannot access shared data, you might want the entire cluster to be down. Otherwise, this configuration is suboptimal because it does not provide high availability.

For information about the best practice to which this exception relates, see Adhering to Quorum Device Best Practices.

Figure 3–3 Atypical Configuration

Illustration: NodeA-D. Node A/B connect to QD1-4. NodeC
connect to QD4. NodeD connect to QD4. Total votes = 10. Votes required for
quorum = 6.

Bad Quorum Configurations

This section shows examples of quorum configurations you should avoid. For examples of recommended quorum configurations, see Recommended Quorum Configurations.

Illustration: Config1: NodeA-B. A/B connect to -> QD1/2.
Config2: NodeA-D. A/B -> QD1/2. Config3: NodeA-C. A/B-> QD1/2 & C -> QD2.