2.1.2 How Cluster Volume Management Works (Sun Cluster 2.2 Cluster Volume Manager Guide)

Sun Cluster 2.2 Cluster Volume Manager Guide

2.1.2 How Cluster Volume Management Works

CVM works together with an externally-provided cluster manager, which is a daemon that informs CVM of changes in cluster membership. Each node starts up independently and has its own copies of the operating environment, CVM, and the cluster manager. When a node joins a cluster, it gains access to shared disks. When a node leaves a cluster, it no longer has access to those shared disks. The system administrator joins a node to a cluster by starting the cluster manager on that node.

Figure 2-1 illustrates a simple cluster arrangement. All of the nodes are connected by a network. The nodes are then connected to a cluster-shareable disk group. To the cluster manager, all nodes are the same. However, CVM requires that one node act as the master node; the other nodes are slave nodes. The master node is responsible for coordinating certain volume manager activities. CVM software determines which node performs the master function (any node is capable of being a master node); this role only changes if the master node leaves the cluster. If the master leaves the cluster, one of the slave nodes becomes the new master. In this example, Node 1 is the master node and Node 2, Node 3, and Node 4 are the slave nodes.

Figure 2-1 Example of a Four-Node Cluster

The system administrator designates a disk group as cluster-shareable using the vxdg utility. Refer to "2.2.3 The vxdg Command" for further information. Once a disk group is imported as cluster-shareable for one node, the disk headers are marked with the cluster ID. When other nodes join the cluster, they will recognize the disk group as being cluster-shareable and import it. The system administrator can import or deport a shared disk group at any time; the operation takes places in a distributed fashion on all nodes.

Each physical disk is marked with a unique disk ID. When the cluster starts up on the master, it imports all the shared disk groups (except for any that have the noautoimport attribute set). When a slave tries to join, the master sends it a list of the disk IDs it has imported and the slave checks to see if it can access all of them. If the slave cannot access one of the imported disks on the list, it abandons its attempt to join the cluster. If it can access all of the disks on the list, it imports the same set of shared disk groups as the master and joins the cluster. When a node leaves the cluster, it deports all its imported shared disk groups, but they remain imported on the surviving nodes. This is true if either type of node, master or slave, leaves the cluster.

Any reconfiguration to a shared disk group is performed with the cooperation of all nodes. Configuration changes to the disk group happen simultaneously on all nodes and the changes are identical. These changes are atomic in nature, so they either occur simultaneously on all nodes or do not occur at all.

All members of the cluster have simultaneous read and write access to any cluster-shareable disk group. Access by the active nodes of the cluster is not affected by a failure in any other node. The data contained in a cluster-shareable disk group is available as long as at least one node is active in the cluster. Regardless of which node accesses the cluster-shareable disk group, the configuration of the disk group looks the same. Applications running on each node can access the data on the volume manager disks simultaneously.

Note -

CVM does not protect against simultaneous writes to shared volumes by more than one node. It is assumed that any consistency control is done at the application level (using a distributed lock manager, for example).

2.1.2.1 Configuration and Initialization

Before any nodes can join a new cluster for the first time, the system administrator must supply certain configuration information. This information is supplied during cluster manager setup and is normally stored in some type of cluster manager configuration database. The precise content and format of this information is dependent on the characteristics of the cluster manager.

The type of information required by CVM is as follows:

Cluster ID
Node IDs
Network addresses of nodes
Port addresses

When a node joins the cluster, this information is automatically loaded into CVM on that node at node startup time.

Node initialization is effected through the cluster manager startup procedure, which brings up the various cluster components (such as CVM, the cluster manager, and a distributed lock manager) on the node. Once initialization is complete, applications can be started. The system administrator invokes the cluster manager startup procedure on each node to be joined to the cluster.

For CVM, initialization consists of loading the cluster configuration information and joining the nodes in the cluster. The first node to join becomes the master node, and later nodes (slaves) join to the master. If two nodes join simultaneously, CVM software chooses the master. Once the join for a given node is complete, that node has access to the shared disks.

2.1.2.2 Cluster Reconfiguration

Any time there is a change in the state of the cluster (in the form of a node leaving or joining), a cluster reconfiguration occurs. The cluster manager for each node monitors other nodes in the cluster and calls the cluster reconfiguration utility, vxclust, when there is a change in cluster membership. vxclust coordinates cluster reconfigurations and provides communication between CVM and the cluster manager. The cluster manager and vxclust work together to ensure that each step in the cluster reconfiguration is completed in the correct order.

During a cluster reconfiguration, I/O to shared disks is suspended. It is resumed when the reconfiguration completes. Applications can therefore appear to be frozen for a short time.

If other operations (such as volume manager operations or recoveries) are in progress, the cluster reconfiguration can be delayed until those operations have completed. Volume reconfigurations (described later) do not take place at the same time as cluster reconfigurations. Depending on the circumstances, an operation can be held up and restarted later. In most cases, cluster reconfiguration takes precedence. However, if the volume reconfiguration is in the commit stage, it will complete first.

For further information on cluster reconfiguration, refer to "2.2.1 The vxclust Command".

2.1.2.3 Volume Reconfiguration

Volume reconfiguration is the process of creating, changing, and removing the volume manager objects in the configuration (such as disk groups, volumes, mirrors, and so forth.). In a cluster, this process is performed with the cooperation of all nodes. Volume reconfiguration is distributed to all nodes; identical configuration changes occur on all nodes simultaneously.

The vxconfigd daemons play an active role in volume reconfiguration. For the reconfiguration to succeed, vxconfigd must be running on all nodes.

The reconfiguration is initiated and coordinated by the initiating node, which is the node on which the system administrator is running the utility that is requesting changes to volume manager objects.

The utility on the initiating node contacts its local vxconfigd daemon, which performs some local checking to make sure that a requested change is reasonable. For instance, it will fail an attempt to create a new disk group when one with the same name already exists. vxconfigd on the initiating node then sends messages with the details of the changes to the vxconfigd daemons on all other nodes in the cluster. The vxconfigds on each of the non-initiating nodes then perform their own checking. For example, a non-initiating node checks that it does not have a private disk group with the same name as the one being created; if the operation involves a new disk, each node checks that it can access that disk. When all of the vxconfigds on all the nodes agree that the proposed change is reasonable, each vxconfigd notifies its kernel and the kernels then cooperate to either commit or abort the transaction. Before the transaction can commit, all of the kernels ensure that no I/O is underway. The master is responsible for initiating a reconfiguration and coordinating the transaction commit.

If vxconfigd on any node goes away during a reconfiguration process, all nodes will be notified and the operation will fail. If any node leaves the cluster, the operation will fail unless the master has already committed the operation. If the master leaves the cluster, the new master (which was a slave previously) either completes or fails the operation. This depends on whether or not it received notification of successful completion from the previous master. This notification is done in such a way that if the new master did not receive it, neither did any other slave.

If a node attempts to join the cluster while a volume reconfiguration is being performed, the results depend on how far the reconfiguration has progressed. If the kernel is not yet involved, the volume reconfiguration is suspended and restarted when the join is complete. If the kernel is involved, the join waits until the reconfiguration is complete.

When an error occurs (such as when a check on a slave fails or a node leaves the cluster), the error is returned to the utility and a message is issued to the console on the initiating node to identify the node on which the error occurred.

2.1.2.4 Node Shutdown

The system administrator can shut down the cluster on a given node by invoking the cluster manager shutdown procedure on that node. This terminates cluster components after cluster applications have been stopped. CVM supports clean node shutdown, which is the ability of a node to leave the cluster gracefully when all access to shared volumes has ceased. The host is still operational, but cluster applications cannot be run on it.

CVM maintains global state information for each volume. This enables CVM to accurately determine which volumes need recovery when a node crashes. When a node leaves the cluster due to a crash or by some other means that is not clean, CVM determines which volumes can have writes that have not completed and the master resynchronizes those volumes. If Dirty Region Logging (DRL) is active for any of those volumes, it will be used.

Clean node shutdown should be used after, or in conjunction with, a procedure to halt all cluster applications. Depending on the characteristics of the clustered application and its shutdown procedure, it can be a long time before the shutdown is successful (minutes to hours). For instance, many applications have the concept of draining, where they accept no new work, but complete any work in progress before exiting. This process can take a long time if, for instance, a long-running transaction is active.

When the CVM shutdown procedure is invoked, it checks all volumes in all shared disk groups on the node that is being shut down and then either proceeds with the shutdown or fails:

If all volumes in shared disk groups are closed, CVM makes them unavailable to applications. Since it is known on all nodes that these volumes are closed on the leaving node, no resynchronizations are performed.
If any volume in a shared disk group is open, the CVM shutdown procedure returns failure. The shutdown procedure can be retried repeatedly until it succeeds. There is no timeout checking in this operation--it is intended as a service that verifies that the clustered applications are no longer active.

Note -
Once the shutdown has succeeded and the node has left the cluster, it is not possible to access the shared volumes until it joins the cluster again.

Since shutdown can be a lengthy process, other reconfigurations can take place while shutdown is in progress. Normally, the shutdown attempt is suspended until the other reconfiguration completes. However, if it is too far advanced, the shutdown can complete first.

2.1.2.5 Node Abort

If a node does not leave cleanly, it is either because the host crashed or because some cluster component decided to make the node leave on an emergency basis. The ensuing cluster reconfiguration calls the CVM abort function. This function makes an attempt to halt all access to shared volumes at once, though the operation waits until I/O that is at the disk completes. I/O operations that have not yet been started are failed, and the shared volumes are removed. Applications that were accessing the shared volumes, therefore, fail with errors.

After a node abort or crash, the shared volumes must be recovered (either by a surviving node or by a subsequent cluster restart) because it is very likely that there are unsynchronized mirrors.

2.1.2.6 Cluster Shutdown

When all the nodes in the cluster leave, the determination of whether or not the shared volumes should be recovered has to be made at the next cluster startup. If all nodes left cleanly, there is no need for recovery. If the last node left cleanly and resynchronization resulting from the non-clean leave of earlier nodes was complete, there is no need for recovery. However, recovery must be performed if the last node did not leave cleanly or if resynchronization from previous leaves was not complete.