Sun Cluster 2.2 Cluster Volume Manager Guide

2.1 Cluster Volume Manager Overview

Cluster Volume Manager (CVM) is an extension of SSVM that enables multiple hosts to simultaneously access and manage a given set of disks under volume manager control. A cluster is a set of hosts sharing a set of disks; each host is referred to as a node in the cluster. The nodes are connected across a network. If one node fails, the other node(s) can still access the disks. CVM presents the same logical view of the disk configurations (including changes) on all nodes.

Note -

CVM currently supports up to four nodes per cluster.

2.1.1 Shared Volume Manager Objects

When used with CVM, volume manager objects are capable of being shared by all of the nodes in a given cluster.

CVM allows for two types of disk groups:

Private disk groups, which belong to only one node. A private disk group is only imported by one system. Disks in a private disk group can be physically accessible from one or more systems, but actual access is restricted to one system only.
Cluster-shareable disk groups, which are shared by all nodes. A cluster-shareable (or shared) disk group is imported by all cluster nodes. Disks in a cluster-shareable disk group must be physically accessible from all systems that can join the cluster.

With CVM, most disk groups are shared. However, the root disk group (rootdg) is always a private disk group.

The volume manager disks within a shared disk group are shared by all of the nodes in a cluster, so multiple nodes in a cluster can access a given volume manager disk simultaneously. Likewise, all volumes in a shared disk group can be shared by all nodes in a cluster. A volume manager disk or volume is considered shared when it is accessed by more than one node at the same time.

Note -

Only raw device access is performed through CVM. CVM does not support shared volumes with file systems. CVM does not currently support RAID5 volumes in cluster-shareable disk groups. RAID5 volumes can, however, be used in private disk groups attached to specific nodes of a cluster.

2.1.2 How Cluster Volume Management Works

CVM works together with an externally-provided cluster manager, which is a daemon that informs CVM of changes in cluster membership. Each node starts up independently and has its own copies of the operating environment, CVM, and the cluster manager. When a node joins a cluster, it gains access to shared disks. When a node leaves a cluster, it no longer has access to those shared disks. The system administrator joins a node to a cluster by starting the cluster manager on that node.

Figure 2-1 illustrates a simple cluster arrangement. All of the nodes are connected by a network. The nodes are then connected to a cluster-shareable disk group. To the cluster manager, all nodes are the same. However, CVM requires that one node act as the master node; the other nodes are slave nodes. The master node is responsible for coordinating certain volume manager activities. CVM software determines which node performs the master function (any node is capable of being a master node); this role only changes if the master node leaves the cluster. If the master leaves the cluster, one of the slave nodes becomes the new master. In this example, Node 1 is the master node and Node 2, Node 3, and Node 4 are the slave nodes.

Figure 2-1 Example of a Four-Node Cluster

The system administrator designates a disk group as cluster-shareable using the vxdg utility. Refer to "2.2.3 The vxdg Command" for further information. Once a disk group is imported as cluster-shareable for one node, the disk headers are marked with the cluster ID. When other nodes join the cluster, they will recognize the disk group as being cluster-shareable and import it. The system administrator can import or deport a shared disk group at any time; the operation takes places in a distributed fashion on all nodes.

Each physical disk is marked with a unique disk ID. When the cluster starts up on the master, it imports all the shared disk groups (except for any that have the noautoimport attribute set). When a slave tries to join, the master sends it a list of the disk IDs it has imported and the slave checks to see if it can access all of them. If the slave cannot access one of the imported disks on the list, it abandons its attempt to join the cluster. If it can access all of the disks on the list, it imports the same set of shared disk groups as the master and joins the cluster. When a node leaves the cluster, it deports all its imported shared disk groups, but they remain imported on the surviving nodes. This is true if either type of node, master or slave, leaves the cluster.

Any reconfiguration to a shared disk group is performed with the cooperation of all nodes. Configuration changes to the disk group happen simultaneously on all nodes and the changes are identical. These changes are atomic in nature, so they either occur simultaneously on all nodes or do not occur at all.

All members of the cluster have simultaneous read and write access to any cluster-shareable disk group. Access by the active nodes of the cluster is not affected by a failure in any other node. The data contained in a cluster-shareable disk group is available as long as at least one node is active in the cluster. Regardless of which node accesses the cluster-shareable disk group, the configuration of the disk group looks the same. Applications running on each node can access the data on the volume manager disks simultaneously.

Note -

CVM does not protect against simultaneous writes to shared volumes by more than one node. It is assumed that any consistency control is done at the application level (using a distributed lock manager, for example).

2.1.2.1 Configuration and Initialization

Before any nodes can join a new cluster for the first time, the system administrator must supply certain configuration information. This information is supplied during cluster manager setup and is normally stored in some type of cluster manager configuration database. The precise content and format of this information is dependent on the characteristics of the cluster manager.

The type of information required by CVM is as follows:

Cluster ID
Node IDs
Network addresses of nodes
Port addresses

When a node joins the cluster, this information is automatically loaded into CVM on that node at node startup time.

Node initialization is effected through the cluster manager startup procedure, which brings up the various cluster components (such as CVM, the cluster manager, and a distributed lock manager) on the node. Once initialization is complete, applications can be started. The system administrator invokes the cluster manager startup procedure on each node to be joined to the cluster.

For CVM, initialization consists of loading the cluster configuration information and joining the nodes in the cluster. The first node to join becomes the master node, and later nodes (slaves) join to the master. If two nodes join simultaneously, CVM software chooses the master. Once the join for a given node is complete, that node has access to the shared disks.

2.1.2.2 Cluster Reconfiguration

Any time there is a change in the state of the cluster (in the form of a node leaving or joining), a cluster reconfiguration occurs. The cluster manager for each node monitors other nodes in the cluster and calls the cluster reconfiguration utility, vxclust, when there is a change in cluster membership. vxclust coordinates cluster reconfigurations and provides communication between CVM and the cluster manager. The cluster manager and vxclust work together to ensure that each step in the cluster reconfiguration is completed in the correct order.

During a cluster reconfiguration, I/O to shared disks is suspended. It is resumed when the reconfiguration completes. Applications can therefore appear to be frozen for a short time.

If other operations (such as volume manager operations or recoveries) are in progress, the cluster reconfiguration can be delayed until those operations have completed. Volume reconfigurations (described later) do not take place at the same time as cluster reconfigurations. Depending on the circumstances, an operation can be held up and restarted later. In most cases, cluster reconfiguration takes precedence. However, if the volume reconfiguration is in the commit stage, it will complete first.

For further information on cluster reconfiguration, refer to "2.2.1 The vxclust Command".

2.1.2.3 Volume Reconfiguration

Volume reconfiguration is the process of creating, changing, and removing the volume manager objects in the configuration (such as disk groups, volumes, mirrors, and so forth.). In a cluster, this process is performed with the cooperation of all nodes. Volume reconfiguration is distributed to all nodes; identical configuration changes occur on all nodes simultaneously.

The vxconfigd daemons play an active role in volume reconfiguration. For the reconfiguration to succeed, vxconfigd must be running on all nodes.

The reconfiguration is initiated and coordinated by the initiating node, which is the node on which the system administrator is running the utility that is requesting changes to volume manager objects.

The utility on the initiating node contacts its local vxconfigd daemon, which performs some local checking to make sure that a requested change is reasonable. For instance, it will fail an attempt to create a new disk group when one with the same name already exists. vxconfigd on the initiating node then sends messages with the details of the changes to the vxconfigd daemons on all other nodes in the cluster. The vxconfigds on each of the non-initiating nodes then perform their own checking. For example, a non-initiating node checks that it does not have a private disk group with the same name as the one being created; if the operation involves a new disk, each node checks that it can access that disk. When all of the vxconfigds on all the nodes agree that the proposed change is reasonable, each vxconfigd notifies its kernel and the kernels then cooperate to either commit or abort the transaction. Before the transaction can commit, all of the kernels ensure that no I/O is underway. The master is responsible for initiating a reconfiguration and coordinating the transaction commit.

If vxconfigd on any node goes away during a reconfiguration process, all nodes will be notified and the operation will fail. If any node leaves the cluster, the operation will fail unless the master has already committed the operation. If the master leaves the cluster, the new master (which was a slave previously) either completes or fails the operation. This depends on whether or not it received notification of successful completion from the previous master. This notification is done in such a way that if the new master did not receive it, neither did any other slave.

If a node attempts to join the cluster while a volume reconfiguration is being performed, the results depend on how far the reconfiguration has progressed. If the kernel is not yet involved, the volume reconfiguration is suspended and restarted when the join is complete. If the kernel is involved, the join waits until the reconfiguration is complete.

When an error occurs (such as when a check on a slave fails or a node leaves the cluster), the error is returned to the utility and a message is issued to the console on the initiating node to identify the node on which the error occurred.

2.1.2.4 Node Shutdown

The system administrator can shut down the cluster on a given node by invoking the cluster manager shutdown procedure on that node. This terminates cluster components after cluster applications have been stopped. CVM supports clean node shutdown, which is the ability of a node to leave the cluster gracefully when all access to shared volumes has ceased. The host is still operational, but cluster applications cannot be run on it.

CVM maintains global state information for each volume. This enables CVM to accurately determine which volumes need recovery when a node crashes. When a node leaves the cluster due to a crash or by some other means that is not clean, CVM determines which volumes can have writes that have not completed and the master resynchronizes those volumes. If Dirty Region Logging (DRL) is active for any of those volumes, it will be used.

Clean node shutdown should be used after, or in conjunction with, a procedure to halt all cluster applications. Depending on the characteristics of the clustered application and its shutdown procedure, it can be a long time before the shutdown is successful (minutes to hours). For instance, many applications have the concept of draining, where they accept no new work, but complete any work in progress before exiting. This process can take a long time if, for instance, a long-running transaction is active.

When the CVM shutdown procedure is invoked, it checks all volumes in all shared disk groups on the node that is being shut down and then either proceeds with the shutdown or fails:

If all volumes in shared disk groups are closed, CVM makes them unavailable to applications. Since it is known on all nodes that these volumes are closed on the leaving node, no resynchronizations are performed.
If any volume in a shared disk group is open, the CVM shutdown procedure returns failure. The shutdown procedure can be retried repeatedly until it succeeds. There is no timeout checking in this operation--it is intended as a service that verifies that the clustered applications are no longer active.

Note -
Once the shutdown has succeeded and the node has left the cluster, it is not possible to access the shared volumes until it joins the cluster again.

Since shutdown can be a lengthy process, other reconfigurations can take place while shutdown is in progress. Normally, the shutdown attempt is suspended until the other reconfiguration completes. However, if it is too far advanced, the shutdown can complete first.

2.1.2.5 Node Abort

If a node does not leave cleanly, it is either because the host crashed or because some cluster component decided to make the node leave on an emergency basis. The ensuing cluster reconfiguration calls the CVM abort function. This function makes an attempt to halt all access to shared volumes at once, though the operation waits until I/O that is at the disk completes. I/O operations that have not yet been started are failed, and the shared volumes are removed. Applications that were accessing the shared volumes, therefore, fail with errors.

After a node abort or crash, the shared volumes must be recovered (either by a surviving node or by a subsequent cluster restart) because it is very likely that there are unsynchronized mirrors.

2.1.2.6 Cluster Shutdown

When all the nodes in the cluster leave, the determination of whether or not the shared volumes should be recovered has to be made at the next cluster startup. If all nodes left cleanly, there is no need for recovery. If the last node left cleanly and resynchronization resulting from the non-clean leave of earlier nodes was complete, there is no need for recovery. However, recovery must be performed if the last node did not leave cleanly or if resynchronization from previous leaves was not complete.

2.1.3 Disks and CVM

The nodes in a cluster must always agree on the status of a disk. In particular, if one node cannot write to a given disk, all nodes must stop accessing that disk before the results of the write operation are returned to the caller. Therefore, if a node cannot contact a disk, it should contact another node to check on the disk's status. If the disk has failed, no node will be able to access it and the nodes can agree to detach the disk. If the disk has not failed, but rather the access paths from some of the nodes have failed, the nodes cannot agree on the status of the disk. Some policy must exist to resolve this type of discrepancy. The current policy is to detach the disk, even though it has not failed. While this resolves the status issue, it leaves mirrored disks with fewer mirrors and non-mirrored volumes with no access to data.

Note -

This policy is the current implementation option. In a future release of CVM, the system administrator will be able to select another policy. For example, an alternative policy would be for any node that cannot access the disk to leave the cluster; this maintains integrity of data access at the cost of keeping some nodes outside the cluster.

2.1.4 Dirty Region Logging and CVM

Dirty Region Logging (DRL) is an optional property of a volume that provides speedy recovery of mirrored volumes after a system failure. Dirty Region Logging is supported in cluster-shareable disk groups. This section provides a brief overview of DRL and outlines differences between SSVM DRL and the CVM implementation of DRL. For more information on DRL, refer to Chapter 1 in the applicable Sun StorEdge Volume Manager 2.6 System Administrator's Guide.

DRL keeps track of the regions that have changed due to I/O writes to a mirrored volume and uses this information to recover only the portions of the volume that need to be recovered. DRL logically divides a volume into a set of consecutive regions and maintains a dirty region log that contains a status bit representing each region of the volume. Log subdisks store the dirty region log of a volume that has DRL enabled. A volume with DRL has at least one log subdisk, that is associated with one of the volume's plexes.

Before writing any data to the volume, the regions being written are marked dirty in the log. If a write causes a log region to become dirty when it was previously clean, the log is synchronously written to disk before the write operation can occur. On system restart, the volume manager recovers only those regions of the volume which are marked as dirty in the dirty region log.

The CVM implementation of DRL differs slightly from the SSVM implementation. The following sections outline some of the differences and discuss some aspects of the CVM implementation.

2.1.4.1 Log Format and Size

As with SSVM, the CVM dirty region log exists on a log subdisk in a mirrored volume.

A SSVM dirty region log has a recovery map and a single active map. A CVM dirty region log, however, has one recovery map and multiple active maps (one for each node in the cluster). Unlike SSVM, CVM places its recovery map at the beginning of the log.

The CVM dirty region log size is typically larger than a SSVM dirty region log, as it must accommodate active maps for all nodes in the cluster plus a recovery map. The size of each map within the dirty region log is one or more whole blocks. vxassist automatically takes care of allocating a sufficiently large dirty region log.

The log size depends on the volume size and the number of nodes. The log must be large enough to accommodate all maps (one map per node plus a recovery map). Each map should be one block long for each two gigabytes of volume size. For a two-gigabyte volume in a two-node cluster, a log size of five blocks (one block per map) should be sufficient; this is the minimum log size. A four-gigabyte volume in a four-node cluster requires a log size of ten blocks, and so on.

2.1.4.2 Compatibility

Except for the addition of a CVM specific magic number, CVM DRL headers are the same as their SSVM counterparts.

It is possible to import a SSVM disk group (and its volumes) as a CVM shared disk group and vice versa. However, the dirty region logs of the imported disk group can be considered invalid and a full recovery can result.

If a CVM shared disk group is imported by a SSVM system, SSVM will consider the logs of the CVM volumes to be invalid and will conduct a full volume recovery. After this recovery completes, the volume manager will use the CVM Dirty Region Logging.

CVM is capable of performing a DRL recovery on a non-shared SSVM volume. However, if a SSVM volume is moved to a CVM system and imported as shared, the dirty region log will probably be too small to accommodate all the nodes in the cluster. CVM will therefore mark the log invalid and do a full recovery anyway. Similarly, moving a DRL volume from a two-node cluster to a four-node cluster will result in too small a log size, which CVM will handle with a full volume recovery. In both cases, the system administrator is responsible for allocating a new log of sufficient size.

2.1.4.3 Recovery With DRL

When one or more nodes in a cluster crash, DRL needs to be able to handle the recovery of all volumes in use by those nodes when the crash(es) occurred. On initial cluster startup, all active maps are incorporated into the recovery map; this is done during the volume start operation.

Nodes that crash (that is, leave the cluster as dirty) are not allowed to rejoin the cluster until their DRL active maps have been incorporated into the recovery maps on all affected volumes. The recovery utilities compare a crashed node's active maps with the recovery map and make any necessary updates before the node can rejoin the cluster and resume I/O to the volume (which overwrites the active map). During this time, other nodes can continue to perform I/O.

The CVM kernel tracks which nodes have crashed. If multiple node recoveries are underway in a cluster at a given time, their respective recoveries and recovery map updates can compete with each other. The CVM kernel therefore tracks changes in the DRL recovery state and prevents I/O operation collisions.

The master performs volatile tracking of DRL recovery map updates for each volume and prevents multiple utilities from changing the recovery map simultaneously.