2.4.1 Disk/Controller/Storage Device Failure (Sun Cluster 2.2 Cluster Volume Manager Guide)

Sun Cluster 2.2 Cluster Volume Manager Guide

2.4.1 Disk/Controller/Storage Device Failure

Failure of a disk, controller, or other storage device may make one or more devices inaccessible from one or more nodes. If a device was being accessed at the time of failure, that device is detached from the disk group. The data layout of a mirrored device should be such that no single failure can make both and/or all mirrors unavailable.

The first step of recovery is to make the failed device(s) accessible again, which includes:

Replacing failed hardware components (if any)
Executing storage device specific recovery/startup actions (for example using the Recovery Guru on Sun StorEdge A3000 or the luxadm on Sun StorEdge A5000)
Updating the Solaris device tree (drvconfig/boot -r)

For the exact sequence of steps to perform, refer to the storage-specific administration manual.

The volume manager needs to recognize when a device is accessible, Usually, this is achieved by running vxdctl enable, after which CVM can perform the recovery action involving the device. Devices can be reattached using vxreattach, vxdiskadm (option 5), or the vxva GUI. All of these utilities attach the disk using vxdg -k adddisk. Once the disk has been attached, the volumes must be recovered using vxrecover. The exact operations required to recover depend on the kstate (kernel state) and/or the state of dm/volume/plex. For an explanation of the state values, refer to the Sun StorEdge Volume Manager 2.6 System Administrator's Guide. A brief discussion on recovering from various states follows.

If you notice devices in the NODEVICE state, you must reattach them using vxreattach/vxdiskadm/vxva. vxreattach is convenient to use, as it tries to figure out disk media and device access names. However, if a disk was replaced, you must attach it using vxdg/vxdiskadm/vxva. When using vxva/vxdiskadm you must specify which disk to use for the disk media. Disks that are in the REMOVED state must be attached by using the vxva/vxdisk

Note -
If the replacement disk is not initialized, you must first initialize it.

A volume enters kstate ENABLED when it is started, and becomes DISABLED when it is stopped (or as a result of critical errors that render it unusable). If one or more volumes are not in kstate ENABLED, they can be started by using vxvol/vxrecover/vxva. A volume may not start if no plexes are in CLEAN or ACTIVE state, in which case you can use vxmend to change the state of the selected plex to CLEAN/ACTIVE before the volume can be started.

A volume may enter the NEEDSYNC state if one or more nodes leave the cluster abruptly. In this case vxrecover is started by the cluster framework to perform the necessary synchronization. When a volume is being synchronized, it will be in the SYNC state, and it will move to the ACTIVE state once complete. If a process doing recovery is killed, the volume may not transition from SYNC to the ACTIVE state. In this case, it must be recovered using vxvol -f resync.

Plexes that are associated with a volume but are detached, are DISABLED (kstate). You can recover these plexes using vxrecover, which in turn calls vxplex att. The following procedures should enable you to recover from most common failures.

Rectify the fault condition (hardware and/or software) and make sure the devices are accessible again.

Run vxdctl enable on all nodes of clusters.

Run vxreattach on the master node.

Run vxreattach on the other nodes that have non-shared disk groups.

Verify (by running vxprint) that the devices have been reattached. (Under certain circumstances, vxreattach may not reattach a disk removed and/or replaced disks. These disks must be manually reattached using vxdg/vxdiskadm/vxva.

Run vxrecover -sb on the master node.

Run vxrecover -g <dg> -sb on another node with a non-shared disk group.