Sun Cluster 2.2 System Administration Guide

Recovering From Power Loss

When power is lost to one SPARCstorage Array, I/O operations generate errors that are detected by the volume management software. Errors are not reported until I/O transactions are made to the disk. Hot spare activity can be initiated if affected devices are set up for hot sparing.

You should monitor the configuration for these events. See Chapter 2, Sun Cluster Administration Tools, for more information on monitoring the configuration.

How to Recover From Power Loss (Solstice DiskSuite)

These are the high-level steps to recover from power loss to a SPARCstorage Array in a Solstice DiskSuite configuration:

Identifying the errored replicas
Returning the errored replicas to service
Identifying the errored devices
Returning the errored devices to service
Resyncing the disks

These are the detailed steps to recover from power loss to a SPARCstorage Array in a Solstice DiskSuite configuration.

When power is restored, run the metadb(1M) command to identify the errored replicas.
# metadb -s diskset

Return replicas to service.

After the loss of power, all metadevice state database replicas on the affected SPARCstorage Array chassis enter an errored state. Because metadevice state database replica recovery is not automatic, it is safest to perform the recovery immediately after the SPARCstorage Array returns to service. Otherwise, a new failure can cause a majority of replicas to be out of service and cause a kernel panic. This is the expected behavior of Solstice DiskSuite when too few replicas are available.

While these errored replicas will be reclaimed at the next takeover (haswitch(1M) or reboot(1M)), it is best to return them to service manually by first deleting them, then adding them back.

Note -
Make sure that you add back the same number of replicas that were deleted on each slice. You can delete multiple replicas with a single metadb(1M) command. If you need multiple copies of replicas on one slice, you must add them in one invocation of metadb(1M) using the -c flag.

Run the metastat(1M) command to identify the errored metadevices.
# metastat -s diskset

Return errored metadevices to service by using the metareplace(1M) command, which will cause a resync of the disks.
# metareplace -s diskset -e mirror component
The -e option transitions the component (slice) to the Available state and performs a resync.

Components that have been replaced by a hot spare should be replaced last by using the metareplace(1M) command. If the hot spare is replaced first, it could replace another errored submirror as soon as it becomes available.

You can perform a resync on only one component of a submirror (metadevice) at a time. If all components of a submirror were affected by the power outage, each component must be replaced separately. It takes approximately 10 minutes to resync a 1.05GB disk.

If more than one diskset was affected by the power outage, you can resync each diskset's affected submirrors concurrently. Log into each host separately to recover that host's diskset by running the metareplace(1M) command on each.

Note -
Depending on the number of submirrors and the number of components in these submirrors, the resync actions can require a considerable amount of time. A single submirror made up of 30 1.05GB drives might take about five hours to complete. A more manageable configuration made up of five component submirrors might take only 50 minutes to complete.

How to Recover From Power Loss (VxVM)

Power failures can detach disk drives and cause plexes to become detached, and thus, unavailable. In a mirror, however, the volume remains active, because the remaining plex(es) in the volume are still available. It is possible to reattach the disk drives and recover from this condition without halting nodes in the cluster.

These are the high-level steps to recover from power loss to a SPARCstorage Array in an VxVM configuration:

Determining the errored plex(es) by using the vxprint and vxdisk commands
Fixing the problem that caused the power loss
Running the drvconfig and disks commands to create the /devices and /dev entries
Scanning the current disk configuration
Reattaching disks that had transient errors
Verifying there are no more errors
(Optional) For shared disk groups, running the vxdg command for each disk that was powered off
Starting volume recovery

These are the detailed steps to recover from power loss to a SPARCstorage Array in an VxVM configuration.

Run the vxprint command to view the errored plexes.

Optionally, specify a disk group with the -g diskgroup option.

Run the vxdisk command to identify the errored disks.

# vxdisk list
DEVICE       TYPE      DISK         GROUP        STATUS
...
-            -         c1t5d0       toi          failed was:c1t5d0s2
...

Fix the condition that resulted in the problem so that power is restored to all failed disks.

Be sure that the disks are spun up before proceeding.

Enter the following commands on all nodes in the cluster.

In some cases, the drive(s) must be rediscovered by the node(s).
# drvconfig # disks

Enter the following commands on all nodes in the cluster.

VxVM must scan the current disk configuration again.
# vxdctl enable # vxdisk -a online

Enter the following command on all nodes in the cluster.

Note -
If you are using VxVM cluster feature (used with Oracle Parallel Server), enter the command first on the master node, then on the slave nodes.

This will reattach disks that had transitory failures.
# vxreattach

Verify the output of the vxdisk command to see if there are any more errors.
# vxdisk list
If there are still errors, rerun the vxreattach command as described in Step 6.

VxVM cluster feature (OPS) only: If you have shared disk groups, and if media was replaced from the master node, repeat the following command for each disk that has been disconnected.

The physical disk and the volume manager access name for that disk must be reconnected.

# vxdg -g disk-group-name -k adddisk medianame=accessname

The values for medianame and accessname appear at the end of the vxdisk list command output.

For example:

# vxdg -g toi -k adddisk c1t5d0=c1t5d0s2
# vxdg -g toi -k adddisk c1t5d1=c1t5d1s2
# vxdg -g toi -k adddisk c1t5d2=c1t5d2s2
# vxdg -g toi -k adddisk c1t5d3=c1t5d3s2
# vxdg -g toi -k adddisk c1t5d4=c1t5d4s2

You can also use the vxdiskadm command, or the graphical user interface, to reconnect the disks.

From the node, or from the master node for shared disk groups, start volume recovery.
# vxrecover -bv [-g diskgroup]

(Optional) Run the vxprint -g command to view the changes.