14.1.1 How to Recover From Power Loss (Solstice DiskSuite) (Sun Cluster 2.2 System Administration Guide)

Sun Cluster 2.2 System Administration Guide

14.1.1 How to Recover From Power Loss (Solstice DiskSuite)

These are the high-level steps to recover from power loss to a disk enclosure in a Solstice DiskSuite environment:

Identifying the errored replicas
Returning the errored replicas to service
Identifying the errored devices
Returning the errored devices to service
Resyncing the disks

These are the detailed steps to recover from power loss to a disk enclosure in a Solstice DiskSuite environment.

When power is restored, use the metadb(1M) command to identify the errored replicas:
# metadb -s diskset

Return replicas to service.

After the loss of power, all metadevice state database replicas on the affected disk enclosure chassis enter an errored state. Because metadevice state database replica recovery is not automatic, it is safest to perform the recovery immediately after the disk enclosure returns to service. Otherwise, a new failure can cause a majority of replicas to be out of service and cause a kernel panic. This is the expected behavior of Solstice DiskSuite when too few replicas are available.

While these errored replicas will be reclaimed at the next takeover (haswitch(1M) or reboot(1M)), you might want to return them to service manually by first deleting and then adding them back.

Note -
Make sure that you add back the same number of replicas that were deleted on each slice. You can delete multiple replicas with a single metadb(1M) command. If you need multiple copies of replicas on one slice, you must add them in one invocation of the metadb(1M) command using the -c flag.

Use the metastat(1M) command to identify the errored metadevices.
# metastat -s diskset

Return errored metadevices to service using the metareplace(1M) command, and resync the disks.
# metareplace -s diskset -e mirror component
The -e option transitions the component (slice) to the available state and performs a resync.

Components that have been replaced by a hot spare should be the last devices replaced using the metareplace(1M) command. If the hot spare is replaced first, it could replace another errored submirror as soon as it becomes available.

You can perform a resync on only one component of a submirror (metadevice) at a time. If all components of a submirror were affected by the power outage, each component must be replaced separately. It takes approximately 10 minutes to resync a 1.05GB disk.

If both disksets in a symmetric configuration were affected by the power outage, you can resync each diskset's affected submirrors concurrently. Log into each host separately to recover that host's diskset by running metareplace(1M) on each.

Note -
Depending on the number of submirrors and the number of components in these submirrors, the resync actions can require a considerable amount of time. A single submirror made up of 30 1.05GB drives might take about five hours to complete. A more manageable configuration made up of five component submirrors might take only 50 minutes to complete.

© 2010, Oracle Corporation and/or its affiliates