Solaris Volume Manager Administration Guide

Overview of Replacing and Enabling Components in RAID 1 and RAID 5 Volumes

Solaris Volume Manager has the capability to replace and enable components within RAID 1 (mirror) and RAID 5 volumes.

In Solaris Volume Manager terms, replacing a component is a way to substitute an available component on the system for a selected component in a submirror or RAID 5 volume. You can think of this process as logical replacement, as opposed to physically replacing the component. (See Replacing a Component With Another Available Component.)

Enabling a component means to “activate” or substitute a component with itself (that is, the component name is the same). See Enabling a Component.

Note –

When recovering from disk errors, scan /var/adm/messages to see what kind of errors occurred. If the errors are transitory and the disks themselves do not have problems, try enabling the failed components. You can also use the format command to test a disk.

Enabling a Component

You can enable a component when any of the following conditions exist:

Solaris Volume Manager could not access the physical drive. This problem might have occurred, for example, due to a power loss, or a loose drive cable. In this case, Solaris Volume Manager puts the components in the “Maintenance” state. You need to make sure that the drive is accessible (restore power, reattach cables, and so on), and then enable the components in the volumes.
You suspect that a physical drive is having transitory problems that are not disk-related. You might be able to fix a component in the “Maintenance” state by simply enabling it. If this does not fix the problem, then you need to either physically replace the disk drive and enable the component, or replace the component with another available component on the system.

When you physically replace a drive, be sure to partition it like the old drive to ensure adequate space on each used component.

Note –

Always check for state database replicas and hot spares on the drive being replaced. Any state database replica shown to be in error should be deleted before replacing the disk. Then after enabling the component, they should be recreated (at the same size). You should treat hot spares in the same manner.

Replacing a Component With Another Available Component

You use the metareplace command when you replace or swap an existing component with a different component that is available and not in use on the system.

You can use this command when any of the following conditions exist:

A disk drive has problems, and you do not have a replacement drive, but you do have available components elsewhere on the system.

You might want to use this strategy if a replacement is absolutely necessary but you do not want to shut down the system.
You are seeing soft errors.

Physical disks might report soft errors even though Solaris Volume Manager shows the mirror/submirror or RAID 5 volume in the “Okay” state. Replacing the component in question with another available component enables you to perform preventative maintenance and potentially prevent hard errors from occurring.
You want to do performance tuning.

For example, by using the performance monitoring feature available from the Enhanced Storage tool within the Solaris Management Console, you see that a particular component in a RAID 5 volume is experiencing a high load average, even though it is in the “Okay” state. To balance the load on the volume, you can replace that component with a component from a disk that is less utilized. You can perform this type of replacement online without interrupting service to the volume.

Maintenance and Last Erred States

When a component in a mirror or RAID 5 volume experiences errors, Solaris Volume Manager puts the component in the “Maintenance” state. No further reads or writes are performed to a component in the “Maintenance” state. Subsequent errors on other components in the same volume are handled differently, depending on the type of volume. A RAID 1 volume might be able to tolerate many components in the “Maintenance” state and still be read from and written to. A RAID 5 volume, by definition, can only tolerate a single component in the “Maintenance” state.

When a component in a RAID 0 or RAID 5 volume experiences errors and there are no redundant components to read from (for example, in a RAID 5 volume, after one component goes into Maintenance state, there is no redundancy available, so the next component to fail would go into “Last Erred” state) When either a mirror or RAID 5 volume has a component in the “Last Erred” state, I/O is still attempted to the component marked “Last Erred.” This happens because a “Last Erred” component contains the last good copy of data from Solaris Volume Manager's point of view. With a component in the “Last Erred” state, the volume behaves like a normal device (disk) and returns I/O errors to an application. Usually, at this point some data has been lost.

Always replace components in the “Maintenance” state first, followed by those in the “Last Erred” state. After a component is replaced and resynchronized, use the metastat command to verify its state, then validate the data to make sure it is good.

Mirrors –If components are in the “Maintenance” state, no data has been lost. You can safely replace or enable the components in any order. If a component is in the “Last Erred” state, you cannot replace it until you first replace all the other mirrored components in the “Maintenance” state. Replacing or enabling a component in the “Last Erred” state usually means that some data has been lost. Be sure to validate the data on the mirror after you repair it.

RAID 5 Volumes–A RAID 5 volume can tolerate a single component failure. You can safely replace a single component in the “Maintenance” state without losing data. If an error on another component occurs, it is put into the “Last Erred” state. At this point, the RAID 5 volume is a read-only device. You need to perform some type of error recovery so that the state of the RAID 5 volume is stable and the possibility of data loss is reduced. If a RAID 5 volume reaches a “Last Erred” state, there is a good chance it has lost data. Be sure to validate the data on the RAID 5 volume after you repair it.

Background Information For Replacing and Enabling Slices in RAID 1 and RAID 5 Volumes

When you replace components in a mirror or a RAID 5 volume, follow these guidelines:

Always replace components in the “Maintenance” state first, followed by those components in the “Last Erred” state.
After a component is replaced and resynchronized, use the metastat command to verify the volume's state, then validate the data to make sure it is good. Replacing or enabling a component in the “Last Erred” state usually means that some data has been lost. Be sure to validate the data on the volume after you repair it. For a UFS, run the fsck command to validate the “metadata” (the structure of the file system) then check the actual user data. (Practically, users will have to examine their files.) A database or other application must have its own way of validating its internal data structure.
Always check for state database replicas and hot spares when you replace components. Any state database replica shown to be in error should be deleted before you replace the physical disk. The state database replica should be added back before enabling the component. The same procedure applies to hot spares.
RAID 5 volumes – During component replacement, data is recovered, either from a hot spare currently in use, or using the RAID level 5 parity, when no hot spare is in use.
RAID 1 volumes – When you replace a component, Solaris Volume Manager automatically starts resynchronizing the new component with the rest of the mirror. When the resynchronization completes, the replaced component becomes readable and writable. If the failed component has been replaced with data from a hot spare, the hot spare is placed in the “Available” state and made available for other hot spare replacements.
The new component must be large enough to replace the old component.
As a precaution, back up all data before you replace “Last Erred” devices.

Note –

A submirror or RAID 5 volume might be using a hot spare in place of a failed component. When that failed component is enabled or replaced by using the procedures in this section, the hot spare is marked “Available” in the hot spare pool, and is ready for use.