This section describes replacing a multihost disk without interrupting Sun Cluster services (online replacement) when the volume manager is reporting problems such as:
Components in the Needs Maintenance state
Hot spare replacement
Intermittent disk errors
Consult your volume management software documentation for offline replacement procedures.
Use the following procedure if you have determined that a disk has components in the Needs Maintenance state, a hot spare has replaced a component, or a disk is generating intermittent errors.
These are the high-level steps to replace a Sun StorEdge MultiPack or Sun StorEdge D1000 disk in a Solstice DiskSuite configuration:
Determining which disk needs replacement
Determining which disk expansion unit holds the disk to be replaced
Removing the bad disk from the diskset
Spinning down the disk and opening the disk enclosure
Replacing the disk drive
Running the scdidadm -R command
Adding the new disk to the diskset
Reserving and enabling failfast on the disk
Partitioning the new disk
Running the metastat(1M) command to verify the problem has been fixed
These are the detailed steps to replace a failed Sun StorEdge MultiPack or Sun StorEdge D1000 disk in a Solstice DiskSuite configuration.
Run the procedure on the host that masters the diskset in which the bad disk resides. This might require you to switch over the diskset using the haswitch(1M) command.
Identify the disk to be replaced.
Use the metastat(1M) command and /var/adm/messages output.
When metastat(1M) reports that a device is in maintenance state or some of the components have been replaced by hot spares, you must locate and replace the device. A sample metastat(1M) output follows. In this example, device c3t3d4s0 is in maintenance state:
phys-hahost1# metastat -s hahost1 ... d50:Submirror of hahost1/d40 State: Needs Maintenance Stripe 0: Device Start Block Dbase State Hot Spare c3t3d4s0 0 No Okay c3t5d4s0 ... |
Check /var/adm/messages to see what kind of problem has been detected.
... Jun 1 16:15:26 host1 unix: WARNING: /io-unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49): Jun 1 16:15:26 host1 unix: Error for command `write(I))' Err Jun 1 16:15:27 host1 unix: or Level: Fatal Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559 Jun 1 16:15:27 host1 unix: Sense Key: Media Error Jun 1 16:15:27 host1 unix: Vendor `CONNER': Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15 ... |
Determine the location of the problem disk.
Use the mount(1M) or format(1M) command to determine the controller number.
If the problem disk contains replicas, make a record of the slice and number, then delete the replicas.
Use the metadb(1M) command to delete the replicas.
Detach all submirrors with components on the disk being replaced.
If you are detaching a submirror that has a failed component, you must force the detach using the metadetach -f option. The following example detaches submirror d50 from metamirror d40.
phys-hahost1# metadetach -s hahost1 -f d40 d50 |
Use the metaclear(1M) command to clear the submirrors detached in Step 5.
phys-hahost1# metaclear -s hahost1 -f d50 |
If the problem disk contains hot spares, make a record of the names of devices and list of devices that contain hot spare pools, then delete the hot spares.
Use the metahs(1M) command to delete hot spares.
You need to record the information before deleting the objects so that the actions can be reversed following the disk replacement.
Use the metaset(1M) command to remove the failed disk from the diskset.
The command syntax is as follows, where diskset is the name of the diskset containing the failed disk, and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3):
phys-hahost1# metaset -s diskset -d drive |
This can take up to fifteen minutes or more, depending on the size of your configuration and the number of disks.
Replace the bad disk.
Refer to the hardware service manuals for your disk enclosure for details on this procedure.
Make sure the new disk spins up.
The disk should spin up automatically.
Update the DID driver's database with the new device ID.
If you upgraded from HA 1.3, your installation does not use the DID driver, so skip this step.
Use the -l flag to scdidadm(1M) to identify the DID name for the lower level device name of the drive to be replaced. Then update the DID drive database using the -R flag to scdidadm(1M). Refer to the Sun Cluster 2.2 Software Installation Guide for details on the DID pseudo driver.
phys-hahost1# scdidadm -o name -l /dev/rdsk/c3t3d4 6 phys-hahost1:/dev/rdsk/c3t3d4 /dev/did/rdsk/d6 phys-hahost1# scdidadm -R d6 |
Add the new disk back into the diskset using the metaset(1M) command.
This step adds automatically adds back the proper number of replicas that were deleted from the failed disk. The syntax of the command is show below. In this example, diskset is the name of the diskset containing the failed disk and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3).
phys-hahost1# metaset -s diskset -a drive |
This operation can take up to fifteen minutes or more, depending on the size of your configuration and the number of disks.
Use the scadmin(1M) command to reserve and enable failfast on the specified disk that has just been added back to the diskset.
phys-hahost1# scadmin reserve c3t3d4 |
Use the format(1M) or fmthard(1M) command to repartition the new disk.
Make sure that you partition the new disk exactly as the disk that was replaced. (Saving the disk format information was recommended in Chapter 1, Preparing for Sun Cluster Administration.)
Use the metainit(1M) command to reinitialize disks that were cleared in Step 6.
phys-hahost1# metainit -s hahost1 d50 |
Attach submirrors that were detached in Step 5.
Use the metattach(1M) command to perform this step. See the metattach(1M) man page for details.
phys-hahost1# metattach -s hahost1 d40 d50 |
Restore all hot spares that were deleted in Step 7.
Use metahs(1M) to add back the hot spares. See the metahs(1M) man page for details.
phys-hahost1# metahs -s hahost1 -a hsp000 c3t2d5s0 |
Verify that the replacement corrected the problem.
phys-hahost1# metastat -s hahost1 |
These are the high-level steps to replace a Sun StorEdge MultiPack or Sun StorEdge D1000 disk in a VxVM configuration:
Removing the failed disk in the disk enclosure by using the vxdiskadm command
Replacing the failed disk
Replacing the disk removed earlier by using the vxdiskadm command
For systems not running shared disk groups, master node refers to the node that has imported the disk group.
If you are running shared disk groups, determine the master and slave node by entering the following command on all nodes in the cluster:
# vxdctl -c mode |
Complete the following steps from the master node.
Determine if the disk in question had failures and is in the NODEVICE state.
If this is not the case, skip to Step 8.
Run the vxdiskadm utility and enter 4 (Remove a disk for replacement).
This option removes a physical disk while retaining the disk name. The utility then queries you for the particular device that you want to replace.
Enter the disk name or list.
The following example illustrates the removal of disk c2t8d0.
Enter disk name [<disk>,list,q,?] list Disk group: rootdg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE ... Disk group: demo DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm c1t2d0 c2t2d0s2 sliced 1519 4152640 - dm c1t3d0 c2t3d0s2 sliced 1519 4152640 - dm c1t4d0 c2t4d0s2 sliced 1519 4152640 - dm c1t5d0 c2t5d0s2 sliced 1519 4152640 - dm c1t8d0 c2t8d0s2 sliced 1519 4152640 - dm c1t9d0 c2t9d0s2 sliced 1519 4152640 - dm c2t2d0 c1t2d0s2 sliced 1519 4152640 - dm c2t3d0 c1t3d0s2 sliced 1519 4152640 - dm c2t4d0 c1t4d0s2 sliced 1519 4152640 - dm c2t5d0 c1t5d0s2 sliced 1519 4152640 - dm c2t8d0 c1t8d0s2 sliced 1519 4152640 - dm c2t9d0 c1t9d0s2 sliced 1519 4152640 - Enter disk name [<disk>,list,q,?] c2t8d0 The requested operation is to remove disk c2t8d0 from disk group demo. The disk name will be kept, along with any volumes using the disk, allowing replacement of the disk. Select "Replace a failed or removed disk" from the main menu when you wish to replace the disk. |
Enter y or press Return to continue.
Continue with operation? [y,n,q,?] (default: y) y Removal of disk c2t8d0 completed successfully. |
Enter q to quit the utility.
Remove another disk? [y,n,q,?] (default: n) q |
Enter vxdisk list and vxprint to view the changes.
The example disk c2t8d0 is removed.
# vxdisk list . c2t3d0s2 sliced c1t3d0 demo online shared c2t4d0s2 sliced c1t4d0 demo online shared c2t5d0s2 sliced c1t5d0 demo online shared c2t8d0s2 sliced c1t8d0 demo online shared c2t9d0s2 sliced c1t9d0 demo online shared - - c2t8d0 demo removed # vxprint . dm c2t3d0 c1t3d0s2 - 4152640 - - - - dm c2t4d0 c1t4d0s2 - 4152640 - - - - dm c2t5d0 c1t5d0s2 - 4152640 - - - - dm c2t8d0 - - - - REMOVED - - dm c2t9d0 c1t9d0s2 - 4152640 - - - - pl demo05-02 - DISABLED 51200 - REMOVED - - sd c2t8d0-1 demo05-02 DISABLED 51200 0 REMOVED - - . . . |
Replace the physical drive without powering off any component.
For further information, refer to the documentation accompanying the disk enclosure unit.
As you replace the drive, you may see messages on the system console similar to those in the following example. Do not become alarmed as these messages may not indicate a problem. Instead, proceed with the replacement as described in the next steps.
Nov 3 17:44:00 updb10a unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@2,0 (sd17): Nov 3 17:44:00 updb10a unix: SCSI transport failed: reason 'incomplete': \ retrying command Nov 3 17:44:03 updb10a unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@2,0 (sd17): Nov 3 17:44:03 updb10a unix: disk not responding to selection |
Run the vxdiskadm utility and enter 5 (Replace a failed or removed disk).
Enter the disk name.
You can enter list to see a list of disks in the REMOVED state.
The disk may appear in the NODEVICE state if it had failures.
Select a removed or failed disk [<disk>,list,q,?] list Disk group: rootdg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE ... Disk group: demo DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm c2t8d0 - - - - REMOVED Select a removed or failed disk [<disk>,list,q,?] c2t8d0 |
The vxdiskadm utility detects the new device and asks you whether the new device should replace the removed device.
If there are other unused disks attached to the system, vxdiskadm also presents these disks as viable choices.
Enter the device name, or if the utility lists the device as the default, press Return.
The following devices are available as replacements: c1t8d0s2 You can choose one of these disks to replace c2t8d0. Choose "none" to initialize another disk to replace c2t8d0. Choose a device, or select "none" [<device>,none,q,?] (default: c1t8d0s2) <Return> The requested operation is to use the initialized device c1t8d0s2 to replace the removed or failed disk c2t8d0 in disk group demo. |
Enter y or press Return to verify that you want this device (in the example, c1t8d0s2) to be the replacement disk.
Continue with operation? [y,n,q,?] (default: y) <Return> Replacement of disk c2t8d0 in group demo with disk device c1t8d0s2 completed successfully. |
Enter n or press Return to quit this utility.
Replace another disk? [y,n,q,?] (default: n) <Return> |
Enter vxdisk list and vxprint to see the changes.
The example, disk c2t8d0, is no longer in the REMOVED state.