Use the following procedure if you have determined that a disk has components in the Needs Maintenance state, a hot spare has replaced a component, or a disk is generating intermittent errors.
These are the high-level steps to replace a Sun StorEdge MultiPack or Sun StorEdge D1000 disk in a Solstice DiskSuite configuration:
Determining which disk needs replacement
Determining which disk expansion unit holds the disk to be replaced
Removing the bad disk from the diskset
Spinning down the disk and opening the disk enclosure
Replacing the disk drive
Running the scdidadm -R command
Adding the new disk to the diskset
Reserving and enabling failfast on the disk
Partitioning the new disk
Running the metastat(1M) command to verify the problem has been fixed
These are the detailed steps to replace a failed Sun StorEdge MultiPack or Sun StorEdge D1000 disk in a Solstice DiskSuite configuration.
Run the procedure on the host that masters the diskset in which the bad disk resides. This might require you to switch over the diskset using the haswitch(1M) command.
Identify the disk to be replaced.
Use the metastat(1M) command and /var/adm/messages output.
When metastat(1M) reports that a device is in maintenance state or some of the components have been replaced by hot spares, you must locate and replace the device. A sample metastat(1M) output follows. In this example, device c3t3d4s0 is in maintenance state:
phys-hahost1# metastat -s hahost1 ... d50:Submirror of hahost1/d40 State: Needs Maintenance Stripe 0: Device Start Block Dbase State Hot Spare c3t3d4s0 0 No Okay c3t5d4s0 ... |
Check /var/adm/messages to see what kind of problem has been detected.
... Jun 1 16:15:26 host1 unix: WARNING: /io- unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49): Jun 1 16:15:26 host1 unix: Error for command `write(I))' Err Jun 1 16:15:27 host1 unix: or Level: Fatal Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559 Jun 1 16:15:27 host1 unix: Sense Key: Media Error Jun 1 16:15:27 host1 unix: Vendor `CONNER': Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15 ... |
Determine the location of the problem disk.
Use the mount(1M) or format(1M) command to determine the controller number.
If the problem disk contains replicas, make a record of the slice and number, then delete the replicas.
Use the metadb(1M) command to delete the replicas.
Detach all submirrors with components on the disk being replaced.
If you are detaching a submirror that has a failed component, you must force the detach using the metadetach -f option. The following example detaches submirror d50 from metamirror d40.
phys-hahost1# metadetach -s hahost1 -f d40 d50 |
Use the metaclear(1M) command to clear the submirrors detached in Step 5.
phys-hahost1# metaclear -s hahost1 -f d50 |
If the problem disk contains hot spares, make a record of the names of devices and list of devices that contain hot spare pools, then delete the hot spares.
Use the metahs(1M) command to delete hot spares.
You need to record the information before deleting the objects so that the actions can be reversed following the disk replacement.
Use the metaset(1M) command to remove the failed disk from the diskset.
The command syntax is as follows, where diskset is the name of the diskset containing the failed disk, and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3):
phys-hahost1# metaset -s diskset -d drive |
This can take up to fifteen minutes or more, depending on the size of your configuration and the number of disks.
Replace the bad disk.
Refer to the hardware service manuals for your disk enclosure for details on this procedure.
Make sure the new disk spins up.
The disk should spin up automatically.
Update the DID driver's database with the new device ID.
If you upgraded from HA 1.3, your installation does not use the DID driver, so skip this step.
Use the -l flag to scdidadm(1M) to identify the DID name for the lower level device name of the drive to be replaced. Then update the DID drive database using the -R flag to scdidadm(1M). Refer to the Sun Cluster 2.2 Software Installation Guide for details on the DID pseudo driver.
phys-hahost1# scdidadm -o name -l /dev/rdsk/c3t3d4 6 phys-hahost1:/dev/rdsk/c3t3d4 /dev/did/rdsk/d6 phys-hahost1# scdidadm -R d6 |
Add the new disk back into the diskset using the metaset(1M) command.
This step adds automatically adds back the proper number of replicas that were deleted from the failed disk. The syntax of the command is show below. In this example, diskset is the name of the diskset containing the failed disk and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3).
phys-hahost1# metaset -s diskset -a drive |
This operation can take up to fifteen minutes or more, depending on the size of your configuration and the number of disks.
Use the scadmin(1M) command to reserve and enable failfast on the specified disk that has just been added back to the diskset.
phys-hahost1# scadmin reserve c3t3d4 |
Use the format(1M) or fmthard(1M) command to repartition the new disk.
Make sure that you partition the new disk exactly as the disk that was replaced. (Saving the disk format information was recommended in Chapter 1, Preparing for Sun Cluster Administration.)
Use the metainit(1M) command to reinitialize disks that were cleared in Step 6.
phys-hahost1# metainit -s hahost1 d50 |
Attach submirrors that were detached in Step 5.
Use the metattach(1M) command to perform this step. See the metattach(1M) man page for details.
phys-hahost1# metattach -s hahost1 d40 d50 |
Restore all hot spares that were deleted in Step 7.
Use metahs(1M) to add back the hot spares. See the metahs(1M) man page for details.
phys-hahost1# metahs -s hahost1 -a hsp000 c3t2d5s0 |
Verify that the replacement corrected the problem.
phys-hahost1# metastat -s hahost1 |