As part of standard Sun Cluster administration, you should monitor the status of the configuration. See Chapter 2, Sun Cluster Administration Tools, for information about monitoring methods. During the monitoring process you might discover problems with multihost disks. The following procedures describe how to correct these problems.
Sun Cluster supports different disk types. Refer to the hardware service manual for your multihost disk expansion unit for a description of your disk enclosure.
In a symmetric configuration, the disk enclosure might contain disks from multiple disk groups and will require that a single node own all of the affected disk groups.
These are the high-level steps to add a Sun StorEdge MultiPack or Sun StorEdge D1000 disk:
Identifying the controller for this new disk and locating an empty slot in the disk enclosure
Adding the new disk
Performing the administrative actions to prepare the disk for use by Sun Cluster
Creating the /devices special files and /dev/dsk and /dev/rdsk links
Adding the disk to the disk group
Formatting and partitioning the disk, if necessary
Performing the volume management-related administrative tasks
These are the detailed steps to add a new Sun StorEdge MultiPack or Sun StorEdge D1000 disk.
Determine the controller number of the disk enclosure to which the disk will be added.
Use the mount(1M) or format(1M) command to determine the controller number.
Locate an appropriate empty disk slot in the disk enclosure for the disk being added.
Identify the empty slots either by observing the disk drive LEDs on the front of the disk enclosure, or by removing the left side cover of the unit. The target address IDs corresponding to the slots appear on the middle partition of the drive bay.
In the following steps, Tray 2 is used as an example. The slot selected for the new disk is Tray 2 Slot 7. The new disk will be known as c2t3d1.
Add the new disk.
Use the instructions in your disk enclosure unit service manual to perform the hardware procedure of adding the disk.
Run the drvconfig(1M) and disks(1M) commands to create the new entries in /devices, /dev/dsk, and /dev/rdsk for all new disks.
phys-hahost1# drvconfig phys-hahost1# disks |
Switch ownership of the logical hosts to the other cluster node to which this disk is connected.
phys-hahost1# haswitch phys-hahost2 hahost1 hahost2 |
Run the drvconfig(1M) and disks(1M) commands on the node that now owns the disk group to which the disk will be added.
phys-hahost2# drvconfig phys-hahost2# disks |
Add the disk to a disk group using your volume management software.
For Solstice DiskSuite, the command syntax is as follows, where diskset is the name of the diskset containing the failed disk, and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3):
# metaset -s diskset -a drive |
For SSVM or CVM, you can use the command line or graphical user interface to add the disk to the disk group.
If you are using Solstice DiskSuite, the metaset(1M) command might repartition this disk automatically. See the Solstice DiskSuite documentation for more information.
(Solstice DiskSuite configurations only) After adding the disks to the diskset by using the metaset(1M) command, use the scadmin(1M) command to reserve and enable failfast on the specified disks.
phys-hahost1# scadmin reserve drivename |
Perform the usual administration actions on the new disk.
You can now perform the usual administration steps that are performed when a new drive is brought into service. See your volume management software documentation for more information on these tasks.
If necessary, switch logical hosts back to their default masters.
This section describes replacing a multihost disk without interrupting Sun Cluster services (online replacement) when the volume manager is reporting problems such as:
Components in the Needs Maintenance state
Hot spare replacement
Intermittent disk errors
Consult your volume management software documentation for offline replacement procedures.
Use the following procedure if you have determined that a disk has components in the Needs Maintenance state, a hot spare has replaced a component, or a disk is generating intermittent errors.
These are the high-level steps to replace a Sun StorEdge MultiPack or Sun StorEdge D1000 disk in a Solstice DiskSuite configuration:
Determining which disk needs replacement
Determining which disk expansion unit holds the disk to be replaced
Removing the bad disk from the diskset
Spinning down the disk and opening the disk enclosure
Replacing the disk drive
Running the scdidadm -R command
Adding the new disk to the diskset
Reserving and enabling failfast on the disk
Partitioning the new disk
Running the metastat(1M) command to verify the problem has been fixed
These are the detailed steps to replace a failed Sun StorEdge MultiPack or Sun StorEdge D1000 disk in a Solstice DiskSuite configuration.
Run the procedure on the host that masters the diskset in which the bad disk resides. This might require you to switch over the diskset using the haswitch(1M) command.
Identify the disk to be replaced.
Use the metastat(1M) command and /var/adm/messages output.
When metastat(1M) reports that a device is in maintenance state or some of the components have been replaced by hot spares, you must locate and replace the device. A sample metastat(1M) output follows. In this example, device c3t3d4s0 is in maintenance state:
phys-hahost1# metastat -s hahost1 ... d50:Submirror of hahost1/d40 State: Needs Maintenance Stripe 0: Device Start Block Dbase State Hot Spare c3t3d4s0 0 No Okay c3t5d4s0 ... |
Check /var/adm/messages to see what kind of problem has been detected.
... Jun 1 16:15:26 host1 unix: WARNING: /io- unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49): Jun 1 16:15:26 host1 unix: Error for command `write(I))' Err Jun 1 16:15:27 host1 unix: or Level: Fatal Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559 Jun 1 16:15:27 host1 unix: Sense Key: Media Error Jun 1 16:15:27 host1 unix: Vendor `CONNER': Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15 ... |
Determine the location of the problem disk.
Use the mount(1M) or format(1M) command to determine the controller number.
If the problem disk contains replicas, make a record of the slice and number, then delete the replicas.
Use the metadb(1M) command to delete the replicas.
Detach all submirrors with components on the disk being replaced.
If you are detaching a submirror that has a failed component, you must force the detach using the metadetach -f option. The following example detaches submirror d50 from metamirror d40.
phys-hahost1# metadetach -s hahost1 -f d40 d50 |
Use the metaclear(1M) command to clear the submirrors detached in Step 5.
phys-hahost1# metaclear -s hahost1 -f d50 |
If the problem disk contains hot spares, make a record of the names of devices and list of devices that contain hot spare pools, then delete the hot spares.
Use the metahs(1M) command to delete hot spares.
You need to record the information before deleting the objects so that the actions can be reversed following the disk replacement.
Use the metaset(1M) command to remove the failed disk from the diskset.
The command syntax is as follows, where diskset is the name of the diskset containing the failed disk, and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3):
phys-hahost1# metaset -s diskset -d drive |
This can take up to fifteen minutes or more, depending on the size of your configuration and the number of disks.
Replace the bad disk.
Refer to the hardware service manuals for your disk enclosure for details on this procedure.
Make sure the new disk spins up.
The disk should spin up automatically.
Update the DID driver's database with the new device ID.
If you upgraded from HA 1.3, your installation does not use the DID driver, so skip this step.
Use the -l flag to scdidadm(1M) to identify the DID name for the lower level device name of the drive to be replaced. Then update the DID drive database using the -R flag to scdidadm(1M). Refer to the Sun Cluster 2.2 Software Installation Guide for details on the DID pseudo driver.
phys-hahost1# scdidadm -o name -l /dev/rdsk/c3t3d4 6 phys-hahost1:/dev/rdsk/c3t3d4 /dev/did/rdsk/d6 phys-hahost1# scdidadm -R d6 |
Add the new disk back into the diskset using the metaset(1M) command.
This step adds automatically adds back the proper number of replicas that were deleted from the failed disk. The syntax of the command is show below. In this example, diskset is the name of the diskset containing the failed disk and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3).
phys-hahost1# metaset -s diskset -a drive |
This operation can take up to fifteen minutes or more, depending on the size of your configuration and the number of disks.
Use the scadmin(1M) command to reserve and enable failfast on the specified disk that has just been added back to the diskset.
phys-hahost1# scadmin reserve c3t3d4 |
Use the format(1M) or fmthard(1M) command to repartition the new disk.
Make sure that you partition the new disk exactly as the disk that was replaced. (Saving the disk format information was recommended in Chapter 1, Preparing for Sun Cluster Administration.)
Use the metainit(1M) command to reinitialize disks that were cleared in Step 6.
phys-hahost1# metainit -s hahost1 d50 |
Attach submirrors that were detached in Step 5.
Use the metattach(1M) command to perform this step. See the metattach(1M) man page for details.
phys-hahost1# metattach -s hahost1 d40 d50 |
Restore all hot spares that were deleted in Step 7.
Use metahs(1M) to add back the hot spares. See the metahs(1M) man page for details.
phys-hahost1# metahs -s hahost1 -a hsp000 c3t2d5s0 |
Verify that the replacement corrected the problem.
phys-hahost1# metastat -s hahost1 |
These are the high-level steps to replace a Sun StorEdge MultiPack or Sun StorEdge D1000 disk in an SSVM or CVM configuration:
Removing the failed disk in the disk enclosure by using the vxdiskadm command
Replacing the failed disk
Replacing the disk removed earlier by using the vxdiskadm command
For systems not running shared disk groups, master node refers to the node that has imported the disk group.
If you are running shared disk groups, determine the master and slave node by entering the following command on all nodes in the cluster:
# vxdctl -c mode |
Complete the following steps from the master node.
Determine if the disk in question had failures and is in the NODEVICE state.
If this is not the case, skip to Step 8.
Run the vxdiskadm utility and enter 4 (Remove a disk for replacement).
This option removes a physical disk while retaining the disk name. The utility then queries you for the particular device that you want to replace.
Enter the disk name or list.
The following example illustrates the removal of disk c2t8d0.
Enter disk name [<disk>,list,q,?] list Disk group: rootdg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm c0t0d0s7 c0t0d0s7 simple 1024 20255 - Disk group: demo DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm c1t2d0 c2t2d0s2 sliced 1519 4152640 - dm c1t3d0 c2t3d0s2 sliced 1519 4152640 - dm c1t4d0 c2t4d0s2 sliced 1519 4152640 - dm c1t5d0 c2t5d0s2 sliced 1519 4152640 - dm c1t8d0 c2t8d0s2 sliced 1519 4152640 - dm c1t9d0 c2t9d0s2 sliced 1519 4152640 - dm c2t2d0 c1t2d0s2 sliced 1519 4152640 - dm c2t3d0 c1t3d0s2 sliced 1519 4152640 - dm c2t4d0 c1t4d0s2 sliced 1519 4152640 - dm c2t5d0 c1t5d0s2 sliced 1519 4152640 - dm c2t8d0 c1t8d0s2 sliced 1519 4152640 - dm c2t9d0 c1t9d0s2 sliced 1519 4152640 - Enter disk name [<disk>,list,q,?] c2t8d0 The requested operation is to remove disk c2t8d0 from disk group demo. The disk name will be kept, along with any volumes using the disk, allowing replacement of the disk. Select "Replace a failed or removed disk" from the main menu when you wish to replace the disk. |
Enter y or press Return to continue.
Continue with operation? [y,n,q,?] (default: y) y Removal of disk c2t8d0 completed successfully. |
Enter q to quit the utility.
Remove another disk? [y,n,q,?] (default: n) q |
Enter vxdisk list and vxprint to view the changes.
The example disk c2t8d0 is removed.
# vxdisk list . c2t3d0s2 sliced c1t3d0 demo online shared c2t4d0s2 sliced c1t4d0 demo online shared c2t5d0s2 sliced c1t5d0 demo online shared c2t8d0s2 sliced c1t8d0 demo online shared c2t9d0s2 sliced c1t9d0 demo online shared - - c2t8d0 demo removed # vxprint . dm c2t3d0 c1t3d0s2 - 4152640 - - - - dm c2t4d0 c1t4d0s2 - 4152640 - - - - dm c2t5d0 c1t5d0s2 - 4152640 - - - - dm c2t8d0 - - - - REMOVED - - dm c2t9d0 c1t9d0s2 - 4152640 - - - - pl demo05-02 - DISABLED 51200 - REMOVED - - sd c2t8d0-1 demo05-02 DISABLED 51200 0 REMOVED - - . . . |
Replace the physical drive without powering off any component.
For further information, refer to the documentation accompanying the disk enclosure unit.
As you replace the drive, you may see messages on the system console similar to those in the following example. Do not become alarmed as these messages may not indicate a problem. Instead, proceed with the replacement as described in the next steps.
Nov 3 17:44:00 updb10a unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@2,0 (sd17): Nov 3 17:44:00 updb10a unix: SCSI transport failed: reason 'incomplete': \ retrying command Nov 3 17:44:03 updb10a unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@2,0 (sd17): Nov 3 17:44:03 updb10a unix: disk not responding to selection |
Run the vxdiskadm utility and enter 5 (Replace a failed or removed disk).
Enter the disk name.
You can enter list to see a list of disks in the REMOVED state.
The disk may appear in the NODEVICE state if it had failures.
Select a removed or failed disk [<disk>,list,q,?] list Disk group: rootdg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE Disk group: demo DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm c2t8d0 - - - - REMOVED Select a removed or failed disk [<disk>,list,q,?] c2t8d0 |
The vxdiskadm utility detects the new device and asks you whether the new device should replace the removed device.
If there are other unused disks attached to the system, vxdiskadm also presents these disks as viable choices.
Enter the device name, or if the utility lists the device as the default, press Return.
The following devices are available as replacements: c1t8d0s2 You can choose one of these disks to replace c2t8d0. Choose "none" to initialize another disk to replace c2t8d0. Choose a device, or select "none" [<device>,none,q,?] (default: c1t8d0s2) <Return> The requested operation is to use the initialized device c1t8d0s2 to replace the removed or failed disk c2t8d0 in disk group demo. |
Enter y or press Return to verify that you want this device (in the example, c1t8d0s2) to be the replacement disk.
Continue with operation? [y,n,q,?] (default: y) <Return> Replacement of disk c2t8d0 in group demo with disk device c1t8d0s2 completed successfully. |
Enter n or press Return to quit this utility.
Replace another disk? [y,n,q,?] (default: n) <Return> |
Enter vxdisk list and vxprint to see the changes.
The example, disk c2t8d0, is no longer in the REMOVED state.
# vxdisk list . c2t2d0s2 sliced c1t2d0 demo online shared c2t3d0s2 sliced c1t3d0 demo online shared c2t4d0s2 sliced c1t4d0 demo online shared c2t5d0s2 sliced c1t5d0 demo online shared c2t8d0s2 sliced c1t8d0 demo online shared c2t9d0s2 sliced c1t9d0 demo online shared # vxprint . dm c2t4d0 c1t4d0s2 - 4152640 - - - - dm c2t5d0 c1t5d0s2 - 4152640 - - - - dm c2t8d0 c1t8d0s2 - 4152640 - - - - dm c2t9d0 c1t9d0s2 - 4152640 - - - - . |
This section describes how to replace an entire Sun StorEdge MultiPack or Sun StorEdge D1000 enclosure running SSVM or CVM.
These are the high-level steps for replacing an entire failed Sun StorEdge MultiPack or Sun StorEdge D1000 in an SSVM or CVM configuration:
Removing all the disks in the defective disk enclosure by using the vxdiskadm command
Replacing the failed disk enclosure
Replacing all the disks removed earlier into the new disk enclosure by using the vxdiskadm command
For systems not running shared disk groups, master node refers to the node that has imported the disk group.
If you are running shared disk groups, determine the master and slave node by entering the following command on all nodes in the cluster:
# vxdctl -c mode |
Complete the following steps from the master node.
Remove all the disks on the failed disk enclosure by running the vxdiskadm utility and entering 4 (Remove a disk for replacement).
This option enables you to remove only one disk at a time. Repeat this procedure for each disk.
Enter the list command.
In the following example, assume that the disk enclosure on controller c2 needs replacement. Based on the list output, the SSVM or CVM names for these disks are c2t2d0, c2t3d0, c2t4d0, c2t5d0, c2t8d0, and c2t9d0.
Remove a disk for replacement Menu: VolumeManager/Disk/RemoveForReplace Use this menu operation to remove a physical disk from a disk group, while retaining the disk name. This changes the state for the disk name to a "removed" disk. If there are any initialized disks that are not part of a disk group, you will be given the option of using one of these disks as a replacement. Enter disk name [<disk>,list,q,?] list Disk group: rootdg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm c0t0d0s7 c0t0d0s7 simple 1024 20255 - Disk group: demo DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm c1t2d0 c2t2d0s2 sliced 1519 4152640 - dm c1t3d0 c2t3d0s2 sliced 1519 4152640 - dm c1t4d0 c2t4d0s2 sliced 1519 4152640 - dm c1t5d0 c2t5d0s2 sliced 1519 4152640 - dm c1t8d0 c2t8d0s2 sliced 1519 4152640 - dm c1t9d0 c2t9d0s2 sliced 1519 4152640 - dm c2t2d0 c1t2d0s2 sliced 1519 4152640 - dm c2t3d0 c1t3d0s2 sliced 1519 4152640 - dm c2t4d0 c1t4d0s2 sliced 1519 4152640 - dm c2t5d0 c1t5d0s2 sliced 1519 4152640 - dm c2t8d0 c1t8d0s2 sliced 1519 4152640 - dm c2t9d0 c1t9d0s2 sliced 1519 4152640 - |
Enter the disk name (in this example, c2t2d0).
Enter disk name [<disk>,list,q,?] c2t2d0 The following volumes will lose mirrors as a result of this operation: demo-1 No data on these volumes will be lost. The requested operation is to remove disk c2t2d0 from disk group demo. The disk name will be kept, along with any volumes using the disk, allowing replacement of the disk. Select "Replace a failed or removed disk" from the main menu when you wish to replace the disk. |
Enter y or press Return to verify that you want to replace the disk.
Continue with operation? [y,n,q,?] (default: y) <Return> Removal of disk c2t2d0 completed successfully. |
Enter y to continue.
Remove another disk? [y,n,q,?] (default: n) y Remove a disk for replacement Menu: VolumeManager/Disk/RemoveForReplace Use this menu operation to remove a physical disk from a disk group, while retaining the disk name. This changes the state for the disk name to a "removed" disk. If there are any initialized disks that are not part of a disk group, you will be given the option of using one of these disks as a replacement. |
Enter the next example disk name, c2t3d0.
Enter disk name [<disk>,list,q,?] c2t3d0 The following volumes will lose mirrors as a result of this operation: demo-2 No data on these volumes will be lost. The following devices are available as replacements: c1t2d0 You can choose one of these disks now, to replace c2t3d0. Select "none" if you do not wish to select a replacement disk. |
Enter none, if necessary.
This query arises whenever the utility recognizes a good disk in the system. If there are no good disks, you will not see this query.
Choose a device, or select "none" [<device>,none,q,?] (default: c1t2d0) none |
Enter y or press Return to verify that you want to remove the disk.
The requested operation is to remove disk c2t3d0 from disk group demo. The disk name will be kept, along with any volumes using the disk, allowing replacement of the disk. Select "Replace a failed or removed disk" from the main menu when you wish to replace the disk. Continue with operation? [y,n,q,?] (default: y) <Return> Removal of disk c2t3d0 completed successfully. |
Repeat Step 6 through Step 9 for each disk you identified in Step 3.
Power off and replace the disk enclosure.
For more information, refer to the disk enclosure documentation.
As you replace the disk enclosure, you may see messages on the system console similar to those in the following example. Do not become alarmed, as these messages may not indicate a problem. Instead, proceed with the replacement as described in the next section.
Nov 3 17:44:00 updb10a unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@2,0 (sd17): Nov 3 17:44:00 updb10a unix: SCSI transport failed: reason 'incomplete': \ retrying command Nov 3 17:44:03 updb10a unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@2,0 (sd17): Nov 3 17:44:03 updb10a unix: disk not responding to selection |
Power on the disk enclosure.
For more information, refer to your disk enclosure service manual.
Attach all the disks removed earlier by running the vxdiskadm utility and entering 5 (Replace a failed or removed disk).
This option enables you to replace only one disk at a time. Repeat this procedure for each disk.
Enter the list command to see a list of disk names now in the REMOVED state.
Replace a failed or removed disk Menu: VolumeManager/Disk/ReplaceDisk Use this menu operation to specify a replacement disk for a disk that you removed with the "Remove a disk for replacement" menu operation, or that failed during use. You will be prompted for a disk name to replace and a disk device to use as a replacement. You can choose an uninitialized disk, in which case the disk will be initialized, or you can choose a disk that you have already initialized using the Add or initialize a disk menu operation. Select a removed or failed disk [<disk>,list,q,?] list Disk group: rootdg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE Disk group: demo DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm c2t2d0 - - - - REMOVED dm c2t3d0 - - - - REMOVED dm c2t4d0 - - - - REMOVED dm c2t5d0 - - - - REMOVED dm c2t8d0 - - - - REMOVED dm c2t9d0 - - - - REMOVED |
Enter the disk name (in this example, c2t2d0).
Select a removed or failed disk [<disk>,list,q,?] c2t2d0 The following devices are available as replacements: c1t2d0s2 c1t3d0s2 c1t4d0s2 c1t5d0s2 c1t8d0s2 c1t9d0s2 |
The vxdiskadm utility detects the new devices and asks you whether the new devices should replace the removed devices.
Enter the "replacement" or "new" device name, or if the utility lists the device as the default, press Return.
You can choose one of these disks to replace c2t2d0. Choose "none" to initialize another disk to replace c2t2d0. Choose a device, or select "none" [<device>,none,q,?] (default: c1t2d0s2) <Return> |
Enter y or press Return to verify that you want this device (in the example, c1t2d0s2) to be the replacement disk.
The requested operation is to use the initialized device c1t2d0s2 to replace the removed or failed disk c2t2d0 in disk group demo. Continue with operation? [y,n,q,?] (default: y) <Return> Replacement of disk c2t2d0 in group demo with disk device c1t2d0s2 completed successfully. |
Enter y to continue.
Replace another disk? [y,n,q,?] (default: n) y |
Repeat Step 15 through Step 18 for each of the REMOVED/NODEVICE disk names.