11.6.5 How to Replace a SPARCstorage Array Disk (Solstice DiskSuite) (Sun Cluster 2.2 System Administration Guide)

Sun Cluster 2.2 System Administration Guide

11.6.5 How to Replace a SPARCstorage Array Disk (Solstice DiskSuite)

These are the high-level steps to replace a multihost disk in a Solstice DiskSuite configuration. Some of the steps in this procedure apply only to configurations using SPARCstorage Array 100 series or SPARCstorage Array 200 series with the differential SCSI tray.

Switching logical hosts to one cluster node
Determining which disk needs replacement
Determining which tray holds the disk to be replaced
(SSA 100 and SSA 200 only) Detaching submirrors on the affected tray or disk enclosure
(SSA 100 and SSA 200 only) Running metaclear(1M) on the detached submirrors
(SSA 100 and SSA 200 only) Deleting available hot spares in the affected disk tray
Removing the bad disk from the diskset
(SSA 100 and SSA 200 only) Deleting any affected metadevice state database replicas on disks in the affected tray
(SSA 100 and SSA 200 only) Producing a list of metadevices in the affected tray
(SSA 100 and SSA 200 only) Using metaoffline(1M) on submirrors in the affected tray or submirrors using hot spares in the tray
(SSA 100 and SSA 200 only) Flushing NVRAM, if enabled
Spinning down the disk(s) and removing the tray or disk enclosure
Replacing the disk drive
Running the scdidadm -R command
Adding the new disk to the diskset
Reserving and enabling failfast on the new disk
Partitioning the new disk
(SSA 100 and SSA 200 only) Using the metainit(1M) command to initialize any devices that were cleared previously with the metaclear(1M) command
(SSA 100 and SSA 200 only) Bringing offline mirrors back on line using metaonline(1M) and resynchronizing
(SSA 100 and SSA 200 only) Attaching submirrors unattached previously
(SSA 100 and SSA 200 only) Replacing any hot spares in use in the submirrors that have just been attached
(SSA 100 and SSA 200 only) Returning the deleted hot spare devices to their original hot spare pools
Running the metastat(1M) command to verify the problem has been fixed

These are the detailed steps to replace a failed multihost disk in a Solstice DiskSuite configuration.

Switch ownership of the affected logical hosts to other nodes by using the haswitch(1M) command.
phys-hahost1# haswitch phys-hahost1 hahost1 hahost2
The SPARCstorage Array tray containing the failed disk might contain disks included in more than one logical host. If this is the case, switch ownership of all logical hosts with disks using this tray to another node in the cluster.

Identify the disk to be replaced by examining metastat(1M) and /var/adm/messages output.

When metastat(1M) reports that a device is in maintenance state or some of the components have been replaced by hot spares, you must locate and replace the device. A sample metastat(1M) output follows. In this example, device c3t3d4s0 is in maintenance state.

phys-hahost1# metastat -s hahost1
...
  d50:Submirror of hahost1/d40
       State: Needs Maintenance
       Stripe 0:
           Device       Start Block      Dbase      State          Hot Spare
           c3t3d4s0     0                No         Okay           c3t5d4s0
 ...

Check /var/adm/messages to see what kind of problem has been detected.

...
Jun 1 16:15:26 host1 unix: WARNING: /io-
unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49):  
Jun 1 16:15:26 host1 unix: Error for command `write(I))' Err
Jun 1 16:15:27 host1 unix: or Level: Fatal
Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559
Jun 1 16:15:27 host1 unix: Sense Key: Media Error
Jun 1 16:15:27 host1 unix: Vendor `CONNER':
Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15
...

Determine the location of the problem disk by running the luxadm(1M) command.

The luxadm(1M) command lists the trays and the drives associated with them. The output differs for each SPARCstorage Array series. This example shows output from a SPARCstorage Array 100 series array. The damaged drive is highlighted below.

phys-hahost1# luxadm display c3
         SPARCstorage Array Configuration
 Controller path:
 /devices/iommu@f,e0000000/sbus@f,e0001000/SUNW,soc@0,0/SUNW,pln@
 a0000000,779a16:ctlr
          DEVICE STATUS
          TRAY1          TRAY2          TRAY3
 Slot
 1        Drive:0,0      Drive:2,0      Drive:4,0
 2        Drive:0,1      Drive:2,1      Drive:4,1
 3        Drive:0,2      Drive:2,2      Drive:4,2
 4        Drive:0,3      Drive:2,3      Drive:4,3
 5        Drive:0,4      Drive:2,4      Drive:4,4
 6        Drive:1,0      Drive:3,0      Drive:5,0
 7        Drive:1,1      Drive:3,1      Drive:5,1
 8        Drive:1,2      Drive:3,2      Drive:5,2
 9        Drive:1,3      Drive:3,3      Drive:5,3
 10       Drive:1,4      Drive:3,4      Drive:5,4

          CONTROLLER STATUS
 Vendor:    SUN
 Product ID:  SSA110
 Product Rev: 1.0
 Firmware Rev: 3.9
 Serial Num: 000000741022
 Accumulate performance Statistics: Enabled

Detach all submirrors with components on the disk being replaced.

If you are detaching a submirror that has a failed component, you must force the detach using the metadetach -f command. The following example command detaches submirror d50 from metamirror d40.
phys-hahost1# metadetach -s hahost1 -f d40 d50

Use the metaclear(1M) command to clear the submirrors detached in Step 4.
phys-hahost1# metaclear -s hahost1 -f d50

Before deleting replicas and hot spares, make a record of the location (slice), number of replicas, and hot spare information (names of the devices and list of devices that contain hot spare pools) so that the actions can be reversed following the disk replacement.

Delete all hot spares that have Available status and are in the same tray as the problem disk.

This includes all hot spares, regardless of their logical host assignment. In the following example, the metahs(1M) command reports hot spares on hahost1, but shows that none are present on hahost2"

phys-hahost1# metahs -s hahost1 -i
 hahost1:hsp000 2 hot spares
         c1t4d0s0                Available       2026080 blocks
         c3t2d5s0                Available       2026080 blocks
 phys-hahost1# metahs -s hahost1 -d hsp000 c3t2d4s0
 hahost1:hsp000:
         Hotspare is deleted
 phys-hahost1# metahs -s hahost2 -i
 phys-hahost1#
 hahost1:hsp000 1 hot spare
 			c3t2d5s0                Available       2026080 blocks

Use the metaset(1M) command to remove the failed disk from the diskset.

The syntax for the command is shown below. In this example, diskset is the name of the diskset containing the failed disk, and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3).
# metaset -s diskset -d drive
This operation can take up to fifteen minutes or more, depending on the size of your configuration and the number of disks.

Delete any metadevice state database replicas that are on disks in the tray to be serviced.

The metadb(1M) command with the -s option reports replicas in a specified diskset.

phys-hahost1# metadb -s hahost1
phys-hahost1# metadb -s hahost2
phys-hahost1# metadb -s hahost1 -d replicas-in-tray
phys-hahost1# metadb -s hahost2 -d replicas-in-tray

Locate the submirrors using components that reside in the affected tray.

One method is to use the metastat(1M) command to create temporary files that contain the names of all metadevices. For example:

phys-hahost1# metastat -s hahost1 > /usr/tmp/hahost1.stat
phys-hahost1# metastat -s hahost2 > /usr/tmp/hahost2.stat

Search the temporary files for the components in question (c3t3dn and c3t2dn in this example). The information in the temporary files will look like this:

...
 hahost1/d35: Submirror of hahost1/d15
    State: Okay
    Hot Spare pool: hahost1/hsp100
    Size: 2026080 blocks
    Stripe 0:
       Device      Start Block     Dbase     State      Hot Spare
       c3t3d3s0    0               No        Okay      
 hahost1/d54: Submirror of hahost1/d24
    State: Okay
    Hot Spare pool: hahost1/hsp106
    Size: 21168 blocks
    Stripe 0:
       Device      Start Block     Dbase     State      Hot Spare
       c3t3d3s6    0               No        Okay      
 ...

Take offline all other submirrors that have components in the affected tray.

Using the output from the temporary files in Step 10, run the metaoffline(1M) command on all submirrors in the affected tray.
phys-hahost1# metaoffline -s hahost1 d15 d35 phys-hahost1# metaoffline -s hahost1 d24 d54 ...
Run metaoffline(1M) as many times as necessary to take all the submirrors off line. This forces Solstice DiskSuite to stop using the submirror components.

If enabled, flush the NVRAM on the controller, tray, individual disk or disks.
phys-hahost1# luxadm sync_cache pathname
A confirmation appears, indicating that NVRAM has been flushed. See "11.7.3 Flushing and Purging NVRAM", for details on flushing NVRAM data.

Spin down all disks in the affected SPARCstorage Array tray(s).

Use the luxadm stop command to spin down the disks. Refer to the luxadm(1M) man page for details.
phys-hahost1# luxadm stop -t 2 c3
Caution -
Do not run any Solstice DiskSuite commands while a SPARCstorage Array tray is spun down because the commands might have the side effect of spinning up some or all of the drives in the tray.

Replace the disk.

Refer to the hardware service manuals for your SPARCstorage Array for details on this procedure.

Update the DID driver's database with the new device ID.

Use the -l flag to scdidadm(1M) to identify the DID name for the lower-level device name of the drive to be replaced. Then update the DID drive database using the -R flag to scdidadm(1M). Refer to the Sun Cluster 2.2 Software Installation Guide for details on the DID pseudo driver.
phys-hahost1# scdidadm -o name -l /dev/rdsk/c3t3d4 6 phys-hahost1:/dev/rdsk/c3t3d4 /dev/did/rdsk/d6 phys-hahost1# scdidadm -R d6

Make sure all disks in the affected multihost disk expansion unit spin up.

The disks in the multihost disk expansion unit should spin up automatically. If the tray fails to spin up within two minutes, force the action by using the following command:
phys-hahost1# luxadm start -t 2 c3

Add the new disk back into the diskset by using the metaset(1M) command.

This step automatically adds back the number of replicas that were deleted from the failed disk. The command syntax is as follows, where diskset is the name of the diskset containing the failed disk, and drive is the DID name of the disk in the form dN (for new installations of Sun Cluster), or cNtYdZ (for installations that upgraded from HA 1.3):
# metaset -s diskset -a drive

(Optional) If you deleted replicas that belonged to other disksets from disks that were in the same tray as the errored disk, use the metadb(1M) command to add back the replicas.
phys-hahost1# metadb -s hahost2 -a deleted-replicas
To add multiple replicas to the same slice, use the -c option.

Use the scadmin(1M) command to reserve and enable failfast on the specified disk that has just been added to the diskset.
phys-hahost2# scadmin reserve c3t3d4

Use the format(1M) or fmthard(1M) command to repartition the new disk.

Make sure that you partition the new disk exactly as the disk that was replaced. (Saving the disk format information was recommended in Chapter 1, Preparing for Sun Cluster Administration.)

Use the metainit(1M) command to reinitialize disks that were cleared in Step 5.
phys-hahost1# metainit -s hahost1 d50

Bring online all submirrors that were taken off line in Step 11.
phys-hahost1# metaonline -s hahost1 d15 d35 phys-hahost1# metaonline -s hahost1 d24 d54 ...
Run the metaonline(1M) command as many times as necessary to bring online all the submirrors.

When the submirrors are brought back online, Solstice DiskSuite automatically performs resyncs on all the submirrors, bringing all data up-to-date.

Note -
Running the metastat(1M) command at this time would show that all metadevices with components residing in the affected tray are resyncing.

Attach submirrors that were detached in Step 4.

Use the metattach(1M) command to perform this step. See the metattach(1M) man page for details.
phys-hahost1# metattach -s hahost1 d40 d50

Replace any hot spares in use in the submirrors attached in Step 23.

If a submirror had a hot spare replacement in use before you detached the submirror, this hot spare replacement will be in effect after the submirror is reattached. This step returns the hot spare to Available status.
phys-hahost1# metareplace -s hahost1 -e d40 c3t3d4s0

Restore all hot spares that were deleted in Step 7.

Use the metahs(1M) command to add back the hot spares. See the metahs(1M) man page for details.
phys-hahost1# metahs -s hahost1 -a hsp000 c3t2d5s0

If necessary, switch logical hosts back to their default masters.
phys-hahost1# haswitch phys-hahost2 hahost2

Verify that the replacement corrected the problem.
phys-hahost1# metastat -s hahost1