How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool) (Solstice DiskSuite 4.2.1 User's Guide)

Solstice DiskSuite 4.2.1 User's Guide

Previous: Preliminary Information for Replacing SPARCstorage Array Components
Next: How to Replace a Failed SPARCstorage Array Disk in a RAID5 Metadevice (DiskSuite Tool)

How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)

The steps to replace a SPARCstorage Array disk in a DiskSuite environment depend a great deal on how the slices on the disk are being used, and how the disks are cabled to the system. They also depend on whether the disk slices are being used as is, or by DiskSuite, or both.

Note -

This procedure applies to a SPARCstorage Array 100. The steps to replace a disk in a SPARCstorage Array 200 are similar.

The high-level steps in this task are:

Identifying the disk that needs replacing and determining its location
Deleting hot spares marked "Available" that are in the tray that must be pulled
Deleting state database replicas that are on the disks in the tray that must be pulled
Locating submirrors using disks in the tray that must be pulled
Detaching submirrors with slices on the disk that is being replaced
Offlining other submirrors using disks in the tray
Spinning down all disks in the tray
Pulling the tray and replacing the disk
Making sure that all disks in the tray spin back up
Repartitioning the new disk
Bringing submirrors in the tray back online
Attaching detached submirrors in the tray
Replacing hot spares that were deleted
Adding hot spares that were deleted to hot spare pool
Adding metadevice state database replicas that were deleted

Note -

You can use this procedure if a submirror is in the "Maintenance" state, replaced by a hot spare, or is generating intermittent errors.

To locate and replace the disk, perform the following steps:

Identify the disk to be replaced, either by using DiskSuite Tool to look at the Status fields of objects, or by examining metastat and /var/adm/messages output.

# metastat
...
 d50:Submirror of d40
      State: Needs Maintenance
...
# tail -f /var/adm/messages
...
Jun 1 16:15:26 host1 unix: WARNING: /io-
unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49):  
Jun 1 16:15:26 host1 unix: Error for command `write(I))' Err
Jun 1 16:15:27 host1 unix: or Level: Fatal
Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559
Jun 1 16:15:27 host1 unix: Sense Key: Media Error
Jun 1 16:15:27 host1 unix: Vendor `CONNER':
Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15
...

The metastat command shows that a submirror is in the "Needs Maintenance" state. The /var/adm/messages file reports a disk drive that has an error. To locate the disk drive, use the ls command as follows, matching the symbolic link name to that from the /var/adm/messages output.

# ls -l /dev/rdsk/*
...
lrwxrwxrwx   1 root     root          90 Mar  4 13:26 /dev/rdsk/c3t3d4s0 -
> ../../devices/io-
unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49)
...

Based on the above information and metastat output, it is determined that drive c3t3d4 must be replaced.

Determine the affected tray by using DiskSuite Tool.

To find the SPARCstorage Array tray where the problem disk resides, use the Disk View window.

Click Disk View to display the Disk View window.

Drag the problem metadevice (in this example, a mirror) from the Objects list to the Disk View window.

The Disk View window shows the logical to physical device mappings by coloring the physical slices that make up the metadevice. You can see at a glance which tray contains the problem disk.

An alternate way to find the SPARCstorage Array tray where the problem disk resides is to use the ssaadm(1M) command.

host1# ssaadm display c3
         SPARCstorage Array Configuration
Controller path: /devices/io-
unit@f,e1200000/sbi@0.0/SUNW,soc@0,0/SUNW,pln@a0000000,741022:ctlr
         DEVICE STATUS
         TRAY1          TRAY2          TRAY3
Slot
1        Drive:0,0      Drive:2,0      Drive:4,0
2        Drive:0,1      Drive:2,1      Drive:4,1
3        Drive:0,2      Drive:2,2      Drive:4,2
4        Drive:0,3      Drive:2,3      Drive:4,3
5        Drive:0,4      Drive:2,4      Drive:4,4
6        Drive:1,0      Drive:3,0      Drive:5,0
7        Drive:1,1      Drive:3,1      Drive:5,1
8        Drive:1,2      Drive:3,2      Drive:5,2
9        Drive:1,3      Drive:3,3      Drive:5,3
10       Drive:1,4      Drive:3,4      Drive:5,4
 
         CONTROLLER STATUS
Vendor:    SUNW
Product ID:  SSA100
Product Rev: 1.0
Firmware Rev: 2.3
Serial Num: 000000741022
Accumulate performance Statistics: Enabled

The ssaadm output for controller (c3) shows that Drive 3,4 (c3t3d4) is the closest to you when you pull out the middle tray.

[Optional] If you have a diskset, locate the diskset that contains the affected drive.

The following commands locate drive c3t3d4. Note that no output was displayed when the command was run with logicalhost2, but logicalhost1 reported that the name was present. In the reported output, the yes field indicates that the disk contains a state database replica.
host1# metaset -s logicalhost2 | grep c3t3d4 host1# metaset -s logicalhost1 | grep c3t3d4 c3t3d4 yes
Note -
If you are using Solstice HA servers, you'll need to switch ownership of both logical hosts to one Solstice HA server. Refer to the Solstice HA documentation.

Determine other DiskSuite objects on the affected tray.

Because you must pull the tray to replace the disk, determine what other objects will be affected in the process.
1. In DiskSuite Tool, display the Disk View window. Select the tray. From the Object menu, choose Device Mappings. The Physical to Logical Device Mapping window appears.
2. Note all affected objects, including state database replicas, metadevices, and hot spares that appear in the window.

Prepare for disk replacement by preparing other DiskSuite objects in the affected tray.
1. Delete all hot spares that have a status of "Available" and that are in the same tray as the problem disk.
  
  Record all the information about the hot spares so they can be added back to the hot spare pools following the replacement procedure.
2. Delete any state database replicas that are on disks in the tray that must be pulled. You must keep track of this information because you must replace these replicas in Step 14.
  
  There may be multiple replicas on the same disk. Make sure you record the number of replicas deleted from each slice.
3. Locate the submirrors that are using slices that reside in the tray.
4. Detach all submirrors with slices on the disk that is being replaced.
5. Take all other submirrors that have slices in the tray offline.
  
  This forces DiskSuite to stop using the submirror slices in the tray so that the drives can be spun down.
  
  To remove objects, refer to Chapter 5, Removing DiskSuite Objects. To detach and offline submirrors, refer to "Working With Mirrors".

Spin down all disks in SPARCstorage Array tray.

Refer to "How to Stop a Disk (DiskSuite Tool)".

Note -
The SPARCstorage Array tray should not be removed as long as the LED on the tray is illuminated. Also, you should not run any DiskSuite commands while the tray is spun down as this may have the side effect of spinning up some or all of the drives in the tray.

Pull the tray and replace the bad disk.

Instructions for the hardware procedure are found in the SPARCstorage Array Model 100 Series Service Manual and the SPARCcluster High Availability Server Service Manual.

Make sure all disks in the tray of the SPARCstorage Array spin up.

The disks in the SPARCstorage Array tray should automatically spin up following the hardware replacement procedure. If the tray fails to spin up automatically within two minutes, force the action by using the following command.
# ssaadm start -t 2 c3

Use format(1M), fmthard(1M), or Storage Manager to repartition the new disk. Make sure you partition the new disk exactly as the disk that was replaced.

Saving the disk format information before problems occur is always a good idea.

Bring all submirrors that were taken offline back online.

Refer to "Working With Mirrors".

When the submirrors are brought back online, DiskSuite automatically resyncs all the submirrors, bringing the data up-to-date.

Attach submirrors that were detached.

Refer to "Working With Mirrors".

Replace any hot spares in use in the submirrors attached in Step 11.

If a submirror had a hot spare replacement in use before you detached the submirror, this hot spare replacement will be in effect after the submirror is reattached. This step returns the hot spare to the "Available" status.

Add all hot spares that were deleted.

Add all state database replicas that were deleted from disks on the tray.

Use the information saved previously to replace the state database replicas.

[Optional] If using Solstice HA servers, switch each logical host back to its default master.

Refer to the Solstice HA documentation.

Validate the data.

Check the user/application data on all metadevices. You may have to run an application-level consistency checker or use some other method to check the data.