Solstice DiskSuite 4.2.1 User's Guide

Working With SPARCstorage Arrays

This section describes how to troubleshoot SPARCstorage Arrays using DiskSuite. The tasks in this section include:

Replacing a failed disk in a mirror
Replacing a failed disk in a RAID5 metadevice
Removing a tray
Replacing a tray
Replacing a controller
Recovering from power loss

Installation

The SPARCstorage Array should be installed according to the SPARCstorage Array Software instructions found with the SPARCstorage Array CD. The SPARCstorage Array Volume Manager need not be installed if you are only using DiskSuite.

Device Naming

DiskSuite accesses SPARCstorage Array disks exactly like any other disks, with one important exception: the disk names differ from non-SPARCstorage Array disks.

The SPARCstorage Array 100 disk naming convention is:

c[0-n]t[0-5]d[0-4]s[0-7]

In this name:

c indicates the controller attached to an SSA unit
t indicates one of the 6 SCSI strings within an SSA
d indicates one of the 5 disks on an internal SCSI string
s indicates the disk slice number
Strings t0 and t1 are contained in tray 1, t2 and t3 in tray 2, and t4 and t5 are in tray 3

The SPARCstorage Array 200 disk naming convention is:

c[0-n]t[0-5]d[0-6]s[0-7]

In this name:

c indicates the controller attached to an SSA unit
t indicates one of the 6 targets (trays) within the SSA unit
d indicates one of the 7 disks on an internal SCSI string
s indicates the disk slice number

Note -

Older trays hold up to six disks; newer trays can hold up to seven.

The main difference between the SSA100 and SSA200 is that the SSA100 arranges pairs of targets into a tray, whereas the SSA200 has a separate tray for each target.

Preliminary Information for Replacing SPARCstorage Array Components

The SPARCstorage Array components that can be replaced include the disks, fan tray, battery, tray, power supply, backplane, controller, optical module, and fibre channel cable.

Some of the SPARCstorage Array components can be replaced without powering down the SPARCstorage Array. Other components require the SPARCstorage Array to be powered off. Consult the SPARCstorage Array documentation for details.

To replace SPARCstorage Array components that require power off without interrupting services, you perform the steps necessary for tray removal for all trays in the SPARCstorage Array before turning off the power. This includes taking submirrors offline, deleting hot spares from hot spare pools, deleting state database replicas from drives, and spinning down the trays.

After these preparations, the SPARCstorage Array can be powered down and the components replaced.

Note -

Because the SPARCstorage Array controller contains a unique World Wide Name, which identifies it to Solaris, special procedures apply for SPARCstorage Array controller replacement. Contact your service provider for assistance.

How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)

The steps to replace a SPARCstorage Array disk in a DiskSuite environment depend a great deal on how the slices on the disk are being used, and how the disks are cabled to the system. They also depend on whether the disk slices are being used as is, or by DiskSuite, or both.

Note -

This procedure applies to a SPARCstorage Array 100. The steps to replace a disk in a SPARCstorage Array 200 are similar.

The high-level steps in this task are:

Identifying the disk that needs replacing and determining its location
Deleting hot spares marked "Available" that are in the tray that must be pulled
Deleting state database replicas that are on the disks in the tray that must be pulled
Locating submirrors using disks in the tray that must be pulled
Detaching submirrors with slices on the disk that is being replaced
Offlining other submirrors using disks in the tray
Spinning down all disks in the tray
Pulling the tray and replacing the disk
Making sure that all disks in the tray spin back up
Repartitioning the new disk
Bringing submirrors in the tray back online
Attaching detached submirrors in the tray
Replacing hot spares that were deleted
Adding hot spares that were deleted to hot spare pool
Adding metadevice state database replicas that were deleted

Note -

You can use this procedure if a submirror is in the "Maintenance" state, replaced by a hot spare, or is generating intermittent errors.

To locate and replace the disk, perform the following steps:

Identify the disk to be replaced, either by using DiskSuite Tool to look at the Status fields of objects, or by examining metastat and /var/adm/messages output.

# metastat
...
 d50:Submirror of d40
      State: Needs Maintenance
...
# tail -f /var/adm/messages
...
Jun 1 16:15:26 host1 unix: WARNING: /io-
unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49):  
Jun 1 16:15:26 host1 unix: Error for command `write(I))' Err
Jun 1 16:15:27 host1 unix: or Level: Fatal
Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559
Jun 1 16:15:27 host1 unix: Sense Key: Media Error
Jun 1 16:15:27 host1 unix: Vendor `CONNER':
Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15
...

The metastat command shows that a submirror is in the "Needs Maintenance" state. The /var/adm/messages file reports a disk drive that has an error. To locate the disk drive, use the ls command as follows, matching the symbolic link name to that from the /var/adm/messages output.

# ls -l /dev/rdsk/*
...
lrwxrwxrwx   1 root     root          90 Mar  4 13:26 /dev/rdsk/c3t3d4s0 -
> ../../devices/io-
unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49)
...

Based on the above information and metastat output, it is determined that drive c3t3d4 must be replaced.

Determine the affected tray by using DiskSuite Tool.

To find the SPARCstorage Array tray where the problem disk resides, use the Disk View window.

Click Disk View to display the Disk View window.

Drag the problem metadevice (in this example, a mirror) from the Objects list to the Disk View window.

The Disk View window shows the logical to physical device mappings by coloring the physical slices that make up the metadevice. You can see at a glance which tray contains the problem disk.

An alternate way to find the SPARCstorage Array tray where the problem disk resides is to use the ssaadm(1M) command.

host1# ssaadm display c3
         SPARCstorage Array Configuration
Controller path: /devices/io-
unit@f,e1200000/sbi@0.0/SUNW,soc@0,0/SUNW,pln@a0000000,741022:ctlr
         DEVICE STATUS
         TRAY1          TRAY2          TRAY3
Slot
1        Drive:0,0      Drive:2,0      Drive:4,0
2        Drive:0,1      Drive:2,1      Drive:4,1
3        Drive:0,2      Drive:2,2      Drive:4,2
4        Drive:0,3      Drive:2,3      Drive:4,3
5        Drive:0,4      Drive:2,4      Drive:4,4
6        Drive:1,0      Drive:3,0      Drive:5,0
7        Drive:1,1      Drive:3,1      Drive:5,1
8        Drive:1,2      Drive:3,2      Drive:5,2
9        Drive:1,3      Drive:3,3      Drive:5,3
10       Drive:1,4      Drive:3,4      Drive:5,4
 
         CONTROLLER STATUS
Vendor:    SUNW
Product ID:  SSA100
Product Rev: 1.0
Firmware Rev: 2.3
Serial Num: 000000741022
Accumulate performance Statistics: Enabled

The ssaadm output for controller (c3) shows that Drive 3,4 (c3t3d4) is the closest to you when you pull out the middle tray.

[Optional] If you have a diskset, locate the diskset that contains the affected drive.

The following commands locate drive c3t3d4. Note that no output was displayed when the command was run with logicalhost2, but logicalhost1 reported that the name was present. In the reported output, the yes field indicates that the disk contains a state database replica.
host1# metaset -s logicalhost2 | grep c3t3d4 host1# metaset -s logicalhost1 | grep c3t3d4 c3t3d4 yes
Note -
If you are using Solstice HA servers, you'll need to switch ownership of both logical hosts to one Solstice HA server. Refer to the Solstice HA documentation.

Determine other DiskSuite objects on the affected tray.

Because you must pull the tray to replace the disk, determine what other objects will be affected in the process.
1. In DiskSuite Tool, display the Disk View window. Select the tray. From the Object menu, choose Device Mappings. The Physical to Logical Device Mapping window appears.
2. Note all affected objects, including state database replicas, metadevices, and hot spares that appear in the window.

Prepare for disk replacement by preparing other DiskSuite objects in the affected tray.
1. Delete all hot spares that have a status of "Available" and that are in the same tray as the problem disk.
  
  Record all the information about the hot spares so they can be added back to the hot spare pools following the replacement procedure.
2. Delete any state database replicas that are on disks in the tray that must be pulled. You must keep track of this information because you must replace these replicas in Step 14.
  
  There may be multiple replicas on the same disk. Make sure you record the number of replicas deleted from each slice.
3. Locate the submirrors that are using slices that reside in the tray.
4. Detach all submirrors with slices on the disk that is being replaced.
5. Take all other submirrors that have slices in the tray offline.
  
  This forces DiskSuite to stop using the submirror slices in the tray so that the drives can be spun down.
  
  To remove objects, refer to Chapter 5, Removing DiskSuite Objects. To detach and offline submirrors, refer to "Working With Mirrors".

Spin down all disks in SPARCstorage Array tray.

Refer to "How to Stop a Disk (DiskSuite Tool)".

Note -
The SPARCstorage Array tray should not be removed as long as the LED on the tray is illuminated. Also, you should not run any DiskSuite commands while the tray is spun down as this may have the side effect of spinning up some or all of the drives in the tray.

Pull the tray and replace the bad disk.

Instructions for the hardware procedure are found in the SPARCstorage Array Model 100 Series Service Manual and the SPARCcluster High Availability Server Service Manual.

Make sure all disks in the tray of the SPARCstorage Array spin up.

The disks in the SPARCstorage Array tray should automatically spin up following the hardware replacement procedure. If the tray fails to spin up automatically within two minutes, force the action by using the following command.
# ssaadm start -t 2 c3

Use format(1M), fmthard(1M), or Storage Manager to repartition the new disk. Make sure you partition the new disk exactly as the disk that was replaced.

Saving the disk format information before problems occur is always a good idea.

Bring all submirrors that were taken offline back online.

Refer to "Working With Mirrors".

When the submirrors are brought back online, DiskSuite automatically resyncs all the submirrors, bringing the data up-to-date.

Attach submirrors that were detached.

Refer to "Working With Mirrors".

Replace any hot spares in use in the submirrors attached in Step 11.

If a submirror had a hot spare replacement in use before you detached the submirror, this hot spare replacement will be in effect after the submirror is reattached. This step returns the hot spare to the "Available" status.

Add all hot spares that were deleted.

Add all state database replicas that were deleted from disks on the tray.

Use the information saved previously to replace the state database replicas.

[Optional] If using Solstice HA servers, switch each logical host back to its default master.

Refer to the Solstice HA documentation.

Validate the data.

Check the user/application data on all metadevices. You may have to run an application-level consistency checker or use some other method to check the data.

How to Replace a Failed SPARCstorage Array Disk in a RAID5 Metadevice (DiskSuite Tool)

When setting up RAID5 metadevices for online repair, you will have to use a minimum RAID5 width of three slices. While this is not an optimal configuration for RAID5, it is still slightly less expensive than mirroring, in terms of the overhead of the redundant data. You should place each of the three slices of each RAID5 metadevice within a separate tray. If all disks in a SPARCstorage Array are configured this way (or in combination with mirrors as described above), the tray containing the failed disk may be removed without losing access to any of the data.

Caution -

Any applications using non-replicated disks in the tray containing the failed drive should first be suspended or terminated.

Refer to "" through Step 9 in the previous procedure, "How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)".

You are going to locate the problem disk and tray, locate other affected DiskSuite objects, prepare the disk to be replaced, replace, then repartition the drive.

Use the metareplace -e command to enable the new drive in the tray.

Refer to Step 12 through Step 16 in the previous procedure, "How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)".

How to Remove a SPARCstorage Array Tray (Command Line)

Before removing a SPARCstorage Array tray, halt all I/O and spin down all drives in the tray. The drives automatically spin up if I/O requests are made. Thus, it is necessary to stop all I/O before the drives are spun down.

Stop DiskSuite I/O activity.

Refer to the metaoffline(1M) command, which takes the submirror offline. When the submirrors on a tray are taken offline, the corresponding mirrors will only provide one-way mirroring (that is, there will be no data redundancy), unless the mirror uses three-way mirroring. When the submirror is brought back online, an automatic resync occurs.

Note -
If you are replacing a drive that contains a submirror, use the metadetach(1M) command to detach the submirror.

Use the metastat(1M) command to identify all submirrors containing slices on the tray to be removed. Also, use the metadb(1M) command to identify any replicas on the tray. Any available hot spare devices must also be identified and the associated submirror identified using the metahs(1M) command.

With all affected submirrors offline, I/O to the tray will be stopped.

Refer to "How to Stop a Disk (DiskSuite Tool)".

Either using DiskSuite Tool or the ssaadm command, spin down the tray. When the tray lock light is out the tray may be removed and the required task performed.

How to Replace a SPARCstorage Array Tray

When you have completed work on a SPARCstorage Array tray, replace the tray in the chassis. The disks will automatically spin up.

However if the disks fail to spin up, you can use DiskSuite Tool (or the ssaadm command) to manually spin up the entire tray. There is a short delay (several seconds) between starting drives in the SPARCstorage Array.

After the disks have spun up, you must place online all the submirrors that were taken offline. When you bring a submirror online, an optimized resync operation automatically brings the submirrors up-to-date. The optimized resync copies only the regions of the disk that were modified while the submirror was offline. This is typically a very small fraction of the submirror capacity. You must also replace all state database replicas and add back hot spares.

Note -

If you used metadetach(1M) to detach the submirror rather than metaoffline, the entire submirror must be resynced. This typically takes about 10 minutes per Gbyte of data.

How to Recover From SPARCstorage Array Power Loss (Command Line)

When power is lost to one SPARCstorage Array, the following occurs:

I/O operations to the DiskSuite objects will generate errors.
Errors are reported at the slice level rather than the drive level.
Errors are not reported until I/O operations are made to the disk.
Hot spare activity may be initiated if affected devices have assigned hot spares.

You must monitor the configuration for these events using the metastat(1M) command as explained in "Checking Status of DiskSuite Objects".

You may need to perform the following after power is restored:

Identify errored devices with metastat
Enable errored submirrors or RAID5 metadevices
Delete/recreate affected state database replicas

After power is restored, use the metastat command to identify the errored devices.

# metastat
...
d10: Trans
    State: Okay
    Size: 11423440 blocks
    Master Device: d20
    Logging Device: d15
 
d20: Mirror
    Submirror 0: d30
      State: Needs maintenance
    Submirror 1: d40
      State: Okay
...
d30: Submirror of d20
    State: Needs maintenance
...

Return errored devices to service using the metareplace command:
# metareplace -e metadevice slice
The -e option transitions the state of the slice to the "Available" state and resyncs the failed slice.

Note -
Slices that have been replaced by a hot spare should be the last devices replaced using the metareplace command. If the hot spare is replaced first, it could replace another errored slice in a submirror as soon as it becomes available.

A resync can be performed on only one slice of a submirror (metadevice) at a time. If all slices of a submirror were affected by the power outage, each slice must be replaced separately. It takes approximately 10 minutes for a resync to be performed on a 1.05-Gbyte disk.

Depending on the number of submirrors and the number of slices in these submirrors, the resync actions can require a considerable amount of time. A single submirror that is made up of 30 1.05-Gbyte drives might take about five hours to complete. A more realistic configuration made up of five-slice submirrors might take only 50 minutes to complete.

After the loss of power, all state database replicas on the affected SPARCstorage Array chassis will enter an errored state. While these will be reclaimed at the next reboot, you may want to manually return them to service by first deleting and then adding them back.
# metadb -d slice # metadb -a slice
Note -
Make sure you add back the same number of state database replicas that were deleted on each slice. Multiple state database replicas can be deleted with a single metadb command. It may require multiple invocations of metadb -a to add back the replicas deleted by a single metadb -d. This is because if you need multiple copies of replicas on one slice these must be added in one invocation of metadb using the -c flag. Refer to the metadb(1M) man page for more information.

Because state database replica recovery is not automatic, it is safest to manually perform the recovery immediately after the SPARCstorage Array returns to service. Otherwise, a new failure may cause a majority of state database replicas to be out of service and cause a kernel panic. This is the expected behavior of DiskSuite when too few state database replicas are available.

How to Move SPARCstorage Array Disks Between Hosts (Command Line)

This procedure explains how to move disks containing DiskSuite objects from one SPARCstorage Array to another.

Repair any devices that are in an errored state or that have been replaced by hot spares on the disks that are to be moved.

Identify the state database replicas, metadevices, and hot spares on the disks that are to be moved, by using the output from the metadb and metastat -p commands.

Physically move the disks to the new host, being careful to connect them in a similar fashion so that the device names are the same.

Recreate the state database replicas.
# metadb -a [-f] slice ...
Be sure to use the same slice names that contained the state database replicas as identified in Step 2. You might need to use the -f option to force the creation of the state database replicas.

Copy the output from the metastat -p command in Step 2 to the md.tab file.

Edit the md.tab file, making the following changes:
- Delete metadevices which you did not move.
- Change the old metadevice names to new names.
- Make any mirrors into one-way mirrors for the time being, selecting the smallest submirror (if appropriate).

Check the syntax of the md.tab file.
# metainit -a -n

Recreate the moved metadevices and hot spare pools.
# metainit -a

Make the one-way mirrors into multi-way mirrors using the metattach(1M) command as necessary.

Edit the /etc/vfstab file for file systems that are to be automatically mounted at boot. Then remount file systems on the new metadevices as necessary.