Solstice DiskSuite 4.2.1 User's Guide

Chapter 7 Troubleshooting the System

This chapter describes how to troubleshoot DiskSuite.

Use the following to proceed directly to the section that provides step-by-step instructions for the particular task.

Overview of Troubleshooting the System

This chapter describes some DiskSuite problems and their appropriate solution. It is not intended to be all-inclusive but rather to present common scenarios and recovery procedures.

Prerequisites for Troubleshooting the System

Here are the prerequisites for the steps in this section:

General Guidelines for Troubleshooting DiskSuite

Have the following information on hand when troubleshooting a DiskSuite problem:

Recovering the DiskSuite Configuration

The /etc/lvm/md.cf file is a backup file of the DiskSuite configuration for a "local" diskset. Whenever you make a configuration change, the md.cf file is automatically updated (except for hot sparing). You never edit the md.cf file directly.

If your system loses the information maintained in the metadevice state database, and as long as no metadevices were created or changed in the meantime, you can use the md.cf file to recover your DiskSuite configuration.


Note -

The md.cf file does not maintain information on active hot spares. Thus, if hot spares were in use when the DiskSuite configuration was lost, those metadevices that were hot-spared will likely be corrupted.


How to Use the md.cf File to Recover a DiskSuite Configuration


Caution - Caution -

Only use this procedure if you have experienced a complete loss of your DiskSuite configuration.


  1. Recreate the state database replicas.

    Refer to Chapter 1, Getting Started for information on creating state database replicas.

  2. Make a backup copy of the /etc/lvm/md.tab file.

  3. Copy the information from the md.cf file to the md.tab file.

  4. Edit the "new" md.tab file so that:

    • All mirrors are one-way mirrors. If a mirror's submirrors are not the same size, be sure to use the smallest submirror for this one-way mirror. Otherwise data could be lost.

    • RAID5 metadevices are recreated with the -k option, to prevent reinitialization of the device. (Refer to the metainit(1M) man page for more information on this option.)

  5. Run the metainit(1M) command to check the syntax of the md.tab file entries.


    # metainit -n -a
    
  6. After verifying that the syntax of the md.tab file entries is correct, run the metainit(1M) command to recreate the metadevices and hot spare pools from the md.tab file.


    # metainit -a
    
  7. Run the metattach(1M) command to make the one-way mirrors into multi-way mirrors.

  8. Validate the data on the metadevices.

Changing DiskSuite Defaults

By default, the DiskSuite configuration defaults to 128 metadevices and state database replicas that are sized to 1034 blocks. The default number of disksets is four. All of these values can be changed if necessary, and the tasks in this section tell you how.

Preliminary Information for Metadevices

How to Increase the Number of Default Metadevices (Command Line)

This task describes how to increase the number of metadevices from the default value of 128.


Caution - Caution -

If you lower this number, any metadevice existing between the old number and the new number may not be available, potentially resulting in data loss. If you see a message such as "md: d20: not configurable, check /kernel/drv/md.conf" you will need to edit the md.conf file as explained in this task.


  1. After checking the prerequisites ("Prerequisites for Troubleshooting the System") and the preliminary information ("Preliminary Information for Metadevices"), edit the /kernel/drv/md.conf file.

  2. Change the value of the nmd field. Values are supported up to 1024.

  3. Save your changes.

  4. Perform a reconfiguration reboot to build the metadevice names.


    # boot -r
    

Example -- md.conf File

Here is a sample md.conf file configured for 256 metadevices.


#
#ident "@(#)md.conf   1.7     94/04/04 SMI"
#
# Copyright (c) 1992, 1993, 1994 by Sun Microsystems, Inc.
#
name="md" parent="pseudo" nmd=256 md_nsets=4;

Preliminary Information for Disksets

The default number of disksets for a system is 4. If you need to configure more than the default, you can increase this value up to 32. The number of shared disksets is always one less than the md_nsets value, because the local set is included in md_nsets.

How to Increase the Number of Default Disksets (Command Line)

This task shows you how to increase the number of disksets from the default value of 4.


Caution - Caution -

If you lower this number, any diskset existing between the old number and the new number may not be persistent.


  1. After checking the prerequisites ("Prerequisites for Troubleshooting the System"), edit the /kernel/drv/md.conf file.

  2. Change the value of the md_nsets field. Values are supported up to 32.

  3. Save your changes.

  4. Perform a reconfiguration reboot to build the metadevice names.


    # boot -r
    

Example -- md.conf File

Here is a sample md.conf file configured for five disksets. The value of md_nsets is six, which results in five disksets and one local diskset.


#
#ident "@(#)md.conf   1.7     94/04/04 SMI"
#
# Copyright (c) 1992, 1993, 1994 by Sun Microsystems, Inc.
#
name="md" parent="pseudo" nmd=255 md_nsets=6;

Preliminary Information for State Database Replicas

How to Add Larger State Database Replicas (Command Line)

After checking the prerequisites ("Prerequisites for Troubleshooting the System"), and reading the preliminary information ("Preliminary Information for State Database Replicas"), use the metadb command to add larger state database replicas, then to delete the old, smaller state database replicas. Refer to the metadb(1M) man page for more information.

Example -- Adding Larger State Database Replicas


# metadb -a -l 2068 c1t0d0s3 c1t1d0s3 c2t0d0s3 c2t1d0s3
# metadb -d c1t0d0s7 c1t1d0s7 c2t0d0s7 c2t1d0s7

The first metadb command adds state database replicas whose size is specified by the -l 2068 option (2068 blocks). This is double the default replica size of 1034 blocks. The second metadb command removes those smaller state database replicas from the system.

Checking For Errors

When DiskSuite encounters a problem, such as being unable to write to a metadevice due to physical errors at the slice level, it changes the status of the metadevice, for example, to "Maintenance." However, unless you are constantly looking at DiskSuite Tool or running metastat(1M), you may never see these status changes in a timely fashion.

There are two ways that you can automatically check for DiskSuite errors:

The first method is described in "Integrating SNMP Alerts With DiskSuite".

The following section describes the kind of script you can use to check for DiskSuite errors.

How to Automate Checking for Slice Errors in Metadevices (Command Line)

One way to continually and automatically check for a bad slice in a metadevice is to write a script that is invoked by cron. Here is an example:


#
#ident "@(#)metacheck.sh   1.3     96/06/21 SMI"
#
# Copyright (c) 1992, 1993, 1994, 1995, 1996 by Sun Microsystems, Inc.
#
 
#
# DiskSuite Commands
#
MDBIN=/usr/sbin
METADB=${MDBIN}/metadb
METAHS=${MDBIN}/metahs
METASTAT=${MDBIN}/metastat
 
#
# System Commands
#
AWK=/usr/bin/awk
DATE=/usr/bin/date
MAILX=/usr/bin/mailx
RM=/usr/bin/rm
 
#
# Initialization
#
eval=0
date=`${DATE} '+%a %b %e %Y'`
SDSTMP=/tmp/sdscheck.${$}
${RM} -f ${SDSTMP}
 
MAILTO=${*:-"root"}			# default to root, or use arg list
 
#
# Check replicas for problems, capital letters in the flags indicate an error.
#
dbtrouble=`${METADB} | tail +2 | \
    ${AWK} '{ fl = substr($0,1,20); if (fl ~ /[A-Z]/) print $0 }'`
if [ "${dbtrouble}" ]; then
        echo ""   >>${SDSTMP}
        echo "SDS replica problem report for ${date}"	>>${SDSTMP}
        echo ""   >>${SDSTMP}
        echo "Database replicas are not active:"     >>${SDSTMP}
        echo ""   >>${SDSTMP}
        ${METADB} -i >>${SDSTMP}
        eval=1
fi
 
#
# Check the metadevice state, if the state is not Okay, something is up.
#
mdtrouble=`${METASTAT} | \
    ${AWK} '/State:/ { if ( $2 != "Okay" ) print $0 }'`
if [ "${mdtrouble}" ]; then
        echo ""  >>${SDSTMP}
        echo "SDS metadevice problem report for ${date}"  >>${SDSTMP}
        echo ""  >>${SDSTMP}
        echo "Metadevices are not Okay:"  >>${SDSTMP}
        echo ""  >>${SDSTMP}
        ${METASTAT} >>${SDSTMP}
        eval=1
fi
 
#
# Check the hotspares to see if any have been used.
#
hstrouble=`${METAHS} -i | \
    ${AWK} ' /blocks/ { if ( $2 != "Available" ) print $0 }'`
if [ "${hstrouble}" ]; then
        echo ""  >>${SDSTMP}
        echo "SDS Hot spares in use  ${date}"  >>${SDSTMP}
        echo ""  >>${SDSTMP}
        echo "Hot spares in usage:"  >>${SDSTMP}
        echo ""  >>${SDSTMP}
        ${METAHS} -i >>${SDSTMP}
        eval=1
fi
#
# If any errors occurred, then mail the report to root, or whoever was called
# out in the command line.
#
if [ ${eval} -ne 0 ]; then
        ${MAILX} -s "SDS problems ${date}" ${MAILTO} <${SDSTMP}
        ${RM} -f ${SDSTMP}
fi
 
exit ${eval}

For information on invoking scripts in this way, refer to the cron(1M) man page.


Note -

This script serves as a starting point for automating DiskSuite error checking. You may need to modify it for your own configuration.


Boot Problems

Because DiskSuite enables you to mirror root (/), swap, and /usr, special problems can arise when you boot the system, either through hardware or operator error. The tasks in this section are solutions to such potential problems.

Table 7-1 describes these problems and points you to the appropriate solution.

Table 7-1 Common DiskSuite Boot Problems

System Does Not Boot Because ... 

Refer To ... 

The /etc/vfstab file contains incorrect information.

"How to Recover From Improper /etc/vfstab Entries (Command Line)"

There are not enough state database replicas. 

"How to Recover From Insufficient State Database Replicas (Command Line)"

A boot device (disk) has failed. 

"How to Recover From a Boot Device Failure (Command Line)"

The boot mirror has failed. 

"SPARC: How to Boot From the Alternate Device (Command Line)" or

"x86: How to Boot From the Alternate Device (Command Line)"

Preliminary Information for Boot Problems

How to Recover From Improper /etc/vfstab Entries (Command Line)

If you have made an incorrect entry in the /etc/vfstab file, for example, when mirroring root (/), the system will appear at first to be booting properly then fail. To remedy this situation, you need to edit /etc/vfstab while in single-user mode.

The high-level steps to recover from improper /etc/vfstab file entries are:

Example -- Recovering the root (/) Mirror

In the following example, root (/) is mirrored with a two-way mirror, d0. The root (/) entry in /etc/vfstab has somehow reverted back to the original slice of the file system, but the information in /etc/system still shows booting to be from the mirror d0. The most likely reason is that the metaroot(1M) command was not used to maintain /etc/system and /etc/vfstab, or an old copy of /etc/vfstab was copied back.

The incorrect /etc/vfstab file would look something like the following:


#device        device          mount          FS      fsck   mount    mount
#to mount      to fsck         point          type    pass   at boot  options
#
/dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0  /       ufs      1     no       -
/dev/dsk/c0t3d0s1 -                   -       swap     -     no       -
/dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6  /usr    ufs      2     no       -
#
/proc             -                  /proc    proc     -     no       -
fd                -                  /dev/fd  fd       -     no       -
swap              -                  /tmp     tmpfs    -     yes      -

Because of the errors, you automatically go into single-user mode when the machine is booted:


ok boot
...
SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0]
Copyright (c) 1983-1995, Sun Microsystems, Inc.
configuring network interfaces: le0.
Hostname: antero
mount: /dev/dsk/c0t3d0s0 is not this fstype.
setmnt: Cannot open /etc/mnttab for writing

INIT: Cannot create /var/adm/utmp or /var/adm/utmpx

INIT: failed write of utmpx entry:"  "

INIT: failed write of utmpx entry:"  "

INIT: SINGLE USER MODE

Type Ctrl-d to proceed with normal startup,
(or give root password for system maintenance): <root-password>

At this point, root (/) and /usr are mounted read-only. Follow these steps:

  1. Run fsck(1M) on the root (/) mirror.


    Note -

    Be careful to use the correct metadevice for root.



    # fsck /dev/md/rdsk/d0
    ** /dev/md/rdsk/d0
    ** Currently Mounted on /
    ** Phase 1 - Check Blocks and Sizes
    ** Phase 2 - Check Pathnames
    ** Phase 3 - Check Connectivity
    ** Phase 4 - Check Reference Counts
    ** Phase 5 - Check Cyl groups
    2274 files, 11815 used, 10302 free (158 frags, 1268 blocks,
    0.7% fragmentation)
  2. Remount root (/) read/write so you can edit the /etc/vfstab file.


    # mount -o rw,remount /dev/md/dsk/d0 /
    mount: warning: cannot lock temp file </etc/.mnt.lock>
  3. Run the metaroot(1M) command.


    # metaroot d0
    

    This edits the /etc/system and /etc/vfstab files to specify that the root (/) file system is now on metadevice d0.

  4. Verify that the /etc/vfstab file contains the correct metadevice entries.

    The root (/) entry in the /etc/vfstab file should appear as follows so that the entry for the file system correctly references the mirror:


    #device           device              mount    FS      fsck   mount   mount
    #to mount         to fsck             point    type    pass   at boot options
    #
    /dev/md/dsk/d0    /dev/md/rdsk/d0     /        ufs     1      no      -
    /dev/dsk/c0t3d0s1 -                   -        swap    -      no      -
    /dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6  /usr     ufs     2      no      -
    #
    /proc             -                  /proc     proc    -      no      -
    fd                -                  /dev/fd   fd      -      no      -
    swap              -                  /tmp      tmpfs   -      yes     -
  5. Reboot.

    The system returns to normal operation.

How to Recover From Insufficient State Database Replicas (Command Line)

If for some reason the state database replica quorum is not met, for example, due to a drive failure, the system cannot be rebooted. In DiskSuite terms, the state database has gone "stale." This task explains how to recover.

The high-level steps in this task are:

Example -- Recovering From Stale State Database Replicas

In the following example, a disk containing two replicas has gone bad. This leaves the system with only two good replicas, and the system cannot reboot.

  1. Boot the machine to determine which state database replicas are down.


    ok boot
    ...
    Hostname: demo
    metainit: demo: stale databases
     
    Insufficient metadevice database replicas located.
     
    Use metadb to delete databases which are broken.
    Ignore any "Read-only file system" error messages.
    Reboot the system when finished to reload the metadevice
    database.
    After reboot, repair any broken database replicas which were
    deleted.
     
    Type Ctrl-d to proceed with normal startup,
    (or give root password for system maintenance): <root-password>
    Entering System Maintenance Mode
     
    SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0]
  2. Use the metadb(1M) command to look at the metadevice state database and see which state database replicas are not available.


    # metadb -i
       flags      first blk      block count
        a m  p  lu    16                1034                  /dev/dsk/c0t3d0s3
        a   p  l      1050              1034                  /dev/dsk/c0t3d0s3
        M  p        unknown      unknown                      /dev/dsk/c1t2d0s3
        M  p        unknown      unknown                      /dev/dsk/c1t2d0s3
    ...

    The system can no longer detect state database replicas on slice /dev/dsk/c1t2d0s3, which is part of the failed disk. The metadb command flags the replicas on this slice as having a problem with the master blocks.

  3. Delete the state database replicas on the bad disk using the -d option to the metadb(1M) command.

    At this point, the root (/) file system is read-only. You can ignore the mddb.cf error messages:


    # metadb -d -f c1t2d0s3
    metadb: demo: /etc/lvm/mddb.cf.new: Read-only file
    system
  4. Verify that the replicas were deleted.


    # metadb -i
        flags        first blk       block count
         a m  p  lu         16              1034            /dev/dsk/c0t3d0s3
         a    p  l          1050            1034            /dev/dsk/c0t3d0s3
  5. Reboot.

  6. Once you have a replacement disk, halt the system, replace the failed disk, and once again, reboot the system. Use the format(1M) command or the fmthard(1M) command to partition the disk as it was before the failure.


    # halt
    ...
    ok boot
    ...
    # format /dev/rdsk/c1t2d0s0
    ...
  7. Use the metadb(1M) command to add back the state database replicas and to determine that the state database replicas are correct.


    # metadb -a -c 2 c1t2d0s3
    # metadb
       flags        first blk  block count
      a m  p  luo      16           1034         dev/dsk/c0t3d0s3
      a    p  luo      1050         1034         dev/dsk/c0t3d0s3
      a       u        16           1034         dev/dsk/c1t2d0s3
      a       u        1050         1034         dev/dsk/c1t2d0s3

    The metadb command with the -c 2 option adds two state database replicas to the same slice.

How to Recover From a Boot Device Failure (Command Line)

If you have a root (/) mirror and your boot device fails, you'll need to set up an alternate boot device.

The high-level steps in this task are:

Example -- Recovering From a Boot Device Failure

In the following example, the boot device containing two of the six state database replicas and the root (/), swap, and /usr submirrors fails.

Initially, when the boot device fails, you'll see a message similar to the following. This message may differ among various architectures.


Rebooting with command:
Boot device: /iommu/sbus/dma@f,81000/esp@f,80000/sd@3,0   File and args: kadb
kadb: kernel/unix
The selected SCSI device is not responding
Can't open boot device
...

When you see this message, note the device. Then, follow these steps:

  1. Boot from another root (/) submirror.

    Since only two of the six state database replicas in this example are in error, you can still boot. If this were not the case, you would need to delete the stale state database replicas in single-user mode. This procedure is described in "How to Recover From Insufficient State Database Replicas (Command Line)".

    When you created the mirror for the root (/) file system, you should have recorded the alternate boot device as part of that procedure. In this example, disk2 is that alternate boot device.


    ok boot disk2
    ...
    SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0]
    Copyright (c) 1983-1995, Sun Microsystems, Inc.
     
    Hostname: demo
    ...
    demo console login: root
    Password: <root-password>
    Last login: Wed Dec 16 13:15:42 on console
    SunOS Release 5.1 Version Generic [UNIX(R) System V Release 4.0]
    ...
  2. Use the metadb(1M) command to determine that two state database replicas have failed.


    # metadb
           flags         first blk    block count
        M     p          unknown      unknown      /dev/dsk/c0t3d0s3
        M     p          unknown      unknown      /dev/dsk/c0t3d0s3
        a m  p  luo      16           1034         /dev/dsk/c0t2d0s3
        a    p  luo      1050         1034         /dev/dsk/c0t2d0s3
        a    p  luo      16           1034         /dev/dsk/c0t1d0s3
        a    p  luo      1050         1034         /dev/dsk/c0t1d0s3

    The system can no longer detect state database replicas on slice /dev/dsk/c0t3d0s3, which is part of the failed disk.

  3. Use the metastat(1M) command to determine that half of the root (/), swap, and /usr mirrors have failed.


    # metastat
    d0: Mirror
        Submirror 0: d10
          State: Needs maintenance
        Submirror 1: d20
          State: Okay
    ...
     
    d10: Submirror of d0
        State: Needs maintenance
        Invoke: "metareplace d0 /dev/dsk/c0t3d0s0 <new device>"
        Size: 47628 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t3d0s0          0     No    Maintenance 
     
    d20: Submirror of d0
        State: Okay
        Size: 47628 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t2d0s0          0     No    Okay  
     
    d1: Mirror
        Submirror 0: d11
          State: Needs maintenance
        Submirror 1: d21
          State: Okay
    ...
     
    d11: Submirror of d1
        State: Needs maintenance
        Invoke: "metareplace d1 /dev/dsk/c0t3d0s1 <new device>"
        Size: 69660 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t3d0s1          0     No    Maintenance 
     
    d21: Submirror of d1
        State: Okay
        Size: 69660 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t2d0s1          0     No    Okay        
     
    d2: Mirror
        Submirror 0: d12
          State: Needs maintenance
        Submirror 1: d22
          State: Okay
    ...
     
    d2: Mirror
        Submirror 0: d12
          State: Needs maintenance
        Submirror 1: d22
          State: Okay
    ...
     
    d12: Submirror of d2
        State: Needs maintenance
        Invoke: "metareplace d2 /dev/dsk/c0t3d0s6 <new device>"
        Size: 286740 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t3d0s6          0     No    Maintenance 
     
     
    d22: Submirror of d2
        State: Okay
        Size: 286740 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t2d0s6          0     No    Okay  

    In this example, the metastat shows that following submirrors need maintenance:

    • Submirror d10, device c0t3d0s0

    • Submirror d11, device c0t3d0s1

    • Submirror d12, device c0t3d0s6

  4. Halt the system, repair the disk, and use the format(1M) command or the fmthard(1M) command, to partition the disk as it was before the failure.


    # halt
    ...
    Halted
    ...
    ok boot
    ...
    # format /dev/rdsk/c0t3d0s0
    
  5. Reboot.

    Note that you must reboot from the other half of the root (/) mirror. You should have recorded the alternate boot device when you created the mirror.


    # halt
    ...
    ok boot disk2
    
  6. To delete the failed state database replicas and then add them back, use the metadb(1M) command.


    # metadb
           flags         first blk    block count
        M     p          unknown      unknown      /dev/dsk/c0t3d0s3
        M     p          unknown      unknown      /dev/dsk/c0t3d0s3
        a m  p  luo      16           1034         /dev/dsk/c0t2d0s3
        a    p  luo      1050         1034         /dev/dsk/c0t2d0s3
        a    p  luo      16           1034         /dev/dsk/c0t1d0s3
        a    p  luo      1050         1034         /dev/dsk/c0t1d0s3
    # metadb -d c0t3d0s3
    # metadb -c 2 -a c0t3d0s3
    # metadb
           flags         first blk    block count
         a m  p  luo     16           1034         /dev/dsk/c0t2d0s3
         a    p  luo     1050         1034         /dev/dsk/c0t2d0s3
         a    p  luo     16           1034         /dev/dsk/c0t1d0s3
         a    p  luo     1050         1034         /dev/dsk/c0t1d0s3
         a        u      16           1034         /dev/dsk/c0t3d0s3
         a        u      1050         1034         /dev/dsk/c0t3d0s3
  7. Use the metareplace(1M) command to re-enable the submirrors.


    # metareplace -e d0 c0t3d0s0
    Device /dev/dsk/c0t3d0s0 is enabled
     
    # metareplace -e d1 c0t3d0s1
    Device /dev/dsk/c0t3d0s1 is enabled
     
    # metareplace -e d2 c0t3d0s6
    Device /dev/dsk/c0t3d0s6 is enabled

    After some time, the resyncs will complete. You can now return to booting from the original device.

How to Record the Path to the Alternate Boot Device (Command Line)

When mirroring root (/), you might need the path to the alternate boot device later if the primary device fails.

Example -- SPARC: Recording the Alternate Boot Device Path

In this example, you would determine the path to the alternate root device by using the ls -l command on the slice that is being attached as the second submirror to the root (/) mirror.


# ls -l /dev/rdsk/c1t3d0s0
lrwxrwxrwx 1  root root  55 Mar 5 12:54  /dev/rdsk/c1t3d0s0 -> \ 
../../devices/sbus@1,f8000000/esp@1,200000/sd@3,0:a

Here you would record the string that follows the /devices directory: /sbus@1,f8000000/esp@1,200000/sd@3,0:a.

On some newer Sun hardware, you will be required to change the /devicesdirectory name from sd@ to disk@.

DiskSuite users who are using a system with Open Boot Prom can use the OpenBoot nvalias command to define a "backup root" devalias for the secondary root mirror. For example:


ok  nvalias backup_root /sbus@1,f8000000/esp@1,200000/sd@3,0:a

In the event of primary root disk failure, you then would only enter:


ok  boot backup_root

Example -- x86: Recording the Alternate Boot Device Path

In this example, you would determine the path to the alternate boot device by using the ls -l command on the slice that is being attached as the second submirror to the root (/) mirror.


# ls -l /dev/rdsk/c1t0d0s0
lrwxrwxrwx 1  root root  55 Mar 5 12:54  /dev/rdsk/c1t0d0s0 -> ../.
./devices/eisa/eha@1000,0/cmdk@1,0:a

Here you would record the string that follows the /devices directory: /eisa/eha@1000,0/cmdk@1,0:a

SPARC: How to Boot From the Alternate Device (Command Line)

To boot a SPARC system from the alternate boot device, type:


# boot alternate-boot-device

The procedure "How to Record the Path to the Alternate Boot Device (Command Line)" describes how to determine the alternate boot device.

x86: How to Boot From the Alternate Device (Command Line)

Use this task to boot an x86 system from the alternate boot device.

  1. Boot your system from the Multiple Device Boot (MDB) diskette.

    After a moment, a screen similar to the following is displayed:


    Solaris/x86 Multiple Device Boot Menu
    Code    Device    Vendor     Model/Desc          Rev
    ============================================================
     
    10      DISK      COMPAQ      C2244              0BC4
    11      DISK      SEAGATE     ST11200N SUN1.05   8808
    12      DISK      MAXTOR      LXT-213S SUN0207   4.24
    13      CD        SONY        CD-ROM CDU-8812    3.0a
    14      NET       SMC/WD      I/O=300 IRQ=5
    80      DISK      First IDE drive (Drive C:)
    81      DISK      Second IDE drive (Drive D:)
     
    Enter the boot device code:
  2. Enter your alternate disk code from the choices listed on the screen. The following is displayed:


    Solaris 2.4 for x86                Secondary Boot Subsystem,vsn 2.11
     
                      <<<Current Boot Parameters>>>
    Boot path:/eisa/eha@1000,0/cmdk@0,0:a
    Boot args:/kernel/unix
     
    Type b[file-name] [boot-flags] <ENTER>     to boot with options
    or   i<ENTER>                              to enter boot interpreter
    or   <ENTER>                               to boot with defaults
     
                        <<<timeout in 5 seconds>>>
  3. Type i to select the interpreter.

  4. Type the following commands:


    >setprop boot-path /eisa/eha@1000,0/cmdk@1,0:a
    >^D
    

    The Control-D character sequence quits the interpreter.

Replacing SCSI Disks

This section describes how to replace SCSI disks that are not part of a SPARCstorage Array in a DiskSuite environment.

How to Replace a Failed SCSI Disk (Command Line)

The high-level steps to replace a SCSI disk that is not part of a SPARCstorage Array are:

  1. Identify the disk to be replaced by examining /var/adm/messages and metastat output.

  2. Locate any local metadevice state database replicas that may have been placed on the problem disk. Use the metadb command to find the replicas.

    Errors may be reported for the replicas located on the failed disk. In this example, c0t1d0 is the problem device.


    # metadb
       flags       first blk        block count
      a m     u        16               1034            /dev/dsk/c0t0d0s4
      a       u        1050             1034            /dev/dsk/c0t0d0s4
      a       u        2084             1034            /dev/dsk/c0t0d0s4
      W   pc luo       16               1034            /dev/dsk/c0t1d0s4
      W   pc luo       1050             1034            /dev/dsk/c0t1d0s4
      W   pc luo       2084             1034            /dev/dsk/c0t1d0s4

    The output above shows three state database replicas on Slice 4 of each of the local disks, c0t0d0 and c0t1d0. The W in the flags field of the c0t1d0s4 slice indicates that the device has write errors. Three replicas on the c0t0d0s4 slice are still good.


    Caution - Caution -

    If, after deleting the bad state database replicas, you are left with three or less, add more state database replicas before continuing. This will ensure that your system reboots correctly.


  3. Record the slice name where the replicas reside and the number of replicas, then delete the state database replicas.

    The number of replicas is obtained by counting the number of appearances of a slice in metadb output in Step 2. In this example, the three state database replicas that exist on c0t1d0s4 are deleted.


    # metadb -d c0t1d0s4
    
  4. Locate any submirrors using slices on the problem disk and detach them.

    The metastat command can show the affected mirrors. In this example, one submirror, d10, is also using c0t1d0s4. The mirror is d20.


    # metadetach d20 d10
    d20: submirror d10 is detached
  5. Delete hot spares on the problem disk.


    # metahs -d hsp000 c0t1d0s6
    hsp000: Hotspare is deleted
  6. Halt the system and boot to single-user mode.


    # halt
    ...
    ok boot -s
    ...
  7. Physically replace the problem disk.

  8. Repartition the new disk.

    Use the format(1M) command or the fmthard(1M) command to partition the disk with the same slice information as the failed disk.

  9. If you deleted replicas in Step 3, add the same number back to the appropriate slice.

    In this example, /dev/dsk/c0t1d0s4 is used.


    # metadb -a c 3 c0t1d0s4
    
  10. Depending on how the disk was used, you may have a variety of things to do. Use the following table to decide what to do next.

    Table 7-2 SCSI Disk Replacement Decision Table

    Type of Device 

    Do the Following ... 

    Slice 

    Use normal data recovery procedures. 

    Unmirrored Stripe or Concatenation 

    If the stripe/concat is used for a file system, run newfs(1M), mount the file system then restore data from backup. If the stripe/concat is used as an application that uses the raw device, that application must have its own recovery procedures.

    Mirror (Submirror) 

    Run metattach(1M) to reattach a detached submirror.

    RAID5 metadevice 

    Run metareplace(1M) to re-enable the slice. This causes the resyncs to start.

    Trans metadevice 

    Run fsck(1M) to repair the trans metadevice.

  11. Replace hot spares that were deleted, and add them to the appropriate hot spare pool(s).


    # metahs -a hsp000 c0t0d0s6
    hsp000: Hotspare is added
  12. Validate the data.

    Check the user/application data on all metadevices. You may have to run an application-level consistency checker or use some other method to check the data.

Working With SPARCstorage Arrays

This section describes how to troubleshoot SPARCstorage Arrays using DiskSuite. The tasks in this section include:

Installation

The SPARCstorage Array should be installed according to the SPARCstorage Array Software instructions found with the SPARCstorage Array CD. The SPARCstorage Array Volume Manager need not be installed if you are only using DiskSuite.

Device Naming

DiskSuite accesses SPARCstorage Array disks exactly like any other disks, with one important exception: the disk names differ from non-SPARCstorage Array disks.

The SPARCstorage Array 100 disk naming convention is:

c[0-n]t[0-5]d[0-4]s[0-7]

In this name:

The SPARCstorage Array 200 disk naming convention is:

c[0-n]t[0-5]d[0-6]s[0-7]

In this name:


Note -

Older trays hold up to six disks; newer trays can hold up to seven.


The main difference between the SSA100 and SSA200 is that the SSA100 arranges pairs of targets into a tray, whereas the SSA200 has a separate tray for each target.

Preliminary Information for Replacing SPARCstorage Array Components

The SPARCstorage Array components that can be replaced include the disks, fan tray, battery, tray, power supply, backplane, controller, optical module, and fibre channel cable.

Some of the SPARCstorage Array components can be replaced without powering down the SPARCstorage Array. Other components require the SPARCstorage Array to be powered off. Consult the SPARCstorage Array documentation for details.

To replace SPARCstorage Array components that require power off without interrupting services, you perform the steps necessary for tray removal for all trays in the SPARCstorage Array before turning off the power. This includes taking submirrors offline, deleting hot spares from hot spare pools, deleting state database replicas from drives, and spinning down the trays.

After these preparations, the SPARCstorage Array can be powered down and the components replaced.


Note -

Because the SPARCstorage Array controller contains a unique World Wide Name, which identifies it to Solaris, special procedures apply for SPARCstorage Array controller replacement. Contact your service provider for assistance.


How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)

The steps to replace a SPARCstorage Array disk in a DiskSuite environment depend a great deal on how the slices on the disk are being used, and how the disks are cabled to the system. They also depend on whether the disk slices are being used as is, or by DiskSuite, or both.


Note -

This procedure applies to a SPARCstorage Array 100. The steps to replace a disk in a SPARCstorage Array 200 are similar.


The high-level steps in this task are:


Note -

You can use this procedure if a submirror is in the "Maintenance" state, replaced by a hot spare, or is generating intermittent errors.


To locate and replace the disk, perform the following steps:

  1. Identify the disk to be replaced, either by using DiskSuite Tool to look at the Status fields of objects, or by examining metastat and /var/adm/messages output.


    # metastat
    ...
     d50:Submirror of d40
          State: Needs Maintenance
    ...
    # tail -f /var/adm/messages
    ...
    Jun 1 16:15:26 host1 unix: WARNING: /io-
    unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49):  
    Jun 1 16:15:26 host1 unix: Error for command `write(I))' Err
    Jun 1 16:15:27 host1 unix: or Level: Fatal
    Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559
    Jun 1 16:15:27 host1 unix: Sense Key: Media Error
    Jun 1 16:15:27 host1 unix: Vendor `CONNER':
    Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15
    ...

    The metastat command shows that a submirror is in the "Needs Maintenance" state. The /var/adm/messages file reports a disk drive that has an error. To locate the disk drive, use the ls command as follows, matching the symbolic link name to that from the /var/adm/messages output.


    # ls -l /dev/rdsk/*
    ...
    lrwxrwxrwx   1 root     root          90 Mar  4 13:26 /dev/rdsk/c3t3d4s0 -
    > ../../devices/io-
    unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49)
    ...

    Based on the above information and metastat output, it is determined that drive c3t3d4 must be replaced.

  2. Determine the affected tray by using DiskSuite Tool.

    To find the SPARCstorage Array tray where the problem disk resides, use the Disk View window.

    1. Click Disk View to display the Disk View window.

    2. Drag the problem metadevice (in this example, a mirror) from the Objects list to the Disk View window.

      The Disk View window shows the logical to physical device mappings by coloring the physical slices that make up the metadevice. You can see at a glance which tray contains the problem disk.

    3. An alternate way to find the SPARCstorage Array tray where the problem disk resides is to use the ssaadm(1M) command.


      host1# ssaadm display c3
               SPARCstorage Array Configuration
      Controller path: /devices/io-
      unit@f,e1200000/sbi@0.0/SUNW,soc@0,0/SUNW,pln@a0000000,741022:ctlr
               DEVICE STATUS
               TRAY1          TRAY2          TRAY3
      Slot
      1        Drive:0,0      Drive:2,0      Drive:4,0
      2        Drive:0,1      Drive:2,1      Drive:4,1
      3        Drive:0,2      Drive:2,2      Drive:4,2
      4        Drive:0,3      Drive:2,3      Drive:4,3
      5        Drive:0,4      Drive:2,4      Drive:4,4
      6        Drive:1,0      Drive:3,0      Drive:5,0
      7        Drive:1,1      Drive:3,1      Drive:5,1
      8        Drive:1,2      Drive:3,2      Drive:5,2
      9        Drive:1,3      Drive:3,3      Drive:5,3
      10       Drive:1,4      Drive:3,4      Drive:5,4
       
               CONTROLLER STATUS
      Vendor:    SUNW
      Product ID:  SSA100
      Product Rev: 1.0
      Firmware Rev: 2.3
      Serial Num: 000000741022
      Accumulate performance Statistics: Enabled

      The ssaadm output for controller (c3) shows that Drive 3,4 (c3t3d4) is the closest to you when you pull out the middle tray.

  3. [Optional] If you have a diskset, locate the diskset that contains the affected drive.

    The following commands locate drive c3t3d4. Note that no output was displayed when the command was run with logicalhost2, but logicalhost1 reported that the name was present. In the reported output, the yes field indicates that the disk contains a state database replica.


    host1# metaset -s logicalhost2 | grep c3t3d4
    host1# metaset -s logicalhost1 | grep c3t3d4
    c3t3d4 yes

    Note -

    If you are using Solstice HA servers, you'll need to switch ownership of both logical hosts to one Solstice HA server. Refer to the Solstice HA documentation.


  4. Determine other DiskSuite objects on the affected tray.

    Because you must pull the tray to replace the disk, determine what other objects will be affected in the process.

    1. In DiskSuite Tool, display the Disk View window. Select the tray. From the Object menu, choose Device Mappings. The Physical to Logical Device Mapping window appears.

    2. Note all affected objects, including state database replicas, metadevices, and hot spares that appear in the window.

  5. Prepare for disk replacement by preparing other DiskSuite objects in the affected tray.

    1. Delete all hot spares that have a status of "Available" and that are in the same tray as the problem disk.

      Record all the information about the hot spares so they can be added back to the hot spare pools following the replacement procedure.

    2. Delete any state database replicas that are on disks in the tray that must be pulled. You must keep track of this information because you must replace these replicas in Step 14.

      There may be multiple replicas on the same disk. Make sure you record the number of replicas deleted from each slice.

    3. Locate the submirrors that are using slices that reside in the tray.

    4. Detach all submirrors with slices on the disk that is being replaced.

    5. Take all other submirrors that have slices in the tray offline.

      This forces DiskSuite to stop using the submirror slices in the tray so that the drives can be spun down.

      To remove objects, refer to Chapter 5, Removing DiskSuite Objects. To detach and offline submirrors, refer to "Working With Mirrors".

  6. Spin down all disks in SPARCstorage Array tray.

    Refer to "How to Stop a Disk (DiskSuite Tool)".


    Note -

    The SPARCstorage Array tray should not be removed as long as the LED on the tray is illuminated. Also, you should not run any DiskSuite commands while the tray is spun down as this may have the side effect of spinning up some or all of the drives in the tray.


  7. Pull the tray and replace the bad disk.

    Instructions for the hardware procedure are found in the SPARCstorage Array Model 100 Series Service Manual and the SPARCcluster High Availability Server Service Manual.

  8. Make sure all disks in the tray of the SPARCstorage Array spin up.

    The disks in the SPARCstorage Array tray should automatically spin up following the hardware replacement procedure. If the tray fails to spin up automatically within two minutes, force the action by using the following command.


    # ssaadm start -t 2 c3
    
  9. Use format(1M), fmthard(1M), or Storage Manager to repartition the new disk. Make sure you partition the new disk exactly as the disk that was replaced.

    Saving the disk format information before problems occur is always a good idea.

  10. Bring all submirrors that were taken offline back online.

    Refer to "Working With Mirrors".

    When the submirrors are brought back online, DiskSuite automatically resyncs all the submirrors, bringing the data up-to-date.

  11. Attach submirrors that were detached.

    Refer to "Working With Mirrors".

  12. Replace any hot spares in use in the submirrors attached in Step 11.

    If a submirror had a hot spare replacement in use before you detached the submirror, this hot spare replacement will be in effect after the submirror is reattached. This step returns the hot spare to the "Available" status.

  13. Add all hot spares that were deleted.

  14. Add all state database replicas that were deleted from disks on the tray.

    Use the information saved previously to replace the state database replicas.

  15. [Optional] If using Solstice HA servers, switch each logical host back to its default master.

    Refer to the Solstice HA documentation.

  16. Validate the data.

    Check the user/application data on all metadevices. You may have to run an application-level consistency checker or use some other method to check the data.

How to Replace a Failed SPARCstorage Array Disk in a RAID5 Metadevice (DiskSuite Tool)

When setting up RAID5 metadevices for online repair, you will have to use a minimum RAID5 width of three slices. While this is not an optimal configuration for RAID5, it is still slightly less expensive than mirroring, in terms of the overhead of the redundant data. You should place each of the three slices of each RAID5 metadevice within a separate tray. If all disks in a SPARCstorage Array are configured this way (or in combination with mirrors as described above), the tray containing the failed disk may be removed without losing access to any of the data.


Caution - Caution -

Any applications using non-replicated disks in the tray containing the failed drive should first be suspended or terminated.


  1. Refer to "" through Step 9 in the previous procedure, "How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)".

    You are going to locate the problem disk and tray, locate other affected DiskSuite objects, prepare the disk to be replaced, replace, then repartition the drive.

  2. Use the metareplace -e command to enable the new drive in the tray.

  3. Refer to Step 12 through Step 16 in the previous procedure, "How to Replace a Failed SPARCstorage Array Disk in a Mirror (DiskSuite Tool)".

How to Remove a SPARCstorage Array Tray (Command Line)

Before removing a SPARCstorage Array tray, halt all I/O and spin down all drives in the tray. The drives automatically spin up if I/O requests are made. Thus, it is necessary to stop all I/O before the drives are spun down.

  1. Stop DiskSuite I/O activity.

    Refer to the metaoffline(1M) command, which takes the submirror offline. When the submirrors on a tray are taken offline, the corresponding mirrors will only provide one-way mirroring (that is, there will be no data redundancy), unless the mirror uses three-way mirroring. When the submirror is brought back online, an automatic resync occurs.


    Note -

    If you are replacing a drive that contains a submirror, use the metadetach(1M) command to detach the submirror.


  2. Use the metastat(1M) command to identify all submirrors containing slices on the tray to be removed. Also, use the metadb(1M) command to identify any replicas on the tray. Any available hot spare devices must also be identified and the associated submirror identified using the metahs(1M) command.

    With all affected submirrors offline, I/O to the tray will be stopped.

  3. Refer to "How to Stop a Disk (DiskSuite Tool)".

    Either using DiskSuite Tool or the ssaadm command, spin down the tray. When the tray lock light is out the tray may be removed and the required task performed.

How to Replace a SPARCstorage Array Tray

When you have completed work on a SPARCstorage Array tray, replace the tray in the chassis. The disks will automatically spin up.

However if the disks fail to spin up, you can use DiskSuite Tool (or the ssaadm command) to manually spin up the entire tray. There is a short delay (several seconds) between starting drives in the SPARCstorage Array.

After the disks have spun up, you must place online all the submirrors that were taken offline. When you bring a submirror online, an optimized resync operation automatically brings the submirrors up-to-date. The optimized resync copies only the regions of the disk that were modified while the submirror was offline. This is typically a very small fraction of the submirror capacity. You must also replace all state database replicas and add back hot spares.


Note -

If you used metadetach(1M) to detach the submirror rather than metaoffline, the entire submirror must be resynced. This typically takes about 10 minutes per Gbyte of data.


How to Recover From SPARCstorage Array Power Loss (Command Line)

When power is lost to one SPARCstorage Array, the following occurs:

You must monitor the configuration for these events using the metastat(1M) command as explained in "Checking Status of DiskSuite Objects".

You may need to perform the following after power is restored:

  1. After power is restored, use the metastat command to identify the errored devices.


    # metastat
    ...
    d10: Trans
        State: Okay
        Size: 11423440 blocks
        Master Device: d20
        Logging Device: d15
     
    d20: Mirror
        Submirror 0: d30
          State: Needs maintenance
        Submirror 1: d40
          State: Okay
    ...
    d30: Submirror of d20
        State: Needs maintenance
    ...
  2. Return errored devices to service using the metareplace command:


    # metareplace -e metadevice slice
    

    The -e option transitions the state of the slice to the "Available" state and resyncs the failed slice.


    Note -

    Slices that have been replaced by a hot spare should be the last devices replaced using the metareplace command. If the hot spare is replaced first, it could replace another errored slice in a submirror as soon as it becomes available.


    A resync can be performed on only one slice of a submirror (metadevice) at a time. If all slices of a submirror were affected by the power outage, each slice must be replaced separately. It takes approximately 10 minutes for a resync to be performed on a 1.05-Gbyte disk.

    Depending on the number of submirrors and the number of slices in these submirrors, the resync actions can require a considerable amount of time. A single submirror that is made up of 30 1.05-Gbyte drives might take about five hours to complete. A more realistic configuration made up of five-slice submirrors might take only 50 minutes to complete.

  3. After the loss of power, all state database replicas on the affected SPARCstorage Array chassis will enter an errored state. While these will be reclaimed at the next reboot, you may want to manually return them to service by first deleting and then adding them back.


    # metadb -d slice
    # metadb -a slice
    

    Note -

    Make sure you add back the same number of state database replicas that were deleted on each slice. Multiple state database replicas can be deleted with a single metadb command. It may require multiple invocations of metadb -a to add back the replicas deleted by a single metadb -d. This is because if you need multiple copies of replicas on one slice these must be added in one invocation of metadb using the -c flag. Refer to the metadb(1M) man page for more information.


    Because state database replica recovery is not automatic, it is safest to manually perform the recovery immediately after the SPARCstorage Array returns to service. Otherwise, a new failure may cause a majority of state database replicas to be out of service and cause a kernel panic. This is the expected behavior of DiskSuite when too few state database replicas are available.

How to Move SPARCstorage Array Disks Between Hosts (Command Line)

This procedure explains how to move disks containing DiskSuite objects from one SPARCstorage Array to another.

  1. Repair any devices that are in an errored state or that have been replaced by hot spares on the disks that are to be moved.

  2. Identify the state database replicas, metadevices, and hot spares on the disks that are to be moved, by using the output from the metadb and metastat -p commands.

  3. Physically move the disks to the new host, being careful to connect them in a similar fashion so that the device names are the same.

  4. Recreate the state database replicas.


    # metadb -a [-f] slice ...
    

    Be sure to use the same slice names that contained the state database replicas as identified in Step 2. You might need to use the -f option to force the creation of the state database replicas.

  5. Copy the output from the metastat -p command in Step 2 to the md.tab file.

  6. Edit the md.tab file, making the following changes:

    • Delete metadevices which you did not move.

    • Change the old metadevice names to new names.

    • Make any mirrors into one-way mirrors for the time being, selecting the smallest submirror (if appropriate).

  7. Check the syntax of the md.tab file.


    # metainit -a -n
    
  8. Recreate the moved metadevices and hot spare pools.


    # metainit -a
    
  9. Make the one-way mirrors into multi-way mirrors using the metattach(1M) command as necessary.

  10. Edit the /etc/vfstab file for file systems that are to be automatically mounted at boot. Then remount file systems on the new metadevices as necessary.

Using the SPARCstorage Array as a System Disk

This section contains information on making a SPARCstorage Array function as a system disk (boot device).

Making a SPARCstorage Array Bootable

The minimum boot requirements for the SPARCstorage Array are:

To update or check the Fcode revision, use the fc_update program, which is supplied on the SPARCstorage Array CD, in its own subdirectory.

Consult the SPARCstorage Array documentation for more details.

How to Make SPARCstorage Array Disks Available Early in the Boot Process

Add the following forceload entries to /etc/system file to ensure that the SPARCstorage Array disks are made available early in the boot process. This is necessary to make the SPARCstorage Array function as a system disk (boot device).


*ident	"@(#)system	1.15	92/11/14 SMI" /* SVR4 1.5 */
*
* SYSTEM SPECIFICATION FILE
*
...
* forceload:
*
*	Cause these modules to be loaded at boot time, (just before mounting
*	the root filesystem) rather than at first reference. Note that
* 	forceload expects a filename which includes the directory. Also
*	note that loading a module does not necessarily imply that it will
*	be installed.
*
forceload: drv/ssd
forceload: drv/pln
forceload: drv/soc
...

Note -

When creating a root (/) mirror on a SPARCstorage Array disk, running the metaroot(1M) command puts the above entries in the /etc/system file automatically.