Solaris Volume Manager Administration Guide

Chapter 25 Troubleshooting Solaris Volume Manager (Tasks)

This chapter describes how to troubleshoot problems related to Solaris Volume Manager. This chapter provides both general troubleshooting guidelines and specific procedures for resolving some particular known problems.

This chapter includes the following information:

This chapter describes some Solaris Volume Manager problems and their appropriate solution. It is not intended to be all-inclusive but rather to present common scenarios and recovery procedures.

Troubleshooting Solaris Volume Manager (Task Map)

The following task map identifies some procedures needed to troubleshoot Solaris Volume Manager.

Task 

Description 

Instructions 

Replace a failed disk 

Replace a disk, then update state database replicas and logical volumes on the new disk. 

How to Replace a Failed Disk

Recover from disk movement problems 

Restore disks to original locations or contact product support. 

Recovering from Disk Movement Problems

Recover from improper /etc/vfstab entries

Use the fsck command on the mirror, then edit the /etc/vfstab file so the system will boot correctly.

How to Recover From Improper /etc/vfstab Entries

Recover from a boot device failure 

Boot from a different submirror.  

How to Recover From a Boot Device Failure

Recover from insufficient state database replicas 

Delete unavailable replicas by using the metadb command.

How to Recover From Insufficient State Database Replicas

Recover configuration data for a lost soft partition 

Use the metarecover command to recover configuration data for soft partitions.

How to Recover Configuration Data for a Soft Partition

Recover a Solaris Volume Manager configuration from salvaged disks 

Attach disks to a new system and have Solaris Volume Manager rebuild the configuration from the existing state database replicas. 

How to Recover a Configuration

Overview of Troubleshooting the System

Prerequisites for Troubleshooting the System

To troubleshoot storage management problems related to Solaris Volume Manager, you need to do the following:

General Guidelines for Troubleshooting Solaris Volume Manager

You should have the following information on hand when you troubleshoot Solaris Volume Manager problems:


Tip –

Any time you update your Solaris Volume Manager configuration, or make other storage or operating environment-related changes to your system, generate fresh copies of this configuration information. You could also generate this information automatically with a cron job.


General Troubleshooting Approach

Although there is no one procedure that will enable you to evaluate all problems with Solaris Volume Manager, the following process provides one general approach that might help.

  1. Gather information about current configuration.

  2. Look at the current status indicators, including the output from the metastat and metadb commands. There should be information here that indicates which component is faulty.

  3. Check the hardware for obvious points of failure. (Is everything connected properly? Was there a recent electrical outage? Have you recently added or changed equipment?)

Replacing Disks

This section describes how to replace disks in a Solaris Volume Manager environment.


Caution – Caution –

If you have soft partitions on a failed disk or on volumes built on a failed disk, you must put the new disk in the same physical location, with the same c*t*d* number as the disk it replaces.


How to Replace a Failed Disk

  1. Identify the failed disk to be replaced by examining the /var/adm/messages file and the metastat command output.

  2. Locate any state database replicas that might have been placed on the failed disk.

    Use the metadb command to find the replicas.

    The metadb command might report errors for the state database replicas located on the failed disk. In this example, c0t1d0 is the problem device.


    # metadb
       flags       first blk        block count
      a m     u        16               1034            /dev/dsk/c0t0d0s4
      a       u        1050             1034            /dev/dsk/c0t0d0s4
      a       u        2084             1034            /dev/dsk/c0t0d0s4
      W   pc luo       16               1034            /dev/dsk/c0t1d0s4
      W   pc luo       1050             1034            /dev/dsk/c0t1d0s4
      W   pc luo       2084             1034            /dev/dsk/c0t1d0s4

    The output shows three state database replicas on slice 4 of the local disks, c0t0d0 and c0t1d0. The W in the flags field of the c0t1d0s4 slice indicates that the device has write errors. Three replicas on the c0t0d0s4 slice are still good.

  3. Record the slice name where the state database replicas reside and the number of state database replicas, then delete the state database replicas.

    The number of state database replicas is obtained by counting the number of appearances of a slice in the metadb command output. In this example, the three state database replicas that exist on c0t1d0s4 are deleted.


    # metadb -d c0t1d0s4
    

    Caution – Caution –

    If, after deleting the bad state database replicas, you are left with three or fewer, add more state database replicas before continuing. This will help ensure that configuration information remains intact.


  4. Locate and delete any hot spares on the failed disk. Use the metastat command to find hot spares. In this example, hot spare pool hsp000 included c0t1d0s6, which is then deleted from the pool.


    # metahs -d hsp000 c0t1d0s6
    hsp000: Hotspare is deleted
  5. Physically replace the failed disk.

  6. Logically replace the failed disk using the devfsadm command, cfgadm command, luxadm command, or other commands as appropriate for your hardware and environment.

  7. Update the Solaris Volume Manager state database with the device ID for the new disk using the metadevadm -u cntndn command.

    In this example, the new disk is c0t1d0.


    # metadevadm -u c0t1d0
    
  8. Repartition the new disk.

    Use the format command or the fmthard command to partition the disk with the same slice information as the failed disk. If you have the prtvtoc output from the failed disk, you can format the replacement disk with fmthard -s /tmp/failed-disk-prtvtoc-output

  9. If you deleted state database replicas, add the same number back to the appropriate slice.

    In this example, /dev/dsk/c0t1d0s4 is used.


    # metadb -a -c 3 c0t1d0s4
    
  10. If any slices on the disk are components of RAID 5 volumes or are components of RAID 0 volumes that are in turn submirrors of RAID 1 volumes, run the metareplace -e command for each slice.

    In this example, /dev/dsk/c0t1d0s4 and mirror d10 are used.


    # metareplace -e d10 c0t1d0s4
    
  11. If any soft partitions are built directly on slices on the replaced disk, run the metarecover -d -p command on each slice containing soft partitions to regenerate the extent headers on disk.

    In this example, /dev/dsk/c0t1d0s4 needs to have the soft partition markings on disk regenerated, so is scanned and the markings are reapplied, based on the information in the state database replicas.


    # metarecover c0t1d0s4 -d -p 
    
  12. If any soft partitions on the disk are components of RAID 5 volumes or are components of RAID 0 volumes that are submirrors of RAID 1 volumes, run the metareplace -e command for each slice.

    In this example, /dev/dsk/c0t1d0s4 and mirror d10 are used.


    # metareplace -e d10 c0t1d0s4
    
  13. If any RAID 0 volumes have soft partitions built on them, run the metarecover command for each of the RAID 0 volume.

    In this example, RAID 0 volume d17 has soft partitions built on it.


    # metarecover d17 -m -p
    
  14. Replace hot spares that were deleted, and add them to the appropriate hot spare pool or pools.


    # metahs -a hsp000 c0t0d0s6
    hsp000: Hotspare is added
  15. If soft partitions or non-redundant volumes were affected by the failure, restore data from backups. If only redundant volumes were affected, then validate your data.

    Check the user/application data on all volumes. You might have to run an application-level consistency checker or use some other method to check the data.

Example—Replacing a Failed Disk

In the following example, a disk (/dev/dsk/c0t1d0) has failed and needs to be replaced.


panic[cpu0]/thread=70a41e00: md: state database problem


ok boot -s
Resetting ... 


Jun  7 08:57:25 su: 'su root' succeeded for root on /dev/console
Sun Microsystems Inc.   SunOS 5.9       s81_39  May 2002
# metadb
        flags           first blk       block count
     a m  p  lu         16              8192            /dev/dsk/c0t0d0s7
     a    p  l          8208            8192            /dev/dsk/c0t0d0s7
     a    p  l          16400           8192            /dev/dsk/c0t0d0s7
#  

SOME TEXT HERE.

Recovering from Disk Movement Problems

This section describes how to recover from unexpected problems after moving disks in the Solaris Volume Manager environment.

Disk Movement and Device ID Overview

Solaris Volume Manager uses device IDs, which are associated with a specific disk, to track all disks used in a Solaris Volume Manager configuration. When disks are moved to a different controller or when the SCSI target numbers change, Solaris Volume Manager usually correctly identifies the movement and updates all related Solaris Volume Manager records accordingly, and no system administrator intervention is required. In isolated cases, Solaris Volume Manager cannot completely update the records and reports an error on boot.

Resolving Unnamed Devices Error Message

If you add new hardware or move hardware (for example, move a string of disks from one controller to another controller), Solaris Volume Manager will check the device IDs associated with the disks that moved, and update the c*t*d* names in internal Solaris Volume Manager records accordingly. If the records cannot be updated, the boot processes spawned by /etc/rc2.d/S95svm.sync (linked to /etc/init.d/svm.sync) will report an error to the console at boot time:


Unable to resolve unnamed devices for volume management.
Please refer to the Solaris Volume Manager documentation,
Troubleshooting section, at http://docs.sun.com or from
your local copy."

No data loss has occured, and none will occur as a direct result of this problem. This error message indicates that the Solaris Volume Manager name records have been only partially updated, so output from the metastat command will likely show some of the c*t*d* names previously used, and some of the c*t*d* names reflecting the state after the move.

If you need to update your Solaris Volume Manager configuration while this condition exists, you must use the c*t*d* names reported by the metastat command when you issue any meta* commands.

If this error condition occurs, you can do one of the following to resolve the condition:

Recovering From Boot Problems

Because Solaris Volume Manager enables you to mirror the root (/), swap, and /usr directories, special problems can arise when you boot the system, either through hardware failures or operator error. The tasks in this section provide solutions to such potential problems.

The following table describes these problems and points you to the appropriate solution.

Table 25–1 Common Solaris Volume Manager Boot Problems

Reason for the Boot Problem 

Instructions 

The /etc/vfstab file contains incorrect information.

How to Recover From Improper /etc/vfstab Entries

There are not enough state database replicas. 

How to Recover From Insufficient State Database Replicas

A boot device (disk) has failed. 

How to Recover From a Boot Device Failure

The boot mirror has failed. 

 

Background Information for Boot Problems

How to Recover From Improper /etc/vfstab Entries

If you have made an incorrect entry in the /etc/vfstab file, for example, when mirroring root (/), the system will appear at first to be booting properly then fail. To remedy this situation, you need to edit the /etc/vfstab file while in single-user mode.

The high-level steps to recover from improper /etc/vfstab file entries are as follows:

  1. Booting the system to single-user mode

  2. Running the fsck command on the mirror volume

  3. Remounting file system read-write

  4. Optional: running the metaroot command for a root (/) mirror

  5. Verifying that the /etc/vfstab file correctly references the volume for the file system entry

  6. Rebooting

Example—Recovering the root (/) RAID 1 (Mirror) Volume

In the following example, root (/) is mirrored with a two-way mirror, d0. The root (/) entry in the /etc/vfstab file has somehow reverted back to the original slice of the file system, but the information in the /etc/system file still shows booting to be from the mirror d0. The most likely reason is that the metaroot command was not used to maintain the /etc/system and /etc/vfstab files, or an old copy of the/etc/vfstab file was copied back.

The incorrect /etc/vfstab file would look something like the following:

#device        device          mount          FS      fsck   mount    mount
#to mount      to fsck         point          type    pass   at boot  options
#
/dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0  /       ufs      1     no       -
/dev/dsk/c0t3d0s1 -                   -       swap     -     no       -
/dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6  /usr    ufs      2     no       -
#
/proc             -                  /proc    proc     -     no       -
swap              -                  /tmp     tmpfs    -     yes      -

Because of the errors, you automatically go into single-user mode when the system is booted:


ok boot
...
configuring network interfaces: hme0.
Hostname: lexicon
mount: /dev/dsk/c0t3d0s0 is not this fstype.
setmnt: Cannot open /etc/mnttab for writing

INIT: Cannot create /var/adm/utmp or /var/adm/utmpx

INIT: failed write of utmpx entry:"  "

INIT: failed write of utmpx entry:"  "

INIT: SINGLE USER MODE

Type Ctrl-d to proceed with normal startup,
(or give root password for system maintenance): <root-password>

At this point, root (/) and /usr are mounted read-only. Follow these steps:

  1. Run the fsck command on the root (/) mirror.


    Note –

    Be careful to use the correct volume for root.



    # fsck /dev/md/rdsk/d0
    ** /dev/md/rdsk/d0
    ** Currently Mounted on /
    ** Phase 1 - Check Blocks and Sizes
    ** Phase 2 - Check Pathnames
    ** Phase 3 - Check Connectivity
    ** Phase 4 - Check Reference Counts
    ** Phase 5 - Check Cyl groups
    2274 files, 11815 used, 10302 free (158 frags, 1268 blocks,
    0.7% fragmentation)
  2. Remount root (/) read/write so you can edit the /etc/vfstab file.


    # mount -o rw,remount /dev/md/dsk/d0 /
    mount: warning: cannot lock temp file </etc/.mnt.lock>
  3. Run the metaroot command.


    # metaroot d0
    

    This command edits the /etc/system and /etc/vfstab files to specify that the root (/) file system is now on volume d0.

  4. Verify that the /etc/vfstab file contains the correct volume entries.

    The root (/) entry in the /etc/vfstab file should appear as follows so that the entry for the file system correctly references the RAID 1 volume:

    #device           device              mount    FS      fsck   mount   mount
    #to mount         to fsck             point    type    pass   at boot options
    #
    /dev/md/dsk/d0    /dev/md/rdsk/d0     /        ufs     1      no      -
    /dev/dsk/c0t3d0s1 -                   -        swap    -      no      -
    /dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6  /usr     ufs     2      no      -
    #
    /proc             -                  /proc     proc    -      no      -
    swap              -                  /tmp      tmpfs   -      yes     -
  5. Reboot the system.

    The system returns to normal operation.

How to Recover From a Boot Device Failure

If you have a root (/) mirror and your boot device fails, you'll need to set up an alternate boot device.

The high-level steps in this task are as follows:

  1. Booting from the alternate root (/) submirror

  2. Determining the errored state database replicas and volumes

  3. Repairing the failed disk

  4. Restoring state database and volumes to their original state

In the following example, the boot device contains two of the six state database replicas and the root (/), swap, and /usr submirrors fails.

Initially, when the boot device fails, you'll see a message similar to the following. This message might differ among various architectures.


Rebooting with command:
Boot device: /iommu/sbus/dma@f,81000/esp@f,80000/sd@3,0   
The selected SCSI device is not responding
Can't open boot device
...

When you see this message, note the device. Then, follow these steps:

  1. Boot from another root (/) submirror.

    Since only two of the six state database replicas in this example are in error, you can still boot. If this were not the case, you would need to delete the inaccessible state database replicas in single-user mode. This procedure is described in How to Recover From Insufficient State Database Replicas.

    When you created the mirror for the root (/) file system, you should have recorded the alternate boot device as part of that procedure. In this example, disk2 is that alternate boot device.


    ok boot disk2
    SunOS Release 5.9 Version s81_51 64-bit
    Copyright 1983-2001 Sun Microsystems, Inc.  All rights reserved.
    Hostname: demo
    ...
    demo console login: root
    Password: <root-password>
    Dec 16 12:22:09 lexicon login: ROOT LOGIN /dev/console
    Last login: Wed Dec 12 10:55:16 on console
    Sun Microsystems Inc.   SunOS 5.9       s81_51  May 2002
    ...
  2. Determine that two state database replicas have failed by using the metadb command.


    # metadb
           flags         first blk    block count
        M     p          unknown      unknown      /dev/dsk/c0t3d0s3
        M     p          unknown      unknown      /dev/dsk/c0t3d0s3
        a m  p  luo      16           1034         /dev/dsk/c0t2d0s3
        a    p  luo      1050         1034         /dev/dsk/c0t2d0s3
        a    p  luo      16           1034         /dev/dsk/c0t1d0s3
        a    p  luo      1050         1034         /dev/dsk/c0t1d0s3

    The system can no longer detect state database replicas on slice /dev/dsk/c0t3d0s3, which is part of the failed disk.

  3. Determine that half of the root (/), swap, and /usr mirrors have failed by using the metastat command.


    # metastat
    d0: Mirror
        Submirror 0: d10
          State: Needs maintenance
        Submirror 1: d20
          State: Okay
    ...
     
    d10: Submirror of d0
        State: Needs maintenance
        Invoke: "metareplace d0 /dev/dsk/c0t3d0s0 <new device>"
        Size: 47628 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t3d0s0          0     No    Maintenance 
     
    d20: Submirror of d0
        State: Okay
        Size: 47628 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t2d0s0          0     No    Okay  
     
    d1: Mirror
        Submirror 0: d11
          State: Needs maintenance
        Submirror 1: d21
          State: Okay
    ...
     
    d11: Submirror of d1
        State: Needs maintenance
        Invoke: "metareplace d1 /dev/dsk/c0t3d0s1 <new device>"
        Size: 69660 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t3d0s1          0     No    Maintenance 
     
    d21: Submirror of d1
        State: Okay
        Size: 69660 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t2d0s1          0     No    Okay        
     
    d2: Mirror
        Submirror 0: d12
          State: Needs maintenance
        Submirror 1: d22
          State: Okay
    ...
     
    d2: Mirror
        Submirror 0: d12
          State: Needs maintenance
        Submirror 1: d22
          State: Okay
    ...
     
    d12: Submirror of d2
        State: Needs maintenance
        Invoke: "metareplace d2 /dev/dsk/c0t3d0s6 <new device>"
        Size: 286740 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t3d0s6          0     No    Maintenance 
     
     
    d22: Submirror of d2
        State: Okay
        Size: 286740 blocks
        Stripe 0:
    	Device              Start Block  Dbase State        Hot Spare
    	/dev/dsk/c0t2d0s6          0     No    Okay  

    In this example, the metastat command shows that following submirrors need maintenance:

    • Submirror d10, device c0t3d0s0

    • Submirror d11, device c0t3d0s1

    • Submirror d12, device c0t3d0s6

  4. Halt the system, replace the disk, and use the format command or the fmthard command, to partition the disk as it was before the failure.


    Tip –

    If the new disk is identical to the existing disk (the intact side of the mirror in this example), use prtvtoc /dev/rdsk/c0t2d0s2 | fmthard -s - /dev/rdsk/c0t3d0s2 to quickly format the new disk (c0t3d0 in this example)



    # halt
    ...
    Halted
    ...
    ok boot
    ...
    # format /dev/rdsk/c0t3d0s0
    
  5. Reboot.

    Note that you must reboot from the other half of the root (/) mirror. You should have recorded the alternate boot device when you created the mirror.


    # halt
    ...
    ok boot disk2
    
  6. To delete the failed state database replicas and then add them back, use the metadb command.


    # metadb
           flags         first blk    block count
        M     p          unknown      unknown      /dev/dsk/c0t3d0s3
        M     p          unknown      unknown      /dev/dsk/c0t3d0s3
        a m  p  luo      16           1034         /dev/dsk/c0t2d0s3
        a    p  luo      1050         1034         /dev/dsk/c0t2d0s3
        a    p  luo      16           1034         /dev/dsk/c0t1d0s3
        a    p  luo      1050         1034         /dev/dsk/c0t1d0s3
    # metadb -d c0t3d0s3
    # metadb -c 2 -a c0t3d0s3
    # metadb
           flags         first blk    block count
         a m  p  luo     16           1034         /dev/dsk/c0t2d0s3
         a    p  luo     1050         1034         /dev/dsk/c0t2d0s3
         a    p  luo     16           1034         /dev/dsk/c0t1d0s3
         a    p  luo     1050         1034         /dev/dsk/c0t1d0s3
         a        u      16           1034         /dev/dsk/c0t3d0s3
         a        u      1050         1034         /dev/dsk/c0t3d0s3
  7. Re-enable the submirrors by using the metareplace command.


    # metareplace -e d0 c0t3d0s0
    Device /dev/dsk/c0t3d0s0 is enabled
     
    # metareplace -e d1 c0t3d0s1
    Device /dev/dsk/c0t3d0s1 is enabled
     
    # metareplace -e d2 c0t3d0s6
    Device /dev/dsk/c0t3d0s6 is enabled

    After some time, the resynchronization will complete. You can now return to booting from the original device.

Recovering From State Database Replica Failures

How to Recover From Insufficient State Database Replicas

If the state database replica quorum is not met, for example, due to a drive failure, the system cannot be rebooted into multiuser mode. This situation could follow a panic (when Solaris Volume Manager discovers that fewer than half the state database replicas are available) or could occur if the system is rebooted with exactly half or fewer functional state database replicas. In Solaris Volume Manager terms, the state database has gone “stale.” This task explains how to recover from this problem.

  1. Boot the system to determine which state database replicas are down.

  2. Determine which state database replicas are unavailable

    Use the following format of the metadb command:


    metadb -i
    
  3. If one or more disks are known to be unavailable, delete the state database replicas on those disks. Otherwise, delete enough errored state database replicas (W, M, D, F, or R status flags reported by metadb) to ensure that a majority of the existing state database replicas are not errored.

    Delete the state database replica on the bad disk using the metadb -d command.


    Tip –

    State database replicas with a capitalized status flag are in error, while those with lowercase status flags are functioning normally.


  4. Verify that the replicas have been deleted by using the metadb command.

  5. Reboot.

  6. If necessary, you can replace the disk, format it appropriately, then add any state database replicas needed to the disk. Following the instructions in Creating State Database Replicas.

    Once you have a replacement disk, halt the system, replace the failed disk, and once again, reboot the system. Use the format command or the fmthard command to partition the disk as it was configured before the failure.

Example—Recovering From Stale State Database Replicas

In the following example, a disk containing seven replicas has gone bad. This leaves the system with only three good replicas, and the system panics, then cannot reboot into multi-user mode.


panic[cpu0]/thread=70a41e00: md: state database problem

403238a8 md:mddb_commitrec_wrapper+6c (2, 1, 70a66ca0, 40323964, 70a66ca0, 3c)
  %l0-7: 0000000a 00000000 00000001 70bbcce0 70bbcd04 70995400 00000002 00000000
40323908 md:alloc_entry+c4 (70b00844, 1, 9, 0, 403239e4, ff00)
  %l0-7: 70b796a4 00000001 00000000 705064cc 70a66ca0 00000002 00000024 00000000
40323968 md:md_setdevname+2d4 (7003b988, 6, 0, 63, 70a71618, 10)
  %l0-7: 70a71620 00000000 705064cc 70b00844 00000010 00000000 00000000 00000000
403239f8 md:setnm_ioctl+134 (7003b968, 100003, 64, 0, 0, ffbffc00)
  %l0-7: 7003b988 00000000 70a71618 00000000 00000000 000225f0 00000000 00000000
40323a58 md:md_base_ioctl+9b4 (157ffff, 5605, ffbffa3c, 100003, 40323ba8, ff1b5470)
  %l0-7: ff3f2208 ff3f2138 ff3f26a0 00000000 00000000 00000064 ff1396e9 00000000
40323ad0 md:md_admin_ioctl+24 (157ffff, 5605, ffbffa3c, 100003, 40323ba8, 0)
  %l0-7: 00005605 ffbffa3c 00100003 0157ffff 0aa64245 00000000 7efefeff 81010100
40323b48 md:mdioctl+e4 (157ffff, 5605, ffbffa3c, 100003, 7016db60, 40323c7c)
  %l0-7: 0157ffff 00005605 ffbffa3c 00100003 0003ffff 70995598 70995570 0147c800
40323bb0 genunix:ioctl+1dc (3, 5605, ffbffa3c, fffffff8, ffffffe0, ffbffa65)
  %l0-7: 0114c57c 70937428 ff3f26a0 00000000 00000001 ff3b10d4 0aa64245 00000000

panic: 
stopped at      edd000d8:       ta      %icc,%g0 + 125
Type  'go' to resume

ok boot -s
Resetting ... 

Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz), No Keyboard
OpenBoot 3.11, 128 MB memory installed, Serial #9841776.
Ethernet address 8:0:20:96:2c:70, Host ID: 80962c70.



Rebooting with command: boot -s                                       
Boot device: /pci@1f,0/pci@1,1/ide@3/disk@0,0:a  File and args: -s
SunOS Release 5.9 Version s81_39 64-bit

Copyright 1983-2001 Sun Microsystems, Inc.  All rights reserved.
configuring IPv4 interfaces: hme0.
Hostname: dodo

metainit: dodo: stale databases

Insufficient metadevice database replicas located.

Use metadb to delete databases which are broken.
Ignore any "Read-only file system" error messages.
Reboot the system when finished to reload the metadevice database.
After reboot, repair any broken database replicas which were deleted.

Type control-d to proceed with normal startup,
(or give root password for system maintenance): root password
single-user privilege assigned to /dev/console.
Entering System Maintenance Mode

Jun  7 08:57:25 su: 'su root' succeeded for root on /dev/console
Sun Microsystems Inc.   SunOS 5.9       s81_39  May 2002
# metadb -i
        flags           first blk       block count
     a m  p  lu         16              8192            /dev/dsk/c0t0d0s7
     a    p  l          8208            8192            /dev/dsk/c0t0d0s7
     a    p  l          16400           8192            /dev/dsk/c0t0d0s7
    M     p             16              unknown         /dev/dsk/c1t1d0s0
    M     p             8208            unknown         /dev/dsk/c1t1d0s0
    M     p             16400           unknown         /dev/dsk/c1t1d0s0
    M     p             24592           unknown         /dev/dsk/c1t1d0s0
    M     p             32784           unknown         /dev/dsk/c1t1d0s0
    M     p             40976           unknown         /dev/dsk/c1t1d0s0
    M     p             49168           unknown         /dev/dsk/c1t1d0s0
# metadb -d c1t1d0s0
# metadb
        flags           first blk       block count
     a m  p  lu         16              8192            /dev/dsk/c0t0d0s7
     a    p  l          8208            8192            /dev/dsk/c0t0d0s7
     a    p  l          16400           8192            /dev/dsk/c0t0d0s7
#  

The system paniced because it could no longer detect state database replicas on slice /dev/dsk/c1t1d0s0, which is part of the failed disk or attached to a failed controller. The first metadb -i command identifies the replicas on this slice as having a problem with the master blocks.

When you delete the stale state database replicas, the root (/) file system is read-only. You can ignore the mddb.cf error messages displayed.

At this point, the system is again functional, although it probably has fewer state database replicas than it should, and any volumes that used part of the failed storage are also either failed, errored, or hot-spared; those issues should be addressed promptly.

Repairing Transactional Volumes

Because a transactional volume is a “layered” volume, consisting of a master device and logging device, and because the logging device can be shared among file systems, repairing a failed transactional volume requires special recovery tasks.

Any device errors or panics must be managed by using the command line utilities.

Panics

If a file system detects any internal inconsistencies while it is in use, it will panic the system. If the file system is configured for logging, it notifies the transactional volume that it needs to be checked at reboot. The transactional volume transitions itself to the “Hard Error” state. All other transactional volumes that share the same log device also go into the “Hard Error” state.

At reboot, fsck checks and repairs the file system and transitions the file system back to the “Okay” state. fsck completes this process for all transactional volumes listed in the /etc/vfstab file for the affected log device.

Transactional Volume Errors

If a device error occurs on either the master device or the log device while the transactional volume is processing logged data, the device transitions from the “Okay” state to the “Hard Error” state. If the device is either in the “Hard Error” or “Error” state, either a device error has occurred, or a panic has occurred.

Any devices sharing the failed log device also go the “Error” state.

Recovering From Soft Partition Problems

The following sections show how to recover configuration information for soft partitions. You should only use these techniques if all of your state database replicas have been lost and you do not have a current or accurate copy of metastat -p output, the md.cf file, or an up-to-date md.tab file.

How to Recover Configuration Data for a Soft Partition

At the beginning of each soft partition extent, a sector is used to mark the beginning of the soft partition extent. These hidden sectors are called extent headers and do not appear to the user of the soft partition. If all Solaris Volume Manager configuration is lost, the disk can be scanned in an attempt to generate the configuration data.

This procedure is a last option to recover lost soft partition configuration information. The metarecover command should only be used when you have lost both your metadb and your md.cf files, and your md.tab is lost or out of date.


Note –

This procedure only works to recover soft partition information, and does not assist in recovering from other lost configurations or for recovering configuration information for other Solaris Volume Manager volumes.



Note –

If your configuration included other Solaris Volume Manager volumes that were built on top of soft partitions, you should recover the soft partitions before attempting to recover the other volumes.


Configuration information about your soft partitions is stored on your devices and in your state database. Since either of these sources could be corrupt, you must tell the metarecover command which source is reliable.

First, use the metarecover command to determine whether the two sources agree. If they do agree, the metarecover command cannot be used to make any changes. If the metarecover command reports an inconsistency, however, you must examine its output carefully to determine whether the disk or the state database is corrupt, then you should use the metarecover command to rebuild the configuration based on the appropriate source.

  1. Read the Configuration Guidelines for Soft Partitions.

  2. Review the soft partition recovery information by using the metarecover command.


    # metarecover component-p -d 
    

    In this case, component is the c*t*d*s*name of the raw component. The -d option indicates to scan the physical slice for extent headers of soft partitions.

    For more information, see the metarecover(1M) man page.

Example—Recovering Soft Partitions from On-Disk Extent Headers


# metarecover c1t1d0s1 -p -d
The following soft partitions were found and will be added to
your metadevice configuration.
 Name            Size     No. of Extents
    d10           10240         1
    d11           10240         1
    d12           10240         1
# metarecover c1t1d0s1 -p -d
The following soft partitions were found and will be added to
your metadevice configuration.
 Name            Size     No. of Extents
    d10           10240         1
    d11           10240         1
    d12           10240         1
WARNING: You are about to add one or more soft partition
metadevices to your metadevice configuration.  If there
appears to be an error in the soft partition(s) displayed
above, do NOT proceed with this recovery operation.
Are you sure you want to do this (yes/no)?yes
c1t1d0s1: Soft Partitions recovered from device.
bash-2.05# metastat
d10: Soft Partition
    Device: c1t1d0s1
    State: Okay
    Size: 10240 blocks
        Device              Start Block  Dbase Reloc
        c1t1d0s1                   0     No    Yes

        Extent              Start Block              Block count
             0                        1                    10240

d11: Soft Partition
    Device: c1t1d0s1
    State: Okay
    Size: 10240 blocks
        Device              Start Block  Dbase Reloc
        c1t1d0s1                   0     No    Yes

        Extent              Start Block              Block count
             0                    10242                    10240

d12: Soft Partition
    Device: c1t1d0s1
    State: Okay
    Size: 10240 blocks
        Device              Start Block  Dbase Reloc
        c1t1d0s1                   0     No    Yes

        Extent              Start Block              Block count
             0                    20483                    10240

This example recovers three soft partitions from disk, after the state database replicas were accidentally deleted.

Recovering Configuration From a Different System

You can recover a Solaris Volume Manager configuration, even onto a different system from the original. For example, assume you have a system with an external Multipack of six disks in it, and a Solaris Volume Manager configuration, including at least one state database replica, on some of those disks. If you experience a system failure, you can attach the Multipack to a different system and recover the complete configuration from the local disk set.


Note –

Only recover a Solaris Volume Manager configuration onto a system with no preexisting Solaris Volume Manager configuration. Otherwise, you risk replacing a logical volume on your system with a logical volume that you are recovering, and possibly corrupting your system.



Note –

This process only works to recover volumes from the local disk set.


How to Recover a Configuration

How to Recover a Configuration
  1. Attach the disk or disks that contain the Solaris Volume Manager configuration to a system with no preexisting Solaris Volume Manager configuration.

  2. Do a reconfiguration reboot to ensure that the system recognizes the newly added disks.


    # reboot -- -r
    
  3. Determine the major/minor number for a slice containing a state database replica on the newly added disks.

    Use ls -lL, and note the two numbers between the group name and the date. Those are the major/minor numbers for this slice.


    # ls -Ll /dev/dsk/c1t9d0s7
    brw-r-----   1 root     sys       32, 71 Dec  5 10:05 /dev/dsk/c1t9d0s7

  4. If necessary, determine the major name corresponding with the major number by looking up the major number in /etc/name_to_major.


    # grep " 32" /etc/name_to_major 
    sd 32
    

  5. Update the /kernel/drv/md.conf file with two commands: one command to tell Solaris Volume Manager where to find a valid state database replica on the new disks, and one command to tell it to trust the new replica and ignore any conflicting device ID information on the system.

    In the line in this example that begins with mddb_bootlist1, replace the sd in the example with the major name you found in the previous step. Replace 71 in the example with the minor number you identified in Step 3.


    #pragma ident   "@(#)md.conf    2.1     00/07/07 SMI"
    #
    # Copyright (c) 1992-1999 by Sun Microsystems, Inc.
    # All rights reserved.
    #
    name="md" parent="pseudo" nmd=128 md_nsets=4;
    #
    #pragma ident   "@(#)md.conf    2.1     00/07/07 SMI"
    #
    # Copyright (c) 1992-1999 by Sun Microsystems, Inc.
    # All rights reserved.
    #
    name="md" parent="pseudo" nmd=128 md_nsets=4;
    # Begin MDD database info (do not edit)
    mddb_bootlist1="sd:71:16:id0"; 
    md_devid_destroy=1;# End MDD database info (do not edit)

  6. Reboot to force Solaris Volume Manager to reload your configuration.

    You will see messages similar to the following displayed to the console.


    volume management starting.
    Dec  5 10:11:53 lexicon metadevadm: Disk movement detected
    Dec  5 10:11:53 lexicon metadevadm: Updating device names in 
    Solaris Volume Manager
    The system is ready.

  7. Verify your configuration by using the metadb and metastat commands.


    # metadb
            flags           first blk       block count
         a m  p  luo        16              8192            /dev/dsk/c1t9d0s7
         a       luo        16              8192            /dev/dsk/c1t10d0s7
         a       luo        16              8192            /dev/dsk/c1t11d0s7
         a       luo        16              8192            /dev/dsk/c1t12d0s7
         a       luo        16              8192            /dev/dsk/c1t13d0s7
    # metastat
    d12: RAID
        State: Okay         
        Interlace: 32 blocks
        Size: 125685 blocks
    Original device:
        Size: 128576 blocks
            Device              Start Block  Dbase State        Reloc  Hot Spare
            c1t11d0s3                330     No    Okay         Yes    
            c1t12d0s3                330     No    Okay         Yes    
            c1t13d0s3                330     No    Okay         Yes    
    
    d20: Soft Partition
        Device: d10
        State: Okay
        Size: 8192 blocks
            Extent              Start Block              Block count
                 0                     3592                     8192
    
    d21: Soft Partition
        Device: d10
        State: Okay
        Size: 8192 blocks
            Extent              Start Block              Block count
                 0                    11785                     8192
    
    d22: Soft Partition
        Device: d10
        State: Okay
        Size: 8192 blocks
            Extent              Start Block              Block count
                 0                    19978                     8192
    
    d10: Mirror
        Submirror 0: d0
          State: Okay         
        Submirror 1: d1
          State: Okay         
        Pass: 1
        Read option: roundrobin (default)
        Write option: parallel (default)
        Size: 82593 blocks
    
    d0: Submirror of d10
        State: Okay         
        Size: 118503 blocks
        Stripe 0: (interlace: 32 blocks)
            Device              Start Block  Dbase State        Reloc  Hot Spare
            c1t9d0s0                   0     No    Okay         Yes    
            c1t10d0s0               3591     No    Okay         Yes    
    
    
    d1: Submirror of d10
        State: Okay         
        Size: 82593 blocks
        Stripe 0: (interlace: 32 blocks)
            Device              Start Block  Dbase State        Reloc  Hot Spare
            c1t9d0s1                   0     No    Okay         Yes    
            c1t10d0s1                  0     No    Okay         Yes    
    
    
    Device Relocation Information:
    Device       Reloc    Device ID
    c1t9d0       Yes      id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3487980000U00907AZ
    c1t10d0      Yes      id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3397070000W0090A8Q
    c1t11d0      Yes      id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3449660000U00904NZ
    c1t12d0      Yes      id1,sd@SSEAGATE_ST39103LCSUN9.0GLS32655400007010H04J
    c1t13d0      Yes      id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3461190000701001T0
    # 
    # metadb         
            flags           first blk       block count
         a m  p  luo        16              8192            /dev/dsk/c1t9d0s7
         a       luo        16              8192            /dev/dsk/c1t10d0s7
         a       luo        16              8192            /dev/dsk/c1t11d0s7
         a       luo        16              8192            /dev/dsk/c1t12d0s7
         a       luo        16              8192            /dev/dsk/c1t13d0s7
    # metastat 
    d12: RAID
        State: Okay         
        Interlace: 32 blocks
        Size: 125685 blocks
    Original device:
        Size: 128576 blocks
            Device              Start Block  Dbase State        Reloc  Hot Spare
            c1t11d0s3                330     No    Okay         Yes    
            c1t12d0s3                330     No    Okay         Yes    
            c1t13d0s3                330     No    Okay         Yes    
    
    d20: Soft Partition
        Device: d10
        State: Okay
        Size: 8192 blocks
            Extent              Start Block              Block count
                 0                     3592                     8192
    
    d21: Soft Partition
        Device: d10
        State: Okay
        Size: 8192 blocks
            Extent              Start Block              Block count
                 0                    11785                     8192
    
    d22: Soft Partition
        Device: d10
        State: Okay
        Size: 8192 blocks
            Extent              Start Block              Block count
                 0                    19978                     8192
    
    d10: Mirror
        Submirror 0: d0
          State: Okay         
        Submirror 1: d1
          State: Okay         
        Pass: 1
        Read option: roundrobin (default)
        Write option: parallel (default)
        Size: 82593 blocks
    
    d0: Submirror of d10
        State: Okay         
        Size: 118503 blocks
        Stripe 0: (interlace: 32 blocks)
            Device              Start Block  Dbase State        Reloc  Hot Spare
            c1t9d0s0                   0     No    Okay         Yes    
            c1t10d0s0               3591     No    Okay         Yes    
    
    
    d1: Submirror of d10
        State: Okay         
        Size: 82593 blocks
        Stripe 0: (interlace: 32 blocks)
            Device              Start Block  Dbase State        Reloc  Hot Spare
            c1t9d0s1                   0     No    Okay         Yes    
            c1t10d0s1                  0     No    Okay         Yes    
    
    
    Device Relocation Information:
    Device         Reloc    Device ID
    c1t9d0         Yes     id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3487980000U00907AZ1
    c1t10d0        Yes     id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3397070000W0090A8Q
    c1t11d0        Yes     id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3449660000U00904NZ
    c1t12d0        Yes     id1,sd@SSEAGATE_ST39103LCSUN9.0GLS32655400007010H04J
    c1t13d0        Yes     id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3461190000701001T0
    # metastat -p
    d12 -r c1t11d0s3 c1t12d0s3 c1t13d0s3 -k -i 32b
    d20 -p d10 -o 3592 -b 8192 
    d21 -p d10 -o 11785 -b 8192 
    d22 -p d10 -o 19978 -b 8192 
    d10 -m d0 d1 1
    d0 1 2 c1t9d0s0 c1t10d0s0 -i 32b
    d1 1 2 c1t9d0s1 c1t10d0s1 -i 32b
    #