Solstice DiskSuite 4.2.1 User's Guide

Boot Problems

Because DiskSuite enables you to mirror root (/), swap, and /usr, special problems can arise when you boot the system, either through hardware or operator error. The tasks in this section are solutions to such potential problems.

Table 7-1 describes these problems and points you to the appropriate solution.

Table 7-1 Common DiskSuite Boot Problems


System Does Not Boot Because ...	Refer To ...
The `/etc/vfstab` file contains incorrect information.	"How to Recover From Improper `/etc/vfstab` Entries (Command Line)"
There are not enough state database replicas.	"How to Recover From Insufficient State Database Replicas (Command Line)"
A boot device (disk) has failed.	"How to Recover From a Boot Device Failure (Command Line)"
The boot mirror has failed.	"SPARC: How to Boot From the Alternate Device (Command Line)" or "x86: How to Boot From the Alternate Device (Command Line)"

Preliminary Information for Boot Problems

If the metadevice driver takes a metadevice offline due to errors, unmount all file systems on the disk where the failure occurred. Because each disk slice is independent, multiple file systems may be mounted on a single disk. If the metadisk driver has encountered a failure, other slices on the same disk will likely experience failures soon. File systems mounted directly on disk slices do not have the protection of metadisk driver error handling, and leaving such file systems mounted can leave you vulnerable to crashing the system and losing data.

Minimize the amount of time you run with submirrors disabled or offline. During resyncing and online backup intervals, the full protection of mirroring is gone.

How to Recover From Improper `/etc/vfstab` Entries (Command Line)

If you have made an incorrect entry in the /etc/vfstab file, for example, when mirroring root (/), the system will appear at first to be booting properly then fail. To remedy this situation, you need to edit /etc/vfstab while in single-user mode.

The high-level steps to recover from improper /etc/vfstab file entries are:

Booting the system to single-user mode
Running fsck(1M) on the mirror metadevice
Remounting file system read-write
Optional: running the metaroot(1M) command for a root (/) mirror
Verifying that the /etc/vfstab file correctly references the metadevice for the file system entry
Rebooting

Example -- Recovering the root (/) Mirror

In the following example, root (/) is mirrored with a two-way mirror, d0. The root (/) entry in /etc/vfstab has somehow reverted back to the original slice of the file system, but the information in /etc/system still shows booting to be from the mirror d0. The most likely reason is that the metaroot(1M) command was not used to maintain /etc/system and /etc/vfstab, or an old copy of /etc/vfstab was copied back.

The incorrect /etc/vfstab file would look something like the following:

#device        device          mount          FS      fsck   mount    mount
#to mount      to fsck         point          type    pass   at boot  options
#
/dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0  /       ufs      1     no       -
/dev/dsk/c0t3d0s1 -                   -       swap     -     no       -
/dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6  /usr    ufs      2     no       -
#
/proc             -                  /proc    proc     -     no       -
fd                -                  /dev/fd  fd       -     no       -
swap              -                  /tmp     tmpfs    -     yes      -

Because of the errors, you automatically go into single-user mode when the machine is booted:

ok boot
...
SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0]
Copyright (c) 1983-1995, Sun Microsystems, Inc.
configuring network interfaces: le0.
Hostname: antero
mount: /dev/dsk/c0t3d0s0 is not this fstype.
setmnt: Cannot open /etc/mnttab for writing

INIT: Cannot create /var/adm/utmp or /var/adm/utmpx

INIT: failed write of utmpx entry:"  "

INIT: failed write of utmpx entry:"  "

INIT: SINGLE USER MODE

Type Ctrl-d to proceed with normal startup,
(or give root password for system maintenance): <root-password>

At this point, root (/) and /usr are mounted read-only. Follow these steps:

Run fsck(1M) on the root (/) mirror.

Note -

Be careful to use the correct metadevice for root.

# fsck /dev/md/rdsk/d0
** /dev/md/rdsk/d0
** Currently Mounted on /
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
2274 files, 11815 used, 10302 free (158 frags, 1268 blocks,
0.7% fragmentation)

Remount root (/) read/write so you can edit the /etc/vfstab file.

# mount -o rw,remount /dev/md/dsk/d0 /
mount: warning: cannot lock temp file </etc/.mnt.lock>

Run the metaroot(1M) command.
# metaroot d0
This edits the /etc/system and /etc/vfstab files to specify that the root (/) file system is now on metadevice d0.

Verify that the /etc/vfstab file contains the correct metadevice entries.

The root (/) entry in the /etc/vfstab file should appear as follows so that the entry for the file system correctly references the mirror:

#device           device              mount    FS      fsck   mount   mount
#to mount         to fsck             point    type    pass   at boot options
#
/dev/md/dsk/d0    /dev/md/rdsk/d0     /        ufs     1      no      -
/dev/dsk/c0t3d0s1 -                   -        swap    -      no      -
/dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6  /usr     ufs     2      no      -
#
/proc             -                  /proc     proc    -      no      -
fd                -                  /dev/fd   fd      -      no      -
swap              -                  /tmp      tmpfs   -      yes     -

Reboot.

The system returns to normal operation.

How to Recover From Insufficient State Database Replicas (Command Line)

If for some reason the state database replica quorum is not met, for example, due to a drive failure, the system cannot be rebooted. In DiskSuite terms, the state database has gone "stale." This task explains how to recover.

The high-level steps in this task are:

Deleting the stale state database replicas and rebooting
Repairing the problem disk
Adding back the state database replica(s)

Example -- Recovering From Stale State Database Replicas

In the following example, a disk containing two replicas has gone bad. This leaves the system with only two good replicas, and the system cannot reboot.

Boot the machine to determine which state database replicas are down.

ok boot
...
Hostname: demo
metainit: demo: stale databases
 
Insufficient metadevice database replicas located.
 
Use metadb to delete databases which are broken.
Ignore any "Read-only file system" error messages.
Reboot the system when finished to reload the metadevice
database.
After reboot, repair any broken database replicas which were
deleted.
 
Type Ctrl-d to proceed with normal startup,
(or give root password for system maintenance): <root-password>
Entering System Maintenance Mode
 
SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0]

Use the metadb(1M) command to look at the metadevice state database and see which state database replicas are not available.

# metadb -i
   flags      first blk      block count
    a m  p  lu    16                1034                  /dev/dsk/c0t3d0s3
    a   p  l      1050              1034                  /dev/dsk/c0t3d0s3
    M  p        unknown      unknown                      /dev/dsk/c1t2d0s3
    M  p        unknown      unknown                      /dev/dsk/c1t2d0s3
...

The system can no longer detect state database replicas on slice /dev/dsk/c1t2d0s3, which is part of the failed disk. The metadb command flags the replicas on this slice as having a problem with the master blocks.

Delete the state database replicas on the bad disk using the -d option to the metadb(1M) command.

At this point, the root (/) file system is read-only. You can ignore the mddb.cf error messages:
# metadb -d -f c1t2d0s3 metadb: demo: /etc/lvm/mddb.cf.new: Read-only file system

Verify that the replicas were deleted.

# metadb -i
    flags        first blk       block count
     a m  p  lu         16              1034            /dev/dsk/c0t3d0s3
     a    p  l          1050            1034            /dev/dsk/c0t3d0s3

Reboot.

Once you have a replacement disk, halt the system, replace the failed disk, and once again, reboot the system. Use the format(1M) command or the fmthard(1M) command to partition the disk as it was before the failure.
# halt ... ok boot ... # format /dev/rdsk/c1t2d0s0 ...

Use the metadb(1M) command to add back the state database replicas and to determine that the state database replicas are correct.

# metadb -a -c 2 c1t2d0s3
# metadb
   flags        first blk  block count
  a m  p  luo      16           1034         dev/dsk/c0t3d0s3
  a    p  luo      1050         1034         dev/dsk/c0t3d0s3
  a       u        16           1034         dev/dsk/c1t2d0s3
  a       u        1050         1034         dev/dsk/c1t2d0s3

The metadb command with the -c 2 option adds two state database replicas to the same slice.

How to Recover From a Boot Device Failure (Command Line)

If you have a root (/) mirror and your boot device fails, you'll need to set up an alternate boot device.

The high-level steps in this task are:

Booting from the alternate root (/) submirror
Determining the errored state database replicas and metadevices
Repairing the problem disk
Restoring metadevice state database and metadevices to their original state

Example -- Recovering From a Boot Device Failure

In the following example, the boot device containing two of the six state database replicas and the root (/), swap, and /usr submirrors fails.

Initially, when the boot device fails, you'll see a message similar to the following. This message may differ among various architectures.

Rebooting with command:
Boot device: /iommu/sbus/dma@f,81000/esp@f,80000/sd@3,0   File and args: kadb
kadb: kernel/unix
The selected SCSI device is not responding
Can't open boot device
...

When you see this message, note the device. Then, follow these steps:

Boot from another root (/) submirror.

Since only two of the six state database replicas in this example are in error, you can still boot. If this were not the case, you would need to delete the stale state database replicas in single-user mode. This procedure is described in "How to Recover From Insufficient State Database Replicas (Command Line)".

When you created the mirror for the root (/) file system, you should have recorded the alternate boot device as part of that procedure. In this example, disk2 is that alternate boot device.

ok boot disk2
...
SunOS Release 5.5 Version Generic [UNIX(R) System V Release 4.0]
Copyright (c) 1983-1995, Sun Microsystems, Inc.
 
Hostname: demo
...
demo console login: root
Password: <root-password>
Last login: Wed Dec 16 13:15:42 on console
SunOS Release 5.1 Version Generic [UNIX(R) System V Release 4.0]
...

Use the metadb(1M) command to determine that two state database replicas have failed.

# metadb
       flags         first blk    block count
    M     p          unknown      unknown      /dev/dsk/c0t3d0s3
    M     p          unknown      unknown      /dev/dsk/c0t3d0s3
    a m  p  luo      16           1034         /dev/dsk/c0t2d0s3
    a    p  luo      1050         1034         /dev/dsk/c0t2d0s3
    a    p  luo      16           1034         /dev/dsk/c0t1d0s3
    a    p  luo      1050         1034         /dev/dsk/c0t1d0s3

The system can no longer detect state database replicas on slice /dev/dsk/c0t3d0s3, which is part of the failed disk.

Use the metastat(1M) command to determine that half of the root (/), swap, and /usr mirrors have failed.

# metastat
d0: Mirror
    Submirror 0: d10
      State: Needs maintenance
    Submirror 1: d20
      State: Okay
...
 
d10: Submirror of d0
    State: Needs maintenance
    Invoke: "metareplace d0 /dev/dsk/c0t3d0s0 <new device>"
    Size: 47628 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t3d0s0          0     No    Maintenance 
 
d20: Submirror of d0
    State: Okay
    Size: 47628 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t2d0s0          0     No    Okay  
 
d1: Mirror
    Submirror 0: d11
      State: Needs maintenance
    Submirror 1: d21
      State: Okay
...
 
d11: Submirror of d1
    State: Needs maintenance
    Invoke: "metareplace d1 /dev/dsk/c0t3d0s1 <new device>"
    Size: 69660 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t3d0s1          0     No    Maintenance 
 
d21: Submirror of d1
    State: Okay
    Size: 69660 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t2d0s1          0     No    Okay        
 
d2: Mirror
    Submirror 0: d12
      State: Needs maintenance
    Submirror 1: d22
      State: Okay
...
 
d2: Mirror
    Submirror 0: d12
      State: Needs maintenance
    Submirror 1: d22
      State: Okay
...
 
d12: Submirror of d2
    State: Needs maintenance
    Invoke: "metareplace d2 /dev/dsk/c0t3d0s6 <new device>"
    Size: 286740 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t3d0s6          0     No    Maintenance 
 
 
d22: Submirror of d2
    State: Okay
    Size: 286740 blocks
    Stripe 0:
	Device              Start Block  Dbase State        Hot Spare
	/dev/dsk/c0t2d0s6          0     No    Okay

In this example, the metastat shows that following submirrors need maintenance:

Submirror d10, device c0t3d0s0
Submirror d11, device c0t3d0s1
Submirror d12, device c0t3d0s6

Halt the system, repair the disk, and use the format(1M) command or the fmthard(1M) command, to partition the disk as it was before the failure.
# halt ... Halted ... ok boot ... # format /dev/rdsk/c0t3d0s0

Reboot.

Note that you must reboot from the other half of the root (/) mirror. You should have recorded the alternate boot device when you created the mirror.
# halt ... ok boot disk2

To delete the failed state database replicas and then add them back, use the metadb(1M) command.

# metadb
       flags         first blk    block count
    M     p          unknown      unknown      /dev/dsk/c0t3d0s3
    M     p          unknown      unknown      /dev/dsk/c0t3d0s3
    a m  p  luo      16           1034         /dev/dsk/c0t2d0s3
    a    p  luo      1050         1034         /dev/dsk/c0t2d0s3
    a    p  luo      16           1034         /dev/dsk/c0t1d0s3
    a    p  luo      1050         1034         /dev/dsk/c0t1d0s3
# metadb -d c0t3d0s3
# metadb -c 2 -a c0t3d0s3
# metadb
       flags         first blk    block count
     a m  p  luo     16           1034         /dev/dsk/c0t2d0s3
     a    p  luo     1050         1034         /dev/dsk/c0t2d0s3
     a    p  luo     16           1034         /dev/dsk/c0t1d0s3
     a    p  luo     1050         1034         /dev/dsk/c0t1d0s3
     a        u      16           1034         /dev/dsk/c0t3d0s3
     a        u      1050         1034         /dev/dsk/c0t3d0s3

Use the metareplace(1M) command to re-enable the submirrors.

# metareplace -e d0 c0t3d0s0
Device /dev/dsk/c0t3d0s0 is enabled
 
# metareplace -e d1 c0t3d0s1
Device /dev/dsk/c0t3d0s1 is enabled
 
# metareplace -e d2 c0t3d0s6
Device /dev/dsk/c0t3d0s6 is enabled

After some time, the resyncs will complete. You can now return to booting from the original device.

How to Record the Path to the Alternate Boot Device (Command Line)

When mirroring root (/), you might need the path to the alternate boot device later if the primary device fails.

Example -- SPARC: Recording the Alternate Boot Device Path

In this example, you would determine the path to the alternate root device by using the ls -l command on the slice that is being attached as the second submirror to the root (/) mirror.

# ls -l /dev/rdsk/c1t3d0s0
lrwxrwxrwx 1  root root  55 Mar 5 12:54  /dev/rdsk/c1t3d0s0 -> \ 
../../devices/sbus@1,f8000000/esp@1,200000/sd@3,0:a

Here you would record the string that follows the /devices directory: /sbus@1,f8000000/esp@1,200000/sd@3,0:a.

On some newer Sun hardware, you will be required to change the /devicesdirectory name from sd@ to disk@.

DiskSuite users who are using a system with Open Boot Prom can use the OpenBoot nvalias command to define a "backup root" devalias for the secondary root mirror. For example:

ok  nvalias backup_root /sbus@1,f8000000/esp@1,200000/sd@3,0:a

In the event of primary root disk failure, you then would only enter:

ok  boot backup_root

Example -- x86: Recording the Alternate Boot Device Path

In this example, you would determine the path to the alternate boot device by using the ls -l command on the slice that is being attached as the second submirror to the root (/) mirror.

# ls -l /dev/rdsk/c1t0d0s0
lrwxrwxrwx 1  root root  55 Mar 5 12:54  /dev/rdsk/c1t0d0s0 -> ../.
./devices/eisa/eha@1000,0/cmdk@1,0:a

Here you would record the string that follows the /devices directory: /eisa/eha@1000,0/cmdk@1,0:a

SPARC: How to Boot From the Alternate Device (Command Line)

To boot a SPARC system from the alternate boot device, type:

# boot alternate-boot-device

The procedure "How to Record the Path to the Alternate Boot Device (Command Line)" describes how to determine the alternate boot device.

x86: How to Boot From the Alternate Device (Command Line)

Use this task to boot an x86 system from the alternate boot device.

Boot your system from the Multiple Device Boot (MDB) diskette.

After a moment, a screen similar to the following is displayed:

Solaris/x86 Multiple Device Boot Menu
Code    Device    Vendor     Model/Desc          Rev
============================================================
 
10      DISK      COMPAQ      C2244              0BC4
11      DISK      SEAGATE     ST11200N SUN1.05   8808
12      DISK      MAXTOR      LXT-213S SUN0207   4.24
13      CD        SONY        CD-ROM CDU-8812    3.0a
14      NET       SMC/WD      I/O=300 IRQ=5
80      DISK      First IDE drive (Drive C:)
81      DISK      Second IDE drive (Drive D:)
 
Enter the boot device code:

Enter your alternate disk code from the choices listed on the screen. The following is displayed:

Solaris 2.4 for x86                Secondary Boot Subsystem,vsn 2.11
 
                  <<<Current Boot Parameters>>>
Boot path:/eisa/eha@1000,0/cmdk@0,0:a
Boot args:/kernel/unix
 
Type b[file-name] [boot-flags] <ENTER>     to boot with options
or   i<ENTER>                              to enter boot interpreter
or   <ENTER>                               to boot with defaults
 
                    <<<timeout in 5 seconds>>>

Type i to select the interpreter.

Type the following commands:
>setprop boot-path /eisa/eha@1000,0/cmdk@1,0:a >^D
The Control-D character sequence quits the interpreter.