C H A P T E R  8

Recovering From Catastrophic Failure

Certain events, such as flooding in a computer room, can be classified as catastrophic failures. This chapter describes the procedure to follow after such an event. You might require the assistance of your ASP or Sun Microsystems customer support.

This chapter contains the following sections:


Recovery Task Overview

You should not recover any system component, software element, or SAM-QFS file system that has not failed. However, you might need to reconfigure the SAM-QFS file system on a restored system to regain access to file systems or to determine whether any file system has failed. For details in performing these tasks, see the other sections of this chapter.

The process of recovering from a catastrophic failure involves the following steps:

1. Determining the failed system component

See To Restore Failed System Components.

2. Disabling the archiver and the recycler until all files are restored

See To Disable the Archiver and Recycler Until All Files Are Restored.

3. Comparing previous and current configuration files, and reconciling inconsistencies

See To Keep and Compare Previous and Current Configuration and Log Files.

4. Repairing disks

See To Repair Disks.

5. Restoring or building new library catalog files

See To Restore or Build New Library Catalog Files.

6. Making new file systems and restoring from samfsdump output

See To Make New File Systems and Restore From samfsdump Output.


Recovery Procedures

This section details the procedures involved in recovering from a catastrophic failure.


procedure icon  To Restore Failed System Components

1. Ascertain which components have failed.

2. If a hardware component has failed, restore it to operation, preserving any available data.

If the failing component is a disk drive that has not totally failed, preserve as much information as possible. Before replacing or reformatting the disk, identify any salvageable files, and copy these files to a tape or to another disk for future use in the recovery process. Salvageable files to identify and copy include the following:

3. If the Solaris Operating System (OS) has failed, restore it to operation.

See Recovering From Failure of the Operating Environment Disk. Verify that the Solaris OS is functioning correctly before proceeding.

4. If the Sun StorageTek SAM or Sun StorageTek QFS package has been damaged, remove and reinstall it from a backup copy or from its distribution file.

You can verify whether a package has been damaged by using the pkgchk(1M) utility.

5. If disk hardware used by the Sun StorageTek SAM software was repaired or replaced in Step 2, configure the disks (for RAID binding or mirroring) as necessary.

Reformat disks only if they have been replaced or if it is otherwise absolutely necessary.


procedure icon To Disable the Archiver and Recycler Until All Files Are Restored



caution icon

Caution - If the recycler is enabled so that it runs before all files are restored, cartridges with good archive copies might be improperly relabeled.



1. Add a single global wait directive to the archiver.cmd file or add a file-system-specific wait directive for each file system for which you want to disable archiving.

a. Open the /etc/opt/SUNWsamfs/archiver.cmd file for editing and find the section in which you want to insert the wait directive.

In the following sample file, local archiving directives exist for two file systems, samfs1 and samfs2.


# vi /etc/opt/SUNWsamfs/archiver.cmd
...
fs = samfs1
allfiles   .
1   10s
fs = samfs2
allfiles   .
1   10s

b. Add the wait directive.

2. Add a global ignore directive to the recycler.cmd file, or add a file-system-specific ignore directive for each library for which you want to disable recycling.

a. Open the /etc/opt/SUNWsamfs/recycler.cmd file for editing, as shown in the following example.


# vi /etc/opt/SUNWsamfs/recycler.cmd
...
         logfile = /var/adm/recycler.log
         lt20 -hwm 75 -mingain 60 
         lt20 75 60 
         hp30 -hwm 90 -mingain 60 -mail root
         gr47 -hwm 95 -mingain 60 -mail root

b. Add the ignore directives.

The following example shows ignore directives added for three libraries.


#  recycler.cmd.after  - example recycler.cmd file
#
         logfile = /var/adm/recycler.log
         lt20 -hwm 75 -mingain 60 -ignore
         hp30 -hwm 90 -mingain 60 -ignore -mail root
         gr47 -hwm 95 -mingain 60 -ignore -mail root


procedure icon  To Keep and Compare Previous and Current Configuration and Log Files

Follow these steps before rebuilding the system.

1. Recover any available Sun StorageTek SAM configuration files or archiver log files from the system's disks.

2. Compare the restored versions of all configuration files represented in the SAMreport with those restored from the system backups.

3. If inconsistencies exist, determine the effect of the inconsistencies and reinstall the Sun StorageTek QFS file system, if necessary, using the configuration information in the SAMreport file.

For more information on SAMreport file, see the samexplorer(1M) man page.


procedure icon  To Repair Disks

single-step bulletFor SAM-QFS file systems that reside on disks that have not been replaced, run the samfsck(1M) utility to repair small inconsistencies, reclaim lost blocks, and so on.

For command-line options to the samfsck utility, see the samfsck(1M) man page.


procedure icon  To Restore or Build New Library Catalog Files

1. Replace the most recent library catalog file copies from the removable media files, from the Sun StorageTek SAM server disks, or from the most recent file system archive copies.

2. If the library catalogs are unavailable, build new catalogs by using the build.cat(1M) command and the library catalog section of the most recent SAMreport as input.

Use the newest library catalog copy available for each automated library.



Note - Sun StorageTek SAM systems automatically rebuild library catalogs for SCSI-attached automated libraries. This does not occur for ACSLS-attached automated libraries. Tape usage statistics are lost.




procedure icon  To Make New File Systems and Restore From samfsdump Output

Follow these steps for SAM-QFS file systems that were partially or completely resident on disks that were replaced or reformatted.

1. Obtain the most recent copy of the samfsdump(1M) output file.

2. Make a new file system and restore the SAM-QFS file system using the samfsdump output file.

a. Use the sammkfs(1M) command to make a new file system.


# mkdir /sam1
# sammkfs samfs1
# mount samfs1

b. Use the samfsrestore(1M) command with the -f option and the -g option, use the following syntax:


samfsrestore -f output-file-location -g log-file

where:

For example:


# cd /sam1
# samfsrestore -f /dump_sam1/dumps/040120 -g /var/adm/messages/restore_log



Note - Once all file systems have been restored, the system can be made available to users in degraded mode.



3. On the file systems you have just restored, perform the following steps:

a. Run the restore.sh script against the log file, and stage all files that were known to be online before the outage. In a shared environment, this script must be run on the metadata server.

b. Run the sfind(1M) command against the SAM-QFS file system to determine which files are labeled as damaged.

These files might or might not be restorable from tape, depending on the content of the archive log files. Determine the most recently available archive log files from one of the following sources, in this order:

c. Run the grep(1) command against the most recent archive log file to search for the damaged files.

This will enable you to determine whether any of the damaged files were archived to tape after the last time the samfsdump(1M) command was run.

d. Examine the archive log files to identify any archived files that do not exist in the file system.

e. Use the star(1M) command to restore the damaged and nonexistent files identified in Step c and Step d.

4. Reimplement disaster recovery scripts, methods, and cron(1M) jobs using information from the backup copies.