C H A P T E R  8

Recovering From Catastrophic Failure

Certain events can be classified as catastrophic failures. These include the damage caused by natural disasters, such as flooding in a computer room. This chapter provides a procedure to follow after such an event. You might require the assistance from your ASP or from Sun Microsystems customer support to successfully complete the procedures described in this chapter.


procedure icon  To Recover From a Catastrophic Failure

Any system component, software element, or SAM-QFS file system that has not failed should not be recovered. However, you might need to reconfigure the SAM-QFS file system on a restored system to regain access to file systems or to determine whether any file system has failed. For details in performing these tasks, see the other chapters of this manual.

1. Determine the failed system component.

See To Restore Failed System Components.

2. Disable the archiver and the recycler until all files are restored.

See To Disable the Archiver and Recycler Until All Files Are Restored.

3. Compare previous and current configuration files, and reconcile inconsistencies.

See To Keep and Compare Previous and Current Configuration and Log Files.

4. Repair disks.

See To Repair Disks.

5. Restore or build new library catalog files.

See To Restore or Build New Library Catalog Files.

6. Make new file systems and restore from samfsdump output.

See To Make New File Systems and Restore From samfsdump Output.


procedure icon  To Restore Failed System Components

1. Ascertain which components have failed.

The following steps describe how to restore the following types of components:

2. If a hardware component has failed, restore it to operation, preserving any available data.

If the failing component is a disk drive that has not totally failed, preserve any information possible. Before replacing or reformatting the disk, identify any salvageable files (including those in the following list), and copy these files to a tape or to another disk for future use in the recovery process.

3. If the Solaris operating environment has failed, restore it to operation.

See Recovering From Failure of the Operating Environment Disk. Verify that the Solaris operating environment is functioning correctly before proceeding.

4. If the Sun StorEdge SAM-FS or Sun StorEdge QFS packages have been damaged, remove and reinstall them from a backup copy or from its distribution file.

You can verify whether a package has been damaged by using the pkgchk(1M) utility.

5. If disk hardware used by the Sun StorEdge SAM-FS software was repaired or replaced in Step 2, configure the disks (RAID binding or mirroring) if necessary.

Reformat disks only if they have been replaced or if it is otherwise absolutely necessary, because reformatting destroys all the file system information.


procedure icon To Disable the Archiver and Recycler Until All Files Are Restored



caution icon

Caution - If the recycler is enabled so that it runs before all files are restored, cartridges with good archive copies may be improperly relabeled.



1. Add a single global wait directive to the archiver.cmd file or add a file-system-specific wait directive for each file system for which you want to disable archiving.



Note - The wait directive can be applied globally or individually to one or more file systems.



a. Open the /etc/opt/SUNWsamfs/archiver.cmd file for editing and find the section where you want to insert the wait directive.

CODE EXAMPLE 8-1 shows using the vi(1) command to edit the file. In the example, local archiving directives exist for two file systems samfs1 and samfs2.


CODE EXAMPLE 8-1 Example archiver.cmd File
# vi /etc/opt/SUNWsamfs/archiver.cmd
...
fs = samfs1
allfiles   .
1   10s
fs = samfs2
allfiles   .
1   10s

b. Add the wait directive.

CODE EXAMPLE 8-2 shows a global wait directive inserted before the first fs = command (fs = samfs1).


CODE EXAMPLE 8-2 Example archiver.cmd File with a Global wait Directive
wait
fs = samfs1
allfiles   .
1   10s
fs = samfs2
allfiles   .
1   10s
:wq

CODE EXAMPLE 8-3 shows two file system-specific wait directives inserted after the first and second fs = commands (fs = samfs1 and fs = samfs2).


CODE EXAMPLE 8-3 Example archiver.cmd File with File System-specific wait Directives
fs = samfs1
wait
allfiles   .
1   10s
fs = samfs2
wait
allfiles   .
1   10s
:wq

2. Add a global ignore directive to the recycler.cmd file or add a file-system-specific ignore directive for each library for which you want to disable recycling.

a. Open the /etc/opt/SUNWsamfs/recycler.cmd file for editing.

CODE EXAMPLE 8-4 shows using the vi(1) command to edit the file.


CODE EXAMPLE 8-4 Example recycler.cmd File
# vi /etc/opt/SUNWsamfs/recycler.cmd
...
         logfile = /var/adm/recycler.log
         lt20 -hwm 75 -mingain 60 
         lt20 75 60 
         hp30 -hwm 90 -mingain 60 -mail root
         gr47 -hwm 95 -mingain 60 -mail root

b. Add the ignore directives.

CODE EXAMPLE 8-5 shows ignore directives added for three libraries.


CODE EXAMPLE 8-5 Example recycler.cmd File with ignore Directives
#  recycler.cmd.after  - example recycler.cmd file
#
         logfile = /var/adm/recycler.log
         lt20 -hwm 75 -mingain 60 -ignore
         hp30 -hwm 90 -mingain 60 -ignore -mail root
         gr47 -hwm 95 -mingain 60 -ignore -mail root


procedure icon  To Keep and Compare Previous and Current Configuration and Log Files

1. Recover any available Sun StorEdge SAM-FS configuration files or archiver log files from the system's disks before rebuilding the system.

2. Compare the restored versions of all configuration files represented in the SAMreport with those restored from the system backups.

3. If inconsistencies exist, determine the effect of the inconsistencies and reinstall the Sun StorEdge QFS file system, if necessary, using the configuration information in the SAMreport.

For more information on SAMreport file, see the samexplorer(1M) man page.


procedure icon  To Repair Disks

single-step bulletFor SAM-QFS file systems that reside on disks that have not been replaced, run the samfsck(1M) utility to repair small inconsistencies, reclaim lost blocks, and so on.

For command line options to the samfsck utility, see the samfsck(1M) man page.


procedure icon  To Restore or Build New Library Catalog Files

1. Replace the most recent library catalog file copies from the removable media files, from the Sun StorEdge SAM-FS server disks, or from the most recent file system archive copies (which are likely to be slightly out of date).

2. If the library catalogs are unavailable, build new catalogs by using the build.cat(1M) command and the library catalog section of the most recent SAMreport as input.

Use the newest library catalog copy available for each automated library.



Note - Sun StorEdge SAM-FS systems automatically rebuild library catalogs for SCSI-attached automated libraries. This does not occur for ACSLS-attached automated libraries. Tape usage statistics are lost.




procedure icon  To Make New File Systems and Restore From samfsdump Output

For those SAM-QFS file systems that were resident (partially or totally) on disks that were replaced or reformatted, perform the following procedure.

1. Obtain the most recent copy of the samfsdump(1M) output file.

2. Make a new file system and restore the SAM-QFS file system using the samfsdump output file.

a. Use the sammkfs(1M) command to make a new file system.

CODE EXAMPLE 8-6 shows this process.


CODE EXAMPLE 8-6 Using the sammkfs (1M) Command
# mkdir /sam1
# sammkfs samfs1
# mount samfs1

b. Use the samfsrestore(1M) command with the -f option and the -g option.

Specify the location of the samfsdump output file after the -f option. Specify the name of a log file after the -g option. The -g option creates a log of the files that had been online. The following example shows this:


# cd /sam1
# samfsrestore -f /dump_sam1/dumps/040120 -g /var/adm/messages/restore_log



Note - Once all file systems have been restored, the system can be made available to users in degraded mode.



3. On the file systems restored in Step 2, perform the following steps:

a. Run the restore.sh(1M) script against the log file created in Step b of Step 2, and stage all files that were known to be online before the outage.

b. Run the sfind(1M) command against the SAM-QFS file system to determine which files are labeled as damaged.

These files might or might not be restorable from tape, depending on the content of the archive log files. Determine the most recently available archive log files from one of the following sources:

c. Run the grep(1) command against the most recent archive log file to search for the damaged files, to determine whether any of the damaged files were archived to tape since the last time the samfsdump(1M) command was run.

d. Examine the archive log files to identify any archived files that do not exist in the file system.

e. Use the star(1M) command to restore files from the archive media and to restore files that have been labeled as damaged.

These are files identified in Step c and Step d.

4. Reimplement disaster recovery scripts, methods, and cron(1M) jobs using information from the backup copies.