2 Stabilizing the Situation

Whenever you are faced with recovering from a significant file-system failure or potential data loss, your first steps should stabilize the affected systems, minimize chances for further losses, and preserve diagnostic information, where possible. This chapter outlines the actions that you need to take:

Stopping Archiving and Recycling Processes

When you have to restore an archiving file system or significant numbers of lost files, you should first stop the archiving and recycling processes for the file system. You want to stabilize and isolate the archive until you have assessed the situation and, ideally, restored everything to normal. Otherwise, ongoing archiving and recycling operations can, in some situations, make matters worse. Archiving and staging processes may propagate corrupted files. Recycling processes may delete the only remaining copies of valid data.

So, whenever possible, take the precautions listed below:

Once recovery operations are complete, you can reverse the changes below and restore normal file system behavior.

Stop Archiving

  1. Log in to the file-system metadata server as root.

    root@mds1:~# 
    
  2. Open the /etc/opt/SUNWsamfs/archiver.cmd file in a text editor, and scroll down to the first fs (file-system) directive.

    In the example, we use the vi editor:

    root@mds1:~# vi /etc/opt/SUNWsamfs/archiver.cmd
    # Configuration file for Oracle HSM archiving file systems
    #-----------------------------------------------------------------------
    # General Directives
    archivemeta = off
    examine = noscan
    #-----------------------------------------------------------------------
    # Archive Set Assignments 
    fs = hsmfs1
    logfile = /var/adm/hsmfs1.archive.log
    all .
        1 -norelease 15m
        2 -norelease 15m
    fs = hsmfs2
    logfile = /var/adm/hsmfs2.archive.log
    all .
    ...
    
  3. If you need to stop archiving on all file systems, insert a wait directive just before the first fs directive in the archiver.cmd. Save the archiver.cmd file, and close the editor.

    In the example, we insert the wait directive just before the directive for the hsmfs1 file system, where it will apply to all file systems configured for archiving:

    root@mds1:~# vi /etc/opt/SUNWsamfs/archiver.cmd
    ...
    #-----------------------------------------------------------------------
    # Archive Set Assignments
    wait
    fs = hsmfs1
    logfile = /var/adm/hsmfs1.archive.log
    all .
        1 -norelease 15m
        2 -norelease 15m
        3 -norelease 15m
    fs = hsmfs2
    ...
    :wq
    root@mds1:~# 
    
  4. If you need to stop archiving on only one file system, insert a wait directive just after the fs directive for that file system. Save the archiver.cmd file, and close the editor.

    In the example, we stop archiving activity on the hsmfs1 file system:

    root@mds1:~# vi /etc/opt/SUNWsamfs/archiver.cmd
    ...
    #-----------------------------------------------------------------------
    # Archive Set Assignments
    fs = hsmfs1
    wait
    logfile = /var/adm/hsmfs1.archive.log
    all .
        1 -norelease 15m
        2 -norelease 15m
        3 -norelease 15m
    fs = hsmfs2
    ...
    :wq
    root@mds1:~# 
    
  5. Next, stop recycling.

Stop Recycling

  1. Log in to the file-system metadata server as root.

    root@mds1:~# 
    
  2. Open the /etc/opt/SUNWsamfs/recycler.cmd file in a text editor.

    In the example, we use the vi editor:

    root@mds1:~# vi /etc/opt/SUNWsamfs/recycler.cmd
    # Configuration file for Oracle HSM archiving file systems
    #-----------------------------------------------------------------------
    logfile = /var/adm/recycler.log
    no_recycle tp VOL[0-9][2-9][0-9]
    lib1 -hwm 95 -mingain 60
    
  3. Add the -ignore parameter to each recycling directive in the recycler.cmd file. Then save the file, and close the editor.

    The recycler.cmd file does not contain recycling directives unless you have configured recycling by library, rather than by archive sets. But check it now.

    In the example, we have one recycling directive for tape library lib1:

    root@mds1:~# vi /etc/opt/SUNWsamfs/recycler.cmd
    # Configuration file for Oracle HSM archiving file systems
    #-----------------------------------------------------------------------
    logfile = /var/adm/recycler.log
    no_recycle tp VOL[0-9][2-9][0-9]
    lib1 -hwm 95 -mingain 60 -ignore
    :wq
    root@mds1:~# 
    
  4. If you are recovering from loss or damage to one more archiving file systems, back up unarchived files before proceeding.

  5. If you are recovering from a server problem or from loss or damage to file systems, save the Oracle HSM configuration before proceeding.

  6. If you need to restore directories and files, decide whether you need to save the Oracle HSM configuration or go directly to Chapter 5, "Recovering Lost and Damaged Files".

Preserving Unarchived Data

Unarchived files may remain in the disk cache of a damaged archiving file system. No copies of these files exist in the archive. So, if you can, back them up to a recovery point file now. Proceed as follows:

Back Up Unarchived Files

  1. Log in to the file-system metadata server as root.

    root@mds1:~# 
    
  2. Select a safe storage location for the recovery point.

    In the example, we create a subdirectory, unarchived/, under a directory that we created for recovery points during initial configuration. The /zfs file system has no devices in common with /hsmfs1, the file system that we are recovering:

    root@mds1:~# mkdir /zfs1/hsmfs_recovery/unarchived/
    root@mds1:~# 
    
  3. Change to the file system's root directory.

    In the example, we change to the mount-point directory /hsmfs1:

    root@mds1:~# cd /hsmfs1
    root@mds1:~# 
    
  4. Backup any unarchived files that remain in the disk cache. Use the command samfsdump -u -f recovery-point, where recovery-point is the path and file name of the output file.

    The -u option causes the samfsdump command to back up any data files that have not been archived. In the example, we save the recovery point file 20160325 to the remote directory /zfs1/hsmfs_recovery/unarchived/:

    root@mds1:~# samfsdump -u -f /zfs1/hsmfs_recovery/unarchived/20160325
    root@mds1:~# 
    
  5. If you are recovering from a server problem or from loss or damage to file systems, save the Oracle HSM configuration before proceeding.

  6. If you need to restore directories and files, decide whether you need to save the Oracle HSM configuration or go directly to Chapter 5, "Recovering Lost and Damaged Files".

Preserving Configuration and State Information

Even when you have safely stored backup copies of all configuration files and scripts needed for restoring the Oracle HSM software and file-system, it pays to preserve the current state of a failed system, if you can. Surviving configuration files and scripts may contain changes that were implemented since the complete configuration was last backed up. This can mean the difference between restoring the system to its almost its exact pre-failure state and merely getting close. Log and trace files contain information that helps restore files and clarifies the causes of failures. For this reason, you should preserve whatever remains, before you do anything else.

Save the Oracle HSM Configuration

  1. If possible, log in to the file-system metadata server as root.

    root@mds1:~# 
    
  2. Run the samexplorer command, create a SAMreport, and save the report in the directory that holds your backup configuration information. Use the command samexplorer path/hostname.YYYYMMDD.hhmmz.tar.gz, where path is the path to the chosen directory, hostname is the name of the Oracle HSM file system host, and YYYYMMDD.hhmmz is a date and time stamp.

    The default file name is /tmp/SAMreport.hostname.YYYYMMDD.hhmmz.tar.gz. In the example, we already have a directory for saving SAMreports, /zfs1/hsmcfg/. So we create the report in this directory:

    root@mds1:~# samexplorer /zfs1/hsmcfg/server1.20160325.1659MST.tar.gz
         Report name:     /zfs1/hsmcfg/samhost1.20160325.1659MST.tar.gz
         Lines per file:  1000
         Output format:   tar.gz (default) Use -u for unarchived/uncompressed.
     
         Please wait.............................................
         Please wait.............................................
         Please wait......................................
     
         The following files should now be ftp'ed to your support provider
         as ftp type binary.
     
         /zfs1/hsmcfg/samhost1.20160325.1659MST.tar.gz
    
  3. Copy the /etc/opt/SUNWsamfs/ directory and its contents to an independent file system.

    The /etc/opt/SUNWsamfs/ directory may contain any or all of the following:

    • mcf (the master configuration file for the Oracle HSM file systems)

    • archiver.cmd (configures the archiving process)

    • inquiry.conf (lists the vendor and product identification strings that SCSI devices report in response to an inquiry command)

    • scripts/* (locally customized Oracle HSM scripts)

    • defaults.conf (overrides specified, default parameter values)

    • diskvols.conf (identifies disk storage that is used for archiving)

    • hosts.family-set-name (defines server and client host names and IP addresses for a shared file-system)

    • hosts.family-set-name.local (defines server and client host names and IP addresses for a shared file-system)

    • preview.cmd (customizes the priorities of archiving and staging requests for volumes that are not currently loaded)

    • recycler.cmd (customizes the recycling process)

    • releaser.cmd (customizes the releasing process)

    • rft.cmd (controls the Oracle HSM file transfer service)

    • samfs.cmd (defines file system mount parameters)

    • stager.cmd (customizes the staging process)

    • samremote (the SAM-Remote server configuration file)

    • family-set-name (a SAM-Remote client configuration file)

    • network-attached-library (a parameters file for a network-attached library

  4. Back up all surviving library catalogs, including the historian catalog. For each catalog, use the command dump_cat -V catalog-file, where catalog-file is the path and name of the catalog file. Redirect the output to dump-file in an independent file system.

    We will use the output of the dump_cat file to rebuild the catalogs on a replacement system. In the example, we first dump the catalog data for lib1 to the file lib1cat1.dump in a directory on the independent NFS-mounted file system zfs1. Then we dump the historian catalog:

    root@mds1:~# dump_cat -V /var/opt/SUNWsamfs/catalog/lib1 > /zfs1/hsmcfg/lib1cat1.dump
    root@mds1:~# dump_cat -V /var/opt/SUNWsamfs/catalog/historian > /zfs1/hsmcfg/historian1.dump
    
  5. Copy system configuration files that were modified during Oracle HSM installation and configuration to an independent file system. These may include:

    /etc/
         syslog.conf
         system
         vfstab
    /kernel/drv/
         sgen.conf
         samst.conf
         samrd.conf
         sd.conf
         ssd.conf
         st.conf
    /usr/kernel/drv/dst.conf
    
  6. Copy any custom shell scripts and crontab entries that you created as part of the Oracle HSM configuration to an independent file system.

    For example, if you created a crontab entry to manage creation of recovery points, you would save a copy now.

  7. Create a readme file that records the revision level of the currently installed software. Include Oracle Oracle HSM, Solaris, and Solaris Cluster (if applicable). Save the file on an independent file system with the other recovery information.

  8. If possible, save copies of downloaded Oracle Oracle HSM, Solaris, and Solaris Cluster packages on an independent file system.

    If you have the packages readily available, you can restore the software quickly, should it become necessary.

  9. If you are recovering from the loss of a Oracle HSM server host, go to Chapter 3, "Restoring the Oracle HSM Configuration".

  10. If you need to restore one or more Oracle HSM file systems, go to Chapter 4, "Recovering File Systems".

  11. If you need to restore directories and files, go to Chapter 5, "Recovering Lost and Damaged Files".