Sun StorEdge SAM-FS Troubleshooting Guide
|
|
Troubleshooting Sun StorEdge SAM-FS Software
|
This chapter describes how to troubleshoot basic Sun StorEdge SAM-FS functions. It covers the following topics:
Troubleshooting the Archiver
The archiver automatically writes SAM-QFS files to archive media. Operator intervention is not required to archive and stage the files. The archiver starts automatically when a SAM-QFS file system is mounted. You can customize the archiver's operations for your site by inserting archiving directives into the following file:
/etc/opt/SUNWsamfs/archiver.cmd
Upon initial setup, the archiver might not perform the tasks as intended. Make sure that you are using the following tools to monitor the archiving activity of the system:
- The File System Manager software. To display archiving activity go to the Servers page and click the name of the server for which you want to display archiving activity. Click the Jobs tab to display the Current Jobs Summary page. Choose whether you want to display current, pending, or all archiving activity by clicking the appropriate local tab under the Jobs tab. From the Filter menu, choose Archive Copy or Archive Scan to view all jobs of either type.
For complete information on using the File System Manager to monitor jobs, see the File System Manager online help file.
- samu(1M) utility's a display. This display shows archiver activity for each file system. It also displays archiver errors and warning messages, such as the following:
Errors in archiver commands - no archiving will be done
|
The samu(1M) utility's a display includes messages for each file system. It indicates when the archiver will scan the .inodes file again and the files currently being archived.
- Archive logs. You can define these logs in the archiver.cmd file, and you should monitor them regularly to ensure that files are archived to volumes. Archive logs can become excessively large and should be reduced regularly either manually or by using a cron(1) job. Archive these log files for safekeeping because the information enables data recovery.
- sfind(1). Use this command to check periodically for unarchived files. If you have unarchived files, make sure you know why they are not being archived.
- sls(1). Files are not considered for release unless a valid archive copy exists. The sls -D command displays inode information for a file, including copy information.
Note - Output from the sls -D command might show the word archdone on a file. This is not an indication that the file has an archive copy. It is only an indication that the file has been scanned by the archiver and that all the work associated with the archiver itself has been completed. An archive copy exists only when you can view the copy information displayed by the sls(1) command.
|
Occasionally, you might see messages to indicate that the archiver either has run out of space on cartridges or has no cartridges. These messages are as follows:
- When the archiver has no cartridges assigned to an archive set, it issues the following message:
No volumes available for Archive Set setname
|
- When the archiver has no space on the cartridges assigned to an archive set, it issues the following message:
No space available on Archive Set setname
|
Why Files Are Not Archiving
The following checklist includes reasons why your Sun StorEdge SAM-FS environment might not be archiving files.
- The archiver.cmd file has a syntax error. Run the archiver -lv command to identify the error, then correct the flagged lines.
- The archiver.cmd file has a wait directive in it. Either remove the wait directive or override it by using the samu(1M) utility's :arrun command.
- No volumes are available. You can view this from archiver(1M) -lv command output. Add more volumes as needed. You might have to export existing cartridges to free up slots in the automated library.
- The volumes for an archive set are full. You can export cartridges and replace them with new cartridges (make sure that the new cartridges are labeled), or you can recycle the cartridges. For more information on recycling, see .
- The VSN section of the archiver.cmd file fails to list correct media. Check your regular expressions and VSN pools to ensure that they are correctly defined.
- There is not enough space to archive any file on the available volumes. If you have larger files and it appears that the volumes are nearly full, the cartridges might be as full as the Sun StorEdge SAM-FS environment allows. If this is the case, add cartridges or recycle.
If you have specified the -join path parameter, and there is not enough space to archive all the files in the directory to any volume, no archiving occurs. You should add cartridges, recycle, or use one of the following parameters:
-sort path or -rsort path.
- The archiver.cmd file has the no_archive directive set for directories or file systems that contain large files.
- The archive(1) -n (archive never) command has been used to specify too many directories, and the files are never archived.
- Large files are busy. Thus, they never reach their archive age and are not archived.
- Hardware or configuration problems exist with the automated library.
- Network connection problems exist between client and server. Ensure that the client and the server have established communications.
Additional Archiver Diagnostics
In addition to examining the items on the previous list, you should check the following when troubleshooting the archiver.
- The syslog file (by default, /var/adm/sam-log). This file can contain archiver messages that can indicate the source of a problem.
- Volume capacity. Ensure that all required volumes are available and have sufficient space on them for archiving.
- If the archiver appears to cause excessive, unexplainable cartridge activity or appears to be doing nothing, turn on the trace facility and examine the trace file. For information on trace files, see the defaults.conf(4) man page.
- You can use the truss(1) -p pid command on the archiver process (sam-archiverd) to determine the system call that is not responding. For more information on the truss(1) command, see the truss(1) man page.
- The showqueue(1M) command displays the content of the archiver queue files. You can use this command to observe the state of archiver requests that are being scheduled or archived. Any archive request that cannot be scheduled generates a message that indicates the reason. This command also displays the progress of archiving.
Why Files Are Not Releasing
The archiver and the releaser work together to balance the amount of data available on the disk cache. The main reason that files are not released automatically from disk cache is that they have not yet been archived.
For more information on why files are not being released, see the following section.
Troubleshooting the Releaser
There can be several reasons for the releaser to not release a file. Some possible reasons are as follows:
- Files can be released only after they are archived. There might not be an archive copy. For more information about this, see Why Files Are Not Archiving.
- The archiver requested that a file not be released. This can occur under the following conditions:
- The archiver has just staged an offline file to make an additional copy.
- The -norelease directive in the archiver.cmd file was set and all the copies flagged -norelease have not been archived. Note that the releaser summary output displays the total number of files with the archnodrop flag set.
- The file is set for partial release, and the file size is less than or equal to the partial size rounded up to the disk allocation unit (DAU) size (block size).
- The file changed residence in the last min_residence_age minutes.
- The release -n command has been used to prevent directories and files from being released.
- The archiver.cmd file has the -release n option set for too many directories and files.
- The releaser high watermark is set too high, and automatic releasing occurs too late. Verify this in the samu(1M) utility's m display or with File System Manager, and lower this value.
- The releaser low watermark is set too high, and automatic releasing stops too soon. Check this in the samu(1M) utility's m display, or with File System Manager, and lower this value.
- Large files are busy. They will never reach their archive age, never be archived, and never be released.
Troubleshooting the Recycler
The most frequent problem encountered with the recycler occurs when the recycler generates a message similar to the following when it is invoked:
Waiting for VSN mo:OPT000 to drain, it still has 123 active archive copies.
|
One of the following conditions can cause the recycler to generate this message:
- Condition 1: The archiver fails to rearchive the 123 archive copies on the volume.
- Condition 2: The 123 archive copies do not refer to files in the file system. Rather, they refer to 123 metadata archive copies.
Condition 1 can exist for one of the following reasons:
- Files that need to be rearchived are marked no_archive.
- Files that need to be rearchived are in the no_archive archive set.
- Files cannot be archived because there are no available VSNs.
- The archiver.cmd file contains a wait directive.
To determine which condition is in effect, run the recycler with the -v option. As CODE EXAMPLE 2-1 shows, this option displays the path names of the files associated with the 123 archive copies in the recycler log file.
CODE EXAMPLE 2-1 Recycler Messages
Archive copy 2 of /sam/fast/testA resides on VSN LSDAT1
Archive copy 1 of /sam3/tmp/dir2/filex resides on VSN LSDAT1
Archive copy 1 of Cannot find pathname for file system /sam3 inum/gen 30/1 resides on VSN LSDAT1
Archive copy 1 of /sam7/hgm/gunk/tstfilA00 resides on VSN LSDAT1
Archive copy 1 of /sam7/hgm/gunk/tstfilF82 resides on VSN LSDAT1
Archive copy 1 of /sam7/hgm/gunk/tstfilV03 resides on VSN LSDAT1
Archive copy 1 of /sam7/hgm/gink/tstfilA06 resides on VSN LSDAT1
Archive copy 1 of /sam7/hgm/gink/tstfilA33 resides on VSN LSDAT1
Waiting for VSN dt:LSDAT1 to drain, it still has 8 active archive copies.
|
In this example output, messages containing seven path names are displayed along with one message that includes Cannot find pathname... text. To correct the problem with LSDAT1 not draining, you need to determine why the seven files cannot be rearchived. After the seven files are rearchived, only one archive copy is not associated with a file. Note that this condition should occur only as the result of a system crash that partially corrupted the .inodes file.
To solve the problem of finding the path name, run samfsck(1M) to reclaim orphan inodes. If you choose not to run samfsck(1M), or if you are unable to unmount the file system to run samfsck(1M), you can manually relabel the cartridge after verifying that the recycler -v output is clean of valid archive copies. However, because the recycler continues to encounter the invalid inode remaining in the .inodes file, the same problem might recur the next time the VSN is a recycle candidate.
Another recycler problem occurs when the recycler fails to select any VSNs for recycling. To determine why each VSN was rejected, you can run the recycler with the -d option. This displays information on how the recycler selects VSNs for recycling.
Sun StorEdge SAM-FS Troubleshooting Guide
|
819-2756-10
|
|
Copyright © 2005, Sun Microsystems, Inc. All Rights Reserved.