C H A P T E R 1 |
Troubleshooting Overview |
Sun StorEdge SAM-FS problems are frequently symptoms of incorrect hardware and software configuration during installation or upgrade. This chapter provides basic information on diagnosing and troubleshooting such problems in the Sun StorEdge SAM-FS environment and also discusses preparing a disaster recovery plan.
This chapter includes the following subsections:
The following sections provide an overview of some of the hardware and software configuration issues that may be encountered in the Sun StorEdge SAM-FS environment.
The following topics are covered:
The following sections describe the daemons that can be present in a SAM-QFS environment and show how to verify the functionality of these daemons.
The process spawner, init(1M), starts the sam-fsd(1M) daemon based on information defined in inittab(4). The sam-fsd(1M) daemon provides overall control of the initialization of the SAM-QFS environment. As part of this process, it starts a number of child daemons. These child daemons are as follows:
It is possible to determine which daemons and processes should be running for a given configuration based on a knowledge of the SAM-QFS daemons and processes and the circumstances under which they are started. You can check that the expected daemons or processes are running by using the ps(1) and ptree(1) commands.
CODE EXAMPLE 1-1 assumes that the ps(1) command is issued in a SAM-QFS environment that includes a a StorageTek L700 library connected via ACSLS to a SAM-QFS system with two mounted file systems, samfs1 and samfs2. In this example, the sam-stkd(1M) daemon is running. This controls the network-attached StorageTek media changers through the ACSAPI interface implemented by the ACSLS software. If such equipment were present, similar daemons would be started for network-attached IBM (sam-ibm3494d(1M)) and Sony (sam-sonyd(1M)) automated libraries, and for standard direct-attached automated libraries that conform to the SCSI-II standard for media changers (sam-genericd(1M)).
The following steps show you what to look for in the ps(1) command's output.
1. Check the output for missing or duplicate daemon processes and defunct processes.
There should be only one of each of these processes, with few exceptions, as follows:
2. Check the configuration files.
The sam-fsd(1M) daemon reads the following configuration files: mcf(4), defaults.conf(4), diskvols.conf(4), and samfs.cmd(4). Verify that these configuration files are error free by issuing the sam-fsd(1M) command manually and watching for error messages. As CODE EXAMPLE 1-2 shows, if sam-fsd(1M) encounters errors when processing these files, it exits without starting up the SAM-QFS environment.
After the software packages have been installed, you need to tailor the SAM-QFS configuration files to the site installation in order to bring the system into an operational state. Syntactical and typographical errors in these configuration files manifest themselves in unexpected behavior. TABLE 1-1 shows the relevant files.
Many of these files are described in the following sections:
The later chapters of this manual address the rest of the files introduced in TABLE 1-1.
Using the appropriate log trace files can greatly facilitate the diagnosis of SAM-QFS problems. TABLE 1-2 shows the relevant files.
The following sections describe how to use the log and trace files when troubleshooting:
The SAM-QFS software makes log entries using the standard Sun StorEdge SAM-FS log file interface (see syslogd(1M), syslog.conf(4), syslog(3C)). All logging is done based on a level and a facility. The level describes the severity of the reported condition. The facility describes the component of the system sharing information with the syslogd(1M) daemon. The SAM-QFS software uses facility local7 by default.
To enable the syslogd(1M) daemon to receive information from the SAM-QFS software for system logging, perform the following steps:
1. Add a line to the /etc/syslog.conf file to enable logging.
For example, add a line similar to the following:
You can copy this line from /opt/SUNWsamfs/examples/syslog.conf_changes. This entry is all one line, and it has a TAB character (not a space) between the fields.
2. Use touch(1) to create an empty /var/adm/sam-log file.
3. Send the syslogd(1M) process a SIGHUP signal.
skeeball # ps -ef | grep syslogd | grep -v grep root 216 1 0 Jun 20 ? 0:00 /usr/sbin/syslogd skeeball # kill -HUP 216 |
4. Use vi(1) or another editor to open the defaults.conf file and add the debugging level. (Optional)
Perform this step only if you want to increase the logging level.
You can use the debug keyword in the defaults.conf file to set the default level for the debug flags used by the SAM-QFS daemons for logging system messages. The syntax for this line is as follows:
The default debug level is logging, so debug=logging is the default specification. For option-list, specify a space-separated list of debug options. For more information on the options available, see the samset(1M) and defaults.conf(4) man pages.
The robot daemon, sam-robotsd(1M), starts and monitors the execution of the media changer control daemons in SAM-QFS systems. The sam-amld(1M) daemon automatically starts the sam-robotsd(1M) daemon if there are any media changers defined in the mcf file. For more information, see the sam-robotsd(1M) man page.
The sam-robotsd(1M) daemon executes the /opt/SUNWsamfs/sbin/dev_down.sh notification script when any removable media device is marked down or off. By default, it sends email to root with the relevant information. It can be tailored to use syslogd(1M) or to interface with the systems management software in use at a site. For more information, see the dev_down.sh(4) man page.
You can enable daemon tracing by configuring settings in the defaults.conf(4) file. CODE EXAMPLE 1-4 shows the syntax to use in the defaults.conf(4) file.
CODE EXAMPLE 1-4 enables daemon tracing for all daemons. The system writes trace files for each daemon to the following default location:
Alternatively, trace files can be turned on individually for the sam-archiverd(1M), sam-catserverd(1M), sam-fsd(1M), sam-ftpd(1M), sam-recycler(1M), and sam-stagerd(1M) processes. CODE EXAMPLE 1-5 enables daemon tracing for the archiver in /var/opt/SUNWsamfs/trace/sam-archiverd, sets the name of the archiver trace file to filename, and defines a list of optional trace events or elements to be included in the trace file as defined in option-list.
trace sam-archiverd = on sam-archiverd.file = filename sam-archiverd.options = option-list sam-archiverd.size = 10M endtrace |
Note that daemon trace files are not automatically rotated by default. As a result, the trace files can become very large, and they might eventually fill the /var file system. You can enable automatic trace file rotation in the defaults.conf(4) file by using the daemon-name.size parameter.
The sam-fsd(1M) daemon invokes the trace_rotate.sh(1M) script when a trace file reaches the specified size. The current trace file is renamed filename.1; the next newest is renamed filename.2; and so on, for up to 7 generations. CODE EXAMPLE 1-5 specifies that the archiver trace file should be rotated when its size reaches 10 megabytes.
For detailed information on the events that can be selected, see the defaults.conf(4) man page.
SAM-QFS systems write messages for archiving devices (automated libraries and tape drives) in log files stored in /var/opt/SUNWsamfs/devlog. In this directory of files, there is one log file for each device, and these files contain device-specific information. Each removable-media device has its own device log, which is named after its Equipment Ordinal (eq) as defined in the mcf file. There is also a device log for the Historian (Equipment Type hy) with a file name equal to the highest eq defined in the mcf file incremented by one.
You can use the devlog keyword in the defaults.conf(4) file to set up device logging using the following syntax:
If eq is set to all, the event flags specified in option-list are set for all devices.
For option-list, specify a space-separated list of devlog event options. If option-list is omitted, the default event options are err, retry, syserr, and date. For information on the list of possible event options, see the samset(1M) man page.
You can use the samset(1M) command to turn on device logging from the command line. Note that the device logs are not maintained by the system, so you must implement a policy at your site to ensure that the log files are routinely rolled over.
CODE EXAMPLE 1-6 shows sample device log output using the default output settings. It shows the first initialization of a 9840A tape drive. The drive is specified as Equipment Ordinal 31 in the mcf file.
CODE EXAMPLE 1-6 shows a 9840A device being initialized and, some three hours later, a tape from slot 0 being loaded into the tape drive for archiving. The tape is checked three times for its VSN label, and each time the system reports that the media is blank. After three checks, the system concludes that the tape is blank, labels it, and then reports the VSN label (700181), the date, the time, and the media block size.
The SAM-QFS software support several troubleshooting utilities and one diagnostic report, the samexplorer(1M) script (called info.sh(1M) in versions prior to 4U1). The following sections describe these tools.
TABLE 1-3 lists the utilities that are helpful in diagnosing SAM-QFS configuration problems.
Initializes the environment. Debugs basic configuration problems, particularly with new installations. |
|
Full-screen operator interface to SAM-QFS systems. Comprehensive display shows status of file systems and devices. Allows operator to control file systems and removable media devices. |
|
Sun Microsystems extended version of the GNU ls(1M) command. The -D option displays extended SAM-QFS attributes. |
|
Generates SAM-QFS diagnostic reports. Also described in The samexplorer(1M) Script. |
TABLE 1-3 briefly describes the general form of these utilities. Consult the relevant man pages and the SAM-QFS documentation, particularly Sun StorEdge QFS Configuration and Administration Guide and the Sun StorEdge SAM-FS Storage and Archive Management Guide, for more information.
The samexplorer(1M) script (called info.sh(1M) in versions prior to 4U1) collates information from a SAM-QFS environment and writes this to file /tmp/SAMreport. The information contained in the SAMreport is an important aid to diagnosing complex SAM-QFS problems, and it is needed by an engineer in the event of an escalation.
The SAMreport includes the following information:
If log files are not routinely collected, an important source of diagnostic information is missing from the SAMreport. It is important to ensure that sites implement a comprehensive logging policy as part of its standard system administration procedures.
It is recommended that the SAMreport be generated in the following circumstances:
Run the samexplorer script and save the SAMreport file before attempting recovery. Ensure that SAMreport is moved from /tmp before rebooting. The functionality of samexplorer has been fully incorporated into the Sun Explorer Data Collector, release 4U0. However, samexplorer provides a focused set of data tuned to the SAM-QFS environment that can be quickly and simply collected and sent to escalation engineers for rapid diagnosis.
The following sections describe various system configuration problems that can be diagnosed and remedied:
SAM-QFS problems can turn out to be hardware related. Before embarking on an extensive troubleshooting exercise, ascertain the following:
It is easiest to verify the hardware configuration by performing the following procedure. However, this procedure requires you to shut down the system. If the system cannot be shut down, consult the /var/adm/messages file for the device check-in messages from the last reboot.
To verify that the Solaris OS can communicate with the devices attached to the server, perform the following steps:
2. Issue the probe-scsi-all command at the ok prompt.
3. Monitor the boot-up sequence messages.
While monitoring the messages, identify the check-in of the expected devices.
CODE EXAMPLE 1-7 shows the st tape devices checking in.
If devices do not respond, consult your Solaris documentation for information on configuring the devices for the Solaris OS.
If you have verified that the hardware has been installed and configured correctly and that no hardware faults are present, the next step in diagnosing an installation or configuration problem is to check that the expected SAM-QFS daemons are running. For more information on the daemons, see Daemons.
SAN-attached devices, such as fibre channel drives and automated libraries, should be checked to ensure that they are configured and that they are visible to the Solaris OS through the cfgadm(1M) command. CODE EXAMPLE 1-8 illustrates this for a fabric-attached library controller and drives.
If devices are in an unconfigured state, use the cfgadm(1M) command with its -c configure option to configure the devices into the Solaris environment. It is important to understand the SAN configuration rules for Fibre Channel tape devices and libraries. For more information, see the latest Sun StorEdge Open SAN Architecture or the SAN Foundation Kit Package documentation for more information.
This section describes specific troubleshooting procedures for identifying issues with the Sun StorEdge SAM-FS and Sun StorEdge QFS configuration files.
The mcf(4) file defines the SAM-QFS devices and device family sets.
The mcf file is read when sam-fsd(1M) is started. It can be changed at any time, even while sam-fsd is running, but sam-fsd(1M) recognizes mcf file changes only when the daemon is restarted. CODE EXAMPLE 1-9 shows an mcf file for a SAM-QFS environment.
The Sun StorEdge QFS Configuration and Administration Guide describes the format of the mcf file in detail.
The most common problems with the mcf file are syntactical and typographical errors. The sam-fsd(1M) command is a useful tool in debugging the mcf file. If an error is encountered by sam-fsd(1M) as it processes the mcf file, it writes error messages to the Sun StorEdge SAM-FS log file (if configured). Errors detected in the following other files, if present, are also reported:
For a newly created or modified mcf file, run the sam-fsd(1M) command and check for error messages. If necessary, correct the mcf file and rerun the sam-fsd(1M) command to ensure that the errors have been corrected. Repeat this process until all errors have been eliminated. When the mcf file is error free, reinitialize the sam-fsd(1M) daemon by sending it the SIGHUP command. CODE EXAMPLE 1-10 shows this process.
Enable the changes to the mcf file for a running system by running the samd(1M) command with its config option (as shown in CODE EXAMPLE 1-10) or by sending the SIGHUP signal to sam-fsd(1M). Note that the procedure for reinitializing sam-fsd(1M) to make it recognize mcf file modifications varies depending on the nature of the changes implemented in the mcf file. For more information, see the Sun StorEdge QFS Configuration and Administration Guide for the procedures to be followed in specific circumstances.
For libraries with more than a single drive, the order in which drive entries appear in the mcf file must match the order in which they are identified by the library controller. The drive that the library controller identifies as the first drive must be the first drive entry for that library in the mcf, and so on. To check the drive order for a direct-attached library, follow the instructions in the "Checking the Drive Order" section of the Sun StorEdge SAM-FS Installation and Upgrade Guide.
Network-attached libraries use different procedures from direct-attached. The difference is due to the fact that drive order for a network-attached library is defined by the library control software.
For example, for a network-attached StorageTek library, the drive mapping in the ACSLS parameters file must match the drives as presented by the ACSLS interface. In this case, the procedure is similar to that for a library without a front panel, except that an additional check is necessary to ensure that the ACSLS parameters file mapping is correct.
Some tape devices that are compatible with SAM-QFS software are not supported by default in the Solaris operating system (OS) kernel. The file /kernel/drv/st.conf is the Solaris st(7D) tape driver configuration file for all supported tape drives. The file can be modified to enable operation of normally unsupported drives to work with SAM-QFS system. Attempting to use any such device in the SAM-QFS environment without updating the st.conf file, or with an incorrectly-modified file, causes the system to write messages such as the following to device log file:
If your configuration is to include devices not supported by the Solaris OS, consult the following file for instructions on how to modify the st.conf file:
For example, the IBM LTO drive is not supported by default in Solaris kernel. CODE EXAMPLE 1-11 shows the lines you need to add to the st.conf file in order to include IBM LTO drives in a SAM-QFS environment.
"IBM ULTRIUM-TD1", "IBM Ultrium", "CLASS_3580", CLASS_3580 = 1,0x24,0,0x418679,2,0x00,0x01,0; |
The st.conf file is read only when the st driver is loaded, so if the /kernel/drv/st.conf file is modified, perform one of the following actions in order to direct the system to recognize the changes:
The samst(7) driver for SCSI media changers and optical drives is used for direct-attached SCSI or Fibre Channel tape libraries and for magneto-optical drives and libraries.
As part of the installation process, the SAM-QFS software creates entries in the /dev/samst directory for all devices that were attached and recognized by the system before the pkgadd(1M) command was entered to begin the installation.
If you add devices after running the pkgadd(1M) command, you must use the devfsadm(1M) command, as follows, to create the appropriate device entries in /dev/samst:
After the command is issued, verify that the device entries have been created in /dev/samst. If they have not, then perform a reconfiguration reboot and attempt to create the entries again.
If the /dev/samst device is not present for the automated library controller, the samst.conf file might need to be updated. In general, Fibre Channel libraries, libraries with targets greater than 7, and libraries with LUNs greater than 0 require the samst.conf file to be updated. To add support for such libraries, add a line similar to the following to the /kernel/drv/samst.conf file:
In the previous example line, 500104f00041182b is the WWN port number of the Fibre-attached automated library. If you need to, you can obtain the WWN port number from the cfgadm(1M) command's output. CODE EXAMPLE 1-12 shows this command.
For network-attached tape libraries such as a StorageTek library controlled by ACSLS, the samst driver is not used, and no /dev/samst device entries are created.
The /etc/opt/SUNWsamfs/inquiry.conf file defines vendor and product identification strings for recognized SCSI or fibre devices and matches these with SAM-QFS product strings. If you have devices that are not defined in inquiry.conf, you need to update the file with the appropriate device entries. This is not a common practice because the great majority of devices are defined in the file. CODE EXAMPLE 1-13 shows an fragment of the inquiry.conf file.
If changes to this file are required, you must make them and then reinitialize your SAM-QFS software by issuing the following commands:
If the system detects errors in the inquiry.conf file during reinitialization, it writes messages to the Sun StorEdge SAM-FS log file. Check for error messages similar to those shown in CODE EXAMPLE 1-15 after making changes to inquiry.conf and reinitializing the SAM-QFS software.
The defaults.conf configuration file allows you to establish certain default parameter values for a SAM-QFS environment. The system reads the defaults.conf file is when sam-fsd(1M) is started or reconfigured. It can be changed at any time while the sam-fsd(1M) daemon is running. The changes take effect when the sam-fsd(1M) daemon is restarted, or when it is sent the signal SIGHUP. Temporary changes to many values can be made using the samset(1M) command.
The sam-fsd(1M) command is also useful for debugging the defaults.conf(4) file. If the sam-fsd(1M) daemon encounters an error as it processes the defaults.conf(4) file, it writes error messages to the Sun StorEdge SAM-FS log file.
For a newly created or modified defaults.conf(4) file, run the sam-fsd(1M) command and check for error messages. If necessary, correct the file and rerun the sam-fsd(1M) command to ensure that the errors have been corrected. Repeat this process until all errors have been eliminated.
If you modify the defaults.conf(4) file on a running system, you need to reinitialize it by restarting the sam-fsd(1M) daemon. You can use the samd(1M) command with its config option to restart sam-fsd(1M). See the Sun StorEdge QFS Configuration and Administration Guide for the procedures to be followed in specific circumstances.
Data must be backed up and disaster recovery processes must be put in place so that data can be retrieved if any of the following occur:
Chapter 4 provides the information you need to know about backing up metadata and other important configuration data. The remaining chapters in this manual describe how to use the data you back up to recover from various types of disasters.
Setting up processes for doing backups and system dumps is only part of preparing to recover from a disaster. The following tasks are also necessary:
When a disk containing the operating environment for a system fails, after you replace the defective disk(s), you need to do what is called bare metal recovery before you can do anything else. Two bare metal recovery approaches are available:
This process is slower than the second alternative described below.
Image backups need to be made only when system configuration changes are made. The downside to this approach is that it is difficult to safely transport hard disks to off site storage.
After you have done all the recovery preparations described in this chapter, do the test described in the following sections:
Always test backup scripts and cron(1) jobs on a development or test system before rolling it out to all systems.
Use the information in the other chapters in this manual to do the following tests, to verify how well your disaster recovery process works. Do these tests periodically. Especially make it a point to do these tests anytime you make changes to the software.
Copyright © 2005, Sun Microsystems, Inc. All Rights Reserved.