C H A P T E R 1 |
Diagnosing Server Performance and Faults |
This chapter describes the diagnostic tools available for use with the Sun Fire V215 and V245 servers. This chapter contains the following diagnostic sections:
Sun provides a range of diagnostic tools for use with the Sun Fire V215 and V245 servers. TABLE 1-1 contains summaries of the diagnostic tools.
Monitors environmental conditions, performs environmental fault isolation, and provides remote console access to system. |
||||
Indicates operational status of the overall system and sub-assemblies that have status indicators. |
Accessed from system chassis. Is available anytime power is available. |
|||
Provides test coverage for CPUs, CPU caches, system memory, CPU interconnects, I/O bridges, and system buses. |
Runs automatically on startup. Is available when the operating system is not running. |
|||
Provides test coverage specifically on the I/O sub-systems and plug-in cards. Test coverage consists of I/O channels, boot controllers (SCSI, IDE, USB, Ethernet), non core devices (Flash, I2C, environmental controls, NVRAM), and plug-in cards with native Fcode drivers which support IEEE 1275 self test mechanisms. OpenBoot Diagnostics provides Fcode self-tests for on-board hardware devices. |
Runs automatically or interactively. Is available when the operating system is not running. |
|||
Displays various system information (See Section 1.4.3, OpenBoot Diagnostic Commands) |
||||
Exercises and stresses the system, running tests in parallel |
||||
Monitors both hardware environmental conditions and software performance of multiple machines. Generates alerts for various conditions |
Requires operating system to be running on both monitored and master servers. Requires a dedicated database on the master server |
|||
Exercises an operational system by running sequential tests. Also reports failed FRUs |
A separately purchased optional add-on to Sun Management Center. Requires operating system and Sun Management Center |
This section helps you choose the right tool to isolate a failed part in a Sun Fire V215 or V245 server. Consider the following questions when selecting a tool.
1. Have you checked the status indicators?
Certain system components have status indicators that can alert you when a component requires replacement.
3. Do you intend to run the tests remotely?
Sun Management Center, ALOM, and the Hardware Diagnostic Suite software enable you to run tests from a remote server. ALOM also provides a means of redirecting system console output, enabling you to remotely view and run tests, like POST diagnostics, that usually require physical proximity to the serial port on the back panel.
Note - The SunVTS software also enables you to run tests remotely by using the tty-mode through a remote login or a Telnet session. |
4. Will the tool test the suspected sources of the problem?
Use a diagnostic tool capable of testing the suspected problem sources. TABLE 1-1 shows which parts can be isolated by each fault isolating tool.
5. Is the problem intermittent or software-related?
If a problem is not caused by a defective hardware component, use a system exerciser tool rather than a fault isolation tool.
FIGURE 1-1 Choosing a Tool to Isolate Hardware Faults
POST is a firmware program that is useful in determining if a portion of the system has failed. POST verifies the core functionality of the system, including the CPU modules, motherboard, memory, and some on-board I/O devices. POST also generates messages that can be useful in determining the nature of a hardware failure. POST can be run even if the system is unable to boot. POST resides in a PROM located on the MBC board (ALOM) and detects most persistent type fault conditions.
POST can run under the following four conditions:
1. POST will run automatically when power is applied to the system.
2. POST will run in service mode when the system is reset with the reset-all command from the ok prompt.
3. POST will run when the keyswitch is set to the diag position.
4. POST will run when the post command is issued from the ok prompt.
If diag-level is set to menu, a menu of all the tests executed at power up is displayed.
POST diagnostic and error message reports are displayed on a console.
where level specifies the level of diagnostics (min, max, menu, off) and verbosity specifies the diagnostic verbosity (debug, max, normal, min, none).
Status and error messages are displayed in the console window. If POST detects an error, it displays an error message describing the failure.
You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables. Changes to OpenBoot configuration variables generally take effect only after the server is restarted. TABLE 1-2 lists the most important and useful of these variables.
Note - These variables affect OpenBoot Diagnostics tests as well as POST diagnostics. |
After POST diagnostics have finished running, POST reports back to the OpenBoot firmware the status of each test it has run. Control then reverts back to the OpenBoot firmware code.
If POST diagnostics do not uncover a fault, and the server still does not start up, run OpenBoot Diagnostics tests. See FIGURE 1-1 for additional information.
Like POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the OpenBoot PROM.
Become superuser, and then type init 0.
This command displays the OpenBoot Diagnostics menu.
Note - If you have a PCI card installed in the server, then additional tests will appear on the obdiag menu. |
Where n represents the number corresponding to the test you want to run.
A summary of the tests is available. At the obdiag> prompt, type:
Most of the OpenBoot configuration variables you use to control POST (see TABLE 1-2) also affect OpenBoot diagnostics tests.
By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 1-3.
If you want to customize the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:
You can run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:
Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V215 and V245 servers. |
To customize an individual test, you can use test-args as follows:
This affects only the current test without changing the value of the test-args OpenBoot configuration variable.
You can test all the devices in the device tree with the test-all command:
If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:
OpenBoot diagnostics error messages are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. displays an example OpenBoot Diagnostics error message.
OpenBoot commands that can provide useful diagnostic information are:
The probe-scsi and probe-scsi-all commands diagnose problems with SCSI devices.
Caution - If you use the halt command or the Stop-A key sequence to reach the ok prompt and then issue the probe-scsi or probe-scsi-all command, the syetem might hang. |
The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers. The probe-scsi-all command additionally accesses devices connected to any host adapters installed in PCI slots.
For any SCSI device that is connected and active, the probe-scsi and probe-scsi-all commands display its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.
The following is example output from the probe-scsi command.
The following is example output from the probe-scsi-all command.
The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This bus is the internal system bus for media devices such as the optional DVD super-multi drive.
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system. |
The following is example output from the probe-ide command.
The show-devs command lists the hardware device paths for each device in the firmware device tree. The following shows some example output.
1. Halt the system to reach the ok prompt.
How you do this depends on the system’s condition. If possible, warn users before you shut the system down. One method is to become superuser and then type the init 0 command.
2. Type the appropriate OpenBoot command at the console prompt.
If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser operating system. For most Sun systems, this means the Solaris OS. After the server is running in multiuser mode, you have access to the software-based diagnostic tools, SunVTS, and Hardware Diagnostic Suite. These tools enable you to monitor the server, exercise it, and isolate faults.
Note - If you set the auto-boot OpenBoot configuration variable to false, the operating system does not boot following completion of the firmware-based tests. |
In addition to the tools mentioned above, you can refer to error and system message log files, and Solaris system information commands.
Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.
The following Solaris commands display data that you can use when assessing the condition of a Sun Fire V215 or V245 server:
This section describes the information that these commands give you. More information on using these commands is contained in the appropriate man page.
The prtconf command displays the Solaris device tree. This tree includes all of the devices probed by the OpenBoot firmware, as well as additional devices, like individual disks that are exposed to the operating system only. The output of prtconf also includes the total amount of system memory.
The -p option produces output similar to the OpenBoot show-devs command. This output lists only those devices that are compiled by the system firmware.
The prtdiag command displays a table of diagnostic information that summarizes the status of system components. The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on the system.
The verbose option (-v) includes information about the front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.
In the event of an overtemperature condition, prtdiag reports an error in the Status column.
System Temperatures (Celsius): ------------------------------- Device Temperature Status --------------------------------------- CPU0 62 OK CPU1 102 ERROR |
If there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.
The Sun Fire V215 and V245 servers maintain a hierarchical list of all FRUs in the system, as well as specific information about various FRUs.
The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs.
Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes:
The psrinfo command displays the date and time each CPU came online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed.
The showrev command displays revision information for the current hardware and software. When used with the (-p) option, this command displays installed patches.
Summaries of the results from the most recent POST and OpenBoot diagnostics tests are saved across power cycles.
2. Do either of the following:
This command produces a system-dependent list of hardware components, along with an indication of which components passed and which failed POST or OpenBoot diagnostics tests.
You can use the probe-scsi, probe-scsi-all, probe-ide, watch-net, and watch-net-all commands to perform additional diagnostic tests on specific devices. This section contains procedures for using these commands.
The probe-scsi command transmits an inquiry to SCSI devices connected to the system’s internal SCSI interface. If a SCSI device is connected and active, the command displays the unit number, device type, and manufacturer name for that device.
Caution - If you use the halt command or the Stop-A key sequence to reach the ok prompt and then issue the probe-scsi orprobe-scsi-all command, the system might hang. |
Become superuser and type init 0
The following is an example of the output from the probe-scsi command.
The probe-scsi-all command transmits an inquiry to all SCSI devices connected to both the system’s internal and external SCSI interfaces.
The following shows example output from a server with no externally connected SCSI devices but containing two 73 Gbyte hard drives, both of them active.
The probe-ide command transmits an inquiry command to internal and external IDE devices connected to the system’s on-board IDE interface.
The following example output shows a optional DVD super-multi drive installed (as Device 0) and active in a server.
The watch-net diagnostics test monitors Ethernet packets on the primary network interface. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.
{1}ok watch-net 100Mbps FDX Link up Looking for Ethernet Packets. ‘.’ is a Good Packet. ‘X’ is a Bad Packet Type any key to stop................................ |
The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.
Become superuser and type init 0
2. Type watch-net-all at the prompt.
Copyright © 2008, Sun Microsystems, Inc. All Rights Reserved.