This chapter covers the diagnostic tools available for the system, and how to use these tools. It also provides information about error indications and software commands to help determine what component of the system needs to be replaced.
Tasks covered in this chapter include:
Other information covered in this chapter includes:
The system provides both firmware-based and software-based diagnostic tools to help you identify and isolate hardware problems. These tools include:
Power-on self-test (POST) diagnostics
OpenBoot Diagnostics (OBDiag)
SunVTS(TM) software
Solstice SyMON software
Remote System Control (RSC) software
POST diagnostics verify the core functionality of the system, including the main logic board, system memory, and any on-board I/O devices. You can run POST even if the system is unable to boot. For more information about POST, see "About Power-On Self-Test (POST) Diagnostics" and "How to Use POST Diagnostics".
OBDiag tests focus on system I/O and peripheral devices. Like POST, you can run OBDiag even if the system is unable to boot. For more information about OBDiag, see "About OpenBoot Diagnostics (OBDiag)" and "How to Use OpenBoot Diagnostics (OBDiag)".
The SunVTS system exerciser is a graphics-oriented UNIX application that permits the continuous exercising of system resources and internal and external peripheral equipment. For more information about SunVTS, see "About SunVTS Software".
Solstice SyMON allows you to monitor system hardware status and operating system performance of your server. For information about SyMON, see "About Solstice SyMON Software".
Remote System Control (RSC) is a server management tool that provides remote system administration for geographically distributed or physically inaccessible systems. The RSC software works with the System Service Processor (SSP) on the Enterprise 250 main logic board. For more information about RSC and SSP, see "About Remote System Control (RSC)".
Which method or tool you use to diagnose system problems depends on the nature of those problems:
If your machine isn't able to boot its operating system software, you need to run POST and OBDiag tests.
If your machine is "healthy" enough to start up and load its operating system software, you can use Solstice SyMON software and SunVTS software to diagnose system problems.
If your machine is at a remote location, use RSC to diagnose problems remotely.
The following chart provides an overview of when to use the various diagnostic tools to diagnose hardware problems.
The POST diagnostic code resides in flash PROM on the main logic board. It runs whenever the system is turned on or when a system reset is issued. POST tests the following system components:
CPU modules
Memory modules
NVRAM
Main logic board
POST reports its test results via LEDs located on the system keyboard and on the system front panel. See "Error Indications" for more information about LEDs and error messages.
POST displays detailed diagnostic and error messages on a local terminal, if one is attached to the system's serial port A. You can also choose to display POST output remotely on a Remote System Control (RSC) console.
The System Service Processor (SSP) runs its own POST diagnostics, separate from the main POST diagnostics. To view detailed diagnostic and error messages from SSP POST, you must attach a local terminal to the SSP (RSC) serial port prior to running SSP POST.
For more information about RSC and the System Service Processor, see "About Remote System Control (RSC)". For information about running POST, see "How to Use POST Diagnostics".
When you turn on the system power, POST diagnostics run automatically if any of the following conditions apply:
The OpenBoot PROM variable diag-switch? is set to true when you power on the system.
You hold down the keyboard's Stop and D keys as you power on the system.
You power on the system by turning the front panel keyswitch to the Diagnostics position.
In the event of an automatic system reset, POST diagnostics run under either of the following conditions:
The diag-switch? variable is set to true and the diag-trigger variable is set to error-reset or soft-reset.
The front panel keyswitch is in the Diagnostics position and the diag-trigger variable is set to error-reset or soft-reset.
For information about the various keyswitch positions, see "About the Status and Control Panel".
You can choose to view POST diagnostic and error messages locally on an attached terminal or remotely on an RSC console.
To view POST diagnostic messages on the local system, you need to connect an alphanumeric terminal or establish a tip connection to another Sun system. For more information, see "About Setting Up a Console".
To view POST diagnostic messages remotely on an RSC console, you need to configure the RSC software before starting POST. For information about using the RSC software, see the Remote System Control (RSC) User's Guide.
By default, POST output is displayed locally on an attached terminal or through a tip connection. If your server has been reconfigured to display POST output on an RSC console, POST results will not display locally. To redirect POST output to the local system, you must issue the OpenBoot PROM command diag-output-to ttya from the RSC console. See the Remote System Control (RSC) User's Guide for additional details.
You can choose to run an abbreviated POST with concise error and status reporting or run an extensive POST with more detailed messages. For more information, see "How to Set the Diagnostic Level for POST and OBDiag".
Ensure that the front panel keyswitch is in the Standby position.
For descriptions of the various keyswitch settings, see "About the Status and Control Panel".
Turn the keyswitch to the Diagnostics position.
The system runs the POST diagnostics. POST displays status and error messages on the system console or on an RSC console, if the RSC software is configured to display POST output. For more information, see the "Results" section below.
Upon successful completion of POST, the system will run OBDiag. For more information about OBDiag, see "About OpenBoot Diagnostics (OBDiag)" and "How to Use OpenBoot Diagnostics (OBDiag)".
While POST is running, you can observe its progress and any error indications in the following locations:
System console or Remote System Control (RSC) console
Front panel fault LEDs
Keyboard LEDs (if a keyboard is present)
You can also obtain a summary of POST results by using the .post and .rsc commands.
As POST runs, it displays detailed diagnostic status messages on the system console (or on an RSC console, if POST output has been redirected to an RSC console). If POST detects an error, it displays an error message on either the system console or the RSC console that indicates the failing part. A sample error message is provided below:
Power On Self Test Failed. Cause: DIMM U0702 or System Board ok
POST status and error conditions are indicated by the general fault LED on the system front panel. The LED blinks slowly to indicate that POST is running. It remains lit if POST detects a fault.
If a Sun Type-5 keyboard is attached, POST status and error indications are also displayed via the four LEDs on the keyboard. When POST starts, all four keyboard LEDs blink on and off simultaneously. After that, the Caps Lock LED blinks slowly to indicate POST is running. If an error is detected, the pattern of the lit LEDs provides an error indication. See "Error Indications" for more information.
If POST detects an error condition that prevents the system from booting, it will halt operation and display the ok prompt. The last message displayed by POST prior to the ok prompt indicates which part you need to replace.
Use the .post command at the ok prompt to view a summary of POST results.
ok .post System status: OK CPU0: OK CPU1: OK SC-MP: OK Psycho@1f: OK Cheerio: OK SCSI: OK Mem Bank0: OK Mem Bank1: OK Mem Bank2: OK Mem Bank3: OK PROM: OK NVRAM: OK TTY: OK SuperIO: OK PCI Slots: OK
Use the .rsc command at the ok prompt to view a summary of SSP POST results.
ok .rsc SEEPROM: OK I2C: OK Ethernet: OK Ethernet (2): OK CPU: OK RAM: OK Console: OK RSC Console line: OK RSC Control line: OK FlashRAM Boot CRC: OK FlashRAM Main CRC: OK RSC Console Link: OK Console Selection: ttya
OpenBoot Diagnostics (OBDiag) reside in flash PROM on the main logic board. OBDiag can isolate errors in the following system components:
Main logic board
Diskette drive
CD-ROM drive
Tape drive
Disk drives
Any option card that contains an on-board self-test
On the main logic board, OBDiag tests not only the main logic board but also its interfaces:
PCI
SCSI
Ethernet
Serial
Parallel
Keyboard/mouse
RSC/SSP
OBDiag reports test results via the LEDs located on the system front panel. See "Error Indications" for more information about LEDs and error messages.
OBDiag displays detailed diagnostic and error messages on a local console or terminal, if one is attached to system. Alternatively, you can display OBDiag output remotely on a Remote System Control (RSC) console. For more information about RSC, see "About Remote System Control (RSC)".
OBDiag tests run automatically under certain conditions. You can also run OBDiag interactively from the system ok prompt. For information about running OBDiag, see "How to Use OpenBoot Diagnostics (OBDiag)".
When you run OBDiag interactively from the ok prompt, you invoke the OBDiag menu, which lets you select which tests you want to perform. For information about the OBDiag menu, see "OBDiag Menu".
The system also provides configuration variables that you can set to affect the operation of the OBDiag tests. For information about the configuration variables, see "OBDiag Configuration Variables".
The OBDiag menu is created dynamically whenever you invoke OBDiag in interactive mode. OBDiag determines whether any optional devices are installed in the system. If the device has an on-board self-test, OBDiag incorporates the test name into the list of menu entries. It sorts the menu entries in alphabetical order and numbers them accordingly. Therefore, the menu entries may vary from system to system, depending on the system configuration.
The OBDiag menu always displays the core tests that exercise parts of the basic system. These tests include envctrltwo, ebus, ecpp, eeprom, fdthree, network, scsi@3, scsi@3,1, se, su, and rsc. For information about each test, see "OBDiag Test Descriptions". For a description of the interactive commands for running OBDiag, see "OBDiag Commands".
Once you invoke OBDiag as described in "How to Use OpenBoot Diagnostics (OBDiag)", the OBDiag menu is displayed.
The following table provides information about the OBDiag interactive commands that are available at the OBDiag command prompt:
Table 12-1
Command |
Description |
---|---|
exit |
Exits the OBDiag tool and returns to the ok prompt. |
help |
Displays a brief description of each command and OpenBoot PROM variable used to run OBDiag. |
printenvs |
Displays the value of all of the OBDiag variables. (See "OBDiag Configuration Variables" for information about settings.) |
setenv variable value |
Sets the value for an OpenBoot PROM configuration variable. (See "OBDiag Configuration Variables" for information about settings.) |
test-all |
Runs all of the tests displayed in the menu. |
test #,#, |
Runs only the test(s) identified by menu entry number (#) in the command line. |
except #,#, |
Run all test(s) except those identified by menu entry number (#) in the command line. |
what #,#, |
Displays selected properties of the device(s) identified by menu entry number (#) in the command line. The exact information provided varies according to device type. |
The following table provides information about OpenBoot PROM configuration variables that affect the operation of OBDiag. Use the printenvs command to show current values and the setenv command to set or change a value. Both commands are described in "OBDiag Commands".
Table 12-2
Variable |
Setting |
Description |
Default |
---|---|---|---|
diag-level |
off |
No tests are run at power up. |
|
|
min |
Performs minimal testing of core functionality. |
min |
|
med |
Performs functional tests for all system functions. |
|
|
max |
Runs exhaustive tests for all functions except external loopbacks. External loopback tests are run only if diag-targets is set to loopback, loopback3, device&loopback, or device&loopback,3. |
|
diag-continue? |
false |
Stops testing within a test routine and prints a message as soon as an error is detected. OBDiag then skips to the next test routine in the sequence. |
false |
|
true |
Causes OBDiag to run all subtests within a test, even if an error is detected. |
|
diag-passes |
n |
Repeats each test the number of times specified by n. Works with the test, except, and test-all commands. |
1 |
diag-targets |
none |
Runs internal tests only, no I/O testing. |
none |
|
iopath |
Extends testing to external device interfaces (connectors/cables). |
|
|
media |
Extends testing to external devices and media, if present. |
|
|
device |
Invokes built-in self-test (BIST) on PCI cards and external devices. |
|
|
loopback |
Runs external loopback tests on the parallel, serial, keyboard, mouse, TPE, and RSC serial ports. |
|
|
loopbacks |
Not for use on Enterprise 250 servers. |
|
|
loopback2 |
Not for use on Enterprise 250 servers. |
|
|
loopback3 |
Runs external loopback tests on the RSC Ethernet port |
|
|
nomem |
Performs tests without testing system memory. |
|
|
device&loopback |
Runs built-in self-test (BIST) on PCI cards and external devices, then runs external loopback tests on the parallel, serial, keyboard, mouse, TPE, and RSC serial ports. |
|
|
device&loopbacks |
Not for use on Enterprise 250 servers. |
|
|
device&loopback,3 |
Runs built-in self-test (BIST) on PCI cards and external devices, then runs external loopback tests on the parallel, serial, keyboard, mouse, TPE, RSC serial, and RSC Ethernet ports. |
|
diag-trigger |
power-reset |
Runs diagnostics only on power-on resets. |
power-reset |
|
error-reset |
Runs diagnostics only on power-on resets, fatal hardware errors, and watchdog reset events. |
|
|
soft-reset |
Runs diagnostics on all resets (except XIR). |
|
diag-verbosity |
0 |
Prints one line that indicates the device being tested and its pass/fail status. |
0 |
|
1 |
Prints more detailed test status, which varies in content from test to test. |
|
|
2 |
Prints subtest names. |
|
|
4 |
Prints debug messages. |
|
|
8 |
Prints back trace of callers on error. |
|
The following table provides information about the tests available through OBDiag. It provides the test name, a brief description of the test, and any special considerations involved in running the test.
Table 12-3
Test Name |
Description |
Special Considerations |
---|---|---|
SUNW,envctrltwo @14,60000 |
Verifies that the fans are operational. Checks that the temperature in the enclosure and at the CPUs does not exceed the maximum allowable range. Also tests the disk and front panel LEDs. |
|
ebus@1 |
Tests the on-board ASIC that interfaces the following devices with the PCI bus: parallel port, serial port, keyboard, mouse, diskette drive, NVRAM, and the environmental monitoring and control system. |
|
ecpp @14,3043bc |
Tests parallel port I/O logic, including internal and external loopback tests. |
To run external loopback tests, you must have a special passive loopback connector attached to the parallel port. The variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.
The Sun part number for the parallel port loopback connector is 501-2965-01. |
eeprom@14,0 |
Tests the NVRAM functionality. |
|
fdthree @14,3023f0 |
Tests diskette drive control logic and the operation of the drive. The test does not differentiate among a drive, media, or main logic board error; if any of these fail, it reports the diskette drive as the FRU. |
A formatted diskette must be inserted into the drive.
|
network@1,1 |
Tests the on-board Ethernet logic, including internal and external loopback tests. |
To run external loopback tests on the TPE port, you must have a TPE loopback connector attached to the TPE port. The variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.
The Sun part number for the TPE loopback connector is 501-4689-01. |
scsi@3 [Depending on your system configuration, the OBDiag menu may include tests for additional SCSI interfaces, such as scsi@4, scsi@4,1, scsi@5, scsi@5,1, etc.] |
Tests the on-board SCSI controller and SCSI bus subsystem for internal disk drives and removable media devices. Checks associated registers and performs a DMA transfer. |
|
scsi@3,1 |
Tests the main logic board external SCSI interface. Checks associated registers and performs a DMA transfer. |
|
se@14,40000 |
Tests serial port control and I/O logic, including internal and external loopback tests. The test checks I/O logic only if the external loopback test is enabled. |
Port A tests are not run if ttya is being used as the input/output device.
To run external loopback tests, you must have a special passive loopback connector attached to each serial port, and the variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.
There is one passive connector available for this test: Sun part number 501-4205-01. Use 501-4205-01 when ports A and B are not attached to external devices. |
su@14,3062f8 |
Tests keyboard control and input logic, including internal and external loopback tests. |
Keyboard tests run only when a keyboard is used as the input device.
To run external loopback tests, you must have a special passive loopback connector attached to the keyboard/mouse port. The variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.
The Sun part number for the loopback connector is 501-4690-01. |
su@14,3083f8 |
Tests mouse control and input logic, including internal and external loopback tests. |
Mouse tests are not run if a keyboard is used as an input device.
To run external loopback tests, you must have a special passive loopback connector attached to the keyboard/mouse port, the variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.
The Sun part number for the loopback connector is 501-4690-01. |
rsc |
Tests RSC (SSP) hardware, including RSC serial and Ethernet ports. For additional details, see "About Remote System Control (RSC)". |
This test is not run if RSC is being used as the console device.
To run external loopback tests on the RSC Ethernet port, the port must be connected to a 10-Mbps Ethernet network. The variable diag-targets must also be set to loopback3 or device&loopback,3.
To run external loopback tests on the RSC serial port, a special passive serial loopback connector must be attached to the port. The variable diag-targets must also be set to loopback, device&loopback, or device&loopback,3.
The Sun part number for the passive serial loopback connector is 501-4205-01. |
When you turn on the system power, OBDiag runs automatically if any of the following conditions apply:
The OpenBoot PROM variable diag-switch? is set to true.
You hold down the keyboard's Stop and D keys as you power on the system.
You power on the system by turning the front panel keyswitch to the Diagnostics position.
In the event of an automatic system reset, POST diagnostics run under either of the following conditions:
The diag-switch? variable is set to true and the diag-trigger variable is set to error-reset or soft-reset.
The front panel keyswitch is in the Diagnostics position and the diag-trigger variable is set to error-reset or soft-reset.
For information about the various keyswitch positions, see "About the Status and Control Panel".
OBDiag tests run automatically, without operator intervention, under the conditions described above. However, you can also run OBDiag in an interactive mode and select which tests you want to perform. The following procedure describes how to run OBDiag interactively from the system ok prompt.
Perform this procedure with the power on and the keyswitch in the Power-on position.
With the keyswitch in the Power-on position, press the Break key on your alphanumeric terminal's keyboard, or enter the Stop-a sequence on a Sun keyboard.
To enter the Stop-a sequence, press the Stop key and the a key simultaneously. The ok prompt is displayed.
(Optional) Select a diagnostic level.
Four different levels of diagnostic testing are available for OBDiag; see "How to Set the Diagnostic Level for POST and OBDiag".
(Optional) Select a diagnostic target.
You can choose to run OBDiag with or without external loopback tests by using the OpenBoot PROM variable diag-targets. For more information, see "OBDiag Configuration Variables".
Enter obdiag at the ok prompt:
ok obdiag
When the OBDiag menu appears, enter the appropriate command and test name/number at the command prompt.
For command usage and descriptions, see "OBDiag Commands".
For more information about OBDiag tests, see "About OpenBoot Diagnostics (OBDiag)".
Four different levels of diagnostic testing are available for power-on self-test (POST) and OpenBoot Diagnostics (OBDiag): max (maximum level), med (medium level), min (minimum level), and off (no testing). The system runs the appropriate level of diagnostics based on the setting of the OpenBoot PROM variable called diag-level.
The default setting for diag-level is min.
If your server is set up without a local console, you'll need to set up a monitor or console before setting the diagnostic level. See "About Setting Up a Console".
Perform this procedure with the power on and the keyswitch set to the Power-on position.
With the keyswitch in the Power-on position, press the Break key on your alphanumeric terminal's keyboard, or enter the Stop-a sequence on a Sun keyboard.
To enter the Stop-a sequence, press the Stop key and the a key simultaneously. The ok prompt is displayed.
To set the diag-level variable, enter the following:
ok setenv diag-level value
The value can be off, min, med, or max. See "OBDiag Configuration Variables" for information about each setting.
SunVTS, the Sun Validation and Test Suite, is an online diagnostics tool and system exerciser for verifying the configuration and functionality of hardware controllers, devices, and platforms. You can run SunVTS using any of these interfaces: a command line interface, a tty interface, or a graphical interface that runs within a windowed desktop environment.
SunVTS software lets you view and control a testing session over modem lines or over a network. Using a remote system, you can view the progress of a SunVTS testing session, change testing options, and control all testing features of another system on the network.
Useful tests to run on your system include:
Table 12-4
SunVTS Test |
Description |
---|---|
ecpptest |
Verifies the ECP1284 parallel port printer functionality |
cdtest |
Tests the CD-ROM drive by reading the disc and verifying the CD table of contents (TOC), if it exists |
disktest |
Verifies local disk drives |
env2test |
Tests the I2C environment control system including all fans, front panel LEDs and keyswitch, disk backplane LEDs, power supplies, and thermistor readings |
fputest |
Checks the floating-point unit |
fstest |
Tests the integrity of the software's file systems |
m64test |
Tests the PGX frame buffer card |
mptest |
Verifies multiprocessor features (for systems with more than one processor) |
nettest |
Checks all the hardware associated with networking (for example, Ethernet, token ring, quad Ethernet, fiber optic, 100-Mbit per second Ethernet devices) |
pmem |
Tests the physical memory (read only) |
sptest |
Tests the system's on-board serial ports |
tapetest |
Tests the various Sun tape devices |
rsctest |
Verifies the RSC/SSP functionality, including SSP Ethernet and serial ports, I2C, and SSP Flash RAM. |
vmem |
Tests the virtual memory (a combination of the swap partition and the physical memory) |
The following documents provide information about SunVTS software. They are available on Solaris on Sun Hardware AnswerBook. This AnswerBook documentation is provided on the SMCC Updates CD for the Solaris release you are running.
SunVTS User's Guide
This document describes the SunVTS environment, including how to start and control the various user interfaces. SunVTS features are described in this document.
SunVTS Test Reference Manual
This document contains descriptions of each test SunVTS software runs in the SunVTS environment. Each test description explains the various test options and gives command line arguments.
SunVTS Quick Reference Card
This card gives an overview of the main features of the SunVTS Open Look interface.
SunVTS software is an optional package that may or may not have been loaded when your system software was installed.
To check whether SunVTS is installed, you must access your system either from a console (see "About Setting Up a Console"), or from a remote machine logged in to the system.
% pkginfo -l SUNWvts
If SunVTS software is loaded, information about the package will be displayed.
If necessary, use the pkgadd utility to load the SUNWvts package onto your system from the SMCC Update CD.
Note that /opt/SUNWvts is the default directory for installing SunVTS software.
For more information, refer to the appropriate Solaris documentation, as well as the pkgadd reference manual page.
If your system passes the firmware-based diagnostics and boots the operating system, yet does not function correctly, you can use SunVTS, the Sun Validation and Test Suite, to run additional tests. These tests verify the configuration and functionality of most hardware controllers and devices.
You'll need root or superuser access to run SunVTS tests.
This procedure assumes you'll test your Enterprise 250 server remotely by running a SunVTS session from a workstation using the SunVTS graphical interface. For information about other SunVTS interfaces and options, see "About Diagnostic Tools".
You can also run SunVTS remotely from a Remote System Control (RSC) console. For information about using the RSC with SunVTS, see the Remote System Control (RSC) User's Guide.
Use xhost to give the remote server access to the workstation display.
On the workstation from which you will be running the SunVTS graphical interface, enter:
% /usr/openwin/bin/xhost + remote_hostname
Substitute the name of the Enterprise 250 server for remote_hostname. Among other things, this command gives the server display permissions to run the SunVTS graphical interface in the OpenWindows(TM) environment of the workstation.
Remotely log in to the server as superuser (root).
Check whether SunVTS software is loaded on the server.
SunVTS is an optional package that may or may not have been loaded when the server software was installed. For more information, see "How to Check Whether SunVTS Software Is Installed".
To start the SunVTS software, enter:
# cd /opt/SUNWvts/bin # ./sunvts -display local_hostname:0
Substitute the name of the workstation you are using for local_hostname. Note that /opt/SUNWvts/bin is the default /bin directory for SunVTS software. If you've installed SunVTS software in a different directory, use the appropriate path instead.
When you start SunVTS software, the SunVTS kernel probes the test system devices. The results of this probe are displayed on the Test Selection panel. For each hardware device on your system, there is an associated SunVTS test.
Fine-tune your testing session by selecting only the tests you want to run.
Click to select and deselect tests. (A check mark in the box indicates the item is selected.)
If SunVTS tests indicate an impaired or defective part, see the replacement procedures in Chapter 6, Removing and Installing Main Logic Board Components through Chapter 9, Removing and Installing Backplanes and Cables to replace the defective part.
Solstice SyMON is a GUI-based diagnostic tool designed to monitor system hardware status and operating system performance. It offers simple, yet powerful monitoring capabilities that allow you to:
Diagnose and address potential problems such as capacity problems or bottlenecks
Display physical and logical views of your exact server configuration
Monitor your server remotely from any location in the network
Isolate potential problems or failed components
Access SunVTS diagnostics to diagnose hardware problems
Solstice SyMON software is included on the SMCC Updates CD for the Solaris release you are running. For instructions on installing and using Solstice SyMON software, see the Solstice SyMON User's Guide included in the Solaris on Sun Hardware AnswerBook on the SMCC Updates CD.
Remote System Control (RSC) is a secure server management tool that lets you monitor and control your server over modem lines or over a network. RSC provides remote system administration for geographically distributed or physically inaccessible systems. The RSC software works with the System Service Processor (SSP) on the Enterprise 250 main logic board. The SSP provides both serial and Ethernet ports for connections to a remote console.
Once RSC is configured to manage your server, you can use it to run diagnostic tests, view diagnostic and error messages, reboot your server, and display environmental status information from a remote console. If the operating system is down, RSC will notify a central host of any power failures, hardware failures, or other important events that may be occurring on your server.
The RSC provides the following features:
Remote system monitoring and error reporting (including output from POST and OBDiag)
Remote reboot on demand
Ability to monitor system environmental conditions remotely
Ability to run POST and OBDiag tests and use SunVTS from a remote console
Remote event notification for over-temperature conditions, power supply failures, fatal system errors, or system crashes
Remote access to detailed event and error logs
Remote console functions on serial and Ethernet ports
For information about configuring and using RSC, see the Remote System Control (RSC) User's Guide, provided with the RSC software.
By default, diagnostic status and error messages are displayed on the local system console or terminal. If your server has been reconfigured to display output on an RSC console, diagnostic results will not display locally. To redirect diagnostic messages to the local console, you must use the OpenBoot PROM command diag-output-to and modify the OpenBoot PROM variables input-device and output-device. For additional details, see the Remote System Control (RSC) User's Guide.
The system provides the following features to help you identify and isolate hardware problems:
Error indications
Software commands
Diagnostic tools
This section describes the error indications and software commands provided to help you troubleshoot your system. Diagnostic tools are covered in "About Diagnostic Tools".
The system provides error indications via LEDs and error messages. Using the two in combination, you can isolate a problem to a particular field-replaceable unit (FRU) with a high degree of confidence.
The system provides fault LEDs in the following places:
Front panel
Keyboard
Power supplies
Disk drives
Error messages are logged in the /var/adm/messages file and are also displayed on the system console by the diagnostic tools.
Front panel LEDs provide your first indication if there is a problem with your system. Usually, a front panel LED is not the sole indication of a problem. Error messages and even other LEDs can help to isolate the problem further.
The front panel has a general fault indicator that lights whenever POST or OBDiag detects any kind of fault. It addition, it has LEDs that indicate problems with the internal disk drives, power supply subsystem, or fans. See "About the Status and Control Panel" for more information on these LEDs and their meanings.
Four LEDs on the Sun Type-5 keyboard are used to indicate the progress and results of POST diagnostics. These LEDs are on the Caps Lock, Compose, Scroll Lock, and Num Lock keys, as shown below.
To indicate the beginning of POST diagnostics, the four LEDs briefly light all at once. The monitor screen remains blank, and the Caps Lock LED blinks for the duration of the testing.
If the system passes all POST diagnostic tests, all four LEDs light again and then go off. Once the system banner appears on the monitor screen, the keyboard LEDs assume their normal functions and should no longer be interpreted as diagnostic error indicators.
If the system fails any test, one or more LEDs will light to form an error code that indicates the nature of the problem.
The LED error code may be lit continuously, or for just a few seconds, so it is important to observe the LEDs closely while POST is running.
The following table provides error code definitions.
Table 12-5
LED |
|
|||
---|---|---|---|---|
Caps Lock |
Compose |
Scroll Lock |
Num Lock |
Failing FRU |
X |
|
|
|
Main logic Board |
|
X |
|
|
CPU 0 |
|
X |
|
X |
CPU 1 |
X |
|
|
X |
No memory detected |
X |
X |
|
|
Memory bank 0 |
X |
X |
|
X |
Memory bank 1 |
X |
X |
X |
|
Memory bank 2 |
X |
X |
X |
X |
Memory bank 3 |
|
|
|
X |
NVRAM |
The Caps Lock LED blinks on and off to indicate that the POST diagnostics are running. When it lights steadily, it indicates an error.
Power supply LEDs are visible from the rear of the system. The following figure shows the LEDs on the power supply in bay 0.
The following table provides a description of each LED.
Table 12-6
LED Name |
Icon |
Description |
---|---|---|
AC-Present-Status |
|
This green LED is lit to indicate that the primary circuit has power. When this LED is lit, the power supply is providing standby power to the system. |
DC Status |
|
This green LED is lit to indicate that all DC outputs from the power supply are functional. |
The disk LEDs are visible from the front of the system when the bottom door is open, as shown in the following figure.
When a disk LED lights steadily and is green, it indicates that the slot is populated and that the drive is receiving power. When an LED is green and blinking, it indicates that there is activity on the disk. Some applications may use the LED to indicate a fault on the disk drive. In this case, the LED changes color to yellow and remains lit. The disk drive LEDs retain their state even when the system is powered off.
Error messages and other system messages are saved in the file /var/adm/messages.
The two firmware-based diagnostic tools, POST and OBDiag, provide error messages either locally on the system console or remotely on an RSC console. These error messages can help to further refine your problem diagnosis. The amount of error information displayed in diagnostic messages is determined by the value of the OpenBoot PROM variable diag-verbosity. See "OBDiag Configuration Variables" for additional details.
System software provides Solaris and OBP commands that you can use to diagnose problems. For more information on Solaris commands, see the appropriate man pages. For additional information on OBP commands, see the OpenBoot 3.x Command Reference Manual. (An online version of the manual is included with the Solaris System Administrator AnswerBook that ships with Solaris software.)
The prtdiag command is a UNIX shell command used to display system configuration and diagnostic information. You can use the prtdiag command to display:
System configuration, including information about clock frequencies, CPUs, memory, and I/O card types
Diagnostic information
Failed field-replaceable units (FRUs)
% /usr/platform/sun4u/sbin/prtdiag
To isolate an intermittent failure, it may be helpful to maintain a prtdiag history log. Use prtdiag with the -l (log) option to send output to a log file in /var/adm.
Refer to the prtdiag man page for additional information.
An example of prtdiag output follows. The exact format of prtdiag output depends on which version of the Solaris operating environment is running on your system.
ok /usr/platform/sun4u/sbin/prtdiag -v System Configuration: Sun Microsystems sun4u Sun Ultra Enterprise 250(2 X UltraSPARC-II 248MHz) System clock frequency: 83 MHz Memory size: 640 Megabytes ========================= CPUs ======================== Run Ecache CPU CPU Brd CPU Module MHz MB Impl. Mask --- --- ------- ----- ------ ------ ---- SYS 0 0 248 1.0 US-II 1.1 SYS 1 1 248 1.0 US-II 1.1 ========================= Memory ========================= Interlv. Socket Size Bank Group Name (MB) Status ---- ----- ------ ---- ------ 0 none U0801 32 OK 0 none U0701 32 OK 0 none U1001 32 OK 0 none U0901 32 OK 1 none U0802 64 OK 1 none U0702 64 OK 1 none U1002 64 OK 1 none U0902 64 OK 2 none U0803 32 OK 2 none U0703 32 OK 2 none U1003 32 OK 2 none U0903 32 OK 3 none U0804 32 OK 3 none U0704 32 OK 3 none U1004 32 OK 3 none U0904 32 OK ========================= IO Cards ========================= Bus Freq Brd Type MHz Slot Name Model --- ---- ---- ---- ------------------ ---------------------- SYS PCI 33 0 SUNW,m64B ATY,GT-B SYS PCI 33 1 pciclass,078000 SYS PCI 33 2 pciclass,078000 SYS PCI 33 3 glm Symbios,53C875 No failures found in System =========================== ========================= Environmental Status ========================= System Temperatures (Celsius): ------------------------------ CPU0 44 CPU1 52 MB0 32 MB1 26 PDB 26 SCSI 24
================================= Front Status Panel: ------------------- Keyswitch position is in On mode. System LED Status: DISK ERROR POWER [OFF] [ ON] POWER SUPPLY ERROR ACTIVITY [OFF] [OFF] GENERAL ERROR THERMAL ERROR [OFF] [OFF] ================================= Disk LED Status: OK = GREEN ERROR = YELLOW DISK 5: [OK] DISK 3: [OK] DISK 1: [OK] DISK 4: [OK] DISK 2: [OK] DISK 0: [OK] ================================= Fan Bank : ---------- Bank Speed Status (0-255) ---- ----- ------ SYS 140 OK ================================= Power Supplies: --------------- Supply Status ------ ------ 0 OK ========================= HW Revisions ========================= ASIC Revisions: --------------- STP2223BGA: Rev 4 STP2003QFP: Rev 1 System PROM revisions: ---------------------- OBP 3.5.145 1997/10/15 14:50 POST 5.0.5 1997/10/09 16:52
If you are working from the OBP prompt (ok), you can use the OBP show-devs command to list the devices in the system configuration.
Use the OBP printenv command to display the OpenBoot PROM configuration variables stored in the system NVRAM. The display includes the current values for these variables as well as the default values.
To diagnose problems with the SCSI subsystem, you can use the OBP probe-scsi and probe-scsi-all commands. Both commands require that you halt the system.
When it is not practical to halt the system, you can use SunVTS as an alternate method of testing the SCSI interfaces. See "About Diagnostic Tools" for more information.
The probe-scsi command transmits an inquiry command to all SCSI devices connected to the main logic board SCSI interfaces. This includes any tape or CD-ROM drive in the removable media assembly (RMA), any internal disk drive, and any device connected to the external SCSI connector on the system rear panel. For any SCSI device that is connected and active, its target address, unit number, device type, and manufacturer name are displayed.
The probe-scsi-all command transmits an inquiry command to all SCSI devices connected to the system SCSI host adapters, including any host adapters installed in PCI slots. The first identifier listed in the display is the SCSI host adapter address in the system device tree followed by the SCSI device identification data.
The first example that follows shows a probe-scsi output message. The second example shows a probe-scsi-all output message.
ok probe-scsi This command may hang the system if a Stop-A or halt command has been executed. Please type reset-all to reset the system before executing this command. Do you wish to continue? (y/n) n ok reset-all ok probe-scsi Primary UltraSCSI bus: Target 0 Unit 0 Disk SEAGATE ST34371W SUN4.2G3862 Target 4 Unit 0 Removable Tape ARCHIVE Python 02635-XXX5962 Target 6 Unit 0 Removable Read Only device TOSHIBA XM5701TASUN12XCD0997 Target 9 Unit 0 Disk SEAGATE ST34371W SUN4.2G7462 Target b Unit 0 Disk SEAGATE ST34371W SUN4.2G7462 ok
ok probe-scsi-all This command may hang the system if a Stop-A or halt command has been executed. Please type reset-all to reset the system before executing this command. Do you wish to continue? (y/n) y /pci@1f,4000/scsi@4,1 Target 2 Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target 3 Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target 4 Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target 5 Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target 8 Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target 9 Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target a Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target b Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target c Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target d Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target e Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target f Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 /pci@1f,4000/scsi@4 Target 2 Unit 0 Disk SEAGATE ST32550W SUN2.1G0416 Target 3 Unit 0 Disk SEAGATE ST32550W SUN2.1G0416 Target 4 Unit 0 Disk SEAGATE ST32550W SUN2.1G0416 Target 5 Unit 0 Disk SEAGATE ST32430W SUN2.1G0666 Target 8 Unit 0 Disk SEAGATE ST32550W SUN2.1G0416
probe-scsi-all output continued:
Target 9 Unit 0 Disk SEAGATE ST32550W SUN2.1G0416 Target a Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target b Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target c Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target d Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target e Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 Target f Unit 0 Disk SEAGATE ST32550W SUN2.1G0418 /pci@1f,4000/scsi@3,1 /pci@1f,4000/scsi@3 Target 0 Unit 0 Disk SEAGATE ST34371W SUN4.2G3862 Target 4 Unit 0 Removable Tape ARCHIVE Python 02635-XXX5962 Target 6 Unit 0 Removable Read Only device TOSHIBA XM5701TASUN12XCD0997 Target 9 Unit 0 Disk SEAGATE ST34371W SUN4.2G7462 Target b Unit 0 Disk SEAGATE ST34371W SUN4.2G7462 /pci@1f,4000/pci@5/SUNW,isptwo@4 Target 1 Unit 0 Disk SEAGATE ST34371W SUN4.2G8246 Target 2 Unit 0 Disk SEAGATE ST34371W SUN4.2G8254 Target 3 Unit 0 Disk SEAGATE ST34371W SUN4.2G8246 Target 4 Unit 0 Disk SEAGATE ST34371W SUN4.2G8246 Target 5 Unit 0 Disk SEAGATE ST34371W SUN4.2G7462 Target 6 Unit 0 Disk SEAGATE ST34371W SUN4.2G7462
The system is unable to communicate over the network.
Your system conforms to the Ethernet 10/100BASE-T standard, which states that the Ethernet 10BASE-T link integrity test function should always be enabled on both the host system and the Ethernet hub. The system cannot communicate with a network if this function is not set identically for both the system and the network hub (either enabled for both or disabled for both). This problem applies only to 10BASE-T network hubs, where the Ethernet link integrity test is optional. This is not a problem for 100BASE-T networks, where the test is enabled by default. Refer to the documentation provided with your Ethernet hub for more information about the link integrity test function.
If you connect the system to a network and the network does not respond, use the OpenBoot PROM command watch-net-all to display conditions for all network connections:
ok watch-net-all
For most PCI Ethernet cards, the link integrity test function can be enabled or disabled with a hardware jumper on the PCI card, which you must set manually. (See the documentation supplied with the card.) For the standard TPE and MII main logic board ports, the link test is enabled or disabled through software, as shown below.
Remember also that the TPE and MII ports share the same circuitry and as a result, only one port can be used at a time.
Some hub designs permanently enable (or disable) the link integrity test through a hardware jumper. In this case, refer to the hub installation or user manual for details of how the test is implemented.
To enable or disable the link integrity test for the standard Ethernet interface, or for a PCI-based Ethernet interface, you must first know the device name of the desired Ethernet interface. To list the device name:
Shut down the operating system and take the system to the ok prompt.
Determine the device name for the desired Ethernet interface:
Use this method while the operating system is running:
Become superuser.
# eeprom nvramrc="probe-all install-console banner apply disable-link-pulse device-name" (Repeat for any additional device names.) # eeprom "use-nvramrc?"=true
Reboot the system (when convenient) to make the changes effective.
Use this alternate method when the system is already in OpenBoot:
ok nvedit 0: probe-all install-console banner 1: apply disable-link-pulse device-name (Repeat this step for other device names as needed.) (Press CONTROL-C to exit nvedit.) ok nvstore ok setenv use-nvramrc? true
Reboot the system to make the changes effective.
The system attempts to power up but does not boot or initialize the monitor.
Run POST diagnostics.
Observe POST results.
The front panel general fault LED should blink slowly to indicate that POST is running. Check the POST output using a locally attached terminal, tip connection, or RSC console.
By default, POST output is displayed locally on an attached terminal or through a tip connection. If your server has been reconfigured to display POST output on an RSC console, POST results will not display locally. To redirect POST output to the local system, you must execute the OpenBoot PROM command diag-output-to ttya from the RSC console. See the Remote System Control (RSC) User's Guide for additional details.
If you see no front panel LED activity, a power supply may be defective.
See "Power Supply LEDs".
If the general fault LED remains lit, or the POST output contains an error message, then POST has failed.
The most probable cause for this type of failure is the main logic board. However, before replacing the main logic board you should:
No video at the system monitor.
Check that the power cord is connected to the monitor and to the wall outlet.
Verify with a volt-ohmmeter that the wall outlet is supplying AC power.
Verify that the video cable connection is secure between the monitor and the video output port.
Use a volt-ohmmeter to perform the continuity test on the video cable.
If the cables and their connections are okay, then troubleshoot the monitor and the graphics card.
A disk drive read, write, or parity error is reported by the operating system or a software application.
A CD-ROM drive read error or parity error is reported by the operating system or a software application.
Disk drive or CD-ROM drive fails to boot or is not responding to commands.
Test the drive response to the probe-scsi-all command as follows:
At the system ok prompt, enter:
ok reset-all ok probe-scsi-all
If the SCSI device responds correctly to probe-scsi-all, a message similar to the one above is printed out.
If the device responds and a message is displayed, the system SCSI controller has successfully probed the device. This indicates that the main logic board is operating correctly.
If one drive does not respond to the SCSI controller probe but the others do, replace the unresponsive drive.
If only one internal disk drive is configured with the system and the probe-scsi-all test fails to show the device in the message, replace the drive. If the problem is still evident after replacing the drive, replace the main logic board. If replacing both the disk drive and the main logic board does not correct the problem, replace the associated UltraSCSI data cable and UltraSCSI backplane.
To check whether the main logic board SCSI controllers are defective, test the drive response to the probe-scsi command. To test additional SCSI host adapters added to the system, use the probe-scsi-all command. You can use the OBP printenv command to display the OpenBoot PROM configuration variables stored in the system NVRAM. The display includes the current values for these variables as well as the default values. See "OBP printenv Command" for more information.
ok probe-scsi
If a message is displayed for each installed disk, the system SCSI controllers have successfully probed the devices. This indicates that the main logic board is working correctly.
If a disk doesn't respond:
If the problem persists, replace the unresponsive drive.
If the problem remains after replacing the drive, replace the main logic board.
If the problem persists, replace the associated SCSI cable and backplane.
If there is a problem with a power supply, POST lights the general fault indicator and the power supply fault indicator on the front panel. If you have more than one power supply, then you can use the LEDs located on the power supplies themselves to identify the faulty supply. The power supply LEDs will indicate any problem with the AC input or DC output. See "Power Supply LEDs" for more information about the LEDs.
SunVTS and POST diagnostics can report memory errors encountered during program execution. Memory error messages typically indicate the DIMM location number ("U" number) of the failing module.
Use the following diagram to identify the location of a failing memory module from its U number:
After you have identified the defective DIMM, remove it according to the instructions in "How to Remove a Memory Module". Install the replacement DIMM according to the directions in "How to Install a Memory Module".
The environmental monitoring subsystem monitors the temperature of the system as well as the operation of the fans and power supplies. For more information on the environmental monitoring subsystem, see "Environmental Monitoring and Control".
In response to an environmental error condition, the monitoring subsystem generates error messages that are displayed on the system console and logged in the /var/adm/messages file. These error messages are described in the table below.
Table 12-7
Message |
Type |
Description |
---|---|---|
TEMPERATURE WARNING: X degrees celsius at location Y. |
Warning |
Indicates that the temperature measured at location Y has exceeded the warning threshold and if it continues to overheat the system will shutdown.
If the value of location Y is a sensor on a CPU, (CP0 or CP1) the temperature (identified by the value X) has exceeded 60 degrees C.
If the value of location Y is a sensor on the PDB (power distribution board), SCSI backplane, MB0 or MB1 (main logic board), the ambient temperature (identified by the value X) has exceeded 53 degrees C. |
TEMPERATURE CRITICAL: X degrees celsius at location Y. |
Warning |
Indicates that the temperature measured at location Y has exceeded a critical threshold. After this warning message, the system automatically shuts down.
If the value of location Y is a sensor on a CPU, (CP0 or CP1) the temperature (identified by the value X) has exceeded 65 degrees C.
If the value of location Y is a sensor on the PDB (power distribution board), SCSI backplane, MB0 or MB1 (main logic board), the ambient temperature (identified by the value X) has exceeded 58 degrees C.
|
Power Supply X NOT okay. |
Warning |
Indicates that there is something wrong with the DC output of the supply. The system may shut down abruptly if the redundant power supply fails. The value X identifies the power supply, PS0 is the lower power supply; PS1 is the upper power supply. |
Power supply X inserted |
Advisory |
A hot-swap feature to tell you that the power supply identified by X was installed without service disruption. |
Power supply X removed |
Advisory |
A hot-swap feature to tell you that the power supply identified by X was removed without service disruption. |
WARNING: Fan failure has been detected |
Warning |
Indicates a fan failure in the fan tray assembly. |
If the environmental monitoring system detects a temperature problem, it also lights the temperature LED on the status and control panel. If it detects a power supply problem, it lights the power supply fault LED on the panel. The LEDs located on the power supplies themselves will help to further identify the problem. For information about system LEDs, see:
Enterprise 250 power supplies will shut down automatically in response to certain over-temperature and power fault conditions (see "Environmental Monitoring and Control"). To recover from an automatic shutdown, you must disconnect the AC power cord, wait approximately 10 seconds, and then reconnect the power cord.