Sun Enterprise 250 Server Owner's Guide

Chapter 12 Diagnostics and Troubleshooting

This chapter covers the diagnostic tools available for the system, and how to use these tools. It also provides information about error indications and software commands to help determine what component of the system needs to be replaced.

Tasks covered in this chapter include:

Other information covered in this chapter includes:

About Diagnostic Tools

The system provides both firmware-based and software-based diagnostic tools to help you identify and isolate hardware problems. These tools include:

POST diagnostics verify the core functionality of the system, including the main logic board, system memory, and any on-board I/O devices. You can run POST even if the system is unable to boot. For more information about POST, see "About Power-On Self-Test (POST) Diagnostics" and "How to Use POST Diagnostics".

OBDiag tests focus on system I/O and peripheral devices. Like POST, you can run OBDiag even if the system is unable to boot. For more information about OBDiag, see "About OpenBoot Diagnostics (OBDiag)" and "How to Use OpenBoot Diagnostics (OBDiag)".

The SunVTS system exerciser is a graphics-oriented UNIX application that permits the continuous exercising of system resources and internal and external peripheral equipment. For more information about SunVTS, see "About SunVTS Software".

Solstice SyMON allows you to monitor system hardware status and operating system performance of your server. For information about SyMON, see "About Solstice SyMON Software".

Remote System Control (RSC) is a server management tool that provides remote system administration for geographically distributed or physically inaccessible systems. The RSC software works with the System Service Processor (SSP) on the Enterprise 250 main logic board. For more information about RSC and SSP, see "About Remote System Control (RSC)".

Which method or tool you use to diagnose system problems depends on the nature of those problems:

Figure 12-1  

Graphic

About Power-On Self-Test (POST) Diagnostics

The POST diagnostic code resides in flash PROM on the main logic board. It runs whenever the system is turned on or when a system reset is issued. POST tests the following system components:

POST reports its test results via LEDs located on the system keyboard and on the system front panel. See "Error Indications" for more information about LEDs and error messages.

POST displays detailed diagnostic and error messages on a local terminal, if one is attached to the system's serial port A. You can also choose to display POST output remotely on a Remote System Control (RSC) console.

The System Service Processor (SSP) runs its own POST diagnostics, separate from the main POST diagnostics. To view detailed diagnostic and error messages from SSP POST, you must attach a local terminal to the SSP (RSC) serial port prior to running SSP POST.

For more information about RSC and the System Service Processor, see "About Remote System Control (RSC)". For information about running POST, see "How to Use POST Diagnostics".

How to Use POST Diagnostics

When you turn on the system power, POST diagnostics run automatically if any of the following conditions apply:

In the event of an automatic system reset, POST diagnostics run under either of the following conditions:

For information about the various keyswitch positions, see "About the Status and Control Panel".

Before You Begin

You can choose to view POST diagnostic and error messages locally on an attached terminal or remotely on an RSC console.

To view POST diagnostic messages on the local system, you need to connect an alphanumeric terminal or establish a tip connection to another Sun system. For more information, see "About Setting Up a Console".

To view POST diagnostic messages remotely on an RSC console, you need to configure the RSC software before starting POST. For information about using the RSC software, see the Remote System Control (RSC) User's Guide.


Note -

By default, POST output is displayed locally on an attached terminal or through a tip connection. If your server has been reconfigured to display POST output on an RSC console, POST results will not display locally. To redirect POST output to the local system, you must issue the OpenBoot PROM command diag-output-to ttya from the RSC console. See the Remote System Control (RSC) User's Guide for additional details.


You can choose to run an abbreviated POST with concise error and status reporting or run an extensive POST with more detailed messages. For more information, see "How to Set the Diagnostic Level for POST and OBDiag".

What to Do

  1. Ensure that the front panel keyswitch is in the Standby position.

    For descriptions of the various keyswitch settings, see "About the Status and Control Panel".

  2. Turn the keyswitch to the Diagnostics position.

    The system runs the POST diagnostics. POST displays status and error messages on the system console or on an RSC console, if the RSC software is configured to display POST output. For more information, see the "Results" section below.

    Upon successful completion of POST, the system will run OBDiag. For more information about OBDiag, see "About OpenBoot Diagnostics (OBDiag)" and "How to Use OpenBoot Diagnostics (OBDiag)".

Results

While POST is running, you can observe its progress and any error indications in the following locations:

You can also obtain a summary of POST results by using the .post and .rsc commands.

Observing POST in Progress

As POST runs, it displays detailed diagnostic status messages on the system console (or on an RSC console, if POST output has been redirected to an RSC console). If POST detects an error, it displays an error message on either the system console or the RSC console that indicates the failing part. A sample error message is provided below:


Power On Self Test Failed. Cause: DIMM U0702 or System Board
ok

POST status and error conditions are indicated by the general fault LED on the system front panel. The LED blinks slowly to indicate that POST is running. It remains lit if POST detects a fault.

If a Sun Type-5 keyboard is attached, POST status and error indications are also displayed via the four LEDs on the keyboard. When POST starts, all four keyboard LEDs blink on and off simultaneously. After that, the Caps Lock LED blinks slowly to indicate POST is running. If an error is detected, the pattern of the lit LEDs provides an error indication. See "Error Indications" for more information.

If POST detects an error condition that prevents the system from booting, it will halt operation and display the ok prompt. The last message displayed by POST prior to the ok prompt indicates which part you need to replace.

Obtaining a Summary of POST Results

Use the .post command at the ok prompt to view a summary of POST results.


ok .post
System status: OK
CPU0:          OK
CPU1:          OK
SC-MP:         OK
Psycho@1f:     OK
Cheerio:       OK
SCSI:          OK
Mem Bank0:     OK
Mem Bank1:     OK
Mem Bank2:     OK
Mem Bank3:     OK
PROM:          OK
NVRAM:         OK
TTY:           OK
SuperIO:       OK
PCI Slots:     OK

Use the .rsc command at the ok prompt to view a summary of SSP POST results.


ok .rsc
SEEPROM:           OK
I2C:               OK
Ethernet:          OK
Ethernet (2):      OK
CPU:               OK
RAM:               OK
Console:           OK
RSC Console line:  OK
RSC Control line:  OK
FlashRAM Boot CRC: OK
FlashRAM Main CRC: OK
RSC Console Link:  OK
Console Selection: ttya

About OpenBoot Diagnostics (OBDiag)

OpenBoot Diagnostics (OBDiag) reside in flash PROM on the main logic board. OBDiag can isolate errors in the following system components:

On the main logic board, OBDiag tests not only the main logic board but also its interfaces:

OBDiag reports test results via the LEDs located on the system front panel. See "Error Indications" for more information about LEDs and error messages.

OBDiag displays detailed diagnostic and error messages on a local console or terminal, if one is attached to system. Alternatively, you can display OBDiag output remotely on a Remote System Control (RSC) console. For more information about RSC, see "About Remote System Control (RSC)".

OBDiag tests run automatically under certain conditions. You can also run OBDiag interactively from the system ok prompt. For information about running OBDiag, see "How to Use OpenBoot Diagnostics (OBDiag)".

When you run OBDiag interactively from the ok prompt, you invoke the OBDiag menu, which lets you select which tests you want to perform. For information about the OBDiag menu, see "OBDiag Menu".

The system also provides configuration variables that you can set to affect the operation of the OBDiag tests. For information about the configuration variables, see "OBDiag Configuration Variables".

OBDiag Menu

The OBDiag menu is created dynamically whenever you invoke OBDiag in interactive mode. OBDiag determines whether any optional devices are installed in the system. If the device has an on-board self-test, OBDiag incorporates the test name into the list of menu entries. It sorts the menu entries in alphabetical order and numbers them accordingly. Therefore, the menu entries may vary from system to system, depending on the system configuration.

The OBDiag menu always displays the core tests that exercise parts of the basic system. These tests include envctrltwo, ebus, ecpp, eeprom, fdthree, network, scsi@3, scsi@3,1, se, su, and rsc. For information about each test, see "OBDiag Test Descriptions". For a description of the interactive commands for running OBDiag, see "OBDiag Commands".

Once you invoke OBDiag as described in "How to Use OpenBoot Diagnostics (OBDiag)", the OBDiag menu is displayed.

Figure 12-2  

Graphic

OBDiag Commands

The following table provides information about the OBDiag interactive commands that are available at the OBDiag command prompt:

Table 12-1  

Command 

Description 

exit

Exits the OBDiag tool and returns to the ok prompt.

help

Displays a brief description of each command and OpenBoot PROM variable used to run OBDiag. 

printenvs

Displays the value of all of the OBDiag variables. (See "OBDiag Configuration Variables" for information about settings.)

setenv variable value

Sets the value for an OpenBoot PROM configuration variable. (See "OBDiag Configuration Variables" for information about settings.)

test-all

Runs all of the tests displayed in the menu. 

test #,#,

Runs only the test(s) identified by menu entry number (#) in the command line.

except #,#,

Run all test(s) except those identified by menu entry number (#) in the command line.

what #,#,

Displays selected properties of the device(s) identified by menu entry number (#) in the command line. The exact information provided varies according to device type.

OBDiag Configuration Variables

The following table provides information about OpenBoot PROM configuration variables that affect the operation of OBDiag. Use the printenvs command to show current values and the setenv command to set or change a value. Both commands are described in "OBDiag Commands".

Table 12-2  

Variable 

Setting 

Description 

Default 

diag-level

off

No tests are run at power up. 

 

 

min

Performs minimal testing of core functionality. 

min

 

med

Performs functional tests for all system functions. 

 

 

max

Runs exhaustive tests for all functions except external loopbacks. External loopback tests are run only if diag-targets is set to loopback, loopback3, device&loopback, or device&loopback,3.

 

diag-continue?

false

Stops testing within a test routine and prints a message as soon as an error is detected. OBDiag then skips to the next test routine in the sequence. 

false

 

true

Causes OBDiag to run all subtests within a test, even if an error is detected. 

 

diag-passes

n

Repeats each test the number of times specified by n. Works with the test, except, and test-all commands.

1

diag-targets

none

Runs internal tests only, no I/O testing. 

none

 

iopath

Extends testing to external device interfaces (connectors/cables). 

 

 

media

Extends testing to external devices and media, if present. 

 

 

device

Invokes built-in self-test (BIST) on PCI cards and external devices. 

 

 

loopback

Runs external loopback tests on the parallel, serial, keyboard, mouse, TPE, and RSC serial ports. 

 

 

loopbacks

Not for use on Enterprise 250 servers. 

 

 

loopback2

Not for use on Enterprise 250 servers. 

 

 

loopback3

Runs external loopback tests on the RSC Ethernet port 

 

 

nomem

Performs tests without testing system memory. 

 

 

device&loopback

Runs built-in self-test (BIST) on PCI cards and external devices, then runs external loopback tests on the parallel, serial, keyboard, mouse, TPE, and RSC serial ports. 

 

 

device&loopbacks

Not for use on Enterprise 250 servers. 

 

 

device&loopback,3

Runs built-in self-test (BIST) on PCI cards and external devices, then runs external loopback tests on the parallel, serial, keyboard, mouse, TPE, RSC serial, and RSC Ethernet ports. 

 

diag-trigger

power-reset

Runs diagnostics only on power-on resets. 

power-reset

 

error-reset

Runs diagnostics only on power-on resets, fatal hardware errors, and watchdog reset events. 

 

 

soft-reset

Runs diagnostics on all resets (except XIR). 

 

diag-verbosity

0

Prints one line that indicates the device being tested and its pass/fail status. 

0

 

1

Prints more detailed test status, which varies in content from test to test. 

 

 

2

Prints subtest names. 

 

 

4

Prints debug messages. 

 

 

8

Prints back trace of callers on error. 

 

OBDiag Test Descriptions

The following table provides information about the tests available through OBDiag. It provides the test name, a brief description of the test, and any special considerations involved in running the test.

Table 12-3  

Test Name 

Description 

Special Considerations 

SUNW,envctrltwo

@14,60000

Verifies that the fans are operational. Checks that the temperature in the enclosure and at the CPUs does not exceed the maximum allowable range. Also tests the disk and front panel LEDs. 

 

ebus@1

Tests the on-board ASIC that interfaces the following devices with the PCI bus: parallel port, serial port, keyboard, mouse, diskette drive, NVRAM, and the environmental monitoring and control system. 

 

ecpp

@14,3043bc

Tests parallel port I/O logic, including internal and external loopback tests. 

To run external loopback tests, you must have a special passive loopback connector attached to the parallel port. The variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.

 

The Sun part number for the parallel port loopback connector is 501-2965-01. 

eeprom@14,0

Tests the NVRAM functionality. 

 

fdthree

@14,3023f0

Tests diskette drive control logic and the operation of the drive. The test does not differentiate among a drive, media, or main logic board error; if any of these fail, it reports the diskette drive as the FRU. 

A formatted diskette must be inserted into the drive. 

 

network@1,1

Tests the on-board Ethernet logic, including internal and external loopback tests. 

To run external loopback tests on the TPE port, you must have a TPE loopback connector attached to the TPE port. The variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.

 

The Sun part number for the TPE loopback connector is 501-4689-01. 

scsi@3 [Depending on your system configuration, the OBDiag menu may include tests for additional SCSI interfaces, such as scsi@4, scsi@4,1, scsi@5, scsi@5,1, etc.]

Tests the on-board SCSI controller and SCSI bus subsystem for internal disk drives and removable media devices. Checks associated registers and performs a DMA transfer. 

 

scsi@3,1

Tests the main logic board external SCSI interface. Checks associated registers and performs a DMA transfer. 

 

se@14,40000

Tests serial port control and I/O logic, including internal and external loopback tests. The test checks I/O logic only if the external loopback test is enabled. 

Port A tests are not run if ttya is being used as the input/output device. 

 

To run external loopback tests, you must have a special passive loopback connector attached to each serial port, and the variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.

 

There is one passive connector available for this test: Sun part number 501-4205-01. Use 501-4205-01 when ports A and B are not attached to external devices.  

su@14,3062f8

Tests keyboard control and input logic, including internal and external loopback tests. 

Keyboard tests run only when a keyboard is used as the input device. 

 

To run external loopback tests, you must have a special passive loopback connector attached to the keyboard/mouse port. The variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.

 

The Sun part number for the loopback connector is 501-4690-01. 

su@14,3083f8

Tests mouse control and input logic, including internal and external loopback tests. 

Mouse tests are not run if a keyboard is used as an input device. 

 

To run external loopback tests, you must have a special passive loopback connector attached to the keyboard/mouse port, the variable diag-targets must be set to loopback, device&loopback, or device&loopback,3.

 

The Sun part number for the loopback connector is 501-4690-01. 

rsc

Tests RSC (SSP) hardware, including RSC serial and Ethernet ports. For additional details, see "About Remote System Control (RSC)".

This test is not run if RSC is being used as the console device. 

 

To run external loopback tests on the RSC Ethernet port, the port must be connected to a 10-Mbps Ethernet network. The variable diag-targets must also be set to loopback3 or device&loopback,3.

 

To run external loopback tests on the RSC serial port, a special passive serial loopback connector must be attached to the port. The variable diag-targets must also be set to loopback, device&loopback, or device&loopback,3.

 

The Sun part number for the passive serial loopback connector is 501-4205-01. 

How to Use OpenBoot Diagnostics (OBDiag)

When you turn on the system power, OBDiag runs automatically if any of the following conditions apply:

In the event of an automatic system reset, POST diagnostics run under either of the following conditions:

For information about the various keyswitch positions, see "About the Status and Control Panel".

OBDiag tests run automatically, without operator intervention, under the conditions described above. However, you can also run OBDiag in an interactive mode and select which tests you want to perform. The following procedure describes how to run OBDiag interactively from the system ok prompt.

What to Do


Note -

Perform this procedure with the power on and the keyswitch in the Power-on position.


  1. With the keyswitch in the Power-on position, press the Break key on your alphanumeric terminal's keyboard, or enter the Stop-a sequence on a Sun keyboard.

    To enter the Stop-a sequence, press the Stop key and the a key simultaneously. The ok prompt is displayed.

  2. (Optional) Select a diagnostic level.

    Four different levels of diagnostic testing are available for OBDiag; see "How to Set the Diagnostic Level for POST and OBDiag".

  3. (Optional) Select a diagnostic target.

    You can choose to run OBDiag with or without external loopback tests by using the OpenBoot PROM variable diag-targets. For more information, see "OBDiag Configuration Variables".

  4. Enter obdiag at the ok prompt:


    ok obdiag
    

  5. When the OBDiag menu appears, enter the appropriate command and test name/number at the command prompt.

    For command usage and descriptions, see "OBDiag Commands".

    Figure 12-3  

    Graphic


    Note -

    For more information about OBDiag tests, see "About OpenBoot Diagnostics (OBDiag)".


How to Set the Diagnostic Level for POST and OBDiag

Before You Begin

Four different levels of diagnostic testing are available for power-on self-test (POST) and OpenBoot Diagnostics (OBDiag): max (maximum level), med (medium level), min (minimum level), and off (no testing). The system runs the appropriate level of diagnostics based on the setting of the OpenBoot PROM variable called diag-level.

The default setting for diag-level is min.

If your server is set up without a local console, you'll need to set up a monitor or console before setting the diagnostic level. See "About Setting Up a Console".

What to Do


Note -

Perform this procedure with the power on and the keyswitch set to the Power-on position.


  1. With the keyswitch in the Power-on position, press the Break key on your alphanumeric terminal's keyboard, or enter the Stop-a sequence on a Sun keyboard.

    To enter the Stop-a sequence, press the Stop key and the a key simultaneously. The ok prompt is displayed.

  2. To set the diag-level variable, enter the following:


    ok setenv diag-level 
    value
    

The value can be off, min, med, or max. See "OBDiag Configuration Variables" for information about each setting.

About SunVTS Software

SunVTS, the Sun Validation and Test Suite, is an online diagnostics tool and system exerciser for verifying the configuration and functionality of hardware controllers, devices, and platforms. You can run SunVTS using any of these interfaces: a command line interface, a tty interface, or a graphical interface that runs within a windowed desktop environment.

SunVTS software lets you view and control a testing session over modem lines or over a network. Using a remote system, you can view the progress of a SunVTS testing session, change testing options, and control all testing features of another system on the network.

Useful tests to run on your system include:

Table 12-4  

SunVTS Test 

Description 

ecpptest

Verifies the ECP1284 parallel port printer functionality 

cdtest

Tests the CD-ROM drive by reading the disc and verifying  

the CD table of contents (TOC), if it exists 

disktest

Verifies local disk drives 

env2test

Tests the I2C environment control system including all fans, front panel LEDs and keyswitch, disk backplane LEDs, power supplies, and thermistor readings

fputest

Checks the floating-point unit  

fstest

Tests the integrity of the software's file systems 

m64test 

Tests the PGX frame buffer card 

mptest

Verifies multiprocessor features (for systems with more than one processor) 

nettest

Checks all the hardware associated with networking (for example, Ethernet, token ring, quad Ethernet, fiber optic, 100-Mbit per second Ethernet devices) 

pmem

Tests the physical memory (read only) 

sptest

Tests the system's on-board serial ports 

tapetest

Tests the various Sun tape devices 

rsctest

Verifies the RSC/SSP functionality, including SSP Ethernet and serial ports, I2C, and SSP Flash RAM.

vmem

Tests the virtual memory (a combination of the swap partition and the physical memory) 

For More Information

The following documents provide information about SunVTS software. They are available on Solaris on Sun Hardware AnswerBook. This AnswerBook documentation is provided on the SMCC Updates CD for the Solaris release you are running.

This document describes the SunVTS environment, including how to start and control the various user interfaces. SunVTS features are described in this document.

This document contains descriptions of each test SunVTS software runs in the SunVTS environment. Each test description explains the various test options and gives command line arguments.

This card gives an overview of the main features of the SunVTS Open Look interface.

How to Check Whether SunVTS Software Is Installed

SunVTS software is an optional package that may or may not have been loaded when your system software was installed.

To check whether SunVTS is installed, you must access your system either from a console (see "About Setting Up a Console"), or from a remote machine logged in to the system.

What to Do

  1. Enter the following:


    % pkginfo -l SUNWvts
    

    • If SunVTS software is loaded, information about the package will be displayed.

    • If SunVTS software is not loaded, you'll see an error message:


       ERROR: information for "SUNWvts" was not found

  2. If necessary, use the pkgadd utility to load the SUNWvts package onto your system from the SMCC Update CD.

    Note that /opt/SUNWvts is the default directory for installing SunVTS software.

What Next

For more information, refer to the appropriate Solaris documentation, as well as the pkgadd reference manual page.

How to Use SunVTS Software

Before You Begin

If your system passes the firmware-based diagnostics and boots the operating system, yet does not function correctly, you can use SunVTS, the Sun Validation and Test Suite, to run additional tests. These tests verify the configuration and functionality of most hardware controllers and devices.

You'll need root or superuser access to run SunVTS tests.

What to Do

This procedure assumes you'll test your Enterprise 250 server remotely by running a SunVTS session from a workstation using the SunVTS graphical interface. For information about other SunVTS interfaces and options, see "About Diagnostic Tools".

You can also run SunVTS remotely from a Remote System Control (RSC) console. For information about using the RSC with SunVTS, see the Remote System Control (RSC) User's Guide.

  1. Use xhost to give the remote server access to the workstation display.

    On the workstation from which you will be running the SunVTS graphical interface, enter:


    % /usr/openwin/bin/xhost + 
    remote_hostname
    

    Substitute the name of the Enterprise 250 server for remote_hostname. Among other things, this command gives the server display permissions to run the SunVTS graphical interface in the OpenWindows(TM) environment of the workstation.

  2. Remotely log in to the server as superuser (root).

  3. Check whether SunVTS software is loaded on the server.

    SunVTS is an optional package that may or may not have been loaded when the server software was installed. For more information, see "How to Check Whether SunVTS Software Is Installed".

  4. To start the SunVTS software, enter:


    # cd /opt/SUNWvts/bin
    # ./sunvts -display local_hostname:0

    Substitute the name of the workstation you are using for local_hostname. Note that /opt/SUNWvts/bin is the default /bin directory for SunVTS software. If you've installed SunVTS software in a different directory, use the appropriate path instead.

    When you start SunVTS software, the SunVTS kernel probes the test system devices. The results of this probe are displayed on the Test Selection panel. For each hardware device on your system, there is an associated SunVTS test.

  5. Fine-tune your testing session by selecting only the tests you want to run.

    Click to select and deselect tests. (A check mark in the box indicates the item is selected.)

    Figure 12-4  

    Graphic

Results

If SunVTS tests indicate an impaired or defective part, see the replacement procedures in Chapter 6, Removing and Installing Main Logic Board Components through Chapter 9, Removing and Installing Backplanes and Cables to replace the defective part.

About Solstice SyMON Software

Solstice SyMON is a GUI-based diagnostic tool designed to monitor system hardware status and operating system performance. It offers simple, yet powerful monitoring capabilities that allow you to:

Solstice SyMON software is included on the SMCC Updates CD for the Solaris release you are running. For instructions on installing and using Solstice SyMON software, see the Solstice SyMON User's Guide included in the Solaris on Sun Hardware AnswerBook on the SMCC Updates CD.

About Remote System Control (RSC)

Remote System Control (RSC) is a secure server management tool that lets you monitor and control your server over modem lines or over a network. RSC provides remote system administration for geographically distributed or physically inaccessible systems. The RSC software works with the System Service Processor (SSP) on the Enterprise 250 main logic board. The SSP provides both serial and Ethernet ports for connections to a remote console.

Once RSC is configured to manage your server, you can use it to run diagnostic tests, view diagnostic and error messages, reboot your server, and display environmental status information from a remote console. If the operating system is down, RSC will notify a central host of any power failures, hardware failures, or other important events that may be occurring on your server.

The RSC provides the following features:

For More Information

For information about configuring and using RSC, see the Remote System Control (RSC) User's Guide, provided with the RSC software.


Note -

By default, diagnostic status and error messages are displayed on the local system console or terminal. If your server has been reconfigured to display output on an RSC console, diagnostic results will not display locally. To redirect diagnostic messages to the local console, you must use the OpenBoot PROM command diag-output-to and modify the OpenBoot PROM variables input-device and output-device. For additional details, see the Remote System Control (RSC) User's Guide.


About Troubleshooting Your System

The system provides the following features to help you identify and isolate hardware problems:

This section describes the error indications and software commands provided to help you troubleshoot your system. Diagnostic tools are covered in "About Diagnostic Tools".

Error Indications

The system provides error indications via LEDs and error messages. Using the two in combination, you can isolate a problem to a particular field-replaceable unit (FRU) with a high degree of confidence.

The system provides fault LEDs in the following places:

Error messages are logged in the /var/adm/messages file and are also displayed on the system console by the diagnostic tools.

Front Panel LEDs

Front panel LEDs provide your first indication if there is a problem with your system. Usually, a front panel LED is not the sole indication of a problem. Error messages and even other LEDs can help to isolate the problem further.

The front panel has a general fault indicator that lights whenever POST or OBDiag detects any kind of fault. It addition, it has LEDs that indicate problems with the internal disk drives, power supply subsystem, or fans. See "About the Status and Control Panel" for more information on these LEDs and their meanings.

Keyboard LEDs

Four LEDs on the Sun Type-5 keyboard are used to indicate the progress and results of POST diagnostics. These LEDs are on the Caps Lock, Compose, Scroll Lock, and Num Lock keys, as shown below.

Figure 12-5  

Graphic

To indicate the beginning of POST diagnostics, the four LEDs briefly light all at once. The monitor screen remains blank, and the Caps Lock LED blinks for the duration of the testing.

If the system passes all POST diagnostic tests, all four LEDs light again and then go off. Once the system banner appears on the monitor screen, the keyboard LEDs assume their normal functions and should no longer be interpreted as diagnostic error indicators.

If the system fails any test, one or more LEDs will light to form an error code that indicates the nature of the problem.


Note -

The LED error code may be lit continuously, or for just a few seconds, so it is important to observe the LEDs closely while POST is running.


The following table provides error code definitions.

Table 12-5  

LED 

 

Caps Lock 

Compose 

Scroll Lock 

Num Lock 

Failing FRU 

 

 

 

Main logic Board 

 

 

 

CPU 0 

 

 

CPU 1 

 

 

No memory detected 

 

 

Memory bank 0 

 

Memory bank 1 

 

Memory bank 2 

Memory bank 3 

 

 

 

NVRAM 


Note -

The Caps Lock LED blinks on and off to indicate that the POST diagnostics are running. When it lights steadily, it indicates an error.


Power Supply LEDs

Power supply LEDs are visible from the rear of the system. The following figure shows the LEDs on the power supply in bay 0.

Figure 12-6  

Graphic

The following table provides a description of each LED.

Table 12-6  

LED Name 

Icon 

Description 

AC-Present-Status 

Graphic

This green LED is lit to indicate that the primary circuit has power. When this LED is lit, the power supply is providing standby power to the system. 

DC Status 

Graphic

This green LED is lit to indicate that all DC outputs from the power supply are functional. 

Disk LEDs

The disk LEDs are visible from the front of the system when the bottom door is open, as shown in the following figure.

Figure 12-7  

Graphic

When a disk LED lights steadily and is green, it indicates that the slot is populated and that the drive is receiving power. When an LED is green and blinking, it indicates that there is activity on the disk. Some applications may use the LED to indicate a fault on the disk drive. In this case, the LED changes color to yellow and remains lit. The disk drive LEDs retain their state even when the system is powered off.

Error Messages

Error messages and other system messages are saved in the file /var/adm/messages.

The two firmware-based diagnostic tools, POST and OBDiag, provide error messages either locally on the system console or remotely on an RSC console. These error messages can help to further refine your problem diagnosis. The amount of error information displayed in diagnostic messages is determined by the value of the OpenBoot PROM variable diag-verbosity. See "OBDiag Configuration Variables" for additional details.

Software Commands

System software provides Solaris and OBP commands that you can use to diagnose problems. For more information on Solaris commands, see the appropriate man pages. For additional information on OBP commands, see the OpenBoot 3.x Command Reference Manual. (An online version of the manual is included with the Solaris System Administrator AnswerBook that ships with Solaris software.)

Solaris prtdiag Command

The prtdiag command is a UNIX shell command used to display system configuration and diagnostic information. You can use the prtdiag command to display:

To run prtdiag, type:


% /usr/platform/sun4u/sbin/prtdiag

To isolate an intermittent failure, it may be helpful to maintain a prtdiag history log. Use prtdiag with the -l (log) option to send output to a log file in /var/adm.


Note -

Refer to the prtdiag man page for additional information.


An example of prtdiag output follows. The exact format of prtdiag output depends on which version of the Solaris operating environment is running on your system.

prtdiag output:


ok /usr/platform/sun4u/sbin/prtdiag -v
System Configuration:  Sun Microsystems  sun4u Sun Ultra Enterprise 250(2 X UltraSPARC-II 248MHz)
System clock frequency: 83 MHz
Memory size: 640 Megabytes

========================= CPUs ========================

Run   Ecache   CPU    CPU
Brd  CPU   Module   MHz     MB    Impl.   Mask
---  ---  -------  -----  ------  ------  ----
SYS     0     0      248     1.0   US-II    1.1
SYS     1     1      248     1.0   US-II    1.1

========================= Memory =========================

Interlv.  Socket   Size
Bank    Group     Name    (MB)  Status
----    -----    ------   ----  ------
  0      none     U0801    32      OK
  0      none     U0701    32      OK
  0      none     U1001    32      OK
  0      none     U0901    32      OK
  1      none     U0802    64      OK
  1      none     U0702    64      OK
  1      none     U1002    64      OK
  1      none     U0902    64      OK
  2      none     U0803    32      OK
  2      none     U0703    32      OK
  2      none     U1003    32      OK
  2      none     U0903    32      OK
  3      none     U0804    32      OK
  3      none     U0704    32      OK
  3      none     U1004    32      OK
  3      none     U0904    32      OK

========================= IO Cards =========================

Bus   Freq
Brd  Type  MHz   Slot  Name                              Model
---  ----  ----  ----  ------------------ ----------------------
SYS   PCI    33     0   SUNW,m64B                         ATY,GT-B              
SYS   PCI    33     1   pciclass,078000                                         
SYS   PCI    33     2   pciclass,078000                                         
SYS   PCI    33     3   glm                               Symbios,53C875        

No failures found in System
===========================

========================= Environmental Status =========================

System Temperatures (Celsius):
------------------------------
      CPU0    44
      CPU1    52
       MB0    32
       MB1    26
       PDB    26
      SCSI    24


=================================
Front Status Panel:
-------------------
Keyswitch position is in On mode.

System LED Status:  DISK ERROR      POWER  
                      [OFF]         [ ON]      
                POWER SUPPLY ERROR  ACTIVITY 
                      [OFF]         [OFF]      
                    GENERAL ERROR   THERMAL ERROR  
                      [OFF]         [OFF]      
=================================
Disk LED Status:	OK = GREEN	ERROR = YELLOW
		DISK  5:    [OK]	DISK  3:    [OK]	DISK  1:    [OK]
		DISK  4:    [OK]	DISK  2:    [OK]	DISK  0:    [OK]

=================================
Fan Bank :
----------

Bank      Speed     Status
         (0-255)	
----      -----     ------
 SYS       140        OK

=================================

Power Supplies:
---------------

Supply     Status
------     ------
  0          OK  

========================= HW Revisions =========================

ASIC Revisions:
---------------
STP2223BGA: Rev 4
STP2003QFP: Rev 1

System PROM revisions:
----------------------
  OBP 3.5.145 1997/10/15 14:50   POST 5.0.5 1997/10/09 16:52

OBP show-devs Command

If you are working from the OBP prompt (ok), you can use the OBP show-devs command to list the devices in the system configuration.

OBP printenv Command

Use the OBP printenv command to display the OpenBoot PROM configuration variables stored in the system NVRAM. The display includes the current values for these variables as well as the default values.

OBP probe-scsi and probe-scsi-all Commands

To diagnose problems with the SCSI subsystem, you can use the OBP probe-scsi and probe-scsi-all commands. Both commands require that you halt the system.


Note -

When it is not practical to halt the system, you can use SunVTS as an alternate method of testing the SCSI interfaces. See "About Diagnostic Tools" for more information.


The probe-scsi command transmits an inquiry command to all SCSI devices connected to the main logic board SCSI interfaces. This includes any tape or CD-ROM drive in the removable media assembly (RMA), any internal disk drive, and any device connected to the external SCSI connector on the system rear panel. For any SCSI device that is connected and active, its target address, unit number, device type, and manufacturer name are displayed.

The probe-scsi-all command transmits an inquiry command to all SCSI devices connected to the system SCSI host adapters, including any host adapters installed in PCI slots. The first identifier listed in the display is the SCSI host adapter address in the system device tree followed by the SCSI device identification data.

The first example that follows shows a probe-scsi output message. The second example shows a probe-scsi-all output message.

probe-scsi output:


ok probe-scsi
This command may hang the system if a Stop-A or halt command
has been executed. Please type reset-all to reset the system
before executing this command.
Do you wish to continue? (y/n) n
ok reset-all

ok probe-scsi
Primary UltraSCSI bus:
Target 0 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G3862
Target 4 
  Unit 0   Removable Tape     ARCHIVE Python 02635-XXX5962
Target 6 
  Unit 0   Removable Read Only device TOSHIBA XM5701TASUN12XCD0997
Target 9 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G7462
Target b 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G7462
ok

probe-scsi-all output:


ok probe-scsi-all
This command may hang the system if a Stop-A or halt command 
has been executed. Please type reset-all to reset the system
before executing this command.
Do you wish to continue? (y/n) y

/pci@1f,4000/scsi@4,1
Target 2 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target 3 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target 4 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target 5 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target 8 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target 9 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target a 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target b 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target c 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target d 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target e 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target f 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418

/pci@1f,4000/scsi@4
Target 2 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0416
Target 3 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0416
Target 4 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0416
Target 5 
  Unit 0   Disk     SEAGATE ST32430W SUN2.1G0666
Target 8 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0416

probe-scsi-all output continued:


Target 9 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0416
Target a 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target b 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target c 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target d 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target e 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418
Target f 
  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418

/pci@1f,4000/scsi@3,1

/pci@1f,4000/scsi@3
Target 0 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G3862
Target 4 
  Unit 0   Removable Tape     ARCHIVE Python 02635-XXX5962
Target 6 
  Unit 0   Removable Read Only device TOSHIBA XM5701TASUN12XCD0997
Target 9 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G7462
Target b 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G7462

/pci@1f,4000/pci@5/SUNW,isptwo@4
Target 1 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G8246
Target 2 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G8254
Target 3 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G8246
Target 4 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G8246
Target 5 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G7462
Target 6 
  Unit 0   Disk     SEAGATE ST34371W SUN4.2G7462

About Diagnosing Specific Problems

Network Communications Failure

Symptom

The system is unable to communicate over the network.

Action

Your system conforms to the Ethernet 10/100BASE-T standard, which states that the Ethernet 10BASE-T link integrity test function should always be enabled on both the host system and the Ethernet hub. The system cannot communicate with a network if this function is not set identically for both the system and the network hub (either enabled for both or disabled for both). This problem applies only to 10BASE-T network hubs, where the Ethernet link integrity test is optional. This is not a problem for 100BASE-T networks, where the test is enabled by default. Refer to the documentation provided with your Ethernet hub for more information about the link integrity test function.

If you connect the system to a network and the network does not respond, use the OpenBoot PROM command watch-net-all to display conditions for all network connections:


ok watch-net-all

For most PCI Ethernet cards, the link integrity test function can be enabled or disabled with a hardware jumper on the PCI card, which you must set manually. (See the documentation supplied with the card.) For the standard TPE and MII main logic board ports, the link test is enabled or disabled through software, as shown below.

Remember also that the TPE and MII ports share the same circuitry and as a result, only one port can be used at a time.


Note -

Some hub designs permanently enable (or disable) the link integrity test through a hardware jumper. In this case, refer to the hub installation or user manual for details of how the test is implemented.


Determining the Device Name of the Ethernet Interface

To enable or disable the link integrity test for the standard Ethernet interface, or for a PCI-based Ethernet interface, you must first know the device name of the desired Ethernet interface. To list the device name:

  1. Shut down the operating system and take the system to the ok prompt.

  2. Determine the device name for the desired Ethernet interface:

Solution 1

Use this method while the operating system is running:

  1. Become superuser.

  2. Type:


    # 
    eeprom nvramrc="probe-all install-console banner apply disable-link-pulse 
    device-name"
      (Repeat for any additional device names.)
    # eeprom "use-nvramrc?"=true
    

  3. Reboot the system (when convenient) to make the changes effective.

Solution 2

Use this alternate method when the system is already in OpenBoot:

  1. At the ok prompt, type:


    ok 
    nvedit
    0: probe-all install-console banner
    1: apply disable-link-pulse device-name
    (Repeat this step for other device names as needed.) 
    (Press CONTROL-C to exit nvedit.)
    ok nvstore
    ok setenv use-nvramrc? true
    

  2. Reboot the system to make the changes effective.

Power-on Failures

Symptom

The system attempts to power up but does not boot or initialize the monitor.

Action

  1. Run POST diagnostics.

    See "How to Use POST Diagnostics".

  2. Observe POST results.

    The front panel general fault LED should blink slowly to indicate that POST is running. Check the POST output using a locally attached terminal, tip connection, or RSC console.


    Note -

    By default, POST output is displayed locally on an attached terminal or through a tip connection. If your server has been reconfigured to display POST output on an RSC console, POST results will not display locally. To redirect POST output to the local system, you must execute the OpenBoot PROM command diag-output-to ttya from the RSC console. See the Remote System Control (RSC) User's Guide for additional details.


  3. If you see no front panel LED activity, a power supply may be defective.

    See "Power Supply LEDs".

  4. If the general fault LED remains lit, or the POST output contains an error message, then POST has failed.

    The most probable cause for this type of failure is the main logic board. However, before replacing the main logic board you should:

    1. Remove optional PCI cards.

    2. Remove optional DIMMs.

      Leave only the four DIMMs in Bank A.

    3. Repeat POST to determine if any of these modules caused the failure.

    4. If POST still fails, then replace the main logic board.

Video Output Failure

Symptom

No video at the system monitor.

Action

  1. Check that the power cord is connected to the monitor and to the wall outlet.

  2. Verify with a volt-ohmmeter that the wall outlet is supplying AC power.

  3. Verify that the video cable connection is secure between the monitor and the video output port.

    Use a volt-ohmmeter to perform the continuity test on the video cable.

  4. If the cables and their connections are okay, then troubleshoot the monitor and the graphics card.

Disk or CD-ROM Drive Failure

Symptom

A disk drive read, write, or parity error is reported by the operating system or a software application.

A CD-ROM drive read error or parity error is reported by the operating system or a software application.

Action

  1. Replace the drive indicated by the failure message.

Symptom

Disk drive or CD-ROM drive fails to boot or is not responding to commands.

Action

Test the drive response to the probe-scsi-all command as follows:

  1. At the system ok prompt, enter:


    ok reset-all
    ok probe-scsi-all
    

  2. If the SCSI device responds correctly to probe-scsi-all, a message similar to the one above is printed out.

    If the device responds and a message is displayed, the system SCSI controller has successfully probed the device. This indicates that the main logic board is operating correctly.

    1. If one drive does not respond to the SCSI controller probe but the others do, replace the unresponsive drive.

    2. If only one internal disk drive is configured with the system and the probe-scsi-all test fails to show the device in the message, replace the drive. If the problem is still evident after replacing the drive, replace the main logic board. If replacing both the disk drive and the main logic board does not correct the problem, replace the associated UltraSCSI data cable and UltraSCSI backplane.

SCSI Controller Failures

To check whether the main logic board SCSI controllers are defective, test the drive response to the probe-scsi command. To test additional SCSI host adapters added to the system, use the probe-scsi-all command. You can use the OBP printenv command to display the OpenBoot PROM configuration variables stored in the system NVRAM. The display includes the current values for these variables as well as the default values. See "OBP printenv Command" for more information.

  1. At the ok prompt, enter:


    ok probe-scsi
    

    If a message is displayed for each installed disk, the system SCSI controllers have successfully probed the devices. This indicates that the main logic board is working correctly.

  2. If a disk doesn't respond:

  3. If the problem persists, replace the unresponsive drive.

  4. If the problem remains after replacing the drive, replace the main logic board.

  5. If the problem persists, replace the associated SCSI cable and backplane.

Power Supply Failure

If there is a problem with a power supply, POST lights the general fault indicator and the power supply fault indicator on the front panel. If you have more than one power supply, then you can use the LEDs located on the power supplies themselves to identify the faulty supply. The power supply LEDs will indicate any problem with the AC input or DC output. See "Power Supply LEDs" for more information about the LEDs.

DIMM Failure

SunVTS and POST diagnostics can report memory errors encountered during program execution. Memory error messages typically indicate the DIMM location number ("U" number) of the failing module.

Use the following diagram to identify the location of a failing memory module from its U number:

Figure 12-8  

Graphic

After you have identified the defective DIMM, remove it according to the instructions in "How to Remove a Memory Module". Install the replacement DIMM according to the directions in "How to Install a Memory Module".

Environmental Failures

The environmental monitoring subsystem monitors the temperature of the system as well as the operation of the fans and power supplies. For more information on the environmental monitoring subsystem, see "Environmental Monitoring and Control".

In response to an environmental error condition, the monitoring subsystem generates error messages that are displayed on the system console and logged in the /var/adm/messages file. These error messages are described in the table below.

Table 12-7  

Message 

Type 

Description 

TEMPERATURE WARNING: X degrees celsius at location Y.

Warning 

Indicates that the temperature measured at location Y has exceeded the warning threshold and if it continues to overheat the system will shutdown.  

 

If the value of location Y is a sensor on a CPU, (CP0 or CP1) the temperature (identified by the value X) has exceeded 60 degrees C. 

 

If the value of location Y is a sensor on the PDB (power distribution board), SCSI backplane, MB0 or MB1 (main logic board), the ambient temperature (identified by the value X) has exceeded 53 degrees C. 

TEMPERATURE CRITICAL: X degrees celsius at location Y.

Warning 

Indicates that the temperature measured at location Y has exceeded a critical threshold. After this warning message, the system automatically shuts down. 

 

If the value of location Y is a sensor on a CPU, (CP0 or CP1) the temperature (identified by the value X) has exceeded 65 degrees C. 

 

If the value of location Y is a sensor on the PDB (power distribution board), SCSI backplane, MB0 or MB1 (main logic board), the ambient temperature (identified by the value X) has exceeded 58 degrees C. 

 

Power Supply X NOT okay.

Warning 

Indicates that there is something wrong with the DC output of the supply. The system may shut down abruptly if the redundant power supply fails. The value X identifies the power supply, PS0 is the lower power supply; PS1 is the upper power supply.

Power supply X inserted

Advisory 

A hot-swap feature to tell you that the power supply identified by X was installed without service disruption. 

Power supply X removed

Advisory 

A hot-swap feature to tell you that the power supply identified by X was removed without service disruption. 

WARNING: Fan failure has been detected

Warning 

Indicates a fan failure in the fan tray assembly. 

If the environmental monitoring system detects a temperature problem, it also lights the temperature LED on the status and control panel. If it detects a power supply problem, it lights the power supply fault LED on the panel. The LEDs located on the power supplies themselves will help to further identify the problem. For information about system LEDs, see:


Note -

Enterprise 250 power supplies will shut down automatically in response to certain over-temperature and power fault conditions (see "Environmental Monitoring and Control"). To recover from an automatic shutdown, you must disconnect the AC power cord, wait approximately 10 seconds, and then reconnect the power cord.