Sun Enterprise 6500/5500/4500 Systems Reference Manual

Chapter 9 Troubleshooting Overview

This chapter contains these topics:

Using a Terminal

If the system does not have a console, you can log in remotely or attach a terminal directly to the system.

To attach a terminal to the system:

  1. Halt the system and turn off power.

  2. Connect the terminal to serial port A on the clock+ board.

    The clock+ board is located at the back of system, near the top of the card cage. Figure 9-1 shows the Enterprise 6500/5500 cabinet server. In the 8-slot Enterprise 4500 standalone server, the clock+ board is also near the top of the card cage.

    Figure 9-1 TTY Serial Port A on the Clock+ Board

    Graphic

  3. Power on the terminal.

  4. Set up the terminal.

    Refer to the OpenBoot Command Reference for instructions for using the set-defaults and printenv commands.

    The settings will vary with the terminal type, but these settings are often used:

    • 9600 bps

    • 8 data bits

    • 1 stop bit

    • Even parity

    • Full duplex

  5. Turn the key switch to the diagnostic position (Graphic).

    The system will turn on. The diagnostic position puts POST in interactive mode and enables extensive POST tests.

Hardware Indicator LEDs

LEDs indicate system status. The front panel and the boards have three LEDs (Figure 9-2). Power supply modules have two LEDs.

Figure 9-2 LED Symbols

Graphic

The LEDs on the system front panel or the clock+ board indicate the status of the system as a whole. The LEDs on individual boards and power supplies indicate the status of the individual board or power supply. Many of the LED codes (Table 9-1) are common to the system front panel and various types of boards. Table 9-2 lists specific exceptions for LED codes for system boards.

System Front Panel LEDs

Table 9-1 lists the LED codes for system operations.

Table 9-1 System Status Codes

Power 

Service 

Cycling  

Condition 

Off 

Off 

Off 

No power or the key switch is in the Off position.  

Off 

On 

Off 

Failure mode. System has electrical power.  

Off 

Off 

On 

Failure mode. System has electrical power.  

Off 

On 

On 

Failure mode. System has electrical power.  

On 

Off 

Off 

System is hung, either in POST/OBP or in the operating system.  

On 

Off 

On 

Hung in OS. 

On 

On 

Off 

(Hung in POST/OBP) or (hung in OS and failed component in system). 

On 

On 

On 

(Hung in POST/OBP) or (hung in OS and failed component in system). 

On 

Off 

Flashing 

OS running. System is operating normally.  

On 

On 

Flashing 

OS running and failed component in system.  

On 

Flashing 

Off 

Slow flash = POST. Fast flash = OBP.  

On 

Flashing 

On 

OS or OBP error.  

LEDs in the system are controlled by OpenBoot(TM) PROM programming (OBP).

Clock+ Board LEDs

The clock+ board also displays system status. The LED codes are the same as for the front panel (Table 9-1).

CPU/Memory+ and I/O+ Board LEDs

Table 9-2 summarizes LED codes for boards. The Power, Service, and Cycling symbols are marked on the card cage frame above the respective LEDs. Note that many but not all of the LED codes are the same as the system codes (Table 9-1).

Table 9-2 Board Status LED Codes

Power 

Service 

Cycling  

Condition 

Off 

Off 

Off 

Board has no electrical power. 

Off 

On 

Off 

Board is in low-power mode, can be unplugged. 

Off 

Off 

Flashing 

Undefined. 

Off 

On 

Flashing 

Undefined. 

On 

Off 

Off 

System is hanging, either in POST/OBP or OS.  

On 

Off 

On 

Hung in OS. 

On 

On 

Off 

(Hung in POST/OBP) or (hung in OS and failed component on board). 

On 

On 

On 

(Hung in POST/OBP) or (hung in OS and failed component on board). 

On 

Off 

Flashing 

OS running. System is operating normally.  

On 

On 

Flashing 

OS running and failed component on board.  

On 

Flash 

Off 

Slow flash = POST. Fast flash = OBP.  

On 

Flash 

On 

OS or OBP error.  


Note -

For boards, Off-On-Off indicates that the board is in low-power mode and is ready for removal. (For the system, Off-On-Off indicates a failure.)



Caution - Caution -

If the Power LED is lit, do not remove the board. Removing a board that is not in low-power mode will damage the board and the system.


Basic Troubleshooting for Boards

Disk Board LEDs

The board status LED codes correspond to those shown in Table 9-2 for the CPU/Memory+ and I/O+ boards. The Disk board has two additional LEDs on the opposite side of the board to show the status of the two onboard disk drives. The LED for disk drive 1 is nearer to the side of the Disk board, and the LED for disk drive 0 is closer to the center of the board.

Power Supplies

A system has one peripheral power supply and up to four or eight CPU/IO modular power supplies. All the power supplies have one green LED and one yellow LED.

The control and status signals of all power supply modules connect to the clock+ board. If the clock+ board LEDs indicate a problem, inspect the LEDs on the power supplies to locate a faulty module, if any.

Peripheral Power Supply (PPS)

The green LED is to the right of the yellow LED on the peripheral power supply. The green LED indicates that the peripheral power supply is operating, but does not necessarily indicate that the DC outputs are within specification.

When the peripheral power supply module yellow LED is lit, a DC power output has malfunctioned or the voltage level is out of specification.

The peripheral power supply produces +5 VDC and +12 VDC current. The current is available for peripherals such as a tape drive and/or CD-ROM drive. In addition, the +5 VDC output of the peripheral power supply is available at the center plane for current sharing with the +5 VDC outputs of the power supply modules.

Power/Cooling Modules (PCMs)

For a PCM at the front of the card cage, the green LED is to the left of the yellow LED. At the back of the card cage, the LED positions are reversed and the green LED is to the right of the yellow LED. See Table 9-3.

When the yellow LED is lit, a fan or a DC output has malfunctioned. Each modular power supply contains two fans and three DC supplies (+3.3 VDC, +5 VDC, and +2 VDC).

The green LED indicates that the DC supplies are operating, but does not guarantee that the DC outputs are within specification.

Table 9-3 Modular Power Supply LED Codes

Green 

Yellow  

Condition 

Off 

Off 

No AC input or key switch is turned off. 

On 

Off 

Normal operation. 

On 

On 

A fan has failed or one or more voltages are out of specification. 

Off 

On 

One or more DC outputs have failed, or the voltages are out of specification, or the system is in the low power state.  

The PCMs operate in redundant current share mode. If a module fails, the remaining modules may or may not provide enough current to continue system operation. The system's ability to continue operations depends on the total demand for current.

Disk Tray Indicators

The availability and type of status information varies with the disk tray type used in a system. Refer to the disk tray user manual for specific status information.

Diagnosing Problems

When LED codes (Table 9-1, Table 9-2, Table 9-3) indicate a hardware problem, several types of software programs are available to supply information about the problem.

Error Messages

Error messages and other system messages are saved in the /var/adm/messages file.

SunVTS

The latest version of SunVTS(TM) (online validation test suite) has several modes of testing, including low-impact testing, which can run with minimum affect on customer applications.

The SunVTS can also be used to stress-test Sun hardware, either in or out of the Solaris operating environment. By running multiple and multithreaded diagnostic hardware tests, the SunVTS software verifies the system configuration and functionality of most hardware controllers and devices.

SunVTS tests many board and system functions, as well as interfaces for Fibre Channel, SCSI, and SBus interfaces. SunVTS accepts user-written scripts for automated testing.

Refer to the SunVTS User's Guide for starting and operating instructions.

prtdiag(1M)

You can use the prtdiag command to display:

Refer to the prtdiag man page for instructions.

History Log Option

To isolate an intermittent failure, it may be helpful to maintain a prtdiag history log. Use the prtdiag command with the -l (log) option to send output to a log file in the /var/adm directory.

Running prtdiag

To run prtdiag, type:


% /usr/platform/sun4u/sbin/prtdiag

POST and OpenBoot

POST and OpenBoot work together in the system to test and manage system hardware.

POST resides in the OpenBoot PROM on each CPU/Memory+ board, I/O+ board, and Disk board. When the system is turned on, or if a system reset is issued, POST detects and tests buses, power supplies, boards, CPUs, SIMMs, and many board functions. POST controls the status LEDs on the system front panel and all boards. POST displays diagnostic and error messages on a console terminal, if available.

Only POST can configure the system hardware, and only POST can enable hot-pluggable boards. If a new unit (board or modular power supply) is added to the card cage after the system has booted, the new unit will not work until the system is rebooted, at which time POST reconfigures the system, using the units that are found in the system at that time.


Note -

POST does not test drives or internal parts of SBus cards. To test these devices, run OBP diagnostics manually after the system has booted. Refer to the OpenBoot Command Reference manual for instructions.


OpenBoot provides basic environmental monitoring, including detection of overheating conditions and out-of-tolerance voltages. For example, if an overheated board is found, OpenBoot issues a warning message. If the temperature passes the danger level, POST will put the overheated board(s) in low power mode.

OpenBoot also provides a set of commands and diagnostics at the ok prompt. For example, you can use OpenBoot to set NVRAM variables that reserve a board or a set of SIMMs for hot-sparing.

The following OpenBoot commands may be useful for diagnosing problems:

show-devs Command

Use the show-devs command to list the devices that are included in the system configuration.

printenv Command

Use the printenv command to display the system configuration variables stored in the system NVRAM. The display includes the current values for these variables, as well as the default values.

If the system cannot communicate with a 10BASE-T network, the Ethernet link test setting for the port may be incompatible with the setting at the network hub. See "Failure of Network Communications" for further details.

probe-scsi Command

The probe-scsi command locates and tests SCSI devices attached to the system. probe-scsi is run from the OpenBoot prompt.

When it is not practical to halt the system, you can use SunVTS as an alternate method of testing the SCSI interfaces.

Reference Documents for POST/OpenBoot

For more information, refer to:

Solstice SyMON

The Solstice(TM) SyMON(TM) program monitors system functioning and features a graphical user interface (GUI) to continuously display system status. Solstice SyMON is intended to complement system management tools such as SunVTS.

Solstice SyMON is accessible through an SNMP interface from network tools such as Solstice(TM) SunNet Manager(TM).

Refer to the Solstice SyMON User's Guide, part number 802-5355, for starting and operating instructions.

Specific Problems and Solutions

Failure of Network Communications

Description of the Problem

The system cannot communicate with a network if the system and the network hub are not set in the same way for the Ethernet Link Integrity Test. This problem particularly applies to 10BASE-T network hubs, where the Ethernet Link Integrity Test is optional. This is not a problem for 100BASE-T networks, where the test is enabled by default.

If you connect the system to a network and the network does not respond, use the OpenBoot command watch-net-all to display conditions for all network connections:


ok watch-net-all

For SBus Ethernet cards, the test can be enabled or disabled with a hardware jumper, which you must set manually. For the TPE and MII onboard ports on the I/O+ board, the link test is enabled or disabled through software, as shown below.


Note -

The TPE and MII ports share some circuitry so do not try to use the two ports at the same time.



Note -

Some hub designs do not use a software command to enable/disable the test, but instead permanently enable (or disable) the test through a hardware jumper. Refer to the hub installation or user manual for details of how the test is implemented.


Determining the Device Names of the I/O+ Boards

To enable or disable the link test for an on-board TPE (hme) port, you must first know the device name for the I/O+ board. To list the device names:

  1. Shut down the system and take the system into OpenBoot.

  2. Determine the device names of the I/O+ boards:

    1. Type:


      ok show-devs
      

    2. In the show-devs listing, find the node names.

      Node names take the general form /sbus@3,0/SUNW,hme@3,8c00000.

Solution 1

Use this method while the operating system is running:

  1. Become superuser.

  2. Type:


    # eeprom nvramrc="probe-all install-console banner apply disable-link-pulse 
    device-name "
      (Repeat for any additional device names.)
    # eeprom "use-nvramrc?"=true
    

  3. Reboot the system (when convenient) to make the changes effective.

Solution 2

Use this alternate method when the system is already in OpenBoot:

  1. At the monitor OpenBoot prompt, type:


    ok nvedit
    0: probe-all install-console banner
    1: apply disable-link-pulse device-name
    (Repeat this step for other device names as needed.) 
    (Press CONTROL-C to exit nvedit.)
    ok nvstore
    ok setenv use-nvramrc? true

  2. Reboot to make the changes effective.

Resetting and Power Cycling the System from a Remote Console

It is possible to reset the system or cycle power from the remote console under these conditions:

Table 9-4 Remote Console Commands

Command  

Enter this sequence 

Remote power off/on  

<CR> <CR> <~> <Control-Shift-p>  

Remote system reset 

<CR> <CR> <~> <Control-Shift-r>  

Remote XIR (CPU) reset  

<CR> <CR> <~> <Control-Shift-x>  

Key: <CR> = ASCII 0d hexadecimal, <~> = ASCII 7e hexadecimal, <Control-Shift-p> = 10 hexadecimal, 

<Control-Shift-r> = 12 hexadecimal, <Control-Shift-x> = 18 hexadecimal.  


Note -

The remote console logic circuit continues to receive power even if you have commanded system power off.


The remote system reset command is useful for resetting the system under general conditions. The remote XIR reset command is used for software development and debugging.