C H A P T E R 4 - Troubleshooting the System

C H A P T E R 4

Troubleshooting the System

This chapter gives instructions for troubleshooting the Netra CT server. You can troubleshoot the system several ways.

Section 4.1, Troubleshooting the System Using the System Status Panel

Section 4.2, Troubleshooting the System Using prtdiag

Section 4.3, Troubleshooting the System Using Diagnostic Software

Section 4.4, Troubleshooting the System Using the Power-On Self Test (POST)

Section 4.5, Troubleshooting the System Using the Alarm Card Software

Section 4.6, Troubleshooting a Power Supply Using the Power Supply Unit LEDs

Section 4.7, Troubleshooting a CPU Card

In addition, Appendix C lists the error messages that might appear when you are operating or servicing your Netra CT server.

4.1 Troubleshooting the System Using the System Status Panel

You can use the system status panel to troubleshoot the Netra CT server.

4.1.1 Locating and Understanding the System Status Panel

The system status panel on the Netra CT server give the majority of troubleshooting information that you will need for your server. FIGURE 4-1 shows the locations of the system status panels on the Netra CT servers. FIGURE 4-2 shows the system status panel for the Netra CT 810 server, and FIGURE 4-3 shows the system status panel for the Netra CT 410 server.

FIGURE 4-1 System Status Panel Locations

FIGURE 4-2 System Status Panel (Netra CT 810 Server)

FIGURE 4-3 System Status Panel (Netra CT 410 Server)

4.1.2 Using the System Status Panel LEDs to Troubleshoot the System

When you first power-on the Netra CT server, some or all of the green Power LEDs on the system status panel flash on and off for several seconds. Do not attempt to troubleshoot the system until after the LEDs have gone through their initial power-on testing.

Each major component in the Netra CT 810 server or Netra CT 410 server has a set of LEDs on the system status panel that gives the status on that particular component. Each component will have either the green Power and the amber Okay to Remove LEDs (FIGURE 4-4) or the green Power and amber Fault LEDs (FIGURE 4-5).

FIGURE 4-4 Power and Okay to Remove LEDs

FIGURE 4-5 Power and Fault LEDs

TABLE 4-1 describes which combination of LEDs is used for each component in the Netra CT 810 server, and TABLE 4-2 describes which combination of LEDs is used for each component in the Netra CT 410 server. Note that the components in the Netra CT servers all have the green Power LED, and they will have either the amber Okay to Remove LED or the amber Fault LED, but not both.


LED	LEDs Available	Component
HDD 0	Power and Okay to Remove	Upper hard disk drive
HDD 1	Power and Okay to Remove	Lower hard disk drive
Slot 1	Power and Okay to Remove	Host CPU card installed in slot 1
Slots 2 - 7	Power and Okay to Remove	I/O card or satellite CPU card (●) installed in slot 2 - 7
Slot 8	Power and Okay to Remove	Alarm card (■) installed in slot 8
SCB	Power and Fault	System controller board (behind the system status panel)
FAN 1	Power and Fault	Upper fan tray (behind the system status panel)
FAN 2	Power and Fault	Lower fan tray (behind the system status panel)
RMM	Power and Okay to Remove	Removeable media module
PDU 1 (DC only)	Power and Fault	Leftmost power distribution unit (behind the server)
PDU 2 (DC only)	Power and Fault	Rightmost power distribution unit (behind the server)
PSU 1	Power and Okay to Remove	Leftmost power supply unit
PSU 2	Power and Okay to Remove	Rightmost power supply unit


LED	LEDs Available	Component
Slot 1	Power and Okay to Remove	Alarm card(■) installed in slot 1
Slot 2	Power and Okay to Remove	I/O card or satellite CPU card (●) installed in slot 2
Slot 3	Power and Okay to Remove	Host CPU card installed in slot 3
Slot 4 and 5	Power and Okay to Remove	I/O cards or satellite CPU cards (●) installed in slot 4 and 5
HDD 0	Power and Okay to Remove	Hard disk drive
SCB	Power and Fault	System controller board (behind the system status panel)
FAN 1	Power and Fault	Upper fan tray (behind the system status panel)
FAN 2	Power and Fault	Lower fan tray (behind the system status panel)
FTC	Power and Fault	Host CPU front transition card or host CPU front termination board
PDU 1 (DC only)	Power and Fault	Power distribution unit (behind the server)
PSU 1	Power and Okay to Remove	Power supply

TABLE 4-3 gives the LED states and meanings for any CompactPCI boards installed in a slot in the Netra CT 810 server or Netra CT 410 server.

TABLE 4-4 gives the LED states and meanings for any component other than a CompactPCI board that has the green Power and amber Okay to Remove LEDs.

TABLE 4-5 gives the LED states and meanings for any component other than a CompactPCI board that has the green Power and amber Fault LEDs.

Note - Do not use the information in TABLE 4-4 to troubleshoot a power supply unit in a server that has only one power supply unit (a Netra CT 410 server or a Netra CT 810 server with only one power supply). To troubleshoot the power supply in a single power supply system, use the LEDs on the power supply itself. Refer to Section 4.6, Troubleshooting a Power Supply Using the Power Supply Unit LEDs for more information. The information given in TABLE 4-4 applies to all other components in the Netra CT 810 server or Netra CT 410 server, including the power supplies in a two power supply Netra CT 810 server.


Green Power LED state	Amber Okay to Remove LED state	Meaning	Action
Off	Off	The slot is empty or the system thinks that the slot is empty because the system didn't detect the card when it was inserted.	If there is a card installed in this slot, then one of the following components is faulty: the card installed in the slot the alarm card the system controller board Remove and replace the failed component to clear this state.
Blinking	Off	The card is coming up or going down.	Do not remove the card in this state.
On	Off	The card is up and running.	Do not remove the card in this state.
Off	On	The card is powered off.	You can remove the card in this state.
Blinking	On	The card is powered on, but it is offline for some reason (for example, a fault was detected on the card).	Wait several seconds to see if the green Power LED stops blinking. If it does not stop blinking after several seconds, enter `cfgadm` and verify that the card is in the `unconfigured` state, then perform the necessary action, depending on the card: Alarm card--You can remove the alarm card in this state. All other cards--Power off the slot through the alarm card software, then remove the card.
On	On	The card is powered on and is in use, but a fault has been detected on the card.	Deactivate the card using one of the following methods: Use the `cfgadm -f -c unconfigure` command to deactivate the card. Note that in some cases, this may cause the system to panic, depending on the nature of the card hardware or software. Halt the system and power off the slot through the alarm card software, then remove the card. The green Power LED will then give status information: If the green Power LED goes off, then you can remove the card. If the green Power LED remains on, then you must halt the system and power off the slot through the alarm card software.


LED State	Power LED	Okay to Remove LED
On, Solid	Component is installed and configured.	Component is Okay to Remove. You can remove the component from the system, if necessary.
On, Flashing	Component is installed but is unconfigured or is going through the configuration process.	Not applicable.
Off	Component was not recognized by the system or is not installed in the slot.	Component is not Okay to Remove. Do not remove the component while the system is running.


LED State	Power LED	Fault LED
On, Solid	Component is installed and configured.	Component has failed. Replace the component.
On, Flashing	Component is installed but is unconfigured or is going through the configuration process.	Not applicable.
Off	Component was not recognized by the system or is not installed in the slot.	Component is functioning properly.

4.2 Troubleshooting the System Using `prtdiag`

You can troubleshoot the system using the prtdiag command. Log into the server console and, as root, enter:

# /usr/platform/sun4u/sbin/prtdiag

If you have a Netra CT 810 server, you should get output on the console similar to the following:

System Configuration: Sun Microsystems sun4u SPARCengine CP2000 model 140 (UltraSPARC-IIi 648MHz) Memory size: 512 Megabytes platform is : SUNW,NetraCT-810 =============================== FRU Information =============================== FRU FRU FRU Green Amber Miscellaneous Type Unit# Present LED LED Information ---------- ----- ------- ----- ----- -------------------------- Midplane 1 Yes Netra ct800 Properties: Version=0 Maximum Slots=8 SCB 1 Yes on off System Controller Board Properties: Version=2 hotswap-mode=basic SSB 1 Yes System Status Panel CPU 1 Yes on off CPU board temperature(celsius):38 I/O 2 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Unknown Devices: pci pci108e,1000 SUNW,hme SUNW,isptwo I/O 3 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Unknown Devices: pci pci108e,1000 SUNW,hme SUNW,isptwo I/O 4 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Unknown Devices: pci pci108e,1000 SUNW,hme SUNW,isptwo I/O 5 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Unknown Devices: pci pci108e,1000 SUNW,hme SUNW,isptwo I/O 6 Yes on off CompactPCI IO Slot Properties: auto-config=disabled I/O 7 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Unknown Devices: pci pci108e,1000 SUNW,qfe pci108e,1000 SUNW,qfe pci108e,1000 SUNW,qfe pci108e,1000 SUNW,qfe pci1176,608 I/O 8 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Alarm Card Devices: pci ebus ethernet PDU 1 Yes on off Power Distribution Unit PDU 2 Yes on off Power Distribution Unit PSU 1 Yes on on Power Supply Unit condition:ok temperature:ok ps fan:ok supply:on PSU 2 Yes on on Power Supply Unit condition:ok temperature:ok ps fan:ok supply:on FAN 1 Yes on off Fan Tray condition:ok fan speed:low FAN 2 Yes on off Fan Tray condition:ok fan speed:low HDD 0 Yes on off Hard Disk Drive condition:ok HDD 1 Yes on off Hard Disk Drive condition:ok RMM Yes on on Removable Media Module condition:Unknown System Board PROM revision: --------------------------- OBP 3.14.1 2000/04/28 12:56

CODE EXAMPLE 4-1 prtdiag Output for a Netra CT 810 Server

System Configuration: Sun Microsystems  sun4u SPARCengine CP2000 model 140

(UltraSPARC-IIi 648MHz)

Memory size: 512 Megabytes

platform is : SUNW,NetraCT-810

=============================== FRU Information ===============================

FRU         FRU      FRU        Green     Amber     Miscellaneous

Type        Unit#    Present    LED       LED       Information

----------  -----    -------    -----     -----     --------------------------

Midplane    1        Yes                            Netra ct800

                                                    Properties:

                                                      Version=0

                                                      Maximum Slots=8

SCB         1        Yes        on        off       System Controller Board

                                                     Properties:

                                                       Version=2

                                                       hotswap-mode=basic

SSB         1        Yes                            System Status Panel

CPU         1        Yes        on        off       CPU board

                                                      temperature(celsius):38

I/O         2        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                    Board Type:Unknown

                                                    Devices:

pci

                                                        pci108e,1000

                                                        SUNW,hme

                                                        SUNW,isptwo

I/O         3        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                    Board Type:Unknown

                                                      Devices:

pci

                                                          pci108e,1000

                                                          SUNW,hme

                                                          SUNW,isptwo

I/O         4        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                      Board Type:Unknown

                                                    Devices:

pci

                                                        pci108e,1000

                                                        SUNW,hme

                                                        SUNW,isptwo

I/O         5        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                      Board Type:Unknown

                                                     Devices:

pci

                                                         pci108e,1000

                                                         SUNW,hme

                                                         SUNW,isptwo

I/O         6        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

I/O         7        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                    Board Type:Unknown

                                                    Devices:

pci

                                                        pci108e,1000

                                                        SUNW,qfe

                                                        pci108e,1000

                                                        SUNW,qfe

                                                        pci108e,1000

                                                        SUNW,qfe

                                                        pci108e,1000

                                                        SUNW,qfe

                                                        pci1176,608

I/O         8        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                    Board Type:Alarm Card

                                                    Devices:

pci

                                                        ebus

                                                        ethernet

PDU         1        Yes        on        off       Power Distribution Unit

PDU         2        Yes        on        off       Power Distribution Unit

PSU         1        Yes        on        on        Power Supply Unit

                                                      condition:ok

                                                      temperature:ok

                                                    ps fan:ok

                                                      supply:on

PSU         2        Yes        on        on        Power Supply Unit

                                                      condition:ok

                                                      temperature:ok

                                                      ps fan:ok

                                                      supply:on

FAN         1        Yes        on        off       Fan Tray

                                                      condition:ok

                                                      fan speed:low

FAN         2        Yes        on        off       Fan Tray

                                                      condition:ok

                                                      fan speed:low

HDD         0        Yes        on        off       Hard Disk Drive

                                                      condition:ok

HDD         1        Yes        on        off       Hard Disk Drive

                                                      condition:ok

RMM                  Yes        on        on       Removable Media Module

                                                      condition:Unknown

System Board PROM revision:

---------------------------

OBP 3.14.1 2000/04/28 12:56

If you have a Netra CT 410 server, you should get output on the console similar to the following:

System Configuration: Sun Microsystems sun4u SPARCengine CP2000 model 140 (UltraSPARC-IIi 648MHz) Memory size: 512 Megabytes platform is : SUNW,NetraCT-410 =============================== FRU Information =============================== FRU FRU FRU Green Amber Miscellaneous Type Unit# Present LED LED Information ---------- ----- ------- ----- ----- -------------------------- Midplane 1 Yes Netra ct400 Properties: Version=0 Maximum Slots=5 SCB 1 Yes on off System Controller Board Properties: Version=2 hotswap-mode=basic SSB 1 Yes System Status Panel I/O 1 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Alarm Card Devices: pci ebus ethernet I/O 2 Yes off off CompactPCI IO Slot Properties: auto-config=disabled CPU 3 Yes on off CPU board temperature(celsius):38 I/O 4 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Unknown Devices: pci pci108e,1000 SUNW,hme SUNW,isptwo I/O 5 Yes on off CompactPCI IO Slot Properties: auto-config=disabled Board Type:Unknown Devices: pci pci108e,1000 SUNW,qfe pci108e,1000 SUNW,qfe pci108e,1000 SUNW,qfe pci108e,1000 SUNW,qfe PDU 1 Yes on off Power Distribution Unit PSU 1 Yes on off Power Supply Unit condition:ok temperature:ok ps fan:ok supply:on FAN 1 Yes on off Fan Tray condition:ok fan speed:low FAN 2 Yes on off Fan Tray condition:ok fan speed:low HDD 0 Yes on off Hard Disk Drive condition:ok System Board PROM revision: --------------------------- OBP 3.14.1 2000/04/28 12:56

CODE EXAMPLE 4-2 prtdiag Output for a Netra CT 410 Server

System Configuration: Sun Microsystems  sun4u SPARCengine CP2000 model 140

(UltraSPARC-IIi 648MHz)

Memory size: 512 Megabytes

platform is : SUNW,NetraCT-410

=============================== FRU Information ===============================

FRU         FRU      FRU        Green     Amber     Miscellaneous

Type        Unit#    Present    LED       LED       Information

----------  -----    -------    -----      -----    --------------------------

Midplane    1        Yes        Netra ct400

                                                    Properties:

                                                      Version=0

                                                      Maximum Slots=5

SCB         1        Yes        on        off       System Controller Board

                                                      Properties:

                                                      Version=2

                                                      hotswap-mode=basic

SSB         1        Yes                            System Status Panel

I/O         1        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                    Board Type:Alarm Card

                                                    Devices:

pci

                                                        ebus

                                                        ethernet

I/O         2        Yes        off        off      CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

CPU         3        Yes        on        off       CPU board

                                                      temperature(celsius):38

I/O         4        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                    Board Type:Unknown

                                                    Devices:

pci

                                                        pci108e,1000

                                                        SUNW,hme

                                                        SUNW,isptwo

I/O         5        Yes        on        off       CompactPCI IO Slot

                                                    Properties:

                                                      auto-config=disabled

                                                    Board Type:Unknown

                                                    Devices:

pci

                                                        pci108e,1000

                                                        SUNW,qfe

                                                        pci108e,1000

                                                        SUNW,qfe

                                                        pci108e,1000

                                                        SUNW,qfe

                                                        pci108e,1000

                                                        SUNW,qfe

PDU         1        Yes        on        off       Power Distribution Unit

PSU         1        Yes        on        off       Power Supply Unit

                                                      condition:ok

                                                      temperature:ok

                                                      ps fan:ok

                                                      supply:on

FAN         1        Yes        on        off       Fan Tray

                                                      condition:ok

                                                      fan speed:low

FAN         2        Yes        on        off       Fan Tray

                                                      condition:ok

                                                      fan speed:low

HDD         0        Yes        on        off       Hard Disk Drive

                                                      condition:ok

System Board PROM revision:

---------------------------

OBP 3.14.1 2000/04/28 12:56

4.3 Troubleshooting the System Using Diagnostic Software

There are several software packages that allow you to run diagnostic tests on your system, such as Sun VTS. SunVTS is a validation test suite that is provided as a supplement to the Solaris operating environment. The individual tests can stress a device, system or resource so as to detect and pinpoint specific hardware and software failures and provide users with informational messages to resolve any problems found. SunVTS runs at the operating system level.

There are several tests that are particularly useful when troubleshooting a Netra CT server:

alarm2test--alarm2test is part of SunVTS, but it is used specifically to test the alarm card installed in the Netra CT server by invoking the alarmdiag test on the alarm card. alarm2test runs at the operating system level.

obdiag--obdiag is similar to the alarm2test, in that it invokes the alarmdiag test on the alarm card; however, obdiag is run from the firmware level, not the operating system level.

Apost--Apost is part of the Chorus operating system image on the alarm card. It runs a basic test on the alarm card to verify that the alarm card is operating properly before bringing up Chorus on the alarm card.

A new utility called diagconf, which is also part of the Chorus operating system image on the alarm card, is now available. You can use diagconf to set or display the configuration settings for Apost, allowing you to make the tests run on the alarm card more or less thoroughly before the Chorus operating system is brought up on the alarm card.

To display the values currently set for Apost, access the alarm card command line interface (CLI), and, through the alarm card CLI, enter the following command:

hostname cli> diagconf -d

You should see output similar to the following, giving you the values currently set for the Apost test on the alarm card:

diag-switch        False

verb-mode          True

stop-on-error      False

diag-level         Max

mfg-mode           Off

hdr-checksum       0xaa

time-stamp         0

record-format-ver  49

post-version       02

reset-status       0xd0000000

post-status        ...

post-msg           Watchdog Reset-------- POST Passed-------------------

Some values are hard-set and cannot be changed by a user, while others can be changed to make that particular test more or less thorough. To change the value for a particular test, enter the following command:

hostname cli> diagconf -s command value

where command is the name of the command that you want to change, and value is the value you want to change.

The following table lists the Apost tests that can be changed by a user and the allowable values for each. Any tests not listed in TABLE 4-6 are either hard-set and cannot be changed, or should not be changed by a user.


Command	Value
diag-switch	True--Turns the diag-switch test on. False--Turns the diag-switch test off.
verb-mode	True--Turns the verb-mode test on. False--Turns the verb-mode test off.
stop-on-error	True--Stops the Apost testing when the first error is encountered. False--Continues Apost testing, regardless of the number of errors encountered.
diag-level	Off--Turns the diag-level test off. Min--Sets the diag-level test to the minimum level of testing. Max--Sets the diag-level test to the maximum level of testing

For more information on these and other tests in the SunVTS test suite, refer to the Computer Systems Release Notes Supplement for Sun Hardware document or the SunVTS documentation on the Solaris on Sun Hardware Answerbook, both included with your Solaris operating environment.

4.4 Troubleshooting the System Using the Power-On Self Test (POST)

When you first power-up the Netra CT server, some or all of the green Power LEDs on the system status panel will flash on and off for several seconds. The green Power LED for the I/O slot holding the CPU card (slot 1 in the Netra CT 810 server and slot 3 in the Netra CT 410 server) will go to solid green while the green Power LEDs for the remaining components are still flashing on and off; this is an indication that the CPU card has passed the power-on self test (POST).

Before any processing can occur on a system, it must successfully complete the POST. Messages are displayed for each step in the POST process. If there is a critical failure, the system will not complete POST and will not boot. To monitor this process, you must be connected to the TTY A port on the CPU card or CPU transition card. See Section 5.2.1, Logging In to the Netra CT Server.

OpenBoot PROM (OBP) variables control the console port. The variables and their possible settings are described below.

To see the console output device, enter:

ok printenv output-device

The screen will display something similar to the following:

output-device						ttya

The possible settings for this variable are:

ttya (default)

ttyb

screen

rsc

ttya and ttyb represent the serial ports on the CPU card. screen represents the display attached to the first frame buffer installed in the system (not present on the Netra CT server). rsc is used by the alarm card.

To see the console input device, enter:

ok printenv input-device

The screen will display something similar to the following:

input-device						ttya

The possible settings for this variable are:

ttya (default)

ttyb

keyboard

rsc

ttya and ttyb represent the serial ports on the CPU card. keyboard represents the standard system keyboard (not present on the Netra CT server). rsc is used by the alarm card. If no system keyboard is connected, the console port defaults to ttya.

Note - Be sure the two variables are consistent with each other. For example, do not set the output-device to screen and the input-device to ttya.

There is another OBP variable that controls the behavior of the POST process called diag-level. By default, this variable is set to max, which means POST will run more thorough/verbose tests against the hardware. This variable can also be set to min, which will run a less stringent set of tests against the hardware. A minimum level of POST testing also takes less time, so the Solaris operating environment can boot more quickly on a machine with diag-level set to min.

To run the maximum amount of POST tests, enter:

ok setenv diag-level max

To run the minimum amount of POST tests, enter:

ok setenv diag-level min

4.5 Troubleshooting the System Using the Alarm Card Software

For information on troubleshooting using the alarm card software, refer to the Netra CT Server System Administration Guide (816-2483-xx).

4.6 Troubleshooting a Power Supply Using the Power Supply Unit LEDs

There are two LEDs on each power supply unit: a green () LED and an amber () LED. You can use the LEDs on the power supply unit to troubleshoot each power supply unit; however, because there is one power supply unit in the Netra CT 410 server and two power supply units in the Netra CT 810 server, the actions to take are different.

4.6.1 Troubleshooting the Power Supply Unit in the Netra CT 410 Server

Following are the states of the LEDs on the power supply unit in the Netra CT 410 server:

Green, flashing--The power supply unit is in the standby mode; the power supply unit is powered on, but it is not supplying power to the server.

Green, solid--Both the server and the power supply unit are powered on and functioning properly.

Amber--A fault was found in the power supply unit. Replace the power supply unit. See Section 10.5, Power Supply Unit for those instructions.

Off--One of the following conditions apply:

The power supply locking mechanism is in the upper, unlocked position.

The accompanying cable is disconnected from the DC power distribution unit or the AC power entry unit.

The accompanying power distribution unit has failed.

The power supply unit has failed.

4.6.2 Troubleshooting the Power Supply Units in the Netra CT 810 Server

When both power supply units in a Netra CT 810 server are up and running properly, the green ()LEDs on both power supply units will be ON (note that these are the LEDs on the power supply units themselves, not the LEDs on the system status panel).

If a power supply unit fails, the amber () LED on the power supply unit might light, depending on the type of failure that has occurred:

If a soft-fault occurs, such as a stuck fan or a temperature warning, you should get a notification of the error; however, the amber (

) LED on the power supply unit will not light for a soft-fault condition. The power supply unit is still supplying power to the system during a soft-fault condition.

If a hard-fault occurs, such as a voltage problem, you should get a notification of the error. In addition, the amber (

) LED on the power supply unit does light for a hard-fault condition. The power supply unit does not supply power to the system during a hard-fault condition.

If one power supply unit fails (either a soft-fault or a hard-fault), but the other power supply unit is still functioning normally, you should replace the faulty power supply unit as soon as possible to keep the system up and running. If both power supply units fail, the action you should take varies depending on which of the two types of fault has occurred:

If	Then
Both power supply units go through a soft-fault	Replace one power supply unit at a time in order to keep the system up and running.
One power supply unit goes through a soft-fault and the other power supply unit goes through a hard-fault	Replace the power supply unit that has gone through a hard-fault first in order to keep the system up and running.
Both power supply units go through a hard-fault	The system is down and you should replace at least one of the power supply units to bring the system back up again.

4.7 Troubleshooting a CPU Card

This section describes how to troubleshoot problems related to the CPU card. The information provided here primarily covers those situations when the system containing the CPU card does not boot up or when the CPU card is not fully functional after boot up. Only general troubleshooting tips are provided here. No component level troubleshooting information is included in this section.

The following topics are covered:

General troubleshooting tips

General troubleshooting requirements

Mechanical failures

Power-on failures

Failures subsequent to power-on

Troubleshooting during POST/OBP and during boot process

The following diagnostic procedures are also described:

OpenBoot PROM on-board diagnostics

OpenBoot diagnostics

4.7.1 General Troubleshooting Tips

Caution - High voltages are present in the Netra CT server. To avoid physical injury, follow all the safety rules specified in the Netra CT Server Safety and Compliance Manual when opening the enclosure and/or removing and installing the board.

The following general troubleshooting tips are useful in isolating the problems related to the CPU card:

1. Make sure the CPU card is installed properly in the correct slot in the Netra CT server.

The CPU card should be installed in slot 1 in the Netra CT 810 server and in slot 3 in the Netra CT 410 server.

2. Make sure all the necessary cables are attached properly to the CPU transition card.

The following figures show the connectors on the different CPU transition cards:

CPU front transition card, Netra CT 410 server--FIGURE 4-6

CPU rear transition card--FIGURE 4-7

Note - The CPU rear transition card is the same for both the Netra CT 810 server and the Netra CT 410 server; only the location in the rear card cage differs.

FIGURE 4-6 Connectors on the CPU Front Transition Card (Netra CT 410 Server)

FIGURE 4-7 Connectors on the CPU Rear Transition Card

4.7.2 General Troubleshooting Requirements

The following devices are generally required to take some of the recommended actions in this section:

Network interface

TTYA and TTYB connection or an ASCII terminal connection to serial port

Parallel port interface

Loopback connectors

4.7.3 Mechanical Failures

Symptom

Unable to insert the CPU card into the backplane.

Action

1. Verify that there are no mechanical and physical obstructions in the slot where the CPU card is going to be installed.

2. Make sure no pins on the board connectors or the CompactPCI backplane connectors are bent or damaged.

4.7.4 Power-On Failures

This section provides examples of power-on failure symptoms and suggested actions. There can be several reasons for the power-on failures.

Make sure the CPU card is installed properly.

Note - If both Ready and Alarm LEDs on the CPU card are green, the board is partially functional and capable of running POST (power on self-test). It means that the basic functionality of the board is present. If none of the aforementioned LEDs is green, and the board is installed properly, the board is not functional. In that case, contact your Sun supplier or field service engineer.

4.7.5 Failures Subsequent to Power-On

Symptom

Cannot connect successfully to a TTY serial port; there are no POST messages and unable to send keyboard input.

Action

1. Check the TTY cable for proper setup.

2. If you do not see any output after connecting the TTY terminal to the CPU transition card, remove it and connect it to the COM port of the CPU card and try again.

4.7.6 Troubleshooting During POST/OBP and During Boot Process

This section describes certain possible problems encountered while running POST and OBP and during the boot process.

Symptom

POST error message displays:

cannot establish network service

Action

This might be a hardware address problem. Add or check the media access control (MAC) address to the server and the IP address at the server.

Symptom

POST detects Ecache error and a message similar to the one below is displayed:

STATUS =FAILED

TEST =Memory Addr w/ Ecache

SUSPECT=U5201 and U5202

MESSAGE=Mem Addr line compare error

addr 00000000.00000000

exp 00000000.00000000

obs 88888888.88888888

Action

This might be a mounting issue with the CPU Mylar film, socket, or heatsink which could have occurred during transportation or due to severe vibration. Contact Sun s Enterprise Services Solution Center.

Caution - Any attempt to disassemble or replace the aforementioned devices will void the warranty.

4.7.7 OpenBoot PROM On-Board Diagnostics

There are several OBP variables specific to the Netra CT server, such as:

pcia-probe-list--Probes the bus that runs the first ethernet port (front connection) and standard I/O devices (by default: 1, 2)

pcib-probe-list--Probes the bus that runs the second ethernet port (rear connection) (by default: 1, 2, 3)

cpci-probe-list--Probes the bus that runs connections to all cPCI slots in the ct400 or ct800 (by default: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f)

The following section describes the OBP on-board diagnostics. To execute the OBP on-board diagnostics, the system must be at the ok prompt. The OBP on-board diagnostics are listed as follows:

watch-clock

watch-net and watch-net-all

probe-scsi

test alias name, device path, -all

4.7.7.1 `watch-clock`

The watch-clock command reads a register in the NVRAM/TOD chip and displays the result as a seconds counter. During normal operation, the seconds counter repeatedly increments from 0 to 59 until interrupted by pressing any key on the PS/2 keyboard. The following identifies the watch-clock output message.

ok watch-clock

Watching the seconds register of the real time clock chip

It should be ticking once a second

Type any key to stop

ok

4.7.7.2 `watch-net` and `watch-net-all`

The watch-net and watch-net-all commands monitor Ethernet packets on the Ethernet interfaces connected to the system. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description. CODE EXAMPLE 4-3 identifies the watch-net output message and CODE EXAMPLE 4-4 identifies the watch-net-all output message.

ok `watch-net` Hme register test --- succeeded. Internal loopback test -- succeeded. Transceiver check -- Using Onboard Transceiver - Link Up. passed Using Onboard Transceiver - Link Up. Looking for Ethernet Packets. . is a Good Packet. X is a Bad Packet. Type any key to stop. .................................................. ................................................................ ................................................................ ........................................................ ok

CODE EXAMPLE 4-3 watch-net Output Message

ok watch-net

Hme register test --- succeeded.

Internal loopback test -- succeeded.

Transceiver check --

Using Onboard Transceiver - Link Up. passed Using Onboard

Transceiver - Link Up. Looking for Ethernet Packets.

.  is a Good Packet.  X  is a Bad Packet.

Type any key to stop. .................................................. ................................................................ ................................................................ ........................................................

ok

ok `watch-net-all` /pci@1f,0/pci@1,1/network@1,1 Hme register test --- succeeded. Internal loopback test -- succeeded. Transceiver check -- Using Onboard Transceiver - Link Up. passed Using Onboard Transceiver - Link Up. Looking for Ethernet Packets. . is a Good Packet. X is a Bad Packet. Type any key to stop. ........ ........ ........................................................ ................................................................ ................................................................ .................................... ok

CODE EXAMPLE 4-4 watch-net-all Output Message

ok watch-net-all

/pci@1f,0/pci@1,1/network@1,1

Hme register test --- succeeded.

Internal loopback test -- succeeded.

Transceiver check -- Using Onboard Transceiver - Link Up. passed

Using Onboard Transceiver - Link Up.

Looking for Ethernet Packets.

.  is a Good Packet.

X  is a Bad Packet.

Type any key to stop. ........ ........ ........................................................ ................................................................ ................................................................ ....................................

ok

4.7.7.3 `probe-scsi`

The probe-scsi command transmits an inquiry command to SCSI devices connected to the system unit on-board SCSI interface. If the SCSI device is connected and active, the target address, unit number, device type, and manufacturer name is displayed. CODE EXAMPLE 4-5 identifies the probe-scsi output message.


ok `probe-scsi` Primary UltraSCSI bus: Target 0 Unit 0 Disk SEAGATE ST32272W 0876 Target 6 Unit 0 Removable Read Only device TOSHIBA CD-ROM XM-6201TA1037 ok

4.7.7.4 `test` alias name, device path, -all

The test command, combined with a device alias or device path, enables a device self-test program. If a device has no self-test program, the message: No selftest method for device name is displayed. To enable the self-test program for a device, type the test command followed by the device alias or device path name. TABLE 4-7 lists test alias name selections, a description of the selection, and preparation.


Type of Test	Description	Preparation
test screen	Tests system video graphics hardware and monitor.	Diag-switch? NVRAM parameter must be true for the test to execute.
test floppy	Tests diskette drive response to commands.	A formatted diskette must be inserted into the diskette drive.
test net	Performs internal/external loopback test of the system auto- selected Ethernet interface.	An Ethernet cable must be attached to the system and to an Ethernet tap or hub or the external loopback test fails.
test ttya test ttyb	Outputs an alphanumeric test pattern on the system serial ports: ttya, serial port A; ttyb, serial port B.	A terminal must be connected to the port being tested to observe the output.
test keyboard	Executes the keyboard self-test.	Four keyboard LEDs should flash once and a message is displayed: Keyboard Present.
test -all	Sequentially test system- configured devices containing self-test.	Tests are sequentially executed in device-tree order (viewed with the show-devs command).

4.7.8 OpenBoot Diagnostics (OB Diag)

OpenBoot Diagnostics is an interactive tool that tests various hardware and peripheral devices. When obdiag is typed at the ok prompt in OBP, the menu shown in CODE EXAMPLE 4-6 is displayed on the screen.

OBDiag performs root-cause failure analysis on the referenced devices by testing internal registers, confirming subsystem integrity, and verifying device functionality. To run OBDiag:

1. At the ok prompt, enter obdiag.

This displays the OBDiag menu as shown in CODE EXAMPLE 4-6.

2. At the OBDiag menu prompt, enter a number from the menu (such as 17 to enable toggle script-debug messages).

0 .... PCI/Cheerio 1 .... EBUS DMA/TCR Registers 2 .... Ethernet 3 .... Ethernet2 <Inactive> 4 .... Parallel Port 5 .... Serial Port C (on optional I/O board) <Inactive> 6 .... Serial Port D (on optional I/O board) <Inactive> 7 .... NVRAM 8 .... Floppy 9 .... Serial port A 10 ... Serial port B 11 ... RAS 12 ... User Flash1 13 ... User Flash2 14 ... All Above 15 ... Quit 16 ... Display this Menu 17 ... Toggle Script-debug 18 ... Enable External Loopback Tests 19 ... Disable External Loopback Tests Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

CODE EXAMPLE 4-6 OBDiag Menu

0 .... PCI/Cheerio

1 .... EBUS DMA/TCR Registers

2 .... Ethernet

3 .... Ethernet2 <Inactive>

4 .... Parallel Port

5 .... Serial Port C (on optional I/O board) <Inactive>

6 .... Serial Port D (on optional I/O board) <Inactive>

7 .... NVRAM

8 .... Floppy

9 .... Serial port A

10 ... Serial port B

11 ... RAS

12 ... User Flash1

13 ... User Flash2

14 ... All Above

15 ... Quit

16 ... Display this Menu

17 ... Toggle Script-debug

18 ... Enable External Loopback Tests

19 ... Disable External Loopback Tests

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

Caution - Prior to running obdiag, do not run any other OBP command that may change the hardware state of the board. After obdiag tests are run, always reset the system to bring it to a known state.

The user may type the relevant numbers at this point to run all or some of the tests. If an error is detected the error message is displayed on the screen. For example, if an error is detected while testing the floppy disk drive, a display similar to the following message is displayed on the screen:

TEST= floppy_test

STATUS= FAILED

SUBTEST= floppy_id0_read_test

ERRORS= 1

TTF= 66

SPEED= 440 MHz

PASSES= 1

MESSAGE= Error: Recalibrate failed. floppy missing, improperly connected, or defective.

Some of the individual items on the OBDiag menu are described in further detail in the following paragraphs.

4.7.8.1 PCI/PCIO

The PCI/PCIO diagnostic performs the following:

vendor_ID_test: Verifies that the PCIO ASIC vendor ID is 108e.

device_ID_test: Verifies that the PCIO ASIC device ID is 1000.

mixmode_read: Verifies that the PCI configuration space is accessible as half-word bytes by reading the EBus2 vendor ID address.

2_class_test: Verifies the address class code. Address class codes include bridge device (0 x B, 0 x 6), other bridge device (0 x A and 0 x 80), and programmable interface (0 x 9 and 0 x 0).

status_reg_walk1: Performs walk-one test on status register with mask 0 x 280 (PCIO ASIC is accepting fast back-to-back transactions, DEVSEL timing is 0 x 1).

line_size_walk1: Performs tests a through e.

latency_walk1: Performs walk one test on latency timer.

line_walk1: Performs walk one test on interrupt line.

pin_test: Verifies that the interrupt pin is logic-level high (1) after reset.

CODE EXAMPLE 4-7 identifies the PCI/PCIO output message.

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `0` TEST= all_pci/PCIO_test SUBTEST= vendor_id_test SUBTEST= device_id_test SUBTEST= mixmode_read SUBTEST= e2_class_test SUBTEST= status_reg_walk1 SUBTEST= line_size_walk1 SUBTEST= latency_walk1 SUBTEST= line_walk1 SUBTEST= pin_test Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

CODE EXAMPLE 4-7 PCI/PCIO Output Message

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> 0

TEST= all_pci/PCIO_test

SUBTEST= vendor_id_test

SUBTEST= device_id_test

SUBTEST= mixmode_read

SUBTEST= e2_class_test

SUBTEST= status_reg_walk1

SUBTEST= line_size_walk1

SUBTEST= latency_walk1

SUBTEST= line_walk1

SUBTEST= pin_test

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

4.7.8.2 EBus DMA/TCR Registers

The EBUS DMA/TCR registers diagnostic performs the following:

The dma_reg_test: Performs a walking ones bit test for control status register, address register, and byte count register of each channel. Verifies that the control status register is set properly.

The dma_func_test: Validates the DMA capabilities and FIFOs. The test is executed in a DMA diagnostic loopback mode. It initializes the data of transmitting memory with its address, performs a DMA read and write, and verifies that the data received is correct. This is repeated for four channels.

CODE EXAMPLE 4-8 identifies the EBus DMA/TCR registers output message.


Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `1` TEST= all_dma/ebus_test SUBTEST= dma_reg_test SUBTEST= dma_func_test Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

4.7.8.3 Ethernet

The Ethernet diagnostic performs the following:

my_channel_reset resets the Ethernet channel.

hme_reg_test performs Walk1 on the following registers set: global register 1, global register 2, bmac xif register, bmac tx register, and the mif register.

MAC_internal_loopback_test performs Ethernet channel engine internal loopback.

10_mb_xcvr_loopback_test enables the 10Base-T data present at the transmit MII data inputs to be routed back to the receive MII data outputs.

100_mb_phy_loopback_test enables MII transmit data to be routed to the MII receive data path.

100_mb_twister_loopback_test forces the twisted-pair transceiver into loopback mode.

CODE EXAMPLE 4-9 identifies the Ethernet output message.

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `2` TEST= ethernet_test SUBTEST= my_channel_reset SUBTEST= hme_reg_test SUBTEST= global_reg1_test SUBTEST= global_reg2_test SUBTEST= bmac_xif_reg_test SUBTEST= bmac_tx_reg_test SUBTEST= mif_reg_test Test only supported for National Phy DP83840A SUBTEST= 10mb_xcvr_loopback_test selecting internal transceiver Test only supported for National Phy DP83840A SUBTEST= 100mb_phy_loopback_test selecting internal transceiver Test only supported for National Phy DP83840A SUBTEST= 100mb_twister_loopback_test selecting internal transceiver Test only supported for National Phy DP83840A Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

CODE EXAMPLE 4-9 Ethernet Output Message

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> 2

TEST= ethernet_test

SUBTEST= my_channel_reset

SUBTEST= hme_reg_test

SUBTEST= global_reg1_test

SUBTEST= global_reg2_test

SUBTEST= bmac_xif_reg_test

SUBTEST= bmac_tx_reg_test

SUBTEST= mif_reg_test

Test only supported for National Phy DP83840A

SUBTEST= 10mb_xcvr_loopback_test

selecting internal transceiver

Test only supported for National Phy DP83840A

SUBTEST= 100mb_phy_loopback_test

selecting internal transceiver

Test only supported for National Phy DP83840A

SUBTEST= 100mb_twister_loopback_test

selecting internal transceiver

Test only supported for National Phy DP83840A

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

4.7.8.4 Parallel Port

The parallel port diagnostic performs the dma_read. This enables ECP mode and ECP DMA configuration, and FIFO test mode. It transfers 16 bytes of data from the memory to the parallel port device and then verifies that the data is in TFIFO. CODE EXAMPLE 4-10 identifies the parallel port output message.


Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `4` TEST= parallel_port_test SUBTEST= dma_read Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

4.7.8.5 Serial Port A

The serial port A diagnostic invokes the uart_loopback test. This test transmits and receives 128 characters and checks the transaction validity. CODE EXAMPLE 4-11 identifies the serial port A output message.


Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `9` TEST= uarta_test Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

Note - The serial port A diagnostic will stall if the TIP line is installed on serial port A. CODE EXAMPLE 4-12 identifies the serial port A output message when the TIP line is installed on serial port A.


Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `9` TEST= uarta_test UART A in use as console - Test not run. Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

4.7.8.6 Serial Port B

The serial port B diagnostic is identical to the serial port A diagnostic. CODE EXAMPLE 4-13 identifies the serial port B output message.

Note - The serial port B diagnostic will stall if the TIP line is installed on serial port B.


Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `10` TEST= uartb_test Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

4.7.8.7 NVRAM

The NVRAM diagnostic verifies the NVRAM operation by performing a write and read to the NVRAM. CODE EXAMPLE 4-14 identifies the NVRAM output message.


Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `7` TEST= nvram_test SUBTEST= write/read_patterns SUBTEST= write/read_inverted_patterns Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

4.7.8.8 All Above

The All Above diagnostic validates the system unit. CODE EXAMPLE 4-15 shows an example of the All Above option output message.

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> `14` TEST= all_pci/cheerio_test SUBTEST= vendor_id_test SUBTEST= device_id_test ... SUBTEST= bmac_xif_reg_test SUBTEST= bmac_tx_reg_test SUBTEST= mif_reg_test SUBTEST= mac_internal_loopback_test selecting internal transceiver Test only supported for National Phy DP83840A ... SUBTEST= 100mb_twister_loopback_test selecting internal transceiver Test only supported for National Phy DP83840A TEST= ethernet2_test TEST= parallel_port_test SUBTEST= dma_read TEST= uarta_test ... SUBTEST= write/read_patterns ... ttya in use as console - Test not run. TEST= usi_test ttyb in use as console - Test not run. TEST= ras_test env-monitor = disabled SUBTEST= obd-init-i2c-test ... TEST= flash_test SUBTEST= flash-supported? TEST= flash_test SUBTEST= flash-supported? Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

CODE EXAMPLE 4-15 All Above Output Message

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===> 14

TEST= all_pci/cheerio_test

SUBTEST= vendor_id_test

SUBTEST= device_id_test

...

SUBTEST= bmac_xif_reg_test

SUBTEST= bmac_tx_reg_test

SUBTEST= mif_reg_test

SUBTEST= mac_internal_loopback_test

selecting internal transceiver

Test only supported for National Phy DP83840A

...

SUBTEST= 100mb_twister_loopback_test

selecting internal transceiver

Test only supported for National Phy DP83840A

TEST= ethernet2_test

TEST= parallel_port_test

SUBTEST= dma_read

TEST= uarta_test

...

SUBTEST= write/read_patterns

...

ttya in use as console - Test not run.

TEST= usi_test

ttyb in use as console - Test not run.

TEST= ras_test  env-monitor = disabled

SUBTEST= obd-init-i2c-test

...

TEST= flash_test

SUBTEST= flash-supported?

TEST= flash_test

SUBTEST= flash-supported?

Enter (0-14 tests, 15 -Quit, 16 -Menu) ===>

4.1 Troubleshooting the System Using the System Status Panel

4.1.1 Locating and Understanding the System Status Panel

4.1.2 Using the System Status Panel LEDs to Troubleshoot the System

4.2 Troubleshooting the System Using prtdiag

4.3 Troubleshooting the System Using Diagnostic Software

4.4 Troubleshooting the System Using the Power-On Self Test (POST)

4.5 Troubleshooting the System Using the Alarm Card Software

4.6 Troubleshooting a Power Supply Using the Power Supply Unit LEDs

4.6.1 Troubleshooting the Power Supply Unit in the Netra CT 410 Server

4.6.2 Troubleshooting the Power Supply Units in the Netra CT 810 Server

4.7 Troubleshooting a CPU Card

4.7.1 General Troubleshooting Tips

4.7.2 General Troubleshooting Requirements

4.7.3 Mechanical Failures

Symptom

Action

4.7.4 Power-On Failures

4.7.5 Failures Subsequent to Power-On

Symptom

Action

4.7.6 Troubleshooting During POST/OBP and During Boot Process

Symptom

Action

Symptom

Action

4.7.7 OpenBoot PROM On-Board Diagnostics

4.7.7.1 watch-clock

4.7.7.2 watch-net and watch-net-all

4.7.7.3 probe-scsi

4.7.7.4 test alias name, device path, -all

4.7.8 OpenBoot Diagnostics (OB Diag)

4.7.8.1 PCI/PCIO

4.7.8.2 EBus DMA/TCR Registers

4.7.8.3 Ethernet

4.7.8.4 Parallel Port

4.7.8.5 Serial Port A

4.7.8.6 Serial Port B

4.7.8.7 NVRAM

4.7.8.8 All Above

4.2 Troubleshooting the System Using `prtdiag`

4.7.7.1 `watch-clock`

4.7.7.2 `watch-net` and `watch-net-all`

4.7.7.3 `probe-scsi`

4.7.7.4 `test` alias name, device path, -all