C H A P T E R  10

Isolating Failed Parts

The most important use of diagnostic tools is to isolate a failed hardware component so that a qualified service technician can quickly remove and replace it. Because servers are complex machines with many failure modes, there is no single diagnostic tool that can isolate all hardware faults under all conditions. However, Sun provides a variety of tools that can help you discern what component needs replacing.

This chapter guides you in choosing the best tools and describes how to use these tools to reveal a failed part in your Sun Fire V490 server. It also explains how to use the Locator LED to isolate a failed system in a large equipment room.

Tasks covered in this chapter include:

Other information in this chapter includes:

If you want background information about the tools, turn to the section:



Note - Many of the procedures in this chapter assume that you are familiar with the OpenBoot firmware and that you know how to enter the OpenBoot environment. For background information, refer to About the ok Prompt. For instructions, refer to How to Get to the ok Prompt.





caution icon

Caution - Do not attempt to access any internal components unless you are a qualified service technician. Detailed service instructions can be found in the Sun Fire V490 Server Parts Installation and Removal Guide, which is included on the Sun Fire V490 Documentation CD.




How to Operate the Locator LED

The Locator LED helps you quickly to find a specific system among dozens of systems in a room. For background information about system LEDs, refer to LED Status Indicators.

You can turn the Locator LED on and off either from the system console, the system controller (SC) command-line interface (CLI), or by using RSC software's graphical user interface (GUI).



Note - It is also possible to use Sun Management Center software to turn the Locator LED on and off. Consult Sun Management Center documentation for details.



Before You Begin

Either log in as root, or access the RSC software's graphical user interface.

What to Do

1. Turn the Locator LED on.

Do one of the following:

Refer to the illustration under Step 5 in How to Monitor the System Using the System Controller and RSC Software. With each click, the LED will change state from off to on, or vice versa.

2. Turn the Locator LED off.

Do one of the following:

Refer to the illustration under Step 5 in How to Monitor the System Using the System Controller and RSC Software. With each click, the LED will change state from on to off, or vice versa.


How to Put the Server in Service Mode

Before You Begin

In normal mode, firmware-based diagnostic tests can be configured (and even disabled) to expedite the server's startup process. If you have set OpenBoot configuration variables to bypass diagnostic tests, you can always reset those variables to their default values to run tests.

Alternatively, putting the server into service mode according to the following procedure ensures that POST and OpenBoot Diagnostics tests do run during startup.

For a full description of service mode, refer to:

This document is included on the Sun Fire V490 Documentation CD.

What to Do

1. Set up a console for viewing diagnostic messages.

Access the system console using an ASCII terminal or tip line. For information on system console options, refer to About Communicating With the System.

2. Do one of the following, whichever is more convenient:

If either of these switches is set as described, the next reset will cause diagnostic tests to run at Sun-specified coverage, levels, and verbosity.

3. Type:


ok reset-all

What Next

Should you want to restore the system to normal mode in order to control the depth of diagnostic coverage, the tests run, and the verbosity of the output, refer to:


How to Put the Server in Normal Mode

Before You Begin

If you have set the server to run in service mode, you can follow this procedure to return the system to normal mode. Putting the system in normal mode allows you control over diagnostic testing. For more information, refer to:

What To Do

1. Set up a console for viewing diagnostic messages.

Access the system console using an ASCII terminal or tip line. For information on system console options, refer to About Communicating With the System.

2. Turn the system control switch to the Normal position.

3. At the ok prompt, type:


ok setenv service-mode? false

The system will not actually enter normal mode until the next reset.

4. Type:


ok reset-all

What Next

For detailed descriptions of service and normal modes, refer to:

This document is included on the Sun Fire V490 Documentation CD.


How to Isolate Faults Using LEDs

While not a deep, formal diagnostic tool, LEDs located on the chassis and on selected system components can serve as front-line indicators of a limited set of hardware failures.

Before You Begin

You can view LED status by direct inspection of the system's front or back panels.



Note - Most LEDs available on the front panel are also duplicated on the back panel.



You can also view LED status remotely using RSC and Sun Management Center software, if you set up these tools ahead of time. For details on setting up RSC and Sun Management Center software, refer to:

What to Do

1. Check the system LEDs.

There is a group of three LEDs located near the top left corner of the front panel and duplicated on the back panel. Their status can tell you the following.


LED

Indicates

Action

Locator (left)

A system administrator can turn this on to flag a system that needs attention.

Identify the system.

Fault (middle)

If lit, hardware or software has detected a problem with the system.

Check other LEDs or run diagnostics to determine the problem source.

Power/OK (right)

If off, power is not reaching the system from the power supplies.

Check AC power source and check the power supplies.


The Locator and Fault LEDs are powered by the system's 5-volt standby power source and remain lit for any fault condition that results in a system shutdown.

2. Check the power supply LEDs.

Each power supply has a set of four LEDs located on the front panel and duplicated on the back panel. Their status can tell you the following.


LED

Indicates

Action

OK-to-Remove (top)

If lit, power supply can safely be removed.

Remove power supply as needed.

Fault (2nd from top)

If lit, there is a problem with the power supply or one of its internal fans.

Replace the power supply.

DC Present (3rd from top)

If off, inadequate DC power is being produced by the supply.

Remove and reseat the power supply. If this does not help, replace the supply.

AC Present (bottom)

If off, AC power is not reaching the supply.

Check power cord and the outlet to which it connects.


3. Check the fan tray LEDs.

There are two LEDs located behind the media door, just under the system control switch. One LED on the left is for Fan Tray 0 (CPU) and one LED on the right is for Fan Tray 1 (PCI). If either is lit, it indicates that the corresponding fan tray needs reseating or replacement.

4. Check the disk drive LEDs.

There are two sets of three LEDs, one for each disk drive. These are located behind the media door, just to the left of each disk drive. Their status can tell you the following.


LED

Indicates

Action

OK-to-Remove (top)

If lit, disk can safely be removed.

Remove disk as needed.

Fault (middle)

If lit, there is a problem with the disk.

Perform software commands to take the disk offline. Refer to the Sun Fire V490 Server Parts Installation and Removal Guide.

Activity (bottom)

If lit or blinking, disk is operating normally.

Not applicable.


5. (Optional) Check the Ethernet LEDs.

There are two LEDs for each Ethernet port--they are close to the right side of each Ethernet receptacle on the back panel. If the Sun Fire V490 system is connected to an Ethernet network, the status of the Ethernet LEDs can tell you the following.


LED

Indicates

Action

Activity (top, amber)

If lit or blinking, data is either being transmitted or received.

None. The condition of these LEDs can help you narrow down the source of a network problem.

Link Up (bottom, green)

If lit, a link is established with a link partner.


What Next

If LEDs do not disclose the source of a suspected problem, try running power-on self-tests (POST). Refer to:


How to Isolate Faults Using POST Diagnostics

This section explains how to run power-on self-test (POST) diagnostics to isolate faults in a Sun Fire V490 server. For background information about POST diagnostics and the boot process, refer to Chapter 6.

Before You Begin

You must ensure that the system is configured to run diagnostic tests. Refer to:

You must additionally decide whether you want to view POST diagnostic output locally, via a terminal or tip connection to the machine's serial port, or remotely after redirecting system console output to the system controller (SC).



Note - A server can have only one system console at a time, so if you redirect output to the system controller, no information appears at the serial port (ttya).



What to Do

1. Set up a console for viewing POST messages.

Connect an alphanumeric terminal to the Sun Fire V490 server or establish a tip connection to another Sun system. Refer to:

2. (Optional) Redirect console output to the system controller, if desired.

For instructions, refer to How to Redirect the System Console to the System Controller.

3. Start POST diagnostics. Type:


ok post

The system runs the POST diagnostics and displays status and error messages via either the local serial terminal (ttya) or the redirected (system controller) system console.

4. Examine the POST output.

Each POST error message includes a "best guess" as to which field-replaceable unit (FRU) was the source of failure. In some cases, there may be more than one possible source, and these are listed in order of decreasing likelihood.



Note - Should the POST output contain code names and acronyms with which you are unfamiliar, seeTABLE 6-13 Reference for Terms in Diagnostic Output.



What Next

Have a qualified service technician replace the FRU or FRUs indicated by POST error messages, if any. For replacement instructions, refer to:

If the POST diagnostics did not disclose any problems, but your system does not start, try running the interactive OpenBoot Diagnostics tests.


How to Isolate Faults Using Interactive OpenBoot Diagnostics Tests

Before You Begin

Because OpenBoot Diagnostics tests require access to some of the same hardware resources used by the operating system, they cannot be operated reliably after an operating system halt or Stop-A key sequence. You need to reset the system before running OpenBoot Diagnostics tests, and then reset the system again after testing. Instructions for doing this follow.

This procedure assumes you have established a system console. Refer to:

What to Do

1. Halt the server to reach the ok prompt.

How you do this depends on the system's condition. If possible, you should warn users and shut down the system gracefully. For information, refer to About the ok Prompt.

2. Set the auto-boot? diagnostic configuration variable to false. Type:


ok setenv auto-boot? false

3. Reset or power cycle the system.

4. Invoke the OpenBoot Diagnostics tests. Type:


ok obdiag

The obdiag prompt and test menu appear. The menu is shown in FIGURE 6-4.

5. Type the appropriate command and numbers for the tests you want to run.

For example, to run all available OpenBoot Diagnostics tests, type:


obdiag> test-all

To run a particular test, type:


obdiag> test #

where # represents the number of the desired test.

For a list of OpenBoot Diagnostics test commands, refer to Interactive OpenBoot Diagnostics Commands. The numbered menu of tests is shown in FIGURE 6-4.

6. When you are done running OpenBoot Diagnostics tests, exit the test menu. Type:


obdiag> exit

The ok prompt reappears.

7. Set the auto-boot? diagnostic configuration variable back to true. Type:


ok setenv auto-boot? true

This allows the operating system to resume starting up automatically after future system resets or power cycles.

What Next

Have a qualified service technician replace the FRU or FRUs indicated by OpenBoot Diagnostics error messages, if any. For replacement instructions, refer to:

This document is included on the Sun Fire V490 Documentation CD.


How to View Diagnostic Test Results After the Fact

Summaries of the results from the most recent power-on self-test (POST) and OpenBoot Diagnostics tests are saved across power cycles.

Before You Begin

You must set up a system console. Refer to:

Then halt the server to reach the ok prompt. Refer to:

What to Do

single-step bulletTo refer to a summary of the most recent POST results, type:


ok show-post-results

single-step bulletTo refer to a summary of the most recent OpenBoot Diagnostics test results, type:


ok show-obdiag-results

What Next

You should refer to a system-dependent list of hardware components, along with an indication of which components passed and which failed POST or OpenBoot Diagnostics tests.


How to View and Set OpenBoot Configuration Variables

Switches and diagnostic configuration variables stored by the system firmware determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, refer to TABLE 6-2.

Before You Begin

Halt the server to reach the ok prompt. Refer to:

What to Do

single-step bulletTo display the current values of all OpenBoot configuration variables, use the printenv command.

The following example shows a short excerpt of this command's output.


ok printenv
Variable Name         Value                          Default Value
 
diag-level            min                            max
diag-switch?          false                          false

single-step bulletTo set or change the value of an OpenBoot configuration variable, use the setenv command:


ok setenv diag-level max
diag-level = max

single-step bulletTo set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space:


ok setenv post-trigger power-on-reset error-reset
post-trigger = power-on-reset error-reset



Note - The test-args variable operates differently from other OpenBoot configuration variables. It requires a single argument consisting of a comma-separated list of keywords. For details, refer to Controlling OpenBoot Diagnostics Tests.



What Next

Changes to OpenBoot configuration variables usually take effect upon the next reboot.


Reference for Choosing a Fault Isolation Tool

This section helps you choose the right tool to isolate a failed part in a Sun Fire V490 system. Consider the following questions when selecting a tool.

1. Have you checked the LEDs?

Certain system components have built-in LEDs that can alert you when that component requires replacement. For detailed instructions, refer to How to Isolate Faults Using LEDs.

2. Does the system have main power?

If there is no main power to the system, standby power from the SC card may enable you to check the status of some components. Refer to About Monitoring the System.

3. Does the system boot?


FIGURE 10-1 Choosing a Tool to Isolate Hardware Faults

This illustration is a flowchart depicting how to choose the appropriate fault isolating tool


4. Do you intend to run the tests remotely?

Both Sun Management Center and RSC software enable you to run tests from a remote computer. In addition, RSC software provides a means of redirecting system console output, allowing you remotely to view and run tests--like POST diagnostics--that usually require physical proximity to the serial port on the system's back panel.

5. Will the tool test the suspected source(s) of the problem?

Perhaps you already have some idea of what the problem is. If so, you want to use a diagnostic tool capable of testing the suspected problem sources.

6. Is the problem intermittent or software-related?

If a problem is not caused by a clearly defective hardware component, then you may want to use a system exerciser tool rather than a fault isolation tool. Refer to Chapter 12 for instructions and About Exercising the System for background information.