C H A P T E R 6

Diagnostic Tools

The Sun Fire V490 server and its accompanying software contain many tools and features that help you:

Isolate problems when there is a failure of a field-replaceable component

Monitor the status of a functioning system

Exercise the system to disclose an intermittent or incipient problem

This chapter introduces the tools that let you accomplish these goals, and helps you to understand how the various tools fit together.

Topics in this chapter include:

About the Diagnostic Tools

About Diagnostics and the Boot Process

About Isolating Faults in the System

About Monitoring the System

About Exercising the System

Reference for OpenBoot Diagnostics Test Descriptions

Reference for Decoding I2C Diagnostic Test Messages

Reference for Terms in Diagnostic Output

If you only want instructions for using diagnostic tools, skip this chapter and turn to
Part Three of this manual. There, you can find chapters that tell you how to isolate failed parts (Chapter 10), monitor the system (Chapter 11), and exercise the system (Chapter 12).

About the Diagnostic Tools

Sun provides a wide spectrum of diagnostic tools for use with the Sun Fire V490 server. These tools range from the formal--like Sun's comprehensive Validation Test Suite (SunVTS), to the informal--like log files that may contain clues helpful in narrowing down the possible sources of a problem.

The diagnostic tool spectrum also ranges from standalone software packages, to firmware-based power-on self-tests (POST), to hardware LEDs that tell you when the power supplies are operating.

Some diagnostic tools enable you to examine many computers from a single console, others do not. Some diagnostic tools stress the system by running tests in parallel, while other tools run sequential tests, enabling the machine to continue its normal functions. Some diagnostic tools function even when power is absent or the machine is out of commission, while others require the operating system to be up and running.

The full palette of tools discussed in this manual is summarized in TABLE 6-1.

TABLE 6-1 Summary of Diagnostic Tools
Diagnostic Tool	Type	What It Does	Accessibility and Availability	Remote Capability
LEDs	Hardware	Indicate status of overall system and particular components	Accessed from system chassis. Available anytime power is available	Local, but can be viewed via SC
POST	Firmware	Tests core components of system	Runs automatically on startup. Available when the operating system is not running	Local, but can be viewed via SC
OpenBoot Diagnostics	Firmware	Tests system components, focusing on peripherals and I/O devices	Runs automatically or interactively. Available when the operating system is not running	Local, but can be viewed via SC
OpenBoot commands	Firmware	Display various kinds of system information	Available whether or not the operating system is running	Local, but can be accessed via SC
Solaris commands	Software	Display various kinds of system information	Requires operating system	Local, but can be accessed via SC
SunVTS	Software	Exercises and stresses the system, running tests in parallel	Requires operating system. Optional package may need to be installed	View and control over network
SC card and RSC software	Hardware and software	Monitors environmental conditions, performs basic fault isolation, and provides remote console access	Can function on standby power and without operating system	Designed for remote access
Sun Management Center	Software	Monitors both hardware environmental conditions and software performance of multiple machines. Generates alerts for various conditions	Requires operating system to be running on both monitored and master servers. Requires a dedicated database on the master server	Designed for remote access
Hardware Diagnostic Suite	Software	Exercises an operational system by running sequential tests. Also reports failed FRUs	Separately purchased optional add-on to Sun Management Center. Requires operating system and Sun Management Center	Designed for remote access

Why are there so many different diagnostic tools?

There are a number of reasons for the lack of a single all-in-one diagnostic test, starting with the complexity of the server systems.

Consider the data bus built into every Sun Fire V490 server. This bus features a five-way switch called a CDX that interconnects all processors and high-speed I/O interfaces (refer to FIGURE 6-1). This data switch enables multiple simultaneous transfers over its private data paths. This sophisticated high-speed interconnect represents just one facet of the Sun Fire V490 server's advanced architecture.

FIGURE 6-1 Simplified Schematic View of a Sun Fire V490 System

This illustration presents a simplified schematic view of a Sun Fire V490 system

Consider also that some diagnostics must function even when the system fails to start. Any diagnostic capable of isolating problems when the system fails to start up must be independent of the operating system. But any diagnostic that is independent of the operating system will also be unable to make use of the operating system's considerable resources for getting at the more complex causes of failures.

Another complicating factor is that different installations have different diagnostic requirements. You may be administering a single computer or a whole data center full of equipment racks. Alternatively, your systems may be deployed remotely-- perhaps in areas that are physically inaccessible.

Finally, consider the different tasks you expect to perform with your diagnostic tools:

Isolating faults to a specific replaceable hardware component

Exercising the system to disclose more subtle problems that may or may not be hardware related

Monitoring the system to catch problems before they become serious enough to cause unplanned downtime

Not every diagnostic tool can be optimized for all these varied tasks.

Instead of one unified diagnostic tool, Sun provides a palette of tools each of which has its own specific strengths and applications. To appreciate how each tool fits into the larger picture, it is necessary to have some understanding of what happens when the server starts up, during the so-called boot process.

About Diagnostics and the Boot Process

You have probably had the experience of powering on a Sun system and watching as it goes through its boot process. Perhaps you have watched as your console displays messages that look like the following:

0:0>

0:0>@(#) Sun Fire[TM] V480/V490 POST 4.15 2004/04/09 16:27

0:0>Copyright © 2004 Sun Microsystems, Inc. All rights reserved

  SUN PROPRIETARY/CONFIDENTIAL.

  Use is subject to license terms.

0:0>Jump from OBP->POST.

0:0>Diag level set to MIN.

0:0>Verbosity level set to NORMAL.

0:0>

0:0>Start selftest...

0:0>CPUs present in system: 0:0 1:0 2:0 3:0

0:0>Test CPU(s)....Done

It turns out these messages are not quite so inscrutable once you understand the boot process. These kinds of messages are discussed later.

It is important to understand that almost all of the firmware-based diagnostics can be disabled so as to minimize the amount of time it takes the server to start up. In the following discussion, assume that the system is configured to run its firmware-based tests.

Prologue: System Controller Boot

As soon as you plug in the Sun Fire V490 server to an electrical outlet, and before you turn on power to the server, the system controller (SC) inside the server begins its self-diagnostic and boot cycle. During this time, the locator LED blinks. Running off standby power, the system controller card begins functioning before the server itself comes up.

The system controller provides access to a number of control and monitoring functions through Remote System Control (RSC) software. For more information about RSC software, refer to Sun Remote System Control Software.

Stage One: OpenBoot Firmware and POST

Every Sun Fire V490 server includes a chip holding about 2 Mbytes of firmware-based code. This chip is called the Boot PROM. After you turn on system power, the first thing the system does is execute code that resides in the Boot PROM.

This code, which is referred to as the OpenBoot firmware, is a small-scale operating system unto itself. However, unlike a traditional operating system that can run multiple applications for multiple simultaneous users, OpenBoot firmware runs in single-user mode and is designed solely to test, configure, and boot the system, thereby ensuring that the hardware is sufficiently "healthy" to run its normal operating system software.

When system power is turned on, the OpenBoot firmware begins running directly out of the Boot PROM, since at this stage system memory has not been verified to work properly.

Soon after power is turned on, the system hardware determines that at least one processor is powered on, and is submitting a bus access request, which indicates that the processor in question is at least partly functional. This becomes the master processor, and is responsible for executing OpenBoot firmware instructions.

The OpenBoot firmware's first actions are to check whether to run the power-on self-test (POST) diagnostics and other tests. The POST diagnostics constitute a separate chunk of code stored in a different area of the Boot PROM (refer to FIGURE 6-2).

FIGURE 6-2 Boot PROM and IDPROM

This illustration presents a block diagram view of the Boot PROM and IDPROM

The extent of these power-on self-tests, and whether they are performed at all, is controlled by configuration variables stored in a separate firmware memory device called the IDPROM. These OpenBoot configuration variables are discussed in Controlling POST Diagnostics.

As soon as POST diagnostics can verify that some subset of system memory is functional, tests are loaded into system memory.

The Purpose of POST Diagnostics

The POST diagnostics verify the core functionality of the system. A successful execution of the POST diagnostics does not ensure that there is nothing wrong with the server, but it does ensure that the server can proceed to the next stage of the boot process.

For a Sun Fire V490 server, this means:

At least one of the processors is working.

At least a subset of system memory is functional.

Cache memory is functional.

Data switches located both on the CPU/Memory boards and the centerplane are functioning.

Input/output bridges located on the centerplane are functioning.

The PCI bus is intact--that is, there are no electrical shorts.

It is possible for a system to pass all POST diagnostics and still be unable to boot the operating system. However, you can run POST diagnostics even when a system fails to boot, and these tests are likely to disclose the source of most hardware problems.

POST generally reports errors that are persistent in nature. To catch intermittent problems, consider running a system exercising tool. Refer to About Exercising the System.

What POST Diagnostics Do

Each POST diagnostic is a low-level test designed to pinpoint faults in a specific hardware component. For example, individual memory tests called address bitwalk and data bitwalk ensure that binary 0s and 1s can be written on each address and data line. During such a test, the POST may display output similar to this:

1:0>Data Bitwalk on Slave 3

1:0>	Test Bank 0.

In this example, processor 1 is the master processor, as indicated by the prompt 1:0>, and it is about to test the memory associated with processor 3, as indicated by the message "Slave 3."

Note - The x:y numbering system identifies processors that have multiple cores.

The failure of such a test reveals precise information about particular integrated circuits, the memory registers inside them, or the data paths connecting them:

1:0>ERROR: TEST = Data Bitwalk on Slave 3

1:0>H/W under test = CPU3 Memory

1:0>MSG = ERROR:	miscompare on mem test!

	Address: 00000030.001b0038

	Expected: 00000000.00100000

	Observed: 00000000.00000000

What POST Error Messages Tell You

When a specific power-on self-test discloses an error, it reports different kinds of information about the error:

The specific test that failed

The specific circuit or subcomponent that is most likely at fault

The field-replaceable units (FRUs) most likely to require replacement, in order of likelihood

Here is an excerpt of POST output showing another error message.

CODE EXAMPLE 6-1 POST Error Message

0:0>Schizo unit 1 PCI DMA C test

0:0>	FAILED

0:0>ERROR: TEST = Schizo unit 1 PCI DMA C test

0:0>H/W under test = Motherboard/Centerplane Schizo 1, I/O Board, CPU

0:0>MSG =

0:0>	Schizo Error - 16bit Data miss compare

0:0>	address  0000060300012800

0:0>	expected 0001020304050607

0:0>	observed 0000000000000000

0:0>END_ERROR

Identifying FRUs

An important feature of POST error messages is the H/W under test line. (Refer to the arrow in CODE EXAMPLE 6-1.)

The H/W under test line indicates which FRU or FRUs may be responsible for the error. Note that in CODE EXAMPLE 6-1, three different FRUs are indicated. Using TABLE 6-13 to decode some of the terms, you can refer to that this POST error was most likely caused by a bad system interconnect circuit (Schizo) on the centerplane. However, the error message also indicates that the PCI riser board (I/O board) may be at fault. In the least likely case, the error might stem from the master processor, in this case processor 0.

Why a POST Error May Implicate Multiple FRUs

Because each test operates at such a low level, the POST diagnostics are often more definite in reporting the minute details of the error, like the numerical values of expected and observed results, than they are about reporting which FRU is responsible. If this seems counter-intuitive, consider the block diagram of one data path within a Sun Fire V490 server, shown in FIGURE 6-3.

FIGURE 6-3 POST Diagnostic Running Across FRUs

This illustration depicts the FRU boundaries in a single data path within a Sun Fire V490 server

The dashed lines in FIGURE 6-3 represent boundaries between FRUs. Suppose a POST diagnostic is running in the processor in the left part of the diagram. This diagnostic attempts to initiate a built-in self-test in a PCI device located in the right side of the diagram.

If this built-in self-test fails, there could be a fault in the PCI controller, or, less likely, in one of the data paths or components leading to that PCI controller. The POST diagnostic can tell you only that the test failed, but not why. So, though the POST may present very precise data about the nature of the test failure, any of three different FRUs could be implicated.

Controlling POST Diagnostics

You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables in the IDPROM. Changes to OpenBoot configuration variables generally take effect only after the machine is restarted. These variables affect OpenBoot Diagnostics tests as well as POST diagnostics.

TABLE 6-2 lists the most important and useful of these variables. You can find more extensive lists and descriptions in OpenBoot PROM Enhancements for Diagnostic Operation and OpenBoot 4.x Command Reference Manual. The former is included on the Sun Fire V490 Documentation CD. The latter is included with the Solaris Software Supplement CD that ships with Solaris software.

You can find instructions for changing OpenBoot configuration variables in How to View and Set OpenBoot Configuration Variables.

TABLE 6-2 OpenBoot Configuration Variables
OpenBoot Configuration Variable	Description and Keywords
`auto-boot`	Determines whether the operating system automatically starts up. Default is `true`. `true`--Operating system automatically starts once firmware tests finish. `false`--System remains at `ok` prompt until you type `boot`.
`auto-boot-on-error?`	Determines whether the system attempts to boot after a nonfatal error. Default is `true`. `true`--System automatically boots after a nonfatal error if the variable `auto-boot?` is also set to `true`. `false`--System remains at the `ok` prompt.
`diag-level`	Determines the level or type of diagnostics executed. Default is `max`. `off`--No testing. `min`--Only basic tests are run. `max`--More extensive tests may be run, depending on the device.
`diag-out-console`	Redirects diagnostic and console messages to the system controller. Default is `false`. `true`--Display diagnostic messages via the SC console. `false`--Display diagnostic messages via the serial port `ttya` or a graphics terminal.
`diag-script`	Determines which devices are tested by OpenBoot Diagnostics. Default is `normal`. `none`--No devices are tested. `normal`--On-board (centerplane-based) devices that have self-tests are tested. `all`--All devices that have self-tests are tested.
`diag-switch?`	Controls diagnostic execution in normal mode. Default is `false`. `true`--Diagnostics are only executed on power-on reset events, but the level of test coverage, verbosity, and output is determined by user-defined settings. `false`--Diagnostics are executed upon next system reset, but only for those class of reset events specified by the OpenBoot configuration variable `diag-trigger`. The level of test coverage, verbosity, and output is determined by user-defined settings. Note: The above behaviors only apply to server machines like the Sun Fire V490 server. Workstations behave differently. For details, refer to OpenBoot PROM Enhancements for Diagnostic Operation.
`diag-trigger`	Specifies the class of reset event that causes diagnostic tests to run. This variable can accept single keywords as well as combinations of the first three keywords separated by spaces. For details, refer to How to View and Set OpenBoot Configuration Variables. Default is `power-on-reset` and `error-reset`. `error-reset`--Reset that is caused by certain hardware error events such as RED State Exception Reset, Watchdog Reset, Software-Instruction Reset, or Hardware Fatal Reset. `power-on-reset`--Reset that is caused by power cycling the system. `user-reset`--Reset that is initiated by an operating system panic or by user-initiated commands from OpenBoot (`reset-all` or `boot`) or from Solaris (`reboot`, `shutdown`, or `init`). `all-resets`--Any kind of system reset. `none`--No power-on self-tests or OpenBoot Diagnostics tests run.
`input-device`	Selects where console input is taken from. Default is `keyboard`. `ttya`--From built-in serial port. `keyboard`--From attached keyboard that is part of a graphics terminal. `rsc-console`--From the system controller. Note: Should the specified input device be unavailable, the system automatically reverts to `ttya`.
`output-device`	Selects where diagnostic and other console output is displayed. Default is `screen`. `ttya`--To built-in serial port. `screen`--To attached screen that is part of a graphics terminal. `rsc-console`--To the system controller. Note: POST messages cannot be displayed on a graphics terminal. They are sent to `ttya` even when `output-device` is set to `screen`. Should the specified output device be unavailable, the system automatically reverts to `ttya`.
`service-mode?`	Controls whether the system is in service mode. Default is `false`. `true`--Service mode. Diagnostics are executed at Sun-specified levels, overriding but preserving user settings. `false`--Normal mode, unless overridden by the system control switch. Diagnostics execution depends entirely on the settings of `diag-switch?` and other user-defined OpenBoot configuration variables. Note: If the system control switch is in Diagnostics position, the system will boot in service mode even if the `service-mode?` variable is `false`.

Stage Two: OpenBoot Diagnostics Tests

Once POST diagnostics have finished running, POST reports back to the OpenBoot firmware the status of each test it has run. Control then reverts back to the OpenBoot firmware code.

OpenBoot firmware code compiles a hierarchical "census" of all devices in the system. This census is called a device tree. Though different for every system configuration, the device tree generally includes both built-in system components and optional PCI bus devices.

Following the successful execution of POST diagnostics, the OpenBoot firmware proceeds to run OpenBoot Diagnostics tests. Like the POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the Boot PROM.

What Are OpenBoot Diagnostics Tests For?

OpenBoot Diagnostics tests focus on system I/O and peripheral devices. Any device in the device tree, regardless of manufacturer, that includes an IEEE 1275-compatible self-test is included in the suite of OpenBoot Diagnostics tests. On a Sun Fire V490 server, OpenBoot Diagnostics test the following system components:

I/O interfaces; including USB and serial ports

System controller

Keyboard, mouse, and video (when present)

On-board boot devices (Ethernet, disk controller)

Any PCI option card with an IEEE 1275-compatible built-in self-test

By default, the OpenBoot Diagnostics tests run automatically via a script when you start up the system. However, you can also run OpenBoot Diagnostics tests manually, as explained in the next section.

Controlling OpenBoot Diagnostics Tests

When you restart the system, you can run OpenBoot Diagnostics tests either interactively from a test menu, or by entering commands directly from the ok prompt.

Most of the same OpenBoot configuration variables you use to control POST (refer to TABLE 6-2) also affect OpenBoot Diagnostics tests. Notably, you can determine OpenBoot Diagnostics testing level--or suppress testing entirely--by appropriately setting the diag-level variable.

In addition, the OpenBoot Diagnostics tests use a special variable called test-args that enables you to customize how the tests operate. By default, test-args is set to contain an empty string. However, you can set test-args to one or more of the reserved keywords, each of which has a different effect on OpenBoot Diagnostics tests. TABLE 6-3 lists the available keywords.

TABLE 6-3 Keywords for the `test-args` OpenBoot Configuration Variable
Keyword	What It Does
`bist`	Invokes built-in self-test (BIST) on external and peripheral devices
`debug`	Displays all debug messages
`iopath`	Verifies bus/interconnect integrity
`loopback`	Exercises external loopback path for the device
`media`	Verifies external and peripheral device media accessibility
`restore`	Attempts to restore original state of the device if the previous execution of the test failed
`silent`	Displays only errors rather than the status of each test
`subtests`	Displays main test and each subtest that is called
`verbose`	Displays detailed messages of status of all tests
`callers=N`	Displays backtrace of N callers when an error occurs `callers=0`--Displays backtrace of all callers before the error
`errors=N`	Continues executing the test until N errors are encountered `errors=0`--Displays all error reports without terminating testing

If you want to make multiple customizations to the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:

ok setenv test-args debug,loopback,media

From the OpenBoot Diagnostics Test Menu

It is easiest to run OpenBoot Diagnostics tests interactively from a menu. You access the menu by typing obdiag at the ok prompt. Refer to How to Isolate Faults Using Interactive OpenBoot Diagnostics Tests for full instructions.

The obdiag> prompt and the OpenBoot Diagnostics interactive menu (FIGURE 6-4) appear. For a brief explanation of each OpenBoot Diagnostics test, refer to TABLE 6-10 in Reference for OpenBoot Diagnostics Test Descriptions.

FIGURE 6-4 OpenBoot Diagnostics Interactive Test Menu

This figure depicts the interactive OpenBoot Diagnostics test menu for a Sun Fire V490 server

Interactive OpenBoot Diagnostics Commands

You run individual OpenBoot Diagnostics tests from the obdiag> prompt by typing:

obdiag> test n

where n represents the number associated with a particular menu item.

There are several other commands available to you from the obdiag> prompt. For descriptions of these commands, refer to TABLE 6-11 in Reference for OpenBoot Diagnostics Test Descriptions.

You can obtain a summary of this same information by typing help at the obdiag> prompt.

From the `ok` Prompt: The `test` and `test-all` Commands

You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:

ok test /pci@x,y/SUNW,qlc@2

Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V490 system.

To customize an individual test, you can use test-args as follows:

ok test /usb@1,3:test-args={verbose,debug}

This affects only the current test without changing the value of the test-args OpenBoot configuration variable.

You can test all the devices in the device tree with the test-all command:

ok test-all

If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all connected devices with self-tests:

ok test-all /pci@9,700000/usb@1,3

What OpenBoot Diagnostics Error Messages Tell You

OpenBoot Diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 6-2 displays a sample OpenBoot Diagnostics error message.

CODE EXAMPLE 6-2 OpenBoot Diagnostics Error Message

Testing /pci@9,700000/ebus@1/rsc-control@1,3062f8

   ERROR   : SC card is not present in system, or SC card is broken.

   DEVICE  : /pci@9,700000/ebus@1/rsc-control@1,3062f8

   SUBTEST : selftest

   CALLERS : main

   MACHINE : Sun Fire V490

   SERIAL# : 705459

   DATE    : 11/28/2001 14:46:21  GMT

   CONTR0LS: diag-level=min test-args=media,verbose,subtests

Error: /pci@9,700000/ebus@1/rsc-control@1,3062f8 selftest failed, return code = 1

Selftest at /pci@9,700000/ebus@1/rsc-control@1,3062f8 (errors=1) ...... failed

Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:0

I2C Bus Device Tests

The i2c@1,2e and i2c@1,30 OpenBoot Diagnostics tests examine and report on environmental monitoring and control devices connected to the Sun Fire V490 server's Inter-IC (I²C) bus.

Error and status messages from the i2c@1,2e and i2c@1,30 OpenBoot Diagnostics tests include the hardware addresses of I²C bus devices:

Testing /pci@9,700000/ebus@1/i2c@1,2e/fru@2,a8

The I²C device address is given at the very end of the hardware path. In this example, the address is 2,a8, which indicates a device located at hexadecimal address A8 on segment 2 of the I²C bus.

To decode this device address, refer to Reference for Decoding I2C Diagnostic Test Messages. Using TABLE 6-12, you can refer to that fru@2,a8 corresponds to an I²C device on DIMM 4 on processor 2. If the i2c@1,2e test were to report an error against fru@2,a8, you would need to replace this memory module.

Other OpenBoot Commands

Beyond the formal firmware-based diagnostic tools, there are a few commands you can invoke from the ok prompt. These OpenBoot commands display information that can help you assess the condition of a Sun Fire V490 server. These include the following commands:

.env command

printenv command

probe-scsi and probe-scsi-all commands

probe-ide command

show-devs command

This section describes the information these commands give you. For instructions on using these commands, turn to How to Use OpenBoot Information Commands, or look up the appropriate man page.

`.env` Command

The .env command displays the current environmental status, including fan speeds; and voltages, currents, and temperatures measured at various system locations. For more information, refer to About OpenBoot Environmental Monitoring, and How to Obtain OpenBoot Environmental Status Information.

`printenv` Command

The printenv command displays the OpenBoot configuration variables. The display includes the current values for these variables as well as the default values. For details, refer to How to View and Set OpenBoot Configuration Variables.

For more information about printenv, refer to the printenv man page. For a list of some important OpenBoot configuration variables, refer to TABLE 6-2.

`probe-scsi` and `probe-scsi-all` Commands

The probe-scsi and probe-scsi-all commands check the presence of SCSI or FC-AL devices and verify that the bus itself is operating properly.

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsior probe-scsi-allcommand can hang the system.

The probe-scsi command communicates with all SCSI and FC-AL devices connected to on-board SCSI and FC-AL controllers. The probe-scsi-all command additionally accesses devices connected to any host adapters installed in PCI slots.

For any SCSI or FC-AL device that is connected and active, the probe-scsi and probe-scsi-all commands display its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.

The following is sample output from the probe-scsi command.

CODE EXAMPLE 6-3 probe-scsi Command Output

ok probe-scsi

LiD HA LUN  --- Port WWN ---  ----- Disk description -----

 0   0   0  2100002037cdaaca  SEAGATE ST336704FSUN36G 0726

 1   1   0  2100002037a9b64e  SEAGATE ST336704FSUN36G 0726

The following is sample output from the probe-scsi-all command.

CODE EXAMPLE 6-4 probe-scsi-all Command Output

ok probe-scsi-all

/pci@9,600000/SUNW,qlc@2

LiD HA LUN  --- Port WWN ---  ----- Disk description -----

 0   0   0  2100002037cdaaca  SEAGATE ST336704FSUN36G 0726

 1   1   0  2100002037a9b64e  SEAGATE ST336704FSUN36G 0726

/pci@8,600000/scsi@1,1

Target 4

  Unit 0   Disk     SEAGATE ST32550W SUN2.1G0418

/pci@8,600000/scsi@1

/pci@8,600000/pci@2/SUNW,qlc@5

/pci@8,600000/pci@2/SUNW,qlc@4

LiD HA LUN  --- Port WWN ---  ----- Disk description -----

 0   0   0  2200002037cdaaca  SEAGATE ST336704FSUN36G 0726

 1   1   0  2200002037a9b64e  SEAGATE ST336704FSUN36G 0726

Note that the probe-scsi-all command lists dual-ported devices twice. This is because these FC-AL devices (refer to the qlc@2 entry in CODE EXAMPLE 6-4) can be accessed through two separate controllers: the on-board Loop-A controller and the optional Loop-B controller provided through a PCI card.

`probe-ide` Command

The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system.

The following is sample output from the probe-ide command.

CODE EXAMPLE 6-5 probe-scsi-all Command Output

ok probe-ide

  Device 0  ( Primary Master )

         Removable ATAPI Model: TOSHIBA DVD-ROM SD-C2512

  Device 1  ( Primary Slave )

         Not Present

`show-devs` Command

The show-devs command lists the hardware device paths for each device in the firmware device tree. CODE EXAMPLE 6-6 shows some sample output (edited for brevity).

CODE EXAMPLE 6-6 show-devs Command Output

/pci@9,600000

/pci@9,700000

/pci@8,600000

/pci@8,700000

/memory-controller@3,400000

/SUNW,UltraSPARC-IV@3,0

/memory-controller@1,400000

/SUNW,UltraSPARC-IV@1,0

/virtual-memory

/memory@m0,20

/pci@9,600000/SUNW,qlc@2

/pci@9,600000/network@1

/pci@9,600000/SUNW,qlc@2/fp@0,0

/pci@9,600000/SUNW,qlc@2/fp@0,0/disk

Stage Three: The Operating System

If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser operating system. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have recourse to software-based diagnostic tools, like SunVTS and Sun Management Center. These tools can help you with more advanced monitoring, exercising, and fault isolating capabilities.

Note - If you set the auto-boot OpenBoot configuration variable to false, the operating system does not boot automatically following completion of the firmware-based tests.

In addition to the formal tools that run on top of Solaris OS software, there are other resources that you can use when assessing or monitoring the condition of a Sun Fire V490 server. These include:

Error and system message log files

Solaris system information commands

Error and System Message Log Files

Error and other system messages are saved in the file /var/adm/messages. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.

For information about /var/adm/messages and other sources of system information, refer to your Solaris system administration documentation.

Solaris System Information Commands

Some Solaris commands display data that you can use when assessing the condition of a Sun Fire V490 server. These include the following commands:

prtconf command

prtdiag command

prtfru command

psrinfo command

showrev command

This section describes the information these commands give you. For instructions on using these commands, turn to How to Use Solaris System Information Commands, or look up the appropriate man page.

`prtconf` Command

The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks, that only the operating system software "knows" about. The output of prtconf also includes the total amount of system memory. CODE EXAMPLE 6-7 shows an excerpt of prtconf output (edited to save space).

CODE EXAMPLE 6-7 prtconf Command Output

System Configuration:  Sun Microsystems  sun4u

Memory size: 1024 Megabytes

System Peripherals (Software Nodes):

SUNW,Sun-Fire-V490

    packages (driver not attached)

        SUNW,builtin-drivers (driver not attached)

...

    SUNW,UltraSPARC-IV (driver not attached)

    memory-controller, instance #3

    pci, instance #0

        SUNW,qlc, instance #5

            fp (driver not attached)

                disk (driver not attached)

...

    pci, instance #2

        ebus, instance #0

            flashprom (driver not attached)

            bbc (driver not attached)

            power (driver not attached)

            i2c, instance #1

                fru, instance #17

The prtconf command's -p option produces output similar to the OpenBoot
show-devs command (refer to show-devs Command). This output lists only those devices compiled by the system firmware.

`prtdiag` Command

The prtdiag command displays a table of diagnostic information that summarizes the status of system components.

The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following is an excerpt of some of the output produced by prtdiag on a healthy Sun Fire V490 system running Solaris 8, Update 7.

CODE EXAMPLE 6-8 prtdiag Command Output

System Configuration:  Sun Microsystems  sun4u Sun Fire V490

System clock frequency: 150 MHz

Memory size: 4096 Megabytes

========================= CPUs ===============================================

          Run   E$    CPU     CPU

Brd  CPU  MHz   MB   Impl.    Mask

---  ---  ---  ----  -------  ----

 A    0   900  8.0   US-IV 2.1

 A    2   900  8.0   US-IV 2.1

========================= Memory Configuration ===============================

          Logical  Logical  Logical

     MC   Bank     Bank     Bank         DIMM    Interleave  Interleaved

Brd  ID   num      size     Status       Size    Factor      with

---  ---  ----     ------   -----------  ------  ----------  -----------

 A    0     0       512MB   no_status     256MB     8-way        0

 A    0     1       512MB   no_status     256MB     8-way        0

 A    0     2       512MB   no_status     256MB     8-way        0

 A    0     3       512MB   no_status     256MB     8-way        0

 A    2     0       512MB   no_status     256MB     8-way        0

 A    2     1       512MB   no_status     256MB     8-way        0

 A    2     2       512MB   no_status     256MB     8-way        0

 A    2     3       512MB   no_status     256MB     8-way        0

========================= IO Cards =========================

                    Bus  Max

 IO  Port Bus       Freq Bus  Dev,

Type  ID  Side Slot MHz  Freq Func State Name                       Model

---- ---- ---- ---- ---- ---- ---- ----- -------------------------  ----------------

PCI   8    B    3    33   33    3,0  ok    TECH-SOURCE,gfxp            GFXP

PCI    8    B     5     33    33    5,1   ok    SUNW,hme-pci108e,1001        SUNW,qsi

In addition to that information, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.

CODE EXAMPLE 6-9 prtdiag Verbose Output

System Temperatures (Celsius):

-------------------------------

Device			Temperature				Status

---------------------------------------

CPU0             59             OK

CPU2             64             OK

DBP0             22             OK

In the event of an overtemperature condition, prtdiag reports an error in the Status column.

CODE EXAMPLE 6-10 prtdiag Overtemperature Indication Output

System Temperatures (Celsius):

-------------------------------

Device				Temperature				Status

---------------------------------------

CPU0				62				OK

CPU1		 		102				ERROR

Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.

CODE EXAMPLE 6-11 prtdiag Fault Indication Output

Fan Status:

-----------

Bank             RPM    Status

----            -----   ------

CPU0             4166   [NO_FAULT]

CPU1             0000   [FAULT]

`prtfru` Command

The Sun Fire V490 system maintains a hierarchical list of all field-replaceable units (FRUs) in the system, as well as specific information about various FRUs.

The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 6-12 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.

CODE EXAMPLE 6-12 prtfru -l Command Output

/frutree

/frutree/chassis (fru)

/frutree/chassis/io-board (container)

/frutree/chassis/rsc-board (container)

/frutree/chassis/fcal-backplane-slot

CODE EXAMPLE 6-13 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.

CODE EXAMPLE 6-13 prtfru -c Command Output

/frutree/chassis/rsc-board (container)

   SEGMENT: SD

      /ManR

      /ManR/UNIX_Timestamp32: Fri Apr 27 00:12:36 EDT 2001

      /ManR/Fru_Description: SC PLAN B

      /ManR/Manufacture_Loc: BENCHMARK,HUNTSVILLE,ALABAMA,USA

      /ManR/Sun_Part_No: 5015856

      /ManR/Sun_Serial_No: 001927

      /ManR/Vendor_Name: AVEX Electronics

      /ManR/Initial_HW_Dash_Level: 02

      /ManR/Initial_HW_Rev_Level: 50

      /ManR/Fru_Shortname: SC

Data displayed by the prtfru command varies depending on the type of FRU. In general, this information includes:

FRU description

Manufacturer name and location

Part number and serial number

Hardware revision levels

Information about the following Sun Fire V490 FRUs is displayed by the prtfru command:

Centerplane

CPU/Memory boards

DIMMs

FC-AL disk backplane

FC-AL disk drive

PCI riser

Power distribution board

Power supplies

System controller card

`psrinfo` Command

The psrinfo command displays the date and time each processor came online. With the verbose (-v) option, the command displays additional information about the processors, including their clock speed. The following is sample output from the psrinfo command with the -v option.

CODE EXAMPLE 6-14 psrinfo -v Command Output

Status of processor 0 as of: 04/11/03 12:03:45

  Processor has been on-line since 04/11/03 10:53:03.

  The sparcv9 processor operates at 900 MHz,

        and has a sparcv9 floating point processor.

Status of processor 2 as of: 04/11/03 12:03:45

  Processor has been on-line since 04/11/03 10:53:05.

  The sparcv9 processor operates at 900 MHz,

        and has a sparcv9 floating point processor.

`showrev` Command

The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 6-15 shows sample output of the showrev command.

CODE EXAMPLE 6-15 showrev Command Output

Hostname: abc-123

Hostid: cc0ac37f

Release: 5.8

Kernel architecture: sun4u

Application architecture: sparc

Hardware provider: Sun_Microsystems

Domain: Sun.COM

Kernel version: SunOS 5.8 cstone_14:08/01/01 2001

When used with the -p option, this command displays installed patches. CODE EXAMPLE 6-16 shows a partial sample output from the showrev command with the -p option.

CODE EXAMPLE 6-16 showrev -p Command Output

Patch: 109729-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu

Patch: 109783-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu

Patch: 109807-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu

Patch: 109809-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu

Patch: 110905-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu

Patch: 110910-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu

Patch: 110914-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu

Patch: 108964-04 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsr

Tools and the Boot Process: A Summary

Different diagnostic tools are available to you at different stages of the boot process. TABLE 6-4 summarizes what tools are available to you and when they are available.

TABLE 6-4 Diagnostic Tool Availability
Stage	Available Diagnostic Tools
Stage	Fault Isolation	System Monitoring	System Exercising
Before the operating system starts	- LEDs - POST - OpenBoot Diagnostics	- RSC software - OpenBoot commands	-none-
After the operating system starts	- LEDs	- RSC software - Sun Management Center - Solaris info commands - OpenBoot commands	- SunVTS - Hardware Diagnostic Suite
When the system is down and power is not available	-none-	- RSC software	-none-

About Isolating Faults in the System

Each of the tools available for fault isolation discloses faults in different field-replaceable units (FRUs). The row headings along the left of TABLE 6-5 list the FRUs in a Sun Fire V490 system. The available diagnostic tools are shown in column headings across the top. A check mark (✔) in this table indicates that a fault in a particular FRU can be isolated by a particular diagnostic.

TABLE 6-5 FRU Coverage of Fault Isolating Tools
	LEDs	POST	OpenBoot Diags
CPU/Memory Boards		Yes
IDPROM			Yes
DIMMs		Yes
DVD Drive			Yes
FC-AL Disk Drive	Yes		Yes
Centerplane		Yes	Yes
SC Card			Yes
PCI Riser		Yes	Yes
FC-AL Disk Backplane			Yes
Power Supplies	Yes
Fan Tray 0 (CPU)	Yes
Fan Tray 1 (PCI)	Yes

In addition to the FRUs listed in TABLE 6-5, there are several minor replaceable system components--mostly cables--that cannot directly be isolated by any system diagnostic. For the most part, you determine when these components are faulty by eliminating other possibilities. These FRUs are listed in TABLE 6-6.

TABLE 6-6 FRUs Not Directly Isolated by Diagnostic Tools
FRU	Notes
FC-AL power cable FC-AL signal cable	If OpenBoot Diagnostics tests indicate a disk problem, but replacing the disk does not fix the problem, you should suspect the FC-AL signal and power cables are either defective or improperly connected.
FC-AL power cable FC-AL signal cable
Fan Tray 0 power cable	If the system is powered on and the fan does not spin, or if the Power/OK LED does not come on, but the system is up and running, you should suspect this cable.
Power distribution board	Any power issue that cannot be traced to the power supplies should lead you to suspect the power distribution board. Particular scenarios include: The system will not power on, but the power supply LEDs indicate DC Present System is running, but RSC indicates a missing power supply
Removable media bay board and cable assembly	If OpenBoot Diagnostics tests indicate a problem with the CD/DVD drive, but replacing the drive does not fix the problem, you should suspect this assembly is either defective or improperly connected.
System control switch/power button cable	If the system control switch and Power button appear unresponsive, you should suspect this cable is loose or defective.

About Monitoring the System

Sun provides two tools that can give you advance warning of difficulties and prevent future downtime. These are:

Sun Remote System Control (RSC)

Sun Management Center

These monitoring tools let you specify system criteria that bear watching. For instance, you can set a threshold for system temperature and be notified if that threshold is exceeded.

Monitoring the System Using Remote System Control Software

Sun Remote System Control (RSC) software, working in conjunction with the system controller (SC) card, enables you to monitor and control your server over a serial port or a network. RSC software provides both graphical and command-line interfaces for remotely administering geographically distributed or physically inaccessible machines.

You can also redirect the server's system console to the system controller, which lets you remotely run diagnostics (like POST) that would otherwise require physical proximity to the machine's serial port.

The system controller card runs independently, and uses standby power from the server. Therefore, the SC and its RSC software continue to be effective when the server operating system goes offline.

RSC software lets you monitor the following on the Sun Fire V490 server.

TABLE 6-7 What RSC Software Monitors
Item Monitored	What RSC Software Reveals
Disk drives	Whether each slot has a drive present, and whether it reports OK status
Fan trays	Fan speed and whether the fan trays report OK status
CPU/Memory boards	The presence of a CPU/Memory board, the temperature measured at each processor, and any thermal warning or failure conditions
Power supplies	Whether each bay has a power supply present, and whether it reports OK status
System temperature	System ambient temperature as measured at several locations in the system, as well as any thermal warning or failure conditions
Server front panel	System control switch position and status of LEDs

Before you can start using RSC software, you must install and configure it on the server and client systems. Instructions for doing this are given in the Sun Remote System Control (RSC) 2.2 User's Guide, which is included on the Sun Fire V490 Documentation CD.

You also have to make any needed physical connections and set OpenBoot configuration variables that redirect the console output to the system controller. The latter task is described in How to Redirect the System Console to the System Controller.

For instructions on using RSC software to monitor a Sun Fire V490 system, refer to How to Monitor the System Using the System Controller and RSC Software.

Monitoring the System Using Sun Management Center

Sun Management Center software provides enterprise-wide monitoring of Sun servers and workstations, including their subsystems, components, and peripheral devices. The system being monitored must be up and running, and you need to install all the proper software components on various systems in your network.

Sun Management Center lets you monitor the following on the Sun Fire V490 server.

TABLE 6-8 What Sun Management Center Software Monitors
Item Monitored	What Sun Management Center Reveals
Disk drives	Whether each slot has a drive present, and whether it reports OK status
Fan trays	Whether the fan trays report OK status
CPU/Memory boards	The presence of a CPU/Memory board, the temperature measured at each processor, and any thermal warning or failure conditions
Power supplies	Whether each bay has a power supply present, and whether it reports OK status
System temperature	System ambient temperature as measured at several locations in the system, as well as any thermal warning or failure conditions

How Sun Management Center Works

The Sun Management Center product comprises three software entities:

Agent components

Server component

Monitor components

You install agents on systems to be monitored. The agents collect system status information from log files, device trees, and platform-specific sources, and report that data to the server component.

The server component maintains a large database of status information for a wide range of Sun platforms. This database is updated frequently, and includes information about boards, tapes, power supplies, and disks as well as operating system parameters like load, resource usage, and disk space. You can create alarm thresholds and be notified when these are exceeded.

The monitor components present the collected data to you in a standard format. Sun Management Center software provides both a standalone Java application and a Web browser-based interface. The Java interface affords physical and logical views of the system for highly-intuitable monitoring.

Other Sun Management Center Features

Sun Management Center software provides you with additional tools in the form of an informal tracking mechanism and an optional add-on diagnostics suite. In a heterogeneous computing environment, the product can interoperate with management utilities made by other companies.

Informal Tracking

Sun Management Center agent software must be loaded on any system you want to monitor. However, the product lets you informally track a supported platform even when the agent software has not been installed on it. In this case, you do not have full monitoring capability, but you can add the system to your browser, have Sun Management Center periodically check whether it is up and running, and notify you if it goes out of commission.

Add-On Diagnostic Suite

The Hardware Diagnostic Suite is available as a premium package you can purchase as an add-on to the Sun Management Center product. This suite lets you exercise a system while it is still up and running in a production environment. Refer to Exercising the System Using Hardware Diagnostic Suite for more information.

Interoperability With Third-Party Monitoring Tools

If you administer a heterogeneous network and use a third-party network-based system monitoring or management tool, you may be able to take advantage of Sun Management Center software's support for Tivoli Enterprise Console, BMC Patrol, and HP Openview.

Who Should Use Sun Management Center?

Sun Management Center software is geared primarily toward system administrators who have large data centers to monitor or other installations that have many computer platforms to monitor. If you administer a more modest installation, you need to weigh Sun Management Center software's benefits against the requirement of maintaining a significant database (typically over 700 Mbytes) of system status information.

The servers being monitored must be up and running if you want to use Sun Management Center, since this tool relies on the Solaris OS. For instructions, refer to How to Monitor the System Using Sun Management Center Software. For detailed information about the product, refer to the Sun Management Center User's Guide.

Obtaining the Latest Information

For the latest information about this product, go to the Sun Management Center Web site at: http://www.sun.com/sunmanagementcenter.

About Exercising the System

It is relatively easy to detect when a system component fails outright. However, when a system has an intermittent problem or seems to be "behaving strangely," a software tool that stresses or exercises the computer's many subsystems can help disclose the source of the emerging problem and prevent long periods of reduced functionality or system downtime.

Sun provides two tools for exercising Sun Fire V490 systems:

Sun Validation Test Suite (SunVTS)

Hardware Diagnostic Suite

TABLE 6-9 shows the FRUs that each system exercising tool is capable of isolating. Note that individual tools do not necessarily test all the components or paths of a particular FRU.

TABLE 6-9 FRU Coverage of System Exercising Tools
	SunVTS	Hardware Diagnostic Suite
CPU/Memory Boards	Yes	Yes
IDPROM	Yes
DIMMs	Yes	Yes
DVD Drive	Yes	Yes
FC-AL Disk Drive	Yes	Yes
Centerplane	Yes	Yes
SC Card	Yes
PCI Riser	Yes	Yes
FC-AL Disk Backplane	Yes

Exercising the System Using SunVTS Software

SunVTS software validation test suite performs system and subsystem stress testing. You can view and control a SunVTS session over a network. Using a remote machine, you can view the progress of a testing session, change testing options, and control all testing features of another machine on the network.

You can run SunVTS software in five different test modes:

Connection mode - SunVTS software verifies the presence of device controllers on all subsystems. This typically takes no more than a few minutes and is a good way to "sanity check" system connections.

Functional mode - SunVTS software exercises only the specific subsystems you choose. This is the default mode. In Functional mode, selected tests are run in parallel. This mode uses system resources heavily, so you should not run any other applications at the same time.

Auto Config mode - SunVTS software automatically detects all subsystems and exercises them in one of two ways:

Confidence testing -SunVTS software performs one pass of tests on all subsystems, and then stops. For typical system configurations, this requires one or two hours.

Comprehensive testing - SunVTS software exhaustively and repeatedly tests all subsystems for up to 24 hours.

Exclusive mode - SunVTS software exercises only the specific subsystems you choose. Selected tests are run one at a time. A few tests are only available in this mode, including: l1dcachetest, l2cachetest, l2sramtest, mpconstest, mptest, qlctest, ramtest, ssptest, and systest.

Online mode - SunVTS software exercises only the specific subsystems you choose. Selected tests are run one at a time until one complete system pass is achieved. This mode is useful for performing tests while other applications are running.

Since SunVTS software can run many tests in parallel and consume many system resources, you should take care when using it on a production system. If you are stress-testing a system using SunVTS software's Comprehensive test mode, you should not run anything else on that system at the same time.

The Sun Fire V490 server to be tested must be up and running if you want to use SunVTS software, since it relies on the Solaris operating system. Since SunVTS software packages are optional, they may not be installed on your system. Turn to How to Check Whether SunVTS Software Is Installed for instructions.

It is important to use the most-up-to-date version of SunVTS available, to ensure you have the latest suite of tests. To download the most recent SunVTS software, point your Web browser to: http://www.sun.com/oem/products/vts/.

For instructions on running SunVTS software to exercise the Sun Fire V490 server, refer to How to Exercise the System Using SunVTS Software. For more information about the product, refer to:

SunVTS User's Guide - Describes SunVTS features as well as how to start and control the various user interfaces.

SunVTS Test Reference Manual - Describes each SunVTS test, option, and command-line argument.

SunVTS Quick Reference Card - Gives an overview of the main features of the graphical user interface (GUI).

SunVTS Documentation Supplement - Describes the latest product enhancements and documentation updates not included in the SunVTS User's Guide and SunVTS Test Reference Manual.

These documents are available on the Solaris Software Supplement CD and on the Web at: http://docs.sun.com. You should also consult the SunVTS README file located at /opt/SUNWvts/. This document provides late-breaking information about the installed version of the product.

SunVTS Software and Security

During SunVTS software installation, you must choose between Basic or Sun Enterprise Authentication Mechanism (SEAM) security. Basic security uses a local security file in the SunVTS installation directory to limit the users, groups, and hosts permitted to use SunVTS software. SEAM security is based on Kerberos--the standard network authentication protocol--and provides secure user authentication, data integrity, and privacy for transactions over networks.

If your site uses SEAM security, you must have the SEAM client and server software installed in your networked environment and configured properly in both Solaris and SunVTS software. If your site does not use SEAM security, do not choose the SEAM option during SunVTS software installation.

If you enable the wrong security scheme during installation, or if you improperly configure the security scheme you choose, you may find yourself unable to run SunVTS tests. For more information, refer to the SunVTS User's Guide and the instructions accompanying the SEAM software.

Exercising the System Using Hardware Diagnostic Suite

The Sun Management Center product features an optional Hardware Diagnostic Suite, which you can purchase as an add-on. The Hardware Diagnostic Suite is designed to exercise a production system by running tests sequentially.

Sequential testing means the Hardware Diagnostic Suite has a low impact on the system. Unlike SunVTS, which stresses a system by consuming its resources with many parallel tests (refer to Exercising the System Using SunVTS Software), the Hardware Diagnostic Suite lets the server run other applications while testing proceeds.

When to Run Hardware Diagnostic Suite

The best use of the Hardware Diagnostic Suite is to disclose a suspected or intermittent problem with a noncritical part on an otherwise functioning machine. Examples might include questionable disk drives or memory modules on a machine that has ample or redundant disk and memory resources.

In cases like these, the Hardware Diagnostic Suite runs unobtrusively until it identifies the source of the problem. The machine under test can be kept in production mode until and unless it must be shut down for repair. If the faulty part is hot-pluggable or hot-swappable, the entire diagnose-and-repair cycle can be completed with minimal impact to system users.

Requirements for Using Hardware Diagnostic Suite

Since it is a part of Sun Management Center, you can only run Hardware Diagnostic Suite if you have set up your data center to run Sun Management Center. This means you have to dedicate a master server to run the Sun Management Center server software that supports Sun Management Center software's database of platform status information. In addition, you must install and set up Sun Management Center agent software on the systems to be monitored. Finally, you need to install the console portion of Sun Management Center software, which serves as your interface to the Hardware Diagnostic Suite.

Instructions for setting up Sun Management Center, as well as for using the Hardware Diagnostic Suite, can be found in the Sun Management Center User's Guide.

Reference for OpenBoot Diagnostics Test Descriptions

This section describes the OpenBoot Diagnostics tests and commands available to you. For background information about these tests, refer to Stage Two: OpenBoot Diagnostics Tests.

TABLE 6-10 OpenBoot Diagnostics Menu Tests
Test Name	What It Does	FRU(s) Tested
`SUNW,qlc@2`	Tests the registers of the Fibre Channel-Arbitrated Loop (FC-AL) subsystem. With `diag-level` set to `max`, verifies each disk can be written to, and with `test-args` set to `media`, performs more extensive disk tests.	Centerplane, FC-AL disk backplane
`bbc@1,0`	Tests all writable registers in the Boot Bus Controller. Also verifies that at least one system processor has Boot Bus access	Centerplane
`ebus@1`	Tests the PCI configuration registers, DMA control registers, and EBus mode registers. Also tests DMA controller functions	Centerplane
`flashprom@0,0`	Performs a checksum test on the Boot PROM	Centerplane
`i2c@1,2e`	Tests segments 0-4 of the I2C environmental monitoring subsystem, which includes various temperature and other sensors located throughout the system	Multiple. Refer to Reference for Decoding I2C Diagnostic Test Messages.
`i2c@1,30`	Same as above, for segment 5 of the I2C environmental monitoring subsystem
`ide@6`	Tests the on-board IDE controller and IDE bus subsystem that controls the DVD drive	PCI riser board, DVD drive
`network@1`	Tests the on-board Ethernet logic, running internal loopback tests. Can also run external loopback tests, but only if you install a loopback connector (not provided)	Centerplane
`network@2`	Same as above, for the other on-board Ethernet controller	Centerplane
`pmc@1,300700`	Tests the registers of the power management controller	PCI riser board
`rsc-control@1,3062f8`	Tests SC hardware, including the SC serial and Ethernet ports	SC card
`rtc@1,300070`	Tests the registers of the real-time clock and then tests the interrupt rates	PCI riser board
`serial@1,400000`	Tests all possible baud rates supported by the `ttya` serial line. Performs an internal and external loopback test on each line at each speed	Centerplane, PCI riser board
`usb@1,3`	Tests the writable registers of the USB open host controller	Centerplane

TABLE 6-11 describes the commands you can type from the obdiag> prompt.

TABLE 6-11 OpenBoot Diagnostics Test Menu Commands
Command	Description
`exit`	Exits OpenBoot Diagnostics tests and returns to the `ok` prompt
`help`	Displays a brief description of each OpenBoot Diagnostics command and OpenBoot configuration variable
`setenv` variable value	Sets the value for an OpenBoot configuration variable (also available from the `ok` prompt)
`test-all`	Tests all devices displayed in the OpenBoot Diagnostics test menu (also available from the `ok` prompt)
`test` #	Tests only the device identified by the given menu entry number. (A similar function is available from the `ok` prompt. Refer to From the ok Prompt: The test and test-all Commands.)
`test #,#`	Tests only the devices identified by the given menu entry numbers
`except #,#`	Tests all devices in the OpenBoot Diagnostics test menu except those identified by the specified menu entry numbers
`versions`	Displays the version, last modified date, and manufacturer of each self-test in the OpenBoot Diagnostics test menu and library
`what #,#`	Displays selected properties of the devices identified by menu entry numbers. The information provided varies according to device type

Reference for Decoding I2C Diagnostic Test Messages

TABLE 6-12 describes each I2C device in a Sun Fire V490 system, and helps you associate each I2C address with the proper FRU. For more information about I2C tests, refer to I2C Bus Device Tests.

TABLE 6-12 Sun Fire V490 I2C Bus Devices
Address	Associated FRU	What the Device Does
fru@0,a0	processor 0, DIMM 0	Provides configuration information for processor 0 DIMMs
fru@0,a2	processor 0, DIMM 1
fru@0,a4	processor 0, DIMM 2
fru@0,a6	processor 0, DIMM 3
fru@0,a8	processor 0, DIMM 4
fru@0,aa	processor 0, DIMM 5
fru@0,ac	processor 0, DIMM 6
fru@0,ae	processor 0, DIMM 7
fru@1,a0	processor 1, DIMM 0	Provides configuration information for processor 1 DIMMs
fru@1,a2	processor 1, DIMM 1
fru@1,a4	processor 1, DIMM 2
fru@1,a6	processor 1, DIMM 3
fru@1,a8	processor 1, DIMM 4
fru@1,aa	processor 1, DIMM 5
fru@1,ac	processor 1, DIMM 6
fru@1,ae	processor 1, DIMM 7
fru@2,a0	processor 2, DIMM 0	Provides configuration information for processor 2 DIMMs
fru@2,a2	processor 2, DIMM 1
fru@2,a4	processor 2, DIMM 2
fru@2,a6	processor 2, DIMM 3
fru@2,a8	processor 2, DIMM 4
fru@2,aa	processor 2, DIMM 5
fru@2,ac	processor 2, DIMM 6
fru@2,ae	processor 2, DIMM 7
fru@3,a0	processor 3, DIMM 0	Provides configuration information for processor 3 DIMMs
fru@3,a2	processor 3, DIMM 1
fru@3,a4	processor 3, DIMM 2
fru@3,a6	processor 3, DIMM 3
fru@3,a8	processor 3, DIMM 4
fru@3,aa	processor 3, DIMM 5
fru@3,ac	processor 3, DIMM 6
fru@3,ae	processor 3, DIMM 7
fru@4,a0	CPU/Mem board, slot A	Provides configuration information for the CPU/Memory board in slot A
fru@4,a2	CPU/Mem Board, slot B	Provides configuration information for the CPU/Memory board in slot B
nvram@4,a4	PCI riser	Provides system configuration information (IDPROM)
fru@4,a8	Centerplane	Provides centerplane configuration information
fru@4,aa	PCI riser	Provides PCI riser board configuration information
fru@5,10	Centerplane	Provides communication and control for I2C subsystem
fru@5,14	RSC card	Provides communication and control for the RSC card
temperature@5,30	CPU/Mem board A	Monitors processor 0 temperature
temperature@5,32	CPU/Mem board B	Monitors processor 1 temperature
temperature@5,34	CPU/Mem board A	Monitors processor 2 temperature
temperature@5,52	CPU/Mem board B	Monitors processor 3 temperature
ioexp@5,44	FC-AL disk backplane	Monitors drive status/LED control
ioexp@5,46	FC-AL disk backplane	Monitors Loop B control
ioexp@5,4c	Power distribution board	Monitors power distribution board status
ioexp@5,70	Power Supply 0	Monitors Power Supply 0 status
ioexp@5,72	Power Supply 1	Monitors Power Supply 1 status
ioexp@5,80	Centerplane	Monitors I/O port expander
ioexp@5,82	PCI riser	Monitors I/O port expander
temperature@5,98	Reserved	Reserved for thermal monitoring
temperature-sensor@5,9c	FC-AL disk backplane	Monitors ambient temperature at disk backplane
fru@5,a0	Power Supply 0	Provides configuration information for Power Supply 0
fru@5,a2	Power Supply 1	Provides configuration information for Power Supply 1
fru@5,a6	SC card	Provides SC card configuration information
fru@5,a8	FC-AL disk backplane	Provides disk backplane configuration information
fru@5,ae	Power distribution board	Provides configuration information for the power distribution board and the enclosure
fru@5,d0	SC card	Monitors SC card's real-time clock

Reference for Terms in Diagnostic Output

The status and error messages displayed by POST diagnostics and OpenBoot Diagnostics tests occasionally include acronyms or abbreviations for hardware sub-components. TABLE 6-13 is included to assist you in decoding this terminology and associating the terms with specific FRUs, where appropriate.

TABLE 6-13 Abbreviations or Acronyms in Diagnostic Output
Term	Description	Associated FRU(s)
ADC	Analog-to-Digital Converter	PCI riser board
APC	Advanced Power Control - A function provided by the SuperIO integrated circuit	PCI riser board
BBC	Boot Bus Controller - Interface between the processors and components on many other buses	Centerplane
CDX	Data Crossbar - Part of the system bus	Centerplane
CRC	Cyclic Redundancy Check	N/A
DAR	Address Repeater - Part of the system bus	Centerplane
DCDS	Dual Data Switch - Part of the system bus	CPU/Memory board
DMA	Direct Memory Access - In diagnostic output, usually refers to a controller on a PCI card	PCI card
EBus	A byte-wide bus for low-speed devices	Centerplane, PCI riser board
HBA	Host Bus Adapter	Centerplane, various others
I2C	Inter-Integrated Circuit (also written as I2C) - A bidirectional, two-wire serial data bus. Used mainly for environmental monitoring and control	Various. Refer to TABLE 6-12.
I/O Board	PCI Riser	PCI riser
JTAG	Joint Test Access Group - An IEEE subcommittee standard (1149.1) for scanning system components	N/A
MAC	Media Access Controller - Hardware address of a device connected to a network	Centerplane
MII	Media Independent Interface - Part of Ethernet controller	Centerplane
Motherboard	Centerplane	Centerplane
NVRAM	IDPROM	IDPROM, located on PCI riser board
OBP	Refers to OpenBoot firmware	N/A
PDB	Power Distribution Board	Power distribution board
PMC	Power Management Controller	PCI riser board
POST	Power-On Self-Test	N/A
RIO	Multifunction integrated circuit bridging the PCI bus with EBus and USB	PCI riser board
RTC	Real-Time Clock	PCI riser board
RX	Receive - Communication protocol	Centerplane
Safari	The system interconnect architecture--that is, the data and address buses	CPU/Memory board, centerplane
Schizo	System bus to PCI bridge integrated circuit	Centerplane
Scan	A means for monitoring and altering the content of ASICs and system components, as provided for in the IEEE 1149.1 standard	N/A
SIO	SuperIO integrated circuit - Controls the SC UART port and more	PCI riser
TX	Transmit - Communication protocol	Centerplane
UART	Universal Asynchronous Receiver Transmitter - Serial port hardware	Centerplane, PCI riser board, SC card