C H A P T E R  2

Diagnostics and the Boot Process

This chapter introduces the tools that let you accomplish the goals of isolating faults and monitoring and exercising systems. It also helps you to understand how the various tools fit together.

Topics in this chapter include:

If you only want instructions for using diagnostic tools, skip this chapter and turn to:

You may also find it helpful to turn to Netra 440 Server System Administration Guide for information about the system console.


Diagnostics and the Boot Process

You have probably had the experience of powering on a Sun system and watching as it goes through its boot process. Perhaps you have watched as your console displays messages that look like the following.

0>@(#) Netra[TM] 440 POST 4.10.0 2003/04/01 22:28 
 
/export/work/staff/firmware_re/post/post-build
4.10.0/Fiesta/system/integrated  (firmware_re)  
0>Hard Powerup RST thru SW
0>CPUs present in system: 0 1 2 3
0>OBP->POST Call with %o0=00000000.01008000.
0>Diag level set to MAX.
0>MFG scrpt mode set to NONE 
0>I/O port set to TTYA.
0>
0>Start selftest...

It turns out these messages are not quite so inscrutable as they first appear once you understand the boot process. These kinds of messages are discussed later.

It is possible to bypass firmware-based diagnostic tests in order to minimize how long it takes a server to reboot. However, in the following discussion, assume that the system is attempting to boot in diagnostics mode, during which the firmware-based tests run. See Putting the System in Diagnostics Mode for instructions.

The boot process requires several stages, detailed in these sections:

System Controller Boot

As soon as you connect the Netra 440 server to an electrical outlet, and before you turn on power to the server, the system controller inside the server begins its self-diagnostic and boot cycle. The system controller is incorporated into the Suntrademark Remote System Control (ALOM) card installed in the Netra 440 server chassis. Running off standby power, the card begins functioning before the server itself comes up.

The system controller provides access to a number of control and monitoring functions through the ALOM command-line interface. For more information about ALOM, see Monitoring the System Using Advanced Lights Out Manager.

OpenBoot Firmware and POST

Every Netra 440 server includes a chip holding about 2 Mbyte of firmware-based code. This chip is called the boot PROM. After you turn on system power, the first thing the system does is execute code that resides in the boot PROM.

This code, which is referred to as the OpenBoottrademark firmware, is a small-scale operating system unto itself. However, unlike a traditional operating system that can run multiple applications for multiple simultaneous users, OpenBoot firmware runs in single-user mode and is designed solely to configure and boot the system. OpenBoot firmware also initiates firmware-based diagnostics that test the system, thereby ensuring that the hardware is sufficiently "healthy" to run its normal operating environment.

When system power is turned on, the OpenBoot firmware begins running directly out of the boot PROM, since at this stage system memory has not been verified to work properly.

Soon after power is turned on, the system hardware determines that at least one CPU is powered on, and is submitting a bus access request, which indicates that the CPU in question is at least partly functional. This becomes the master CPU, and is responsible for executing OpenBoot firmware instructions.

The OpenBoot firmware's first actions are to check whether to run the power-on self-test (POST) diagnostics and other tests. The POST diagnostics constitute a separate chunk of code stored in a different area of the boot PROM (see FIGURE 2-1).

  FIGURE 2-1 Boot PROM and SCC

This figure shows schematically the relationship between the Netra 440 system's major firmware components.

The extent of these power-on self-tests, and whether they are performed at all, is controlled by configuration variables stored in the removable system configuration card (SCC). These OpenBoot configuration variables are discussed in Controlling POST Diagnostics.

As soon as POST diagnostics can verify that some subset of system memory is functional, tests are loaded into system memory.

Purpose of POST Diagnostics

The POST diagnostics verify the core functionality of the system. A successful execution of the POST diagnostics does not ensure that there is nothing wrong with the server, but it does ensure that the server can proceed to the next stage of the boot process.

For a Netra 440 server, this means:

It is possible for a system to pass all POST diagnostics and still be unable to boot the operating system. However, you can run POST diagnostics even when a system fails to boot, and these tests are likely to disclose the source of most hardware problems.

POST generally reports errors that are persistent in nature. To catch intermittent problems, consider running a system exercising tool. See Exercising the System.

What POST Diagnostics Do

Each POST diagnostic is a low-level test designed to pinpoint faults in a specific hardware component. For example, individual memory tests called address bitwalk and data bitwalk ensure that binary 0s and 1s can be written on each address and data line. During such a test, the POST may display output similar to this example.

1>Data Bitwalk on Slave 3
1>     Test Bank 0.

In this example, CPU 1 is the master CPU, as indicated by the prompt 1>, and it is about to test the memory associated with CPU 3, as indicated by the message Slave 3.

The failure of such a test reveals precise information about particular integrated circuits, the memory registers inside them, or the data paths connecting them.

1>ERROR: TEST = Data Bitwalk on Slave 3
1>H/W under test = CPU3 B0/D1 J0602 side 1 (Bank 1), CPU Module C3
1>Repair Instructions: Replace items in order listed by 'H/W under test' above
1>MSG = ERROR: miscompare on mem test!
               Address: 00000030.001b0040
               Expected: ffffffff.fffffffe
               Observed: fffffbff.fffffff6

In this case, the DIMM labeled J0602, associated with CPU 3, was found to be faulty. For information about the several ways firmware messages identify memory, see Identifying Memory Modules.

What POST Error Messages Tell You

When a specific power-on self-test discloses an error, it reports the following kinds of information about the error:

Here is an excerpt of POST output showing another error message.

CODE EXAMPLE 2-1 POST Error Message
1>ERROR: TEST = IO-Bridge unit 0 PCI id    test 
1>H/W under test = Motherboard IO-Bridge 0, CPU
1>Repair Instructions: Replace items in order listed by 'H/W under test' above
1>MSG = ERROR: PCI Master Abort Detected for 
    TOMATILLO:0, PCI BUS: A, DEVICE NUMBER:2. 
    DEVICE NAME: SCSI
1>END_ERROR
 
1>
1>ERROR: TEST = IO-Bridge unit 0 PCI id    test 
1>H/W under test = Motherboard IO-Bridge 0, CPU
1>MSG = 
        *** Test Failed!! ***
 
1>END_ERROR

Identifying FRUs

An important feature of POST error messages is the H/W under test line (the second line in CODE EXAMPLE 2-1) indicates which FRU or FRUs may be responsible for the error. Note that in CODE EXAMPLE 2-1, two different FRUs are indicated. Using TABLE 2-13 to decode some of the terms, you can see that this POST error was most likely caused by bad integrated circuits (IO-Bridge) or electrical pathways on the motherboard. However, the error message also indicates that the master CPU, in this case CPU 1, may be at fault. For information on how Netra 440 CPUs are numbered, see Identifying CPU/Memory Modules.

Though beyond the scope of this manual, it is worth noting that POST error messages provide fault isolation capability beyond the FRU level. In the current example, the MSG line located immediately below the H/W under test line specifies the particular integrated circuit (DEVICE NAME: SCSI) most likely at fault. This level of isolation is most useful at the repair depot.

Why a POST Error Might Implicate Multiple FRUs

Because each test operates at such a low level, the POST diagnostics are often more definite in reporting the minute details of the error, like the numerical values of expected and observed results, than they are about reporting which FRU is responsible. If this seems counterintuitive, consider the block diagram of one data path within a Netra 440 server, shown in FIGURE 2-2.

  FIGURE 2-2 POST Diagnostic Running Across FRUs

This figure is a block diagram showing bus connections between a CPU, an I/O bridge, and a PCI device.

The dashed line in FIGURE 2-2 represents a boundary between FRUs. Suppose a POST diagnostic is running in the CPU in the left part of the diagram. This diagnostic attempts to access registers in a PCI device located in the right side of the diagram.

If this access fails, there could be a fault in the PCI device, or, less likely, in one of the data paths or components leading to that PCI device. The POST diagnostic can tell you only that the test failed, but not why. So, though the POST diagnostic may present very precise data about the nature of the test failure, potentially several different FRUs could be implicated.

Controlling POST Diagnostics

You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables in the system configuration card. Changes to OpenBoot configuration variables generally take effect only after the server is reset.

TABLE 2-1 lists the most important and useful of these variables, which are more fully documented in the OpenBoot Command Reference Manual. You can find instructions for changing OpenBoot configuration variables in Viewing and Setting OpenBoot Configuration Variables.

TABLE 2-1 OpenBoot Configuration Variables

OpenBoot Configuration Variable

Description and Keywords

auto-boot?

Determines whether the operating system automatically starts up. Default is true.

  • true--Operating system automatically starts once OpenBoot firmware completes initialization.
  • false--System remains at ok prompt until you type boot.

diag-level

Determines the level or type of diagnostics executed. Default is .

  • off--No testing.
  • min--Only basic tests are run.
  • max--More extensive tests may be run, depending on the device. Memory is especially thoroughly checked.

diag-script

Determines which devices are tested by OpenBoot Diagnostics. Default is none.

  • none--No devices are tested.
  • normal--On-board (motherboard-based) devices that have self-tests are tested.
  • all--All devices that have self-tests are tested.

diag-switch?

 

  • true--if post-trigger and obdiag-trigger conditions, respectively, are satisfied. Causes system to boot using diag-device and diag-file parameters.

false-- , even if post-trigger and obdiag-trigger conditions are satisfied. Causes system to boot using boot-device and boot-file parameters.NOTE: You can put the system in diagnostics mode either by setting this variable to true or by setting the system control rotary switch to the Diagnostics position. For details, see Putting the System in Diagnostics Mode.

post-trigger

obdiag-trigger

Specifies the class of reset event that causes POST diagnostics or OpenBoot Diagnostics tests to run. These variables can accept single keywords as well as combinations of the first three keywords separated by spaces. For details, see Viewing and Setting OpenBoot Configuration Variables.

  • error-reset--A reset caused by certain nonrecoverable hardware error conditions. In general, an error reset occurs when a hardware problem corrupts system state data and the machine becomes "confused." Examples include CPU and system watchdog resets, fatal errors, and certain CPU reset events (default).
  • power-on-reset--A reset caused by pressing the Power button (default).
  • user-reset--A reset initiated by the user or the operating system. Examples of user resets include the OpenBoot boot and reset-all commands, as well as the Solaris reboot command.
  • all-resets--Any kind of system reset.
  • none--No POST diagnostics or OpenBoot Diagnostics tests run.

input-device

Selects where system console input is taken from. Default is ttya.

  • ttya--From serial and network management ports.
  • ttyb--From built-in serial port B.*
  • keyboard--From attached keyboard that is part of a local graphics monitor.[1]

output-device

Selects where diagnostic and other system console output is displayed. Default is ttya.

  • ttya--To serial and network management ports.
  • ttyb--To built-in serial port B.*
  • screen--To attached screen that is part of a local graphics monitor.*



Note - These variables affect OpenBoot Diagnostics tests as well as POST diagnostics.



Diagnostics: Reliability versus Availability

The OpenBoot configuration variables described in TABLE 2-1 let you control not only how diagnostic tests proceed, but also what triggers them.

Bypassing diagnostic tests can create a situation where a server with faulty hardware gets locked into a cycle of repeated booting and crashing. Depending on the type of problem, the cycle may repeat intermittently. Because diagnostic tests are never invoked, the crashes may occur without leaving behind any log entries or meaningful console messages.

The section Putting the System in Diagnostics Mode provides instructions for ensuring that your server runs diagnostics when starting up. The section Bypassing Firmware Diagnostics explains how to disable firmware diagnostics.

Temporarily Bypassing Diagnostics

Even if you set up the server to run diagnostic tests automatically on reboot, it is still possible to bypass diagnostic tests for a single boot cycle. This can be useful in cases where you are reconfiguring the server, or on those rare occasions when POST or OpenBoot Diagnostics tests themselves stall or "hang," leaving the server unable to boot and in an unusable state. These "hangs" most commonly result from firmware corruption of some sort, especially of having flashed an incompatible firmware image into the server's PROMs.

If you do find yourself needing to skip diagnostic tests for a single boot cycle, the ALOM system controller provides a convenient way to do this. See Bypassing Diagnostics Temporarily for instructions.

Maximizing Reliability

By default, diagnostics do not run following a user- or operating system-initiated reset. This means the system does not run diagnostics in the event of an operating system panic. To ensure the maximum reliability, especially for automatic system recovery (ASR), you can configure the system to run its firmware-based diagnostic tests following all resets. For instructions, see Maximizing Diagnostic Testing.

OpenBoot Diagnostics Tests

Once POST diagnostics have finished running, POST marks the status of any faulty device as "FAILED," and returns control to OpenBoot firmware.

OpenBoot firmware compiles a hierarchical "census" of all devices in the system. This census is called a device tree. Though different for every system configuration, the device tree generally includes both built-in system components and optional PCI bus devices. The device tree does not include any components marked as "FAILED" by POST diagnostics.

Following the successful execution of POST diagnostics, the OpenBoot firmware proceeds to run OpenBoot Diagnostics tests. Like the POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the boot PROM.

Purpose of OpenBoot Diagnostics Tests

OpenBoot Diagnostics tests focus on system I/O and peripheral devices. Any device in the device tree, regardless of manufacturer, that includes an IEEE 1275-compatible self-test is included in the suite of OpenBoot Diagnostics tests. On a Netra 440 server, OpenBoot Diagnostics examine the following system components:

The OpenBoot Diagnostics tests run automatically through a script when you start up the system in diagnostics mode. However, you can also run OpenBoot Diagnostics tests manually, as explained in the next section.

Like POST diagnostics, OpenBoot Diagnostics tests catch persistent errors. To disclose intermittent problems, consider running a system exercising tool. See Exercising the System.

Controlling OpenBoot Diagnostics Tests

When you restart the system, you can run OpenBoot Diagnostics tests either interactively from a test menu, or by entering commands directly from the ok prompt.



Note - You cannot reliably run OpenBoot Diagnostics tests following an operating system halt, since the halt leaves system memory in an unpredictable state. Best practice is to reset the system before running these tests.



Most of the same OpenBoot configuration variables you use to control POST (see TABLE 2-1) also affect OpenBoot Diagnostics tests. Notably, you can determine OpenBoot Diagnostics testing level--or suppress testing entirely--by appropriately setting the diag-level variable.

In addition, the OpenBoot Diagnostics tests use a special variable called test-args that enables you to customize how the tests operate. By default, test-args is set to contain an empty string. However, you can set test-args to one or more of the reserved keywords, each of which has a different effect on OpenBoot Diagnostics tests. TABLE 2-2 lists the available keywords.

TABLE 2-2 Keywords for the test-args OpenBoot Configuration Variable

Keyword

What It Does

bist

Invokes built-in self-test (BIST) on external and peripheral devices

debug

Displays all debug messages

iopath

Verifies bus and interconnect integrity

loopback

Exercises external loopback path for the device

media

Verifies external and peripheral device media accessibility

restore

Attempts to restore original state of the device if the previous execution of the test failed

silent

Displays only errors rather than the status of each test

subtests

Displays main test and each subtest that is called

verbose

Displays detailed messages of status of all tests

callers=N

Displays backtrace of N callers when an error occurs

  • callers=0 -- Displays backtrace of all callers before the error

errors=N

Continues executing the test until N errors are encountered

  • errors=0 -- Displays all error reports without terminating testing

If you want to make multiple customizations to the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:

ok setenv test-args debug,loopback,media

From the OpenBoot Diagnostics Test Menu

It is easiest to run OpenBoot Diagnostics tests interactively from a menu. You access the menu by typing obdiag at the ok prompt. See Isolating Faults Using Interactive OpenBoot Diagnostics Tests for full instructions.

The obdiag> prompt and the OpenBoot Diagnostics interactive menu (FIGURE 2-3) appear. Only the devices detected by OpenBoot firmware appear in this menu. For a brief explanation of each OpenBoot Diagnostics test, see TABLE 2-10 in OpenBoot Diagnostics Test Descriptions.

  FIGURE 2-3 OpenBoot Diagnostics Interactive Test Menu

This figure shows the selections of the OpenBoot Diagnostics interactive test menu.
Interactive OpenBoot Diagnostics Commands

You run individual OpenBoot Diagnostics tests from the obdiag> prompt by typing:

obdiag> test n

where n represents the number associated with a particular menu item.



Note - You cannot reliably run OpenBoot Diagnostics commands following an operating system halt, since the halt leaves system memory in an unpredictable state. Best practice is to reset the system before running these commands.



There are several other commands available to you from the obdiag> prompt. For descriptions of these commands, see TABLE 2-11 in OpenBoot Diagnostics Test Descriptions.

You can obtain a summary of this same information by typing help at the obdiag> prompt.

From the ok Prompt: The test and test-all Commands

You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:

ok test /pci@1c,600000/scsi@2,1



Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Netra 440 server. If you lack this knowledge, it may help to use the OpenBoot show-devs command (see show-devs Command), which displays a list of all configured devices.



To customize an individual test, you can use test-args as follows:

ok test /pci@1e,600000/usb@b:test-args={verbose,subtests}

This affects only the current test without changing the value of the test-args OpenBoot configuration variable.

You can test all the devices in the device tree with the test-all command:

ok test-all

If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:

ok test-all /pci@1f,700000



Note - You cannot reliably run OpenBoot Diagnostics commands following an operating system halt, since the halt leaves system memory in an unpredictable state. Best practice is to reset the system before running these commands.



What OpenBoot Diagnostics Error Messages Tell You

OpenBoot Diagnostics error messages are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 2-2 displays a sample OpenBoot Diagnostics error message, one that suggests a failure of the IDE controller.

CODE EXAMPLE 2-2 OpenBoot Diagnostics Error Message
Testing /pci@1e,600000/ide@d
 
    ERROR   : IDE device did not reset, busy bit not set
    DEVICE  : /pci@1e,600000/ide@d
    DEVICE  : /pci@1e,600000/ide@d
    ex MACHINE : Netra 440
    SERIAL# : 51994289
    DATE    : 10/17/2002 20:17:43  GMT
    CONTR0LS: diag-level=min test-args=
 
Error: /pci@1e,600000/ide@d selftest failed, return code = 1
Selftest at /pci@1e,600000/ide@d (errors=1) ........................... failed

I2C Bus Device Tests

The i2c@0,320 OpenBoot Diagnostics test examines and reports on environmental monitoring and control devices connected to the Netra 440 server's Inter-Integrated Circuit (I2C) bus.

Error and status messages from the i2c@0,320 OpenBoot Diagnostics test include the hardware addresses of I2C bus devices.

Testing /pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,b6

The I2C device address is given at the very end of the hardware path. In this example, the address is 0,b6, which indicates a device located at hexadecimal address b6 on segment 0 of the I2C bus.

To decode this device address, see Decoding I2C Diagnostic Test Messages. Using TABLE 2-12, you can see that dimm-spd@0,b6 corresponds to DIMM 0 on CPU/memory module 0. If the i2c@0,320 test were to report an error against dimm-spd@0,b6, you would need to replace this DIMM.

Other OpenBoot Commands

Beyond the formal firmware-based diagnostic tools, there are a few commands you can invoke from the ok prompt. These OpenBoot commands display information that can help you assess the condition of a Netra 440 server. These include the following:

The following sections describe the information these commands give you. For instructions on using these commands, turn to Using OpenBoot Information Commands, or look up the appropriate man page.

printenv Command

The printenv command displays the OpenBoot configuration variables. The display includes the current values for these variables as well as the default values. For details, see Viewing and Setting OpenBoot Configuration Variables.

For a list of some important OpenBoot configuration variables, see TABLE 2-1.

probe-scsi and probe-scsi-all Commands

The probe-scsi and probe-scsi-all commands diagnose problems with attached and internal SCSI devices.



caution icon

Caution - If you used the halt command or the L1-A (Stop-A) key sequence to reach the ok prompt, then issuing the probe-scsi or probe-scsi-all command can hang the system.



The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers. The probe-scsi-all command additionally accesses devices connected to any host adapters installed in PCI slots.

For any SCSI device that is connected and active, the probe-scsi and probe-scsi-all commands display its target and unit numbers, and a device description that includes type and manufacturer.

The following is sample output from the probe-scsi command.

CODE EXAMPLE 2-3 probe-scsi Command Output
ok probe-scsi
Target 0 
  Unit 0   Disk     FUJITSU MAN3367M SUN36G 1502    71132959 Blocks, 34732 MB
Target 1 
  Unit 0   Disk     FUJITSU MAN3367M SUN36G 1502    71132959 Blocks, 34732 MB

The following is sample output from the probe-scsi-all command.

CODE EXAMPLE 2-4 probe-scsi-all Command Output
ok probe-scsi-all
/pci@1f,700000/scsi@2,1
 
/pci@1f,700000/scsi@2
Target 0 
  Unit 0   Disk     FUJITSU MAN3367M SUN36G 1502    71132959 Blocks, 34732 MB
Target 1 
  Unit 0   Disk     FUJITSU MAN3367M SUN36G 1502    71132959 Blocks, 34732 MB

probe-ide Command

The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD-ROM drive.



caution icon

Caution - If you used the halt command or the L1-A (Stop-A) key sequence to reach the ok prompt, then issuing the probe-ide command can hang the system.



The following is sample output from the probe-ide command.

CODE EXAMPLE 2-5 probe-ide Command Output
ok probe-ide
Device 0  ( Primary Master ) 
         Removable ATAPI Model: TOSHIBA DVD-ROM SD-C2512                
 
  Device 1  ( Primary Slave ) 
         Not Present

show-devs Command

The show-devs command lists the hardware device paths for each device in the firmware device tree. CODE EXAMPLE 2-6 shows some sample output (edited for brevity).

CODE EXAMPLE 2-6 show-devs Command Output
ok show-devs
/i2c@1f,464000
/pci@1f,700000
/ppm@1e,0
/pci@1e,600000
/pci@1d,700000
/ppm@1c,0
/pci@1c,600000
/memory-controller@2,0
/SUNW,UltraSPARC-IIIi@2,0
/virtual-memory
/memory@m0,10
/aliases
/options
/openprom
/packages
/i2c@1f,464000/idprom@0,50

Operating System

If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser operating environment. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have recourse to software-based diagnostic tools, like SunVTS and Suntrademark Management Center software. These tools can help you with more advanced monitoring, exercising, and fault isolating capabilities.



Note - If you set the auto-boot? OpenBoot configuration variable to false, the operating environment does not boot following completion of the firmware-based tests.



In addition to the formal tools that run on top of Solaris OS software, there are other resources that you can use when assessing or monitoring the condition of a Netra 440 server. These resources include the following:

Error and System Message Log Files

Error and other system messages are saved in the file /var/adm/messages. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.

In the case of Solaris OS software, the syslogd daemon and its configuration file (/etc/syslogd.conf) control how error messages are handled.

For information about /var/adm/messages and other sources of system information, refer to "How to Customize System Message Logging" in the System Administration Guide: Advanced Administration, which is part of the Solaris System Administration Collection.

Solaris System Information Commands

Some Solaris commands display data that you can use when assessing the condition of a Netra 440 server. These commands include the following:

The following sections describe the information these commands give you. For instructions on using these commands, turn to Using Solaris System Information Commands, or look up the appropriate man page.

prtconf Command

The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks, that only the operating environment software "knows" about. The output of prtconf also includes the total amount of system memory. CODE EXAMPLE 2-7 shows an excerpt of prtconf output (edited for brevity).

CODE EXAMPLE 2-7 prtconf Command Output
System Configuration:  Sun Microsystems  sun4u
Memory size: 16384 Megabytes
System Peripherals (Software Nodes):
 
SUNW,Netra-440
    packages (driver not attached)
        SUNW,builtin-drivers (driver not attached)
        deblocker (driver not attached)
        disk-label (driver not attached)
 
[...]
 
    pci, instance #1
        isa, instance #0
            flashprom (driver not attached)
            rtc (driver not attached)
            i2c, instance #0
                i2c-bridge (driver not attached)
                i2c-bridge (driver not attached)
                temperature (driver not attached)
 
[...]
 

The prtconf command's -p option produces output similar to the OpenBoot
show-devs command (see show-devs Command). This output lists only those devices compiled by the system firmware.

prtdiag Command

The prtdiag command displays a table of diagnostic information that summarizes the status of system components.

The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following are several excerpts of the output produced by prtdiag on a "healthy" Netra 440 server running Solaris 8 software.

CODE EXAMPLE 2-8 prtdiag CPU and I/O Output
System Configuration: Sun Microsystems  sun4u Netra 440
System clock frequency: 183 MHZ
Memory size: 16GB       
 
==================================== CPUs ====================================
               E$          CPU                  CPU
CPU  Freq      Size        Implementation       Mask    Status      Location
---  --------  ----------  -------------------  -----   ------      --------
  0  1281 MHz  1MB         SUNW,UltraSPARC-IIIi  2.3    online       -      
  1  1281 MHz  1MB         SUNW,UltraSPARC-IIIi  2.3    online       -      
  2  1281 MHz  1MB         SUNW,UltraSPARC-IIIi  2.3    online       -      
  3  1281 MHz  1MB         SUNW,UltraSPARC-IIIi  2.3    online       -      
 
================================= IO Devices =================================
Bus   Freq      Slot +  Name +
Type  MHz       Status  Path                          Model
----  ----  ----------  ----------------------------  --------------------
pci    66           MB  pci108e,abba (network)        SUNW,pci-ce        
                  okay  /pci@1c,600000/network@2
 
pci    33           MB  isa/su (serial)                                  
                  okay  /pci@1e,600000/isa@7/serial@0,3f8
 
pci    33           MB  isa/su (serial)                                  
                  okay  /pci@1e,600000/isa@7/serial@0,2e8
 
pci    66           MB  pci108e,abba (network)        SUNW,pci-ce        
                  okay  /pci@1f,700000/network@1
 
pci    66           MB  scsi-pci1000,30 (scsi-2)      LSI,1030           
                  okay  /pci@1f,700000/scsi@2

The prtdiag command produces a great deal of output about the system memory configuration. Another excerpt follows.

CODE EXAMPLE 2-9 prtdiag Memory Configuration Output
============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address       Size       Interleave Factor  Contains
-----------------------------------------------------------------------
0x0                4GB               16          BankIDs 0,1,2,3, ... ,15
0x1000000000       4GB               16          BankIDs 16,17,18, ... ,31
0x2000000000       4GB               16          BankIDs 32,33,34, ... ,47
0x3000000000       4GB               2           BankIDs 48,49
 
Bank Table:
-----------------------------------------------------------
        Physical       Location
ID      ControllerID   GroupID   Size       Interleave Way
-----------------------------------------------------------
0        0             0         256MB      0,1,2,3, ... ,15
1        0             0         256MB  
  
[...]
 
48       3             0         2GB        0,1
49       3             0         2GB             
 
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels         Status
--------------------------------------------------
0              0        C0/P0/B0/D0    
0              0        C0/P0/B0/D1    
 
[...]
 
3              0        C3/P0/B0/D1    

In addition to the preceding information, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.

CODE EXAMPLE 2-10 prtdiag Verbose Output
Temperature sensors:
---------------------------------------------------------------
Location   Sensor      Temperature  Lo LoWarn HiWarn  Hi Status
---------------------------------------------------------------
SCSIBP     T_AMB         26C     -11C    0C   65C   75C okay
C0/P0      T_CORE        55C     -10C    0C   97C  102C okay

In the event of an overtemperature condition, prtdiag reports warning or failed in the Status column.

CODE EXAMPLE 2-11 prtdiag Overtemperature Indication Output
Temperature sensors:
---------------------------------------------------------------
Location   Sensor      Temperature  Lo LoWarn HiWarn  Hi Status
---------------------------------------------------------------
SCSIBP     T_AMB         26C     -11C    0C   65C   75C okay
C0/P0      T_CORE        99C     -10C    0C   97C  102C failed

Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.

CODE EXAMPLE 2-12 prtdiag Fault Indication Output
Fan Status:
---------------------------------------
Location       Sensor          Status          
---------------------------------------
FT1/F0         F0              failed (0 rpm)

Here is an example of how the prtdiag command displays the status of system LEDs.

CODE EXAMPLE 2-13 prtdiag LED Status Display
Led State:
--------------------------------------------------
Location   Led                   State       Color
--------------------------------------------------
MB         ACT                   on          green           
MB         SERVICE               on          amber           
MB         LOCATE                off         white           
PS0        POK                   off         green           
PS0        STBY                  off         green    

prtfru Command

The Netra 440 server maintains a hierarchical list of all field-replaceable units (FRUs) in the system, as well as specific information about various FRUs.

The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 2-14 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.

CODE EXAMPLE 2-14 prtfru -l Command Output
/frutree
/frutree/chassis (fru)
/frutree/chassis/SYS?Label=SYS
/frutree/chassis/SYS?Label=SYS/led-location (fru)
/frutree/chassis/SYS?Label=SYS/key-location (fru)
/frutree/chassis/SYS?Label=SYS/key-location/SYSCTRL?Label=SYSCTRL
/frutree/chassis/SC?Label=SC
[...]
/frutree/chassis/HDD0?Label=HDD0
/frutree/chassis/HDD0?Label=HDD0/disk (fru)
/frutree/chassis/HDD1?Label=HDD1
/frutree/chassis/HDD1?Label=HDD1/disk (fru)
/frutree/chassis/HDD2?Label=HDD2
/frutree/chassis/HDD2?Label=HDD2/disk (fru)
/frutree/chassis/HDD3?Label=HDD3
/frutree/chassis/HDD3?Label=HDD3/disk (fru)
/frutree/chassis/DVD?Label=DVD
/frutree/chassis/DVD?Label=DVD/cdrom (fru)
/frutree/chassis/SCC?Label=SCC
/frutree/chassis/SCC?Label=SCC/scc (fru)
/frutree/chassis/ALARM?Label=ALARM
/frutree/chassis/ALARM?Label=ALARM/alarm (container)
[...]
/frutree/chassis/PDB?Label=PDB
/frutree/chassis/PDB?Label=PDB/pdb (container)

CODE EXAMPLE 2-15 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.

CODE EXAMPLE 2-15 prtfru -c Command Output
/frutree/chassis/SC?Label=SC/system-controller (container)
   SEGMENT: SD
      /ManR
      /ManR/UNIX_Timestamp32: Wed Dec 31 19:00:00 EST 1969
      /ManR/Fru_Description: ASSY,ALOM Card
      /ManR/Manufacture_Loc: 
      /ManR/Sun_Part_No: 5016346
      /ManR/Sun_Serial_No: 
      /ManR/Vendor_Name: NO JEDEC CODE FOR THIS VENDOR
      /ManR/Initial_HW_Dash_Level: 03
      /ManR/Initial_HW_Rev_Level: 
      /ManR/Fru_Shortname: ALOM_Card
      /SpecPartNo: 885-0084-05
/frutree/chassis/MB?Label=MB/system-board (container)
   SEGMENT: SD
      /ManR
      /ManR/UNIX_Timestamp32: Mon Nov  4 15:35:24 EST 2002
      /ManR/Fru_Description: ASSY,A42,MOTHERBOARD
      /ManR/Manufacture_Loc: Celestica,Toronto,Ontario
      /ManR/Sun_Part_No: 5016344
      /ManR/Sun_Serial_No: 000001
      /ManR/Vendor_Name: Celestica
      /ManR/Initial_HW_Dash_Level: 03
      /ManR/Initial_HW_Rev_Level: 06
      /ManR/Fru_Shortname: A42_MB
      /SpecPartNo: 885-0060-02

The prtfru command displays varied data depending on the type of FRU. In general, this information includes:

Information about the following Netra 440 server FRUs is displayed by the prtfru command:

Similar information is provided by the ALOM system controller showfru command. For more information about showfru and other ALOM commands, see Monitoring the System Using Sun Advanced Lights Out Manager.

psrinfo Command

The psrinfo command displays the date and time each CPU came online. With the verbose option (-v), the command displays additional information about the CPUs, including their clock speed. The following is sample output from the psrinfo command with the -v option.

CODE EXAMPLE 2-16 psrinfo -v Command Output
Status of processor 0 as of: 04/11/03 12:03:45
  Processor has been on-line since 04/11/03 10:53:03.
  The sparcv9 processor operates at 1280 MHz,
        and has a sparcv9 floating point processor.
Status of processor 1 as of: 04/11/03 12:03:45
  Processor has been on-line since 04/11/03 10:53:05.
  The sparcv9 processor operates at 1280 MHz,
        and has a sparcv9 floating point processor.

showrev Command

The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 2-17 shows sample output of the showrev command.

CODE EXAMPLE 2-17 showrev Command Output
Hostname: wgs94-111
Hostid: 83195f01
Release: 5.8
Kernel architecture: sun4u
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain: Ecd.East.Sun.COM
Kernel version: SunOS 5.8 system28_11:12/03/02 2002
    SunOS Internal Development: root 12/03/02 [system28-gate]

When used with the -p option, this command displays installed patches. CODE EXAMPLE 2-18 shows a partial sample output from the showrev command with the -p option.

CODE EXAMPLE 2-18 showrev -p Command Output
Patch: 112663-01 Obsoletes:  Requires: 108652-44 Incompatibles:  Packages: SUNWxwplt
Patch: 111382-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWxwplt
Patch: 111626-02 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWolrte, SUNWolslb
Patch: 111741-02 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWxwmod, SUNWxwmox
Patch: 111844-02 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWxwopt
Patch: 112781-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWxwopt
Patch: 108714-07 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWdtbas, SUNWdtbax

Tools and the Boot Process: A Summary

Different diagnostic tools are available to you at different stages of the boot process. TABLE 2-3 summarizes what tools are available to you and when they are available.

TABLE 2-3 Diagnostic Tool Availability

Stage

Available Diagnostic Tools

Fault Isolation

System Monitoring

System Exercising

Before the operating system starts

- LEDs

- POST

- OpenBoot Diagnostics

- ALOM

- OpenBoot commands

-none-

After the operating system starts

- LEDs

- ALOM

- Solaris info commands

- SunVTS

- Hardware Diagnostic Suite

When the system is turned off but standby power is available

-none-

- ALOM

-none-



Isolating Faults in the System

Each of the tools available for fault isolation discloses faults in different field-replaceable units (FRUs). The row headings along the left of TABLE 2-4 list the FRUs in a Netra 440 server. The available diagnostic tools are shown in column headings across the top. A check mark in this table indicates that a fault in a particular FRU can be isolated by a particular diagnostic.

TABLE 2-4 FRU Coverage of Fault-Isolating Tools

FRU

ALOM

LEDs

OpenBoot
Diags

POST

Enclosure

On FRU

ALOM system controller card

checkmark

 

checkmark

checkmark

 

Connector board assembly

No coverage. See TABLE 2-5 for fault isolation hints.

CPU/memory module

checkmark

checkmark

 

 

checkmark

DIMMs

 

checkmark

 

 

checkmark

Hard drive

checkmark

checkmark

checkmark

checkmark

 

DVD drive

 

 

checkmark

checkmark

 

Fan tray 3

checkmark

checkmark

 

 

 

Fan trays 0-2

checkmark

checkmark

 

 

 

Motherboard

checkmark

checkmark

 

checkmark

checkmark

Power supply

checkmark

checkmark

checkmark

 

 

SCSI backplane

No coverage. See TABLE 2-5 for fault isolation hints.

System configuration card reader

No coverage. See TABLE 2-5 for fault isolation hints.

System configuration card

No coverage. See TABLE 2-5 for fault isolation hints.


In addition to the FRUs listed in TABLE 2-4, there are several minor replaceable system components--mostly cables--that cannot directly be isolated by any system diagnostic. For the most part, you determine when these components are faulty by eliminating other possibilities. Some of these FRUs are listed in TABLE 2-5, along with hints on how to discern problems with them.

TABLE 2-5 FRUs Not Directly Isolated by Fault-Isolating Tools

FRU

Diagnostic Hints

Connector board assembly

This is difficult to distinguish from other problems with similar symptoms. The firmware generates many error messages about being unable to access OpenBoot configuration variables, for example: Could not read diag-level from NVRAM! ALOM shows the front panel Service Required indicator is lit.

Connector board power cable

If ALOM is able to read the system rotary switch position, but reports that none of the fans are spinning, you should suspect that this cable is loose or defective.

DVD drive cable

If OpenBoot Diagnostics tests indicate a problem with the DVD drive, but replacing the drive does not fix the problem, you should suspect (primarily) that this cable is either defective or improperly connected, or (secondarily) that there is a problem with the motherboard.

SCSI backplane

Though not an exhaustive diagnostic, some SunVTS tests (i2c2test and disktest) exercise certain SCSI backplane paths. You can also monitor the backplane's ambient temperature using the ALOM system controller showenvironment command (see Monitoring the System Using Sun Advanced Lights Out Manager).

SCSI data cable

This is difficult to distinguish from problems with similar symptoms. The firmware generates many error messages about being unable to access OpenBoot configuration variables, for example: Could not read diag-level from NVRAM! ALOM shows the front panel Service Required indicator is lit.

System configuration card reader

-and-

System configuration card reader cable

If the system control rotary switch and On/Standby button appear unresponsive, and if the power supplies are known to be good, you should suspect the SCC reader and its cable. To test these components, access ALOM, issue the resetsc command, log in again to ALOM, and remove the system controller card. If an alert message appears ("SCC card has been removed"), it means the card reader is functioning and the cable is intact.

System control rotary switch cable

If the system control rotary switch appears unresponsive (ALOM cannot read rotary switch position), but the Power button works and the system stays powered on, you should suspect either that this cable is loose or defective, or (less likely) that there is a problem with the system configuration card reader.




Note - Most replacement cables for the Netra 440 server are available only as part of a cable kit, Sun part number F595-7286.




Monitoring the System

Sun provides the Sun Advanced Lights Out Manager (ALOM) tool that can give you advance warning of difficulties and prevent future downtime.

This monitoring tool lets you specify system criteria that bear watching. For instance, you can enable alerts for system events (such as excessive temperatures, power supply or fan failures, system resets), and be notified if those events occur. Warnings can be reported by icons in the software's graphical user interface, or you can be notified by email whenever a problem occurs.

Monitoring the System Using Advanced Lights Out Manager

Advanced Lights Out Manager (ALOM) enables you to monitor and control your server over a serial port or a network interface. The ALOM system controller provides a command-line interface that enables you to administer the server from remote locations. This may be especially useful when servers are geographically distributed or physically inaccessible.

ALOM also lets you remotely access the system console and run diagnostics (like POST) that would otherwise require physical proximity to the server's serial port. ALOM can send email notification of hardware failures or other server events.

The ALOM system controller runs independently, and uses standby power from the server. Therefore, ALOM firmware and software continue to be effective when the server operating system goes offline, or when power to the server itself is turned off.

TABLE 2-6 lists the items that ALOM enables you to monitor on the Netra 440 server.

TABLE 2-6 What ALOM Monitors

Item Monitored

What ALOM Reveals

Command to Type

Hard drives

Whether each slot has a drive present, and whether the drive reports OK status

showenvironment

Fan trays

Fan speed and whether the fan trays report OK status

showenvironment

CPU/memory modules

The presence of a CPU/memory module and the temperature measured at each CPU, as well as any thermal warning

showenvironment

Operating system status

Whether the operating system is running, stopped, initializing, or in some other state

showplatform

Power supplies

Whether each bay has a power supply present, and whether the power supply reports OK status

showenvironment

System temperature

Ambient and CPU core temperatures as measured at several locations in the system, as well as any thermal warning

showenvironment

Server front panel

System control rotary switch position and status of LEDs

showenvironment

User sessions

Which users are logged in to ALOM, and through which connections

showusers


For instructions on using ALOM to monitor a Netra 440 system, see Monitoring the System Using Sun Advanced Lights Out Manager.


Exercising the System

It is relatively easy to detect when a system component fails outright. However, when a system has an intermittent problem or seems to be "behaving strangely," a software tool that stresses or exercises the computer's many subsystems can help disclose the source of the emerging problem and prevent long periods of reduced functionality or system downtime.

Sun provides two tools for exercising Netra 440 servers:

TABLE 2-7 shows the FRUs that each system exercising tool is capable of isolating. Note that individual tools do not necessarily test all the components or paths of a particular FRU.

TABLE 2-7 FRU Coverage of System-Exercising Tools

FRU

SunVTS

Hardware Diagnostic Suite

ALOM system controller card

checkmark

 

Connector board assembly

No coverage. See TABLE 2-5 for fault isolation hints.

CPU/memory module

checkmark

checkmark

DIMMs

checkmark

checkmark

Hard drive

checkmark

checkmark

DVD drive

checkmark

 

Fan tray 3

No coverage. See TABLE 2-8 for fault isolation hints.

Fan trays 0-2

No coverage. See TABLE 2-8 for fault isolation hints.

Motherboard

checkmark

checkmark

Power supply

checkmark

 

SCSI backplane

checkmark

 

System configuration card reader

No coverage. See TABLE 2-5 for fault isolation hints.

System configuration card

checkmark

 


Some FRUs are not isolated by any system exercising tool.

TABLE 2-8 FRUs Not Directly Isolated by System-Exercising Tools

FRU

Diagnostic Hints

Connector board assembly

See TABLE 2-5.

DVD drive cable

See TABLE 2-5.

Fan tray 3

If this FRU fails, ALOM issues an alert message:
SC Alert: PCI_FAN @ FT0 Failed.

Fan trays 0-2

If this FRU fails, ALOM issues an alert message:
SC Alert: CPU_FAN @ FT1 Failed.

SCSI data cable

See TABLE 2-5.

Connector board power cable

See TABLE 2-5.


Exercising the System Using SunVTS Software

SunVTS software validation test suite performs system and subsystem stress testing. You can view and control a SunVTS session over a network. Using a remote machine, you can view the progress of a testing session, change testing options, and control all testing features of another machine on the network.

You can run SunVTS software in five different test modes:

Since SunVTS software can run many tests in parallel and can consume many system resources, you should take care when using it on a production system. If you are stress-testing a system using SunVTS software's Comprehensive test mode, you should not run anything else on that system at the same time.

The Netra 440 server to be tested must be up and running if you want to use SunVTS software, since it relies on the Solaris OS. Since SunVTS software packages are optional, they may not be installed on your system. Turn to Checking Whether SunVTS Software Is Installed for instructions.

It is important to use the most up-to-date version of SunVTS available, to ensure that you have the latest suite of tests. You can download the most recent SunVTS software from http://www.sun.com/oem/products/vts/.

For instructions on running SunVTS software to exercise the Netra 440 server, see Exercising the System Using SunVTS Software. For more information about the product, refer to:

These documents are available on the Solaris Supplement CD and on the Web at: http://www.sun.com/documentation. You should also consult the SunVTS README file located at /opt/SUNWvts/. This document provides late-breaking information about the installed version of the product.

SunVTS Software and Security

During SunVTS software installation, you must choose between Basic or Sun Enterprise Authentication Mechanism (SEAM) security. Basic security uses a local security file in the SunVTS installation directory to limit the users, groups, and hosts permitted to use SunVTS software. SEAM security is based on Kerberos--the standard network authentication protocol--and provides secure user authentication, data integrity, and privacy for transactions over networks.

If your site uses SEAM security, you must have the SEAM client and server software installed in your networked environment and configured properly in both Solaris and SunVTS software. If your site does not use SEAM security, do not choose the SEAM option during SunVTS software installation.

If you enable the wrong security scheme during installation, or if you improperly configure the security scheme you chose, you may find yourself unable to run SunVTS tests. For more information, refer to the SunVTS User's Guide and the instructions accompanying the SEAM software.


Identifying Memory Modules

System firmware, including POST, has multiple ways of referring to memory. In most cases, such as when running tests or displaying configuration information, firmware refers to memory "banks." These are logical and not physical banks (see CODE EXAMPLE 2-19).

CODE EXAMPLE 2-19 POST Reference to Logical Memory Banks
0>Memory interleave set to 0
0>    Bank 0  512MB : 00000000.00000000 -> 00000000.20000000.
0>    Bank 1  512MB : 00000001.00000000 -> 00000001.20000000.
0>    Bank 2  512MB : 00000002.00000000 -> 00000002.20000000.
0>    Bank 3  512MB : 00000003.00000000 -> 00000003.20000000.

However, in POST error output (see CODE EXAMPLE 2-20), the firmware provides a memory slot identifier (B0/D1 J0602). Note that B0/D1 identifies the memory slot and is visible on the circuit board when the DIMM is installed. The label J0602 also identifies the memory slot, but is not visible unless you remove the DIMM from the slot.

CODE EXAMPLE 2-20 POST Reference to Physical ID and Logical Bank

1>H/W under test = CPU3 B0/D1 J0602 side 1 (Bank 1), CPU Module C3
 

Adding to the potential confusion, when configuring system memory, you must also contend with the separate notion of physical memory banks: DIMMs must be installed as pairs of the same capacity and type within each physical bank.

The following sections clarify how memory is identified.

Physical Identifiers

Each CPU/memory module's circuit board contains silk-screened labels that uniquely identify every DIMM on that board. Each label is in this form:

Bx/Dy

Where x indicates the physical bank, and y the DIMM number within the bank.

In addition, a "J" number silk-screened on the circuit board uniquely identifies each DIMM slot. However, this slot number is not readily visible unless the DIMM is removed from the slot.

If you run POST and it finds a memory error, the error message will include the physical ID of the failed DIMM and the "J" number of the failed DIMM's slot, making it easy to determine which parts you need to replace.



Note - To ensure compatibility and maximize system uptime, you should replace DIMMs in pairs. Treat both DIMMs in a physical bank as one FRU.



Logical Banks

Logical banks reflect the system's internal memory architecture and not the architecture of the system's field-replaceable units. In the Netra 440 server, each logical bank spans two physical DIMMs. Since firmware-generated status messages refer only to logical banks, it is not possible to use these status messages to isolate a memory problem to a single failed DIMM. POST error messages, on the other hand, specify failures to the FRU level.



Note - To isolate faults in the memory subsystem, run POST diagnostics.



Correspondence Between Logical and Physical Banks

TABLE 2-9 shows the logical-to-physical memory bank mapping for the Netra 440 server.

TABLE 2-9 Logical and Physical Memory Banks in a Netra 440 Server

Logical Bank
(As Given in Firmware Output)

Physical Identifiers
(As Shown on Circuit Board)

Physical Bank

Bank 0

B0/D0 and B0/D1

Bank 0

Bank 1

 

 

Bank 2

B1/D0 and B1/D1

Bank 1

Bank 3


FIGURE 2-4 depicts the same mapping graphically.

  FIGURE 2-4 How Logical Memory Banks Map to DIMMs

This figure shows that logical memory banks cross the boundaries of physical memory modules, and specifies how the logical and physical banks are related.

Identifying CPU/Memory Modules

Since each CPU/memory module has its own set of DIMMs, you need to determine the CPU/memory module in which a faulty DIMM resides. This information is given in the POST error message:

1>H/W under test = CPU3 B0/D1 J0602 side 1 (Bank 1), CPU Module C3

In this example, the cited module is CPU Module C3.

The processors are numbered according to the slot in which they are installed, and these slots are numbered 0 to 3, left to right, as you look down on the Netra 440 server's chassis from the front (see FIGURE 2-5).

  FIGURE 2-5 CPU/Memory Module Numbering

This figure calls out the location of CPU slots in the Netra 440 server chassis.

For example, if a Netra 440 server has only two CPU/memory modules installed, and if those are located in the leftmost and rightmost slots, then the firmware will refer to the two system processors as CPU 0 and CPU 3.

The failed DIMM called out by the previous POST error message, then, resides in the rightmost CPU/memory module (C3), and is labeled B0/D1 on that module's circuit board.


OpenBoot Diagnostics Test Descriptions

This section describes the OpenBoot Diagnostics tests and commands available to you. For background information about these tests, see OpenBoot Diagnostics Tests.

TABLE 2-10 OpenBoot Diagnostics Menu Tests

Test Name

What It Does

FRU(s) Tested

flashprom@2,0

Performs a checksum test on the boot PROM.

Motherboard

i2c@0,320

Tests the I2C environmental monitoring subsystem, which includes various temperature and other sensors located on the motherboard and on other FRUs.

Motherboard, power supplies, SCSI disks, CPU/memory modules

ide@d

Tests the on-board IDE controller and IDE bus subsystem that controls the DVD-ROM drive.

Motherboard, DVD-ROM drive

network@1

Tests the on-board Ethernet controller, running internal loopback tests. Can also run external loopback tests, but only if you install a loopback connector (not provided).

Motherboard

network@2

Same as above, for the other on-board Ethernet controller.

Motherboard

rmc-comm@0,3e8

Tests communication with the ALOM system controller, and requests that ALOM diagnostics run.

ALOM card

rtc@0,70

Tests the registers of the real-time clock and verifies that it is running.

Motherboard

scsi@2

Tests internal SCSI hard drives.

Motherboard,
SCSI backplane,
SCSI disks

scsi@2,1

Tests any external SCSI hard drives attached.

Motherboard,
SCSI cable,
SCSI disks

serial@0,3f8
serial@0,2e8

Tests all possible baud rates supported by the ttya and ttyb serial lines. Performs internal and external loopback tests on each line at each speed.

Motherboard

usb@a
usb@b

Tests the writable registers of the USB open host controller.

Motherboard


TABLE 2-11 describes the commands you can type from the obdiag> prompt.

TABLE 2-11 OpenBoot Diagnostics Test Menu Commands

Command

Description

exit

Exits OpenBoot Diagnostics tests and returns to the ok prompt.

help

Displays a brief description of each OpenBoot Diagnostics command and OpenBoot configuration variable.

set-default variable

Restores the default value of an OpenBoot configuration variable.

setenv variable value

Sets the value for an OpenBoot configuration variable (also available from the ok prompt).

test-all

Tests all devices displayed in the OpenBoot Diagnostics test menu (also available from the ok prompt).

test #

Tests only the device identified by the menu entry number. (A similar function is available from the ok prompt. See From the ok Prompt: The test and test-all Commands.)

test #,#

Tests only the devices identified by the menu entry numbers.

except #,#

Tests all devices in the OpenBoot Diagnostics test menu except those identified by the menu entry numbers.

what #,#

Displays selected properties of the devices identified by the menu entry numbers. The information provided varies according to device type.



Decoding I2C Diagnostic Test Messages

TABLE 2-12 describes each I2C device in a Netra 440 server, and helps you associate each I2C address with the proper FRU. For more information about I2C tests, see I2C Bus Device Tests.

TABLE 2-12 I 2 C Bus Devices in a Netra 440 Server

Address

Associated FRU

What the Device Does

alarm-fru-prom@0,ac

Dry Contact Alarm

Dry Contact Alarm Board FRUID

clock-generator@0,d2

Motherboard

Controls PCI bus clock

cpu-fru-prom@0,be

CPU 0

Contains FRU configuration information

cpu-fru-prom@0,ce

CPU 1

Contains FRU configuration information

cpu-fru-prom@0,de

CPU 2

Contains FRU configuration information

cpu-fru-prom@0,ee

CPU 3

Contains FRU configuration information

dimm-spd@0,b6

CPU/memory module 0, DIMM 0

Contains FRU configuration information

dimm-spd@0,b8

CPU/memory module 0, DIMM 1

Contains FRU configuration information

dimm-spd@0,ba

CPU/memory module 0, DIMM 2

Contains FRU configuration information

dimm-spd@0,bc

CPU/memory module 0, DIMM 3

Contains FRU configuration information

dimm-spd@0,c6

CPU/memory module 1, DIMM 0

Contains FRU configuration information

dimm-spd@0,c8

CPU/memory module 1, DIMM 1

Contains FRU configuration information

dimm-spd@0,ca

CPU/memory module 1, DIMM 2

Contains FRU configuration information

dimm-spd@0,cc

CPU/memory module 1, DIMM 3

Contains FRU configuration information

dimm-spd@0,d6

CPU/memory module 2, DIMM 0

Contains FRU configuration information

dimm-spd@0,d8

CPU/memory module 2, DIMM 1

Contains FRU configuration information

dimm-spd@0,da

CPU/memory module 2, DIMM 2

Contains FRU configuration information

dimm-spd@0,dc

CPU/memory module 2, DIMM 3

Contains FRU configuration information

dimm-spd@0,e6

CPU/memory module 3, DIMM 0

Contains FRU configuration information

dimm-spd@0,e8

CPU/memory module 3, DIMM 1

Contains FRU configuration information

dimm-spd@0,ea

CPU/memory module 3, DIMM 2

Contains FRU configuration information

dimm-spd@0,ec

CPU/memory module 3, DIMM 3

Contains FRU configuration information

gpio@0,38

Power supply 0

PSU0 Status/Control REG

gpio@0,3a

Power supply 1

PSU1 Status/Control REG

gpio@0,3c

Power Distribution Board

PSU0_1 Status/Control REG

gpio@0,42

SCSI backplane

Indicates rotary switch status and drives Activity LEDs

gpio@0,44

Motherboard

Indicates power supply and CPU status

gpio@0,46

SCSI backplane

Indicates disk status and drives fault and Ok-to-Remove indicators

gpio@0,48

Motherboard

Drives system LEDs and CPU overtemperature indication

gpio@0,e0

Power Supply 2

PSU2 Status/Control REG

gpio@0,e2

Power Supply 3

PSU3 Status/Control REG

gpio@0,e4

Power Distribution Board

PSU2_3 Status/Control REG

hardware-monitor@0,5c

Motherboard

Monitors temperatures, voltages, and fan speeds

i2c-bridge@0,16

Motherboard

Translates I2C bus addresses and isolates bus devices

i2c-bridge@0,18

Motherboard

Translates I2C bus addresses and isolates bus devices

motherboard-fru-prom@0,a2

Motherboard

Contains FRU configuration information

pdb-fru-prom@0,7c

Power Distribution Board

PDB FRUID

power-supply-fru-prom@0,70

Power Supply 2

PSU2 FRUID

power-supply-fru-prom@0,72

Power Supply 3

PSU3 FRUID

power-supply-fru-prom@0,a4

Power supply

Contains FRU configuration information

power-supply-fru-prom@0,c0

Power supply 0

PSU0 FRUID

power-supply-fru-prom@0,c2

Power supply 1

PSU1 FRUID

rmc-fru-prom@0,a6

ALOM card

Contains FRU configuration information

scsi-fru-prom@0,a8

SCSI backplane

Contains FRU configuration information

temperature-sensor@0,9c

SCSI backplane

Senses system ambient temperature

temperature@0,30

CPU 0

Senses CPU die temperature

temperature@0,64

CPU 1

Senses CPU die temperature

temperature@0,80

CPU 2

Senses CPU die temperature

temperature@0,90

CPU 3

Senses CPU die temperature



Terms in Diagnostic Output Terms

The status and error messages displayed by POST diagnostics and OpenBoot Diagnostics tests occasionally include acronyms or abbreviations for hardware subcomponents. TABLE 2-13 is included to assist you in decoding this terminology and associating the terms with specific FRUs, where appropriate.

TABLE 2-13 Abbreviations or Acronyms in Diagnostic Output

Term

Description

Associated FRU(s)

ADC

Analog-to-Digital Converter

Motherboard

APC

Advanced Power Control - A function provided by the Southbridge integrated circuit

Motherboard

Bell

A repeater circuit element that forms part of the system bus

Motherboard

CRC

Cyclic Redundancy Check

Not applicable

DMA

Direct Memory Access - In diagnostic output, usually refers to a controller on a PCI card

PCI card

HBA

Host Bus Adapter

Motherboard, various others

I2C

Inter-Integrated Circuit (also written as I2C) - A bidirectional, two-wire serial data bus. Used mainly for environmental monitoring and control

Various, see TABLE 2-12

IO-Bridge

System bus to PCI bridge integrated circuit (same as "Tomatillo")

Motherboard

JBus

The system interconnect architecture--that is, the data and address buses

Motherboard

JTAG

Joint Test Access Group - An IEEE subcommittee standard (1149.1) for scanning system components

Not applicable

MAC

Media Access Controller - Hardware address of a device connected to a network

Motherboard

MII

Media Independent Interface - Part of the Ethernet controller

Motherboard

NVRAM

Refers to the system configuration card (SCC)

System configuration card

OBP

Refers to OpenBoot firmware

Not applicable

PHY

Physical Interface - Part of the Ethernet control circuit

Motherboard

POST

Power-On Self-Test

Not applicable

RTC

Real-Time Clock

Motherboard

RX

Receive - Communication protocol

Motherboard

Scan

A means for monitoring and altering the content of ASICs and system components, as provided for in the IEEE 1149.1 standard

Not applicable

Southbridge

Integrated circuit that controls the ALOM UART port and more

Motherboard

Tomatillo

System bus to PCI bridge integrated circuit

Motherboard

TX

Transmit - Communication protocol

Motherboard

UART

Universal Asynchronous Receiver Transmitter - Serial port hardware

Motherboard, ALOM card

UIE

Update-ended Interrupt Enable - A function provided by the real-time clock

Motherboard

XBus

A byte-wide bus for low-speed devices

Motherboard


 


1 (TableFootnote) POST messages cannot be displayed on a local graphics monitor. They are sent to ttya even when output-device is set to screen. Likewise, POST can accept input only from ttya.