C H A P T E R  6

Diagnostics

This chapter describes the diagnostics tools available to the Sun Fire V210 and V240 servers. The chapter contains the sections:


6.1 Overview of Diagnostic Tools

Sun provides a range of diagnostic tools for use with the Sun Fire V210 and V240 servers.

These diagnostic tools are summarized in TABLE 6-1.


TABLE 6-1 Summary of Diagnostic Tools

Diagnostic Tool

Type

What It Does

Accessibility and Availability

Remote Capability

LEDs

Hardware

Indicate status of overall system and particular components.

Accessed from system chassis. Available anytime power is available.

Local, but can be viewed via ALOM

ALOM

Hardware and software

Monitors environmental conditions, performs basic fault isolation, and provides remote console access.

Can function on standby power and without operating system.

Designed for remote access

POST

Firmware

Tests core components of system.

Runs automatically on startup. Available when the operating system is not running.

Local, but can be viewed via ALOM

OpenBoot Diagnostics

Firmware

Tests system components, focusing on peripherals and
I/O devices.

Runs automatically or interactively. Available when the operating system is not running.

Local, but can be viewed via ALOM

OpenBoot commands

Firmware

Display various kinds of system information.

Available when the operating system is not running.

Local, but can be accessed via ALOM

Solaris commands

Software

Display various kinds of system information.

Requires operating system.

Local, but can be accessed via ALOM

SunVTS

Software

Exercises and stresses the system, running tests in parallel.

Requires operating system functionality. Optional package may need to be installed.

View and control over network

Sun Management Center

Software

Monitors both hardware environmental conditions and software performance of multiple machines. Generates alerts for various conditions.

Requires operating system to be running on both monitored and master servers. Requires a dedicated database on the master server.

Designed for remote access

Hardware Diagnostic Suite

Software

Exercises an operational system by running sequential tests. Also reports failed FRUs.

Separately purchased optional add-on to Sun Management Center. Requires operating system and Sun Management Center.

Designed for remote access


This table provides a summary of diagnostic tools available for the Sun Fire V210 and V240 server. The table describes each tool, its accessibility, and its remote capability.


6.2 Status Indicators

For a summary of the server's LED status indicators, see Section 1.2.1, Server Status Indicators.


6.3 Sun Advanced Lights Out Manager

Both the Sun Fire V210 server and the Sun Fire V240 server are shipped with Suntrademark Advanced Lights Out Manager (ALOM) pre-installed.

ALOM enables you to monitor and control your server through a serial connection (using the SERIAL MGT port), or Ethernet connection (using the NET MGT port).

ALOM can send email notification of hardware failures or other server events.

The ALOM circuitry uses standby power from the server. This means that:

See TABLE 3-1 for a list of the components monitored by ALOM and the information it provides for each.



Tip - Table describes components and server systems monitored by ALOM. This table also describes the ALOM output for the component or system monitored.For additional information see the Advanced Lights Out Management User's Guide (817-5481).




6.4 POST Diagnostics

POST is a firmware program that is useful in determining if a portion of the system has failed. POST verifies the core functionality of the system, including the CPU module or modules, motherboard, memory, and some on-board I/O devices. POST generates messages that can be useful in determining the nature of a hardware failure. POST can be run even if the system is unable to boot.

POST detects most system faults and is located in the motherboard OpenBoottrademark PROM. POST can be set to run by the OpenBoot firmware at power up by setting two environment variables, the diag-switch? and the diag-level flag, which are stored on the system configuration card.

POST runs automatically when the system power is applied and all of the following conditions apply:

POST also runs automatically when the system is reset and all of the following conditions apply:

If diag-level is set to min or max, POST performs an abbreviated or extended test, respectively.

If diag-level is set to menus, a menu of all the tests executed at power up is displayed.

POST diagnostic and error message reports are displayed on a console.

6.4.1 To Start POST Diagnostics--Method 1

There are two methods for starting POST diagnostics. In the following procedures both methods are described.

1. Go to the ok prompt.

2. Type:


ok setenv diag-switch? true

3. Type:


ok setenv diag-level value

Where value is either min or max depending on the desired range coverage.

4. Power cycle the server.

After you have powered the server off, wait 60 seconds before powering the server on. POST executes after the server is powered on.



Note - Status and error messages could be displayed in the console window. If POST detects an error, it displays an error message describing the failure.



5. When you have finished running POST, restore the value of diag-switch? to false by typing:


ok setenv diag-switch? false

Resetting diag-switch? to false minimizes boot time.

6.4.2 To Start POST Diagnostics--Method 2

1. Go to the ok prompt.

2. Type:


ok setenv diag-switch? false

3. Type:


ok setenv diag-level value

Where value is either min or max depending on the desired range of coverage.

4. Type:


ok setenv diag-trigger user-reset

5. Type:


ok setenv diag-trigger all-resets



Note - Status and error messages could be displayed in the console window. If POST detects an error, it displays an error message describing the failure.



6.4.3 Controlling POST Diagnostics

You control POST diagnostics, and other aspects of the boot process by setting OpenBoot configuration variables. Changes to OpenBoot configuration variables generally take effect only after the system is restarted. TABLE 6-2 lists the most important and useful of these variables. You can find instructions for changing OpenBoot configuration variables in Section 6.9, OpenBoot Configuration Variables.


TABLE 6-2 OpenBoot Configuration Variables

OpenBoot Configuration Variable

Description and Keywords

auto-boot

Determines whether the operating system automatically starts up. Default is true.

  • true - Operating system automatically starts once firmware tests finish.
  • false - System remains at ok prompt until you type boot.

diag-level

Determines the level or type of diagnostics executed. Default is min.

  • off - No testing.
  • min - Only basic tests are run.
  • max - More extensive tests may be run, depending on the device.

diag-script

Determines which devices are tested by OpenBoot Diagnostics. Default is none.

  • none - No devices are tested.
  • normal - On-board (centerplane-based) devices that have self-tests are tested.
  • all - All devices that have self-tests are tested.

diag-switch?

Toggles the system in and out of diagnostic mode. Default is false.

  • true - Diagnostic mode: POST diagnostics and OpenBoot Diagnostics tests may run.
  • false - Default mode: Do not run POST or OpenBoot Diagnostics tests.

diag-trigger

 

 

 

Specifies the class of reset event that causes Power-On Self-Test and OpenBoot Diagnostics to run. These variables can accept single keywords as well as combinations of the first three keywords separated by spaces. For details, see To View and Set OpenBoot Configuration Variables.

  • error-reset - A reset caused by certain non-recoverable hardware error conditions. In general, an error reset occurs when a hardware problem corrupts system data. Examples include CPU and system watchdog resets, fatal errors, and certain CPU reset events (default).
  • power-on-reset - A reset caused by pressing the Power button (default).
  • user-reset - A reset initiated by the user or the operating system.
  • all-resets - Any kind of system reset.
  • none - No Power-On Self-Tests or OpenBoot Diagnostics tests run.

input-device

Selects where console input is taken from. Default is TTYA.

  • TTYA - From built-in SERIAL MGT port.
  • TTYB - From built-in general purpose serial port (10101)
  • keyboard - From attached keyboard that is part of a graphics terminal.

output-device

Selects where diagnostic and other console output is displayed. Default is TTYA.

  • TTYA - To built-in SERIAL MGT port.
  • TTYB - To built-in general purpose serial port (10101)
  • screen - To attached screen that is part of a graphics terminal.1

Table describes auto-boot, diag-level, diag-script, diag-switch?, diag-trigger, input-device, and output-device OpenBoot configuration variables.1 - POST messages cannot be displayed on a graphics terminal. They are sent to TTYA even when output-device is set to screen.

Note - These variables affect OpenBoot Diagnostics tests as well as POST diagnostics.



Once POST diagnostics have finished running, POST reports back to the OpenBoot firmware the status of each test it has run. Control then reverts back to the OpenBoot firmware code.

If POST diagnostics do not uncover a fault, and your server still does not start up, run OpenBoot Diagnostics tests.


6.5 OpenBoot Diagnostics

Like POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the OpenBoot PROM.

6.5.1 To Start OpenBoot Diagnostics

1. Type:


ok setenv diag-switch? true
ok setenv diag-level max
ok setenv auto-boot? false
ok reset-all

2. Type:


ok obdiag

This command displays the OpenBoot Diagnostics menu. See TABLE 6-3.


TABLE 6-3 Sample obdiag menu

obdiag

1 flashprom@2,0

4 network@2

7 scsi@2

10 serial@0,3f8

 

2 i2c@0,320

5 network@2,1

8 scsi@2,1

11 usb@a

 

3 ide@d

6 rtc@0,70

9 serial@0,2e8

12 usb@b

 

Commands: test test-all except help what setenv set-default exit

diag-passes=1 diag-level=max test-args=subtests, verbose


Table shows the OBDiag menu of optional tests that the user can select from.

Note - If you have a PCI card installed in the server, then additional tests are displayed on the OBDiag menu.



3. Type:


obdiag> test n

Where n represents the number corresponding to the test you want to run.

A summary of the tests is available. At the obdiag> prompt, type:


obdiag> help

6.5.2 Controlling OpenBoot Diagnostics Tests

Most of the OpenBoot configuration variables you use to control POST (see TABLE 6-2 on "Invalid Cross-Reference Format") also affects OpenBoot Diagnostics tests.

By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 6-4.


TABLE 6-4 Keywords for the test-args OpenBoot Configuration Variable

Keyword

What It Does

bist

Invokes built-in self-test (BIST) on external and peripheral devices.

debug

Displays all debug messages.

iopath

Verifies bus/interconnect integrity.

loopback

Exercises external loopback path for the device.

media

Verifies external and peripheral device media accessibility.

restore

Attempts to restore original state of the device if the previous execution of the test failed.

silent

Displays only errors rather than the status of each test.

subtests

Displays main test and each subtest that is called.

verbose

Displays detailed messages of status of all tests.

callers=n

Displays backtrace of n callers when an error occurs.
callers=0 - displays backtrace of all callers before the error. Default is callers=1.

errors=n

Continues executing the test until n errors are encountered.
errors=0 - displays all error reports without terminating testing. Default is errors=1.


Table describes test-args keywords that are used to control OpenBoot diagnostic tests.

If you want to customize the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:


ok setenv test-args debug,loopback,media

6.5.2.1 test and test-all Commands

You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:


ok test /pci@x,y/SUNW,qlc@2

Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V210 and V240 servers.



Tip - Use the show-devs command to list the hardware device paths.



To customize an individual test, you can use test-args as follows:


ok test /usb@1,3:test-args={verbose,debug}

This affects only the current test without changing the value of the test-args OpenBoot configuration variable.

You can test all the devices in the device tree with the test-all command:


ok test-all

If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:


ok test-all /pci@9,700000/usb@1,3

6.5.2.2 What OpenBoot Diagnostics Error Messages Tell You

OpenBoot Diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 6-1 displays a sample OpenBoot Diagnostics error message.

CODE EXAMPLE 6-1 OpenBood Diagnostics Error Message

Testing /pci@1e,600000/isa@7/flashprom@2,0
 
    ERROR   : There is no POST in this FLASHPROM or POST header is 
unrecognized
    DEVICE  : /pci@1e,600000/isa@7/flashprom@2,0
    SUBTEST : selftest:crc-subtest
    MACHINE : Sun Fire V210
    SERIAL# : 51347798
    DATE    : 03/05/2003 15:17:31  GMT
    CONTR0LS: diag-level=max test-args=errors=1
 
Error: /pci@1e,600000/isa@7/flashprom@2,0 selftest failed, return code = 1
Selftest at /pci@1e,600000/isa@7/flashprom@2,0 (errors=1) ............. 
failed
Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:1

This is a code example of OpenBoot diagnostic error message output.

To change the system defaults and the diagnostics settings after initial boot, refer to the OpenBoot PROM Enhancements for Diagnostic Operation (817-6957). You can view or print this document by going to:
http://www.sun.com/documentation


6.6 OpenBoot Commands

OpenBoot commands are commands you type from the ok prompt. OpenBoot commands which can provide useful diagnostic information are:

6.6.1 probe-scsi Command

The probe-scsi command is used to diagnose problems with SCSI devices.



caution icon

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsicommand can hang the system.



The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers.

For any SCSI device that is connected and active, the probe-scsi command displays its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.

The following is sample output from the probe-scsi command.

CODE EXAMPLE 6-2 Sample probe-scsi Command Output

{1} ok probe-scsi
Target 0 
  Unit 0   Disk     SEAGATE ST336605LSUN36G 0238
Target 1 
  Unit 0   Disk     SEAGATE ST336605LSUN36G 0238
Target 2 
  Unit 0   Disk     SEAGATE ST336605LSUN36G 0238
Target 3 
  Unit 0   Disk     SEAGATE ST336605LSUN36G 0238

Code examples displays loop ID, host adapter information, logical unit number, unique WWN, and device description information obtained when the probe-scsi command is run.

6.6.2 probe-ide Command

The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.



caution icon

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system.



The following is sample output from the probe-ide command.

CODE EXAMPLE 6-3 sample probe-ide Command Output

{1} ok probe-ide
  Device 0  ( Primary Master ) 
         Removable ATAPI Model: DV-28E-B                                
 
  Device 1  ( Primary Slave ) 
         Not Present
 
  Device 2  ( Secondary Master ) 
         Not Present
 
  Device 3  ( Secondary Slave ) 
         Not Present

This code example shows system output after the probe-ide command is run.

6.6.3 show-devs Command

The show-devs command lists the hardware device paths for each device in the firmware device tree. The following code example shows sample output from the show-devs command.


CODE EXAMPLE 6-4 show-devs Command Output
ok show devs
/pci@1d, 700000
/pci@1c,600000
/pci@1e,600000
/pci@1f,700000
/memory-controller@1,0
/SUNW,UltraSPARC-IIIi@1,0
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi@0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/pci@1d,700000/network@2,1
/pci@1d,700000/network@2
/pci@1c,600000/scsi@2,1
/pci@1c,600000/scsi@2
/pci@1c,600000/scsi@2,1/tape
/pci@1c,600000/scsi@2,1/disk
/pci@1c,600000/scsi@2/tape
/pci@1c,600000/scsi@2/disk
/pci@1e,600000/ide@d
/pci@1e,600000/usb@a
/pci@1e,600000/pmu@6
/pci@1e,600000/isa@7
/pci@1e,600000/ide@d/cdrom
/pci@1e,600000/ide@d/disk
/pci@1e,600000/pmu@6/gpio@80000000,8a
/pci@1e,600000/pmu@6/i2c@0,0
/pci@1e,600000/isa@7/rmc-comm@0,3e8
/pci@1e,600000/isa@7/serial@0,2e8
/pci@1e,600000/isa@7/serial@0,3f8
/pci@1e,600000/isa@7/power@0,800
/pci@1e,600000/isa@7/i2c@0,320
/pci@1e,600000/isa@7/rtc@0,70
/pci@1e,600000/isa@7/flashprom@2,0
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,70
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,88
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,68
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,4a
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,46
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,44
/pci@1e,600000/isa@7/i2c@0,320/idprom@0,50
/pci@1e,600000/isa@7/i2c@0,320/nvram@0,50
/pci@1e,600000/isa@7/i2c@0,320/rscrtc@0,d0
/pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,c8
/pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,c6
/pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,b8
/pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,b6
/pci@1e,600000/isa@7/i2c@0,320/power-supply-fru-prom@0,a4
/pci@1e,600000/isa@7/i2c@0,320/power-supply-fru-prom@0,b0
/pci@1e,600000/isa@7/i2c@0,320/chassis-fru-prom@0,a8
/pci@1e,600000/isa@7/i2c@0,320/motherboard-fru-prom@0,a2
/pci@1e,600000/isa@7/i2c@0,320/12c-bridge@0,18
/pci@1e,600000/isa@7/i2c@0,320/12c-bridge@0,16
/pci@1f,700000/network@2,1
/pci@1f,700000/network@2
/openprom/client-services
/packages/obdiag-menu
/packages/obdiag-lib
/packages/SUNW,asr
/packages/SUNW,fru-device
/packages/SUNW,12c-ram-device
/packages/obp-tftp
/packages/kbd-translator
/packages/dropins
/packages/terminal-emulator
/packages/disk-label
/packages/deblocker
/packages/SUNW,bultin-drivers
{1} ok
 

This code example showsf system output after show-devs command is run.

6.6.4 To Run OpenBoot Commands



caution icon

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsicommand can hang the system.



1. Halt the system to reach the ok prompt.

How you do this depends on the system's condition. If possible, you should warn users before you shut the system down.

2. Type the appropriate command at the console prompt.

 


6.7 Operating System Diagnostic Tools

If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser operating system. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have access to the software-based diagnostic tools, SunVTS, and Sun Management Center. These tools enable you to monitor the server, exercise it, and isolate faults.



Note - If you set the auto-boot OpenBoot configuration variable to false, the operating system does not boot following completion of the firmware-based tests.



In addition to the tools mentioned, you can refer to error and system message log files, and Solaris system information commands.

6.7.1 Error and System Message Log Files

Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.

6.7.2 Solaris System Information Commands

The following Solaris commands display data that you can use when assessing the condition of a Sun Fire V210 and V240 Servers server:

This section describes the information these commands give you. More information about using each command is contained in the appropriate man page.

6.7.2.1 prtconf command

The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks, that only the operating system software can detect. The output of prtconf also includes the total amount of system memory. CODE EXAMPLE 6-5 shows an excerpt of prtconf output.

CODE EXAMPLE 6-5 prtconf Command Output

# prtconf
System Configuration:  Sun Microsystems  sun4u
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
 
SUNW,Sun-Fire-V240
    packages (driver not attached)
        SUNW,builtin-drivers (driver not attached)
        deblocker (driver not attached)
        disk-label (driver not attached)
        terminal-emulator (driver not attached)
        dropins (driver not attached)
        kbd-translator (driver not attached)
        obp-tftp (driver not attached)
        SUNW,i2c-ram-device (driver not attached)
        SUNW,fru-device (driver not attached)
        ufs-file-system (driver not attached)
    chosen (driver not attached)
    openprom (driver not attached)
        client-services (driver not attached)
    options, instance #0
    aliases (driver not attached)
    memory (driver not attached)
    virtual-memory (driver not attached)
    SUNW,UltraSPARC-IIIi (driver not attached)
    memory-controller, instance #0
    SUNW,UltraSPARC-IIIi (driver not attached)
    memory-controller, instance #1 ...

Code example of Solaris system output generated by server after prtconf command is run.

The prtconf command's -p option produces output similar to the OpenBoot
show-devs command. This output lists only those devices compiled by the system firmware.

6.7.2.2 prtdiag Command

The prtdiag command displays a table of diagnostic information that summarizes the status of system components. The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following is an excerpt of the output produced by prtdiag on a healthy Sun Fire V240 server running Solaris OS 8, PSR1.

CODE EXAMPLE 6-6 prtdiag Command Output

# prtdiag
System Configuration: Sun Microsystems  sun4u Sun Fire V240
System clock frequency: 160 MHZ
Memory size: 1GB        
 
==================================== CPUs ====================================
                      E$          CPU     CPU       Temperature         Fan
       CPU  Freq      Size        Impl.   Mask     Die    Ambient   Speed   Unit
       ---  --------  ----------  ------  ----  --------  --------  -----   ----
     MB/P0   960 MHz  1MB         US-IIIi   2.0     -     -  
     MB/P1   960 MHz  1MB         US-IIIi   2.0     -     -  
 
================================= IO Devices =================================
     Bus   Freq
Brd  Type  MHz   Slot        Name                          Model
---  ----  ----  ----------  ----------------------------  --------------------
 0   pci    66            2  network-SUNW,bge (network)                       
 0   pci    66            2  scsi-pci1000,21.1 (scsi-2)                       
 0   pci    66            2  scsi-pci1000,21.1 (scsi-2)                       
 0   pci    66            2  network-SUNW,bge (network)                       
 0   pci    33            7  isa/serial-su16550 (serial)                      
 0   pci    33            7  isa/serial-su16550 (serial)                      
 0   pci    33            7  isa/rmc-comm-rmc_comm (seria+                    
 0   pci    33           13  ide-pci10b9,5229.c4 (ide)                        
 
============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address       Size       Interleave Factor  Contains
-----------------------------------------------------------------------
0x0                512MB             1           GroupID 0 
0x1000000000       512MB             1           GroupID 0 
 
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels
--------------------------------------------------
0              0        MB/P0/B0/D0,MB/P0/B0/D1
 
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels
--------------------------------------------------
1              0        MB/P1/B0/D0,MB/P1/B0/D1 

Code example of the Solaris system output generated by the server after the prtdiag command is run.

In addition to the information in CODE EXAMPLE 6-6, prtdiag with the verbose option (-v) reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.

CODE EXAMPLE 6-7 prtdiag Verbose Output

System Temperatures (Celsius):
-------------------------------
Device				Temperature									Status
---------------------------------------
CPU0             59             OK
CPU2             64             OK
DBP0             22             OK

Code example of Solaris system output generated by server during an overtemperature event after prtdiag command with verbose option is run.

In the event of an overtemperature condition, prtdiag reports an error in the Status column for that device.

CODE EXAMPLE 6-8 prtdiag Overtemperature Indication Output

System Temperatures (Celsius):
-------------------------------
Device				Temperature									Status
---------------------------------------
CPU0				62								OK
CPU1		 		102							ERROR

If an overtemperature condition occurs, this code example of Solaris system output generated by server when a fault condition occurs after prtdiag command with verbose option is run.

Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.

CODE EXAMPLE 6-9 prtdiag Fault Indication Output

Fan Status:
-----------
 
Bank             RPM    Status
----            -----   ------
CPU0             4166   [NO_FAULT]
CPU1             0000   [FAULT]

If a fault condition occurs, this code example of Solaris system output generated by server after prtdiag command with verbose option is run.

6.7.2.3 prtfru Command

The Sun Fire V210 and V240 servers maintain a hierarchical list of all field-replacable units (FRUs) in the system, as well as specific information about various FRUs.

The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs.

CODE EXAMPLE 6-10 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.

CODE EXAMPLE 6-10 prtfru -l Command Output

# prtfru -l
/frutree
/frutree/chassis (fru)
/frutree/chassis/MB?Label=MB
/frutree/chassis/MB?Label=MB/system-board (container)
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC/sc (fru)
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT/battery (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F0?Label=F0

Code example of Solaris system output generated by server after prtfru command is run.

CODE EXAMPLE 6-11 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.

CODE EXAMPLE 6-11 prtfru -c Command Output

# prtfru -c
/frutree/chassis/MB?Label=MB/system-board (container)
   SEGMENT: SD
      /SpecPartNo: 885-0092-02
      /ManR
      /ManR/UNIX_Timestamp32: Wednesday April 10 11:34:49 BST 2002
      /ManR/Fru_Description: FRUID,INSTR,M'BD,0CPU,0MB,ENXU
      /ManR/Manufacture_Loc: HsinChu, Taiwan
      /ManR/Sun_Part_No: 3753107
      /ManR/Sun_Serial_No: abcdef
      /ManR/Vendor_Name: Mitac International
      /ManR/Initial_HW_Dash_Level: 02
      /ManR/Initial_HW_Rev_Level: 01

Code example of Solaris system output generated by server after prtfru command is run with -c option.

Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes:

6.7.2.4 psrinfo Command

The psrinfo command displays the date and time each CPU came online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed. The following is sample output from the psrinfo command with the -v option.

CODE EXAMPLE 6-12 psrinfo -v Command Output

# psrinfo -v
Status of processor 0 as of: 09/20/02 11:35:49
  Processor has been on-line since 09/20/02 11:30:53.
  The sparcv9 processor operates at 960 MHz,
        and has a sparcv9 floating point processor.
Status of processor 1 as of: 09/20/02 11:35:49
  Processor has been on-line since 09/20/02 11:30:52.
  The sparcv9 processor operates at 960 MHz,
        and has a sparcv9 floating point processor.

Code example of Solaris system output generated by server after psrinfo command is run with -v option.

6.7.2.5 showrev Command

The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 6-13 shows sample output of the showrev command.

CODE EXAMPLE 6-13 showrev Command Output

# showrev
Hostname: griffith
Hostid: 830f8192
Release: 5.8
Kernel architecture: sun4u
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain: 
Kernel version: SunOS 5.8 Generic 108528-16 August 2002

Code example of Solaris system output generated by server after showrev command is run.

When used with the -p option, this command displays installed patches. CODE EXAMPLE 6-14 shows a partial sample output from the showrev command with the -p option.

CODE EXAMPLE 6-14 showrev -p Command Output

# showrev -p 
Patch: 109729-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109783-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109807-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109809-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110905-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110910-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110914-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 108964-04 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsr

Code example of Solaris system output generated by server after showrev command is run with the -p option.

6.7.3 To Run Solaris System Information Commands

1. Decide on the of system information you want to display.

For more information, see Solaris System Information Commands.

2. Type the appropriate command at a console prompt.

See TABLE 6-5 for a summary of the commands.


TABLE 6-5 Using Solaris Information Display Commands

Command

What It Displays

What to Type

Notes

prtconf

System configuration information

/usr/sbin/prtconf

--

prtdiag

Diagnostic and configuration information

/usr/platform/sun4u/sbin/prtdiag

Use the -v option for additional detail.

prtfru

FRU hierarchy and SEEPROM memory contents

/usr/sbin/prtfru

Use the -l option to display hierarchy. Use the -c option to display SEEPROM data.

psrinfo

Date and time each CPU came online; processor clock speed

/usr/sbin/psrinfo

Use the -v option to obtain clock speed and other data.

showrev

Hardware and software revision information

/usr/bin/showrev

Use the -p option to show software patches.


Table describes how to use the prtconf, prtdiag, prtfru, psrinfo, and showrev Solaris system information commands.


6.8 Recent Diagnostic Test Results

Summaries of the results from the most recent power-on self-test (POST) and OpenBoot Diagnostics tests are saved across power cycles.

6.8.1 To View Recent Test Results

1. Go to the ok prompt.

2. Type the following:

To see a summary of the most recent POST results.


ok show-post-results


6.9 OpenBoot Configuration Variables

Switches and diagnostic configuration variables stored in the IDPROM determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 6-2.

Changes to OpenBoot configuration variables usually take effect upon the next reboot.

6.9.1 To View and Set OpenBoot Configuration Variables

6.9.1.1 To View OpenBoot Configuration Variables

1. Halt the server to reach the ok prompt.

2. To display the current values of all OpenBoot configuration variables, use the printenv command.

The following example shows a short excerpt of this command's output.


ok printenv
Variable Name         Value                          Default Value
 
diag-level            min                            min
diag-switch?          false                          false

6.9.1.2 To Set OpenBoot Configuration Variables

1. Halt the server to reach the ok prompt.

2. To set or change the value of an OpenBoot configuration variable, use the setenv command:


ok setenv diag-level max

diag-level = max

To set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space.



Note - Keywords for the OpenBoot configuration variable test-args must be separated by commas.




6.10 Additional Diagnostic Tests for Specific Devices

6.10.1 Using the probe-scsi Command to Confirm That Hard Drives are Active

The probe-scsi command transmits an inquiry to SCSI devices connected to the system's internal SCSI interface. If a SCSI device is connected and active, the command displays the unit number, device type, and manufacturer name for that device.


CODE EXAMPLE 6-15 probe-scsi Output Message

ok probe-scsi
Target 0
 Unit 0   Disk     SEAGATE ST336605LSUN36G 4207
Target 1 
 Unit 0   Disk     SEAGATE ST336605LSUN36G 0136
 

Code example displays output from running the probe-scsi command.

The probe-scsi-all command transmits an inquiry to all SCSI devices connected to both the system's internal and its external SCSI interfaces. CODE EXAMPLE 6-16 shows sample output from a server with no externally connected SCSI devices but containing two 36 GB hard drives, both of them active.


CODE EXAMPLE 6-16 probe-scsi-all Output Message

ok probe-scsi-all
/pci@1f,0/pci@1/scsi@8,1
 
/pci@1f,0/pci@1/scsi@8
Target 0
 Unit 0   Disk     SEAGATE ST336605LSUN36G 4207
Target 1 
 Unit 0   Disk     SEAGATE ST336605LSUN36G 0136
 

Code example displays output from running probe-scsi-all command.

6.10.2 Using probe-ide Command to Confirm That the DVD or CD-ROM Drive is Connected

The probe-ide command transmits an inquiry command to internal and external IDE devices connected to the system's on-board IDE interface. The following sample output reports a DVD drive installed (as Device 0) and active in a server.


CODE EXAMPLE 6-17 probe-ide Output Message

ok probe-ide
 Device 0  ( Primary Master ) 
 Removable ATAPI Model: DV-28E-B
 
 Device 1  ( Primary Slave )
 Not Present
 
 Device 2  ( Secondary Master ) 
 Not Present
 
 Device 3  ( Secondary Slave )
 Not Present
 

Code example displays output from running probe-ide command.

6.10.3 Using watch-net and watch-net-all Commands to Check the Network Connections

The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.

Start the watch-net diagnostic test by typing the watch-net command at the ok prompt. For the watch-net-all diagnostic test, type watch-net-all at the ok prompt.


CODE EXAMPLE 6-18 watch-net Diagnostic O utput Message

{1} ok watch-net
100 Mbps FDX Link up
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.
................................

Code example displays output from running watch-net command.

CODE EXAMPLE 6-19 watch-net-all Diagnostic O utput Message

{1} ok watch-net-all
/pci@1d,700000/network@2,1
Timed out waiting for Autonegotation to complete
Check cable and try again
Link Down
 
/pci@1f,700000/network@2
100 Mbps FDX Link up
................................
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.
................................
{1} ok

Code example displays output from running watch-net-all command.

For additional information about diagnostic tests for the OpenBoot PROM see: OpenBoot PROM Enhancements for Diagnostic Operation (817-6957-10).


6.11 Automatic System Recovery



Note - Automatic System Recovery (ASR) is not the same as Automatic Server Restart, which the Sun Fire V210 and V240 servers also support. For additional information about Automatic Server Restart see Section 3.1.3, Automatic Server Restart.



Automatic System Recovery (ASR) consists of self-test features and an auto-configuring capability to detect failed hardware components and unconfigure them. By doing this, the server is able to resume operating after certain non-fatal hardware faults or failures have occurred.

If a component is one that is monitored by ASR, and the server is capable of operating without it, the server automatically reboots if that component develops a fault or fails.

ASR monitors memory modules:

If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.

If a fault occurs on a running server, and it is possible for the server to run without the failed component, the server automatically reboots. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash repeatedly.

To support degraded boot capability, OpenBoot firmware uses the 1275 Client interface (via the device tree) to mark a device as either failed or disabled. This creats an appropriate status property in the device tree node. The Solaris OS does not activate a driver for any subsystem so marked.

As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system reboots automatically and resumes operation while a service call is made.



Note - ASR is not enabled until you activate it.



6.11.1 Auto-Boot Options

The auto-boot? setting controls whether or not the firmware automatically boots the operating system after each reset. The default setting is true.

The auto-boot-on-error? setting controls whether the system attempts a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? settings must be set to true to enable an automatic degraded boot.

single-step bulletTo set the switches, type:

.

ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - The default setting for auto-boot-on-error? is false. Therefore, the system does not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal non-recoverable error, even if degraded booting is enabled. For examples of fatal non-recoverable errors, see Error Handling Summary.



6.11.2 Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:



Note - If POST or OpenBoot Diagnostics detects a non-fatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.



6.11.3 Reset Scenarios

Two OpenBoot configuration variables, diag-switch?, and diag-trigger
control how the system runs firmware diagnostics in response to system reset events.

The standard system reset protocol bypasses POST and OpenBoot Diagnostics unless diag-switch? is set to true or diag-trigger is set to a reset event. The default setting for this variable is false. Because ASR relies on firmware diagnostics to detect faulty devices, diag-switch? must be set to true for ASR to run. For instructions, see Section 6.11.4, To Enable ASR.

To control which reset events, if any, automatically initiate firmware diagnostics, use diag-trigger. For detailed explanations of these variables and their uses, see Section 6.4.3, Controlling POST Diagnostics.

6.11.4 To Enable ASR

1. At the system ok prompt, type:


ok setenv diag-switch? true

2. Set the diag-trigger variable to power-on-reset, error-reset, or user-reset. For example, type:


ok setenv diag-trigger user-reset

3. Type:


ok setenv auto-boot? true
ok setenv auto-boot-on-error? true

4. Type:


ok reset-all

The system permanently stores the parameter changes and boots automatically if the OpenBoot variable auto-boot? is set to true (its default value).



Note - To store parameter changes, you can also power-cycle the system using the front panel Power switch.



6.11.5 To Disable ASR

1. At the system ok prompt, type:


ok setenv diag-switch? false
ok setenv diag-trigger none

2. Type:

 


ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power-cycle the system using the front panel Power switch.