C H A P T E R  1

Troubleshooting Tools

This chapter describes the diagnostics tools available to the Netra 240 server. The chapter contains the following sections:


Overview of Diagnostic Tools

Sun provides a range of diagnostic tools for use with the Netra 240 server, as summarized in the following table.


TABLE 1-1 Summary of Troubleshooting Tools

Diagnostic Tool

Type

Description

Accessibility and Availability

Remote Capability

ALOM

Hardware and software

Monitors environmental conditions, performs basic fault isolation, and provides remote console access.

Can function on standby power and without operating system.

Designed for remote access.

LEDs

Hardware

Indicate status of overall system and particular components.

Accessed from system chassis. Available anytime power is available.

Local, but can be viewed by means of ALOM.

Power-on self-test (POST)

Firmware

Tests core components of system.

Runs automatically on startup. Available when the operating system is not running.

Local, but can be viewed by means of ALOM.

OpenBoot commands

Firmware

Display various kinds of system information.

Available when the operating system is not running.

Local, but can be accessed by means of ALOM.

OpenBoot diagnostics

Firmware

Tests system components, focusing on peripherals and
I/O devices.

Runs automatically or interactively. Available when the operating system is not running.

Local, but can be viewed by means of ALOM.

Solaris software commands

Software

Display various kinds of system information.

Requires operating system.

Local, but can be accessed by means of ALOM.

SunVTStrademark software

Software

Exercises and stresses the system, running tests in parallel.

Requires operating system. Optional package.

Viewable and controllable over network.



System Prompts

The following default server prompts are used by the Netra 240 server:

FIGURE 1-1 shows the relationship between the three prompts and how to change from one to the other.


FIGURE 1-1 System Prompt Flow

This is a system prompt flow diagram showing the relationship between three default server prompts and how to change from one to the other. The three server prompts are OpenBoot, ALOM, and Solaris prompts.


The following commands are in the flow diagram in FIGURE 1-1:


Advanced Lights Out Manager

Suntrademark Advanced Lights Out Manager (ALOM) for the Netra 240 server provides a series of LED status indicators. This section details the meaning of their status and how to turn them on and off. For more information on ALOM, see Chapter 3.


FIGURE 1-2 Location of Front Panel Indicators

This figure shows the location of the front panel indicators. From left to right they are the four dry contact alarm card indicators and the three server status indicators.


Server Status Indicators

The server has three LED status indicators. They are located on the front bezel (FIGURE 1-2) and are repeated on the rear panel. A summary of the indicators is provided in TABLE 1-2.


TABLE 1-2 Server Status Indicators (Front and Rear)

Indicator

LED Color

LED State

Meaning

Activity

Green

On

The server is powered on and is running the Solaris OS.

 

 

Off

Either power is not present or the Solaris OS is not running.

Service Required

Yellow

On

The server has detected a problem and requires the attention of service personnel.

 

 

Off

The server has no detected faults.

Locator

White

On

A continuous light turns on and identifies the server from others in a rack, when the setlocator command is used.


You can turn the Locator LED on and off either from the system console or the ALOM command-line interface (CLI).


procedure icon  To Display Locator LED Status

single-step bulletDo one of the following:


procedure icon  To Turn the Locator LED On

single-step bulletDo one of the following:


procedure icon  To Turn the Locator LED Off

single-step bulletDo one of the following:

Alarm Status Indicators

The dry contact alarm card has four LED status indicators that are supported by ALOM. They are located vertically on the front bezel (FIGURE 1-2). Information about the alarm indicators and dry contact alarm states is provided in TABLE 1-3. For more information about alarm indicators, see the Sun Advanced Lights Out Manager Software User's Guide for the Netra 240 Server (part number 817-3174). For more information about an API to control the alarm indicators, see Appendix A.


TABLE 1-3 Alarm Indicators and Dry Contact Alarm States

Indicator and Relay

Labels

Indicator Color

Application or Server State

Condition or Action

System Indicator State

Alarm Indicator State

Relay

NC[1]

State

Relay

NO[2]

State

Comments

Critical

(Alarm0)

Red

Server state (Power on/off and
Solaris OS functional/
not functional)

No power input.

Off

Off

Closed

Open

Default state.

System power off.

Off

Off[3]

Closed

Open

Input power connected.

System power turns on; Solaris OS not fully loaded.

Off

Offiii

Closed

Open

Transient state.

Solaris OS successfully loaded.

On

Off

Open

Closed

Normal operating state.

Watchdog timeout.

Off

On

Closed

Open

Transient state; reboot Solaris OS.

Solaris OS shutdown initiated by user[4].

Off

Offiii

Closed

Open

Transient state.

Lost input power.

Off

Off

Closed

Open

Default state.

System power shutdown initiated by user.

Off

Offiii

Closed

Open

Transient state.

Application state

User sets Critical alarm on[5].

--

On

Closed

Open

Critical fault detected.

User sets Critical alarm offii.

--

Off

Open

Closed

Critical fault cleared.

Major

(Alarm1)

Red

Application state

User sets Major alarm onii.

--

On

Open

Closed

Major fault detected.

User sets Major alarm offii.

--

Off

Closed

Open

Major fault cleared.

Minor

(Alarm2)

Amber

Application state

User sets Minor alarm onii.

--

On

Open

Closed

Minor fault detected.

User sets Minor alarm offii.

--

Off

Closed

Open

Minor fault cleared.

User

(Alarm3)

Amber

Application state

User sets User alarm onii.

--

On

Open

Closed

User fault detected.

User sets User alarm offii.

--

Off

Closed

Open

User fault cleared.


In all cases when the user sets an alarm, a message is displayed on the console. For example, when the critical alarm is set, the following message is displayed on the console:
Note that in some instances when the critical alarm is set, the associated alarm indicator is not lit. This implementation is subject to change in future releases
(see Footnote iii of TABLE 1-3).


SC Alert: CRITICAL ALARM is set 


Power-On Self-Test Diagnostics

Power-on self-test (POST) is a firmware program that helps determine whether a portion of the system has failed. POST verifies the core functionality of the system, including the CPU module(s), motherboard, memory, and some on-board I/O devices. The software then generates messages that can be useful in determining the nature of a hardware failure. You can run POST even if the system is unable to boot.

POST detects most system faults and is located in the motherboard OpenBoot PROM. You can program the OpenBoot software to run POST at power-on by setting two environment variables: the diag-switch? and the diag-level flag. These two variables are stored on the system configuration card (SCC).



Note - The SCC contains information about the system's identity, including the Host ID, MAC address and NVRAM settings. To increase the speed by which a system can be brought back online, your Sun Service representative might transfer the SCC and the drive on which it resides to another server, enabling it to inherit the old server's identity without being reconfigured. This procedure should only be done by trained Sun personnel.



POST runs automatically when the system power is applied, or following an automatic system reset, if all of the following conditions apply:

If diag-level is set to min or max, POST performs an abbreviated or extended test, respectively.

If diag-level is set to menus, a menu of all the tests executed at power up is displayed.

POST diagnostic and error message reports are displayed on a console.

Controlling POST Diagnostics

You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables. Changes to OpenBoot configuration variables take effect only after the system is restarted. TABLE 1-4 lists the most important and useful of these variables. You can find instructions for changing OpenBoot configuration variables in To View and Set OpenBoot Configuration Variables.


TABLE 1-4 OpenBoot Configuration Variables

OpenBoot Configuration Variable

Description and Keywords

auto-boot

Determines whether the operating system automatically starts up. Default is true.

  • true--Operating system automatically starts once firmware tests have finished running.
  • false--System remains at ok prompt until you type boot.

diag-level

Determines the level or type of diagnostics executed. Default is min.

  • off--No testing.
  • min--Only basic tests are run.
  • max--More extensive tests may be run, depending on the device.
  • menus-- Menu-driven tests at POST levels can be individually run.

diag-script

Determines which devices are tested by OpenBoot diagnostics. Default is none.

  • none--No devices are tested.
  • normal--On-board (centerplane-based) devices that have self-tests are tested.
  • all--All devices that have self-tests are tested.

diag-switch?

Toggles the system in and out of diagnostic mode. Default is false.

  • true--Diagnostic mode: POST diagnostics and OpenBoot diagnostics tests are run.
  • false--Default mode: Do not run POST or OpenBoot diagnostics tests.

post-trigger

obdiag-trigger[6]

These two variables specify the class of reset event that causes power-on self-tests (or OpenBoot diagnostics tests) to run. These variables can accept single keywords as well as combinations of the first three keywords separated by spaces. For details, see To View and Set OpenBoot Configuration Variables.

  • error-reset--A reset caused by certain nonrecoverable hardware error conditions. In general, an error reset occurs when a hardware problem corrupts system state data. Examples include CPU and system watchdog resets, fatal errors, and certain CPU reset events (default).
  • power-on-reset--A reset caused by pressing the On/Standby button (default).
  • user-reset--A reset initiated by the user or the operating system.
  • all-resets--Any kind of system reset.
  • none--No power-on self-tests (or OpenBoot diagnostics tests) are run.

input-device

Selects where console input is taken from. Default is ttya.

  • ttya--From built-in SERIAL MGT port.
  • ttyb--From built-in general purpose serial port (10101).
  • keyboard--From attached keyboard that is part of a graphics terminal.

output-device

Selects where diagnostic and other console output is displayed. Default is ttya.

  • ttya--To built-in SERIAL MGT port.
  • ttyb--To built-in general purpose serial port (10101).
  • screen--To attached screen that is part of a graphics terminal.[7]



Note - These variables affect OpenBoot diagnostics tests as well as POST diagnostics.



Once POST diagnostics have finished running, POST reports back the status of each test that was run to the OpenBoot firmware. Control then reverts back to the OpenBoot firmware code.

If POST diagnostics do not uncover a fault, and your server still does not start up, run OpenBoot diagnostics tests.


procedure icon  To Start POST Diagnostics

1. Go to the ok prompt.

2. Type:


ok setenv diag-switch? true

3. Type:


ok setenv diag-level value

Where value is min, max, or menus, depending on the quantity of diagnostic information you want to see.

4. Type:


ok reset-all

The system runs POST diagnostics if post-trigger is set to user-reset. Status and error messages are displayed in the console window. If POST detects an error, it displays an error message describing the failure.

5. When you have finished running POST, restore the value of diag-switch? to false by typing:


ok setenv diag-switch? false

Resetting diag-switch? to false minimizes boot time.


OpenBoot Commands

OpenBoot commands are commands you type from the ok prompt. OpenBoot commands that can provide useful diagnostic information are as follows:

probe-scsi and probe-scsi-all Commands

The probe-scsi and probe-scsi-all commands diagnose problems with the SCSI devices.



caution icon

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, issuing the probe-scsior probe-scsi-allcommand can hang the system.



The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers. The probe-scsi-all command also accesses devices connected to any host adapters installed in PCI slots.

For any SCSI device that is connected and active, the probe-scsi and probe-scsi-all commands display its loop ID, host adapter, logical unit number, unique world-wide name (WWN), and a device description that includes type and manufacturer.

The following sample output is from the probe-scsi command.

CODE EXAMPLE 1-1 probe-scsi Command Output

{1} ok probe-scsi
Target 0 
  Unit 0   Disk     SEAGATE ST373307LSUN72G 0207
Target 1 
  Unit 0   Disk     SEAGATE ST336607LSUN36G 0207
{1} ok 

The following sample output is from the probe-scsi-all command.

CODE EXAMPLE 1-2 probe-scsi-all Command Output

{1} ok probe-scsi-all
/pci@1c,600000/scsi@2,1
 
/pci@1c,600000/scsi@2
Target 0 
  Unit 0   Disk     SEAGATE ST373307LSUN72G 0207
Target 1 
  Unit 0   Disk     SEAGATE ST336607LSUN36G 0207
 
{1} ok 

probe-ide Command

The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.



caution icon

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, issuing the probe-idecommand can hang the system.



The following sample output is from the probe-ide command.

CODE EXAMPLE 1-3 probe-ide Command Output

{1} ok probe-ide
Device 0  ( Primary Master ) 
         Not Present
 
  Device 1  ( Primary Slave ) 
         Not Present
 
  Device 2  ( Secondary Master ) 
         Not Present
 
  Device 3  ( Secondary Slave ) 
         Not Present
 
{1} ok 

show-devs Command

The show-devs command lists the hardware device paths for each device in the firmware device tree. CODE EXAMPLE 1-4 shows some sample output.


CODE EXAMPLE 1-4 show-devs Command Output
/pci@1d,700000
/pci@1c,600000
/pci@1e,600000
/pci@1f,700000
/memory-controller@1,0
/SUNW,UltraSPARC-IIIi@1,0
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi@0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/pci@1d,700000/network@2,1
/pci@1d,700000/network@2
/pci@1c,600000/scsi@2,1
/pci@1c,600000/scsi@2
/pci@1c,600000/scsi@2,1/tape
/pci@1c,600000/scsi@2,1/disk
/pci@1c,600000/scsi@2/tape
/pci@1c,600000/scsi@2/disk     
/pci@1e,600000/ide@d
/pci@1e,600000/usb@a
/pci@1e,600000/pmu@6
/pci@1e,600000/isa@7
/pci@1e,600000/ide@d/cdrom
/pci@1e,600000/ide@d/disk.........


procedure icon  To Run OpenBoot Commands

1. Halt the system to reach the ok prompt.

Inform users before you shut down the system.

2. Type the appropriate command at the console prompt.


OpenBoot Diagnostics

Like POST diagnostics, OpenBoot diagnostics code is firmware-based and resides in the Boot PROM.


procedure icon  To Start OpenBoot Diagnostics

1. Type:


ok setenv diag-switch? true
ok setenv auto-boot? false
ok reset-all

2. Type:


ok obdiag

This command displays the OpenBoot diagnostics menu.


ok obdiag
_____________________________________________________________________________
|                                 o b d i a g                                |
|_________________________ __________________________________________________|
|                         |                         |                        |
|  1 flashprom@2,0        |  2 i2c@0,320            |  3 ide@d               |
|  4 network@2            |  5 network@2            |  6 network@2,1         |
|  7 network@2,1          |  8 rmc-comm@0,3e8       |  9 rtc@0,70            |
| 10 scsi@2               | 11 scsi@2,1             | 12 serial@0,2e8        |
| 13 serial@0,3f8         |                         |                        |
|_________________________|_________________________|________________________|
|   Commands: test test-all except help what setenv set-default exit         |
|____________________________________________________________________________|



Note - If you have a PCI card installed inside the server, additional tests appear on the obdiag menu.



3. Type:


obdiag> test n

Where n represents the number corresponding to the test you want to run.

A summary of the tests is available. At the obdiag> prompt, type:


obdiag> help

Controlling OpenBoot Diagnostics Tests

Most of the OpenBoot configuration variables you use to control POST (see TABLE 1-4) also affect OpenBoot diagnostics tests.

By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 1-5.


TABLE 1-5 Keywords for the test-args OpenBoot Configuration Variable

Keyword

Description

bist

Invokes built-in self-test (BIST) on external and peripheral devices.

debug

Displays all debug messages.

iopath

Verifies bus and interconnect integrity.

loopback

Exercises external loopback path for the device.

media

Verifies external and peripheral device media accessibility.

restore

Attempts to restore original state of the device if the previous execution of the test failed.

silent

Displays only errors rather than the status of each test.

subtests

Displays main test and each subtest that is called.

verbose

Displays detailed status messages for all tests.

callers=n

Displays backtrace of N callers when an error occurs:

callers=0--Displays backtrace of all callers before the error.

errors=n

Continues executing the test until N errors are encountered:

errors=0--Displays all error reports without terminating testing.


If you want to customize the OpenBoot diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:


ok setenv test-args debug,loopback,media

test and test-all Commands

You can also run OpenBoot diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:


ok test /pci@x,y/SUNW,qlc@2

To customize an individual test, you can use test-args, as follows:


ok test /usb@1,3:test-args={verbose,debug}

This syntax affects only the current test without changing the value of the
test-args OpenBoot configuration variable.

You can test all the devices in the device tree with the test-all command:


ok test-all

CODE EXAMPLE 1-5 displays a sample OpenBoot diagnostics test report where all tests have passed.

CODE EXAMPLE 1-5 OpenBoot Diagnostics Test Report

Hit the spacebar to interrupt testing
Testing /pci@1e,600000/isa@7/flashprom@2,0 ............................passed
Testing /pci@1e,600000/isa@7/i2c@0,320 ................................passed
Testing /pci@1e,600000/ide@d ..........................................passed
Testing /pci@1f,700000/network@2 ......................................passed
Testing /pci@1d,700000/network@2 ......................................passed
Testing /pci@1f,700000/network@2,1 ....................................passed
Testing /pci@1d,700000/network@2,1 ....................................passed
Testing /pci@1e,600000/isa@7/rmc-comm@0,3e8 ...........................passed
Testing /pci@1e,600000/isa@7/rtc@0,70 .................................passed
Testing /pci@1c,600000/scsi@2 .........................................passed
Testing /pci@1c,600000/scsi@2,1 .......................................passed
Testing /pci@1e,600000/isa@7/serial@0,2e8 .............................passed
Testing /pci@1e,600000/isa@7/serial@0,3f8 .............................passed
Pass:1 (of 1) Errors:0 (of 0) Tests Failed:0 Elapsed Time: 0:0:0:25

If you specify a path argument to test-all, only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:


ok test-all /pci@9,700000/usb@1,3

OpenBoot Diagnostics Error Messages

OpenBoot diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 1-6 displays a sample OpenBoot diagnostics error message.

CODE EXAMPLE 1-6 OpenBoot Diagnostics Error Message

Testing /pci@1e,600000/isa@7/flashprom@2,0 
 
   ERROR   : FLASHPROM CRC-32 is incorrect
   SUMMARY : Obs=0x729f6392 Exp=0x3d6cdf53 XOR=0x4ff3bcc1 Addr=0xfeebbffc 
   DEVICE  : /pci@1e,600000/isa@7/flashprom@2,0
   SUBTEST : selftest:crc-subtest
   MACHINE : Netra 240
   SERIAL# : 52965531 
   DATE    : 03/05/2003 01:33:59  GMT 
   CONTR0LS: diag-level=max test-args=
 
Error: /pci@1e,600000/isa@7/flashprom@2,0 selftest failed, return code = 1
Selftest at /pci@1e,600000/isa@7/flashprom@2,0 (errors=1) .............
failed
Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:27


Operating System Diagnostic Tools

When the system passes OpenBoot diagnostics tests, it attempts to boot the Solaris OS. Once the server is running in multiuser mode, you have access to the software-based diagnostic tools and the SunVTS software. These tools enable you to monitor the server, exercise it, and isolate faults.



Note - If you set the auto-boot? OpenBoot configuration variable to false, the operating system does not boot following completion of the firmware-based tests.



In addition to the tools just mentioned, you can refer to error and system message log files and to Solaris software information commands.

Error and System Message Log Files

Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.

Solaris Software System Information Commands

The following Solaris software system information commands display data that you can use when assessing the condition of a Netra 240 server:

This section describes the information that these commands give you. For more information about using these commands, refer to the appropriate man page.

prtconf Command

The prtconf command displays the Solaris software device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, such as individual disks that only the operating system software recognizes. The output of prtconf also includes the total size of system memory. CODE EXAMPLE 1-7 shows an excerpt of prtconf output.

CODE EXAMPLE 1-7 prtconf Command Output

# prtconf
 
System Configuration:  Sun Microsystems  sun4u
Memory size: 5120 Megabytes
System Peripherals (Software Nodes):
 
SUNW,Netra-240
    packages (driver not attached)
        SUNW,builtin-drivers (driver not attached)
        deblocker (driver not attached)
        disk-label (driver not attached)
        terminal-emulator (driver not attached)
        dropins (driver not attached)
        kbd-translator (driver not attached)
        obp-tftp (driver not attached)
        SUNW,i2c-ram-device (driver not attached)
        SUNW,fru-device (driver not attached)
        ufs-file-system (driver not attached)
    chosen (driver not attached)
    openprom (driver not attached)
        client-services (driver not attached)
    options, instance #0
    aliases (driver not attached)
    memory (driver not attached)
    virtual-memory (driver not attached)
    SUNW,UltraSPARC-IIIi (driver not attached)
    memory-controller, instance #0
    SUNW,UltraSPARC-IIIi (driver not attached)
    memory-controller, instance #1
    pci, instance #0........

The prtconf command -p option produces output similar to that of the OpenBoot
show-devs command. This output lists only those devices compiled by the system firmware.

prtdiag Command

The prtdiag command displays a table of diagnostic information that summarizes the status of system components. The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. The following code example is an excerpt of some of the output produced by prtdiag on a functional Netra 240 server running Solaris software.


CODE EXAMPLE 1-8 prtdiag Command Output
# prtdiag
System Configuration: Sun Microsystems  sun4u Netra 240
System clock frequency: 160 MHZ
Memory size: 2GB 
==================================== CPUs ====================================
                      E$          CPU     CPU       Temperature         Fan
       CPU  Freq      Size        Impl.   Mask     Die    Ambient   Speed   Unit
       ---  --------  ----------  ------  ----  --------  --------  -----   ----
     MB/P0  1280 MHz  1MB         US-IIIi   2.3     -     -  
     MB/P1  1280 MHz  1MB         US-IIIi   2.3     -     - 
================================= IO Devices =================================
     Bus   Freq
Brd  Type  MHz   Slot        Name                          Model
---  ----  ----  ----------  ----------------------------  --------------------
 0   pci    66            2  network-pci14e4,1648.108e.16+                    
 0   pci    66            2  network-pci14e4,1648.108e.16+                    
 0   pci    66            2  scsi-pci1000,21.1000.1000.1 +                    
 0   pci    66            2  scsi-pci1000,21.1000.1000.1 +                    
 0   pci    66            2  network-pci14e4,1648.108e.16+                    
 0   pci    66            2  network-pci14e4,1648.108e.16+                    
 0   pci    33            7  isa/serial-su16550 (serial)                      
 0   pci    33            7  isa/serial-su16550 (serial)                      
 0   pci    33            7  isa/rmc-comm-rmc_comm (seria+                    
 0   pci    33           13  ide-pci10b9,5229.c4 (ide) 
============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address       Size       Interleave Factor  Contains
-----------------------------------------------------------------------
0x0                1GB               1           GroupID 0 
0x1000000000       1GB               1           GroupID 0 
 
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels
--------------------------------------------------
0              0        MB/P0/B0/D0,MB/P0/B0/D1
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels
--------------------------------------------------
1              0        MB/P1/B0/D0,MB/P1/B0/D1

In addition to the information in CODE EXAMPLE 1-8, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures (see CODE EXAMPLE 1-9).

CODE EXAMPLE 1-9 prtdiag Verbose Output

---------------------------------------------------------------
Location   Sensor      Temperature  Lo LoWarn HiWarn  Hi Status
---------------------------------------------------------------
MB         T_ENC           22C     -7C   -5C   55C   58C   okay
MB/P0      T_CORE          57C     -       -    110C  115C okay
MB/P1      T_CORE          54C     -       -    110C  115C okay
PS0        FF_OT           -       -       -    -       -  okay
PS1        FF_OT           -       -       -    -       -  okay

In the event of an overtemperature condition, prtdiag reports an error in the Status column (CODE EXAMPLE 1-10).

CODE EXAMPLE 1-10 prtdiag Overtemperature Indication Output

---------------------------------------------------------------
Location   Sensor      Temperature  Lo LoWarn HiWarn  Hi Status
---------------------------------------------------------------
MB         T_ENC           22C    -7C    -5C     55C   58C okay
MB/P0      T_CORE         118C     -       -    110C  115C failed
MB/P1      T_CORE         112C     -       -    110C  115C warning
PS0        FF_OT           -       -       -    -       -  okay
PS1        FF_OT           -       -       -    -       -  okay

Similarly, if a particular component fails, prtdiag reports a fault in the appropriate status column (CODE EXAMPLE 1-11).

CODE EXAMPLE 1-11 prtdiag Fault Indication Output

Fan Speeds:
-----------------------------------------
Location       Sensor      Status   Speed
-----------------------------------------
MB/P0/F0       RS          failed   0 rpm         
MB/P0/F1       RS          okay     3994 rpm         
F2             RS          okay     2896 rpm         
PS0            FF_FAN      okay         
F3             RS          okay     2576 rpm         
PS1            FF_FAN      okay         
---------------------------------

prtfru Command

The Netra 240 server maintains a hierarchical list of all field-replaceable units (FRUs) in the system, as well as specific information about various FRUs.

The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 1-12 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.


CODE EXAMPLE 1-12 prtfru -l Command Output
# prtfru -l
/frutree
/frutree/chassis (fru)
/frutree/chassis/MB?Label=MB
/frutree/chassis/MB?Label=MB/system-board (container)
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC/sc (fru)
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT/battery (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F0?Label=F0/fan-unit 
(fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F1?Label=F1
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F1?Label=F1/fan-unit 
(fru)........

CODE EXAMPLE 1-13 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option. This output displays only the containers and their data and does not print the FRU tree hierarchy.

CODE EXAMPLE 1-13 prtfru -c Command Output

# prtfru -c
/frutree/chassis/MB?Label=MB/system-board (container)
   SEGMENT: SD
      /ManR
      /ManR/UNIX_Timestamp32: Mon Dec  2 19:47:38 PST 2002
      /ManR/Fru_Description: FRUID,INSTR,M'BD,2X1.28GHZ,CPU
      /ManR/Manufacture_Loc: Hsinchu,Taiwan
      /ManR/Sun_Part_No: 3753120
      /ManR/Sun_Serial_No: 000615
      /ManR/Vendor_Name: Mitac International
      /ManR/Initial_HW_Dash_Level: 02
      /ManR/Initial_HW_Rev_Level: 0E
      /ManR/Fru_Shortname: MOTHERBOARD
      /SpecPartNo: 885-0076-11
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/B0?Label=B0/bank/D0?La
bel=D0/mem-module (container)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/B0?Label=B0/bank/D1?La
bel=D1/mem-module (container)........

Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes the following:

psrinfo Command

The psrinfo command displays the date and time that each CPU is introduced online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed. CODE EXAMPLE 1-14 shows sample output from the psrinfo command with the -v option.

CODE EXAMPLE 1-14 psrinfo -v Command Output

# psrinfo -v
Status of processor 0 as of: 07/28/2003 14:43:29
  Processor has been on-line since 07/21/2003 18:43:37.
  The sparcv9 processor operates at 1280 MHz,
        and has a sparcv9 floating point processor.
Status of processor 1 as of: 07/28/2003 14:43:29
  Processor has been on-line since 07/21/2003 18:43:36.
  The sparcv9 processor operates at 1280 MHz,
        and has a sparcv9 floating point processor

showrev Command

The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 1-15 shows sample output from the showrev command.

CODE EXAMPLE 1-15 showrev Command Output

# showrev
Hostname: vsp78-36
Hostid: 8328c87b
Release: 5.8
Kernel architecture: sun4u
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain: vsplab.SFBay.Sun.COM
Kernel version: SunOS 5.8 Generic 108528-18 November 2002

When used with the -p option, the showrev command displays installed patches. CODE EXAMPLE 1-16 shows a partial sample output from the showrev command with the -p option.

CODE EXAMPLE 1-16 showrev -p Command Output

Patch: 109729-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109783-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109807-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109809-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110905-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110910-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110914-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 108964-04 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsr


procedure icon  To Run Solaris Platform System Information Commands

single-step bulletAt a command prompt, type the command for the kind of system information you want to display.

For more information, see Solaris Software System Information Commands. See TABLE 1-6 for a summary of the commands.


TABLE 1-6 Solaris Platform Information Display Commands

Command

What It Displays

What to Type

Notes

prtconf

System configuration information

/usr/sbin/prtconf

--

prtdiag

Diagnostic and configuration information

/usr/platform/sun4u/sbin/prtdiag

Use the -v option for additional detail.

prtfru

FRU hierarchy and SEEPROM memory contents

/usr/sbin/prtfru

Use the -l option to display hierarchy. Use the -c option to display SEEPROM data.

psrinfo

Date and time each CPU came online; processor clock speed

/usr/sbin/psrinfo

Use the -v option to obtain clock speed and other data.

showrev

Hardware and software revision information

/usr/bin/showrev

Use the -p option to show software patches.



Recent Diagnostic Test Results

Summaries of the results from the most recent power-on self-test (POST) diagnostics tests are saved across power cycles.


procedure icon  To View Recent POST Test Results

1. Go to the ok prompt.

2. To see a summary of the most recent POST results, type:


ok show-post-results

This command produces a system-dependent list of hardware components, along with an indication of which components passed and which failed POST diagnostics tests.


OpenBoot Configuration Variables

Switches and diagnostic configuration variables stored in the IDPROM determine how and when POST diagnostics and OpenBoot diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 1-4.

Changes to OpenBoot configuration variables take effect at the next reboot.


procedure icon  To View and Set OpenBoot Configuration Variables

single-step bulletHalt the server to display the ok prompt.

The following example shows a short excerpt of this command's output.


ok printenv
Variable Name         Value                          Default Value
 
diag-level            min                            min
diag-switch?          false                          false

Using the watch-net and watch-net-all Commands to Check the Network Connections

The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.

single-step bulletTo start the watch-net diagnostic test, type the watch-net command at the ok prompt (CODE EXAMPLE 1-17).


CODE EXAMPLE 1-17 watch-net Diagnostic Output Message

{0} ok watch-net
Internal loopback test -- succeeded.
Link is -- up
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.................................
 

single-step bulletTo start the watch-net-all diagnostic test, type watch-net-all at the ok prompt (CODE EXAMPLE 1-18).


CODE EXAMPLE 1-18 watch-net-all Diagnostic Output Message

{0} ok watch-net-all
/pci@1f,0/pci@1,1/network@c,1
Internal loopback test -- succeeded.
Link is -- up 
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.
 


Automatic System Recovery



Note - Automatic System Recovery (ASR) is not the same as Automatic Server Restart, which the Netra 240 server also supports. For information about Automatic Server Restart, see Chapter 3.



Automatic System Recovery (ASR) consists of self-test features and an auto-configuring capability to detect failed hardware components and unconfigure them. By enabling ASR, the server is able to resume operating after certain nonfatal hardware faults or failures have occurred.

If a component is monitored by ASR and the server is capable of operating without it, the server automatically reboots if that component develops a fault or fails. This capability prevents a faulty hardware component from preventing the entire system from operating or causing the system to fail repeatedly.

If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.

To support this degraded boot capability, the OpenBoot firmware uses the 1275 Client Interface (by means of the device tree) to mark a device as either failed or disabled, by creating an appropriate status property in the device tree node. The Solaris OS does not activate a driver for any subsystem marked in this way.

As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system reboots automatically and resumes operation while a service call is made.

Once a failed or disabled device is replaced with a new one, the OpenBoot firmware automatically modifies the status of the device upon reboot.



Note - ASR is not enabled until you activate it (see To Enable ASR).



Auto-Boot Options

The auto-boot? setting controls whether the firmware automatically boots the operating system after each reset. The default setting is true.

The auto-boot-on-error? setting controls whether the system attempts a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? settings must be set to true to enable an automatic degraded boot.

single-step bulletTo set the switches, type:


ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - The default setting for auto-boot-on-error? is false. Therefore, the system does not attempt a degraded boot unless you change this setting to true. In addition, the system does not attempt a degraded boot in response to any fatal
non-recoverable error, even if degraded booting is enabled. For examples of fatal non-recoverable errors, see Error-Handling Summary.



Error-Handling Summary

Error handling during the power-on sequence can be summarized in the following three ways:



Note - If POST or OpenBoot diagnostics detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.



Reset Scenarios

Three OpenBoot configuration variables--diag-switch?, obdiag-trigger, and post-trigger--control how the system runs firmware diagnostics in response to system reset events.

The standard system reset protocol bypasses POST and OpenBoot diagnostics unless diag-switch? is set to true. The default setting for this variable is false. Because ASR relies on firmware diagnostics to detect faulty devices, diag-switch? must be set to true for ASR to run. For instructions, see To Enable ASR.

To control which reset events, if any, automatically initiate firmware diagnostics, use obdiag-trigger and post-trigger. For detailed explanations of these variables and their uses, see Controlling POST Diagnostics and Controlling OpenBoot Diagnostics Tests.


procedure icon  To Enable ASR

1. At the system ok prompt, type:


ok setenv diag-switch? true
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true

2. Set the obdiag-trigger variable to power-on-reset, error-reset, or user-reset.

For example, type:


ok setenv obdiag-trigger user-reset

3. Type:


ok reset-all

The system permanently stores the parameter changes and boots automatically if the OpenBoot variable auto-boot? is set to true (its default value).



Note - To store parameter changes, you can also power-cycle the system by using the front panel On/Standby button.




procedure icon  To Disable ASR

1. At the system ok prompt, type:


ok setenv diag-switch? false

2. Type:


ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power-cycle the system by using the front panel On/Standby button.




1 (TableFootnote) NC state is the normally closed state. This state represents the default mode of the relay contacts in the normally closed state.
2 (TableFootnote) NO state is the normally open state. This state represents the default mode of the relay contacts in the normally open state.This table contains details on the alarm indicators and the dry contact alarm states. It has sixteen rows and nine columns.
3 (TableFootnote) The implementation of this alarm indicator state is subject to change.
4 (TableFootnote) The user can shut down the system using commands such as init0 and init6. This does not include the system power shutdown.
5 (TableFootnote) Based on a determination of the fault conditions, the user can turn the alarm on using the Solaris platform alarm API or ALOM CLI. For more information about the alarm API see Appendix A, and for more information about the ALOM CLI, refer to the Sun Advanced Lights Out Manager Software User's Guide for the Netra 240 Server (part number 817-3174).
6 (TableFootnote) The post-trigger and obdiag-trigger variables are obsolete in releases of OBP after 4.16.2.
7 (TableFootnote) POST messages cannot be displayed on a graphics terminal. They are sent to ttya even when output-device is set to screen.