Sun Fire V210 and V240 Servers Administration Guide
|
|
This chapter describes the diagnostics tools available to the Sun Fire V210 and V240 servers. The chapter contains the sections:
- Section 6.1, Overview of Diagnostic Tools
- Section 6.3, Sun Advanced Lights Out Manager
- Section 6.2, Status Indicators
- Section 6.4, POST Diagnostics
- Section 6.5, OpenBoot Diagnostics
- Section 6.6, OpenBoot Commands
- Section 6.7, Operating System Diagnostic Tools
- Section 6.8, Recent Diagnostic Test Results
- Section 6.9, OpenBoot Configuration Variables
- Section 6.10, Additional Diagnostic Tests for Specific Devices
- Section 6.11, Automatic System Recovery
6.1 Overview of Diagnostic Tools
Sun provides a range of diagnostic tools for use with the Sun Fire V210 and V240 servers.
These diagnostic tools are summarized in TABLE 6-1.
TABLE 6-1 Summary of Diagnostic Tools
Diagnostic Tool
|
Type
|
What It Does
|
Accessibility and Availability
|
Remote Capability
|
LEDs
|
Hardware
|
Indicate status of overall system and particular components.
|
Accessed from system chassis. Available anytime power is available.
|
Local, but can be viewed via ALOM
|
ALOM
|
Hardware and software
|
Monitors environmental conditions, performs basic fault isolation, and provides remote console access.
|
Can function on standby power and without operating system.
|
Designed for remote access
|
POST
|
Firmware
|
Tests core components of system.
|
Runs automatically on startup. Available when the operating system is not running.
|
Local, but can be viewed via ALOM
|
OpenBoot Diagnostics
|
Firmware
|
Tests system components, focusing on peripherals and
I/O devices.
|
Runs automatically or interactively. Available when the operating system is not running.
|
Local, but can be viewed via ALOM
|
OpenBoot commands
|
Firmware
|
Display various kinds of system information.
|
Available when the operating system is not running.
|
Local, but can be accessed via ALOM
|
Solaris commands
|
Software
|
Display various kinds of system information.
|
Requires operating system.
|
Local, but can be accessed via ALOM
|
SunVTS
|
Software
|
Exercises and stresses the system, running tests in parallel.
|
Requires operating system functionality. Optional package may need to be installed.
|
View and control over network
|
Sun Management Center
|
Software
|
Monitors both hardware environmental conditions and software performance of multiple machines. Generates alerts for various conditions.
|
Requires operating system to be running on both monitored and master servers. Requires a dedicated database on the master server.
|
Designed for remote access
|
Hardware Diagnostic Suite
|
Software
|
Exercises an operational system by running sequential tests. Also reports failed FRUs.
|
Separately purchased optional add-on to Sun Management Center. Requires operating system and Sun Management Center.
|
Designed for remote access
|
This table provides a summary of diagnostic tools available for the Sun Fire V210 and V240 server. The table describes each tool, its accessibility, and its remote capability.
6.2 Status Indicators
For a summary of the server's LED status indicators, see Section 1.2.1, Server Status Indicators.
6.3 Sun Advanced Lights Out Manager
Both the Sun Fire V210 server and the Sun Fire V240 server are shipped with Sun Advanced Lights Out Manager (ALOM) pre-installed.
ALOM enables you to monitor and control your server through a serial connection (using the SERIAL MGT port), or Ethernet connection (using the NET MGT port).
ALOM can send email notification of hardware failures or other server events.
The ALOM circuitry uses standby power from the server. This means that:
- ALOM is active as soon as the server is connected to a power source, and until power is removed by unplugging the power cable.
- ALOM continues to be effective when the server operating system goes off-line.
See TABLE 3-1 for a list of the components monitored by ALOM and the information it provides for each.
Tip - Table describes components and server systems monitored by ALOM. This table also describes the ALOM output for the component or system monitored.For additional information see the Advanced Lights Out Management User's Guide (817-5481).
|
6.4 POST Diagnostics
POST is a firmware program that is useful in determining if a portion of the system has failed. POST verifies the core functionality of the system, including the CPU module or modules, motherboard, memory, and some on-board I/O devices. POST generates messages that can be useful in determining the nature of a hardware failure. POST can be run even if the system is unable to boot.
POST detects most system faults and is located in the motherboard OpenBoot PROM. POST can be set to run by the OpenBoot firmware at power up by setting two environment variables, the diag-switch? and the diag-level flag, which are stored on the system configuration card.
POST runs automatically when the system power is applied and all of the following conditions apply:
- diag-switch? is set to true (default is false)
- diag-level is set to min, max or menus (default is min)
POST also runs automatically when the system is reset and all of the following conditions apply:
- diag-switch? is set to false (default is false)
- the current type of system reset matches any of the reset types set in post-trigger
- diag-level is set to min, max or menus (default is min)
If diag-level is set to min or max, POST performs an abbreviated or extended test, respectively.
If diag-level is set to menus, a menu of all the tests executed at power up is displayed.
POST diagnostic and error message reports are displayed on a console.
6.4.1 To Start POST Diagnostics--Method 1
There are two methods for starting POST diagnostics. In the following procedures both methods are described.
1. Go to the ok prompt.
2. Type:
ok setenv diag-switch? true
|
3. Type:
ok setenv diag-level value
|
Where value is either min or max depending on the desired range coverage.
4. Power cycle the server.
After you have powered the server off, wait 60 seconds before powering the server on. POST executes after the server is powered on.
Note - Status and error messages could be displayed in the console window. If POST detects an error, it displays an error message describing the failure.
|
5. When you have finished running POST, restore the value of diag-switch? to false by typing:
ok setenv diag-switch? false
|
Resetting diag-switch? to false minimizes boot time.
6.4.2 To Start POST Diagnostics--Method 2
1. Go to the ok prompt.
2. Type:
ok setenv diag-switch? false
|
3. Type:
ok setenv diag-level value
|
Where value is either min or max depending on the desired range of coverage.
4. Type:
ok setenv diag-trigger user-reset
|
5. Type:
ok setenv diag-trigger all-resets
|
Note - Status and error messages could be displayed in the console window. If POST detects an error, it displays an error message describing the failure.
|
6.4.3 Controlling POST Diagnostics
You control POST diagnostics, and other aspects of the boot process by setting OpenBoot configuration variables. Changes to OpenBoot configuration variables generally take effect only after the system is restarted. TABLE 6-2 lists the most important and useful of these variables. You can find instructions for changing OpenBoot configuration variables in Section 6.9, OpenBoot Configuration Variables.
TABLE 6-2 OpenBoot Configuration Variables
OpenBoot Configuration Variable
|
Description and Keywords
|
auto-boot
|
Determines whether the operating system automatically starts up. Default is true.
- true - Operating system automatically starts once firmware tests finish.
- false - System remains at ok prompt until you type boot.
|
diag-level
|
Determines the level or type of diagnostics executed. Default is min.
- off - No testing.
- min - Only basic tests are run.
- max - More extensive tests may be run, depending on the device.
|
diag-script
|
Determines which devices are tested by OpenBoot Diagnostics. Default is none.
- none - No devices are tested.
- normal - On-board (centerplane-based) devices that have self-tests are tested.
- all - All devices that have self-tests are tested.
|
diag-switch?
|
Toggles the system in and out of diagnostic mode. Default is false.
- true - Diagnostic mode: POST diagnostics and OpenBoot Diagnostics tests may run.
- false - Default mode: Do not run POST or OpenBoot Diagnostics tests.
|
diag-trigger
|
Specifies the class of reset event that causes Power-On Self-Test and OpenBoot Diagnostics to run. These variables can accept single keywords as well as combinations of the first three keywords separated by spaces. For details, see To View and Set OpenBoot Configuration Variables.
- error-reset - A reset caused by certain non-recoverable hardware error conditions. In general, an error reset occurs when a hardware problem corrupts system data. Examples include CPU and system watchdog resets, fatal errors, and certain CPU reset events (default).
- power-on-reset - A reset caused by pressing the Power button (default).
- user-reset - A reset initiated by the user or the operating system.
- all-resets - Any kind of system reset.
- none - No Power-On Self-Tests or OpenBoot Diagnostics tests run.
|
input-device
|
Selects where console input is taken from. Default is TTYA.
- TTYA - From built-in SERIAL MGT port.
- TTYB - From built-in general purpose serial port (10101)
- keyboard - From attached keyboard that is part of a graphics terminal.
|
output-device
|
Selects where diagnostic and other console output is displayed. Default is TTYA.
- TTYA - To built-in SERIAL MGT port.
- TTYB - To built-in general purpose serial port (10101)
- screen - To attached screen that is part of a graphics terminal.1
|
Table describes auto-boot, diag-level, diag-script, diag-switch?, diag-trigger, input-device, and output-device OpenBoot configuration variables.
Note - These variables affect OpenBoot Diagnostics tests as well as POST diagnostics.
|
Once POST diagnostics have finished running, POST reports back to the OpenBoot firmware the status of each test it has run. Control then reverts back to the OpenBoot firmware code.
If POST diagnostics do not uncover a fault, and your server still does not start up, run OpenBoot Diagnostics tests.
6.5 OpenBoot Diagnostics
Like POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the OpenBoot PROM.
6.5.1 To Start OpenBoot Diagnostics
1. Type:
ok setenv diag-switch? true
ok setenv diag-level max
ok setenv auto-boot? false
ok reset-all
|
2. Type:
This command displays the OpenBoot Diagnostics menu. See TABLE 6-3.
TABLE 6-3 Sample obdiag menu
obdiag
|
1 flashprom@2,0
4 network@2
7 scsi@2
10 serial@0,3f8
|
2 i2c@0,320
5 network@2,1
8 scsi@2,1
11 usb@a
|
3 ide@d
6 rtc@0,70
9 serial@0,2e8
12 usb@b
|
|
Commands: test test-all except help what setenv set-default exit
|
diag-passes=1 diag-level=max test-args=subtests, verbose
|
Table shows the OBDiag menu of optional tests that the user can select from.
Note - If you have a PCI card installed in the server, then additional tests are displayed on the OBDiag menu.
|
3. Type:
Where n represents the number corresponding to the test you want to run.
A summary of the tests is available. At the obdiag> prompt, type:
6.5.2 Controlling OpenBoot Diagnostics Tests
Most of the OpenBoot configuration variables you use to control POST (see TABLE 6-2 on "Invalid Cross-Reference Format") also affects OpenBoot Diagnostics tests.
- Use the diag-level variable to control the OpenBoot Diagnostics testing level.
- Use test-args to customize how the tests run.
By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 6-4.
TABLE 6-4 Keywords for the test-args OpenBoot Configuration Variable
Keyword
|
What It Does
|
bist
|
Invokes built-in self-test (BIST) on external and peripheral devices.
|
debug
|
Displays all debug messages.
|
iopath
|
Verifies bus/interconnect integrity.
|
loopback
|
Exercises external loopback path for the device.
|
media
|
Verifies external and peripheral device media accessibility.
|
restore
|
Attempts to restore original state of the device if the previous execution of the test failed.
|
silent
|
Displays only errors rather than the status of each test.
|
subtests
|
Displays main test and each subtest that is called.
|
verbose
|
Displays detailed messages of status of all tests.
|
callers=n
|
Displays backtrace of n callers when an error occurs.
callers=0 - displays backtrace of all callers before the error. Default is callers=1.
|
errors=n
|
Continues executing the test until n errors are encountered.
errors=0 - displays all error reports without terminating testing. Default is errors=1.
|
Table describes test-args keywords that are used to control OpenBoot diagnostic tests.
If you want to customize the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:
ok setenv test-args debug,loopback,media
|
6.5.2.1 test and test-all Commands
You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:
ok test /pci@x,y/SUNW,qlc@2
|
Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V210 and V240 servers.
Tip - Use the show-devs command to list the hardware device paths.
|
To customize an individual test, you can use test-args as follows:
ok test /usb@1,3:test-args={verbose,debug}
|
This affects only the current test without changing the value of the test-args OpenBoot configuration variable.
You can test all the devices in the device tree with the test-all command:
If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:
ok test-all /pci@9,700000/usb@1,3
|
6.5.2.2 What OpenBoot Diagnostics Error Messages Tell You
OpenBoot Diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 6-1 displays a sample OpenBoot Diagnostics error message.
CODE EXAMPLE 6-1 OpenBood Diagnostics Error Message
Testing /pci@1e,600000/isa@7/flashprom@2,0
ERROR : There is no POST in this FLASHPROM or POST header is
unrecognized
DEVICE : /pci@1e,600000/isa@7/flashprom@2,0
SUBTEST : selftest:crc-subtest
MACHINE : Sun Fire V210
SERIAL# : 51347798
DATE : 03/05/2003 15:17:31 GMT
CONTR0LS: diag-level=max test-args=errors=1
Error: /pci@1e,600000/isa@7/flashprom@2,0 selftest failed, return code = 1
Selftest at /pci@1e,600000/isa@7/flashprom@2,0 (errors=1) .............
failed
Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:1
|
This is a code example of OpenBoot diagnostic error message output.
To change the system defaults and the diagnostics settings after initial boot, refer to the OpenBoot PROM Enhancements for Diagnostic Operation (817-6957). You can view or print this document by going to:
http://www.sun.com/documentation
6.6 OpenBoot Commands
OpenBoot commands are commands you type from the ok prompt. OpenBoot commands which can provide useful diagnostic information are:
- probe-scsi
- probe-ide
- show-devs
6.6.1 probe-scsi Command
The probe-scsi command is used to diagnose problems with SCSI devices.
|
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsicommand can hang the system.
|
The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers.
For any SCSI device that is connected and active, the probe-scsi command displays its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.
The following is sample output from the probe-scsi command.
CODE EXAMPLE 6-2 Sample probe-scsi Command Output
{1} ok probe-scsi
Target 0
Unit 0 Disk SEAGATE ST336605LSUN36G 0238
Target 1
Unit 0 Disk SEAGATE ST336605LSUN36G 0238
Target 2
Unit 0 Disk SEAGATE ST336605LSUN36G 0238
Target 3
Unit 0 Disk SEAGATE ST336605LSUN36G 0238
|
Code examples displays loop ID, host adapter information, logical unit number, unique WWN, and device description information obtained when the probe-scsi command is run.6.6.2 probe-ide Command
The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.
|
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system.
|
The following is sample output from the probe-ide command.
CODE EXAMPLE 6-3 sample probe-ide Command Output
{1} ok probe-ide
Device 0 ( Primary Master )
Removable ATAPI Model: DV-28E-B
Device 1 ( Primary Slave )
Not Present
Device 2 ( Secondary Master )
Not Present
Device 3 ( Secondary Slave )
Not Present
|
This code example shows system output after the probe-ide command is run. 6.6.3 show-devs Command
The show-devs command lists the hardware device paths for each device in the firmware device tree. The following code example shows sample output from the show-devs command.
CODE EXAMPLE 6-4 show-devs Command Output
ok show devs
/pci@1d, 700000
/pci@1c,600000
/pci@1e,600000
/pci@1f,700000
/memory-controller@1,0
/SUNW,UltraSPARC-IIIi@1,0
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi@0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/pci@1d,700000/network@2,1
/pci@1d,700000/network@2
/pci@1c,600000/scsi@2,1
/pci@1c,600000/scsi@2
/pci@1c,600000/scsi@2,1/tape
/pci@1c,600000/scsi@2,1/disk
/pci@1c,600000/scsi@2/tape
/pci@1c,600000/scsi@2/disk
/pci@1e,600000/ide@d
/pci@1e,600000/usb@a
/pci@1e,600000/pmu@6
/pci@1e,600000/isa@7
/pci@1e,600000/ide@d/cdrom
/pci@1e,600000/ide@d/disk
/pci@1e,600000/pmu@6/gpio@80000000,8a
/pci@1e,600000/pmu@6/i2c@0,0
/pci@1e,600000/isa@7/rmc-comm@0,3e8
/pci@1e,600000/isa@7/serial@0,2e8
/pci@1e,600000/isa@7/serial@0,3f8
/pci@1e,600000/isa@7/power@0,800
/pci@1e,600000/isa@7/i2c@0,320
/pci@1e,600000/isa@7/rtc@0,70
/pci@1e,600000/isa@7/flashprom@2,0
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,70
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,88
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,68
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,4a
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,46
/pci@1e,600000/isa@7/i2c@0,320/gpio@0,44
/pci@1e,600000/isa@7/i2c@0,320/idprom@0,50
/pci@1e,600000/isa@7/i2c@0,320/nvram@0,50
/pci@1e,600000/isa@7/i2c@0,320/rscrtc@0,d0
/pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,c8
/pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,c6
/pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,b8
/pci@1e,600000/isa@7/i2c@0,320/dimm-spd@0,b6
/pci@1e,600000/isa@7/i2c@0,320/power-supply-fru-prom@0,a4
/pci@1e,600000/isa@7/i2c@0,320/power-supply-fru-prom@0,b0
/pci@1e,600000/isa@7/i2c@0,320/chassis-fru-prom@0,a8
/pci@1e,600000/isa@7/i2c@0,320/motherboard-fru-prom@0,a2
/pci@1e,600000/isa@7/i2c@0,320/12c-bridge@0,18
/pci@1e,600000/isa@7/i2c@0,320/12c-bridge@0,16
/pci@1f,700000/network@2,1
/pci@1f,700000/network@2
/openprom/client-services
/packages/obdiag-menu
/packages/obdiag-lib
/packages/SUNW,asr
/packages/SUNW,fru-device
/packages/SUNW,12c-ram-device
/packages/obp-tftp
/packages/kbd-translator
/packages/dropins
/packages/terminal-emulator
/packages/disk-label
/packages/deblocker
/packages/SUNW,bultin-drivers
{1} ok
|
This code example showsf system output after show-devs command is run. 6.6.4 To Run OpenBoot Commands
|
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsicommand can hang the system.
|
1. Halt the system to reach the ok prompt.
How you do this depends on the system's condition. If possible, you should warn users before you shut the system down.
2. Type the appropriate command at the console prompt.
6.7 Operating System Diagnostic Tools
If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser operating system. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have access to the software-based diagnostic tools, SunVTS, and Sun Management Center. These tools enable you to monitor the server, exercise it, and isolate faults.
Note - If you set the auto-boot OpenBoot configuration variable to false, the operating system does not boot following completion of the firmware-based tests.
|
In addition to the tools mentioned, you can refer to error and system message log files, and Solaris system information commands.
6.7.1 Error and System Message Log Files
Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.
6.7.2 Solaris System Information Commands
The following Solaris commands display data that you can use when assessing the condition of a Sun Fire V210 and V240 Servers server:
- prtconf
- prtdiag
- prtfru
- psrinfo
- showrev
This section describes the information these commands give you. More information about using each command is contained in the appropriate man page.
6.7.2.1 prtconf command
The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks, that only the operating system software can detect. The output of prtconf also includes the total amount of system memory. CODE EXAMPLE 6-5 shows an excerpt of prtconf output.
CODE EXAMPLE 6-5 prtconf Command Output
# prtconf
System Configuration: Sun Microsystems sun4u
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
SUNW,Sun-Fire-V240
packages (driver not attached)
SUNW,builtin-drivers (driver not attached)
deblocker (driver not attached)
disk-label (driver not attached)
terminal-emulator (driver not attached)
dropins (driver not attached)
kbd-translator (driver not attached)
obp-tftp (driver not attached)
SUNW,i2c-ram-device (driver not attached)
SUNW,fru-device (driver not attached)
ufs-file-system (driver not attached)
chosen (driver not attached)
openprom (driver not attached)
client-services (driver not attached)
options, instance #0
aliases (driver not attached)
memory (driver not attached)
virtual-memory (driver not attached)
SUNW,UltraSPARC-IIIi (driver not attached)
memory-controller, instance #0
SUNW,UltraSPARC-IIIi (driver not attached)
memory-controller, instance #1 ...
|
Code example of Solaris system output generated by server after prtconf command is run.
The prtconf command's -p option produces output similar to the OpenBoot
show-devs command. This output lists only those devices compiled by the system firmware.
6.7.2.2 prtdiag Command
The prtdiag command displays a table of diagnostic information that summarizes the status of system components. The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following is an excerpt of the output produced by prtdiag on a healthy Sun Fire V240 server running Solaris OS 8, PSR1.
CODE EXAMPLE 6-6 prtdiag Command Output
# prtdiag
System Configuration: Sun Microsystems sun4u Sun Fire V240
System clock frequency: 160 MHZ
Memory size: 1GB
==================================== CPUs ====================================
E$ CPU CPU Temperature Fan
CPU Freq Size Impl. Mask Die Ambient Speed Unit
--- -------- ---------- ------ ---- -------- -------- ----- ----
MB/P0 960 MHz 1MB US-IIIi 2.0 - -
MB/P1 960 MHz 1MB US-IIIi 2.0 - -
================================= IO Devices =================================
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---------- ---------------------------- --------------------
0 pci 66 2 network-SUNW,bge (network)
0 pci 66 2 scsi-pci1000,21.1 (scsi-2)
0 pci 66 2 scsi-pci1000,21.1 (scsi-2)
0 pci 66 2 network-SUNW,bge (network)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 7 isa/rmc-comm-rmc_comm (seria+
0 pci 33 13 ide-pci10b9,5229.c4 (ide)
============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address Size Interleave Factor Contains
-----------------------------------------------------------------------
0x0 512MB 1 GroupID 0
0x1000000000 512MB 1 GroupID 0
Memory Module Groups:
--------------------------------------------------
ControllerID GroupID Labels
--------------------------------------------------
0 0 MB/P0/B0/D0,MB/P0/B0/D1
Memory Module Groups:
--------------------------------------------------
ControllerID GroupID Labels
--------------------------------------------------
1 0 MB/P1/B0/D0,MB/P1/B0/D1
|
Code example of the Solaris system output generated by the server after the prtdiag command is run.
In addition to the information in CODE EXAMPLE 6-6, prtdiag with the verbose option (-v) reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.
CODE EXAMPLE 6-7 prtdiag Verbose Output
System Temperatures (Celsius):
-------------------------------
Device Temperature Status
---------------------------------------
CPU0 59 OK
CPU2 64 OK
DBP0 22 OK
|
Code example of Solaris system output generated by server during an overtemperature event after prtdiag command with verbose option is run.
In the event of an overtemperature condition, prtdiag reports an error in the Status column for that device.
CODE EXAMPLE 6-8 prtdiag Overtemperature Indication Output
System Temperatures (Celsius):
-------------------------------
Device Temperature Status
---------------------------------------
CPU0 62 OK
CPU1 102 ERROR
|
If an overtemperature condition occurs, this code example of Solaris system output generated by server when a fault condition occurs after prtdiag command with verbose option is run.
Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.
CODE EXAMPLE 6-9 prtdiag Fault Indication Output
Fan Status:
-----------
Bank RPM Status
---- ----- ------
CPU0 4166 [NO_FAULT]
CPU1 0000 [FAULT]
|
If a fault condition occurs, this code example of Solaris system output generated by server after prtdiag command with verbose option is run.6.7.2.3 prtfru Command
The Sun Fire V210 and V240 servers maintain a hierarchical list of all field-replacable units (FRUs) in the system, as well as specific information about various FRUs.
The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs.
CODE EXAMPLE 6-10 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.
CODE EXAMPLE 6-10 prtfru -l Command Output
# prtfru -l
/frutree
/frutree/chassis (fru)
/frutree/chassis/MB?Label=MB
/frutree/chassis/MB?Label=MB/system-board (container)
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC/sc (fru)
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT/battery (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F0?Label=F0
|
Code example of Solaris system output generated by server after prtfru command is run.
CODE EXAMPLE 6-11 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.
CODE EXAMPLE 6-11 prtfru -c Command Output
# prtfru -c
/frutree/chassis/MB?Label=MB/system-board (container)
SEGMENT: SD
/SpecPartNo: 885-0092-02
/ManR
/ManR/UNIX_Timestamp32: Wednesday April 10 11:34:49 BST 2002
/ManR/Fru_Description: FRUID,INSTR,M'BD,0CPU,0MB,ENXU
/ManR/Manufacture_Loc: HsinChu, Taiwan
/ManR/Sun_Part_No: 3753107
/ManR/Sun_Serial_No: abcdef
/ManR/Vendor_Name: Mitac International
/ManR/Initial_HW_Dash_Level: 02
/ManR/Initial_HW_Rev_Level: 01
|
Code example of Solaris system output generated by server after prtfru command is run with -c option.
Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes:
- FRU description
- Manufacturer name and location
- Part number and serial number
- Hardware revision levels
6.7.2.4 psrinfo Command
The psrinfo command displays the date and time each CPU came online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed. The following is sample output from the psrinfo command with the -v option.
CODE EXAMPLE 6-12 psrinfo -v Command Output
# psrinfo -v
Status of processor 0 as of: 09/20/02 11:35:49
Processor has been on-line since 09/20/02 11:30:53.
The sparcv9 processor operates at 960 MHz,
and has a sparcv9 floating point processor.
Status of processor 1 as of: 09/20/02 11:35:49
Processor has been on-line since 09/20/02 11:30:52.
The sparcv9 processor operates at 960 MHz,
and has a sparcv9 floating point processor.
|
Code example of Solaris system output generated by server after psrinfo command is run with -v option.6.7.2.5 showrev Command
The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 6-13 shows sample output of the showrev command.
CODE EXAMPLE 6-13 showrev Command Output
# showrev
Hostname: griffith
Hostid: 830f8192
Release: 5.8
Kernel architecture: sun4u
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain:
Kernel version: SunOS 5.8 Generic 108528-16 August 2002
|
Code example of Solaris system output generated by server after showrev command is run.
When used with the -p option, this command displays installed patches. CODE EXAMPLE 6-14 shows a partial sample output from the showrev command with the -p option.
CODE EXAMPLE 6-14 showrev -p Command Output
# showrev -p
Patch: 109729-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109783-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109807-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109809-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110905-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110910-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110914-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 108964-04 Obsoletes: Requires: Incompatibles: Packages: SUNWcsr
|
Code example of Solaris system output generated by server after showrev command is run with the -p option.6.7.3 To Run Solaris System Information Commands
1. Decide on the of system information you want to display.
For more information, see Solaris System Information Commands.
2. Type the appropriate command at a console prompt.
See TABLE 6-5 for a summary of the commands.
TABLE 6-5 Using Solaris Information Display Commands
Command
|
What It Displays
|
What to Type
|
Notes
|
prtconf
|
System configuration information
|
/usr/sbin/prtconf
|
--
|
prtdiag
|
Diagnostic and configuration information
|
/usr/platform/sun4u/sbin/prtdiag
|
Use the -v option for additional detail.
|
prtfru
|
FRU hierarchy and SEEPROM memory contents
|
/usr/sbin/prtfru
|
Use the -l option to display hierarchy. Use the -c option to display SEEPROM data.
|
psrinfo
|
Date and time each CPU came online; processor clock speed
|
/usr/sbin/psrinfo
|
Use the -v option to obtain clock speed and other data.
|
showrev
|
Hardware and software revision information
|
/usr/bin/showrev
|
Use the -p option to show software patches.
|
Table describes how to use the prtconf, prtdiag, prtfru, psrinfo, and showrev Solaris system information commands.
6.8 Recent Diagnostic Test Results
Summaries of the results from the most recent power-on self-test (POST) and OpenBoot Diagnostics tests are saved across power cycles.
6.8.1 To View Recent Test Results
1. Go to the ok prompt.
2. Type the following:
To see a summary of the most recent POST results.
6.9 OpenBoot Configuration Variables
Switches and diagnostic configuration variables stored in the IDPROM determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 6-2.
Changes to OpenBoot configuration variables usually take effect upon the next reboot.
6.9.1 To View and Set OpenBoot Configuration Variables
6.9.1.1 To View OpenBoot Configuration Variables
1. Halt the server to reach the ok prompt.
2. To display the current values of all OpenBoot configuration variables, use the printenv command.
The following example shows a short excerpt of this command's output.
ok printenv
Variable Name Value Default Value
diag-level min min
diag-switch? false false
|
6.9.1.2 To Set OpenBoot Configuration Variables
1. Halt the server to reach the ok prompt.
2. To set or change the value of an OpenBoot configuration variable, use the setenv command:
ok setenv diag-level max
diag-level = max
|
To set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space.
Note - Keywords for the OpenBoot configuration variable test-args must be separated by commas.
|
6.10 Additional Diagnostic Tests for Specific Devices6.10.1 Using the probe-scsi Command to Confirm That Hard Drives are Active
The probe-scsi command transmits an inquiry to SCSI devices connected to the system's internal SCSI interface. If a SCSI device is connected and active, the command displays the unit number, device type, and manufacturer name for that device.
CODE EXAMPLE 6-15 probe-scsi Output Message
ok probe-scsi
Target 0
Unit 0 Disk SEAGATE ST336605LSUN36G 4207
Target 1
Unit 0 Disk SEAGATE ST336605LSUN36G 0136
|
Code example displays output from running the probe-scsi command.
The probe-scsi-all command transmits an inquiry to all SCSI devices connected to both the system's internal and its external SCSI interfaces. CODE EXAMPLE 6-16 shows sample output from a server with no externally connected SCSI devices but containing two 36 GB hard drives, both of them active.
CODE EXAMPLE 6-16 probe-scsi-all Output Message
ok probe-scsi-all
/pci@1f,0/pci@1/scsi@8,1
/pci@1f,0/pci@1/scsi@8
Target 0
Unit 0 Disk SEAGATE ST336605LSUN36G 4207
Target 1
Unit 0 Disk SEAGATE ST336605LSUN36G 0136
|
Code example displays output from running probe-scsi-all command.6.10.2 Using probe-ide Command to Confirm That the DVD or CD-ROM Drive is Connected
The probe-ide command transmits an inquiry command to internal and external IDE devices connected to the system's on-board IDE interface. The following sample output reports a DVD drive installed (as Device 0) and active in a server.
CODE EXAMPLE 6-17 probe-ide Output Message
ok probe-ide
Device 0 ( Primary Master )
Removable ATAPI Model: DV-28E-B
Device 1 ( Primary Slave )
Not Present
Device 2 ( Secondary Master )
Not Present
Device 3 ( Secondary Slave )
Not Present
|
Code example displays output from running probe-ide command.6.10.3 Using watch-net and watch-net-all Commands to Check the Network Connections
The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.
Start the watch-net diagnostic test by typing the watch-net command at the ok prompt. For the watch-net-all diagnostic test, type watch-net-all at the ok prompt.
CODE EXAMPLE 6-18 watch-net Diagnostic O utput Message
{1} ok watch-net
100 Mbps FDX Link up
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.
................................
|
Code example displays output from running watch-net command.
CODE EXAMPLE 6-19 watch-net-all Diagnostic O utput Message
{1} ok watch-net-all
/pci@1d,700000/network@2,1
Timed out waiting for Autonegotation to complete
Check cable and try again
Link Down
/pci@1f,700000/network@2
100 Mbps FDX Link up
................................
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.
................................
{1} ok
|
Code example displays output from running watch-net-all command.
For additional information about diagnostic tests for the OpenBoot PROM see: OpenBoot PROM Enhancements for Diagnostic Operation (817-6957-10).
6.11 Automatic System Recovery
Note - Automatic System Recovery (ASR) is not the same as Automatic Server Restart, which the Sun Fire V210 and V240 servers also support. For additional information about Automatic Server Restart see Section 3.1.3, Automatic Server Restart.
|
Automatic System Recovery (ASR) consists of self-test features and an auto-configuring capability to detect failed hardware components and unconfigure them. By doing this, the server is able to resume operating after certain non-fatal hardware faults or failures have occurred.
If a component is one that is monitored by ASR, and the server is capable of operating without it, the server automatically reboots if that component develops a fault or fails.
ASR monitors memory modules:
If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.
If a fault occurs on a running server, and it is possible for the server to run without the failed component, the server automatically reboots. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash repeatedly.
To support degraded boot capability, OpenBoot firmware uses the 1275 Client interface (via the device tree) to mark a device as either failed or disabled. This creats an appropriate status property in the device tree node. The Solaris OS does not activate a driver for any subsystem so marked.
As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system reboots automatically and resumes operation while a service call is made.
Note - ASR is not enabled until you activate it.
|
6.11.1 Auto-Boot Options
The auto-boot? setting controls whether or not the firmware automatically boots the operating system after each reset. The default setting is true.
The auto-boot-on-error? setting controls whether the system attempts a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? settings must be set to true to enable an automatic degraded boot.
To set the switches, type:
.
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true
|
Note - The default setting for auto-boot-on-error? is false. Therefore, the system does not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal non-recoverable error, even if degraded booting is enabled. For examples of fatal non-recoverable errors, see Error Handling Summary.
|
6.11.2 Error Handling Summary
Error handling during the power-on sequence falls into one of the following three cases:
- If no errors are detected by POST or OpenBoot Diagnostics, the system attempts to boot if auto-boot? is true.
- If only non-fatal errors are detected by POST or OpenBoot Diagnostics, the system attempts to boot if auto-boot? is true and auto-boot-on-error? is true.
Note - If POST or OpenBoot Diagnostics detects a non-fatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.
|
- If a fatal error is detected by POST or OpenBoot Diagnostics, the system does not boot regardless of the settings of auto-boot? or auto-boot-on-error?. Fatal non-recoverable errors include the following:
- All CPUs failed
- All logical memory banks failed
- Flash RAM cyclical redundancy check (CRC) failure
- Critical field-replaceable unit (FRU) PROM configuration data failure
- Critical application-specific integrated circuit (ASIC) failure
6.11.3 Reset Scenarios
Two OpenBoot configuration variables, diag-switch?, and diag-trigger
control how the system runs firmware diagnostics in response to system reset events.
The standard system reset protocol bypasses POST and OpenBoot Diagnostics unless diag-switch? is set to true or diag-trigger is set to a reset event. The default setting for this variable is false. Because ASR relies on firmware diagnostics to detect faulty devices, diag-switch? must be set to true for ASR to run. For instructions, see Section 6.11.4, To Enable ASR.
To control which reset events, if any, automatically initiate firmware diagnostics, use diag-trigger. For detailed explanations of these variables and their uses, see Section 6.4.3, Controlling POST Diagnostics.
6.11.4 To Enable ASR
1. At the system ok prompt, type:
ok setenv diag-switch? true
|
2. Set the diag-trigger variable to power-on-reset, error-reset, or user-reset. For example, type:
ok setenv diag-trigger user-reset
|
3. Type:
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true
|
4. Type:
The system permanently stores the parameter changes and boots automatically if the OpenBoot variable auto-boot? is set to true (its default value).
Note - To store parameter changes, you can also power-cycle the system using the front panel Power switch.
|
6.11.5 To Disable ASR
1. At the system ok prompt, type:
ok setenv diag-switch? false
ok setenv diag-trigger none
|
2. Type:
The system permanently stores the parameter change.
Note - To store parameter changes, you can also power-cycle the system using the front panel Power switch.
|
Sun Fire V210 and V240 Servers Administration Guide
|
819-4208-10
|
|
Copyright © 2004, Sun Microsystems, Inc. All Rights Reserved.