C H A P T E R  1

Diagnosing Server Performance and Faults

This chapter describes the diagnostic tools available for use with the Sun Fire V215 and V245 servers. This chapter contains the following diagnostic sections:


1.1 Diagnostic Tools Overview

Sun provides a range of diagnostic tools for use with the Sun Fire V215 and V245 servers. TABLE 1-1 contains summaries of the diagnostic tools.


TABLE 1-1 Summary of Diagnostic Tools

Diagnostic Tool

Type

What It Does

Accessibility and Availability

Remote Capability

ALOM

Hardware and Software

Monitors environmental conditions, performs environmental fault isolation, and provides remote console access to system.

Can function on standby power and without operating system.

Designed for remote system access

Status indicators

Hardware

Indicates operational status of the overall system and sub-assemblies that have status indicators.

Accessed from system chassis. Is available anytime power is available.

Local, but operational status can be viewed in ALOM

POST

Firmware

Provides test coverage for CPUs, CPU caches, system memory, CPU interconnects, I/O bridges, and system buses.

Runs automatically on startup. Is available when the operating system is not running.

Local, but operation can be viewed in ALOM

OpenBoottrademark Diagnostics

Firmware

Provides test coverage specifically on the I/O sub-systems and plug-in cards. Test coverage consists of I/O channels, boot controllers (SCSI, IDE, USB, Ethernet), non core devices (Flash, I2C, environmental controls, NVRAM), and plug-in cards with native Fcode drivers which support IEEE 1275 self test mechanisms. OpenBoot Diagnostics provides Fcode self-tests for on-board hardware devices.

Runs automatically or interactively. Is available when the operating system is not running.

Local, but operation can be viewed in ALOM

OpenBoot Diagnostic commands

Firmware

Displays various system information (See Section 1.4.3, OpenBoot Diagnostic Commands)

Available when the operating system is not running

Local, but can be accessed in ALOM

Solaris OS commands

Software

Displays various system information

Requires operating system

Local, but can be accessed in ALOM

SunVTStrademark

Software

Exercises and stresses the system, running tests in parallel

Requires operating system.

View and control over network

Sun Management Center

Software

Monitors both hardware environmental conditions and software performance of multiple machines. Generates alerts for various conditions

Requires operating system to be running on both monitored and master servers. Requires a dedicated database on the master server

Designed for remote access

Hardware Diagnostic Suite

Software

Exercises an operational system by running sequential tests. Also reports failed FRUs

A separately purchased optional add-on to Sun Management Center. Requires operating system and Sun Management Center

Designed for remote access



1.2 Choosing a Fault Isolation Tool

This section helps you choose the right tool to isolate a failed part in a Sun Fire V215 or V245 server. Consider the following questions when selecting a tool.

1. Have you checked the status indicators?

Certain system components have status indicators that can alert you when a component requires replacement.

2. Does the server boot?

3. Do you intend to run the tests remotely?

Sun Management Center, ALOM, and the Hardware Diagnostic Suite software enable you to run tests from a remote server. ALOM also provides a means of redirecting system console output, enabling you to remotely view and run tests, like POST diagnostics, that usually require physical proximity to the serial port on the back panel.



Note - The SunVTS software also enables you to run tests remotely by using the tty-mode through a remote login or a Telnet session.


4. Will the tool test the suspected sources of the problem?

Use a diagnostic tool capable of testing the suspected problem sources. TABLE 1-1 shows which parts can be isolated by each fault isolating tool.

5. Is the problem intermittent or software-related?

If a problem is not caused by a defective hardware component, use a system exerciser tool rather than a fault isolation tool.

FIGURE 1-1 Choosing a Tool to Isolate Hardware Faults


This flowchart shows in what order and under what conditions one might choose to run various diagnostic tools to isolate a hardware failure.


1.3 POST Diagnostics

POST is a firmware program that is useful in determining if a portion of the system has failed. POST verifies the core functionality of the system, including the CPU modules, motherboard, memory, and some on-board I/O devices. POST also generates messages that can be useful in determining the nature of a hardware failure. POST can be run even if the system is unable to boot. POST resides in a PROM located on the MBC board (ALOM) and detects most persistent type fault conditions.

POST can run under the following four conditions:

1. POST will run automatically when power is applied to the system.

2. POST will run in service mode when the system is reset with the reset-all command from the ok prompt.

3. POST will run when the keyswitch is set to the diag position.

4. POST will run when the post command is issued from the ok prompt.

If diag-level is set to menu, a menu of all the tests executed at power up is displayed.

POST diagnostic and error message reports are displayed on a console.

1.3.1 Starting POST Diagnostics

1. Obtain the ok prompt.

2. Type:


ok post level verbosity

where level specifies the level of diagnostics (min, max, menu, off) and verbosity specifies the diagnostic verbosity (debug, max, normal, min, none).

Status and error messages are displayed in the console window. If POST detects an error, it displays an error message describing the failure.

1.3.2 Controlling POST Diagnostics

You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables. Changes to OpenBoot configuration variables generally take effect only after the server is restarted. TABLE 1-2 lists the most important and useful of these variables.


TABLE 1-2 OpenBoot Configuration Variables

OpenBoot Configuration Variable

Description and Keywords

auto-boot

Determines whether the operating system automatically starts boot. Default is true.

  • true - System automatically boots operating system, once firmware diagnostics and initialization complete.
  • false - System remains at ok prompt until you type boot.

diag-level

Determines the level or type of diagnostics executed. Default is max.

  • off - No testing.
  • min - Only basic tests are run.
  • max - More extensive tests might be run, depending on the device.
  • menu - Displays the Diagnostics Engineering Monitor menu.

verbosity

Displays notice, warning, error, and fatal messages on the console.

  • max - Displays detailed progress and informational messages.
  • normal - Keeps regular output to a minimum.
  • min - Displays notice, warning, error, and fatal messages.
  • none - Displays only error and fatal messages.

diag-script

Determines which devices are tested by OpenBoot Diagnostics. Default is normal.

  • none - No devices are tested.
  • normal - On-board devices are tested.
  • all - All devices that have self-tests are tested.

diag-trigger

Determines under what reset conditions, POST/OpenBoot Diagnostics shall be executed.

  • none - Do not run diagnostics on any reset event (requires diag-switch? = false)
  • user-reset - User invoked reset
  • error-reset - When system encounters an error reset event (Red State Exception, Watchdog, Fatal).
  • power-on-reset - When power is applied to the system. Default is (power-on-reset, error-reset).

input-device

Selects where console input is taken from. Default is ttya.

  • ttya - From built-in SER MGT port.
  • ttyb - From built-in general purpose serial port (SER TTYB)
  • keyboard - From attached keyboard that is part of a graphics terminal.

output-device

Selects where diagnostic and other console output is displayed. Default is ttya.

  • ttya - To built-in SER MGT port.
  • ttyb - To built-in general purpose serial port (SER TTYB).
  • screen - To attached screen that is part of a graphics terminal.1

1 - POST messages cannot be displayed on a graphics terminal. Messages are sent to TTYA when the output-device is set to screen.

Note - These variables affect OpenBoot Diagnostics tests as well as POST diagnostics.


After POST diagnostics have finished running, POST reports back to the OpenBoot firmware the status of each test it has run. Control then reverts back to the OpenBoot firmware code.

If POST diagnostics do not uncover a fault, and the server still does not start up, run OpenBoot Diagnostics tests. See FIGURE 1-1 for additional information.


1.4 OpenBoot Diagnostics

Like POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the OpenBoot PROM.

1.4.1 Starting OpenBoot Diagnostics

1. Obtain the ok prompt.

Become superuser, and then type init 0.

2. Type:


ok obdiag

This command displays the OpenBoot Diagnostics menu.



Note - If you have a PCI card installed in the server, then additional tests will appear on the obdiag menu.


3. Run an obdiag test. Type:


obdiag> test n

Where n represents the number corresponding to the test you want to run.

A summary of the tests is available. At the obdiag> prompt, type:


obdiag> help

1.4.2 Controlling OpenBoot Diagnostics Tests

Most of the OpenBoot configuration variables you use to control POST (see TABLE 1-2) also affect OpenBoot diagnostics tests.

By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 1-3.


TABLE 1-3 Keywords for the test-args OpenBoot Configuration Variable

Keyword

What It Does

bist

Invokes built-in self-test (BIST) on external and peripheral devices

debug

Displays all debug messages

iopath

Verifies bus and interconnect integrity

loopback

Exercises external loopback path for the device

media

Verifies external and peripheral device media accessibility

restore

Attempts to restore original state of the device if the previous execution of the test failed

silent

Displays only errors rather than the status of each test

subtests

Displays main test and each subtest that is called

verbose

Displays detailed messages of status of all tests

callers=N

Displays backtrace of N callers when an error occurs

  • callers=0 - displays backtrace of all callers before the error

errors=N

Continues executing the test until N errors are encountered

  • errors=0 - displays all error reports without terminating testing

If you want to customize the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:


ok setenv test-args debug,loopback,media

1.4.2.1 test and test-all Commands

You can run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:


ok test /ebus@1f,464000/serial@2,40



Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V215 and V245 servers.


To customize an individual test, you can use test-args as follows:


ok test /ebus@1f,464000/serial@2,40:test-args={verbose,debug}

This affects only the current test without changing the value of the test-args OpenBoot configuration variable.

You can test all the devices in the device tree with the test-all command:


ok test-all

If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:


ok test-all /pci@9,700000/usb@1,3

1.4.2.2 What OpenBoot Diagnostics Error Messages Tell You

OpenBoot diagnostics error messages are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. displays an example OpenBoot Diagnostics error message.


Testing /ebus@1f,464000/flashprom@0,0
 
    ERROR   : FLASHPROM CRC-32 is incorrect
	SUMMARY		 	: Obs=0x1ea5bc20 Exp=0x5c896226 XOR=0x422cde06 Addr=0xfeb1fffc
    DEVICE  : /ebus@1f,464000/flashprom@0,0
    SUBTEST : selftest
    MACHINE : Sun Fire V215
    SERIAL# : 64196915
    DATE    : 04/07/2006 23:27:45 GMT
    CONTR0LS: diag-level=max test-args=
 
Error: /ebus@1f,464000/flashprom@0,0 selftest failed, return code = 1
Selftest at /ebus@1f,464000/flashprom@0,0 (errors=1) ............. 
failed
Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:1

1.4.3 OpenBoot Diagnostic Commands

OpenBoot commands that can provide useful diagnostic information are:

1.4.3.1 probe-scsi and probe-scsi-all Commands

The probe-scsi and probe-scsi-all commands diagnose problems with SCSI devices.



caution icon Caution - If you use the halt command or the Stop-A key sequence to reach the ok prompt and then issue the probe-scsi or probe-scsi-all command, the syetem might hang.


The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers. The probe-scsi-all command additionally accesses devices connected to any host adapters installed in PCI slots.

For any SCSI device that is connected and active, the probe-scsi and probe-scsi-all commands display its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.

The following is example output from the probe-scsi command.


{1} ok probe-scsi
MPT Version 1.05, Firmware version 0.03.23.00
 
Target 0 
  Unit 0   Disk 			 		 	 	 	 FUJITSU MAY2073RCSUN72G 0401	 	 143374738 Blocks, 73 GB SASAddress 500000e011772152 PhyNum 0
Target 1 
  Unit 0   Disk     FUJITSU MAY2073RCSUN72 0401 	 	 143374738 bLOCKS, 73 gb SASAdress 500000@e0115adf42 PhyNum 1
 

The following is example output from the probe-scsi-all command.


{1} ok probe-scsi-all
/pci@1e,600000/pci@0/pci@a/pci@0/pci@8/scsi@1
 
MPT Version 1.05 Firmware Version 0.03.23.00
 
Target 0
  Unit 0   Disk 		 	FUJITSU MAY2073RCSUN72G 0401	 	143374738 Blocks, 73GB SASAddress 500000e011772152 PhyNum 0
Target 1
	Unit 0	 	 	 Disk		 	 FUJITSUMAY2073RCSUN72G 0401	 	 143374738 Blocks, 73GB SASAddress 500000e0115adf42 PhyNum 1
 

1.4.3.2 probe-ide Command

The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This bus is the internal system bus for media devices such as the optional DVD super-multi drive.



caution icon Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system.


The following is example output from the probe-ide command.


{1} ok probe-ide
  Device 0  ( Primary Master ) 
         Removable ATAPI Model: MATSHITADVD-RAM UJ-845S
 
  Device 1  ( Primary Slave ) 
         Not Present
 
  Device 2  ( Secondary Master ) 
         Not Present
 
  Device 3  ( Secondary Slave ) 
         Not Present

1.4.3.3 show-devs Command

The show-devs command lists the hardware device paths for each device in the firmware device tree. The following shows some example output.


{1} ok show-devs/
/i2c@1f,530000
/ebus@1f,464000
/pci@1f,700000
/pci@1e,600000
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi+0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/i2c@1f, 5300000/dimm-spd@0,e2
/i2c@1f, 5300000/dimm-spd@0,e0
/i2c@1f, 5300000/clock-generator@0,dc
/i2c@1f, 5300000/rscrtc@0,d0
/i2c@1f, 5300000/hardware-monitor@0,b0
/i2c@1f, 5300000/riser-fru-prom@0,a8
/i2c@1f, 5300000/idprom@0,a6
/i2c@1f, 5300000/nvram,a6
 

1.4.3.4 Running OpenBoot Commands

1. Halt the system to reach the ok prompt.

How you do this depends on the system’s condition. If possible, warn users before you shut the system down. One method is to become superuser and then type the init 0 command.

2. Type the appropriate OpenBoot command at the console prompt.


1.5 Operating System Diagnostic Tools

If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser operating system. For most Sun systems, this means the Solaris OS. After the server is running in multiuser mode, you have access to the software-based diagnostic tools, SunVTS, and Hardware Diagnostic Suite. These tools enable you to monitor the server, exercise it, and isolate faults.



Note - If you set the auto-boot OpenBoot configuration variable to false, the operating system does not boot following completion of the firmware-based tests.


In addition to the tools mentioned above, you can refer to error and system message log files, and Solaris system information commands.

1.5.1 Error and System Message Log Files

Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the operating system, the environmental control subsystem, and various software applications.

1.5.2 Solaris System Information Commands

The following Solaris commands display data that you can use when assessing the condition of a Sun Fire V215 or V245 server:

This section describes the information that these commands give you. More information on using these commands is contained in the appropriate man page.

1.5.2.1 prtconf Command

The prtconf command displays the Solaris device tree. This tree includes all of the devices probed by the OpenBoot firmware, as well as additional devices, like individual disks that are exposed to the operating system only. The output of prtconf also includes the total amount of system memory.

The -p option produces output similar to the OpenBoot show-devs command. This output lists only those devices that are compiled by the system firmware.

1.5.2.2 prtdiag Command

The prtdiag command displays a table of diagnostic information that summarizes the status of system components. The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on the system.

The verbose option (-v) includes information about the front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.

In the event of an overtemperature condition, prtdiag reports an error in the Status column.


System Temperatures (Celsius):
-------------------------------
Device				Temperature									Status
---------------------------------------
CPU0				62								OK
CPU1		 		102							ERROR

If there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.


Fan Status:
-----------
 
Bank             RPM    Status
----            -----   ------
CPU0             4166   [NO_FAULT]
CPU1             0000   [FAULT]

1.5.2.3 prtfru Command

The Sun Fire V215 and V245 servers maintain a hierarchical list of all FRUs in the system, as well as specific information about various FRUs.

The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable programmable read-only memory (SEEPROM) devices located on many FRUs.

Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes:

1.5.2.4 psrinfo Command

The psrinfo command displays the date and time each CPU came online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed.

1.5.2.5 showrev Command

The showrev command displays revision information for the current hardware and software. When used with the (-p) option, this command displays installed patches.


1.6 Recent Diagnostic Test Results

Summaries of the results from the most recent POST and OpenBoot diagnostics tests are saved across power cycles.

1.6.1 Viewing Recent Test Results

1. Obtain the ok prompt.

2. Do either of the following:

This command produces a system-dependent list of hardware components, along with an indication of which components passed and which failed POST or OpenBoot diagnostics tests.


1.7 Additional Diagnostic Tests for Specific Devices

You can use the probe-scsi, probe-scsi-all, probe-ide, watch-net, and watch-net-all commands to perform additional diagnostic tests on specific devices. This section contains procedures for using these commands.

1.7.1 Confirming That the Internal Hard Drives Are Active

The probe-scsi command transmits an inquiry to SCSI devices connected to the system’s internal SCSI interface. If a SCSI device is connected and active, the command displays the unit number, device type, and manufacturer name for that device.



caution icon Caution - If you use the halt command or the Stop-A key sequence to reach the ok prompt and then issue the probe-scsi orprobe-scsi-all command, the system might hang.


1. Obtain the ok prompt.

Become superuser and type init 0

2. Type probe-scsi.

The following is an example of the output from the probe-scsi command.


MPT Version 1.05, Firmware version 0.03.23.00
 
Target 0 
  Unit 0   Disk 			 		 	 	 	 FUJITSU MAY2073RCSUN72G 0401	 	 143374738 Blocks, 73 GB SASAddress 500000e011772152 PhyNum 0
Target 1 
  Unit 0   Disk     FUJITSU MAY2073RCSUN72 0401 	 	 143374738 bLOCKS, 73 gb SASAdress 500000@e0115adf42 Phynum 1
 

1.7.2 Confirming That the External Hard Drives Are Active

The probe-scsi-all command transmits an inquiry to all SCSI devices connected to both the system’s internal and external SCSI interfaces.

1. Obtain the ok prompt.

2. Type probe-scsi-all

The following shows example output from a server with no externally connected SCSI devices but containing two 73 Gbyte hard drives, both of them active.


{1} ok probe-scsi-all
/pci@1e,600000/pci@0/pci@a/pci@0/pci@8/scsi@1
 
MPT Version 1.05 Firmware Version 0.03.23.00
 
Target 0
  Unit 0   Disk 		 	FUJITSU MAY2073RCSUN72G 0401	 	143374738 Blocks, 73GB SASAddress 500000e011772152 PhyNum 0
Target 1
	Unit 0	 	 	 Disk		 	 FUJITSUMAY2073RCSUN72G 0401	 	 143374738 Blocks, 73GB SASAddress 500000e0115adf42 PhyNum 1
 

1.7.3 Confirming That the DVD Super-Multi Drive Is Connected

The probe-ide command transmits an inquiry command to internal and external IDE devices connected to the system’s on-board IDE interface.

1. Obtain the OK prompt.

2. Type probe-ide.

The following example output shows a optional DVD super-multi drive installed (as Device 0) and active in a server.


{1} ok probe-ide
  Device 0  ( Primary Master ) 
         Removable ATAPI Model: MATSHITADVD-RAM UJ-845S
 
  Device 1  ( Primary Slave ) 
         Not Present
 
  Device 2  ( Secondary Master ) 
         Not Present
 
  Device 3  ( Secondary Slave ) 
         Not Present

1.7.4 Checking Network Connections on the Primary Network

The watch-net diagnostics test monitors Ethernet packets on the primary network interface. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.

1. Obtain the ok prompt.

2. Type watch-net.


{1}ok watch-net
100Mbps FDX Link up
Looking for Ethernet Packets.
‘.’ is a Good Packet. ‘X’ is a Bad Packet
Type any key to stop................................
 

1.7.5 Checking Network Connections on Additional Network Interfaces

The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.

1. Obtain the ok prompt.

Become superuser and type init 0

2. Type watch-net-all at the prompt.


{1}ok watch-net-all
/pci@1e,600000/pci@0/pci@a/pci@0/network@4.1
100Mbps FDX Link up
Looking for Ethernet Packets
‘.’ is a Good Packet. ‘X’ is a Bad Packet
Type any key to stop................................