C H A P T E R 8

TABLE 8-1 Summary of Diagnostic Tools
Diagnostic Tool	Type	What It Does	Accessibility and Availability	Remote Capability
ALOM system controller	Hardware and Software	Monitors environmental conditions, performs basic fault isolation, and provides remote console access	Can function on standby power and without OS	Designed for remote access
LED indicators	Hardware	Indicates status of overall system and particular components	Accessed from system chassis. Available anytime power is available	Local, but can be viewed with the ALOM system console
POST	Firmware	Tests core components of system	Runs automatically on startup. Available when the OS is not running	Local, but can be viewed with ALOM system controller
OpenBoot Diagnostics	Firmware	Tests system components, focusing on peripherals and I/O devices	Runs automatically or interactively. Available when the OS is not running	Local, but can be viewed with ALOM system controller
OpenBoot commands	Firmware	Display various kinds of system information	Available when the OS is not running	Local, but can be accessed with ALOM system controller
Solaris 10 Predictive Self-Healing	Software	Monitors system errors and reports and disables faulty hardware	Runs in the background when the OS is running	Local, but can be accessed with ALOM system controller
Traditional Solaris OS commands	Software	Displays various kinds of system information	Requires OS	Local, but can be accessed with ALOM system controller
SunVTS	Software	Exercises and stresses the system, running tests in parallel	Requires OS. Optional package that needs to be installed separately	View and control over network
Sun Management Center	Software	Monitors both hardware environmental conditions and software performance of multiple machines. Generates alerts for various conditions	Requires OS to be running on both monitored and master servers. Requires a dedicated database on the master server	Designed for remote access
Hardware Diagnostic Suite	Software	Exercises an operational system by running sequential tests. Also reports failed FRUs	Separately purchased optional add-on to Sun Management Center. Requires OS and Sun Management Center	Designed for remote access

About Sun Advanced Lights-Out Manager 1.0 (ALOM)

The Sun Fire V445 server ships with Sun Advanced Lights Out Manager (ALOM) 1.0 installed. The system console is directed to ALOM by default and is configured to show server console information on startup.

ALOM enables you to monitor and control your server over either a serial connection (using the SERIAL MGT port), or Ethernet connection (using the NET MGT port). For information on configuring an Ethernet connection, refer to the ALOM Online Help.

Note - The ALOM serial port, labelled SERIAL MGT, is for server management only. If you need a general purpose serial port, use the serial port labeled TTYB.

ALOM can send email notification of hardware failures and other events related to the server or to ALOM.

The ALOM circuitry uses standby power from the server. This means that:

ALOM is active as soon as the server is connected to a power source, and until power is removed by unplugging the power cable.

ALOM firmware and software continue to be effective when the server OS goes offline.

See TABLE 8-2 for a list of the components monitered by ALOM and the information it provides for each.

TABLE 8-2 What ALOM Monitors
Component	Information
Hard disk drives	Presence and status
System and CPU fans	Speed and status
CPUs	Presence, temperature and any thermal warning or failure conditions
Power supplies	Presence and status
System temperature	Ambient temperature and any thermal warning or failure conditions
Server front panel	Status indicator
Voltage	Status and thresholds
SAS and USB circuit breakers	Status

ALOM Management Ports

The default management port is labeled SERIAL MGT. This port uses an RJ-45 connector and is for server management only - it supports only ASCII connections to an external console. Use this port when you first begin to operate the server.

Another serial port - labeled TTYB - is available for general purpose serial data transfer. This port uses a DB-9 connector. For information on pinouts, refer to the Sun Fire V445 Server Installation Guide.

In addition, the server has one 10BASE-T Ethernet management domain interface, labelled NET MGT. To use this port, ALOM configuration is required. For more information, see the ALOM Online Help.

Setting the `admin` Password for ALOM

When you switch to the ALOM prompt after initial power-on, you will be logged in as the admin user and prompted to set a password. You must set this password in order to execute certain commands.

If you are prompted to do so, set a password for the admin user.

The password must:

contain at least two alphabetic characters

contain at least one numeric or one special character

be at least six characters long

Once the password is set, the admin user has full permissions and can execute all ALOM CLI commands.

Basic ALOM Functions

This section covers some basic ALOM functions. For comprehensive documentation, refer to the ALOM Online Help.

To Switch to the ALOM Prompt

Type the default keystroke sequence:

TABLE 8-3
# `#.`

Note - When you switch to the ALOM prompt, you will be logged in with the userid admin. See Setting the admin Password for ALOM.

To Switch to the Server Console Prompt

Type:

TABLE 8-4
sc> `console`

More than one ALOM user can be connected to the server console stream at a time, but only one user is permitted to type input characters to the console.

If another user is logged on and has write capability, you will see the message below after issuing the console command:

TABLE 8-5
sc> Console session already in use. [view mode]

To take console write capability away from another user, type:

TABLE 8-6
sc> `console -f`

About Status Indicators

For a summary of the server's LED status indicators, see Front Panel Indicators and Back Panel Indicators.

About POST Diagnostics

POST is a firmware program that is useful in determining if a portion of the system has failed. POST verifies the core functionality of the system, including the CPU module(s), motherboard, memory, and some on-board I/O devices, and generates messages that can determine the nature of a hardware failure. POST can be run even if the system is unable to boot.

POST detects CPU and Memory subsystem faults and is located in a SEEPROM on the MBC (ALOM) board. POST can be set to run by the OpenBoot program at power-on by setting three environment variables, the diag-switch?, diag-trigger, and diag-level.

POST runs automatically when the system power is applied, or following a noncritical error reset, if all of the following conditions apply:

diag-switch? is set to true or false (default is false)

diag-level is set to min, max, or menus (default is min)

diag-trigger is set to power-on-reset and error-reset (default is power-on-reset and error-reset)

If diag-level is set to min or max, POST performs an abbreviated or extended test, respectively. If diag-level is set to menus, a menu of all the tests executed at power-up is displayed. POST diagnostic and error message reports are displayed on a console.

For information on starting and controlling POST diagnostics, see About the post Command.

OpenBoot PROM Enhancements for Diagnostic Operation

This section describes the diagnostic operation enhancements provided by OpenBoot PROM Version 4.15 and later and presents information about how to use the resulting new operational features. Note that the behavior of certain operational features on your system might differ from the behavior described in this section.

What's New in Diagnostic Operation

The following features are the diagnostic operation enhancements:

New and redefined configuration variables simplify diagnostic controls and allow you to customize a "normal mode" of diagnostic operation for your environment. See About the New and Redefined Configuration Variables.

New standard (default) configuration enables and runs diagnostics and enables Automatic System Restoration (ASR) capabilities at power-on and after error reset events. See About the Default Configuration.

Service mode establishes a Sun prescribed methodology for isolating and diagnosing problems. See About Service Mode.

The post command executes the power-on self-test (POST) and provides options that enable you to specify the level of diagnostic testing and verbosity of diagnostic output. See About the post Command.

About the New and Redefined Configuration Variables

New and redefined configuration variables simplify diagnostic operation and provide you with more control over the amount of diagnostic output. The following list summarizes the configuration variable changes. See TABLE 8-7 for complete descriptions of the variables.

New variables:

service-mode? - Diagnostics are executed at a Sun-prescribed level.

diag-trigger - Replaces and consolidates the functions of post-trigger and obdiag-trigger.

verbosity - Controls the amount and detail of firmware output.

Redefined variable:

diag-switch? parameter has modified behaviors for controlling diagnostic execution in normal mode on Sun UltraSPARC based volume servers. Behavior of the diag-switch? parameter is unchanged on Sun workstations.

Default value changes:

auto-boot-on-error? - New default value is true.

diag-level - New default value is max.

error-reset-recovery - New default value is sync.

About the Default Configuration

The new standard (default) configuration runs diagnostic tests and enables full ASR capabilities during power-on and after the occurrence of an error reset (RED State Exception Reset, CPU Watchdog Reset, System Watchdog Reset, Software-Instruction Reset, or Hardware Fatal Reset). This is a change from the previous default configuration, which did not run diagnostic tests. When you power on your system for the first time, the change will be visible to you through the increased boot time and the display of approximately two screens of diagnostic output produced by POST and OpenBoot Diagnostics.

Note - The standard (default) configuration does not increase system boot time after a reset that is initiated by user commands from OpenBoot (reset-all or boot) or from Solaris (reboot, shutdown, or init).

The visible changes are due to the default settings of two configuration variables, diag-level (max) and verbosity (normal):

diag-level (max) specifies maximum diagnostic testing, including extensive memory testing, which increases system boot time. See Reference for Estimating System Boot Time (to the ok Prompt) for more information about the increased boot time.

verbosity (normal) specifies that diagnostic messages and information will be displayed, which usually produces approximately two screens of output. See Reference for Sample Outputs for diagnostic output samples of verbosity settings min and normal.

After initial power-on, you can customize the standard (default) configuration by setting the configuration variables to define a "normal mode" of operation that is appropriate for your production environment. TABLE 8-7 lists and describes the defaults and keywords of the OpenBoot configuration variables that control diagnostic testing and ASR capabilities. These are the variables you will set to define your normal mode of operation.

Note - The standard (default) configuration is recommended for improved fault isolation and system restoration, and for increased system availability.

TABLE 8-7 OpenBoot Configuration Variables That Control Diagnostic Testing and Automatic System Restoration
OpenBoot Configuration Variable	Description and Keywords
`auto-boot?`	Determines whether the system automatically boots. Default is `true`. `true` - System automatically boots after initialization, provided no firmware-based (diagnostics or OpenBoot) errors are detected. `false` - System remains at the `ok` prompt until you type `boot`.
`auto-boot-on-error?`	Determines whether the system attempts a degraded boot after a nonfatal error. Default is `true`. `true` - System automatically boots after a nonfatal error if the variable `auto-boot?` is also set to `true`. `false` - System remains at the `ok` prompt.
`boot-device`	Specifies the name of the default boot device, which is also the normal mode boot device.
`boot-file`	Specifies the default boot arguments, which are also the normal mode boot arguments.
`diag-device`	Specifies the name of the boot device that is used when `diag-switch?` is `true`.
`diag-file`	Specifies the boot arguments that are used when `diag-switch?` is `true`.
`diag-level`	Specifies the level or type of diagnostics that are executed. Default is `max`. `off` - No testing. `min` - Basic tests are run. `max` - More extensive tests might be run, depending on the device. Memory is extensively checked.
`diag-out-console`	Redirects system console output to the system controller. `true` - Redirects output to the system controller. `false` - Restores output to the local console. Note: See your system documentation for information about redirecting system console output to the system controller. (Not all systems are equipped with a system controller.)
`diag-passes`	Specifies the number of consecutive executions of OpenBoot Diagnostics self-tests that are run from the OpenBoot Diagnostics (`obdiag`) menu. Default is `1`. Note: `diag-passes` applies only to systems with firmware that contains OpenBoot Diagnostics and has no effect outside the OpenBoot Diagnostics menu.
`diag-script`	Determines which devices are tested by OpenBoot Diagnostics. Default is `normal`. `none` - OpenBoot Diagnostics do not run. `normal` - Tests all devices that are expected to be present in the system's baseline configuration for which self-tests exist. `all` - Tests all devices that have self-tests.
`diag-switch?`	Controls diagnostic execution in normal mode. Default is `false`. For servers: `true` - Diagnostics are only executed on power-on reset events, but the level of test coverage, verbosity, and output is determined by user-defined settings. `false` - Diagnostics are executed upon next system reset, but only for those class of reset events specified by the OpenBoot configuration variable `diag-trigger`. The level of test coverage, verbosity, and output is determined by user-defined settings. For workstations: `true` - Diagnostics are only executed on power-on reset events, but the level of test coverage, verbosity, and output is determined by user-defined settings. `false` - Diagnostics are disabled.
`diag-trigger`	Specifies the class of reset event that causes diagnostics to run automatically. Default setting is `power-on-reset error-reset`. `none` - Diagnostic tests are not executed. `error-reset` - Reset that is caused by certain hardware error events such as RED State Exception Reset, Watchdog Resets, Software-Instruction Reset, or Hardware Fatal Reset. `power-on-reset` - Reset that is caused by power cycling the system. `user-reset` - Reset that is initiated by an OS panic or by user-initiated commands from OpenBoot (`reset-all` or `boot`) or from Solaris (`reboot`, `shutdown`, or `init`). `all-resets` - Any kind of system reset. Note: Both POST and OpenBoot Diagnostics run at the specified reset event if the variable `diag-script` is set to `normal` or `all`. If `diag-script` is set to `none`, only POST runs.
`error-reset-recovery`	Specifies recovery action after an error reset. Default is `sync`. `none` - No recovery action. `boot` - System attempts to boot. `sync` - Firmware attempts to execute a Solaris `sync` callback routine.
`service-mode?`	Controls whether the system is in service mode. Default is `false`. `true` - Service mode. Diagnostics are executed at Sun-specified levels, overriding but preserving user settings. `false` - Normal mode. Diagnostics execution depends entirely on the settings of `diag-switch?` and other user-defined OpenBoot configuration variables.
`test-args`	Customizes OpenBoot Diagnostics tests. Allows a text string of reserved keywords (separated by commas) to be specified in the following ways: As an argument to the `test` command at the `ok` prompt. As an OpenBoot variable to the `setenv` command at the `ok` or `obdiag` prompt. Note: The variable `test-args` applies only to systems with firmware that contains OpenBoot Diagnostics. See your system documentation for a list of keywords.
`verbosity`	Controls the amount and detail of OpenBoot, POST, and OpenBoot Diagnostics output. Default is `normal`. `none` - Only error and fatal messages are displayed on the system console. Banner is not displayed. Note: Problems in systems with `verbosity` set to `none` might be deemed not diagnosable, rendering the system unserviceable by Sun. `min` - Notice, error, warning, and fatal messages are displayed on the system console. Transitional states and banner are also displayed. `normal` - Summary progress and operational messages are displayed on the system console in addition to the messages displayed by the `min` setting. The work-in-progress indicator shows the status and progress of the boot sequence. `max` - Detailed progress and operational messages are displayed on the system console in addition to the messages displayed by the `min` and `normal` settings.

About Service Mode

Service mode is an operational mode defined by Sun that facilitates fault isolation and recovery of systems that appear to be nonfunctional. When initiated, service mode overrides the settings of key OpenBoot configuration variables.

Note that service mode does not change your stored settings. After initialization (at the ok prompt), all OpenBoot PROM configuration variables revert to the user-defined settings. In this way, you or your service provider can quickly invoke a known and maximum level of diagnostics and still preserve your normal mode settings.

TABLE 8-8 lists the OpenBoot configuration variables that are affected by service mode and the overrides that are applied when you select service mode.

TABLE 8-8 Service Mode Overrides
OpenBoot Configuration Variable	Service Mode Override
`auto-boot?`	`false`
`diag-level`	`max`
`diag-trigger`	`power-on-reset error-reset user-reset`
`input-device`	Factory default
`output-device`	Factory default
`verbosity`	`max`
The following apply only to systems with firmware that contains OpenBoot Diagnostics:
`diag-script`	`normal`
`test-args`	`subtests`,`verbose`

About Initiating Service Mode

Enhancements provide a software mechanism for specifying service mode:

service-mode? configuration variable - When set to true, initiates service mode. (Service mode should be used only by authorized Sun service providers.)

Note - The diag-switch? configuration variable should remain at the default setting (false) for normal operation. To specify diagnostic testing for your OS, see To Initiate Normal Mode.

For instructions, see To Initiate Service Mode.

About Overriding Service Mode Settings

When the system is in service mode, three commands can override service mode settings. TABLE 8-9 describes the effect of each command.

TABLE 8-9 Scenarios for Overriding Service Mode Settings
Command	Issued From	What It Does
`post`	`ok` prompt	OpenBoot firmware forces a one-time execution of normal mode diagnostics. For information about normal mode, see About Normal Mode. For information about `post` command options, see About the post Command.
`bootmode diag`	system controller	OpenBoot firmware overrides service mode settings and forces a one-time execution of normal mode diagnostics.1
`bootmode skip_diag`	system controller	OpenBoot firmware suppresses service mode and bypasses all firmware diagnostics.1

^{1 - If the system is not reset within 10 minutes of issuing the bootmode system controller command, the command is cleared.}

Note - Not all systems are equipped with a system controller.

About Normal Mode

Normal mode is the customized operational mode that you define for your environment. To define normal mode, set the values of the OpenBoot configuration variables that control diagnostic testing. See TABLE 8-7 for the list of variables that control diagnostic testing.

Note - The standard (default) configuration is recommended for improved fault isolation and system restoration, and for increased system availability.

When you are deciding whether to enable diagnostic testing in your normal environment, remember that you always should run diagnostics to troubleshoot an existing problem or after the following events:

Initial system installation

New hardware installation and replacement of defective hardware

Hardware configuration modification

Hardware relocation

Firmware upgrade

Power interruption or failure

Hardware errors

Severe or inexplicable software problems

About Initiating Normal Mode

If you define normal mode for your environment, you can specify normal mode with the following method:

System controller bootmode diag command - When you issue this command, it specifies normal mode with the configuration values defined by you - with the following exceptions:

If you defined diag-level = off, bootmode diag specifies diagnostics at diag-level = min.

If you defined verbosity = none, bootmode diag specifies diagnostics at verbosity = min.

Note - The next reset cycle must occur within 10 minutes of issuing the
bootmode diag command or the bootmode command is cleared and normal mode is not initiated.

For instructions, see To Initiate Normal Mode.

About the `post` Command

The post command enables you to easily invoke POST diagnostics and to control the level of testing and the amount of output. When you issue the post command, OpenBoot firmware performs the following actions:

Initiates a user reset

Triggers a one-time execution of POST at the test level and verbosity that you specify

Clears old test results

Displays and logs the new test results

Note - The post command overrides service mode settings and pending system controller bootmode diag and bootmode skip_diag commands.

The syntax for the post command is:

post [level [verbosity]]

where:

level = min or max

verbosity = min, normal, or max

The level and verbosity options provide the same functions as the OpenBoot configuration variables diag-level and verbosity. To determine which settings you should use for the post command options, see TABLE 8-7 for descriptions of the keywords for diag-level and verbosity.

You can specify settings for:

Both level and verbosity

level only (If you specify a verbosity setting, you must also specify a level setting.)

Neither level nor verbosity

If you specify a setting for level only, the post command uses the normal mode value for verbosity with the following exception:

If the normal mode value of verbosity = none, post uses verbosity = min.

If you specify settings for neither level nor verbosity, the post command uses the normal mode values you specified for the configuration variables,
diag-level and verbosity, with two exceptions:

If the normal mode value of diag-level = off, post uses level = min.

If the normal mode value of verbosity = none, post uses
verbosity = min.

To Initiate Service Mode

For background information, see About Service Mode.

Set the service-mode? variable. At the ok prompt, type:

TABLE 1
ok `setenv service-mode? true`

For service mode to take effect, you must reset the system.

9. At the ok prompt, type:

TABLE 2
ok `reset-all`

To Initiate Normal Mode

For background information, see About Normal Mode.

1. At the ok prompt, type:

TABLE 3
ok `setenv service-mode? false`

The system will not actually enter normal mode until the next reset.

2. Type:

TABLE 4
ok `reset-all`

Reference for Estimating System Boot Time (to the `ok` Prompt)

The measurement of system boot time begins when you power on (or reset) the system and ends when the OpenBoot ok prompt appears. During the boot time period, the firmware executes diagnostics (POST and OpenBoot Diagnostics) and performs OpenBoot initialization. The time required to run OpenBoot Diagnostics and to perform OpenBoot setup, configuration, and initialization is generally similar for all systems, depending on the number of I/O cards installed when
diag-script is set to all. However, at the default settings (diag-level = max and verbosity = normal), POST executes extensive memory tests, which will increase system boot time.

System boot time will vary from system-to-system, depending on the configuration of system memory and the number of CPUs:

Because each CPU tests its associated memory and POST performs the memory tests simultaneously, memory test time will depend on the amount of memory on the most populated CPU.

Because the competition for system resources makes CPU testing a less linear process than memory testing, CPU test time will depend on the number of CPUs.

If you need to know the approximate boot time of your new system before you power on for the first time, the following sections describe two methods you can use to estimate boot time:

If your system configuration matches one of the three typical configurations cited in Boot Time Estimates for Typical Configurations, you can use the approximate boot time given for the appropriate configuration.

If you know how the memory is configured among the CPUs, you can estimate the boot time for your specific system configuration using the method described in Estimating Boot Time for Your System.

Boot Time Estimates for Typical Configurations

The following are three typical configurations and the approximate boot time you can expect for each:

Small configuration (2 CPUs and 4 Gbytes of memory) - Boot time is approximately 5 minutes.

Medium configuration (4 CPUs and 16 Gbytes of memory) - Boot time is approximately 10 minutes.

Large configuration (4 CPUs and 32 Gbytes of memory) - Boot time is approximately 15 minutes.

Estimating Boot Time for Your System

Generally, for systems configured with default settings, the times required to execute OpenBoot Diagnostics and to perform OpenBoot setup, configuration, and initialization are the same for all systems:

1 minute for OpenBoot Diagnostics testing might require more time for systems with a greater number of devices to be tested.

2 minutes for OpenBoot setup, configuration, and initialization

To estimate the time required to run POST memory tests, you need to know the amount of memory associated with the most populated CPU. To estimate the time required to run POST CPU tests, you need to know the number of CPUs. Use the following guidelines to estimate memory and CPU test times:

2 minutes per Gbyte of memory associated with the most populated CPU

1 minute per CPU

The following example shows how to estimate the system boot time of a sample configuration consisting of 4 CPUs and 32 Gbytes of system memory, with 8 Gbytes of memory on the most populated CPU.

This figure shows the calculation for estimating system boot time for a sample configuration.

Reference for Sample Outputs

At the default setting of verbosity = normal, POST and OpenBoot Diagnostics generate less diagnostic output (about 2 pages) than was produced before the OpenBoot PROM enhancements (over 10 pages). This section includes output samples for verbosity settings at min and normal.

Note - The diag-level configuration variable also affects how much output the system generates. The following samples were produced with diag-level set to max, the default setting.

The following sample shows the firmware output after a power reset when verbosity is set to min. At this verbosity setting, OpenBoot firmware displays notice, error, warning, and fatal messages but does not display progress or operational messages. Transitional states and the power-on banner are also displayed. Since no error conditions were encountered, this sample shows only the POST execution message, the system's install banner, and the device self-tests conducted by OpenBoot Diagnostics.

TABLE 5
Executing POST w/%o0 = 0000.0400.0101.2041 Sun Fire V445, Keyboard Present Copyright 1998-2006 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.15.0, 4096 MB memory installed, Serial #12980804. Ethernet address 8:0:20:c6:12:44, Host ID: 80c61244. Running diagnostic script obdiag/normal Testing /pci@8,600000/network@1 Testing /pci@8,600000/SUNW,qlc@2 Testing /pci@9,700000/ebus@1/i2c@1,2e Testing /pci@9,700000/ebus@1/i2c@1,30 Testing /pci@9,700000/ebus@1/i2c@1,50002e Testing /pci@9,700000/ebus@1/i2c@1,500030 Testing /pci@9,700000/ebus@1/bbc@1,0 Testing /pci@9,700000/ebus@1/bbc@1,500000 Testing /pci@8,700000/scsi@1 Testing /pci@9,700000/network@1,1 Testing /pci@9,700000/usb@1,3 Testing /pci@9,700000/ebus@1/gpio@1,300600 Testing /pci@9,700000/ebus@1/pmc@1,300700 Testing /pci@9,700000/ebus@1/rtc@1,300070 {7} ok

TABLE 5

Executing POST w/%o0 = 0000.0400.0101.2041
Sun Fire V445, Keyboard Present
Copyright 1998-2006 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.15.0, 4096 MB memory installed, Serial #12980804.
Ethernet address 8:0:20:c6:12:44, Host ID: 80c61244.
Running diagnostic script obdiag/normal
Testing /pci@8,600000/network@1
Testing /pci@8,600000/SUNW,qlc@2
Testing /pci@9,700000/ebus@1/i2c@1,2e
Testing /pci@9,700000/ebus@1/i2c@1,30
Testing /pci@9,700000/ebus@1/i2c@1,50002e
Testing /pci@9,700000/ebus@1/i2c@1,500030
Testing /pci@9,700000/ebus@1/bbc@1,0
Testing /pci@9,700000/ebus@1/bbc@1,500000
Testing /pci@8,700000/scsi@1
Testing /pci@9,700000/network@1,1
Testing /pci@9,700000/usb@1,3
Testing /pci@9,700000/ebus@1/gpio@1,300600
Testing /pci@9,700000/ebus@1/pmc@1,300700
Testing /pci@9,700000/ebus@1/rtc@1,300070
{7} ok

The following sample shows the diagnostic output after a power reset when verbosity is set to normal, the default setting. At this verbosity setting, the OpenBoot firmware displays summary progress or operational messages in addition to the notice, error, warning, and fatal messages; transitional states; and install banner displayed by the min setting. On the console, the work-in-progress indicator shows the status and progress of the boot sequence.

TABLE 6
Sun Fire V445, Keyboard Present Copyright 1998-2004 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.15.0, 4096 MB memory installed, Serial #12980804. Ethernet address 8:0:20:c6:12:44, Host ID: 80c61244. Running diagnostic script obdiag/normal Testing /pci@8,600000/network@1 Testing /pci@8,600000/SUNW,qlc@2 Testing /pci@9,700000/ebus@1/i2c@1,2e Testing /pci@9,700000/ebus@1/i2c@1,30 Testing /pci@9,700000/ebus@1/i2c@1,50002e Testing /pci@9,700000/ebus@1/i2c@1,500030 Testing /pci@9,700000/ebus@1/bbc@1,0 Testing /pci@9,700000/ebus@1/bbc@1,500000 Testing /pci@8,700000/scsi@1 Testing /pci@9,700000/network@1,1 Testing /pci@9,700000/usb@1,3 Testing /pci@9,700000/ebus@1/gpio@1,300600 Testing /pci@9,700000/ebus@1/pmc@1,300700 Testing /pci@9,700000/ebus@1/rtc@1,300070 {7} ok

TABLE 6

Sun Fire V445, Keyboard Present
Copyright 1998-2004 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.15.0, 4096 MB memory installed, Serial #12980804.
Ethernet address 8:0:20:c6:12:44, Host ID: 80c61244.
Running diagnostic script obdiag/normal
Testing /pci@8,600000/network@1
Testing /pci@8,600000/SUNW,qlc@2
Testing /pci@9,700000/ebus@1/i2c@1,2e
Testing /pci@9,700000/ebus@1/i2c@1,30
Testing /pci@9,700000/ebus@1/i2c@1,50002e
Testing /pci@9,700000/ebus@1/i2c@1,500030
Testing /pci@9,700000/ebus@1/bbc@1,0
Testing /pci@9,700000/ebus@1/bbc@1,500000
Testing /pci@8,700000/scsi@1
Testing /pci@9,700000/network@1,1
Testing /pci@9,700000/usb@1,3
Testing /pci@9,700000/ebus@1/gpio@1,300600
Testing /pci@9,700000/ebus@1/pmc@1,300700
Testing /pci@9,700000/ebus@1/rtc@1,300070
{7} ok

Reference for Determining Diagnostic Mode

The flowchart in FIGURE 8-7 summarizes graphically how various system controller and OpenBoot variables affect whether a system boots in normal or service mode, as well as whether any overrides occur.

CODE EXAMPLE 8-1
{3} ok post SC Alert: Host System has Reset Executing Power On Self Test Q#0> 0>@(#)Sun Fire[TM] V445 POST 4.22.11 2006/06/12 15:10 /export/delivery/delivery/4.22/4.22.11/post4.22.x/Fiesta/boston/integrated (root) 0>Copyright ? 2006 Sun Microsystems, Inc. All rights reserved SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. 0>OBP->POST Call with %o0=00000800.01012000. 0>Diag level set to MIN. 0>Verbosity level set to NORMAL. 0>Start Selftest..... 0>CPUs present in system: 0 1 2 3 0>Test CPU(s)....Done 0>Interrupt Crosscall....Done 0>Init Memory....\| SC Alert: Host System has Reset 'Done 0>PLL Reset....Done 0>Init Memory....Done 0>Test Memory....Done 0>IO-Bridge Tests....Done 0>INFO: 0> POST Passed all devices. 0> 0>POST: Return to OBP. SC Alert: Host System has Reset Configuring system memory & CPU(s) Probing system devices Probing memory Probing I/O buses screen not found. keyboard not found. Keyboard not present. Using ttya for input and output. Probing system devices Probing memory Probing I/O buses Sun Fire V445, No Keyboard Copyright 2006 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.22.11, 24576 MB memory installed, Serial #64548465. Ethernet address 0:3:ba:d8:ee:71, Host ID: 83d8ee71.

CODE EXAMPLE 8-1

{3} ok post
SC Alert: Host System has Reset
 
 
Executing Power On Self Test
Q#0>
0>@(#)Sun Fire[TM] V445 POST 4.22.11 2006/06/12 15:10
        /export/delivery/delivery/4.22/4.22.11/post4.22.x/Fiesta/boston/integrated  (root)
0>Copyright ? 2006 Sun Microsystems, Inc. All rights reserved
   SUN PROPRIETARY/CONFIDENTIAL.
   Use is subject to license terms.
0>OBP->POST Call with %o0=00000800.01012000.
0>Diag level set to MIN.
0>Verbosity level set to NORMAL.
0>Start Selftest.....
0>CPUs present in system: 0 1 2 3
0>Test CPU(s)....Done
0>Interrupt Crosscall....Done
0>Init Memory....|
SC Alert: Host System has Reset
'Done
0>PLL Reset....Done
0>Init Memory....Done
0>Test Memory....Done
0>IO-Bridge Tests....Done
0>INFO:
0>    POST Passed all devices.
0>
0>POST:    Return to OBP.
 
SC Alert: Host System has Reset
 
Configuring system memory & CPU(s)
 
Probing system devices
Probing memory
Probing I/O buses
screen not found.
keyboard not found.
Keyboard not present.  Using ttya for input and output.
Probing system devices
Probing memory
Probing I/O buses
 
 
Sun Fire V445, No Keyboard
Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.22.11, 24576 MB memory installed, Serial #64548465.
Ethernet address 0:3:ba:d8:ee:71, Host ID: 83d8ee71.

This flowchart depicts how various OpenBoot configuration variables affect the diagnostic mode.

FIGURE 8-7 Diagnostic Mode Flowchart

Quick Reference for Diagnostic Operation

TABLE 8-10 summarizes the effects of the following user actions on diagnostic operation:

Set service-mode? to true

Issue the bootmode commands, bootmode diag or bootmode skip_diag

Issue the post command

TABLE 8-10 Summary of Diagnostic Operation
User Action	Sets Configuration Variables	And Initiates
Service Mode
Set `service-mode?` to `true`	Note: Service mode overrides the settings of the following configuration variables without changing your stored settings: `auto-boot?` = `false` `diag-level` = `max` `diag-trigger` = `power-on-reset` `error-reset` `user` `reset` `input-device` = Factory default `output-device` = Factory default `verbosity` = `max` The following apply only to systems with firmware that contains OpenBoot Diagnostics: `diag-script` = `normal` `test-args` = `subtests`,`verbose`	Service mode (defined by Sun)
Normal Mode
Set `service-mode?` to `false`	`auto-boot?` = user-defined setting `auto-boot-on-error?` = user-defined setting `diag-level` = user-defined setting `verbosity` = user-defined setting `diag-script` = user-defined setting `diag-trigger` = user-defined setting `input-device` = user-defined setting `output-device` = user-defined setting	Normal mode (user-defined)
bootmode Commands
Issue `bootmode` `diag` command	Overrides service mode settings and uses normal mode settings with the following exceptions: `diag-level` = `min` if normal mode value = `off` `verbosity` = `min` if normal mode value = `none`	Normal mode diagnostics with the exceptions in the preceding column.
Issue `bootmode` `skip_diag` command		OpenBoot initialization without running diagnostics
post Command Note: If the value of `diag`-`script` = `normal` or `all`, OpenBoot Diagnostics also run.
Issue `post` command		POST diagnostics
Specify both `level` and `verbosity`	`level` and `verbosity` = user-defined values
Specify neither `level` nor `verbosity`	`level` and `verbosity` = normal mode values with the following exceptions: `level` = `min` if normal mode value of `diag-level` = `none` `verbosity` = `min` if normal mode value of `verbosity` = `none`
Specify `level` only	`level` = user-defined value `verbosity` = normal mode value for `verbosity` (Exception: `verbosity` = `min` if normal mode value of `verbosity` = `none`)

OpenBoot Diagnostics

Like POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the boot PROM.

To Start OpenBoot Diagnostics

1. Type:

TABLE 8-11
ok `setenv diag-switch? true` ok `setenv auto-boot? false` ok `reset-all`

2. Type:

TABLE 8-12
ok `obdiag`

This command displays the OpenBoot Diagnostics menu. See TABLE 8-13.

TABLE 8-13 Sample `obdiag` Menu
`obdiag`
1 LSILogic,sas@1 4 rmc-comm@0,c28000 serial@3,fffff8	2 flashprom@0,0 5 rtc@0,70	3 network@0 6 serial@0,c2c000





Commands: test test-all except help what setenv set-default exit
diag-passes=1 diag-level=min test-args=args

Note - If you have a PCI card installed in the server, then additional tests will appear on the obdiag menu.

3. Type:

TABLE 8-14
obdiag> `test` n

where n represents the number corresponding to the test you want to run.

A summary of the tests is available. At the obdiag> prompt, type:

TABLE 8-15
obdiag> `help`

4. You can also run all tests, type:

TABLE 8-16
obdiag> `test-all` Hit the spacebar to interrupt testing Testing /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1 ......... passed Testing /ebus@1f,464000/flashprom@0,0 ................................. passed Testing /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0 Internal loopback test -- succeeded. Link is -- up ........ passed Testing /ebus@1f,464000/rmc-comm@0,c28000 ............................. passed Testing /pci@1f,700000/pci@0/pci@1/pci@0/isa@1e/rtc@0,70 .............. passed Testing /ebus@1f,464000/serial@0,c2c000 ............................... passed Testing /ebus@1f,464000/serial@3,fffff8 ............................... passed Pass:1 (of 1) Errors:0 (of 0) Tests Failed:0 Elapsed Time: 0:0:1:1 Hit any key to return to the main menu

TABLE 8-16

obdiag> test-all
Hit the spacebar to interrupt testing
Testing /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1 ......... passed
Testing /ebus@1f,464000/flashprom@0,0 ................................. passed
Testing /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0 Internal loopback test -- succeeded.
Link is  -- up
........ passed
Testing /ebus@1f,464000/rmc-comm@0,c28000 ............................. passed
Testing /pci@1f,700000/pci@0/pci@1/pci@0/isa@1e/rtc@0,70 .............. passed
Testing /ebus@1f,464000/serial@0,c2c000 ............................... passed
Testing /ebus@1f,464000/serial@3,fffff8 ............................... passed
Pass:1 (of 1) Errors:0 (of 0) Tests Failed:0 Elapsed Time: 0:0:1:1
 
 
Hit any key to return to the main menu

Note - From the obdiag prompt you can select a device from the list and test it. However, at the ok prompt you need to use the full device path. In addition, the device needs to have a self-test method, otherwise errors will result.

Controlling OpenBoot Diagnostics Tests

Most of the OpenBoot configuration variables you use to control POST (see TABLE 8-7) also affect OpenBoot Diagnostics tests.

Use the diag-level variable to control the OpenBoot Diagnostics testing level.

Use test-args to customize how the tests run.

By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 8-17.

TABLE 8-17 Keywords for the `test-args` OpenBoot Configuration Variable
Keyword	What It Does
`bist`	Invokes built-in self-test (BIST) on external and peripheral devices
`debug`	Displays all debug messages
`iopath`	Verifies bus/interconnect integrity
`loopback`	Exercises external loopback path for the device
`media`	Verifies external and peripheral device media accessibility
`restore`	Attempts to restore original state of the device if the previous execution of the test failed
`silent`	Displays only errors rather than the status of each test
`subtests`	Displays main test and each subtest that is called
`verbose`	Displays detailed messages of status of all tests
`callers=N`	Displays backtrace of N callers when an error occurs `callers=0` - displays backtrace of all callers before the error
`errors=N`	Continues executing the test until N errors are encountered `errors=0` - displays all error reports without terminating testing

If you want to make multiple customizations to the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:

TABLE 8-18
`ok` `setenv test-args debug,loopback,media`

`test` and `test-all` Commands

You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:

TABLE 8-19
ok `test /pci@x,y/SUNW,qlc@2`

Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V445 system.

To customize an individual test, you can use test-args as follows:

TABLE 8-20
ok `test /usb@1,3:test-args={verbose,debug}`

This affects only the current test without changing the value of the test-args OpenBoot configuration variable.

You can test all the devices in the device tree with the test-all command:

TABLE 8-21
ok `test-all`

If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:

TABLE 8-22
ok `test-all /pci@9,700000/usb@1,3`

OpenBoot Diagnostics Error Messages

OpenBoot Diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. The following example displays a sample OpenBoot Diagnostics error message.

CODE EXAMPLE 8-2 OpenBoot Diagnostics Error Message

Testing /pci@1e,600000/isa@7/flashprom@2,0
 
    ERROR   : There is no POST in this FLASHPROM or POST header is 
unrecognized
    DEVICE  : /pci@1e,600000/isa@7/flashprom@2,0
    SUBTEST : selftest:crc-subtest
    MACHINE : Sun Fire V445
    SERIAL# : 51347798
    DATE    : 03/05/2003 15:17:31  GMT
    CONTR0LS: diag-level=max test-args=errors=1
 
Error: /pci@1e,600000/isa@7/flashprom@2,0 selftest failed, return code = 1
Selftest at /pci@1e,600000/isa@7/flashprom@2,0 (errors=1) ............. 
failed
Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:1

About OpenBoot Commands

OpenBoot commands are commands you type from the ok prompt. OpenBoot commands that can provide useful diagnostic information are:

probe-scsi-all

probe-ide

show-devs

`probe-scsi-all`

The probe-scsi-all command diagnoses problems with the SAS devices.

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsi-allcommand can hang the system.

The probe-scsi-all command communicates with all SAS devices connected to on-board SAS controllers and accesses devices connected to any host adapters installed in PCI slots.

For any SAS device that is connected and active, the probe-scsi-all command displays its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.

The following is sample output from the probe-scsi-all command.

CODE EXAMPLE 8-3 Sample probe-scsi-all Command Output

{3} ok probe-scsi-all
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
 
MPT Version 1.05, Firmware Version 1.08.04.00
 
Target 0
   Unit 0   Disk     SEAGATE ST973401LSUN72G 0356    143374738 Blocks, 73 GB
   SASAddress 5000c50000246b35  PhyNum 0
Target 1
   Unit 0   Disk     SEAGATE ST973401LSUN72G 0356    143374738 Blocks, 73 GB
   SASAddress 5000c50000246bc1  PhyNum 1
Target 4 Volume 0
   Unit 0   Disk     LSILOGICLogical Volume  3000    16515070 Blocks, 8455 MB
Target 6
   Unit 0   Disk     FUJITSU MAV2073RCSUN72G 0301    143374738 Blocks, 73 GB
   SASAddress 500000e0116a81c2  PhyNum 6
 
{3} ok

`probe-ide`

The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.

Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system.

The following is sample output from the probe-ide command.

CODE EXAMPLE 8-4 `Sample probe-ide` Command Output
{1} ok `probe-ide` Device 0 ( Primary Master ) Removable ATAPI Model: DV-28E-B Device 1 ( Primary Slave ) Not Present Device 2 ( Secondary Master ) Not Present Device 3 ( Secondary Slave ) Not Present

`show-devs`

The show-devs command lists the hardware device paths for each device in the firmware device tree. shows some sample output.

CODE EXAMPLE 8-5 show-devs Command Output (Truncated)

/i2c@1f,520000
/ebus@1f,464000
/pci@1f,700000
/pci@1e,600000
/memory-controller@3,0
/SUNW,UltraSPARC-IIIi@3,0
/memory-controller@2,0
/SUNW,UltraSPARC-IIIi@2,0
/memory-controller@1,0
/SUNW,UltraSPARC-IIIi@1,0
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi@0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/i2c@1f,520000/cpu-fru-prom@0,e8
/i2c@1f,520000/dimm-spd@0,e6
/i2c@1f,520000/dimm-spd@0,e4
.
.
.
/pci@1f,700000/pci@0
/pci@1f,700000/pci@0/pci@9
/pci@1f,700000/pci@0/pci@8
/pci@1f,700000/pci@0/pci@2
/pci@1f,700000/pci@0/pci@1
/pci@1f,700000/pci@0/pci@2/pci@0
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8
/pci@1f,700000/pci@0/pci@2/pci@0/network@4,1
/pci@1f,700000/pci@0/pci@2/pci@0/network@4
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1/disk
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1/tape

To Run OpenBoot Commands

1. Halt the system to reach the ok prompt.

How you do this depends on the system's condition. If possible, you should warn users before you shut the system down.

2. Type the appropriate command at the console prompt.

About Predictive Self-Healing

In Solaris 10 systems, the Solaris Predictive Self-Healing (PSH) technology enables Sun Fire V445 server to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use message ID to get additional information about the problem from Sun's knowledge article database.

The Predictive Self-Healing technology covers the following Sun Fire V445 server components:

UltraSPARC IIIi processors

Memory

I/O bus

The PSH console message provides the following information:

Type

Severity

Description

Automated Response

Impact

Suggested Action for System Administrator

If the Solaris PSH facility has detected a faulty component, use the fmdump command (described in the following subsections) to identify the fault. Faulty FRUs are identified in fault messages using the FRU name.

Use the following web site to interpret faults and obtain information on a fault:

http://www.sun.com/msg/

This web site directs you to provide the message ID that your system displayed. The web site then provides knowledge articles about the fault and corrective action to resolve the fault. The fault information and documentation at this web site is updated regularly.

You can find more detailed descriptions of Solaris 10 Predictive Self-Healing at the following web site:

http://www.sun.com/bigadmin/features/articles/selfheal.html

Predictive Self-Healing Tools

In summary, the Solaris Fault Manager daemon (fmd) performs the following functions:

Receives telemetry information about problems detected by the system software.

Diagnoses the problems and provides system generated messages.

Initiates pro-active self-healing activities such as disabling faulty components.

TABLE 8-23 shows a typical message generated when a fault occurs on your system. The message appears on your console and is recorded in the /var/adm/messages file.

Note - The messages in TABLE 8-23 indicate that the fault has already been diagnosed. Any corrective action that the system can perform has already taken place. If your server is still running, it continues to run.

TABLE 8-23 System Generated Predictive Self-Healing Message
Output Displayed	Description
Jul 1 14:30:20 sunrise EVENT-TIME: Tue Nov 1 16:30:20 PST 2005	`EVENT-TIME`: the time stamp of the diagnosis.
Jul 1 14:30:20 sunrise PLATFORM: SUNW,A70, CSN: -, HOSTNAME: sunrise	`PLATFORM`: A description of the system encountering the problem
Jul 1 14:30:20 sunrise SOURCE: eft, REV: 1.13	`SOURCE`: Information on the Diagnosis Engine used to determine the fault
Jul 1 14:30:20 sunrise EVENT-ID: afc7e660-d609-4b2f-86b8-ae7c6b8d50c4	`EVENT-ID`: The Universally Unique event ID (UUID) for this fault
Jul 1 14:30:20 sunrise DESC:Jul 1 14:30:20 sunrise A problem was detected in the PCI-Express subsystem	`DESC`: A basic description of the failure
Jul 1 14:30:20 sunrise Refer to http://sun.com/msg/SUN4-8000-0Y for more information.	`WEBSITE`: Where to find specific information and actions for this fault
Jul 1 14:30:20 sunrise AUTO-RESPONSE: One or more device instances may be disabled	`AUTO-RESPONSE`: What, if anything, the system did to alleviate any follow-on issues
Jul 1 14:30:20 sunrise IMPACT: Loss of services provided by the device instances associated with this fault	`IMPACT`: A description of what that response may have done
Jul 1 14:30:20 sunrise REC-ACTION: Schedule a repair procedure to replace the affected device. Use Nov 1 14:30:20 sunrise fmdump -v -u EVENT_ID to identify the device or contact Sun for support.	`REC-ACTION`: A short description of what the system administrator should do

Using the Predictive Self-Healing Commands

For complete information about Predictive Self-Healing commands, refer to the Solaris 10 man pages. This section describes some details of the following commands:

fmdump(1M)

fmadm(1M)

fmstat(1M)

Using the `fmdump` Command

After the message in TABLE 8-23 is displayed, more information about the fault is available. The fmdump command displays the contents of any log files associated with the Solaris Fault Manager.

The fmdump command produces output similar to TABLE 8-23. This example assumes there is only one fault.

TABLE 8-24
`#` `fmdump` TIME UUID SUNW-MSG-ID Jul 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2 SUN4-8000-0Y

`fmdump -V`

The -V option provides more details.

TABLE 8-25

# fmdump -V -u  0ee65618-2218-4997-c0dc-b5c410ed8ec2
TIME                 UUID                                  SUNW-MSG-ID
Jul 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2  SUN4-8000-0Y
100% fault.io.fire.asic
FRU: hc://product-id=SUNW,A70/motherboard=0
rsrc: hc:///motherboard=0/hostbridge=0/pciexrc=0

Three lines of new output are delivered with the -V option.

The first line is a summary of information displayed previously in the console message but includes the timestamp, the UUID, and the Message-ID.

The second line is a declaration of the certainty of the diagnosis. In this case the failure is in the ASIC described. If the diagnosis could involve multiple components, two lines would be displayed here with 50 percent in each, for example.

The FRU line declares the part that needs to be replaced to return the system to a fully operational state.

The rsrc line describes what component was taken out of service as a result of this fault.

`fmdump -e`

To get information of the errors that caused this failure, use the -e option.

TABLE 8-26
# `fmdump -e` TIME CLASS Nov 02 10:04:14.3008 ereport.io.fire.jbc.mb_per

Using the `fmadm faulty` Command

The fmadm faulty command lists and modifies system configuration parameters that are maintained by the Solaris Fault Manager. The fmadm faulty command is primarily used to determine the status of a component involved in a fault.

TABLE 8-27
# `fmadm faulty` STATE RESOURCE / UUID -------- ------------------------------------------------------------- degraded dev:////pci@1e,600000 0ee65618-2218-4997-c0dc-b5c410ed8ec2

The PCI device is degraded and is associated with the same UUID as seen above. You may also see faulted states.

`fmadm config`

The fmadm config command output shows the version numbers of the diagnosis engines in use by your system, and also displays their current state. You can check these versions against information on the http://sunsolve.sun.com web site to determine if your server is using the latest diagnostic engines.

TABLE 8-28

# fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-diagnosis         1.5     active  UltraSPARC-III/IV CPU/Memory Diagnosis
cpumem-retire            1.1     active  CPU/Memory Retire Agent
eft                      1.16    active  eft diagnosis engine
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                1.0     active  I/O Retire Agent
snmp-trapgen             1.0     active  SNMP Trap Generation Agent
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.0     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine

Using the `fmstat` Command

The fmstat command can report statistics associated with the Solaris Fault Manager. The fmstat command shows information about DE performance. In the example below, the eft DE (also seen in the console output) has received an event which it accepted. A case is opened for that event and a diagnosis is performed to solve the cause for the failure.

TABLE 8-29

# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-diagnosis         0       0  0.0    0.0   0   0     0     0   3.0K      0
cpumem-retire            0       0  0.0    0.0   0   0     0     0      0      0
eft                      0       0  0.0    0.0   0   0     0     0   713K      0
fmd-self-diagnosis       0       0  0.0    0.0   0   0     0     0      0      0
io-retire                0       0  0.0    0.0   0   0     0     0      0      0
snmp-trapgen             0       0  0.0    0.0   0   0     0     0    32b      0
sysevent-transport       0       0  0.0 6704.4   1   0     0     0      0      0
syslog-msgs              0       0  0.0    0.0   0   0     0     0      0      0
zfs-diagnosis            0       0  0.0    0.0   0   0     0     0      0      0

About Traditional Solaris OS Diagnostic Tools

If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser OS. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have access to the software-based exerciser tools, SunVTS and Sun Management Center. These tools enable you to monitor the server, exercise it, and isolate faults.

Note - If you set the auto-boot OpenBoot configuration variable to false, the OS does not boot following completion of the firmware-based tests.

In addition to the tools mentioned above, you can refer to error and system message log files, and Solaris system information commands.

Error and System Message Log Files

Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the OS, the environmental control subsystem, and various software applications.

Solaris System Information Commands

The following Solaris commands display data that you can use when assessing the condition of a Sun Fire V445 server:

prtconf

prtdiag

prtfru

psrinfo

showrev

This section describes the information these commands give you. For more information on using these commands, refer to the Solaris man pages.

Using the `prtconf` Command

The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks. The output of prtconf also includes the total amount of system memory, and shows an excerpt of prtconf output (truncated to save space).

CODE EXAMPLE 8-6 prtconf Command Output (Truncated)

# prtconf
System Configuration:  Sun Microsystems  sun4u
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
 
SUNW,Sun-Fire-V445
    packages (driver not attached)
        SUNW,builtin-drivers (driver not attached)
        deblocker (driver not attached)
        disk-label (driver not attached)
        terminal-emulator (driver not attached)
        dropins (driver not attached)
        kbd-translator (driver not attached)
        obp-tftp (driver not attached)
        SUNW,i2c-ram-device (driver not attached)
        SUNW,fru-device (driver not attached)
        ufs-file-system (driver not attached)
    chosen (driver not attached)
    openprom (driver not attached)
        client-services (driver not attached)
    options, instance #0
    aliases (driver not attached)
    memory (driver not attached)
    virtual-memory (driver not attached)
    SUNW,UltraSPARC-IIIi (driver not attached)
    memory-controller, instance #0
    SUNW,UltraSPARC-IIIi (driver not attached)
    memory-controller, instance #1 ...

The prtconf command -p option produces output similar to the OpenBoot
show-devs command. This output lists only those devices compiled by the system firmware.

Using the `prtdiag` Command

The prtdiag command displays a table of diagnostic information that summarizes the status of system components.

The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following is an excerpt of some of the output produced by prtdiag on a Sun Fire V445 server.

CODE EXAMPLE 8-7 prtdiag Command Output

# prtdiag
System Configuration: Sun Microsystems  sun4u Sun Fire V445
System clock frequency: 199 MHZ
Memory size: 24GB
 
==================================== CPUs ====================================
                E$          CPU                    CPU
CPU  Freq      Size        Implementation         Mask    Status      Location
---  --------  ----------  ---------------------  -----   ------      --------
0    1592 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/C0/P0
1    1592 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/C1/P0
2    1592 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/C2/P0
3    1592 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/C3/P0
 
================================= IO Devices =================================
Bus     Freq  Slot +      Name +
Type    MHz   Status      Path                          Model
------  ----  ----------  ----------------------------  --------------------
pci     199   MB/PCI4     LSILogic,sas-pci1000,54 (scs+ LSI,1068
               okay        /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
 
pci     199   MB/PCI5     pci108e,abba (network)        SUNW,pci-ce
               okay        /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0
 
pciex   199   MB          pci14e4,1668 (network)
               okay        /pci@1e,600000/pci/pci/pci/network
 
pciex   199   MB          pci14e4,1668 (network)
               okay        /pci@1e,600000/pci/pci/pci/network
 
pciex   199   MB          pci10b9,5229 (ide)
               okay        /pci@1f,700000/pci@0/pci@1/pci@0/ide
 
pciex   199   MB          pci14e4,1668 (network)
               okay        /pci@1f,700000/pci@0/pci@2/pci@0/network
 
pciex   199   MB          pci14e4,1668 (network)
               okay        /pci@1f,700000/pci@0/pci@2/pci@0/network
 
 
============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address       Size       Interleave Factor  Contains
-----------------------------------------------------------------------
0x0                8GB               16          BankIDs 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0x1000000000       8GB               16          BankIDs 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
0x2000000000       4GB               4           BankIDs 32,33,34,35
0x3000000000       4GB               4           BankIDs 48,49,50,51
 
Bank Table:
-----------------------------------------------------------
            Physical Location
ID       ControllerID  GroupID   Size       Interleave Way
-----------------------------------------------------------
0        0             0         512MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1        0             0         512MB
2        0             1         512MB
3        0             1         512MB
4        0             0         512MB
5        0             0         512MB
6        0             1         512MB
7        0             1         512MB
8        0             1         512MB
9        0             1         512MB
10       0             0         512MB
11       0             0         512MB
12       0             1         512MB
13       0             1         512MB
14       0             0         512MB
15       0             0         512MB
16       1             0         512MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
17       1             0         512MB
18       1             1         512MB
19       1             1         512MB
20       1             0         512MB
21       1             0         512MB
22       1             1         512MB
23       1             1         512MB
24       1             1         512MB
25       1             1         512MB
26       1             0         512MB
27       1             0         512MB
28       1             1         512MB
29       1             1         512MB
30       1             0         512MB
31       1             0         512MB
32       2             0         1GB             0,1,2,3
33       2             1         1GB
34       2             1         1GB
35       2             0         1GB
48       3             0         1GB             0,1,2,3
49       3             1         1GB
50       3             1         1GB
51       3             0         1GB
 
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels         Status
--------------------------------------------------
0              0        MB/C0/P0/B0/D0
0              0        MB/C0/P0/B0/D1
0              1        MB/C0/P0/B1/D0
0              1        MB/C0/P0/B1/D1
1              0        MB/C1/P0/B0/D0
1              0        MB/C1/P0/B0/D1
1              1        MB/C1/P0/B1/D0
1              1        MB/C1/P0/B1/D1
2              0        MB/C2/P0/B0/D0
2              0        MB/C2/P0/B0/D1
2              1        MB/C2/P0/B1/D0
2              1        MB/C2/P0/B1/D1
3              0        MB/C3/P0/B0/D0
3              0        MB/C3/P0/B0/D1
3              1        MB/C3/P0/B1/D0
3              1        MB/C3/P0/B1/D1
 
=============================== usb Devices ===============================
 
Name          Port#
------------  -----
hub           HUB0
bash-3.00#
 
 
Page 177
 
Verbose output with fan tach fail
 
============================ Environmental Status ============================
Fan Status:
-------------------------------------------
Location             Sensor          Status
-------------------------------------------
MB/FT0/F0            TACH            okay
MB/FT1/F0            TACH            failed (0 rpm)
MB/FT2/F0            TACH            okay
MB/FT5/F0            TACH            okay
PS1                  FF_FAN          okay
PS3                  FF_FAN          okay
 
Temperature sensors:
-----------------------------------------
Location       Sensor              Status
-----------------------------------------
MB/C0/P0       T_CORE              okay
MB/C1/P0       T_CORE              okay
MB/C2/P0       T_CORE              okay
MB/C3/P0       T_CORE              okay
MB/C0          T_AMB               okay
MB/C1          T_AMB               okay
MB/C2          T_AMB               okay
MB/C3          T_AMB               okay
MB             T_CORE              okay
MB             IO_T_AMB            okay
MB/FIOB        T_AMB               okay
MB             T_AMB               okay
PS1            FF_OT               okay
PS3            FF_OT               okay
------------------------------------
Current sensors:
----------------------------------------
Location             Sensor       Status
----------------------------------------
MB/USB0              I_USB0       okay
MB/USB1              I_USB1       okay

In addition to the information in CODE EXAMPLE 8-7, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.

CODE EXAMPLE 8-8 `prtdiag` Verbose Output
System Temperatures (Celsius): ------------------------------- Device Temperature Status --------------------------------------- CPU0 59 OK CPU2 64 OK DBP0 22 OK

In the event of an overtemperature condition, prtdiag reports an error in the Status column.

CODE EXAMPLE 8-9 `prtdiag` Overtemperature Indication Output
System Temperatures (Celsius): ------------------------------- Device Temperature Status --------------------------------------- CPU0 62 OK CPU1 102 ERROR

Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.

CODE EXAMPLE 8-10 `prtdiag` Fault Indication Output
Fan Status: ----------- Bank RPM Status ---- ----- ------ CPU0 4166 [NO_FAULT] CPU1 0000 [FAULT]

Using the `prtfru` Command

The Sun Fire V445 system maintains a hierarchical list of all FRUs in the system, as well as specific information about various FRUs.

The prtfru command can display this hierarchical list, as well as data contained in the serial electrically erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 8-11 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.

CODE EXAMPLE 8-11 prtfru -l Command Output (Truncated)

# prtfru -l
/frutree
/frutree/chassis (fru)
/frutree/chassis/MB?Label=MB
/frutree/chassis/MB?Label=MB/system-board (container)
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT3?Label=FT3
/frutree/chassis/MB?Label=MB/system-board/FT4?Label=FT4
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module (container)
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0/cpu
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0/cpu/B0?Label=B0

CODE EXAMPLE 8-12 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.

CODE EXAMPLE 8-12 prtfru -c Command Output

# prtfru -c
/frutree/chassis/MB?Label=MB/system-board (container)
    SEGMENT: FD
       /Customer_DataR
       /Customer_DataR/UNIX_Timestamp32: Wed Dec 31 19:00:00 EST 1969
       /Customer_DataR/Cust_Data:
       /InstallationR (4 iterations)
       /InstallationR[0]
       /InstallationR[0]/UNIX_Timestamp32: Fri Dec 31 20:47:13 EST 1999
       /InstallationR[0]/Fru_Path: MB.SEEPROM
       /InstallationR[0]/Parent_Part_Number: 5017066
       /InstallationR[0]/Parent_Serial_Number: BM004E
       /InstallationR[0]/Parent_Dash_Level: 05
       /InstallationR[0]/System_Id:
       /InstallationR[0]/System_Tz: 238
       /InstallationR[0]/Geo_North: 15658734
       /InstallationR[0]/Geo_East: 15658734
       /InstallationR[0]/Geo_Alt: 238
       /InstallationR[0]/Geo_Location:
       /InstallationR[1]
       /InstallationR[1]/UNIX_Timestamp32: Mon Mar  6 10:08:30 EST 2006
       /InstallationR[1]/Fru_Path: MB.SEEPROM
       /InstallationR[1]/Parent_Part_Number: 3753302
       /InstallationR[1]/Parent_Serial_Number: 0001
       /InstallationR[1]/Parent_Dash_Level: 03
       /InstallationR[1]/System_Id:
       /InstallationR[1]/System_Tz: 238
       /InstallationR[1]/Geo_North: 15658734
       /InstallationR[1]/Geo_East: 15658734
       /InstallationR[1]/Geo_Alt: 238
       /InstallationR[1]/Geo_Location:
       /InstallationR[2]
       /InstallationR[2]/UNIX_Timestamp32: Tue Apr 18 10:00:45 EDT 2006
       /InstallationR[2]/Fru_Path: MB.SEEPROM
       /InstallationR[2]/Parent_Part_Number: 5017066
       /InstallationR[2]/Parent_Serial_Number: BM004E
       /InstallationR[2]/Parent_Dash_Level: 05
       /InstallationR[2]/System_Id:
       /InstallationR[2]/System_Tz: 0
       /InstallationR[2]/Geo_North: 12704
       /InstallationR[2]/Geo_East: 1
       /InstallationR[2]/Geo_Alt: 251
       /InstallationR[2]/Geo_Location:
       /InstallationR[3]
       /InstallationR[3]/UNIX_Timestamp32: Fri Apr 21 08:50:32 EDT 2006
       /InstallationR[3]/Fru_Path: MB.SEEPROM
       /InstallationR[3]/Parent_Part_Number: 3753302
       /InstallationR[3]/Parent_Serial_Number: 0001
       /InstallationR[3]/Parent_Dash_Level: 03
       /InstallationR[3]/System_Id:
       /InstallationR[3]/System_Tz: 0
       /InstallationR[3]/Geo_North: 1
       /InstallationR[3]/Geo_East: 16531457
       /InstallationR[3]/Geo_Alt: 251
       /InstallationR[3]/Geo_Location:
       /Status_EventsR (0 iterations)
    SEGMENT: PE
       /Power_EventsR (50 iterations)
       /Power_EventsR[0]
       /Power_EventsR[0]/UNIX_Timestamp32: Mon Jul 10 12:34:20 EDT 2006
       /Power_EventsR[0]/Event: power_on
       /Power_EventsR[1]
       /Power_EventsR[1]/UNIX_Timestamp32: Mon Jul 10 12:34:49 EDT 2006
       /Power_EventsR[1]/Event: power_off
       /Power_EventsR[2]
       /Power_EventsR[2]/UNIX_Timestamp32: Mon Jul 10 12:35:27 EDT 2006
       /Power_EventsR[2]/Event: power_on
       /Power_EventsR[3]
       /Power_EventsR[3]/UNIX_Timestamp32: Mon Jul 10 12:58:43 EDT 2006
       /Power_EventsR[3]/Event: power_off
       /Power_EventsR[4]
       /Power_EventsR[4]/UNIX_Timestamp32: Mon Jul 10 13:07:27 EDT 2006
       /Power_EventsR[4]/Event: power_on
       /Power_EventsR[5]
       /Power_EventsR[5]/UNIX_Timestamp32: Mon Jul 10 14:07:20 EDT 2006
       /Power_EventsR[5]/Event: power_off
       /Power_EventsR[6]
       /Power_EventsR[6]/UNIX_Timestamp32: Mon Jul 10 14:07:21 EDT 2006
       /Power_EventsR[6]/Event: power_on
       /Power_EventsR[7]
       /Power_EventsR[7]/UNIX_Timestamp32: Mon Jul 10 14:17:01 EDT 2006
       /Power_EventsR[7]/Event: power_off
       /Power_EventsR[8]
       /Power_EventsR[8]/UNIX_Timestamp32: Mon Jul 10 14:40:22 EDT 2006
       /Power_EventsR[8]/Event: power_on
       /Power_EventsR[9]
       /Power_EventsR[9]/UNIX_Timestamp32: Mon Jul 10 14:42:38 EDT 2006
       /Power_EventsR[9]/Event: power_off
       /Power_EventsR[10]
       /Power_EventsR[10]/UNIX_Timestamp32: Mon Jul 10 16:12:35 EDT 2006
       /Power_EventsR[10]/Event: power_on
       /Power_EventsR[11]
       /Power_EventsR[11]/UNIX_Timestamp32: Tue Jul 11 08:53:47 EDT 2006
       /Power_EventsR[11]/Event: power_off
       /Power_EventsR[12]

Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes:

FRU description

Manufacturer name and location

Part number and serial number

Hardware revision levels

Using the `psrinfo` Command

The psrinfo command displays the date and time each CPU came online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed. The following is sample output from the psrinfo command with the -v option.

CODE EXAMPLE 8-13 psrinfo -v Command Output

# psrinfo -v
Status of virtual processor 0 as of: 07/13/2006 14:18:39
   on-line since 07/13/2006 14:01:26.
   The sparcv9 processor operates at 1592 MHz,
         and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 07/13/2006 14:18:39
   on-line since 07/13/2006 14:01:26.
   The sparcv9 processor operates at 1592 MHz,
         and has a sparcv9 floating point processor.
Status of virtual processor 2 as of: 07/13/2006 14:18:39
   on-line since 07/13/2006 14:01:26.
   The sparcv9 processor operates at 1592 MHz,
         and has a sparcv9 floating point processor.
Status of virtual processor 3 as of: 07/13/2006 14:18:39
   on-line since 07/13/2006 14:01:24.
   The sparcv9 processor operates at 1592 MHz,
         and has a sparcv9 floating point processor.

Using the `showrev` Command

The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 8-14 shows sample output of the showrev command.

CODE EXAMPLE 8-14 showrev Command Output

# showrev
Hostname: sunrise
Hostid: 83d8ee71
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain: Ecd.East.Sun.COM
Kernel version: SunOS 5.10 Generic_118833-17
bash-3.00#

When used with the -p option, this command displays installed patches. TABLE 8-30 shows a partial sample output from the showrev command with the -p option.

TABLE 8-30 showrev -p Command Output

Patch: 109729-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109783-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109807-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109809-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110905-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110910-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110914-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 108964-04 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsr

To Run Solaris System Information Commands

1. Decide what kind of system information you want to display.

For more information, see Solaris System Information Commands.

2. Type the appropriate command at a console prompt.

See TABLE 8-31 for a summary of the commands.

TABLE 8-31 Using Solaris Information Display Commands
Command	What It Displays	What to Type	Notes
fmadm	Fault management information	/usr/sbin/fmadm	Lists information and changes settings.
fmdump	Fault management information	/usr/sbin/fmdump	Use the -v option for additional detail.
`prtconf`	System configuration information	`/usr/sbin/prtconf`	-
`prtdiag`	Diagnostic and configuration information	`/usr/platform/sun4u/sbin/prtdiag`	Use the `-v` option for additional detail.
`prtfru`	FRU hierarchy and SEEPROM memory contents	`/usr/sbin/prtfru`	Use the `-l` option to display hierarchy. Use the `-c` option to display SEEPROM data.
`psrinfo`	Date and time each CPU came online; processor clock speed	`/usr/sbin/psrinfo`	Use the `-v` option to obtain clock speed and other data.
`showrev`	Hardware and software revision information	`/usr/bin/showrev`	Use the `-p` option to show software patches.

Viewing Recent Diagnostic Test Results

A summary of the results of the most recent power-on self-test (POST) are saved across power cycles.

To View Recent Test Results

1. Obtain the ok prompt.

2. To see a summary of the most recent POST results, type:

TABLE 8-32
ok `show-post-results`

Setting OpenBoot Configuration Variables

Switches and diagnostic configuration variables stored in the IDPROM determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 8-7.

Changes to OpenBoot configuration variables usually take effect upon the next reboot.

To View and Set OpenBoot Configuration Variables

1. Obtain the ok prompt.

To display the current values of all OpenBoot configuration variables, use the printenv command.

The following example shows a short excerpt of this command's output.

TABLE 8-33
ok `printenv` Variable Name Value Default Value diag-level min min diag-switch? false false

To set or change the value of an OpenBoot configuration variable, use the setenv command:

TABLE 8-34
ok `setenv diag-level max` diag-level = max

To set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space.

Additional Diagnostic Tests for Specific Devices

Using the `probe-scsi` Command to Confirm That Hard Disk Drives are Active

The probe-scsi command transmits an inquiry to SAS devices connected to the system's internal SAS interface. If a SAS device is connected and active, the command displays the unit number, device type, and manufacturer name for that device.

CODE EXAMPLE 8-15 `probe-scsi` Output Message
ok probe-scsi Target 0 Unit 0 Disk SEAGATE ST336605LSUN36G 4207 Target 1 Unit 0 Disk SEAGATE ST336605LSUN36G 0136

The probe-scsi-all command transmits an inquiry to all SAS devices connected to both the system's internal and its external SAS interfaces. CODE EXAMPLE 8-16 shows sample output from a server with no externally connected SAS devices but containing two 36 Gbyte Hard Disk Drives, both of them active.

CODE EXAMPLE 8-16 `probe-scsi-all` Output Message
ok probe-scsi-all /pci@1f,0/pci@1/scsi@8,1 /pci@1f,0/pci@1/scsi@8 Target 0 Unit 0 Disk SEAGATE ST336605LSUN36G 4207 Target 1 Unit 0 Disk SEAGATE ST336605LSUN36G 0136

Using the `probe-ide` Command To Confirm That the DVD Drive is Connected

The probe-ide command transmits an inquiry command to internal and external IDE devices connected to the system's on-board IDE interface. The following sample output reports a DVD drive installed (as Device 0) and active in a server.

CODE EXAMPLE 8-17 `probe-ide` Output Message
ok probe-ide Device 0 ( Primary Master ) Removable ATAPI Model: DV-28E-B Device 1 ( Primary Slave ) Not Present Device 2 ( Secondary Master ) Not Present Device 3 ( Secondary Slave ) Not Present

Using the `watch-net` and `watch-net-all` Commands to Check the Network Connections

The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.

Start the watch-net diagnostic test by typing the watch-net command at the ok prompt. For the watch-net-all diagnostic test, type watch-net-all at the ok prompt.

CODE EXAMPLE 8-18 `watch-net` Diagnostic O utput Message
{0} ok `watch-net` Internal loopback test -- succeeded. Link is -- up Looking for Ethernet Packets. `.' is a Good Packet. `X' is a Bad Packet. Type any key to stop.................................

CODE EXAMPLE 8-19 `watch-net-all` Diagnostic O utput Message
{0} ok `watch-net-all` `/pci@1f,0/pci@1,1/network@c,1` Internal loopback test -- succeeded. Link is -- up Looking for Ethernet Packets. `.' is a Good Packet. `X' is a Bad Packet. Type any key to stop.

About Automatic Server Restart

Note - Automatic Server Restart is not the same as Automatic System Restoration (ASR), which the Sun Fire V445 server also supports.

Automatic Server Restart is a functional part of ALOM. It monitors the Solaris OS while it is running and, by default, captures cpu register and memory contents to the dump-device using the firmware level sync command.

ALOM uses a watchdog process to monitor only the kernel. ALOM will not restart the server if a process hangs and the kernel is still running. The ALOM watchdog parameters for the watchdog patting interval and watchdog timeout are not user configurable.

If the kernel hangs and the watchdog times out, ALOM reports and logs the event and performs one of three user configurable actions.

xir: this is the default action and will cause the server to capture cpu register and memory contents to the dump-device using the firmware level sync command. In the event of the sync hanging, ALOM falls back to a hard reset after 15 minutes.

Note - Do not confuse this OpenBoot sync command with the Solaris OS sync command, which results in I/O writes of buffered data to the disk drives, prior to unmounting file systems.

Reset: this is a hard reset and results in a rapid system recovery but diagnostic data regarding the hang is not stored, and file system damage may result.

None - this will result in the system being left in the hung state indefinitely after the watchdog timeout has been reported.

For more information, see the sys_autorestart section of the ALOM Online Help.

About Automatic System Restoration

Note - Automatic System Restoration (ASR) is not the same as Automatic Server Restart, which the Sun Fire V445 server also supports.

Automatic System Restoration (ASR) consists of self-test features and an auto-configuring capability to detect failed hardware components and unconfigure them. By doing this, the server is able to resume operating after certain nonfatal hardware faults or failures have occured.

If a component is one that is monitored by ASR, and the server is capable of operating without it, the server will automatically reboot if that component should develop a fault or fail.

ASR monitors the following components:

Memory modules

PCI cards

If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.

If a fault occurs on a running server, and it is possible for the server to run without the failed component, the server automatically reboots. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash repeatedly.

To support such a degraded boot capability, the OpenBoot firmware uses the 1275 Client Interface (via the device tree) to mark a device as either failed or disabled, by creating an appropriate status property in the device tree node. The Solaris OS will not activate a driver for any subsystem so marked.

As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system will reboot automatically and resume operation while a service call is made.

Note - ASR is enabled by default.

Auto-Boot Options

The OpenBoot firmware stores configuration variables on a ROM chip called auto-boot? and auto-boot-on-error? The default setting on the Sun Fire V445 server for both of these variables is true.

The auto-boot? setting controls whether or not the firmware automatically boots the OS after each reset. The auto-boot-on-error? setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? settings must be set to true (default) to enable an automatic degraded boot.

To Set the Auto-Boot Switches

1. Type:


ok `setenv auto-boot? true` ok `setenv auto-boot-on-error? true`

Note - With both of these variables set to true, the system attempts a degraded boot in response to any fatal nonrecoverable error.

Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:

If no errors are detected by POST or OpenBoot Diagnostics, the system attempts to boot if auto-boot? is true.

If only nonfatal errors are detected by POST or OpenBoot Diagnostics, the system attempts to boot if auto-boot? is true and auto-boot-on-error? is true. Non-fatal errors include the following:

SAS subsystem failure. In this case, a working alternate path to the boot disk is required. For more information, see About Multipathing Software.

Ethernet interface failure.

USB interface failure.

Serial interface failure.

PCI card failure.

Memory failure.

Given a failed DIMM, the firmware unconfigures the entire logical bank associated with the failed module. Another nonfailing logical bank must be present in the system for the system to attempt a degraded boot. See About the CPU/Memory Modules.

Note - If POST or OpenBoot Diagnostics detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.

If a critical or fatal error is detected by POST or OpenBoot Diagnostics, the system will not boot regardless of the settings of auto-boot? or auto-boot-on-error?. Critical and fatal nonrecoverable errors include the following:

Any CPU failed

All logical memory banks failed

Flash RAM cyclical redundancy check (CRC) failure

Critical field-replaceable unit (FRU) PROM configuration data failure

Critical application-specific integrated circuit (ASIC) failure

For more information about troubleshooting fatal errors, see Chapter 9.

Reset Scenarios

Two OpenBoot configuration variables, diag-switch? and diag-trigger control whether the system executes firmware diagnostics in response to system reset events.

POST is enabled as the default for power-on-reset and error-reset events. When the diag-switch? variable is set to true, diagnostics are executed using user-defined settings. If the diag-switch? variable is set to false, diagnostics are executed depending on the diag-trigger variable setting.

In addition, ASR is enabled by default because diag-trigger is set to power-on-reset and error-reset. This default setting remains when the diag-switch? variable is set to false. auto-boot? and auto-boot-on-error? are set to true by default.

Automatic System Restoration User Commands

The OpenBoot commands .asr, asr-disable, and asr-enable are available for obtaining ASR status information and for manually unconfiguring or reconfiguring system devices. For more information, see Unconfiguring a Device Manually.

Enabling Automatic System Restoration

The ASR feature is enabled by default. ASR is always enabled when the diag-switch? OpenBoot variable is set to true, and when the diag-trigger setting is set to error-reset.

To activate any parameter changes, type the following at the ok prompt:


ok reset-all

The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (default).

Note - To store parameter changes, you can also power cycle the system using the front panel Power button.

Disabling Automatic System Restoration

After you disable the automatic system restoration (ASR) feature, it is not activated again until you enable it at the system ok prompt.

To Disable Automatic System Restoration

1. At the ok prompt, type:


ok `setenv auto-boot-on-error? false`

2. To activate the parameter change, type:


ok reset-all

The system permanently stores the parameter change.

Note - To store parameter changes, you can also power cycle the system using the front panel Power button.

Displaying Automatic System Restoration Information

Use the following command to display information about the status of the ASR feature.

At the ok prompt, type:


ok `.asr`

In the .asr command output, any devices marked disabled have been manually unconfigured using the asr-disable command. The .asr command also lists devices that have failed firmware diagnostics and have been automatically unconfigured by the OpenBoot ASR feature.

About SunVTS

SunVTS is a software suite that performs system and subsystem stress testing. You can view and control a SunVTS session over a network. Using a remote machine, you can view the progress of a testing session, change testing options, and control all testing features of another machine on the network.

You can run SunVTS software in four different test modes:

Connection test mode provides a low-stress, quick testing of the availability and connectivity of selected devices. These tests are nonintrusive, meaning they release the devices after a quick test, and they do not place a heavy load on system activity.

Functional test mode provides robust testing of your system and devices. It uses your system resources for thorough testing and it assumes that no other applications are running.

Exclusive test mode enables performing the tests that require no other SunVTS tests or applications running at the same time.

Online test mode enables performance of SunVTS testing while other customer applications are running.

Auto Config automatically detects all subsystems and exercises them in one of two ways:

Confidence testing - Performs one pass of tests on all subsystems, and then stops. For typical system configurations, this requires one or two hours.

Comprehensive testing - Tests all subsystems repeatedly for up to 24 hours.

Since SunVTS software can run many tests in parallel and consume many system resources, you should be cautious when using it on a production system. If you are stress-testing a system using the Functional test mode, do not run anything else on that system at the same time.

To install and use SunVTS, a system must be running a Solaris OS compatible for the SunVTS version. Since SunVTS software packages are optional, they may not be installed on your system. See To Find Out Whether SunVTS Is Installed for instructions.

SunVTS Software and Security

During SunVTS software installation, you must choose between Basic or Sun Enterprise Authentication Mechanism trademark security. Basic security uses a local security file in the SunVTS installation directory to limit the users, groups, and hosts permitted to use SunVTS software. Sun Enterprise Authentication Mechanism security is based on the standard network authentication protocol Kerberos and provides secure user authentication, data integrity and privacy for transactions over networks.

If your site uses Sun Enterprise Authentication Mechanism security, you must have the Sun Enterprise Authentication Mechanism client and server software installed in your networked environment and configured properly in both Solaris and SunVTS software. If your site does not use Sun Enterprise Authentication Mechanism security, do not choose the Sun Enterprise Authentication Mechanism option during SunVTS software installation.

If you enable the wrong security scheme during installation, or if you improperly configure the security scheme you choose, you may find yourself unable to run SunVTS tests. For more information, see the SunVTS User's Guide and the instructions accompanying the Sun Enterprise Authentication Mechanism software.

Using SunVTS

SunVTS, the Sun Validation and Test Suite, is an online diagnostics tool that you can use to verify the configuration and functionality of hardware controllers, devices, and platforms. It runs in the Solaris OS and presents the following interfaces:

Command line interface

Serial (TTY) interface

SunVTS software enables you to view and control testing sessions on a remotely connected server. TABLE 8-35 lists some of the tests that are available:

TABLE 8-35 SunVTS Tests
SunVTS Test	Description
`cputest`	Tests the CPU
`disktest`	Tests the local disk drives
`dvdtest`	Tests the DVD-ROM drive
`fputest`	Tests the floating-point unit
`nettest`	Tests the Ethernet hardware on the system board and the networking hardware on any optional PCI cards
`netlbtest`	Performs a loopback test to check that the Ethernet adapter can send and receive packets
`pmemtest`	Tests the physical memory (read only)
`sutest`	Tests the server's on-board serial ports
`vmemtest`	Tests the virtual memory (a combination of the swap partition and the physical memory)
`env6test`	Tests the environmental devices
`ssptest`	Tests ALOM hardware devices
`i2c2test`	Tests I2C devices for correct operation

To Find Out Whether SunVTS Is Installed

Type:

TABLE 8-36
# `pkginfo -l SUNWvts`

If SunVTS software is loaded, information about the package will be displayed.

If SunVTS software is not loaded, you will see the following error message:

TABLE 8-37
ERROR: information for "SUNWvts" was not found

Installing SunVTS

By default, SunVTS is not installed on the Sun Fire V445 servers. However, it is available in the Solaris_10/ExtraValue/CoBundled/SunVTS_X.X Solaris 10 DVD supplied in the Solaris Media Kit. For information about downloading SunVTS from the Sun Downloard Center, refer to the Sun Hardware Platform Guide for the Solaris version you are using.

To find out more about using SunVTS, refer to the SunVTS documentation that corresponds to the Solaris release that you are running.

Viewing SunVTS Documentation

The SunVTS documents are accessible in the Solaris on Sun Hardware documentation collection at http://docs.sun.com.

For further information, you can also consult the following SunVTS documents:

SunVTS User's Guide describes how to install, configure, and run the SunVTS diagnostic software.

SunVTS Quick Reference Card provides an overview of how to use the SunVTS graphical user interface.

SunVTS Test Reference Manual for SPARC Platforms provides details about each individual SunVTS test.

About Sun Management Center

Sun Management Center software provides enterprise-wide monitoring of Sun servers and workstations, including their subsystems, components, and peripheral devices. The system being monitored must be up and running, and you need to install all the proper software components on various systems in your network.

Sun Management Center enables you to monitor the following on the Sun Fire V445 server.

TABLE 8-38 What Sun Management Center Monitors
Item Monitored	What Sun Management Center Monitors
Disk drives	Status
Fans	Status
CPUs	Temperature and any thermal warning or failure conditions
Power supply	Status
System temperature	Temperature and any thermal warning or failure conditions

Sun Management Center software extends and enhances the management capability of Sun's hardware and software products.

TABLE 8-39 Sun Management Center Features
Feature	Description
System management	Monitors and manages the system at the hardware and operating system levels. Monitored hardware includes boards, tapes, power supplies, and disks.
Operating system management	Monitors and manages operating system parameters including load, resource usage, disk space, and network statistics.
Application and business system management	Provides technology to monitor business applications such as trading systems, accounting systems, inventory systems, and real-time control systems.
Scalability	Provides an open, scalable, and flexible solution to configure and manage multiple management administrative domains (consisting of many systems) spanning an enterprise. The software can be configured and used in a centralized or distributed fashion by multiple users.

Sun Management Center software is geared primarily toward system administrators who have large data centers to monitor or other installations that have many computer platforms to monitor. If you administer a more modest installation, you need to weigh Sun Management Center software's benefits against the requirement of maintaining a significant database (typically over 700 Mbytes) of system status information.

The servers being monitored must be up and running if you want to use Sun Management Center, since this tool relies on the Solaris OS. For instructions on using this tool to monitor a Sun Fire V445 server, see Chapter 8.

How Sun Management Center Works

Sun Management Center consists of three components:

Agent

Server

Monitor

You install agents on systems to be monitored. The agents collect system status information from log files, device trees, and platform-specific sources, and report that data to the server component.

The server component maintains a large database of status information for a wide range of Sun platforms. This database is updated frequently, and includes information about boards, tapes, power supplies, and disks as well as OS parameters like load, resource usage, and disk space. You can create alarm thresholds and be notified when these are exceeded.

The monitor components present the collected data to you in a standard format. Sun Management Center software provides both a standalone Java application and a Web browser-based interface. The Java interface affords physical and logical views of the system for highly-intuitable monitoring.

Using Sun Management Center

Sun Management Center software is aimed at system administrators who have large data centers to monitor or other installations that have many computer platforms to monitor. If you administer a smaller installation, you need to weigh Sun Management Center software's benefits against the requirement of maintaining a significant database (typically over 700 Mbytes) of system status information.

The servers to be monitored must be running , Sun Management Center relies on the Solaris OS for its operation.

For detailed instructions, see the Sun Management Center Software User's Guide.

Other Sun Management Center Features

Sun Management Center software provides you with additional tools, which can operate with management utilities made by other companies.

The tools are an informal tracking mechanism and the optional add-on, Hardware Diagnostics Suite.

Informal Tracking

Sun Management Center agent software must be loaded on any system you want to monitor. However, the product enables you to informally track a supported platform even when the agent software has not been installed on it. In this case, you do not have full monitoring capability, but you can add the system to your browser, have Sun Management Center periodically check whether it is up and running, and notify you if it goes out of commission.

Hardware Diagnostic Suite

The Hardware Diagnostic Suite is a package that you can purchase as an add-on to Sun Management Center. The suite enables you to exercise a system while it is still up and running in a production environment. See Hardware Diagnostic Suite for more information.

Interoperability With Third-Party Monitoring Tools

If you administer a heterogeneous network and use a third-party network-based system monitoring or management tool, you might be able to take advantage of Sun Management Center software's support for Tivoli Enterprise Console, BMC Patrol, and HP Openview.

Obtaining the Latest Information

For the latest information about this product, go to the Sun Management Center web site: http://www.sun.com/sunmanagementcenter

Hardware Diagnostic Suite

The Sun Management Center features an optional Hardware Diagnostic Suite, which you can purchase as an add-on. The Hardware Diagnostic Suite is designed to exercise a production system by running tests sequentially.

Sequential testing means the Hardware Diagnostic Suite has a low impact on the system. Unlike SunVTS, which stresses a system by consuming its resources with many parallel tests (see About SunVTS), the Hardware Diagnostic Suite lets the server run other applications while testing proceeds.

When to Run Hardware Diagnostic Suite

The best use of the Hardware Diagnostic Suite is to disclose a suspected or intermittent problem with a noncritical part on an otherwise functioning machine. Examples might include questionable disk drives or memory modules on a machine that has ample or redundant disk and memory resources.

In cases like these, the Hardware Diagnostic Suite runs unobtrusively until it identifies the source of the problem. The machine under test can be kept in production mode until and unless it must be shut down for repair. If the faulty part is hot-pluggable or hot-swappable, the entire diagnose-and-repair cycle can be completed with minimal impact to system users.

Requirements for Using Hardware Diagnostic Suite

Since it is a part of Sun Management Center, you can only run Hardware Diagnostic Suite if you have set up your data center to run Sun Management Center. This means you have to dedicate a master server to run the Sun Management Center server software that supports Sun Management Center software's database of platform status information. In addition, you must install and set up Sun Management Center agent software on the systems to be monitored. Finally, you need to install the console portion of Sun Management Center software, which serves as your interface to the Hardware Diagnostic Suite.

Instructions for setting up Sun Management Center, as well as for using the Hardware Diagnostic Suite, can be found in the Sun Management Center Software User's Guide.

Diagnostic Tools Overview

About Sun Advanced Lights-Out Manager 1.0 (ALOM)

ALOM Management Ports

Setting the admin Password for ALOM

Basic ALOM Functions

About Status Indicators

About POST Diagnostics

OpenBoot PROM Enhancements for Diagnostic Operation

What's New in Diagnostic Operation

About the New and Redefined Configuration Variables

About the Default Configuration

About Service Mode

About Initiating Service Mode

About Overriding Service Mode Settings

About Normal Mode

About Initiating Normal Mode

About the post Command

To Initiate Service Mode

To Initiate Normal Mode

Reference for Estimating System Boot Time (to the ok Prompt)

Boot Time Estimates for Typical Configurations

Estimating Boot Time for Your System

Reference for Sample Outputs

Reference for Determining Diagnostic Mode

Quick Reference for Diagnostic Operation

OpenBoot Diagnostics

To Start OpenBoot Diagnostics

Controlling OpenBoot Diagnostics Tests

test and test-all Commands

OpenBoot Diagnostics Error Messages

About OpenBoot Commands

probe-scsi-all

probe-ide

show-devs

To Run OpenBoot Commands

About Predictive Self-Healing

Predictive Self-Healing Tools

Using the Predictive Self-Healing Commands

Using the fmdump Command

fmdump -V

fmdump -e

Using the fmadm faulty Command

fmadm config

Using the fmstat Command

About Traditional Solaris OS Diagnostic Tools

Error and System Message Log Files

Solaris System Information Commands

Using the prtconf Command

Using the prtdiag Command

Using the prtfru Command

Using the psrinfo Command

Using the showrev Command

To Run Solaris System Information Commands

Viewing Recent Diagnostic Test Results

To View Recent Test Results

Setting OpenBoot Configuration Variables

To View and Set OpenBoot Configuration Variables

Additional Diagnostic Tests for Specific Devices

Using the probe-scsi Command to Confirm That Hard Disk Drives are Active

Using the probe-ide Command To Confirm That the DVD Drive is Connected

Using the watch-net and watch-net-all Commands to Check the Network Connections

About Automatic Server Restart

About Automatic System Restoration

Auto-Boot Options

To Set the Auto-Boot Switches

Error Handling Summary

Reset Scenarios

Automatic System Restoration User Commands

Enabling Automatic System Restoration

Disabling Automatic System Restoration

Displaying Automatic System Restoration Information

About SunVTS

SunVTS Software and Security

Using SunVTS

To Find Out Whether SunVTS Is Installed

Installing SunVTS

Viewing SunVTS Documentation

About Sun Management Center

How Sun Management Center Works

Using Sun Management Center

Setting the `admin` Password for ALOM

About the `post` Command

Reference for Estimating System Boot Time (to the `ok` Prompt)

`test` and `test-all` Commands

`probe-scsi-all`

`probe-ide`

`show-devs`

Using the `fmdump` Command

`fmdump -V`

`fmdump -e`

Using the `fmadm faulty` Command

`fmadm config`

Using the `fmstat` Command

Using the `prtconf` Command

Using the `prtdiag` Command

Using the `prtfru` Command

Using the `psrinfo` Command

Using the `showrev` Command

Using the `probe-scsi` Command to Confirm That Hard Disk Drives are Active

Using the `probe-ide` Command To Confirm That the DVD Drive is Connected

Using the `watch-net` and `watch-net-all` Commands to Check the Network Connections