Sun Fire V445 Server Administration Guide
|
|
This chapter describes the diagnostic tools available for the Sun Fire V445 server.
Topics in this chapter include:
Diagnostic Tools Overview
Sun provides a range of diagnostic tools for use with the Sun Fire V445 server.
The diagnostic tools are summarized in TABLE 8-1.
TABLE 8-1 Summary of Diagnostic Tools
Diagnostic Tool
|
Type
|
What It Does
|
Accessibility and Availability
|
Remote Capability
|
ALOM system controller
|
Hardware and Software
|
Monitors environmental conditions, performs basic fault isolation, and provides remote console access
|
Can function on standby power and without OS
|
Designed for remote access
|
LED indicators
|
Hardware
|
Indicates status of overall system and particular components
|
Accessed from system chassis. Available anytime power is available
|
Local, but can be viewed with the ALOM system console
|
POST
|
Firmware
|
Tests core components of system
|
Runs automatically on startup. Available when the OS is not running
|
Local, but can be viewed with ALOM system controller
|
OpenBoot Diagnostics
|
Firmware
|
Tests system components, focusing on peripherals and
I/O devices
|
Runs automatically or interactively. Available when the OS is not running
|
Local, but can be viewed with ALOM system controller
|
OpenBoot commands
|
Firmware
|
Display various kinds of system information
|
Available when the OS is not running
|
Local, but can be accessed with ALOM system controller
|
Solaris 10 Predictive Self-Healing
|
Software
|
Monitors system errors and reports and disables faulty hardware
|
Runs in the background when the OS is running
|
Local, but can be accessed with ALOM system controller
|
Traditional Solaris OS commands
|
Software
|
Displays various kinds of system information
|
Requires OS
|
Local, but can be accessed with ALOM system controller
|
SunVTS
|
Software
|
Exercises and stresses the system, running tests in parallel
|
Requires OS. Optional package that needs to be installed separately
|
View and control over network
|
Sun Management Center
|
Software
|
Monitors both hardware environmental conditions and software performance of multiple machines. Generates alerts for various conditions
|
Requires OS to be running on both monitored and master servers. Requires a dedicated database on the master server
|
Designed for remote access
|
Hardware Diagnostic Suite
|
Software
|
Exercises an operational system by running sequential tests. Also reports failed FRUs
|
Separately purchased optional add-on to Sun Management Center. Requires OS and Sun Management Center
|
Designed for remote access
|
About Sun Advanced Lights-Out Manager 1.0 (ALOM)
The Sun Fire V445 server ships with Sun Advanced Lights Out Manager (ALOM) 1.0 installed. The system console is directed to ALOM by default and is configured to show server console information on startup.
ALOM enables you to monitor and control your server over either a serial connection (using the SERIAL MGT port), or Ethernet connection (using the NET MGT port). For information on configuring an Ethernet connection, refer to the ALOM Online Help.
Note - The ALOM serial port, labelled SERIAL MGT, is for server management only. If you need a general purpose serial port, use the serial port labeled TTYB.
|
ALOM can send email notification of hardware failures and other events related to the server or to ALOM.
The ALOM circuitry uses standby power from the server. This means that:
- ALOM is active as soon as the server is connected to a power source, and until power is removed by unplugging the power cable.
- ALOM firmware and software continue to be effective when the server OS goes offline.
See TABLE 8-2 for a list of the components monitered by ALOM and the information it provides for each.
TABLE 8-2 What ALOM Monitors
Component
|
Information
|
Hard disk drives
|
Presence and status
|
System and CPU fans
|
Speed and status
|
CPUs
|
Presence, temperature and any thermal warning or failure conditions
|
Power supplies
|
Presence and status
|
System temperature
|
Ambient temperature and any thermal warning or failure conditions
|
Server front panel
|
Status indicator
|
Voltage
|
Status and thresholds
|
SAS and USB circuit breakers
|
Status
|
ALOM Management Ports
The default management port is labeled SERIAL MGT. This port uses an RJ-45 connector and is for server management only - it supports only ASCII connections to an external console. Use this port when you first begin to operate the server.
Another serial port - labeled TTYB - is available for general purpose serial data transfer. This port uses a DB-9 connector. For information on pinouts, refer to the Sun Fire V445 Server Installation Guide.
In addition, the server has one 10BASE-T Ethernet management domain interface, labelled NET MGT. To use this port, ALOM configuration is required. For more information, see the ALOM Online Help.
Setting the admin Password for ALOM
When you switch to the ALOM prompt after initial power-on, you will be logged in as the admin user and prompted to set a password. You must set this password in order to execute certain commands.
If you are prompted to do so, set a password for the admin user.
The password must:
- contain at least two alphabetic characters
- contain at least one numeric or one special character
- be at least six characters long
Once the password is set, the admin user has full permissions and can execute all ALOM CLI commands.
Basic ALOM Functions
This section covers some basic ALOM functions. For comprehensive documentation, refer to the ALOM Online Help.
To Switch to the ALOM Prompt
|
Type the default keystroke sequence:
To Switch to the Server Console Prompt
|
Type:
More than one ALOM user can be connected to the server console stream at a time, but only one user is permitted to type input characters to the console.
If another user is logged on and has write capability, you will see the message below after issuing the console command:
TABLE 8-5
sc> Console session already in use. [view mode]
|
To take console write capability away from another user, type:
About Status Indicators
For a summary of the server's LED status indicators, see Front Panel Indicators and Back Panel Indicators.
About POST Diagnostics
POST is a firmware program that is useful in determining if a portion of the system has failed. POST verifies the core functionality of the system, including the CPU module(s), motherboard, memory, and some on-board I/O devices, and generates messages that can determine the nature of a hardware failure. POST can be run even if the system is unable to boot.
POST detects CPU and Memory subsystem faults and is located in a SEEPROM on the MBC (ALOM) board. POST can be set to run by the OpenBoot program at power-on by setting three environment variables, the diag-switch?, diag-trigger, and diag-level.
POST runs automatically when the system power is applied, or following a noncritical error reset, if all of the following conditions apply:
- diag-switch? is set to true or false (default is false)
- diag-level is set to min, max, or menus (default is min)
- diag-trigger is set to power-on-reset and error-reset (default is power-on-reset and error-reset)
If diag-level is set to min or max, POST performs an abbreviated or extended test, respectively. If diag-level is set to menus, a menu of all the tests executed at power-up is displayed. POST diagnostic and error message reports are displayed on a console.
For information on starting and controlling POST diagnostics, see About the post Command.
OpenBoot PROM Enhancements for Diagnostic Operation
This section describes the diagnostic operation enhancements provided by OpenBoot PROM Version 4.15 and later and presents information about how to use the resulting new operational features. Note that the behavior of certain operational features on your system might differ from the behavior described in this section.
What's New in Diagnostic Operation
The following features are the diagnostic operation enhancements:
- New and redefined configuration variables simplify diagnostic controls and allow you to customize a "normal mode" of diagnostic operation for your environment. See About the New and Redefined Configuration Variables.
- New standard (default) configuration enables and runs diagnostics and enables Automatic System Restoration (ASR) capabilities at power-on and after error reset events. See About the Default Configuration.
- Service mode establishes a Sun prescribed methodology for isolating and diagnosing problems. See About Service Mode.
- The post command executes the power-on self-test (POST) and provides options that enable you to specify the level of diagnostic testing and verbosity of diagnostic output. See About the post Command.
About the New and Redefined Configuration Variables
New and redefined configuration variables simplify diagnostic operation and provide you with more control over the amount of diagnostic output. The following list summarizes the configuration variable changes. See TABLE 8-7 for complete descriptions of the variables.
- New variables:
- service-mode? - Diagnostics are executed at a Sun-prescribed level.
- diag-trigger - Replaces and consolidates the functions of post-trigger and obdiag-trigger.
- verbosity - Controls the amount and detail of firmware output.
- Redefined variable:
- diag-switch? parameter has modified behaviors for controlling diagnostic execution in normal mode on Sun UltraSPARC based volume servers. Behavior of the diag-switch? parameter is unchanged on Sun workstations.
- Default value changes:
- auto-boot-on-error? - New default value is true.
- diag-level - New default value is max.
- error-reset-recovery - New default value is sync.
About the Default Configuration
The new standard (default) configuration runs diagnostic tests and enables full ASR capabilities during power-on and after the occurrence of an error reset (RED State Exception Reset, CPU Watchdog Reset, System Watchdog Reset, Software-Instruction Reset, or Hardware Fatal Reset). This is a change from the previous default configuration, which did not run diagnostic tests. When you power on your system for the first time, the change will be visible to you through the increased boot time and the display of approximately two screens of diagnostic output produced by POST and OpenBoot Diagnostics.
Note - The standard (default) configuration does not increase system boot time after a reset that is initiated by user commands from OpenBoot (reset-all or boot) or from Solaris (reboot, shutdown, or init).
|
The visible changes are due to the default settings of two configuration variables, diag-level (max) and verbosity (normal):
- diag-level (max) specifies maximum diagnostic testing, including extensive memory testing, which increases system boot time. See Reference for Estimating System Boot Time (to the ok Prompt) for more information about the increased boot time.
- verbosity (normal) specifies that diagnostic messages and information will be displayed, which usually produces approximately two screens of output. See Reference for Sample Outputs for diagnostic output samples of verbosity settings min and normal.
After initial power-on, you can customize the standard (default) configuration by setting the configuration variables to define a "normal mode" of operation that is appropriate for your production environment. TABLE 8-7 lists and describes the defaults and keywords of the OpenBoot configuration variables that control diagnostic testing and ASR capabilities. These are the variables you will set to define your normal mode of operation.
Note - The standard (default) configuration is recommended for improved fault isolation and system restoration, and for increased system availability.
|
TABLE 8-7 OpenBoot Configuration Variables That Control Diagnostic Testing and Automatic System Restoration
OpenBoot Configuration Variable
|
Description and Keywords
|
auto-boot?
|
Determines whether the system automatically boots. Default is true.
- true - System automatically boots after initialization, provided no firmware-based (diagnostics or OpenBoot) errors are detected.
- false - System remains at the ok prompt until you type boot.
|
auto-boot-on-error?
|
Determines whether the system attempts a degraded boot after a nonfatal error. Default is true.
- true - System automatically boots after a nonfatal error if the variable
auto-boot? is also set to true.
- false - System remains at the ok prompt.
|
boot-device
|
Specifies the name of the default boot device, which is also the normal mode boot device.
|
boot-file
|
Specifies the default boot arguments, which are also the normal mode boot arguments.
|
diag-device
|
Specifies the name of the boot device that is used when diag-switch? is true.
|
diag-file
|
Specifies the boot arguments that are used when diag-switch? is true.
|
diag-level
|
Specifies the level or type of diagnostics that are executed. Default is max.
- off - No testing.
- min - Basic tests are run.
- max - More extensive tests might be run, depending on the device. Memory is extensively checked.
|
diag-out-console
|
Redirects system console output to the system controller.
- true - Redirects output to the system controller.
- false - Restores output to the local console.
Note: See your system documentation for information about redirecting system console output to the system controller. (Not all systems are equipped with a system controller.)
|
diag-passes
|
Specifies the number of consecutive executions of OpenBoot Diagnostics self-tests that are run from the OpenBoot Diagnostics (obdiag) menu. Default is 1.
Note: diag-passes applies only to systems with firmware that contains OpenBoot Diagnostics and has no effect outside the OpenBoot Diagnostics menu.
|
diag-script
|
Determines which devices are tested by OpenBoot Diagnostics. Default is normal.
- none - OpenBoot Diagnostics do not run.
- normal - Tests all devices that are expected to be present in the system's baseline configuration for which self-tests exist.
- all - Tests all devices that have self-tests.
|
diag-switch?
|
Controls diagnostic execution in normal mode. Default is false.
For servers:
- true - Diagnostics are only executed on power-on reset events, but the level of test coverage, verbosity, and output is determined by user-defined settings.
- false - Diagnostics are executed upon next system reset, but only for those class of reset events specified by the OpenBoot configuration variable
diag-trigger. The level of test coverage, verbosity, and output is determined by user-defined settings.
For workstations:
- true - Diagnostics are only executed on power-on reset events, but the level of test coverage, verbosity, and output is determined by user-defined settings.
- false - Diagnostics are disabled.
|
diag-trigger
|
Specifies the class of reset event that causes diagnostics to run automatically. Default setting is power-on-reset error-reset.
- none - Diagnostic tests are not executed.
- error-reset - Reset that is caused by certain hardware error events such as RED State Exception Reset, Watchdog Resets, Software-Instruction Reset, or Hardware Fatal Reset.
- power-on-reset - Reset that is caused by power cycling the system.
- user-reset - Reset that is initiated by an OS panic or by user-initiated commands from OpenBoot (reset-all or boot) or from Solaris (reboot, shutdown, or init).
- all-resets - Any kind of system reset.
Note: Both POST and OpenBoot Diagnostics run at the specified reset event if the variable diag-script is set to normal or all. If diag-script is set to none, only POST runs.
|
error-reset-recovery
|
Specifies recovery action after an error reset. Default is sync.
- none - No recovery action.
- boot - System attempts to boot.
- sync - Firmware attempts to execute a Solaris sync callback routine.
|
service-mode?
|
Controls whether the system is in service mode. Default is false.
- true - Service mode. Diagnostics are executed at Sun-specified levels, overriding but preserving user settings.
- false - Normal mode. Diagnostics execution depends entirely on the settings of diag-switch? and other user-defined OpenBoot configuration variables.
|
test-args
|
Customizes OpenBoot Diagnostics tests. Allows a text string of reserved keywords (separated by commas) to be specified in the following ways:
- As an argument to the test command at the ok prompt.
- As an OpenBoot variable to the setenv command at the ok or obdiag prompt.
Note: The variable test-args applies only to systems with firmware that contains OpenBoot Diagnostics. See your system documentation for a list of keywords.
|
verbosity
|
Controls the amount and detail of OpenBoot, POST, and OpenBoot Diagnostics output.
Default is normal.
- none - Only error and fatal messages are displayed on the system console. Banner is not displayed.
Note: Problems in systems with verbosity set to none might be deemed not diagnosable, rendering the system unserviceable by Sun.
- min - Notice, error, warning, and fatal messages are displayed on the system console. Transitional states and banner are also displayed.
- normal - Summary progress and operational messages are displayed on the system console in addition to the messages displayed by the min setting. The work-in-progress indicator shows the status and progress of the boot sequence.
- max - Detailed progress and operational messages are displayed on the system console in addition to the messages displayed by the min and normal settings.
|
About Service Mode
Service mode is an operational mode defined by Sun that facilitates fault isolation and recovery of systems that appear to be nonfunctional. When initiated, service mode overrides the settings of key OpenBoot configuration variables.
Note that service mode does not change your stored settings. After initialization (at the ok prompt), all OpenBoot PROM configuration variables revert to the user-defined settings. In this way, you or your service provider can quickly invoke a known and maximum level of diagnostics and still preserve your normal mode settings.
TABLE 8-8 lists the OpenBoot configuration variables that are affected by service mode and the overrides that are applied when you select service mode.
TABLE 8-8 Service Mode Overrides
OpenBoot Configuration Variable
|
Service Mode Override
|
auto-boot?
|
false
|
diag-level
|
max
|
diag-trigger
|
power-on-reset error-reset user-reset
|
input-device
|
Factory default
|
output-device
|
Factory default
|
verbosity
|
max
|
The following apply only to systems with firmware that contains OpenBoot Diagnostics:
|
diag-script
|
normal
|
test-args
|
subtests,verbose
|
About Initiating Service Mode
Enhancements provide a software mechanism for specifying service mode:
service-mode? configuration variable - When set to true, initiates service mode. (Service mode should be used only by authorized Sun service providers.)
Note - The diag-switch? configuration variable should remain at the default setting (false) for normal operation. To specify diagnostic testing for your OS, see To Initiate Normal Mode.
|
For instructions, see To Initiate Service Mode.
About Overriding Service Mode Settings
When the system is in service mode, three commands can override service mode settings. TABLE 8-9 describes the effect of each command.
TABLE 8-9 Scenarios for Overriding Service Mode Settings
Command
|
Issued From
|
What It Does
|
post
|
ok prompt
|
OpenBoot firmware forces a one-time execution of normal mode diagnostics.
|
bootmode diag
|
system controller
|
OpenBoot firmware overrides service mode settings and forces a one-time execution of normal mode diagnostics.1
|
bootmode skip_diag
|
system controller
|
OpenBoot firmware suppresses service mode and bypasses all firmware diagnostics.1
|
Note - Not all systems are equipped with a system controller.
|
About Normal Mode
Normal mode is the customized operational mode that you define for your environment. To define normal mode, set the values of the OpenBoot configuration variables that control diagnostic testing. See TABLE 8-7 for the list of variables that control diagnostic testing.
Note - The standard (default) configuration is recommended for improved fault isolation and system restoration, and for increased system availability.
|
When you are deciding whether to enable diagnostic testing in your normal environment, remember that you always should run diagnostics to troubleshoot an existing problem or after the following events:
- Initial system installation
- New hardware installation and replacement of defective hardware
- Hardware configuration modification
- Hardware relocation
- Firmware upgrade
- Power interruption or failure
- Hardware errors
- Severe or inexplicable software problems
About Initiating Normal Mode
If you define normal mode for your environment, you can specify normal mode with the following method:
System controller bootmode diag command - When you issue this command, it specifies normal mode with the configuration values defined by you - with the following exceptions:
- If you defined diag-level = off, bootmode diag specifies diagnostics at diag-level = min.
- If you defined verbosity = none, bootmode diag specifies diagnostics at verbosity = min.
Note - The next reset cycle must occur within 10 minutes of issuing the
bootmode diag command or the bootmode command is cleared and normal mode is not initiated.
|
For instructions, see To Initiate Normal Mode.
About the post Command
The post command enables you to easily invoke POST diagnostics and to control the level of testing and the amount of output. When you issue the post command, OpenBoot firmware performs the following actions:
- Initiates a user reset
- Triggers a one-time execution of POST at the test level and verbosity that you specify
- Clears old test results
- Displays and logs the new test results
Note - The post command overrides service mode settings and pending system controller bootmode diag and bootmode skip_diag commands.
|
The syntax for the post command is:
post [level [verbosity]]
where:
- level = min or max
- verbosity = min, normal, or max
The level and verbosity options provide the same functions as the OpenBoot configuration variables diag-level and verbosity. To determine which settings you should use for the post command options, see TABLE 8-7 for descriptions of the keywords for diag-level and verbosity.
You can specify settings for:
- Both level and verbosity
- level only (If you specify a verbosity setting, you must also specify a level setting.)
- Neither level nor verbosity
If you specify a setting for level only, the post command uses the normal mode value for verbosity with the following exception:
- If the normal mode value of verbosity = none, post uses verbosity = min.
If you specify settings for neither level nor verbosity, the post command uses the normal mode values you specified for the configuration variables,
diag-level and verbosity, with two exceptions:
- If the normal mode value of diag-level = off, post uses level = min.
- If the normal mode value of verbosity = none, post uses
verbosity = min.
To Initiate Service Mode
|
For background information, see About Service Mode.
Set the service-mode? variable. At the ok prompt, type:
TABLE 1
ok setenv service-mode? true
|
For service mode to take effect, you must reset the system.
9. At the ok prompt, type:
To Initiate Normal Mode
|
For background information, see About Normal Mode.
1. At the ok prompt, type:
TABLE 3
ok setenv service-mode? false
|
The system will not actually enter normal mode until the next reset.
2. Type:
Reference for Estimating System Boot Time (to the ok Prompt)
Note - The standard (default) configuration does not increase system boot time after a reset that is initiated by user commands from OpenBoot (reset-all or boot) or from Solaris (reboot, shutdown, or init).
|
The measurement of system boot time begins when you power on (or reset) the system and ends when the OpenBoot ok prompt appears. During the boot time period, the firmware executes diagnostics (POST and OpenBoot Diagnostics) and performs OpenBoot initialization. The time required to run OpenBoot Diagnostics and to perform OpenBoot setup, configuration, and initialization is generally similar for all systems, depending on the number of I/O cards installed when
diag-script is set to all. However, at the default settings (diag-level = max and verbosity = normal), POST executes extensive memory tests, which will increase system boot time.
System boot time will vary from system-to-system, depending on the configuration of system memory and the number of CPUs:
- Because each CPU tests its associated memory and POST performs the memory tests simultaneously, memory test time will depend on the amount of memory on the most populated CPU.
- Because the competition for system resources makes CPU testing a less linear process than memory testing, CPU test time will depend on the number of CPUs.
If you need to know the approximate boot time of your new system before you power on for the first time, the following sections describe two methods you can use to estimate boot time:
Boot Time Estimates for Typical Configurations
The following are three typical configurations and the approximate boot time you can expect for each:
- Small configuration (2 CPUs and 4 Gbytes of memory) - Boot time is approximately 5 minutes.
- Medium configuration (4 CPUs and 16 Gbytes of memory) - Boot time is approximately 10 minutes.
- Large configuration (4 CPUs and 32 Gbytes of memory) - Boot time is approximately 15 minutes.
Estimating Boot Time for Your System
Generally, for systems configured with default settings, the times required to execute OpenBoot Diagnostics and to perform OpenBoot setup, configuration, and initialization are the same for all systems:
- 1 minute for OpenBoot Diagnostics testing might require more time for systems with a greater number of devices to be tested.
- 2 minutes for OpenBoot setup, configuration, and initialization
To estimate the time required to run POST memory tests, you need to know the amount of memory associated with the most populated CPU. To estimate the time required to run POST CPU tests, you need to know the number of CPUs. Use the following guidelines to estimate memory and CPU test times:
- 2 minutes per Gbyte of memory associated with the most populated CPU
- 1 minute per CPU
The following example shows how to estimate the system boot time of a sample configuration consisting of 4 CPUs and 32 Gbytes of system memory, with 8 Gbytes of memory on the most populated CPU.
Reference for Sample Outputs
At the default setting of verbosity = normal, POST and OpenBoot Diagnostics generate less diagnostic output (about 2 pages) than was produced before the OpenBoot PROM enhancements (over 10 pages). This section includes output samples for verbosity settings at min and normal.
Note - The diag-level configuration variable also affects how much output the system generates. The following samples were produced with diag-level set to max, the default setting.
|
The following sample shows the firmware output after a power reset when verbosity is set to min. At this verbosity setting, OpenBoot firmware displays notice, error, warning, and fatal messages but does not display progress or operational messages. Transitional states and the power-on banner are also displayed. Since no error conditions were encountered, this sample shows only the POST execution message, the system's install banner, and the device self-tests conducted by OpenBoot Diagnostics.
TABLE 5
Executing POST w/%o0 = 0000.0400.0101.2041
Sun Fire V445, Keyboard Present
Copyright 1998-2006 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.15.0, 4096 MB memory installed, Serial #12980804.
Ethernet address 8:0:20:c6:12:44, Host ID: 80c61244.
Running diagnostic script obdiag/normal
Testing /pci@8,600000/network@1
Testing /pci@8,600000/SUNW,qlc@2
Testing /pci@9,700000/ebus@1/i2c@1,2e
Testing /pci@9,700000/ebus@1/i2c@1,30
Testing /pci@9,700000/ebus@1/i2c@1,50002e
Testing /pci@9,700000/ebus@1/i2c@1,500030
Testing /pci@9,700000/ebus@1/bbc@1,0
Testing /pci@9,700000/ebus@1/bbc@1,500000
Testing /pci@8,700000/scsi@1
Testing /pci@9,700000/network@1,1
Testing /pci@9,700000/usb@1,3
Testing /pci@9,700000/ebus@1/gpio@1,300600
Testing /pci@9,700000/ebus@1/pmc@1,300700
Testing /pci@9,700000/ebus@1/rtc@1,300070
{7} ok
|
The following sample shows the diagnostic output after a power reset when verbosity is set to normal, the default setting. At this verbosity setting, the OpenBoot firmware displays summary progress or operational messages in addition to the notice, error, warning, and fatal messages; transitional states; and install banner displayed by the min setting. On the console, the work-in-progress indicator shows the status and progress of the boot sequence.
TABLE 6
Sun Fire V445, Keyboard Present
Copyright 1998-2004 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.15.0, 4096 MB memory installed, Serial #12980804.
Ethernet address 8:0:20:c6:12:44, Host ID: 80c61244.
Running diagnostic script obdiag/normal
Testing /pci@8,600000/network@1
Testing /pci@8,600000/SUNW,qlc@2
Testing /pci@9,700000/ebus@1/i2c@1,2e
Testing /pci@9,700000/ebus@1/i2c@1,30
Testing /pci@9,700000/ebus@1/i2c@1,50002e
Testing /pci@9,700000/ebus@1/i2c@1,500030
Testing /pci@9,700000/ebus@1/bbc@1,0
Testing /pci@9,700000/ebus@1/bbc@1,500000
Testing /pci@8,700000/scsi@1
Testing /pci@9,700000/network@1,1
Testing /pci@9,700000/usb@1,3
Testing /pci@9,700000/ebus@1/gpio@1,300600
Testing /pci@9,700000/ebus@1/pmc@1,300700
Testing /pci@9,700000/ebus@1/rtc@1,300070
{7} ok
|
Reference for Determining Diagnostic Mode
The flowchart in FIGURE 8-7 summarizes graphically how various system controller and OpenBoot variables affect whether a system boots in normal or service mode, as well as whether any overrides occur.
CODE EXAMPLE 8-1
{3} ok post
SC Alert: Host System has Reset
Executing Power On Self Test
Q#0>
0>@(#)Sun Fire[TM] V445 POST 4.22.11 2006/06/12 15:10
/export/delivery/delivery/4.22/4.22.11/post4.22.x/Fiesta/boston/integrated (root)
0>Copyright ? 2006 Sun Microsystems, Inc. All rights reserved
SUN PROPRIETARY/CONFIDENTIAL.
Use is subject to license terms.
0>OBP->POST Call with %o0=00000800.01012000.
0>Diag level set to MIN.
0>Verbosity level set to NORMAL.
0>Start Selftest.....
0>CPUs present in system: 0 1 2 3
0>Test CPU(s)....Done
0>Interrupt Crosscall....Done
0>Init Memory....|
SC Alert: Host System has Reset
'Done
0>PLL Reset....Done
0>Init Memory....Done
0>Test Memory....Done
0>IO-Bridge Tests....Done
0>INFO:
0> POST Passed all devices.
0>
0>POST: Return to OBP.
SC Alert: Host System has Reset
Configuring system memory & CPU(s)
Probing system devices
Probing memory
Probing I/O buses
screen not found.
keyboard not found.
Keyboard not present. Using ttya for input and output.
Probing system devices
Probing memory
Probing I/O buses
Sun Fire V445, No Keyboard
Copyright 2006 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.22.11, 24576 MB memory installed, Serial #64548465.
Ethernet address 0:3:ba:d8:ee:71, Host ID: 83d8ee71.
|
FIGURE 8-7 Diagnostic Mode Flowchart
Quick Reference for Diagnostic Operation
TABLE 8-10 summarizes the effects of the following user actions on diagnostic operation:
- Set service-mode? to true
- Issue the bootmode commands, bootmode diag or bootmode skip_diag
- Issue the post command
TABLE 8-10 Summary of Diagnostic Operation
User Action
|
Sets Configuration Variables
|
And Initiates
|
Service Mode
|
Set service-mode? to true
|
Note: Service mode overrides the settings of the following configuration variables without changing your stored settings:
- auto-boot? = false
- diag-level = max
- diag-trigger = power-on-reset
error-reset user reset
- input-device = Factory default
- output-device = Factory default
- verbosity = max
The following apply only to systems with firmware that contains OpenBoot Diagnostics:
- diag-script = normal
- test-args = subtests,verbose
|
Service mode
(defined by Sun)
|
Normal Mode
|
Set service-mode? to false
|
- auto-boot? = user-defined setting
- auto-boot-on-error? = user-defined setting
- diag-level = user-defined setting
- verbosity = user-defined setting
- diag-script = user-defined setting
- diag-trigger = user-defined setting
- input-device = user-defined setting
- output-device = user-defined setting
|
Normal mode
(user-defined)
|
bootmode Commands
|
Issue bootmode diag command
|
Overrides service mode settings and uses normal mode settings with the following exceptions:
- diag-level = min if normal mode
value = off
- verbosity = min if normal mode
value = none
|
Normal mode diagnostics with the exceptions in the preceding column.
|
Issue bootmode skip_diag command
|
|
OpenBoot initialization without running diagnostics
|
post Command
Note: If the value of diag-script = normal or all, OpenBoot Diagnostics also run.
|
Issue post command
|
|
POST diagnostics
|
Specify both level and
verbosity
|
level and verbosity = user-defined values
|
|
Specify neither level nor verbosity
|
level and verbosity = normal mode values with the following exceptions:
- level = min if normal mode value of
diag-level = none
- verbosity = min if normal mode value of verbosity = none
|
|
Specify level only
|
level = user-defined value
verbosity = normal mode value for verbosity (Exception: verbosity = min if normal mode value of verbosity = none)
|
|
OpenBoot Diagnostics
Like POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the boot PROM.
To Start OpenBoot Diagnostics
|
1. Type:
TABLE 8-11
ok setenv diag-switch? true
ok setenv auto-boot? false
ok reset-all
|
2. Type:
This command displays the OpenBoot Diagnostics menu. See TABLE 8-13.
TABLE 8-13 Sample obdiag Menu
obdiag
|
1 LSILogic,sas@1
4 rmc-comm@0,c28000 serial@3,fffff8
|
2 flashprom@0,0
5 rtc@0,70
|
3 network@0
6 serial@0,c2c000
|
Commands: test test-all except help what setenv set-default exit
|
diag-passes=1 diag-level=min test-args=args
|
Note - If you have a PCI card installed in the server, then additional tests will appear on the obdiag menu.
|
3. Type:
TABLE 8-14
obdiag> test n
|
where n represents the number corresponding to the test you want to run.
A summary of the tests is available. At the obdiag> prompt, type:
4. You can also run all tests, type:
TABLE 8-16
obdiag> test-all
Hit the spacebar to interrupt testing
Testing /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1 ......... passed
Testing /ebus@1f,464000/flashprom@0,0 ................................. passed
Testing /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0 Internal loopback test -- succeeded.
Link is -- up
........ passed
Testing /ebus@1f,464000/rmc-comm@0,c28000 ............................. passed
Testing /pci@1f,700000/pci@0/pci@1/pci@0/isa@1e/rtc@0,70 .............. passed
Testing /ebus@1f,464000/serial@0,c2c000 ............................... passed
Testing /ebus@1f,464000/serial@3,fffff8 ............................... passed
Pass:1 (of 1) Errors:0 (of 0) Tests Failed:0 Elapsed Time: 0:0:1:1
Hit any key to return to the main menu
|
Note - From the obdiag prompt you can select a device from the list and test it. However, at the ok prompt you need to use the full device path. In addition, the device needs to have a self-test method, otherwise errors will result.
|
Controlling OpenBoot Diagnostics Tests
Most of the OpenBoot configuration variables you use to control POST (see TABLE 8-7) also affect OpenBoot Diagnostics tests.
- Use the diag-level variable to control the OpenBoot Diagnostics testing level.
- Use test-args to customize how the tests run.
By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 8-17.
TABLE 8-17 Keywords for the test-args OpenBoot Configuration Variable
Keyword
|
What It Does
|
bist
|
Invokes built-in self-test (BIST) on external and peripheral devices
|
debug
|
Displays all debug messages
|
iopath
|
Verifies bus/interconnect integrity
|
loopback
|
Exercises external loopback path for the device
|
media
|
Verifies external and peripheral device media accessibility
|
restore
|
Attempts to restore original state of the device if the previous execution of the test failed
|
silent
|
Displays only errors rather than the status of each test
|
subtests
|
Displays main test and each subtest that is called
|
verbose
|
Displays detailed messages of status of all tests
|
callers=N
|
Displays backtrace of N callers when an error occurs
- callers=0 - displays backtrace of all callers before the error
|
errors=N
|
Continues executing the test until N errors are encountered
- errors=0 - displays all error reports without terminating testing
|
If you want to make multiple customizations to the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:
TABLE 8-18
ok setenv test-args debug,loopback,media
|
test and test-all Commands
You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:
TABLE 8-19
ok test /pci@x,y/SUNW,qlc@2
|
Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V445 system.
|
To customize an individual test, you can use test-args as follows:
TABLE 8-20
ok test /usb@1,3:test-args={verbose,debug}
|
This affects only the current test without changing the value of the test-args OpenBoot configuration variable.
You can test all the devices in the device tree with the test-all command:
If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:
TABLE 8-22
ok test-all /pci@9,700000/usb@1,3
|
OpenBoot Diagnostics Error Messages
OpenBoot Diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. The following example displays a sample OpenBoot Diagnostics error message.
CODE EXAMPLE 8-2 OpenBoot Diagnostics Error Message
Testing /pci@1e,600000/isa@7/flashprom@2,0
ERROR : There is no POST in this FLASHPROM or POST header is
unrecognized
DEVICE : /pci@1e,600000/isa@7/flashprom@2,0
SUBTEST : selftest:crc-subtest
MACHINE : Sun Fire V445
SERIAL# : 51347798
DATE : 03/05/2003 15:17:31 GMT
CONTR0LS: diag-level=max test-args=errors=1
Error: /pci@1e,600000/isa@7/flashprom@2,0 selftest failed, return code = 1
Selftest at /pci@1e,600000/isa@7/flashprom@2,0 (errors=1) .............
failed
Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:1
|
About OpenBoot Commands
OpenBoot commands are commands you type from the ok prompt. OpenBoot commands that can provide useful diagnostic information are:
- probe-scsi-all
- probe-ide
- show-devs
probe-scsi-all
The probe-scsi-all command diagnoses problems with the SAS devices.
|
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsi-allcommand can hang the system.
|
The probe-scsi-all command communicates with all SAS devices connected to on-board SAS controllers and accesses devices connected to any host adapters installed in PCI slots.
For any SAS device that is connected and active, the probe-scsi-all command displays its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.
The following is sample output from the probe-scsi-all command.
CODE EXAMPLE 8-3 Sample probe-scsi-all Command Output
{3} ok probe-scsi-all
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
MPT Version 1.05, Firmware Version 1.08.04.00
Target 0
Unit 0 Disk SEAGATE ST973401LSUN72G 0356 143374738 Blocks, 73 GB
SASAddress 5000c50000246b35 PhyNum 0
Target 1
Unit 0 Disk SEAGATE ST973401LSUN72G 0356 143374738 Blocks, 73 GB
SASAddress 5000c50000246bc1 PhyNum 1
Target 4 Volume 0
Unit 0 Disk LSILOGICLogical Volume 3000 16515070 Blocks, 8455 MB
Target 6
Unit 0 Disk FUJITSU MAV2073RCSUN72G 0301 143374738 Blocks, 73 GB
SASAddress 500000e0116a81c2 PhyNum 6
{3} ok
|
probe-ide
The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.
|
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system.
|
The following is sample output from the probe-ide command.
CODE EXAMPLE 8-4 Sample probe-ide Command Output
{1} ok probe-ide
Device 0 ( Primary Master )
Removable ATAPI Model: DV-28E-B
Device 1 ( Primary Slave )
Not Present
Device 2 ( Secondary Master )
Not Present
Device 3 ( Secondary Slave )
Not Present
|
show-devs
The show-devs command lists the hardware device paths for each device in the firmware device tree. shows some sample output.
CODE EXAMPLE 8-5 show-devs Command Output (Truncated)
/i2c@1f,520000
/ebus@1f,464000
/pci@1f,700000
/pci@1e,600000
/memory-controller@3,0
/SUNW,UltraSPARC-IIIi@3,0
/memory-controller@2,0
/SUNW,UltraSPARC-IIIi@2,0
/memory-controller@1,0
/SUNW,UltraSPARC-IIIi@1,0
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi@0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/i2c@1f,520000/cpu-fru-prom@0,e8
/i2c@1f,520000/dimm-spd@0,e6
/i2c@1f,520000/dimm-spd@0,e4
.
.
.
/pci@1f,700000/pci@0
/pci@1f,700000/pci@0/pci@9
/pci@1f,700000/pci@0/pci@8
/pci@1f,700000/pci@0/pci@2
/pci@1f,700000/pci@0/pci@1
/pci@1f,700000/pci@0/pci@2/pci@0
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8
/pci@1f,700000/pci@0/pci@2/pci@0/network@4,1
/pci@1f,700000/pci@0/pci@2/pci@0/network@4
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1/disk
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1/tape
|
To Run OpenBoot Commands
|
1. Halt the system to reach the ok prompt.
How you do this depends on the system's condition. If possible, you should warn users before you shut the system down.
2. Type the appropriate command at the console prompt.
About Predictive Self-Healing
In Solaris 10 systems, the Solaris Predictive Self-Healing (PSH) technology enables Sun Fire V445 server to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.
The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use message ID to get additional information about the problem from Sun's knowledge article database.
The Predictive Self-Healing technology covers the following Sun Fire V445 server components:
- UltraSPARC IIIi processors
- Memory
- I/O bus
The PSH console message provides the following information:
- Type
- Severity
- Description
- Automated Response
- Impact
- Suggested Action for System Administrator
If the Solaris PSH facility has detected a faulty component, use the fmdump command (described in the following subsections) to identify the fault. Faulty FRUs are identified in fault messages using the FRU name.
Use the following web site to interpret faults and obtain information on a fault:
http://www.sun.com/msg/
This web site directs you to provide the message ID that your system displayed. The web site then provides knowledge articles about the fault and corrective action to resolve the fault. The fault information and documentation at this web site is updated regularly.
You can find more detailed descriptions of Solaris 10 Predictive Self-Healing at the following web site:
http://www.sun.com/bigadmin/features/articles/selfheal.html
Predictive Self-Healing Tools
In summary, the Solaris Fault Manager daemon (fmd) performs the following functions:
- Receives telemetry information about problems detected by the system software.
- Diagnoses the problems and provides system generated messages.
- Initiates pro-active self-healing activities such as disabling faulty components.
TABLE 8-23 shows a typical message generated when a fault occurs on your system. The message appears on your console and is recorded in the /var/adm/messages file.
Note - The messages in TABLE 8-23 indicate that the fault has already been diagnosed. Any corrective action that the system can perform has already taken place. If your server is still running, it continues to run.
|
TABLE 8-23 System Generated Predictive Self-Healing Message
Output Displayed
|
Description
|
Jul 1 14:30:20 sunrise EVENT-TIME: Tue Nov 1 16:30:20 PST 2005
|
EVENT-TIME: the time stamp of the diagnosis.
|
Jul 1 14:30:20 sunrise PLATFORM: SUNW,A70, CSN: -, HOSTNAME: sunrise
|
PLATFORM: A description of the system encountering the problem
|
Jul 1 14:30:20 sunrise SOURCE: eft, REV: 1.13
|
SOURCE: Information on the Diagnosis Engine used to determine the fault
|
Jul 1 14:30:20 sunrise EVENT-ID: afc7e660-d609-4b2f-86b8-ae7c6b8d50c4
|
EVENT-ID: The Universally Unique event ID (UUID) for this fault
|
Jul 1 14:30:20 sunrise DESC:Jul 1 14:30:20 sunrise A problem was detected in the PCI-Express subsystem
|
DESC: A basic description of the failure
|
Jul 1 14:30:20 sunrise Refer to http://sun.com/msg/SUN4-8000-0Y for more information.
|
WEBSITE: Where to find specific information and actions for this fault
|
Jul 1 14:30:20 sunrise AUTO-RESPONSE: One or more device instances may be disabled
|
AUTO-RESPONSE: What, if anything, the system did to alleviate any follow-on issues
|
Jul 1 14:30:20 sunrise IMPACT: Loss of services provided by the device instances associated with this fault
|
IMPACT: A description of what that response may have done
|
Jul 1 14:30:20 sunrise REC-ACTION: Schedule a repair procedure to replace the affected device. Use Nov 1 14:30:20 sunrise fmdump -v -u EVENT_ID to identify the device or contact Sun for support.
|
REC-ACTION: A short description of what the system administrator should do
|
Using the Predictive Self-Healing Commands
For complete information about Predictive Self-Healing commands, refer to the Solaris 10 man pages. This section describes some details of the following commands:
- fmdump(1M)
- fmadm(1M)
- fmstat(1M)
Using the fmdump Command
After the message in TABLE 8-23 is displayed, more information about the fault is available. The fmdump command displays the contents of any log files associated with the Solaris Fault Manager.
The fmdump command produces output similar to TABLE 8-23. This example assumes there is only one fault.
TABLE 8-24
# fmdump
TIME UUID SUNW-MSG-ID
Jul 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2 SUN4-8000-0Y
|
fmdump -V
The -V option provides more details.
TABLE 8-25
# fmdump -V -u 0ee65618-2218-4997-c0dc-b5c410ed8ec2
TIME UUID SUNW-MSG-ID
Jul 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2 SUN4-8000-0Y
100% fault.io.fire.asic
FRU: hc://product-id=SUNW,A70/motherboard=0
rsrc: hc:///motherboard=0/hostbridge=0/pciexrc=0
|
Three lines of new output are delivered with the -V option.
- The first line is a summary of information displayed previously in the console message but includes the timestamp, the UUID, and the Message-ID.
- The second line is a declaration of the certainty of the diagnosis. In this case the failure is in the ASIC described. If the diagnosis could involve multiple components, two lines would be displayed here with 50 percent in each, for example.
- The FRU line declares the part that needs to be replaced to return the system to a fully operational state.
- The rsrc line describes what component was taken out of service as a result of this fault.
fmdump -e
To get information of the errors that caused this failure, use the -e option.
TABLE 8-26
# fmdump -e
TIME CLASS
Nov 02 10:04:14.3008 ereport.io.fire.jbc.mb_per
|
Using the fmadm faulty Command
The fmadm faulty command lists and modifies system configuration parameters that are maintained by the Solaris Fault Manager. The fmadm faulty command is primarily used to determine the status of a component involved in a fault.
TABLE 8-27
# fmadm faulty
STATE RESOURCE / UUID
-------- -------------------------------------------------------------
degraded dev:////pci@1e,600000
0ee65618-2218-4997-c0dc-b5c410ed8ec2
|
The PCI device is degraded and is associated with the same UUID as seen above. You may also see faulted states.
fmadm config
The fmadm config command output shows the version numbers of the diagnosis engines in use by your system, and also displays their current state. You can check these versions against information on the http://sunsolve.sun.com web site to determine if your server is using the latest diagnostic engines.
TABLE 8-28
# fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-diagnosis 1.5 active UltraSPARC-III/IV CPU/Memory Diagnosis
cpumem-retire 1.1 active CPU/Memory Retire Agent
eft 1.16 active eft diagnosis engine
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 1.0 active I/O Retire Agent
snmp-trapgen 1.0 active SNMP Trap Generation Agent
sysevent-transport 1.0 active SysEvent Transport Agent
syslog-msgs 1.0 active Syslog Messaging Agent
zfs-diagnosis 1.0 active ZFS Diagnosis Engine
|
Using the fmstat Command
The fmstat command can report statistics associated with the Solaris Fault Manager. The fmstat command shows information about DE performance. In the example below, the eft DE (also seen in the console output) has received an event which it accepted. A case is opened for that event and a diagnosis is performed to solve the cause for the failure.
TABLE 8-29
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-diagnosis 0 0 0.0 0.0 0 0 0 0 3.0K 0
cpumem-retire 0 0 0.0 0.0 0 0 0 0 0 0
eft 0 0 0.0 0.0 0 0 0 0 713K 0
fmd-self-diagnosis 0 0 0.0 0.0 0 0 0 0 0 0
io-retire 0 0 0.0 0.0 0 0 0 0 0 0
snmp-trapgen 0 0 0.0 0.0 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 6704.4 1 0 0 0 0 0
syslog-msgs 0 0 0.0 0.0 0 0 0 0 0 0
zfs-diagnosis 0 0 0.0 0.0 0 0 0 0 0 0
|
About Traditional Solaris OS Diagnostic Tools
If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser OS. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have access to the software-based exerciser tools, SunVTS and Sun Management Center. These tools enable you to monitor the server, exercise it, and isolate faults.
Note - If you set the auto-boot OpenBoot configuration variable to false, the OS does not boot following completion of the firmware-based tests.
|
In addition to the tools mentioned above, you can refer to error and system message log files, and Solaris system information commands.
Error and System Message Log Files
Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the OS, the environmental control subsystem, and various software applications.
Solaris System Information Commands
The following Solaris commands display data that you can use when assessing the condition of a Sun Fire V445 server:
- prtconf
- prtdiag
- prtfru
- psrinfo
- showrev
This section describes the information these commands give you. For more information on using these commands, refer to the Solaris man pages.
Using the prtconf Command
The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks. The output of prtconf also includes the total amount of system memory, and shows an excerpt of prtconf output (truncated to save space).
CODE EXAMPLE 8-6 prtconf Command Output (Truncated)
# prtconf
System Configuration: Sun Microsystems sun4u
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
SUNW,Sun-Fire-V445
packages (driver not attached)
SUNW,builtin-drivers (driver not attached)
deblocker (driver not attached)
disk-label (driver not attached)
terminal-emulator (driver not attached)
dropins (driver not attached)
kbd-translator (driver not attached)
obp-tftp (driver not attached)
SUNW,i2c-ram-device (driver not attached)
SUNW,fru-device (driver not attached)
ufs-file-system (driver not attached)
chosen (driver not attached)
openprom (driver not attached)
client-services (driver not attached)
options, instance #0
aliases (driver not attached)
memory (driver not attached)
virtual-memory (driver not attached)
SUNW,UltraSPARC-IIIi (driver not attached)
memory-controller, instance #0
SUNW,UltraSPARC-IIIi (driver not attached)
memory-controller, instance #1 ...
|
The prtconf command -p option produces output similar to the OpenBoot
show-devs command. This output lists only those devices compiled by the system firmware.
Using the prtdiag Command
The prtdiag command displays a table of diagnostic information that summarizes the status of system components.
The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following is an excerpt of some of the output produced by prtdiag on a Sun Fire V445 server.
CODE EXAMPLE 8-7 prtdiag Command Output
# prtdiag
System Configuration: Sun Microsystems sun4u Sun Fire V445
System clock frequency: 199 MHZ
Memory size: 24GB
==================================== CPUs ====================================
E$ CPU CPU
CPU Freq Size Implementation Mask Status Location
--- -------- ---------- --------------------- ----- ------ --------
0 1592 MHz 1MB SUNW,UltraSPARC-IIIi 3.4 on-line MB/C0/P0
1 1592 MHz 1MB SUNW,UltraSPARC-IIIi 3.4 on-line MB/C1/P0
2 1592 MHz 1MB SUNW,UltraSPARC-IIIi 3.4 on-line MB/C2/P0
3 1592 MHz 1MB SUNW,UltraSPARC-IIIi 3.4 on-line MB/C3/P0
================================= IO Devices =================================
Bus Freq Slot + Name +
Type MHz Status Path Model
------ ---- ---------- ---------------------------- --------------------
pci 199 MB/PCI4 LSILogic,sas-pci1000,54 (scs+ LSI,1068
okay /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
pci 199 MB/PCI5 pci108e,abba (network) SUNW,pci-ce
okay /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0
pciex 199 MB pci14e4,1668 (network)
okay /pci@1e,600000/pci/pci/pci/network
pciex 199 MB pci14e4,1668 (network)
okay /pci@1e,600000/pci/pci/pci/network
pciex 199 MB pci10b9,5229 (ide)
okay /pci@1f,700000/pci@0/pci@1/pci@0/ide
pciex 199 MB pci14e4,1668 (network)
okay /pci@1f,700000/pci@0/pci@2/pci@0/network
pciex 199 MB pci14e4,1668 (network)
okay /pci@1f,700000/pci@0/pci@2/pci@0/network
============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address Size Interleave Factor Contains
-----------------------------------------------------------------------
0x0 8GB 16 BankIDs 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0x1000000000 8GB 16 BankIDs 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
0x2000000000 4GB 4 BankIDs 32,33,34,35
0x3000000000 4GB 4 BankIDs 48,49,50,51
Bank Table:
-----------------------------------------------------------
Physical Location
ID ControllerID GroupID Size Interleave Way
-----------------------------------------------------------
0 0 0 512MB 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1 0 0 512MB
2 0 1 512MB
3 0 1 512MB
4 0 0 512MB
5 0 0 512MB
6 0 1 512MB
7 0 1 512MB
8 0 1 512MB
9 0 1 512MB
10 0 0 512MB
11 0 0 512MB
12 0 1 512MB
13 0 1 512MB
14 0 0 512MB
15 0 0 512MB
16 1 0 512MB 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
17 1 0 512MB
18 1 1 512MB
19 1 1 512MB
20 1 0 512MB
21 1 0 512MB
22 1 1 512MB
23 1 1 512MB
24 1 1 512MB
25 1 1 512MB
26 1 0 512MB
27 1 0 512MB
28 1 1 512MB
29 1 1 512MB
30 1 0 512MB
31 1 0 512MB
32 2 0 1GB 0,1,2,3
33 2 1 1GB
34 2 1 1GB
35 2 0 1GB
48 3 0 1GB 0,1,2,3
49 3 1 1GB
50 3 1 1GB
51 3 0 1GB
Memory Module Groups:
--------------------------------------------------
ControllerID GroupID Labels Status
--------------------------------------------------
0 0 MB/C0/P0/B0/D0
0 0 MB/C0/P0/B0/D1
0 1 MB/C0/P0/B1/D0
0 1 MB/C0/P0/B1/D1
1 0 MB/C1/P0/B0/D0
1 0 MB/C1/P0/B0/D1
1 1 MB/C1/P0/B1/D0
1 1 MB/C1/P0/B1/D1
2 0 MB/C2/P0/B0/D0
2 0 MB/C2/P0/B0/D1
2 1 MB/C2/P0/B1/D0
2 1 MB/C2/P0/B1/D1
3 0 MB/C3/P0/B0/D0
3 0 MB/C3/P0/B0/D1
3 1 MB/C3/P0/B1/D0
3 1 MB/C3/P0/B1/D1
=============================== usb Devices ===============================
Name Port#
------------ -----
hub HUB0
bash-3.00#
Page 177
Verbose output with fan tach fail
============================ Environmental Status ============================
Fan Status:
-------------------------------------------
Location Sensor Status
-------------------------------------------
MB/FT0/F0 TACH okay
MB/FT1/F0 TACH failed (0 rpm)
MB/FT2/F0 TACH okay
MB/FT5/F0 TACH okay
PS1 FF_FAN okay
PS3 FF_FAN okay
Temperature sensors:
-----------------------------------------
Location Sensor Status
-----------------------------------------
MB/C0/P0 T_CORE okay
MB/C1/P0 T_CORE okay
MB/C2/P0 T_CORE okay
MB/C3/P0 T_CORE okay
MB/C0 T_AMB okay
MB/C1 T_AMB okay
MB/C2 T_AMB okay
MB/C3 T_AMB okay
MB T_CORE okay
MB IO_T_AMB okay
MB/FIOB T_AMB okay
MB T_AMB okay
PS1 FF_OT okay
PS3 FF_OT okay
------------------------------------
Current sensors:
----------------------------------------
Location Sensor Status
----------------------------------------
MB/USB0 I_USB0 okay
MB/USB1 I_USB1 okay
|
In addition to the information in CODE EXAMPLE 8-7, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.
CODE EXAMPLE 8-8 prtdiag Verbose Output
System Temperatures (Celsius):
-------------------------------
Device Temperature Status
---------------------------------------
CPU0 59 OK
CPU2 64 OK
DBP0 22 OK
|
In the event of an overtemperature condition, prtdiag reports an error in the Status column.
CODE EXAMPLE 8-9 prtdiag Overtemperature Indication Output
System Temperatures (Celsius):
-------------------------------
Device Temperature Status
---------------------------------------
CPU0 62 OK
CPU1 102 ERROR
|
Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.
CODE EXAMPLE 8-10 prtdiag Fault Indication Output
Fan Status:
-----------
Bank RPM Status
---- ----- ------
CPU0 4166 [NO_FAULT]
CPU1 0000 [FAULT]
|
Using the prtfru Command
The Sun Fire V445 system maintains a hierarchical list of all FRUs in the system, as well as specific information about various FRUs.
The prtfru command can display this hierarchical list, as well as data contained in the serial electrically erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 8-11 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.
CODE EXAMPLE 8-11 prtfru -l Command Output (Truncated)
# prtfru -l
/frutree
/frutree/chassis (fru)
/frutree/chassis/MB?Label=MB
/frutree/chassis/MB?Label=MB/system-board (container)
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT3?Label=FT3
/frutree/chassis/MB?Label=MB/system-board/FT4?Label=FT4
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module (container)
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0/cpu
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0/cpu/B0?Label=B0
|
CODE EXAMPLE 8-12 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.
CODE EXAMPLE 8-12 prtfru -c Command Output
# prtfru -c
/frutree/chassis/MB?Label=MB/system-board (container)
SEGMENT: FD
/Customer_DataR
/Customer_DataR/UNIX_Timestamp32: Wed Dec 31 19:00:00 EST 1969
/Customer_DataR/Cust_Data:
/InstallationR (4 iterations)
/InstallationR[0]
/InstallationR[0]/UNIX_Timestamp32: Fri Dec 31 20:47:13 EST 1999
/InstallationR[0]/Fru_Path: MB.SEEPROM
/InstallationR[0]/Parent_Part_Number: 5017066
/InstallationR[0]/Parent_Serial_Number: BM004E
/InstallationR[0]/Parent_Dash_Level: 05
/InstallationR[0]/System_Id:
/InstallationR[0]/System_Tz: 238
/InstallationR[0]/Geo_North: 15658734
/InstallationR[0]/Geo_East: 15658734
/InstallationR[0]/Geo_Alt: 238
/InstallationR[0]/Geo_Location:
/InstallationR[1]
/InstallationR[1]/UNIX_Timestamp32: Mon Mar 6 10:08:30 EST 2006
/InstallationR[1]/Fru_Path: MB.SEEPROM
/InstallationR[1]/Parent_Part_Number: 3753302
/InstallationR[1]/Parent_Serial_Number: 0001
/InstallationR[1]/Parent_Dash_Level: 03
/InstallationR[1]/System_Id:
/InstallationR[1]/System_Tz: 238
/InstallationR[1]/Geo_North: 15658734
/InstallationR[1]/Geo_East: 15658734
/InstallationR[1]/Geo_Alt: 238
/InstallationR[1]/Geo_Location:
/InstallationR[2]
/InstallationR[2]/UNIX_Timestamp32: Tue Apr 18 10:00:45 EDT 2006
/InstallationR[2]/Fru_Path: MB.SEEPROM
/InstallationR[2]/Parent_Part_Number: 5017066
/InstallationR[2]/Parent_Serial_Number: BM004E
/InstallationR[2]/Parent_Dash_Level: 05
/InstallationR[2]/System_Id:
/InstallationR[2]/System_Tz: 0
/InstallationR[2]/Geo_North: 12704
/InstallationR[2]/Geo_East: 1
/InstallationR[2]/Geo_Alt: 251
/InstallationR[2]/Geo_Location:
/InstallationR[3]
/InstallationR[3]/UNIX_Timestamp32: Fri Apr 21 08:50:32 EDT 2006
/InstallationR[3]/Fru_Path: MB.SEEPROM
/InstallationR[3]/Parent_Part_Number: 3753302
/InstallationR[3]/Parent_Serial_Number: 0001
/InstallationR[3]/Parent_Dash_Level: 03
/InstallationR[3]/System_Id:
/InstallationR[3]/System_Tz: 0
/InstallationR[3]/Geo_North: 1
/InstallationR[3]/Geo_East: 16531457
/InstallationR[3]/Geo_Alt: 251
/InstallationR[3]/Geo_Location:
/Status_EventsR (0 iterations)
SEGMENT: PE
/Power_EventsR (50 iterations)
/Power_EventsR[0]
/Power_EventsR[0]/UNIX_Timestamp32: Mon Jul 10 12:34:20 EDT 2006
/Power_EventsR[0]/Event: power_on
/Power_EventsR[1]
/Power_EventsR[1]/UNIX_Timestamp32: Mon Jul 10 12:34:49 EDT 2006
/Power_EventsR[1]/Event: power_off
/Power_EventsR[2]
/Power_EventsR[2]/UNIX_Timestamp32: Mon Jul 10 12:35:27 EDT 2006
/Power_EventsR[2]/Event: power_on
/Power_EventsR[3]
/Power_EventsR[3]/UNIX_Timestamp32: Mon Jul 10 12:58:43 EDT 2006
/Power_EventsR[3]/Event: power_off
/Power_EventsR[4]
/Power_EventsR[4]/UNIX_Timestamp32: Mon Jul 10 13:07:27 EDT 2006
/Power_EventsR[4]/Event: power_on
/Power_EventsR[5]
/Power_EventsR[5]/UNIX_Timestamp32: Mon Jul 10 14:07:20 EDT 2006
/Power_EventsR[5]/Event: power_off
/Power_EventsR[6]
/Power_EventsR[6]/UNIX_Timestamp32: Mon Jul 10 14:07:21 EDT 2006
/Power_EventsR[6]/Event: power_on
/Power_EventsR[7]
/Power_EventsR[7]/UNIX_Timestamp32: Mon Jul 10 14:17:01 EDT 2006
/Power_EventsR[7]/Event: power_off
/Power_EventsR[8]
/Power_EventsR[8]/UNIX_Timestamp32: Mon Jul 10 14:40:22 EDT 2006
/Power_EventsR[8]/Event: power_on
/Power_EventsR[9]
/Power_EventsR[9]/UNIX_Timestamp32: Mon Jul 10 14:42:38 EDT 2006
/Power_EventsR[9]/Event: power_off
/Power_EventsR[10]
/Power_EventsR[10]/UNIX_Timestamp32: Mon Jul 10 16:12:35 EDT 2006
/Power_EventsR[10]/Event: power_on
/Power_EventsR[11]
/Power_EventsR[11]/UNIX_Timestamp32: Tue Jul 11 08:53:47 EDT 2006
/Power_EventsR[11]/Event: power_off
/Power_EventsR[12]
|
Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes:
- FRU description
- Manufacturer name and location
- Part number and serial number
- Hardware revision levels
Using the psrinfo Command
The psrinfo command displays the date and time each CPU came online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed. The following is sample output from the psrinfo command with the -v option.
CODE EXAMPLE 8-13 psrinfo -v Command Output
# psrinfo -v
Status of virtual processor 0 as of: 07/13/2006 14:18:39
on-line since 07/13/2006 14:01:26.
The sparcv9 processor operates at 1592 MHz,
and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 07/13/2006 14:18:39
on-line since 07/13/2006 14:01:26.
The sparcv9 processor operates at 1592 MHz,
and has a sparcv9 floating point processor.
Status of virtual processor 2 as of: 07/13/2006 14:18:39
on-line since 07/13/2006 14:01:26.
The sparcv9 processor operates at 1592 MHz,
and has a sparcv9 floating point processor.
Status of virtual processor 3 as of: 07/13/2006 14:18:39
on-line since 07/13/2006 14:01:24.
The sparcv9 processor operates at 1592 MHz,
and has a sparcv9 floating point processor.
|
Using the showrev Command
The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 8-14 shows sample output of the showrev command.
CODE EXAMPLE 8-14 showrev Command Output
# showrev
Hostname: sunrise
Hostid: 83d8ee71
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain: Ecd.East.Sun.COM
Kernel version: SunOS 5.10 Generic_118833-17
bash-3.00#
|
When used with the -p option, this command displays installed patches. TABLE 8-30 shows a partial sample output from the showrev command with the -p option.
TABLE 8-30 showrev -p Command Output
Patch: 109729-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109783-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109807-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 109809-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110905-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110910-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 110914-01 Obsoletes: Requires: Incompatibles: Packages: SUNWcsu
Patch: 108964-04 Obsoletes: Requires: Incompatibles: Packages: SUNWcsr
|
To Run Solaris System Information Commands
|
1. Decide what kind of system information you want to display.
For more information, see Solaris System Information Commands.
2. Type the appropriate command at a console prompt.
See TABLE 8-31 for a summary of the commands.
TABLE 8-31 Using Solaris Information Display Commands
Command
|
What It Displays
|
What to Type
|
Notes
|
fmadm
|
Fault management information
|
/usr/sbin/fmadm
|
Lists information and changes settings.
|
fmdump
|
Fault management information
|
/usr/sbin/fmdump
|
Use the -v option for additional detail.
|
prtconf
|
System configuration information
|
/usr/sbin/prtconf
|
-
|
prtdiag
|
Diagnostic and configuration information
|
/usr/platform/sun4u/sbin/prtdiag
|
Use the -v option for additional detail.
|
prtfru
|
FRU hierarchy and SEEPROM memory contents
|
/usr/sbin/prtfru
|
Use the -l option to display hierarchy. Use the -c option to display SEEPROM data.
|
psrinfo
|
Date and time each CPU came online; processor clock speed
|
/usr/sbin/psrinfo
|
Use the -v option to obtain clock speed and other data.
|
showrev
|
Hardware and software revision information
|
/usr/bin/showrev
|
Use the -p option to show software patches.
|
Viewing Recent Diagnostic Test Results
A summary of the results of the most recent power-on self-test (POST) are saved across power cycles.
To View Recent Test Results
|
1. Obtain the ok prompt.
2. To see a summary of the most recent POST results, type:
TABLE 8-32
ok show-post-results
|
Setting OpenBoot Configuration Variables
Switches and diagnostic configuration variables stored in the IDPROM determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 8-7.
Changes to OpenBoot configuration variables usually take effect upon the next reboot.
To View and Set OpenBoot Configuration Variables
|
1. Obtain the ok prompt.
- To display the current values of all OpenBoot configuration variables, use the printenv command.
The following example shows a short excerpt of this command's output.
TABLE 8-33
ok printenv
Variable Name Value Default Value
diag-level min min
diag-switch? false false
|
- To set or change the value of an OpenBoot configuration variable, use the setenv command:
TABLE 8-34
ok setenv diag-level max
diag-level =
max
|
To set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space.
Additional Diagnostic Tests for Specific DevicesUsing the probe-scsi Command to Confirm That Hard Disk Drives are Active
The probe-scsi command transmits an inquiry to SAS devices connected to the system's internal SAS interface. If a SAS device is connected and active, the command displays the unit number, device type, and manufacturer name for that device.
CODE EXAMPLE 8-15 probe-scsi Output Message
ok probe-scsi
Target 0
Unit 0 Disk SEAGATE ST336605LSUN36G 4207
Target 1
Unit 0 Disk SEAGATE ST336605LSUN36G 0136
|
The probe-scsi-all command transmits an inquiry to all SAS devices connected to both the system's internal and its external SAS interfaces. CODE EXAMPLE 8-16 shows sample output from a server with no externally connected SAS devices but containing two 36 Gbyte Hard Disk Drives, both of them active.
CODE EXAMPLE 8-16 probe-scsi-all Output Message
ok probe-scsi-all
/pci@1f,0/pci@1/scsi@8,1
/pci@1f,0/pci@1/scsi@8
Target 0
Unit 0 Disk SEAGATE ST336605LSUN36G 4207
Target 1
Unit 0 Disk SEAGATE ST336605LSUN36G 0136
|
Using the probe-ide Command To Confirm That the DVD Drive is Connected
The probe-ide command transmits an inquiry command to internal and external IDE devices connected to the system's on-board IDE interface. The following sample output reports a DVD drive installed (as Device 0) and active in a server.
CODE EXAMPLE 8-17 probe-ide Output Message
ok probe-ide
Device 0 ( Primary Master )
Removable ATAPI Model: DV-28E-B
Device 1 ( Primary Slave )
Not Present
Device 2 ( Secondary Master )
Not Present
Device 3 ( Secondary Slave )
Not Present
|
Using the watch-net and watch-net-all Commands to Check the Network Connections
The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.
Start the watch-net diagnostic test by typing the watch-net command at the ok prompt. For the watch-net-all diagnostic test, type watch-net-all at the ok prompt.
CODE EXAMPLE 8-18 watch-net Diagnostic O utput Message
{0} ok watch-net
Internal loopback test -- succeeded.
Link is -- up
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.................................
|
CODE EXAMPLE 8-19 watch-net-all Diagnostic O utput Message
{0} ok watch-net-all
/pci@1f,0/pci@1,1/network@c,1
Internal loopback test -- succeeded.
Link is -- up
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.
|
About Automatic Server Restart
Note - Automatic Server Restart is not the same as Automatic System Restoration (ASR), which the Sun Fire V445 server also supports.
|
Automatic Server Restart is a functional part of ALOM. It monitors the Solaris OS while it is running and, by default, captures cpu register and memory contents to the dump-device using the firmware level sync command.
ALOM uses a watchdog process to monitor only the kernel. ALOM will not restart the server if a process hangs and the kernel is still running. The ALOM watchdog parameters for the watchdog patting interval and watchdog timeout are not user configurable.
If the kernel hangs and the watchdog times out, ALOM reports and logs the event and performs one of three user configurable actions.
- xir: this is the default action and will cause the server to capture cpu register and memory contents to the dump-device using the firmware level sync command. In the event of the sync hanging, ALOM falls back to a hard reset after 15 minutes.
Note - Do not confuse this OpenBoot sync command with the Solaris OS sync command, which results in I/O writes of buffered data to the disk drives, prior to unmounting file systems.
|
- Reset: this is a hard reset and results in a rapid system recovery but diagnostic data regarding the hang is not stored, and file system damage may result.
- None - this will result in the system being left in the hung state indefinitely after the watchdog timeout has been reported.
For more information, see the sys_autorestart section of the ALOM Online Help.
About Automatic System Restoration
Note - Automatic System Restoration (ASR) is not the same as Automatic Server Restart, which the Sun Fire V445 server also supports.
|
Automatic System Restoration (ASR) consists of self-test features and an auto-configuring capability to detect failed hardware components and unconfigure them. By doing this, the server is able to resume operating after certain nonfatal hardware faults or failures have occured.
If a component is one that is monitored by ASR, and the server is capable of operating without it, the server will automatically reboot if that component should develop a fault or fail.
ASR monitors the following components:
If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.
If a fault occurs on a running server, and it is possible for the server to run without the failed component, the server automatically reboots. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash repeatedly.
To support such a degraded boot capability, the OpenBoot firmware uses the 1275 Client Interface (via the device tree) to mark a device as either failed or disabled, by creating an appropriate status property in the device tree node. The Solaris OS will not activate a driver for any subsystem so marked.
As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system will reboot automatically and resume operation while a service call is made.
Note - ASR is enabled by default.
|
Auto-Boot Options
The OpenBoot firmware stores configuration variables on a ROM chip called auto-boot? and auto-boot-on-error? The default setting on the Sun Fire V445 server for both of these variables is true.
The auto-boot? setting controls whether or not the firmware automatically boots the OS after each reset. The auto-boot-on-error? setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? settings must be set to true (default) to enable an automatic degraded boot.
To Set the Auto-Boot Switches
|
1. Type:
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true
|
Note - With both of these variables set to true, the system attempts a degraded boot in response to any fatal nonrecoverable error.
|
Error Handling Summary
Error handling during the power-on sequence falls into one of the following three cases:
- If no errors are detected by POST or OpenBoot Diagnostics, the system attempts to boot if auto-boot? is true.
- If only nonfatal errors are detected by POST or OpenBoot Diagnostics, the system attempts to boot if auto-boot? is true and auto-boot-on-error? is true. Non-fatal errors include the following:
- SAS subsystem failure. In this case, a working alternate path to the boot disk is required. For more information, see About Multipathing Software.
- Ethernet interface failure.
- USB interface failure.
- Serial interface failure.
- PCI card failure.
- Memory failure.
Given a failed DIMM, the firmware unconfigures the entire logical bank associated with the failed module. Another nonfailing logical bank must be present in the system for the system to attempt a degraded boot. See About the CPU/Memory Modules.
Note - If POST or OpenBoot Diagnostics detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.
|
- If a critical or fatal error is detected by POST or OpenBoot Diagnostics, the system will not boot regardless of the settings of auto-boot? or auto-boot-on-error?. Critical and fatal nonrecoverable errors include the following:
- Any CPU failed
- All logical memory banks failed
- Flash RAM cyclical redundancy check (CRC) failure
- Critical field-replaceable unit (FRU) PROM configuration data failure
- Critical application-specific integrated circuit (ASIC) failure
For more information about troubleshooting fatal errors, see Chapter 9.
Reset Scenarios
Two OpenBoot configuration variables, diag-switch? and diag-trigger control whether the system executes firmware diagnostics in response to system reset events.
POST is enabled as the default for power-on-reset and error-reset events. When the diag-switch? variable is set to true, diagnostics are executed using user-defined settings. If the diag-switch? variable is set to false, diagnostics are executed depending on the diag-trigger variable setting.
In addition, ASR is enabled by default because diag-trigger is set to power-on-reset and error-reset. This default setting remains when the diag-switch? variable is set to false. auto-boot? and auto-boot-on-error? are set to true by default.
Automatic System Restoration User Commands
The OpenBoot commands .asr, asr-disable, and asr-enable are available for obtaining ASR status information and for manually unconfiguring or reconfiguring system devices. For more information, see Unconfiguring a Device Manually.
Enabling Automatic System Restoration
The ASR feature is enabled by default. ASR is always enabled when the diag-switch? OpenBoot variable is set to true, and when the diag-trigger setting is set to error-reset.
To activate any parameter changes, type the following at the ok prompt:
The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (default).
Note - To store parameter changes, you can also power cycle the system using the front panel Power button.
|
Disabling Automatic System Restoration
After you disable the automatic system restoration (ASR) feature, it is not activated again until you enable it at the system ok prompt.
To Disable Automatic System Restoration
|
1. At the ok prompt, type:
ok setenv auto-boot-on-error? false
|
2. To activate the parameter change, type:
The system permanently stores the parameter change.
Note - To store parameter changes, you can also power cycle the system using the front panel Power button.
|
Displaying Automatic System Restoration Information
Use the following command to display information about the status of the ASR feature.
At the ok prompt, type:
In the .asr command output, any devices marked disabled have been manually unconfigured using the asr-disable command. The .asr command also lists devices that have failed firmware diagnostics and have been automatically unconfigured by the OpenBoot ASR feature.
About SunVTS
SunVTS is a software suite that performs system and subsystem stress testing. You can view and control a SunVTS session over a network. Using a remote machine, you can view the progress of a testing session, change testing options, and control all testing features of another machine on the network.
You can run SunVTS software in four different test modes:
- Connection test mode provides a low-stress, quick testing of the availability and connectivity of selected devices. These tests are nonintrusive, meaning they release the devices after a quick test, and they do not place a heavy load on system activity.
- Functional test mode provides robust testing of your system and devices. It uses your system resources for thorough testing and it assumes that no other applications are running.
- Exclusive test mode enables performing the tests that require no other SunVTS tests or applications running at the same time.
- Online test mode enables performance of SunVTS testing while other customer applications are running.
- Auto Config automatically detects all subsystems and exercises them in one of two ways:
- Confidence testing - Performs one pass of tests on all subsystems, and then stops. For typical system configurations, this requires one or two hours.
- Comprehensive testing - Tests all subsystems repeatedly for up to 24 hours.
Since SunVTS software can run many tests in parallel and consume many system resources, you should be cautious when using it on a production system. If you are stress-testing a system using the Functional test mode, do not run anything else on that system at the same time.
To install and use SunVTS, a system must be running a Solaris OS compatible for the SunVTS version. Since SunVTS software packages are optional, they may not be installed on your system. See To Find Out Whether SunVTS Is Installed for instructions.
SunVTS Software and Security
During SunVTS software installation, you must choose between Basic or Sun Enterprise Authentication Mechanism security. Basic security uses a local security file in the SunVTS installation directory to limit the users, groups, and hosts permitted to use SunVTS software. Sun Enterprise Authentication Mechanism security is based on the standard network authentication protocol Kerberos and provides secure user authentication, data integrity and privacy for transactions over networks.
If your site uses Sun Enterprise Authentication Mechanism security, you must have the Sun Enterprise Authentication Mechanism client and server software installed in your networked environment and configured properly in both Solaris and SunVTS software. If your site does not use Sun Enterprise Authentication Mechanism security, do not choose the Sun Enterprise Authentication Mechanism option during SunVTS software installation.
If you enable the wrong security scheme during installation, or if you improperly configure the security scheme you choose, you may find yourself unable to run SunVTS tests. For more information, see the SunVTS User's Guide and the instructions accompanying the Sun Enterprise Authentication Mechanism software.
Using SunVTS
SunVTS, the Sun Validation and Test Suite, is an online diagnostics tool that you can use to verify the configuration and functionality of hardware controllers, devices, and platforms. It runs in the Solaris OS and presents the following interfaces:
- Command line interface
- Serial (TTY) interface
SunVTS software enables you to view and control testing sessions on a remotely connected server. TABLE 8-35 lists some of the tests that are available:
TABLE 8-35 SunVTS Tests
SunVTS Test
|
Description
|
cputest
|
Tests the CPU
|
disktest
|
Tests the local disk drives
|
dvdtest
|
Tests the DVD-ROM drive
|
fputest
|
Tests the floating-point unit
|
nettest
|
Tests the Ethernet hardware on the system board and the networking hardware on any optional PCI cards
|
netlbtest
|
Performs a loopback test to check that the Ethernet adapter can send and receive packets
|
pmemtest
|
Tests the physical memory (read only)
|
sutest
|
Tests the server's on-board serial ports
|
vmemtest
|
Tests the virtual memory (a combination of the swap partition and the physical memory)
|
env6test
|
Tests the environmental devices
|
ssptest
|
Tests ALOM hardware devices
|
i2c2test
|
Tests I2C devices for correct operation
|
To Find Out Whether SunVTS Is Installed
|
Type:
TABLE 8-36
# pkginfo -l SUNWvts
|
If SunVTS software is loaded, information about the package will be displayed.
If SunVTS software is not loaded, you will see the following error message:
TABLE 8-37
ERROR: information for "SUNWvts" was not found
|
Installing SunVTS
By default, SunVTS is not installed on the Sun Fire V445 servers. However, it is available in the Solaris_10/ExtraValue/CoBundled/SunVTS_X.X Solaris 10 DVD supplied in the Solaris Media Kit. For information about downloading SunVTS from the Sun Downloard Center, refer to the Sun Hardware Platform Guide for the Solaris version you are using.
To find out more about using SunVTS, refer to the SunVTS documentation that corresponds to the Solaris release that you are running.
Viewing SunVTS Documentation
The SunVTS documents are accessible in the Solaris on Sun Hardware documentation collection at http://docs.sun.com.
For further information, you can also consult the following SunVTS documents:
- SunVTS User's Guide describes how to install, configure, and run the SunVTS diagnostic software.
- SunVTS Quick Reference Card provides an overview of how to use the SunVTS graphical user interface.
- SunVTS Test Reference Manual for SPARC Platforms provides details about each individual SunVTS test.
About Sun Management Center
Sun Management Center software provides enterprise-wide monitoring of Sun servers and workstations, including their subsystems, components, and peripheral devices. The system being monitored must be up and running, and you need to install all the proper software components on various systems in your network.
Sun Management Center enables you to monitor the following on the Sun Fire V445 server.
TABLE 8-38 What Sun Management Center Monitors
Item Monitored
|
What Sun Management Center Monitors
|
Disk drives
|
Status
|
Fans
|
Status
|
CPUs
|
Temperature and any thermal warning or failure conditions
|
Power supply
|
Status
|
System temperature
|
Temperature and any thermal warning or failure conditions
|
Sun Management Center software extends and enhances the management capability of Sun's hardware and software products.
TABLE 8-39 Sun Management Center Features
Feature
|
Description
|
System management
|
Monitors and manages the system at the hardware and operating system levels. Monitored hardware includes boards, tapes, power supplies, and disks.
|
Operating system management
|
Monitors and manages operating system parameters including load, resource usage, disk space, and network statistics.
|
Application and business system management
|
Provides technology to monitor business applications such as trading systems, accounting systems, inventory systems, and real-time control systems.
|
Scalability
|
Provides an open, scalable, and flexible solution to configure and manage multiple management administrative domains (consisting of many systems) spanning an enterprise. The software can be configured and used in a centralized or distributed fashion by multiple users.
|
Sun Management Center software is geared primarily toward system administrators who have large data centers to monitor or other installations that have many computer platforms to monitor. If you administer a more modest installation, you need to weigh Sun Management Center software's benefits against the requirement of maintaining a significant database (typically over 700 Mbytes) of system status information.
The servers being monitored must be up and running if you want to use Sun Management Center, since this tool relies on the Solaris OS. For instructions on using this tool to monitor a Sun Fire V445 server, see Chapter 8.
How Sun Management Center Works
Sun Management Center consists of three components:
You install agents on systems to be monitored. The agents collect system status information from log files, device trees, and platform-specific sources, and report that data to the server component.
The server component maintains a large database of status information for a wide range of Sun platforms. This database is updated frequently, and includes information about boards, tapes, power supplies, and disks as well as OS parameters like load, resource usage, and disk space. You can create alarm thresholds and be notified when these are exceeded.
The monitor components present the collected data to you in a standard format. Sun Management Center software provides both a standalone Java application and a Web browser-based interface. The Java interface affords physical and logical views of the system for highly-intuitable monitoring.
Using Sun Management Center
Sun Management Center software is aimed at system administrators who have large data centers to monitor or other installations that have many computer platforms to monitor. If you administer a smaller installation, you need to weigh Sun Management Center software's benefits against the requirement of maintaining a significant database (typically over 700 Mbytes) of system status information.
The servers to be monitored must be running , Sun Management Center relies on the Solaris OS for its operation.
For detailed instructions, see the Sun Management Center Software User's Guide.
Other Sun Management Center Features
Sun Management Center software provides you with additional tools, which can operate with management utilities made by other companies.
The tools are an informal tracking mechanism and the optional add-on, Hardware Diagnostics Suite.
Informal Tracking
Sun Management Center agent software must be loaded on any system you want to monitor. However, the product enables you to informally track a supported platform even when the agent software has not been installed on it. In this case, you do not have full monitoring capability, but you can add the system to your browser, have Sun Management Center periodically check whether it is up and running, and notify you if it goes out of commission.
Hardware Diagnostic Suite
The Hardware Diagnostic Suite is a package that you can purchase as an add-on to Sun Management Center. The suite enables you to exercise a system while it is still up and running in a production environment. See Hardware Diagnostic Suite for more information.
Interoperability With Third-Party Monitoring Tools
If you administer a heterogeneous network and use a third-party network-based system monitoring or management tool, you might be able to take advantage of Sun Management Center software's support for Tivoli Enterprise Console, BMC Patrol, and HP Openview.
Obtaining the Latest Information
For the latest information about this product, go to the Sun Management Center web site: http://www.sun.com/sunmanagementcenter
Hardware Diagnostic Suite
The Sun Management Center features an optional Hardware Diagnostic Suite, which you can purchase as an add-on. The Hardware Diagnostic Suite is designed to exercise a production system by running tests sequentially.
Sequential testing means the Hardware Diagnostic Suite has a low impact on the system. Unlike SunVTS, which stresses a system by consuming its resources with many parallel tests (see About SunVTS), the Hardware Diagnostic Suite lets the server run other applications while testing proceeds.
When to Run Hardware Diagnostic Suite
The best use of the Hardware Diagnostic Suite is to disclose a suspected or intermittent problem with a noncritical part on an otherwise functioning machine. Examples might include questionable disk drives or memory modules on a machine that has ample or redundant disk and memory resources.
In cases like these, the Hardware Diagnostic Suite runs unobtrusively until it identifies the source of the problem. The machine under test can be kept in production mode until and unless it must be shut down for repair. If the faulty part is hot-pluggable or hot-swappable, the entire diagnose-and-repair cycle can be completed with minimal impact to system users.
Requirements for Using Hardware Diagnostic Suite
Since it is a part of Sun Management Center, you can only run Hardware Diagnostic Suite if you have set up your data center to run Sun Management Center. This means you have to dedicate a master server to run the Sun Management Center server software that supports Sun Management Center software's database of platform status information. In addition, you must install and set up Sun Management Center agent software on the systems to be monitored. Finally, you need to install the console portion of Sun Management Center software, which serves as your interface to the Hardware Diagnostic Suite.
Instructions for setting up Sun Management Center, as well as for using the Hardware Diagnostic Suite, can be found in the Sun Management Center Software User's Guide.
Sun Fire V445 Server Administration Guide
|
819-3741-13
|
|
Copyright © 2007, Sun Microsystems, Inc. All Rights Reserved.