C H A P T E R  8

Diagnostics

This chapter describes the diagnostic tools available for the Sun Fire V445 server.

Topics in this chapter include:


Diagnostic Tools Overview

Sun provides a range of diagnostic tools for use with the Sun Fire V445 server.

The diagnostic tools are summarized in TABLE 8-1.


TABLE 8-1 Summary of Diagnostic Tools

Diagnostic Tool

Type

What It Does

Accessibility and Availability

Remote Capability

ALOM system controller

Hardware and Software

Monitors environmental conditions, performs basic fault isolation, and provides remote console access

Can function on standby power and without OS

Designed for remote access

LED indicators

Hardware

Indicates status of overall system and particular components

Accessed from system chassis. Available anytime power is available

Local, but can be viewed with the ALOM system console

POST

Firmware

Tests core components of system

Runs automatically on startup. Available when the OS is not running

Local, but can be viewed with ALOM system controller

OpenBoot Diagnostics

Firmware

Tests system components, focusing on peripherals and
I/O devices

Runs automatically or interactively. Available when the OS is not running

Local, but can be viewed with ALOM system controller

OpenBoot commands

Firmware

Display various kinds of system information

Available when the OS is not running

Local, but can be accessed with ALOM system controller

Solaris 10 Predictive Self-Healing

Software

Monitors system errors and reports and disables faulty hardware

Runs in the background when the OS is running

Local, but can be accessed with ALOM system controller

Traditional Solaris OS commands

Software

Displays various kinds of system information

Requires OS

Local, but can be accessed with ALOM system controller

SunVTS

Software

Exercises and stresses the system, running tests in parallel

Requires OS. Optional package that needs to be installed separately

View and control over network

Sun Management Center

Software

Monitors both hardware environmental conditions and software performance of multiple machines. Generates alerts for various conditions

Requires OS to be running on both monitored and master servers. Requires a dedicated database on the master server

Designed for remote access

Hardware Diagnostic Suite

Software

Exercises an operational system by running sequential tests. Also reports failed FRUs

Separately purchased optional add-on to Sun Management Center. Requires OS and Sun Management Center

Designed for remote access



About Sun Advanced Lights-Out Manager 1.0 (ALOM)

The Sun Fire V445 server ships with Sun Advanced Lights Out Manager (ALOM) 1.0 installed. The system console is directed to ALOM by default and is configured to show server console information on startup.

ALOM enables you to monitor and control your server over either a serial connection (using the SERIAL MGT port), or Ethernet connection (using the NET MGT port). For information on configuring an Ethernet connection, refer to the ALOM Online Help.



Note - The ALOM serial port, labelled SERIAL MGT, is for server management only. If you need a general purpose serial port, use the serial port labeled TTYB.


ALOM can send email notification of hardware failures and other events related to the server or to ALOM.

The ALOM circuitry uses standby power from the server. This means that:

See TABLE 8-2 for a list of the components monitered by ALOM and the information it provides for each.


TABLE 8-2 What ALOM Monitors

Component

Information

Hard disk drives

Presence and status

System and CPU fans

Speed and status

CPUs

Presence, temperature and any thermal warning or failure conditions

Power supplies

Presence and status

System temperature

Ambient temperature and any thermal warning or failure conditions

Server front panel

Status indicator

Voltage

Status and thresholds

SAS and USB circuit breakers

Status


ALOM Management Ports

The default management port is labeled SERIAL MGT. This port uses an RJ-45 connector and is for server management only - it supports only ASCII connections to an external console. Use this port when you first begin to operate the server.

Another serial port - labeled TTYB - is available for general purpose serial data transfer. This port uses a DB-9 connector. For information on pinouts, refer to the Sun Fire V445 Server Installation Guide.

In addition, the server has one 10BASE-T Ethernet management domain interface, labelled NET MGT. To use this port, ALOM configuration is required. For more information, see the ALOM Online Help.

Setting the admin Password for ALOM

When you switch to the ALOM prompt after initial power-on, you will be logged in as the admin user and prompted to set a password. You must set this password in order to execute certain commands.

If you are prompted to do so, set a password for the admin user.

The password must:

Once the password is set, the admin user has full permissions and can execute all ALOM CLI commands.

 

Basic ALOM Functions

This section covers some basic ALOM functions. For comprehensive documentation, refer to the ALOM Online Help.


procedure icon  To Switch to the ALOM Prompt

single-step bullet  Type the default keystroke sequence:


TABLE 8-3
# #.



Note - When you switch to the ALOM prompt, you will be logged in with the userid admin. See Setting the admin Password for ALOM.



procedure icon  To Switch to the Server Console Prompt

single-step bullet  Type:


TABLE 8-4
sc> console

More than one ALOM user can be connected to the server console stream at a time, but only one user is permitted to type input characters to the console.

If another user is logged on and has write capability, you will see the message below after issuing the console command:


TABLE 8-5
sc> Console session already in use. [view mode]

To take console write capability away from another user, type:


TABLE 8-6
sc> console -f

 


About Status Indicators

For a summary of the server's LED status indicators, see Front Panel Indicators and Back Panel Indicators.


About POST Diagnostics

POST is a firmware program that is useful in determining if a portion of the system has failed. POST verifies the core functionality of the system, including the CPU module(s), motherboard, memory, and some on-board I/O devices, and generates messages that can determine the nature of a hardware failure. POST can be run even if the system is unable to boot.

POST detects CPU and Memory subsystem faults and is located in a SEEPROM on the MBC (ALOM) board. POST can be set to run by the OpenBoot program at power-on by setting three environment variables, the diag-switch?, diag-trigger, and diag-level.

POST runs automatically when the system power is applied, or following a noncritical error reset, if all of the following conditions apply:

If diag-level is set to min or max, POST performs an abbreviated or extended test, respectively. If diag-level is set to menus, a menu of all the tests executed at power-up is displayed. POST diagnostic and error message reports are displayed on a console.

For information on starting and controlling POST diagnostics, see About the post Command.


OpenBoot PROM Enhancements for Diagnostic Operation

This section describes the diagnostic operation enhancements provided by OpenBoot PROM Version 4.15 and later and presents information about how to use the resulting new operational features. Note that the behavior of certain operational features on your system might differ from the behavior described in this section.

What's New in Diagnostic Operation

The following features are the diagnostic operation enhancements:

About the New and Redefined Configuration Variables

New and redefined configuration variables simplify diagnostic operation and provide you with more control over the amount of diagnostic output. The following list summarizes the configuration variable changes. See TABLE 8-7 for complete descriptions of the variables.

About the Default Configuration

The new standard (default) configuration runs diagnostic tests and enables full ASR capabilities during power-on and after the occurrence of an error reset (RED State Exception Reset, CPU Watchdog Reset, System Watchdog Reset, Software-Instruction Reset, or Hardware Fatal Reset). This is a change from the previous default configuration, which did not run diagnostic tests. When you power on your system for the first time, the change will be visible to you through the increased boot time and the display of approximately two screens of diagnostic output produced by POST and OpenBoot Diagnostics.



Note - The standard (default) configuration does not increase system boot time after a reset that is initiated by user commands from OpenBoot (reset-all or boot) or from Solaris (reboot, shutdown, or init).


The visible changes are due to the default settings of two configuration variables, diag-level (max) and verbosity (normal):

After initial power-on, you can customize the standard (default) configuration by setting the configuration variables to define a "normal mode" of operation that is appropriate for your production environment. TABLE 8-7 lists and describes the defaults and keywords of the OpenBoot configuration variables that control diagnostic testing and ASR capabilities. These are the variables you will set to define your normal mode of operation.



Note - The standard (default) configuration is recommended for improved fault isolation and system restoration, and for increased system availability.



TABLE 8-7 OpenBoot Configuration Variables That Control Diagnostic Testing and Automatic System Restoration

OpenBoot Configuration Variable

Description and Keywords

auto-boot?

Determines whether the system automatically boots. Default is true.

  • true - System automatically boots after initialization, provided no firmware-based (diagnostics or OpenBoot) errors are detected.
  • false - System remains at the ok prompt until you type boot.

auto-boot-on-error?

Determines whether the system attempts a degraded boot after a nonfatal error. Default is true.

  • true - System automatically boots after a nonfatal error if the variable
    auto-boot? is also set to true.
  • false - System remains at the ok prompt.

boot-device

Specifies the name of the default boot device, which is also the normal mode boot device.

boot-file

Specifies the default boot arguments, which are also the normal mode boot arguments.

diag-device

Specifies the name of the boot device that is used when diag-switch? is true.

diag-file

Specifies the boot arguments that are used when diag-switch? is true.

diag-level

Specifies the level or type of diagnostics that are executed. Default is max.

  • off - No testing.
  • min - Basic tests are run.
  • max - More extensive tests might be run, depending on the device. Memory is extensively checked.

diag-out-console

Redirects system console output to the system controller.

  • true - Redirects output to the system controller.
  • false - Restores output to the local console.

Note: See your system documentation for information about redirecting system console output to the system controller. (Not all systems are equipped with a system controller.)

diag-passes

Specifies the number of consecutive executions of OpenBoot Diagnostics self-tests that are run from the OpenBoot Diagnostics (obdiag) menu. Default is 1.

Note: diag-passes applies only to systems with firmware that contains OpenBoot Diagnostics and has no effect outside the OpenBoot Diagnostics menu.

diag-script

Determines which devices are tested by OpenBoot Diagnostics. Default is normal.

  • none - OpenBoot Diagnostics do not run.
  • normal - Tests all devices that are expected to be present in the system's baseline configuration for which self-tests exist.
  • all - Tests all devices that have self-tests.

diag-switch?

Controls diagnostic execution in normal mode. Default is false.

For servers:

  • true - Diagnostics are only executed on power-on reset events, but the level of test coverage, verbosity, and output is determined by user-defined settings.
  • false - Diagnostics are executed upon next system reset, but only for those class of reset events specified by the OpenBoot configuration variable
    diag-trigger. The level of test coverage, verbosity, and output is determined by user-defined settings.

For workstations:

  • true - Diagnostics are only executed on power-on reset events, but the level of test coverage, verbosity, and output is determined by user-defined settings.
  • false - Diagnostics are disabled.

diag-trigger

Specifies the class of reset event that causes diagnostics to run automatically. Default setting is power-on-reset error-reset.

  • none - Diagnostic tests are not executed.
  • error-reset - Reset that is caused by certain hardware error events such as RED State Exception Reset, Watchdog Resets, Software-Instruction Reset, or Hardware Fatal Reset.
  • power-on-reset - Reset that is caused by power cycling the system.
  • user-reset - Reset that is initiated by an OS panic or by user-initiated commands from OpenBoot (reset-all or boot) or from Solaris (reboot, shutdown, or init).
  • all-resets - Any kind of system reset.

Note: Both POST and OpenBoot Diagnostics run at the specified reset event if the variable diag-script is set to normal or all. If diag-script is set to none, only POST runs.

error-reset-recovery

Specifies recovery action after an error reset. Default is sync.

  • none - No recovery action.
  • boot - System attempts to boot.
  • sync - Firmware attempts to execute a Solaris sync callback routine.

service-mode?

Controls whether the system is in service mode. Default is false.

  • true - Service mode. Diagnostics are executed at Sun-specified levels, overriding but preserving user settings.
  • false - Normal mode. Diagnostics execution depends entirely on the settings of diag-switch? and other user-defined OpenBoot configuration variables.

test-args

Customizes OpenBoot Diagnostics tests. Allows a text string of reserved keywords (separated by commas) to be specified in the following ways:

  • As an argument to the test command at the ok prompt.
  • As an OpenBoot variable to the setenv command at the ok or obdiag prompt.

Note: The variable test-args applies only to systems with firmware that contains OpenBoot Diagnostics. See your system documentation for a list of keywords.

verbosity

Controls the amount and detail of OpenBoot, POST, and OpenBoot Diagnostics output.
Default is normal.

  • none - Only error and fatal messages are displayed on the system console. Banner is not displayed.
    Note: Problems in systems with verbosity set to none might be deemed not diagnosable, rendering the system unserviceable by Sun.
  • min - Notice, error, warning, and fatal messages are displayed on the system console. Transitional states and banner are also displayed.
  • normal - Summary progress and operational messages are displayed on the system console in addition to the messages displayed by the min setting. The work-in-progress indicator shows the status and progress of the boot sequence.
  • max - Detailed progress and operational messages are displayed on the system console in addition to the messages displayed by the min and normal settings.

About Service Mode

Service mode is an operational mode defined by Sun that facilitates fault isolation and recovery of systems that appear to be nonfunctional. When initiated, service mode overrides the settings of key OpenBoot configuration variables.

Note that service mode does not change your stored settings. After initialization (at the ok prompt), all OpenBoot PROM configuration variables revert to the user-defined settings. In this way, you or your service provider can quickly invoke a known and maximum level of diagnostics and still preserve your normal mode settings.

TABLE 8-8 lists the OpenBoot configuration variables that are affected by service mode and the overrides that are applied when you select service mode.


TABLE 8-8 Service Mode Overrides

OpenBoot Configuration Variable

Service Mode Override

auto-boot?

false

diag-level

max

diag-trigger

power-on-reset error-reset user-reset

input-device

Factory default

output-device

Factory default

verbosity

max

The following apply only to systems with firmware that contains OpenBoot Diagnostics:

diag-script

normal

test-args

subtests,verbose


About Initiating Service Mode

Enhancements provide a software mechanism for specifying service mode:

service-mode? configuration variable - When set to true, initiates service mode. (Service mode should be used only by authorized Sun service providers.)



Note - The diag-switch? configuration variable should remain at the default setting (false) for normal operation. To specify diagnostic testing for your OS, see To Initiate Normal Mode.


For instructions, see To Initiate Service Mode.

About Overriding Service Mode Settings

When the system is in service mode, three commands can override service mode settings. TABLE 8-9 describes the effect of each command.


TABLE 8-9 Scenarios for Overriding Service Mode Settings

Command

Issued From

What It Does

post

ok prompt

OpenBoot firmware forces a one-time execution of normal mode diagnostics.

bootmode diag

system controller

OpenBoot firmware overrides service mode settings and forces a one-time execution of normal mode diagnostics.1

bootmode skip_diag

system controller

OpenBoot firmware suppresses service mode and bypasses all firmware diagnostics.1


1 - If the system is not reset within 10 minutes of issuing the bootmode system controller command, the command is cleared.

Note - Not all systems are equipped with a system controller.


About Normal Mode

Normal mode is the customized operational mode that you define for your environment. To define normal mode, set the values of the OpenBoot configuration variables that control diagnostic testing. See TABLE 8-7 for the list of variables that control diagnostic testing.



Note - The standard (default) configuration is recommended for improved fault isolation and system restoration, and for increased system availability.


When you are deciding whether to enable diagnostic testing in your normal environment, remember that you always should run diagnostics to troubleshoot an existing problem or after the following events:

About Initiating Normal Mode

If you define normal mode for your environment, you can specify normal mode with the following method:

System controller bootmode diag command - When you issue this command, it specifies normal mode with the configuration values defined by you - with the following exceptions:



Note - The next reset cycle must occur within 10 minutes of issuing the
bootmode diag command or the bootmode command is cleared and normal mode is not initiated.


For instructions, see To Initiate Normal Mode.

About the post Command

The post command enables you to easily invoke POST diagnostics and to control the level of testing and the amount of output. When you issue the post command, OpenBoot firmware performs the following actions:



Note - The post command overrides service mode settings and pending system controller bootmode diag and bootmode skip_diag commands.


The syntax for the post command is:

post [level [verbosity]]

where:

The level and verbosity options provide the same functions as the OpenBoot configuration variables diag-level and verbosity. To determine which settings you should use for the post command options, see TABLE 8-7 for descriptions of the keywords for diag-level and verbosity.

You can specify settings for:

If you specify a setting for level only, the post command uses the normal mode value for verbosity with the following exception:

If you specify settings for neither level nor verbosity, the post command uses the normal mode values you specified for the configuration variables,
diag-level and verbosity, with two exceptions:


procedure icon  To Initiate Service Mode

For background information, see About Service Mode.

single-step bullet  Set the service-mode? variable. At the ok prompt, type:


TABLE 1
ok setenv service-mode? true

For service mode to take effect, you must reset the system.

9. At the ok prompt, type:


TABLE 2
ok reset-all


procedure icon  To Initiate Normal Mode

For background information, see About Normal Mode.

1. At the ok prompt, type:


TABLE 3
ok setenv service-mode? false

The system will not actually enter normal mode until the next reset.

2. Type:


TABLE 4
ok reset-all

Reference for Estimating System Boot Time (to the ok Prompt)



Note - The standard (default) configuration does not increase system boot time after a reset that is initiated by user commands from OpenBoot (reset-all or boot) or from Solaris (reboot, shutdown, or init).


The measurement of system boot time begins when you power on (or reset) the system and ends when the OpenBoot ok prompt appears. During the boot time period, the firmware executes diagnostics (POST and OpenBoot Diagnostics) and performs OpenBoot initialization. The time required to run OpenBoot Diagnostics and to perform OpenBoot setup, configuration, and initialization is generally similar for all systems, depending on the number of I/O cards installed when
diag-script is set to all. However, at the default settings (diag-level = max and verbosity = normal), POST executes extensive memory tests, which will increase system boot time.

System boot time will vary from system-to-system, depending on the configuration of system memory and the number of CPUs:

If you need to know the approximate boot time of your new system before you power on for the first time, the following sections describe two methods you can use to estimate boot time:

Boot Time Estimates for Typical Configurations

The following are three typical configurations and the approximate boot time you can expect for each:

Estimating Boot Time for Your System

Generally, for systems configured with default settings, the times required to execute OpenBoot Diagnostics and to perform OpenBoot setup, configuration, and initialization are the same for all systems:

To estimate the time required to run POST memory tests, you need to know the amount of memory associated with the most populated CPU. To estimate the time required to run POST CPU tests, you need to know the number of CPUs. Use the following guidelines to estimate memory and CPU test times:

The following example shows how to estimate the system boot time of a sample configuration consisting of 4 CPUs and 32 Gbytes of system memory, with 8 Gbytes of memory on the most populated CPU.


This figure shows the calculation for estimating system boot time for a sample configuration.

Reference for Sample Outputs

At the default setting of verbosity = normal, POST and OpenBoot Diagnostics generate less diagnostic output (about 2 pages) than was produced before the OpenBoot PROM enhancements (over 10 pages). This section includes output samples for verbosity settings at min and normal.



Note - The diag-level configuration variable also affects how much output the system generates. The following samples were produced with diag-level set to max, the default setting.


The following sample shows the firmware output after a power reset when verbosity is set to min. At this verbosity setting, OpenBoot firmware displays notice, error, warning, and fatal messages but does not display progress or operational messages. Transitional states and the power-on banner are also displayed. Since no error conditions were encountered, this sample shows only the POST execution message, the system's install banner, and the device self-tests conducted by OpenBoot Diagnostics.

 


TABLE 5
Executing POST w/%o0 = 0000.0400.0101.2041
Sun Fire V445, Keyboard Present
Copyright 1998-2006 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.15.0, 4096 MB memory installed, Serial #12980804.
Ethernet address 8:0:20:c6:12:44, Host ID: 80c61244.
Running diagnostic script obdiag/normal
Testing /pci@8,600000/network@1
Testing /pci@8,600000/SUNW,qlc@2
Testing /pci@9,700000/ebus@1/i2c@1,2e
Testing /pci@9,700000/ebus@1/i2c@1,30
Testing /pci@9,700000/ebus@1/i2c@1,50002e
Testing /pci@9,700000/ebus@1/i2c@1,500030
Testing /pci@9,700000/ebus@1/bbc@1,0
Testing /pci@9,700000/ebus@1/bbc@1,500000
Testing /pci@8,700000/scsi@1
Testing /pci@9,700000/network@1,1
Testing /pci@9,700000/usb@1,3
Testing /pci@9,700000/ebus@1/gpio@1,300600
Testing /pci@9,700000/ebus@1/pmc@1,300700
Testing /pci@9,700000/ebus@1/rtc@1,300070
{7} ok 

The following sample shows the diagnostic output after a power reset when verbosity is set to normal, the default setting. At this verbosity setting, the OpenBoot firmware displays summary progress or operational messages in addition to the notice, error, warning, and fatal messages; transitional states; and install banner displayed by the min setting. On the console, the work-in-progress indicator shows the status and progress of the boot sequence.

 


TABLE 6
Sun Fire V445, Keyboard Present
Copyright 1998-2004 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.15.0, 4096 MB memory installed, Serial #12980804.
Ethernet address 8:0:20:c6:12:44, Host ID: 80c61244.
Running diagnostic script obdiag/normal
Testing /pci@8,600000/network@1
Testing /pci@8,600000/SUNW,qlc@2
Testing /pci@9,700000/ebus@1/i2c@1,2e
Testing /pci@9,700000/ebus@1/i2c@1,30
Testing /pci@9,700000/ebus@1/i2c@1,50002e
Testing /pci@9,700000/ebus@1/i2c@1,500030
Testing /pci@9,700000/ebus@1/bbc@1,0
Testing /pci@9,700000/ebus@1/bbc@1,500000
Testing /pci@8,700000/scsi@1
Testing /pci@9,700000/network@1,1
Testing /pci@9,700000/usb@1,3
Testing /pci@9,700000/ebus@1/gpio@1,300600
Testing /pci@9,700000/ebus@1/pmc@1,300700
Testing /pci@9,700000/ebus@1/rtc@1,300070
{7} ok 

Reference for Determining Diagnostic Mode

The flowchart in FIGURE 8-7 summarizes graphically how various system controller and OpenBoot variables affect whether a system boots in normal or service mode, as well as whether any overrides occur.


CODE EXAMPLE 8-1
{3} ok post
SC Alert: Host System has Reset
 
 
Executing Power On Self Test
Q#0>
0>@(#)Sun Fire[TM] V445 POST 4.22.11 2006/06/12 15:10
        /export/delivery/delivery/4.22/4.22.11/post4.22.x/Fiesta/boston/integrated  (root)
0>Copyright ? 2006 Sun Microsystems, Inc. All rights reserved
   SUN PROPRIETARY/CONFIDENTIAL.
   Use is subject to license terms.
0>OBP->POST Call with %o0=00000800.01012000.
0>Diag level set to MIN.
0>Verbosity level set to NORMAL.
0>Start Selftest.....
0>CPUs present in system: 0 1 2 3
0>Test CPU(s)....Done
0>Interrupt Crosscall....Done
0>Init Memory....|
SC Alert: Host System has Reset
'Done
0>PLL Reset....Done
0>Init Memory....Done
0>Test Memory....Done
0>IO-Bridge Tests....Done
0>INFO:
0>    POST Passed all devices.
0>
0>POST:    Return to OBP.
 
SC Alert: Host System has Reset
 
Configuring system memory & CPU(s)
 
Probing system devices
Probing memory
Probing I/O buses
screen not found.
keyboard not found.
Keyboard not present.  Using ttya for input and output.
Probing system devices
Probing memory
Probing I/O buses
 
 
Sun Fire V445, No Keyboard
Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.22.11, 24576 MB memory installed, Serial #64548465.
Ethernet address 0:3:ba:d8:ee:71, Host ID: 83d8ee71.


This flowchart depicts how various OpenBoot configuration variables affect the diagnostic mode.

FIGURE 8-7 Diagnostic Mode Flowchart

Quick Reference for Diagnostic Operation

TABLE 8-10 summarizes the effects of the following user actions on diagnostic operation:


OpenBoot Diagnostics

Like POST diagnostics, OpenBoot Diagnostics code is firmware-based and resides in the boot PROM.


procedure icon  To Start OpenBoot Diagnostics

1. Type:


TABLE 8-11
ok setenv diag-switch? true
ok setenv auto-boot? false
ok reset-all

2. Type:


TABLE 8-12
ok obdiag

This command displays the OpenBoot Diagnostics menu. See TABLE 8-13.


TABLE 8-13 Sample obdiag Menu

obdiag

1 LSILogic,sas@1

4 rmc-comm@0,c28000 serial@3,fffff8

2 flashprom@0,0

5 rtc@0,70

3 network@0

6 serial@0,c2c000

Commands: test test-all except help what setenv set-default exit

diag-passes=1 diag-level=min test-args=args




Note - If you have a PCI card installed in the server, then additional tests will appear on the obdiag menu.


3. Type:


TABLE 8-14
obdiag> test n

where n represents the number corresponding to the test you want to run.

A summary of the tests is available. At the obdiag> prompt, type:


TABLE 8-15
obdiag> help

4. You can also run all tests, type:


TABLE 8-16
obdiag> test-all
Hit the spacebar to interrupt testing
Testing /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1 ......... passed
Testing /ebus@1f,464000/flashprom@0,0 ................................. passed
Testing /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0 Internal loopback test -- succeeded.
Link is  -- up
........ passed
Testing /ebus@1f,464000/rmc-comm@0,c28000 ............................. passed
Testing /pci@1f,700000/pci@0/pci@1/pci@0/isa@1e/rtc@0,70 .............. passed
Testing /ebus@1f,464000/serial@0,c2c000 ............................... passed
Testing /ebus@1f,464000/serial@3,fffff8 ............................... passed
Pass:1 (of 1) Errors:0 (of 0) Tests Failed:0 Elapsed Time: 0:0:1:1
 
 
Hit any key to return to the main menu



Note - From the obdiag prompt you can select a device from the list and test it. However, at the ok prompt you need to use the full device path. In addition, the device needs to have a self-test method, otherwise errors will result.


Controlling OpenBoot Diagnostics Tests

Most of the OpenBoot configuration variables you use to control POST (see TABLE 8-7) also affect OpenBoot Diagnostics tests.

By default, test-args is set to contain an empty string. You can modify test-args using one or more of the reserved keywords shown in TABLE 8-17.


TABLE 8-17 Keywords for the test-args OpenBoot Configuration Variable

Keyword

What It Does

bist

Invokes built-in self-test (BIST) on external and peripheral devices

debug

Displays all debug messages

iopath

Verifies bus/interconnect integrity

loopback

Exercises external loopback path for the device

media

Verifies external and peripheral device media accessibility

restore

Attempts to restore original state of the device if the previous execution of the test failed

silent

Displays only errors rather than the status of each test

subtests

Displays main test and each subtest that is called

verbose

Displays detailed messages of status of all tests

callers=N

Displays backtrace of N callers when an error occurs

  • callers=0 - displays backtrace of all callers before the error

errors=N

Continues executing the test until N errors are encountered

  • errors=0 - displays all error reports without terminating testing

If you want to make multiple customizations to the OpenBoot Diagnostics testing, you can set test-args to a comma-separated list of keywords, as in this example:


TABLE 8-18
ok setenv test-args debug,loopback,media

test and test-all Commands

You can also run OpenBoot Diagnostics tests directly from the ok prompt. To do this, type the test command, followed by the full hardware path of the device (or set of devices) to be tested. For example:


TABLE 8-19
ok test /pci@x,y/SUNW,qlc@2



Note - Knowing how to construct an appropriate hardware device path requires precise knowledge of the hardware architecture of the Sun Fire V445 system.


To customize an individual test, you can use test-args as follows:


TABLE 8-20
ok test /usb@1,3:test-args={verbose,debug}

This affects only the current test without changing the value of the test-args OpenBoot configuration variable.

You can test all the devices in the device tree with the test-all command:


TABLE 8-21
ok test-all

If you specify a path argument to test-all, then only the specified device and its children are tested. The following example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:


TABLE 8-22
ok test-all /pci@9,700000/usb@1,3

OpenBoot Diagnostics Error Messages

OpenBoot Diagnostics error results are reported in a tabular format that contains a short summary of the problem, the hardware device affected, the subtest that failed, and other diagnostic information. The following example displays a sample OpenBoot Diagnostics error message.


CODE EXAMPLE 8-2 OpenBoot Diagnostics Error Message
Testing /pci@1e,600000/isa@7/flashprom@2,0
 
    ERROR   : There is no POST in this FLASHPROM or POST header is 
unrecognized
    DEVICE  : /pci@1e,600000/isa@7/flashprom@2,0
    SUBTEST : selftest:crc-subtest
    MACHINE : Sun Fire V445
    SERIAL# : 51347798
    DATE    : 03/05/2003 15:17:31  GMT
    CONTR0LS: diag-level=max test-args=errors=1
 
Error: /pci@1e,600000/isa@7/flashprom@2,0 selftest failed, return code = 1
Selftest at /pci@1e,600000/isa@7/flashprom@2,0 (errors=1) ............. 
failed
Pass:1 (of 1) Errors:1 (of 1) Tests Failed:1 Elapsed Time: 0:0:0:1


About OpenBoot Commands

OpenBoot commands are commands you type from the ok prompt. OpenBoot commands that can provide useful diagnostic information are:

probe-scsi-all

The probe-scsi-all command diagnoses problems with the SAS devices.



caution icon Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-scsi-allcommand can hang the system.


The probe-scsi-all command communicates with all SAS devices connected to on-board SAS controllers and accesses devices connected to any host adapters installed in PCI slots.

For any SAS device that is connected and active, the probe-scsi-all command displays its loop ID, host adapter, logical unit number, unique World Wide Name (WWN), and a device description that includes type and manufacturer.

The following is sample output from the probe-scsi-all command.


CODE EXAMPLE 8-3 Sample probe-scsi-all Command Output
{3} ok probe-scsi-all
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
 
MPT Version 1.05, Firmware Version 1.08.04.00
 
Target 0
   Unit 0   Disk     SEAGATE ST973401LSUN72G 0356    143374738 Blocks, 73 GB
   SASAddress 5000c50000246b35  PhyNum 0
Target 1
   Unit 0   Disk     SEAGATE ST973401LSUN72G 0356    143374738 Blocks, 73 GB
   SASAddress 5000c50000246bc1  PhyNum 1
Target 4 Volume 0
   Unit 0   Disk     LSILOGICLogical Volume  3000    16515070 Blocks, 8455 MB
Target 6
   Unit 0   Disk     FUJITSU MAV2073RCSUN72G 0301    143374738 Blocks, 73 GB
   SASAddress 500000e0116a81c2  PhyNum 6
 
{3} ok

probe-ide

The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus. This is the internal system bus for media devices such as the DVD drive.



caution icon Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, then issuing the probe-idecommand can hang the system.


The following is sample output from the probe-ide command.


CODE EXAMPLE 8-4 Sample probe-ide Command Output
{1} ok probe-ide
  Device 0  ( Primary Master ) 
         Removable ATAPI Model: DV-28E-B                                
 
  Device 1  ( Primary Slave ) 
         Not Present
 
  Device 2  ( Secondary Master ) 
         Not Present
 
  Device 3  ( Secondary Slave ) 
         Not Present

show-devs

The show-devs command lists the hardware device paths for each device in the firmware device tree. shows some sample output.


CODE EXAMPLE 8-5 show-devs Command Output (Truncated)
/i2c@1f,520000
/ebus@1f,464000
/pci@1f,700000
/pci@1e,600000
/memory-controller@3,0
/SUNW,UltraSPARC-IIIi@3,0
/memory-controller@2,0
/SUNW,UltraSPARC-IIIi@2,0
/memory-controller@1,0
/SUNW,UltraSPARC-IIIi@1,0
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi@0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/i2c@1f,520000/cpu-fru-prom@0,e8
/i2c@1f,520000/dimm-spd@0,e6
/i2c@1f,520000/dimm-spd@0,e4
.
.
.
/pci@1f,700000/pci@0
/pci@1f,700000/pci@0/pci@9
/pci@1f,700000/pci@0/pci@8
/pci@1f,700000/pci@0/pci@2
/pci@1f,700000/pci@0/pci@1
/pci@1f,700000/pci@0/pci@2/pci@0
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8
/pci@1f,700000/pci@0/pci@2/pci@0/network@4,1
/pci@1f,700000/pci@0/pci@2/pci@0/network@4
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1/disk
/pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1/tape


procedure icon  To Run OpenBoot Commands

1. Halt the system to reach the ok prompt.

How you do this depends on the system's condition. If possible, you should warn users before you shut the system down.

2. Type the appropriate command at the console prompt.


About Predictive Self-Healing

In Solaris 10 systems, the Solaris Predictive Self-Healing (PSH) technology enables Sun Fire V445 server to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use message ID to get additional information about the problem from Sun's knowledge article database.

The Predictive Self-Healing technology covers the following Sun Fire V445 server components:

The PSH console message provides the following information:

If the Solaris PSH facility has detected a faulty component, use the fmdump command (described in the following subsections) to identify the fault. Faulty FRUs are identified in fault messages using the FRU name.

Use the following web site to interpret faults and obtain information on a fault:

http://www.sun.com/msg/

This web site directs you to provide the message ID that your system displayed. The web site then provides knowledge articles about the fault and corrective action to resolve the fault. The fault information and documentation at this web site is updated regularly.

You can find more detailed descriptions of Solaris 10 Predictive Self-Healing at the following web site:

http://www.sun.com/bigadmin/features/articles/selfheal.html

Predictive Self-Healing Tools

In summary, the Solaris Fault Manager daemon (fmd) performs the following functions:

TABLE 8-23 shows a typical message generated when a fault occurs on your system. The message appears on your console and is recorded in the /var/adm/messages file.



Note - The messages in TABLE 8-23 indicate that the fault has already been diagnosed. Any corrective action that the system can perform has already taken place. If your server is still running, it continues to run.



TABLE 8-23 System Generated Predictive Self-Healing Message

Output Displayed

Description

Jul 1 14:30:20 sunrise EVENT-TIME: Tue Nov 1 16:30:20 PST 2005

EVENT-TIME: the time stamp of the diagnosis.

Jul 1 14:30:20 sunrise PLATFORM: SUNW,A70, CSN: -, HOSTNAME: sunrise

PLATFORM: A description of the system encountering the problem

Jul 1 14:30:20 sunrise SOURCE: eft, REV: 1.13

SOURCE: Information on the Diagnosis Engine used to determine the fault

Jul 1 14:30:20 sunrise EVENT-ID: afc7e660-d609-4b2f-86b8-ae7c6b8d50c4

EVENT-ID: The Universally Unique event ID (UUID) for this fault

Jul 1 14:30:20 sunrise DESC:Jul 1 14:30:20 sunrise A problem was detected in the PCI-Express subsystem

DESC: A basic description of the failure

Jul 1 14:30:20 sunrise  Refer to http://sun.com/msg/SUN4-8000-0Y for more information.

WEBSITE: Where to find specific information and actions for this fault

Jul 1 14:30:20 sunrise AUTO-RESPONSE: One or more device instances may be disabled

AUTO-RESPONSE: What, if anything, the system did to alleviate any follow-on issues

Jul 1 14:30:20 sunrise IMPACT: Loss of services provided by the device instances associated with this fault

IMPACT: A description of what that response may have done

Jul 1 14:30:20 sunrise REC-ACTION: Schedule a repair procedure to replace the affected device. Use Nov 1 14:30:20 sunrise fmdump -v -u EVENT_ID to identify the device or contact Sun for support.

REC-ACTION: A short description of what the system administrator should do


Using the Predictive Self-Healing Commands

For complete information about Predictive Self-Healing commands, refer to the Solaris 10 man pages. This section describes some details of the following commands:

Using the fmdump Command

After the message in TABLE 8-23 is displayed, more information about the fault is available. The fmdump command displays the contents of any log files associated with the Solaris Fault Manager.

The fmdump command produces output similar to TABLE 8-23. This example assumes there is only one fault.


TABLE 8-24
# fmdump 
TIME UUID SUNW-MSG-ID
Jul 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2 SUN4-8000-0Y

fmdump -V

The -V option provides more details.


TABLE 8-25
# fmdump -V -u  0ee65618-2218-4997-c0dc-b5c410ed8ec2
TIME                 UUID                                  SUNW-MSG-ID
Jul 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2  SUN4-8000-0Y
100% fault.io.fire.asic
FRU: hc://product-id=SUNW,A70/motherboard=0
rsrc: hc:///motherboard=0/hostbridge=0/pciexrc=0

Three lines of new output are delivered with the -V option.

fmdump -e

To get information of the errors that caused this failure, use the -e option.


TABLE 8-26
# fmdump -e
TIME                 CLASS
Nov 02 10:04:14.3008 ereport.io.fire.jbc.mb_per

Using the fmadm faulty Command

The fmadm faulty command lists and modifies system configuration parameters that are maintained by the Solaris Fault Manager. The fmadm faulty command is primarily used to determine the status of a component involved in a fault.


TABLE 8-27
# fmadm faulty
STATE		RESOURCE / UUID
-------- -------------------------------------------------------------
degraded dev:////pci@1e,600000
		0ee65618-2218-4997-c0dc-b5c410ed8ec2

The PCI device is degraded and is associated with the same UUID as seen above. You may also see faulted states.

fmadm config

The fmadm config command output shows the version numbers of the diagnosis engines in use by your system, and also displays their current state. You can check these versions against information on the http://sunsolve.sun.com web site to determine if your server is using the latest diagnostic engines.


TABLE 8-28
# fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-diagnosis         1.5     active  UltraSPARC-III/IV CPU/Memory Diagnosis
cpumem-retire            1.1     active  CPU/Memory Retire Agent
eft                      1.16    active  eft diagnosis engine
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                1.0     active  I/O Retire Agent
snmp-trapgen             1.0     active  SNMP Trap Generation Agent
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.0     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine

Using the fmstat Command

The fmstat command can report statistics associated with the Solaris Fault Manager. The fmstat command shows information about DE performance. In the example below, the eft DE (also seen in the console output) has received an event which it accepted. A case is opened for that event and a diagnosis is performed to solve the cause for the failure.


TABLE 8-29
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-diagnosis         0       0  0.0    0.0   0   0     0     0   3.0K      0
cpumem-retire            0       0  0.0    0.0   0   0     0     0      0      0
eft                      0       0  0.0    0.0   0   0     0     0   713K      0
fmd-self-diagnosis       0       0  0.0    0.0   0   0     0     0      0      0
io-retire                0       0  0.0    0.0   0   0     0     0      0      0
snmp-trapgen             0       0  0.0    0.0   0   0     0     0    32b      0
sysevent-transport       0       0  0.0 6704.4   1   0     0     0      0      0
syslog-msgs              0       0  0.0    0.0   0   0     0     0      0      0
zfs-diagnosis            0       0  0.0    0.0   0   0     0     0      0      0


About Traditional Solaris OS Diagnostic Tools

If a system passes OpenBoot Diagnostics tests, it normally attempts to boot its multiuser OS. For most Sun systems, this means the Solaris OS. Once the server is running in multiuser mode, you have access to the software-based exerciser tools, SunVTS and Sun Management Center. These tools enable you to monitor the server, exercise it, and isolate faults.



Note - If you set the auto-boot OpenBoot configuration variable to false, the OS does not boot following completion of the firmware-based tests.


In addition to the tools mentioned above, you can refer to error and system message log files, and Solaris system information commands.

Error and System Message Log Files

Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from many sources, including the OS, the environmental control subsystem, and various software applications.

Solaris System Information Commands

The following Solaris commands display data that you can use when assessing the condition of a Sun Fire V445 server:

This section describes the information these commands give you. For more information on using these commands, refer to the Solaris man pages.

Using the prtconf Command

The prtconf command displays the Solaris device tree. This tree includes all the devices probed by OpenBoot firmware, as well as additional devices, like individual disks. The output of prtconf also includes the total amount of system memory, and shows an excerpt of prtconf output (truncated to save space).


CODE EXAMPLE 8-6 prtconf Command Output (Truncated)
# prtconf
System Configuration:  Sun Microsystems  sun4u
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
 
SUNW,Sun-Fire-V445
    packages (driver not attached)
        SUNW,builtin-drivers (driver not attached)
        deblocker (driver not attached)
        disk-label (driver not attached)
        terminal-emulator (driver not attached)
        dropins (driver not attached)
        kbd-translator (driver not attached)
        obp-tftp (driver not attached)
        SUNW,i2c-ram-device (driver not attached)
        SUNW,fru-device (driver not attached)
        ufs-file-system (driver not attached)
    chosen (driver not attached)
    openprom (driver not attached)
        client-services (driver not attached)
    options, instance #0
    aliases (driver not attached)
    memory (driver not attached)
    virtual-memory (driver not attached)
    SUNW,UltraSPARC-IIIi (driver not attached)
    memory-controller, instance #0
    SUNW,UltraSPARC-IIIi (driver not attached)
    memory-controller, instance #1 ...

The prtconf command -p option produces output similar to the OpenBoot
show-devs command. This output lists only those devices compiled by the system firmware.

Using the prtdiag Command

The prtdiag command displays a table of diagnostic information that summarizes the status of system components.

The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on your system. Following is an excerpt of some of the output produced by prtdiag on a Sun Fire V445 server.


CODE EXAMPLE 8-7 prtdiag Command Output
# prtdiag
System Configuration: Sun Microsystems  sun4u Sun Fire V445
System clock frequency: 199 MHZ
Memory size: 24GB
 
==================================== CPUs ====================================
                E$          CPU                    CPU
CPU  Freq      Size        Implementation         Mask    Status      Location
---  --------  ----------  ---------------------  -----   ------      --------
0    1592 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/C0/P0
1    1592 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/C1/P0
2    1592 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/C2/P0
3    1592 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/C3/P0
 
================================= IO Devices =================================
Bus     Freq  Slot +      Name +
Type    MHz   Status      Path                          Model
------  ----  ----------  ----------------------------  --------------------
pci     199   MB/PCI4     LSILogic,sas-pci1000,54 (scs+ LSI,1068
               okay        /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/LSILogic,sas@1
 
pci     199   MB/PCI5     pci108e,abba (network)        SUNW,pci-ce
               okay        /pci@1f,700000/pci@0/pci@2/pci@0/pci@8/pci@2/network@0
 
pciex   199   MB          pci14e4,1668 (network)
               okay        /pci@1e,600000/pci/pci/pci/network
 
pciex   199   MB          pci14e4,1668 (network)
               okay        /pci@1e,600000/pci/pci/pci/network
 
pciex   199   MB          pci10b9,5229 (ide)
               okay        /pci@1f,700000/pci@0/pci@1/pci@0/ide
 
pciex   199   MB          pci14e4,1668 (network)
               okay        /pci@1f,700000/pci@0/pci@2/pci@0/network
 
pciex   199   MB          pci14e4,1668 (network)
               okay        /pci@1f,700000/pci@0/pci@2/pci@0/network
 
 
============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address       Size       Interleave Factor  Contains
-----------------------------------------------------------------------
0x0                8GB               16          BankIDs 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0x1000000000       8GB               16          BankIDs 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
0x2000000000       4GB               4           BankIDs 32,33,34,35
0x3000000000       4GB               4           BankIDs 48,49,50,51
 
Bank Table:
-----------------------------------------------------------
            Physical Location
ID       ControllerID  GroupID   Size       Interleave Way
-----------------------------------------------------------
0        0             0         512MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1        0             0         512MB
2        0             1         512MB
3        0             1         512MB
4        0             0         512MB
5        0             0         512MB
6        0             1         512MB
7        0             1         512MB
8        0             1         512MB
9        0             1         512MB
10       0             0         512MB
11       0             0         512MB
12       0             1         512MB
13       0             1         512MB
14       0             0         512MB
15       0             0         512MB
16       1             0         512MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
17       1             0         512MB
18       1             1         512MB
19       1             1         512MB
20       1             0         512MB
21       1             0         512MB
22       1             1         512MB
23       1             1         512MB
24       1             1         512MB
25       1             1         512MB
26       1             0         512MB
27       1             0         512MB
28       1             1         512MB
29       1             1         512MB
30       1             0         512MB
31       1             0         512MB
32       2             0         1GB             0,1,2,3
33       2             1         1GB
34       2             1         1GB
35       2             0         1GB
48       3             0         1GB             0,1,2,3
49       3             1         1GB
50       3             1         1GB
51       3             0         1GB
 
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels         Status
--------------------------------------------------
0              0        MB/C0/P0/B0/D0
0              0        MB/C0/P0/B0/D1
0              1        MB/C0/P0/B1/D0
0              1        MB/C0/P0/B1/D1
1              0        MB/C1/P0/B0/D0
1              0        MB/C1/P0/B0/D1
1              1        MB/C1/P0/B1/D0
1              1        MB/C1/P0/B1/D1
2              0        MB/C2/P0/B0/D0
2              0        MB/C2/P0/B0/D1
2              1        MB/C2/P0/B1/D0
2              1        MB/C2/P0/B1/D1
3              0        MB/C3/P0/B0/D0
3              0        MB/C3/P0/B0/D1
3              1        MB/C3/P0/B1/D0
3              1        MB/C3/P0/B1/D1
 
=============================== usb Devices ===============================
 
Name          Port#
------------  -----
hub           HUB0
bash-3.00#
 
 
Page 177
 
Verbose output with fan tach fail
 
============================ Environmental Status ============================
Fan Status:
-------------------------------------------
Location             Sensor          Status
-------------------------------------------
MB/FT0/F0            TACH            okay
MB/FT1/F0            TACH            failed (0 rpm)
MB/FT2/F0            TACH            okay
MB/FT5/F0            TACH            okay
PS1                  FF_FAN          okay
PS3                  FF_FAN          okay
 
Temperature sensors:
-----------------------------------------
Location       Sensor              Status
-----------------------------------------
MB/C0/P0       T_CORE              okay
MB/C1/P0       T_CORE              okay
MB/C2/P0       T_CORE              okay
MB/C3/P0       T_CORE              okay
MB/C0          T_AMB               okay
MB/C1          T_AMB               okay
MB/C2          T_AMB               okay
MB/C3          T_AMB               okay
MB             T_CORE              okay
MB             IO_T_AMB            okay
MB/FIOB        T_AMB               okay
MB             T_AMB               okay
PS1            FF_OT               okay
PS3            FF_OT               okay
------------------------------------
Current sensors:
----------------------------------------
Location             Sensor       Status
----------------------------------------
MB/USB0              I_USB0       okay
MB/USB1              I_USB1       okay

In addition to the information in CODE EXAMPLE 8-7, prtdiag with the verbose option (-v) also reports on front panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures.


CODE EXAMPLE 8-8 prtdiag Verbose Output
System Temperatures (Celsius):
-------------------------------
Device			 Temperature			   Status
---------------------------------------
CPU0            59             OK
CPU2            64             OK
DBP0            22             OK

In the event of an overtemperature condition, prtdiag reports an error in the Status column.


CODE EXAMPLE 8-9 prtdiag Overtemperature Indication Output
System Temperatures (Celsius):
-------------------------------
Device			Temperature				Status
---------------------------------------
CPU0		    62					OK
CPU1			102				ERROR

Similarly, if there is a failure of a particular component, prtdiag reports a fault in the appropriate Status column.


CODE EXAMPLE 8-10 prtdiag Fault Indication Output
Fan Status:
-----------
 
Bank             RPM    Status
----            -----   ------
CPU0             4166   [NO_FAULT]
CPU1             0000   [FAULT]

Using the prtfru Command

The Sun Fire V445 system maintains a hierarchical list of all FRUs in the system, as well as specific information about various FRUs.

The prtfru command can display this hierarchical list, as well as data contained in the serial electrically erasable programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 8-11 shows an excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.


CODE EXAMPLE 8-11 prtfru -l Command Output (Truncated)
# prtfru -l
/frutree
/frutree/chassis (fru)
/frutree/chassis/MB?Label=MB
/frutree/chassis/MB?Label=MB/system-board (container)
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT0?Label=FT0/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT1?Label=FT1/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT2?Label=FT2/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/FT3?Label=FT3
/frutree/chassis/MB?Label=MB/system-board/FT4?Label=FT4
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5/fan-tray (fru)
/frutree/chassis/MB?Label=MB/system-board/FT5?Label=FT5/fan-tray/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module (container)
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0/cpu
/frutree/chassis/MB?Label=MB/system-board/C0?Label=C0/cpu-module/P0?Label=P0/cpu/B0?Label=B0

CODE EXAMPLE 8-12 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.


CODE EXAMPLE 8-12 prtfru -c Command Output
# prtfru -c
/frutree/chassis/MB?Label=MB/system-board (container)
    SEGMENT: FD
       /Customer_DataR
       /Customer_DataR/UNIX_Timestamp32: Wed Dec 31 19:00:00 EST 1969
       /Customer_DataR/Cust_Data:
       /InstallationR (4 iterations)
       /InstallationR[0]
       /InstallationR[0]/UNIX_Timestamp32: Fri Dec 31 20:47:13 EST 1999
       /InstallationR[0]/Fru_Path: MB.SEEPROM
       /InstallationR[0]/Parent_Part_Number: 5017066
       /InstallationR[0]/Parent_Serial_Number: BM004E
       /InstallationR[0]/Parent_Dash_Level: 05
       /InstallationR[0]/System_Id:
       /InstallationR[0]/System_Tz: 238
       /InstallationR[0]/Geo_North: 15658734
       /InstallationR[0]/Geo_East: 15658734
       /InstallationR[0]/Geo_Alt: 238
       /InstallationR[0]/Geo_Location:
       /InstallationR[1]
       /InstallationR[1]/UNIX_Timestamp32: Mon Mar  6 10:08:30 EST 2006
       /InstallationR[1]/Fru_Path: MB.SEEPROM
       /InstallationR[1]/Parent_Part_Number: 3753302
       /InstallationR[1]/Parent_Serial_Number: 0001
       /InstallationR[1]/Parent_Dash_Level: 03
       /InstallationR[1]/System_Id:
       /InstallationR[1]/System_Tz: 238
       /InstallationR[1]/Geo_North: 15658734
       /InstallationR[1]/Geo_East: 15658734
       /InstallationR[1]/Geo_Alt: 238
       /InstallationR[1]/Geo_Location:
       /InstallationR[2]
       /InstallationR[2]/UNIX_Timestamp32: Tue Apr 18 10:00:45 EDT 2006
       /InstallationR[2]/Fru_Path: MB.SEEPROM
       /InstallationR[2]/Parent_Part_Number: 5017066
       /InstallationR[2]/Parent_Serial_Number: BM004E
       /InstallationR[2]/Parent_Dash_Level: 05
       /InstallationR[2]/System_Id:
       /InstallationR[2]/System_Tz: 0
       /InstallationR[2]/Geo_North: 12704
       /InstallationR[2]/Geo_East: 1
       /InstallationR[2]/Geo_Alt: 251
       /InstallationR[2]/Geo_Location:
       /InstallationR[3]
       /InstallationR[3]/UNIX_Timestamp32: Fri Apr 21 08:50:32 EDT 2006
       /InstallationR[3]/Fru_Path: MB.SEEPROM
       /InstallationR[3]/Parent_Part_Number: 3753302
       /InstallationR[3]/Parent_Serial_Number: 0001
       /InstallationR[3]/Parent_Dash_Level: 03
       /InstallationR[3]/System_Id:
       /InstallationR[3]/System_Tz: 0
       /InstallationR[3]/Geo_North: 1
       /InstallationR[3]/Geo_East: 16531457
       /InstallationR[3]/Geo_Alt: 251
       /InstallationR[3]/Geo_Location:
       /Status_EventsR (0 iterations)
    SEGMENT: PE
       /Power_EventsR (50 iterations)
       /Power_EventsR[0]
       /Power_EventsR[0]/UNIX_Timestamp32: Mon Jul 10 12:34:20 EDT 2006
       /Power_EventsR[0]/Event: power_on
       /Power_EventsR[1]
       /Power_EventsR[1]/UNIX_Timestamp32: Mon Jul 10 12:34:49 EDT 2006
       /Power_EventsR[1]/Event: power_off
       /Power_EventsR[2]
       /Power_EventsR[2]/UNIX_Timestamp32: Mon Jul 10 12:35:27 EDT 2006
       /Power_EventsR[2]/Event: power_on
       /Power_EventsR[3]
       /Power_EventsR[3]/UNIX_Timestamp32: Mon Jul 10 12:58:43 EDT 2006
       /Power_EventsR[3]/Event: power_off
       /Power_EventsR[4]
       /Power_EventsR[4]/UNIX_Timestamp32: Mon Jul 10 13:07:27 EDT 2006
       /Power_EventsR[4]/Event: power_on
       /Power_EventsR[5]
       /Power_EventsR[5]/UNIX_Timestamp32: Mon Jul 10 14:07:20 EDT 2006
       /Power_EventsR[5]/Event: power_off
       /Power_EventsR[6]
       /Power_EventsR[6]/UNIX_Timestamp32: Mon Jul 10 14:07:21 EDT 2006
       /Power_EventsR[6]/Event: power_on
       /Power_EventsR[7]
       /Power_EventsR[7]/UNIX_Timestamp32: Mon Jul 10 14:17:01 EDT 2006
       /Power_EventsR[7]/Event: power_off
       /Power_EventsR[8]
       /Power_EventsR[8]/UNIX_Timestamp32: Mon Jul 10 14:40:22 EDT 2006
       /Power_EventsR[8]/Event: power_on
       /Power_EventsR[9]
       /Power_EventsR[9]/UNIX_Timestamp32: Mon Jul 10 14:42:38 EDT 2006
       /Power_EventsR[9]/Event: power_off
       /Power_EventsR[10]
       /Power_EventsR[10]/UNIX_Timestamp32: Mon Jul 10 16:12:35 EDT 2006
       /Power_EventsR[10]/Event: power_on
       /Power_EventsR[11]
       /Power_EventsR[11]/UNIX_Timestamp32: Tue Jul 11 08:53:47 EDT 2006
       /Power_EventsR[11]/Event: power_off
       /Power_EventsR[12]

Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes:

Using the psrinfo Command

The psrinfo command displays the date and time each CPU came online. With the verbose (-v) option, the command displays additional information about the CPUs, including their clock speed. The following is sample output from the psrinfo command with the -v option.


CODE EXAMPLE 8-13 psrinfo -v Command Output
# psrinfo -v
Status of virtual processor 0 as of: 07/13/2006 14:18:39
   on-line since 07/13/2006 14:01:26.
   The sparcv9 processor operates at 1592 MHz,
         and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 07/13/2006 14:18:39
   on-line since 07/13/2006 14:01:26.
   The sparcv9 processor operates at 1592 MHz,
         and has a sparcv9 floating point processor.
Status of virtual processor 2 as of: 07/13/2006 14:18:39
   on-line since 07/13/2006 14:01:26.
   The sparcv9 processor operates at 1592 MHz,
         and has a sparcv9 floating point processor.
Status of virtual processor 3 as of: 07/13/2006 14:18:39
   on-line since 07/13/2006 14:01:24.
   The sparcv9 processor operates at 1592 MHz,
         and has a sparcv9 floating point processor.

Using the showrev Command

The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 8-14 shows sample output of the showrev command.


CODE EXAMPLE 8-14 showrev Command Output
# showrev
Hostname: sunrise
Hostid: 83d8ee71
Release: 5.10
Kernel architecture: sun4u
Application architecture: sparc
Hardware provider: Sun_Microsystems
Domain: Ecd.East.Sun.COM
Kernel version: SunOS 5.10 Generic_118833-17
bash-3.00#

When used with the -p option, this command displays installed patches. TABLE 8-30 shows a partial sample output from the showrev command with the -p option.


TABLE 8-30 showrev -p Command Output
Patch: 109729-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109783-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109807-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 109809-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110905-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110910-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 110914-01 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsu
Patch: 108964-04 Obsoletes:  Requires:  Incompatibles:  Packages: SUNWcsr


procedure icon  To Run Solaris System Information Commands

1. Decide what kind of system information you want to display.

For more information, see Solaris System Information Commands.

2. Type the appropriate command at a console prompt.

See TABLE 8-31 for a summary of the commands.


TABLE 8-31 Using Solaris Information Display Commands

Command

What It Displays

What to Type

Notes

fmadm

Fault management information

/usr/sbin/fmadm

Lists information and changes settings.

fmdump

Fault management information

/usr/sbin/fmdump

Use the -v option for additional detail.

prtconf

System configuration information

/usr/sbin/prtconf

-

prtdiag

Diagnostic and configuration information

/usr/platform/sun4u/sbin/prtdiag

Use the -v option for additional detail.

prtfru

FRU hierarchy and SEEPROM memory contents

/usr/sbin/prtfru

Use the -l option to display hierarchy. Use the -c option to display SEEPROM data.

psrinfo

Date and time each CPU came online; processor clock speed

/usr/sbin/psrinfo

Use the -v option to obtain clock speed and other data.

showrev

Hardware and software revision information

/usr/bin/showrev

Use the -p option to show software patches.



Viewing Recent Diagnostic Test Results

A summary of the results of the most recent power-on self-test (POST) are saved across power cycles.


procedure icon  To View Recent Test Results

1. Obtain the ok prompt.

2. To see a summary of the most recent POST results, type:


TABLE 8-32
ok show-post-results


Setting OpenBoot Configuration Variables

Switches and diagnostic configuration variables stored in the IDPROM determine how and when power-on self-test (POST) diagnostics and OpenBoot Diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration variables. For a list of important OpenBoot configuration variables, see TABLE 8-7.

Changes to OpenBoot configuration variables usually take effect upon the next reboot.


procedure icon  To View and Set OpenBoot Configuration Variables

1. Obtain the ok prompt.

The following example shows a short excerpt of this command's output.


TABLE 8-33
ok printenv
Variable Name         Value                          Default Value
 
diag-level            min                            min
diag-switch?          false                          false

To set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space.


Additional Diagnostic Tests for Specific Devices

Using the probe-scsi Command to Confirm That Hard Disk Drives are Active

The probe-scsi command transmits an inquiry to SAS devices connected to the system's internal SAS interface. If a SAS device is connected and active, the command displays the unit number, device type, and manufacturer name for that device.


CODE EXAMPLE 8-15 probe-scsi Output Message

ok probe-scsi
Target 0
 Unit 0   Disk     SEAGATE ST336605LSUN36G 4207
Target 1 
 Unit 0   Disk     SEAGATE ST336605LSUN36G 0136
 

The probe-scsi-all command transmits an inquiry to all SAS devices connected to both the system's internal and its external SAS interfaces. CODE EXAMPLE 8-16 shows sample output from a server with no externally connected SAS devices but containing two 36 Gbyte Hard Disk Drives, both of them active.


CODE EXAMPLE 8-16 probe-scsi-all Output Message

ok probe-scsi-all
/pci@1f,0/pci@1/scsi@8,1
 
/pci@1f,0/pci@1/scsi@8
Target 0
 Unit 0   Disk     SEAGATE ST336605LSUN36G 4207
Target 1 
 Unit 0   Disk     SEAGATE ST336605LSUN36G 0136
 

Using the probe-ide Command To Confirm That the DVD Drive is Connected

The probe-ide command transmits an inquiry command to internal and external IDE devices connected to the system's on-board IDE interface. The following sample output reports a DVD drive installed (as Device 0) and active in a server.


CODE EXAMPLE 8-17 probe-ide Output Message

ok probe-ide
 Device 0  ( Primary Master ) 
 Removable ATAPI Model: DV-28E-B
 
 Device 1  ( Primary Slave )
 Not Present
 
 Device 2  ( Secondary Master ) 
 Not Present
 
 Device 3  ( Secondary Slave )
 Not Present
 

Using the watch-net and watch-net-all Commands to Check the Network Connections

The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.

Start the watch-net diagnostic test by typing the watch-net command at the ok prompt. For the watch-net-all diagnostic test, type watch-net-all at the ok prompt.


CODE EXAMPLE 8-18 watch-net Diagnostic O utput Message

{0} ok watch-net
Internal loopback test -- succeeded.
Link is -- up
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.................................
 


CODE EXAMPLE 8-19 watch-net-all Diagnostic O utput Message

{0} ok watch-net-all
/pci@1f,0/pci@1,1/network@c,1
Internal loopback test -- succeeded.
Link is -- up 
Looking for Ethernet Packets.
`.' is a Good Packet. `X' is a Bad Packet.
Type any key to stop.
 


About Automatic Server Restart



Note - Automatic Server Restart is not the same as Automatic System Restoration (ASR), which the Sun Fire V445 server also supports.


Automatic Server Restart is a functional part of ALOM. It monitors the Solaris OS while it is running and, by default, captures cpu register and memory contents to the dump-device using the firmware level sync command.

ALOM uses a watchdog process to monitor only the kernel. ALOM will not restart the server if a process hangs and the kernel is still running. The ALOM watchdog parameters for the watchdog patting interval and watchdog timeout are not user configurable.

If the kernel hangs and the watchdog times out, ALOM reports and logs the event and performs one of three user configurable actions.



Note - Do not confuse this OpenBoot sync command with the Solaris OS sync command, which results in I/O writes of buffered data to the disk drives, prior to unmounting file systems.


For more information, see the sys_autorestart section of the ALOM Online Help.


About Automatic System Restoration



Note - Automatic System Restoration (ASR) is not the same as Automatic Server Restart, which the Sun Fire V445 server also supports.


Automatic System Restoration (ASR) consists of self-test features and an auto-configuring capability to detect failed hardware components and unconfigure them. By doing this, the server is able to resume operating after certain nonfatal hardware faults or failures have occured.

If a component is one that is monitored by ASR, and the server is capable of operating without it, the server will automatically reboot if that component should develop a fault or fail.

ASR monitors the following components:

If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of functioning, the boot sequence continues.

If a fault occurs on a running server, and it is possible for the server to run without the failed component, the server automatically reboots. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash repeatedly.

To support such a degraded boot capability, the OpenBoot firmware uses the 1275 Client Interface (via the device tree) to mark a device as either failed or disabled, by creating an appropriate status property in the device tree node. The Solaris OS will not activate a driver for any subsystem so marked.

As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the system will reboot automatically and resume operation while a service call is made.



Note - ASR is enabled by default.


Auto-Boot Options

The OpenBoot firmware stores configuration variables on a ROM chip called auto-boot? and auto-boot-on-error? The default setting on the Sun Fire V445 server for both of these variables is true.

The auto-boot? setting controls whether or not the firmware automatically boots the OS after each reset. The auto-boot-on-error? setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? settings must be set to true (default) to enable an automatic degraded boot.


procedure icon  To Set the Auto-Boot Switches

1. Type:


 
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - With both of these variables set to true, the system attempts a degraded boot in response to any fatal nonrecoverable error.


Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:

Given a failed DIMM, the firmware unconfigures the entire logical bank associated with the failed module. Another nonfailing logical bank must be present in the system for the system to attempt a degraded boot. See About the CPU/Memory Modules.



Note - If POST or OpenBoot Diagnostics detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.


For more information about troubleshooting fatal errors, see Chapter 9.

Reset Scenarios

Two OpenBoot configuration variables, diag-switch? and diag-trigger control whether the system executes firmware diagnostics in response to system reset events.

POST is enabled as the default for power-on-reset and error-reset events. When the diag-switch? variable is set to true, diagnostics are executed using user-defined settings. If the diag-switch? variable is set to false, diagnostics are executed depending on the diag-trigger variable setting.

In addition, ASR is enabled by default because diag-trigger is set to power-on-reset and error-reset. This default setting remains when the diag-switch? variable is set to false. auto-boot? and auto-boot-on-error? are set to true by default.

Automatic System Restoration User Commands

The OpenBoot commands .asr, asr-disable, and asr-enable are available for obtaining ASR status information and for manually unconfiguring or reconfiguring system devices. For more information, see Unconfiguring a Device Manually.

Enabling Automatic System Restoration

The ASR feature is enabled by default. ASR is always enabled when the diag-switch? OpenBoot variable is set to true, and when the diag-trigger setting is set to error-reset.

To activate any parameter changes, type the following at the ok prompt:


 
ok reset-all

The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (default).



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.


Disabling Automatic System Restoration

After you disable the automatic system restoration (ASR) feature, it is not activated again until you enable it at the system ok prompt.


procedure icon  To Disable Automatic System Restoration

1. At the ok prompt, type:


 
ok setenv auto-boot-on-error? false

2. To activate the parameter change, type:


 
ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.


Displaying Automatic System Restoration Information

Use the following command to display information about the status of the ASR feature.

single-step bullet  At the ok prompt, type:


 
ok .asr

In the .asr command output, any devices marked disabled have been manually unconfigured using the asr-disable command. The .asr command also lists devices that have failed firmware diagnostics and have been automatically unconfigured by the OpenBoot ASR feature.


About SunVTS

SunVTS is a software suite that performs system and subsystem stress testing. You can view and control a SunVTS session over a network. Using a remote machine, you can view the progress of a testing session, change testing options, and control all testing features of another machine on the network.

You can run SunVTS software in four different test modes:

Since SunVTS software can run many tests in parallel and consume many system resources, you should be cautious when using it on a production system. If you are stress-testing a system using the Functional test mode, do not run anything else on that system at the same time.

To install and use SunVTS, a system must be running a Solaris OS compatible for the SunVTS version. Since SunVTS software packages are optional, they may not be installed on your system. See To Find Out Whether SunVTS Is Installed for instructions.

SunVTS Software and Security

During SunVTS software installation, you must choose between Basic or Sun Enterprise Authentication Mechanismtrademark security. Basic security uses a local security file in the SunVTS installation directory to limit the users, groups, and hosts permitted to use SunVTS software. Sun Enterprise Authentication Mechanism security is based on the standard network authentication protocol Kerberos and provides secure user authentication, data integrity and privacy for transactions over networks.

If your site uses Sun Enterprise Authentication Mechanism security, you must have the Sun Enterprise Authentication Mechanism client and server software installed in your networked environment and configured properly in both Solaris and SunVTS software. If your site does not use Sun Enterprise Authentication Mechanism security, do not choose the Sun Enterprise Authentication Mechanism option during SunVTS software installation.

If you enable the wrong security scheme during installation, or if you improperly configure the security scheme you choose, you may find yourself unable to run SunVTS tests. For more information, see the SunVTS User's Guide and the instructions accompanying the Sun Enterprise Authentication Mechanism software.

Using SunVTS

SunVTS, the Sun Validation and Test Suite, is an online diagnostics tool that you can use to verify the configuration and functionality of hardware controllers, devices, and platforms. It runs in the Solaris OS and presents the following interfaces:

SunVTS software enables you to view and control testing sessions on a remotely connected server. TABLE 8-35 lists some of the tests that are available:


TABLE 8-35 SunVTS Tests

SunVTS Test

Description

cputest

Tests the CPU

disktest

Tests the local disk drives

dvdtest

Tests the DVD-ROM drive

fputest

Tests the floating-point unit

nettest

Tests the Ethernet hardware on the system board and the networking hardware on any optional PCI cards

netlbtest

Performs a loopback test to check that the Ethernet adapter can send and receive packets

pmemtest

Tests the physical memory (read only)

sutest

Tests the server's on-board serial ports

vmemtest

Tests the virtual memory (a combination of the swap partition and the physical memory)

env6test

Tests the environmental devices

ssptest

Tests ALOM hardware devices

i2c2test

Tests I2C devices for correct operation



procedure icon  To Find Out Whether SunVTS Is Installed

single-step bullet  Type:


TABLE 8-36
# pkginfo -l SUNWvts

If SunVTS software is loaded, information about the package will be displayed.

If SunVTS software is not loaded, you will see the following error message:


TABLE 8-37
ERROR: information for "SUNWvts" was not found

Installing SunVTS

By default, SunVTS is not installed on the Sun Fire V445 servers. However, it is available in the Solaris_10/ExtraValue/CoBundled/SunVTS_X.X Solaris 10 DVD supplied in the Solaris Media Kit. For information about downloading SunVTS from the Sun Downloard Center, refer to the Sun Hardware Platform Guide for the Solaris version you are using.

To find out more about using SunVTS, refer to the SunVTS documentation that corresponds to the Solaris release that you are running.

Viewing SunVTS Documentation

The SunVTS documents are accessible in the Solaris on Sun Hardware documentation collection at http://docs.sun.com.

For further information, you can also consult the following SunVTS documents:


About Sun Management Center

Sun Management Center software provides enterprise-wide monitoring of Sun servers and workstations, including their subsystems, components, and peripheral devices. The system being monitored must be up and running, and you need to install all the proper software components on various systems in your network.

Sun Management Center enables you to monitor the following on the Sun Fire V445 server.


TABLE 8-38 What Sun Management Center Monitors

Item Monitored

What Sun Management Center Monitors

Disk drives

Status

Fans

Status

CPUs

Temperature and any thermal warning or failure conditions

Power supply

Status

System temperature

Temperature and any thermal warning or failure conditions


Sun Management Center software extends and enhances the management capability of Sun's hardware and software products.


TABLE 8-39 Sun Management Center Features

Feature

Description

System management

Monitors and manages the system at the hardware and operating system levels. Monitored hardware includes boards, tapes, power supplies, and disks.

Operating system management

Monitors and manages operating system parameters including load, resource usage, disk space, and network statistics.

Application and business system management

Provides technology to monitor business applications such as trading systems, accounting systems, inventory systems, and real-time control systems.

Scalability

Provides an open, scalable, and flexible solution to configure and manage multiple management administrative domains (consisting of many systems) spanning an enterprise. The software can be configured and used in a centralized or distributed fashion by multiple users.


Sun Management Center software is geared primarily toward system administrators who have large data centers to monitor or other installations that have many computer platforms to monitor. If you administer a more modest installation, you need to weigh Sun Management Center software's benefits against the requirement of maintaining a significant database (typically over 700 Mbytes) of system status information.

The servers being monitored must be up and running if you want to use Sun Management Center, since this tool relies on the Solaris OS. For instructions on using this tool to monitor a Sun Fire V445 server, see Chapter 8.

How Sun Management Center Works

Sun Management Center consists of three components:

You install agents on systems to be monitored. The agents collect system status information from log files, device trees, and platform-specific sources, and report that data to the server component.

The server component maintains a large database of status information for a wide range of Sun platforms. This database is updated frequently, and includes information about boards, tapes, power supplies, and disks as well as OS parameters like load, resource usage, and disk space. You can create alarm thresholds and be notified when these are exceeded.

The monitor components present the collected data to you in a standard format. Sun Management Center software provides both a standalone Java application and a Web browser-based interface. The Java interface affords physical and logical views of the system for highly-intuitable monitoring.

Using Sun Management Center

Sun Management Center software is aimed at system administrators who have large data centers to monitor or other installations that have many computer platforms to monitor. If you administer a smaller installation, you need to weigh Sun Management Center software's benefits against the requirement of maintaining a significant database (typically over 700 Mbytes) of system status information.

The servers to be monitored must be running , Sun Management Center relies on the Solaris OS for its operation.

For detailed instructions, see the Sun Management Center Software User's Guide.

Other Sun Management Center Features

Sun Management Center software provides you with additional tools, which can operate with management utilities made by other companies.

The tools are an informal tracking mechanism and the optional add-on, Hardware Diagnostics Suite.

Informal Tracking

Sun Management Center agent software must be loaded on any system you want to monitor. However, the product enables you to informally track a supported platform even when the agent software has not been installed on it. In this case, you do not have full monitoring capability, but you can add the system to your browser, have Sun Management Center periodically check whether it is up and running, and notify you if it goes out of commission.

Hardware Diagnostic Suite

The Hardware Diagnostic Suite is a package that you can purchase as an add-on to Sun Management Center. The suite enables you to exercise a system while it is still up and running in a production environment. See Hardware Diagnostic Suite for more information.

Interoperability With Third-Party Monitoring Tools

If you administer a heterogeneous network and use a third-party network-based system monitoring or management tool, you might be able to take advantage of Sun Management Center software's support for Tivoli Enterprise Console, BMC Patrol, and HP Openview.

Obtaining the Latest Information

For the latest information about this product, go to the Sun Management Center web site: http://www.sun.com/sunmanagementcenter


Hardware Diagnostic Suite

The Sun Management Center features an optional Hardware Diagnostic Suite, which you can purchase as an add-on. The Hardware Diagnostic Suite is designed to exercise a production system by running tests sequentially.

Sequential testing means the Hardware Diagnostic Suite has a low impact on the system. Unlike SunVTS, which stresses a system by consuming its resources with many parallel tests (see About SunVTS), the Hardware Diagnostic Suite lets the server run other applications while testing proceeds.

When to Run Hardware Diagnostic Suite

The best use of the Hardware Diagnostic Suite is to disclose a suspected or intermittent problem with a noncritical part on an otherwise functioning machine. Examples might include questionable disk drives or memory modules on a machine that has ample or redundant disk and memory resources.

In cases like these, the Hardware Diagnostic Suite runs unobtrusively until it identifies the source of the problem. The machine under test can be kept in production mode until and unless it must be shut down for repair. If the faulty part is hot-pluggable or hot-swappable, the entire diagnose-and-repair cycle can be completed with minimal impact to system users.

Requirements for Using Hardware Diagnostic Suite

Since it is a part of Sun Management Center, you can only run Hardware Diagnostic Suite if you have set up your data center to run Sun Management Center. This means you have to dedicate a master server to run the Sun Management Center server software that supports Sun Management Center software's database of platform status information. In addition, you must install and set up Sun Management Center agent software on the systems to be monitored. Finally, you need to install the console portion of Sun Management Center software, which serves as your interface to the Hardware Diagnostic Suite.

Instructions for setting up Sun Management Center, as well as for using the Hardware Diagnostic Suite, can be found in the Sun Management Center Software User's Guide.