Server Diagnostics

C H A P T E R 1

Server Diagnostics

This chapter describes the diagnostics that are available for monitoring and troubleshooting the server.

The following topics are covered:

Fault on Initial Power Up

Server Diagnostics Overview

Using LEDs to Identify the State of Devices

Using the Service Processor Firmware for Diagnosis and Repair Verification

Running POST

Using the Solaris Predictive Self-Healing Feature

Collecting Information From Solaris OS Files and Commands

Managing Components With Automatic System Recovery Commands

Exercising the System With SunVTS Software

Obtaining the Chassis Serial Number

Additional Service Related Information

1.1 Fault on Initial Power Up

If you have installed the server, and upon initial power up, you see errors indicating faults with the Fully Buffered DIMMs (FB-DIMMs), PCI cards, or other components, the suspect component might have become loosened or ajar during shipment.

Conduct a visual inspection of the server internals and its components. Remove the top cover and physically reseat the cable connections, the PCI cards, and the FB-DIMMs. See:

Prerequisite Tasks for Component Replacement

Replacing PCI-X, PCIe/XAUI Cards

Replacing FB-DIMMs.

If performing these tasks is not successful, then continue to Server Diagnostics Overview.

1.2 Server Diagnostics Overview

There are a variety of diagnostic tools, commands, and indicators you can use to monitor and troubleshoot a server:

LEDs - These indicators provide a quick visual notification of the status of the server and of some of the FRUs.

Fault management architecture - FMA provides simplified fault diagnostics through use of the /var/adm/messages file, the fmdump command, and a Sun Microsystems web site.

ILOM firmware - This system firmware runs on the service processor. In addition to providing the interface between the hardware and OS, ILOM also tracks and reports the health of key server components. ILOM works closely with POST and Solaris Predictive Self-Healing technology to keep the system up and running even when there is a faulty component.

Power-on self-test (POST) - POST performs diagnostics on system components upon system reset to ensure the integrity of those components. POST is configurable and works with ILOM to take faulty components offline if needed.

Solaris OS Predictive Self-Healing (PSH) - This technology continuously monitors the health of the CPU and memory, and works with ILOM to take a faulty component offline if needed. The Predictive Self-Healing technology enables Sun systems to accurately predict component failures and mitigate many serious problems before they occur.

Log files and console messages - These provide the standard Solaris OS log files and investigative commands that can be accessed and displayed on the device of your choice.

SunVTS - An application that exercises the system, provides hardware validation, and discloses possible faulty components with recommendations for repair.

The LEDs, ILOM, Solaris OS PSH, and many of the log files and console messages are integrated. For example, a fault detected by the Solaris software will display the fault, log it, pass information to ILOM where it is logged, and depending on the fault, might light one or more LEDs.

The diagnostic flowchart in FIGURE 1-1 and TABLE 1-1 describes an approach for using the server diagnostics to identify a faulty field-replaceable unit (FRU). The diagnostics you use, and the order in which you use them, depend on the nature of the problem you are troubleshooting. So you might perform some actions and not others.

The flowchart assumes that you have already performed some rudimentary troubleshooting such as verification of proper installation, visual inspection of cables and power, and possibly performed a reset of the server (refer to the server installation guide and server administration guide for details).

Use this flowchart to understand what diagnostics are available to troubleshoot faulty hardware. Use TABLE 1-1 to find more information about each diagnostic in this chapter.

FIGURE 1-1 Diagnostic Flowchart

[ D ]

TABLE 1-1 Diagnostic Flowchart Actions
Action No.	Diagnostic Action	Resulting Action	Additional Information
1.	Check Power OK and Input OK LEDs on the server.	The Power OK LED is located on the front and rear of the chassis. The Input OK LED is located on the rear of the server on each power supply. If these LEDs are not on, check the power source and power connections to the server.	Using LEDs to Identify the State of Devices
2.	Run the ALOM CMT CLI `showfaults` command to check for faults.	The `showfaults` command displays the following kinds of faults: Environmental faults Solaris Predictive Self-Healing (PSH) detected faults POST detected faults Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 2-1.	Displaying System Faults
3.	Check the Solaris log files for fault information.	The Solaris message buffer and log files record system events and provide information about faults. If system messages indicate a faulty device, replace the FRU. To obtain more diagnostic information, go to Action 4.	Collecting Information From Solaris OS Files and Commands
4.	Run SunVTS.	SunVTS is an application you can run to exercise and diagnose FRUs. To run SunVTS, the server must be running the Solaris OS. If SunVTS reports a faulty device replace the FRU. If SunVTS does not report a faulty device, go to Action 5.	Exercising the System With SunVTS Software
5.	Run POST.	POST performs basic tests of the server components and reports faulty FRUs. If POST indicates a faulty FRU, replace the FRU. If POST does not indicate a faulty FRU, go to Action 9.	Running POST
6.	Determine if the fault is an environmental fault.	If the fault listed by the `showfaults` command displays a temperature or voltage fault, then the fault is an environmental fault. Environmental faults can be caused by faulty FRUs (power supply, fan, or blower), or by environmental conditions such as when computer room ambient temperature is too high, or the server airflow is blocked. When the environmental condition is corrected, the fault will automatically clear.	Displaying System Faults
		If the fault indicates that a fan, blower, or power supply is bad, you can perform a hot-swap of the FRU. You can also use the fault LEDs on the server to identify the faulty FRU (fans, blower, and power supplies).	Using LEDs to Identify the State of Devices
7.	Determine if the fault was detected by PSH.	If the fault message displays the following text, the fault was detected by the Solaris Predictive Self-Healing software: `Host detected fault`	Using the Solaris Predictive Self-Healing Feature
		If the fault is a PSH detected fault, identify the faulty FRU from the fault message and replace the faulty FRU.	Clearing PSH Detected Faults
		After replacing the FRU, perform the procedure to clear PSH detected faults.
8.	Determine if the fault was detected by POST.	POST performs basic tests of the server components and reports faulty FRUs. When POST detects a faulty FRU, it logs the fault and if possible, takes the FRU offline. POST detected FRUs display the following text in the fault message: FRU-name `deemed faulty and disabled`	Running POST
		In this case, replace the FRU and run the procedure to clear POST detected faults.	Clearing POST Detected Faults

1.2.1 Memory Configuration and Fault Handling

A variety of features play a role in how the memory subsystem is configured and how memory faults are handled. Understanding the underlying features helps you identify and repair memory problems. This section describes how the memory is configured and how the server deals with memory faults.

1.2.1.1 Memory Configuration

In the server memory there are 16 slots that hold DDR-2 memory FB-DIMMs in the following FB-DIMM sizes:

1 Gbyte (maximum of 16 Gbyte)

2 Gbyte (maximum of 32 Gbyte)

4 Gbyte (maximum of 64 Gbyte)

FB-DIMMs are installed in groups of 8, called ranks (ranks 0 and 1). At minimum, rank 0 must be fully populated with eight FB-DIMMs of the same capacity. A second rank of FB-DIMMs of the same capacity can be added to fill rank 1.

See Replacing FB-DIMMs for instructions about adding memory to a server.

1.2.1.2 Memory Fault Handling

The server uses an advanced ECC technology, called chipkill, that corrects up to 4 bits in error on nibble boundaries, as long as all of the bits are in the same DRAM. If a DRAM fails, the FB-DIMM continues to function.

The following server features independently manage memory faults:

POST - Based on ILOM configuration variables, POST runs when the server is powered on.

For correctable memory errors (CEs), POST forwards the error to the Solaris Predictive Self-Healing (PSH) daemon for error handling. If an uncorrectable memory fault is detected or if a “storm” of CEs is detected, POST displays the fault with the device name of the faulty FB-DIMMs, logs the fault, and disables the faulty FB-DIMMs by placing them in the ASR blacklist. Depending on the memory configuration and the location of the faulty FB-DIMM, POST disables half of physical memory in the system, or half the physical memory and half the processor threads. When this offlining process occurs in normal operation, you must replace the faulty FB-DIMMs based on the fault message. You then must enable the disabled FB-DIMMs with the ALOM CMT CLI enablecomponent command.

Solaris Predictive Self-Healing (PSH) technology - A feature of the Solaris OS, uses the fault manager daemon (fmd) to watch for various kinds of faults. When a fault occurs, the fault is assigned a unique fault ID (UUID), and logged. PSH reports the fault and provides a recommended proactive replacement for the FB-DIMMs associated with the fault.

1.2.1.3 Troubleshooting Memory Faults

If you suspect that the server has a memory problem, follow the flowchart (FIGURE 1-1). Run the ALOM CMT compatability CLI (in ILOM) showfaults command, see Using the ALOM CMT Compatibility CLI in ILOM and Displaying System Faults. The showfaults command lists memory faults and lists the specific FB-DIMMS that are associated with the fault. Once you identify which FB-DIMMs to replace, see Replacing FB-DIMMs for FB-DIMM replacement instructions. You must perform the instructions in that chapter to clear the faults and enable the replaced FB-DIMMs.

1.3 Using LEDs to Identify the State of Devices

The server provides the following groups of LEDs:

Front and Rear Panel LEDs

Hard Drive LEDs

Power Supply LEDs

Ethernet Port LEDs

These LEDs provide a quick visual check of the state of the system.

1.3.1 Front and Rear Panel LEDs

The seven front panel LEDs (FIGURE 1-2) are located in the upper left corner of the server chassis. Three of these LEDs are also provided on the rear panel (FIGURE 1-3).

FIGURE 1-2 Location of the Bezel Server Status and Alarm Status Indicators

Figure showing the the location of the server and alarm status indicators on the front bezel

Figure Legend
1	User (amber) Alarm Status Indicator	5	Locator LED and Button
2	Minor (amber) Alarm Status Indicator	6	Fault LED
3	Major (red) Alarm Status Indicator	7	Activity LED
4	Critical (red) Alarm Status Indicator	8	Power OK LED

FIGURE 1-3 Rear Panel Connectors, LEDs, and Features on the Sun Netra T5220 Server

Figure showning the rear panel connectors, LEDs, and features

Figure Legend
1	Power Supply 0 LEDs top to bottom: Locator LED and Button, Service Required LED, Power OK LED	11	Alarm Port
2	Power Supply 0	12	USB ports left to right: USB0, USB1
3	Power Supply 1 LEDs top to bottom: Locator LED Button, Service Required LED, Power OK LED	13	TTYA Serial Port
4	Power Supply 1	14	Captive screw for securing motherboard (2 of 2)
5	Captive screw for securing motherboard (1 of 2)	15	PCI-X Slot 3
6	System LEDs left to right: Locator LED Button, Service Required LED, Power OK LED	16	PCIe or XAUI Slot 0
7	Service Processor Serial Management Port	17	PCI-X Slot 4
8	Service Processor Network Management Port	18	PCIe or XAUI Slot 1
9	Captive screws for securing the bottom PCI cards. Note that there are two screws on either side of each bottom PCI card (total 6).	19	PCIe Slot 5
10	Gigabit Ethernet Ports left to right: NET0, NET1, NET2, NET3	20	PCIe Slot 2

TABLE 1-2 lists and describes the front and rear panel LEDs.

TABLE 1-2 Front and Rear Panel LEDs
LED	Location	Color	Description
Locator LED and Button	Front upper left and rear center	White	Enables you to identify a particular server. The LED is activated using one of the following methods: Issuing the `setlocator on` or `off` command. Pressing the button to toggle the indicator on or off. This LED provides the following indications: Off - Normal operating state. Fast blink - The server received a signal as a result of one of the preceding methods.
Fault LED	Front upper left and rear center	Amber	If on, indicates that service is required. The ALOM CMT CLI `showfaults` command provides details about any faults that cause this indicator to be lit.
Activity LED	Front upper left	Green	On - Drives are receiving power. Solidly lit if drive is idle. Flashing - Drives are processing a command. Off - Power is off.
Power Button	Front upper left		Turns the host system on and off. This button is recessed to prevent accidental server power-off. Use the tip of a pen to operate this button.
Alarm:Critical LED	Front left	Red	Indicates a critical alarm. Refer to the server administration guide for a description of alarm states.
Alarm:Major LED	Front left	Red	Indicates a major alarm.
Alarm:Minor LED	Front left	Amber	Indicates a minor alarm.
Alarm :User LED	Front left	Amber	Indicates a user alarm.
Power OK LED	Rear center	Green	The LED provides the following indications: Off - The system is unavailable. Either the system has no power or ILOM is not running. Steady on - Indicates that the system is powered on and is running it its normal operating state. Standby blink - Indicates that the service processor is running while the system is running at a minimum level in Standby mode, and is ready to be returned to its normal operating state. Slow blink - Indicates that a normal transitory activity is taking place. The system diagnostics might be running, or that the system might be booting.

1.3.2 Hard Drive LEDs

The hard drive LEDs (FIGURE 1-4 and TABLE 1-3) are located on the front of each hard drive that is installed in the server chassis.

FIGURE 1-4 Hard Drive LEDs

Figure showing the hard drive LEDs.

Figure Legend
1	OK to Remove
2	Fault
3	Activity

TABLE 1-3 Hard Drive LEDs
LED	Color	Description
OK to Remove	Blue	On - The drive is ready for hot-plug removal. Off - Normal operation.
Fault	Amber	On - The drive has a fault and requires attention. Off - Normal operation.
Activity	Green	On - The drive is receiving power. Solidly lit if drive is idle. Flashing - The drive is processing a command. Off - Power is off.

1.3.3 Power Supply LEDs

The power supply LEDs (FIGURE 1-5 and TABLE 1-4) are located on the rear of each power supply.

FIGURE 1-5 Power Supply LEDs

Figure showing the power supply LEDs

Figure Legend
1	Power OK power supply LED
2	Fault power supply LED
3	Input OK power supply LED

TABLE 1-4 Power Supply LEDs
LED	Color	Description
Power OK	Green	On - Normal operation. DC output voltage is within normal limits. Off - Power is off.
Fault	Amber	On - Power supply has detected a failure. Off - Normal operation.
Input OK	Green	On - Normal operation. Input power is within normal limits. Off - No input voltage, or input voltage is below limits.

1.3.4 Ethernet Port LEDs

The ILOM management Ethernet port and the four 10/100/1000 Mbps Ethernet ports each have two LEDs, as shown in FIGURE 1-6 and described in TABLE 1-5.

FIGURE 1-6 Ethernet Port LEDs

Figure showing the Ethernet LEDs

Figure Legend
1	Link/Activity indicator LED (Same location for all Ethernet ports)
2	Speed indicator LED (Same location for all Ethernet ports)

TABLE 1-5 Ethernet Port LEDs
LED	Color	Description
Left LED	Green	Link/Activity indicator: Steady On - a link is established. Blinking - there is activity on this port. Off - No link is established.
Right LED	Amber or Green	Speed indicator: Amber On - The link is operating as a Gigabit connection (1000-Mbps) Green On - The link is operating as a 100-Mbps connection. Off - The link is operating as a 10/100-Mbps connection.

Note - The NET MGT port operates only in 100-Mbps or 10-Mbps so the speed indicator LED can be green or off (never amber).

1.4 Using the Service Processor Firmware for Diagnosis and Repair Verification

The Sun Integrated Lights Out Manager (ILOM) firmware is a service processor in the server that enables you to remotely manage and administer your server.

ILOM enables you to remotely run diagnostics, such as power-on self-test (POST), that would otherwise require physical proximity to the server’s serial port. You can also configure ILOM to send email alerts of hardware failures, hardware warnings, and other events related to the server or to ILOM.

The service processor runs independently of the server, using the server’s standby power. Therefore, ILOM firmware and software continue to function when the server operating system goes offline or when the server is powered off.

Note - ILOM provides an ALOM CMT compatibility CLI. Refer to the Sun Integrated Lights Out Management 2.0 Supplement for the Sun Netra T5220 Server for comprehensive ILOM and ALOM CMT compatibility information.

Faults detected by ILOM, POST, and the Solaris Predictive Self-Healing (PSH) technology are forwarded to ILOM for fault handling (FIGURE 1-7).

In the event of a system fault, ILOM ensures that the fault LED is lit, FRU ID PROMs are updated, the fault is logged, and alerts are displayed (faulty FRUs are identified in fault messages using the FRU name). For a list of FRU names, see TABLE 2-1.

FIGURE 1-7 ILOM Fault Management

Figure showing the fault source interfaces.

The service processor detects when a fault is no longer present and clears the fault in several ways:

Fault recovery - The system automatically detects that the fault condition is no longer present. ILOM extinguishes the Service Required LED and updates the FRU’s PROM, indicating that the fault is no longer present.

Fault repair - The fault has been repaired by human intervention. In most cases, the service processor detects the repair and extinguishes the Service Required LED. If the service processor does not perform these actions, you must perform these tasks manually with the clearfault or enablecomponent commands.

The service processor also detects the removal of a FRU, in many cases even if the FRU is removed while the service processor is powered off (that is, if the system power cables are unplugged during service procedures). This situation enables ILOM to know that a fault, diagnosed to a specific FRU, has been repaired.

Note - ILOM does not automatically detect hard drive replacement.

Many environmental faults can automatically recover. A temperature that is exceeding a threshold might return to normal limits. An unplugged power supply can be plugged in, and so on. Recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms:

fru at location is OK.

sensor at location is within normal range.

Environmental faults can be repaired through hot-removal of the faulty FRU. FRU removal is automatically detected by the environmental monitoring, and all faults associated with the removed FRU are cleared. The message for that case, and the alert sent for all FRU removals is:

fru at location has been removed.

There is no ILOM command to manually repair an environmental fault.

The Solaris Predictive Self-Healing technology does not monitor the hard drive for faults. As a result, the service processor does not recognize hard drive faults, and will not light the fault LEDs on either the chassis or the hard drive itself. Use the Solaris message files to view hard drive faults. See Collecting Information From Solaris OS Files and Commands.

1.4.1 Using the ALOM CMT Compatibility CLI in ILOM

There are three methods of interacting with the service processor:

ILOM CLI (default)

ILOM browser interface (BI)

ALOM CMT compatibility CLI (ALOM CMT CLI in ILOM)

Note - The examples in this section use the ALOM CMT compatibility CLI.

The ALOM CMT CLI emulates the ALOM CMT interface supported on the previous generation of CMT servers. Using the ALOM CMT CLI (with few exceptions) you can use commands that resemble the ALOM CMT commands. The comparisons between the ILOM CLI and The ALOM CMT compatibility CLI are described in the Sun Integrated Lights Out Management 2.0 Supplement for the Sun Netra T5220 Server.

The service processor sends alerts to all ALOM CMT CLI users that are logged in, sending the alert through email to a configured email address, and writing the event to the ILOM event log.

1.4.2 Creating an ALOM CMT CLI Shell

To create an ALOM CMT CLI, do the following:

1. Log in to the service processor with username: root.

When powered on, the service processor boots to the ILOM login prompt. The factory default password is changeme.

SUNSPxxxxxxxxxxxx login: rootPassword:
Waiting for daemons to initialize...
 
Daemons ready
 
Sun(TM) Integrated Lights Out Manager
 
Version 2.0.0.0
 
Copyright 2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
 
Warning: password is set to factory default.

2. Create a new user, set the account role to Administrator and the CLI mode to alom.

-> create /SP/users/admin Creating user...Enter new password: ********Enter new password again: ********Created /SP/users/admin
-> set /SP/users/admin role=Administrator
Set 'role' to 'Administrator'-> set /SP/users/admin cli_mode=alomSet 'cli_mode' to 'alom'

Note - The asterisks in the example will not appear when you enter your password.

You can combine the create and set commands on a single line:

-> create /SP/users/admin role=Administrator cli_mode=alomCreating user...Enter new password: ********Enter new password again: ********Created /SP/users/admin

3. Log out of the root account after you have finished creating the new account.

-> exit

4. Log in to the ALOM CMT CLI (indicated by the sc> prompt) from the ILOM login prompt.

SUNSPxxxxxxxxxxxx login: admin
Password:
Waiting for daemons to initialize...
 
Daemons ready
 
Sun(TM) Integrated Lights Out Manager
 
Version 2.0.0.0
 
Copyright 2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
 
sc>

Note - Multiple service processor accounts can be active concurrently. A user can be logged in under one account using the ILOM CLI, and another account using the ALOM CMT CLI.

1.4.3 Running ALOM CMT CLI Service-Related Commands

This section describes commands commonly used for service-related activities.

1.4.3.1 Connecting to ALOM CMT CLI

Before you can run ALOM CMT CLI commands, you must connect to the service processor in one of two ways:

Connect an ASCII terminal directly to the serial management port.

Use the ssh command to connect to the service processor through an Ethernet connection on the network management port.

Note - Refer to the Sun Integrated Lights Out Management 2.0 Supplement for the Sun Netra T5220 Server for instructions on configuring and connecting to the service processor.

1.4.3.2 Switching Between the System Console and Service Processor

To switch from the console output to the ALOM CMT CLI sc> prompt, type #. (Hash-Period).

To switch from the sc> prompt to the console, type console.

1.4.3.3 Service-Related ALOM CMT CLI Commands

TABLE 1-6 describes the typical ALOM CMT CLI commands for servicing a server. For descriptions of all ALOM CMT CLI commands, issue the help command or refer to the Integrated Lights Out Management User’s Guide.

TABLE 1-6 Service-Related ALOM CMT CLI Commands
ALOM CMT Command	Description
`help` [command]	Displays a list of all ALOM CMT CLI commands with syntax and descriptions. Specifying a command name as an option displays help for that command.
`break` [`-y`][`-c`][`-D`]	Takes the host server from the OS to either `kmdb` or OpenBoot PROM (equivalent to a Stop-A), depending on the mode Solaris software was booted. `-y` skips the confirmation question `-c` executes a `console` command after the `break` command completes `-D` forces a core dump of the Solaris OS
`clearfault` UUID	Manually clears host-detected faults. The UUID is the unique fault ID of the fault to be cleared.
console [-f]	Connects you to the host system. The `-f` option forces the console to have read and write capabilities.
consolehistory [-b lines\|-e lines\|-v] [-g lines] [boot\|run]	Displays the contents of the system’s console buffer. The following options enable you to specify how the output is displayed: `-g` lines specifies the number of lines to display before pausing. `-e` lines displays n lines from the end of the buffer. `-b` lines displays n lines from beginning of buffer. `-v` displays entire buffer. `boot\|run` specifies the log to display (`run` is the default log).
bootmode [normal\|reset_nvram\| bootscript=string]	Enables control of the firmware during system initialization with the following options: `normal` is the default boot mode. `reset_nvram` resets OpenBoot PROM parameters to their default values. `bootscript=`string enables the passing of a string to the `boot` command.
`powercycle` [`-f`]	Performs a `poweroff` followed by `poweron`. The `-f` option forces an immediate `poweroff`, otherwise the command attempts a graceful shutdown.
`poweroff` [`-y`] [`-f`]	Powers off the host server. The `-y` option enables you to skip the confirmation question. The `-f` option forces an immediate shutdown.
`poweron [-c]`	Powers on the host server. Using the `-c` option executes a `console` command after completion of the `poweron` command.
`removefru PS0\|PS1`	Indicates if it is okay to perform a hot-swap of a power supply. This command does not perform any action, but it provides a warning if the power supply should not be removed because the other power supply is not enabled.
`reset` [`-y`] [-c]	Generates a hardware reset on the host server. The `-y` option enables you to skip the confirmation question. The `-c` option executes a `console` command after completion of the `reset` command.
`resetsc` [`-y`]	Reboots the service processor. The `-y` option enables you to skip the confirmation question.
`setkeyswitch` `[-y]` `normal` \| `stby` \| `diag` \| `locked`	Sets the virtual keyswitch. The `-y` option enables you to skip the confirmation question when setting the keyswitch to `stby`.
`setlocator` [`on` \| `off`]	Turns the Locator LED on the server on or off.
`showenvironment`	Displays the environmental status of the host server. This information includes system temperatures, power supply, front panel LED, hard drive, fan, voltage, and current sensor status. See Displaying the Server’s Environmental Status.
`showfaults` [`-v`]	Displays current system faults. See Displaying System Faults.
`showfru` [`-g` lines] [`-s` \| `-d`] [FRU]	Displays information about the FRUs in the server. `-g` lines specifies the number of lines to display before pausing the output to the screen. `-s` displays static information about system FRUs (defaults to all FRUs, unless one is specified). `-d` displays dynamic information about system FRUs (defaults to all FRUs, unless one is specified). See Displaying FRU Information.
`showkeyswitch`	Displays the status of the virtual keyswitch.
`showlocator`	Displays the current state of the Locator LED as either on or off.
`showlogs` [`-b` lines \| `-e` lines `\| -v]` [`-g` lines] [`-p logtype[r\|p]]`]	Displays the history of all events logged in the ALOM CMT event buffers (in RAM or the persistent buffers).
`showplatform` [`-v`]	Displays information about the host system’s hardware configuration, the system serial number, and whether the hardware is providing service.

Note - See TABLE 1-10 for the ALOM CMT CLI automatic system recover (ASR) commands.

1.4.4 Displaying System Faults

The ALOM CMT CLI showfaults command displays the following kinds of faults:

Environmental or configuration faults - System configuration faults, or temperature or voltage problems that might be caused by faulty FRUs (power supplies, fans, or blower), or by room temperature or blocked air flow to the server.

POST detected faults - Faults on devices detected by the power-on self-test diagnostics.

PSH detected faults - Faults detected by the Solaris Predictive Self-healing (PSH) technology

Use the showfaults command for the following reasons:

To see if any faults have been diagnosed in the system.

To verify that the replacement of a FRU has cleared the fault and not generated any additional faults.

At the sc> prompt, type the showfaults command.

The following showfaults command examples show the different kinds of output from the showfaults command:

Example of the showfaults command when no faults are present:

sc> showfaults
Last POST run: THU MAR 09 16:52:44 2006
POST status: Passed all devices
 
No failures found in System

Example of the showfaults command displaying an environmental fault:

sc> showfaults
Last POST Run: Wed Jul 18 11:44:47 2007
 
Post Status: Passed all devices
 ID FRU               Fault
  0 /SYS/FANBD0/FM0   SP detected fault: TACH at /SYS/FANBD0/FM0/F1 has exceeded low non-recoverable threshold.

Example showing a fault that was detected by POST. These kinds of faults are identified by the message Forced fail reason where reason is the name of the power-on routine that detected the failure.

sc> showfaults
Last POST Run: Wed Jun 27 21:29:02 2007
 
Post Status: Passed all devices
 ID FRU                     Fault
  0 /SYS/MB/CMP0/BR3/CH1/D1 SP detected fault: /SYS/MB/CMP0/BR3/CH1/D1 Forced fail (POST)

Example showing a fault that was detected by the PSH technology. These kinds of faults are identified by the text Host detected fault and by a UUID.

sc> showfaults -v
Last POST Run: Wed Jun 29 11:29:02 2007
 
Post Status: Passed all devices
 ID  Time                 FRU         Fault
  0  Jun 30 22:13:02      /SYS/MB     Host detected fault, MSGID: SUN4V-8000-N3  UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86

1.4.5 Manually Cleaning PSH Diagnosed Faults

The ALOM CMT CLI clearfault command enables you to manually clear PSH diagnosed faults from the service processor without a FRU replacement or if the service processor was unable to automatically detect the FRU replacement.

At the sc> prompt, type the clearfault command.

Example showing a fault being cleared manually using the clearfault command:

sc> clearfault 7ee0e46b-ea64-6565-e684-e996963f7b86

1.4.6 Displaying the Server’s Environmental Status

The showenvironment command displays a snapshot of the server’s environmental status. This command displays system temperatures, hard drive status, power supply and fan status, front panel LED status, and voltage and current sensors. The output uses a format similar to the Solaris OS command prtdiag (1m).

At the sc> prompt, type the showenvironment command.

The output differs according to your system’s model and configuration.

EXAMPLE 1-1 shows abridged output of the showenvironment command.

EXAMPLE 1-1 `showenvironment` Command Output
sc> `showenvironment` ------------------------------------------------------------------------------ System Temperatures (Temperatures in Celsius): ------------------------------------------------------------------------------ Sensor Status Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard ------------------------------------------------------------------------------ /SYS/MB/T_AMB OK 29 -10 -5 0 50 55 60 /SYS/MB/CMP0/T_TCORE OK 50 -14 -9 -4 86 96 106 /SYS/MB/CMP0/T_BCORE OK 51 -14 -9 -4 86 96 106 /SYS/MB/CMP0/BR0/CH0/D0/T_AMB OK 41 -10 -8 -5 95 100 105 ... ------------------------------------------------------------------------------ System Indicator Status: ------------------------------------------------------------------------------ /SYS/LOCATE /SYS/SERVICE /SYS/ACT OFF OFF ON ------------------------------------------------------------------------------ /SYS/PSU_FAULT /SYS/TEMP_FAULT /SYS/FAN_FAULT OFF OFF OFF ------------------------------------------------------------------------------ System Disks: ------------------------------------------------------------------------------ Disk Status Service OK2RM ------------------------------------------------------------------------------ /SYS/HDD0 OK OFF OFF /SYS/HDD1 NOT PRESENT OFF OFF ... ------------------------------------------------------------------------------ Fan Status: ------------------------------------------------------------------------------ Fans (Speeds Revolution Per Minute): Sensor Status Speed Warn Low ------------------------------------------------------------------------------ /SYS/FANBD0/FM0/F0/TACH OK 7000 4000 2400 ... ------------------------------------------------------------------------------ Voltage sensors (in Volts): ------------------------------------------------------------------------------ Sensor Status Voltage LowSoft LowWarn HighWarn HighSoft ------------------------------------------------------------------------------ /SYS/MB/V_+3V3_STBY OK 3.39 3.13 3.17 3.53 3.58 ... ------------------------------------------------------------------------------ Power Supplies: ------------------------------------------------------------------------------ Supply Status Fan_Fault Temp_Fault Volt_Fault Cur_Fault ------------------------------------------------------------------------------ /SYS/PS0 OK OFF OFF OFF OFF ...

EXAMPLE 1-1 showenvironment Command Output

sc> showenvironment
 
------------------------------------------------------------------------------
System Temperatures (Temperatures in Celsius):
------------------------------------------------------------------------------
Sensor                         Status  Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
------------------------------------------------------------------------------
/SYS/MB/T_AMB                  OK         29   -10     -5      0      50      55      60
/SYS/MB/CMP0/T_TCORE           OK         50   -14     -9     -4      86      96     106
/SYS/MB/CMP0/T_BCORE           OK         51   -14     -9     -4      86      96     106
/SYS/MB/CMP0/BR0/CH0/D0/T_AMB  OK         41   -10     -8     -5      95     100     105
...
------------------------------------------------------------------------------
System Indicator Status:
------------------------------------------------------------------------------
/SYS/LOCATE          /SYS/SERVICE         /SYS/ACT            
OFF                  OFF                  ON                  
------------------------------------------------------------------------------
/SYS/PSU_FAULT       /SYS/TEMP_FAULT      /SYS/FAN_FAULT      
OFF                  OFF                  OFF                 
 
------------------------------------------------------------------------------
System Disks:
------------------------------------------------------------------------------
Disk      Status           Service        OK2RM
------------------------------------------------------------------------------
/SYS/HDD0  OK               OFF           OFF     
/SYS/HDD1  NOT PRESENT      OFF           OFF     
...
------------------------------------------------------------------------------
Fan Status:
------------------------------------------------------------------------------
Fans (Speeds Revolution Per Minute):
Sensor                    Status       Speed     Warn      Low
------------------------------------------------------------------------------
/SYS/FANBD0/FM0/F0/TACH   OK            7000     4000     2400
...
------------------------------------------------------------------------------
Voltage sensors (in Volts):
------------------------------------------------------------------------------
Sensor               Status     Voltage LowSoft LowWarn HighWarn HighSoft
------------------------------------------------------------------------------
/SYS/MB/V_+3V3_STBY  OK         3.39    3.13     3.17     3.53      3.58
...
------------------------------------------------------------------------------
Power Supplies:
------------------------------------------------------------------------------
Supply     Status            Fan_Fault  Temp_Fault  Volt_Fault  Cur_Fault
------------------------------------------------------------------------------
/SYS/PS0    OK                   OFF       OFF          OFF         OFF
...

Note - Some environmental information might not be available when the server is in standby mode.

1.4.7 Displaying FRU Information

The showfru command displays information about the FRUs in the server. Use this command to see information about an individual FRU, or for all the FRUs.

Note - By default, the output of the showfru command for all FRUs is very long.

At the sc> prompt, enter the showfru command.

In the following example, the showfru command is used to get information about the motherboard (MB).

sc> showfru /SYS/MB
/SYS/MB (container)
   SEGMENT: FL
      /Configured_LevelR
      /Configured_LevelR/UNIX_Timestamp32: Thu Jun  7 20:12:17 GMT 2007
      /Configured_LevelR/Sun_Part_No: 5412153
      /Configured_LevelR/Configured_Serial_No: BBX053
      /Configured_LevelR/Initial_HW_Dash_Level: 02
   SEGMENT: FD
      /InstallationR (1 iterations)
      /InstallationR[0]
      /InstallationR[0]/UNIX_Timestamp32: Thu Jun 21 19:37:57 GMT 2007
      /InstallationR[0]/Fru_Path: /SYS/MB
      /InstallationR[0]/Parent_Part_Number: 5017813
      /InstallationR[0]/Parent_Serial_Number: 110508
      /InstallationR[0]/Parent_Dash_Level: 01
      /InstallationR[0]/System_Id: 0721BBB050
      /InstallationR[0]/System_Tz: 0
...

1.5 Running POST

Power-on self-test (POST) is a group of PROM-based tests that run when the server is powered on or reset. POST checks the basic integrity of the critical hardware components in the server (CPU, memory, and I/O buses).

If POST detects a faulty component, the component is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system will boot when POST is complete. For example, if one of the processor cores is deemed faulty by POST, the core will be disabled, and the system will boot and run using the remaining cores.

1.5.1 Controlling How POST Runs

The server can be configured for normal, extensive, or no POST execution. You can also control the level of tests that run, the amount of POST output that is displayed, and which reset events trigger POST by using ALOM CMT CLI variables.

TABLE 1-7 lists the ALOM CMT CLI variables used to configure POST. FIGURE 1-8 shows how the variables work together.

Note - Use the ALOM CMT CLI setsc command to set all the parameters in TABLE 1-7 except setkeyswitch.

TABLE 1-7 ALOM CMT CLI Parameters Used for POST Configuration
Parameter	Values	Description
setkeyswitch	`normal`	The system can power on and run POST (based on the other parameter settings). For details see FIGURE 1-8. This parameter overrides all other commands.
	`diag`	The system runs POST based on predetermined settings.
	`stby`	The system cannot power on.
	`locked`	The system can power on and run POST, but no flash updates can be made.
`diag_mode`	`off`	POST does not run.
	`normal`	Runs POST according to `diag_level` value.
	`service`	Runs POST with preset values for `diag_level` and `diag_verbosity`.
`diag_level`	`max`	If `diag_mode` = `normal`, runs all the minimum tests plus extensive CPU and memory tests.
	`min`	If `diag_mode` = `normal`, runs minimum set of tests.
`diag_trigger`	`none`	Does not run POST on reset.
	`user_reset`	Runs POST upon user-initiated resets.
	`power_on_reset`	Only runs POST for the first power on. This option is the default.
	`error_reset`	Runs POST if fatal errors are detected.
	`all_resets`	Runs POST after any reset.
`diag_verbosity`	`none`	No POST output is displayed.
	`min`	POST output displays functional tests with a banner and pinwheel.
	`normal`	POST output displays all test and informational messages.
	`max`	POST displays all test, informational, and some debugging messages.

FIGURE 1-8 Flowchart of ALOM CMT CLI Variables for POST Configuration

Figure showing POST flow chart

TABLE 1-8 shows typical combinations of ALOM CMT CLI variables and associated POST modes .

TABLE 1-8 ALOM CMT CLI Parameters and POST Modes
Parameter	Normal Diagnostic Mode (Default Settings)	No POST Execution	Diagnostic Service Mode	Keyswitch Diagnostic Preset Values
`diag_mode`	`normal`	off	`service`	`normal`
`setkeyswitch^[1]`	`normal`	`normal`	`normal`	`diag`
`diag_level`	`max`	n/a	`max`	`max`
`diag_trigger`	`power-on-reset error-reset`	none	`all-resets`	`all-resets`
`diag_verbosity`	`normal`	n/a	`max`	`max`
Description of POST execution	This is the default POST configuration. This configuration tests the system thoroughly, and suppresses some of the detailed POST output.	POST does not run, resulting in quick system initialization. This is not a suggested configuration.	POST runs the full spectrum of tests with the maximum output displayed.	POST runs the full spectrum of tests with the maximum output displayed.

1.5.2 Changing POST Parameters

1. Access the ALOM CMT CLI sc> prompt:

At the console, issue the #. key sequence:

#.

2. Use the ALOM CMT CLI sc> prompt to change the POST parameters.

Refer to TABLE 1-7 for a list of ALOM CMT CLI POST parameters and their values.

The setkeyswitch parameter sets the virtual keyswitch, so this parameter does not use the setsc command. For example, to change the POST parameters using the setkeyswitch command, enter the following:

sc> setkeyswitch diag

To change the POST parameters using the setsc command, you must first set the setkeyswitch parameter to normal. Then you can change the POST parameters using the setsc command:

sc> setkeyswitch normal
sc> setsc value

For example:

sc> setkeyswitch normal
sc> setsc diag_mode service

1.5.3 Reasons to Run POST

You can use POST for basic hardware verification and diagnosis, and for troubleshooting as described in the following sections.

1.5.3.1 Verifying Hardware Functionality

POST tests critical hardware components to verify functionality before the system boots and accesses software. If POST detects an error, the faulty component is disabled automatically, preventing faulty hardware from potentially harming software.

1.5.3.2 Diagnosing the System Hardware

You can use POST as an initial diagnostic tool for the system hardware. In this case, configure POST to run in maximum mode (diag_mode=service, setkeyswitch=diag, diag_level=max) for thorough test coverage and verbose output.

1.5.4 Running POST in Maximum Mode

This procedure describes how to run POST when you want maximum testing, as in the case when you are troubleshooting a server or verifying a hardware upgrade or repair.

1. Switch from the system console prompt to the sc> prompt by issuing the #. escape sequence.

ok #.
sc>

2. Set the virtual keyswitch to diag so that POST will run in service mode.

sc> setkeyswitch diag

3. Reset the system so that POST runs.

There are several ways to initiate a reset. EXAMPLE 1-2 shows the powercycle command. For other methods, refer to the Sun Netra T5220 Server Administration Guide.

EXAMPLE 1-2 Initiating POST Using the `powercycle` Command
sc> `powercycle`Are you sure you want to powercycle the system (y/n)? `y` Powering host off at Fri Jul 27 08:11:52 2007 Waiting for host to Power Off; hit any key to abort. Audit \| minor: admin : Set : object = /SYS/power_state : value = soft : success Chassis \| critical: Host has been powered off Powering host on at Fri Jul 27 08:13:08 2007 Audit \| minor: admin : Set : object = /SYS/power_state : value = on : success Chassis \| major: Host has been powered on

EXAMPLE 1-2 Initiating POST Using the powercycle Command

sc> powercycleAre you sure you want to powercycle the system (y/n)? y
Powering host off at Fri Jul 27 08:11:52 2007
Waiting for host to Power Off; hit any key to abort.
Audit | minor: admin : Set : object = /SYS/power_state : value = soft : success
Chassis | critical: Host has been powered off
Powering host on at Fri Jul 27 08:13:08 2007
Audit | minor: admin : Set : object = /SYS/power_state : value = on : success
Chassis | major: Host has been powered on

4. Switch to the system console to view the POST output:

sc> console

EXAMPLE 1-3 depicts abridged POST output.

EXAMPLE 1-3 POST Output (Abridged)
sc> `console` Enter #. to return to ALOM. 2007-07-03 10:25:12.081 0:0:0>@(#)Sun Netra[TM] T5220 POST 4.x.build_119 2007/06/06 09:48 /export/delivery/delivery/4.x/4.x.build_119/post4.x/UltraSPARC/NetraT5220/integrated (root) 2007-07-03 10:25:12.386 0:0:0>Copyright 2007 Sun Microsystems, Inc. All rights reserved 2007-07-03 10:25:12.550 0:0:0>VBSC cmp0 arg is: 00ff00ff.ffffffff 2007-07-03 10:25:12.653 0:0:0>POST enabling threads: 00ff00ff.ffffffff 2007-07-03 10:25:12.766 0:0:0>VBSC mode is: 00000000.00000001 2007-07-03 10:25:12.867 0:0:0>VBSC level is: 00000000.00000001 2007-07-03 10:25:12.966 0:0:0>VBSC selecting POST MAX Testing. 2007-07-03 10:25:13.066 0:0:0>VBSC setting verbosity level 3 2007-07-03 10:25:13.161 0:0:0> UltraSPARCT2, Version 2.1 2007-07-03 10:25:13.247 0:0:0> Serial Number: 0fac006b.0e654482 2007-07-03 10:25:13.353 0:0:0>Basic Memory Tests..... 2007-07-03 10:25:13.456 0:0:0>Begin: Branch Sanity Check 2007-07-03 10:25:13.569 0:0:0>End : Branch Sanity Check 2007-07-03 10:25:13.668 0:0:0>Begin: DRAM Memory BIST 2007-07-03 10:25:13.793 0:0:0>................................................................................................ 2007-07-03 10:25:38.399 0:0:0>End : DRAM Memory BIST 2007-07-03 10:25:39.547 0:0:0>Sys 166 MHz, CPU 1166 MHz, Mem 332 MHz 2007-07-03 10:25:39.658 0:0:0>L2 Bank EFuse = 00000000.000000ff 2007-07-03 10:25:39.760 0:0:0>L2 Bank status = 00000000.00000f0f 2007-07-03 10:25:39.864 0:0:0>Core available Efuse = ffff00ff.ffffffff 2007-07-03 10:25:39.982 0:0:0>Test Memory..... 2007-07-03 10:25:40.070 0:0:0>Begin: Probe and Setup Memory 2007-07-03 10:25:40.181 0:0:0>INFO: 4096MB at Memory Branch 0 ... 2007-07-03 10:29:21.683 0:0:0>INFO: 2007-07-03 10:29:21.686 0:0:0> POST Passed all devices. 2007-07-03 10:29:21.692 0:0:0>POST: Return to VBSC.

EXAMPLE 1-3 POST Output (Abridged)

sc> console
Enter #. to return to ALOM.
2007-07-03 10:25:12.081 0:0:0>@(#)Sun Netra[TM] T5220 POST 4.x.build_119 2007/06/06 09:48 
/export/delivery/delivery/4.x/4.x.build_119/post4.x/UltraSPARC/NetraT5220/integrated  (root)  
2007-07-03 10:25:12.386 0:0:0>Copyright 2007 Sun Microsystems, Inc. All rights reserved
2007-07-03 10:25:12.550 0:0:0>VBSC cmp0 arg is: 00ff00ff.ffffffff
2007-07-03 10:25:12.653 0:0:0>POST enabling threads: 00ff00ff.ffffffff
2007-07-03 10:25:12.766 0:0:0>VBSC mode is: 00000000.00000001
2007-07-03 10:25:12.867 0:0:0>VBSC level is: 00000000.00000001
2007-07-03 10:25:12.966 0:0:0>VBSC selecting POST MAX Testing.
2007-07-03 10:25:13.066 0:0:0>VBSC setting verbosity level 3
2007-07-03 10:25:13.161 0:0:0>	UltraSPARCT2, Version 2.1
2007-07-03 10:25:13.247 0:0:0>	Serial Number: 0fac006b.0e654482
2007-07-03 10:25:13.353 0:0:0>Basic Memory Tests.....
2007-07-03 10:25:13.456 0:0:0>Begin: Branch Sanity Check
2007-07-03 10:25:13.569 0:0:0>End  : Branch Sanity Check
2007-07-03 10:25:13.668 0:0:0>Begin: DRAM Memory BIST
2007-07-03 10:25:13.793 0:0:0>................................................................................................
2007-07-03 10:25:38.399 0:0:0>End  : DRAM Memory BIST
2007-07-03 10:25:39.547 0:0:0>Sys 166 MHz, CPU 1166 MHz, Mem 332 MHz 
2007-07-03 10:25:39.658 0:0:0>L2 Bank EFuse = 00000000.000000ff 
2007-07-03 10:25:39.760 0:0:0>L2 Bank status = 00000000.00000f0f 
2007-07-03 10:25:39.864 0:0:0>Core available Efuse = ffff00ff.ffffffff 
2007-07-03 10:25:39.982 0:0:0>Test Memory.....
2007-07-03 10:25:40.070 0:0:0>Begin: Probe and Setup Memory
2007-07-03 10:25:40.181 0:0:0>INFO:	  4096MB at Memory Branch 0 
...
 
2007-07-03 10:29:21.683 0:0:0>INFO:
2007-07-03 10:29:21.686 0:0:0>	POST Passed all devices.
2007-07-03 10:29:21.692 0:0:0>POST:	Return to VBSC.

5. Perform further investigation if needed.

If no faults were detected, the system will boot.

If POST detects a faulty device, the fault is displayed and the fault information is passed to ALOM CMT CLI for fault handling. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 2-1.

a. Interpret the POST messages:

POST error messages use the following syntax:

c:s > ERROR: TEST = failing-test
c:s > H/W under test = FRU
c:s > Repair Instructions: Replace items in order listed by H/W under test abovec:s > MSG = test-error-message
c:s > END_ERROR

In this syntax, c = the core number, s = the strand number.

Warning and informational messages use the following syntax:

INFO or WARNING: message

In EXAMPLE 1-4, POST reports a memory error at FB-DIMM location /SYS/MB/CMP0/BR2/CH0/D0. The error was detected by POST running on core 7, strand 2.

EXAMPLE 1-4 POST Error Message
7:2> 7:2>ERROR: TEST = Data Bitwalk 7:2>H/W under test = /SYS/MB/CMP0/BR2/CH0/D0 7:2>Repair Instructions: Replace items in order listed by 'H/W under test' above. 7:2>MSG = Pin 149 failed on /SYS/MB/CMP0/BR2/CH0/D0 (J2001) 7:2>END_ERROR 7:2>Decode of Dram Error Log Reg Channel 2 bits 60000000.0000108c 7:2> 1 MEC 62 R/W1C Multiple corrected errors, one or more CE not logged 7:2> 1 DAC 61 R/W1C Set to 1 if the error was a DRAM access CE 7:2> 108c SYND 15:0 RW ECC syndrome. 7:2> 7:2> Dram Error AFAR channel 2 = 00000000.00000000 7:2> L2 AFAR channel 2 = 00000000.00000000

EXAMPLE 1-4 POST Error Message

7:2>
7:2>ERROR: TEST = Data Bitwalk
7:2>H/W under test = /SYS/MB/CMP0/BR2/CH0/D0
7:2>Repair Instructions: Replace items in order listed by 'H/W
under test' above.
7:2>MSG = Pin 149 failed on /SYS/MB/CMP0/BR2/CH0/D0 (J2001)
7:2>END_ERROR
 
7:2>Decode of Dram Error Log Reg Channel 2 bits
60000000.0000108c
7:2> 1 MEC 62 R/W1C Multiple corrected
errors, one or more CE not logged
7:2> 1 DAC 61 R/W1C Set to 1 if the error
was a DRAM access CE
7:2> 108c SYND 15:0 RW ECC syndrome.
7:2>
7:2> Dram Error AFAR channel 2 = 00000000.00000000
7:2> L2 AFAR channel 2 = 00000000.00000000

b. Run the showfaults command to obtain additional fault information.

The fault is captured by ALOM CMT CLI, where the fault is logged, the Service Required LED is lit, and the faulty component is disabled.

Example:

EXAMPLE 1-5 `showfaults` Output
ok `.#` sc> `showfaults` Last POST Run: Wed Jun 27 21:29:02 2007 Post Status: Passed all devices ID FRU Fault 0 /SYS/MB/CMP0/BR2/CH0/D0 SP detected fault: /SYS/MB/CMP0/BR2/CH0/D0 Forced fail (POST)

In this example, /SYS/MB/CMP0/BR2/CH0/D0 is disabled. The system can boot using memory that was not disabled until the faulty component is replaced.

Note - You can use ASR commands to display and control disabled components. See Managing Components With Automatic System Recovery Commands.

1.5.5 Clearing POST Detected Faults

In most cases, when POST detects a faulty component, POST logs the fault and automatically takes the failed component out of operation by placing the component in the ASR blacklist (see Managing Components With Automatic System Recovery Commands).

In most cases, the replacement of the faulty FRU is detected when the service processor is reset or power cycled. In this case, the fault is automatically cleared from the system. This procedure describes how to identify POST detected faults and, if necessary, manually clear the fault.

1. After replacing a faulty FRU, at the ALOM CMT CLI prompt use the showfaults command to identify POST detected faults.

POST detected faults are distinguished from other kinds of faults by the text:
Forced fail, and no UUID number is reported.

Example:

EXAMPLE 1-6 POST Detected Fault
sc> `showfaults` Last POST Run: Wed Jun 27 21:29:02 2007 Post Status: Passed all devices ID FRU Fault 0 /SYS/MB/CMP0/BR2/CH0/D0 SP detected fault: /SYS/MB/CMP0/BR2/CH0/D0 Forced fail (POST)

If no fault is reported, you do not need to do anything else. Do not perform the subsequent steps.

2. Use the enablecomponent command to clear the fault and remove the component from the ASR blacklist.

Use the FRU name that was reported in the fault in Step 1.

EXAMPLE 1-7 Using the `enablecomponent` Command
sc> `enablecomponent /SYS/MB/CMP0/BR2/CH0/D0`

The fault is cleared and should not show up when you run the showfaults command. Additionally, the Service Required LED is no longer on.

3. Power cycle the server.

You must reboot the server for the enablecomponent command to take effect.

4. At the ALOM CMT CLI prompt, use the showfaults command to verify that no faults are reported.

TABLE 1-9 Verifying Cleared Faults Using the `showfaults` Command
sc> `showfaults` Last POST run: THU MAR 09 16:52:44 2006 POST status: Passed all devices No failures found in System

1.6 Using the Solaris Predictive Self-Healing Feature

The Solaris Predictive Self-Healing (PSH) technology enables the server to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use the message ID to get additional information about the problem from Sun’s knowledge article database.

The Predictive Self-Healing technology covers the following server components:

UltraSPARC® T2 multicore processor

Memory

I/O bus

The PSH console message provides the following information:

Type

Severity

Description

Automated response

Impact

Suggested action for system administrator

If the Solaris PSH facility detects a faulty component, use the fmdump command to identify the fault. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 2-1.

1.6.1 Identifying PSH Detected Faults

When a PSH fault is detected, a Solaris console message similar to EXAMPLE 1-8 is displayed.

EXAMPLE 1-8 Console Message Showing Fault Detected by PSH

SUNW-MSG-ID: SUN4V-8000-DX, TYPE: Fault, VER: 1, SEVERITY: Minor
EVENT-TIME: Wed Sep 14 10:09:46 EDT 2005
PLATFORM: SUNW,Sun-Netra-T5220, CSN: -, HOSTNAME: hostname
SOURCE: cpumem-diagnosis, REV: 1.5
EVENT-ID: f92e9fbe-735e-c218-cf87-9e1720a28004
DESC: The number of errors associated with this memory module has exceeded acceptable levels.
AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported.
IMPACT: Total system memory capacity will be reduced as pages are retired.
REC-ACTION: Schedule a repair procedure to replace the affected memory module.  Use fmdump -v -u <EVENT_ID> to identify the module.

Faults detected by the Solaris PSH facility are also reported through service processor alerts. EXAMPLE 1-9 depicts an ALOM CMT CLI alert of the same fault reported by Solaris PSH in EXAMPLE 1-8.

EXAMPLE 1-9 ALOM CMT CLI Alert of PSH Diagnosed Fault
SC Alert: Host detected fault, MSGID: SUN4V-8000-DX

The ALOM CMT CLI showfaults command provides summary information about the fault. See Displaying System Faults for more information about the showfaults command.

Note - The Service Required LED is also turns on for PSH diagnosed faults.

1.6.1.1 Using the `fmdump` Command to Identify Faults

The fmdump command displays the list of faults detected by the Solaris PSH facility and identifies the faulty FRU for a particular EVENT_ID (UUID).

Do not use fmdump to verify a FRU replacement has cleared a fault because the output of fmdump is the same after the FRU has been replaced. Use the fmadm faulty command to verify the fault has cleared.

1. Check the event log using the fmdump command with -v for verbose output:

EXAMPLE 1-10 Output from the fmdump -v Command

# fmdump -v -u fd940ac2-d21e-c94a-f258-f8a9bb69d05b
TIME                 UUID                                 SUNW-MSG-ID
Jul 31 12:47:42.2007 fd940ac2-d21e-c94a-f258-f8a9bb69d05b SUN4V-8000-JA
  100%  fault.cpu.ultraSPARC-T2.misc_regs
 
        Problem in: cpu:///cpuid=16/serial=5D67334847
           Affects: cpu:///cpuid=16/serial=5D67334847
               FRU: hc://:serial=101083:part=541215101/motherboard=0
          Location: MB

In EXAMPLE 1-10, a fault is displayed, indicating the following details:

Date and time of the fault (Jul 31 12:47:42.2007)

Universal Unique Identifier (UUID). This is unique for every fault (fd940ac2-d21e-c94a-f258-f8a9bb69d05b)

Sun message identifier, which can be used to obtain additional fault information (SUN4V-8000-JA)

Faulted FRU. The information provided in the example includes the part number of the FRU (part=541215101) and the serial number of the FRU (serial=101083). The Location field provides the name of the FRU. In EXAMPLE 1-10 the FRU name is MB, meaning the motherboard.

Note - fmdump displays the PSH event log. Entries remain in the log after the fault has been repaired.

2. Use the Sun message ID to obtain more information about this type of fault.

a. Obtain the message ID from the console output or the ALOM CMT CLI showfaults command.

b. Enter the message ID in the SUNW-MSG-ID field, and click Lookup.

In EXAMPLE 1-11, the message ID SUN4V-8000-JA provides information for corrective action:

EXAMPLE 1-11 PSH Message Output

CPU errors exceeded acceptable levels
 
Type
    Fault 
Severity
    Major 
Description
    The number of errors associated with this CPU has exceeded acceptable levels. 
Automated Response
    The fault manager will attempt to remove the affected CPU from service. 
Impact
    System performance may be affected. 
 
Suggested Action for System Administrator
    Schedule a repair procedure to replace the affected CPU, the identity of which can be determined using fmdump -v -u <EVENT_ID>. 
 
Details
    The Message ID:  SUN4V-8000-JA indicates diagnosis has determined that a CPU is faulty. The Solaris fault manager arranged an automated attempt to disable this CPU. The recommended action for the system administrator is to contact Sun support so a Sun service technician can replace the affected component.

3. Follow the suggested actions to repair the fault.

1.6.2 Clearing PSH Detected Faults

When the Solaris PSH facility detects faults the faults are logged and displayed on the console. In most cases, after the fault is repaired, the corrected state is detected by the system and the fault condition is repaired automatically. However, this must be verified and, in cases where the fault condition is not automatically cleared, the fault must be cleared manually.

1. After replacing a faulty FRU, power on the server.

2. At the ALOM CMT CLI prompt, use the showfaults command to identify PSH detected faults.

PSH detected faults are distinguished from other kinds of faults by the text:
Host detected fault.

Example:

sc> showfaults -v
Last POST Run: Wed Jun 29 11:29:02 2007
 
Post Status: Passed all devices
ID  Time              FRU                      Fault
0  Jun 30 22:13:02   /SYS/MB/CMP0/BR2/CH0/D0  Host detected fault, MSGID: SUN4V-8000-DX  UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86

If no fault is reported, you do not need to do anything else. Do not perform the subsequent steps.

If a fault is reported, perform Step 3 and Step 4.

3. Run the ALOM CMT CLI clearfault command with the UUID provided in the showfaults output.

Example:

sc> clearfault 7ee0e46b-ea64-6565-e684-e996963f7b86
Clearing fault from all indicted FRUs...
Fault cleared.

4. Clear the fault from all persistent fault records.

In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time. To ensure that these messages are not displayed, perform the following Solaris command:

fmadm repair UUID

Example:

# fmadm repair 7ee0e46b-ea64-6565-e684-e996963f7b86

1.7 Collecting Information From Solaris OS Files and Commands

With the Solaris OS running on the server, you have the full complement of Solaris OS files and commands available for collecting information and for troubleshooting.

If POST, service processor, or the Solaris PSH features do not indicate the source of a fault, check the message buffer and log files for notifications for faults. Hard drive faults are usually captured by the Solaris message files.

Use the dmesg command to view the most recent system message. To view the system messages log file, view the contents of the /var/adm/messages file.

1.7.1 Checking the Message Buffer

1. Log in as superuser.

2. Type the dmesg command:

# dmesg

The dmesg command displays the most recent messages generated by the system.

1.7.2 Viewing System Message Log Files

The error logging daemon, syslogd, automatically records various system warnings, errors, and faults in message files. These messages can alert you to system problems such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file. After a period of time (usually every ten days), a new messages file is automatically created. The original contents of the messages file are rotated to a file named messages.1. Over a period of time, the messages are further rotated to messages.2 and messages.3, and then deleted.

1. Log in as superuser.

2. Type the following command:

# more /var/adm/messages

3. If you want to view all logged messages, type the following command:

# more /var/adm/messages*

1.8 Managing Components With Automatic System Recovery Commands

The Automatic System Recovery (ASR) feature enables the server to automatically configure failed components out of operation until they can be replaced. In the server, theASR feature manages the following components:

UltraSPARC T2 processor strands

Memory FB-DIMMs

I/O bus

The database that contains the list of disabled components is called the ASR blacklist (asr-db).

In most cases, POST automatically disables a faulty component. After the cause of the fault is repaired (FRU replacement, loose connector reseated, and so on), you must remove the component from the ASR blacklist.

The ASR commands (TABLE 1-10) enable you to view, and manually add or remove components from the ASR blacklist. You run these commands from the ALOM CMT CLI sc> prompt.

TABLE 1-10 ASR Commands
Command	Description
`showcomponent`	Displays system components and their current state.
`enablecomponent` asrkey	Removes a component from the `asr-db` blacklist, where asrkey is the component to enable.
`disablecomponent` asrkey	Adds a component to the `asr-db` blacklist, where asrkey is the component to disable.
`clearasrdb`	Removes all entries from the `asr-db` blacklist.

Note - The components (asrkeys) vary from system to system, depending on how many cores and memory are present. Use the showcomponent command to see the asrkeys on a given system.

Note - A reset or power cycle is required after disabling or enabling a component. If the status of a component is changed, there is no effect to the system until the next reset or power cycle.

1.8.1 Displaying System Components

The showcomponent command displays the system components (asrkeys) and reports their status.

At the sc> prompt, enter the showcomponent command

EXAMPLE 1-12 shows partial output with no disabled components.

EXAMPLE 1-12 Output of the showcomponent Command With No Disabled Components

sc> showcomponent
Keys:
 
    /SYS/MB/RISER0/XAUI0
    /SYS/MB/RISER0/PCIE0
    /SYS/MB/RISER0/PCIE3
    /SYS/MB/RISER1/XAUI1
    /SYS/MB/RISER1/PCIE1
    /SYS/MB/RISER1/PCIE4
    /SYS/MB/RISER2/PCIE2
    /SYS/MB/RISER2/PCIE5
    /SYS/MB/GBE0
    /SYS/MB/GBE1
    /SYS/MB/PCIE
    /SYS/MB/PCIE-IO/USB
    /SYS/MB/SASHBA
    /SYS/MB/CMP0/NIU0
    /SYS/MB/CMP0/NIU1
    /SYS/MB/CMP0/MCU0
    /SYS/MB/CMP0/MCU1
    /SYS/MB/CMP0/MCU2
    /SYS/MB/CMP0/MCU3
 
    /SYS/MB/CMP0/L2_BANK0
    /SYS/MB/CMP0/L2_BANK1
    /SYS/MB/CMP0/L2_BANK2
    /SYS/MB/CMP0/L2_BANK3
    /SYS/MB/CMP0/L2_BANK4
    /SYS/MB/CMP0/L2_BANK5
    /SYS/MB/CMP0/L2_BANK6
    /SYS/MB/CMP0/L2_BANK7
    ...
    /SYS/TTYA
State: Clean

EXAMPLE 1-13 shows showcomponent command output with a component disabled:

EXAMPLE 1-13 Output of the showcomponent Command Showing Disabled Components

sc> showcomponent
Keys:
 
    /SYS/MB/RISER0/XAUI0
    /SYS/MB/RISER0/PCIE0
    /SYS/MB/RISER0/PCIE3
    /SYS/MB/RISER1/XAUI1
    /SYS/MB/RISER1/PCIE1
    /SYS/MB/RISER1/PCIE4
    /SYS/MB/RISER2/PCIE2
    /SYS/MB/RISER2/PCIE5
    ...
    /SYS/TTYA
Disabled Devices
  /SYS/MB/CMP0/L2_BANK0	Disabled by user

1.8.2 Disabling Components

The disablecomponent command disables a component by adding it to the ASR blacklist.

1. At the sc> prompt, enter the disablecomponent command.

sc> disablecomponent /SYS/MB/CMP0/BR1/CH0/D0
Chassis | major: /SYS/MB/CMP0/BR1/CH0/D0 has been disabled. Disabled by user

2. After receiving confirmation that the disablecomponent command is complete, reset the server so that the ASR command takes effect.

sc> reset

1.8.3 Enabling Disabled Components

The enablecomponent command enables a disabled component by removing it from the ASR blacklist.

1. At the sc> prompt, enter the enablecomponent command.

sc> enablecomponent /SYS/MB/CMP0/BR1/CH0/D0
Chassis | major: /SYS/MB/CMP0/BR1/CH0/D0 has been enabled.

2. After receiving confirmation that the enablecomponent command is complete, reset the server for so that the ASR command takes effect.

sc> reset

1.9 Exercising the System With SunVTS Software

Sometimes a server exhibits a problem that cannot be isolated definitively to a particular hardware or software component. In such cases, it might be useful to run a diagnostic tool that stresses the system by continuously running a comprehensive battery of tests. Sun provides the SunVTS software for this purpose.

This section describes the tasks necessary to use SunVTS software to exercise your server:

Checking Whether SunVTS Software Is Installed

Exercising the System Using SunVTS Software

1.9.1 Checking Whether SunVTS Software Is Installed

This procedure assumes that the Solaris OS is running on the server, and that you have access to the Solaris command line.

1. Check for the presence of SunVTS packages using the pkginfo command.

% pkginfo -l SUNWvts SUNWvtsr SUNWvtsts SUNWvtsmn

TABLE 1-11 lists SunVTS packages:

TABLE 1-11 SunVTS Packages
Package	Description
`SUNWvts`	SunVTS framework
`SUNWvtsr`	SunVTS framework (root)
SUNWvtsts	SunVTS for tests
`SUNWvtsmn`	SunVTS man pages

If SunVTS software is installed, information about the packages is displayed.

If SunVTS software is not installed, you see an error message for each missing package, as in EXAMPLE 1-14

EXAMPLE 1-14 Missing Package Errors for SunVTS
ERROR: information for "SUNWvts" was not found ERROR: information for "SUNWvtsr" was not found ...

The SunVTS 6.0 PS3 software, and future compatible versions, are supported on the server.

SunVTS installation instructions are described in the SunVTS User’s Guide.

1.9.2 Exercising the System Using SunVTS Software

Before you begin, the Solaris OS must be running. You also must ensure that SunVTS validation test software is installed on your system. See Checking Whether SunVTS Software Is Installed.

The SunVTS installation process requires that you specify one of two security schemes to use when running SunVTS. The security scheme you choose must be properly configured in the Solaris OS for you to run SunVTS. For details, refer to the SunVTS User’s Guide.

SunVTS software features both character-based and graphics-based interfaces. This procedure assumes that you are using the graphical user interface (GUI) on a system running the Common Desktop Environment (CDE). For more information about the character-based SunVTS TTY interface, and specifically for instructions on accessing it by tip or telnet commands, refer to the SunVTS User’s Guide.

SunVTS software can be run in several modes. This procedure assumes that you are using the default mode.

This procedure also assumes that the server is headless. That is, it is not equipped with a monitor capable of displaying bitmap graphics. In this case, you access the SunVTS GUI by logging in remotely from a machine that has a graphics display.

Finally, this procedure describes how to run SunVTS tests in general. Individual tests might presume the presence of specific hardware, or might require specific drivers, cables, or loopback connectors. For information about test options and prerequisites, refer to the following documentation:

SunVTS 6.3 Test Reference Manual for SPARC Platforms

SunVTS 6.3 User’s Guide

1.9.3 Exercising the System With SunVTS Software

1. Log in as superuser to a system with a graphics display.

The display system should be one with a frame buffer and monitor capable of displaying bitmap graphics such as those produced by the SunVTS GUI.

2. Enable the remote display.

On the display system, type:

# /usr/openwin/bin/xhost + test-system

where test-system is the name of the server you plan to test.

3. Remotely log in to the server as superuser.

Use a command such as rlogin or telnet.

4. Start SunVTS software.

If you have installed SunVTS software in a location other than the default /opt directory, alter the path, as in EXAMPLE 1-15.

EXAMPLE 1-15 Alternate Command for Starting SunVTS Software
# `/opt/SUNWvts/bin/sunvts -display` display-system`:0`

where display-system is the name of the machine through which you are remotely logged in to the server.

The SunVTS GUI is displayed (FIGURE 1-9).

FIGURE 1-9 SunVTS GUI

Figure showing the SunVTS GUI for the server and the various buttons and areas on the GUI screen.

5. Expand the test lists to see the individual tests.

The test selection area lists tests in categories, such as Network, as shown in FIGURE 1-10. To expand a category, left-click the

icon (expand category icon) to the left of the category name.

FIGURE 1-10 SunVTS Test Selection Panel

Figure showing a small portion of the test selection area in the SunVTS graphical interface.

6. (Optional) Select the tests you want to run.

Certain tests are enabled by default, and you can choose to accept these.

Alternatively, you can enable and disable individual tests or blocks of tests by clicking the checkbox next to the test name or test category name. Tests are enabled when checked, and disabled when not checked.

TABLE 1-12 lists tests that are especially useful to run on this server.

TABLE 1-12 Useful SunVTS Tests to Run on This Server
SunVTS Tests	FRUs Exercised by Tests
`cmttest, cputest`, `fputest`, `iutest`, `l1dcachetest, dtlbtest,` and `l2sramtest` - indirectly: `mptest`, and `systest`	FB-DIMMS, CPU motherboard
`disktest`	Disks, cables, disk backplane
`cddvdtest`	CD/DVD device, cable, motherboard
`nettest`, `netlbtest`	Network interface, network cable, CPU motherboard
`pmemtest,` `vmemtest,` r`amtest`	FB-DIMMs, motherboard
`serialtest`	I/O (serial port interface)
`usbkbtest`, `disktest`	USB devices, cable, CPU motherboard (USB controller)
`hsclbtest`	Motherboard, service processor (Host to service processor interface)

7. (Optional) Customize individual tests.

You can customize individual tests by right-clicking on the name of the test. For example, in FIGURE 1-10, right-clicking on the text string ce0(nettest) brings up a menu that enables you to configure this Ethernet test.

8. Start testing.

Click the Start button that is located at the top left of the SunVTS window. Status and error messages appear in the test messages area located across the bottom of the window. You can stop testing at any time by clicking the Stop button.

During testing, SunVTS software logs all status and error messages. To view these messages, click the Log button or select Log Files from the Reports menu. This action opens a log window from which you can choose to view the following logs:

Information - Detailed versions of all the status and error messages that appear in the test messages area.

Test Error - Detailed error messages from individual tests.

VTS Kernel Error - Error messages pertaining to SunVTS software itself. Look here if SunVTS software appears to be acting strangely, especially when it starts up.

Solaris OS Messages (/var/adm/messages) - A file containing messages generated by the operating system and various applications.

Log Files (/var/opt/SUNWvts/logs) - A directory containing the log files.

1.10 Obtaining the Chassis Serial Number

To obtain support for your system, you need your chassis serial number. The chassis serial number is located on a sticker that is on the front of the server and another sticker on the side of the server. You can also run the ALOM CMT CLI showplatform command to obtain the chassis serial number.

For example:

TABLE 1-13 Obtaining the Chassis Serial Number With the `showplatform` Command
sc> `showplatform` SUNW,Sun-Netra-T5220 Chassis Serial Number: xxxxxxxxxxxx Domain Status ------ ------ S0 OS Standby sc>

1.11 Additional Service Related Information

In addition to this service manual, the following resources are available to help you keep your server running optimally. These documents are available at:

http://www.oracle.com/technetwork/indexes/documentation/index.html

Server Product Notes - Contain late-breaking information about the system including required software patches, updated hardware and compatibility information, and solutions to know issues.

Solaris Release Notes - Contain important information about the Solaris OS.

1.1 Fault on Initial Power Up

1.2 Server Diagnostics Overview

1.2.1 Memory Configuration and Fault Handling

1.2.1.1 Memory Configuration

1.2.1.2 Memory Fault Handling

1.2.1.3 Troubleshooting Memory Faults

1.3 Using LEDs to Identify the State of Devices

1.3.1 Front and Rear Panel LEDs

1.3.2 Hard Drive LEDs

1.3.3 Power Supply LEDs

1.3.4 Ethernet Port LEDs

1.4 Using the Service Processor Firmware for Diagnosis and Repair Verification

1.4.1 Using the ALOM CMT Compatibility CLI in ILOM

1.4.2 Creating an ALOM CMT CLI Shell

1.4.3 Running ALOM CMT CLI Service-Related Commands

1.4.3.1 Connecting to ALOM CMT CLI

1.4.3.2 Switching Between the System Console and Service Processor

1.4.3.3 Service-Related ALOM CMT CLI Commands

1.4.4 Displaying System Faults

1.4.5 Manually Cleaning PSH Diagnosed Faults

1.4.6 Displaying the Server’s Environmental Status

1.4.7 Displaying FRU Information

1.5 Running POST

1.5.1 Controlling How POST Runs

1.5.2 Changing POST Parameters

1.5.3 Reasons to Run POST

1.5.3.1 Verifying Hardware Functionality

1.5.3.2 Diagnosing the System Hardware

1.5.4 Running POST in Maximum Mode

1.5.5 Clearing POST Detected Faults

1.6 Using the Solaris Predictive Self-Healing Feature

1.6.1 Identifying PSH Detected Faults

1.6.1.1 Using the fmdump Command to Identify Faults

1.6.2 Clearing PSH Detected Faults

1.7 Collecting Information From Solaris OS Files and Commands

1.7.1 Checking the Message Buffer

1.7.2 Viewing System Message Log Files

1.8 Managing Components With Automatic System Recovery Commands

1.8.1 Displaying System Components

1.8.2 Disabling Components

1.8.3 Enabling Disabled Components

1.9 Exercising the System With SunVTS Software

1.9.1 Checking Whether SunVTS Software Is Installed

1.9.2 Exercising the System Using SunVTS Software

1.9.3 Exercising the System With SunVTS Software

1.10 Obtaining the Chassis Serial Number

1.11 Additional Service Related Information

1.6.1.1 Using the `fmdump` Command to Identify Faults