C H A P T E R 2 - Sun Blade T6300 Server Module Diagnostics

C H A P T E R 2

Sun Blade T6300 Server Module Diagnostics

This chapter describes the diagnostics that are available for monitoring and troubleshooting the Sun Blade T6300 server module.

This chapter is intended for technicians, service personnel, and system administrators who service and repair computer systems.

The following topics are covered:

Section 2.1, Sun Blade T6300 Server Module Diagnostics Overview

Section 2.2, Interpreting System LEDs

Section 2.3, Using ALOM CMT for Diagnosis and Repair Verification

Section 2.4, Running POST

Section 2.5, Using the Solaris Predictive Self-Healing Feature

Section 2.6, Collecting Information From Solaris OS Files and Commands

Section 2.7, Managing Components With Automatic System Recovery Commands

Section 2.8, Exercising the System With SunVTS

2.1 Sun Blade T6300 Server Module Diagnostics Overview

There are a variety of diagnostic tools, commands, and indicators you can use to monitor and troubleshoot a Sun Blade T6300 server module:

LEDs - Provide a quick visual notification of the status of the server module and of some of the FRUs.

ALOM CMT firmware - This system firmware runs on the service processor. In addition to providing the interface between the hardware and OS, ALOM CMT also tracks and reports the health of key server module components. ALOM CMT works closely with POST and Solaris Predictive Self-Healing technology to keep the system up and running even when there is a faulty component.

Power-on self-test (POST) - POST performs diagnostics on system components upon system reset to ensure the integrity of those components. POST is configureable and works with ALOM CMT to take faulty components offline if needed.

Solaris OS Predictive Self-Healing (PSH) - This technology continuously monitors the health of the CPU and memory, and works with ALOM CMT to take a faulty component offline if needed. The Predictive Self-Healing technology enables Sun systems to accurately predict component failures and mitigate many serious problems before they occur.

Log files and console messages - Provide the standard Solaris OS log files and investigative commands that can be accessed and displayed on the device of your choice.

SunVTS - An application that exercises the system, provides hardware validation, and discloses possible faulty components with recommendations for repair.

The LEDs, ALOM, Solaris OS PSH, and many of the log files and console messages are integrated. For example, a fault detected by the Solaris software will display the fault, log it, pass information to ALOM CMT where it is logged, and depending on the fault, might illuminate one or more LEDs.

The diagnostic flowchart in FIGURE 2-1 and TABLE 2-1 describe an approach for using the server module diagnostics to identify a faulty field-replaceable unit (FRU). The diagnostics you use, and the order in which you use them, depend on the nature of the problem you are troubleshooting, so you might perform some actions and not others.

Use this flowchart to understand what diagnostics are available to troubleshoot faulty hardware, and use TABLE 2-1 to find more information about each diagnostic in this chapter.

FIGURE 2-1 Diagnostic Flowchart

Figure shows the diagnostic flowchart.

TABLE 2-1 Diagnostic Flowchart Actions
Action No.	Diagnostic Action	Resulting Action	For more information, see these sections
1.	Check the OK LED.	The OK LED is located on the front of the chassis. If the LED is not lit, check that the blade is properly plugged in and the chassis has power.	Section 2.2, Interpreting System LEDs
2.	Run the ALOM CMT `showfaults` command to check for faults.	The `showfaults` command displays the following kinds of faults: Environmental faults Solaris Predictive Self-Healing (PSH) detected faults POST detected faults Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 1-3.	Section 2.3.2, Displaying System Faults
3.	Check the Solaris log files for fault information.	The Solaris message buffer and log files record system events and provide information about faults. If system messages indicate a faulty device, replace the FRU. To obtain more diagnostic information, go to Action 4.	Section 2.6, Collecting Information From Solaris OS Files and Commands
4.	Run SunVTS software.	SunVTS can exercise and diagnose FRUs. To run SunVTS, the server module must be running the Solaris OS. If SunVTS reports a faulty device replace the FRU. If SunVTS does not report a faulty device, go to Action 5.	Section 2.8, Exercising the System With SunVTS
5.	Run POST.	POST performs basic tests of the server module components and reports faulty FRUs. If POST indicates a faulty FRU, replace the FRU. If POST does not indicate a faulty FRU, go to Action 9.	Section 2.4, Running POST
6.	Determine if the fault is an environmental fault.	If the fault listed by the `showfaults` command displays a temperature or voltage fault, then the fault is an environmental fault. Environmental faults can be caused by faulty FRUs (power supply, fan, or blower) or by environmental conditions such as high ambient temperature, or blocked airflow.	Section 2.3.2, Displaying System Faults See the Sun Blade 6000 Modular System Service Manual, 820-0051.
7.	Determine if the fault was detected by PSH.	If the fault message displays the following text, the fault was detected by the Solaris Predictive Self-Healing software: `Host detected fault` If the fault is a PSH detected fault, identify the faulty FRU from the fault message and replace the faulty FRU. After the FRU is replaced, perform the procedure to clear PSH detected faults.	Section 2.5, Using the Solaris Predictive Self-Healing Feature Section 4.2, Common Procedures for Parts Replacement Section 2.5.2, Clearing PSH Detected Faults Section 2.5.3, Clearing the PSH Fault From the ALOM CMT Logs
8.	Determine if the fault was detected by POST.	POST performs basic tests of the server module components and reports faulty FRUs. When POST detects a faulty FRU, it logs the fault and if possible, takes the FRU offline. POST detected FRUs display the following text in the fault message: FRU-name `deemed faulty and disabled` In this case, replace the FRU and run the procedure to clear POST detected faults.	Section 2.4, Running POST Section 4.2, Common Procedures for Parts Replacement Section 2.4.5, Clearing POST Detected Faults
9.	Contact Sun for support.	The majority of hardware faults are detected by the server module diagnostics. In rare cases it is possible that a problem requires additional troubleshooting. If you are unable to determine the cause of the problem, contact Sun for support.	Sun Support information: `http://www.sun.com/ support` Section 1.2, Finding the Serial Number

2.1.1 Memory Configuration and Fault Handling

This section describes how the memory is configured and how the server module deals with memory faults.

2.1.1.1 Memory Configuration

The Sun Blade T6300 server module has eight slots that hold DDR-2 memory DIMMs in the following DIMM sizes:

1 Gbyte (maximum of 8 Gbyte)

2 Gbyte (maximum of 16 Gbyte)

4 Gbyte (maximum of 32 Gbyte)

The Sun Blade T6300 server module performs best if all eight connectors are populated with eight DIMMs. This configuration also enables the system to continue operating even when a DIMM fails, or if an entire channel fails.

2.1.1.2 Capacity Restrictions

Due to interleaving rules for the CPU, the system will operate at the lowest capacity of all the DIMMs installed. Therefore, it is ideal to install eight identical DIMMs (not four DIMMs of one capacity and four DIMMs of another capacity).

2.1.1.3 DIMM Installation Rules

Caution - The following DIMM rules must be followed. The server module might not operate correctly if the DIMM rules are not followed. Always use DIMMs that have been qualified by Sun.

DIMMs are installed in groups of four, with four DIMMs of the same capacity (FIGURE 2-2).

All DIMMS must use DDR-2 four-data input output DRAMs

Each set of four DIMMS must have the exact same DRAM devices on the DIMM, for example, four DIMMs must have 256 Mbyte DRAMs, or four DIMMS have 512 Mbyte DRAMs.

All DIMMS must be 72-bit ECC.

If the DIMMs are not properly configured, the system issues a message and the system does not boot.

See Section 5.2, Installing DIMMS for DIMM installation instructions.

FIGURE 2-2 DIMM Installation Rules

Figure shows the location of the DIMMs the channel numbers, and connector numbers.

2.1.1.4 Memory Fault Handling

The Sun Blade T6300 server module uses advanced ECC technology, also called chipkill, that corrects up to 4-bits in error on nibble boundaries, as long as they are all in the same DRAM. If a DRAM fails, the DIMM continues to function.

The following server module features manage memory faults independently:

POST - Runs when the server module is powered on (based on ALOM CMT configuration variables) and thoroughly tests the memory subsystem.

If a memory fault is detected, POST displays the fault with the FRU name of the faulty DIMMS, logs the fault, and disables the faulty DIMMs by placing them in the ASR blacklist. For a given memory fault, POST disables half of the physical memory in the system. When this occurs, you must replace the faulty DIMMs based on the fault message and enable the disabled DIMMs with the ALOM CMT enablecomponent command.

Solaris Predictive Self-healing (PSH) technology - Afeature of the Solaris OS, uses the fault manager daemon (fmd) to watch for various kinds of faults. When a fault occurs, the fault is assigned a unique fault ID (UUID), and logged. PSH reports the fault and provides a recommended proactive replacement for the DIMMs associated with the fault.

2.1.1.5 Troubleshooting Memory Faults

If you suspect that the server module has a memory problem, follow the flowchart (see FIGURE 2-1). Run the ALOM CMT showfaults command. The showfaults command lists memory faults and lists the specific DIMMS that are associated with the fault. Once you've identified which DIMMs to replace, see Chapter 4 for DIMM removal and replacement instructions. You must perform the instructions in that chapter to clear the faults and enable the replaced DIMMs.

2.2 Interpreting System LEDs

The Sun Blade T6300 server module has LEDs on the front panel and the hard drives. The behavior of LEDs on your server module conform to the American National Standards Institute (ANSI) Status Indicator Standard (SIS). These standard LED behaviors are described in TABLE 2-2

2.2.1 Front Panel LEDs and Buttons

The front panel LEDs and buttons are located in the center of the server module (FIGURE 2-3, TABLE 2-2, and TABLE 2-3, and TABLE 2-4).

FIGURE 2-3 Front Panel and Hard Drive LEDs

Figure shows the front panel LEDs and buttons. From top to bottom they are: [ D ]

TABLE 2-2 LED Behavior and Meaning
LED Behavior	Meaning
Off	The condition represented by the color is not true.
Steady on	The condition represented by the color is true.
Standby blink	The system is functioning at a minimal level and ready to resume full function.
Slow blink	Transitory activity or new activity represented by the color is taking place.
Fast blink	Attention is required.
Feedback flash	Activity is taking place commensurate with the flash rate (such as disk drive activity).

The LEDs have assigned meanings, described in TABLE 2-3.

TABLE 2-3 LED Behaviors With Assigned Meanings
Color	Behavior	Definition	Description
White	Off	Steady state
	Fast blink	4 Hz repeating sequence, equal intervals On and Off.	This indicator helps you to locate a particular enclosure, board, or subsystem (for example, the Locator LED). The LED is activated using one of the following methods: Issuing the `setlocator` `on` or `off` command. Pressing the button to toggle the indicator on or off. This LED provides the following indications: Off- Normal operating state. Fast blink - The server module received a signal as a result of one of the preceding methods and is indicating that the server module is active.
Blue	Off	Steady state	Steady state - it is safe to remove the server module from the chassis.
	Steady on	Steady state	If blue is on, a service action can be performed on the applicable component with no adverse consequences (for example, the OK-to-Remove LED).
Yellow or Amber	Off	Steady state
	Steady on	Steady state	This indicator signals the existance of a fault condition. Service is required (for example, the Service Required LED). The ALOM CMT `showfaults` command provides details about any faults that cause this indicator to be lit.
Green	Off	Steady state	Off - The system is unavailable. Either it has no power or ALOM CMT is not running.
	Standby blink	Repeating sequence consisting of a brief (0.1 sec.) on flash followed by a long off period (2.9 sec.)	The system is running at a minimum level and is ready to be quickly revived to full function (for example, the System Activity LED).
	Steady on	Steady state	Status normal; system or component functioning with no service actions required.
	Slow blink		A transitory (temporary) event is taking place for which direct proportional feedback is not needed or not feasible. ALOM is enabled but the server module is not fully powered on. Indicates that the service processor is running while the system is running at a minimum level in standby mode and ready to be returned to its normal operating state.

TABLE 2-4 Front Panel Buttons
LED	Color	Description
Power button	gray	Turns the host system on and off. Use a paper clip or other small tipped object to completely press this button.
(reset)	gray	This button does not function on the Sun Blade T6300 server module.

2.2.2 Ethernet Port LEDs

For information about Ethernet LEDs see the Sun Blade 6000 Modular System Service Manual, 820-0051, at:

http://www.sun.com/documentation/

2.3 Using ALOM CMT for Diagnosis and Repair Verification

The Sun Advanced Lights Out Manager (ALOM) CMT is a service processor in the Sun Blade T6300 server module that enables you to remotely manage and administer your server module.

ALOM CMT enables you to run remote diagnostics such as power-on self-test (POST), that would otherwise require physical proximity to the server module serial port. You can also configure ALOM CMT to send email alerts of hardware failures, hardware warnings, and other events related to the server module or to ALOM.

The ALOM CMT circuitry runs independently of the server module, using the server module standby power. Therefore, ALOM CMT firmware and software continue to function when the server module operating system goes offline or when the server module is powered off.

Note - Refer to the Advanced Lights out Management (ALOM) CMT v1.3 Guide, 819-7981, for comprehensive ALOM CMT information.

Faults detected by ALOM CMT, POST, and the Solaris Predictive Self-healing (PSH) technology are forwarded to ALOM CMT for fault handling (FIGURE 2-4).

In the event of a system fault, ALOM CMT ensures that the Service Action Required LED is lit, FRU ID PROMs are updated, the fault is logged, and alerts are displayed (faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see Appendix A).

FIGURE 2-4 ALOM CMT Fault Management

Figure shows the fault source interfaces.

ALOM CMT sends alerts to all ALOM CMT users that are logged in, sending the alert through email to a configured email address, and writing the event to the ALOM CMT event log.

ALOM CMT can detect when a fault is no longer present and clears the fault in several ways:

Fault recovery - The system automatically detects that the fault condition is no longer present. ALOM CMT extinguishes the Service Action Required LED and updates the FRU PROM, indicating that the fault is no longer present.

Fault repair - The fault has been repaired by human intervention. In most cases, ALOM CMT detects the repair and extinguishes the Service Required LED. In the event that ALOM CMT does not perform these actions, you must perform these tasks manually with clearfault or enablecomponent commands.

ALOM CMT can detect the removal of a FRU, in many cases even if the FRU is removed while ALOM CMT is powered off. This enables ALOM CMT to know that a fault, diagnosed to a specific FRU, has been repaired. The ALOM CMT clearfault command enables you to manually clear certain types of faults without a FRU replacement or if ALOM CMT was unable to automatically detect the FRU replacement.

ALOM CMT does not automatically detect hard drive replacement.

Many environmental faults can automatically recover. For example, a temperature that is exceeding a threshold might return to normal limits. An unplugged power supply can be plugged in. The recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms:

fru at location is OK.

sensor at location is within normal range.

Environmental faults can be repaired through hot removal of the faulty FRU. FRU removal is automatically detected by the environmental monitoring and all faults associated with the removed FRU are cleared. The message for that case, and the alert sent for all FRU removals is:

fru at location has been removed.

There is no ALOM CMT command to manually repair an environmental fault.

ALOM CMT does not handle hard drive faults. Use the Solaris message files to view hard drive faults. See Section 2.6, Collecting Information From Solaris OS Files and Commands.

2.3.1 Running ALOM CMT Service-Related Commands

This section describes the ALOM CMT commands that are commonly used for service-related activities.

2.3.1.1 Connecting to ALOM

Before you can run ALOM CMT commands, you must connect to the ALOM. There are several ways to connect to the service processor:

Connect an ASCII terminal directly to the serial management port.

Use the telnet command to connect to ALOM CMT through an Ethernet connection on the network management port.

Note - Refer to the Advanced Lights out Management (ALOM) CMT v1.3 Guide, 819-7981, for instructions on configuring and connecting to ALOM.

2.3.1.2 Switching Between the System Console and ALOM

To switch from the console output to the ALOM CMT sc> prompt, type: #. (Hash-Period).

To switch from the sc> prompt to the console, type: console.

2.3.1.3 Service-Related ALOM CMT Commands

TABLE 2-5 describes the typical ALOM CMT commands for servicing a Sun Blade T6300 server module. For descriptions of all ALOM CMT commands, issue the help command or refer to the Advanced Lights out Management (ALOM) CMT v1.3 Guide, 819-7981.

TABLE 2-5 Service-Related ALOM CMT Commands
ALOM CMT Command	Description
`help` [command]	Displays a list of all ALOM CMT commands with syntax and descriptions. Specifying a command name as an option displays help for that command.
`break` [`-y`][`-c`]	Takes the host server from the OS to either `kmdb` or OpenBoot PROM (equivalent to a Stop-A command), depending on the Solaris mode that was booted. The `-y` option skips the confirmation question. The `-c` option executes a `console` command after completion of the `break` command.
`clearfault` UUID	Manually clears host-detected faults. The UUID is the unique fault ID of the fault to be cleared.
console [-f]	Connects you to the host system. The `-f` option forces the console to have read and write capabilities.
consolehistory [-b lines\|-e lines\|-v] [-g lines] [boot\|run]	Displays the contents of the system's console buffer. The following options enable you to specify how the output is displayed: `-g` lines option specifies the number of lines to display before pausing. `-e` lines option displays n lines from the end of the buffer. `-b` lines option displays n lines from beginning of the buffer. `-v` option displays the entire buffer. `boot\|run` option specifies the log to display (`run` is the default log).
bootmode [normal\|reset_nvram\| bootscript=string]	Enables control of the firmware during system initialization with the following options: `normal` is the default boot mode. `reset_nvram` resets OpenBoot PROM parameters to their default values. `bootscript=`string enables the passing of a string to the `boot` command.
`powercycle` [`-f`]	Performs a `poweroff` followed by `poweron`. The `-f` option forces an immediate `poweroff`, otherwise the command attempts a graceful shutdown.
`poweroff` [`-y`] [`-f`]	Powers off the host server. The `-y` option enables you to skip the confirmation question. The `-f` option forces an immediate shutdown.
`poweron [-c]`	Powers on the host server. Using the `-c` option executes a `console` command after completion of the `poweron` command.
`removefru PS0\|PS1`	Indicates if it is OK to perform a hot-swap of a power supply. This command does not perform any action, but provides a warning if the power supply should not be removed because the other power supply is not enabled.
`removeblade`	Pauses the service processor tasks and illuminates the white locator LED indicating that it is safe to remove the blade.
`unremoveblade`	Turns off the locator LED and restores the service processor state.
`reset` [`-y`] [`-c`]	Generates a hardware reset on the host server. The `-y` option enables you to skip the confirmation question. The `-c` option executes a `console` command after completion of the `reset` command.
`resetsc` [`-y`]	Reboots the service processor. The `-y` option enables you to skip the confirmation question.
`setkeyswitch` [`-y`] `normal` \| `stby` \| `diag` \| `locked`	Sets the virtual keyswitch. The `-y` option enables you to skip the confirmation question when setting the keyswitch to `stby`.
`setlocator` [`on` \| `off`]	Turns the Locator LED on the server on or off.
`showenvironment`	Displays the environmental status of the host server. This information includes system temperatures, power supply, front panel LED, hard drive, fan, voltage, and current sensor status. See Section 2.3.3, Displaying the Environmental Status.
`showfaults` [`-v`]	Displays current system faults. See Section 2.3.2, Displaying System Faults.
`showfru` [`-g` lines] [`-s` \| `-d`] [FRU]	Displays information about the FRUs in the server. The `-g` lines option specifies the number of lines to display before pausing the output to the screen. The `-s` option displays static information about system FRUs (defaults to all FRUs, unless one is specified). The `-d` option displays dynamic information about system FRUs (defaults to all FRUs, unless one is specified). See Section 2.3.4, Displaying FRU Information.
`showkeyswitch`	Displays the status of the virtual keyswitch.
`showlocator`	Displays the current state of the Locator LED as either on or off.
`showlogs` [`-b` lines \| `-e` lines `\|-v]` [`-g` lines] [`-p logtype[r\|p]]`]	Displays the history of all events logged in the ALOM CMT event buffers (in RAM or the persistent buffers).
`showplatform` [`-v`]	Displays information about the host system's hardware configuration, the system serial number, and whether the hardware is providing service.

Note - See TABLE 2-8 for the ALOM CMT ASR commands.

2.3.2 Displaying System Faults

The ALOM CMT showfaults command displays the following kinds of faults:

Environmental faults - Temperature or voltage problems that might be caused by faulty FRUs (power supplies, fans, or blower), or by room temperature or blocked air flow.

POST detected faults - Faults on devices detected by the power-on self-test diagnostics.

PSH detected faults - Faults detected by the Solaris Predictive Self-healing (PSH) technology

Use the showfaults command for the following reasons:

To see if any faults have been passed to, or detected by ALOM CMT.

To obtain the fault message ID (SUNW-MSG-ID) for PSH detected faults.

To verify that the replacement of a FRU has cleared the fault and not generated any additional faults.

At the sc> prompt, type the showfaults command.

The following showfaults command examples show the different kinds of output from the showfaults command:

Example of the showfaults command when no faults are present:

sc> showfaults

Last POST run: THU MAR 09 16:52:44 2006

POST status: Passed all devices

No failures found in System

Example of the showfaults command displaying an environmental fault:

sc> showfaults -v

Last POST run: TUE FEB 07 18:51:02 2006

POST status: Passed all devices

 ID FRU               Fault

  0 IOBD              VOLTAGE_SENSOR at IOBD/V_+1V has exceeded low warning threshold.

Example showing a fault that was detected by POST. These kinds of faults are identified by the message deemed faulty and disabled and by a FRU name:

sc> showfaults -v

   ID Time              FRU               Fault

    1 OCT 13 12:47:27   MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled

Example showing a fault that was detected by the PSH technology. These kinds of faults are identified by the text Host detected fault and by a Universal Unique Identifier (UUID):

sc> showfaults -v

ID Time              FRU               Fault

0 SEP 09 11:09:26   MB/CMP0/CH0/R0/D0 Host detected fault, MSGID:

SUN4U-8000-2S  UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86

2.3.3 Displaying the Environmental Status

The showenvironment command displays a snapshot of the server module environmental status. This command displays system temperatures, hard drive status, power supply and fan status, front panel LED status, voltage, and current sensors. The output uses a format similar to the Solaris OS command prtdiag (1m).

At the sc> prompt, type the showenvironment command.

The output differs according to your system's model and configuration.

Example:

sc> showenvironment

=============== Environmental Status ===============

------------------------------------------------------------------------------

System Indicator Status:

SYS/LOCATE               SYS/SERVICE        SYS/ACT      SYS/OK_TO_RM

OFF                      ON                 ON           OFF

---------------------------------------------------------------------------

System Disks:

---------------------------------------------------------------------------

Disk   Status            Service  OKtoRem

---------------------------------------------------------------------------

HDD0   OK                OFF      OFF

HDD1   OK                OFF      OFF

HDD2   OK                OFF      OFF

HDD3   OK                OFF      OFF

------------------------------------------------------------------------------

System Temperatures (Temperatures in Celsius):

------------------------------------------------------------------------------

Sensor           Status  Temp LowHard LowSoft LowWarn HighWarn HighSoft

HighHard

------------------------------------------------------------------------------

MB/T_AMB         OK      32   -10     -5      0       45

50       55

MB/CMP0/T_TCORE  OK      45   -10     -5      0       80

80       85

MB/CMP0/T_BCORE  OK      46   -10     -5      0       80

80       85

MB/F0/T_CORE     OK      47   -10     -5      0       95

100      105

---------------------------------------------------------------------------

Fans Status (Speeds in Revolutions Per Minute) :

---------------------------------------------------------------------------

Sensor           Status           Speed

---------------------------------------------------------------------------

MP/FM0/FIN       OK                3970

MP/FM0/FOUT      OK                3970

MP/FM1/FIN       OK                3970

MP/FM1/FOUT      OK                4017

MP/FM2/FIN       OK                4066

MP/FM2/FOUT      OK                4066

MP/FM3/FIN       OK                3970

MP/FM3/FOUT      OK                4017

MP/FM4/FIN       OK                4017

MP/FM4/FOUT      OK                4017

MP/FM5/FIN       OK                4017

MP/FM5/FOUT      OK                4017

------------------------------------------------------------------------------

Voltage Sensors (in Volts):

------------------------------------------------------------------------------

Sensor          Status      Voltage LowSoft LowWarn HighWarn HighSoft

-----------------------------------------------------------------------------

MB/V_VCORE      OK            1.30    1.02    1.08    1.40     1.58

MB/V_VTTB       OK            0.87    0.76    0.81    0.99     1.03

MB/V_VTTT       OK            0.87    0.76    0.81    0.99     1.03

MB/V_VCCB       OK            1.78    1.53    1.62    1.98     2.07

MB/V_VCCT       OK            1.76    1.53    1.62    1.98     2.07

MB/V_+1V1       OK            1.10    0.85    0.90    1.15     1.18

MB/V_+1V2       OK            1.18    1.02    1.08    1.32     1.38

MB/V_+1V5       OK            1.46    1.28    1.35    1.65     1.72

MB/V_+1V8       OK            1.79    1.53    1.62    1.98     2.07

MB/V_+3V3       OK            3.33    2.80    3.00    3.63     3.80

MB/V_+3V3STBY   OK            3.34    2.80    2.97    3.63     3.80

MB/V_+5V        OK            4.91    4.25    4.50    5.50     5.75

MB/V_+12V       OK           12.18   10.20   10.80   13.20    13.80

SC/BAT/V_BAT    OK            3.03      --    2.25     --       --

MP/V_+12V       OK           12.40   10.20   10.80   13.20    13.80

---------------------------------------------------------------------------

System Load (in amperes):

---------------------------------------------------------------------------

Sensor           Status              Load     Warn     Shutdown

-----------------------------------------------------------

MB/I_CORE        OK                  10.760   80.000   88.000

MB/I_MEMB        OK                   2.040   60.000   66.000

MB/I_MEMT        OK                   1.740   60.000   66.000

MB/I_12V         OK                   9.000   40.000   45.000

---------------------------------------------------------------------------

Power Supply  Status

------------------------------------------------------------------------------

Supply          Present       On

MP/PS0          PRESENT       FAULTED

MP/PS1          PRESENT       OK

sc>

Note - Some environmental information might not be available when the server module is in standby mode.

2.3.4 Displaying FRU Information

The showfru command displays information about the FRUs in the server module. Use this command to see information about an individual FRU, or for all the FRUs.

Note - By default, the output of the showfru command for all FRUs is very long.

At the sc> prompt, enter the showfru command.

In the following example, the showfru command is used to get information about the motherboard (MB).

sc> showfru MB.SEEPROM

SEGMENT: SD

/ManR

/ManR/UNIX_Timestamp32:      WED FEB 14 18:24:28 2007

/ManR/Description:           ASSY,Sun-Fire-T6300,CPU Board

/ManR/Manufacture Location:  Sriracha,Chonburi,Thailand

/ManR/Sun Part No:           5016843

/ManR/Sun Serial No:         NC00OD

/ManR/Vendor:                Celestica

/ManR/Initial HW Dash Level: 06

/ManR/Initial HW Rev Level:  02

/ManR/Shortname:             T2000_MB

/SpecPartNo:                 885-0483-04

SEGMENT: FL

/Configured_LevelR

/Configured_LevelR/UNIX_Timestamp32:     WED FEB 14 18:24:28 2007

/Configured_LevelR/Sun_Part_No:          5410827

/Configured_LevelR/Configured_Serial_No: N4001A

/Configured_LevelR/HW_Dash_Level:        03

2.4 Running POST

Power-on self-test (POST) is a group of PROM-based tests that run when the server module is powered on or reset. POST checks the basic integrity of the critical hardware components in the server module (CPU, memory, and I/O buses).

If POST detects a faulty component, it is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system will boot when POST is complete. For example, if one of the processor cores is deemed faulty by POST, the core will be disabled, and the system will boot and run using the remaining cores.

Note - Devices can be manually enabled or disabled using ASR commands (see Section 2.7, Managing Components With Automatic System Recovery Commands).

2.4.1 Controlling How POST Runs

The server module can be configured for normal, extensive, or no POST execution. You can also control the level of tests that run, the amount of POST output that is displayed, and which reset events trigger POST by using ALOM CMT variables.

TABLE 2-6 lists the ALOM CMT variables used to configure POST and FIGURE 2-5 shows how the variables work together.

TABLE 2-6 ALOM CMT Parameters Used For POST Configuration
Parameter	Values	Description
setkeyswitch ^[1]	`normal`	The system can power on and run POST (based on the other parameter settings). For details see FIGURE 2-5. This parameter overrides all other commands.
	`diag`	The system runs POST based on predetermined settings.
	`stby`	The system cannot power on.
	`locked`	The system can power on and run POST, but no flash updates can be made.
`diag_mode`	`off`	POST does not run.
	`normal`	Runs POST according to `diag_level` value.
	`service`	Runs POST with preset values for `diag_level` and `diag_verbosity`.
`diag_level`	`min`	If `diag_mode` = `normal`, runs minimum set of tests.
	`max`	If `diag_mode` = `normal`, runs all the minimum tests plus extensive CPU and memory tests.
`diag_trigger`	`none`	Does not run POST on reset.
	`user_reset`	Runs POST upon user initiated resets.
	`power_on_reset`	Only runs POST for the first poweron. This is the default.
	`error_reset`	Runs POST if fatal errors are detected.
	`all_reset`	Runs POST after any reset.
`diag_verbosity`	`none`	No POST output is displayed.
	`min`	POST output displays functional tests with a banner and pinwheel.
	`normal`	POST output displays all test and informational messages.
	`max`	POST displays all test, informational, and some debugging messages.

FIGURE 2-5 Flowchart of ALOM CMT Variables for POST Configuration

Figure shows POST flow chart.

TABLE 2-7 shows typical combinations of ALOM CMT variables and associated POST modes .

TABLE 2-7 ALOM CMT Parameters and POST Modes
Parameter	Normal Diagnostic Mode (default settings)	No POST Execution	Diagnostic Service Mode	Keyswitch Diagnostic Preset Values
`diag_mode`	`normal`	off	`service`	`normal`
`setkeyswitch^[2]`	`normal`	normal	`normal`	`diag`
`diag_level`	`min`	n/a	`max`	`max`
`diag_trigger`	`power-on-reset error-reset`	none	`all-resets`	`all-resets`
`diag_verbosity`	`normal`	n/a	`max`	`max`
Description of POST execution	This is the default POST configuration. This configuration tests the system thoroughly, and suppresses some of the detailed POST output.	POST does not run, resulting in quick system initialization, but this is not a suggested configuration.	POST runs the full spectrum of tests with the maximum output displayed.	POST runs the full spectrum of tests with the maximum output displayed.

2.4.2 Changing POST Parameters

1. Access the ALOM CMT sc> prompt:

At the console, issue the #. key sequence:

#.

2. At the ALOM CMT sc> prompt, use the setsc command to set the POST parameter:

Example:

sc> setsc diag_mode service

The setkeyswitch parameter is a command that sets the virtual keyswitch, so it does not use the setsc command. Example:

sc> setkeyswitch diag

2.4.3 Reasons to Run POST

You can use POST to test and verify server module hardware.

2.4.3.1 Verifying Hardware Functionality

POST tests critical hardware components to verify functionality before the system boots and accesses software. If POST detects an error, the faulty component is disabled automatically, preventing faulty hardware from potentially harming software.

Under normal operating conditions, the server module is usually configured to run POST in maximum mode for all power-on or error-generated resets.

2.4.3.2 Diagnosing the System Hardware

You can use POST as an initial diagnostic tool for the system hardware. In this case, configure POST to run in diagnostic service mode for maximum test coverage and verbose output.

2.4.4 Running POST

This procedure describes how to run POST when you want maximum testing, as in the case when you are troubleshooting a system.

1. Switch from the system console prompt to the ALOM CMT sc> prompt by issuing the #. escape sequence.

ok #.

sc>

2. Set the virtual keyswitch to diag so that POST will run in service mode.

sc> setkeyswitch diag

3. Reset the system so that POST runs.

There are several ways to initiate a reset. The following example uses the powercycle command. For other methods, refer to the Sun Blade T6300 Server Module Administration Guide, 820-0277.

sc> powercycleAre you sure you want to powercycle the system [y/n]? y

Powering host off at MON JAN 10 02:52:02 2000

Waiting for host to Power Off; hit any key to abort.

SC Alert: SC Request to Power Off Host.

SC Alert: Host system has shut down.

Powering host on at MON JAN 10 02:52:13 2000

SC Alert: SC Request to Power On Host.

4. Switch to the system console to view the post output:

sc> console

Example of POST output with some output omitted:

0:0>

0:0>@(#)Sun Blade T6300 Server Module POST 4.25.0 2007/01/16 11:57

0:0>Copyright @ 2007 Sun Microsystems, Inc. All rights reserved

  SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.

0:0>VBSC selecting POST MAX Testing.

0:0>POST enabling threads: f00fffff

0:0>VBSC setting verbosity level 3

0:0>Start Selftest.....

0:0>Begin: Init CPU

0:0>End  : Init CPU

0:0>Master CPU Tests Basic.....

0:0>CPU =: 0

0:0>Begin: DMMU Registers Access

0:0>End  : DMMU Registers Access

0:0>Begin: Common MMU regs

0:0>End  : Common MMU regs

0:0>Begin: Init mmu regs

0:0>End  : Init mmu regs

0:0>Begin: D-Cache RAM

0:0>End  : D-Cache RAM

0:0>Init MMU.....

0:0>Begin: DMMU TLB DATA RAM Access

0:0>End  : DMMU TLB DATA RAM Access

0:0>Begin: DMMU TLB TAGS Access

0:0>End  : DMMU TLB TAGS Access

0:0>Begin: DMMU CAM

0:0>End  : DMMU CAM

0:0>Begin: Setup DMMU Miss Handler

0:0>End  : Setup DMMU Miss Handler

0:0>    Niagara, Version 2.0

0:0>    Serial Number 00000098.00000820 = fffff238.2e4df502

0:0>Begin: Init JBUS Config Regs

0:0>End  : Init JBUS Config Regs

0:0>Begin: IO-Bridge unit 1 init test

0:0>End  : IO-Bridge unit 1 init test

0:0>sys 200 MHz, CPU 1000 MHz, mem 200 MHz.

0:0>Begin: Integrated POST Testing

0:0>End  : Integrated POST Testing

0:0>L2 Tests.....

0:0>Begin: Setup L2 Cache

0:0>L2 Cache Control = 00000000.00300000

0:0>End  : Setup L2 Cache

0:0>Begin: L2 Cache Tags Test

0:0>End  : L2 Cache Tags Test

0:0>Begin: Scrub and Setup L2 Cache

0:0>L2 Directory clear

0:0>L2 Scrub VD & UA

0:0>L2 Scrub Tags

0:0>End  : Scrub and Setup L2 Cache

0:0>Test Memory.....

0:0>Begin: Probe and Setup Memory

0:0>INFO: 4096MB at Memory Channel [1 2 ] Rank 0 Stack 0

0:0>INFO:No memory detected at Memory Channel [1 2 ] Rank  0 Stack 1

0:0>INFO:No memory detected at Memory Channel [1 2 ] Rank  1 Stack 0

0:0>INFO:No memory detected at Memory Channel [1 2 ] Rank  1 Stack 1

0:0>

0:0>End  : Probe and Setup Memory

0:0>Begin: Data Bitwalk

0:0>L2 Scrub Data

0:0>L2 Enable

0:0>    Testing Memory Channel 2 Rank 0 Stack 0

0:0>    Testing Memory Channel 1 Rank 0 Stack 0

0:0>L2 Directory clear

0:0>L2 Scrub VD & UA

0:0>L2 Scrub Tags

0:0>L2 Disable

0:0>End  : Data Bitwalk

0:0>Begin: Address Bitwalk

0:0>    Testing Memory Channel 2 Rank 0 Stack 0

0:0>    Testing Memory Channel 1 Rank 0 Stack 0

0:0>End  : Address Bitwalk

0:0>Test Slave Threads Basic.....

0:0>Begin: Test Mailbox region

0:0>End  : Test Mailbox region

0:0>Begin: Set Mailbox

0:0>End  : Set Mailbox

0:0>Begin: Setup Final DMMU Entries

0:0>End  : Setup Final DMMU Entries

0:0>Begin: Post Image Region Scrub

0:0>End  : Post Image Region Scrub

0:0>Begin: Run POST from Memory

0:0>Verifying checksum on copied image.

0:0>The Memory's CHECKSUM value is f242.

0:0>The Memory's Content Size value is 84f42.

0:0>Success...  Checksum on Memory Validated.

0:0>End  : Run POST from Memory

0:0>Begin: L2 Cache Ram Test

0:0>End  : L2 Cache Ram Test

0:0>Begin: Enable L2 Cache

0:0>L2 Scrub Data

0:0>L2 Enable

0:0>End  : Enable L2 Cache

0:0>CPU =: 0 4 8 12 16 28

1:0>Begin: DMMU Registers Access

1:0>End  : DMMU Registers Access

2:0>Begin: DMMU Registers Access

2:0>End  : DMMU Registers Access

3:0>Begin: DMMU Registers Access

3:0>End  : DMMU Registers Access

4:0>Begin: DMMU Registers Access

7:0>Begin: DMMU Registers Access

4:0>End  : DMMU Registers Access

7:0>End  : DMMU Registers Access

0:0>CPU =: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 28 29 30 31

0:0>Test slave strand registers...

0:0>Extended CPU Tests.....

2:0>Begin: I-Cache RAM Test

2:0>End  : I-Cache RAM Test

3:0>Begin: I-Cache RAM Test

3:0>End  : I-Cache RAM Test

4:0>Begin: I-Cache RAM Test

4:0>End  : I-Cache RAM Test

7:0>Begin: I-Cache RAM Test

7:0>End  : I-Cache RAM Test

0:0>Begin: I-Cache RAM Test

0:0>End  : I-Cache RAM Test

0:0>Scrub Memory.....

0:0>Begin: Scrub Memory

0:0>Scrub 00000000.00600000->00000001.00000000 on Memory Channel [1 2 ] Rank 0 Stack 0

0:0>End  : Scrub Memory

0:0>Extended Memory Tests.....

0:0>Begin: Print Mem Config

0:0>Caches : Icache is ON, Dcache is ON.

0:0>    Bank 0   4096MB : 00000000.00000000 -> 00000001.00000000.

0:0>End  : Print Mem Config

0:0>Begin: Block Mem Test

0:0>Test 4288675840 bytes at 00000000.00600000 Memory Channel [ 1 2 ] Rank 0 Stack 0

0:0>........

0:0>End  : Block Mem Test

0:0>IO-Bridge Tests.....

0:0>Begin: IO-Bridge Quick Read

0:0>

0:0>------------------------------------------------------------

0:0>--------- IO-Bridge Quick Read Only of CSR and ID ------------

0:0>------------------------------------------------------------

0:0>fire 1 JBUSID  00000080.0f000000 =

0:0>                                     fc000002.e03dda23

0:0>------------------------------------------------------------

0:0>fire 1 JBUSCSR 00000080.0f410000 =

0:0>                                     00000ff5.13cb7000

0:0>------------------------------------------------------------

0:0>End  : IO-Bridge Quick Read

0:0>Begin: IO-Bridge unit 1 jbus perf test

0:0>End  : IO-Bridge unit 1 jbus perf test

0:0>Begin: IO-Bridge unit 1 int init test

0:0>End  : IO-Bridge unit 1 int init test

0:0>Begin: IO-Bridge unit 1 link train port A

0:0>End  : IO-Bridge unit 1 link train port A

0:0>Begin: IO-Bridge unit 1 link train port B

0:0>End  : IO-Bridge unit 1 link train port B

0:0>Begin: IO-Bridge unit 1 interrupt test

0:0>End  : IO-Bridge unit 1 interrupt test

0:0>Begin: IO-Bridge unit 1 Config MB bridges

0:0>Config port A, bus 2 dev 0 func 0, tag MB/PCI-SWITCH0

0:0>Config port A, bus 3 dev 1 func 0, tag MB/PCI-SWITCH0

0:0>Config port B, bus 2 dev 0 func 0, tag MB/PCI-SWITCH1

0:0>Config port B, bus 3 dev 1 func 0, tag MB/PCI-SWITCH1

0:0>Config port B, bus 3 dev 2 func 0, tag MB/PCI-SWITCH1

0:0>Config port B, bus 4 dev 0 func 0, tag MB/PCIE-IO

0:0>End  : IO-Bridge unit 1 Config MB bridges

0:0>Begin: IO-Bridge unit 1 PCI id test

0:0>    INFO:100000 count read passed for MB/PCI-SWITCH0! Last read VID:10b5|DID:8532 LinkWidth:8

0:0>    INFO:100000 count read passed for MB/NET0! Last read VID:8086|DID:105e LinkWidth:4

0:0>    INFO:100000 count read passed for MB/PCI-SWITCH1! Last read VID:10b5|DID:8532 LinkWidth:8

0:0>End  : IO-Bridge unit 1 PCI id test

0:0>Begin: Quick JBI Loopback Block Mem Test

0:0>Quick jbus loopback Test 262144 bytes at 00000000.00600000

0:0>End  : Quick JBI Loopback Block Mem Test

0:0>INFO:

0:0>    POST Passed all devices.

0:0>POST:       Return to VBSC.

0:0>Master set ACK for vbsc runpost command and spin...

SC Alert: Host System has Reset

5. Perform further investigation if needed.

When POST is finished running, and if no faults were detected, the system will boot.

If POST detects a faulty device, the fault is displayed and the fault information is passed to ALOM CMT for fault handling. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see Appendix A.

a. Interpret the POST messages:

POST error messages use the following syntax:

c:s > ERROR: TEST = failing-test
c:s > H/W under test = FRU
c:s > Repair Instructions: Replace items in order listed by H/W under test abovec:s > MSG = test-error-message
c:s > END_ERROR

In this syntax, c = the core number, s = the strand number.

Warning and informational messages use the following syntax:

INFO or WARNING: message

The following example shows a POST error message.

7:2>

7:2>ERROR: TEST = Data Bitwalk

7:2>H/W under test = MB/CMP0/CH2/R0/D0/S0 (MB/CMP0/CH2/R0/D0)

7:2>Repair Instructions: Replace items in order listed by 'H/W

under test' above.

7:2>MSG = Pin 149 failed on MB/CMP0/CH2/R0/D0/S0 (J6901)

7:2>END_ERROR

7:2>Decode of Dram Error Log Reg Channel 2 bits

60000000.0000108c

7:2> 1 MEC 62 R/W1C Multiple corrected

errors, one or more CE not logged

7:2> 1 DAC 61 R/W1C Set to 1 if the error

was a DRAM access CE

7:2> 108c SYND 15:0 RW ECC syndrome.

7:2>

7:2> Dram Error AFAR channel 2 = 00000000.00000000

7:2> L2 AFAR channel 2 = 00000000.00000000

In this example, POST is reporting a memory error at DIMM location MB/CMP0/CH2/R0/D0. This error was detected by POST running on core 7, strand 2.

b. Run the showfaults command to obtain additional fault information.

The fault is captured by ALOM, where the fault is logged, the Service Action Required LED is lit, and the faulty component is disabled.

Example:

ok .#

sc> showfaults -v

   ID  Time              FRU         Fault

    1 APR 24 12:47:27   MB/CMP0/CH2/R0/D0  MB/CMP0/CH2/R0/D0 deemed faulty and disabled

In this example, MB/CMP0/CH2/R0/D0 is disabled. The system can boot using memory that was not disabled until the faulty component is replaced.

Note - You can use ASR commands to display and control disabled components. See Section 2.7, Managing Components With Automatic System Recovery Commands.

2.4.5 Clearing POST Detected Faults

In most cases, when POST detects a faulty component, POST logs the fault and automatically takes the failed component out of operation by placing the component in the ASR blacklist (see Section 2.7, Managing Components With Automatic System Recovery Commands).

After the faulty FRU is replaced, you must clear the fault by removing the component from the ASR blacklist.

1. At the ALOM CMT prompt, use the showfaults command to identify POST detected faults.

POST detected faults are distinguished from other kinds of faults by the text:
deemed faulty and disabled, and no UUID number is reported.

Example:

sc> showfaults -v

   ID  Time              FRU         Fault

    1 APR 24 12:47:27   MB/CMP0/CH2/R0/D0  MB/CMP0/CH2/R0/D0 deemed faulty and disabled

If no fault is reported, you do not need to do anything else. Do not perform the subsequent steps.

2. Use the enablecomponent command to clear the fault and remove the component from the ASR blacklist.

Use the FRU name that was reported in the fault in the previous step.

Example:

sc> enablecomponent MB/CMP0/CH0/R0/D0

The fault is cleared and should not show up when you run the showfaults command. Additionally, the Service Action Required LED is no longer on.

3. Reboot the server module.

You must reboot the server module for the enablecomponent command to take effect.

4. At the ALOM CMT prompt, use the showfaults command to verify that no faults are reported.

sc> showfaults

Last POST run: THU MAR 09 16:52:44 2006

POST status: Passed all devices

No failures found in System

2.5 Using the Solaris Predictive Self-Healing Feature

The Solaris Predictive Self-Healing (PSH) technology enables the Sun Blade T6300 server module to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use message ID to get additional information about the problem from Sun's knowledge article database.

The Predictive Self-Healing technology covers the following Sun Blade T6300 server module components:

UltraSPARC^® T1 multicore processor (CPU)

Memory

I/O bus

The PSH console message provides the following information:

Type

Severity

Description

Automated response

Impact

Suggested action for system administrator

If the Solaris PSH facility has detected a faulty component, use the fmdump command to identify the fault. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see Appendix A.

Note - Additional Predictive Self-Healing information is available at: http://www.sun.com/msg

2.5.1 Identifying Faults With the `fmdump` Command

The fmdump command displays the list of faults detected by the Solaris PSH facility. Use this command for the following reasons:

To see if any faults have been detected by the Solaris PSH facility.

If you need to obtain the fault message ID (SUNW-MSG-ID) for detected faults.

To verify that the replacement of a FRU has cleared the fault and not generated any additional faults.

If you already have a fault message ID, go to Step 2 to obtain more information about the fault from the Sun Predictive Self-Healing Knowledge Article web site.

Note - Faults detected by the Solaris PSH facility are also reported through ALOM CMT alerts. In addition to the PSH fmdump command, the ALOM CMT showfaults command also provides information about faults and displays fault UUIDs. See Section 2.3.2, Displaying System Faults.

1. Check the event log using the fmdump command with -v for verbose output:

# fmdump -v

TIME				UUID										SUNW-MSG-ID

Apr 24 06:54:08.2005 lce22523-lc80-6062-e61d-f3b39290ae2c SUN4V-8000-6H

100% fault.cpu.ultraSPARCT1l2cachedata

	FRU:hc:///component=MB

	rsrc: cpu:///cpuid=0/serial=22D1D6604A

In this example, a fault is displayed, indicating the following details:

Date and time of the fault (Apr 24 06:54:08.2005)

Universal Unique Identifier (UUID) that is unique for every fault (lce22523-lc80-6062-e61d-f3b39290ae2c)

Sun message identifier (SUNW4V-8000-6H) that can be used to obtain additional fault information

Faulted FRU (FRU:hc:///component=MB), that in this example is identified as MB, indicating that the motherboard requires replacement.

2. Use the Sun message ID to obtain more information about this type of fault.

a. In a browser, go to the Predictive Self-Healing Knowledge Article web site: http://www.sun.com/msg

b. Enter the message ID in the SUNW-MSG-ID field, and press Lookup.

In this example, the message ID SUN4U-8000-6H returns the following information for corrective action:

CPU errors exceeded acceptable levels

Type

    Fault

Severity

    Major

Description

    The number of errors associated with this CPU has exceeded acceptable levels.

Automated Response

    The fault manager will attempt to remove the affected CPU from service.

Impact

    System performance may be affected.

Suggested Action for System Administrator

    Schedule a repair procedure to replace the affected CPU, the identity of which can be determined using fmdump -v -u <EVENT_ID>.

Details

    The Message ID:   SUN4U-8000-6H indicates diagnosis has determined that a CPU is faulty. The Solaris fault manager arranged an automated attempt to disable this CPU. The recommended action for the system administrator is to contact Sun support so a Sun service technician can replace the affected component.

c. Follow the suggested actions to repair the fault.

2.5.2 Clearing PSH Detected Faults

When the Solaris PSH facility detects faults, the faults are logged and displayed on the console. After the fault condition is corrected, for example by replacing a faulty FRU, you must clear the fault.

Note - If you are dealing with faulty DIMMs, do not follow this procedure. Instead, perform the procedure in Section 4.3.3, Replacing a DIMM.

1. After replacing a faulty FRU, boot the system.

2. At the ALOM CMT prompt, use the showfaults command to identify PSH detected faults.

PSH detected faults are distinguished from other kinds of faults by the text:
Host detected fault.

Example:

sc> showfaults -v

ID Time              FRU               Fault

0 SEP 09 11:09:26   MB/CMP0/CH0/R0/D0 Host detected fault, MSGID:

SUN4U-8000-2S  UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86

If no fault is reported, you do not need to do anything else. Do not perform the subsequent step.

3. Clear the fault from all persistent fault records.

In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time. To ensure that these messages are not displayed, perform the following command:

fmadm repair UUID

Example:

sc> fmadm repair 7ee0e46b-ea64-6565-e684-e996963f7b86

2.5.3 Clearing the PSH Fault From the ALOM CMT Logs

When the Solaris PSH facility detects faults, the faults are also logged by the ALOM CMT service processor. After the fault condition is corrected, for example by replacing a faulty FRU, you must clear the fault from the ALOM CMT logs.

Note - If you are dealing with faulty DIMMs, do not follow this procedure. Instead, perform the procedure in Section 4.3.3, Replacing a DIMM.

1. After replacing a faulty FRU, at the ALOM CMT prompt, use the showfaults command to identify PSH detected faults.

PSH detected faults are distinguished from other kinds of faults by the text:
Host detected fault.

Example:

showfaults

ID FRU           Fault

 0 MB             Host detected fault, MSGID: SUNW-TEST07 UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86

If no fault is reported, you do not need to do anything else. Do not perform the subsequent steps.

2. Run the clearfault command with the UUID provided in the showfaults output:

sc> clearfault 7ee0e46b-ea64-6565-e684-e996963f7b86

Clearing fault from all indicted FRUs...

Fault cleared.

2.6 Collecting Information From Solaris OS Files and Commands

With the Solaris OS running on the Sun Blade T6300 server module, you have all the Solaris OS files and commands available for collecting information and for troubleshooting.

In the event that POST, ALOM, or the Solaris PSH features did not indicate the source of a fault, check the message buffer and log files for notifications for faults. Hard drive faults are usually captured by the Solaris message files.

Use the dmesg command to view the most recent system message. To view the system messages log file, view the contents of the /var/adm/messages file.

2.6.1 Checking the Message Buffer

1. Log in as superuser.

2. Issue the dmesg command:

# dmesg

The dmesg command displays the most recent messages generated by the system.

2.6.2 Viewing the System Message Log Files

The error logging daemon, syslogd, automatically records various system warnings, errors, and faults in message files. These messages can alert you to system problems such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file. After a period of time (usually every ten days), a new messages file is automatically created. The original contents of the messages file are rotated to a file named messages.1. Over a period of time, the messages are further rotated to messages.2 and messages.3, and then deleted.

1. Log in as superuser.

2. Issue the following command:

# more /var/adm/messages

3. If you want to view all logged messages, issue the following command:

# more /var/adm/messages*

2.7 Managing Components With Automatic System Recovery Commands

The Automatic System Recovery (ASR) feature enables the server module to automatically unconfigure failed components to remove them from operation until they can be replaced. In the Sun Blade T6300 server module, the following components are managed by the ASR feature:

UltraSPARC T1 processor strands

Memory DIMMS

I/O bus

The database that contains the list of disabled components is called the ASR blacklist (asr-db).

In most cases, POST automatically disables a component when it is faulty. After the cause of the fault is repaired (FRU replacement, loose connector reseated, and so on), you must remove the component from the ASR blacklist.

The ASR commands (TABLE 2-8) enable you to view, and manually add or remove components from the ASR blacklist. These commands are run from the ALOM CMT sc> prompt.

TABLE 2-8 ASR Commands
Command	Description
`showcomponent ^[3]`	Displays system components and their current state.
`enablecomponent` asrkey	Removes a component from the `asr-db` blacklist, where asrkey is the component to enable.
`disablecomponent` asrkey	Adds a component to the `asr-db` blacklist, where asrkey is the component to disable.
`clearasrdb`	Removes all entries from the `asr-db` blacklist.

Note - The components (asrkeys) vary from system to system, depending on how many cores and memory are present. Use the showcomponent command to see the asrkeys on a given system.

Note - A reset or powercycle is required after disabling or enabling a component. If the status of a component is changed with power on there is no effect to the system until the next reset or powercycle.

2.7.1 Displaying System Components With the `showcomponent` Command

The showcomponent command displays the system components (asrkeys) and reports their status.

1. At the sc> prompt, enter the showcomponent command.

Example with no disabled components:

sc> showcomponent

Keys:

MB/CMP0/P0			MB/CMP0/P1				MB/CMP0/P2				MB/CMP0/P3				MB/CMP0/P4 	MB/CMP0/P5 			MB/CMP0/P6 				MB/CMP0/P7 				MB/CMP0/P8				MB/CMP0/P9		MB/CMP0/P10			MB/CMP0/P11				MB/CMP0/P12				MB/CMP0/P13				MB/CMP0/P14      MB/CMP0/P15			MB/CMP0/P16				MB/CMP0/P17				MB/CMP0/P18				MB/CMP0/P19      MB/CMP0/P20			MB/CMP0/P21				MB/CMP0/P22				MB/CMP0/P23				MB/CMP0/P24 MB/CMP0/P25			MB/CMP0/P26				MB/CMP0/P27				MB/CMP0/P28				MB/CMP0/P29    MB/CMP0/P30			MB/CMP0/P31				MB/CMP0/CH0/R0/D0								MB/CMP0/CH0/R0/D1    MB/CMP0/CH1/R0/D0							MB/CMP0/CH1/R0/D1								MB/CMP0/CH2/R0/D0    MB/CMP0/CH2/R0/D1							MB/CMP0/CH3/R0/D0								MB/CMP0/CH3/R0/D1											MB/PCIEa			MB/PCIEb				MB/EM0								MB/EM1					MB/NEM0			MB/NEM1				MB/PCI-BRIDGE				MB/USB				TTYAMB/NET			MB/SAS-SATA-HBA

State: clean

Example showing a disabled component:

sc> showcomponent

ASR state:  Disabled Devices

   MB/CMP0/CH3/R1/D1 : dimm15 deemed faulty

2.7.2 Disabling Components With the `disablecomponent` Command

The disablecomponent command disables a component by adding it to the ASR blacklist.

1. At the sc> prompt, enter the disablecomponent command.

sc> disablecomponent MB/CMP0/CH3/R1/D1

SC Alert:MB/CMP0/CH3/R1/D1 disabled

2. After receiving confirmation that the disablecomponent command is complete, reset the server module so that the ASR command takes effect.

sc> reset

2.7.3 Enabling a Disabled Component With the `enablecomponent` Command

The enablecomponent command enables a disabled component by removing it from the ASR blacklist.

1. At the sc> prompt, enter the enablecomponent command.

sc> enablecomponent MB/CMP0/CH3/R1/D1

SC Alert:MB/CMP0/CH3/R1/D1 reenabled

2. After receiving confirmation that the enablecomponent command is complete, reset the server module for so that the ASR command takes effect.

sc> reset

2.8 Exercising the System With SunVTS

Sometimes a system exhibits a problem that cannot be isolated definitively to a particular hardware or software component. In such cases, it might be useful to run a diagnostic tool that stresses the system by continuously running a comprehensive battery of tests. Sun provides the SunVTS software for this purpose.

2.8.1 Checking SunVTS Software Installation

This procedure assumes that the Solaris OS is running on the Sun Blade T6300 server module, and that you have access to the Solaris command line.

1. Check for the presence of SunVTS packages using the pkginfo command.

% pkginfo -l SUNWvts SUNWvtsr SUNWvtsts SUNWvtsmn

If SunVTS software is loaded, information about the packages is displayed.

If SunVTS software is not loaded, you see an error message for each missing package.

ERROR: information for "SUNWvts" was not found

ERROR: information for "SUNWvtsr" was not found

...

TABLE 2-9 lists some SunVTS packages:

TABLE 2-9 Sample of installed SunVTS Packages
Package	Description
`SUNWvts`	SunVTS framework
`SUNWvtsr`	SunVTS Framework (root)
SUNWvtsts	SunVTS for tests
`SUNWvtsmn`	SunVTS man pages

If SunVTS is not installed, you can obtain the installation packages from the following resources:

Solaris Operating System DVDs

Sun Download Center: http://www.sun.com/oem/products/vts

The SunVTS 6.3 software, and future compatible versions, are supported on the Sun Blade T6300 server module.

SunVTS installation instructions are described in the Sun VTS 6.3 User's Guide, 820-0080.

2.8.2 Exercising the System Using SunVTS Software

Before you begin, the Solaris OS must be running. You must verify that SunVTS validation test software is installed on your system. See Section 2.8.1, Checking SunVTS Software Installation.

The SunVTS installation process requires that you specify one of two security schemes to use when running SunVTS. The security scheme you choose must be properly configured in the Solaris OS for you to run SunVTS.

SunVTS software features both character-based and graphics-based interfaces.

For more information about the character-based SunVTS TTY interface, and specifically for instructions on accessing it by TIP or telnet commands, refer to the Sun VTS 6.3 User's Guide.

^{1 (TableFootnote) Set all of these parameters using the ALOM CMT setsc command, except for the setkeyswitch command.}

^{2 (TableFootnote) The setkeyswitch parameter, when set to diag, overrides all the other ALOM CMT POST variables.}

^{3 (TableFootnote) The showcomponent command might not report all blacklisted DIMMS.}

2.1 Sun Blade T6300 Server Module Diagnostics Overview

2.1.1 Memory Configuration and Fault Handling

2.1.1.1 Memory Configuration

2.1.1.2 Capacity Restrictions

2.1.1.3 DIMM Installation Rules

2.1.1.4 Memory Fault Handling

2.1.1.5 Troubleshooting Memory Faults

2.2 Interpreting System LEDs

2.2.1 Front Panel LEDs and Buttons

2.2.2 Ethernet Port LEDs

2.3 Using ALOM CMT for Diagnosis and Repair Verification

2.3.1 Running ALOM CMT Service-Related Commands

2.3.1.1 Connecting to ALOM

2.3.1.2 Switching Between the System Console and ALOM

2.3.1.3 Service-Related ALOM CMT Commands

2.3.2 Displaying System Faults

2.3.3 Displaying the Environmental Status

2.3.4 Displaying FRU Information

2.4 Running POST

2.4.1 Controlling How POST Runs

2.4.2 Changing POST Parameters

2.4.3 Reasons to Run POST

2.4.3.1 Verifying Hardware Functionality

2.4.3.2 Diagnosing the System Hardware

2.4.4 Running POST

2.4.5 Clearing POST Detected Faults

2.5 Using the Solaris Predictive Self-Healing Feature

2.5.1 Identifying Faults With the fmdump Command

2.5.2 Clearing PSH Detected Faults

2.5.3 Clearing the PSH Fault From the ALOM CMT Logs

2.6 Collecting Information From Solaris OS Files and Commands

2.6.1 Checking the Message Buffer

2.6.2 Viewing the System Message Log Files

2.7 Managing Components With Automatic System Recovery Commands

2.7.1 Displaying System Components With the showcomponent Command

2.7.2 Disabling Components With the disablecomponent Command

2.7.3 Enabling a Disabled Component With the enablecomponent Command

2.8 Exercising the System With SunVTS

2.8.1 Checking SunVTS Software Installation

2.8.2 Exercising the System Using SunVTS Software

2.5.1 Identifying Faults With the `fmdump` Command

2.7.1 Displaying System Components With the `showcomponent` Command

2.7.2 Disabling Components With the `disablecomponent` Command

2.7.3 Enabling a Disabled Component With the `enablecomponent` Command