C H A P T E R  3

Server Diagnostics

This chapter describes the diagnostics that are available for monitoring and troubleshooting the server. This chapter does not provide detailed troubleshooting procedures, but instead describes the server diagnostics facilities and how to use them.

This chapter is intended for technicians, service personnel, and system administrators who service and repair computer systems.

The following topics are covered:


3.1 Overview of Server Diagnostics

There are a variety of diagnostic tools, commands, and indicators you can use to troubleshoot a server.

The LEDs, ALOM CMT, Solaris OS PSH, and many of the log files and console messages are integrated. For example, a fault detected by the Solaris PSH software displays the fault, logs it, passes information to ALOM CMT where it is logged, and depending on the fault, might illuminate of one or more LEDs.

The flow chart in FIGURE 3-1 and TABLE 3-1 describes an approach for using the server diagnostics to identify a faulty field-replaceable unit (FRU). The diagnostics you use, and the order in which you use them, depend on the nature of the problem you are troubleshooting, so you might perform some actions and not others.

The flow chart assumes that you have already performed some troubleshooting such as verification of proper installation and visual inspection of cables and power, and possibly performed a reset of the server (refer to the Sun SPARC Enterprise T1000 Server Installation Guide and Sun SPARC Enterprise T1000 Server Administration Guide for details).

FIGURE 3-1 is a flow chart of the diagnostics available to troubleshoot faulty hardware. TABLE 3-1 has more information about each diagnostic in this chapter.

Note - POST is configured with ALOM CMT configuration variables (TABLE 3-6). If diag_level is set to max (diag_level=max), POST reports all detected FRUs including memory devices with errors correctable by Predictive Self-Healing (PSH). Thus, not all memory devices detected by POST need to be replaced. See Section 3.4.5, Correctable Errors Detected by POST.

FIGURE 3-1 Diagnostic Flow Chart

Figure showing the diagnostic flow chart for the server.



TABLE 3-1 Diagnostic Flow Chart Actions

Action No.

Diagnostic Action

Resulting Action

For more information, see these sections

1.

Check Power OK and AC OK LEDs on the server.

The Power OK LED is located on the front and rear of the chassis.

The AC OK LED is located on the rear of the server on each power supply.

If these LEDs are not on, check the power source and power connections to the server.

Section 3.2, Using LEDs to Identify the State of Devices

2.

Run the ALOM CMT showfaults command to check for faults.

The showfaults command displays the following kinds of faults:

  • Environmental faults
  • Solaris Predictive Self-Healing (PSH) detected faults
  • POST detected faults

Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see Appendix A.

Section 3.3.2, Running the showfaults Command

 

 

3.

Check the Solaris log files for fault information.

The Solaris message buffer and log files record system events and provide information about faults.

  • If system messages indicate a faulty device, replace the FRU.
  • To obtain more diagnostic information, go to Action No. 4.

Section 3.6, Collecting Information From Solaris OS Files and Commands

 

Chapter 5

4.

Run SunVTS.

SunVTS is an application you can run to exercise and diagnose FRUs. To run SunVTS, the server must be running the Solaris OS.

  • If SunVTS reports a faulty device replace the FRU.
  • If SunVTS does not report a faulty device, go to Action No. 5.

Section 3.8, Exercising the System With SunVTS

 

Chapter 5

5.

Run POST.

POST performs basic tests of the server components and reports faulty FRUs.

Note - diag_level=min is the default ALOM CMT setting, which tests devices required to boot the server. Use diag_level=max for troubleshooting and hardware replacement.

  • If POST indicates a faulty FRU while diag_level=min, replace the FRU.
  • If POST indicates a faulty memory device while diag_level=max, the detected errors might be correctable by PSH after the server boots.
  • If POST does not indicate a faulty FRU, go to Action No. 9.

Section 3.4, Running POST

 

TABLE 3-5, TABLE 3-6

 

Chapter 5

 

Section 3.4.5, Correctable Errors Detected by POST

6.

Determine if the fault is an environmental fault.

If the fault listed by the showfaults command displays a temperature or voltage fault, then the fault is an environmental fault. Environmental faults can be caused by faulty FRUs (power supply or fan tray) or by environmental conditions such as when computer room ambient temperature is too high, or the server airflow is blocked. When the environmental condition is corrected, the fault will automatically clear. You can also use the fault LEDs on the server to identify the faulty FRU (fan tray or power supply).

Section 3.3.2, Running the showfaults Command

 

Chapter 5, Section , Replacing Field-Replaceable Units

 

Section 3.2, Using LEDs to Identify the State of Devices

7.

Determine if the fault was detected by PSH.

If the fault message displays the following text, the fault was detected by the Solaris Predictive Self-Healing software:
Host detected fault

If the fault is a PSH detected fault, identify the faulty FRU from the fault message and replace the faulty FRU.

After the FRU is replaced, perform the procedure to clear PSH detected faults.

Section 3.5, Using the Solaris Predictive Self-Healing Feature

 

Chapter 5, Section , Replacing Field-Replaceable Units

 

Section 3.5.2, Clearing PSH Detected Faults

8.

Determine if the fault was detected by POST.

POST performs basic tests of the server components and reports faulty FRUs. When POST detects a faulty FRU, it logs the fault and if possible, takes the FRU offline. POST detected FRUs display the following text in the fault message:

FRU_name deemed faulty and disabled

In this case, replace the FRU and run the procedure to clear POST detected faults.

Section 3.4, Running POST

 

Chapter 5, Section , Replacing Field-Replaceable Units

 

Section 3.4.6, Clearing POST Detected Faults

9.

Contact technical support.

The majority of hardware faults are detected by the server's diagnostics. In rare cases a problem might require additional troubleshooting. If you are unable to determine the cause of the problem, contact technical support.

Section 2.2, Obtaining the Chassis Serial Number


3.1.1 Memory Configuration and Fault Handling

A variety of features play a role in how the memory subsystem is configured and how memory faults are handled. Understanding the underlying features helps you identify and repair memory problems. This section describes how the memory is configured and how the server deals with memory faults.

3.1.1.1 Memory Configuration

In the server memory, there are eight slots that hold DDR-2 memory DIMMs in the following DIMM sizes:

All DIMMS installed must be the same size, and DIMMs must be added four at a time. In addition, Rank 0 memory must be fully populated for the server to function.

See Section 5.6.2, Installing DIMMs, for instructions about adding memory to the server.

3.1.1.2 Memory Fault Handling

The server uses advanced ECC technology, also called chipkill, that corrects up to 4-bits in error on nibble boundaries, as long as the bits are all in the same DRAM. If a DRAM fails, the DIMM continues to function.

The following server features independently manage memory faults:

When a memory fault is detected, POST displays the fault with the device name of the faulty DIMMS, logs the fault, and disables the faulty DIMMs by placing them in the ASR blacklist. For a given memory fault, POST disables half of the physical memory in the system. When this offlining process occurs in normal operation, you must replace the faulty DIMMs based on the fault message and enable the disabled DIMMs with the ALOM CMT enablecomponent command.

In other than normal operation, POST can be configured to run various levels of testing (see TABLE 3-5 and TABLE 3-6) and can thoroughly test the memory subsystem based on the purpose of the test. However, with thorough testing enabled (diag_level=max), POST finds faults and offlines memory devices with errors that could be correctable with PSH. Thus, not all memory devices detected and offlined by POST need to be replaced. See Section 3.4.5, Correctable Errors Detected by POST.

3.1.1.3 Troubleshooting Memory Faults

If you suspect that the server has a memory problem, follow the flow chart (see TABLE 3-1). Run the ALOM CMT showfaults command. The showfaults command lists memory faults and lists the specific DIMMS that are associated with the fault. Once you identify which DIMMs to replace, see Chapter 5 for DIMM removal and replacement instructions. It is important that you perform the instructions in that chapter to clear the faults and enable the replaced DIMMs.


3.2 Using LEDs to Identify the State of Devices

The server provides the following groups of LEDs:

These LEDs provide a quick visual check of the state of the system.


FIGURE 3-2 LEDs on the Server Front Panel

Figure showing the location of LEDs on the front panel of the server.



FIGURE 3-3 LEDs on the Server Rear Panel

Figure showing the location of LEDs and ports on the rear panel of the server.


3.2.1 Front and Rear Panel LEDs

Two LEDs and one LED/button are located in the upper left corner of the front panel (TABLE 3-2). The LEDs are also provided on the rear panel.


TABLE 3-2 Front and Rear Panel LEDs

LED

Location

Color

Description

Locator LED/button

Front and rear panels

White

Enables you to identify a particular server. Activate the LED using one of the following methods:

  • Issuing the setlocator on or off command.
  • Pressing the button to toggle the indicator on or off.

This LED provides the following indications:

  • Off - Normal operating state.
  • Fast blink - The server received a signal as a result of one of the preceding methods and is indicating here I am-- that it is operational.

Service Required LED

Front and rear panels

Yellow

If on, indicates that service is required. The ALOM CMT showfaults command will indicate any faults causing this indicator to light.

Power OK LED

Front and rear panels

Green

The LED provides the following indications:

  • Off - Indicates that the system is unavailable. Either it has no power or ALOM CMT is not running.
  • Steady on - Indicates that the system is powered on and is running in its normal operating state. No service actions are required.
  • Standby blink - Indicates the system is running at a minimum level in standby and is ready to be quickly returned to full function. The service processor is running.
  • Slow blink - Indicates that a normal transitory activity is taking place. Server diagnostics could be running, or the system might be powering on.

Power On/Off button

Front panel

N/A

Turns the server on and off.

Ethernet Link Activity LEDs

Rear panel

Green

These LEDs indicate that there is activity on the associated nets.

Ethernet Link LEDs

Rear panel

Yellow

Indicates that the server is linked to the associated nets.

SC Network Management Activity LED

Rear panel

Yellow

Indicates that there is activity on the SC Network Management port.

SC Network Management Link LED

Rear panel

Green

Indicates that the server is linked to the SC network management port.


3.2.2 Power Supply LEDs

The power supply LEDs (TABLE 3-3) are located on the back of the power supply.


TABLE 3-3 Power Supply LEDs

Name

Color

Description

Fault

Amber

  • On - Power supply has detected a failure.
  • Off - Normal operation.

DC OK

Green

  • On - Normal operation. DC output voltage is within normal limits.
  • Off - Power is off.

AC OK

Green

  • On - Normal operation. Input power is within normal limits.
  • Off - No input voltage, or input voltage is below limits.


3.3 Using ALOM CMT for Diagnosis and Repair Verification

The Sun Advanced Lights Out Management (ALOM) CMT is a system controller in the server that enables you to remotely manage and administer your server.

ALOM CMT enables you to remotely run diagnostics, such as power-on self-test (POST), that would otherwise require physical proximity to the server's serial port. You can also configure ALOM CMT to send email alerts of hardware failures, hardware warnings, and other events related to the server or to ALOM CMT.

The ALOM CMT circuitry runs independently of the server, using the server's standby power. Therefore, ALOM CMT firmware and software continue to function when the server operating system goes offline or when the server is powered off.

Note - Refer to the Advanced Lights Out Management (ALOM) CMT Guide for comprehensive ALOM CMT information.

Faults detected by ALOM CMT, POST, and the Solaris Predictive Self-Healing (PSH) technology are forwarded to ALOM CMT for fault handling (FIGURE 3-4).

In the event of a system fault, ALOM CMT ensures that the Service Required LED is lit, FRU ID PROMs are updated, the fault is logged, and alerts are displayed. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see Appendix A.


FIGURE 3-4 ALOM CMT Fault Management

Figure showing the fault source interfaces.


ALOM CMT sends alerts to all ALOM CMT users that are logged in, sending the alert through email to a configured email address, and writing the event to the ALOM CMT event log.

ALOM CMT can detect when a fault is no longer present and clears the fault in several ways:

ALOM CMT can detect the removal of a FRU, in many cases even if the FRU is removed while ALOM CMT is powered off. This enables ALOM CMT to know that a fault, diagnosed to a specific FRU, has been repaired. The ALOM CMT clearfault command enables you to manually clear certain types of faults without a FRU replacement or if ALOM CMT was unable to automatically detect the FRU replacement.

Note - ALOM CMT does not automatically detect hard drive replacement.

Many environmental faults can automatically recover. A temperature that is exceeding a threshold might return to normal limits. An unplugged power supply can be plugged in, and so on. Recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms:

Environmental faults can be repaired through the removal of the faulty FRU. FRU removal is automatically detected by the environmental monitoring and all faults associated with the removed FRU are cleared. The message for that case, and the alert sent for all FRU removals is:

fru at location has been removed.

There is no ALOM CMT command to manually repair an environmental fault.

The Solaris Predictive Self-Healing technology does not monitor the hard drive for faults. As a result, ALOM CMT does not recognize hard drive faults, and will not light the fault LEDs on either the chassis or the hard drive itself. Use the Solaris message files to view hard drive faults. See Section 3.6, Collecting Information From Solaris OS Files and Commands.

3.3.1 Running ALOM CMT Service-Related Commands

This section describes the ALOM CMT commands that are commonly used for service-related activities.

3.3.1.1 Connecting to ALOM

Before you can run ALOM CMT commands, you must connect to the ALOM. There are several ways to connect to the system controller:

Note - Refer to the Advanced Lights Out Management (ALOM) CMT Guide for instructions on configuring and connecting to ALOM.

3.3.1.2 Switching Between the System Console and ALOM

3.3.1.3 Service-Related ALOM CMT Commands

TABLE 3-4 describes the typical ALOM CMT commands for servicing the server. For descriptions of all ALOM CMT commands, issue the help command or refer to the Advanced Lights Out Management (ALOM) CMT Guide.


TABLE 3-4 Service-Related ALOM CMT Commands

ALOM CMT Command

Description

help [command]

Displays a list of all ALOM CMT commands with syntax and descriptions. Specifying a command name as an option displays help for that command.

break [-y][-c][-D]

Takes the host server from the OS to either kmdb or OpenBoot PROM (equivalent to a Stop-A), depending on the mode Solaris software was booted.

  • -y skips the confirmation question
  • -c executes a console command after the break command completes
  • -D forces a core dump of the Solaris OS

clearfault UUID

Manually clears host-detected faults. The UUID is the unique fault ID of the fault to be cleared.

console [-f]

Connects you to the host system. The -f option forces the console to have read and write capabilities.

consolehistory [-b lines|-e lines|-v]
[-g lines] [boot|run]

Displays the contents of the system's console buffer. The following options enable you to specify how the output is displayed:

  • -g lines specifies the number of lines to display before pausing.
  • -e lines displays n lines from the end of the buffer.
  • -b lines displays n lines from beginning of buffer.
  • -v displays entire buffer.
  • boot|run specifies the log to display (run is the default log).

bootmode [normal|reset_nvram|
bootscript=string]

Enables control of the firmware during system initialization with the following options:

  • normal is the default boot mode.
  • reset_nvram resets OpenBoot PROM parameters to their default values.
  • bootscript=string enables the passing of a string to the boot command.

powercycle [-f]

Performs a poweroff followed by poweron. The -f option forces an immediate poweroff, otherwise the command attempts a graceful shutdown.

poweroff [-y] [-f]

Powers off the host server. The -y option enables you to skip the confirmation question. The -f option forces an immediate shutdown.

poweron [-c]

Powers on the host server. Using the -c option executes a console command after completion of the poweron command.

reset [-y] [-c]

Generates a hardware reset on the host server. The -y option enables you to skip the confirmation question. The -c option executes a console command after completion of the reset command.

resetsc [-y]

Reboots the system controller. The -y option enables you to skip the confirmation question.

setkeyswitch [-y] normal | stby | diag | locked

Sets the virtual keyswitch. The -y option enables you to skip the confirmation question when setting the keyswitch to stby.

setlocator [on | off]

Turns the Locator LED on the server on or off.

showenvironment

Displays the environmental status of the host server. This information includes system temperatures, power supply, front panel LED, hard drive, fan, voltage, and current sensor status. See Section 3.3.3, Running the showenvironment Command.

showfaults [-v]

Displays current system faults. See Section 3.3.2, Running the showfaults Command.

showfru [-g lines] [-s | -d] [FRU]

Displays information about the FRUs in the server.

  • -g lines specifies the number of lines to display before pausing the output to the screen.
  • -s displays static information about system FRUs (defaults to all FRUs, unless one is specified).
  • -d displays dynamic information about system FRUs (defaults to all

FRUs, unless one is specified). See Section 3.3.4, Running the showfru Command.

showkeyswitch

Displays the status of the virtual keyswitch.

showlocator

Displays the current state of the Locator LED as either on or off.

showlogs [-b lines | -e lines |-v] [-g lines] [-p logtype[r|p]]]

Displays the history of all events logged in the ALOM CMT event buffers (in RAM or the persistent buffers).

showplatform [-v]

Displays information about the host system's hardware configuration, the system serial number, and whether the hardware is providing service.


Note - See TABLE 3-7 for the ALOM CMT ASR commands.

3.3.2 Running the showfaults Command

The ALOM CMT showfaults command displays the following kinds of faults:

Use the showfaults command for the following reasons:

single-step bulletAt the sc> prompt, type the showfaults command.

The following showfaults command examples show the different kinds of output from the showfaults command:

3.3.3 Running the showenvironment Command

The showenvironment command displays a snapshot of the server's environmental status. This command displays system temperatures, hard disk drive status, power supply and fan status, front panel LED status, voltage and current sensors. The output uses a format similar to the Solaris OS command prtdiag (1m).

single-step bulletAt the sc> prompt, type the showenvironment command.

The output differs according to your system's model and configuration. Example:


sc> showenvironment
 
 
=============== Environmental Status ===============
 
 
--------------------------------------------------------------------------------
System Temperatures (Temperatures in Celsius):
--------------------------------------------------------------------------------
Sensor           Status  Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
--------------------------------------------------------------------------------
MB/T_AMB         OK        28    -10      -5       0      45       50       55
MB/CMP0/T_TCORE  OK        50    -10      -5       0      85       90       95
MB/CMP0/T_BCORE  OK        51    -10      -5       0      85       90       95
MB/IOB/T_CORE    OK        49    -10      -5       0      95      100      105
 
--------------------------------------------------------
System Indicator Status:
--------------------------------------------------------
SYS/LOCATE           SYS/SERVICE          SYS/ACT             
OFF                  OFF                  ON                  
--------------------------------------------------------
 
----------------------------------------------------------
Fans (Speeds Revolution Per Minute):
----------------------------------------------------------
Sensor           Status           Speed   Warn    Low
----------------------------------------------------------
FT0/F0           OK                6762   2240   1920
FT0/F1           OK                6762   2240   1920
FT0/F2           OK                6762   2240   1920
FT0/F3           OK                6653   2240   1920
 
--------------------------------------------------------------------------------
Voltage sensors (in Volts):
--------------------------------------------------------------------------------
Sensor          Status      Voltage LowSoft LowWarn HighWarn HighSoft
--------------------------------------------------------------------------------
MB/V_VCORE      OK            1.30    1.20    1.24    1.36     1.39
MB/V_VMEM       OK            1.79    1.69    1.72    1.87     1.90
MB/V_VTT        OK            0.89    0.84    0.86    0.93     0.95
MB/V_+1V2       OK            1.18    1.09    1.11    1.28     1.30
MB/V_+1V5       OK            1.49    1.36    1.39    1.60     1.63
MB/V_+2V5       OK            2.51    2.27    2.32    2.67     2.72
MB/V_+3V3       OK            3.29    3.06    3.10    3.49     3.53
MB/V_+5V        OK            5.02    4.55    4.65    5.35     5.45
MB/V_+12V       OK           12.25   10.92   11.16   12.84    13.08
MB/V_+3V3STBY   OK            3.33    3.13    3.16    3.53     3.59
 
-----------------------------------------------------------
System Load (in amps):
-----------------------------------------------------------
Sensor           Status              Load     Warn Shutdown
-----------------------------------------------------------
MB/I_VCORE       OK                20.560   80.000   88.000
MB/I_VMEM        OK                 8.160   60.000   66.000
-----------------------------------------------------------
 
 
----------------------
Current sensors: 
----------------------
Sensor          Status
----------------------
MB/BAT/V_BAT     OK
 
 
------------------------------------------------------------------------------
Power Supplies:
------------------------------------------------------------------------------
Supply  Status          Underspeed  Overtemp  Overvolt  Undervolt  Overcurrent
------------------------------------------------------------------------------
PS0     OK              OFF         OFF       OFF       OFF        OFF
 
sc> 

Note - Some environmental information might not be available when the server is in Standby mode.

3.3.4 Running the showfru Command

The showfru command displays information about the FRUs in the server. Use this command to see information about an individual FRU, or for all the FRUs.

Note - By default, the output of the showfru command for all FRUs is very long.

single-step bulletAt the sc> prompt, enter the showfru command.


sc> showfru -s
FRU_PROM at MB/SEEPROM
SEGMENT: SD
/ManR
/ManR/UNIX_Timestamp32:      TUE OCT 18 21:17:55 2005
/ManR/Description:          ASSY,SPARC-Enterprise-T1000,Motherboard
/ManR/Manufacture Location:  Sriracha,Chonburi,Thailand
/ManR/Sun Part No:           5017302
/ManR/Sun Serial No:         002989
/ManR/Vendor:                Celestica
/ManR/Initial HW Dash Level: 03
/ManR/Initial HW Rev Level:  01
/ManR/Shortname:             T1000_MB
/SpecPartNo:                 885-0505-04
 
 
FRU_PROM at PS0/SEEPROM
SEGMENT: SD
/ManR
/ManR/UNIX_Timestamp32:      SUN JUL 31 19:45:13 2005 
/ManR/Description:           PSU,300W,AC_INPUT,A207
/ManR/Manufacture Location:  Matamoros, Tamps, Mexico
/ManR/Sun Part No:           3001799
/ManR/Sun Serial No:         G00001
/ManR/Vendor:                Tyco Electronics
/ManR/Initial HW Dash Level: 02
/ManR/Initial HW Rev Level:  01
/ManR/Shortname:             PS
/SpecPartNo:                 885-0407-02
 
 
FRU_PROM at MB/CMP0/CH0/R0/D0/SEEPROM
/SPD/Timestamp: MON OCT 03 12:00:00 2005
/SPD/Description: DDR2 SDRAM, 2048 MB
/SPD/Manufacture Location:  
/SPD/Vendor: Infineon (formerly Siemens)
/SPD/Vendor Part No:   72T256220HR3.7A   
/SPD/Vendor Serial No: d03fe27
 
 
FRU_PROM at MB/CMP0/CH0/R0/D1/SEEPROM
/SPD/Timestamp: MON OCT 03 12:00:00 2005
/SPD/Description: DDR2 SDRAM, 2048 MB
/SPD/Manufacture Location:  
/SPD/Vendor: Infineon (formerly Siemens)
/SPD/Vendor Part No:   72T256220HR3.7A   
/SPD/Vendor Serial No: d03f623
 
 
FRU_PROM at MB/CMP0/CH0/R1/D0/SEEPROM
/SPD/Timestamp: MON OCT 03 12:00:00 2005
/SPD/Description: DDR2 SDRAM, 2048 MB
/SPD/Manufacture Location:  
/SPD/Vendor: Infineon (formerly Siemens)
/SPD/Vendor Part No:   72T256220HR3.7A   
/SPD/Vendor Serial No: d03fc26
 
 
FRU_PROM at MB/CMP0/CH0/R1/D1/SEEPROM
/SPD/Timestamp: MON OCT 03 12:00:00 2005
/SPD/Description: DDR2 SDRAM, 2048 MB
/SPD/Manufacture Location:  
/SPD/Vendor: Infineon (formerly Siemens)
/SPD/Vendor Part No:   72T256220HR3.7A   
/SPD/Vendor Serial No: d03eb26
 
 
FRU_PROM at MB/CMP0/CH3/R0/D0/SEEPROM
/SPD/Timestamp: MON OCT 03 12:00:00 2005
/SPD/Description: DDR2 SDRAM, 2048 MB
/SPD/Manufacture Location:  
/SPD/Vendor: Infineon (formerly Siemens)
/SPD/Vendor Part No:   72T256220HR3.7A   
/SPD/Vendor Serial No: d03e620
 
 
FRU_PROM at MB/CMP0/CH3/R0/D1/SEEPROM
/SPD/Timestamp: MON OCT 03 12:00:00 2005
/SPD/Description: DDR2 SDRAM, 2048 MB
/SPD/Manufacture Location:  
/SPD/Vendor: Infineon (formerly Siemens)
/SPD/Vendor Part No:   72T256220HR3.7A   
/SPD/Vendor Serial No: d040920
 
 
FRU_PROM at MB/CMP0/CH3/R1/D0/SEEPROM
/SPD/Timestamp: MON OCT 03 12:00:00 2005
/SPD/Description: DDR2 SDRAM, 2048 MB
/SPD/Manufacture Location:  
/SPD/Vendor: Infineon (formerly Siemens)
/SPD/Vendor Part No:   72T256220HR3.7A   
/SPD/Vendor Serial No: d03ec27
 
 
FRU_PROM at MB/CMP0/CH3/R1/D1/SEEPROM
/SPD/Timestamp: MON OCT 03 12:00:00 2005
/SPD/Description: DDR2 SDRAM, 2048 MB
/SPD/Manufacture Location:  
/SPD/Vendor: Infineon (formerly Siemens)
/SPD/Vendor Part No:   72T256220HR3.7A   
/SPD/Vendor Serial No: d040924
 
 
sc> 


3.4 Running POST

Power-on self-test (POST) is a group of PROM-based tests that run when the server is powered on or reset. POST checks the basic integrity of the critical hardware components in the server (CPU, memory, and I/O buses).

If POST detects a faulty component, the component is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system will boot when POST is complete. For example, if one of the processor cores is deemed faulty by POST, the core will be disabled, and the system will boot and run using the remaining cores.

In normal operation*, the default configuration of POST (diag_level=min), provides a sanity check to ensure the server will boot. Normal operation applies to any power on of the server not intended to test power-on errors, hardware upgrades, or repairs. Once the Solaris OS is running, PSH provides run time diagnosis of faults.

*Note - Earlier versions of firmware have max as the default setting for the POST diag_level variable. To set the default to min, use the ALOM CMT command, setsc diag_level min

For validating hardware upgrades or repairs, configure POST to run in maximum mode (diag_level=max). Note that with maximum testing enabled, POST detects and offlines memory devices with errors that could be correctable by PSH. Thus, not all memory devices detected by POST need to be replaced. See Section 3.4.5, Correctable Errors Detected by POST.

Note - Devices can be manually enabled or disabled using ASR commands (see Section 3.7, Managing Components With Automatic System Recovery Commands).

3.4.1 Controlling How POST Runs

The server can be configured for normal, extensive, or no POST execution. You can also control the level of tests that run, the amount of POST output that is displayed, and which reset events trigger POST by using ALOM CMT variables.

TABLE 3-5 lists the ALOM CMT variables used to configure POST and FIGURE 3-5 shows how the variables work together.

Note - Use the ALOM CMT setsc command to set all the parameters in TABLE 3-5 except setkeyswitch.

TABLE 3-5 ALOM CMT Parameters Used for POST Configuration

Parameter

Values

Description

setkeyswitch

normal

The system can power on and run POST (based on the other parameter settings). For details see TABLE 3-6. This parameter overrides all other commands.

 

diag

The system runs POST based on predetermined settings.

 

stby

The system cannot power on.

 

locked

The system can power on and run POST, but no flash updates can be made.

diag_mode

off

POST does not run.

 

normal

Runs POST according to diag_level value.

 

service

Runs POST with preset values for diag_level and diag_verbosity.

diag_level

min

If diag_mode = normal, runs minimum set of tests.

 

max

If diag_mode = normal, runs all the minimum tests plus extensive CPU and memory tests.

diag_trigger

none

Does not run POST on reset.

 

user_reset

Runs POST upon user initiated resets.

 

power_on_reset

Only runs POST for the first power on. This option is the default.

 

error_reset

Runs POST if fatal errors are detected.

 

all_reset

Runs POST after any reset.

diag_verbosity

none

No POST output is displayed.

 

min

POST output displays functional tests with a banner and pinwheel.

 

normal

POST output displays all test and informational messages.

 

max

POST displays all test, informational, and some debugging messages.



FIGURE 3-5 Flow Chart of ALOM CMT Variables for POST Configuration

Figure showing POST flow chart.


TABLE 3-6 shows combinations of ALOM CMT variables and associated POST modes.


TABLE 3-6 ALOM CMT Parameters and POST Modes

Parameter

Normal Diagnostic Mode

(Default Settings)

No POST Execution

Diagnostic Service Mode

Keyswitch Diagnostic Preset Values

diag_mode

normal

off

service

normal

setkeyswitch[1]

normal

normal

normal

diag

diag_level[2]

min

n/a

max

max

diag_trigger

power-on-reset error-reset

none

all-resets

all-resets

diag_verbosity

normal

n/a

max

max

Description of POST execution

This is the default POST configuration. This configuration tests the system thoroughly, and suppresses some of the detailed POST output.

POST does not run, resulting in quick system initialization, but this is not a suggested configuration.

POST runs the full spectrum of tests with the maximum output displayed.

POST runs the full spectrum of tests with the maximum output displayed.


3.4.2 Changing POST Parameters

1. Access the ALOM CMT sc> prompt:

At the console, issue the #. key sequence:


#.

2. Use the ALOM CMT sc> prompt to change the POST parameters.

Refer to TABLE 3-5 for a list of ALOM CMT POST parameters and their values.

The setkeyswitch parameter sets the virtual keyswitch, so it does not use the setsc command. For example, to change the POST parameters using the setkeyswitch command, enter the following:


sc> setkeyswitch diag

To change the POST parameters using the setsc command, you must first set the setkeyswitch parameter to normal, then you can change the POST parameters using the setsc command:


sc> setkeyswitch normal
sc> setsc value

Example:


sc> setkeyswitch normal
sc> setsc diag_mode service

3.4.3 Reasons to Run POST

You can use POST for basic hardware verification and diagnosis, and for troubleshooting as described in the following sections.

3.4.3.1 Verifying Hardware Functionality

POST tests critical hardware components to verify functionality before the system boots and accesses software. If POST detects an error, the faulty component is disabled automatically, preventing faulty hardware from potentially harming software.

In normal operation (diag_level=min), POST runs in mimimum mode by default to test devices required to power on the server. Replace any devices POST detects as faulty in minimum mode.

Run POST in maximum mode (diag_level=max) for all power-on or error-generated resets, and to validate hardware upgrades or repairs. With maximum testing enabled, POST finds faults and offlines memory devices with errors that could be correctable by PSH. Check the POST-generated errors with the showfaults -v command to verify if memory devices detected by POST can be corrected by PSH or need to be replaced. See Section 3.4.5, Correctable Errors Detected by POST.

3.4.3.2 Diagnosing the System Hardware

You can use POST as an initial diagnostic tool for the system hardware. In this case, configure POST to run in maximum mode (diag_mode=service, setkeyswitch=diag, diag_level=max) for thorough test coverage and verbose output.

3.4.4 Running POST in Maximum Mode

This procedure describes how to run POST when you want maximum testing, as in the case when you are troubleshooting a server or verifying a hardware upgrade or repair.

1. Switch from the system console prompt to the sc> prompt by issuing the #. escape sequence.


ok #.
sc>

2. Set the virtual keyswitch to diag so that POST will run in Service mode.


sc> setkeyswitch diag

3. Reset the system so that POST runs.

There are several ways to initiate a reset. The following example uses the powercycle command. For other methods, refer to the Sun SPARC Enterprise T1000 Server Administration Guide.


sc> powercycleAre you sure you want to powercycle the system [y/n]? y
Powering host off at MON JAN 10 02:52:02 2000
 
Waiting for host to Power Off; hit any key to abort.
 
SC Alert: SC Request to Power Off Host.
 
SC Alert: Host system has shut down.
Powering host on at MON JAN 10 02:52:13 2000
 
SC Alert: SC Request to Power On Host.

4. Switch to the system console to view the POST output:


sc> console

Example of POST output:


SC: Alert: Host system has reset1                      Note: Some output omitted.
0:0>
0:0>@(#) ERIE Integrated POST 4.x.0.build_17 2005/08/30 11:25 
       /export/common-source/firmware_re/ontario-fireball_fio/build_17/post/Niagara/erie/integrated  (firmware_re) 
0:0>Copyright © 2005 Sun Microsystems, Inc. All rights reserved
  SUN PROPRIETARY/CONFIDENTIAL.
  Use is subject to license terms.
0:0>VBSC selecting POST IO Testing.
0:0>VBSC enabling threads: 1
0:0>VBSC setting verbosity level 3
0:0>Start Selftest.....
0:0>Init CPU
0:0>Master CPU Tests Basic.....
0:0>CPU =: 0
0:0>DMMU Registers Access
0:0>IMMU Registers Access
0:0>Init mmu regs
0:0>D-Cache RAM
0:0>DMMU TLB DATA RAM Access
0:0>DMMU TLB TAGS Access
0:0>DMMU CAM
0:0>IMMU TLB DATA RAM Access
0:0>IMMU TLB TAGS Access
0:0>IMMU CAM
0:0>Setup and Enable DMMU
0:0>Setup DMMU Miss Handler
0:0>	Niagara, Version 2.0
0:0>	Serial Number 00000098.00000820 = fffff238.6b4c60e9
0:0>Init JBUS Config Regs
0:0>IO-Bridge unit 1 init test             
0:0>sys 200 MHz, CPU 1000 MHz, mem 200 MHz.
0:0>Integrated POST Testing
0:0>L2 Tests.....
0:0>Setup L2 Cache
0:0>L2 Cache Control = 00000000.00300000 
0:0>Scrub and Setup L2 Cache
0:0>L2 Directory clear
0:0>L2 Scrub VD & UA
0:0>L2 Scrub Tags
0:0>Test Memory Basic.....
0:0>Probe and Setup Memory
0:0>INFO:	4096MB at Memory Channel [0 3 ] Rank 0 Stack 0
0:0>INFO:	4096MB at Memory Channel [0 3 ] Rank 0 Stack 1
0:0>INFO:	No memory detected at Memory Channel [0 3 ] Rank 1 Stack 0
0:0>INFO:	No memory detected at Memory Channel [0 3 ] Rank 1 Stack 1
0:0>
0:0>Data Bitwalk
0:0>L2 Scrub Data
0:0>L2 Enable
0:0>	Testing Memory Channel 0 Rank 0 Stack 0 
0:0>	Testing Memory Channel 3 Rank 0 Stack 0 
0:0>	Testing Memory Channel 0 Rank 0 Stack 1 
0:0>	Testing Memory Channel 3 Rank 0 Stack 1 
0:0>L2 Directory clear
0:0>L2 Scrub VD & UA
0:0>L2 Scrub Tags
0:0>L2 Disable 
0:0>Address Bitwalk
0:0>	Testing Memory Channel 0 Rank 0 Stack 0 
0:0>	Testing Memory Channel 3 Rank 0 Stack 0 
0:0>	Testing Memory Channel 0 Rank 0 Stack 1 
0:0>	Testing Memory Channel 3 Rank 0 Stack 1 
0:0>Test Slave Threads Basic.....
0:0>Set Mailbox
0:0>Setup Final DMMU Entries
0:0>Post Image Region Scrub
0:0>Run POST from Memory
0:0>Verifying checksum on copied image.
0:0>The Memory's CHECKSUM value is cc1e.
0:0>The Memory's Content Size value is 7b192.
0:0>Success...  Checksum on Memory Validated.
0:0>L2 Cache Ram Test
0:0>Enable L2 Cache
0:0>L2 Scrub Data
0:0>L2 Enable
0:0>CPU =: 0
0:0>CPU =: 0
0:0>Test slave strand registers...
0:0>Extended CPU Tests.....
0:0>Scrub Icache
0:0>Scrub Dcache
0:0>D-Cache Tags
0:0>I-Cache RAM Test
0:0>I-Cache Tag RAM
0:0>FPU Registers and Data Path
0:0>FPU Move Registers
0:0>FSR Read/Write
0:0>FPU Branch Instructions
0:0>Enable Icache
0:0>Enable Dcache
0:0>Scrub Memory.....
0:0>Scrub Memory
0:0>Scrub 00000000.00600000->00000001.00000000 on Memory Channel [0 3 ] Rank 0 Stack 0
0:0>Scrub 00000001.00000000->00000002.00000000 on Memory Channel [0 3 ] Rank 0 Stack 1
0:0>IMMU Functional
0:0>DMMU Functional
0:0>Extended Memory Tests.....
0:0>Print Mem Config
0:0>Caches : Icache is ON, Dcache is ON.
0:0>	Bank 0 4096MB : 00000000.00000000 -> 00000001.00000000.
0:0>	Bank 1 4096MB : 00000001.00000000 -> 00000002.00000000.
0:0>Block Mem Test
0:0>Test 6291456 bytes at 00000000.00600000 Memory Channel [ 0 3 ] Rank 0 Stack 0
0:0>........
0:0>Test 6291456 bytes at 00000001.00000000 Memory Channel [ 0 3 ] Rank 0 Stack 1
0:0>........
0:0>IO-Bridge Tests.....
0:0>IO-Bridge Quick Read             
0:0>
0:0>--------------------------------------------------------------
0:0>--------- IO-Bridge Quick Read Only of CSR and ID ---------------
0:0>--------------------------------------------------------------
0:0>fire 1 JBUSID  00000080.0f000000 = 	   
0:0>                               	 fc000002.e03dda23
0:0>--------------------------------------------------------------
0:0>fire 1 JBUSCSR 00000080.0f410000 = 	   
0:0>                               	 00000ff5.13cb7000
0:0>--------------------------------------------------------------
0:0>IO-Bridge unit 1 jbus perf test        
0:0>IO-Bridge unit 1 int init test 
0:0>IO-Bridge unit 1 msi init test 
0:0>IO-Bridge unit 1 ilu init test 
0:0>IO-Bridge unit 1 tlu init test 
0:0>IO-Bridge unit 1 lpu init test 
0:0>IO-Bridge unit 1 link train port B   
0:0>IO-Bridge unit 1 interrupt test   
0:0>IO-Bridge unit 1 Config MB bridges
0:0>Config port B, bus 2 dev 0 func 0, tag 5714 BRIDGE
0:0>Config port B, bus 3 dev 8 func 0, tag PCIX BRIDGE
0:0>IO-Bridge unit 1 PCI id test 
0:0>	INFO:10 count read passed for MB/IOB_PCIEb/BRIDGE! Last read VID:1166|DID:103
0:0>	INFO:10 count read passed for MB/IOB_PCIEb/BRIDGE/GBE! Last read VID:14e4|DID:1648
0:0>	INFO:10 count read passed for MB/IOB_PCIEb/BRIDGE/HBA! Last read VID:1000|DID:50
0:0>Quick JBI Loopback Block Mem Test
0:0>Quick jbus loopback Test 262144 bytes at 00000000.00600000
0:0>INFO:
0:0>	POST Passed all devices.
0:0>POST:	Return to VBSC.
0:0>Master set ACK for vbsc runpost command and spin...

5. Perform further investigation if needed.

a. Interpret the POST messages:

POST error messages use the following syntax:

c:s > ERROR: TEST = failing-test
c:s > H/W under test = FRU
c
:s > Repair Instructions: Replace items in order listed by H/W under test above
c:s > MSG = test-error-message
c
:s > END_ERROR

In this syntax, c = the core number and s = the strand number.

Warning and informational messages use the following syntax:

INFO or WARNING: message

The following example shows a POST error message.


.
.
.
 
0:0>Data Bitwalk
0:0>L2 Scrub Data
0:0>L2 Enable
0:0>	Testing Memory Channel 0 Rank 0 Stack 0 
0:0>	Testing Memory Channel 3 Rank 0 Stack 0 
0:0>	Testing Memory Channel 0 Rank 1 Stack 0 
.
.
.
0:0>ERROR: TEST = Data Bitwalk
0:0>H/W under test = MB/CMP0/CH0/R1/D0/S0 (J0701) 
0:0>Repair Instructions: Replace items in order listed by 'H/W under test' above.
0:0>MSG = Pin 3 failed on MB/CMP0/CH0/R1/D0/S0 (J0701) 
0:0>END_ERROR
 
0:0>	Testing Memory Channel 3 Rank 1 Stack 0 

In this example, POST is reporting a memory error at DIMM location MB/CMP0/CH0/R1/D0 (J0701).

b. Run the showfaults command to obtain additional fault information.

The fault is captured by ALOM, where the fault is logged, the Service Required LED is lit, and the faulty component is disabled.

Example:


ok #.
sc> showfaults -v
   ID  Time              FRU         Fault
    1 APR 24 12:47:27   MB/CMP0/CH0/R1/D0  MB/CMP0/CH0/R1/D0 deemed faulty and disabled

In this example, MB/CMP0/CH0/R1/D0 is disabled. The system can boot using memory that was not disabled until the faulty component is replaced.

Note - You can use ASR commands to display and control disabled components. See Section 3.7, Managing Components With Automatic System Recovery Commands.

3.4.5 Correctable Errors Detected by POST

In maximum mode, POST detects and offlines memory devices with errors that could be correctable by PSH. Use the examples in this section to verify if the detected memory devices are correctable.

Note - For servers powered on in maximum mode without the intention of validating a hardware upgrade or repair, examine all faults detected by POST to verify if the errors can be corrected by Solaris PSH. See Section 3.5, Using the Solaris Predictive Self-Healing Feature.

When using maximum mode, if no faults are detected, return POST to minimum mode.


sc> setkeyswitch normal
sc> setsc diag_mode normal
sc> setsc diag_level min

3.4.5.1 Correctable Errors for Single DIMMs

If POST faults a single DIMM (CODE EXAMPLE 3-1) that was not part of a hardware upgrade or repair, it is likely that POST encountered a correctable error that can be handled by PSH.


CODE EXAMPLE 3-1 POST Fault for a Single DIMM
sc> showfaults -v
ID Time           FRU               Fault
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled

In this case, reenable the DIMM and run POST in minimum mode as follows:

1. Reenable the DIMM.


sc> enablecomponent name-of-DIMM

2. Return POST to minimum mode.


sc> setkeyswitch normal
sc> setsc diag_mode normal
sc> setsc diag_level min

3. Reset the system so that POST runs.

There are several ways to initiate a reset. The following example uses the powercycle command. For other methods, refer to the Sun SPARC Enterprise T1000 Server Administration Guide.


sc> powercycleAre you sure you want to powercycle the system [y/n]? y
Powering host off at MON JAN 10 02:52:02 2000
 
Waiting for host to Power Off; hit any key to abort.
 
SC Alert: SC Request to Power Off Host.
 
SC Alert: Host system has shut down.
Powering host on at MON JAN 10 02:52:13 2000
 
SC Alert: SC Request to Power On Host.

4. Replace the DIMM if POST continues to fault the device in minimum mode.

3.4.5.2 Determining When to Replace Detected Devices

Note - This section assumes faults are detected by POST in maximum mode.

If a detected device is part of a hardware upgrade or repair, or if POST detects multiple DIMMs (CODE EXAMPLE 3-2), replace the detected devices.


CODE EXAMPLE 3-2 POST Fault for Multiple DIMMs
sc> showfaults -v
ID Time           FRU               Fault
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled
2 OCT 13 12:47:27 MB/CMP0/CH0/R0/D1 MB/CMP0/CH0/R0/D1 deemed faulty and disabled
 

Note - The previous example shows two DIMMs on the same channel/rank, which could be an uncorrectable error.

If the detected device is not a part of a hardware upgrade or repair, use the following list to examine and repair the fault:

1. If a detected device is not a DIMM, or if more than a single DIMM is detected, replace the detected devices.

2. If a detected device is a single DIMM and the same DIMM is also detected by PSH, replace the DIMM (CODE EXAMPLE 3-3).


CODE EXAMPLE 3-3 PSH and POST Faults on the Same DIMM
sc> showfaults -v
ID Time           FRU               Fault
0 SEP 09 11:09:26 MB/CMP0/CH0/R0/D0 Host detected fault,
MSGID:SUN4V-8000-DX UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled

Note - The detected DIMM in the previous example must also be replaced because it exceeds the PSH page retire threshold.

3. If a device detected by POST is a single DIMM and the same DIMM is not detected by PSH, follow the procedure in Section 3.4.5.1, Correctable Errors for Single DIMMs.

After the detected devices are repaired or replaced, return POST to the default minimum level.


sc> setkeyswitch normal
sc> setsc diag_mode normal
sc> setsc diag_level min

3.4.6 Clearing POST Detected Faults

In most cases, when POST detects a faulty component, POST logs the fault and automatically takes the failed component out of operation by placing the component in the ASR blacklist (see Section 3.7, Managing Components With Automatic System Recovery Commands).

In most cases, after the faulty FRU is replaced, ALOM CMT detects the repair and extinguishes the Service Required LED. If ALOM CMT does not perform these actions, use the enablecomponent command to manually clear the fault and remove the component from the ASR blacklist. This procedure describes how to do this.

1. After replacing a faulty FRU, at the ALOM CMT prompt use the showfaults command to identify POST detected faults.

POST detected faults are distinguished from other kinds of faults by the text:
deemed faulty and disabled, and no UUID number is reported.

Example:


sc> showfaults -v
   ID  Time              FRU         Fault
    1 APR 24 12:47:27   MB/CMP0/CH0/R1/D0  MB/CMP0/CH0/R1/D0 deemed faulty and disabled

2. Use the enablecomponent command to clear the fault and remove the component from the ASR blacklist.

Use the FRU name that was reported in the fault in the previous step.

Example:


sc> enablecomponent MB/CMP0/CH0/R1/D0 

The fault is cleared and should not appear when you run the showfaults command. Additionally, if there are no other faults remaining, the Service Required LED should be extinguished.

3. Power cycle the server.

You must reboot the server for the enablecomponent command to take effect.

4. At the ALOM CMT prompt, use the showfaults command to verify that no faults are reported.


sc> showfaults
Last POST run: THU MAR 09 16:52:44 2006
POST status: Passed all devices
 
No failures found in System


3.5 Using the Solaris Predictive Self-Healing Feature

The Solaris Predictive Self-Healing (PSH) technology enables the server to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use the message ID to get additional information about the problem from Sun's knowledge article database.

The Predictive Self-Healing technology covers the following server components:

The PSH console message provides the following information:

If the Solaris PSH facility detects a faulty component, use the fmdump command to identify the fault. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see Appendix A.

Note - Additional Predictive Self-Healing information is available at: http://www.sun.com/msg

3.5.1 Identifying PSH Detected Faults

When a PSH fault is detected, a Solaris console message similar to the following is displayed:


SUNW-MSG-ID: SUN4V-8000-DX, TYPE: Fault, VER: 1, SEVERITY: Minor
EVENT-TIME: Wed Sep 14 10:09:46 EDT 2005
PLATFORM: SPARC-Enterprise-T1000, CSN: -, HOSTNAME: wgs48-37
SOURCE: cpumem-diagnosis, REV: 1.5
EVENT-ID: f92e9fbe-735e-c218-cf87-9e1720a28004
DESC: The number of errors associated with this memory module has exceeded acceptable levels.  Refer to http://sun.com/msg/SUN4V-8000-DX for more information.
AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported.
IMPACT: Total system memory capacity will be reduced as pages are retired.
REC-ACTION: Schedule a repair procedure to replace the affected memory module.  Use fmdump -v -u <EVENT_ID> to identify the module.

The following is an example of the ALOM CMT alert for the same PSH diagnosed fault:


SC Alert: Host detected fault, MSGID: SUN4V-8000-DX

Note - The Service Required LED is also turns on for PSH diagnosed faults.

3.5.1.1 Using the fmdump Command to Identify Faults

The fmdump command displays the list of faults detected by the Solaris PSH facility and identifies the faulty FRU for a particular EVENT_ID (UUID). Do not use fmdump to verify a FRU replacement has cleared a fault because the output of fmdump is the same after the FRU has been replaced. Use the fmadm faulty command to verify the fault has cleared.

Note - Faults detected by the Solaris PSH facility are also reported through ALOM CMT alerts. In addition to the PSH fmdump command, the ALOM CMT showfaults command provides information about faults and displays fault UUIDs. See Section 3.3.2, Running the showfaults Command.

1. Check the event log using the fmdump command with -v for verbose output:


# fmdump -v
TIME                 UUID                                 SUNW-MSG-ID
Sep 14 10:09:46.2234 f92e9fbe-735e-c218-cf87-9e1720a28004 SUN4V-8000-DX
   95%  fault.memory.dimm
         FRU: mem:///component=MB/CMP0/CH0:R0/D0/J0601
        rsrc: mem:///component=MB/CMP0/CH0:R0/D0/J0601

In this example, a fault is displayed, indicating the following details:

Note - fmdump displays the PSH event log. Entries remain in the log after the fault has been repaired.

2. Use the Sun message ID to obtain more information about this type of fault.

a. In a browser, go to the Predictive Self-Healing Knowledge Article web site: http://www.sun.com/msg

b. Obtain the message ID from the console output or the ALOM CMT showfaults command.

c. Enter the message ID in the SUNW-MSG-ID field, and click Lookup.

In this example, the message ID SUN4V-8000-DX returns the following information for corrective action:


Article for Message ID:   SUN4V-8000-DX 
Correctable memory errors exceeded acceptable levels 
Type
	Fault 
Severity
	Major 
Description
	The number of correctable memory errors reported against a memory DIMM has 
	exceeded acceptable levels. 
Automated Response
	Pages of memory associated with this memory DIMM are being removed from 
	service as errors are reported. 
Impact
	Total system memory capacity will be reduced as pages are retired. 
Suggested Action for System Administrator
	Schedule a repair procedure to replace the affected memory DIMM, the identity 
	of which can be determined using the command fmdump -v -u EVENT_ID. 
Details
	The Message ID:   SUN4V-8000-DX indicates diagnosis has determined that a 
	memory DIMM is faulty as a result of exceeding the threshold for correctable 
	memory errors. Memory pages associated with the correctable errors have been 
	retired and no data has been lost. However, the system is at increased risk 
	of incurring an uncorrectable error, which will cause a service 
	interruption, until the memory DIMM module is replaced. 
	Use the command fmdump -v -u EVENT_ID with the EVENT_ID from the console 
	message to locate the faulty DIMM. For example: 
	fmdump -v -u f92e9fbe-735e-c218-cf87-9e1720a28004
	TIME                 UUID                                 SUNW-MSG-ID
	Sep 14 10:09:46.2234 f92e9fbe-735e-c218-cf87-9e1720a28004 SUN4V-8000-DX
 			95%  fault.memory.dimm
         FRU: mem:///component=MB/CMP0/CH0:R0/D0/J0601
        rsrc: mem:///component=MB/CMP0/CH0:R0/D0/J0601 
	In this example, the DIMM location is: 
	MB/CMP0/CH0:R0/D0/J0601
	Refer to the Service Manual or the Service Label attached to the server 
	chassis to find the physical location of the DIMM. Once the DIMM has been 
	replaced, use the Service Manual for instructions on clearing the fault 
	condition and validating the repair action. 
	NOTE - The server Product Notes may contain updated service procedures. The 
	latest version of the Service Manual and Product Notes are available at the 
	Sun Documentation Center. 

3. Follow the suggested actions to repair the fault.

3.5.2 Clearing PSH Detected Faults

When the Solaris PSH facility detects faults, the faults are logged and displayed on the console. After the fault condition is corrected, for example by replacing a faulty FRU, you must clear the fault.

Note - If you are dealing with faulty DIMMs, do not follow this procedure. Instead, perform the procedure in Section 5.6.2, Installing DIMMs.

1. After replacing a faulty FRU, power on the server.

2. At the ALOM CMT prompt, use the showfaults command to identify PSH detected faults.

PSH detected faults are distinguished from other kinds of faults by the text:
Host detected fault.

Example:


sc> showfaults -v
ID Time              FRU               Fault
0 SEP 09 11:09:26   MB/CMP0/CH0/R1/D0 Host detected fault, MSGID: 
SUN4U-8000-2S  UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86

3. Run the clearfault command with the UUID provided in the showfaults output:


sc> clearfault 7ee0e46b-ea64-6565-e684-e996963f7b86
Clearing fault from all indicted FRUs...
Fault cleared.

4. Clear the fault from all persistent fault records.

In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time. To ensure that these messages are not displayed, perform the following command:

fmadm repair UUID

Example:


# fmadm repair 7ee0e46b-ea64-6565-e684-e996963f7b86


3.6 Collecting Information From Solaris OS Files and Commands

With the Solaris OS running on the server, you have the full compliment of Solaris OS files and commands available for collecting information and for troubleshooting.

If POST, ALOM, or the Solaris PSH features do not indicate the source of a fault, check the message buffer and log files for notifications for faults. Hard drive faults are usually captured by the Solaris message files.

Use the dmesg command to view the most recent system message. To view the system messages log file, view the contents of the /var/adm/messages file.

3.6.1 Checking the Message Buffer

1. Log in as superuser.

2. Issue the dmesg command:


# dmesg

The dmesg command displays the most recent messages generated by the system.

3.6.2 Viewing System Message Log Files

The error logging daemon, syslogd, automatically records various system warnings, errors, and faults in message files. These messages can alert you to system problems such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file. After a period of time (usually every ten days), a new messages file is automatically created. The original contents of the messages file are rotated to a file named messages.1. Over a period of time, the messages are further rotated to messages.2 and messages.3, and then deleted.

1. Log in as superuser.

2. Issue the following command:


# more /var/adm/messages

3. If you want to view all logged messages, issue the following command:


# more /var/adm/messages*


3.7 Managing Components With Automatic System Recovery Commands

The Automatic System Recovery (ASR) feature enables the server to automatically configure failed components out of operation until they can be replaced. In the server, the following components are managed by the ASR feature:

The database that contains the list of disabled components is called the ASR blacklist (asr-db).

In most cases, POST automatically disables a faulty component. After the cause of the fault is repaired (FRU replacement, loose connector reseated, and so on), you must remove the component from the ASR blacklist.

The ASR commands (TABLE 3-7) enable you to view, and manually add or remove components from the ASR blacklist. These commands are run from the ALOM CMT sc> prompt.


TABLE 3-7 ASR Commands

Command

Description

showcomponent [3]

Displays system components and their current state.

enablecomponent asrkey

Removes a component from the asr-db blacklist, where asrkey is the component to enable.

disablecomponent asrkey

Adds a component to the asr-db blacklist, where asrkey is the component to disable.

clearasrdb

Removes all entries from the asr-db blacklist.


Note - The components (asrkeys) vary from system to system, depending on how many cores and memory are present. Use the showcomponent command to see the asrkeys on a given system.
Note - A reset or power cycle is required after disabling or enabling a component. If the status of a component is changed with power on there is no effect to the system until the next reset or power cycle.

3.7.1 Displaying System Components

The showcomponent command displays the system components (asrkeys) and reports their status.

single-step bulletAt the sc> prompt, enter the showcomponent command.

Example with no disabled components:


sc> showcomponent
Keys:
.
.
.
 
ASR state: clean

Example showing a disabled component:


sc> showcomponent
 
Keys:
 
.
.
.
 
ASR state:  Disabled Devices
   MB/CMP0/CH3/R1/D1 : dimm8 deemed faulty

3.7.2 Disabling Components

The disablecomponent command disables a component by adding it to the ASR blacklist.

1. At the sc> prompt, enter the disablecomponent command.


sc> disablecomponent MB/CMP0/CH3/R1/D1 
 
SC Alert:MB/CMP0/CH3/R1/D1 disabled

2. After receiving confirmation that the disablecomponent command is complete, reset the server so that the ASR command takes effect.


sc> reset

3.7.3 Enabling Disabled Components

The enablecomponent command enables a disabled component by removing it from the ASR blacklist.

1. At the sc> prompt, enter the enablecomponent command.


sc> enablecomponent MB/CMP0/CH3/R1/D1
 
SC Alert:MB/CMP0/CH3/R1/D1 reenabled

2. After receiving confirmation that the enablecomponent command is complete, reset the server so that the ASR command takes effect.


sc> reset


3.8 Exercising the System With SunVTS

Sometimes a server exhibits a problem that cannot be isolated definitively to a particular hardware or software component. In such cases, it might be useful to run a diagnostic tool that stresses the system by continuously running a comprehensive battery of tests. Sun provides the SunVTS software for this purpose.

This section describes the tasks necessary to use SunVTS software to exercise your server:

3.8.1 Checking Whether SunVTS Software Is Installed

This procedure assumes that the Solaris OS is running on the server, and that you have access to the Solaris command line.

1. Check for the presence of SunVTS packages using the pkginfo command.


% pkginfo -l SUNWvts SUNWvtsr SUNWvtsts SUNWvtsmn

The following table lists the SunVTS packages:


Package

Description

SUNWvts

SunVTS framework

SUNWvtsr

SunVTS framework (root)

SUNWvtsts

SunVTS for tests

SUNWvtsmn

SunVTS man pages


If SunVTS is not installed, you can obtain the installation packages from the Solaris Operating System DVDs.

The SunVTS 6.1 software, and future compatible versions, are supported on the server.

SunVTS installation instructions are described in the SunVTS User's Guide.

3.8.2 Exercising the System Using SunVTS Software

Before you begin, the Solaris OS must be running. You also need to ensure that SunVTS validation test software is installed on your system. See Section 3.8.1, Checking Whether SunVTS Software Is Installed.

The SunVTS installation process requires that you specify one of two security schemes to use when running SunVTS. The security scheme you choose must be properly configured in the Solaris OS for you to run SunVTS. For details, refer to the SunVTS User's Guide.

SunVTS software features both character-based and graphics-based interfaces. This procedure assumes that you are using the graphical user interface (GUI) on a system running the Common Desktop Environment (CDE). For more information about the character-based SunVTS TTY interface, and specifically for instructions on accessing it by tip or telnet commands, refer to the SunVTS User's Guide.

SunVTS software can be run in several modes. This procedure assumes that you are using the default mode.

This procedure also assumes that the server is headless, that is, it is not equipped with a monitor capable of displaying bitmap graphics. In this case, you access the SunVTS GUI by logging in remotely from a machine that has a graphics display.

Finally, this procedure describes how to run SunVTS tests in general. Individual tests may presume the presence of specific hardware, or might require specific drivers, cables, or loopback connectors. For information about test options and prerequisites, refer to the following documentation:

3.8.3 Using SunVTS Software

1. Log in as superuser to a system with a graphics display.

The display system should be one with a frame buffer and monitor capable of displaying bitmap graphics such as those produced by the SunVTS GUI.

2. Enable the remote display.

On the display system, type:


# /usr/openwin/bin/xhost + test-system

where test-system is the name of the server you plan to test.

3. Remotely log in to the server as superuser.

Use a command such as rlogin or telnet.

4. Start SunVTS software.

If you have installed SunVTS software in a location other than the default /opt directory, alter the path in the following command accordingly.


# /opt/SUNWvts/bin/sunvts -display display-system:0

where display-system is the name of the machine through which you are remotely logged in to the server.

The SunVTS GUI is displayed (FIGURE 3-6).


FIGURE 3-6 SunVTS GUI

Figure showing the SunVTS GUI for the server.


5. Expand the test lists to see the individual tests.


The test selection area lists tests in categories, such as Network, as shown in FIGURE 3-7. To expand a category, left-click the icon (expand category icon) to the left of the category name.

FIGURE 3-7 SunVTS Test Selection Panel

Figure showing a small portion of the test selection area in the SunVTS GUI.


6. (Optional) Select the tests you want to run.

Certain tests are enabled by default, and you can choose to accept these.

Alternatively, you can enable and disable individual tests or blocks of tests by clicking the checkbox next to the test name or test category name. Tests are enabled when checked, and disabled when not checked.

TABLE 3-8 lists tests that are especially useful to run on this server.


TABLE 3-8 Useful SunVTS Tests to Run on This Server

SunVTS Tests

FRUs Exercised by Tests

cmttest, cputest, fputest, iutest, l1dcachetest, dtlbtest, and l2sramtest--indirectly: mptest, and systest

DIMMS, motherboard

disktest

Disks, cables, disk backplane

nettest, netlbtest

Network interface, network cable, CPU motherboard

pmemtest, vmemtest, ramtest

DIMMs, motherboard

serialtest

I/O (serial port interface)

hsclbtest

Motherboard, system controller

(Host to system controller interface)


7. (Optional) Customize individual tests.

You can customize individual tests by right-clicking on the name of the test. For example, in FIGURE 3-7, right-clicking on the text string ce0(nettest) brings up a menu that enables you to configure this Ethernet test.

8. Start testing.

Click the Start button that is located at the top left of the SunVTS window. Status and error messages appear in the test messages area located across the bottom of the window. You can stop testing at any time by clicking the Stop button.

During testing, SunVTS software logs all status and error messages. To view these messages, click the Log button or select Log Files from the Reports menu. This action opens a log window from which you can choose to view the following logs:


1 (TableFootnote) The setkeyswitch parameter, when set to diag, overrides all the other ALOM CMT POST variables.
2 (TableFootnote) Earlier versions of firmware have max as the default setting for the POST diag_level variable. To set the default to min, use the ALOM CMT command, setsc diag_level min
3 (TableFootnote) The showcomponent command might not report all blacklisted DIMMS.