Diagnostics

C H A P T E R 2

Diagnostics

This chapter describes the diagnostics that are available for monitoring and troubleshooting the Sun Blade T6340 server module. This chapter is intended for technicians, service personnel, and system administrators who service and repair computer systems.

The following topics are covered:

Section 2.1, Sun Blade T6340 Server Module Diagnostics Overview

Section 2.2, Memory Configuration and Fault Handling

Section 2.3, Interpreting System LEDs

Section 2.4, Using ILOM for Diagnosis and Repair Verification

Section 2.5, Using the ILOM Web Interface For Diagnostics

Section 2.6, Running POST

Section 2.7, Using the Solaris Predictive Self-Healing Feature

Section 2.8, Collecting Information From Solaris OS Files and Commands

Section 2.9, Managing Components With Automatic System Recovery Commands

Section 2.10, Exercising the System With SunVTS

Section 2.11, Resetting the Password to the Factory Default

2.1 Sun Blade T6340 Server Module Diagnostics Overview

There are a variety of diagnostic tools, commands, and indicators you can use to monitor and troubleshoot a Sun Blade T6340 server module.

LEDs - Provide a quick visual notification of the status of the server module and some of the FRUs.

ILOM firmware - This system firmware runs on the service processor. In addition to providing the interface between the hardware and the Solaris OS, ILOM tracks and reports the health of key server module components. ILOM works closely with POST and Solaris Predictive Self-Healing technology to keep the system up and running even when there is a faulty component. For more information about ILOM, refer to the ILOM documentation collections.

Power-on self-test (POST) - POST performs diagnostics on system components upon system reset to ensure the integrity of those components. POST is configurable and works with ILOM to take faulty components offline if needed.

Solaris OS Predictive Self-Healing (PSH) - This technology continuously monitors the health of the CPU and memory, and other components. PSH works with ILOM to take a faulty component offline if needed. The Predictive Self-Healing technology enables Sun systems to accurately predict component failures and mitigate many serious problems before they occur.

Log files and console messages - Provide the standard Solaris OS log files and investigative commands that can be accessed and displayed on the device of your choice.

SunVTS - An application that exercises the system, provides hardware validation, identifies possible faulty components, and provides recommendations for repair.

The LEDs, ILOM, Solaris OS PSH, and many of the log files and console messages are integrated. For example, when the Solaris software detects a fault, it will display the fault, log it, pass information to ILOM where the fault is logged, and depending on the fault, one or more LEDs might be illuminated.

The diagnostic flowchart in FIGURE 2-1 and TABLE 2-1 describes an approach for using the server module diagnostics to identify a faulty field-replaceable unit (FRU). The diagnostics you use, and the order in which you use them, depend on the nature of the problem you are troubleshooting, so you might perform some actions and not others.

Use this flowchart to understand what diagnostics are available to troubleshoot faulty hardware, and use TABLE 2-1 to find more information about each diagnostic in this chapter.

FIGURE 2-1 Diagnostic Flowchart

Figure shows the diagnostic flowchart.

TABLE 2-1 Diagnostic Flowchart Actions
Action No.	Diagnostic Action	Resulting Action	For more information, see these sections
1.	Check the OK LED.	The OK LED is located on the front of the Sun Blade T6340 server module. If the LED is not lit, check that the blade is properly connected and the chassis has power.	Section 2.3, Interpreting System LEDs
2.	Type the ILOM `show faulty` command to check for faults.	The `faultmgmt` command displays the following types of faults: Environmental faults Solaris Predictive Self-Healing (PSH) detected faults POST detected faults Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 1-3.	Section 2.5.1, Displaying System Faults
3.	Check the Solaris log files for fault information.	The Solaris message buffer and log files record system events and provide information about faults. If system messages indicate a faulty device, replace the FRU. To obtain more diagnostic information, go to Action 4.	Section 2.8, Collecting Information From Solaris OS Files and Commands
4.	Run the SunVTS software.	SunVTS can exercise and diagnose FRUs. To run SunVTS, the server module must be running the Solaris OS. If SunVTS reports a faulty device replace the FRU. If SunVTS does not report a faulty device, go to Action 5.	Section 2.10, Exercising the System With SunVTS
5.	Run POST.	POST performs basic tests of the server module components and reports faulty FRUs. If POST indicates a faulty FRU, replace the FRU. If POST does not indicate a faulty FRU, go to Action 9.	Section 2.6, Running POST
6.	Determine if the fault is an environmental fault.	If the fault listed by the `show faulty`command displays a temperature or voltage fault, then the fault is an environmental fault. Environmental faults can be caused by faulty FRUs (chassis power supply, fan, or blower) or by environmental conditions such as high ambient temperature, or blocked airflow.	Section 2.5.1, Displaying System Faults
7.	Determine if the fault was detected by PSH.	If the fault message displays the following text, the fault was detected by the Solaris Predictive Self-Healing software: `Host detected fault` If the fault is a PSH detected fault, identify the faulty FRU from the fault message and replace the faulty FRU. After the FRU is replaced, perform the procedure to clear PSH detected faults.	Section 2.7, Using the Solaris Predictive Self-Healing Feature Section 4.2, Common Procedures for Parts Replacement Section 2.7.2, Clearing PSH Detected Faults Section 2.7.3, Clearing the PSH Fault From the ILOM Logs
8.	Determine if the fault was detected by POST.	POST performs basic tests of the server module components and reports faulty FRUs. When POST detects a faulty FRU, it logs the fault and if possible takes the FRU offline. POST detected FRUs display the following text in the fault message: FRU-name `deemed faulty and disabled` In this case, replace the FRU and run the procedure to clear POST detected faults.	Section 2.6, Running POST Section 4.2, Common Procedures for Parts Replacement Section 2.6.4, Clearing POST Detected Faults
9.	Contact Sun for support.	The majority of hardware faults are detected by the server module diagnostics. In rare cases it is possible that a problem requires additional troubleshooting. If you are unable to determine the cause of the problem, contact Sun for support.	Sun Support information: `http://www.sun.com/ support` Section 1.3, Finding the Serial Number

2.2 Memory Configuration and Fault Handling

This section describes how the memory is configured and how the server module deals with memory faults.

2.2.1 Memory Configuration

The Sun Blade T6340 server module has 32 connectors (slots) that hold fully-buffered DIMMs (FB-DIMMs) in the following FB-DIMM capacities:

1 Gbyte (maximum of 32 Gbytes)

2 Gbyte (maximum of 64 Gbytes)

4 Gbyte (maximum of 128 Gbytes)

8 Gbyte (maximum of 256 Gbytes)

The Sun Blade T6340 server module performs best if all 32 connectors are populated with 32 identical DIMMs. This configuration also enables the system to continue operating even when a DIMM fails, or if an entire channel fails.

Note - All installed FB-DIMMs will be seen by the system as having the capacity of the smallest installed FB-DIMM.

For example, suppose that you have installed 32 8-Gbyte FB-DIMMs for a total of 256 Gbytes of memory. If you were to replace one of those 8-Gbyte FB-DIMMs with a functioning 1-Gbyte FB-DIMM, the system will now treat all installed FB-DIMMs as 1 Gbyte FB-DIMMs and thus see only 32 Gbytes of installed memory, .

2.2.1.1 FB-DIMM Installation Rules

Caution - The following FB-DIMM rules must be followed. The server module might not operate correctly if the FB-DIMM rules are not followed. Always use FB-DIMMs that have been qualified by Sun.

Use these FB-DIMM configuration rules to help you plan the memory configuration of your server:

32 slots hold industry-standard FB-DIMM memory modules on the motherboard.

All FB-DIMMs must have the same Sun part number. The system treats all installed FB-DIMMs as the lowest-capacity installed FB-DIMM.

Install FB-DIMMs in this order, refer to FIGURE 2-2:

Fill FB-DIMMs of group 8 first.

Fill FB-DIMMs of group 16 next.

Fill FB-DIMMs of group 24 next.

Fill FB-DIMMs of group 32 last.

See Section 4.3.1, Removing the DIMMs for DIMM installation instructions.

FIGURE 2-2 DIMM Installation Rules

Figure shows motherboard, the DIMM locate button, and the DIMM ejector levers.

You can also use TABLE 2-2 to identify the DIMMs you want to remove.

TABLE 2-2 FB-DIMM Configuration and Installation
CPU #	Branch Name	Channel Name	FRU Name in ILOM Messages	Motherboard FB-DIMM Connector	FB-DIMM Installation Order ^[1]
CMP 0	Branch 0	Channel 0	/SYS/MB/CMP0/BR0/CH0/D0	J0501	8
	Branch 0		/SYS/MB/CMP0/BR0/CH0/D1	J0601	16
			/SYS/MB/CMP0/BR0/CH0/D2	J0701	24
			/SYS/MB/CMP0/BR0/CH0/D3	J0801	24
		Channel 1	/SYS/MB/CMP0/BR0/CH1/D0	J0901	8
			/SYS/MB/CMP0/BR0/CH1/D1	J1001	16
			/SYS/MB/CMP0/BR0/CH1/D2	J1101	24
			/SYS/MB/CMP0/BR0/CH1/D3	J1201	24
	Branch 1	Channel 0	/SYS/MB/CMP0/BR1/CH0/D0	J1301	8
	Branch 1		/SYS/MB/CMP0/BR1/CH0/D1	J1401	16
			/SYS/MB/CMP0/BR1/CH0/D2	J1501	24
			/SYS/MB/CMP0/BR1/CH0/D3	J1601	24
		Channel 1	/SYS/MB/CMP0/BR1/CH1/D0	J1701	8
			/SYS/MB/CMP0/BR1/CH1/D1	J1801	16
			/SYS/MB/CMP0/BR1/CH1/D2	J1901	24
			/SYS/MB/CMP0/BR1/CH1/D3	J2001	24
CMP 1	Branch 0	Channel 0	/SYS/MB/CMP1/BR0/CH0/D0	J2401	8
	Branch 0		/SYS/MB/CMP1/BR0/CH0/D1	J2501	16
			/SYS/MB/CMP1/BR0/CH0/D2	J2601	32
			/SYS/MB/CMP1/BR0/CH0/D3	J2701	32
		Channel 1	/SYS/MB/CMP1/BR0/CH1/D0	J2801	8
			/SYS/MB/CMP1/BR0/CH1/D1	J2901	16
			/SYS/MB/CMP1/BR0/CH1/D2	J2601	32
			/SYS/MB/CMP1/BR0/CH1/D3	J2701	32
	Branch 1	Channel 0	/SYS/MB/CMP1/BR1/CH0/D0	J3201	8
	Branch 1		/SYS/MB/CMP1/BR1/CH0/D1	J3301	16
			/SYS/MB/CMP1/BR1/CH0/D2	J3401	32
			/SYS/MB/CMP1/BR1/CH0/D3	J3501	32
		Channel 1	/SYS/MB/CMP1/BR1/CH1/D0	J3601	8
			/SYS/MB/CMP1/BR1/CH1/D1	J3701	16
			/SYS/MB/CMP1/BR1/CH1/D2	J3801	32
			/SYS/MB/CMP1/BR1/CH1/D3	J3901	32

2.2.1.2 Memory Fault Handling

The Sun Blade T6340 server module uses advanced ECC technology, also called chipkill, that corrects up to 4-bits in error on nibble boundaries, as long as they are all in the same DRAM. If a DRAM fails, the DIMM continues to function.

Note - The chipkill function is only supported on DIMMs that use “x4” DRAMs.

The following server module features manage memory faults independently.

POST - Runs when the server module is powered on (based on configuration variables) and thoroughly tests the memory subsystem.

If a memory fault is detected, POST displays the fault with the FRU name of the faulty DIMMs, logs the fault, and disables the faulty DIMMs by placing them in the Automatic System Recovery (ASR) blacklist. For a given memory fault, POST disables half of the physical memory in the system. When this occurs, you must replace the faulty DIMMs based on the fault message and enable the disabled DIMMs with the ILOM command set /SYS/component component_state=enabled .

Solaris Predictive Self-healing (PSH) technology - A feature of the Solaris OS, uses the fault manager daemon (fmd) to watch for various kinds of faults. When a fault occurs, the fault is assigned a unique fault ID (UUID), and logged. PSH reports the fault and provides a recommended proactive replacement for the DIMMs associated with the fault.

2.2.1.3 Troubleshooting Memory Faults

If you suspect that the server module has a memory problem, follow the flowchart (FIGURE 2-1). Type the ILOM command: show faulty . The faultmgmt command lists memory faults and lists the specific DIMMs that are associated with the fault. Once you have identified which DIMMs to replace, see Chapter 4 for DIMM removal and replacement instructions. You must perform the instructions in that chapter to clear the faults and enable the replaced DIMMs.

2.3 Interpreting System LEDs

The Sun Blade T6340 server module has LEDs on the front panel and the hard drives. The behavior of LEDs on your server module conforms to the American National Standards Institute (ANSI) Status Indicator Standard (SIS). These standard LED behaviors are described in TABLE 2-3.

2.3.1 Front Panel LEDs and Buttons

The front panel LEDs and buttons are located in the center of the server module (TABLE 2-4, and TABLE 2-5). The functions of their respective devices are displayed as follows:

TABLE 2-3 LED Behavior and Meaning
LED Behavior	Meaning
Off	The condition represented by the color is not true.
Steady on	The condition represented by the color is true.
Standby blink	The system is functioning at a minimal level and ready to resume full function.
Slow blink	Transitory activity or new activity represented by the color that is taking place.
Fast blink	Attention is required.
Feedback flash	Activity is taking place commensurate with the flash rate (such as disk drive activity).

The front panel LEDs on the Sun Blade T6340 are shown in FIGURE 2-3:

FIGURE 2-3 Front Panel and Hard Drive LEDs

Illustration of Front Panel and Hard Drive LEDs

Figure Legend
1	White Locator LED	7	Universal Connector Port (UCP)
2	Blue Ready to Remove LED	8	Green Drive OK LED
3	Amber Service Action Required LED	9	Amber Drive Service Action Required LED
4	Green OK LED	10	Blue Drive Ready to Remove LED
5	Power Button	11	Chassis power connector
6	Reset Button (for service use only)	12	Chassis data connector

TABLE 2-4 LED Behaviors With Assigned Meanings
Color	Behavior	Definition	Description, Actions, and ILOM Commands
White	Off	Steady state
	Fast blink	4 Hz repeating sequence, equal intervals On and Off.	This indicator helps you to locate a particular enclosure, board, or subsystem (for example, the Locator LED). The LED is activated using one of the following methods: Press the button to toggle the indicator on or off, or type the ILOM command: `set /SYS/LOCATE value=Off` This LED provides the following indications: Off- Normal operating state. Fast blink - The server module received a signal as a result of one of the preceding methods and indicats that the server module is active. Type the ILOM command: `set /SYS/LOCATE value=Fast_Blink`
Blue	Off	Steady state	Steady state - If LED is off, it is not safe to remove the server module from the chassis. You must use software to take the component offline or shut down the server. To turn off the blue LED, type: `set /SYS return_to_service_action=true`
	Steady on	Steady state	If the blue LED is on, a service action can be safely performed on the component. To remove a server module (and illuminate the blue LED), type: `set /SYS prepare_to_remove_action=true` To remove a hard drive, use the Solaris `cfgadm` command
Amber	Off	Steady state
	Steady on	Steady state	This indicator signals the existence of a fault condition. Service is required (for example, the Service Required LED). The ILOM `show faulty` command provides details about any faults that cause this indicator to be lit. To turn off an amber LED, either fix the fault condition or mark the fault condition fixed.
Green	Off	Steady state	Off - The system is unavailable. Either it has no power or ILOM is not running.
	Standby blink	Repeating sequence consisting of a brief (0.1 sec.) on flash followed by a long off period (2.9 sec.)	The system is running at a minimum level and is ready to be quickly revived to full function (for example, the System Activity LED).
	Steady on	Steady state	Status normal; system or component functioning with no service actions required.
	Slow blink		A transitory (temporary) event is taking place for which direct proportional feedback is not needed or not feasible. ILOM is enabled but the server module is not fully powered on. Indicates that the service processor is running while the system is running at a minimum level in standby mode and ready to be returned to its normal operating state.

2.3.2 Power and Reset Buttons

TABLE 2-5 Front Panel Buttons
Button	Color	Description
Power button	gray	Turns the host system on and off. Use a non-conductive stylus to completely press this button.
(reset)	gray	This button causes a reset of the Service Processor.

For information about Ethernet LEDs see the service manual for your modular system chassis or ethernet device at:
http://docs.sun.com/app/docs/prod/blade.6000mod

2.4 Using ILOM for Diagnosis and Repair Verification

The Oracle Integrated Lights Out Manager (ILOM) is contained on firmware on the service processor in the Sun Blade T6340 server module. ILOM enables you to remotely manage and administer your server module.

Note - ILOM also contains an ALOM-CMT compatibility shell. For more information about ALOM-CMT compatibility see the Sun Integrated Lights Out Manager 2.0 Supplement for Sun Blade T6340 Server Modules, 820-3904. Appendix G of this service manual also provides some information about the ALOM CMT CLI.

ILOM enables you to run remote diagnostics such as power-on self-test (POST), that would otherwise require physical proximity to the server module serial port. You can also configure ILOM to send email alerts of hardware failures, hardware warnings, and other events related to the server module or to ILOM.

The ILOM circuitry runs independently of the server module, using the server module standby power. Therefore, ILOM firmware and software continue to function when the server module operating system goes offline or when the server module is powered off.

Faults detected by ILOM, POST, and the Solaris Predictive Self-healing (PSH) technology are forwarded to ILOM for fault handling (FIGURE 2-4).

In the event of a system fault, ILOM ensures that the Service Action Required LED is lit, FRU ID PROMs are updated, the fault is logged, and alerts are displayed (faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 1-3).

FIGURE 2-4 ILOM Fault Management

Figure shows environmentals, POST, Solaris PSH routed through ILOM fault manager to produce results in LEDs, FRUID PROMs, Logs, and alerts.

In ILOM you can view the ILOM logs to see alerts. FIGURE 2-5 is a sample of the ILOM web interface. Using the CLI you can type the show /SP/logs/event/list/ command.

FIGURE 2-5 Sample Event Log in ILOM Web Interface

Figure shows a sample ILOM event log.

ILOM can detect when a fault is no longer present and clears the fault in several ways:

Fault recovery - The system automatically detects that the fault condition is no longer present. ILOM extinguishes the Service Action Required LED and updates the FRU PROM.

Many environmental faults can automatically recover. For example, a temperature that is exceeding a threshold might return to normal limits when you connect a fan. The recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms:

fru at location is OK.

sensor at location is within normal range.

There are three thresholds for an environmental fault:

Warning: ILOM issues a command to burst the fan speed.

Soft shutdown: ILOM initiates a graceful shutdown.

Hard shutdown: Immediate shutdown.

Environmental faults can be repaired through hot removal of the faulty FRU. The FRU removal is automatically detected by the environmental monitoring and all faults associated with the removed FRU are cleared. The message for that case, and the alert sent for all FRU removals is:

fru at location has been removed.

Fault repair - The fault has been repaired by human intervention. In most cases, ILOM detects the repair and extinguishes the Service Required LED. In the event that ILOM does not perform these actions, you must perform these tasks manually with the following commands:

set /SYS/FRU/clear_fault_action=true (The ALOM-CMT equivalent is clearfault) clears the PSH fault logs but does not enable the component. See Section 2.7.2, Clearing PSH Detected Faults.

set /SYS/component/component_state=enabled (The ALOM-CMT equivalent is enablecomponent) clears POST generated faults and enables the component. See Section 2.6.4, Clearing POST Detected Faults.

2.5 Using the ILOM Web Interface For Diagnostics

These instructions use the ILOM web interface. To use the command line interface (CLI), see Appendix G of this manual, the ILOM documentation collection.

1. Connect to the ILOM web interface by typing the IP address for the Sun Blade T6340 server module service processor in a web browser.

If you do not know the IP address for the server module, you can obtain the service processor IP address from the following:

ILOM CLI: ->show /SP/network

ALOM-CMT compatibility shell: sc> shownetwork

Chassis CMM ILOM: ->show /CH/BL<x>/SP/network (Where <x> is the number of the blade server module in the chassis.)

2. Type the username and password to access the diagnostics menus in the ILOM web interface. The default user name is root, and the default password is changeme.

FIGURE 2-6 ILOM Login Screen

ILOM Login Screen example

2.5.1 Displaying System Faults

ILOM displays the following faults with the web interface and CLI:

Environmental faults - Temperature or voltage problems that might be caused by faulty FRUs (power supplies, fans, or blower), or by room temperature or blocked air flow

POST detected faults - Detected by the power-on self-test diagnostics

PSH detected faults - Detected by the Solaris Predictive Self-healing (PSH) technology

Use the web interface or type the show faulty command for the following reasons:

To see if any faults have been passed to, or detected by the ILOM firmware

To obtain the fault message ID (SUNW-MSG-ID) for PSH detected faults

To verify that the replacement of a FRU has cleared the fault and not generated any additional faults

2.5.1.1 Viewing Fault Status Using the ILOM Web Interface

In the ILOM web interface, you can view the system components currently in a fault state using the Fault Management page.

FIGURE 2-7 Fault Management Page Example

Fault Management Page Example

The Fault Management page lists faulted components by ID, FRU, and TimeStamp. You can access additional information about the faulted component by clicking the faulted component ID. For example, if you clicked the faulted component ID, 0 SYS/MB/, a dialog window similar to the following one appears, displaying additional details about the faulted component.

FIGURE 2-8 Faulted Component ID Window

Figure shows the Fault Properties Dialog screen.

Alternatively, in the ILOM web interface, you can identify the fault status of a component on the Component Management page.

FIGURE 2-9 Component Management Page - Fault Status

Figure shows the Component Management Page and Fault Status screen.

2.5.1.2 Viewing Fault Status Using the ILOM CLI

In the ILOM CLI, you can view the fault status of component(s) by using the show command. For example:

->show faulty

2.5.2 Displaying the Environmental Status with the ILOM CLI

The ILOM show command displays a snapshot of the server module environmental status. This command displays system temperatures, hard drive status, power supply and fan status, front panel LED status, voltage, and current sensors. The output uses a format similar to the Solaris OS command prtdiag (1M).

At the -> prompt, type the show command.

The output differs according to your system model and configuration.

-> show /SYS/MB/V_+12V
 
 /SYS/MB/V_+12V
    Targets:
 
    Properties:
        type = Voltage
        class = Threshold Sensor
        value = 12.411 Volts
        upper_nonrecov_threshold = 13.23 Volts
        upper_critical_threshold = 13.10 Volts
        upper_noncritical_threshold = 12.85 Volts
        lower_noncritical_threshold = 11.15 Volts
        lower_critical_threshold = 10.90 Volts
        lower_nonrecov_threshold = 10.77 Volts
 
    Commands:
        cd
        show

Note - Some environmental information might not be available when the server module is in standby mode.

2.5.3 Displaying the Environmental Status and Sensor Readings with the ILOM Web Interface

1. Open a web browser and type the IP address of the server module service processor in the browser.

2. Select the top System Monitoring tab and the lower Sensor Readings tab (FIGURE 2-10).

3. Click on the sensor reading that you want to check (FIGURE 2-10).

FIGURE 2-10 Obtaining Sensor Readings and Environmental Status With the ILOM Web Interface

Figure shows the System Monitoring and Sensor Readings tabs selected in the window.

FIGURE 2-11 Sensor Reading Window for an FB-DIMM in Channel 1

Figure shows the sensor reading window.

2.5.4 Displaying FRU Information

ILOM can display static FRU information such as the FRU manufacturer, serial number and some FRU status information (FIGURE 2-12).

Note - To view dynamic FRU information you must type the ALOM CMT showfru command. The dynamic FRU information provides more details about FRUs.

2.5.4.1 Using the ILOM Web Interface to Display FRU Information

1. Select the System Information and Components tabs.

2. Click on the component to view the FRU information (FIGURE 2-12).

FIGURE 2-12 Static FRU Information in the ILOM Web Interface

Figure shows the static FRU information displayed in an ILOM window.

2.5.4.2 Using the CLI to Display FRU Information

The show /SYS/MB command displays static information about the FRUs in the server module. Use this command to see information about an individual FRU.

At the -> prompt, type the show command.

In the following example, the show command displays information about the motherboard (MB).

-> show /SYS/MB
 
/SYS/MB
    Targets:
        SEEPROM
        SCC_NVRAM
        PCIE0
        PCIE1
        PCI-SWITCH0
        PCI-SWITCH1
        REM
        NET0
CMP0
        CMP1
        V_VDDIO
        V_+12V
        V_+3V3
        V_+3V3_STBY
        V_+5V
 
    Properties:
        type = Motherboard
        chassis_name = SUN BLADE 6000 MODULAR SYSTEM
        chassis_part_number = "541-1983-0
        chassis_serial_number = "1005LCB-0804YM04XE
        chassis_manufacturer = SUN MICROSYSTEMS
        product_name = Sun Blade T6340 Server Module
        product_part_number = 541-3299-02
        product_serial_number = 1005LCB-08268N0008
        product_manufacturer = SUN MICROSYSTEMS
        fru_name = T6340_MB
        fru_description = 8C,1.2GHZ VF,T6340,DIRECT-A
        fru_manufacturer = NO JEDEC CODE FOR THIS VENDOR
        fru_version = 02_01
        fru_part_number = 5407762
        fru_serial_number = 8J0010
        fault_state = OK
        clear_fault_action = (none)
 
    Commands:
        cd
        show
->

This example shows a portion of the more detailed dynamic FRU information provided by the ALOM CMT showfru command.

sc> showfru
/SYS/SP (container)
   SEGMENT: ST
      /Status_CurrentR
      /Status_CurrentR/UNIX_Timestamp32: Thu Feb 17 07:25:57 2000
      /Status_CurrentR/status:           0x00 (OK)
   SEGMENT: TH ...
... ... ...
   SEGMENT: FD
      /Customer_DataR
      /Customer_DataR/UNIX_Timestamp32: Wed Feb 16 08:41:44 GMT 2000
      /Customer_DataR/Cust_Data: QT
      /InstallationR (1 iterations)
      /InstallationR[0]
      /InstallationR[0]/UNIX_Timestamp32: Thu Feb 17 07:26:09 GMT 2000
      /InstallationR[0]/Fru_Path: /SYS/MB/REM
      /InstallationR[0]/Parent_Part_Number: 5017821
      /InstallationR[0]/Parent_Serial_Number: 5C00FV
      /InstallationR[0]/Parent_Dash_Level: 04
      /InstallationR[0]/System_Id: 1005LCB-0709YM00FV
      /InstallationR[0]/System_Tz: 0
      /InstallationR[0]/Geo_North: 0
      /InstallationR[0]/Geo_East: 0
      /InstallationR[0]/Geo_Alt: 0
      /InstallationR[0]/Geo_Location: GMT
 ... ... ...
/SYS/MB/CMP0/BR0/CH0/D0 (container)
        /SPD/Timestamp: Mon Feb 12 12:00:00 2007
        /SPD/Description: DDR2 SDRAM FB-DIMM, 4 GByte
        /SPD/Manufacture Location: ff
        /SPD/AMB Vendor: IDT
        /SPD/Vendor: Micron Technology
        /SPD/Vendor Part No:   36HTF51272F667E1D4
        /SPD/Vendor Serial No: d2174043
        /SPD/Num_Banks: 8
        /SPD/Num_Ranks: 2
        /SPD/Num_Rows: 14
        /SPD/Num_Cols: 11
        /SPD/Sdram_Width: 4
        /SunSPD/Sun_Serial_Number:   002C010707D2174043
        /SunSPD/SPD_Format_Version:  20
        /SunSPD/Sun_Part_Dash_Rev:   000-0000-00 Rev 00
        /SunSPD/Certified_Platforms: 0x00000001 (OK)
        /SunSPD/Sun_Key_Code:        0x0000
        /SunSPD/Sun_Certification:   NO
        /SunSPD/timestamp:           Thu Feb 17 07:26:20 2000
        /SunSPD/MACADDR:             00:14:4F:98:84:7A
        /SunSPD/status               0x00 (OK)
        /SunSPD/Initiator            N/A
        /SunSPD/Message:             No message
        /SunSPD/powerupdate:         Thu Feb 17 07:01:16 2000
        /SunSPD/Poweron_minutes:     1487
/SYS/MB/CMP0/BR1/CH0/D0 (container)
 ... ... ...
sc>

2.6 Running POST

Use POST to test and verify server module hardware. Power-on self-test (POST) is a group of PROM-based tests that run when the server module is powered on or reset. POST checks the basic integrity of the critical hardware components in the server module (CPU, memory, and I/O buses).

If POST detects a faulty component, the component is disabled automatically, preventing faulty hardware from potentially harming any software. If the system is capable of running without the disabled component, the system will boot when POST is complete. For example, if one of the processor cores is deemed faulty by POST, that core will be disabled, and the system will boot and run using the remaining cores.

You can use POST as an initial diagnostic tool for the system hardware. In this case, configure POST to run in diagnostic service mode for maximum test coverage and verbose output.

Note - Devices can be manually enabled or disabled using ASR commands (see Section 2.9, Managing Components With Automatic System Recovery Commands).

2.6.1 Controlling How POST Runs

The server module can be configured for normal, extensive, or no POST execution. You can also control the level of tests that run, the amount of POST output that is displayed, and which reset events trigger POST by using diag variables.

TABLE 2-6 lists the DIAG variables used to configure POST and FIGURE 2-13 shows how the variables work together.

TABLE 2-6 Parameters Used For POST Configuration
Parameter	Values	Description
/SYS keyswitch_state	`normal`	The system can power on and run POST (based on the other parameter settings). For details see FIGURE 2-13. This parameter overrides all other commands.
	`diag`	The system runs POST based on predetermined settings.
	`stby`	The system cannot power on.
	`locked`	The system can power on and run POST, but no flash updates can be made.
`diag_mode`	`off`	POST does not run.
	`normal`	Runs POST according to `diag_level` value.
	`service`	Runs POST with preset values for `diag_level` and `diag_verbosity`.
`diag_level`	`min`	If `diag_mode` = `normal`, runs minimum set of tests.
	`max`	If `diag_mode` = `normal`, runs all the minimum tests plus extensive CPU and memory tests.
`diag_trigger`	`none`	Does not run POST on reset or poweron.
	`user-reset`	Runs POST upon user-initiated resets.
	`power-on-reset`	Only runs POST for the first power on. Default state is ‘`power-on-reset error-reset’`
	`error-reset`	Runs POST if fatal errors are detected.
	`all-resets`	Runs POST after any reset.
`diag_verbosity`	`none`	No POST output is displayed.
	`min`	POST output displays functional tests with a banner and pinwheel.
	`normal`	POST output displays all test and informational messages.
	`max`	POST displays all test, informational, and some debugging messages.

FIGURE 2-13 Flowchart of ILOM Variables for POST Configuration

Figure shows POST flow chart.

TABLE 2-7 shows typical combinations of ILOM variables and associated POST modes .

TABLE 2-7 POST Modes and Parameter Settings
Parameter	Normal Diagnostic Mode (default settings)	No POST Execution	Diagnostic Service Mode	Keyswitch Diagnostic Preset Values
`diag_mode`	`normal`	off	`service`	`normal`
`keyswitch_state^[2]`	`normal`	normal	`normal`	`diag`
`diag_level`	`min`	n/a	`max`	`max`
`diag_trigger`	`power-on-reset error-reset`	none	`all-resets`	`all-resets`
`diag_verbosity`	`normal`	n/a	`max`	`max`
Description of POST execution	This is the default POST configuration. This configuration tests the system thoroughly, and suppresses some of the detailed POST output.	POST does not run, resulting in quick system initialization, but this is not a suggested configuration.	POST runs the full spectrum of tests with the maximum output displayed.	POST runs the full spectrum of tests with the maximum output displayed.

2.6.2 Changing POST Parameters

You can use the web interface or the CLI to change the POST parameters.

2.6.2.1 Using the Web Interface to Change POST Parameters

1. From the ILOM web interface, select the Remote Console tab (FIGURE 2-14).

2. Select the Diagnostics Tab.

3. Select the POST settings that you require.

TABLE 2-7 describes how the POST settings will execute.

4. Click the Save button.

Note - If you do not have a console window open, you should open one. POST will only display output to a console window, not the web interface.

FIGURE 2-14 Setting POST Parameters With the ILOM Web Interface

Figure shows the Remote Control and Diagnostics tabs selected in an ILOM window.

5. Select the Remote Power Control Tab.

6. Select a power control setting and Select Save (FIGURE 2-15).

FIGURE 2-15 Changing Power Settings With the ILOM Web Interface

Figure shows the Remote Control and Remote Server Control tabs selected in an ILOM window.

When you power cycle the server module, POST runs and displays output to the service processor console window:

{0} ok Chassis | critical: Host has been powered off
Chassis | major: Host has been powered on
2007-11-07 18:22:19.511 0:0:0>
2007-11-07 18:22:19.560 0:0:0>Sun Blade T6320 Server Module POST 4.27.4 2007/10/02 19:09 
       /export/delivery/delivery/4.27/4.27.4/post4.27.x/Niagara/glendale/integrated  (root)  
2007-11-07 18:22:19.836 0:0:0>Copyright 2007 Sun Microsystems, Inc. All rights reserved
2007-11-07 18:22:20.001 0:0:0>VBSC cmp 0 arg is: 00ffffff.ffff00ff
2007-11-07 18:22:20.108 0:0:0>POST enabling threads: 00ffffff.ffff00ff
2007-11-07 18:22:20.223 0:0:0>VBSC mode is: 00000000.00000001
2007-11-07 18:22:20.321 0:0:0>VBSC level is: 00000000.00000001
2007-11-07 18:22:20.421 0:0:0>VBSC selecting Normal mode, MAX Testing.
2007-11-07 18:22:20.533 0:0:0>VBSC setting verbosity level 3
2007-11-07 18:22:20.629 0:0:0>  Niagara2, Version 2.1
2007-11-07 18:22:20.714 0:0:0>  Serial Number: 0f880060.768660a8
2007-11-07 18:22:20.843 0:0:0>Basic Memory Tests.....

7. Read the POST output to determine if you need to perform service actions.

See Section 2.6.3, Interpreting POST Messages.

2.6.2.2 Using the CLI to Change POST Parameters

1. Verify the current post parameters with the show command. Type:

-> show /HOST/diag
 
/HOST/diag
    Targets:
 
    Properties:
        level = min
        mode = normal
        trigger = power-on-reset error-reset
        verbosity = normal
 
    Commands:
        cd
        set
        show
->

2. Type the set command to change the POST parameters.

TABLE 2-7 describes how the POST settings will execute. This example shows how to set the verbosity to max.

-> set /HOST/diag verbosity=max
Set ’verbosity’ to ’max’
->

3. Power cycle the server module to run POST.

There are several ways to initiate a reset. The following example uses the ILOM reset command.

-> reset /SYS
Are you sure you want to reset /SYS (y/n)? y
Performing hard reset on /SYS
->

4. Read the POST output to determine if you need to perform service actions. See Section 2.6.3, Interpreting POST Messages.

2.6.3 Interpreting POST Messages

When POST is finished running and no faults were detected, the system will boot.

If POST detects a faulty device, the fault is displayed and the fault information is passed to ILOM for fault handling. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 1-3.

1. Interpret the POST messages:

POST error messages use the following syntax:

c:s > ERROR: TEST = failing-test
c:s > H/W under test = FRU
c:s > Repair Instructions: Replace items in order listed by H/W under test abovec:s > MSG = test-error-message
c:s > END_ERROR

In this syntax, c = the core number, s = the strand number.

Warning and informational messages use the following syntax:

INFO or WARNING: message

The following example shows a POST error message report for a missing PCI device:

0:0:0>ERROR: TEST = PIU PCI id test
0:0:0>H/W under test = MB/PCI-SWITCH
0:0:0>Repair Instructions: Replace items in order listed by ‘H/W under test’ above.
0:0:0>MSG = PCI ID test device missing Cont. 
					DEVICE NAME: MB/PCI-SWITCH
0:0:0>END_ERROR

2. Type the show faulty command to obtain additional fault information.

The fault is captured by ILOM, where the fault is logged. The Service Action Required LED is lit, and the faulty component is disabled.

For example:

ok #.
->
-> show faulty
 
Target              | Property               | Value
--------------------+------------------------+----------------------------
/SP/faultmgmt/0      | fru                    | /SYS/MB/CMP0/BR0/CH0/D0
/SP/faultmgmt/0     | timestamp              | Sep 12 05:02:52
/SP/faultmgmt/0/    | timestamp              | Sep 12 05:02:52
 faults/0           |                        |
/SP/faultmgmt/0/    | sp_detected_fault      | /SYS/MB/CMP0/BR0/CH0/D0
 faults/0           |                        | Disabled by user
 
    Commands:
        cd
        show

In this example, /SYS/MB/CMP0/BR0/CH0/D0 is disabled by a user. The system can boot using memory that was not disabled until the faulty component is replaced.

Note - You can use ASR commands to display and control disabled components. See Section 2.9, Managing Components With Automatic System Recovery Commands.

2.6.4 Clearing POST Detected Faults

In most cases, when POST detects a faulty component, POST logs the fault and automatically takes the failed component out of operation by placing the component in the ASR blacklist.

See Section 2.9, Managing Components With Automatic System Recovery Commands).

After the faulty FRU is replaced, the fault is normally automatically cleared. In some cases it might be necessary to manually clear the fault by removing the component from the ASR blacklist.

2.6.4.1 Clearing Faults With the Web Interface

This procedure describes how to enable components after a POST fault has been generated. The POST fault log is not actually cleared.

1. Select the tabs: System Information and Components tabs (FIGURE 2-16).

2. Select the radio button for the component that you must clear.

3. In the Actions menu, select: Enable Component.

FIGURE 2-16 Enabling Components With the ILOM Web Interface

Figure shows the component management window.

Note - The Clear Faults command in the Action menu will only clear the PSH-generated faults, and will not enable a component.

2.6.4.2 Clearing Faults With the ILOM CLI

1. At the ILOM prompt, type the show faulty command to identify POST detected faults.

POST detected faults are distinguished from other faults by the text:
deemed faulty and disabled, and no UUID number is reported.

For example:

-> show faulty

If no fault is reported, you do not need to do anything else. Do not perform the subsequent steps.

If a fault is detected, continue with Step 2.

2. Type the set component_state=enabled command to clear the fault and remove the component from the ASR blacklist.

Type the cd command with the FRU name that was reported in the fault in the previous step.

This example shows how to change directory to thread P32 on the CPU and enable it.

-> cd /SYS/MB/CMP0/P32
/SYS/MB/CMP0/P32
 
-> show
 
 /SYS/MB/CMP0/P32
    Targets:
 
    Properties:
        type = CPU thread
        component_state = Disabled
 
    Commands:
        cd
        show
 
-> set component_state=enabled
Set ’component_state’ to ’enabled’

The fault is cleared and should not show up when you type the show faulty command. Additionally, the Service Action Required LED is no longer illuminated.

3. Reboot the server module.

You must reboot the server module for the enablecomponent command to take effect.

4. At the ILOM prompt, type the show faulty command to verify that no faults are reported.

-> show faulty
Last POST run: THU MAR 09 16:52:44 2006
POST status: Passed all devices
 
No failures found in System

2.6.4.3 Clearing Faults Manually with ILOM

The ILOM set /SYS/clear_fault_action=enabled command allows you to manually clear certain types of faults without replacing a FRU. It also allows you to clear a fault if ILOM was unable to automatically detect the FRU replacement.

2.6.4.4 Clearing Hard Drive Faults

ILOM can detect hard drive replacement. However, to configure and unconfigure a hard drive, you must type the Solaris cfgadm command. See Section 3.1, Hot-Plugging a Hard Drive. ILOM does not handle hard drive faults. Use the Solaris message files to view hard drive faults. See Section 2.8, Collecting Information From Solaris OS Files and Commands.

2.7 Using the Solaris Predictive Self-Healing Feature

The Solaris Predictive Self-Healing (PSH) technology enables the Sun Blade T6340 server module to diagnose problems while the Solaris OS is running. Many problems can be resolved before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. Once diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the system and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use the message ID to get additional information about the problem from Sun’s knowledge article database.

The Predictive Self-Healing technology covers the following Sun Blade T6340 server module components:

UltraSPARC^® T2 Plus multicore processor (CPU)

Memory

I/O bus

The PSH console message provides the following information:

Type

Severity

Description

Automated response

Impact

Suggested action for system administrator

If the Solaris PSH facility has detected a faulty component, type the fmdump command to identify the fault. Faulty FRUs are identified in fault messages using the FRU name. For a list of FRU names, see TABLE 1-3.

Note - Additional Predictive Self-Healing information is available at: http://www.sun.com/msg

2.7.1 Identifying Faults With the `fmadm faulty` and `fmdump` Commands

2.7.1.1 Using the `fmadm faulty` Command

1. Use the fmadm faulty command to identify a faulty component.

# fmadm faulty
STATE RESOURCE /UUID 
faulted cpu:///cpuid=8/serial=FAC006AE4515C47
	8856153f-6f9b-47c6-909a-b05180f53c07

The output shows the UUID of the related fault and provides information for clearing the fault.

2. Use the output of this command to clear the fault as shown in Section 2.7.2, Clearing PSH Detected Faults.

If fmadm faulty does not identify a faulty component or if you need more detailed information, type the fmdump command.

2.7.1.2 Using the `fmdump` Command

The fmdump command displays the list of faults detected by the Solaris PSH facility. Use this command for the following reasons:

To see if any faults have been detected by the Solaris PSH facility.

To obtain the fault message ID (SUNW-MSG-ID) for detected faults.

To verify that the replacement of a FRU has not generated any additional faults.

If you already have a fault message ID, go to Step 2 to obtain more information about the fault from the Sun Predictive Self-Healing Knowledge Article web site.

Note - Faults detected by the Solaris PSH facility are also reported through ILOM alerts. In addition to the PSH fmdump command, the ILOM show faulty command also provides information about faults and displays fault UUIDs. See Section 2.5.1, Displaying System Faults.

1. Check the event log by typing the fmdump command with -v for verbose output.

For example:

# fmdump -v
TIME				UUID					SUNW-MSG-ID
Apr 24 06:54:08.2005 lce22523-lc80-6062-e61d-f3b39290ae2c SUN4V-8000-6H
100% fault.cpu.ultraSPARCT2l2cachedata
	FRU:hc:///component=MB
	rsrc: cpu:///cpuid=0/serial=22D1D6604A

In this example, a fault is displayed, indicating the following details:

Date and time of the fault (Apr 24 06:54:08.2005)

Universal Unique Identifier (UUID) that is unique for every fault (lce22523-lc80-6062-e61d-f3b39290ae2c)

Sun message identifier (SUN4V-8000-6H) that can be used to obtain additional fault information

Faulted FRU (FRU:hc:///component=MB). In this example it is identified as MB, indicating that the motherboard requires replacement.

2. Use the Sun message ID to obtain more information about this type of fault.

a. In a browser, go to the Predictive Self-Healing Knowledge Article web site: http://www.sun.com/msg

b. Type the message ID in the SUNW-MSG-ID field, and press Lookup.

In this example, the message ID SUN4U-8000-6H returns the following information for corrective action:

CPU errors exceeded acceptable levels
 
Type
    Fault 
Severity
    Major 
Description
    The number of errors associated with this CPU has exceeded acceptable levels. 
Automated Response
    The fault manager will attempt to remove the affected CPU from service. 
Impact
    System performance may be affected. 
 
Suggested Action for System Administrator
    Schedule a repair procedure to replace the affected CPU,
the identity of which can be determined using 
fmdump -v -u <EVENT_ID>. 
 
Details
    The Message ID:   SUN4U-8000-6H indicates diagnosis has
determined that a CPU is faulty. The Solaris fault manager arranged
an automated attempt to disable this CPU. The recommended action
for the system administrator is to contact Sun support so a Sun
service technician can replace the affected component.

c. Follow the suggested actions to repair the fault.

2.7.2 Clearing PSH Detected Faults

When the Solaris PSH facility detects faults, the faults are logged and displayed on the console. After the fault condition is corrected, for example by replacing a faulty FRU, you might have to clear the fault.

1. After replacing a faulty FRU, boot the system.

2. Type fmadm faulty:

# fmadm faulty
STATE RESOURCE /UUID 
faulted cpu:///cpuid=8/serial=FAC006AE4515C47
	8856153f-6f9b-47c6-909a-b05180f53c07

3. Clear the fault from all persistent fault records.

In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time. To ensure that these messages are not displayed, perform the following command:

fmadm repair UUID

For example:

# fmadm repair cpu:///cpuid=8/serial=FAC006AE4515C47
fmadm: recorded repair to cpu:///cpuid=8/serial=FAC006AE4515C47
# fmadm faulty
	STATE RESOURCE/UUID

Note - You can also use the FRU fault UUID instead of the Fault Management Resource Identifier (FMRI).

Typing fmadm faulty after the repair command verifies that there are no more faults.

2.7.3 Clearing the PSH Fault From the ILOM Logs

When the Solaris PSH facility detects faults, the faults are also logged by the ILOM software.

Note - If you clear the faults using Solaris PSH, you do not have to clear the faults in ILOM. If you clear the faults in ILOM, you do not have to clear them with Solaris PSH.

Note - If you are diagnosing or replacing faulty DIMMs, do not follow this procedure. Instead, perform the procedure in Section 4.3.2, Replacing the DIMMs.

1. After replacing a faulty FRU, at the ILOM prompt, type the ILOM -> show faulty command to identify PSH detected faults.

PSH detected faults are distinguished from other faults by the text:
Host detected fault.

For example:

-> show faulty

If no fault is reported, you do not need to do anything else.

If the fault is still reported, continue with Step 2.

2. Use the ILOM clear_fault command to clear the fault on the component provided in the show faulty output:

-> set /SYS/component clear_fault_action=true
Clearing fault from component...
Fault cleared.

2.8 Collecting Information From Solaris OS Files and Commands

With the Solaris OS running on the Sun Blade T6340 server module, you have all the Solaris OS files and commands available for collecting information and for troubleshooting.

In the event that POST, ILOM, or the Solaris PSH features did not indicate the source of a fault, check the message buffer and log files for fault notifications. Hard drive faults are usually captured by the Solaris message files.

Type the dmesg command to view the most recent system message.

Use the /var/adm/messages file to view the system messages log file.

2.8.1 Checking the Message Buffer

1. Log in as superuser.

2. Type the dmesg command.

# dmesg

The dmesg command displays the most recent messages generated by the system.

2.8.2 Viewing the System Message Log Files

The error logging daemon, syslogd, automatically records various system warnings, errors, and faults in message files. These messages can alert you to system problems such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file. After a period of time (usually every ten days), a new messages file is automatically created. The original contents of the messages file are rotated to a file named messages.1. Over a period of time, the messages are further rotated to messages.2 and messages.3, and then deleted.

1. Log in as superuser.

2. Type the following command.

# more /var/adm/messages

3. If you want to view all logged messages, type:

# more /var/adm/messages*

2.9 Managing Components With Automatic System Recovery Commands

The Automatic System Recovery (ASR) feature enables the server module to automatically unconfigure failed components to remove them from operation until they can be replaced. In the Sun Blade T6340 server module, the following components are managed by the ASR feature:

UltraSPARC T2 processor strands

Memory DIMMs

I/O bus

The database that contains the list of disabled components is called the ASR blacklist (asr-db).

In most cases, POST automatically disables a component when it is faulty. After the cause of the fault is repaired (FRU replacement, loose connector reseated, and so on), you must remove the component from the ASR blacklist.

The ASR commands (TABLE 2-8) enable you to view and manually add or remove components from the ASR blacklist. These commands are run from the ILOM -> prompt. For information about ALOM CMT commands, see the Sun Integrated Lights Out Manager 2.0 Supplement for Sun Blade T6340 Server Modules, 820-3904.

TABLE 2-8 ASR Commands
ILOM Web Interface	ILOM Command	ALOM Command	Description
Select the following tabs: System Information, Components, Actions, then select the action.	`show /SYS/`component `component_state`	`showcomponent ^[3]`	Displays system components and their current state.
	`set /SYS/`component `component_state=enabled`	`enablecomponent` asrkey	Removes a component from the `asr-db` blacklist, where asrkey is the component to enable.
	`set /SYS/`component `component_state=disabled`	`disablecomponent` asrkey	Adds a component to the `asr-db` blacklist, where asrkey is the component to disable.
`No equivalent in ILOM`		`clearasrdb`	Removes all entries from the `asr-db` blacklist.

Note - The components (asrkeys) vary from system to system, depending on how many cores and memory are present. Type the showcomponent command to see the asrkeys on a given system.

Note - A reset or powercycle is required after disabling or enabling a component. If the status of a component is changed with power on there is no effect to the system until the next reset or powercycle.

2.9.1 Displaying System Components With the `show /SYS` Command

To see examples of ILOM web interface and CLI commands that show component status, see Section 2.5.2, Displaying the Environmental Status with the ILOM CLI.

The show command displays the system components (asrkeys) and reports their status.

1. At the -> prompt, type the show command.

An example with no disabled components.

-> show -level all -o table component_state
 
Target              | Property               | Value
--------------------+------------------------+---------------------------------
/SYS/MB/PCIE0       | component_state        | Enabled
/SYS/MB/PCIE1       | component_state        | Enabled
/SYS/MB/PCI-        | component_state        | Enabled
 SWITCH0            |                        |
/SYS/MB/PCI-        | component_state        | Enabled
 SWITCH1            |                        |
/SYS/MB/REM         | component_state        | Enabled
/SYS/MB/NET0        | component_state        | (none)
/SYS/MB/NET1        | component_state        | (none)
/SYS/MB/PCIE-IO     | component_state        | Enabled
/SYS/MB/PCIE-IO/    | component_state        | Enabled
 USB                |                        |
/SYS/MB/PCIE-IO/    | component_state        | Enabled
 GRFX               |                        |
/SYS/MB/CMP0/MCU0   | component_state        | Enabled
/SYS/MB/CMP0/MCU1   | component_state        | Enabled
 
Commands:
        cd
        show
->

2.10 Exercising the System With SunVTS

Sometimes a system exhibits a problem that cannot be isolated definitively to a particular hardware or software component. In such cases, it might be useful to run a diagnostic tool that stresses the system by continuously running a comprehensive battery of tests. Sun provides the SunVTS software for this purpose.

2.10.1 Checking SunVTS Software Installation

This procedure assumes that the Solaris OS is running on the Sun Blade T6340 server module, and that you have access to the Solaris command line.

1. Check for the presence of SunVTS packages using the pkginfo command.

# pkginfo | grep -i vts
system 				SUNWvts 				SunVTS Framework
system 				SUNWvtsmn 				SunVTS Man Pages
system 				SUNWvtsr 				SunVTS Framework (root)
system 				SUNWvtss 				SunVTS Server and BUI
system 				SUNWvtsts 				SunVTS Core Installation Tests
#

If SunVTS software is loaded, information about the packages is displayed.

If SunVTS software is not loaded, no information is displayed.

TABLE 2-9 lists some SunVTS packages.

TABLE 2-9 Sample of installed SunVTS Packages
Package	Description
`SUNWvts`	SunVTS framework
`SUNWvtsr`	SunVTS Framework (root)
SUNWvtss	SunVTS middle server and BI components
SUNWvtsts	SunVTS for tests
`SUNWvtsmn`	SunVTS man pages

If SunVTS is not installed, you can obtain the installation packages from the following resources:

Solaris Operating System DVDs

Sun Download Center: http://www.sun.com/oem/products/vts

The SunVTS 7.0 software, and subsequent compatible versions, are supported on the Sun Blade T6340 server module.

SunVTS installation instructions are described in the Sun VTS 7.0 User’s Guide, 820-0012.

2.10.2 Exercising the System Using SunVTS Software

Before you begin, the Solaris OS must be running. You should verify that SunVTS validation test software is installed on your system. See Section 2.10.1, Checking SunVTS Software Installation.

The SunVTS installation process requires that you specify one of two security schemes to use when running SunVTS. The security scheme you choose must be properly configured in the Solaris OS for you to run SunVTS.

SunVTS software features both character-based and graphics-based interfaces.

For more information about the character-based SunVTS TTY interface, and specifically for instructions on accessing it by TIP or telnet commands, refer to the Sun VTS 7.0 User’s Guide.

Finally, this procedure describes how to run SunVTS tests in general. Individual tests might presume the presence of specific hardware, or might require specific drivers, cables, or loopback connectors. For information about test options and prerequisites, refer to the following documentation:

SunVTS 7.0 Test Reference Manual for SPARC Platforms

Sun VTS 7.0 User’s Guide

1. Log in as superuser to a system with a graphics display.

The display system should be one with a frame buffer and monitor capable of displaying bitmap graphics such as those produced by the SunVTS BI.

2. Enable the remote display.

On the display system, type:

# /usr/openwin/bin/xhost + test-system

where test-system is the name of the server you plan to test.

3. Remotely log in to the server as superuser.

Type a command such as rlogin or telnet.

4. Start SunVTS software.

# /usr/sunvts/bin/startsunvts

As SunVTS starts, it prompts you to choose between using CLI, BI, or tty interfaces. A representative SunVTS BI is displayed below (FIGURE 2-17).

FIGURE 2-17 SunVTS BI

This screen capture shows a small portion of the test selection area in the SunVTS graphical interface.

5. (Optional) Select the test category you want to run.

Certain tests are enabled by default, and you can choose to accept these.

Alternatively, you can enable or disable test categories by clicking the checkbox next to the test name or test category name. Tests are enabled when checked, and disabled when not checked.

TABLE 2-10 lists tests that are especially useful to run on this server.

TABLE 2-10 Useful SunVTS Tests to Run on This Server
Category	SunVTS Tests	FRUs Exercised by Tests
CPU	`mptest`	CPU and motherboard
Graphics	`pfbtest, graphicstest`--indirectly: `systest`	DIMMs, CPU motherboard
Processor	`cmttest, cputest`, `fputest`, `iutest`, `l1dcachetest, dtlbtest,` and `l2sramtest`--indirectly: `mptest`, and `systest`	DIMMs, CPU motherboard
Disk	`disktest`	Disks, cables, disk backplane
Environment	`hsclbtest, cryptotest`	Crypto engine (CPU), SP <-->, host communication channels (motherboard)
Network	`nettest`, `netlbtest`, x`netlbtest`	Network interface, network cable, CPU motherboard
Memory	`pmemtest,` `vmemtest,` r`amtest`	DIMMs, motherboard
I/O ports	`usbtest, iobustest`	Motherboard, service processor (Host to service processor interface)

6. (Optional) Customize individual tests.

You can customize test categories by right-clicking on the name of the test.

7. Start testing.

Click the Start button that is located at the top left of the SunVTS window. Status and error messages appear in the test messages area located across the bottom of the window. You can stop testing at any time by clicking the Stop button.

During testing, SunVTS software logs all status and error messages. To view these messages, click the Log button or select Log Files from the Reports menu. This action opens a log window from which you can choose to view the following logs:

Information - Detailed versions of all the status and error messages that appear in the test messages area.

Test Error - Detailed error messages from individual tests.

VTS Kernel Error - Error messages pertaining to SunVTS software itself. Look here if SunVTS software appears to be acting strangely, especially when it starts up.

Solaris OS Messages (/var/adm/messages) - A file containing messages generated by the operating system and various applications.

Log Files (/var/sunvts/logs) - A directory containing the log files.

2.11 Resetting the Password to the Factory Default

The procedure for resetting the ILOM root password to the factory default (changeme) requires installation of a jumper on the service processor. This procedure should be performed by a technician, a service professional, or a system administrator who services and repairs computer systems. This person should meet the criteria described in the preface of the Sun Blade T6340 Server Module Service Manual.

2.11.1 To Reset the Root Password to the Factory Default

1. Remove the server module from the modular system chassis.

Prepare for removal using ILOM or ALOM CMT commands and ensure that the blue OK to Remove LED is lit, indicating that it is safe to remove the blade.

2. Open the server module and install a standard jumper at location J0601, pins 11 and 12.

3. Close the server module, install it in the modular system chassis, and boot the server module.

Refer to the Sun Blade T6340 Server Module Installation and Administration Guide for instructions.

The ILOM root password is now reset to the factory default (changeme).

4. Change the root password.

Refer to the Sun Blade T6340 Server Module Installation and Administration Guide for instructions.

5. Remove the server module from the modular system chassis and remove the jumper.

As in Step 1, prepare for removal using ILOM or ALOM CMT commands and ensure that the blue OK to Remove LED is lit, indicating that it is safe to remove the blade.

6. Close the server module, install it in the modular system chassis, and boot the server module.

Refer to the Sun Blade T6340 Server Module Installation and Administration Guide for instructions.

^{1 (TableFootnote) Upgrade path: DIMMs should be added with each group populated in the order shown.}

^{2 (TableFootnote) The keyswitch_state parameter, when set to diag, overrides all the other POST variables.}

^{3 (TableFootnote) The showcomponent command might not report all blacklisted DIMMs.}

2.1 Sun Blade T6340 Server Module Diagnostics Overview

2.2 Memory Configuration and Fault Handling

2.2.1 Memory Configuration

2.2.1.1 FB-DIMM Installation Rules

2.2.1.2 Memory Fault Handling

2.2.1.3 Troubleshooting Memory Faults

2.3 Interpreting System LEDs

2.3.1 Front Panel LEDs and Buttons

2.3.2 Power and Reset Buttons

2.4 Using ILOM for Diagnosis and Repair Verification

2.5 Using the ILOM Web Interface For Diagnostics

2.5.1 Displaying System Faults

2.5.1.1 Viewing Fault Status Using the ILOM Web Interface

2.5.1.2 Viewing Fault Status Using the ILOM CLI

2.5.2 Displaying the Environmental Status with the ILOM CLI

2.5.3 Displaying the Environmental Status and Sensor Readings with the ILOM Web Interface

2.5.4 Displaying FRU Information

2.5.4.1 Using the ILOM Web Interface to Display FRU Information

2.5.4.2 Using the CLI to Display FRU Information

2.6 Running POST

2.6.1 Controlling How POST Runs

2.6.2 Changing POST Parameters

2.6.2.1 Using the Web Interface to Change POST Parameters

2.6.2.2 Using the CLI to Change POST Parameters

2.6.3 Interpreting POST Messages

2.6.4 Clearing POST Detected Faults

2.6.4.1 Clearing Faults With the Web Interface

2.6.4.2 Clearing Faults With the ILOM CLI

2.6.4.3 Clearing Faults Manually with ILOM

2.6.4.4 Clearing Hard Drive Faults

2.7 Using the Solaris Predictive Self-Healing Feature

2.7.1 Identifying Faults With the fmadm faulty and fmdump Commands

2.7.1.1 Using the fmadm faulty Command

2.7.1.2 Using the fmdump Command

2.7.2 Clearing PSH Detected Faults

2.7.3 Clearing the PSH Fault From the ILOM Logs

2.8 Collecting Information From Solaris OS Files and Commands

2.8.1 Checking the Message Buffer

2.8.2 Viewing the System Message Log Files

2.9 Managing Components With Automatic System Recovery Commands

2.9.1 Displaying System Components With the show /SYS Command

2.10 Exercising the System With SunVTS

2.10.1 Checking SunVTS Software Installation

2.10.2 Exercising the System Using SunVTS Software

2.11 Resetting the Password to the Factory Default

2.11.1 To Reset the Root Password to the Factory Default

2.7.1 Identifying Faults With the `fmadm faulty` and `fmdump` Commands

2.7.1.1 Using the `fmadm faulty` Command

2.7.1.2 Using the `fmdump` Command

2.9.1 Displaying System Components With the `show /SYS` Command