1.1 Fault on Initial Power Up

If you have installed the server, and upon initial power up, you see errors indicating faults with the Fully Buffered DIMMs (FB-DIMMs), PCI cards, or other components, the suspect component might have become loose during shipment.

Conduct a visual inspection of the server internals and its components. Remove the top cover and physically reseat the cable connections, the PCI cards, and the FB-DIMMs.

1.2 Server Diagnostics Overview

There are a variety of diagnostic tools, commands, and indicators you can use to monitor and troubleshoot a server:

Status indicators – These indicators provide a quick visual notification of the status of the server and of some of the FRUs.
Fault management architecture – FMA provides simplified fault diagnostics through use of the /var/adm/messages file, the fmdump command, and a Sun Microsystems web site.
ILOM firmware – This system firmware runs on the service processor. In addition to providing the interface between the hardware and OS, ILOM also tracks and reports the health of key server components. ILOM works closely with POST and Solaris Predictive Self Healing technology to keep the system up and running even when there is a faulty component.
Power-on self-test (POST) – POST performs diagnostics on system components upon system reset to ensure the integrity of those components. POST is configurable and works with ILOM to take faulty components offline if needed.
Solaris Predictive Self-Healing (PSH) – This technology continuously monitors the health of the CPU and memory, and works with ILOM to take a faulty component offline if needed. The Predictive Self Healing technology enables Sun systems to accurately predict component failures and mitigate many serious problems before they occur.
Log files and console messages – These provide the standard Solaris Operating System (OS) log files and investigative commands that can be accessed and displayed on the device of your choice.

The LEDs, ILOM, Solaris PSH, and many of the log files and console messages are integrated. For example, a fault detected by the Solaris software will display the fault, log it, pass information to ILOM where it is logged, and, depending on the fault, might light one or more LEDs.

**TABLE 1-1 Diagnostic Actions**
Action No.	Diagnostic Action	Resulting Action	Additional Information
1	Check Power OK and Input OK LEDs on the server.	The Power OK LED is located on the front and rear of the chassis. The Input OK LED is located on the rear of the server on each power supply. If these LEDs are not on, check the power source and power connections to the server.	Using the Status Indicators to Identify the State of Devices
2	Check the Solaris log files for fault information.	The Solaris message buffer and log files record system events and provide information about faults. If system messages indicate a faulty device, replace the FRU.	Collecting Information From Solaris OS Files and Commands
3	Determine if the fault is an environmental fault.	Environmental faults can be caused by faulty FRUs (power supply, fan, or blower), or by environmental conditions such as when computer room ambient temperature is too high, or the server airflow is blocked. When the environmental condition is corrected, the fault will automatically clear.	Using the Service Processor Firmware for Diagnosis and Repair Verification
		If the fault indicates that a fan, blower, or power supply is bad, you can perform a hot-swap of the FRU. You can also use the fault indicators on the server to identify the faulty FRU (fans, blower, and power supplies).	Using the Status Indicators to Identify the State of Devices
4	Determine if the fault was detected by PSH.	If the fault message displays the following text, the fault was detected by the Solaris Predictive Self Healing software: `Host detected fault`	Using the Solaris Predictive Self Healing Feature
		If the fault is a PSH detected fault, identify the faulty FRU from the fault message and replace the faulty FRU.	Clear PSH Detected Faults
		After replacing the FRU, perform the procedure to clear PSH detected faults.
5	Contact Sun for Support.	The majority of hardware faults are detected by the server’s diagnostics. In rare cases, a problem might require additional troubleshooting. If you are unable to determine the cause of the problem, contact Sun for support.	Sun Support information: `http://www.sun.com/support`

1.2.1 Memory Fault Handling

The server uses an advanced ECC technology, called chipkill, that corrects up to 4 bits in error on nibble boundaries, as long as all of the bits are in the same DRAM. If a DRAM fails, the FB-DIMM continues to function.

The Predictive Self Healing (PSH) technology in the Solaris OS uses the fault manager daemon (fmd) to watch for various kinds of faults. When a fault occurs, the fault is assigned a unique fault ID (UUID), and logged. PSH reports the fault and provides a recommended proactive replacement for the FB-DIMMs associated with the fault.

If you suspect that the server has a memory problem, see Replacing FB-DIMMs for FB-DIMM replacement instructions. You must perform the instructions in that chapter to clear the faults and enable the replaced FB-DIMMs.

1.3 Using the Status Indicators to Identify the State of Devices

The server provides status indicators in the upper left corner of the front panel (FIGURE 1-1) and on the rear panel (FIGURE 1-2). These indicators provide a visual means of determining the state of the system or individual components.

FIGURE 1-1 Front Panel Status and Alarm Status Indicators

Figure showing the the location of the server
and alarm status indicators on the front bezel.

Figure Legend

1 White Locator Indicator and Button

2 Yellow Service Required Indicator

3 Green Running Indicator

4 On/Standby Button

5 Red Critical Alarm Indicator

6 Red Major Alarm Indicator

7 Amber Minor Alarm Indicator

8 Amber User Alarm Indicator

FIGURE 1-2 Rear Panel Status Indicators

Figure Legend

1 Alarm port

2 White Locator Indicator and button

3 Yellow Service Required Indicator

4 Green Running Indicator

5 Management network port indicators

6 Ethernet port indicators

7 Power supply indicators

8 Video port

TABLE 1-2 lists and describes the front and rear panel indicators.

**TABLE 1-2 Front and Rear Panel Indicators**
LED	Location	Color	Description
Locator Indicator and Butto n	Front upper left and rear left	White	Enables you to identify a particular server. The LED is activated using one of the following methods: Issuing the `setlocator` `on` or `off` command. Pressing the button to toggle the indicator on or off. This LED provides the following indications: Off – Normal operating state. Fast blink – The server received a signal as a result of one of the preceding methods.
F ault Indicator	Front upper center and rear center	Yellow	If on, indicates that service is required.
A ctivity Indicator	Front upper right and rear right	Green	On – Drives are receiving power. Solidly on if drive is idle. Flashing – Drives are processing a command. Off – Power is off.
Power Button	Front upper right		Turns the host system on and off. This button is recessed to prevent accidental server power-off. Use the tip of a pen to operate this button. Press this button once for a graceful shutdown. Press this button for 4 seconds for an emergency shutdown.
Power OK Indicator	Rear center	Green	Provides the following indications: Off – The system is unavailable. Either the system has no power or ILOM is not running. Steady on – Indicates that the system is powered on and is running it its normal operating state. Standby blink – Indicates that the service processor is running while the system is running at a minimum level in Standby mode, and is ready to be returned to its normal operating state. Slow blink – Indicates that a normal transitory activity is taking place. The system diagnostics might be running, or the system might be booting.
Critical Alarm Indicator	Front left	Red	Indicates a critical alarm.
Major Alarm Indicator	Front left	Red	Indicates a major alarm.
Minor Alarm Indicator	Front left	Amber	Indicates a minor alarm.
User Alarm Indicator	Front left	Amber	Indicates a user alarm.

1.3.1 Hard Drive Indicators

The hard drive indicators (FIGURE 1-3 and TABLE 1-3) are located on the front of each hard drive that is installed in the server chassis.

FIGURE 1-3 Hard Drive Indicators

Figure showing the hard drive indicators.

Figure Legend

1 OK to Remove

2 Fault

3 Activity

**TABLE 1-3 Hard Drive Indicators**
Indicator	Color	Description
OK to Remove	Blue	On – The drive is ready for hot-plug removal. Off – Normal operation.
Faul t	Amber	On – The drive has a fault and requires attention. Off – Normal operation.
Activity	Green	On – The drive is receiving power. Solidly lit if drive is idle. Flashing – The drive is processing a command. Off – Power is off.

1.3.2 Power Supply Indicators

The power supply indicators (FIGURE 1-4 and TABLE 1-4) are located on the rear of each power supply.

FIGURE 1-4 Power Supply Indicators

Figure showing the power supply indicator.

Figure Legend

1 Power OK incdicator

2 Fault indicator

3 Input OK indicator

**TABLE 1-4 Power Supply Indicators**
Indicator	Color	Description
Power OK	Green	On – Normal operation. DC output voltage is within normal limits. Off – Power is off.
Fault	Amber	On – Power supply has detected a failure. Off – Normal operation.
Input OK	Green	On – Normal operation. Input power is within normal limits. Off – No input voltage, or input voltage is below limits.

1.3.3 Ethernet Port Indicators

The ILOM management Ethernet port and the four 10/100/1000 Mbps Ethernet ports each have two indicators, as shown in FIGURE 1-5 and described in TABLE 1-5.

FIGURE 1-5 Ethernet Port Indicators

Figure Legend

1 Speed indicator (same location for all Ethernet ports)

2 Link/Activity indicator (Same location for all Ethernet ports)

**TABLE 1-5 Ethernet Port Indicators**
Indicator	Color	Description
Right indicator	Green	Link/Activity indicator: Steady On – a link is established. Blinking – there is activity on this port. Off – No link is established.
Left Indicator	Amber or Green	Speed indicator: Amber On – The link is operating as a Gigabit connection (1000-Mbps) Green On – The link is operating as a 100-Mbps connection. Off – The link is operating as a 10/100-Mbps connection.

Note - The NET MGT port operates only in 100-Mbps or 10-Mbps so the speed indicator can be green or off (never amber).

1.4 Using the Service Processor Firmware for Diagnosis and Repair Verification

The Sun Integrated Lights Out Manager (ILOM) firmware is a service processor in the server that enables you to remotely manage and administer your server.

ILOM enables you to remotely run diagnostics that would otherwise require physical proximity to the server’s serial port. You can also configure ILOM to send email alerts of hardware failures, hardware warnings, and other events related to the server or to ILOM.

Faults detected by ILOM, POST, and the Solaris Predictive Self Healing (PSH) technology are forwarded to ILOM for fault handling. In the event of a system fault, ILOM ensures that the Fault Indicator is lit, the FRU ID PROMs are updated, the fault is logged, and alerts are displayed (faulty FRUs are identified in fault messages using the FRU name).

The service processor detects when a fault is no longer present and clears the fault in several ways:

Fault recovery – The system automatically detects that the fault condition is no longer present. ILOM extinguishes the Service Required Indicator and updates the FRU’s PROM, indicating that the fault is no longer present.
Fault repair – The fault has been repaired by human intervention. In most cases, the service processor detects the repair and extinguishes the Service Required Indicator. If the service processor does not perform these actions, you must perform these tasks manually.

The service processor also detects the removal of a FRU, in many cases even if the FRU is removed while the service processor is powered off (that is, if the system power cables are unplugged during service procedures). This situation enables ILOM to know that a fault, diagnosed to a specific FRU, has been repaired.

Note - ILOM does not automatically detect hard drive replacement.

Many environmental faults can automatically recover. A temperature that is exceeding a threshold might return to normal limits. An unplugged power supply can be plugged in, and so on. Recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms:

fru at location is OK.
sensor at location is within normal range.

Environmental faults can be repaired through hot-removal of the faulty FRU. FRU removal is automatically detected by the environmental monitoring, and all faults associated with the removed FRU are cleared. The message for that case, and the alert sent for all FRU removals is:

fru at location has been removed.

There is no ILOM command to manually repair an environmental fault.

The Solaris Predictive Self Healing technology does not monitor the hard drive for faults. As a result, the service processor does not recognize hard drive faults, and will not light the fault indicators on either the chassis or the hard drive itself. Use the Solaris message files to view hard drive faults. See Collecting Information From Solaris OS Files and Commands.

1.5 Using the Solaris Predictive Self Healing Feature

The Solaris Predictive Self Healing (PSH) technology enables the server to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. After diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use the message ID to get additional information about the problem from Sun’s knowledge article database.

The Predictive Self Healing technology covers the following server components:

Processor
Memory
I/O bus

The PSH console message provides the following information:

Type
Severity
Description
Automated response
Impact
Suggested action for system administrator

If the Solaris PSH facility detects a faulty component, use the fmdump command to identify the fault. Faulty FRUs are identified in fault messages using the FRU name.

Note - Additional Predictive Self Healing information is available at: http://www.sun.com/msg

1.5.1 Identifying PSH Detected Faults

When a PSH fault is detected, a Solaris OS console message similar to EXAMPLE 1-1 is displayed.

**EXAMPLE 1-1 Console Message Showing Fault Detected by PSH**
SUNW-MSG-ID: SUN4V-8000-DX, TYPE: Fault, VER: 1, SEVERITY: Minor EVENT-TIME: Wed Sep 14 10:09:46 EDT 2005 PLATFORM: SUNW,Sun-Netra-X4450, CSN: -, HOSTNAME: wgs48-37 SOURCE: cpumem-diagnosis, REV: 1.5 EVENT-ID: f92e9fbe-735e-c218-cf87-9e1720a28004 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-DX for more information. AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. IMPACT: Total system memory capacity will be reduced as pages are retired. REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u <EVENT_ID> to identify the module.

EXAMPLE 1-1 Console Message Showing Fault Detected by PSH

SUNW-MSG-ID: SUN4V-8000-DX, TYPE: Fault, VER: 1, SEVERITY: Minor
EVENT-TIME: Wed Sep 14 10:09:46 EDT 2005
PLATFORM: SUNW,Sun-Netra-X4450, CSN: -, HOSTNAME: wgs48-37
SOURCE: cpumem-diagnosis, REV: 1.5
EVENT-ID: f92e9fbe-735e-c218-cf87-9e1720a28004
DESC: The number of errors associated with this memory module has exceeded acceptable levels.  Refer to http://sun.com/msg/SUN4V-8000-DX for more information.
AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported.
IMPACT: Total system memory capacity will be reduced as pages are retired.
REC-ACTION: Schedule a repair procedure to replace the affected memory module.  Use fmdump -v -u <EVENT_ID> to identify the module.

Faults detected by the Solaris PSH facility are also reported through service processor alerts.

Note - The Service Required Indicator is also turned on for PSH diagnosed faults.

Use the fmdump Command to Identify Faults

The fmdump command displays the list of faults detected by the Solaris PSH facility and identifies the faulty FRU for a particular EVENT_ID (UUID).

Do not use fmdump to verify a FRU replacement has cleared a fault because the output of fmdump is the same after the FRU has been replaced. Use the fmadm faulty command to verify the fault has cleared.

Check the event log using the fmdump command with -v for verbose output.

The output includes the following details:
- Date and time of the fault (Jul 31 12:47:42.2007)
- Universal Unique Identifier (UUID). This is unique for every fault (fd940ac2-d21e-c94a-f258-f8a9bb69d05b)
- Sun message identifier, which can be used to obtain additional fault information (SUN4V-8000-JA)
- Faulted FRU. The information provided in the example includes the part number of the FRU (part=541215101) and the serial number of the FRU (serial=101083). The Location field provides the name of the FRU. The FRU name is MB, meaning the motherboard.
  
  Note - fmdump displays the PSH event log. Entries remain in the log after the fault has been repaired.

Use the Sun message ID to obtain more information about this type of fault.
1. In a browser, go to the Predictive Self Healing Knowledge Article web site: http://www.sun.com/msg
2. Obtain the message ID from the console output.
3. Enter the message ID in the SUNW-MSG-ID field, and click Lookup.

Follow the suggested actions to repair the fault.

Clear PSH Detected Faults

When the Solaris PSH facility detects faults the faults are logged and displayed on the console. In most cases, after the fault is repaired, the corrected state is detected by the system and the fault condition is repaired automatically. However, this must be verified and, in cases where the fault condition is not automatically cleared, the fault must be cleared manually.

After replacing a faulty FRU, power on the server.

Clear the fault from all persistent fault records.

In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time. To ensure that these messages are not displayed, perform the following Solaris OS command:

fmadm repair UUID

For example:
# fmadm repair 7ee0e46b-ea64-6565-e684-e996963f7b86

1.6 Collecting Information From Solaris OS Files and Commands

With the Solaris OS running on the server, you have the full complement of Solaris OS files and commands available for collecting information and for troubleshooting.

If the service processor or the Solaris PSH features do not indicate the source of a fault, check the message buffer and log files for notifications for faults. Hard drive faults are usually captured by the Solaris message files.

Use the dmesg command to view the most recent system message. To view the system messages log file, view the contents of the /var/adm/messages file.

Check the Message Buffer

Log in as superuser.

Type the dmesg command:
# dmesg
The dmesg command displays the most recent messages generated by the system.

View System Message Log Files

The error logging daemon, syslogd, automatically records various system warnings, errors, and faults in message files. These messages alert you to system problems such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file. After a period of time (usually every ten days), a new messages file is automatically created. The original contents of the messages file are rotated to a file named messages.1. Over a period of time, the messages are further rotated to messages.2 and messages.3, and then deleted.

Log in as superuser.

Type:
# more /var/adm/messages

If you want to view all logged messages, type:
# more /var/adm/messages*

1.7 Additional Service Related Information

In addition to this service manual, the following resources are available to help you keep your server running optimally:

Server Product Notes – Contain late-breaking information about the system including required software patches, updated hardware and compatibility information, and solutions to know issues. The product notes are available online at: http://www.oracle.com/technetwork/indexes/documentation/index.html
Solaris Release Notes – Contain important information about the Solaris OS. The release notes are available online at: http://www.oracle.com/technetwork/indexes/documentation/index.html
SunSolve Online – Provides a collection of support resources. Depending on the level of your service contract, you have access to Sun patches, the Sun System Handbook, the SunSolve knowledge base, the Sun Support Forum, and additional documents, bulletins, and related links. Access this site at: http://sunsolve.sun.com
Predictive Self Healing Knowledge Database – Provides access to the knowledge article corresponding to a self-healing message by taking the Sun Message Identifier (SUNW-MSG-ID) and entering it into the field on this page: http://www.sun.com/msg