C H A P T E R  1

Server Diagnostics

This chapter describes the diagnostics that are available for monitoring and troubleshooting the server.

The following topics are covered:


1.1 Fault on Initial Power Up

If you have installed the server, and upon initial power up, you see errors indicating faults with the Fully Buffered DIMMs (FB-DIMMs), PCI cards, or other components, the suspect component might have become loose during shipment.

Conduct a visual inspection of the server internals and its components. Remove the top cover and physically reseat the cable connections, the PCI cards, and the FB-DIMMs.


1.2 Server Diagnostics Overview

There are a variety of diagnostic tools, commands, and indicators you can use to monitor and troubleshoot a server:

The LEDs, ILOM, Solaris PSH, and many of the log files and console messages are integrated. For example, a fault detected by the Solaris software will display the fault, log it, pass information to ILOM where it is logged, and, depending on the fault, might light one or more LEDs.


TABLE 1-1   Diagnostic Actions
Action No. Diagnostic Action Resulting Action Additional Information
1 Check Power OK and Input OK LEDs on the server. The Power OK LED is located on the front and rear of the chassis.

The Input OK LED is located on the rear of the server on each power supply.

If these LEDs are not on, check the power source and power connections to the server.

Using the Status Indicators to Identify the State of Devices
2 Check the Solaris log files for fault information. The Solaris message buffer and log files record system events and provide information about faults. If system messages indicate a faulty device, replace the FRU. Collecting Information From Solaris OS Files and Commands
3 Determine if the fault is an environmental fault. Environmental faults can be caused by faulty FRUs (power supply, fan, or blower), or by environmental conditions such as when computer room ambient temperature is too high, or the server airflow is blocked. When the environmental condition is corrected, the fault will automatically clear. Using the Service Processor Firmware for Diagnosis and Repair Verification
    If the fault indicates that a fan, blower, or power supply is bad, you can perform a hot-swap of the FRU. You can also use the fault indicators on the server to identify the faulty FRU (fans, blower, and power supplies). Using the Status Indicators to Identify the State of Devices
4 Determine if the fault was detected by PSH. If the fault message displays the following text, the fault was detected by the Solaris Predictive Self Healing software: Host detected fault Using the Solaris Predictive Self Healing Feature
    If the fault is a PSH detected fault, identify the faulty FRU from the fault message and replace the faulty FRU. Clear PSH Detected Faults
    After replacing the FRU, perform the procedure to clear PSH detected faults.  
5 Contact Sun for Support. The majority of hardware faults are detected by the server’s diagnostics. In rare cases, a problem might require additional troubleshooting. If you are unable to determine the cause of the problem, contact Sun for support. Sun Support information: http://www.sun.com/support

1.2.1 Memory Fault Handling

The server uses an advanced ECC technology, called chipkill, that corrects up to 4 bits in error on nibble boundaries, as long as all of the bits are in the same DRAM. If a DRAM fails, the FB-DIMM continues to function.

The Predictive Self Healing (PSH) technology in the Solaris OS uses the fault manager daemon (fmd) to watch for various kinds of faults. When a fault occurs, the fault is assigned a unique fault ID (UUID), and logged. PSH reports the fault and provides a recommended proactive replacement for the FB-DIMMs associated with the fault.

If you suspect that the server has a memory problem, see Replacing FB-DIMMs for FB-DIMM replacement instructions. You must perform the instructions in that chapter to clear the faults and enable the replaced FB-DIMMs.


1.3 Using the Status Indicators to Identify the State of Devices

The server provides status indicators in the upper left corner of the front panel (FIGURE 1-1) and on the rear panel (FIGURE 1-2). These indicators provide a visual means of determining the state of the system or individual components.

FIGURE 1-1   Front Panel Status and Alarm Status Indicators

Figure showing the the location of the server
and alarm status indicators on the front bezel.

Figure Legend

  1   White Locator Indicator and Button

  2   Yellow Service Required Indicator

  3   Green Running Indicator

  4   On/Standby Button

  5   Red Critical Alarm Indicator

  6   Red Major Alarm Indicator

  7   Amber Minor Alarm Indicator

  8   Amber User Alarm Indicator


FIGURE 1-2   Rear Panel Status Indicators

Figure showning the rear panel LEDs.

Figure Legend

  1   Alarm port

  2   White Locator Indicator and button

  3   Yellow Service Required Indicator

  4   Green Running Indicator

  5   Management network port indicators

  6   Ethernet port indicators

  7   Power supply indicators

  8   Video port


TABLE 1-2 lists and describes the front and rear panel indicators.


TABLE 1-2   Front and Rear Panel Indicators
LED Location Color Description
Locator Indicator and Button Front upper left and rear left White Enables you to identify a particular server. The LED is activated using one of the following methods:
  • Issuing the setlocator on or off command.

  • Pressing the button to toggle the indicator on or off.

This LED provides the following indications:

  • Off – Normal operating state.

  • Fast blink – The server received a signal as a result of one of the preceding methods.

Fault Indicator Front upper center and rear center Yellow If on, indicates that service is required.
Activity Indicator Front upper right and rear right Green
  • On – Drives are receiving power. Solidly on if drive is idle.

  • Flashing – Drives are processing a command.

  • Off – Power is off.

Power Button Front upper right   Turns the host system on and off. This button is recessed to prevent accidental server power-off. Use the tip of a pen to operate this button. Press this button once for a graceful shutdown. Press this button for 4 seconds for an emergency shutdown.
Power OK Indicator Rear center Green Provides the following indications:
  • Off – The system is unavailable. Either the system has no power or ILOM is not running.

  • Steady on – Indicates that the system is powered on and is running it its normal operating state.

  • Standby blink – Indicates that the service processor is running while the system is running at a minimum level in Standby mode, and is ready to be returned to its normal operating state.

  • Slow blink – Indicates that a normal transitory activity is taking place. The system diagnostics might be running, or the system might be booting.

Critical Alarm Indicator Front left Red Indicates a critical alarm.
Major Alarm Indicator Front left Red Indicates a major alarm.
Minor Alarm Indicator Front left Amber Indicates a minor alarm.
User Alarm Indicator Front left Amber Indicates a user alarm.

1.3.1 Hard Drive Indicators

The hard drive indicators (FIGURE 1-3 and TABLE 1-3) are located on the front of each hard drive that is installed in the server chassis.

FIGURE 1-3   Hard Drive Indicators

Figure showing the hard drive indicators.

Figure Legend

  1   OK to Remove

  2   Fault

  3   Activity



TABLE 1-3   Hard Drive Indicators
Indicator Color Description
OK to Remove Blue
  • On – The drive is ready for hot-plug removal.

  • Off – Normal operation.

Fault Amber
  • On – The drive has a fault and requires attention.

  • Off – Normal operation.

Activity Green
  • On – The drive is receiving power. Solidly lit if drive is idle.

  • Flashing – The drive is processing a command.

  • Off – Power is off.


1.3.2 Power Supply Indicators

The power supply indicators (FIGURE 1-4 and TABLE 1-4) are located on the rear of each power supply.

FIGURE 1-4   Power Supply Indicators

Figure showing the power supply indicator.

Figure Legend

  1   Power OK incdicator

  2   Fault indicator

  3   Input OK indicator



TABLE 1-4   Power Supply Indicators
Indicator Color Description
Power OK Green
  • On – Normal operation. DC output voltage is within normal limits.

  • Off – Power is off.

Fault Amber
  • On – Power supply has detected a failure.

  • Off – Normal operation.

Input OK Green
  • On – Normal operation. Input power is within normal limits.

  • Off – No input voltage, or input voltage is below limits.


1.3.3 Ethernet Port Indicators

The ILOM management Ethernet port and the four 10/100/1000 Mbps Ethernet ports each have two indicators, as shown in FIGURE 1-5 and described in TABLE 1-5.

FIGURE 1-5   Ethernet Port Indicators

Figure showing the Ethernet indicators.

Figure Legend

  1   Speed indicator (same location for all Ethernet ports)

  2   Link/Activity indicator (Same location for all Ethernet ports)



TABLE 1-5   Ethernet Port Indicators
Indicator Color Description
Right indicator Green Link/Activity indicator:
  • Steady On – a link is established.

  • Blinking – there is activity on this port.

  • Off – No link is established.

Left Indicator Amber or Green Speed indicator:
  • Amber On – The link is operating as a Gigabit connection (1000-Mbps)

  • Green On – The link is operating as a 100-Mbps connection.

  • Off – The link is operating as a 10/100-Mbps connection.




Note - The NET MGT port operates only in 100-Mbps or 10-Mbps so the speed indicator can be green or off (never amber).




1.4 Using the Service Processor Firmware for Diagnosis and Repair Verification

The Sun Integrated Lights Out Manager (ILOM) firmware is a service processor in the server that enables you to remotely manage and administer your server.

ILOM enables you to remotely run diagnostics that would otherwise require physical proximity to the server’s serial port. You can also configure ILOM to send email alerts of hardware failures, hardware warnings, and other events related to the server or to ILOM.

Faults detected by ILOM, POST, and the Solaris Predictive Self Healing (PSH) technology are forwarded to ILOM for fault handling. In the event of a system fault, ILOM ensures that the Fault Indicator is lit, the FRU ID PROMs are updated, the fault is logged, and alerts are displayed (faulty FRUs are identified in fault messages using the FRU name).

The service processor detects when a fault is no longer present and clears the fault in several ways:

The service processor also detects the removal of a FRU, in many cases even if the FRU is removed while the service processor is powered off (that is, if the system power cables are unplugged during service procedures). This situation enables ILOM to know that a fault, diagnosed to a specific FRU, has been repaired.



Note - ILOM does not automatically detect hard drive replacement.



Many environmental faults can automatically recover. A temperature that is exceeding a threshold might return to normal limits. An unplugged power supply can be plugged in, and so on. Recovery of environmental faults is automatically detected. Recovery events are reported using one of two forms:

Environmental faults can be repaired through hot-removal of the faulty FRU. FRU removal is automatically detected by the environmental monitoring, and all faults associated with the removed FRU are cleared. The message for that case, and the alert sent for all FRU removals is:

fru at location has been removed.

There is no ILOM command to manually repair an environmental fault.

The Solaris Predictive Self Healing technology does not monitor the hard drive for faults. As a result, the service processor does not recognize hard drive faults, and will not light the fault indicators on either the chassis or the hard drive itself. Use the Solaris message files to view hard drive faults. See Collecting Information From Solaris OS Files and Commands.


1.5 Using the Solaris Predictive Self Healing Feature

The Solaris Predictive Self Healing (PSH) technology enables the server to diagnose problems while the Solaris OS is running, and mitigate many problems before they negatively affect operations.

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem. After diagnosed, the fault manager daemon assigns the problem a Universal Unique Identifier (UUID) that distinguishes the problem across any set of systems. When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault notification with a message ID (MSGID). You can use the message ID to get additional information about the problem from Sun’s knowledge article database.

The Predictive Self Healing technology covers the following server components:

The PSH console message provides the following information:

If the Solaris PSH facility detects a faulty component, use the fmdump command to identify the fault. Faulty FRUs are identified in fault messages using the FRU name.



Note - Additional Predictive Self Healing information is available at: http://www.sun.com/msg



1.5.1 Identifying PSH Detected Faults

When a PSH fault is detected, a Solaris OS console message similar to EXAMPLE 1-1 is displayed.


EXAMPLE 1-1   Console Message Showing Fault Detected by PSH
SUNW-MSG-ID: SUN4V-8000-DX, TYPE: Fault, VER: 1, SEVERITY: Minor
EVENT-TIME: Wed Sep 14 10:09:46 EDT 2005
PLATFORM: SUNW,Sun-Netra-X4450, CSN: -, HOSTNAME: wgs48-37
SOURCE: cpumem-diagnosis, REV: 1.5
EVENT-ID: f92e9fbe-735e-c218-cf87-9e1720a28004
DESC: The number of errors associated with this memory module has exceeded acceptable levels.  Refer to http://sun.com/msg/SUN4V-8000-DX for more information.
AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported.
IMPACT: Total system memory capacity will be reduced as pages are retired.
REC-ACTION: Schedule a repair procedure to replace the affected memory module.  Use fmdump -v -u <EVENT_ID> to identify the module.

Faults detected by the Solaris PSH facility are also reported through service processor alerts.



Note - The Service Required Indicator is also turned on for PSH diagnosed faults.



procedure icon  Use the fmdump Command to Identify Faults

The fmdump command displays the list of faults detected by the Solaris PSH facility and identifies the faulty FRU for a particular EVENT_ID (UUID).

Do not use fmdump to verify a FRU replacement has cleared a fault because the output of fmdump is the same after the FRU has been replaced. Use the fmadm faulty command to verify the fault has cleared.

  1. Check the event log using the fmdump command with -v for verbose output.

    The output includes the following details:

    • Date and time of the fault (Jul 31 12:47:42.2007)

    • Universal Unique Identifier (UUID). This is unique for every fault (fd940ac2-d21e-c94a-f258-f8a9bb69d05b)

    • Sun message identifier, which can be used to obtain additional fault information (SUN4V-8000-JA)

    • Faulted FRU. The information provided in the example includes the part number of the FRU (part=541215101) and the serial number of the FRU (serial=101083). The Location field provides the name of the FRU. The FRU name is MB, meaning the motherboard.



      Note - fmdump displays the PSH event log. Entries remain in the log after the fault has been repaired.



  2. Use the Sun message ID to obtain more information about this type of fault.

    1. In a browser, go to the Predictive Self Healing Knowledge Article web site: http://www.sun.com/msg

    2. Obtain the message ID from the console output.

    3. Enter the message ID in the SUNW-MSG-ID field, and click Lookup.

  3. Follow the suggested actions to repair the fault.

procedure icon  Clear PSH Detected Faults

When the Solaris PSH facility detects faults the faults are logged and displayed on the console. In most cases, after the fault is repaired, the corrected state is detected by the system and the fault condition is repaired automatically. However, this must be verified and, in cases where the fault condition is not automatically cleared, the fault must be cleared manually.

  1. After replacing a faulty FRU, power on the server.

  2. Clear the fault from all persistent fault records.

    In some cases, even though the fault is cleared, some persistent fault information remains and results in erroneous fault messages at boot time. To ensure that these messages are not displayed, perform the following Solaris OS command:

    fmadm repair UUID

    For example:


    # fmadm repair 7ee0e46b-ea64-6565-e684-e996963f7b86
    


1.6 Collecting Information From Solaris OS Files and Commands

With the Solaris OS running on the server, you have the full complement of Solaris OS files and commands available for collecting information and for troubleshooting.

If the service processor or the Solaris PSH features do not indicate the source of a fault, check the message buffer and log files for notifications for faults. Hard drive faults are usually captured by the Solaris message files.

Use the dmesg command to view the most recent system message. To view the system messages log file, view the contents of the /var/adm/messages file.

procedure icon  Check the Message Buffer

  1. Log in as superuser.

  2. Type the dmesg command:


    # dmesg
    

    The dmesg command displays the most recent messages generated by the system.

procedure icon  View System Message Log Files

The error logging daemon, syslogd, automatically records various system warnings, errors, and faults in message files. These messages alert you to system problems such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in the /var/adm/messages file. After a period of time (usually every ten days), a new messages file is automatically created. The original contents of the messages file are rotated to a file named messages.1. Over a period of time, the messages are further rotated to messages.2 and messages.3, and then deleted.

  1. Log in as superuser.

  2. Type:


    # more /var/adm/messages
    

  3. If you want to view all logged messages, type:


    # more /var/adm/messages*
    


1.7 Additional Service Related Information

In addition to this service manual, the following resources are available to help you keep your server running optimally: