Initial Inspection of the Server

C H A P T E R 1

This chapter includes the following topics:

Service Visit Troubleshooting Flowchart

Gathering Service Visit Information

Troubleshooting Power Problems

Externally Inspecting the Server

Internally Inspecting the Server

Troubleshooting DIMM Problems

Service Visit Troubleshooting Flowchart

Use the following flowchart as a guideline for using the subjects in this book to troubleshoot the server.

FIGURE 1-1 Troubleshooting Flowchart

Graphic showing suggested steps for troubleshooting problems during a service visit, using the sections in this book.

Gathering Service Visit Information

The first step in determining the cause of the problem with the server is to gather whatever information you can from the service-call paperwork or the onsite personnel. Use the following general guideline steps when you begin troubleshooting.

To gather service visit information:

1. Collect information about the following items:

Events that occurred prior to the failure

Whether any hardware or software was modified or installed

Whether the server was recently installed or moved

How long the server exhibited symptoms

The duration or frequency of the problem

2. Document the server settings before you make any changes.

If possible, make one change at a time in order to isolate potential problems. In this way, you can maintain a controlled environment and reduce the scope of troubleshooting.

3. Note the results of any change that you make.

Include any errors or informational messages.

4. Check for potential device conflicts before you add a new device.

5. Check for version dependencies, especially with third-party software.

Troubleshooting Power Problems

If the server will not power on:

1. Check that AC power cords are attached firmly to the server’s power supplies and to the AC sources.

Use of the cable clamps will ensure that the AC power cords are attached to the server’s power supplies.

2. Check that the component covers are firmly in place.(Including the hard disk drive access cover, system controller cover, and fan access cover.)

An intrusion switch on the system controller shuts the server down when the hard disk drive access cover is removed.

3. Investigate the following conditions that can trigger an automatic shutdown sequence:

A power-off sequence is initiated either by a request from the board management controller (BMC) or a fault condition.

The conditions that trigger the BMC to issue a shutdown request are:

An over-temperature condition for more than 1 second

Multiple fan failures.

The fault conditions that trigger a shutdown are:

All power supplies have failed or have been removed.

A power supply has been out of spec for more than 100 mS.

The hot-swap circuit has faulted.

An over-temperature condition has occurred.

Note - Any power supply that is out of spec causes a reset, but only power supplies that remain out of spec for more than 100 mS cause a shutdown.

Externally Inspecting the Server

To perform a visual inspection of the external system:

1. Inspect the external status indicator LEDs, which can indicate component malfunction.

For the LED locations and descriptions of their behavior, see Front Panel Features.

2. Verify that nothing in the server environment is blocking air flow or making a contact that could short out power.

3. If the problem is not evident, continue with the next section, Internally Inspecting the Server.

Internally Inspecting the Server

To perform a visual inspection of the internal system:

1. Choose a method for shutting down the server from main power mode to standby power mode.

Graceful shutdown - Use a non-conducting ballpoint pen or stylus to press and release the Power button on the front panel. This causes Advanced Configuration and Power Interface (ACPI) enabled operating systems to perform an orderly shutdown of the operating system. Servers not running ACPI-enabled operating systems will shut down to standby power mode immediately.

Emergency shutdown - Use a ballpoint pen or stylus to press and hold the Power button for four seconds to force main power off and enter standby power mode.

When main power is off, the Power/OK LED on the front panel blinks once every three seconds, indicating that the server is in standby power mode. See FIGURE 1-2.

Caution - When you use the Power button to enter standby power mode, power is still directed to the graphics-redirect and service processor (GRASP) board and power supply fans, indicated when the Power/OK LED is blinking. To completely power off the server, disconnect the AC power cords from the back panel of the server.

FIGURE 1-2 Sun Fire X4500 Server Front Panel

Graphic showing the Sun Fire X4500 server front panel with the power button and Power/OK LED shown on the upper-left.

Figure Legend
1	Locate button
2	Power/OK LED
3	USB ports (2)

2. Remove the component covers, including hard disk drive cover, system controller cover, and fan cover, as required.

For instructions on removing the component covers, refer to the Sun Fire X4500 Server Service Manual, 819-4359.

3. Inspect the internal status indicator LEDs, which can indicate component malfunction.

For the LED locations and descriptions of their behavior, see Internal Status Indicator LEDs.

Note - You can hold down the Locate button on the server back panel or front panel for 5 seconds to initiate a “push-to-test” mode that illuminates all other LEDs both inside and outside of the chassis for 15 seconds.

4. Verify that there are no loose or improperly seated components.

5. Verify that all cable connectors inside the system are firmly and correctly attached to their appropriate connectors.

6. Verify that any after-factory components are qualified and supported.

For a list of supported PCI cards and DIMMs, refer to the Sun Fire X4500 Server Service Manual, 819-4359.

7. Check that the installed DIMMs comply with the supported DIMM population rules and configurations, as described in Troubleshooting DIMM Problems.

8. Replace the component covers.

9. To restore main power mode to the server (all components powered on), use a ballpoint pen or stylus to press and release the Power button on the server front panel. See FIGURE 1-2.

When main power is applied to the full server, the Power/OK LED next to the Power button lights and remains lit.

10. If the problem with the server is not evident, you can try viewing the power-on self test (POST) messages and BIOS event logs during system startup. Continue with Viewing Event Logs.

Troubleshooting DIMM Problems

Use this section to troubleshoot problems with memory modules, or DIMMs.

Note - For information on Sun’s DIMM replacement policy for x64 servers, contact your Sun Service representative.

How DIMM Errors Are Handled By the System

This section describes system behavior for the two types of DIMM errors: uncorrectable errors (UCEs) and correctable errors (CEs); it also describes BIOS DIMM error messages.

Uncorrectable DIMM Errors

For all operating systems (OS’s), the behavior is the same for UCEs:

1. When UCE occurs, the memory controller causes an immediate reboot of the system.

2. During reboot, the BIOS checks the NorthBridge memory controller’s Machine Check registers and determines that the previous reboot was due to an UCE, then reports this message in POST after the memtest stage:

A Hypertransport Sync Flood occurred on last boot

3. Memory reports this event in the service processor’s system event log (SEL) as shown in the sample IPMItool output below:

# ipmitool -H 10.6.77.249 -U root -P changeme -I lanplus sel list

f000 | 02/16/2006 | 03:32:38 | OEM #0x12 |
f100 | OEM record e0 | 00000000040f0c0200200000a2
f200 | OEM record e0 | 01000000040000000000000000
f300 | 02/16/2006 | 03:32:50 | Memory | Uncorrectable ECC | CPU 1 DIMM 0
f400 | 02/16/2006 | 03:32:50 | Memory | Memory Device Disabled | CPU 1 DIMM 0
f500 | 02/16/2006 | 03:32:55 | System Firmware Progress | Motherboard initialization
f600 | 02/16/2006 | 03:32:55 | System Firmware Progress | Video initialization
f700 | 02/16/2006 | 03:33:01 | System Firmware Progress | USB resource configuration

Correctable DIMM Errors

At this time, CEs are not logged in the server’s system event logs.

Note - When running Solaris 10, the Fault Management Architecture (FMA) will manage memory CE’s by providing fault monitoring and diagnosis.

BIOS DIMM Error Messages

The BIOS displays and logs three types of DIMM error messages:

NODE-n Memory Configuration Mismatch

The following conditions causes this error message:

DIMMs mode is not paired (running in 64-bit mode instead of 128-bit mode)

DIMMs speed are not same.

DIMMs do not support ECC.

DIMMs are not registered.

MCT stopped due to errors in the DIMM.

DIMM module type (buffer) is mismatched.

DIMM generation (I or II) is mismatched.

DIMM CL/T is mismatched.

Banks on a two-sided DIMM are mismatched.

DIMM organization is mismatched (128-bit).

SPD is missing Trc or Trfc information.NODE-n Paired DIMMs Mismatch

NODE-n Paired DIMMs Mismatch

The following condition displays this error message:

DIMMs pairs are not the same or Checksum is mismatched.

NODE-n DIMMs Manufacturer Mismatch

The following conditions display this error message:

DIMMs manufacturer is not supported.

Only Samsung, Micron, Infineon, and SMART DIMMs are supported.

DIMM Fault LEDs

In the Sun Fire X4500 server, there are eight DIMM slots on the CPU board. The server has an internal status LEDs for the CPU board. DIMM and CPU fault LEDs on the CPU board provide further indications of which component has a fault condition.

These CPU and DIMM fault LEDs can be lit for up to one minute by a capacitor on the CPU board, even after the CPU board is removed from the server. To light the fault LEDs from the capacitor, push the small button on the CPU board labeled, “Press to see fault.”

See FIGURE 1-3 for the LED and button locations.

The DIMM ejector levers contain LEDs that can indicate a faulty DIMM:

DIMM fault LED is off: The DIMM is operating properly.

DIMM fault LED is on (amber): The DIMM is faulty and should be replaced.

The CPU fault LED can indicate a faulty CPU (on CPU 0 or CPU 1):

CPU fault LED is off: The CPU is operating properly.

CPU fault LED is on (amber): The CPU is faulty and should be replaced.

Battery Fault LED is on (amber): The battery is faulty and should be replaced.

Note - The CPU fault and DIMM LEDs continue to indicate a failure until the system is powered up. The Battery LED continues to indicate a failure until the service processor is started.When a UE is detected by the BIOS the DIMM LEDs will also illuminate.

For more information on CPU fault indicators and replacing CPUs, refer to the Sun Fire X4500 Server Service Manual (819-4359).

FIGURE 1-3 CPU Module LED and Button Locations

Diagram showing the locations and designations of the 8 memory slots on the CPU board.

Figure Legend
1	DIMM 0 2 1 3
2	CPU 1 (under heatsink)
3	CPU 0 (under heatsink)
4	DIMM 3 1 2 0
5	DIMM fault LEDs
6	CPU 1 fault LED
7	Battery
8	Battery fault LED
9	CPU 0 fault LED
10	Press to see fault
11	DIMM fault LED

DIMM Population Rules

The DIMM population rules for the Sun Fire X4500 server are as follows:

Each CPU can support a maximum of four DIMMs.

The DIMM slots are paired and the DIMMs must be installed in pairs (0 and 1, 2 and 3). See FIGURE 1-3.

CPUs with only a single pair of DIMMs must have those DIMMs installed in that CPUs white DIMM slots (0 and 1). See FIGURE 1-3.

Only PC3200 ECC Registered DIMMs are supported.

Each pair of DIMMs must be identical (same manufacturer, size, and speed).

Supported DIMM Configurations

TABLE 1-1 lists the supported DIMM configurations for the Sun Fire X4500 server.

TABLE 1-1 Supported DIMM Configurations
Slot 3		Slot 2	Slot 1		Slot 0		Total Memory Per CPU
0		2 GB	0		2 GB		4 GB
2 GB		2 GB	2 GB		2 GB		8 GB

Isolating and Correcting DIMM ECC Errors

If your log files report an ECC error or a problem with a DIMM, complete the steps below until you can isolate the fault.

In this example, the log file reports an error with the DIMM in CPU0, slot 1. The fault LEDs on CPU0, slots 1 and 3 are lit.

To isolate and correct DIMM ECC errors:

1. If you have not already done so, shut down your server to standby power mode and remove the system controller cover.

Refer to the Sun Fire X4500 Server Service Manual, 819-4359.

2. Inspect the installed DIMMs to ensure that they comply with the DIMM Population Rules and the Supported DIMM Configurations.

3. Inspect the fault LEDs on the DIMM slot ejectors and the CPU fault LEDs on the CPU board. See FIGURE 1-3.

If any of these LEDs are lit, they can indicate the component with the fault.

4. Disconnect the AC power cords from the server.

Caution - Before handling components, attach an ESD wrist strap to a chassis ground (any unpainted metal surface). The system’s printed circuit boards and hard disk drives contain components that are extremely sensitive to static electricity.

5. Replace the CPU that has the problem.

Refer to the Sun Fire X4500 Server Service Manual, 819-4359.

6. Remove the DIMMs from the CPU board.

Refer to the Sun Fire X4500 Server Service Manual, 819-4359.

7. Visually inspect the DIMMs for physical damage, dust, or any other contamination on the connector or circuits.

8. Visually inspect the DIMM slot for physical damage. Look for cracked or broken plastic on the slot.

9. Dust off the DIMMs, clean the contacts, and reseat them.

10. If there is no obvious damage, exchange the individual DIMMs between the two slots of a given pair. Ensure that they are inserted correctly with ejector latches secured. Using the slot numbers from the example:

a. Remove the DIMMs from CPU0, slots 1 and 3.

b. Reinstall the DIMM from slot 1 into slot 3.

c. Reinstall the DIMM from slot 3 into slot 1.

11. Reconnect AC power cords to the server.

12. Power on the server and run the diagnostics test again.

13. Review the log file.

If the error now appears in CPU0, slot 3 (opposite to the original error in slot 1), the problem is related to the individual DIMM. In this case, return both DIMMs (the pair) to the Support Center for replacement.

If the error still appears in CPU0, slot 1 (as the original error did), the problem is not related to an individual DIMM. Instead, it might be caused by CPU0 or by the DIMM slot. Continue with the next step.

14. Shut down the server again and disconnect the AC power cords.

15. Remove both DIMMs of the pair and install them into paired slots on the second CPU board that did not indicate a DIMM problem.

Using the slot numbers in the example, install the two DIMMs from CPU0, slots 1 and 3 into CPU1, slots 1 and 3 or CPU1, slots 0 and 2.

16. Reconnect AC power cords to the server.

17. Power on the server and run the diagnostics test again.

18. Review the log file.

If the error now appears under the CPU that manages the DIMM slots you just installed, the problem is with the DIMMs. Return both DIMMs (the pair) to the Support Center for replacement.

If the error remains with the original CPU, there is a problem with that CPU.