C H A P T E R 1 |
Initial Inspection of the Server |
Note - This chapter applies to all Sun Fire X4100/X4100 M2 and X4200/X4200 M2 servers, unless otherwise noted. |
Use the following flowchart as a guideline for using the subjects in this book to troubleshoot the server.
The first step in determining the cause of the problem with the server is to gather whatever information you can from the service call paperwork or the on-site personnel. Use the following general guideline steps when you begin troubleshooting.
1. Collect information about the following items:
2. Document the server settings before you make any changes.
If possible, make one change at a time, in order to isolate potential problems. In this way, you can maintain a controlled environment and reduce the scope of troubleshooting.
3. Take note of the results of any change you make. Include any errors or informational messages.
4. Check for potential device conflicts before you add a new device.
5. Check for version dependencies, especially with third-party software.
The system serial number is located on a sticker that is attached to the front bezel (see FIGURE 1-2 or FIGURE 1-3 for the location).
If the bezel is missing, a second serial number label is affixed to the system:
Improperly set controls and loose or improperly connected cables are common causes of problems with hardware components.
1. Check that AC power cords are attached firmly to the server's power supplies and to the AC source.
2. Check that both the main cover and rear cover are firmly in place.
There is an intrusion switch on the front I/O board that automatically shuts down the server power to standby mode when the covers are removed.
To perform a visual inspection of the external system:
1. Inspect the external status indicator LEDs, which can indicate component malfunction.
For the LED locations and descriptions of their behavior, see External Status Indicator LEDs.
2. Verify that nothing in the server environment is blocking air flow or making a contact that could short out power.
3. If the problem is not evident, continue with Internally Inspecting the Server.
Perform a visual inspection of the internal system by following these steps. Stop when you identify the problem.
1. Choose a method for shutting down the server from main power mode to standby power mode.
When main power is off, the Power/OK LED on the front panel will begin flashing, indicating that the server is in standby power mode.
2. Remove the server covers, as required.
For instructions on removing system covers, refer to the Sun Fire X4100/X4100 M2 and Sun Fire X4200/X4200 M2 Servers Service Manual, 819-1157.
3. Inspect the internal status indicator LEDs, which can indicate component malfunction.
For the LED locations and descriptions of their behavior, see Internal Status Indicator LEDs.
4. Verify that there are no loose or improperly seated components.
5. Verify that all cable connectors inside the system are firmly and correctly attached to their appropriate connectors.
6. Verify that any after-factory components are qualified and supported.
For a list of supported PCI cards and DIMMs, refer to the Sun Fire X4100/X4100 M2 and Sun Fire X4200/X4200 M2 Servers Service Manual, 819-1157.
7. Check that the installed DIMMs comply with the supported DIMM population rules and configurations, as described in Troubleshooting DIMM Problems.
9. To restore main power mode to the server (all components powered on), use a ballpoint pen or other pointed object to press and release the Power button on the server front panel. See FIGURE 1-2 or FIGURE 1-3.
When main power is applied to the full server, the Power/OK LED next to the Power button lights and remains lit.
10. If the problem with the server is not evident, you can try viewing the power-on self test (POST) messages and BIOS event logs during system startup. Continue with Viewing BIOS Event Logs.
Use this section to troubleshoot problems with memory modules, or DIMMs.
Note - For information on Sun's DIMM replacement policy for x64 servers, contact your Sun Service representative. |
For all operating systems (OS), the behavior is the same:
# ipmitool -H 10.6.77.249 -U root -P changeme -I lanplus sel list
f000 | 02/16/2006 | 03:32:38 | OEM #0x12 |
f100 | OEM record e0 | 00000000040f0c0200200000a2
f200 | OEM record e0 | 01000000040000000000000000
f300 | 02/16/2006 | 03:32:50 | Memory | Uncorrectable ECC | CPU 1 DIMM 0
f400 | 02/16/2006 | 03:32:50 | Memory | Memory Device Disabled | CPU 1
f500 | 02/16/2006 | 03:32:55 | System Firmware Progress | Motherboard
f600 | 02/16/2006 | 03:32:55 | System Firmware Progress | Video
f700 | 02/16/2006 | 03:33:01 | System Firmware Progress | USB resource
At this time, correctable errors are not logged in the server's system event logs. They are reported or handled in the supported operating systems as follows:
Start-->Administration Tools-->Event Viewer
BIOS will display and log three types of error messages:
NODE-n Memory Configuration Mismatch
The following conditions will cause this error message:
The following conditions will cause this error message:
NODE-n DIMMs Manufacturer Mismatch
The following conditions will cause this error message:
This will be displayed when you add Hitachi DIMMs
The ejectors on the DIMM slots on the motherboard contain DIMM fault LEDs.
Note the following differences between the Sun Fire X4100/X4200 and the X4100 M2/X4200 M2 servers regarding the power requirements for viewing the DIMM fault LEDs:
Note - The DIMM fault LEDs always indicate a failed DIMM pair, with the LEDs lit on both slots of the pair that contains the failed DIMM. See Isolating and Correcting DIMM ECC Errors for a procedure to determine which DIMM of the pair is faulty. |
FIGURE 1-4 shows the numbering of the Sun Fire X4100/X4200 DIMM slots.
FIGURE 1-5 shows the numbering of the Sun Fire X4100 M2/X4200 M2 DIMM slots.
Note - The Sun Fire X4100/X4200 servers use only DDR1 DIMM. The Sun Fire X4100 M2/X4200 M2 servers use only DDR2 DIMMs. |
The DIMM population rules for the Sun Fire X4100/X4200 servers are listed here:
The DIMM population rules for the Sun Fire X4100 M2/X4200 M2 servers are listed here:
If your log files report an ECC error or a problem with a DIMM, complete the steps below until you can isolate the fault.
In this example, the log file reports an error with the DIMM in CPU0, slot 1. The fault LEDs on CPU0, slots 0+1 are lit.
1. If you have not already done so, shut down your server to standby power mode and remove the main cover.
Refer to the Sun Fire X4100 and Sun Fire X4200 Servers Service Manual, 819-1157.
2. Inspect the installed DIMMs to ensure that they comply with the DIMM Population Rules.
3. Inspect the fault LEDs on the DIMM slot ejectors and the CPU LEDs on the motherboard. See FIGURE 1-4.
If any of these LEDs are lit, they can indicate the component with the fault.
4. Disconnect the AC power cords from the server.
6. Visually inspect the DIMMs for physical damage, dust, or any other contamination on the connector or circuits.
7. Visually inspect the DIMM slot for physical damage. Look for cracked or broken plastic on the slot.
8. Dust off the DIMMs, clean the contacts, and reseat them.
9. If there is no obvious damage, exchange the individual DIMMs between the two slots of a given pair. Ensure that they are inserted correctly with ejector latches secured.
Using the example, remove the DIMMs from CPU0, slots 0+1 then reinstall the DIMM from slot 1 into slot 0; reinstall the DIMM from slot 0 into slot 1.
10. Reconnect AC power cords to the server.
11. Power on the server and run the diagnostics test again.
13. Shut down the server again and disconnect the AC power cords.
14. Remove both DIMMs of the pair and install them into paired slots on the opposite CPU.
Using the example, install the two DIMMs from CPU0, slots 0+1 into CPU1, slots 0+1 or CPU1, slots 2+3.
15. Reconnect AC power cords to the server.
16. Power on the server and run the diagnostics test again.
Copyright © 2007, Sun Microsystems, Inc. All Rights Reserved.