| Sun Fire X4600/X4600 M2 Servers Diagnostics Guide |     | 
 
 
This appendix contains information about how the servers process and log errors.
| Note - The information in this appendix applies to the original Sun Fire X4600 server, and to the Sun Fire X4600 M2 server, unless otherwise noted in the text. 
 | 
This appendix contains the following sections:
Handling of Uncorrectable Errors 
This section lists facts and considerations about how the server handles uncorrectable errors.
| Note - The BIOS ChipKill feature must be disabled if you are testing for failures of multiple bits within a DRAM (ChipKill corrects for the failure of a four-bit wide DRAM).
 | 
- The BIOS logs the error to the SP system event log (SEL) through the board management controller (BMC).
- The SP's SEL is updated with the failing DIMM pair's particular bank address.
- The system reboots.
- The BIOS logs the error in DMI.
| Note - If the error is on low 1MB, the BIOS freezes after rebooting. Therefore, no DMI log is recorded.
 | 
- An example of the error reported by the SEL through IPMI 2.0 is as follows:
- When low memory is erroneous, the BIOS is frozen on pre-boot low memory test because the BIOS cannot decompress itself into faulty DRAM and execute the following items:
ipmitool> sel list
100 | 08/26/2005 | 11:36:09 | OEM #0xfb |
200 | 08/26/2005 | 11:36:12 | System Firmware Error | No usable system memory
300 | 08/26/2005 | 11:36:12 | Memory | Memory Device Disabled | CPU 0 DIMM 0
- When the faulty DIMM is beyond the BIOS's low 1MB extraction space, proper boot happens:
ipmitool> sel list
100 | 08/26/2005 | 05:04:04 | OEM #0xfb |
200 | 08/26/2005 | 05:04:09 | Memory | Memory Device Disabled | CPU 0 DIMM 0
- Note the following considerations for this revision:
- Uncorrectable ECC Memory Error is not reported.
- Multi-bit ECC errors are reported as Memory Device Disabled.
- On first reboot, BIOS logs a HyperTransport Error in the DMI log.
- The BIOS disables the DIMM.
- The BIOS sends the SEL records to the BMC.
- The BIOS reboots again.
- The BIOS skips the faulty DIMM on the next POST memory test.
- The BIOS reports available memory, excluding the faulty DIMM pair.
FIGURE E-1 shows an example of a DMI log screen from BIOS Setup Page.
FIGURE E-1 	 DMI Log Screen, Uncorrectable Error 
 
Handling of Correctable Errors 
This section lists facts and considerations about how the server handles correctable errors. 
- During BIOS POST:
- The BIOS polls the MCK registers.
- The BIOS logs to DMI.
- The BIOS logs to the SP SEL through the BMC.
- The feature is turned off at OS boot time by default.
- The following Linux versions report correctable ECC syndrome and memory fill errors in /var/log, if kernel flag mce is indicated at boot time, or if mce is enabled through kernel compile or installation:
- RH3 Update5 single core
- RH4 Update1+
- SLES9 SP1+
- The Linux kernel (x86_64/kernel/mce.c) repeats a report every 30 seconds until another error is encountered and an 8131 flag is reset.
- Solaris support provides full self-healing and automated diagnosis for the CPU and Memory subsystems.
- FIGURE E-2 shows an example of a DMI log screen from BIOS Setup Page:
FIGURE E-2 	 DMI Log Screen, Correctable Error 
 
- If during any stage of memory testing the BIOS finds itself incapable of reading/writing to the DIMM, it takes the following actions:
- The BIOS disables the DIMM as indicated by the Memory Decreased message in the example in FIGURE E-3.
- The BIOS logs an SEL record.
- The BIOS logs an event in DMI.
FIGURE E-3 	 DMI Log Screen, Correctable Error, Memory Decreased 
 
Handling of Parity Errors (PERR) 
This section lists facts and considerations about how the server handles parity errors (PERR). 
- The handling of parity errors works through NMIs.
- During BIOS POST, the NMI is logged in the DMI and the SP SEL. See the following example command and output:
[root@d-mpk12-53-238 root]# ipmitool -H 129.146.53.95 -U root -P changeme -I lan sel list -v
SEL Record ID 	: 0100
Record Type 	: 00
Timestamp 	: 01/10/2002 20:16:16
Generator ID 	: 0001
EvM Revision 	: 04
Sensor Type 	: Critical Interrupt
Sensor Number 	: 00
Event Type 	: Sensor-specific Discrete
Event Direction 	: Assertion Event
Event Data 	: 04ff00
Description 	: PCI PERR
- FIGURE E-4 shows an example of a DMI log screen from BIOS Setup Page, with a parity error.
FIGURE E-4 	 DMI Log Screen, PCI Parity Error 
 
- The BIOS displays the following messages and freezes (during POST or DOS):
- NMI EVENT!!
- System Halted due to Fatal NMI!
- The Linux NMI trap catches the interrupt and reports the following NMI "confusion report" sequence:
Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 1.
Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue
Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled?
Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 1.
Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue
Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled?
Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue
Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled?
Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue
Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled?
| Note - The Linux system reboots, but does not inform the BIOS of this incident.
 | 
Handling of System Errors (SERR) 
This section lists facts and considerations about how the server handles system errors (SERR). 
- System error handling works through the HyperTransport Synch Flood Error mechanism on 8111 and 8131.
- The following events happen during BIOS POST:
- POST reports any previous system errors at the bottom of screen. See FIGURE E-5 for an example.
FIGURE E-5 	 POST Screen, Previous System Error Listed 
 
- SERR and HyperTransport Synch Flood Error are logged in DMI and the SP SEL. See the following sample output:
SEL Record ID 	: 0a00
Record Type 	: 00
Timestamp 	: 08/10/2005 06:05:32
Generator ID 	: 0001
EvM Revision 	: 04
Sensor Type 	: Critical Interrupt
Sensor Number 	: 00
Event Type 	: Sensor-specific Discrete
Event Direction 	: Assertion Event
Event Data 	: 05ffff
Description 	: PCI SERR
- FIGURE E-6 shows an example DMI log screen from the BIOS Setup Page with a system error.
FIGURE E-6 	 DMI Log Screen, System Error Listed 
 
Handling Mismatching Processors 
This section lists facts and considerations about how the server handles mismatching processors. 
- The BIOS performs a complete POST.
- The BIOS displays a report of any mismatching CPUs, as shown in the following example:
AMIBIOS(C)2003 American Megatrends, Inc.
BIOS Date: 08/10/05 14:51:11 Ver: 08.00.10
CPU : AMD Opteron(tm) Processor 254,  Speed : 2.4 GHz
Count : 3, CPU Revision, CPU0 : E4, CPU1 : E6
Microcode Revision, CPU0 : 0, CPU1 : 0
DRAM Clocking CPU0 = 400 MHz, CPU1 Core0/1 = 400 MHz
Sun Fire X4600 Server, 1 AMD North Bridge, Rev E4
1 AMD North Bridge, Rev E6
1 AMD 8111 I/O Hub, Rev C2
2 AMD 8131 PCI-X Controllers, Rev B2
System Serial Number  : 0505AMF028
BMC Firmware Revision :  1.00
Checking NVRAM..
Initializing USB Controllers .. Done.
Press F2 to run Setup (CTRL+E on Remote Keyboard)
Press F12 to boot from the network (CTRL+N on Remote Keyboard)
Press F8 for BBS POPUP  (CTRL+P on Remote Keyboard)
- No SEL or DMI event is recorded.
- The system enters Halt mode and the following message is displayed:
******** Warning: Bad Mix of Processors *********
Multiple core processors cannot be installed with single core
processors.
Fatal Error... System Halted.
Hardware Error Handling Summary
TABLE E-1 summarizes the most common hardware errors that you might encounter with these servers.
 
  TABLE E-1 	   Hardware Error Handling Summary   
| Error
 | Description
 | Handling
 | Logged (DMI Log or SP SEL)
 | Fatal?
 | 
| SP failure
 | The SP fails to boot upon application of system power.
 | The SP controls the system reset, so the system may power on, but will not come out of reset. 
 
During power up, the SP's boot loader turns on the power LED. 
During SP boot, Linux startup, and SP sanity check, the power LED blinks. 
The LED is turned off when SP management code (the IPMI stack) is started.
At exit of BIOS POST, the LED goes to STEADY ON state.
 | Not logged
 | Fatal
 | 
| SP failure
 | SP boots but fails POST.
 | The SP controls the system RESET, so the system will not come out of reset.
 | Not logged
 | Fatal
 | 
| BIOS POST failure
 | Server BIOS does not pass POST.
 | There are fatal and non-fatal errors in POST. The BIOS does detect some errors that are announced during POST as POST codes on the bottom right corner of the display on the serial console and on the video display. Some POST codes are forwarded to the SP for logging. 
 The POST codes do not come out in sequential order and some are repeated, because some POST codes are issued by code in add-in card BIOS expansion ROMs.
 In the case of early POST failures (for example, the BSP fails to operate correctly), BIOS just halts without logging. 
 For some other POST failures subsequent to memory and SP initialization, the BIOS logs a message to the SP's SEL.
 |  
 |  
 | 
| Single-bit DRAM ECC error
 | With ECC enabled in the BIOS Setup, the CPU detects and corrects a single-bit error on the DIMM interface.
 | The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half-second by SMI timer interrupts and is done by the BIOS SMI handler. 
 The BIOS SMI handler starts logging each detected error and stops logging when the limit for the same error is reached. The BIOS's polling can be disabled through a software interface. 
 | SP SEL
 | Normal operation
 | 
| Single four-bit DRAM error
 | With CHIP-KILL enabled in the BIOS Setup, the CPU detects and corrects for the failure of a four-bit-wide DRAM on the DIMM interface.
 | The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half-second by SMI timer interrupts and is done by the BIOS SMI handler. 
 The BIOS SMI handler starts logging each detected error and stops logging when the limit for the same error is reached. The BIOS's polling can be disabled through a software interface.
 | SP SEL
 | Normal operation
 | 
| Uncorrectable DRAM ECC error
 | The CPU detects an uncorrectable multiple-bit DIMM error.
 | The "sync flood" method of handling this is used to prevent the erroneous data from being propagated across the HyperTransport links. The system reboots, the BIOS recovers the machine check register information, maps this information to the failing DIMM (when CHIPKILL is disabled) or DIMM pair (when CHIPKILL is enabled), and logs that information to the SP. 
 The BIOS will halt the CPU.
 | SP SEL
 | Fatal
 | 
| Unsupported DIMM configuration
 | Unsupported DIMMs are used, or supported DIMMs are loaded improperly.
 | The BIOS displays an error message, logs an error, and halts the system.
 | DMI Log
SP SEL
 | Fatal
 | 
| HyperTransport link failure
 | CRC or link error on one of the HyperTransport Links
 | Sync floods on HyperTransport links, the machine resets itself, and error information gets retained through reset. 
 The BIOS reports, A Hyper Transport sync flood error occurred on last boot, press F1 to continue.
 | DMI Log
SP SEL
 | Fatal
 | 
| PCI SERR, PERR
 | System or parity error on a PCI bus
 | Sync floods on HyperTransport links, the machine resets itself, and error information gets retained through reset. 
 The BIOS reports, A Hyper Transport sync flood error occurred on last boot, press F1 to continue.
 | DMI Log
SP SEL
 | Fatal
 | 
| BIOS POST Microcode Error
 | The BIOS could not find or load the CPU Microcode Update to the CPU. The message most likely appears when a new CPU is installed in a motherboard with an outdated BIOS. In this case, the BIOS must be updated.
 | The BIOS displays an error message, logs the error to DMI, and boots.
 | DMI Log
 | Non-fatal
 | 
| BIOS POST CMOS Checksum Bad
 | CMOS contents failed the Checksum check.
 | The BIOS displays an error message, logs the error to DMI, and boots.
 | DMI Log
 | Non-fatal
 | 
| Unsupported CPU configuration
 | The BIOS supports mismatched frequency and steppings in CPU configuration, but some CPUs might not be supported.
 | The BIOS displays an error message, logs the error, and halts the system.
 | DMI Log
 | Fatal
 | 
| Correctable error
 | The CPU detects a variety of correctable errors in the MCi_STATUS registers.
 | The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half second by SMI timer interrupts, and is done by the BIOS SMI handler.
 The SMI handler logs a message to the SP SEL if the SEL is available, otherwise SMI logs a message to DMI. The BIOS's polling can be disabled through software SMI.
 | DMI Log
SP SEL
 | Normal operation
 | 
| Single fan failure
 | Fan failure is detected by reading tach signals.
 | The Front Fan Fault, Service Action Required, and individual fan module LEDs are lit. 
 | SP SEL
 | Non-fatal
 | 
| Multiple fan failure
 | Fan failure is detected by reading tach signals.
 | The Front Fan Fault, Service Action Required, and individual fan module LEDs are lit.
 | SP SEL
 | Fatal
 | 
| Single power supply failure
 | When any of the AC/DC PS_VIN_GOOD or PS_PWR_OK signals are deasserted.
 | Service Action Required, and Power Supply Fault LEDs are lit.
 | SP SEL
 | Non-fatal
 | 
| DC/DC power converter failure
 | Any POWER_GOOD signal is deasserted from the DC/DC converters.
 | The Service Action Required LED is lit, the system is powered down to standby power mode, and the Power LED enters standby blink state. 
 | SP SEL
 | Fatal
 | 
| Voltage above/below Threshold
 | The SP monitors system voltages and detects voltage above or below a given threshold.
 | The Service Action Required LED and Power Supply Fault LED blink. 
 | SP SEL
 | Fatal
 | 
| High temperature
 | The SP monitors CPU and system temperatures, and detects temperatures above a given threshold.
 | The Service Action Required LED and System Overheat Fault LED blink. The motherboard is shut down above the specified critical level.
 | SP SEL
 | Fatal
 | 
| Processor thermal trip
 | The CPU drives the THERMTRIP_L signal upon detecting an overtemp condition.
 | CPLD shuts down power to the CPU. The Service Action Required LED and System Overheat Fault LED blink.
 | SP SEL
 | Fatal
 | 
| Boot device failure
 | The BIOS is not able to boot from a device in the boot device list.
 | The BIOS goes to the next boot device in the list. If all devices in the list fail, an error message is displayed, retry from beginning of list. SP can control/change boot order
 | DMI Log
 | Non-fatal
 | 
 
| Sun Fire X4600/X4600 M2 Servers Diagnostics Guide | 819-4343-12 |     | 
 
Copyright © 2006, Sun Microsystems, Inc.   All Rights Reserved.