A P P E N D I X  E

Error Handling

This appendix contains information about how the servers process and log errors.



Note - The information in this appendix applies to the original Sun Fire X4600 server, and to the Sun Fire X4600 M2 server, unless otherwise noted in the text.



This appendix contains the following sections:


Handling of Uncorrectable Errors

This section lists facts and considerations about how the server handles uncorrectable errors.



Note - The BIOS ChipKill feature must be disabled if you are testing for failures of multiple bits within a DRAM (ChipKill corrects for the failure of a four-bit wide DRAM).





Note - If the error is on low 1MB, the BIOS freezes after rebooting. Therefore, no DMI log is recorded.



ipmitool> sel list

100 | 08/26/2005 | 11:36:09 | OEM #0xfb |

200 | 08/26/2005 | 11:36:12 | System Firmware Error | No usable system memory

300 | 08/26/2005 | 11:36:12 | Memory | Memory Device Disabled | CPU 0 DIMM 0

ipmitool> sel list

100 | 08/26/2005 | 05:04:04 | OEM #0xfb |

200 | 08/26/2005 | 05:04:09 | Memory | Memory Device Disabled | CPU 0 DIMM 0

FIGURE E-1 shows an example of a DMI log screen from BIOS Setup Page.


FIGURE E-1 DMI Log Screen, Uncorrectable Error

Graphic showing a DMI log screen with a sample uncorrectable error message.



Handling of Correctable Errors

This section lists facts and considerations about how the server handles correctable errors.


FIGURE E-2 DMI Log Screen, Correctable Error

Graphic showing a DMI log screen with a sample correctable error message.



FIGURE E-3 DMI Log Screen, Correctable Error, Memory Decreased

Graphic showing a DMI log screen with a sample memory decreased error message.



Handling of Parity Errors (PERR)

This section lists facts and considerations about how the server handles parity errors (PERR).

[root@d-mpk12-53-238 root]# ipmitool -H 129.146.53.95 -U root -P changeme -I lan sel list -v

SEL Record ID : 0100
Record Type : 00
Timestamp : 01/10/2002 20:16:16
Generator ID : 0001
EvM Revision : 04
Sensor Type : Critical Interrupt
Sensor Number : 00
Event Type : Sensor-specific Discrete
Event Direction : Assertion Event
Event Data : 04ff00
Description : PCI PERR


FIGURE E-4 DMI Log Screen, PCI Parity Error

Graphic showing a DMI log screen with a sample PCI parity error message displayed.


Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.

Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 1.

Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue

Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled?

Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 1.

Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue

Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled?

Aug 5 05:15:00 d-mpk12-53-159 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.

Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue

Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled?

Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue

Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled?



Note - The Linux system reboots, but does not inform the BIOS of this incident.




Handling of System Errors (SERR)

This section lists facts and considerations about how the server handles system errors (SERR).


FIGURE E-5 POST Screen, Previous System Error Listed

Graphic showing a POST screen with a sample previous system error message displayed.


SEL Record ID : 0a00
Record Type : 00
Timestamp : 08/10/2005 06:05:32
Generator ID : 0001
EvM Revision : 04
Sensor Type : Critical Interrupt
Sensor Number : 00
Event Type : Sensor-specific Discrete
Event Direction : Assertion Event
Event Data : 05ffff
Description : PCI SERR


FIGURE E-6 DMI Log Screen, System Error Listed

Graphic showing a DMI log screen with a sample system error message displayed.



Handling Mismatching Processors

This section lists facts and considerations about how the server handles mismatching processors.

AMIBIOS(C)2003 American Megatrends, Inc.
BIOS Date: 08/10/05 14:51:11 Ver: 08.00.10
CPU : AMD Opteron(tm) Processor 254, Speed : 2.4 GHz
Count : 3, CPU Revision, CPU0 : E4, CPU1 : E6
Microcode Revision, CPU0 : 0, CPU1 : 0
DRAM Clocking CPU0 = 400 MHz, CPU1 Core0/1 = 400 MHz

Sun Fire X4600 Server, 1 AMD North Bridge, Rev E4
1 AMD North Bridge, Rev E6
1 AMD 8111 I/O Hub, Rev C2
2 AMD 8131 PCI-X Controllers, Rev B2
System Serial Number : 0505AMF028
BMC Firmware Revision : 1.00
Checking NVRAM..
Initializing USB Controllers .. Done.
Press F2 to run Setup (CTRL+E on Remote Keyboard)
Press F12 to boot from the network (CTRL+N on Remote Keyboard)
Press F8 for BBS POPUP (CTRL+P on Remote Keyboard)

******** Warning: Bad Mix of Processors *********
Multiple core processors cannot be installed with single core
processors.
Fatal Error... System Halted.


Hardware Error Handling Summary

TABLE E-1 summarizes the most common hardware errors that you might encounter with these servers.

 


TABLE E-1 Hardware Error Handling Summary

Error

Description

Handling

Logged (DMI Log or SP SEL)

Fatal?

SP failure

The SP fails to boot upon application of system power.

The SP controls the system reset, so the system may power on, but will not come out of reset.

  • During power up, the SP's boot loader turns on the power LED.
  • During SP boot, Linux startup, and SP sanity check, the power LED blinks.
  • The LED is turned off when SP management code (the IPMI stack) is started.
  • At exit of BIOS POST, the LED goes to STEADY ON state.

Not logged

Fatal

SP failure

SP boots but fails POST.

The SP controls the system RESET, so the system will not come out of reset.

Not logged

Fatal

BIOS POST failure

Server BIOS does not pass POST.

There are fatal and non-fatal errors in POST. The BIOS does detect some errors that are announced during POST as POST codes on the bottom right corner of the display on the serial console and on the video display. Some POST codes are forwarded to the SP for logging.

The POST codes do not come out in sequential order and some are repeated, because some POST codes are issued by code in add-in card BIOS expansion ROMs.

In the case of early POST failures (for example, the BSP fails to operate correctly), BIOS just halts without logging.

For some other POST failures subsequent to memory and SP initialization, the BIOS logs a message to the SP's SEL.

 

 

Single-bit DRAM ECC error

With ECC enabled in the BIOS Setup, the CPU detects and corrects a single-bit error on the DIMM interface.

The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half-second by SMI timer interrupts and is done by the BIOS SMI handler.

The BIOS SMI handler starts logging each detected error and stops logging when the limit for the same error is reached. The BIOS's polling can be disabled through a software interface.

SP SEL

Normal operation

Single four-bit DRAM error

With CHIP-KILL enabled in the BIOS Setup, the CPU detects and corrects for the failure of a four-bit-wide DRAM on the DIMM interface.

The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half-second by SMI timer interrupts and is done by the BIOS SMI handler.

The BIOS SMI handler starts logging each detected error and stops logging when the limit for the same error is reached. The BIOS's polling can be disabled through a software interface.

SP SEL

Normal operation

Uncorrectable DRAM ECC error

The CPU detects an uncorrectable multiple-bit DIMM error.

The "sync flood" method of handling this is used to prevent the erroneous data from being propagated across the HyperTransport links. The system reboots, the BIOS recovers the machine check register information, maps this information to the failing DIMM (when CHIPKILL is disabled) or DIMM pair (when CHIPKILL is enabled), and logs that information to the SP.

The BIOS will halt the CPU.

SP SEL

Fatal

Unsupported DIMM configuration

Unsupported DIMMs are used, or supported DIMMs are loaded improperly.

The BIOS displays an error message, logs an error, and halts the system.

DMI Log
SP SEL

Fatal

HyperTransport link failure

CRC or link error on one of the HyperTransport Links

Sync floods on HyperTransport links, the machine resets itself, and error information gets retained through reset.

The BIOS reports, A Hyper Transport sync flood error occurred on last boot, press F1 to continue.

DMI Log
SP SEL

Fatal

PCI SERR, PERR

System or parity error on a PCI bus

Sync floods on HyperTransport links, the machine resets itself, and error information gets retained through reset.

The BIOS reports, A Hyper Transport sync flood error occurred on last boot, press F1 to continue.

DMI Log
SP SEL

Fatal

BIOS POST Microcode Error

The BIOS could not find or load the CPU Microcode Update to the CPU. The message most likely appears when a new CPU is installed in a motherboard with an outdated BIOS. In this case, the BIOS must be updated.

The BIOS displays an error message, logs the error to DMI, and boots.

DMI Log

Non-fatal

BIOS POST CMOS Checksum Bad

CMOS contents failed the Checksum check.

The BIOS displays an error message, logs the error to DMI, and boots.

DMI Log

Non-fatal

Unsupported CPU configuration

The BIOS supports mismatched frequency and steppings in CPU configuration, but some CPUs might not be supported.

The BIOS displays an error message, logs the error, and halts the system.

DMI Log

Fatal

Correctable error

The CPU detects a variety of correctable errors in the MCi_STATUS registers.

The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half second by SMI timer interrupts, and is done by the BIOS SMI handler.

The SMI handler logs a message to the SP SEL if the SEL is available, otherwise SMI logs a message to DMI. The BIOS's polling can be disabled through software SMI.

DMI Log
SP SEL

Normal operation

Single fan failure

Fan failure is detected by reading tach signals.

The Front Fan Fault, Service Action Required, and individual fan module LEDs are lit.

SP SEL

Non-fatal

Multiple fan failure

Fan failure is detected by reading tach signals.

The Front Fan Fault, Service Action Required, and individual fan module LEDs are lit.

SP SEL

Fatal

Single power supply failure

When any of the AC/DC PS_VIN_GOOD or PS_PWR_OK signals are deasserted.

Service Action Required, and Power Supply Fault LEDs are lit.

SP SEL

Non-fatal

DC/DC power converter failure

Any POWER_GOOD signal is deasserted from the DC/DC converters.

The Service Action Required LED is lit, the system is powered down to standby power mode, and the Power LED enters standby blink state.

SP SEL

Fatal

Voltage above/below Threshold

The SP monitors system voltages and detects voltage above or below a given threshold.

The Service Action Required LED and Power Supply Fault LED blink.

SP SEL

Fatal

High temperature

The SP monitors CPU and system temperatures, and detects temperatures above a given threshold.

The Service Action Required LED and System Overheat Fault LED blink. The motherboard is shut down above the specified critical level.

SP SEL

Fatal

Processor thermal trip

The CPU drives the THERMTRIP_L signal upon detecting an overtemp condition.

CPLD shuts down power to the CPU. The Service Action Required LED and System Overheat Fault LED blink.

SP SEL

Fatal

Boot device failure

The BIOS is not able to boot from a device in the boot device list.

The BIOS goes to the next boot device in the list. If all devices in the list fail, an error message is displayed, retry from beginning of list. SP can control/change boot order

DMI Log

Non-fatal