A P P E N D I X  E

Error Handling

This appendix contains information about how the servers process and log errors. See the following sections:


Handling of Uncorrectable Errors

This section lists facts and considerations about how the server handles uncorrectable errors.



Note - The BIOS ChipKill feature must be disabled if you are testing for failures of multiple bits within a DRAM (ChipKill corrects for the failure of a four-bit wide DRAM).




Note - If the error is on low 1MB, the BIOS freezes after rebooting. Therefore, no DMI log is recorded.


FIGURE E-1 shows an example of a DMI log screen from BIOS Setup Page.

FIGURE E-1 DMI Log Screen, Uncorrectable Error


Graphic showing a DMI log screen with a sample uncorrectable error message.


Handling of Correctable Errors

This section lists facts and considerations about how the server handles correctable errors.

FIGURE E-2 Sample Windows 2003 Server Event Properties Screen

FIGURE E-3 DMI Log Screen, Correctable Error


Graphic showing a DMI log screen with a sample correctable error message.

EXAMPLE E-1 DMI Log Screen, Correctable Error, Memory Decreased

Graphic showing a DMI log screen with a sample memory decreased error message.


Handling of Parity Errors (PERR)

This section lists facts and considerations about how the server handles parity errors (PERR).

FIGURE E-4 DMI Log Screen, PCI Parity Error


Graphic showing a DMI log screen with a sample PCI parity error message displayed.



Note - The Linux system reboots, but does not inform the BIOS of this incident.


*Hardware Malfunction Call your hardware vendor for support*
*** The system has halted ***

[HKEY LOCAL MACHINE\SYSTEM\CurentControlSet\Control\CrashControl]

"NMICrashDump"=dword:00000001 ( 1=Windows will check on NMI )

For details of Windows NMI mechanism, refer to Dump Switch Support for Windows in your Windows documentation.


Handling of System Errors (SERR)

This section lists facts and considerations about how the server handles system errors (SERR).

FIGURE E-5 POST Screen, Previous System Error Listed


Graphic showing a POST screen with a sample previous system error message displayed.

FIGURE E-6 DMI Log Screen with Error


Graphic showing a DMI log screen with a sample system error message displayed.


Handling Mismatching Processors

This section lists facts and considerations about how the server handles mismatching processors.


Hardware Error Handling Summary

TABLE E-1 summarizes the most common hardware errors that you might encounter with these servers.


TABLE E-1 Hardware Error Handling Summary

Error

Description

Handling

Logged (DMI Log or SP SEL)

Fatal?

SP failure

The SP fails to boot upon application of system power.

The SP controls the system reset, so the system may power on, but will not come out of reset.

  • During power up, the SP's boot loader turns on the power LED.
  • During SP boot, Linux startup, and SP sanity check, the power LED blinks.
  • The LED is turned off when SP management code (the IPMI stack) is started.
  • At exit of BIOS POST, the LED goes to STEADY ON state.

Not logged

Fatal

SP failure

SP boots but fails POST.

The SP controls the system RESET, so the system will not come out of reset.

Not logged

Fatal

BIOS POST failure

Server BIOS does not pass POST.

There are fatal and non-fatal errors in POST. The BIOS does detect some errors that are announced during POST as POST codes on the bottom right corner of the display on the serial console and on the video display. Some POST codes are forwarded to the SP for logging.

The POST codes do not come out in sequential order and some are repeated, because some POST codes are issued by code in add-in card BIOS expansion ROMs.

In the case of early POST failures (for example, the BSP fails to operate correctly), BIOS just halts without logging.

For some other POST failures subsequent to memory and SP initialization, the BIOS logs a message to the SP’s SEL.

 

 

Single-bit DRAM ECC error

With ECC enabled in the BIOS Setup, the CPU detects and corrects a single-bit error on the DIMM interface.

The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half-second by SMI timer interrupts and is done by the BIOS SMI handler.

The BIOS SMI handler starts logging each detected error and stops logging when the limit for the same error is reached. The BIOS's polling can be disabled through a software interface.

SP SEL

Normal operation

Single four-bit DRAM error

With CHIP-KILL enabled in the BIOS Setup, the CPU detects and corrects for the failure of a four-bit-wide DRAM on the DIMM interface.

The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half-second by SMI timer interrupts and is done by the BIOS SMI handler.

The BIOS SMI handler starts logging each detected error and stops logging when the limit for the same error is reached. The BIOS's polling can be disabled through a software interface.

SP SEL

Normal operation

Uncorrectable DRAM ECC error

The CPU detects an uncorrectable multiple-bit DIMM error.

The “sync flood” method is used to prevent the erroneous data from being propagated across the Hypertransport links. The system reboots, the BIOS recovers the machine check register information, maps this information to the failing DIMM (when CHIPKILL is disabled) or DIMM pair (when CHIPKILL is enabled), and logs that information to the SP.

The BIOS will halt the CPU.

SP SEL

Fatal

Unsupported DIMM configuration

Unsupported DIMMs are used, or supported DIMMs are loaded improperly.

The BIOS displays an error message, logs an error, and halts the system.

DMI Log
SP SEL

Fatal

HyperTransport link failure

CRC or link error on one of the Hypertransport Links.

Sync floods on HyperTransport links, the machine resets itself, and error information gets retained through reset.

The BIOS reports, A Hyper Transport sync flood error occurred on last boot, press F1 to continue.

DMI Log
SP SEL

Fatal

PCI SERR, PERR

System or parity error on a PCI bus.

Sync floods on HyperTransport links, the machine resets itself, and error information gets retained through reset.

The BIOS reports, A Hyper Transport sync flood error occurred on last boot, press F1 to continue.

DMI Log
SP SEL

Fatal

BIOS POST Microcode Error

The BIOS could not find or load the CPU Microcode Update to the CPU. The message most likely appears when a new CPU is installed in a motherboard with an outdated BIOS. In this case, the BIOS must be updated.

The BIOS displays an error message, logs the error to DMI, and boots.

DMI Log

Non-fatal

BIOS POST CMOS Checksum Bad

CMOS contents failed the Checksum check.

The BIOS displays an error message, logs the error to DMI, and boots.

DMI Log

Non-fatal

Unsupported CPU configuration

The BIOS supports mismatched frequency and steppings in CPU configuration, but some CPUs might not be supported.

The BIOS displays an error message, logs the error, and halts the system.

DMI Log

Fatal

Correctable error

The CPU detects a variety of correctable errors in the MCi_STATUS registers.

The CPU corrects the error in hardware. No interrupt or machine check is generated by the hardware. The polling is triggered every half second by SMI timer interrupts, and is done by the BIOS SMI handler.

The SMI handler logs a message to the SP SEL if the SEL is available, otherwise SMI logs a message to DMI. The BIOS's polling can be disabled through software SMI.

DMI Log
SP SEL

Normal operation

Single fan failure

Fan failure is detected by reading tach signals.

The Front Fan Fault, Service Action Required, and individual fan module LEDs are lit.

SP SEL

Non-fatal

Multiple fan failure

Fan failure is detected by reading tach signals.

The Front Fan Fault, Service Action Required, and individual fan module LEDs are lit.

SP SEL

Fatal

Single power supply failure

When any of the AC/DC PS_VIN_GOOD or PS_PWR_OK signals are deasserted.

Service Action Required, and Power Supply Fault LEDs are lit.

SP SEL

Non-fatal

DC/DC power converter failure

Any POWER_GOOD signal is deasserted from the DC/DC converters.

The Service Action Required LED is lit, the system is powered down to standby power mode, and the Power LED enters standby blink state.

SP SEL

Fatal

Voltage above/below threshold

The SP monitors system voltages and detects voltage above or below a given threshold.

The Service Action Required LED and Power Supply Fault LED blink.

SP SEL

Fatal

High temperature

The SP monitors CPU and system temperatures, and detects temperatures above a given threshold.

The Service Action Required LED and System Overheat Fault LED blink. The motherboard is shut down above the specified critical level.

SP SEL

Fatal

Processor thermal trip

The CPU drives the THERMTRIP_L signal upon detecting an overtemp condition.

CPLD shuts down power to the CPU. The Service Action Required LED and System Overheat Fault LED blink.

SP SEL

Fatal

Boot device failure

The BIOS is not able to boot from a device in the boot device list.

The BIOS goes to the next boot device in the list. If all devices in the list fail, an error message is displayed, retry from beginning of list. SP can control/change boot order.

DMI Log

Non-fatal


 


Enabling ILOM Diagnostics in BIOS Error Handling

Activation of this feature enables service and support engineers to collect hardware dumps and diagnostic data regarding critical system states. This feature is not meant for generic hardware failures such as Uncorrectable FRU errors, but is used during unknown hardware failures which cause OS freezes or hardware resets which leave no diagnostic trace.

This feature works in conjunction with system’s ILOM web interface. This feature requires that the most recent ILOM firmware is installed and running. The ILOM version in SW 3.0.2 is recommended.


procedure icon  To Enable ILOM Diagnostics in the BIOS

1. Boot up the system.

2. During POST, press F2 to access the BIOS setup utility.

3. Select the Advanced Menu.

4. Select Error Handling.


5. Enable ILOM Diagnostics.

6. Press F10 to save and exit the BIOS Setup Utility.

When the ILOM Diagnostics feature is enabled, the following occurs.

Upon a system reset, the BIOS will freeze the execution in very early bootblock execution stage, before any hardware component is fully initialized. This leaves hardware status registers from across system’s PCI hierarchy and HyperTransport mesh untouched and will save the fault report information. You will see a constant blank screen on system’s video terminal or javaRConsole screen.


procedure icon  To View the Diagnostic Information

1. Log in as root into system’s ILOM web interface.

2. Navigate to Maintenance --> Snapshot.

3. In the Data Set field, select Full (may reset the host).


4. Click Run.

This operation may take few minutes to accomplish, and will download a full snapshot of system status including diagnostic output of hdtl sunservice utility to your selected folder.

5. Send this diagnostic snapshot bundle to Oracle Service for analysis and a possible root-cause.