Go to main content
Oracle® Linux Fault Management Architecture Software User's Guide

Exit Print View

Updated: October 2015
 
 

Notification of Faults and Defects

When the mcelog daemon encounters an error, it triggers a configurable response and logs information to the mcelog file. For example, assume that physical address location 0x45a3b50c0 generates a correctable memory read error. When this happens, the mcelog daemon adds an entry to /var/log/mcelog . For example:

CPU 8
BANK 3
TSC 0
RIP 00:0
MISC 0x85
ADDR 0x45a3b50c0    <------ address that had the correctable read error
STATUS 0x9c000000f00c009f
MCGSTATUS 0x7
PROCESSOR 0:0x306f1
TIME 1389814624
SOCKETID 0
APICID 18
MCGCAP 0x7000c16

A message is also sent to the system log (/var/log/messages) describing the problem (error count exceeded threshold) and what was done (offlining the page), such as:

1  Jan 15 14:37:04 testserver16 kernel: Machine check poll done on CPU 8
2  Jan 15 14:37:04 testserver16 mcelog: Family 6 Model 3f CPU: only decoding 
architectural errors
3  Jan 15 14:37:04 testserver16 mcelog: corrected Socket memory error count 
exceeded threshold: 1 in 24h
4  Jan 15 14:37:04 testserver16 mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
5  Jan 15 14:37:04 testserver16 mcelog: Corrected memory errors on page 45a3b5000 
exceed threshold 1 in 24h: 1 in 24h
6  Jan 15 14:37:04 testserver16 mcelog: Location SOCKET:0 CHANNEL:? DIMM:? []
7  Jan 15 14:37:04 testserver16 mcelog: Running trigger `page-error-trigger'
8  Jan 15 14:37:04 testserver16 mcelog: Offlining page 45a3b5000

The message on line 5 indicates that the correctable error threshold was set to 1 error in 24 hours. Since this threshold was exceed, the action taken was to remove page 0x45a3b5000 from service. This is indicated by the "Offlining page" message (line 8) in the system log. The process that encountered the correctable error is either assigned a new page, or it is killed, depending on the "memory-ce-action" value in the "page" section of the mcelog.conf file.

In addition to the page being offlined, if the DIMM corresponding to the failed address exceeds the factory programmed DIMM threshold, the SP igenerates a fault that is forwarded to the host and logged as part of the fault management database.

Often, the first interaction with the Fault Manager daemon is a system message indicating that a fault or defect has been diagnosed. Messages are sent to both the console and the /var/log/messages file. All messages from the Fault Manager daemon use the following format:

1    SUNW-MSG-ID: SPX86A-8002-30, TYPE: Fault, VER: 1, SEVERITY: Minor
2    EVENT-TIME: Wed Nov 27 10:36:30 PST 2013
3    PLATFORM: SUN SERVER X4-4, CSN: -, HOSTNAME: testserver16
4    SOURCE: fdd, REV: 1.0
5    EVENT-ID: eed2208e-2dcf-40c9-9bab-ab3a13e94182
6    DESC: A processor has detected multiple memory controller correctable
     errors.
8    AUTO-RESPONSE: The affected processor will be disabled at the next system boot
9    and remain unavailable until repaired.  
10   The chassis wide and processor service-required LED's are illuminated.
11   IMPACT: The system will continue to operate in the presence of this
12   fault.
13   System performance may be impacted due to disabled processor.
14   REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this 
15   event. Please refer to the associated reference document at 
16   http://support.oracle.com/msg/SUN4V-8001-8H for the latest service procedures and 
17   policies regarding this diagnosis.

When notified of a diagnosed problem, always consult the recommended Oracle Knowledge Article for additional details. See line 16 above for an example. The knowledge article might contain additional actions that you or a service provider should take beyond those listed on line 14.

Notification of events can also be configured in Oracle ILOM using the Simple Network Management Protocol (SNMP) or the Simple Mail Transfer Protocol (SMTP). See the Oracle ILOM documentation at: http://www.oracle.com/goto/ILOM/docs

In addition, Oracle Auto Service Request can be configured to automatically request Oracle service when specific hardware problems occur from supported telemetry resources (such as Oracle ILOM). See the Oracle Auto Service Request product page for information about this feature. The documentation link on this page provides links to Oracle ASR Quick Installation Guide and Oracle ASR Installation and Operations Guide.