For various reasons, it is possible that the mcelog daemon might not start or fail during normal operation. When this happens, you stop receiving and diagnosing CPU and memory errors from the host.
For example:
[root@testserver16 ~]# service mcelogd status Checking for mcelog mcelog (pid 32435) is running...
The status should be "running". If not, it could be stopped or failed.
If mcelog is either not running or failed, the Oracle Linux FMA mce module fails because it requires the mcelog daemon to be working properly for it to function.
To list the status of all fault manager modules:
[root@testserver16 ~]# fmadm config MODULE VERSION STATUS DESCRIPTION ext-event-transport 0.2 active External FM event transport fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis ip-transport 1.1 active IP Transport Agent mce 1.0 failed Machine Check Translator sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.1 active Syslog Messaging Agent
In the above example, the mce module has a "failed" status. This means that CPU or memory machine check events are not being monitored by the host and, consequently, not being logged or diagnosed in the fault management database.
For example:
[root@testserver16 ~]# fmdump -Ve n 21 2014 09:56:05.930589483 ereport.fm.fmd.module nvlist version: 0 version = 0x0 class = ereport.fm.fmd.module detector = (embedded nvlist) nvlist version: 0 version = 0x1 scheme = fmd authority = (embedded nvlist) nvlist version: 0 version = 0x0 system-mfg = unknown system-name = unknown system-part = unknown system-serial = unknown sys-comp-mfg = unknown sys-comp-name = unknown sys-comp-part = unknown sys-comp-serial = unknown server-name = testserver16 host-id = ffffffff990a7a4a (end authority) mod-name = mce mod-version = 1.0 (end detector) ena = 0x3631d6cd9f6c0001 msg = mcelog not running!: client requested that module execution abort errno = 1072 errclass = ereport.fm.fmd.hdl_abort __ttl = 0x1 __tod = 0x52de8a85 0x3777ab2b
In the above example, the "msg =" field lists that mcelog is not running and is the cause for the mce module failure.
For example:
[root@testserver16 ~]# service mcelogd start Starting mcelog daemon
For example:
[root@testserver16 ~]# service mcelogd status Checking for mcelog mcelog (pid 32498) is running...
[root@testserver16 ~]# fmadm unload mce
Doing this generates a fault event that you can identify in the fault management database.
For example:
[root@ban25ts12uut2 ~]# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jan 21 11:35:07 528fbbb9-92d4-cd7f-ef81-e2fddfd3c244 FMD-8000-2K Minor Problem Status : solved Diag Engine : fmd-self-diagnosis / 1.0 System Manufacturer : unknown Name : unknown Part_Number : unknown Serial_Number : unknown Host_ID : ffffffff990a7a4a ---------------------------------------- Suspect 1 of 1 : Fault class : defect.sunos.fmd.module Certainty : 100% Affects : fmd:///module/mce Status : faulted and taken out of service Description : A Linux Fault Manager component has experienced an error that required the module to be disabled. Response : The module has been disabled. Events destined for the module will be saved for manual diagnosis. Impact : Automated diagnosis and response for subsequent events associated with this module will not occur. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/FMD-8000-2K for the latest service procedures and policies regarding this diagnosis.
For example:
[root@testserver16 ~]# fmadm load /opt/fma/fm/lib/fmd/plugins/mce.so fmadm: module '/opt/fma/fm/lib/fmd/plugins/mce.so' loaded into fault manager [root@testserver16 ~]# fmadm config MODULE VERSION STATUS DESCRIPTION ext-event-transport 0.2 active External FM event transport fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis ip-transport 1.1 active IP Transport Agent mce 1.0 active Machine Check Translator sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.1 active Syslog Messaging Agent
If the mce module does not unload or reload, restart the fault manager, as follows:
[root@testserver16 ~]# service fmd.init restart Stopping fmd: [ OK ] Starting fmd: [ OK ]