Go to main content
Oracle® Linux Fault Management Architecture Software User's Guide

Exit Print View

Updated: October 2015
 
 

Restart fmd if mcelog Fails

For various reasons, it is possible that the mcelog daemon might not start or fail during normal operation. When this happens, you stop receiving and diagnosing CPU and memory errors from the host.

  1. Determine if the mcelog daemon is running.

    For example:

    [root@testserver16 ~]# service mcelogd status
    Checking for mcelog
    mcelog (pid  32435) is running... 

    The status should be "running". If not, it could be stopped or failed.

    If mcelog is either not running or failed, the Oracle Linux FMA mce module fails because it requires the mcelog daemon to be working properly for it to function.

  2. If the mcelog daemon is running, check the status of the Oracle Linux FMA modules.

    To list the status of all fault manager modules:

    [root@testserver16 ~]# fmadm config
    MODULE                   VERSION STATUS  DESCRIPTION
    ext-event-transport      0.2     active  External FM event transport
    fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
    ip-transport             1.1     active  IP Transport Agent
    mce                      1.0     failed  Machine Check Translator
    sysevent-transport       1.0     active  SysEvent Transport Agent
    syslog-msgs              1.1     active  Syslog Messaging Agent

    In the above example, the mce module has a "failed" status. This means that CPU or memory machine check events are not being monitored by the host and, consequently, not being logged or diagnosed in the fault management database.

  3. If the Oracle Linux FMA mce module has failed, confirm the cause of the failure using fmdump.

    For example:

    [root@testserver16 ~]# fmdump -Ve
    
    
    n 21 2014 09:56:05.930589483 ereport.fm.fmd.module
    nvlist version: 0
    	version = 0x0
    	class = ereport.fm.fmd.module
    	detector = (embedded nvlist)
    	nvlist version: 0
    		version = 0x1
    		scheme = fmd
    		authority = (embedded nvlist)
    		nvlist version: 0
    			version = 0x0
    			system-mfg = unknown
    			system-name = unknown
    			system-part = unknown
    			system-serial = unknown
    			sys-comp-mfg = unknown
    			sys-comp-name = unknown
    			sys-comp-part = unknown
    			sys-comp-serial = unknown
    			server-name = testserver16
    			host-id = ffffffff990a7a4a
    		(end authority)
    
    		mod-name = mce
    		mod-version = 1.0
    	(end detector)
    
    	ena = 0x3631d6cd9f6c0001
    	msg = mcelog not running!: client requested that module execution abort
    	errno = 1072
    	errclass = ereport.fm.fmd.hdl_abort
    	__ttl = 0x1
    	__tod = 0x52de8a85 0x3777ab2b

    In the above example, the "msg =" field lists that mcelog is not running and is the cause for the mce module failure.

  4. Once you have determined that the mcelog daemon is the problem, restart it.

    For example:

    [root@testserver16 ~]# service mcelogd start
    Starting mcelog daemon
  5. Verify that mcelog is running.

    For example:

    [root@testserver16 ~]# service mcelogd status
    Checking for mcelog
    mcelog (pid  32498) is running... 
  6. Unload the Oracle Linux FMA mce module.
    [root@testserver16 ~]# fmadm unload mce

    Doing this generates a fault event that you can identify in the fault management database.

  7. Confirm that the unloading of the mce module is captured in the fault management database.

    For example:

    [root@ban25ts12uut2 ~]# fmadm faulty
    --------------- ------------------------------------  -------------- ---------
    TIME            EVENT-ID                              MSG-ID         SEVERITY
    --------------- ------------------------------------  -------------- ---------
    Jan 21 11:35:07 528fbbb9-92d4-cd7f-ef81-e2fddfd3c244  FMD-8000-2K    Minor    
    
    Problem Status    : solved
    Diag Engine       : fmd-self-diagnosis / 1.0
    System
        Manufacturer  : unknown
        Name          : unknown
        Part_Number   : unknown
        Serial_Number : unknown
        Host_ID       : ffffffff990a7a4a
    
    ----------------------------------------
    Suspect 1 of 1 :
       Fault class : defect.sunos.fmd.module
       Certainty   : 100%
       Affects     : fmd:///module/mce
       Status      : faulted and taken out of service
    
    Description : A Linux Fault Manager component has experienced an error that
                  required the module to be disabled.
    
    Response    : The module has been disabled.  Events destined for the module
                  will be saved for manual diagnosis.
    
    Impact      : Automated diagnosis and response for subsequent events associated
                  with this module will not occur.
    
    Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
                  Please refer to the associated reference document at
                  http://support.oracle.com/msg/FMD-8000-2K for the latest service
                  procedures and policies regarding this diagnosis.
  8. Reload the Oracle Linux FMA mce module and confirm that it is running.

    For example:

    [root@testserver16 ~]# fmadm load /opt/fma/fm/lib/fmd/plugins/mce.so
    fmadm: module '/opt/fma/fm/lib/fmd/plugins/mce.so' loaded into fault manager
    
    
    [root@testserver16 ~]# fmadm config
    MODULE                   VERSION STATUS  DESCRIPTION
    ext-event-transport      0.2     active  External FM event transport
    fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
    ip-transport             1.1     active  IP Transport Agent
    mce                      1.0     active  Machine Check Translator
    sysevent-transport       1.0     active  SysEvent Transport Agent
    syslog-msgs              1.1     active  Syslog Messaging Agent

    If the mce module does not unload or reload, restart the fault manager, as follows:

    [root@testserver16 ~]# service fmd.init restart
    Stopping fmd:                                              [  OK  ]
    Starting fmd:                                              [  OK  ]