Go to main content
Oracle® Linux 故障管理体系结构软件用户指南

退出打印视图

更新时间: 2015 年 10 月
 
 

如果 mcelog 失败,则重新启动 fmd

出于各种原因,mcelog 守护进程在常规运行过程中可能无法启动或者会失败。发生该情况时,应停止接收和诊断来自主机的 CPU 和内存错误。

  1. 确定 mcelog 守护进程是否正在运行。

    例如:

    [root@testserver16 ~]# service mcelogd status
    Checking for mcelog
    mcelog (pid  32435) is running... 

    该状态应为 "running"。如果不是 "running",则说明 mcelog 可能已停止或失败。

    如果 mcelog 未运行或已失败,Oracle Linux FMA mce 模块会失败,因为只有 mcelog 守护进程正常工作,该模块才能正常运行。

  2. 如果 mcelog 守护进程正在运行,则检查 Oracle Linux FMA 模块的状态。

    列出所有 Fault Manager 模块的状态:

    [root@testserver16 ~]# fmadm config
    MODULE                   VERSION STATUS  DESCRIPTION
    ext-event-transport      0.2     active  External FM event transport
    fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
    ip-transport             1.1     active  IP Transport Agent
    mce                      1.0     failed  Machine Check Translator
    sysevent-transport       1.0     active  SysEvent Transport Agent
    syslog-msgs              1.1     active  Syslog Messaging Agent

    在上面的示例中,mce 模块具有 "failed" 状态。这意味着主机未在监视 CPU 或内存计算机检查事件,因此不会在故障管理数据库中记录或诊断这些事件。

  3. 如果 Oracle Linux FMA mce 模块已失败,则使用 fmdump 确认失败的原因。

    例如:

    [root@testserver16 ~]# fmdump -Ve
    
    
    n 21 2014 09:56:05.930589483 ereport.fm.fmd.module
    nvlist version: 0
    	version = 0x0
    	class = ereport.fm.fmd.module
    	detector = (embedded nvlist)
    	nvlist version: 0
    		version = 0x1
    		scheme = fmd
    		authority = (embedded nvlist)
    		nvlist version: 0
    			version = 0x0
    			system-mfg = unknown
    			system-name = unknown
    			system-part = unknown
    			system-serial = unknown
    			sys-comp-mfg = unknown
    			sys-comp-name = unknown
    			sys-comp-part = unknown
    			sys-comp-serial = unknown
    			server-name = testserver16
    			host-id = ffffffff990a7a4a
    		(end authority)
    
    		mod-name = mce
    		mod-version = 1.0
    	(end detector)
    
    	ena = 0x3631d6cd9f6c0001
    	msg = mcelog not running!: client requested that module execution abort
    	errno = 1072
    	errclass = ereport.fm.fmd.hdl_abort
    	__ttl = 0x1
    	__tod = 0x52de8a85 0x3777ab2b

    在上面的示例中,"msg =" 字段列出了 mcelog 未在运行,这是 mce 模块故障的原因。

  4. 确定 mcelog 守护进程是问题所在之后,重新启动该守护进程。

    例如:

    [root@testserver16 ~]# service mcelogd start
    Starting mcelog daemon
  5. 验证 mcelog 是否正在运行。

    例如:

    [root@testserver16 ~]# service mcelogd status
    Checking for mcelog
    mcelog (pid  32498) is running... 
  6. 卸载 Oracle Linux FMA mce 模块。
    [root@testserver16 ~]# fmadm unload mce

    执行此操作会生成一个故障事件,您可以在故障管理数据库中识别该故障事件。

  7. 确认已在故障管理数据库中捕获了 mce 模块卸载。

    例如:

    [root@ban25ts12uut2 ~]# fmadm faulty
    --------------- ------------------------------------  -------------- ---------
    TIME            EVENT-ID                              MSG-ID         SEVERITY
    --------------- ------------------------------------  -------------- ---------
    Jan 21 11:35:07 528fbbb9-92d4-cd7f-ef81-e2fddfd3c244  FMD-8000-2K    Minor    
    
    Problem Status    : solved
    Diag Engine       : fmd-self-diagnosis / 1.0
    System
        Manufacturer  : unknown
        Name          : unknown
        Part_Number   : unknown
        Serial_Number : unknown
        Host_ID       : ffffffff990a7a4a
    
    ----------------------------------------
    Suspect 1 of 1 :
       Fault class : defect.sunos.fmd.module
       Certainty   : 100%
       Affects     : fmd:///module/mce
       Status      : faulted and taken out of service
    
    Description : A Linux Fault Manager component has experienced an error that
                  required the module to be disabled.
    
    Response    : The module has been disabled.  Events destined for the module
                  will be saved for manual diagnosis.
    
    Impact      : Automated diagnosis and response for subsequent events associated
                  with this module will not occur.
    
    Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
                  Please refer to the associated reference document at
                  http://support.oracle.com/msg/FMD-8000-2K for the latest service
                  procedures and policies regarding this diagnosis.
  8. 重新装入 Oracle Linux FMA mce 模块并确认其正在运行。

    例如:

    [root@testserver16 ~]# fmadm load /opt/fma/fm/lib/fmd/plugins/mce.so
    fmadm: module '/opt/fma/fm/lib/fmd/plugins/mce.so' loaded into fault manager
    
    
    [root@testserver16 ~]# fmadm config
    MODULE                   VERSION STATUS  DESCRIPTION
    ext-event-transport      0.2     active  External FM event transport
    fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
    ip-transport             1.1     active  IP Transport Agent
    mce                      1.0     active  Machine Check Translator
    sysevent-transport       1.0     active  SysEvent Transport Agent
    syslog-msgs              1.1     active  Syslog Messaging Agent

    如果 mce 模块不卸载或重新装入,则重新启动 Fault Manager,如下所示:

    [root@testserver16 ~]# service fmd.init restart
    Stopping fmd:                                              [  OK  ]
    Starting fmd:                                              [  OK  ]