出于各种原因,mcelog 守护进程在常规运行过程中可能无法启动或者会失败。发生该情况时,应停止接收和诊断来自主机的 CPU 和内存错误。
例如:
[root@testserver16 ~]# service mcelogd status Checking for mcelog mcelog (pid 32435) is running...
该状态应为 "running"。如果不是 "running",则说明 mcelog 可能已停止或失败。
如果 mcelog 未运行或已失败,Oracle Linux FMA mce 模块会失败,因为只有 mcelog 守护进程正常工作,该模块才能正常运行。
列出所有 Fault Manager 模块的状态:
[root@testserver16 ~]# fmadm config MODULE VERSION STATUS DESCRIPTION ext-event-transport 0.2 active External FM event transport fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis ip-transport 1.1 active IP Transport Agent mce 1.0 failed Machine Check Translator sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.1 active Syslog Messaging Agent
在上面的示例中,mce 模块具有 "failed" 状态。这意味着主机未在监视 CPU 或内存计算机检查事件,因此不会在故障管理数据库中记录或诊断这些事件。
例如:
[root@testserver16 ~]# fmdump -Ve n 21 2014 09:56:05.930589483 ereport.fm.fmd.module nvlist version: 0 version = 0x0 class = ereport.fm.fmd.module detector = (embedded nvlist) nvlist version: 0 version = 0x1 scheme = fmd authority = (embedded nvlist) nvlist version: 0 version = 0x0 system-mfg = unknown system-name = unknown system-part = unknown system-serial = unknown sys-comp-mfg = unknown sys-comp-name = unknown sys-comp-part = unknown sys-comp-serial = unknown server-name = testserver16 host-id = ffffffff990a7a4a (end authority) mod-name = mce mod-version = 1.0 (end detector) ena = 0x3631d6cd9f6c0001 msg = mcelog not running!: client requested that module execution abort errno = 1072 errclass = ereport.fm.fmd.hdl_abort __ttl = 0x1 __tod = 0x52de8a85 0x3777ab2b
在上面的示例中,"msg =" 字段列出了 mcelog 未在运行,这是 mce 模块故障的原因。
例如:
[root@testserver16 ~]# service mcelogd start Starting mcelog daemon
例如:
[root@testserver16 ~]# service mcelogd status Checking for mcelog mcelog (pid 32498) is running...
[root@testserver16 ~]# fmadm unload mce
执行此操作会生成一个故障事件,您可以在故障管理数据库中识别该故障事件。
例如:
[root@ban25ts12uut2 ~]# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jan 21 11:35:07 528fbbb9-92d4-cd7f-ef81-e2fddfd3c244 FMD-8000-2K Minor Problem Status : solved Diag Engine : fmd-self-diagnosis / 1.0 System Manufacturer : unknown Name : unknown Part_Number : unknown Serial_Number : unknown Host_ID : ffffffff990a7a4a ---------------------------------------- Suspect 1 of 1 : Fault class : defect.sunos.fmd.module Certainty : 100% Affects : fmd:///module/mce Status : faulted and taken out of service Description : A Linux Fault Manager component has experienced an error that required the module to be disabled. Response : The module has been disabled. Events destined for the module will be saved for manual diagnosis. Impact : Automated diagnosis and response for subsequent events associated with this module will not occur. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/FMD-8000-2K for the latest service procedures and policies regarding this diagnosis.
例如:
[root@testserver16 ~]# fmadm load /opt/fma/fm/lib/fmd/plugins/mce.so fmadm: module '/opt/fma/fm/lib/fmd/plugins/mce.so' loaded into fault manager [root@testserver16 ~]# fmadm config MODULE VERSION STATUS DESCRIPTION ext-event-transport 0.2 active External FM event transport fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis ip-transport 1.1 active IP Transport Agent mce 1.0 active Machine Check Translator sysevent-transport 1.0 active SysEvent Transport Agent syslog-msgs 1.1 active Syslog Messaging Agent
如果 mce 模块不卸载或重新装入,则重新启动 Fault Manager,如下所示:
[root@testserver16 ~]# service fmd.init restart Stopping fmd: [ OK ] Starting fmd: [ OK ]