C H A P T E R 7 |
Hardware Error Report and Decode (HERD) 3.0 for Linux is a tool for monitoring, decoding, and reporting correctable hardware errors. This chapter has the following sections:
You can download HERD from the Tools and Drivers CD, if available, or from the Tools and Drivers CD image, downloadable from the product web page.
HERD is a tool for monitoring, decoding, and reporting correctable hardware errors. These correctable hardware errors are also known as Machine Check Exceptions (MCE).
Versions of Linux x86_64 kernels since 2.6.4 do not print recoverable MCEs to the kernel log. Instead they are saved into a special kernel buffer which is accessible using /dev/mcelog. HERD monitors and collects data from /dev/mcelog and reports the corresponding errors to the system log and, if the resource is available, to the system Service Processor (SP) Event Log through the local IPMI interface.
During error decoding, HERD attempts to provide as much information as possible from the data supplied by the AMD CPU. In particular, physical addresses obtained from correctable ECC memory errors are matched to the corresponding CPU slot and DIMM number.
RPMs are provided for the following Linux distributions:
To install the RPM, run the following command:
rpm -Uhv herd-1.x-1.rh4.x86_64.rpm
Each RPM has a set of run-time dependencies that are enforced by RPM. These dependencies include the openssl libraries or the OpenIPMI scripts. If one of these dependencies is missing, RPM reports an error and you must install them manually.
With SLES, use the yast utility. For example, type:
With RHEL, use up2date or system-config-packages. For example, type:
HERD is designed to be backwardly compatible with the mcelog utility. It supports the same command-line options and uses the same format to report errors to the system log. As such, HERD acts as a replacement to mcelog (both cannot be used at the same time). Note that this conflict information is encoded into the HERD RPMs, so installing HERD automatically uninstalls mcelog if it was present on the system.
All RPMs that are provided come with the appropriate SysV init scripts. After installation, the HERD daemon is automatically setup to run after system boot. The daemon is not, however, started right away.
To start HERD immediately after installation:
When the following message appears in the system log, then HERD is running successfully:
Once the HERD daemon is running, any correctable MCEs that occur on the system are reported both on the system log (/var/log/messages) and onto the service processor System Event Log (SEL). In the case of correctable ECC memory errors, both reports should correctly identify the CPU slot and DIMM number on which the memory error occurred.
Note - The Linux kernel only harvests MCE errors every 5 minutes, so a delay might occur between an MCE occurrence and its report to the system log and SEL. |
Usage: herd [options] Options: -e, --decode <addr> Decode the given 64-bit hex address and exit-- -D, --nodaemon Don’t detach and become a daemonD-- -d, --debu Debug moded-- --ignorenodev Silent exit if device missing --filter Filter out known bogus MCEs --dmi Lookup MCE address in BIOS tables --params Display herd parameters information --setparam <key>=<value> Set or override parameter value -h, --help This messageh-- |
Here is an example of the system log output generated by HERD:
Jan 14 18:57:32 host herd: HARDWARE ERROR. This is *NOT* a software problem! Jan 14 18:57:32 host herd: Please contact your hardware vendor Jan 14 18:57:32 host herd: CPU 0 4 northbridge Jan 14 18:57:32 host herd: Northbridge Watchdog error Jan 14 18:57:32 host herd: bit57 = processor context corrupt Jan 14 18:57:32 host herd: bit61 = error uncorrected Jan 14 18:57:32 host herd: bus error ’generic participation, request timed out generic error mem transaction generic access, level generic’ Jan 14 18:57:32 host herd: STATUS b200000000070f0f MCGSTATUS 0 Jan 14 18:57:32 host herd: Physical address maps to: Cpu Node 0, DIMM 1 |
Jan 14 18:57:32 host herd: HARDWARE ERROR. This is *NOT* a software problem! Jan 14 18:57:32 host herd: Please contact your hardware vendor Jan 14 18:57:32 host herd: CPU 0 4 northbridge Jan 14 18:57:32 host herd: Northbridge Watchdog error Jan 14 18:57:32 host herd: bit57 = processor context corrupt Jan 14 18:57:32 host herd: bit61 = error uncorrected Jan 14 18:57:32 host herd: bus error ’generic participation, request timed out generic error mem transaction generic access, level generic’ Jan 14 18:57:32 host herd: STATUS b200000000070f0f MCGSTATUS 0 Jan 14 18:57:32 host herd: Physical address maps to: Cpu Node 0, DIMM 1 |
HERD has a number of parameters that can be changed using the --setparam option. The list of available parameters and their descriptions is available by running herd --params.
Recent Linux kernel versions (2.6.16 and newer) ship with an MCE decoding stack called EDAC, which can conflict with HERD. In order for the HERD daemon to function correctly, it is important to first unload the EDAC-related kernel modules with the rmmod command. This is done automatically by the HERD starting script in version 1.8.
On systems that have a 128-bit configured DRAM interface, HERD can only identify DIMM pairs rather than individual DIMM modules. The size of the DRAM interface is reported by HERD when it runs in debug mode. For example, with the following command:
herd -d -e 0 |
If an MCE occurred before HERD was installed on a system, use the HERD tool to identify the CPU slot and DIMM number from the physical address reported by the MCE.
For example, use the herd command with the -e option to decode a physical address:
000018000000: Cpu Node 0, DIMM 0 address0x18000000.The results identify the DIMM associated with physical |
Note - HERD must be run on the system on which the MCE actually occurred to identify the CPU and DIMM numbers correctly. |
HERD supports a debug option (-d) that gives more system information, including the Opteron CPU identification data, for example:
# herd -d -e 0x000008000000 2 cores found, family 15, model 5, stepping 10 (revision C) 2herd: dimm translation against system address 00080000 Node 0: DRAM base 00000000, DRAM lmit 003fffff, HoleEn 0 Chip 0: CSBase 00000000. CSMask 03ffffff 000008000000: Cpu Node 0, DIMM 0 |
Software Error Report and Decode (SERD) engine is a component of HERD that filters errors meeting a certain criteria. The default setting for errors on a DIMM (with a unique address) is 24 errors within a 24-hour period. The SERD filter allows 24 errors in a 24-hour time period and will not report an error, but when the SERD filter is triggered on the 25th error, HERD error messages begin to be added to /var/log/messages. The logging is done by HERD.
When HERD is restarted, the internal accounting of the last 24 hours is lost and the policy is reset upon reboot. This means the SERD engine holds the info it uses to account for the last 24 hours in RAM. When the program is interrupted, either by a reboot or restarting HERD, it loses all recollection of the past internal failures. However, the log data in the SERD log remains intact.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.