C H A P T E R  7

Hardware Error Report and Decode Tool (HERD) 3.0 for Linux

Hardware Error Report and Decode (HERD) 3.0 for Linux is a tool for monitoring, decoding, and reporting correctable hardware errors. This chapter has the following sections:


Downloading HERD

You can download HERD from the Tools and Drivers CD, if available, or from the Tools and Drivers CD image, downloadable from the product web page.

The utility resides in the /tools/linux/herd directory.


About HERD

HERD is a tool for monitoring, decoding, and reporting correctable hardware errors. These correctable hardware errors are also known as Machine Check Exceptions (MCE).

Versions of Linux x86_64 kernels since 2.6.4 do not print recoverable MCEs to the kernel log. Instead they are saved into a special kernel buffer which is accessible using /dev/mcelog. HERD monitors and collects data from /dev/mcelog and reports the corresponding errors to the system log and, if the resource is available, to the system Service Processor (SP) Event Log through the local IPMI interface.

During error decoding, HERD attempts to provide as much information as possible from the data supplied by the AMD CPU. In particular, physical addresses obtained from correctable ECC memory errors are matched to the corresponding CPU slot and DIMM number.

HERD is supported on Sun servers with AMD processors.


Installing HERD

RPMs are provided for the following Linux distributions:


TABLE 7-1   RPM Linux Distributions
Release RPM Designation
Red Hat RHEL4 (64-bit) herd-1.x-x.rh4.x86_64.rpm
Red Hat RHEL5 (64-bit) herd-1.x-x.rh5.x86_64.rpm
Novell SLES9 (64-bit) herd-1.x-x.sl9.x86_64.rpm
Novell SLES10 (64-bit) herd-1.x-x.sl10.x86_64.rpm

To install the RPM, run the following command:

rpm -Uhv herd-1.x-1.rh4.x86_64.rpm

Each RPM has a set of run-time dependencies that are enforced by RPM. These dependencies include the openssl libraries or the OpenIPMI scripts. If one of these dependencies is missing, RPM reports an error and you must install them manually.

With SLES, use the yast utility. For example, type:

yast2 -i OpenIPMI

With RHEL, use up2date or system-config-packages. For example, type:

up2date -i openssl

HERD is designed to be backwardly compatible with the mcelog utility. It supports the same command-line options and uses the same format to report errors to the system log. As such, HERD acts as a replacement to mcelog (both cannot be used at the same time). Note that this conflict information is encoded into the HERD RPMs, so installing HERD automatically uninstalls mcelog if it was present on the system.


Starting the HERD Daemon

All RPMs that are provided come with the appropriate SysV init scripts. After installation, the HERD daemon is automatically setup to run after system boot. The daemon is not, however, started right away.

To start HERD immediately after installation:

When the following message appears in the system log, then HERD is running successfully:

/var/log/messages:

herd: IPMI connection fully operational


Using HERD

Once the HERD daemon is running, any correctable MCEs that occur on the system are reported both on the system log (/var/log/messages) and onto the service processor System Event Log (SEL). In the case of correctable ECC memory errors, both reports should correctly identify the CPU slot and DIMM number on which the memory error occurred.



Note - The Linux kernel only harvests MCE errors every 5 minutes, so a delay might occur between an MCE occurrence and its report to the system log and SEL.



HERD Syntax


Usage: herd [options]
Options:
-e, --decode <addr>   Decode the given 64-bit hex address and exit--
-D, --nodaemon        Don’t detach and become a daemonD--
-d, --debu            Debug moded--
    --ignorenodev     Silent exit if device missing
    --filter          Filter out known bogus MCEs
    --dmi             Lookup MCE address in BIOS tables
    --params          Display herd parameters information
    --setparam <key>=<value>  Set or override parameter value
-h, --help            This messageh--

Example of HERD Output

Here is an example of the system log output generated by HERD:


Jan 14 18:57:32 host herd: HARDWARE ERROR. This is *NOT* a software problem! 
Jan 14 18:57:32 host herd: Please contact your hardware vendor 
Jan 14 18:57:32 host herd: CPU 0 4 northbridge 
Jan 14 18:57:32 host herd:   Northbridge Watchdog error 
Jan 14 18:57:32 host herd:        bit57 = processor context corrupt 
Jan 14 18:57:32 host herd:        bit61 = error uncorrected 
Jan 14 18:57:32 host herd:   bus error ’generic participation, request timed out generic error mem transaction generic access, level generic’ 
Jan 14 18:57:32 host herd: STATUS b200000000070f0f MCGSTATUS 0
Jan 14 18:57:32 host herd: Physical address maps to: Cpu Node 0, DIMM 1


Jan 14 18:57:32 host herd: HARDWARE ERROR. This is *NOT* a software problem! 
Jan 14 18:57:32 host herd: Please contact your hardware vendor 
Jan 14 18:57:32 host herd: CPU 0 4 northbridge 
Jan 14 18:57:32 host herd:   Northbridge Watchdog error 
Jan 14 18:57:32 host herd:        bit57 = processor context corrupt 
Jan 14 18:57:32 host herd:        bit61 = error uncorrected 
Jan 14 18:57:32 host herd:   bus error ’generic participation, request timed out generic error mem transaction generic access, level generic’ 
Jan 14 18:57:32 host herd: STATUS b200000000070f0f MCGSTATUS 0
Jan 14 18:57:32 host herd: Physical address maps to: Cpu Node 0, DIMM 1

Additional Options

HERD has a number of parameters that can be changed using the --setparam option. The list of available parameters and their descriptions is available by running herd --params.


TABLE 7-2   HERD Options  
Option Default Values Description
check_timer_secs 10 Delay in seconds between MCE log checks.
proc_pci_devices /proc/bus/pci/devices Path of procfs file containing PCI devices information. HERD uses this file to obtain the CPU DRAM bridge PCI devices on the system.
proc_pci_bus /proc/bus/pci Path of procfs directory containing PCI devices configuration data. HERD reads the PCI configuration data of the system DRAM controllers from the corresponding files in that directory.
force_cpu   Sets the CPU version information. Should be formatted as "family,model,stepping" with decimal values. If not set, the CPU version is auto-detected.


Known Problems and Limitations

Recent Linux kernel versions (2.6.16 and newer) ship with an MCE decoding stack called EDAC, which can conflict with HERD. In order for the HERD daemon to function correctly, it is important to first unload the EDAC-related kernel modules with the rmmod command. This is done automatically by the HERD starting script in version 1.8.

On systems that have a 128-bit configured DRAM interface, HERD can only identify DIMM pairs rather than individual DIMM modules. The size of the DRAM interface is reported by HERD when it runs in debug mode. For example, with the following command:


herd -d -e 0


Identifying CPU and DIMMs With MCEs

If an MCE occurred before HERD was installed on a system, use the HERD tool to identify the CPU slot and DIMM number from the physical address reported by the MCE.

# herd -e 0x18000000

For example, use the herd command with the -e option to decode a physical address:


000018000000: Cpu Node 0, DIMM 0
address0x18000000.The results identify the DIMM associated with physical 



Note - HERD must be run on the system on which the MCE actually occurred to identify the CPU and DIMM numbers correctly.



HERD supports a debug option (-d) that gives more system information, including the Opteron CPU identification data, for example:


# herd -d -e 0x000008000000
2 cores found, family 15, model 5, stepping 10 (revision C)
2herd: dimm translation against system address 00080000
Node 0: DRAM base 00000000, DRAM lmit 003fffff, HoleEn 0
 Chip 0: CSBase 00000000. CSMask 03ffffff
000008000000: Cpu Node 0, DIMM 0


Software Error Report and Decode (SERD)

Software Error Report and Decode (SERD) engine is a component of HERD that filters errors meeting a certain criteria. The default setting for errors on a DIMM (with a unique address) is 24 errors within a 24-hour period. The SERD filter allows 24 errors in a 24-hour time period and will not report an error, but when the SERD filter is triggered on the 25th error, HERD error messages begin to be added to /var/log/messages. The logging is done by HERD.

When HERD is restarted, the internal accounting of the last 24 hours is lost and the policy is reset upon reboot. This means the SERD engine holds the info it uses to account for the last 24 hours in RAM. When the program is interrupted, either by a reboot or restarting HERD, it loses all recollection of the past internal failures. However, the log data in the SERD log remains intact.