|C H A P T E R 7|
Diagnosis and Domain Restoration
The diagnosis and domain restoration features are enabled by default on Sun Fire midframe systems, starting with firmware version 5.15.0. This section provides an overview of how these capabilities work.
Depending on the type of hardware errors that occur and the diagnostic controls that are set, the system controller performs certain diagnosis and domain restoration steps, as FIGURE 7-1 shows. The firmware includes an auto-diagnosis (AD) engine, which detects and diagnoses hardware errors that affect the availability of a platform and its domains.
The following summary describes the process shown in FIGURE 7-1:
3. Auto-restoration. During the auto-restoration process, POST reviews the component health status of FRUs that were updated by the AD engine. POST uses this information and tries to isolate the fault by deconfiguring (disabling) any FRUs from the domain that have been determined to cause the hardware error. Even if POST cannot isolate the fault, the system controller then automatically reboots the domain as part of domain restoration.
When the hang policy parameter of the setupdomain command is set to reset, the system controller automatically performs an externally initiated reset (XIR) and reboots the hung domain. If the OBP.error-reset-recovery parameter of the setupdomain command is set to sync, a core file is also generated after an XIR and can be used to troubleshoot the domain hang. See Domain Parameters section for details.
CODE EXAMPLE 7-2 shows the domain console message displayed when the domain heartbeat stops.
CODE EXAMPLE 7-3 shows the domain console message displayed when the domain does not respond to interrupts.
Sun strongly recommends that you define platform and domain loghosts to which all system log (syslog) messages are forwarded and stored. Platform and domain messages, including the auto-diagnostic and domain-restoration event messages, cannot be stored locally. By specifying a loghost for platform and domain log messages, you can use the loghost to monitor and review critical events and messages as needed. However, you must set up a loghost server if you want to assign platform and domain loghosts.
You assign the loghosts through the Loghost and the Log Facility parameters in the setupplatform and setupdomain commands. The facility level identifies where the log messages originate, either the platform or a domain. For details on these commands, refer to their command descriptions in the Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual.
TABLE 7-1 describes the domain parameter settings in the setupdomain command that control the diagnostic and domain recovery process. The default values for the diagnostic and domain restoration parameters are the recommended settings.
Note - If you do not use the default settings, the domain restoration features will not function as described in Diagnosis and Domain Restoration Overview.
Automatically reboots the domain after an XIR occurs and generates a core file that can be used to troubleshoot the domain hang. However, be aware that sufficient disk space must be allocated in the domain swap area to hold the core file.
The diagnostic information logged on the platform and domain are similar, but the domain log provides additional information on the domain hardware error. The auto-diagnosis event message includes the following information:
Note - Disabled components that have a POST status of chs cannot be enabled by using the setls command. Contact your service provider for assistance. In some cases, subcomponents belonging to a "parent" component associated with a hardware error will also reflect a disabled status, as does the parent. You cannot re-enable the subcomponents of a parent component associated with a hardware error. Review the auto-diagnosis event messages to determine which parent component is associated with the error.
The showerrorbuffer command shows the contents of the system error buffer and displays error messages that otherwise might be lost when your domains are rebooted as part of the domain recovery process. The information displayed can be used by your service provider for troubleshooting purposes.
CODE EXAMPLE 7-8 shows the output displayed for a domain hardware error.