C H A P T E R 7 |
Diagnosis and Domain Restoration |
This chapter describes the error diagnosis and domain restoration capabilities included with the firmware for Sun Fire 6800/4810/4800/3800 systems. This chapter explains the following:
The diagnosis and domain restoration features are enabled by default on Sun Fire midframe systems, starting with firmware version 5.15.0. This section provides an overview of how these capabilities work.
Depending on the type of hardware errors that occur and the diagnostic controls that are set, the system controller performs certain diagnosis and domain restoration steps, as FIGURE 7-1 shows. The firmware includes an auto-diagnosis (AD) engine, which detects and diagnoses hardware errors that affect the availability of a platform and its domains.
The following summary describes the process shown in FIGURE 7-1:
1. System Controller detects domain hardware error and pauses the domain.
2. Auto-diagnosis. The AD engine analyzes the hardware error and determines which field-replaceable units (FRUs) are associated with the hardware error.
Note - Contact your service provider when you see these auto-diagnosis messages. Your service provider will review the auto-diagnosis information and initiate the appropriate service action. |
3. Auto-restoration. During the auto-restoration process, POST reviews the component health status of FRUs that were updated by the AD engine. POST uses this information and tries to isolate the fault by deconfiguring (disabling) any FRUs from the domain that have been determined to cause the hardware error. Even if POST cannot isolate the fault, the system controller then automatically reboots the domain as part of domain restoration.
The system controller automatically monitors domains for hangs when
When the hang policy parameter of the setupdomain command is set to reset, the system controller automatically performs an externally initiated reset (XIR) and reboots the hung domain. If the OBP.error-reset-recovery parameter of the setupdomain command is set to sync, a core file is also generated after an XIR and can be used to troubleshoot the domain hang. See Domain Parameters section for details.
CODE EXAMPLE 7-2 shows the domain console message displayed when the domain heartbeat stops.
CODE EXAMPLE 7-3 shows the domain console message displayed when the domain does not respond to interrupts.
This section explains the various controls and domain parameters that affect the domain restoration features.
Sun strongly recommends that you define platform and domain loghosts to which all system log (syslog) messages are forwarded and stored. Platform and domain messages, including the auto-diagnostic and domain-restoration event messages, cannot be stored locally. By specifying a loghost for platform and domain log messages, you can use the loghost to monitor and review critical events and messages as needed. However, you must set up a loghost server if you want to assign platform and domain loghosts.
You assign the loghosts through the Loghost and the Log Facility parameters in the setupplatform and setupdomain commands. The facility level identifies where the log messages originate, either the platform or a domain. For details on these commands, refer to their command descriptions in the Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual.
TABLE 7-1 describes the domain parameter settings in the setupdomain command that control the diagnostic and domain recovery process. The default values for the diagnostic and domain restoration parameters are the recommended settings.
Note - If you do not use the default settings, the domain restoration features will not function as described in Diagnosis and Domain Restoration Overview. |
For a complete description of all the domain parameters and their values, refer to the setupdomain command description in the Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual.
This section describes various ways to monitor diagnostic errors and obtain additional information about components that are associated with hardware errors.
Auto-diagnosis event messages are displayed on the platform and domain console and also in the following:
The diagnostic information logged on the platform and domain are similar, but the domain log provides additional information on the domain hardware error. The auto-diagnosis event message includes the following information:
You can obtain additional information about components that have been deconfigured as part of the auto-diagnosis process or disabled for other reasons by reviewing the following items:
The showerrorbuffer command shows the contents of the system error buffer and displays error messages that otherwise might be lost when your domains are rebooted as part of the domain recovery process. The information displayed can be used by your service provider for troubleshooting purposes.
CODE EXAMPLE 7-8 shows the output displayed for a domain hardware error.
Copyright © 2003, Sun Microsystems, Inc. All rights reserved.