C H A P T E R 7 - Automatic Diagnosis and Recovery

C H A P T E R 7

Automatic Diagnosis and Recovery

This chapter describes the error diagnosis and domain recovery capabilities included with the firmware for Sun Fire entry-level midrange systems.

This chapter explains the following topics:

Automatic Diagnosis and Recovery Overview

Automatic Recovery of a Hung System

Diagnosis Events

Diagnostic and Recovery Controls

Obtaining Auto-Diagnosis and Recovery Information

Automatic Diagnosis and Recovery Overview

The diagnosis and recovery features are enabled by default on Sun Fire midrange systems. This section provides an overview of how these features work.

Depending on the type of hardware errors that occur and the diagnostic controls that are set, the system controller performs certain diagnosis and recovery steps, as FIGURE 7-1 shows. The firmware includes an auto-diagnosis (AD) engine, which detects and diagnoses hardware errors that affect the availability of a system.

Note - Although entry-level midrange systems do not support the multiple domains that other midrange systems support, by convention, diagnostic output provides system status as the status for Domain A.

FIGURE 7-1 Auto Diagnosis and Recovery Process

Diagram showing the main steps in the error diagnosis and domain restoration process. [ D ]

The following summary describes the process shown in FIGURE 7-1:

1. The SC detects hardware error and pauses the operating system.

2. Auto-diagnosis. The AD engine analyzes the hardware error and determines which field-replaceable units (FRUs) are associated with the hardware error.

The AD engine provides one of the following diagnosis results, depending on the hardware error and the components involved:

Identifies a single FRU that is responsible for the error.

Identifies multiple FRUs that are responsible for the error. Be aware that not all components listed may be faulty. The hardware error could be related to a smaller subset of the components identified.

Indicates that the FRUs responsible for the error cannot be determined. This condition is considered to be "unresolved" and requires further analysis by your service provider.

The AD engine records the diagnosis information for the affected components and maintains this information as part of the component health status (CHS).

The AD reports diagnosis information through console event messages.

CODE EXAMPLE 7-1 shows an auto-diagnosis event message that appears on the console. In this example, a single FRU is responsible for the hardware error. See Reviewing Auto-Diagnosis Event Messages for details on the AD message contents.

CODE EXAMPLE 7-1 Example of Auto-Diagnosis Event Message Displayed on the Console
[AD] Event: E2900.ASIC.AR.ADR_PERR.10473006 CSN: DomainID: A ADInfo: 1.SCAPP.17.0 Time: Fri Dec 12 09:30:20 PST 2003 FRU-List-Count: 2; FRU-PN: 5405564; FRU-SN: A08712; FRU-LOC: /N0/IB6 FRU-PN: 5404974; FRU-SN: 000274; FRU-LOC: /N0/RP2 Recommended-Action: Service action required

CODE EXAMPLE 7-1 Example of Auto-Diagnosis Event Message Displayed on the Console

[AD] Event: E2900.ASIC.AR.ADR_PERR.10473006

     CSN:  DomainID: A ADInfo: 1.SCAPP.17.0

     Time: Fri Dec 12 09:30:20 PST 2003

     FRU-List-Count: 2; FRU-PN: 5405564; FRU-SN: A08712; FRU-LOC: /N0/IB6

                        FRU-PN: 5404974; FRU-SN: 000274; FRU-LOC: /N0/RP2

     Recommended-Action: Service action required

Note - Contact your service provider when you see these auto-diagnosis messages. Your service provider will review the auto-diagnosis information and initiate the appropriate service action.

Output from the showlogs, showboards, showcomponent, and showerrorbuffer commands (see Obtaining Auto-Diagnosis and Recovery Information for details on the diagnosis-related information displayed by these commands).

The output from these commands supplements the diagnosis information presented in the event messages and can be used for additional troubleshooting purposes.

3. Auto-restoration. During the auto-restoration process, POST reviews the component health status of FRUs that were updated by the AD engine. POST uses this information and tries to isolate the fault by deconfiguring (disabling) any FRUs from the domain that have been determined to cause the hardware error. Even if POST cannot isolate the fault, the system controller then automatically reboots the domain as part of domain restoration.

Note - To take advantage of the automatic recovery feature, make sure that the Openboot PROM variable hang-policy is set to reset.

Automatic Recovery of a Hung System

The system controller automatically monitors systems for hangs when either of the following occurs:

The operating system heartbeat stops within a designated timeout period.

The default timeout value is three minutes, but you can override this value by setting the watchdog_timeout_seconds parameter in the domain /etc/systems file. If you set the value to less than three minutes, the system controller uses three minutes (the default value) as the timeout period. For details on this system parameter, see the system(4) man page of your Solaris Operating System release.

The system does not respond to interrupts.

When the host watchdog (as described in the setupsc command) is enabled, the system controller automatically performs an externally initiated reset (XIR) and reboots the hung operating system. If the OpenBoot PROM nvram variable, error-reset-recovery is set to sync, a core file is also generated after an XIR and can be used to troubleshoot the operating system hang.

CODE EXAMPLE 7-2 shows the console message displayed when the operating system heartbeat stops.

CODE EXAMPLE 7-2 Example of Message Output for Automatic Domain Recovery After the Operating System Heartbeat Stops
Tue Dec 09 12:24:47 commando lom: Domain watchdog timer expired. Tue Dec 09 12:24:48 commando lom: Using default hang-policy (RESET). Tue Dec 09 12:24:48 commando lom: Resetting (XIR) domain.

CODE EXAMPLE 7-3 shows the console message displayed when the operating system does not respond to interrupts.

CODE EXAMPLE 7-3 Example of Console Output for Automatic Recovery After the Operating System Does Not Respond to Interrupts
Tue Dec 09 12:37:38 commando lom: Domain is not responding to interrupts. Tue Dec 09 12:37:38 commando lom: Using default hang-policy (RESET). Tue Dec 09 12:37:38 commando lom: Resetting (XIR) domain

Diagnosis Events

Certain nonfatal hardware errors are identified by the Solaris Operating System and reported to the system controller. The system controller does the following:

Records and maintains this information for the affected resources as part of the component health status

Reports this information through event messages displayed on the console.

The next time that POST is run, POST reviews the health status of affected resources and if possible, deconfigures the appropriate resources from the system.

CODE EXAMPLE 7-4 shows an event message for a nonfatal domain error. When you see such event messages, contact your service provider so that the appropriate service action can be initiated. The event message information provided is described in Reviewing Auto-Diagnosis Event Messages.

CODE EXAMPLE 7-4 Domain Diagnosis Event Message - Nonfatal Domain Hardware Error
[DOM] Event: SFV1280.L2SRAM.SERD.0.60.10040000000128.7fd78d140 CSN: DomainID: A ADInfo: 1.SF-SOLARIS-DE.5_8_Generic_116188-01 Time: Wed Nov 26 12:06:14 PST 2003 FRU-List-Count: 1; FRU-PN: 3704129; FRU-SN: 100ACD; FRU-LOC: /N0/SB0/P0/E0 Recommended-Action: Service action required

CODE EXAMPLE 7-4 Domain Diagnosis Event Message - Nonfatal Domain Hardware Error

[DOM] Event: SFV1280.L2SRAM.SERD.0.60.10040000000128.7fd78d140

      CSN:  DomainID: A ADInfo: 1.SF-SOLARIS-DE.5_8_Generic_116188-01

      Time: Wed Nov 26 12:06:14 PST 2003

      FRU-List-Count: 1; FRU-PN: 3704129; FRU-SN: 100ACD; FRU-LOC: /N0/SB0/P0/E0

      Recommended-Action: Service action required

You can obtain further information about components deconfigured by POST by using the showboards and showcomponent commands, as described in Reviewing Component Status.

Diagnostic and Recovery Controls

This section explains the various controls and parameters that affect the restoration features.

Diagnostic Parameters

TABLE 7-1 describes the parameter settings that control the diagnostic and operating system recovery process. The default values for the diagnostic and operating system recovery parameters are the recommended settings.

Note - If you do not use the default settings, the restoration features will not function as described in Automatic Diagnosis and Recovery Overview.

TABLE 7-1 Diagnostic and Operating System Recovery Parameters
Parameter	Set Using	Default Value	Description
`Host Watchdog`	setupsc command	`enabled`	Automatically reboots the domain when a hardware error is detected. Also boots the Solaris Operating System when the `OBP.auto-boot` parameter is set to `true`.
`Log Reset Data`	setupsc command	`true`	If enabled, causes the system controller to send data to the console about the current state of each CPU before resetting the system during a system hang (if Host Watchdog has been enabled). This allows system state data to be preserved if console data is being logged. The output format is the same as the format used by the `showresetstate` command when dumping the CPU state data for a hung system manually (that is, if Host Watchdog has been disabled).
`Verbose Reset Data`	setupsc command	`true`	Controls the amount of information that the system controller sends to the console. When enabled, this option produces the same result as using the `showresetstate` `-v` command.
`Tolerate correctable memory errors`	setupsc command	False	If set to true it allows the Solaris Operating System to boot with memory exhibiting correctable ECC errors. The Solaris 10 Operating System incorporates features that automatically isolate faulty parts of such memory modules, thus avoiding the need to completely disable these modules and increasing system availability. If set to false, memory modules exhibiting correctable ECC errors are disabled by POST and not allowed to participate in the Solaris domain.
`reboot-on-error`	OBP setenv	`true`	Automatically reboots the domain when a hardware error is detected. Also boots the Solaris Operating System when the `OBP.auto-boot` parameter is set to `true`.
`auto-boot`	OBP setenv	`true`	Boots the Solaris Operating System after POST runs.
`error-reset-recovery`	OBP setenv	`sync`	Automatically reboots the system after an XIR occurs and generates a core file that can be used to troubleshoot the system hang. However, be aware that sufficient disk space must be allocated in the swap area to hold the core file.

Obtaining Auto-Diagnosis and Recovery Information

This section describes various ways to monitor hardware errors and obtain additional information about components associated with hardware errors.

Reviewing Auto-Diagnosis Event Messages

Auto-diagnosis [AD] and domain [DOM] event messages are displayed on the console and also in the following:

The /var/adm/messages file, provided that you have set up the event reporting appropriately, as described in Chapter 4.

The showlogs command output, which displays the event messages logged on the console.

In systems with enhanced-memory system controllers (SC V2s), log messages are maintained in a persistent buffer. You can selectively view certain types of log messages according to message type, such as fault event messages, by using the showlogs -p -f filter command. For details, see the showlogs command description in the Sun Fire Entry-Level Midrange System Controller Command Reference Manual.

The [AD] or [DOM] event messages (see CODE EXAMPLE 7-1, CODE EXAMPLE 7-4, CODE EXAMPLE 7-5, and CODE EXAMPLE 7-6) include the following information:

[AD]or [DOM] - Beginning of the message. AD indicates that the ScApp or POST automatic diagnosis engine generated the event message. DOM indicates that the Solaris Operating System on the affected domain generated the automatic diagnosis event message.

Event - An alphanumeric text string that identifies the platform and event-specific information used by your service provider.

CSN - Chassis serial number, which identifies your Sun Fire midrange system.

DomainID - The domain affected by the hardware error. Entry-level midrange systems are always Domain A.

ADInfo - The version of the auto-diagnosis message, the name of the diagnosis engine (SCAPP or SF-SOLARIS_DE), and the auto-diagnosis engine version. For domain diagnosis events, the diagnosis engine is the Solaris Operating System (SF-SOLARIS-DE) and the version of the diagnosis engine is the version of the Solaris Operating System in use.

Time - The day of the week, month, date, time (hours, minutes, and seconds), time zone, and year of the auto-diagnosis.

FRU-List-Count - The number of components (FRUs) involved with the error and the following FRU data:

If a single component is implicated, the FRU part number, serial number, and location are displayed, as CODE EXAMPLE 7-1 shows.

If multiple components are implicated, the FRU part number, serial number, and location for each component involved is reported, as CODE EXAMPLE 7-5 shows.

In some cases, be aware that not all the FRUs listed are necessarily faulty. The fault may reside in a subset of the components identified.

If the SCAPP diagnosis engine cannot implicate specific components, the term UNRESOLVED is displayed, as CODE EXAMPLE 7-6 shows.

Recommended-Action: Service action required - Instructs the administrator to contact their service provider for further service action. Also indicates the end of the auto-diagnosis message.

CODE EXAMPLE 7-5 Example of Auto-Diagnostic Message
Tue Dec 02 14:35:56 commando lom: ErrorMonitor: Domain A has a SYSTEM ERROR . . . Tue Dec 02 14:35:59 commando lom: [AD] Event: E2900 `CSN: DomainID: A ADInfo: 1.SCAPP.17.0` `Time: Tue Dec 02 14:35:57 PST 2003` `FRU-List-Count: 0; FRU-PN: ; FRU-SN: ; FRU-LOC: UNRESOLVED` `Recommended-Action: Service action required` Tue Dec 02 14:35:59 commando lom: A fatal condition is detected on Domain A. Initiating automatic restoration for this domain

CODE EXAMPLE 7-5 Example of Auto-Diagnostic Message

Tue Dec 02 14:35:56 commando lom: ErrorMonitor: Domain A has a SYSTEM ERROR

Tue Dec 02 14:35:59 commando lom: [AD] Event: E2900

 CSN:  DomainID: A ADInfo: 1.SCAPP.17.0

 Time: Tue Dec 02 14:35:57 PST 2003

 FRU-List-Count: 0; FRU-PN:  ; FRU-SN:  ; FRU-LOC: UNRESOLVED

     Recommended-Action: Service action required

Tue Dec 02 14:35:59 commando lom: A fatal condition is detected on Domain A.

Initiating automatic restoration for this domain

Reviewing Component Status

You can obtain additional information about components that have been unconfigured as part of the auto-diagnosis process or disabled for other reasons by reviewing the following items:

The showboards command output after an auto-diagnosis has occurred

CODE EXAMPLE 7-6 shows the location assignments and the status for all components in the system. The diagnostic-related information is provided in the Status column for a component. Components that have a Failed or Disabled status are deconfigured from the system. The Failed status indicates that the board failed testing and is not usable. Disabled indicates that the board has been deconfigured from the system, because it was disabled using the setls command or because it failed POST. Degraded status indicates that certain components on the boards have failed or are disabled, but there are still usable parts on the board. Components with degraded status are configured into the system.

You can obtain additional information about Failed, Disabled, or Degraded components by reviewing the output from the showcomponent command.

CODE EXAMPLE 7-6 `showboards` Command Output - `Disabled` and `Degraded` Components
Slot Pwr Component Type State Status ---- --- -------------- ----- ------ SSC1 On System Controller V2 Main Passed /N0/SCC - System Config Card Assigned OK /N0/BP - Baseplane Assigned Passed /N0/SIB - Indicator Board Assigned Passed /N0/SPDB - System Power Distribution Bd. Assigned Passed /N0/PS0 On A166 Power Supply - OK /N0/PS1 On A166 Power Supply - OK /N0/PS2 On A166 Power Supply - OK /N0/PS3 On A166 Power Supply - OK /N0/FT0 On Fan Tray Auto Speed Passed /N0/RP0 On Repeater Board Assigned OK /N0/RP2 On Repeater Board Assigned OK /N0/SB0 On CPU Board Active Passed /N0/SB2 On CPU Board V3 Assigned Disabled /N0/SB4 On CPU Board Active Degraded /N0/IB6 On PCI I/O Board Active Passed /N0/MB - Media Bay Assigned Passed

CODE EXAMPLE 7-6 showboards Command Output - Disabled and Degraded Components

Slot     Pwr Component Type                 State      Status

----     --- --------------                 -----      ------

SSC1     On  System Controller V2           Main       Passed

/N0/SCC  -   System Config Card             Assigned   OK

/N0/BP   -   Baseplane                      Assigned   Passed

/N0/SIB  -   Indicator Board                Assigned   Passed

/N0/SPDB -   System Power Distribution Bd.  Assigned   Passed

/N0/PS0  On  A166 Power Supply              -          OK

/N0/PS1  On  A166 Power Supply              -          OK

/N0/PS2  On  A166 Power Supply              -          OK

/N0/PS3  On  A166 Power Supply              -          OK

/N0/FT0  On  Fan Tray                       Auto Speed Passed

/N0/RP0  On  Repeater Board                 Assigned   OK

/N0/RP2  On  Repeater Board                 Assigned   OK

/N0/SB0  On  CPU Board                      Active     Passed

/N0/SB2  On  CPU Board V3                   Assigned   Disabled

/N0/SB4  On  CPU Board                      Active     Degraded

/N0/IB6  On  PCI I/O Board                  Active     Passed

/N0/MB   -   Media Bay                      Assigned   Passed

The showcomponent command output after an auto-diagnosis has occurred

The Status column in CODE EXAMPLE 7-7 shows the status for components. The status is either enabled or disabled. The disabled components are deconfigured from the system. The POST status chs (abbreviation for component health status) flags the component for further analysis by your service provider.

Note - Disabled components that have a POST status of chs cannot be enabled by using the setls command. Contact your service provider for assistance. In some cases, subcomponents belonging to a "parent" component associated with a hardware error will also reflect a disabled status, as will the parent. You cannot re-enable the subcomponents of a parent component associated with a hardware error. Review the auto-diagnosis event messages to determine which parent component is associated with the error.

CODE EXAMPLE 7-7 `showcomponent` Command Output - Disabled Components
schostname: SC> `showcomponent` Component Status Pending POST Description --------- ------ ------- ---- ----------- /N0/SB0/P0 disabled - chs UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB0/P1 disabled - chs UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB0/P2 disabled - chs UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB0/P3 disabled - chs UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB0/P0/B0/L0 disabled - chs empty /N0/SB0/P0/B0/L2 disabled - chs empty /N0/SB0/P0/B1/L1 disabled - chs 2048M DRAM /N0/SB0/P0/B1/L3 disabled - chs 2048M DRAM . . . /N0/SB0/P3/B0/L0 disabled - chs empty /N0/SB0/P3/B0/L2 disabled - chs empty /N0/SB0/P3/B1/L1 disabled - chs 1024M DRAM /N0/SB0/P3/B1/L3 disabled - chs 1024M DRAM /N0/SB4/P0 enabled - pass UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB4/P1 enabled - pass UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB4/P2 enabled - pass UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB4/P3 enabled - pass UltraSPARC-IV, 1050MHz, 16M ECache . . .

CODE EXAMPLE 7-7 showcomponent Command Output - Disabled Components

schostname: SC> showcomponent

Component          Status    Pending POST   Description

---------          ------    ------- ----   -----------

/N0/SB0/P0         disabled  -       chs    UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB0/P1         disabled  -       chs    UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB0/P2         disabled  -       chs    UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB0/P3         disabled  -       chs    UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB0/P0/B0/L0   disabled  -       chs    empty

/N0/SB0/P0/B0/L2   disabled  -       chs    empty

/N0/SB0/P0/B1/L1   disabled  -       chs    2048M DRAM

/N0/SB0/P0/B1/L3   disabled  -       chs    2048M DRAM

/N0/SB0/P3/B0/L0    disabled -       chs    empty

/N0/SB0/P3/B0/L2    disabled -       chs    empty

/N0/SB0/P3/B1/L1    disabled -       chs    1024M DRAM

/N0/SB0/P3/B1/L3    disabled -       chs    1024M DRAM

/N0/SB4/P0          enabled  -       pass   UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB4/P1          enabled  -       pass   UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB4/P2          enabled  -       pass   UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB4/P3          enabled  -       pass   UltraSPARC-IV, 1050MHz, 16M ECache

Reviewing Additional Error Information

For systems configured with enhanced-memory SCs (SC V2s), the showerrorbuffer -p command shows the system error contents maintained in the persistent buffer.

However, for systems that do not have enhanced-memory SCs, the showerrorbuffer command shows the contents of the dynamic buffer and displays error messages that otherwise might be lost when your domains are rebooted as part of the domain recovery process.

In either case, the information displayed can be used by your service provider for troubleshooting purposes.

CODE EXAMPLE 7-8 shows the output displayed for a domain hardware error.

CODE EXAMPLE 7-8 `showerrorbuffer` Command Output--Hardware Error
EX07: lom>showerrorbuffer ErrorData[0] Date: Fri Jan 30 10:23:32 EST 2004 Device: /SSC1/sbbc0/systemepld Register: FirstError[0x10] : 0x0200 SB0 encountered the first error ErrorData[1] Date: Fri Jan 30 10:23:32 EST 2004 Device: /SB0/bbcGroup0/repeaterepld Register: FirstError[0x10]: 0x0002 sdc0 encountered the first error ErrorData[2] Date: Fri Jan 30 10:23:32 EST 2004 Device: /SB0/sdc0 ErrorID: 0x60171010 Register: SafariPortError0[0x200] : 0x00000002 ParSglErr [01:01] : 0x1 ParitySingle error

CODE EXAMPLE 7-8 showerrorbuffer Command Output--Hardware Error

EX07:

lom>showerrorbuffer

ErrorData[0]

  Date: Fri Jan 30 10:23:32 EST 2004

  Device: /SSC1/sbbc0/systemepld

  Register: FirstError[0x10] : 0x0200

            SB0 encountered the first error

ErrorData[1]

  Date: Fri Jan 30 10:23:32 EST 2004

  Device: /SB0/bbcGroup0/repeaterepld

  Register: FirstError[0x10]: 0x0002

            sdc0 encountered the first error

ErrorData[2]

  Date: Fri Jan 30 10:23:32 EST 2004

  Device: /SB0/sdc0

  ErrorID: 0x60171010

  Register: SafariPortError0[0x200] : 0x00000002

               ParSglErr [01:01] : 0x1 ParitySingle error