C H A P T E R 7 - Automatic Diagnosis and Recovery

C H A P T E R 7

Automatic Diagnosis and Recovery

This chapter describes the error diagnosis and domain recovery capabilities included with the firmware for Sun Fire midrange systems. This chapter explains the following:

Automatic Diagnosis and Recovery Overview

Automatic Recovery of Hung Domains

Domain Diagnosis Events

Domain Recovery Controls

Obtaining Auto-Diagnosis and Domain Recovery Information

Automatic Diagnosis and Recovery Overview

The diagnosis and recovery features are enabled by default on Sun Fire midrange systems. This section provides an overview of how these features work.

Depending on the type of hardware errors that occur and the diagnostic controls that are set, the system controller performs certain diagnosis and domain recovery steps, as FIGURE 7-1 shows. The firmware includes an auto-diagnosis (AD) engine, which detects and diagnoses hardware errors that affect the availability of a platform and its domains.

FIGURE 7-1 Auto Diagnosis and Domain Recovery Process

Diagram that shows the main steps in the error diagnosis and domain restoration process: Domain hardware error detection and domain pause, automatic diagnosis, and automatic domain restoration.

The following summary describes the process shown in FIGURE 7-1:

1. System Controller detects domain hardware error and pauses the domain.

2. Auto-diagnosis. The AD engine analyzes the hardware error and determines which field-replaceable units (FRUs) are associated with the hardware error.

The AD engine provides one of the following diagnosis results, depending on the hardware error and the components involved:

Identifies a single FRU that is responsible for the error.

Identifies multiple FRUs that are responsible for the error. Be aware that not all components listed may be faulty. The hardware error could be related to a smaller subset of the components identified.

Indicates that the FRUs responsible for the error cannot be determined. This condition is considered to be "unresolved" and requires further analysis by your service provider.

The AD engine records the diagnosis information for the affected components and maintains this information as part of the component health status (CHS).

The AD reports diagnosis information through the following:

Platform and domain console event messages or the platform or domain loghost output, provided that the syslog loghost for the platform and domains has been configured (see The syslog Loghost for details).

CODE EXAMPLE 7-1 shows an auto-diagnosis event message that appears on the platform console. In this example, a single FRU is responsible for the hardware error. See Reviewing Auto-Diagnosis Event Messages for details on the AD message contents.

CODE EXAMPLE 7-1 Example of Auto-Diagnosis Event Message Displayed on the Platform Console
Jan 23 20:47:11 schostname Platform.SC: ErrorMonitor: Domain A has a SYSTEM ERROR . . . [AD] Event: SF3800.ASIC.SDC.PAR_SGL_ERR.60111010 CSN: 124H58EE DomainID: A ADInfo: 1.SCAPP.15.0 Time: Thu Jan 23 20:47:11 PST 2003 FRU-List-Count: 1; FRU-PN: 5014362; FRU-SN: 011600; FRU-LOC: /N0/SB0 Recommended-Action: Service action required Jan 23 20:47:16 schostname Platform.SC: A fatal condition is detected on Domain A. Initiating automatic restoration for this domain.

Note - Contact your service provider when you see these auto-diagnosis messages. Your service provider will review the auto-diagnosis information and initiate the appropriate service action.

Output from the showlogs, showboards, showcomponent, and showerrorbuffer commands (see Obtaining Auto-Diagnosis and Domain Recovery Information for details on the diagnosis-related information displayed by these commands).

The output from these commands supplements the diagnosis information presented in the platform and domain event messages and can be used for additional troubleshooting purposes.

3. Auto-restoration. During the auto-restoration process, POST reviews the component health status of FRUs that were updated by the AD engine. POST uses this information and tries to isolate the fault by deconfiguring (disabling) any FRUs from the domain that have been determined to cause the hardware error. Even if POST cannot isolate the fault, the system controller then automatically reboots the domain as part of domain restoration.

Note - To take advantage of the automatic recovery feature, make sure that the Openboot PROM variable hang-policy is set to reset.

Automatic Recovery of Hung Domains

The system controller automatically monitors domains for hangs when either of the following occurs:

A domain heartbeat stops within a designated timeout period.

The default timeout value is three minutes, but you can override this value by setting the watchdog_timeout_seconds parameter in the domain /etc/systems file. If you set the value to less than three minutes, the system controller uses three minutes (the default value) as the timeout period. For details on this system parameter, refer to the system(4) man page of your Solaris operating environment release.

The domain does not respond to interrupts.

When the hang policy parameter of the setupdomain command is set to reset, the system controller automatically performs an externally initiated reset (XIR) and reboots the hung domain. If the OBP.error-reset-recovery parameter of the setupdomain command is set to sync, a core file is also generated after an XIR and can be used to troubleshoot the domain hang. See Domain Parameters section for details.

CODE EXAMPLE 7-2 shows the domain console message displayed when the domain heartbeat stops.

CODE EXAMPLE 7-2 Example of Domain Message Output for Automatic Domain Recovery After the Domain Heartbeat Stops
Jan 22 14:59:23 schostname Domain-A.SC: Domain watchdog timer expired. Jan 22 14:59:23 schostname Domain-A.SC: Using default hang-policy (RESET). Jan 22 14:59:23 schostname Domain-A.SC: Resetting (XIR) domain.

CODE EXAMPLE 7-3 shows the domain console message displayed when the domain does not respond to interrupts.

CODE EXAMPLE 7-3 Example of Domain Console Output for Automatic Domain Recovery After the Domain Does Not Respond to Interrupts
Jan 22 14:59:23 schostname Domain-A.SC: Domain is not responding to interrupts. Jan 22 14:59:23 schostname Domain-A.SC: Using default hang-policy (RESET). Jan 22 14:59:23 schostname Domain-A.SC: Resetting (XIR) domain.

Domain Diagnosis Events

Starting with the 5.15.3 release, certain non-fatal domain hardware errors are identified by the Solaris operating environment and reported to the system controller. The system controller does the following:

Records and maintains this information for the affected domain resources as part of the component health status

Reports this information through domain diagnosis [DOM] event messages displayed on the domain console or domain loghost, provided that domain loghosts have been configured

The next time that POST is run, POST reviews the health status of affected resources and if possible, deconfigures the appropriate resources from the system.

CODE EXAMPLE 7-4 shows a domain diagnosis event message for a non-fatal domain error. When you see such event messages, contact your service provider so that the appropriate service action can be initiated. The event message information provided is described in Reviewing Auto-Diagnosis Event Messages.

CODE EXAMPLE 7-4 Domain Diagnosis Event Message - Non-fatal Domain Hardware Error
[DOM] Event: SF6800.L2SRAM.SERD.2.f.1b.10040000000091.f4470000 CSN: 044M347B DomainID: A ADInfo: 1.SF-SOLARIS-DE.build:05/29/03 Time: Mon Jun 02 23:34:59 PDT 2003 FRU-List-Count: 1; FRU-PN: 3704125; FRU-SN: 090K01; FRU-LOC: /N0/SB3/P3/E0 Recommended-Action: Service action required

CODE EXAMPLE 7-4 Domain Diagnosis Event Message - Non-fatal Domain Hardware Error

[DOM] Event: SF6800.L2SRAM.SERD.2.f.1b.10040000000091.f4470000

      CSN: 044M347B DomainID: A ADInfo: 1.SF-SOLARIS-DE.build:05/29/03

      Time: Mon Jun 02 23:34:59 PDT 2003

      FRU-List-Count: 1; FRU-PN: 3704125; FRU-SN: 090K01; FRU-LOC: /N0/SB3/P3/E0

      Recommended-Action: Service action required

You can obtain further information about components deconfigured by POST by using the showboards and showcomponent commands, as described in Reviewing Component Status.

Domain Recovery Controls

This section explains the various controls and domain parameters that affect the domain restoration features.

The `syslog` Loghost

Sun strongly recommends that you define platform and domain loghosts to which all system log (syslog) messages are forwarded and stored. Platform and domain messages, including the auto-diagnosis and domain-recovery event messages, cannot be stored locally. By specifying a loghost for platform and domain log messages, you can use the loghost to monitor and review critical events and messages as needed. However, you must set up a loghost server if you want to assign platform and domain loghosts.

You assign the loghosts through the Loghost and the Log Facility parameters in the setupplatform and setupdomain commands. The facility level identifies where the log messages originate, either the platform or a domain. For details on these commands, refer to their command descriptions in the Sun Fire Midrange System Controller Command Reference Manual.

Domain Parameters

TABLE 7-1 describes the domain parameter settings in the setupdomain command that control the diagnostic and domain recovery process. The default values for the diagnostic and domain recovery parameters are the recommended settings.

Note - If you do not use the default settings, the domain restoration features will not function as described in Automatic Diagnosis and Recovery Overview.

TABLE 7-1 Diagnostic and Domain Recovery Parameters in the `setupdomain` Command
`setupdomain` Parameter	Default Value	Description
`reboot-on-error`	`true`	Automatically reboots the domain when a hardware error is detected. Also boots the Solaris operating environment when the `OBP.auto-boot` parameter is set to `true`.
`hang-policy`	`reset`	Automatically resets a hung domain through an externally initiated reset (XIR).
max-panic-diag-limit	(The same list of values as `diag-level`. The default value is `mem2`)	Defines the maximum level of POST that runs automatically during repeated domain panics. POST level is escalated upon repeated panics until it runs the level specified in `max-panic-diag-limit`. If the domain panics again, it is placed in standby..
`OBP.auto-boot`	`true`	Boots the Solaris operating environment after POST runs.
OBP.error-reset-recovery	`sync`	Automatically reboots the domain after an XIR occurs and generates a core file that can be used to troubleshoot the domain hang. However, be aware that sufficient disk space must be allocated in the domain swap area to hold the core file.

For a complete description of all the domain parameters and their values, refer to the setupdomain command description in the Sun Fire Midrange System Controller Command Reference Manual.

Obtaining Auto-Diagnosis and Domain Recovery Information

This section describes various ways to monitor hardware errors and obtain additional information about components associated with hardware errors.

Reviewing Auto-Diagnosis Event Messages

Auto-diagnosis[AD] and domain [DOM] event messages are displayed on the platform and domain console and also in the following:

The platform or domain loghost, provided that you have defined the syslog host for the platform and domains.

Each line of the loghost output contains a timestamp, a syslog ID number, and the facility level that identifies the platform or a domain where the log message originated.

The showlogs command output, which displays the event messages logged on the platform or domain console.

In systems with SC V2s, certain log messages are maintained in persistent storage. You can selectively view certain types of persistent log messages according to message type, such as fault event messages, by using the showlogs -p -f filter command. For details, refer to the showlogs command description in the Sun Fire Midrange System Controller Command Reference Manual.

The diagnostic information logged on the platform and domain are similar, but the domain log provides additional information on the domain hardware error. The [AD] or [DOM] event messages (see CODE EXAMPLE 7-1, CODE EXAMPLE 7-4, CODE EXAMPLE 7-5, and CODE EXAMPLE 7-6) include the following information:

[AD] or [DOM] - Beginning of the message. AD indicates that the ScApp or POST automatic diagnosis engine generated the event message. DOM indicates that the Solaris operating environment on the affected domain generated the automatic diagnosis event message.

Event - An alphanumeric text string that identifies the platform and event-specific information used by your service provider.

CSN - Chassis serial number, which identifies your Sun Fire midrange system.

DomainID - The domain affected by the hardware error.

ADInfo - The version of the auto-diagnosis message, the name of the diagnosis engine (SCAPP or SF-SOLARIS_DE), and the auto-diagnosis engine version. For domain diagnosis events, the diagnosis engine is the Solaris operating environment (SF-SOLARIS-DE) and the version of the diagnosis engine is the version of the Solaris operating environment in use.

Time - The day of the week, month, date, time (hours, minutes, and seconds), time zone, and year of the auto-diagnosis.

FRU-List-Count - The number of components (FRUs) involved with the error and the following FRU data:

If a single component is implicated, the FRU part number, serial number, and location are displayed, as CODE EXAMPLE 7-1 shows.

If multiple components are implicated, the FRU part number, serial number, and location for each component involved is reported, as CODE EXAMPLE 7-5 shows.

In some cases, be aware that not all the FRUs listed are necessarily faulty. The fault may reside in a subset of the components identified.

If the SCAPP diagnosis engine cannot implicate specific components, the term UNRESOLVED is displayed, as CODE EXAMPLE 7-6 shows.

Recommended-Action: Service action required - Instructs the platform or domain administrator to contact their service provider for further service action. Also indicates the end of the auto-diagnosis message.

CODE EXAMPLE 7-5 Example of Domain Console Auto-Diagnostic Message Involving Multiple FRUs
Jan. 23 21:07:51 schostname Domain-A.SC: ErrorMonitor: Domain A has a SYSTEM ERROR . . . [AD] Event: SF3800.ASIC.SDC.PAR_L2_ERR_TT.60113022 `CSN: 124H58EE DomainID: A ADInfo: 1.SCAPP.15.0` `Time: Thu Jan 23 21:07:51 PST 2003` `FRU-List-Count: 2; FRU-PN: 5015876; FRU-SN: 000429; FRU-LOC: RP0` `FRU-PN: 5014362; FRU-SN: 011570; FRU-LOC: /N0/SB2` `Recommended-Action: Service action required` Jan 23 21:08:01 schostname Domain-A.SC: A fatal condition is detected on Domain A. Initiating automatic restoration for this domain.

CODE EXAMPLE 7-5 Example of Domain Console Auto-Diagnostic Message Involving Multiple FRUs

Jan. 23 21:07:51 schostname Domain-A.SC: ErrorMonitor: Domain A has a SYSTEM ERROR

[AD] Event: SF3800.ASIC.SDC.PAR_L2_ERR_TT.60113022

     CSN: 124H58EE  DomainID: A ADInfo: 1.SCAPP.15.0

     Time: Thu Jan 23 21:07:51 PST 2003

     FRU-List-Count: 2; FRU-PN: 5015876; FRU-SN: 000429; FRU-LOC: RP0

                        FRU-PN: 5014362; FRU-SN: 011570; FRU-LOC: /N0/SB2

     Recommended-Action: Service action required

Jan 23 21:08:01 schostname Domain-A.SC: A fatal condition is detected on Domain

A. Initiating automatic restoration for this domain.

CODE EXAMPLE 7-6 Example of Domain Console Auto-Diagnostic Message Involving an Unresolved Diagnosis
Jan 23 21:47:28 schostname Domain-A.SC: ErrorMonitor: Domain A has a SYSTEM ERROR . . . [AD] Event: SF3800 CSN: 124H58EE DomainID: A ADInfo: 1.SCAPP.15.0 Time: Thu Jan 23 21:47:28 PST 2003 FRU-List-Count: 0; FRU-PN: ; FRU-SN: ; FRU-LOC: UNRESOLVED Recommended-Action: Service action required Jan 23 21:47:28 schostname Domain-A.SC: A fatal condition is detected on Domain A. Initiating automatic restoration for this domain.

CODE EXAMPLE 7-6 Example of Domain Console Auto-Diagnostic Message Involving an Unresolved Diagnosis

Jan 23 21:47:28 schostname Domain-A.SC: ErrorMonitor: Domain A has a SYSTEM ERROR

[AD] Event: SF3800

     CSN: 124H58EE  DomainID: A ADInfo: 1.SCAPP.15.0

     Time: Thu Jan 23 21:47:28 PST 2003

     FRU-List-Count: 0; FRU-PN: ; FRU-SN: ; FRU-LOC: UNRESOLVED

     Recommended-Action: Service action required

Jan 23 21:47:28 schostname Domain-A.SC: A fatal condition is detected on Domain

A. Initiating automatic restoration for this domain.

Reviewing Component Status

You can obtain additional information about components that have been deconfigured as part of the auto-diagnosis process or disabled for other reasons by reviewing the following items:

The showboards command output after an auto-diagnosis has occurred

CODE EXAMPLE 7-7 shows the location assignments and the status for all components in the system. The diagnostic-related information is provided in the Status column for a component. Components that have a Failed or Disabled status are deconfigured from the system. The Failed status indicates that the board failed testing and is not usable. Disabled indicates that the board has been deconfigured from the system, because it was disabled using the setls command or because it failed POST. Degraded status indicates that certain components on the boards have failed or are disabled, but there are still usable parts on the board. Components with degraded status are configured into the system.

You can obtain additional information about Failed, Disabled, or Degraded components by reviewing the output from the showcomponent command.

CODE EXAMPLE 7-7 `showboards` Command Output - `Disabled` and `Degraded` Components
schostname: SC> `showboards` Slot Pwr Component Type State Status Domain ---- --- -------------- ----- ------ ------ SSC0 On System Controller Main Passed - SSC1 - Empty Slot - - - ID0 On Sun Fire 4800 Centerplane - OK - PS0 - Empty Slot - - - PS1 On A185 Power Supply - OK - PS2 On A185 Power Supply - OK - FT0 On Fan Tray High Speed OK - FT1 On Fan Tray High Speed OK - FT2 On Fan Tray High Speed OK - RP0 On Repeater Board - OK - /N0/SB0 On CPU Board V3 Assigned Disabled A SB2 - Empty Slot Assigned - A /N0/SB4 On CPU Board V3 Active Degraded A /N0/IB6 On PCI I/O Board Active Passed A IB8 Off PCI I/O Board Available Not tested Isolated

CODE EXAMPLE 7-7 showboards Command Output - Disabled and Degraded Components

schostname: SC> showboards

Slot     Pwr Component Type                 State      Status     Domain

----     --- --------------                 -----      ------     ------

SSC0     On  System Controller              Main       Passed     -

SSC1     -   Empty Slot                     -          -          -

ID0      On  Sun Fire 4800 Centerplane      -          OK         -

PS0      -   Empty Slot                     -          -          -

PS1      On  A185 Power Supply              -          OK         -

PS2      On  A185 Power Supply              -          OK         -

FT0      On  Fan Tray                       High Speed OK         -

FT1      On  Fan Tray                       High Speed OK         -

FT2      On  Fan Tray                       High Speed OK         -

RP0      On  Repeater Board                 -          OK         -

/N0/SB0  On  CPU Board V3                   Assigned   Disabled   A

SB2      -   Empty Slot                     Assigned   -          A

/N0/SB4  On  CPU Board V3                   Active     Degraded   A

/N0/IB6  On  PCI I/O Board                  Active     Passed     A

IB8      Off PCI I/O Board                  Available  Not tested Isolated

The showcomponent command output after an auto-diagnosis has occurred

The Status column in CODE EXAMPLE 7-8 shows the status for components. The status is either enabled or disabled. The disabled components are deconfigured from the system. The POST status chs (abbreviation for component health status) flags the component for further analysis by your service provider.

Note - Disabled components that have a POST status of chs cannot be enabled by using the setls command. Contact your service provider for assistance. In some cases, subcomponents belonging to a "parent" component associated with a hardware error will also reflect a disabled status, as will the parent. You cannot re-enable the subcomponents of a parent component associated with a hardware error. Review the auto-diagnosis event messages to determine which parent component is associated with the error.

CODE EXAMPLE 7-8 `showcomponent` Command Output - Disabled Components
schostname: SC> `showcomponent` Component Status Pending POST Description --------- ------ ------- ---- ----------- /N0/SB0/P0 disabled - chs UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB0/P1 disabled - chs UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB0/P2 disabled - chs UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB0/P3 disabled - chs UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB0/P0/B0/L0 disabled - chs empty /N0/SB0/P0/B0/L2 disabled - chs empty /N0/SB0/P0/B1/L1 disabled - chs 2048M DRAM /N0/SB0/P0/B1/L3 disabled - chs 2048M DRAM . . . /N0/SB0/P3/B0/L0 disabled - chs empty /N0/SB0/P3/B0/L2 disabled - chs empty /N0/SB0/P3/B1/L1 disabled - chs 2048M DRAM /N0/SB0/P3/B1/L3 disabled - chs 2048M DRAM /N0/SB4/P0 enabled - pass UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB4/P1 enabled - pass UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB4/P2 enabled - pass UltraSPARC-IV, 1050MHz, 16M ECache /N0/SB4/P3 enabled - pass UltraSPARC-IV, 1050MHz, 16M ECache . . .

CODE EXAMPLE 7-8 showcomponent Command Output - Disabled Components

schostname: SC> showcomponent

Component          Status    Pending POST   Description

---------          ------    ------- ----   -----------

/N0/SB0/P0         disabled  -       chs    UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB0/P1         disabled  -       chs    UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB0/P2         disabled  -       chs    UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB0/P3         disabled  -       chs    UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB0/P0/B0/L0   disabled  -       chs    empty

/N0/SB0/P0/B0/L2   disabled  -       chs    empty

/N0/SB0/P0/B1/L1   disabled  -       chs    2048M DRAM

/N0/SB0/P0/B1/L3   disabled  -       chs    2048M DRAM

/N0/SB0/P3/B0/L0    disabled -       chs    empty

/N0/SB0/P3/B0/L2    disabled -       chs    empty

/N0/SB0/P3/B1/L1    disabled -       chs    2048M DRAM

/N0/SB0/P3/B1/L3    disabled -       chs    2048M DRAM

/N0/SB4/P0          enabled  -       pass   UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB4/P1          enabled  -       pass   UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB4/P2          enabled  -       pass   UltraSPARC-IV, 1050MHz, 16M ECache

/N0/SB4/P3          enabled  -       pass   UltraSPARC-IV, 1050MHz, 16M ECache

Reviewing Additional Error Information

For systems configured with SC V2s, the showerrorbuffer -p command displays the system error contents maintained in persistent storage.

However, for systems that do not have SC V2s, the showerrorbuffer command displays the contents of the system error buffer and displays error messages that otherwise might be lost when your domains are rebooted as part of the domain recovery process.

In either case, the information displayed can be used by your service provider for troubleshooting purposes.

CODE EXAMPLE 7-9 shows the output displayed for a domain hardware error, maintained in the system error buffer.

CODE EXAMPLE 7-9 `showerrorbuffer` Command Output - Hardware Error
schostname: SC> `showerrorbuffer` ErrorData[0] Date: Tue Jan 21 14:30:20 PST 2003 Device: /SSC0/sbbc0/systemepld Register: FirstError[0x10] : 0x0200 SB0 encountered the first error ErrorData[1] Date: Tue Jan 21 14:30:20 PST 2003 Device: /partition0/domain0/SB4/bbcGroup0/repeaterepld Register: FirstError[0x10]: 0x00c0 sbbc0 encountered the first error sbbc1 encountered the first error ErrorData[2] Date: Tue Jan 21 14:30:20 PST 2003 Device: /partition0/domain0/SB4/bbcGroup0/sbbc0 ErrorID: 0x50121fff Register: ErrorStatus[0x80] : 0x00000300 SafErr [09:08] : 0x3 Fireplane device asserted an error . . .

CODE EXAMPLE 7-9 showerrorbuffer Command Output - Hardware Error

schostname: SC> showerrorbuffer

ErrorData[0]

  Date: Tue Jan 21 14:30:20 PST 2003

  Device: /SSC0/sbbc0/systemepld

  Register: FirstError[0x10] : 0x0200

            SB0 encountered the first error

ErrorData[1]

  Date: Tue Jan 21 14:30:20 PST 2003

  Device: /partition0/domain0/SB4/bbcGroup0/repeaterepld

  Register: FirstError[0x10]: 0x00c0

            sbbc0 encountered the first error

            sbbc1 encountered the first error

ErrorData[2]

  Date: Tue Jan 21 14:30:20 PST 2003

  Device: /partition0/domain0/SB4/bbcGroup0/sbbc0

  ErrorID: 0x50121fff

  Register: ErrorStatus[0x80] : 0x00000300

	          SafErr [09:08] : 0x3 Fireplane device asserted an error