C H A P T E R  6

Automatic Diagnosis and Recovery

This chapter describes the automatic error diagnosis and domain recovery features. This chapter contains the following sections:


Automatic Diagnosis and Recovery Overview

When certain hardware errors occur in a Sun Fire high-end system, the system controller performs specific diagnosis and domain recovery steps. The following automatic diagnosis engines (DEs) identify and diagnose hardware errors that affect the availability of the system and its domains:

The SMS DE diagnoses hardware errors associated with domain stops (dstops).

The Solaris OS DE (also referred to as the Solaris DE) identifies nonfatal domain hardware errors and reports them to the system controller.

The POST DE identifies any hardware test failures that occur when the power-on self-test is run.

The following sections describe the diagnosis and recovery steps that occur for the hardware errors identified by the different diagnosis engines.

Hardware Errors Associated With Domain Stops

FIGURE 6-1 shows the basic diagnosis and domain recovery steps performed when hardware errors associated with a dstop are identified by the SMS diagnosis engine.


FIGURE 6-1 Automatic Diagnosis and Recovery Process for Hardware Errors Associated With a Stopped Domain


The following summary describes the process shown in FIGURE 6-1.

A dump file is generated whenever a dstop occurs. This file (/var/opt/SUNWSMS/sms-version/adm/domain-id/dump/dsmd.dstop.yymmdd.hhmm.ss) captures the domain hardware errors associated with the dstop.

In situations where multiple FRUs are identified by the DE, further analysis by your service provider might be required to determine the faulty FRU.

CODE EXAMPLE 6-1 shows the information displayed for a domain stop and the auto-diagnosis message that describes a fault event on domain D. The event message begins with the [AD] indicator. See Reviewing Diagnosis Events for a description of the event message contents.


CODE EXAMPLE 6-1 Example of a Dstop and Auto-Diagnosis Event Message in the Platform Log File

Jul 30 14:23:26 2005 smshostname dsmd[14838]-D(): [2516 589424843782403 ERR 
EventHandler.cc 136] Domain stop has been detected in domain D
Jul 30 14:23:27 2005 smshostname dsmd[14838]-D(): [2525 589425136691417 NOTICE 
SysControl.cc 2360] Taking hardware configuration dump. Dump
file: -D/var/opt/SUNWSMS/SMS1.6/adm/D/dump/dsmd.dstop.030730.1423.27
Jul 30 14:24:37 2005 smshostname erd[14864]-D(): [11900 589495236849691 CRIT Mes
sageReportingService.cc 381] [AD]  Event: SF15000-8000-GK  CSN: 352A00005
DomainID: D  ADInfo: 1.SMS-DE.1.6 Time: Wed Jul 30 14:23:27 PDT 2005 
Recommended-Action: Service action required
 

For general information on SRS Net Connect, refer to

http://www.sun.com/srs

For SRS Net Connect product documentation, refer to

https://srsnetconnect3.sun.com

and

http://docs.sun.com

The showlogs event output supplements the diagnosis information presented in the platform and domain message logs or the event email. The showlogs event output can be used for additional troubleshooting purposes by your service provider. For details on the event information displayed, see Obtaining Diagnosis and Recovery Information.



Note - Contact your service provider when you see these event messages or when you are notified of these events. Your service provider will review the auto-diagnosis information and initiate the appropriate service action.



Nonfatal Domain Hardware Errors

FIGURE 6-2 shows the basic steps involved in the diagnosis of nonfatal domain hardware errors. These errors do not cause a domain to stop.


FIGURE 6-2 Automatic Diagnosis Process for Nonfatal Domain Hardware Errors


The steps shown in FIGURE 6-2 are similar to the steps discussed in the section Hardware Errors Associated With Domain Stops, except for the following differences:

CODE EXAMPLE 6-2 shows the diagnosis of a nonfatal hardware error and the event message information displayed. The event message begins with the [DOM] indicator. See Reviewing Diagnosis Events for a description of the event message contents.


CODE EXAMPLE 6-2 Example of a Nonfatal Domain Hardware Error Identified by Solaris and the Domain Event Message

Sep 12 14:47:24 2005 smshostname dsmd[7839]: [0 876197473671508 ERR
SoftErrorHandler.cc 577] E$ Slot 3 SubSlot 5
Sep 12 14:47:25 2005 smshostname dsmd[7839]: [2552 876198449525014 ERR
SoftErrorHandler.cc 592] Soft Error: Comp ID : 0x62 Error Code: 3 Error Type: 1
Error Bit/Pin: 104 
Sep 12 14:47:58 2005 smshostname erd[17227]: [11900 876231607099583 CRIT
MessageReportingService.cc 243] [DOM]  Event: SF15000-8000-FF  CSN: 352A00006
DomainID: D  ADInfo: 1.SF-SOLARIS-DE.5-9-cs3:4791004-on81:08/18/2005  Time: Fri
Sep 12 14:47:38 PDT 2005  Recommended-Action: Service action required
 



Note - Contact your service provider when you see these event messages or when you are notified of these events. Your service provider will review the auto-diagnosis information and initiate the appropriate service action.



POST-Detected Hardware Failures

Whenever POST is run to test and configure system board components, any components that fail the self-test are automatically unconfigured from the system. POST updates the component health status of the affected components accordingly.

CODE EXAMPLE 6-3 shows an auto-diagnosis event message reported by the POST DE for Domain B. See Reviewing Diagnosis Events for a description of the event message contents.


CODE EXAMPLE 6-3 Example of a POST Auto-Diagnosis Event Message

Sep  8 13:31:16 2005 smshostname erd[11987]: [11900 240509936296585 C
RIT
MessageReportingService.cc 243] [AD]  Event: SF15000-8000-4L  CSN: 352A00005
DomainID: B  ADInfo: 1.POST-DE.1.4.1  Time: Mon Sep  8 13:30:47 PDT 2005
Recommended-Action: Service action required
 

When you see these messages or when you are notified of these events, contact your service provider to initiate the appropriate service action.


Enabling Email Event Notification

Email event notification is an optional feature that automatically generates an email notice informing designated recipients of domain fault events when they occur. You can receive immediate notice of critical fault events without manually monitoring the platform or domain message logs.

CODE EXAMPLE 6-4 shows an example email that reports a fault event in which two components are indicted (suspected of causing a fault). The following sections explain how to control email content and notification.


CODE EXAMPLE 6-4 Example Event Email

Date: Tue, 19 Aug 2005 10:45:28 -0600 (MDT)
Subject: FAULT: SF15000, csn: 352A00007, main fault class: list.suspects
From: smshostname@xyz.com
To: undisclosed-recipients:;
 
FAULT: platform: SF15000, csn: 352A00007, main fault class: list.suspects
EVENT CODE: SF15000-8000-GK
EMBEDDED FAULT(S): fault.board.sb.l1l2
fault.board.ex.l1l2
 
Fault event in domain(s) R at Fri Jun 27 00:08:05 PDT 2005.
Fault severity = SMIEVENT_SEV_FATAL <7>
Indictment Count: 2
Indictment list:
sb11
ex11
 

The following files work together to generate event email:

This template identifies the event information to be reported in the email. This information includes the email subject line and specific event items (tags) to be reported in the email.

This file (/etc/opt/SUNWSMS/SMS/config/event_email.cf) uses certain event information, namely the event class and the domain affected by the event, to assign the specified email recipients and email templates that control the event information to be reported.



Note - The event email feature uses the standard sendmail utility to send email to designated email recipients.




procedure icon  To Enable Email Event Notification

1. In the email template file, identify the event tags to be reported in email.

Copy the sample email template (sample_email) provided with SMS and edit the copied file. For details on modifying the email template, see Configuring an Email Template.

2. In the email control file, set the parameters that determine who receives the email and the email templates to be used.

Edit the email control file (event_email.cf) included with SMS and assign the email notification parameters.

For details on modifying the control file, see Configuring the Email Control File.



Note - If you use the email notification feature, review the email destination addresses to ensure that the recipients receive notifications for events pertaining only to the domains that they have authorization to see. Implement and enforce a process for maintaining appropriate security separation whenever people change responsibilities, and gain or lose authorization.



Configuring an Email Template

A sample email template file called sample_email (/etc/opt/SUNWSMS/SMS/config/templates) is provided with SMS. CODE EXAMPLE 6-5 shows the default template. The text in angle brackets identifies the event information to be displayed in the body of the event email.


CODE EXAMPLE 6-5 Default Sample Email Template

# Sample Email Template File - This sample is intended to convey
# a terse fault event notification to a pager.
#
# The following is the subject line for the email with the event
# descriptor from the event and the platform model and serial
# number inserted.
#
FAULT: <PLATFORM_MODEL>, serial# <PLATFORM_SERIAL_NUMBER>, code <EVENT_CODE> 
#
# The following lines are the body of the email notification.
#
Fault event in domain(s) <EVENT_DOMAINS_AFFECTED> at <EVENT_TIMESTAMP>.
Fault severity = <EVENT_SEVERITY>
 
Indictment Count: <EVENT_INDICTMENT_COUNT>
Indictment list:
<EVENT_INDICTMENT_LIST>
 
Member fault list:
<EVENT_FAULT_MEMBERS>
# End of email template.
 

You can use the sample template file as is, or you can copy the sample template file to a new file, which can then be edited to identify additional or different event tags to be contained in the email. You must have superuser privileges to copy and rename the sample template file. The name of the file can be any text string that you choose.

When you edit the file, specify the event tags to be reported in the email subject line and email body. Specify these tags on new, uncommented lines in the file (lines that do not begin with a # sign). For a list of the tags that can be specified in the email template, see TABLE 6-1.


TABLE 6-1 Event Tags in the Email Template File

Event Tag

Information Displayed

<EVENT_CLASS>

A dot-separated alphanumeric text string that describes the event category (error report, fault event, or a list of suspected faults). For example: list.suspects

<EVENT_CODE>

A dash-separated alphanumeric text string that uniquely identifies an event type, for example: SF15000-8000-GK. The event code summarizes the fault classes involved in the event and is used by your service provider to obtain further information about the event.

<EVENT_DE_NAME>

Name of the diagnosis engine (DE) used to determine the fault event: SMS-DE, SF-SOLARIS-DE, or POST-DE.

<EVENT_DE_VERSION>

Version of the diagnosis engine used to determine the event.

<EVENT_DOMAINS_AFFECTED>

A comma-separated list of domains affected by the event.

<EVENT_FAULT_MEMBERS>

List of fault event classes associated with the fault event. For example: fault.board.sb.l1l2

<EVENT_INDICTMENT_COUNT>

Number of components indicted or suspected of causing the fault event.

<EVENT_INDICTMENT_LIST>

The indicted components. Each component is listed on a separate line.

<EVENT_SEVERITY>

The severity of the event, ranging from 0 to 7. For example, test event messages have a severity level 2 and fault events that cause a domain stop have a severity level 7 (SMIEVENT_SEV_FATAL).

<EVENT_TIMESTAMP>

The day and time of the event.

<PLATFORM_SERIAL_NUMBER>

The chassis serial number that identifies the Sun Fire high-end system.

<PLATFORM_MODEL>

The number of the product model (SF15000, SFE25000, SF12000 or SFE20000) affected by the event.


FIGURE 6-3 shows the email template used to generate the email example shown in CODE EXAMPLE 6-4.


FIGURE 6-3 Example Email Template and Generated Email


Configuring the Email Control File

The email control file contains the email notification parameters that do the following:

You specify these notification parameters in the email control file supplied with SMS (/etc/opt/SUNWSMS/SMS/config/event_email.cf). This file, shown in CODE EXAMPLE 6-6, contains comment lines that begin with a pound (#) sign. These comment lines explain how to update the file.


CODE EXAMPLE 6-6 Email Control File ( event_email.cf )
#
# Copyright (c) 2004 by Sun Microsystems, Inc.
# All rights reserved.
#
# Email Control File
#
# ident "@(#)event_email.cf 1.6     03/08/19 SMI"
#
# The following fields are required to receive email notification of fault 
# events
# Event_Class  Domains  Template  From  Include-event?  Recipients Script
# Event_Class and Domains are regular expressions filtering for specific event 
# types and affected domains. Domains are required to be upper case.
# The following example, uncommented, generates an email for any List Event
# containing a Fault Event, affecting any domain, and sends it to 
# two recipients.
# The Packed Event List is included as an attachment to the email.
#
# Event_Class  Domains  Template  From  Include-event?  Recipients  Script
#^fault[.] [A-R] sample_email FMA@xyz.com Y adm@xyz.com,adm2xyz.com sendmail.sh 
#
#
# The following example, uncommented, generates an email for any Event 
# that contains a Fault Event and affects domains A through C.  The Packed 
# Event List is not sent as an attachment. The user would be required to add his
# custom fault_email template to the directory 
# /etc/opt/SUNWSMS/config/templates, and for tag 
# replacement to work should refer to the documentation, or look at the 
# sample_email template in that directory.
#^fault[.] [A-C] fault_email FMA@xyz.com N admin.manager@xyz.com sendmail.sh
 

Use a text editor to edit the file and add the notification parameters in new, uncommented lines. You must have superuser privileges to edit the email control file and add the required email parameters. Separate each parameter with spaces or tabs. You can enter multiple notification lines that control how different event email messages are to be distributed, perhaps by domain, event class, or email template. The notification parameters that you configure are described in TABLE 6-2.

You can use regular expressions to specify ranges or specific matches for the Event_Class and Domains parameters. The email control file supports extended regular expressions as explained in the regexp(5) man page. Some examples of valid regular expressions include:

CODE EXAMPLE 6-7 shows an updated email control file in which notification parameters have been added to the bottom of the file. The sendmail.sh script will be used to send event email to the two specified recipients. An event email will be generated for all fault events that occurred in domains A through C and will be formatted based on the template file called sample_email. The event message structure will be sent as a binary file attachment that accompanies the email.


CODE EXAMPLE 6-7 Sample Email Control File
#
# Copyright (c) 2004 by Sun Microsystems, Inc.
# All rights reserved.
# Email Control File
#
# ident "@(#)event_email.cf 1.1     03/03/12 SMI"
#
# The following fields are required to receive email notification of fault
# events
# Event_Class  Domains  Template  From  Include-event?  Recipients-Script
# Event_Class and Domains are regular expressions filtering for specific event
# types and affected domains. Domains are required to be upper case.
# The following example, uncommented, generates an email for any List Event
# containing a Fault Event, affecting any domain, and sends it to
# two recipients. Recipients are email addresses separated by commas if there
# are more than 1. Embedded blanks are not permitted in the Recipients list.
# The Packed Event List is included as an attachment to the email.
#
# Event_Class  Domains  Template  From  Include-event?  Recipients Script
#^fault[.] [A-R] sample_email FMA@xyz.com Y adm1@xyz.com,adm2@xyz.com sendmail.sh
#
#
# The following example, uncommented, generates an email for any Event
# that contains a Fault Event and affects domains A through C.  The Packed
# Event List is not sent as an attachment. The user would be required to add his
# custom fault_email template to the directory
# /etc/opt/SUNWSMS/config/templates, and for tag
# replacement to work should refer to the documentation, or look at the
# sample_email template in that directory.
#
#^fault[.] [A-C] sample_email FMA@xyz.com Y adm1@xyz.com,adm2@xyz.com sendmail.sh
^fault[.] [A-C] sample_email FMA@xyz.com Y adm1@xyz.com,adm2@xyz.com sendmail.sh
 


Testing Email Event Notification

Use the testemail(1M) command to verify email event notification. This command also enables you to track events and check any changes to the email control file.


procedure icon  To Test Email Event Notification

1. Set up the email event templates and the email control file as described in Enabling Email Event Notification.

2. In an SC window, log in as platform administrator or platform service and type:


sc0:sms-user:> /opt/SUNWSMS/SMS/lib/smsadmin/testemail -c event-class-list -d domain-id [-i resource-indictment-list]

where:

event-class-list is a list of one or more fault event classes to be tracked

domain-id specifies a single domain, A-R

resource-indictment-list is an optional list of one or more components that map to each event class specified. For a list of the valid component values, refer to the testemail(1M) man page.

For example, the following command generates an event type fault.test.email originating on domain A.


sc0:sms-user:> /opt/SUNWSMS/SMS/lib/smsadmin/testemail -c fault.test.email -d A

3. Verify that the test event was recorded in the platform or domain message logs.

For example, a message similar to the following is displayed in the platform message log:


Aug 19 10:45:28 2005smshostname [6696:1]: [11917 682823530704603 ERR teste
mailApp.cc 345] Test fault with code SF15000-8000-Y1 generated by user root
using testEmailReporting - please ignore 

4. If the test event was successfully recorded in the message logs, verify that the designated recipients received the test email.

For example, the test email might resemble the following:


Date: Tue, 19 Aug 2005 10:45:28 -0600 (MDT)
Subject: FAULT: SF15000, serial# 352A0008, code SF15000-8000-Y1
From: smshostname@xyz.com
To: undisclosed-recipients:;
 
FAULT: SF15000, serial# 352A0008, code SF15000-8000-Y1
Fault event in domain(s) A at Tue Aug 19 10:45:18 MDT 2005.
Fault severity = SMIEVENT_SEV_INFO <2>
Indictment Count: 0
Indictment list:
 
Member fault list:
fault.test.email
 

If the test email was not generated, review the next section for troubleshooting suggestions.

What To Do If Test Email Fails

If you did not receive test email notification, do the following;

1. Review your email event templates and the email control file to verify that the files have been set up correctly.

2. Check the domain and platform message logs to verify that the test events were recorded.

3. Verify that the sendmail daemon is running. For example:


sc0:sms-user:> ps -ef | grep sendmail
    root  256     1  0   Aug 06 ?        0:05 /usr/lib/sendmail -bd -q15m
sms-user  525 28546  0 21:23:15 pts/27   0:00 grep sendmail

If the sendmail daemon is not running, you might have a problem with your installation setup that requires correction. Proceed to Step 4.

4. Manually start sendmail, which will run until the next reboot, by logging on as superuser and restarting the sendmail daemon:


sc0:# /usr/lib/sendmail -bd -q15m &

5. Check /var/log/syslog on the SC to see if email was sent by the Mail Transfer Agent (MTA), sendmail.

If sendmail is not configured or was configured incorrectly, error messages would appear in this log file.

6. Verify that the domain and nameserver IP entries (to route the email messages outside of the system controller) exist in the /etc/resolv.conf file.

7. Restart sendmail.sh:


sc0:#:/etc/inet.d/sendmail stop
sc0:#:/etc/inet.d/sendmail start


Obtaining Diagnosis and Recovery Information

This section describes the various ways to monitor diagnostic errors and obtain additional information about fault and error events.

Reviewing Diagnosis Events

Automatic diagnosis [AD] and domain [DOM] event messages are displayed in the platform message logs and on the domain console or in the syslog host, if a loghost server was configured. The [AD] or [DOM] event messages (see CODE EXAMPLE 6-1, CODE EXAMPLE 6-2, and CODE EXAMPLE 6-3) include the following information:

Reviewing the Event Log

If you have platform administrator or platform service privileges, you can use the showlogs command to view the contents of the event log, to obtain more detailed information about a particular type of event. The information displayed can also be used by your service provider for troubleshooting purposes.

You can obtain information on the following types (classes) of events recorded in the event log:

TABLE 6-3 describes some of the various ways to view event information through the showlogs command.


TABLE 6-3 showlogs (1M) Command Options for Displaying Error and Fault Event Information

Command Options

Description

showlogs -E -p e

Displays the last event in the event log in a condensed format.

showlogs -E -p e number

Displays the event data for the last number of events in a condensed format. For example, showlogs -E -p e 3 displays condensed event information for the last three events in the event log,

showlogs -p e list

Displays the last list event in the event log.

showlogs -p e ereport

Displays the last ereport (error report) in the event log. An error report contains specific information about the hardware entity, such as an unexpected condition or behavior.

showlogs -d domain-ID -p e number

Displays the last number of events in the specified domain.

showlogs -E -p e event-code

Displays condensed event log information for the specified event code.


For details on the showlogs command options and examples of event output, refer to the showlogs(1M) command description in the System Management Services (SMS) 1.6 Reference Manual.