C H A P T E R  5

Automatic Diagnosis and Recovery

This chapter describes the automatic error diagnosis and domain recovery features included with SMS, starting with the SMS 1.4 release. This chapter covers the following topics:


Automatic Diagnosis and Recovery Overview

When certain hardware errors occur in a Sun Fire high-end system, the system controller performs specific diagnosis and domain recovery steps. The following automatic diagnosis engines (DEs) identify and diagnose hardware errors that affect the availability of the system and its domains:

The following sections describe the diagnosis and recovery steps that occur for the hardware errors identified by the different diagnosis engines.

Hardware Errors Associated with Domain Stops

FIGURE 5-1 shows the basic diagnosis and domain recovery steps performed when hardware errors associated with a dstop are identified by the SMS diagnosis engine.

FIGURE 5-1 Automatic Diagnosis and Recovery Process for Hardware Errors Associated with a Stopped Domain

Flow diagram that shows the diagnosis and recovery steps for errors that cause domain stops.

The following summary describes the process shown in FIGURE 5-1.

Non-Fatal Domain Hardware Errors

FIGURE 5-2 shows the basic steps involved in the diagnosis of non-fatal domain hardware errors. These errors do not cause a domain to stop.

FIGURE 5-2 Automatic Diagnosis Process for Non-Fatal Domain Hardware Errors

Flow diagram that shows the diagnosis process for non-fatal domain hardware errors.

The steps shown in FIGURE 5-2 are similar to the steps discussed in the section Hardware Errors Associated with Domain Stops, except for the following differences:

POST-Detected Hardware Failures

Whenever POST is run to test and configure system board components, any components that fail the self-test are automatically deconfigured from the system. POST updates the component health status of the affected components accordingly.

CODE EXAMPLE 5-3 shows an auto-diagnosis event message reported by the POST DE for domain B. See Reviewing Diagnosis Events for a description of the event message contents.

CODE EXAMPLE 5-3 Example of a POST Auto-Diagnosis Event Message

Sep  8 13:31:16 2003 smshostname erd[11987]: [11900 240509936296585 CRIT
MessageReportingService.cc 243] [AD]  Event: SF15000-8000-4L  CSN: 352A00005
DomainID: B  ADInfo: 1.POST-DE.1.4  Time: Mon Sep  8 13:30:47 PDT 2003
Recommended-Action: Service action required


When you see these messages or when you are notified of these events, contact your service provider to initiate the appropriate service action.


Enabling Email Event Notification

Email event notification is an optional feature that automatically generates an email notice informing designated recipients of domain fault events when they occur. You can receive immediate notice of critical fault events, without manually monitoring the platform or domain message logs.

CODE EXAMPLE 5-4 shows an example email that reports a fault event in which two components are indicted (suspected of causing a fault). The following sections explain how to control email content and notification.

CODE EXAMPLE 5-4 Example Event Email

Date: Tue, 19 Aug 2003 10:45:28 -0600 (MDT)
Subject: FAULT: SF15000, csn: 352A00007, main fault class: list.suspects
From: smshostname@xyz.com
To: undisclosed-recipients:;

FAULT: platform: SF15000, csn: 352A00007, main fault class: list.suspects
EVENT CODE: SF15000-8000-GK
EMBEDDED FAULT(S): fault.board.sb.l1l2
fault.board.ex.l1l2

Fault event in domain(s) R at Fri Jun 27 00:08:05 PDT 2003.
Fault severity = SMIEVENT_SEV_FATAL <7>
Indictment Count: 2
Indictment list:
sb11
ex11


The following files work together to generate event email:



Note - The event email feature uses the standard sendmail utility to send email to designated email recipients.




procedure icon  To Enable Email Event Notification

1. In the email template file, identify the event tags to be reported in email.

Copy the sample email template (sample_email) provided with SMS and edit the copied file. For details on modifying the email template, see Configuring an Email Template.

2. In the email control file, set the parameters that determine who receives the email and the email templates to be used.

Edit the email control file (event_email.cf) included with SMS and assign the email notification parameters.

For details on modifying the control file, see Configuring the Email Control File.



Note - If you use the email notification feature, review the email destination addresses to ensure that the recipients receive notifications for events pertaining only to the domains that they have authorization to see. Sun recommends that you implement and enforce a process for maintaining appropriate security separation whenever people change responsibilities and gain or lose authorization.



Configuring an Email Template

A sample email template file called sample_email (/etc/opt/SUNWSMS/SMS/config/templates) is provided with SMS. CODE EXAMPLE 5-5 shows the default template. The text in angle brackets serves as tags that identify the event information to be displayed in the body of the event email.

CODE EXAMPLE 5-5 Default Sample Email Template
# Sample Email Template File - This sample is intended to convey
# a terse fault event notification to a pager.
#
# The following is the subject line for the email with the event
# descriptor from the event and the platform model and serial
# number inserted.
#
FAULT: <PLATFORM_MODEL>, serial# <PLATFORM_SERIAL_NUMBER>, code <EVENT_CODE> 
#
# The following lines are the body of the email notification.
#
Fault event in domain(s) <EVENT_DOMAINS_AFFECTED> at <EVENT_TIMESTAMP>.
Fault severity = <EVENT_SEVERITY>

Indictment Count: <EVENT_INDICTMENT_COUNT>
Indictment list:
<EVENT_INDICTMENT_LIST>

Member fault list:
<EVENT_FAULT_MEMBERS>
# End of email template.


You can use the sample template file as is, or you can copy the sample template file to a new file, which can be edited to identify additional or different event tags to be contained in the email. You must have superuser privileges to copy and rename the sample template file. The name of the file can be any text string that you choose.

When you edit the file, specify the event tags to be reported in the email subject line and email body. Specify these tags on new, uncommented lines in the file (lines that do not begin with a # sign). For a list of the tags that can be specified in the email template, see TABLE 5-1.


TABLE 5-1 Event Tags in the Email Template File

Event Tag

Information Displayed

<EVENT_CLASS>

A dot-separated alphanumeric text string that describes the event category (error report, fault event, or a list of suspected faults). For example: list.suspects

<EVENT_CODE>

A dash-separated alphanumeric text string that uniquely identifies an event type, for example: SF15000-8000-GK. The event code summarizes the fault classes involved in the event and is used by your service provider to obtain further information about the event.

<EVENT_DE_NAME>

Name of the diagnosis engine (DE) used to determine the fault event: SMS-DE, SF-SOLARIS-DE, or POST-DE.

<EVENT_DE_VERSION>

Version of the diagnosis engine used to determine the event.

<EVENT_DOMAINS_AFFECTED>

A comma-separated list of domains affected by the event.

<EVENT_FAULT_MEMBERS>

List of fault event classes associated with the fault event. For example: fault.board.sb.l1l2

<EVENT_INDICTMENT_COUNT>

Number of components indicted or suspected of causing the fault event.

<EVENT_INDICTMENT_LIST>

The indicted components. Each component is listed on a separate line.

<EVENT_SEVERITY>

The severity of the event, ranging from 0 to 7. For example, test event messages have a severity level 2 and fault events that cause a domain stop have a severity level 7 (SMIEVENT_SEV_FATAL).

<EVENT_TIMESTAMP>

The day and time of the event.

<PLATFORM_SERIAL_NUMBER>

The chassis serial number that identifies the Sun Fire high-end system.

<PLATFORM_MODEL>

The number of the product model (SF15000 or SF12000) affected by the event.


FIGURE 5-3 shows the email template used to generate the email example shown in CODE EXAMPLE 5-4.

FIGURE 5-3 Example Email Template and Generated Email

Example of custom email template and the resulting email generated.

Configuring the Email Control File

The email control file contains the email notification parameters that do the following:

You specify these notification parameters in the email control file supplied with SMS (/etc/opt/SUNWSMS/SMS/config/event_email.cf). This file, shown in CODE EXAMPLE 5-6, contains comment lines that begin with a pound (#) sign. These comment lines explain how to update the file.

CODE EXAMPLE 5-6 Email Control File (event_email.cf)
#
# Copyright (c) 2003 by Sun Microsystems, Inc.
# All rights reserved.
#
# Email Control File
#
# ident "@(#)event_email.cf 1.5     03/08/19 SMI"
#
# The following fields are required to receive email notification of fault 
# events
# Event_Class  Domains  Template  From  Include-event?  Recipients Script
# Event_Class and Domains are regular expressions filtering for specific event 
# types and affected domains. Domains are required to be upper case.
# The following example, uncommented, generates an email for any List Event
# containing a Fault Event, affecting any domain, and sends it to 
# two recipients.
# The Packed Event List is included as an attachment to the email.
#
# Event_Class  Domains  Template  From  Include-event?  Recipients  Script
#^fault[.] [A-R] sample_email FMA@xyz.com Y adm@xyz.com,adm2xyz.com sendmail.sh 
#
#
# The following example, uncommented, generates an email for any Event 
# that contains a Fault Event and affects domains A through C.  The Packed 
# Event List is not sent as an attachment. The user would be required to add his
# custom fault_email template to the directory 
# /etc/opt/SUNWSMS/config/templates, and for tag 
# replacement to work should refer to the documentation, or look at the 
# sample_email template in that directory.
#^fault[.] [A-C] fault_email FMA@xyz.com N admin.manager@xyz.com sendmail.sh


Use a text editor to edit the file and add the notification parameters in new, uncommented lines. You must have superuser privileges to edit the email control file and add the required email parameters. Separate each parameter with spaces or tabs. You can enter multiple notification lines that control how different event email messages are to be distributed, perhaps by domain, event class, or email template. The notification parameters that you configure are described in TABLE 5-2.

You can use regular expressions to specify ranges or specific matches for the Event_Class and Domains parameters. The email control file supports extended regular expressions (REs) as explained in the regexp(5) man page. Some examples of valid regular expressions include:


TABLE 5-2 Email Control File Parameters

Email control parameter

Description

Event_Class

The fault event class to be used as a filter.

Specify the event class as a regular expression, so that this parameter can apply to a wide range of event classes. For example, the default format fault.* causes all fault events that match the string fault to be reported in the event email.

Domains

The domains to be used as filters. The default format [A-R] causes the fault events from domains A through R to be identified in the email. The domains must be specified in uppercase letters.

Template

The name of the email template file to be used to generate the email contents.

From

The email alias from which the email is generated.

Include-event?

One of the following states:

  • Y - Yes, include the binary file of the event message structure as an email attachment. This file can be used by your service provider for troubleshooting purposes.

  • N - No, do not include the binary file of the event message structure as an email attachment.

Recipients

The email aliases of the individuals to receive the event email. Separate each alias with a comma.

Script

The shell script used to send the email to the designated recipients. The sendmail.sh script in /etc/opt/SUNWSMS/config/scripts is the standard script, but you can replace this with your own custom script in the same directory.


CODE EXAMPLE 5-7 shows an updated email control file in which notification parameters have been added to the bottom of the file. The sendmail.sh script will be used to send event email to the two specified recipients. An event email will be generated for all fault events that occurred in domains A through C and will be formatted based on the template file called sample_email. The event message structure will be sent as a binary file attachment that accompanies the email.

CODE EXAMPLE 5-7 Sample Email Control File
#
# Copyright (c) 2003 by Sun Microsystems, Inc.
# All rights reserved.
# Email Control File
#
# ident "@(#)event_email.cf 1.1     03/03/12 SMI"
#
# The following fields are required to receive email notification of fault
# events
# Event_Class  Domains  Template  From  Include-event?  Recipients-Script
# Event_Class and Domains are regular expressions filtering for specific event
# types and affected domains. Domains are required to be upper case.
# The following example, uncommented, generates an email for any List Event
# containing a Fault Event, affecting any domain, and sends it to
# two recipients. Recipients are email addresses separated by commas if there
# are more than 1. Embedded blanks are not permitted in the Recipients list.
# The Packed Event List is included as an attachment to the email.
#
# Event_Class  Domains  Template  From  Include-event?  Recipients Script
#^fault[.] [A-R] sample_email FMA@xyz.com Y adm1@xyz.com,adm2@xyz.com sendmail.sh
#
#
# The following example, uncommented, generates an email for any Event
# that contains a Fault Event and affects domains A through C.  The Packed
# Event List is not sent as an attachment. The user would be required to add his
# custom fault_email template to the directory
# /etc/opt/SUNWSMS/config/templates, and for tag
# replacement to work should refer to the documentation, or look at the
# sample_email template in that directory.
#
#^fault[.] [A-C] sample_email FMA@xyz.com Y adm1@xyz.com,adm2@xyz.com sendmail.sh
^fault[.] [A-C] sample_email FMA@xyz.com Y adm1@xyz.com,adm2@xyz.com sendmail.sh



Testing Email Event Notification

Use the testemail(1M) command to verify email event notification. This command also enables you to track events and check any changes to the email control file.


procedure icon  To Test Email Event Notification

1. Set up the email event templates and the email control file as described in Enabling Email Event Notification.

2. In an SC window, log in as platform administrator or platform service and type:

sc0:sms-user:> /opt/SUNWSMS/SMS1.4/lib/smsadmin/testemail -c event_class_list -d domain_id [-i resource_indictment_list]

where:

event_class_list is a list of one or more fault event classes to be tracked

domain_id specifies a single domain, A-R

resource_indictment_list is an optional list of one or more components that map to each event class specified. For a list of the valid component values, refer to the testemail(1M) man page.

For example, the following command

sc0:sms-user:> /opt/SUNWSMS/SMS1.4/lib/smsadmin/testemail -c fault.test.email -d A

generates an event of type fault.test.email originating on domain A.

3. Verify that the test event was recorded in the platform or domain message logs.

For example, a message similar to the following is displayed in the platform message log:

Aug 19 10:45:28 2003smshostname [6696:1]: [11917 682823530704603 ERR teste
mailApp.cc 345] Test fault with code SF15000-8000-Y1 generated by user root
using testEmailReporting - please ignore 

4. If the test event was successfully recorded in the message logs, verify that the designated recipients received the test email.

For example, the test email might resemble the following:

Date: Tue, 19 Aug 2003 10:45:28 -0600 (MDT)
Subject: FAULT: SF15000, serial# 352A0008, code SF15000-8000-Y1
From: smshostname@xyz.com
To: undisclosed-recipients:;

FAULT: SF15000, serial# 352A0008, code SF15000-8000-Y1
Fault event in domain(s) A at Tue Aug 19 10:45:18 MDT 2003.
Fault severity = SMIEVENT_SEV_INFO <2>
Indictment Count: 0
Indictment list:

Member fault list:
fault.test.email


If the test email was not generated, review the next section for troubleshooting suggestions.

What To Do If Test Email Fails

If you did not receive test email notification, do the following;

  1. Review your email event templates and the email control file to verify that the files have been set up correctly.

  2. Check the domain and platform message logs to verify that the test events were recorded.

  3. Verify that the sendmail daemon is running. For example:

    sc0:sms-user:> ps -ef | grep sendmail
    
        root  256     1  0   Aug 06 ?        0:05 /usr/lib/sendmail -bd -q15m
    
    sms-user  525 28546  0 21:23:15 pts/27   0:00 grep sendmail
    

    If the sendmail daemon is not running, you might have a problem with your installation setup that requires correction. Proceed to Step 4.

  4. Manually start sendmail, which will run until the next reboot, by logging on as superuser and restarting the sendmail daemon:

    sc0:# /usr/lib/sendmail -bd -q15m &
    

  5. Check /var/log/syslog on the SC to see if email was sent by the Mail Transfer Agent (MTA), sendmail.

    If sendmail is not configured or was configured incorrectly, error messages would appear in this log file.

  6. Verify that the domain and nameserver IP entries (to route the email messages outside of the system controller) exist in the /etc/resolv.conf file.

  7. Restart sendmail.sh:

    sc0:#:/etc/inet.d/sendmail stop
    
    sc0:#:/etc/inet.d/sendmail start
    


Obtaining Diagnosis and Recovery Information

This section describes the various ways to monitor diagnostic errors and obtain additional information about fault and error events.

Reviewing Diagnosis Events

Automatic diagnosis [AD] and domain [DOM] event messages are displayed on the platform and domain console or in the syslog host, if a loghost server was configured. The [AD] or [DOM] event messages (see CODE EXAMPLE 5-1, CODE EXAMPLE 5-2, and CODE EXAMPLE 5-3) include the following information:

Reviewing the Event Log

If you have platform administrator or platform service privileges, you can use the showlogs command to view the contents of the event log, to obtain more detailed information about a particular type of event. The information displayed can also be used by your service provider for troubleshooting purposes.

You can obtain information on the following types (classes) of events recorded in the event log:

TABLE 5-3 describes some of the various ways to view event information through the showlogs command.


TABLE 5-3 showlogs(1M) Command Options for Displaying Error and Fault Event Information

Command Options

Description

showlogs -E -p e

Displays the last event in the event log in a condensed format.

showlogs -E -p e number

Displays the event data for the last number of events in a condensed format. For example, showlogs -E -p e 3 displays condensed event information for the last three events in the event log,

showlogs -p e list

Displays the last list event in the event log.

showlogs -p e ereport

Displays the last ereport (error report) in the event log. An error report contains specific information about the hardware entity, such as an unexpected condition or behavior.

showlogs -d domain_ID -p e number

Displays the last number of events in the specified domain.

showlogs -E -p e event_code

Displays condensed event log information for the specified event code.


For details on the showlogs command options and examples of event output, refer to the showlogs(1M) command description in the System Management Services (SMS) 1.4 Reference Manual.