C H A P T E R 5
|
Automatic Diagnosis and Recovery |
This chapter describes the automatic error diagnosis and domain recovery features included with SMS, starting with the SMS 1.4 release. This chapter covers the following topics:
When certain hardware errors occur in a Sun Fire high-end system, the system controller performs specific diagnosis and domain recovery steps. The following automatic diagnosis engines (DEs) identify and diagnose hardware errors that affect the availability of the system and its domains:
SMS diagnosis engine
The SMS DE diagnoses hardware errors associated with domain stops (dstops).
Solaris operating environment
The Solaris operating environment (also referred to as the Solaris DE) identifies non-fatal domain hardware errors and reports them to the system controller.
POST diagnosis engine
The POST DE identifies any hardware test failures that occur when the power-on self-test is run.
The following sections describe the diagnosis and recovery steps that occur for the hardware errors identified by the different diagnosis engines.
FIGURE 5-1 shows the basic diagnosis and domain recovery steps performed when hardware errors associated with a dstop are identified by the SMS diagnosis engine.
The following summary describes the process shown in FIGURE 5-1.
Hardware error detection. The system controller provides information on hardware errors involving CPU boards, processors, I/O controllers, and memory banks.
A dump file is generated whenever a dstop occurs. This file (/var/opt/SUNWSMS/sms_version/adm/domain_id/dump/dsmd.dstop.yymmdd.hhmm.ss) captures the domain hardware errors associated with the dstop.
Automatic diagnosis. The SMS DE determines a failure based on the hardware errors captured in the dstop dump file. The DE may identify one or more FRUs that are responsible for the error. Depending on the hardware error, the DE may identify one faulty FRU or one or more suspect FRUs.
In situations where multiple FRUs are identified by the DE, further analysis by your service provider may be required to determine the faulty FRU.
Error and fault event reporting. The DE reports diagnosis information through the following:
Auto-diagnosis fault messages that appear in the domain and platform log files.
CODE EXAMPLE 5-1 shows the information displayed for a domain stop and the auto-diagnosis message that describes a fault event on domain D. The event message begins with the [AD] indicator. See Reviewing Diagnosis Events for a description of the event message contents.
Email notification of fault events. For details, see Enabling Email Event Notification.
Fault event notification if you are using Sun Management Center. For details, refer to the Sun Management Center 3.5 Version 2 Supplement for Sun Fire 15K/12K Systems.
Notification of fault events if you are using Sun Remote Services Net Connect and have configured Net Connect accordingly.
For general information on SRS Net Connect, refer to
For SRS Net Connect product documentation, refer to
Event log output from the showlogs (1M) command if you have platform administrator privileges
The showlogs event output supplements the diagnosis information presented in the platform and domain message logs or the event email. The showlogs event output can be used for additional troubleshooting purposes by your service provider. For details on the event information displayed, see Obtaining Diagnosis and Recovery Information.
Component health status updates. The SMS DE records the diagnosis information for each affected component and maintains this health history as part of the component health status (CHS).
Automatic restoration. As part of the domain restoration process, POST reviews the updated component health status of the affected components and uses the CHS information to determine which components to deconfigure from the system. The appropriate components are then deconfigured, and the domain is restarted.
FIGURE 5-2 shows the basic steps involved in the diagnosis of non-fatal domain hardware errors. These errors do not cause a domain to stop.
The steps shown in FIGURE 5-2 are similar to the steps discussed in the section Hardware Errors Associated with Domain Stops, except for the following differences:
Hardware error detection. The Solaris operating environment determines when a non-fatal domain hardware error has occurred and reports the error to the system controller. The affected domain is not stopped.
Automatic diagnosis and resource deconfiguration.The Solaris operating environment identifies the failure and the resources that caused the failure. If appropriate, the Solaris operating environment may also deconfigure the affected resources. For example, a CPU module may be taken off-line because of non-fatal errors that occur within the module, or a virtual memory page may be retired due to errors contained in the page.
Error and fault event reporting. The Solaris operating system provides diagnosis information through the same channels as the SMS DE: event messages that appear in the domain and platform logs, fault event notification if using Sun Management Center, or email event notification within SMS or through SRS Net Connect if you configured those features, and showlogs(1M) event output.
CODE EXAMPLE 5-2 shows the diagnosis of a non-fatal hardware error and the event message information displayed. The event message begins with the [DOM] indicator. See Reviewing Diagnosis Events for a description of the event message contents.
Component health status updates. SMS updates the component health status of the affected hardware resources, using the information supplied by the Solaris operating environment.
Deconfiguration of appropriate resources. In cases where the Solaris operating environment could not previously deconfigure faulty domain resources, those resources are deconfigured from the system at the next domain reboot.
Whenever POST is run to test and configure system board components, any components that fail the self-test are automatically deconfigured from the system. POST updates the component health status of the affected components accordingly.
CODE EXAMPLE 5-3 shows an auto-diagnosis event message reported by the POST DE for domain B. See Reviewing Diagnosis Events for a description of the event message contents.
When you see these messages or when you are notified of these events, contact your service provider to initiate the appropriate service action.
Email event notification is an optional feature that automatically generates an email notice informing designated recipients of domain fault events when they occur. You can receive immediate notice of critical fault events, without manually monitoring the platform or domain message logs.
CODE EXAMPLE 5-4 shows an example email that reports a fault event in which two components are indicted (suspected of causing a fault). The following sections explain how to control email content and notification.
The following files work together to generate event email:
This template identifies the event information to be reported in the email. This information includes the email subject line and specific event items (tags) to be reported in the email.
Email control file (event_email.cf)
This file uses certain event information, namely the event class and the domain affected by the event, to assign the specified email recipients and email templates that control the event information to be reported.
Note - The event email feature uses the standard sendmail utility to send email to designated email recipients. |
To Enable Email Event Notification |
1. In the email template file, identify the event tags to be reported in email.
Copy the sample email template (sample_email) provided with SMS and edit the copied file. For details on modifying the email template, see Configuring an Email Template.
2. In the email control file, set the parameters that determine who receives the email and the email templates to be used.
Edit the email control file (event_email.cf) included with SMS and assign the email notification parameters.
For details on modifying the control file, see Configuring the Email Control File.
A sample email template file called sample_email (/etc/opt/SUNWSMS/SMS/config/templates) is provided with SMS. CODE EXAMPLE 5-5 shows the default template. The text in angle brackets serves as tags that identify the event information to be displayed in the body of the event email.
You can use the sample template file as is, or you can copy the sample template file to a new file, which can be edited to identify additional or different event tags to be contained in the email. You must have superuser privileges to copy and rename the sample template file. The name of the file can be any text string that you choose.
When you edit the file, specify the event tags to be reported in the email subject line and email body. Specify these tags on new, uncommented lines in the file (lines that do not begin with a # sign). For a list of the tags that can be specified in the email template, see TABLE 5-1.
FIGURE 5-3 shows the email template used to generate the email example shown in CODE EXAMPLE 5-4.
The email control file contains the email notification parameters that do the following:
Identify the email recipients based on the event class and the domain in which the event occurred
Identify the email templates to be used
Indicate whether the event message structure is to be sent as an attachment with the event email
You specify these notification parameters in the email control file supplied with SMS (/etc/opt/SUNWSMS/SMS/config/event_email.cf). This file, shown in CODE EXAMPLE 5-6, contains comment lines that begin with a pound (#) sign. These comment lines explain how to update the file.
Use a text editor to edit the file and add the notification parameters in new, uncommented lines. You must have superuser privileges to edit the email control file and add the required email parameters. Separate each parameter with spaces or tabs. You can enter multiple notification lines that control how different event email messages are to be distributed, perhaps by domain, event class, or email template. The notification parameters that you configure are described in TABLE 5-2.
You can use regular expressions to specify ranges or specific matches for the Event_Class and Domains parameters. The email control file supports extended regular expressions (REs) as explained in the regexp(5) man page. Some examples of valid regular expressions include:
. (period) - Matches any single character.
^ (circumflex) - Forces a match to start at the beginning of the string. For example, ^fault matches any string that starts with fault .
[BDG] - Matches any single character, B or D or G.
[B-F] - Matches any single character ranging between B and F, such as B and C or D or E or F.
CODE EXAMPLE 5-7 shows an updated email control file in which notification parameters have been added to the bottom of the file. The sendmail.sh script will be used to send event email to the two specified recipients. An event email will be generated for all fault events that occurred in domains A through C and will be formatted based on the template file called sample_email. The event message structure will be sent as a binary file attachment that accompanies the email.
Use the testemail(1M) command to verify email event notification. This command also enables you to track events and check any changes to the email control file.
To Test Email Event Notification |
1. Set up the email event templates and the email control file as described in Enabling Email Event Notification.
2. In an SC window, log in as platform administrator or platform service and type:
sc0:sms-user:> /opt/SUNWSMS/SMS1.4/lib/smsadmin/testemail -c event_class_list -d domain_id [-i resource_indictment_list] |
event_class_list is a list of one or more fault event classes to be tracked
domain_id specifies a single domain, A-R
resource_indictment_list is an optional list of one or more components that map to each event class specified. For a list of the valid component values, refer to the testemail(1M) man page.
For example, the following command
generates an event of type fault.test.email originating on domain A.
3. Verify that the test event was recorded in the platform or domain message logs.
For example, a message similar to the following is displayed in the platform message log:
Aug 19 10:45:28 2003smshostname [6696:1]: [11917 682823530704603 ERR teste mailApp.cc 345] Test fault with code SF15000-8000-Y1 generated by user root using testEmailReporting - please ignore |
4. If the test event was successfully recorded in the message logs, verify that the designated recipients received the test email.
For example, the test email might resemble the following:
If the test email was not generated, review the next section for troubleshooting suggestions.
If you did not receive test email notification, do the following;
Review your email event templates and the email control file to verify that the files have been set up correctly.
Check the domain and platform message logs to verify that the test events were recorded.
Verify that the sendmail daemon is running. For example:
sc0:sms-user:> ps -ef | grep sendmail root 256 1 0 Aug 06 ? 0:05 /usr/lib/sendmail -bd -q15m sms-user 525 28546 0 21:23:15 pts/27 0:00 grep sendmail |
If the sendmail daemon is not running, you might have a problem with your installation setup that requires correction. Proceed to Step 4.
Manually start sendmail, which will run until the next reboot, by logging on as superuser and restarting the sendmail daemon:
Check /var/log/syslog on the SC to see if email was sent by the Mail Transfer Agent (MTA), sendmail.
If sendmail is not configured or was configured incorrectly, error messages would appear in this log file.
Verify that the domain and nameserver IP entries (to route the email messages outside of the system controller) exist in the /etc/resolv.conf file.
Restart sendmail.sh:
This section describes the various ways to monitor diagnostic errors and obtain additional information about fault and error events.
Automatic diagnosis [AD] and domain [DOM] event messages are displayed on the platform and domain console or in the syslog host, if a loghost server was configured. The [AD] or [DOM] event messages (see CODE EXAMPLE 5-1, CODE EXAMPLE 5-2, and CODE EXAMPLE 5-3) include the following information:
[AD] or [DOM] - Beginning of the message. AD indicates that the SMS or POST automatic diagnosis engine generated the event message. DOM indicates that the Solaris operating environment on the affected domain generated the automatic diagnosis event message.
Event - The event code, a dash-separated alphanumeric text string that uniquely identifies an event type. This code is used by your service provider to obtain further information about the event and the platform involved.
CSN - Chassis serial number, which identifies your Sun Fire high-end system.
DomainID - The domain affected by the hardware error. Valid domains are A through R.
ADInfo - The version of the auto-diagnosis message, the name of the diagnosis engine (SMS-DE, SF-SOLARIS-DE, or POST-DE), and the diagnosis engine version (the SMS version or the version of Solaris operating environment in use).
Time - The day of the week, month, time (hours, minutes, and seconds), time zone, and year of the auto-diagnosis.
Recommended-Action: Service action required - Instructs the platform or domain administrator to contact their service provider for further service action. Also indicates the end of the auto-diagnosis message.
If you have platform administrator or platform service privileges, you can use the showlogs command to view the contents of the event log, to obtain more detailed information about a particular type of event. The information displayed can also be used by your service provider for troubleshooting purposes.
You can obtain information on the following types (classes) of events recorded in the event log:
Ereports - Error reports provide data on unexpected component behavior or conditions.
List events - List events provide a list of fault events or suspected faults associated with a hardware error.
TABLE 5-3 describes some of the various ways to view event information through the showlogs command.
For details on the showlogs command options and examples of event output, refer to the showlogs(1M) command description in the System Management Services (SMS) 1.4 Reference Manual.
Copyright © 2003, Sun Microsystems, Inc. All rights reserved.