C H A P T E R 11 |
Domain Events |
Event monitoring periodically checks the domain and hardware status to detect conditions that require an action. The action taken is determined by the condition and can involve reporting the condition or initiating automated procedures to deal with it. This chapter describes the events that are detected by monitoring and the requirements with respect to actions taken in response to detected events.
This chapter includes the following sections:
SMS logs all significant actions other than logging or updating user monitoring displays taken in response to an event. Log messages for significant domain software events and their response actions are written to the message log file for the affected domain located in /var/opt/SUNWSMS/adm/domain-id/messages. Included in the log is information to support subsequent servicing of the hardware or software.
SMS writes log messages for significant hardware events to the platform log file located in /var/opt/SUNWSMS/adm/platform/messages. SMS writes log messages to /var/opt/SUNWSMS/adm/domain-id/messages for significant hardware events that can visibly affect one or more domains of the affected domains.
The actions taken in response to events that crash domain software systems include automatic system recovery (ASR) reboots of all affected domains, provided that the domain hardware (or a bootable subset thereof) meets the requirements for safe and correct operation.
SMS also logs domain console, syslog, event, post, and dump information and manages sms_core files.
SMS software maintains SC-resident copies of all server information that it logs. Use the showlogs(1M) command to access log information.
The platform message log file can be accessed only by administrators for the platform, using the following command:
SMS log information relevant to a configured domain can be accessed only by administrators for that domain. SMS maintains separate log files for each domain. To access the files, type the following command:
domain-id - ID for a domain. Valid domain-ids are A-R and are not case sensitive. |
SMS maintains copies of domain syslog files on the SC in /var/opt/SUNWSMS/adm/domain-id/syslog.The syslog information can be accessed only by administrators for that domain.
To access the information, type the following command:
Solaris console output logs are maintained to provide valuable insight into what happened before a domain crashed. Console output is available on the SC for a crashed domain in /var/opt/SUNWSMS/adm/domain-id/console. console information can be accessed only by administrators for that domain.
To access the information, type the following command:
XIR state dumps, generated by the reset command, can be displayed using showxirstate. For more information, refer to the showxirstate man page.
Domain post logs are for service diagnostic purposes and are not displayed by showlogs or any SMS CLI.
The /var/tmp/sms_core.daemon files are binaries and not viewable.
The availability of various log files on the SC supports analysis and correction of problems that prevent a domain or domains from booting. For more information, refer to the showlogs man page.
Note - Panic dumps for panicked domains are available in the /var/crash logs on the domain, not on the SC. |
TABLE 11-1 lists the SMS log information types and their descriptions.
SMS manages the log files, as necessary, to keep the SC disk utilization within acceptable limits.
The message log daemon (mld) monitors message log size, file count per directory, and age every 10 minutes. The mld daemon executes when it reaches the first limit. TABLE 11-2 lists the MLD default settings.Table listing default MLD settings.
Assuming 20 directories, the defaults represent approximately 4 Gbytes of stored logs.
Caution - The parameters shown in TABLE 11-2are stored in the file /etc/opt/SUNWSMS/config/mld_tuning. For any changes to take effect, mldmust be stopped and restarted. Only an administrator experienced with system disk utilization should edit this file. Improperly changing the parameters in this file could flood the disk and hang or crash the SC. |
Starting with the oldest message file x.X, it moves that file to x.X+1, except when the oldest message file is message.9 or core file is sms_core.daemon.1; then it starts with x.X-1.
For example, messages becomes messages.0, messages.0 becomes messages.1 and so on up to messages.9. When messages reaches 2.5 Mbytes, then messages.9 is deleted, all files are bumped up by one and a new empty messages file is created.
When messages or sms_core.daemon reaches its count limit, then the oldest message or core file is deleted.
When any message file reaches x days, it is deleted.
Note - By default, the age limit (*_log_keep_days) is set to zero and not used. |
Note - Post files are provided for service diagnostic purposes and not intended for display. |
For more information, refer to the mld and showlogs man pages, and see Message Logging Daemon.
SMS monitors domain software status (see Software Status) to detect domain reboot events.
Since the domain software is incapable of rebooting itself, SMS software controls the initial sequence for all domain reboots. As a result, SMS is always aware of domain reboot initiation events.
SMS software logs the initiation of each reboot and the passage through each significant stage of booting a domain to the domain-specific log file.
SMS software detects all domain reboot failures.
Upon detecting a domain reboot failure, SMS logs the reboot failure event to the domain-specific message log.
SC resident per-domain log files are available for failure analysis. In addition to the reboot failure logs, SMS can maintain duplicates of important domain-resident logs and transcripts of domain console output, as described in Log File Maintenance.
Domain reboot failures are handled as follows:
A subsequent domain hardware failure is handled by the reboot procedure.
A subsequent domain software failure is handled by a quick reboot procedure, and the reboot or reset request is handled by the fast bringup procedure.
SMS tries all ASR methods at its disposal to boot a domain that has failed booting. All recovery attempts are logged in the domain-specific message log.
When a domain panics, it informs dsmd so that a recovery reboot can be initiated. The panic is reported as a domain software status change (see Software Status).
The dsmd daemon is informed when the Solaris software on a domain panics.
Upon detecting a domain panic, dsmd logs the panic event to the domain-specific message log.
SC resident per-domain log files are available to assist in domain panic analysis. In addition to the panic logs, SMS can maintain duplicates of important domain-resident logs and transcripts of domain console output, as described in Log File Maintenance.
In general, after an initial panic where there has been no prior indication of hardware errors, SMS requests that a fast reboot be tried to bring up the domain. For more information, see Domain Reboot.
The dsmd daemon handles a panic event as follows:
A subsequent domain hardware failure is handled by the reboot procedure.
A subsequent domain software failure is handled by a quick reboot procedure, and the reboot or reset request is handled by the fast bringup procedure.
This recovery action is logged in the domain-specific message log.
The Solaris panic dump logic has been redesigned to minimize the possibility of hangs at panic time. In a panic situation, Solaris software might operate differently, either because normal functions are shut down or because it is disabled by the panic. An ASR reboot of a panicked Solaris domain is eventually started, even if the panicked domain hangs before it can request a reboot.
Since the normal heartbeat monitoring (see Solaris Software Hang Events) of a panicked domain might not be appropriate or sufficient to detect situations where a panicked Solaris domain does not proceed to request an ASR reboot, dsmd takes special measures as necessary to detect a domain panic hang event.
Upon detecting a panic hang event, dsmd logs each occurrence, including event information, to the domain-specific message log.
Upon detection of a domain panic hang (if any), SMS aborts the domain panic (see Domain Abort or Reset) and initiates an ASR reboot of the domain. dsmd logs these recovery actions in the domain-specific message log.
SC-resident log files are available to assist in panic hang analysis. In addition to the panic hang event logs, the dsmd daemon maintains duplicates of important domain-resident logs and transcripts of domain console output on the SC, as described in Log File Maintenance.
If a second domain panic is detected shortly after recovering from a panic event, dsmd classifies the domain panic as a repeated domain panic event.
In addition to the standard logging actions that occur for any panic, the following actions are taken when attempting to reboot after the repeated domain panic event:
The dsmd daemon monitors the Solaris heartbeat described in Solaris Software Heartbeat in each domain while Solaris software is running (see Software Status). When the heartbeat indicator is not updated for a period of time, a Solaris software hang event occurs.
The dsmd daemon detects Solaris software hangs.
Upon detecting a Solaris hang, dsmd logs the event, including event information, to the domain-specific message log.
Upon detecting a Solaris hang, dsmd requests the domain software to panic so that it can obtain a core image for analysis of the Solaris hang (Domain Abort or Reset). SMS logs this recovery action in the domain-specific message log.
The dsmd daemon monitors the inability of the domain software to satisfy the request to panic. Upon determining noncompliance with the panic request, the dsmd daemon aborts the domain (see Domain Abort or Reset) and initiates an ASR reboot. The dsmd daemon logs these recovery actions in the domain-specific message log.
Although the core image taken as a result of the panic is available for analysis only from the domain, SC-resident log files are available to assist in domain hang analysis. In addition to the Solaris hang event logs, the dsmd daemon can maintain duplicates of important domain-resident logs and transcripts of domain console output on the SC.
Changes to the hardware configuration status are considered hardware configuration events. esmd detects the following hardware configuration events on a Sun Fire high-end system.
The insertion of a hot-pluggable unit (HPU) is a hot-plug event. The following actions take place:
The removal of a hot-pluggable unit (HPU) is a hot-unplug event. The following actions take place:
POST can run against different server components at different times due to domain-related events such as reboots and dynamic reconfigurations. As described in Hardware Configuration, SMS includes status from POST and identifying failed-test components. Consequently, changes in POST status of a component are considered to be hardware configuration events. SMS logs POST-initiated hardware configuration changes to the platform message log.
In general, environmental events are detected when hardware status measurements exceed normal operational limits. Acceptable operational limits depend upon the hardware and the server configuration.
The esmd daemon verifies that measurements returned by each sensor are within acceptable operational limits. The esmd daemon logs all sensor measurements outside of acceptable operational limits as environmental events to the platform log file.
The esmd daemon also logs significant actions taken in response to an environmental event (such as those beyond logging information or updating user displays) to the platform log file.
The esmd daemon logs significant environmental event response actions that affect one or more domains to the log files of the affected domains.
The esmd daemon handles environmental events by removing from operation the hardware that has experienced the event (and any other hardware dependent upon the disabled component). Hardware can be left in service, however, if continued operation of the hardware does not harm the hardware or cause hardware functional errors.
The options for handling environmental events are dependent upon the characteristics of the event. All events have a time frame during which the event must be handled. Some events kill the domain software; some do not. Event response actions are such that esmd responds within the event time frame.
There are a number of responses esmd can make to environmental events, such as increasing fan speeds. In response to a detected environmental event that requires a powering off, esmd undertakes one of the following corrective actions:
If the software is still running and a viable domain configuration remains after the affected hardware is removed, dsmd attempts to recover the domain.
If either of the last two options takes longer than the allotted time for the given environmental condition, esmd immediately powers off the component regardless of the state of the domain software.
SMS illuminates the Fault indicator on any hot-pluggable unit that can be identified as the cause of an environmental event.
So long as the environmental event response actions do not include shutdown of the system controllers, all domains whose software operations were terminated by an environmental event or the ensuing response actions are subject to ASR reboot as soon as possible.
ASR reboot begins immediately if there is a bootable set of hardware that can be operated in accordance with constraints imposed by the Sun Fire high-end system to assure safe and correct operation.
The following sections provide a little more detail about each type of environmental event that can occur on an Sun Fire high-end system.
The esmd daemon monitors temperature measurements from Sun Fire high-end systems hardware for values that are too high. There is a critical temperature threshold that, if exceeded, is handled as quickly as possible by powering off the affected hardware. High, but not critical, temperatures are handled by attempting slower recovery actions, such as a graceful shutdown or DR for the MCPU boards.
There is very little opportunity to do anything when a full power failure occurs. The entire platform, domains as well as SCs, is shut off when the plug is pulled without the benefit of a graceful shutdown. The ultimate recovery action occurs when power is restored (see Power-On Self-Test (POST)).
Power voltages for Sun Fire high-end systems are monitored to detect out-of-range events. The handling of out-of-range voltages follows the general principles outlined at the beginning of Environmental Events.
In addition to checking for adequate power before powering on any boards, as mentioned in Power Control, the failure of a power supply could leave the server inadequately powered. The system is equipped with power supply redundancy in the event of failure. The esmd daemon does not take any action (other than logging) in response to a bulk power supply hardware failure. The handling of under power events follows the general principles outlined at the beginning of Environmental Events.
The esmd daemon monitors fans for continuing operation. Should a fan fail, a fan failure event occurs. The handling of fan failures follows the general principles outlined at the beginning of Environmental Events.
The esmd daemon monitors clocks for continuing operation. Should a clock fail, esmd logs a message every 10 minutes. It also turns on manual override so the clock selector on that board never automatically starts using that clock. If the clock returns to good status, esmd turns off manual override and logs a message.
When phase lock is lost, the esmd daemon turns on manual override on all the boards and logs one message. When phase lock returns, esmd turns off manual override on all the boards and logs a message.
As described in Hardware Error Status, the occurrence of Sun Fire high-end system hardware errors is recognized at the SC by more than one mechanism. Of the errors that are directly visible to the SC, some are reported directly by PCI interrupt to the UltraSPARC processor on the SC, and others are detected only through monitoring of the hardware registers on Sun Fire high-end systems.
There are other hardware errors that are detected by the processors running in a domain. Domain software running in the domain detects the occurrence of those errors in the domain, which then reports the error to the SC. Like the mechanism by which the SC becomes aware of the occurrence of a hardware error, the error state retained by the hardware after a hardware error is dependent upon the specific error.
The dsmd daemon performs the following functions:
If data collected in response to a hardware error is not suitable for inclusion in a log file, the data can be saved in uniquely named files in /var/opt/SUNWSMS/adm/domain-id/dump on the SC.
SMS illuminates the Fault LED on any hot-pluggable unit that can be identified as the cause of a hardware error.
The actions taken in response to hardware errors (other than collecting and logging information as described previosly) are twofold. First, it might be possible to eliminate the further occurrence of certain types of hardware errors by eliminating from use the hardware identified to be at fault. Second, all domains that crashed either as a result of a hardware error or were shut down as a consequence of the first type of action are subject to ASR reboot actions.
In response to each detected hardware error and each domain-software-reported hardware error, dsmd undertakes the appropriate corrective actions. In some cases automatic diagnosis and domain recovery occurs (see Chapter 6), while in other instances, an ASR reboot with full POST verification is initiated for each domain brought down by a hardware error.
Note - Problems with the ASR reboot of a domain after a hardware error are detected as domain boot failure events and subject to the recovery actions described in Domain Boot Failure. |
The dsmd daemon logs all significant actions, such as those beyond logging information or updating user displays taken in response to a hardware error in the platform log file. When a hardware error affects one or more domains, dsmd logs the significant response actions in the message log files of the affected domains.
The following sections summarize the types of hardware errors expected to be detected and handled on a Sun Fire high-end system.
Domain stops are uncorrectable hardware errors that immediately terminate the affected domains. Hardware state dumps are taken before dsmd initiates an ASR reboot of the affected domains. These files are located in /var/opt/SUNWSMS/adm/domain-id/dump
The dsmd daemon logs the event in the domain message log file and also the event log file.
A RED_state or Watchdog reset traps to low-level domain software (OpenBoot PROM or kadb), which reports the error and requests initiation of ASR reboot of the domain.
An XIR signal (reset -x) also traps to low-level domain software (OpenBoot PROM or kadb), which retains control of the software. The domain must be rebooted manually.
Correctable data transmission errors (for example, CE ECC errors) can stop the normal transaction history recording feature of ASICs in Sun Fire high-end systems. SMS reports a transmission error as a record stop. SMS dumps the transaction history buffers of these ASICs and re-enables transaction history recording when a record stop is handled. The dsmd daemon records record stops in the domain log file.
ASIC-detected hardware failures other than domain stop or record stop include console bus errors, which might or might not impact a domain. The hardware itself does not abort any domain, but the domain software might not survive the impact of the hardware failure and could panic or hang. The dsmd daemon logs the event in the domain log file.
SMS monitors the main SC hardware and running software status as well as the hardware and running software of the spare SC, if present. In a high-availability SC configuration, SMS handles failures of the hardware or software on the main SC or failures detected in the hardware control paths (for example, console bus, or internal network connections) to the main SC by an automatic SC failover process. This cedes main responsibilities to the spare SC and leaves the former main SC as a (possibly crippled) spare.
SMS monitors the hardware of the main and spare SCs for failures.
SMS logs the hardware failure and related information to the platform message log.
SMS illuminates the Fault LED on a system controller with an identified hardware failure.
For more information, see Chapter 12.
Copyright © 2006, Sun Microsystems, Inc. All Rights Reserved.