C H A P T E R 5 |
Sun Fire X4500 Fault Management Architecture |
This chapter includes information about the following topics:
The Sun Fire X4500 server features the latest fault management technologies. With the Solaris 10 Operating System (OS), the Sun Fire X4500 Server introduces a new Fault Management Architecture (FMA) that diagnoses and predicts component failures before they actually occur. This technology is incorporated into both the hardware and software of the server.
At the heart of the Sun Fire X4500 server Fault Manager is the diagnosis engine. The disk diagnosis engine receives data relating to hardware and software errors and automatically and silently diagnoses the underlying problems. The diagnosis engine runs in the background, silently capturing telemetry, until a diagnosis can be completed or a fault can be predicted.
After processing sufficient telemetry to reach a conclusion, a diagnosis engine produces another event called a fault event that is broadcast to any agents deployed on the system that know how to respond. A software component known as the Solaris Fault Manager, fmd(1M), manages the diagnosis engines and agents, provides a simplified programming model for these clients as well as common facilities such as event logging, and manages the multiplexing of events between producers and consumers.
The Sun Fire X4500 Server has a Fault Management Application (FMA) that provides fault monitoring and hotplug processing. The FMA provides passive fault monitoring by analyzing each disk once per hour to determine if a disk fault is imminent. If a disk fault is imminent, an FMA fault is generated and the amber Fault LED for that disk is activated.
The Sun Fire X4500 server FMA obtains diagnostic information from the fault management utilities in Solaris. The fault management commands used are:
Refer to the man pages for fmd(1M), fmadm(1M), fmdump(1M), and fmstat(1M) for more information about individual fault management utilities.
The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem.
Each problem diagnosed by the fault manager is assigned a Universal Unique Identifier (UUID). The UUID uniquely identifies this particular problem across any set of systems. The fmdump(1M) utility can be used to view the list of problems diagnosed by the fault manager, along with their UUIDs and knowledge article message identifiers. The fmadm(1M) utility can be used to view the resources on the system believed to be faulty. The fmstat(1M) utility can be used to report statistics kept by the fault manager. The fault manager is started automatically when Solaris boots, so it is not necessary to use the fmd command directly.
When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslog daemon and provides a fault notification with a message ID (MSGID). You can use the message ID to view additional information about the problem from Sun’s knowledge article database at:
For more information, refer to the fmd(1M) man page.
The fmdump command displays the list of faults detected by the FMA. You can use this command for the following reasons:
To use the fmdump command to identify faults:
Check the event log by typing the fmdump command with -v for verbose output. For example:
The following is an example of displayed information. This example provides details about the date, time and unique identifier related to the fault:
To determine which disk failed, you can view the FMA fault error log, use fmdump command, or open the system cover to look for illuminated LEDs. If you use the fmdump command to isolate a disk, you should also open the system cover and look for amber LEDs.
The following shows an example of the fmdump command you can use to display disk faults.
The following is an example of information that can display when a disk fault is detected and the fmdump command is used:
Based on the information displayed, you can determine which disk failed and the attachment point.
For more information, refer to the fmdump(1M) man page.
When the Solaris FMA facility detects faults, the faults are logged and displayed on the console. After the fault condition is corrected, for example by replacing a faulty disk, you must clear the fault.
The fmadm command can be used to view and modify system configuration parameters that are maintained by the Solaris Fault Manager. The fmadm fault command is primarily used to determine the status of a component involved in a fault. The fmadm command can be used to:
In cases, where the disk fault is cleared, some persistent fault information can remain and result in erroneous fault messages at boot time. To ensure that these messages are not displayed, the fmadm repair UUID command should be performed.
To use the fmadm Command to clear faults:
Clear faults by typing the fmadm repair command. For example:
For more information, see the fmadm(1M) man page.
This section discusses statistics associated with the Fault Management Architecture.
The fmstat command displays statistical information about faults handled by the FMA. The fmstat command can report statistics associated with the Solaris Fault Manager.
In the example below, an event was received. A case is opened for that event and a diagnosis is performed.
Check the event log by typing the fmstat command with -v for verbose output. For example:
The following is a example of information that may display:
For detailed instructions on the fmstat command, refer to the fmstat man page.
Copyright © 2009 Sun Microsystems, Inc. All rights reserved.