C H A P T E R  12

Sun Fire X4540 Fault Management Architecture

This chapter includes the following topics:


Fault Management Architecture Overview

The Sun Fire X4540 server features the latest fault management technologies. With the Solaris 10 Operating System (OS), the Sun Fire X4540 Server introduces a new Fault Management Architecture (FMA) that diagnoses and predicts component failures before they actually occur. This technology is incorporated into both the hardware and software of the server.

At the heart of the Sun Fire X4540 server Fault Manager is the diagnosis engine. The disk diagnosis engine receives data relating to hardware and software errors and automatically and silently diagnoses the underlying problems. The diagnosis engine runs in the background, silently capturing telemetry, until a diagnosis can be completed or a fault can be predicted.

After processing sufficient telemetry to reach a conclusion, a diagnosis engine produces another event called a fault event that is broadcast to any agents deployed on the system that know how to respond. A software component known as the Solaris Fault Manager, fmd(1M), manages the diagnosis engines and agents, provides a simplified programming model for these clients as well as common facilities such as event logging, and manages the multiplexing of events between producers and consumers.

The Sun Fire X4540 Server has a Fault Management Application (FMA) that provides fault monitoring and hotplug processing. The FMA provides passive fault monitoring by analyzing each disk once per hour to determine if a disk fault is imminent. If a disk fault is imminent, an FMA fault is generated and the amber Fault LED for that disk is activated.


Sun Fire X4540 Fault Management Utilities

The Sun Fire X4540 server FMA obtains diagnostic information from the fault management utilities in Solaris. The fault management utilities described are:

You can also refer to the man pages for fmd(1M), fmadm(1M), fmdump(1M), and fmstat(1M) for more information about the individual fault management utilities.

fmd Command

The Solaris OS uses the fault manager daemon, fmd(1M), which starts at boot time and runs in the background to monitor the system. If a component generates an error, the daemon handles the error by correlating the error with data from previous errors and other related information to diagnose the problem.

Each problem diagnosed by the fault manager is assigned a Universal Unique Identifier (UUID). The UUID uniquely identifies this particular problem across any set of systems. The fmdump(1M) utility can be used to view the list of problems diagnosed by the fault manager, along with their UUIDs and knowledge article message identifiers. The fmadm(1M) utility can be used to view the resources on the system believed to be faulty. The fmstat(1M) utility can be used to report statistics kept by the fault manager. The fault manager is started automatically when Solaris boots, so it is not necessary to use the fmd command directly.

When possible, the fault manager daemon initiates steps to self-heal the failed component and take the component offline. The daemon also logs the fault to the syslog daemon and provides a fault notification with a message ID (MSGID). You can use the message ID to get additional information about the problem from Sun’s knowledge article database at:

http://www.sun.com/msg/

For more information, see the fmd(1M) man page.

fmdump Command

The fmdump command displays the list of faults detected by the FMA. You can use this command for the following reasons:

For more information, see the fmdump(1M) man page.

Using the fmdump Command to Identify Faults

single-step bullet  Check the event log by typing the fmdump command with -v for verbose output. For example:


# fmdump -v 

The following is an example of information that is displayed. In this example, a fault is displayed, providing details about the date, time, and unique identifier related to the fault:


CODE EXAMPLE 12-1 fmdump Command Verbose Output
TIME                 UUID                                 SUNW-MSG-ID
Jul 11 13:55:01.5548 e92f2cec-e393-cd04-89ff-c5e2081b9940 DISK-8000-0X
  100%  fault.io.disk.predictive-failure
Problem in: hc:///:serial=VDK41BT4C7MB7S:part=HITACHI-HDS7225SBSUN250G-527N7MB7S:revision=V44OA81A/motherboard=0/hostbridge=2/pcibus=9/pcidev=8/pcifn=0/pcibus=11/pcidev=1/pcifn=0/sata-port=1/disk=0
           Affects: hc:///:serial=VDK41BT4C7MB7S/component=sata5/1
               FRU: hc:///component=HD_ID_16 
 


Diagnosing Disk Faults

To determine which disk failed, you can view the FMA fault error log, use fmdump command, or open the system cover to look for illuminated LEDs. If you use the fmdump command to isolate a disk, you should also open the system cover and look for amber LEDs. The following is an example of the fmdump command you can use to display disk faults.


# fmdump -v -u uuid

The following is an example of information that is displayed when a disk fault is detected and the fmdump command is used.


CODE EXAMPLE 12-2 fmdump Command Diagnose Disk Fault
TIME                 UUID                                 SUNW-MSG-ID
May 09 13:38:24.9404 9a2c5052-687b-e196-b12b-8035267c3031 DISK-8000-0X
   100%  fault.io.disk.predictive-failure
Problem in: hc:///:serial=VDK41BT4C7PJYS:part=HITACHI-HDS7225SBSUN250G-527N7PJYS:revision=V44OA81A/motherboard=0/hostbridge=2/pcibus=9/pcidev=8/pcifn=0/pcibus=11/pcidev=1/pcifn=0/sata-port=6/disk=0
            Affects: hc:///component=sata5/6
                FRU: hc:///component=HD_ID_29

Based on the information displayed, you can determine which disk failed and the attachment point.

For more information, see the fmdump(1M) man page.


Clearing Disk Faults

When the Solaris FMA facility detects faults, the faults are logged and displayed on the console. After the fault condition is corrected, for example by replacing a faulty disk, you must clear the fault.

fmadm Command

The fmadm command can be used to view and modify system configuration parameters that are maintained by the Solaris Fault Manager. The fmadm fault command is primarily used to determine the status of a component involved in a fault. The fmadm command can be used to:

In cases, where the disk fault is cleared, some persistent fault information can remain and result in erroneous fault messages at boot time. To ensure that these messages are not displayed, the fmadm repair UUID command should be performed.

Using the fmadm Command to Clear Faults

single-step bullet  Clear faults by typing the fmadm repair command. For example:


# fmadm repair 9a2c5052-687b-e196-b12b-8035267c3031

For more information, see the fmadm(1M) man page.


Displaying Fault Statistics Using the fmstat Command

The fmstat command displays statistical information about faults handled by the Fault Management Architecture (FMA). The fmstat command can report statistics associated with the Solaris Fault Manager.

For more information, see the fmstat(1M) man page.

In the example below, an event is received. Then, a case is opened for that event and a diagnosis is performed.

To display statistical information:

single-step bullet  Check the event log by typing the fmdump command with -v for verbose output. For example:


# fmstat -v

The following is a example of information that is displayed:


CODE EXAMPLE 12-3 fmstat Command Example
module          ev_recv ev_acpt wait svc_t   %w  %b  open solve  memsz bufsz
 
cpumem-diagnosis   0       0    0.0  0.0     0   0   0    0      3.0   K0
 
cpumem-retire      0       0    0.0  0.0     0   0   0    0      0     0
 
eft                1       1    0.0  1191.8  0   0   1    1      3.3M  11K
 
fmd-self-diagnosis 0       0    0.0  0.0     0   0   0    0      0     0
 
io-retire          1       0    0.0  32.4    0   0   0    0      37b   0
 
syslog-msgs        1       0    0.0  0.5     0   0   0    0      32b   0