Go to main content

Managing Faults, Defects, and Alerts in Oracle® Solaris 11.4

Exit Print View

Updated: November 2020
 
 

Displaying Information About Faulted Hardware

Use the fmadm list-fault command to display fault information and determine which FRUs are involved. The fmadm list-fault command displays active fault diagnoses. The fmdump command displays the contents of log files associated with the Fault Manager daemon and is more useful as a historical log of errors, observations, and diagnoses on the system.


Tip  -  Base your administrative action on output from the fmadm list-fault command. Log files output by the fmdump command contain a historical record of events and do not necessarily present active or open diagnoses. Log files output by fmdump -e are a historical record of error telemetry and might not have been diagnosed into faults.

The fmadm list-fault command displays status information for resources that the Fault Manager identifies as faulty. The fmadm list-fault command has many options for displaying different information or displaying information in different formats. See the fmadm(8) man page for information about all the fmadm list-fault options.

Example 1  fmadm list-fault Output Showing a Faulty Disk

In the following example output, the section labeled FRU identifies the faulted component. The Location string shown in quotation marks, "/SUN-Storage-J4410.1051QCQ08A/HDD23", should match the chassis type and serial number of the chassis containing the faulty disk and the label of the disk bay in that chassis. For a location in the main system chassis, the location string would be something like "/SYS/HDD3". If no location is available, the Fault Management Resource Identifier (FMRI) of the FRU is shown. See Fault Management Glossary for definitions of chassis and FMRI.

The Status line in the FRU section of the output shows the state as faulty.

Above the FRU section, the lines labeled Affects identify components that are affected by the fault and their relative state. In this example, a single disk is affected. The disk is faulted but is still in service.

Perhaps the most useful piece of information in this output is the MSG-ID. Follow the instructions in the Action section at the end of the report to access more information about DISK-8000-0X. The Action section might include specific actions in addition to references to documents on the support site.

Every diagnosis can be mapped to a specific MSG-ID. Diagnoses may have one or more suspects. If only one suspect is identified, then the MSG-ID can be mapped to a single fault class or diagnosis class. If more than one suspect is identified, then the MSG-ID maps to more than one diagnosis class. See Fault Management Glossary for the definition of diagnosis class.

# fmadm list-fault
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 08 08:36:50 91cfc113-eacc-44d0-8236-9e2ed3926fd3  DISK-8000-0X   Major

Problem Status    : open
Diag Engine       : eft / 1.16
System
    Manufacturer  : Oracle Corporation
    Name          : Sun Netra X4270 M3
    Part_Number   : NILE-P1LRQT-8
    Serial_Number : 1211FM200D

System Component
    Manufacturer  : Oracle
    Name          : Sun Netra X4270 M3
    Part_Number   : NILE-P1LRQT-8
    Serial_Number : 1211FM200D
    Host_ID       : 008167b1

----------------------------------------
Suspect 1 of 1 :
   Problem class : fault.io.disk.predictive-failure
   Certainty   : 100%
   Affects     : dev:///:devid=id1,sd@n5000a7203002c0f2//scsi_vhci/disk@g5000a7203002c0f2
   Status      : faulted but still in service

   FRU
     Status           : faulty
     Location         : "/SUN-Storage-J4410.1051QCQ08A/HDD23"
     Manufacturer     : STEC
     Name             : ZeusIOPs
     Part_Number      : STEC-ZeusIOPs
     Revision         : 9007
     Serial_Number    : STM00011EDCA
     Chassis
        Manufacturer  : SUN
        Name          : SUN-Storage J4410
        Part_Number   : 3753659
        Serial_Number : 1051QCQ08A

Description : SMART health-monitoring firmware reported that a disk failure is
              imminent.

Response    : A hot-spare disk may have been activated.

Impact      : It is likely that the continued operation of this disk will
              result in data loss.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/DISK-8000-0X for the latest service
              procedures and policies regarding this diagnosis.

In the following sample output, a single CPU strand is affected. That CPU strand is faulted and has been taken out of service by the Fault Manager.

# fmadm list-fault
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 24 10:41:32 662ec53e-3aff-41d1-a836-ad7d1795705a  SUN4V-8002-6E  Major

Problem Status    : isolated
Diag Engine       : eft / 1.16
System
    Manufacturer  : Oracle Corporation
    Name          : ORCL,SPARC-T4-1
    Part_Number   : 602-4918-02
    Serial_Number : 1315BDY5D8
    Host_ID       : 862e0f5e

----------------------------------------
Suspect 1 of 1 :
   Problem class : fault.cpu.generic-sparc.strand
   Certainty   : 100%
   Affects     : cpu:///cpuid=0/serial=15a02807e0b026b
   Status      : faulted and taken out of service

   FRU
     Status           : faulty
     Location         : "/SYS/MB"
     Manufacturer     : Oracle Corporation
     Name             : PCA,MB,SPARC_T4-1
     Part_Number      : 7047134
     Revision         : 02
     Serial_Number    : 465769T+1309BW0V8E
     Chassis
        Manufacturer  : Oracle Corporation
        Name          : ORCL,SPARC-T4-1
        Part_Number   : 31538783+1+1
        Serial_Number : 1315BDY5D8

Description : The number of correctable errors associated with this strand has
              exceeded acceptable levels.

Response    : The fault manager will attempt to remove the affected strand from
              service.

Impact      : System performance may be affected.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/SUN4V-8002-6E for the latest
              service procedures and policies regarding this diagnosis.
Example 2  fmadm list-fault Output Showing Multiple Faults

In the following output, all three suspect PCI devices are described as "faulted but still in service". The unknown values indicate that no identity information is available for these devices.

# fmadm list-fault
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 23 02:48:15 a9445995-0eee-460b-82ba-d8ddb29cda71  PCIEX-8000-3S  Critical

Problem Status    : open
Diag Engine       : eft / 1.16
System
    Manufacturer  : Oracle Corporation
    Name          : Sun Netra X4270 M3
    Part_Number   : NILE-P1LRQT-8
    Serial_Number : 1211FM200D

System Component
    Manufacturer  : Oracle
    Name          : Sun Netra X4270 M3
    Part_Number   : NILE-P1LRQT-8
    Serial_Number : 1211FM200D
    Host_ID       : 008167b1

----------------------------------------
Suspect 1 of 3 :
   Problem class : fault.io.pciex.device-interr
   Certainty   : 50%
   Affects     : dev:////pci@0,0/pci8086,3c04@2/pci1000,3050@0
   Status      : faulted but still in service

   FRU
     Status           : faulty
     Location         : "/SYS/MB/PCIE1"
     Manufacturer     : unknown
     Name             : pciex8086,1522.108e.7b19.1
     Part_Number      : 7014747-Rev.01
     Revision         : G29837-009
     Serial_Number    : 159048B+1206A0369F048B54
     Chassis
        Manufacturer  : Oracle
        Name          : Sun Netra X4270 M3
        Part_Number   : NILE-P1LRQT-8
        Serial_Number : 1211FM200D
----------------------------------------
Suspect 2 of 3 :
   Problem class : fault.io.pciex.bus-linkerr
   Certainty   : 25%
   Affects     : dev:////pci@0,0/pci8086,3c04@2/pci1000,3050@0
   Status      : faulted but still in service

   FRU
     Status           : faulty
     Location         : "/SYS/MB/PCIE1"
     Manufacturer     : unknown
     Name             : pciex8086,1522.108e.7b19.1
     Part_Number      : 7014747-Rev.01
     Revision         : G29837-009
     Serial_Number    : 159048B+1206A0369F048B54
     Chassis
        Manufacturer  : Oracle
        Name          : Sun Netra X4270 M3
        Part_Number   : NILE-P1LRQT-8
        Serial_Number : 1211FM200D
----------------------------------------
Suspect 3 of 3 :
   Problem class : fault.io.pciex.device-interr
   Certainty   : 25%

   FRU
     Status           : faulty
     Location         : "/SYS/MB"
     Manufacturer     : Oracle
     Name             : unknown
     Part_Number      : 7016786
     Revision         : Rev-03
     Serial_Number    : 489089M+1208UU003X
     Chassis
        Manufacturer  : Oracle
        Name          : Sun Netra X4270 M3
        Part_Number   : NILE-P1LRQT-8
        Serial_Number : 1211FM200D
   Resource
     Location         : "/SYS/MB/PCIE1"
     Status           : faulted but still in service

Description : A problem has been detected on one of the specified devices or on
              one of the specified connecting buses.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              If a plug-in card is involved check for badly-seated cards or
              bent pins. Please refer to the associated reference document at
              http://support.oracle.com/msg/PCIEX-8000-3S for the latest
              service procedures and policies regarding this diagnosis.

In the following example, two CPU strands are faulted and have been removed from service by the Fault Manager.

# fmadm list-fault
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 24 10:49:18 1479f457-d99a-4c55-9373-b33621d3aaee  SUN4V-8002-6E  Major

Problem Status    : isolated
Diag Engine       : eft / 1.16
System
    Manufacturer  : Oracle Corporation
    Name          : ORCL,SPARC-T4-1
    Part_Number   : 602-4918-02
    Serial_Number : 1315BDY5D8
    Host_ID       : 862e0f5e

----------------------------------------
Suspect 1 of 2 :
   Problem class : fault.cpu.generic-sparc.strand
   Certainty   : 50%
   Affects     : cpu:///cpuid=0/serial=SERIAL1
   Status      : faulted and taken out of service

   FRU
     Status           : faulty
     Location         : "/SYS/MB"
     Manufacturer     : Oracle Corporation
     Name             : PCA,MB,SPARC_T4-1
     Part_Number      : 7047134
     Revision         : 02
     Serial_Number    : 465769T+1309BW0V8E
     Chassis
        Manufacturer  : Oracle Corporation
        Name          : ORCL,SPARC-T4-1
        Part_Number   : 31538783+1+1
        Serial_Number : 1315BDY5D8
----------------------------------------
Suspect 2 of 2 :
   Problem class : fault.cpu.generic-sparc.strand
   Certainty   : 50%
   Affects     : cpu:///cpuid=1/serial=SERIAL2
   Status      : faulted and taken out of service

   FRU
     Status           : faulty
     Location         : "/SYS/MB"
     Manufacturer     : Oracle Corporation
     Name             : PCA,MB,SPARC_T4-1
     Part_Number      : 7047134
     Revision         : 02
     Serial_Number    : 465769T+1309BW0V8E
     Chassis
        Manufacturer  : Oracle Corporation
        Name          : ORCL,SPARC-T4-1
        Part_Number   : 31538783+1+1
        Serial_Number : 1315BDY5D8

Description : The number of correctable errors associated with this strand has
              exceeded acceptable levels.

Response    : The fault manager will attempt to remove the affected strand from
              service.

Impact      : System performance may be affected.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/SUN4V-8002-6E for the latest
              service procedures and policies regarding this diagnosis.
Example 3  fmdump Fault Reports

Some console messages and knowledge articles instruct you to use the fmdump command to display fault information, as shown in the following example. The information about the affected components is in the Affects line. The FRU Location value presents the human-readable FRU string. The FRU line and the Problem in line show the FMRIs. Note that the output lines in this example are artificially divided to improve readability.

$ fmdump -vu 91cfc113-eacc-44d0-8236-9e2ed3926fd3
TIME                 UUID                                 SUNW-MSG-ID  EVENT
Apr 08 08:36:50.1418 91cfc113-eacc-44d0-8236-9e2ed3926fd3 DISK-8000-0X Diagnosed
  100%  fault.io.disk.predictive-failure

        Problem in: hc://:chassis-mfg=SUN:chassis-name=SUN-Storage-J4410
                    :chassis-part=3753659:chassis-serial=1051QCQ08A:fru-mfg=STEC
                    :fru-name=ZeusIOPs:fru-serial=STM00011EDCA:fru-part=STEC-ZeusIOPs
                    :fru-revision=9007:devid=id1,sd@n5000a7203002c0f2/ses-enclosure=
                    0/bay=23/disk=0
           Affects: dev:///:devid=id1,sd@n5000a7203002c0f2//scsi_vhci/disk@g5000a7203002c0f2
               FRU: hc://:chassis-mfg=SUN:chassis-name=SUN-Storage-J4410
                    :chassis-part=3753659:chassis-serial=1051QCQ08A:fru-mfg=STEC
                    :fru-name=ZeusIOPs:fru-serial=STM00011EDCA:fru-part=STEC-ZeusIOPs
                    :fru-revision=9007:devid=id1,sd@n5000a7203002c0f2/ses-enclosure=
                    0/bay=23/disk=0
      FRU Location: /SUN-Storage-J4410.1051QCQ08A/HDD23

To see the severity, descriptive text, and action in the fmdump output, use the -m option. The fmdump -m output is similar to the information you receive in FMA event notifications as described in Receiving Notification of Faults, Defects, and Alerts.

The following fmdump output is for two CPU devices:

$ fmdump -vu 662ec53e-3aff-41d1-a836-ad7d1795705a
TIME                 UUID                                 SUNW-MSG-ID   EVENT
Apr 24 10:41:32.7511 662ec53e-3aff-41d1-a836-ad7d1795705a SUN4V-8002-6E Diagnosed

  100%  fault.cpu.generic-sparc.strand

        Problem in: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1
                    :chassis-part=31538783+1+1:chassis-serial=1315BDY5D8/chassis=0
                    /motherboard=0/chip=0/core=0/strand=0
           Affects: cpu:///cpuid=0/serial=15a02807e0b026b
               FRU: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1
                    :chassis-part=31538783+1+1:chassis-serial=1315BDY5D8
                    :fru-serial=465769T+1309BW0V8E:fru-part=7047134
                    :fru-revision=02/chassis=0/motherboard=0
      FRU Location: /SYS/MB

Apr 24 10:41:32.7732 662ec53e-3aff-41d1-a836-ad7d1795705a FMD-8000-9L   Isolated
  100%  fault.cpu.generic-sparc.strand

        Problem in: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1
                    :chassis-part=31538783+1+1:chassis-serial=1315BDY5D8/chassis=0
                    /motherboard=0/chip=0/core=0/strand=0
           Affects: cpu:///cpuid=0/serial=15a02807e0b026b
               FRU: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1
                    :chassis-part=31538783+1+1:chassis-serial=1315BDY5D8
                    :fru-serial=465769T+1309BW0V8E:fru-part=7047134
                    :fru-revision=02/chassis=0/motherboard=0
      FRU Location: /SYS/MB
Example 4  Identifying Which CPUs Are Offline

Use the psrinfo command to display information about the CPUs:

$ psrinfo
0       faulted   since 04/24/2015 10:41:32
1       on-line   since 04/23/2015 14:52:03

The faulted state in this example indicates that the CPU has been taken offline by a Fault Manager response agent.

Example 5  Identifying Bugs that Might Be the Cause of the Problem

If a fault or defect might be caused by a known bug, the bug number is shown in the Description section of the fmadm output or in the DESC section of the FMA event notification or fmdump -m output. Even if these bugs are not the cause of the problem, reviewing these bugs might help you find the cause of the fault or defect.

The following partial fmadm list-fault output shows the Description section with bugs listed that might be the cause of the problem or might help you find the cause of the problem:

Description : The system has rebooted after a kernel panic. The following are
              potential bugs.
              stack[0]  - bug-number1 bug-number2 bug-number3

The following fmdump -m output shows the same information:

DESC: The system has rebooted after a kernel panic. The following are potential bugs.
stack[0]  - bug-number1 bug-number2 bug-number3