Use the fmadm list-fault command to display fault information and determine which FRUs are involved. The fmadm list-fault command displays active fault diagnoses. The fmdump command displays the contents of log files associated with the Fault Manager daemon and is more useful as a historical log of errors, observations, and diagnoses on the system.
The fmadm list-fault command displays status information for resources that the Fault Manager identifies as faulty. The fmadm list-fault command has many options for displaying different information or displaying information in different formats. See the fmadm(8) man page for information about all the fmadm list-fault options.
Example 1 fmadm list-fault Output Showing a Faulty DiskIn the following example output, the section labeled FRU identifies the faulted component. The Location string shown in quotation marks, "/SUN-Storage-J4410.1051QCQ08A/HDD23", should match the chassis type and serial number of the chassis containing the faulty disk and the label of the disk bay in that chassis. For a location in the main system chassis, the location string would be something like "/SYS/HDD3". If no location is available, the Fault Management Resource Identifier (FMRI) of the FRU is shown. See Fault Management Glossary for definitions of chassis and FMRI.
The Status line in the FRU section of the output shows the state as faulty.
Above the FRU section, the lines labeled Affects identify components that are affected by the fault and their relative state. In this example, a single disk is affected. The disk is faulted but is still in service.
Perhaps the most useful piece of information in this output is the MSG-ID. Follow the instructions in the Action section at the end of the report to access more information about DISK-8000-0X. The Action section might include specific actions in addition to references to documents on the support site.
Every diagnosis can be mapped to a specific MSG-ID. Diagnoses may have one or more suspects. If only one suspect is identified, then the MSG-ID can be mapped to a single fault class or diagnosis class. If more than one suspect is identified, then the MSG-ID maps to more than one diagnosis class. See Fault Management Glossary for the definition of diagnosis class.
# fmadm list-fault
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Apr 08 08:36:50 91cfc113-eacc-44d0-8236-9e2ed3926fd3 DISK-8000-0X Major
Problem Status : open
Diag Engine : eft / 1.16
System
Manufacturer : Oracle Corporation
Name : Sun Netra X4270 M3
Part_Number : NILE-P1LRQT-8
Serial_Number : 1211FM200D
System Component
Manufacturer : Oracle
Name : Sun Netra X4270 M3
Part_Number : NILE-P1LRQT-8
Serial_Number : 1211FM200D
Host_ID : 008167b1
----------------------------------------
Suspect 1 of 1 :
Problem class : fault.io.disk.predictive-failure
Certainty : 100%
Affects : dev:///:devid=id1,sd@n5000a7203002c0f2//scsi_vhci/disk@g5000a7203002c0f2
Status : faulted but still in service
FRU
Status : faulty
Location : "/SUN-Storage-J4410.1051QCQ08A/HDD23"
Manufacturer : STEC
Name : ZeusIOPs
Part_Number : STEC-ZeusIOPs
Revision : 9007
Serial_Number : STM00011EDCA
Chassis
Manufacturer : SUN
Name : SUN-Storage J4410
Part_Number : 3753659
Serial_Number : 1051QCQ08A
Description : SMART health-monitoring firmware reported that a disk failure is
imminent.
Response : A hot-spare disk may have been activated.
Impact : It is likely that the continued operation of this disk will
result in data loss.
Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Please refer to the associated reference document at
http://support.oracle.com/msg/DISK-8000-0X for the latest service
procedures and policies regarding this diagnosis.
In the following sample output, a single CPU strand is affected. That CPU strand is faulted and has been taken out of service by the Fault Manager.
# fmadm list-fault
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Apr 24 10:41:32 662ec53e-3aff-41d1-a836-ad7d1795705a SUN4V-8002-6E Major
Problem Status : isolated
Diag Engine : eft / 1.16
System
Manufacturer : Oracle Corporation
Name : ORCL,SPARC-T4-1
Part_Number : 602-4918-02
Serial_Number : 1315BDY5D8
Host_ID : 862e0f5e
----------------------------------------
Suspect 1 of 1 :
Problem class : fault.cpu.generic-sparc.strand
Certainty : 100%
Affects : cpu:///cpuid=0/serial=15a02807e0b026b
Status : faulted and taken out of service
FRU
Status : faulty
Location : "/SYS/MB"
Manufacturer : Oracle Corporation
Name : PCA,MB,SPARC_T4-1
Part_Number : 7047134
Revision : 02
Serial_Number : 465769T+1309BW0V8E
Chassis
Manufacturer : Oracle Corporation
Name : ORCL,SPARC-T4-1
Part_Number : 31538783+1+1
Serial_Number : 1315BDY5D8
Description : The number of correctable errors associated with this strand has
exceeded acceptable levels.
Response : The fault manager will attempt to remove the affected strand from
service.
Impact : System performance may be affected.
Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Please refer to the associated reference document at
http://support.oracle.com/msg/SUN4V-8002-6E for the latest
service procedures and policies regarding this diagnosis.
Example 2 fmadm list-fault Output Showing Multiple Faults
In the following output, all three suspect PCI devices are described as "faulted but still in service". The unknown values indicate that no identity information is available for these devices.
# fmadm list-fault
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Apr 23 02:48:15 a9445995-0eee-460b-82ba-d8ddb29cda71 PCIEX-8000-3S Critical
Problem Status : open
Diag Engine : eft / 1.16
System
Manufacturer : Oracle Corporation
Name : Sun Netra X4270 M3
Part_Number : NILE-P1LRQT-8
Serial_Number : 1211FM200D
System Component
Manufacturer : Oracle
Name : Sun Netra X4270 M3
Part_Number : NILE-P1LRQT-8
Serial_Number : 1211FM200D
Host_ID : 008167b1
----------------------------------------
Suspect 1 of 3 :
Problem class : fault.io.pciex.device-interr
Certainty : 50%
Affects : dev:////pci@0,0/pci8086,3c04@2/pci1000,3050@0
Status : faulted but still in service
FRU
Status : faulty
Location : "/SYS/MB/PCIE1"
Manufacturer : unknown
Name : pciex8086,1522.108e.7b19.1
Part_Number : 7014747-Rev.01
Revision : G29837-009
Serial_Number : 159048B+1206A0369F048B54
Chassis
Manufacturer : Oracle
Name : Sun Netra X4270 M3
Part_Number : NILE-P1LRQT-8
Serial_Number : 1211FM200D
----------------------------------------
Suspect 2 of 3 :
Problem class : fault.io.pciex.bus-linkerr
Certainty : 25%
Affects : dev:////pci@0,0/pci8086,3c04@2/pci1000,3050@0
Status : faulted but still in service
FRU
Status : faulty
Location : "/SYS/MB/PCIE1"
Manufacturer : unknown
Name : pciex8086,1522.108e.7b19.1
Part_Number : 7014747-Rev.01
Revision : G29837-009
Serial_Number : 159048B+1206A0369F048B54
Chassis
Manufacturer : Oracle
Name : Sun Netra X4270 M3
Part_Number : NILE-P1LRQT-8
Serial_Number : 1211FM200D
----------------------------------------
Suspect 3 of 3 :
Problem class : fault.io.pciex.device-interr
Certainty : 25%
FRU
Status : faulty
Location : "/SYS/MB"
Manufacturer : Oracle
Name : unknown
Part_Number : 7016786
Revision : Rev-03
Serial_Number : 489089M+1208UU003X
Chassis
Manufacturer : Oracle
Name : Sun Netra X4270 M3
Part_Number : NILE-P1LRQT-8
Serial_Number : 1211FM200D
Resource
Location : "/SYS/MB/PCIE1"
Status : faulted but still in service
Description : A problem has been detected on one of the specified devices or on
one of the specified connecting buses.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : Use 'fmadm faulty' to provide a more detailed view of this event.
If a plug-in card is involved check for badly-seated cards or
bent pins. Please refer to the associated reference document at
http://support.oracle.com/msg/PCIEX-8000-3S for the latest
service procedures and policies regarding this diagnosis.
In the following example, two CPU strands are faulted and have been removed from service by the Fault Manager.
# fmadm list-fault
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Apr 24 10:49:18 1479f457-d99a-4c55-9373-b33621d3aaee SUN4V-8002-6E Major
Problem Status : isolated
Diag Engine : eft / 1.16
System
Manufacturer : Oracle Corporation
Name : ORCL,SPARC-T4-1
Part_Number : 602-4918-02
Serial_Number : 1315BDY5D8
Host_ID : 862e0f5e
----------------------------------------
Suspect 1 of 2 :
Problem class : fault.cpu.generic-sparc.strand
Certainty : 50%
Affects : cpu:///cpuid=0/serial=SERIAL1
Status : faulted and taken out of service
FRU
Status : faulty
Location : "/SYS/MB"
Manufacturer : Oracle Corporation
Name : PCA,MB,SPARC_T4-1
Part_Number : 7047134
Revision : 02
Serial_Number : 465769T+1309BW0V8E
Chassis
Manufacturer : Oracle Corporation
Name : ORCL,SPARC-T4-1
Part_Number : 31538783+1+1
Serial_Number : 1315BDY5D8
----------------------------------------
Suspect 2 of 2 :
Problem class : fault.cpu.generic-sparc.strand
Certainty : 50%
Affects : cpu:///cpuid=1/serial=SERIAL2
Status : faulted and taken out of service
FRU
Status : faulty
Location : "/SYS/MB"
Manufacturer : Oracle Corporation
Name : PCA,MB,SPARC_T4-1
Part_Number : 7047134
Revision : 02
Serial_Number : 465769T+1309BW0V8E
Chassis
Manufacturer : Oracle Corporation
Name : ORCL,SPARC-T4-1
Part_Number : 31538783+1+1
Serial_Number : 1315BDY5D8
Description : The number of correctable errors associated with this strand has
exceeded acceptable levels.
Response : The fault manager will attempt to remove the affected strand from
service.
Impact : System performance may be affected.
Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Please refer to the associated reference document at
http://support.oracle.com/msg/SUN4V-8002-6E for the latest
service procedures and policies regarding this diagnosis.
Example 3 fmdump Fault Reports
Some console messages and knowledge articles instruct you to use the fmdump command to display fault information, as shown in the following example. The information about the affected components is in the Affects line. The FRU Location value presents the human-readable FRU string. The FRU line and the Problem in line show the FMRIs. Note that the output lines in this example are artificially divided to improve readability.
$ fmdump -vu 91cfc113-eacc-44d0-8236-9e2ed3926fd3
TIME UUID SUNW-MSG-ID EVENT
Apr 08 08:36:50.1418 91cfc113-eacc-44d0-8236-9e2ed3926fd3 DISK-8000-0X Diagnosed
100% fault.io.disk.predictive-failure
Problem in: hc://:chassis-mfg=SUN:chassis-name=SUN-Storage-J4410
:chassis-part=3753659:chassis-serial=1051QCQ08A:fru-mfg=STEC
:fru-name=ZeusIOPs:fru-serial=STM00011EDCA:fru-part=STEC-ZeusIOPs
:fru-revision=9007:devid=id1,sd@n5000a7203002c0f2/ses-enclosure=
0/bay=23/disk=0
Affects: dev:///:devid=id1,sd@n5000a7203002c0f2//scsi_vhci/disk@g5000a7203002c0f2
FRU: hc://:chassis-mfg=SUN:chassis-name=SUN-Storage-J4410
:chassis-part=3753659:chassis-serial=1051QCQ08A:fru-mfg=STEC
:fru-name=ZeusIOPs:fru-serial=STM00011EDCA:fru-part=STEC-ZeusIOPs
:fru-revision=9007:devid=id1,sd@n5000a7203002c0f2/ses-enclosure=
0/bay=23/disk=0
FRU Location: /SUN-Storage-J4410.1051QCQ08A/HDD23
To see the severity, descriptive text, and action in the fmdump output, use the -m option. The fmdump -m output is similar to the information you receive in FMA event notifications as described in Receiving Notification of Faults, Defects, and Alerts.
The following fmdump output is for two CPU devices:
$ fmdump -vu 662ec53e-3aff-41d1-a836-ad7d1795705a
TIME UUID SUNW-MSG-ID EVENT
Apr 24 10:41:32.7511 662ec53e-3aff-41d1-a836-ad7d1795705a SUN4V-8002-6E Diagnosed
100% fault.cpu.generic-sparc.strand
Problem in: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1
:chassis-part=31538783+1+1:chassis-serial=1315BDY5D8/chassis=0
/motherboard=0/chip=0/core=0/strand=0
Affects: cpu:///cpuid=0/serial=15a02807e0b026b
FRU: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1
:chassis-part=31538783+1+1:chassis-serial=1315BDY5D8
:fru-serial=465769T+1309BW0V8E:fru-part=7047134
:fru-revision=02/chassis=0/motherboard=0
FRU Location: /SYS/MB
Apr 24 10:41:32.7732 662ec53e-3aff-41d1-a836-ad7d1795705a FMD-8000-9L Isolated
100% fault.cpu.generic-sparc.strand
Problem in: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1
:chassis-part=31538783+1+1:chassis-serial=1315BDY5D8/chassis=0
/motherboard=0/chip=0/core=0/strand=0
Affects: cpu:///cpuid=0/serial=15a02807e0b026b
FRU: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1
:chassis-part=31538783+1+1:chassis-serial=1315BDY5D8
:fru-serial=465769T+1309BW0V8E:fru-part=7047134
:fru-revision=02/chassis=0/motherboard=0
FRU Location: /SYS/MB
Example 4 Identifying Which CPUs Are Offline
Use the psrinfo command to display information about the CPUs:
$ psrinfo 0 faulted since 04/24/2015 10:41:32 1 on-line since 04/23/2015 14:52:03
The faulted state in this example indicates that the CPU has been taken offline by a Fault Manager response agent.
Example 5 Identifying Bugs that Might Be the Cause of the ProblemIf a fault or defect might be caused by a known bug, the bug number is shown in the Description section of the fmadm output or in the DESC section of the FMA event notification or fmdump -m output. Even if these bugs are not the cause of the problem, reviewing these bugs might help you find the cause of the fault or defect.
The following partial fmadm list-fault output shows the Description section with bugs listed that might be the cause of the problem or might help you find the cause of the problem:
Description : The system has rebooted after a kernel panic. The following are
potential bugs.
stack[0] - bug-number1 bug-number2 bug-number3
The following fmdump -m output shows the same information:
DESC: The system has rebooted after a kernel panic. The following are potential bugs. stack[0] - bug-number1 bug-number2 bug-number3