Use the fmadm list-fault command to display fault information and determine which FRUs are involved. The fmadm list-fault command displays active fault diagnoses. The fmdump command displays the contents of log files associated with the Fault Manager daemon and is more useful as a historical log of errors, observations, and diagnoses on the system.
The fmadm list-fault command displays status information for resources that the Fault Manager identifies as faulty. The fmadm list-fault command has many options for displaying different information or displaying information in different formats. See the fmadm(1M) man page for information about all the fmadm list-fault options.
Example 1 fmadm list-fault Output Showing a Faulty DiskIn the following example output, the section labeled FRU identifies the faulted component. The Location string shown in quotation marks, "/SUN-Storage-J4410.1051QCQ08A/HDD23", should match the chassis type and serial number of the chassis containing the faulty disk and the label of the disk bay in that chassis. For a location in the main system chassis, the location string would be something like "/SYS/HDD3". If no location is available, the Fault Management Resource Identifier (FMRI) of the FRU is shown. See Fault Management Glossary for definitions of chassis and FMRI.
The Status line in the FRU section of the output shows the state as faulty.
Above the FRU section, the lines labeled Affects identify components that are affected by the fault and their relative state. In this example, a single disk is affected. The disk is faulted but is still in service.
Perhaps the most useful piece of information in this output is the MSG-ID. Follow the instructions in the Action section at the end of the report to access more information about DISK-8000-0X. The Action section might include specific actions in addition to references to documents on the support site.
Every diagnosis can be mapped to a specific MSG-ID. Diagnoses may have one or more suspects. If only one suspect is identified, then the MSG-ID can be mapped to a single fault class or diagnosis class. If more than one suspect is identified, then the MSG-ID maps to more than one diagnosis class. See Fault Management Glossary for the definition of diagnosis class.
# fmadm list-fault --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Apr 08 08:36:50 91cfc113-eacc-44d0-8236-9e2ed3926fd3 DISK-8000-0X Major Problem Status : open Diag Engine : eft / 1.16 System Manufacturer : Oracle Corporation Name : Sun Netra X4270 M3 Part_Number : NILE-P1LRQT-8 Serial_Number : 1211FM200D System Component Manufacturer : Oracle Name : Sun Netra X4270 M3 Part_Number : NILE-P1LRQT-8 Serial_Number : 1211FM200D Host_ID : 008167b1 ---------------------------------------- Suspect 1 of 1 : Problem class : fault.io.disk.predictive-failure Certainty : 100% Affects : dev:///:devid=id1,sd@n5000a7203002c0f2//scsi_vhci/disk@g5000a7203002c0f2 Status : faulted but still in service FRU Status : faulty Location : "/SUN-Storage-J4410.1051QCQ08A/HDD23" Manufacturer : STEC Name : ZeusIOPs Part_Number : STEC-ZeusIOPs Revision : 9007 Serial_Number : STM00011EDCA Chassis Manufacturer : SUN Name : SUN-Storage J4410 Part_Number : 3753659 Serial_Number : 1051QCQ08A Description : SMART health-monitoring firmware reported that a disk failure is imminent. Response : A hot-spare disk may have been activated. Impact : It is likely that the continued operation of this disk will result in data loss. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/DISK-8000-0X for the latest service procedures and policies regarding this diagnosis.
In the following sample output, a single CPU strand is affected. That CPU strand is faulted and has been taken out of service by the Fault Manager.
# fmadm list-fault --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Apr 24 10:41:32 662ec53e-3aff-41d1-a836-ad7d1795705a SUN4V-8002-6E Major Problem Status : isolated Diag Engine : eft / 1.16 System Manufacturer : Oracle Corporation Name : ORCL,SPARC-T4-1 Part_Number : 602-4918-02 Serial_Number : 1315BDY5D8 Host_ID : 862e0f5e ---------------------------------------- Suspect 1 of 1 : Problem class : fault.cpu.generic-sparc.strand Certainty : 100% Affects : cpu:///cpuid=0/serial=15a02807e0b026b Status : faulted and taken out of service FRU Status : faulty Location : "/SYS/MB" Manufacturer : Oracle Corporation Name : PCA,MB,SPARC_T4-1 Part_Number : 7047134 Revision : 02 Serial_Number : 465769T+1309BW0V8E Chassis Manufacturer : Oracle Corporation Name : ORCL,SPARC-T4-1 Part_Number : 31538783+1+1 Serial_Number : 1315BDY5D8 Description : The number of correctable errors associated with this strand has exceeded acceptable levels. Response : The fault manager will attempt to remove the affected strand from service. Impact : System performance may be affected. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/SUN4V-8002-6E for the latest service procedures and policies regarding this diagnosis.Example 2 fmadm list-fault Output Showing Multiple Faults
In the following output, all three suspect PCI devices are described as "faulted but still in service". The unknown values indicate that no identity information is available for these devices.
# fmadm list-fault --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Apr 23 02:48:15 a9445995-0eee-460b-82ba-d8ddb29cda71 PCIEX-8000-3S Critical Problem Status : open Diag Engine : eft / 1.16 System Manufacturer : Oracle Corporation Name : Sun Netra X4270 M3 Part_Number : NILE-P1LRQT-8 Serial_Number : 1211FM200D System Component Manufacturer : Oracle Name : Sun Netra X4270 M3 Part_Number : NILE-P1LRQT-8 Serial_Number : 1211FM200D Host_ID : 008167b1 ---------------------------------------- Suspect 1 of 3 : Problem class : fault.io.pciex.device-interr Certainty : 50% Affects : dev:////pci@0,0/pci8086,3c04@2/pci1000,3050@0 Status : faulted but still in service FRU Status : faulty Location : "/SYS/MB/PCIE1" Manufacturer : unknown Name : pciex8086,1522.108e.7b19.1 Part_Number : 7014747-Rev.01 Revision : G29837-009 Serial_Number : 159048B+1206A0369F048B54 Chassis Manufacturer : Oracle Name : Sun Netra X4270 M3 Part_Number : NILE-P1LRQT-8 Serial_Number : 1211FM200D ---------------------------------------- Suspect 2 of 3 : Problem class : fault.io.pciex.bus-linkerr Certainty : 25% Affects : dev:////pci@0,0/pci8086,3c04@2/pci1000,3050@0 Status : faulted but still in service FRU Status : faulty Location : "/SYS/MB/PCIE1" Manufacturer : unknown Name : pciex8086,1522.108e.7b19.1 Part_Number : 7014747-Rev.01 Revision : G29837-009 Serial_Number : 159048B+1206A0369F048B54 Chassis Manufacturer : Oracle Name : Sun Netra X4270 M3 Part_Number : NILE-P1LRQT-8 Serial_Number : 1211FM200D ---------------------------------------- Suspect 3 of 3 : Problem class : fault.io.pciex.device-interr Certainty : 25% FRU Status : faulty Location : "/SYS/MB" Manufacturer : Oracle Name : unknown Part_Number : 7016786 Revision : Rev-03 Serial_Number : 489089M+1208UU003X Chassis Manufacturer : Oracle Name : Sun Netra X4270 M3 Part_Number : NILE-P1LRQT-8 Serial_Number : 1211FM200D Resource Location : "/SYS/MB/PCIE1" Status : faulted but still in service Description : A problem has been detected on one of the specified devices or on one of the specified connecting buses. Response : One or more device instances may be disabled Impact : Loss of services provided by the device instances associated with this fault Action : Use 'fmadm faulty' to provide a more detailed view of this event. If a plug-in card is involved check for badly-seated cards or bent pins. Please refer to the associated reference document at http://support.oracle.com/msg/PCIEX-8000-3S for the latest service procedures and policies regarding this diagnosis.
In the following example, two CPU strands are faulted and have been removed from service by the Fault Manager.
# fmadm list-fault --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Apr 24 10:49:18 1479f457-d99a-4c55-9373-b33621d3aaee SUN4V-8002-6E Major Problem Status : isolated Diag Engine : eft / 1.16 System Manufacturer : Oracle Corporation Name : ORCL,SPARC-T4-1 Part_Number : 602-4918-02 Serial_Number : 1315BDY5D8 Host_ID : 862e0f5e ---------------------------------------- Suspect 1 of 2 : Problem class : fault.cpu.generic-sparc.strand Certainty : 50% Affects : cpu:///cpuid=0/serial=SERIAL1 Status : faulted and taken out of service FRU Status : faulty Location : "/SYS/MB" Manufacturer : Oracle Corporation Name : PCA,MB,SPARC_T4-1 Part_Number : 7047134 Revision : 02 Serial_Number : 465769T+1309BW0V8E Chassis Manufacturer : Oracle Corporation Name : ORCL,SPARC-T4-1 Part_Number : 31538783+1+1 Serial_Number : 1315BDY5D8 ---------------------------------------- Suspect 2 of 2 : Problem class : fault.cpu.generic-sparc.strand Certainty : 50% Affects : cpu:///cpuid=1/serial=SERIAL2 Status : faulted and taken out of service FRU Status : faulty Location : "/SYS/MB" Manufacturer : Oracle Corporation Name : PCA,MB,SPARC_T4-1 Part_Number : 7047134 Revision : 02 Serial_Number : 465769T+1309BW0V8E Chassis Manufacturer : Oracle Corporation Name : ORCL,SPARC-T4-1 Part_Number : 31538783+1+1 Serial_Number : 1315BDY5D8 Description : The number of correctable errors associated with this strand has exceeded acceptable levels. Response : The fault manager will attempt to remove the affected strand from service. Impact : System performance may be affected. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/SUN4V-8002-6E for the latest service procedures and policies regarding this diagnosis.Example 3 fmdump Fault Reports
Some console messages and knowledge articles might instruct you to use the fmdump command to display fault information, as shown in the following example. The information about the affected components is in the Affects line. The FRU Location value presents the human-readable FRU string. The FRU line and the Problem in line show the FMRIs. Note that the output lines in this example are artificially divided to improve readability.
$ fmdump -vu 91cfc113-eacc-44d0-8236-9e2ed3926fd3 TIME UUID SUNW-MSG-ID EVENT Apr 08 08:36:50.1418 91cfc113-eacc-44d0-8236-9e2ed3926fd3 DISK-8000-0X Diagnosed 100% fault.io.disk.predictive-failure Problem in: hc://:chassis-mfg=SUN:chassis-name=SUN-Storage-J4410 :chassis-part=3753659:chassis-serial=1051QCQ08A:fru-mfg=STEC :fru-name=ZeusIOPs:fru-serial=STM00011EDCA:fru-part=STEC-ZeusIOPs :fru-revision=9007:devid=id1,sd@n5000a7203002c0f2/ses-enclosure= 0/bay=23/disk=0 Affects: dev:///:devid=id1,sd@n5000a7203002c0f2//scsi_vhci/disk@g5000a7203002c0f2 FRU: hc://:chassis-mfg=SUN:chassis-name=SUN-Storage-J4410 :chassis-part=3753659:chassis-serial=1051QCQ08A:fru-mfg=STEC :fru-name=ZeusIOPs:fru-serial=STM00011EDCA:fru-part=STEC-ZeusIOPs :fru-revision=9007:devid=id1,sd@n5000a7203002c0f2/ses-enclosure= 0/bay=23/disk=0 FRU Location: /SUN-Storage-J4410.1051QCQ08A/HDD23
To see the severity, descriptive text, and action in the fmdump output, use the -m option. See the fmdump(1M) man page for more information.
The following fmdump output is for two CPU devices:
$ fmdump -vu 662ec53e-3aff-41d1-a836-ad7d1795705a TIME UUID SUNW-MSG-ID EVENT Apr 24 10:41:32.7511 662ec53e-3aff-41d1-a836-ad7d1795705a SUN4V-8002-6E Diagnosed 100% fault.cpu.generic-sparc.strand Problem in: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1 :chassis-part=31538783+1+1:chassis-serial=1315BDY5D8/chassis=0 /motherboard=0/chip=0/core=0/strand=0 Affects: cpu:///cpuid=0/serial=15a02807e0b026b FRU: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1 :chassis-part=31538783+1+1:chassis-serial=1315BDY5D8 :fru-serial=465769T+1309BW0V8E:fru-part=7047134 :fru-revision=02/chassis=0/motherboard=0 FRU Location: /SYS/MB Apr 24 10:41:32.7732 662ec53e-3aff-41d1-a836-ad7d1795705a FMD-8000-9L Isolated 100% fault.cpu.generic-sparc.strand Problem in: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1 :chassis-part=31538783+1+1:chassis-serial=1315BDY5D8/chassis=0 /motherboard=0/chip=0/core=0/strand=0 Affects: cpu:///cpuid=0/serial=15a02807e0b026b FRU: hc://:chassis-mfg=Oracle-Corporation:chassis-name=ORCL,SPARC-T4-1 :chassis-part=31538783+1+1:chassis-serial=1315BDY5D8 :fru-serial=465769T+1309BW0V8E:fru-part=7047134 :fru-revision=02/chassis=0/motherboard=0 FRU Location: /SYS/MBExample 4 Identifying Which CPUs Are Offline
Use the psrinfo command to display information about the CPUs:
$ psrinfo 0 faulted since 04/24/2015 10:41:32 1 on-line since 04/23/2015 14:52:03
The faulted state in this example indicates that the CPU has been taken offline by a Fault Manager response agent.