Managing Faults in Oracle® Solaris 11.2

Exit Print View

Updated: July 2014
 
 

Displaying Information About Faults or Defects

Use the fmadm faulty command to display fault or defect information and determine which FRUs are involved. The fmadm faulty command displays active problems. The fmdump command displays the contents of log files associated with the Fault Manager daemon and is more useful as a historical log of problems on the system.


Tip  -  Base your administrative action on output from the fmadm faulty command. Log files output by the fmdump command can contain error statements that are not faults or defects.

The fmadm faulty command displays status information for resources that the Fault Manager identifies as faulty. The fmadm faulty command has many options for displaying different information or displaying information in different formats. See the fmadm(1M) man page for information about all the fmadm faulty options.

Example 2-1  fmadm faulty Output Showing One Faulty CPU
1    # fmadm faulty
2    --------------- ------------------------------------  -------------- ---------
3    TIME            EVENT-ID                              MSG-ID         SEVERITY
4    --------------- ------------------------------------  -------------- ---------
5    Aug 24 17:56:03 7b83c87c-78f6-6a8e-fa2b-d0cf16834049  SUN4V-8001-8H  Minor
6    
7    Host        : bur419-61
8    Platform    : SUNW,T5440        Chassis_id  : BEL07524BN
9    Product_sn  : BEL07524BN
10
11   Fault class : fault.cpu.ultraSPARC-T2plus.ireg
12   Affects     : cpu:///cpuid=0/serial=1F95806CD1421929
13                     faulted and taken out of service
14   FRU         : "MB/CPU0" (hc://:product-id=SUNW,T5440:server-id=bur419-61:\
15                 serial=3529:part=541255304/motherboard=0/cpuboard=0)
16                     faulty
17   Serial ID.  : 3529
18                 1F95806CD1421929
19   
20   Description : The number of integer register errors associated with this thread
21                 has exceeded acceptable levels.
22   
23   Response    : The fault manager will attempt to remove the affected thread from
24                 service.
25   
26   Impact      : System performance may be affected.
27   
28   Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
29                 Please refer to the associated reference document at
30                 http://support.oracle.com/msg/SUN4V-8001-8H for the latest service
31                 procedures and policies regarding this diagnosis.

Line 14 identifies the impacted FRU. The string shown in quotation marks, “MB/CPU0,” should match the label on the physical hardware. The string shown in parentheses is the Fault Management Resource Identifier (FMRI) for the FRU. The FMRI includes descriptive properties about the system that contains the fault, such as its host name and chassis serial number. On some platforms, the part number and serial number of the FRU are also included in the FMRI of the FRU.

The Affects lines (lines 12 and 13) indicate the components that are affected by the fault and their relative state. In this example, a single CPU strand is affected. That CPU strand is faulted and has been taken out of service by the Fault Manager.

Following the FRU description in the fmadm faulty command output, line 16 shows the state as faulty. The Action section might include specific actions in addition to references to documents on the support site.

Example 2-2  fmadm faulty Output Showing Multiple Faults
1    # fmadm faulty
2    --------------- ------------------------------------  -------------- -------
3    TIME            EVENT-ID                              MSG-ID         SEVERITY
4    --------------- ------------------------------------  -------------- -------
5    Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c  PCIEX-8000-5Y  Major
6    
7    Fault class  : fault.io.pci.device-invreq
8    Affects      : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0
9                   dev:///pci@0,0/pci1022,7458@11/pci1000,3060@1
10                   ok and in service
11                  dev:///pci@0,0/pci1022,7458@11/pci1000,3060@2
12                  dev:///pci@0,0/pci1022,7458@11/pci1000,3060@3
13                    faulty and taken out of service
14   FRU          : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0)
15                    repair attempted
16                  "SLOT 3" (hc://.../pciexrc=3/pciexbus=4/pciexdev=1)
17                    acquitted
18                  "SLOT 4" (hc://.../pciexrc=3/pciexbus=4/pciexdev=2)
19                    not present
20                  "SLOT 5" (hc://.../pciexrc=3/pciexbus=4/pciexdev=3)
21                    faulty
22   
23    Description  : The transmitting device sent an invalid request.
24   
25    Response     : One or more device instances may be disabled
26   
27    Impact       : Possible loss of services provided by the device instances
28                   associated with this fault
29   
30    Action       : Use 'fmadm faulty' to provide a more detailed view of this event.
31                   Please refer to the associated reference document at
32                   http://support.oracle.com/msg/PCIEX-8000-5Y for the latest service
33                   procedures and policies regarding this diagnosis.

In this output, device 1 in slot 3 is described as “ok and in service” on line 10, and line 17 shows its state as “acquitted.” Device 3 in slot 5 is described as “faulty and taken out of service,” and its state is “faulty.” States shown for two other devices are “repair attempted” and “not present.”

Example 2-3  Showing Faults With the fmdump Command

Some console messages and knowledge articles might instruct you to use the fmdump -v -u UUID command to display fault information, as shown in the following example:

1    # fmdump -v -u 7b83c87c-78f6-6a8e-fa2b-d0cf16834049
2    TIME                 UUID                                 SUNW-MSG-ID EVENT
3    Aug 24 17:56:03.4596 7b83c87c-78f6-6a8e-fa2b-d0cf16834049 SUN4V-8001-8H Diagnosed
4      100%  fault.cpu.ultraSPARC-T2plus.ireg
5
6            Problem in: -
7               Affects: cpu:///cpuid=0/serial=1F95806CD1421929
8                   FRU: hc://:product-id=SUNW,T5440:server-id=bur419-61:\
9                   serial=9999:part=541255304/motherboard=0/cpuboard=0
10              Location: MB/CPU0

The information about the affected FRUs is on lines 8 through 10. The Location string on line 10 presents the human-readable FRU string. Line 8 shows the FMRI of the FRU. To see the severity, descriptive text, and action in the fmdump output, use the -m option. See the fmdump(1M) man page for more information.

Example 2-4  Identifying Which CPUs Are Offline

Use the psrinfo command to display information about the CPUs:

$ psrinfo 
0       faulted   since 05/13/2013 12:55:26 
1       on-line   since 05/12/2013 11:47:26 

The faulted state in this example indicates that the CPU has been taken offline by a Fault Manager response agent.