Go to main content

Managing Faults, Defects, and Alerts in Oracle® Solaris 11.4

Exit Print View

Updated: November 2020
 
 

Displaying Information About Alerts

An alert is information of interest that is neither a fault nor a defect. An alert might report a problem or might be simply informational. A problem that is reported by an alert is a misconfiguration or other problem that the administrator can resolve without assistance from a response agent. An example of this type of problem is a DIMM plugged into the wrong slot. An example of an informational message reported by an alert is a message that a shadow migration has completed. The following list provides examples of alert messages:

  • Threshold alerts – Temperature is high, storage is at capacity, a zpool is at 80% or 90% capacity, a quota is exceeded, the path count to a chassis or disk has changed. These kinds of alerts can predict a performance impact.

  • Configuration checks – An FRU has been added or removed, SAS cabling is incorrect, a DIMM is plugged into the wrong slot, a datalink changed, a link went up or down, ILOM is misconfigured, MTU (Maximum Transmission Unit - TCP/IP) is misconfigured.

  • Interesting events – A reboot occurred, file system events occurred, firmware has been upgraded, save core failed, ZFS deduplication failed, shadow migration completed.

    If an application that is signed by Oracle terminates abnormally, a diagnostic core is saved and an alert is generated. See COREDIAG Alerts.

Alerts can be in one of the following states:

  • active – The alert has not been cleared.

  • cleared – The alert has been cleared. The cleared state for alerts can be compared to the resolved state for faults and defects. See the following description of persistent and transient alerts for more information about clearing an alert.

Alerts can be persistent or transient.

  • A persistent alert is active until it is manually cleared as shown in fmadm clear Command.

  • A transient alert clears after a specified timeout period or is cleared by a service such as a network monitor.


Tip  -  Base your administrative action on output from the fmadm list-alert command. Log files output by the fmdump command contain a historical record of events and do not necessarily present active or open diagnoses. Log files output by fmdump -i are a historical record of telemetry and might not have been diagnosed into alerts.
Example 8  fmadm list-alert Output

Use the fmadm list-alert command to list all alerts that have not been cleared. The following alert shows that a disk has been removed from the system. The Problem Status has the value open, which is an active state. Problem Status can be open, isolated, repaired, or resolved. The Problem class indicates that the FRU has been removed. The Impact indicates that the severity of the impact depends on the importance of this device in your environment. Perhaps the most useful piece of information in this output is the MSG-ID. Follow the instructions in the Action at the end of the alert to access more information about FMD-8000-CV.

# fmadm list-alert
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 23 02:15:12 a7921317-8ba2-4ab1-b1c3-b0fb8822c000  FMD-8000-CV    Minor

Problem Status    : open
Diag Engine       : software-diagnosis / 0.1
System
    Manufacturer  : Oracle Corporation
    Name          : Sun Netra X4270 M3
    Part_Number   : NILE-P1LRQT-8
    Serial_Number : 1211FM200D

System Component
    Manufacturer  : Oracle
    Name          : Sun Netra X4270 M3
    Part_Number   : NILE-P1LRQT-8
    Serial_Number : 1211FM200D
    Host_ID       : 008167b1

----------------------------------------
Suspect 1 of 1 :
   Problem class : alert.oracle.solaris.fmd.fru-monitor.fru-remove
   Certainty   : 100%

   FRU
     Status           : faulty/not present
     Location         : "/SUN-Storage-J4410.1051QCQ08A/HDD13"
     Manufacturer     : SEAGATE
     Name             : ST330057SSUN300G
     Part_Number      : SEAGATE-ST330057SSUN300G
     Revision         : 0B25
     Serial_Number    : 001117G1LC1S--------6SJ1LC1S
     Chassis
        Manufacturer  : SUN
        Name          : SUN-Storage-J4410
        Part_Number   : 3753659
        Serial_Number : 1051QCQ08A
   Resource
     Status           : faulty/not present

Description : FRU '/SUN-Storage-J4410.1051QCQ08A/HDD13' has been removed from
              the system.

Response    : FMD topology will be updated.

Impact      : System impact depends on the type of FRU.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/FMD-8000-CV for the latest service
              procedures and policies regarding this diagnosis.

COREDIAG Alerts

If an application that is signed by Oracle terminates abnormally, a diagnostic core is saved and an alert is generated. See Configuring Reporting of Diagnostic Core Dumps for options to change this default reporting behavior.

A diagnostic core is smaller than a global core because only the relevant information about the particular application is saved, such as the stack and environment variables. A diagnostic core has two parts: a core file (core.diag) and a core summary file (core.json). These two files are placed in /var/share/diag/uuid, where uuid is the process ID of the application that failed. The /var/share/diag directory is linked to from /var/diag.

The core files are purged periodically by coremond so that only the summary files remain. You can use options of the coreadm command or properties of the coreadm:default service to modify the policy for retaining the files, specify a different location for the files, and modify other configuration.

The /var/share/diag/path-to-binary directory contains links to /var/share/diag/uuid directories for that binary, which makes it easier to associate core files with applications. For example, if /usr/bin/vim terminated abnormally three times, the directory /var/share/diag/usr/bin/vim would contain links to /var/share/diag/uuid-1, /var/share/diag/uuid-2, and /var/share/diag/uuid-3.

The following example is a core diagnostic alert for VirtualBox:

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Nov 04 21:06:16 1c9c8afa-036d-4eb3-a97f-a17298b20fa9  COREDIAG-8000-1V Major

Problem Status            : open
Diag Engine               : software-diagnosis / 0.2
System
    Manufacturer          : unknown
    Name                  : unknown
    Part_Number           : unknown
    Serial_Number         : unknown

System Component
    Manufacturer          : innotek GmbH
    Name                  : VirtualBox
    Part_Number           : 
    Serial_Number         : 0
    Firmware_Manufacturer : innotek GmbH
    Firmware_Version      : (BIOS)VirtualBox
    Firmware_Release      : (BIOS)12.01.2006
    Host_ID               : 008953e5

----------------------------------------
Suspect 1 of 1 :
   Problem class : alert.oracle.solaris.utility.corediag.dump_available
   Certainty   : 100%

   Resource
     FMRI             : "sw:///:path=/usr/lib/picl/picld#:token=0fed5e879996dfc053f62f6736a01cb432f0b7d92f653beef1b587a5e0019483"
     Status           : Active

Description : A diagnostic core file was dumped in
              /var/diag/1de0f8bc-d4f6-416e-843c-efba9f9edb65 for RESOURCE
              /usr/lib/picl/picld whose ASRU is svc:/system/picl:default. The
              ASRU is the Service FMRI for the resource and will be NULL if the
              resource is not part of a service. The following are potential
              bugs.
              stack[1] - 15760557 22191243 22551744 

Response    : The diagnostic core file will be removed and a json format core
              data summary file will be generated in
              /var/diag/1de0f8bc-d4f6-416e-843c-efba9f9edb65.

Impact      : The program may not be working properly.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/COREDIAG-8000-1V for the latest
              service procedures and policies regarding this diagnosis.