Managing Faults in Oracle® Solaris 11.2

Exit Print View

Updated: July 2014
 
 

Repairing Faults or Defects

You can configure Oracle Auto Service Request (ASR) to automatically request Oracle service when specific hardware problems occur. See http://www.oracle.com/asr for more information about ASR.

When a component in your system has faulted, the Fault Manager can repair the component implicitly or you can repair the component explicitly.

Implicit repair

An implicit repair can occur when the faulty component is replaced or removed if the component has serial number information that the Fault Manager daemon (fmd) can track. On many SPARC based systems, serial number information is included in the FMRIs so that fmd can determine when components have been removed from operation, either through replacement or through other means such as blacklisting. When fmd determines that a component has been removed from operation, the Fault Manager no longer displays that component in fmadm faulty output. The component is maintained in the Fault Manager internal resource cache until the fault event is 30 days old.

When fmd detects faulty CPU or memory resources, those resources are placed on a blacklist. A faulty resource that is on the blacklist cannot be reassigned until fmd marks the resource as being repaired.

Explicit repair

Sometimes no FRU serial number information is available even though the FMRI includes a chassis identifier. In this case, fmd cannot detect an FRU replacement, and you must perform an explicit repair by using the fmadm command with the replaced, repaired, or acquit subcommand as shown in the following sections. You should perform explicit repairs only at the direction of a specific documented repair procedure.

These fmadm commands take the following operands:

  • The UUID, also shown as the EVENT-ID in Fault Manager output, identifies the fault event. The UUID can only be used with the fmadm acquit command. You can specify that the entire event can be safely ignored, or you can specify that a particular resource is not a suspect in this event.

  • The FMRI and the label identify the suspect faulted resource. Examples of the FMRI and label of a resource are shown in Example 2–1. Typically, the label is easier to use than the FMRI.

A case is considered repaired when the fault event UUID is acquitted or when all suspect resources have been repaired, replaced, removed, or acquitted. A case that is repaired moves into the FMD_CASE_REPAIRED state, and the Fault Manager generates a list.repaired event.