Go to main content

Managing Faults, Defects, and Alerts in Oracle® Solaris 11.3

Exit Print View

Updated: March 2018
 
 

Repairing Faults or Defects

You can configure Oracle Auto Service Request (ASR) to automatically request Oracle service when specific hardware problems occur. See the Oracle Auto Service Request (ASR) support document for more information.

When a component in your system has faulted, the Fault Manager can repair the component implicitly or you can repair the component explicitly.

Implicit repair

An implicit repair can occur when the faulty component is replaced if the component has serial number information that the Fault Manager daemon (fmd) can track. On many systems, serial number information is included in the FMRIs so that fmd can determine when components have been replaced. When fmd determines that a component has been replaced and the replacement has been successfully brought into service, then the Fault Manager no longer displays that component in fmadm list output. The component is maintained in the Fault Manager internal resource cache until the fault event is 30 days old.

When fmd faults a piece of hardware, that hardware might be taken out of service so that it does not adversely affect the system. Hardware removal from service can occur whether Oracle Solaris or ILOM diagnosed the problem. Hardware removal from service is usually reported in the Response section of the diagnosis message.

Explicit repair

Sometimes no FRU serial number information is available even though the FMRI includes a chassis identifier. In this case, fmd cannot detect an FRU replacement, and you must perform an explicit repair by using the fmadm command with the replaced, repaired, or acquit subcommand as shown in the following sections. You should perform explicit repairs only at the direction of a specific documented repair procedure.

These fmadm commands take the following operands:

  • The UUID, also shown as the EVENT-ID in Fault Manager output, identifies the fault event. The UUID can only be used with the fmadm acquit command. You can specify that the entire event can be safely ignored, or you can specify that a particular resource is not a suspect in this event.

  • The FMRI and the label identify the suspect faulted resource. Examples of the FMRI and label of a resource are shown in Example 1, fmadm list-fault Output Showing a Faulty Disk. Typically, the label is easier to use than the FMRI.

A case is considered repaired when the fault event UUID is acquitted or when all suspect resources have been repaired, replaced, or acquitted. A case that is repaired moves into the repaired state, and the Fault Manager generates a list.repaired event.

fmadm replaced Command

Use the fmadm replaced command to indicate that the suspect FRU has been replaced. If multiple faults are currently reported against one FRU, the FRU shows as replaced in all cases.

fmadm replaced FMRI | label

When an FRU is replaced, the serial number of the FRU changes. If fmd automatically detects that the serial number of an FRU has changed, the Fault Manager behaves in the same way as if you had entered the fmadm replaced command. If fmd cannot detect whether the serial number of the FRU has changed, then you must enter the fmadm replaced command if you have replaced the FRU. If fmd detects that the serial number of the FRU has not changed, then the fmadm replaced command exits with an error.

If you remove the FRU but do not replace the FRU, the Fault Manager displays the suspect as not present.

fmadm repaired Command

Use the fmadm repaired command when you have performed a physical repair other than replacement of the FRU to resolve the problem. Examples of such repairs include reseating a card or straightening a bent pin. If multiple faults are currently reported against one FRU, the FRU shows as repaired in all cases.

fmadm repaired FMRI | label

fmadm acquit Command

Use the acquit subcommand if you determine that the indicated resource is not the cause of the fault. Usually the Fault Manager automatically acquits some suspects in a multi-element suspect list. Acquittal can occur implicitly as the Fault Manager refines the diagnosis, for example if additional error events occur. Sometimes Support Services gives you instructions to perform a manual acquittal.

Replacement takes precedence over repair, and both replacement and repair take precedence over acquittal. Thus, you can acquit a component and then subsequently repair the component, but you cannot acquit a component that has already been repaired.

If you do not specify any FMRI or label with the UUID, then the entire event is identified as able to be ignored. A case is considered repaired when the fault event UUID is acquitted.

fmadm acquit UUID

Acquit by FMRI or label with no UUID only if you determine that the resource is not a factor in any current cases in which that resource is a suspect. If multiple faults are currently reported against one FRU, the FRU shows as acquitted in all cases.

fmadm acquit FMRI
fmadm acquit label

To acquit a resource in one case and keep that resource as a suspect in other cases, specify both the fault event UUID and the resource FMRI or both the UUID and the resource label, as shown in the following examples:

fmadm acquit FMRI UUID
fmadm acquit label UUID