Go to main content

Oracle® x86 Servers Diagnostics and Troubleshooting Guide

Exit Print View

Updated: January 2020
 
 

Repairing Faults or Defects

After Oracle ILOM Fault Management identifies a faulted component in your system, you must repair it. A repair can happen in one of two ways: implicitly or explicitly.

  • An implicit repair can occur when the faulty component is replaced or removed, provided the component has serial number information that the Fault Manager daemon can track. The system's serial number information is included so that the Fault Manager daemon can determine when components have been removed from operation, either through replacement or other means (for example, blacklisting). When such detections occur, the Fault Manager daemon no longer displays the affected resource in fmadm faulty output.

  • An explicit repair is required if no FRU serial number is available. For example, CPUs have no serial numbers. In these cases, the Fault Manager daemon cannot detect a FRU replacement.

    Use the fmadm command to explicitly mark a fault as repaired. The options include:

    • fmadm replaced label

    • fmadm repaired label

    • fmadm acquit label

    • fmadm acquit uuid [label]

    Although these four commands can take UUIDs or labels as arguments, it is better to use the label. For example, the label /SYS/MB/P0 represents the CPU labeled "P0" on the motherboard.

    If a FRU has multiple faults against it and you want to replace the FRU only one time, use the fmadm replaced command against the FRU.

fmadm replaced Command

You can use the Oracle ILOM fmadm replaced command to indicate that the suspect FRU has been replaced or removed.

If the system automatically discovers that a FRU has been replaced (the serial number has changed), then this discovery is treated in the same way as if fmadm replaced had been typed on the command line. The fmadm replaced command is not allowed if fmadm can automatically confirm that the FRU has not been replaced (the serial number has not changed).

If the system automatically discovers that a FRU has been removed but not replaced, then the current behavior is unchanged: The suspect is displayed as not present, but is not considered to be permanently removed until the fault event is 30 days old, at which point it is purged.

fmadm repaired Command

You can use the Oracle ILOM fmadm repaired command when some physical repair has been carried out to resolve the problem, other than replacing a FRU. Examples of such repairs include reseating a component or straightening a bent pin.

fmadm acquit Command

Often you use the Oracle ILOM fmadm acquit option when you determine that the resource was not the cause. Acquittal can also happen implicitly when additional error events occur, and the diagnosis gets refined.

Replacement takes precedence over repair, and both replacement and repair take precedence over acquittal. Thus, you can acquit a component and then subsequently repair it, but you cannot acquit a component that has already been repaired.

A case is considered repaired (moves into the FMD_CASE_REPAIRED state and a list.repaired event is generated) when either its UUID is acquitted, or all suspects have been either repaired, replaced, removed, or acquitted.

Usually fmadm automatically acquits a suspect in a multi-element suspect list, or Oracle Support Services gives you instructions to perform a manual acquittal. You would only want to acquit by label if you determined that the resource was not guilty in all current cases in which it is a suspect. However, you can allow a FRU to be manually acquitted in one case while remaining a suspect in all others, using the following option, which enables you to specify both UUID and label:

fmadm acquit uuid [label]