Go to main content

Managing Devices in Oracle® Solaris 11.4

Exit Print View

Updated: November 2020
 
 

Resolving Faulty Devices

A device retirement mechanism isolates a device that is flagged as faulty by the fault management framework (FMA). This feature enables faulty devices to be safely and automatically inactivated to avoid data loss, data corruption, or panics and system down time. The retirement process takes into account the stability of the system after the device has been retired.

Critical devices are never retired. If you need to manually replace a retired device, use the fmadm repair command after the device replacement to notify the system that the device is replaced.

For more information, see the fmadm(8) man page.

When a device is retired, a message similar to the following example is displayed on the console and recorded in the /var/adm/messages file.

Aug 9 18:14 starbug genunix: [ID 751201 kern.notice] \
     NOTICE: One or more I/O devices have been retired

You can use the prtconf command to identify specific retired devices. For example:

# prtconf
.
.
.
pci, instance #2
scsi, instance #0
disk (driver not attached)
tape (driver not attached)
sd, instance #3
sd, instance #0 (retired)
scsi, instance #1 (retired)
disk (retired)
tape (retired)
pci, instance #3
network, instance #2 (driver not attached)
network, instance #3 (driver not attached)
os-io (driver not attached)
iscsi, instance #0
pseudo, instance #0
.
.
.

How to Resolve a Faulty Device

This procedure describes how to resolve a faulty device or a device that has been retired.

  1. Identify the faulty device with the fmadm faulty command.

    For example:

    # fmadm faulty
    --------------- ------------------------------------  -------------- ---------
    TIME            EVENT-ID                              MSG-ID SEVERITY
    --------------- ------------------------------------  -------------- ---------
    Jun 20 16:30:52 55c82fff-b709-62f5-b66e-b4e1bbe9dcb1  ZFS-8000-LR Major
    
    Problem Status    : solved
    Diag Engine       : zfs-diagnosis / 1.0
    System
    Manufacturer  : unknown
    Name          : ORCL,SPARC-T3-4
    Part_Number   : unknown
    Serial_Number : 1120BDRCCD
    Host_ID       : 84a02d28
    
    ----------------------------------------
    Suspect 1 of 1 :
    Fault class : fault.fs.zfs.open_failed
    Certainty   : 100%
    Affects     : zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
    pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
    Status      : faulted and taken out of service
    
    FRU
    Name             : "zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
    pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a"
    Status        : faulty
    
    Description : ZFS device 'id1,sd@n5000c500335dc60f/a' in pool 'pond' failed to
    open.
    
    Response    : An attempt will be made to activate a hot spare if available.
    
    Impact      : Fault tolerance of the pool may be compromised.
    
    Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
    Run 'zpool status -lx' for more information. Please refer to the
    associated reference document at
    http://support.oracle.com/msg/ZFS-8000-LR for the latest service
    procedures and policies regarding this diagnosis.
  2. Replace the faulty or retired device or clear the device error.

    For example:

    # zpool clear pond c0t5000C500335DC60Fd0

    If an intermittent device error occurred but the device was not replaced, you can attempt to clear the previous error, which is the faulty device identified by fmadm utility.

  3. Clear the FMA fault.

    For example:

    # fmadm repaired zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/ \
    pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
    fmadm: recorded repair to of zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
    pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
  4. Confirm that the fault is cleared.
    # fmadm faulty

    If the error is cleared, the fmadm faulty command does not return any output.