Go to main content

Managing Devices in Oracle® Solaris 11.3

Exit Print View

Updated: April 2018
 
 

Resolving Faulty Devices

A device retirement mechanism isolates a device that is flagged as faulty by the fault management framework (FMA). This feature allows faulty devices to be safely and automatically inactivated to avoid data loss, data corruption, or panics and system down time. The retirement process takes into account the stability of the system after the device has been retired.

Critical devices are never retired. If you need to manually replace a retired device, use the fmadm repair command after the device replacement so that system knows that the device is replaced.

For more information, see fmadm(1M).

When a device is retired, a message similar to the following is displayed on the console and recorded on the /var/adm/messages file.

Aug 9 18:14 starbug genunix: [ID 751201 kern.notice] \
     NOTICE: One or more I/O devices have been retired

You can use the prtconf command to identify specific retired devices. For example:

# prtconf
.
.
.
pci, instance #2
scsi, instance #0
disk (driver not attached)
tape (driver not attached)
sd, instance #3
sd, instance #0 (retired)
scsi, instance #1 (retired)
disk (retired)
tape (retired)
pci, instance #3
network, instance #2 (driver not attached)
network, instance #3 (driver not attached)
os-io (driver not attached)
iscsi, instance #0
pseudo, instance #0
.
.
.

How to Resolve a Faulty Device

Use the steps that follow to resolve a faulty device or a device that has been retired.

  1. Identify the faulted device with the fmadm faulty command. For example:
    # fmadm faulty
    --------------- ------------------------------------  -------------- ---------
    TIME            EVENT-ID                              MSG-ID SEVERITY
    --------------- ------------------------------------  -------------- ---------
    Jun 20 16:30:52 55c82fff-b709-62f5-b66e-b4e1bbe9dcb1  ZFS-8000-LR Major
    
    Problem Status    : solved
    Diag Engine       : zfs-diagnosis / 1.0
    System
    Manufacturer  : unknown
    Name          : ORCL,SPARC-T3-4
    Part_Number   : unknown
    Serial_Number : 1120BDRCCD
    Host_ID       : 84a02d28
    
    ----------------------------------------
    Suspect 1 of 1 :
    Fault class : fault.fs.zfs.open_failed
    Certainty   : 100%
    Affects     : zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
    pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
    Status      : faulted and taken out of service
    
    FRU
    Name             : "zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
    pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a"
    Status        : faulty
    
    Description : ZFS device 'id1,sd@n5000c500335dc60f/a' in pool 'pond' failed to
    open.
    
    Response    : An attempt will be made to activate a hot spare if available.
    
    Impact      : Fault tolerance of the pool may be compromised.
    
    Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
    Run 'zpool status -lx' for more information. Please refer to the
    associated reference document at
    http://support.oracle.com/msg/ZFS-8000-LR for the latest service
    procedures and policies regarding this diagnosis.
  2. Replace the faulty or retired device or clear the device error. For example:
    # zpool clear pond c0t5000C500335DC60Fd0

    If an intermittent device error occurred but the device was not replaced, you can attempt to clear the previous error.

  3. Clear the FMA fault. For example:
    # fmadm repaired zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/ \
    pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
    fmadm: recorded repair to of zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
    pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
  4. Confirm that the fault is cleared.
    # fmadm faulty

    If the error is cleared, the fmadm faulty command returns nothing.