Resolving Faulty Devices

Language:

A device retirement mechanism isolates a device that is flagged as faulty by the fault management framework (FMA). This feature enables faulty devices to be safely and automatically inactivated to avoid data loss, data corruption, or panics and system down time. The retirement process takes into account the stability of the system after the device has been retired.

Critical devices are never retired. If you need to manually replace a retired device, use the fmadm repair command after the device replacement to notify the system that the device is replaced.

For more information, see the fmadm(8) man page.

When a device is retired, a message similar to the following example is displayed on the console and recorded in the /var/adm/messages file.

Aug 9 18:14 starbug genunix: [ID 751201 kern.notice] \
     NOTICE: One or more I/O devices have been retired

You can use the prtconf command to identify specific retired devices. For example:

# prtconf
.
.
.
pci, instance #2
scsi, instance #0
disk (driver not attached)
tape (driver not attached)
sd, instance #3
sd, instance #0 (retired)
scsi, instance #1 (retired)
disk (retired)
tape (retired)
pci, instance #3
network, instance #2 (driver not attached)
network, instance #3 (driver not attached)
os-io (driver not attached)
iscsi, instance #0
pseudo, instance #0
.
.
.

How to Resolve a Faulty Device

This procedure describes how to resolve a faulty device or a device that has been retired.

Note - For ZFS device problem or failure information, see Chapter 11, Oracle Solaris ZFS Troubleshooting and Pool Recovery in Managing ZFS File Systems in Oracle Solaris 11.4.

Identify the faulty device with the fmadm faulty command.

For example:

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID SEVERITY
--------------- ------------------------------------  -------------- ---------
Jun 20 16:30:52 55c82fff-b709-62f5-b66e-b4e1bbe9dcb1  ZFS-8000-LR Major

Problem Status    : solved
Diag Engine       : zfs-diagnosis / 1.0
System
Manufacturer  : unknown
Name          : ORCL,SPARC-T3-4
Part_Number   : unknown
Serial_Number : 1120BDRCCD
Host_ID       : 84a02d28

----------------------------------------
Suspect 1 of 1 :
Fault class : fault.fs.zfs.open_failed
Certainty   : 100%
Affects     : zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
Status      : faulted and taken out of service

FRU
Name             : "zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a"
Status        : faulty

Description : ZFS device 'id1,sd@n5000c500335dc60f/a' in pool 'pond' failed to
open.

Response    : An attempt will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
Run 'zpool status -lx' for more information. Please refer to the
associated reference document at
http://support.oracle.com/msg/ZFS-8000-LR for the latest service
procedures and policies regarding this diagnosis.

Replace the faulty or retired device or clear the device error.
For example:
```
# zpool clear pond c0t5000C500335DC60Fd0
```
If an intermittent device error occurred but the device was not replaced, you can attempt to clear the previous error, which is the faulty device identified by fmadm utility.

Clear the FMA fault.

For example:

# fmadm repaired zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/ \
pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
fmadm: recorded repair to of zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a

Confirm that the fault is cleared.
```
# fmadm faulty
```
If the error is cleared, the fmadm faulty command does not return any output.