ZFS provides an integrated method of examining pool and device health. The health of a pool is determined from the state of all its devices. This state information is displayed by using the zpool status command. In addition, potential pool and device failures are reported by fmd, displayed on the system console, and logged in the /var/adm/messages file.
This section describes how to determine pool and device health. This chapter does not document how to repair or recover from unhealthy pools. For more information about troubleshooting and data recovery, see Chapter 10, Oracle Solaris ZFS Troubleshooting and Pool Recovery.
A pool's health status is described by one of four states:
A pool with one or more failed devices, but the data is still available due to a redundant configuration.
A pool that has all devices operating normally.
A pool that is waiting for device connectivity to be restored. A SUSPENDED pool remains in the wait state until the device issue is resolved.
A pool with corrupted metadata, or one or more unavailable devices, and insufficient replicas to continue functioning.
Each pool device can fall into one of the following states:
The virtual device has experienced a failure but can still function. This state is most common when a mirror or RAID-Z device has lost one or more constituent devices. The fault tolerance of the pool might be compromised, as a subsequent fault in another device might be unrecoverable.
The device has been explicitly taken offline by the administrator.
The device or virtual device is in normal working order. Although some transient errors might still occur, the device is otherwise in working order.
The device was physically removed while the system was running. Device removal detection is hardware-dependent and might not be supported on all platforms.
The device or virtual device cannot be opened. In some cases, pools with UNAVAIL devices appear in DEGRADED mode. If a top-level virtual device is UNAVAIL, then nothing in the pool can be accessed.
The health of a pool is determined from the health of all its top-level virtual devices. If all virtual devices are ONLINE, then the pool is also ONLINE. If any one of the virtual devices is DEGRADED or UNAVAIL, then the pool is also DEGRADED. If a top-level virtual device is UNAVAIL or OFFLINE, then the pool is also UNAVAIL or SUSPENDED. A pool in the UNAVAIL or SUSPENDED state is completely inaccessible. No data can be recovered until the necessary devices are attached or repaired. A pool in the DEGRADED state continues to run, but you might not achieve the same level of data redundancy or data throughput than if the pool were online.
The zpool status command also provides details about resilver and scrub operations.
Resilver in-progress report. For example:
scan: resilver in progress since Wed Jun 20 14:19:38 2012 7.43G scanned 7.43G resilvered at 26.8M/s, 10.35% done, 0h30m to go
Scrub in-progress report. For example:
scan: scrub in progress since Wed Jun 20 14:56:52 2012 529M scanned out of 71.8G at 48.1M/s, 0h25m to go 0 repaired, 0.72% done
Resilver completion message. For example:
scan: resilvered 71.8G in 0h14m with 0 errors on Wed Jun 20 14:33:42 2012
Scrub completion message. For example:
scan: scrub repaired 0 in 0h11m with 0 errors on Wed Jun 20 15:08:23 2012
Ongoing scrub cancellation message. For example:
scan: scrub canceled on Wed Jun 20 16:04:40 2012
Scrub and resilver completion messages persist across system reboots
You can quickly review pool health status by using the zpool status command as follows:
# zpool status -x all pools are healthy
Specific pools can be examined by specifying a pool name in the command syntax. Any pool that is not in the ONLINE state should be investigated for potential problems, as described in the next section.
You can request a more detailed health summary status by using the –v option. For example:
# zpool status -v pond pool: pond state: DEGRADED status: One or more devices are unavailable in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or 'fmadm repaired', or replace the device with 'zpool replace'. scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 20 15:38:08 2012 config: NAME STATE READ WRITE CKSUM pond DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c0t5000C500335F95E3d0 ONLINE 0 0 0 c0t5000C500335F907Fd0 UNAVAIL 0 0 0 mirror-1 ONLINE 0 0 0 c0t5000C500335BD117d0 ONLINE 0 0 0 c0t5000C500335DC60Fd0 ONLINE 0 0 0 device details: c0t5000C500335F907Fd0 UNAVAIL cannot open status: ZFS detected errors on this device. The device was missing. see: http://support.oracle.com/msg/ZFS-8000-LR for recovery errors: No known data errors
This output displays a complete description of why the pool is in its current state, including a readable description of the problem and a link to a knowledge article for more information. Each knowledge article provides up-to-date information about the best way to recover from your current problem. Using the detailed configuration information, you can determine which device is damaged and how to repair the pool.
In the preceding example, the UNAVAIL device should be replaced. After the device is replaced, use the zpool online command to bring the device online, if necessary. For example:
# zpool online pond c0t5000C500335F907Fd0 warning: device 'c0t5000C500335DC60Fd0' onlined, but remains in degraded state # zpool status -x all pools are healthy
The above output identifies that the device remains in a degraded state until any resilvering is complete.
If the autoreplace property is on, you might not have to online the replaced device.
If a pool has an offline device, the command output identifies the problem pool. For example:
# zpool status -x pool: pond state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. config: NAME STATE READ WRITE CKSUM pond DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c0t5000C500335F95E3d0 ONLINE 0 0 0 c0t5000C500335F907Fd0 OFFLINE 0 0 0 mirror-1 ONLINE 0 0 0 c0t5000C500335BD117d0 ONLINE 0 0 0 c0t5000C500335DC60Fd0 ONLINE 0 0 0 errors: No known data errors
The READ and WRITE columns provide a count of I/O errors that occurred on the device, while the CKSUM column provides a count of uncorrectable checksum errors that occurred on the device. Both error counts indicate a potential device failure, and some corrective action is needed. If non-zero errors are reported for a top-level virtual device, portions of your data might have become inaccessible.
The errors: field identifies any known data errors.
In the preceding example output, the offline device is not causing data errors.
For more information about diagnosing and repairing UNAVAIL pools and data, see Chapter 10, Oracle Solaris ZFS Troubleshooting and Pool Recovery.
You can use the zpool status interval and count options to gather statistics over a period of time. In addition, you can display a time stamp by using the –T option. For example:
# zpool status -T d 3 2 Wed Jun 20 16:10:09 MDT 2012 pool: pond state: ONLINE scan: resilvered 9.50K in 0h0m with 0 errors on Wed Jun 20 16:07:34 2012 config: NAME STATE READ WRITE CKSUM pond ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t5000C500335F95E3d0 ONLINE 0 0 0 c0t5000C500335F907Fd0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c0t5000C500335BD117d0 ONLINE 0 0 0 c0t5000C500335DC60Fd0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scan: scrub repaired 0 in 0h11m with 0 errors on Wed Jun 20 15:08:23 2012 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t5000C500335BA8C3d0s0 ONLINE 0 0 0 c0t5000C500335FC3E7d0s0 ONLINE 0 0 0 errors: No known data errors Wed Jun 20 16:10:12 MDT 2012 pool: pond state: ONLINE scan: resilvered 9.50K in 0h0m with 0 errors on Wed Jun 20 16:07:34 2012 config: NAME STATE READ WRITE CKSUM pond ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t5000C500335F95E3d0 ONLINE 0 0 0 c0t5000C500335F907Fd0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c0t5000C500335BD117d0 ONLINE 0 0 0 c0t5000C500335DC60Fd0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scan: scrub repaired 0 in 0h11m with 0 errors on Wed Jun 20 15:08:23 2012 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t5000C500335BA8C3d0s0 ONLINE 0 0 0 c0t5000C500335FC3E7d0s0 ONLINE 0 0 0 errors: No known data errors