Determining the Health Status of ZFS Storage Pools

Language:

ZFS provides an integrated method of examining pool and device health. The health of a pool is determined from the state of all its devices. This state information is displayed by using the zpool status command. In addition, potential pool and device failures are reported by fmd, displayed on the system console, and logged in the /var/adm/messages file.

This section describes how to determine pool and device health. This chapter does not document how to repair or recover from unhealthy pools. For more information about troubleshooting and data recovery, see Chapter 10, Oracle Solaris ZFS Troubleshooting and Pool Recovery.

A pool's health status is described by one of four states:

DEGRADED: A pool with one or more failed devices, but the data is still available due to a redundant configuration.
ONLINE: A pool that has all devices operating normally.
SUSPENDED: A pool that is waiting for device connectivity to be restored. A SUSPENDED pool remains in the wait state until the device issue is resolved.
UNAVAIL: A pool with corrupted metadata, or one or more unavailable devices, and insufficient replicas to continue functioning.

Each pool device can fall into one of the following states:

DEGRADED: The virtual device has experienced a failure but can still function. This state is most common when a mirror or RAID-Z device has lost one or more constituent devices. The fault tolerance of the pool might be compromised, as a subsequent fault in another device might be unrecoverable.
OFFLINE: The device has been explicitly taken offline by the administrator.
ONLINE: The device or virtual device is in normal working order. Although some transient errors might still occur, the device is otherwise in working order.
REMOVED: The device was physically removed while the system was running. Device removal detection is hardware-dependent and might not be supported on all platforms.
UNAVAIL: The device or virtual device cannot be opened. In some cases, pools with UNAVAIL devices appear in DEGRADED mode. If a top-level virtual device is UNAVAIL, then nothing in the pool can be accessed.

The health of a pool is determined from the health of all its top-level virtual devices. If all virtual devices are ONLINE, then the pool is also ONLINE. If any one of the virtual devices is DEGRADED or UNAVAIL, then the pool is also DEGRADED. If a top-level virtual device is UNAVAIL or OFFLINE, then the pool is also UNAVAIL or SUSPENDED. A pool in the UNAVAIL or SUSPENDED state is completely inaccessible. No data can be recovered until the necessary devices are attached or repaired. A pool in the DEGRADED state continues to run, but you might not achieve the same level of data redundancy or data throughput than if the pool were online.

The zpool status command also provides details about resilver and scrub operations.

Resilver in-progress report. For example:

scan: resilver in progress since Wed Jun 20 14:19:38 2012
7.43G scanned
7.43G resilvered at 26.8M/s, 10.35% done, 0h30m to go

Scrub in-progress report. For example:

scan: scrub in progress since Wed Jun 20 14:56:52 2012
529M scanned out of 71.8G at 48.1M/s, 0h25m to go
0 repaired, 0.72% done

Resilver completion message. For example:

scan: resilvered 71.8G in 0h14m with 0 errors on Wed Jun 20 14:33:42 2012

Scrub completion message. For example:

scan: scrub repaired 0 in 0h11m with 0 errors on Wed Jun 20 15:08:23 2012

Ongoing scrub cancellation message. For example:

scan: scrub canceled on Wed Jun 20 16:04:40 2012

Scrub and resilver completion messages persist across system reboots

Basic Storage Pool Health Status

You can quickly review pool health status by using the zpool status command as follows:

# zpool status -x
all pools are healthy

Specific pools can be examined by specifying a pool name in the command syntax. Any pool that is not in the ONLINE state should be investigated for potential problems, as described in the next section.

Detailed Health Status

You can request a more detailed health summary status by using the –v option. For example:

# zpool status -v pond
pool: pond
state: DEGRADED
status: One or more devices are unavailable in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or 'fmadm repaired', or replace the device
with 'zpool replace'.
scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 20 15:38:08 2012
config:

NAME                       STATE     READ WRITE CKSUM
pond                       DEGRADED     0     0     0
mirror-0                   DEGRADED     0     0     0
c0t5000C500335F95E3d0      ONLINE       0     0     0
c0t5000C500335F907Fd0      UNAVAIL      0     0     0
mirror-1                   ONLINE       0     0     0
c0t5000C500335BD117d0      ONLINE       0     0     0
c0t5000C500335DC60Fd0      ONLINE       0     0     0

device details:

c0t5000C500335F907Fd0    UNAVAIL          cannot open
status: ZFS detected errors on this device.
The device was missing.
see: http://support.oracle.com/msg/ZFS-8000-LR for recovery


errors: No known data errors

This output displays a complete description of why the pool is in its current state, including a readable description of the problem and a link to a knowledge article for more information. Each knowledge article provides up-to-date information about the best way to recover from your current problem. Using the detailed configuration information, you can determine which device is damaged and how to repair the pool.

In the preceding example, the UNAVAIL device should be replaced. After the device is replaced, use the zpool online command to bring the device online, if necessary. For example:

# zpool online pond c0t5000C500335F907Fd0
warning: device 'c0t5000C500335DC60Fd0' onlined, but remains in degraded state
# zpool status -x
all pools are healthy

The above output identifies that the device remains in a degraded state until any resilvering is complete.

If the autoreplace property is on, you might not have to online the replaced device.

If a pool has an offline device, the command output identifies the problem pool. For example:

# zpool status -x
pool: pond
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
config:

NAME                       STATE     READ WRITE CKSUM
pond                       DEGRADED     0     0     0
mirror-0                   DEGRADED     0     0     0
c0t5000C500335F95E3d0      ONLINE       0     0     0
c0t5000C500335F907Fd0      OFFLINE      0     0     0
mirror-1                   ONLINE       0     0     0
c0t5000C500335BD117d0      ONLINE       0     0     0
c0t5000C500335DC60Fd0      ONLINE       0     0     0

errors: No known data errors

The READ and WRITE columns provide a count of I/O errors that occurred on the device, while the CKSUM column provides a count of uncorrectable checksum errors that occurred on the device. Both error counts indicate a potential device failure, and some corrective action is needed. If non-zero errors are reported for a top-level virtual device, portions of your data might have become inaccessible.

The errors: field identifies any known data errors.

In the preceding example output, the offline device is not causing data errors.

For more information about diagnosing and repairing UNAVAIL pools and data, see Chapter 10, Oracle Solaris ZFS Troubleshooting and Pool Recovery.

Gathering ZFS Storage Pool Status Information

You can use the zpool status interval and count options to gather statistics over a period of time. In addition, you can display a time stamp by using the –T option. For example:

# zpool status -T d 3 2
Wed Jun 20 16:10:09 MDT 2012
pool: pond
state: ONLINE
scan: resilvered 9.50K in 0h0m with 0 errors on Wed Jun 20 16:07:34 2012
config:

NAME                          STATE     READ  WRITE  CKSUM
pond                          ONLINE       0      0      0
   mirror-0                   ONLINE       0      0      0
      c0t5000C500335F95E3d0   ONLINE       0      0      0
      c0t5000C500335F907Fd0   ONLINE       0      0      0
   mirror-1                   ONLINE       0      0      0
      c0t5000C500335BD117d0   ONLINE       0      0      0
      c0t5000C500335DC60Fd0   ONLINE       0      0      0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h11m with 0 errors on Wed Jun 20 15:08:23 2012
config:

NAME                         STATE     READ WRITE CKSUM
rpool                        ONLINE       0     0     0
mirror-0                     ONLINE       0     0     0
c0t5000C500335BA8C3d0s0      ONLINE       0     0     0
c0t5000C500335FC3E7d0s0      ONLINE       0     0     0

errors: No known data errors
Wed Jun 20 16:10:12 MDT 2012

pool: pond
state: ONLINE
scan: resilvered 9.50K in 0h0m with 0 errors on Wed Jun 20 16:07:34 2012
config:

NAME                       STATE     READ WRITE CKSUM
pond                       ONLINE       0     0     0
mirror-0                   ONLINE       0     0     0
c0t5000C500335F95E3d0      ONLINE       0     0     0
c0t5000C500335F907Fd0      ONLINE       0     0     0
mirror-1                   ONLINE       0     0     0
c0t5000C500335BD117d0      ONLINE       0     0     0
c0t5000C500335DC60Fd0      ONLINE       0     0     0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h11m with 0 errors on Wed Jun 20 15:08:23 2012
config:

NAME                         STATE     READ WRITE CKSUM
rpool                        ONLINE       0     0     0
mirror-0                     ONLINE       0     0     0
c0t5000C500335BA8C3d0s0      ONLINE       0     0     0
c0t5000C500335FC3E7d0s0      ONLINE       0     0     0

errors: No known data errors