Resolving Data Problems in a ZFS Storage Pool

Language:

Examples of data problems include the following:

Pool or file system space is missing
Transient I/O errors due to a bad disk or controller
On-disk data corruption due to cosmic rays
Driver bugs resulting in data being transferred to or from the wrong location
A user overwriting portions of the physical device by accident

In some cases, these errors are transient, such as a random I/O error while the controller is having problems. In other cases, the damage is permanent, such as on-disk corruption. Even still, whether the damage is permanent does not necessarily indicate that the error is likely to occur again. For example, if you accidentally overwrite part of a disk, no type of hardware failure has occurred, and the device does not need to be replaced. Identifying the exact problem with a device is not an easy task and is covered in more detail in a later section.

Resolving ZFS Space Issues

Review the following sections if you are unsure how ZFS reports file system and pool space accounting.

ZFS File System Space Reporting

The zpool list and zfs list commands are better than the previous df and du commands for determining your available pool and file system space. With the legacy commands, you cannot easily discern between pool and file system space, nor do the legacy commands account for space that is consumed by descendant file systems or snapshots.

For example, the following root pool (rpool) has 5.46 GB allocated and 68.5 GB free.

# zpool list rpool
NAME   SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool   74G  5.46G  68.5G   7%  1.00x  ONLINE  -

If you compare the pool space accounting with the file system space accounting by reviewing the USED column of your individual file systems, you can see that the pool space that is reported in ALLOC is accounted for in the file systems' USED total. For example:

# zfs list -r rpool
NAME                      USED  AVAIL  REFER  MOUNTPOINT
rpool                    5.41G  67.4G  74.5K  /rpool
rpool/ROOT               3.37G  67.4G    31K  legacy
rpool/ROOT/solaris       3.37G  67.4G  3.07G  /
rpool/ROOT/solaris/var    302M  67.4G   214M  /var
rpool/dump               1.01G  67.5G  1000M  -
rpool/export             97.5K  67.4G    32K  /rpool/export
rpool/export/home        65.5K  67.4G    32K  /rpool/export/home
rpool/export/home/admin  33.5K  67.4G  33.5K  /rpool/export/home/admin
rpool/swap               1.03G  67.5G  1.00G  -

ZFS Storage Pool Space Reporting

The SIZE value that is reported by the zpool list command is generally the amount of physical disk space in the pool, but varies depending on the pool's redundancy level. See the examples below. The zfs list command lists the usable space that is available to file systems, which is disk space minus ZFS pool redundancy metadata overhead, if any.

The following ZFS dataset configurations are tracked as allocated space by the zfs list command but they are not tracked as allocated space in the zpool list output:

ZFS file system quota
ZFS file system reservation
ZFS logical volume size

The following items describe how using different pool configurations, ZFS volumes and ZFS reservations can impact your consumed and available disk space. Depending upon your configuration, monitoring pool space should be tracked by using the steps listed below.

Non-redundant storage pool – When a pool is created with one 136-GB disk, the zpool list command reports SIZE and initial FREE values as 136 GB. The initial AVAIL space reported by the zfs list command is 134 GB, due to a small amount of pool metadata overhead. For example:

# zpool create system1 c0t6d0
# zpool list system1
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
system1   136G  95.5K   136G     0%  1.00x  ONLINE  -
# zfs list system1
NAME      USED  AVAIL  REFER  MOUNTPOINT
system1    72K   134G    21K  /system1

Mirrored storage pool – When a pool is created with two 136-GB disks, zpool list command reports SIZE as 136 GB and initial FREE value as 136 GB. This reporting is referred to as the deflated space value. The initial AVAIL space reported by the zfs list command is 134 GB, due to a small amount of pool metadata overhead. For example:
```
# zpool create system1 mirror c0t6d0 c0t7d0
# zpool list system1
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
system1   136G  95.5K   136G     0%  1.00x  ONLINE  -
# zfs list system1
NAME      USED  AVAIL  REFER  MOUNTPOINT
system1    72K   134G    21K  /system1
```
RAID-Z storage pool – When a raidz2 pool is created with three 136-GB disks, the zpool list commands reports SIZE as 408 GB and initial FREE value as 408 GB. This reporting is referred to as the inflated disk space value, which includes redundancy overhead, such as parity information. The initial AVAIL space reported by the zfs list command is 133 GB, due to the pool redundancy overhead. The space discrepancy between the zpool list and the zfs list output for a RAID-Z pool is because zpool list reports the inflated pool space.
```
# zpool create system1 raidz2 c0t6d0 c0t7d0 c0t8d0
# zpool list system1
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
system1   408G   286K   408G     0%  1.00x  ONLINE  -
# zfs list system1
NAME      USED  AVAIL  REFER  MOUNTPOINT
system1  73.2K   133G  20.9K  /system1
```
NFS mounted file system space – Neither the zpool list or the zfs list account for NFS mounted file system space. However, local data files can be hidden under a mounted NFS file system. If you are missing file system space, ensure that you do not have data files hidden under an NFS file system.
Using ZFS Volumes – When a ZFS file system is created and pool space is consumed, you can view the file system space consumption by using the zpool list command. For example:
```
# zpool create nova mirror c1t1d0 c2t1d0
# zfs create nova/fs1
# mkfile 10g /nova/fs1/file1_10g
# zpool list nova
NAME  SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
nova   68G  10.0G  58.0G  14%  1.00x  ONLINE  -
# zfs list -r nova
NAME       USED  AVAIL  REFER  MOUNTPOINT
nova      10.0G  56.9G    32K  /nova
nova/fs1  10.0G  56.9G  10.0G  /nova/fs1
```
If you create a 10-GB ZFS volume, the space is not accounted for in the zpool list command. The space is accounted for in the zfs list command. If you are using ZFS volumes in your storage pools, monitor ZFS volume space consumption by using the zfs list command. For example:
```
# zfs create -V 10g nova/vol1
# zpool list nova
NAME  SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
nova   68G  10.0G  58.0G  14%  1.00x  ONLINE  -
# zfs list -r nova
NAME        USED  AVAIL  REFER  MOUNTPOINT
nova       20.3G  46.6G    32K  /nova
nova/fs1   10.0G  46.6G  10.0G  /nova/fs1
nova/vol1  10.3G  56.9G    16K  -
```
Note in the above output that ZFS volume space is not tracked in the zpool list output so use the zfs list or the zfs list -o space command to identify space that is consumed by ZFS volumes.

In addition, because ZFS volumes act like raw devices, some amount of space for metadata is automatically reserved through the refreservation property, which causes volumes to consume slightly more space then the amount specified when the volume was created. Do not remove the refreservation on ZFS volumes or you risk running out of volume space.

Using ZFS Reservations – If you create a file system with a reservation or add a reservation to an existing file system, reservations or refreservations are not tracked by the zpool list command.

Identify space that is consumed by file system reservations by using the zfs list -r command to identify the increased USED space. For example:

# zfs create -o reservation=10g nova/fs2
# zpool list nova
NAME  SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
nova   68G  10.0G  58.0G  14%  1.00x  ONLINE  -
# zfs list -r nova
NAME        USED  AVAIL  REFER  MOUNTPOINT
nova       30.3G  36.6G    33K  /nova
nova/fs1   10.0G  36.6G  10.0G  /nova/fs1
nova/fs2     31K  46.6G    31K  /nova/fs2
nova/vol1  10.3G  46.9G    16K  -

If you create a file system with a refreservation, it can be identified by using the zfs list -r command. For example:

# zfs create -o refreservation=10g nova/fs3
# zfs list -r nova
NAME        USED  AVAIL  REFER  MOUNTPOINT
nova       40.3G  26.6G    35K  /nova
nova/fs1   10.0G  26.6G  10.0G  /nova/fs1
nova/fs2     31K  36.6G    31K  /nova/fs2
nova/fs3     10G  36.6G    31K  /nova/fs3
nova/vol1  10.3G  36.9G    16K  -

Use the following command to identify all existing reservations to account for total USED space.

# zfs get -r reserv,refreserv nova
NAME       PROPERTY        VALUE  SOURCE
nova       reservation     none   default
nova       refreservation  none   default
nova/fs1   reservation     none   default
nova/fs1   refreservation  none   default
nova/fs2   reservation     10G    local
nova/fs2   refreservation  none   default
nova/fs3   reservation     none   default
nova/fs3   refreservation  10G    local
nova/vol1  reservation     none   default
nova/vol1  refreservation  10.3G  local

Checking ZFS File System Integrity

No fsck utility equivalent exists for ZFS. This utility has traditionally served two purposes, those of file system repair and file system validation.

File System Repair

With traditional file systems, the way in which data is written is inherently vulnerable to unexpected failure causing file system inconsistencies. Because a traditional file system is not transactional, unreferenced blocks, bad link counts, or other inconsistent file system structures are possible. The addition of journaling does solve some of these problems, but can introduce additional problems when the log cannot be rolled back. The only way for inconsistent data to exist on disk in a ZFS configuration is through hardware failure (in which case the pool should have been redundant) or when a bug exists in the ZFS software.

The fsck utility repairs known problems specific to UFS file systems. Most ZFS storage pool problems are generally related to failing hardware or power failures. Many problems can be avoided by using redundant pools. If your pool is damaged due to failing hardware or a power outage, see Repairing ZFS Storage Pool-Wide Damage.

If your pool is not redundant, the risk that file system corruption can render some or all of your data inaccessible is always present.

File System Validation

In addition to performing file system repair, the fsck utility validates that the data on disk has no problems. Traditionally, this task requires unmounting the file system and running the fsck utility, possibly taking the system to single-user mode in the process. This scenario results in downtime that is proportional to the size of the file system being checked. Instead of requiring an explicit utility to perform the necessary checking, ZFS provides a mechanism to perform routine checking of all inconsistencies. This feature, known as scrubbing, is commonly used in memory and other systems as a method of detecting and preventing errors before they result in a hardware or software failure.

Controlling ZFS Data Scrubbing

Whenever ZFS encounters an error, either through scrubbing or when accessing a file on demand, the error is logged internally so that you can obtain a quick overview of all known errors within the pool.

Explicit ZFS Data Scrubbing

The simplest way to check data integrity is to initiate an explicit scrubbing of all data within the pool. This operation traverses all the data in the pool once and verifies that all blocks can be read. Scrubbing proceeds as fast as the devices allow, though the priority of any I/O remains below that of normal operations. This operation might negatively impact performance, though the pool's data should remain usable and nearly as responsive while the scrubbing occurs. To initiate an explicit scrub, use the zpool scrub command. For example:

# zpool scrub system1

The status of the current scrubbing operation can be displayed by using the zpool status command. For example:

# zpool status -v system1
pool: system1
state: ONLINE
scan: scrub in progress since Mon Jun  7 12:07:52 2010
201M scanned out of 222M at 9.55M/s, 0h0m to go
0 repaired, 90.44% done
config:

NAME          STATE     READ WRITE CKSUM
system1       ONLINE       0     0     0
   mirror-0   ONLINE       0     0     0
      c1t0d0  ONLINE       0     0     0
      c1t1d0  ONLINE       0     0     0

errors: No known data errors

Only one active scrubbing operation per pool can occur at one time.

You can stop a scrubbing operation that is in progress by using the –s option. For example:

# zpool scrub -s system1

In most cases, a scrubbing operation to ensure data integrity should continue to completion. Stop a scrubbing operation at your own discretion if system performance is impacted by the operation.

Performing routine scrubbing guarantees continuous I/O to all disks on the system. Routine scrubbing has the side effect of preventing power management from placing idle disks in low-power mode. If the system is generally performing I/O all the time, or if power consumption is not a concern, then this issue can safely be ignored. If the system is largely idle, and you want to conserve power to the disks, you should consider using a cron scheduled explicit scrub rather than background scrubbing. This will still perform complete scrubs of data, though it will only generate a large amount of I/O until the scrubbing is finished, at which point the disks can be power managed as normal. The downside (besides increased I/O) is that there will be large periods of time when no scrubbing is being done at all, potentially increasing the risk of corruption during those periods.

For more information about interpreting zpool status output, see Querying ZFS Storage Pool Status.

ZFS Data Scrubbing and Resilvering

When a device is replaced, a resilvering operation is initiated to move data from the good copies to the new device. This action is a form of disk scrubbing. Therefore, only one such action can occur at a given time in the pool. If a scrubbing operation is in progress, a resilvering operation suspends the current scrubbing and restarts it after the resilvering is completed.

For more information about resilvering, see Viewing Resilvering Status.

Repairing Corrupted ZFS Data

Data corruption occurs when one or more device errors (indicating one or more missing or damaged devices) affects a top-level virtual device. For example, one half of a mirror can experience thousands of device errors without ever causing data corruption. If an error is encountered on the other side of the mirror in the exact same location, corrupted data is the result.

Data corruption is always permanent and requires special consideration during repair. Even if the underlying devices are repaired or replaced, the original data is lost forever. Most often, this scenario requires restoring data from backups. Data errors are recorded as they are encountered, and they can be controlled through routine pool scrubbing as explained in the following section. When a corrupted block is removed, the next scrubbing pass recognizes that the corruption is no longer present and removes any trace of the error from the system.

The following sections describe how to identify the type of data corruption and how to repair the data, if possible.

ZFS uses checksums, redundancy, and self-healing data to minimize the risk of data corruption. Nonetheless, data corruption can occur if a pool isn't redundant, if corruption occurred while a pool was degraded, or an unlikely series of events conspired to corrupt multiple copies of a piece of data. Regardless of the source, the result is the same: The data is corrupted and therefore no longer accessible. The action taken depends on the type of data being corrupted and its relative value. Two basic types of data can be corrupted:

Pool metadata – ZFS requires a certain amount of data to be parsed to open a pool and access datasets. If this data is corrupted, the entire pool or portions of the dataset hierarchy will become unavailable.
Object data – In this case, the corruption is within a specific file or directory. This problem might result in a portion of the file or directory being inaccessible, or this problem might cause the object to be broken altogether.

Data is verified during normal operations as well as through a scrubbing. For information about how to verify the integrity of pool data, see Checking ZFS File System Integrity.

Identifying the Type of Data Corruption

By default, the zpool status command shows only that corruption has occurred, but not where this corruption occurred. For example:

# zpool status system1
pool: system1
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://support.oracle.com/msg/ZFS-8000-8A
config:

NAME                           STATE     READ WRITE CKSUM
   system1                     ONLINE       4     0     0
      c0t5000C500335E106Bd0    ONLINE       0     0     0
      c0t5000C500335FC3E7d0    ONLINE       4     0     0


errors: 2 data errors, use '-v' for a list

Each error indicates only that an error occurred at a given point in time. Each error is not necessarily still present on the system. Under normal circumstances, this is the case. Certain temporary outages might result in data corruption that is automatically repaired after the outage ends. A complete scrub of the pool is guaranteed to examine every active block in the pool, so the error log is reset whenever a scrub finishes. If you determine that the errors are no longer present, and you don't want to wait for a scrub to complete, reset all errors in the pool by using the zpool online command.

If the data corruption is in pool-wide metadata, the output is slightly different. For example:

# zpool status -v morpheus
pool: morpheus
id: 13289416187275223932
state: UNAVAIL
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
see: http://support.oracle.com/msg/ZFS-8000-72
config:

morpheus  FAULTED   corrupted data
c1t10d0   ONLINE

In the case of pool-wide corruption, the pool is placed into the FAULTED state because the pool cannot provide the required redundancy level.

Repairing a Corrupted File or Directory

If a file or directory is corrupted, the system might still function, depending on the type of corruption. Any damage is effectively unrecoverable if no good copies of the data exist on the system. If the data is valuable, you must restore the affected data from backup. Even so, you might be able to recover from this corruption without restoring the entire pool.

If the damage is within a file data block, then the file can be safely removed, thereby clearing the error from the system. Use the zpool status –v command to display a list of file names with persistent errors. For example:

# zpool status system1 -v
pool: system1
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://support.oracle.com/msg/ZFS-8000-8A
config:

NAME                           STATE     READ WRITE CKSUM
   system1                     ONLINE       4     0     0
      c0t5000C500335E106Bd0    ONLINE       0     0     0
      c0t5000C500335FC3E7d0    ONLINE       4     0     0

errors: Permanent errors have been detected in the following files:
/system1/file.1
/system1/file.2

The list of file names with persistent errors might be described as follows:

If the full path to the file is found and the dataset is mounted, the full path to the file is displayed. For example:
```
/path1/a.txt
```
If the full path to the file is found, but the dataset is not mounted, then the dataset name with no preceding slash (/), followed by the path within the dataset to the file, is displayed. For example:
```
path1/documents/e.txt
```
If the object number to a file path cannot be successfully translated, either due to an error or because the object doesn't have a real file path associated with it, as is the case for a dnode_t, then the dataset name followed by the object's number is displayed. For example:
```
path1/dnode:<0x0>
```
If an object in the metaobject set (MOS) is corrupted, then a special tag of <metadata>, followed by the object number, is displayed.

You can attempt to resolve more minor data corruption by using scrubbing the pool and clearing the pool errors in multiple iterations. If the first scrub and clear iteration does not resolve the corrupted files, run them again. For example:

# zpool scrub system1
# zpool clear system1

If the corruption is within a directory or a file's metadata, the only choice is to move the file elsewhere. You can safely move any file or directory to a less convenient location, allowing the original object to be restored in its place.

Repairing Corrupted Data With Multiple Block References

If a damaged file system has corrupted data with multiple block references, such as snapshots, the zpool status –v command cannot display all corrupted data paths. The current zpool status reporting of corrupted data is limited by the amount of metadata corruption and if any blocks have been reused after the zpool status command is executed. Deduplicated blocks makes reporting all corrupted data even more complicated.

If you have corrupted data and the zpool status –v command identifies that snapshot data is impacted, then considering running the following command to identify additional corrupted paths:

# find mount-point -inum $inode -print
# find mount-point/.zfs/snapshot -inum $inode -print

The first command searches for the inode number of the reported corrupted data in the specified file system and all its snapshots. The second command searches for snapshots with the same inode number.

Repairing ZFS Storage Pool-Wide Damage

If the damage is in pool metadata and that damage prevents the pool from being opened or imported, then the following options are available to you:

You can attempt to recover the pool by using the zpool clear –F command or the zpool import –F command. These commands attempt to roll back the last few pool transactions to an operational state. You can use the zpool status command to review a damaged pool and the recommended recovery steps. For example:

# zpool status
pool: storpool
state: UNAVAIL
status: The pool metadata is corrupted and the pool cannot be opened.
action: Recovery is possible, but will result in some data loss.
Returning the pool to its state as of Fri Jun 29 17:22:49 2012
should correct the problem.  Approximately 5 seconds of data
must be discarded, irreversibly.  Recovery can be attempted
by executing 'zpool clear -F tpool'.  A scrub of the pool
is strongly recommended after recovery.
see: http://support.oracle.com/msg/ZFS-8000-72
scrub: none requested
config:

NAME         STATE     READ WRITE CKSUM
storpool     UNAVAIL      0     0     1  corrupted data
   c1t1d0    ONLINE       0     0     2
   c1t3d0    ONLINE       0     0     4

The recovery process as described in the preceding output is to use the following command:

# zpool clear -F storpool

If you attempt to import a damaged storage pool, you will see messages similar to the following:

# zpool import storpool
cannot import 'storpool': I/O error
Recovery is possible, but will result in some data loss.
Returning the pool to its state as of Fri Jun 29 17:22:49 2012
should correct the problem.  Approximately 5 seconds of data
must be discarded, irreversibly.  Recovery can be attempted
by executing 'zpool import -F storpool'.  A scrub of the pool
is strongly recommended after recovery.

The recovery process as described in the preceding output is to use the following command:

# zpool import -F storpool
Pool storpool returned to its state as of Fri Jun 29 17:22:49 2012.
Discarded approximately 5 seconds of transactions

If the damaged pool is in the zpool.cache file, the problem is discovered when the system is booted, and the damaged pool is reported in the zpool status command. If the pool isn't in the zpool.cache file, it won't successfully import or open and you will see the damaged pool messages when you attempt to import the pool.

You can import a damaged pool in read-only mode. This method enables you to import the pool so that you can access the data. For example:
```
# zpool import -o readonly=on storpool
```
For more information about importing a pool read-only, see Importing a Pool in Read-Only Mode.
You can import a pool with a missing log device by using the zpool import –m command. For more information, see Importing a Pool With a Missing Log Device.
If the pool cannot be recovered by either pool recovery method, you must restore the pool and all its data from a backup copy. The mechanism you use varies widely depending on the pool configuration and backup strategy. First, save the configuration as displayed by the zpool status command so that you can re-create it after the pool is destroyed. Then, use the zpool destroy –f command to destroy the pool.

Also, keep a file describing the layout of the datasets and the various locally set properties somewhere safe, as this information will become inaccessible if the pool is ever rendered inaccessible. With the pool configuration and dataset layout, you can reconstruct your complete configuration after destroying the pool. The data can then be populated by using whatever backup or restoration strategy you use.