Examples of data problems include the following:
Pool or file system space is missing
Transient I/O errors due to a bad disk or controller
On-disk data corruption due to cosmic rays
Driver bugs resulting in data being transferred to or from the wrong location
A user overwriting portions of the physical device by accident
In some cases, these errors are transient, such as a random I/O error while the controller is having problems. In other cases, the damage is permanent, such as on-disk corruption. Even still, whether the damage is permanent does not necessarily indicate that the error is likely to occur again. For example, if you accidentally overwrite part of a disk, no type of hardware failure has occurred, and the device does not need to be replaced. Identifying the exact problem with a device is not an easy task and is covered in more detail in a later section.
Review the following sections if you are unsure how ZFS reports file system and pool space accounting.
The zpool list and zfs list commands are better than the previous df and du commands for determining your available pool and file system space. With the legacy commands, you cannot easily discern between pool and file system space, nor do the legacy commands account for space that is consumed by descendant file systems or snapshots.
For example, the following root pool (rpool) has 5.46 GB allocated and 68.5 GB free.
# zpool list rpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT rpool 74G 5.46G 68.5G 7% 1.00x ONLINE -
If you compare the pool space accounting with the file system space accounting by reviewing the USED column of your individual file systems, you can see that the pool space that is reported in ALLOC is accounted for in the file systems' USED total. For example:
# zfs list -r rpool NAME USED AVAIL REFER MOUNTPOINT rpool 5.41G 67.4G 74.5K /rpool rpool/ROOT 3.37G 67.4G 31K legacy rpool/ROOT/solaris 3.37G 67.4G 3.07G / rpool/ROOT/solaris/var 302M 67.4G 214M /var rpool/dump 1.01G 67.5G 1000M - rpool/export 97.5K 67.4G 32K /rpool/export rpool/export/home 65.5K 67.4G 32K /rpool/export/home rpool/export/home/admin 33.5K 67.4G 33.5K /rpool/export/home/admin rpool/swap 1.03G 67.5G 1.00G -
The SIZE value that is reported by the zpool list command is generally the amount of physical disk space in the pool, but varies depending on the pool's redundancy level. See the examples below. The zfs list command lists the usable space that is available to file systems, which is disk space minus ZFS pool redundancy metadata overhead, if any.
The following ZFS dataset configurations are tracked as allocated space by the zfs list command but they are not tracked as allocated space in the zpool list output:
ZFS file system quota
ZFS file system reservation
ZFS logical volume size
The following items describe how using different pool configurations, ZFS volumes and ZFS reservations can impact your consumed and available disk space. Depending upon your configuration, monitoring pool space should be tracked by using the steps listed below.
Non-redundant storage pool – When a pool is created with one 136-GB disk, the zpool list command reports SIZE and initial FREE values as 136 GB. The initial AVAIL space reported by the zfs list command is 134 GB, due to a small amount of pool metadata overhead. For example:
# zpool create system1 c0t6d0 # zpool list system1 NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT system1 136G 95.5K 136G 0% 1.00x ONLINE - # zfs list system1 NAME USED AVAIL REFER MOUNTPOINT system1 72K 134G 21K /system1
Mirrored storage pool – When a pool is created with two 136-GB disks, zpool list command reports SIZE as 136 GB and initial FREE value as 136 GB. This reporting is referred to as the deflated space value. The initial AVAIL space reported by the zfs list command is 134 GB, due to a small amount of pool metadata overhead. For example:
# zpool create system1 mirror c0t6d0 c0t7d0 # zpool list system1 NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT system1 136G 95.5K 136G 0% 1.00x ONLINE - # zfs list system1 NAME USED AVAIL REFER MOUNTPOINT system1 72K 134G 21K /system1
RAID-Z storage pool – When a raidz2 pool is created with three 136-GB disks, the zpool list commands reports SIZE as 408 GB and initial FREE value as 408 GB. This reporting is referred to as the inflated disk space value, which includes redundancy overhead, such as parity information. The initial AVAIL space reported by the zfs list command is 133 GB, due to the pool redundancy overhead. The space discrepancy between the zpool list and the zfs list output for a RAID-Z pool is because zpool list reports the inflated pool space.
# zpool create system1 raidz2 c0t6d0 c0t7d0 c0t8d0 # zpool list system1 NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT system1 408G 286K 408G 0% 1.00x ONLINE - # zfs list system1 NAME USED AVAIL REFER MOUNTPOINT system1 73.2K 133G 20.9K /system1
NFS mounted file system space – Neither the zpool list or the zfs list account for NFS mounted file system space. However, local data files can be hidden under a mounted NFS file system. If you are missing file system space, ensure that you do not have data files hidden under an NFS file system.
Using ZFS Volumes – When a ZFS file system is created and pool space is consumed, you can view the file system space consumption by using the zpool list command. For example:
# zpool create nova mirror c1t1d0 c2t1d0 # zfs create nova/fs1 # mkfile 10g /nova/fs1/file1_10g # zpool list nova NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT nova 68G 10.0G 58.0G 14% 1.00x ONLINE - # zfs list -r nova NAME USED AVAIL REFER MOUNTPOINT nova 10.0G 56.9G 32K /nova nova/fs1 10.0G 56.9G 10.0G /nova/fs1
If you create a 10-GB ZFS volume, the space is not accounted for in the zpool list command. The space is accounted for in the zfs list command. If you are using ZFS volumes in your storage pools, monitor ZFS volume space consumption by using the zfs list command. For example:
# zfs create -V 10g nova/vol1 # zpool list nova NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT nova 68G 10.0G 58.0G 14% 1.00x ONLINE - # zfs list -r nova NAME USED AVAIL REFER MOUNTPOINT nova 20.3G 46.6G 32K /nova nova/fs1 10.0G 46.6G 10.0G /nova/fs1 nova/vol1 10.3G 56.9G 16K -
Note in the above output that ZFS volume space is not tracked in the zpool list output so use the zfs list or the zfs list -o space command to identify space that is consumed by ZFS volumes.
In addition, because ZFS volumes act like raw devices, some amount of space for metadata is automatically reserved through the refreservation property, which causes volumes to consume slightly more space then the amount specified when the volume was created. Do not remove the refreservation on ZFS volumes or you risk running out of volume space.
Using ZFS Reservations – If you create a file system with a reservation or add a reservation to an existing file system, reservations or refreservations are not tracked by the zpool list command.
Identify space that is consumed by file system reservations by using the zfs list -r command to identify the increased USED space. For example:
# zfs create -o reservation=10g nova/fs2 # zpool list nova NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT nova 68G 10.0G 58.0G 14% 1.00x ONLINE - # zfs list -r nova NAME USED AVAIL REFER MOUNTPOINT nova 30.3G 36.6G 33K /nova nova/fs1 10.0G 36.6G 10.0G /nova/fs1 nova/fs2 31K 46.6G 31K /nova/fs2 nova/vol1 10.3G 46.9G 16K -
If you create a file system with a refreservation, it can be identified by using the zfs list -r command. For example:
# zfs create -o refreservation=10g nova/fs3 # zfs list -r nova NAME USED AVAIL REFER MOUNTPOINT nova 40.3G 26.6G 35K /nova nova/fs1 10.0G 26.6G 10.0G /nova/fs1 nova/fs2 31K 36.6G 31K /nova/fs2 nova/fs3 10G 36.6G 31K /nova/fs3 nova/vol1 10.3G 36.9G 16K -
Use the following command to identify all existing reservations to account for total USED space.
# zfs get -r reserv,refreserv nova NAME PROPERTY VALUE SOURCE nova reservation none default nova refreservation none default nova/fs1 reservation none default nova/fs1 refreservation none default nova/fs2 reservation 10G local nova/fs2 refreservation none default nova/fs3 reservation none default nova/fs3 refreservation 10G local nova/vol1 reservation none default nova/vol1 refreservation 10.3G local
No fsck utility equivalent exists for ZFS. This utility has traditionally served two purposes, those of file system repair and file system validation.
With traditional file systems, the way in which data is written is inherently vulnerable to unexpected failure causing file system inconsistencies. Because a traditional file system is not transactional, unreferenced blocks, bad link counts, or other inconsistent file system structures are possible. The addition of journaling does solve some of these problems, but can introduce additional problems when the log cannot be rolled back. The only way for inconsistent data to exist on disk in a ZFS configuration is through hardware failure (in which case the pool should have been redundant) or when a bug exists in the ZFS software.
The fsck utility repairs known problems specific to UFS file systems. Most ZFS storage pool problems are generally related to failing hardware or power failures. Many problems can be avoided by using redundant pools. If your pool is damaged due to failing hardware or a power outage, see Repairing ZFS Storage Pool-Wide Damage.
If your pool is not redundant, the risk that file system corruption can render some or all of your data inaccessible is always present.
In addition to performing file system repair, the fsck utility validates that the data on disk has no problems. Traditionally, this task requires unmounting the file system and running the fsck utility, possibly taking the system to single-user mode in the process. This scenario results in downtime that is proportional to the size of the file system being checked. Instead of requiring an explicit utility to perform the necessary checking, ZFS provides a mechanism to perform routine checking of all inconsistencies. This feature, known as scrubbing, is commonly used in memory and other systems as a method of detecting and preventing errors before they result in a hardware or software failure.
Whenever ZFS encounters an error, either through scrubbing or when accessing a file on demand, the error is logged internally so that you can obtain a quick overview of all known errors within the pool.
The simplest way to check data integrity is to initiate an explicit scrubbing of all data within the pool. This operation traverses all the data in the pool once and verifies that all blocks can be read. Scrubbing proceeds as fast as the devices allow, though the priority of any I/O remains below that of normal operations. This operation might negatively impact performance, though the pool's data should remain usable and nearly as responsive while the scrubbing occurs. To initiate an explicit scrub, use the zpool scrub command. For example:
# zpool scrub system1
The status of the current scrubbing operation can be displayed by using the zpool status command. For example:
# zpool status -v system1 pool: system1 state: ONLINE scan: scrub in progress since Mon Jun 7 12:07:52 2010 201M scanned out of 222M at 9.55M/s, 0h0m to go 0 repaired, 90.44% done config: NAME STATE READ WRITE CKSUM system1 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 errors: No known data errors
Only one active scrubbing operation per pool can occur at one time.
You can stop a scrubbing operation that is in progress by using the –s option. For example:
# zpool scrub -s system1
In most cases, a scrubbing operation to ensure data integrity should continue to completion. Stop a scrubbing operation at your own discretion if system performance is impacted by the operation.
Performing routine scrubbing guarantees continuous I/O to all disks on the system. Routine scrubbing has the side effect of preventing power management from placing idle disks in low-power mode. If the system is generally performing I/O all the time, or if power consumption is not a concern, then this issue can safely be ignored. If the system is largely idle, and you want to conserve power to the disks, you should consider using a cron scheduled explicit scrub rather than background scrubbing. This will still perform complete scrubs of data, though it will only generate a large amount of I/O until the scrubbing is finished, at which point the disks can be power managed as normal. The downside (besides increased I/O) is that there will be large periods of time when no scrubbing is being done at all, potentially increasing the risk of corruption during those periods.
For more information about interpreting zpool status output, see Querying ZFS Storage Pool Status.
When a device is replaced, a resilvering operation is initiated to move data from the good copies to the new device. This action is a form of disk scrubbing. Therefore, only one such action can occur at a given time in the pool. If a scrubbing operation is in progress, a resilvering operation suspends the current scrubbing and restarts it after the resilvering is completed.
For more information about resilvering, see Viewing Resilvering Status.
Data corruption occurs when one or more device errors (indicating one or more missing or damaged devices) affects a top-level virtual device. For example, one half of a mirror can experience thousands of device errors without ever causing data corruption. If an error is encountered on the other side of the mirror in the exact same location, corrupted data is the result.
Data corruption is always permanent and requires special consideration during repair. Even if the underlying devices are repaired or replaced, the original data is lost forever. Most often, this scenario requires restoring data from backups. Data errors are recorded as they are encountered, and they can be controlled through routine pool scrubbing as explained in the following section. When a corrupted block is removed, the next scrubbing pass recognizes that the corruption is no longer present and removes any trace of the error from the system.
The following sections describe how to identify the type of data corruption and how to repair the data, if possible.
ZFS uses checksums, redundancy, and self-healing data to minimize the risk of data corruption. Nonetheless, data corruption can occur if a pool isn't redundant, if corruption occurred while a pool was degraded, or an unlikely series of events conspired to corrupt multiple copies of a piece of data. Regardless of the source, the result is the same: The data is corrupted and therefore no longer accessible. The action taken depends on the type of data being corrupted and its relative value. Two basic types of data can be corrupted:
Pool metadata – ZFS requires a certain amount of data to be parsed to open a pool and access datasets. If this data is corrupted, the entire pool or portions of the dataset hierarchy will become unavailable.
Object data – In this case, the corruption is within a specific file or directory. This problem might result in a portion of the file or directory being inaccessible, or this problem might cause the object to be broken altogether.
Data is verified during normal operations as well as through a scrubbing. For information about how to verify the integrity of pool data, see Checking ZFS File System Integrity.
By default, the zpool status command shows only that corruption has occurred, but not where this corruption occurred. For example:
# zpool status system1 pool: system1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://support.oracle.com/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM system1 ONLINE 4 0 0 c0t5000C500335E106Bd0 ONLINE 0 0 0 c0t5000C500335FC3E7d0 ONLINE 4 0 0 errors: 2 data errors, use '-v' for a list
Each error indicates only that an error occurred at a given point in time. Each error is not necessarily still present on the system. Under normal circumstances, this is the case. Certain temporary outages might result in data corruption that is automatically repaired after the outage ends. A complete scrub of the pool is guaranteed to examine every active block in the pool, so the error log is reset whenever a scrub finishes. If you determine that the errors are no longer present, and you don't want to wait for a scrub to complete, reset all errors in the pool by using the zpool online command.
If the data corruption is in pool-wide metadata, the output is slightly different. For example:
# zpool status -v morpheus pool: morpheus id: 13289416187275223932 state: UNAVAIL status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. see: http://support.oracle.com/msg/ZFS-8000-72 config: morpheus FAULTED corrupted data c1t10d0 ONLINE
In the case of pool-wide corruption, the pool is placed into the FAULTED state because the pool cannot provide the required redundancy level.
If a file or directory is corrupted, the system might still function, depending on the type of corruption. Any damage is effectively unrecoverable if no good copies of the data exist on the system. If the data is valuable, you must restore the affected data from backup. Even so, you might be able to recover from this corruption without restoring the entire pool.
If the damage is within a file data block, then the file can be safely removed, thereby clearing the error from the system. Use the zpool status –v command to display a list of file names with persistent errors. For example:
# zpool status system1 -v pool: system1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://support.oracle.com/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM system1 ONLINE 4 0 0 c0t5000C500335E106Bd0 ONLINE 0 0 0 c0t5000C500335FC3E7d0 ONLINE 4 0 0 errors: Permanent errors have been detected in the following files: /system1/file.1 /system1/file.2
The list of file names with persistent errors might be described as follows:
If the full path to the file is found and the dataset is mounted, the full path to the file is displayed. For example:
/path1/a.txt
If the full path to the file is found, but the dataset is not mounted, then the dataset name with no preceding slash (/), followed by the path within the dataset to the file, is displayed. For example:
path1/documents/e.txt
If the object number to a file path cannot be successfully translated, either due to an error or because the object doesn't have a real file path associated with it, as is the case for a dnode_t, then the dataset name followed by the object's number is displayed. For example:
path1/dnode:<0x0>
If an object in the metaobject set (MOS) is corrupted, then a special tag of <metadata>, followed by the object number, is displayed.
You can attempt to resolve more minor data corruption by using scrubbing the pool and clearing the pool errors in multiple iterations. If the first scrub and clear iteration does not resolve the corrupted files, run them again. For example:
# zpool scrub system1 # zpool clear system1
If the corruption is within a directory or a file's metadata, the only choice is to move the file elsewhere. You can safely move any file or directory to a less convenient location, allowing the original object to be restored in its place.
If a damaged file system has corrupted data with multiple block references, such as snapshots, the zpool status –v command cannot display all corrupted data paths. The current zpool status reporting of corrupted data is limited by the amount of metadata corruption and if any blocks have been reused after the zpool status command is executed. Deduplicated blocks makes reporting all corrupted data even more complicated.
If you have corrupted data and the zpool status –v command identifies that snapshot data is impacted, then considering running the following command to identify additional corrupted paths:
# find mount-point -inum $inode -print # find mount-point/.zfs/snapshot -inum $inode -print
The first command searches for the inode number of the reported corrupted data in the specified file system and all its snapshots. The second command searches for snapshots with the same inode number.
If the damage is in pool metadata and that damage prevents the pool from being opened or imported, then the following options are available to you:
You can attempt to recover the pool by using the zpool clear –F command or the zpool import –F command. These commands attempt to roll back the last few pool transactions to an operational state. You can use the zpool status command to review a damaged pool and the recommended recovery steps. For example:
# zpool status pool: storpool state: UNAVAIL status: The pool metadata is corrupted and the pool cannot be opened. action: Recovery is possible, but will result in some data loss. Returning the pool to its state as of Fri Jun 29 17:22:49 2012 should correct the problem. Approximately 5 seconds of data must be discarded, irreversibly. Recovery can be attempted by executing 'zpool clear -F tpool'. A scrub of the pool is strongly recommended after recovery. see: http://support.oracle.com/msg/ZFS-8000-72 scrub: none requested config: NAME STATE READ WRITE CKSUM storpool UNAVAIL 0 0 1 corrupted data c1t1d0 ONLINE 0 0 2 c1t3d0 ONLINE 0 0 4
The recovery process as described in the preceding output is to use the following command:
# zpool clear -F storpool
If you attempt to import a damaged storage pool, you will see messages similar to the following:
# zpool import storpool cannot import 'storpool': I/O error Recovery is possible, but will result in some data loss. Returning the pool to its state as of Fri Jun 29 17:22:49 2012 should correct the problem. Approximately 5 seconds of data must be discarded, irreversibly. Recovery can be attempted by executing 'zpool import -F storpool'. A scrub of the pool is strongly recommended after recovery.
The recovery process as described in the preceding output is to use the following command:
# zpool import -F storpool Pool storpool returned to its state as of Fri Jun 29 17:22:49 2012. Discarded approximately 5 seconds of transactions
If the damaged pool is in the zpool.cache file, the problem is discovered when the system is booted, and the damaged pool is reported in the zpool status command. If the pool isn't in the zpool.cache file, it won't successfully import or open and you will see the damaged pool messages when you attempt to import the pool.
You can import a damaged pool in read-only mode. This method enables you to import the pool so that you can access the data. For example:
# zpool import -o readonly=on storpool
For more information about importing a pool read-only, see Importing a Pool in Read-Only Mode.
You can import a pool with a missing log device by using the zpool import –m command. For more information, see Importing a Pool With a Missing Log Device.
If the pool cannot be recovered by either pool recovery method, you must restore the pool and all its data from a backup copy. The mechanism you use varies widely depending on the pool configuration and backup strategy. First, save the configuration as displayed by the zpool status command so that you can re-create it after the pool is destroyed. Then, use the zpool destroy –f command to destroy the pool.
Also, keep a file describing the layout of the datasets and the various locally set properties somewhere safe, as this information will become inaccessible if the pool is ever rendered inaccessible. With the pool configuration and dataset layout, you can reconstruct your complete configuration after destroying the pool. The data can then be populated by using whatever backup or restoration strategy you use.