ZFS Storage Pool Maintenance and Monitoring Practices
-
Make sure that pool capacity is below 90% for best performance.
Pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. If the primary workload is immutable files, then keep pool in the 95-96% utilization range. Even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer.
-
Monitor pool and file system space to make sure that they are not full.
-
Consider using ZFS quotas and reservations to make sure file system space does not exceed 90% pool capacity.
-
-
Monitor pool health
-
Monitor a redundant pool with
zpool status
andfmdump
at least once per week -
Monitor a non-redundant pool with
zpool status
andfmdump
at least twice per week
-
-
Run
zpool scrub
on a regular basis to identify data integrity problems.Scrub scheduling is enabled to run every 30 days by default. You can use the
scrubinterval
property to disable scrub scheduling or change the interval at which scrubs run. See thezpool
(8) man page.-
If you have consumer-quality drives, consider a weekly scrubbing schedule.
-
If you have datacenter-quality drives, consider a monthly scrubbing schedule.
-
You should also run a scrub prior to replacing devices or temporarily reducing a pool's redundancy to ensure that all devices are currently operational.
-
-
Monitoring pool or device failures - Use
zpool status
as described below. Also usefmdump
orfmdump -eV
to see if any device faults or errors have occurred.-
Redundant pools, monitor pool health with
zpool status
andfmdump
on a weekly basis -
Non-redundant pools, monitor pool health with
zpool status
andfmdump
on a semiweekly basis
-
-
Pool device is
UNAVAIL
orOFFLINE
– If a pool device is not available, then check to see if the device is listed in theformat
command output. If the device is not listed in theformat
output, then it will not be visible to ZFS.If a pool device has
UNAVAIL
orOFFLINE
, then this generally means that the device has failed or cable has disconnected, or some other hardware problem, such as a bad cable or bad controller has caused the device to be inaccessible. -
Consider configuring the
smtp-notify
service to notify you when a hardware component is diagnosed as faulty. For more information, see the Notification Parameters section ofsmf
(7) andsmtp-notify
(8).By default, some notifications are set up automatically to be sent to the
root
user. If you add an alias for your user account asroot
in the/etc/aliases
file, you will receive electronic mail notifications with information similar to the following:SUNW-MSG-ID: ZFS-8000-8A, TYPE: Fault, VER: 1, SEVERITY: Critical EVENT-TIME: Fri Jun 29 16:58:58 MDT 2012 ... SOURCE: zfs-diagnosis, REV: 1.0 EVENT-ID: 76c2d1d1-4631-4220-dbbc-a3574b1ee807 DESC: A file or directory in pool 'pond' could not be read due to corrupt data. AUTO-RESPONSE: No automated response will occur. IMPACT: The file or directory is unavailable. REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -xv' and examine the list of damaged files to determine what has been affected. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures and policies regarding this diagnosis.
-
Monitor your storage pool space – Use the
zpool list
command and thezfs list
command to identify how much disk is consumed by file system data. ZFS snapshots can consume disk space and if they are not listed by thezfs list
command, they can silently consume disk space. Use thezfs list -t snapshot
command to identify disk space that is consumed by snapshots.