ZFS Storage Pool Maintenance and Monitoring Practices

Language:

Make sure that pool capacity is below 90% for best performance.
Pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. If the primary workload is immutable files, then keep pool in the 95-96% utilization range. Even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer.
- Monitor pool and file system space to make sure that they are not full.
- Consider using ZFS quotas and reservations to make sure file system space does not exceed 90% pool capacity.
Monitor pool health
- Monitor a redundant pool with zpool status and fmdump at least once per week
- Monitor a non-redundant pool with zpool status and fmdump at least twice per week
Run zpool scrub on a regular basis to identify data integrity problems.
- If you have consumer-quality drives, consider a weekly scrubbing schedule.
- If you have datacenter-quality drives, consider a monthly scrubbing schedule.
- You should also run a scrub prior to replacing devices or temporarily reducing a pool's redundancy to ensure that all devices are currently operational.
Monitoring pool or device failures - Use zpool status as described below. Also use fmdump or fmdump -eV to see if any device faults or errors have occurred.
- Redundant pools, monitor pool health with zpool status and fmdump on a weekly basis
- Non-redundant pools, monitor pool health with zpool status and fmdump on a semiweekly basis
Pool device is UNAVAIL or OFFLINE – If a pool device is not available, then check to see if the device is listed in the format command output. If the device is not listed in the format output, then it will not be visible to ZFS.
If a pool device has UNAVAIL or OFFLINE, then this generally means that the device has failed or cable has disconnected, or some other hardware problem, such as a bad cable or bad controller has caused the device to be inaccessible.

Consider configuring the smtp-notify service to notify you when a hardware component is diagnosed as faulty. For more information, see the Notification Parameters section of smf(5) and smtp-notify(1M).

By default, some notifications are set up automatically to be sent to the root user. If you add an alias for your user account as root in the /etc/aliases file, you will receive electronic mail notifications, similar to the following:

From noaccess@tardis.space.com Fri Jun 29 16:58:59 2012
Date: Fri, 29 Jun 2012 16:58:58 -0600 (MDT)
From: No Access User <noaccess@tardis.space.com>
Message-Id: <201206292258.q5TMwwFL002753@tardis.space.com>
Subject: Fault Management Event: tardis:ZFS-8000-8A
To: root@tardis.central.com
Content-Length: 771

SUNW-MSG-ID: ZFS-8000-8A, TYPE: Fault, VER: 1, SEVERITY: Critical
EVENT-TIME: Fri Jun 29 16:58:58 MDT 2012
PLATFORM: ORCL,SPARC-T3-4, CSN: 1120BDRCCD, HOSTNAME: tardis
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 76c2d1d1-4631-4220-dbbc-a3574b1ee807
DESC: A file or directory in pool 'pond' could not be read due to corrupt data.
AUTO-RESPONSE: No automated response will occur.
IMPACT: The file or directory is unavailable.
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event.
Run 'zpool status -xv' and examine the list of damaged files to determine what
has been affected. Please refer to the associated reference document at
http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures
and policies regarding this diagnosis.

Monitor your storage pool space – Use the zpool list command and the zfs list command to identify how much disk is consumed by file system data. ZFS snapshots can consume disk space and if they are not listed by the zfs list command, they can silently consume disk space. Use the zfs list –t snapshot command to identify disk space that is consumed by snapshots.