Recommended Storage Pool Practices

Language:

The following sections provide recommended practices for creating and monitoring ZFS storage pools. For information about troubleshooting storage pool problems, see Oracle Solaris ZFS Troubleshooting and Pool Recovery.

General System Practices

Keep system up-to-date with latest Oracle Solaris updates and releases
Confirm that your controller honors cache flush commands so that you know your data is safely written, which is important before changing the pool's devices or splitting a mirrored storage pool. This is generally not a problem on Oracle/Sun hardware, but it is good practice to confirm that your hardware's cache flushing setting is enabled.
Size memory requirements to actual system workload
- With a known application memory footprint, such as for a database application, you might cap the ARC size so that the application will not need to reclaim its necessary memory from the ZFS cache.
- Consider deduplication memory requirements
- Identify ZFS memory usage with the following command:
```
# mdb -k
> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     388117              1516   19%
ZFS File Data               81321               317    4%
Anon                        29928               116    1%
Exec and libs                1359                 5    0%
Page cache                   4890                19    0%
Free (cachelist)             6030                23    0%
Free (freelist)           1581183              6176   76%

Total                     2092828              8175
Physical                  2092827              8175
> $q
```
- See Document 1663862.1, Memory Management Between ZFS and Applications in Oracle Solaris 11.x, in My Oracle Support (MOS) for tips on tuning the ZFS ARC cache. This document includes a script which you can use to modify the user_reserver_hint_pct memory management parameter.
- Consider using ECC memory to protect against memory corruption. Silent memory corruption can potentially damage your data.
Perform regular backups – Although a pool that is created with ZFS redundancy can help reduce down time due to hardware failures, it is not immune to hardware failures, power failures, or disconnected cables. Make sure you backup your data on a regular basis. If your data is important, it should be backed up. Different ways to provide copies of your data are:
- Regular or daily ZFS snapshots
- Weekly backups of ZFS pool data. You can use the zpool split command to create an exact duplicate of ZFS mirrored storage pool.
- Monthly backups by using an enterprise-level backup product
Hardware RAID
- Consider using JBOD-mode for storage arrays rather than hardware RAID so that ZFS can manage the storage and the redundancy.
- Use hardware RAID or ZFS redundancy or both
- Using ZFS redundancy has many benefits – For production environments, configure ZFS so that it can repair data inconsistencies. Use ZFS redundancy, such as RAID-Z, RAID-Z-2, RAID-Z-3, mirror, regardless of the RAID level implemented on the underlying storage device. With such redundancy, faults in the underlying storage device or its connections to the host can be discovered and repaired by ZFS.
- If you are confident in the redundancy of your hardware RAID solution, then consider using ZFS without ZFS redundancy with your hardware RAID array. However, follow these recommendations to help ensure data integrity.
  - Assign the size of the LUNs and the ZFS storage pool according to your comfort level by considering that ZFS will not be able to resolve data inconsistencies if the hardware RAID array experiences a failure.
  - Create RAID5 LUNs with global hot spares.
  - Monitor both the ZFS storage pool by using zpool status and the underlying LUNs by using your hardware RAID monitoring tools.
  - Promptly replace any failed devices.
  - Scrub your ZFS storage pools routinely, such as monthly, if you are using datacenter quality services.
  - Always have good, recent backups of your important data.
See also Pool Creation Practices on Local or Network Attached Storage Arrays.
Crash dumps consume more disk space, generally in the 1/2-3/4 size of physical memory range.

ZFS Storage Pool Creation Practices

The following sections provide general and more specific pool practices.

General Storage Pool Practices

Use whole disks to enable disk write cache and provide easier maintenance. Creating pools on slices adds complexity to disk management and recovery.
Use ZFS redundancy so that ZFS can repair data inconsistencies.
- The following message is displayed when a non-redundant pool is created:
```
# zpool create system1 c4t1d0 c4t3d0
'system1' successfully created, but with no redundancy; failure
of one device will cause loss of the pool
```
- For mirrored pools, use mirrored disk pairs
- For RAID-Z pools, group 3-9 disks per VDEV
- Do not mix RAID-Z and mirrored components within the same pool. These pools are harder to manage and performance might suffer.
Use hot spares to reduce down time due to hardware failures
Use similar size disks so that I/O is balanced across devices
- Smaller LUNs can be expanded to large LUNs
- Do not expand LUNs from extremely varied sizes, such as 128 MB to 2 TB, to keep optimal metaslab sizes
Consider creating a small root pool and larger data pools to support faster system recovery
Recommended minimum pool size is 8 GB. Although the minimum pool size is 64 MB, anything less than 8 GB makes allocating and reclaiming free pool space more difficult.
Recommended maximum pool size should comfortably fit your workload or data size. Do not try to store more data than you can routinely back up on a regular basis. Otherwise, your data is at risk due to some unforeseen event.

Root Pool Creation Practices

SPARC (SMI (VTOC)): Create root pools with slices by using the s* identifier. Do not use the p* identifier. In general, a system's ZFS root pool is created when the system is installed. If you are creating a second root pool or re-creating a root pool, use syntax similar to the following on a SPARC system:
```
# zpool create rpool c0t1d0s0
```
Or, create a mirrored root pool. For example:
```
# zpool create rpool mirror c0t1d0s0 c0t2d0s0
```
Oracle Solaris 11.1 x86 (EFI (GPT)): Create root pools with whole disks by using the d* identifier. Do not use the p* identifier. In general, a system's ZFS root pool is created when the system is installed. If you are creating a second root pool or re-creating a root pool, use syntax similar to the following:
```
# zpool create rpool c0t1d0
```
Or, create a mirrored root pool. For example:
```
# zpool create rpool mirror c0t1d0 c0t2d0
```
The root pool must be created as a mirrored configuration or as a single-disk configuration. Neither a RAID-Z nor a striped configuration is supported. You cannot add additional disks to create multiple mirrored top-level virtual devices by using the zpool add command, but you can expand a mirrored virtual device by using the zpool attach command.
The root pool cannot have a separate log device.
Pool properties can be set during an AI installation, but the gzip compression algorithm is not supported on root pools.
Do not rename the root pool after it is created by an initial installation. Renaming the root pool might cause an unbootable system.
Do not create a root pool on a USB stick on a production system because root pool disks are critical for continuous operation, particularly in an enterprise environment. Consider using a system's internal disks for the root pool, or at least, use the same quality disks that you would use for your non-root data. In addition, a USB stick might not be large enough to support a dump volume size that is equivalent to at least 1/2 the size of physical memory.
Rather than adding a hot spare to a root pool, consider creating a two- or a three-way mirror root pool. In addition, do not share a hot spare between a root pool and a data pool.
Do not use a VMware thinly-provisioned device for a root pool device.

Non-Root Pool Creation Practices

Create non-root pools with whole disks by using the d* identifier. Do not use the p* identifier.
- ZFS works best without any additional volume management software.
- For better performance, use individual disks or at least LUNs made up of just a few disks. By providing ZFS with more visibility into the LUNs setup, ZFS is able to make better I/O scheduling decisions.
Create redundant pool configurations across multiple controllers to reduce down time due to a controller failure.
- Mirrored storage pools – Consume more disk space but generally perform better with small random reads.
```
# zpool create system1 mirror c1d0 c2d0 mirror c3d0 c4d0
```
- RAID-Z storage pools – Can be created with 3 parity strategies, where parity equals 1 (raidz), 2 (raidz2), or 3 (raidz3). A RAID-Z configuration maximizes disk space and generally performs well when data is written and read in large chunks (128K or more).
  - Consider a single-parity RAID-Z (raidz) configuration with 2 VDEVs of 3 disks (2+1) each.
```
# zpool create rzpool raidz1 c1t0d0 c2t0d0 c3t0d0 raidz1 c1t1d0 c2t1d0 c3t1d0
```
  - A RAIDZ-2 configuration offers better data availability, and performs similarly to RAID-Z. RAIDZ-2 has significantly better mean time to data loss (MTTDL) than either RAID-Z or 2-way mirrors. Create a double-parity RAID-Z (raidz2) configuration at 6 disks (4+2).
```
# zpool create rzpool raidz2 c0t1d0 c1t1d0 c4t1d0 c5t1d0 c6t1d0 c7t1d0
raidz2 c0t2d0 c1t2d0 c4t2d0 c5t2d0 c6t2d0 c7t2d
```
  - A RAIDZ-3 configuration maximizes disk space and offers excellent availability because it can withstand 3 disk failures. Create a triple-parity RAID-Z (raidz3) configuration at 9 disks (6+3).
```
# zpool create rzpool raidz3 c0t0d0 c1t0d0 c2t0d0 c3t0d0 c4t0d0
c5t0d0 c6t0d0 c7t0d0 c8t0d0
```

Pool Creation Practices on Local or Network Attached Storage Arrays

Consider the following storage pool practices when creating an a ZFS storage pool on a storage array that is connected locally or remotely.

If you create an pool on SAN devices and the network connection is slow, the pool's devices might be UNAVAIL for a period of time. You need to assess whether the network connection is appropriate for providing your data in a continuous fashion. Also, consider that if you are using SAN devices for your root pool, they might not be available as soon as the system is booted and the root pool's devices might also be UNAVAIL.
Confirm with your array vendor that the disk array is not flushing its cache after a flush write cache request is issued by ZFS.
Use whole disks, not disk slices, as storage pool devices so that Oracle Solaris ZFS activates the local small disk caches, which get flushed at appropriate times.
For best performance, create one LUN for each physical disk in the array. Using only one large LUN can cause ZFS to queue up too few read I/O operations to actually drive the storage to optimal performance. Conversely, using many small LUNs could have the effect of swamping the storage with a large number of pending read I/O operations.
A storage array that uses dynamic (or thin) provisioning software to implement virtual space allocation is not recommended for Oracle Solaris ZFS. When Oracle Solaris ZFS writes the modified data to free space, it writes to the entire LUN. The Oracle Solaris ZFS write process allocates all the virtual space from the storage array's point of view, which negates the benefit of dynamic provisioning.

Consider that dynamic provisioning software might be unnecessary when using ZFS:
- You can expand a LUN in an existing ZFS storage pool and it will use the new space.
- Similar behavior works when a smaller LUN is replaced with a larger LUN.
- If you assess the storage needs for your pool and create the pool with smaller LUNs that equal the required storage needs, then you can always expand the LUNs to a larger size if you need more space.
If the array can present individual devices (JBOD-mode), then consider creating redundant ZFS storage pools (mirror or RAID-Z) on this type of array so that ZFS can report and correct data inconsistencies.

Pool Creation Practices for an Oracle Database

Consider the following storage pool practices when creating an Oracle database.

Use a mirrored pool or hardware RAID for pools
RAID-Z pools are generally not recommended for random read workloads
Create a small separate pool with a separate log device for database redo logs
Create a small separate pool for the archive log

For more information about tuning ZFS for an Oracle database, Tuning ZFS for Database Products in Oracle Solaris 11.3 Tunable Parameters Reference Manual.

Using ZFS Storage Pools in VirtualBox

Virtual Box is configured to ignore cache flush commands from the underlying storage by default. This means that in the event of a system crash or a hardware failure, data could be lost.
Enable cache flushing on Virtual Box by issuing the following command:
```
VBoxManage setextradata vm-name "VBoxInternal/Devices/type/0/LUN#n/Config/IgnoreFlush" 0
```
- vm-name – the name of the virtual machine
- type – the controller type, either piix3ide (if you're using the usual IDE virtual controller) or ahci, if you're using a SATA controller
- n – the disk number

Storage Pool Practices for Performance

In general, keep pool capacity below 90% for best performance. The percentage where performance might be impacted depends greatly on workload:
- If data is mostly added (write once, remove never), then it's very easy for ZFS to find new blocks. In this case, the percentage can be higher than normal. Maybe up to 95%.
- If data is made of large files or large blocks (such as 128K files or 1MB blocks) and the data is removed in bulk operations, the percentage can be higher than normal. Maybe up to 95%
- If a large percentage (more than 50%) of the pool is made up of 8k chunks (DBfiles, iSCSI Luns, or many small files) and have constant rewrites, then the 90% rule should be followed strictly.
- If all of the data is small blocks that have constant rewrites, then you should monitor your pool closely once the capacity gets over 80%. The sign to watch for is increased disk IOPS to achieve the same level of client IOPS
Mirrored pools are recommended over RAID-Z pools for random read/write workloads
Separate log devices
- Recommended to improve synchronous write performance
- With a high synchronous write load, prevents fragmentation of writing many log blocks in the main pool
Separate cache devices are recommended to improve read performance
Scrub/resilver - A very large RAID-Z pool with lots of devices will have longer scrub and resilver times
Pool performance is slow – Use the zpool status command to rule out any hardware problems that are causing pool performance problems. If no problems show up in the zpool status command, use the fmdump command to display hardware faults or use the fmdump –eV command to review any hardware errors that have not yet resulted in a reported fault.

ZFS Storage Pool Maintenance and Monitoring Practices

Make sure that pool capacity is below 90% for best performance.
Pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. If the primary workload is immutable files, then keep pool in the 95-96% utilization range. Even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer.
- Monitor pool and file system space to make sure that they are not full.
- Consider using ZFS quotas and reservations to make sure file system space does not exceed 90% pool capacity.
Monitor pool health
- Monitor a redundant pool with zpool status and fmdump at least once per week
- Monitor a non-redundant pool with zpool status and fmdump at least twice per week
Run zpool scrub on a regular basis to identify data integrity problems.
- If you have consumer-quality drives, consider a weekly scrubbing schedule.
- If you have datacenter-quality drives, consider a monthly scrubbing schedule.
- You should also run a scrub prior to replacing devices or temporarily reducing a pool's redundancy to ensure that all devices are currently operational.
Monitoring pool or device failures - Use zpool status as described below. Also use fmdump or fmdump -eV to see if any device faults or errors have occurred.
- Redundant pools, monitor pool health with zpool status and fmdump on a weekly basis
- Non-redundant pools, monitor pool health with zpool status and fmdump on a semiweekly basis
Pool device is UNAVAIL or OFFLINE – If a pool device is not available, then check to see if the device is listed in the format command output. If the device is not listed in the format output, then it will not be visible to ZFS.
If a pool device has UNAVAIL or OFFLINE, then this generally means that the device has failed or cable has disconnected, or some other hardware problem, such as a bad cable or bad controller has caused the device to be inaccessible.

Consider configuring the smtp-notify service to notify you when a hardware component is diagnosed as faulty. For more information, see the Notification Parameters section of smf(5) and smtp-notify(1M).

By default, some notifications are set up automatically to be sent to the root user. If you add an alias for your user account as root in the /etc/aliases file, you will receive electronic mail notifications, similar to the following:

From noaccess@tardis.space.com Fri Jun 29 16:58:59 2012
Date: Fri, 29 Jun 2012 16:58:58 -0600 (MDT)
From: No Access User <noaccess@tardis.space.com>
Message-Id: <201206292258.q5TMwwFL002753@tardis.space.com>
Subject: Fault Management Event: tardis:ZFS-8000-8A
To: root@tardis.central.com
Content-Length: 771

SUNW-MSG-ID: ZFS-8000-8A, TYPE: Fault, VER: 1, SEVERITY: Critical
EVENT-TIME: Fri Jun 29 16:58:58 MDT 2012
PLATFORM: ORCL,SPARC-T3-4, CSN: 1120BDRCCD, HOSTNAME: tardis
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 76c2d1d1-4631-4220-dbbc-a3574b1ee807
DESC: A file or directory in pool 'pond' could not be read due to corrupt data.
AUTO-RESPONSE: No automated response will occur.
IMPACT: The file or directory is unavailable.
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event.
Run 'zpool status -xv' and examine the list of damaged files to determine what
has been affected. Please refer to the associated reference document at
http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures
and policies regarding this diagnosis.

Monitor your storage pool space – Use the zpool list command and the zfs list command to identify how much disk is consumed by file system data. ZFS snapshots can consume disk space and if they are not listed by the zfs list command, they can silently consume disk space. Use the zfs list –t snapshot command to identify disk space that is consumed by snapshots.