Go to main content

Oracle® Solaris 11.4 Tunable Parameters Reference Manual

Exit Print View

Updated: January 2019
 
 

Tuning ZFS When Using Flash Storage

The following information applies to Flash SSDs, F20 PCIe Accelerator Card, F40 PCIe Accelerator Card, F5100 Flash Storage Array, and F80 PCIe Accelerator Card.

Review the following general comments when using ZFS with Flash storage:

  • Consider using LUNs or low latency disks that are managed by a controller with persistent memory, if available, for the ZIL (ZFS intent log). This option can be considerably more cost effective than using flash for low latency commits. The size of the log devices must only be large enough to hold 10 seconds of maximum write throughput. Examples would include a storage array based LUN, or a disk connected to an HBA with a battery protected write cache.

    If no such device is available, segment a separate pool of flash devices for use as log devices in a ZFS storage pool.

  • The F40, F20, and F80 Flash Accelerator cards contain and export 4 independent flash modules to the OS. The F5100 contains up to 80 independent flash modules. Each flash module appear to the operating system as a single device. SSDs are viewed as a single device by the OS. Flash devices may be used as ZFS log devices to reduce commit latency, particularly if used in an NFS server. For example, a single flash module of a flash device used as a ZFS log device can reduce latency of single lightly threaded operations by 10x. More flash devices can be striped together to achieve higher throughput for large amounts of synchronous operations.

  • Log devices should be mirrored for reliability. For maximum protection, the mirrors should be created on separate flash devices. In the case of F20, F40, and F80 PCIe accelerator cards, maximum protection is achieved by ensuring that mirrors reside on different physical PCIe cards. Maximum protection with the F5100 storage array is obtained by placing mirrors on separate F5100 devices.

  • Flash devices that are not used as log devices may be used as second level cache devices. This serves to both offload IOPS from primary disk storage and also to improve read latency for commonly used data.

Adding Flash Devices as ZFS Log or Cache Devices

Review the following recommendations when adding flash devices as ZFS log or cache devices.

  • A ZFS log or cache device can be added to an existing ZFS storage pool by using the zpool add command. Be very careful with zpool add commands. Mistakenly adding a log device as a normal pool device is a mistake that will require you to destroy and restore the pool from scratch. Individual log devices themselves can be removed from a pool.

  • Familiarize yourself with the zpool add command before attempting this operation on active storage. You can use the zpool add –n option to preview the configuration without creating the configuration. For example, the following incorrect zpool add preview syntax attempts to add a device as a log device:

    # zpool add -n tank c4t1d0
    vdev verification failed: use -f to override the following errors:
    mismatched replication level: pool uses mirror and new vdev is disk
    Unable to build pool from specified devices: invalid vdev configuration

    This is the correct zpool add preview syntax for adding a log device to an existing pool:

    # zpool add -n tank log c4t1d0
    would update 'tank' to the following configuration:
    tank
    mirror
    c4t0d0
    c5t0d0
    logs
    c4t1d0

    If multiple devices are specified, they are striped together. For more information, see the examples below or the zpool(8) man page.

A flash device, c4t1d0, can be added as a ZFS log device:

# zpool add pool log c4t1d0

If 2 flash devices are available, you can add mirrored log devices:

# zpool add pool log mirror c4t1d0 c4t2d0

Available flash devices can be added as a cache device for reads.

# zpool add pool cache c4t3d0

You can't mirror cache devices, they will be striped together.

# zpool add pool cache c4t3d0 c4t4d0

Ensuring Proper Cache Flush Behavior for Flash and NVRAM Storage Devices

ZFS is designed to work with storage devices that manage a disk-level cache. ZFS commonly asks the storage device to ensure that data is safely placed on stable storage by requesting a cache flush. For JBOD storage, this works as designed and without problems. For many NVRAM-based storage arrays, a performance problem might occur if the array takes the cache flush request and actually does something with it, rather than ignoring it. Some storage arrays flush their large caches despite the fact that the NVRAM protection makes those caches as good as stable storage.

ZFS issues infrequent flushes (every 5 second or so) after the uberblock updates. The flushing infrequency is fairly inconsequential so no tuning is warranted here. ZFS also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, and so on). The completion of this type of flush is waited upon by the application and impacts performance. Greatly so, in fact. From a performance standpoint, this neutralizes the benefits of having an NVRAM-based storage.

Cache flush tuning was recently shown to help flash device performance when used as log devices. When all LUNs exposed to ZFS come from NVRAM-protected storage array and procedures ensure that no unprotected LUNs will be added in the future, ZFS can be tuned to not issue the flush requests by setting zfs_nocacheflush. If some LUNs exposed to ZFS are not protected by NVRAM, then this tuning can lead to data loss, application level corruption, or even pool corruption. In some NVRAM-protected storage arrays, the cache flush command is a no-op, so tuning in this situation makes no performance difference.

A recent OS change is that the flush request semantic has been qualified to instruct storage devices to ignore the requests if they have the proper protection. This change requires a fix to our disk drivers and for the NVRAM device to support the updated semantics. If the NVRAM device does not recognize this improvement, use these instructions to tell the Oracle Solaris OS not to send any synchronize cache commands to the array. If you use these instructions, make sure all targeted LUNS are indeed protected by NVRAM.

Occasionally, flash and NVRAM devices do not properly advertise to the OS that they are non-volatile devices, and that caches do not need to be flushed. Cache flushing is an expensive operation. Unnecessary flushes can drastically impede performance in some cases.

Review the following zfs_nocacheflush syntax restrictions before applying the tuning entries below:

  • The tuning syntax below can be included in sd.conf but there must be only a single sd-config-list entry per vendor/product.

  • If multiple devices entries are desired, multiple pairs of vendor IDs and sd tuning strings can be specified on the same line by using the following syntax:

    #              "012345670123456789012345","tuning    ",
    sd-config-list="|-VID1-||-----PID1-----|","param1:val1, param2:val2",
                   "|-VIDN-||-----PIDN-----|","param1:val1, param3:val3";

    Make sure the vendor ID (VID) string is padded to 8 characters and the Product ID (PID) string is padded to 16 characters as described in the preceding example.


Caution

Caution  -  All cache sync commands are ignored by the device. Use at your own risk.


  1. Use the format utility to run the inquiry subcommand on a LUN from the storage array. For example:

    # format
    .
    .
    .
    Specify disk (enter its number): x
    format> inquiry
    Vendor:   ATA
    Product:  Marvell
    Revision: XXXX
    format>
  2. Select one of the following based on your architecture:

    • For all devices, copy the file /kernel/drv/sd.conf to the /etc/driver/drv/sd.conf file.

    • For F40 flash devices, add the following entry to /kernel/drv/sd.conf. In the entry below, ensure that ATA is padded to 8 characters, and 3E128-TS2-550B01 contains 16 characters. Total string length is 24.

      sd-config-list="ATA  3E128-TS2-550B01","disksort:false, cache-nonvolatile:true, physical-block-size:4096";
    • For F80 flash devices, add the following entry to /kernel/drv/sd.conf. Ensure that ATA is padded to 8 characters, and 3E128-TS2-550B01 contains 16 characters. Total string length is 24.

      sd-config-list="ATA  2E256-TU2-510B00","disksort:false, cache-nonvolatile:true, physical-block-size:4096";
      
    • For F20 and F5100 flash devices, choose one of the following based on your architecture. In the entries below, ATA is padded to 8 characters, and MARVELL SD88SA02 contains 16 characters. The total string length is 24.

    • Add the following entry to /etc/driver/drv/sd.conf

      sd-config-list="ATA  MARVELL SD88SA02","throttle-max:32, disksort:false, cache-nonvolatile:true";
  3. Carefully add whitespace to make the vendor ID (VID) 8 characters long (here ATA) and Product ID (PID) 16 characters long (here MARVELL) in the sd-config-list entry as illustrated.

  4. Reboot the system.

    You can tune zfs_nocacheflush back to it's default value (0) with no adverse effect on performance.

  5. Confirm that the flush behavior is correct.

    Use the script provided in System Check Script for verification.