Data deduplication
Controls whether duplicate copies of data are eliminated. Deduplication is synchronous,
pool-wide, block-based, and can be enabled on a per project or share basis. Enable it by selecting
the Data Deduplication checkbox on the general properties screen for projects or shares. The
deduplication ratio will appear in the usage area of the Status Dashboard.
Data written with deduplication enabled is entered into the deduplication table indexed by the
data checksum. Deduplication forces the use of the cryptographically strong SHA-256 checksum.
Subsequent writes will identify duplicate data and retain only the existing copy on disk.
Deduplication can only happen between blocks of the same size, data written with the same record
size. As always, for best results set the record size to that of the application using the data; for
streaming workloads use a large record size.
If your data doesn't contain any duplicates, enabling Data Deduplication will add overhead (a
more CPU-intensive checksum and on-disk deduplication table entries) without providing any benefit.
If your data does contain duplicates, enabling Data Deduplication will both save space by storing
only one copy of a given block regardless of how many times it occurs. Deduplication necessarily
will impact performance in that the checksum is more expensive to compute and the metadata of the
deduplication table must be accessed and maintained.
Note that deduplication has no effect on the calculated size of a share, but does affect the
amount of space used for the pool. For example, if two shares contain the same 1GB file, each will
appear to be 1GB in size, but the total for the pool will be just 1GB and the deduplication ratio
will be reported as 2x.
Performance Warning: by its nature, deduplication requires modifying the deduplication table
when a block is written to or freed. If the deduplication table cannot fit in DRAM, writes and frees
may induce significant random read activity where there was previously none. As a result, the
performance impact of enabling deduplication can be severe. Moreover, for some cases -- in
particular, share or snapshot deletion -- the performance degradation from enabling deduplication
may be felt pool-wide. In general, it is not advised to enable deduplication unless it is known that
a share has a very high rate of duplicated data, and that that duplicated data plus the table to
reference it can comfortably reside in DRAM. To determine if performance has been adversely affected
by deduplication, enable Chapter 8, Setting ZFSSA Preferences and then use Analytics in Oracle ZFS Storage Appliance Analytics Guide
to measure "ZFS DMU operations broken down
by DMU object type" and check for a higher rate of sustained DDT operations (Data Duplication Table
operations) as compared to ZFS operations. If this is happening, more I/O is for serving the
deduplication table rather than file I/O.