Disk: Disks

The Disks statistic is used to display the heat map for disks broken down by percent utilization. This is the best way to identify when pool disks are under heavy load. It can also identify problem disks that are beginning to perform poorly, before their behavior triggers a fault and automatic removal from the pool.

When to Check Disks

Any investigation into disk performance.

Disks Breakdowns

Table 5-29 Breakdowns of Disks

Breakdown Description

percent utilization

A heat map with utilization on the Y-axis and each level on the Y-axis colored by the number of disks at that utilization: from light (none) to dark (many).

Interpretation

Utilization is a better measure of disk load than IOPS or throughput. Utilization is measured as the time during which that disk was busy performing requests (see Details at the end of this section). At 100% utilization, the disk might not be able to accept more requests, and additional I/O might wait on a queue. This I/O wait time will cause latency to increase and overall performance to decrease.

In practice, disks with a consistent utilization of 75% or higher are an indication of a heavy disk load.

The heat map allows a particular pathology to be easily identified: a single disk misperforming and reaching 100% utilization (a bad disk). Disks can exhibit this symptom before they fail. After disks fail, they are automatically removed from the pool with a corresponding alert. This particular problem is during the time before they fail, when their I/O latency is increasing and slowing down overall Oracle ZFS Storage Appliance performance, but their status is considered healthy; they have yet to identify any error state. This situation will be seen as a faint line at the top of the heat map, showing that a single disk has stayed at 100% utilization for some time.

Suggested interpretation summary:

Table 5-30 Interpretation Summary

Observed Suggested Interpretation

Most disks consistently over 75%

Available disk resources are being exhausted.

Single disk at 100% for several seconds

This can indicate a bad disk that is about to fail.

Further Analysis

To understand the nature of the I/O, such as IOPS, throughput, I/O sizes, and offsets, see Disk: I/O Operations and Disk: I/O Bytes.

Details

This statistic is actually a measure of percent busy, which serves as a reasonable approximation of percent utilization because Oracle ZFS Storage Appliance manages the disks directly. Technically, this is not a direct measure of disk utilization: At 100% busy, a disk might be able to accept more requests which it serves concurrently by inserting into and reordering its command queue, or serves from its on-disk cache.