3.3.4 Replacing a Hard Disk Due to Bad Performance
A single bad hard disk can degrade the performance of other good disks. It is better to remove the bad disk from the system than let it remain.
Starting with Oracle Exadata System Software release 11.2.3.2, an underperforming disk is automatically identified and removed from active configuration. Oracle Exadata Database Machine then runs a set of performance tests. When poor disk performance is detected by CELLSRV, the cell disk status changes to normal - confinedOnline
, and the hard disk status changes to warning - confinedOnline
.
The following conditions trigger disk confinement:
-
Disk stopped responding. The cause code in the storage alert log is
CD_PERF_HANG
. -
Slow cell disk such as the following:
-
High service time threshold (cause code
CD_PERF_SLOW_ABS
) -
High relative service time threshold (cause code
CD_PERF_SLOW_RLTV
)
-
-
High read or write latency such as the following:
-
High latency on writes (cause code
CD_PERF_SLOW_LAT_WT
) -
High latency on reads (cause code
CD_PERF_SLOW_LAT_RD
) -
High latency on reads and writes (cause code
CD_PERF_SLOW_LAT_RW
) -
Very high absolute latency on individual I/Os happening frequently (cause code
CD_PERF_SLOW_LAT_ERR
)
-
-
Errors such as I/O errors (cause code
CD_PERF_IOERR
).
If the disk problem is temporary and passes the tests, then it is brought back into the configuration. If the disk does not pass the tests, then it is marked as poor performance
, and Oracle Auto Service Request (ASR) submits a service request to replace the disk. If possible, Oracle ASM takes the grid disks offline for testing. If Oracle ASM cannot take the disks offline, then the cell disk status stays at normal - confinedOnline
until the disks can be taken offline safely.
The disk status change is associated with the following entry in the cell alert history:
MESSAGE ID date_time info "Hard disk entered confinement status. The LUN
n_m changed status to warning - confinedOnline. CellDisk changed status to normal
- confinedOnline. Status: WARNING - CONFINEDONLINE Manufacturer: name Model
Number: model Size: size Serial Number: serial_number Firmware: fw_release
Slot Number: m Cell Disk: cell_disk_name Grid Disk: grid disk 1, grid disk 2
... Reason for confinement: threshold for service time exceeded"
The following would be logged in the storage cell alert log:
CDHS: Mark cd health state change cell_disk_name with newState HEALTH_BAD_
ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0
inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
...
Note:
In releases earlier than Oracle Exadata System Software release 11.2.3.2, use the CALIBRATE
command to identify a bad hard disk, and look for very low throughput and IOPS for each hard disk.
The following procedure describes how to remove a hard disk once the bad disk has been identified:
See Also:
-
Oracle Automatic Storage Management Administrator's Guide for information about dropping a disk from a disk group