3.3.4 Replacing a Hard Disk Due to Bad Performance

A single bad hard disk can degrade the performance of other good disks. It is better to remove the bad disk from the system than let it remain.

Starting with Oracle Exadata System Software release 11.2.3.2, an underperforming disk is automatically identified and removed from active configuration. Oracle Exadata Database Machine then runs a set of performance tests. When poor disk performance is detected by CELLSRV, the cell disk status changes to normal - confinedOnline, and the hard disk status changes to warning - confinedOnline.

The following conditions trigger disk confinement:

  • Disk stopped responding. The cause code in the storage alert log is CD_PERF_HANG.

  • Slow cell disk such as the following:

    • High service time threshold (cause code CD_PERF_SLOW_ABS)

    • High relative service time threshold (cause code CD_PERF_SLOW_RLTV)

  • High read or write latency such as the following:

    • High latency on writes (cause code CD_PERF_SLOW_LAT_WT)

    • High latency on reads (cause code CD_PERF_SLOW_LAT_RD)

    • High latency on reads and writes (cause code CD_PERF_SLOW_LAT_RW)

    • Very high absolute latency on individual I/Os happening frequently (cause code CD_PERF_SLOW_LAT_ERR)

  • Errors such as I/O errors (cause code CD_PERF_IOERR).

If the disk problem is temporary and passes the tests, then it is brought back into the configuration. If the disk does not pass the tests, then it is marked as poor performance, and Oracle Auto Service Request (ASR) submits a service request to replace the disk. If possible, Oracle ASM takes the grid disks offline for testing. If Oracle ASM cannot take the disks offline, then the cell disk status stays at normal - confinedOnline until the disks can be taken offline safely.

The disk status change is associated with the following entry in the cell alert history:

MESSAGE ID date_time info "Hard disk entered confinement status. The LUN
 n_m changed status to warning - confinedOnline. CellDisk changed status to normal
 - confinedOnline. Status: WARNING - CONFINEDONLINE  Manufacturer: name  Model
 Number: model  Size: size  Serial Number: serial_number  Firmware: fw_release 
 Slot Number: m  Cell Disk: cell_disk_name  Grid Disk: grid disk 1, grid disk 2
 ... Reason for confinement: threshold for service time exceeded"

The following would be logged in the storage cell alert log:

CDHS: Mark cd health state change cell_disk_name  with newState HEALTH_BAD_
ONLINE pending HEALTH_BAD_ONLINE ongoing INVALID cur HEALTH_GOOD
Celldisk entering CONFINE ACTIVE state with cause CD_PERF_SLOW_ABS activeForced: 0
inactiveForced: 0 trigger HistoryFail: 0, forceTestOutcome: 0 testFail: 0
global conf related state: numHDsConf: 1 numFDsConf: 0 numHDsHung: 0 numFDsHung: 0
...

Note:

In releases earlier than Oracle Exadata System Software release 11.2.3.2, use the CALIBRATE command to identify a bad hard disk, and look for very low throughput and IOPS for each hard disk.

The following procedure describes how to remove a hard disk once the bad disk has been identified:

  1. Illuminate the hard drive service LED to identify the drive to be replaced using a command similar to the following, where disk_name is the name of the hard disk to be replaced, such as 20:2:
    cellcli -e 'alter physicaldisk disk_name serviceled on'
    
  2. Find all the grid disks on the bad disk.

    For example:

    [root@exa05celadm03 ~]# cellcli -e "list physicaldisk 20:11 attributes name, id"
            20:11 RD58EA 
    [root@exa05celadm03 ~]# cellcli -e "list celldisk where physicalDisk='RD58EA'"
            CD_11_exa05celadm03 normal 
    [root@exa05celadm03 ~]# cellcli -e "list griddisk where cellDisk='CD_11_exa05celadm03'"
            DATA_CD_11_exa05celadm03 active
            DBFS_CD_11_exa05celadm03 active
            RECO_CD_11_exa05celadm03 active
            TPCH_CD_11_exa05celadm03 active
  3. Direct Oracle ASM to stop using the bad disk immediately.
    SQL> ALTER DISKGROUP diskgroup_name DROP DISK asm_disk_name;
    
  4. Ensure the blue OK to Remove LED on the disk is lit before removing the disk.
  5. Ensure that the Oracle ASM disks associated with the grid disks on the bad disk have been successfully dropped by querying the V$ASM_DISK_STAT view.
  6. Remove the badly-performing disk. An alert is sent when the disk is removed.
  7. When a new disk is available, install the new disk in the system. The cell disks and grid disks are automatically created on the new hard disk.

    Note:

    When a hard disk is replaced, the disk must be acknowledged by the RAID controller before it can be used. The acknowledgement does not take long, but use the LIST PHYSICALDISK command to ensure the status is NORMAL.

See Also: