3.3.2 Replacing a Hard Disk Due to Disk Failure

A hard disk outage can cause a reduction in performance and data redundancy. Therefore, the disk should be replaced with a new disk as soon as possible. When the disk fails, the Oracle ASM disks associated with the grid disks on the hard disk are automatically dropped with the FORCE option, and an Oracle ASM rebalance follows to restore the data redundancy.

An Exadata alert is generated when a disk fails. The alert includes specific instructions for replacing the disk. If you have configured the system for alert notifications, then the alert is sent by e-mail to the designated address.

After the hard disk is replaced, the grid disks and cell disks that existed on the previous disk in that slot are re-created on the new hard disk. If those grid disks were part of an Oracle ASM group, then they are added back to the disk group, and the data is rebalanced on them, based on the disk group redundancy and ASM_POWER_LIMIT parameter.

Note:

For storage servers running Oracle Exadata System Software release 12.1.2.0 with Oracle Database release 12.1.0.2 with BP4, Oracle ASM sends an e-mail about the status of a rebalance operation. In earlier releases, the administrator had to check the status of the operation.

For earlier releases, check the rebalance operation status as described in Checking the Status of an ASM Rebalance Operation.

The following procedure describes how to replace a hard disk due to disk failure:

  1. Determine the failed disk using the following command:
    CellCLI> LIST PHYSICALDISK WHERE diskType=HardDisk AND status=failed DETAIL
    

    The following is an example of the output from the command. The slot number shows the location of the disk, and the status shows that the disk has failed.

    CellCLI> LIST PHYSICALDISK WHERE diskType=HardDisk AND status=failed DETAIL
    
             name:                   28:5
             deviceId:               21
             diskType:               HardDisk
             enclosureDeviceId:      28
             errMediaCount:          0
             errOtherCount:          0
             foreignState:           false
             luns:                   0_5
             makeModel:              "SEAGATE ST360057SSUN600G"
             physicalFirmware:       0705
             physicalInterface:      sas
             physicalSerial:         A01BC2
             physicalSize:           558.9109999993816G
             slotNumber:             5
             status:                 failed
    
  2. Ensure the blue OK to Remove LED on the disk is lit before removing the disk.
  3. Replace the hard disk on Oracle Exadata Storage Server and wait for three minutes. The hard disk is hot-pluggable, and can be replaced when the power is on.
  4. Confirm the disk is online.

    When you replace a hard disk, the disk must be acknowledged by the RAID controller before you can use it. This does not take long.

    Use the LIST PHYSICALDISK command similar to the following to ensure the status is NORMAL.

    CellCLI> LIST PHYSICALDISK WHERE name=28:5 ATTRIBUTES status
    
  5. Verify the firmware is correct using the ALTER CELL VALIDATE CONFIGURATION command.

In rare cases, the automatic firmware update may not work, and the LUN is not rebuilt. This can be confirmed by checking the ms-odl.trc file.

See Also: