Replacing a Failed Storage Device

2.6.1 Replacing a Failed Storage Device

Failure of a storage device can affect performance and data redundancy. Consequently, a failed storage device should be replaced as soon as possible.

A storage device is considered to have failed in the following circumstances:

A hardware or firmware fault causes the device to stop functioning.
The device enters a predictive failure state.

In this case, the device is still usable, but there is an indication that the device may soon stop functioning. For example, a hard disk drive (HDD) could be short of spare sectors, or a flash device could be approaching wear limits.
Exadata software confines the device, and the device fails the post-confinement checks.

Exadata automatically confines a storage device after detecting a significant performance problem or functional anomaly. After confinement, Exadata attempts to resolve the issue and recheck the device. However, if the post-confinement checks fail, the device is considered to have failed.

If a storage device fails, Exadata automatically drops all of the grid disks contained on the storage device. If any grid disk is used as an Exascale pool disk, Exascale automatically removes it and rebalances the storage pool to restore data redundancy.

Exadata also generates an alert when a storage device fails. The alert message includes specific instructions for replacing the device. If alert notifications are configured on the storage server, then the alert notification is automatically sent using email and SNMP.

After the failed storage device is replaced, Exadata automatically creates the cell and grid disks on the new device. If any grid disk is configured as an Exascale pool disk, it is automatically added to the storage pool, and the storage pool is rebalanced.

The following steps outline the procedure for replacing a failed storage device:

Confirm the location of the failed device.
Use the following CellCLI LIST PHYSICALDISK command:
```
CellCLI> list physicaldisk where status!=normal detail
```
In the output, the slotNumber value describes the physical location of the device.

You can also classify the failure type by examining the status value:
- failed or failed - dropped for replacement - Indicates that the device stopped functioning due to a hardware or firmware failure.
- warning - predictive failure - Indicates that the device entered a predictive failure state.
- warning - poor performance - Indicates that Exadata software confined the device, and the device failed the post-confinement checks.
For example, the following output indicates that the hard disk drive (HDD) in slot 5 stopped functioning due to a hardware or firmware failure:
```
CellCLI> list physicaldisk where status!=normal detail
         name:                   0:5
         deviceName:             /dev/sdi
         diskType:               HardDisk
         enclosureDeviceId:      0
         luns:                   0_5
         makeModel:              "WDC W7222A520ORA022T"
         physicalFirmware:       A7B0
         physicalInsertTime:     2023-07-07T17:20:44-07:00
         physicalInterface:      sas
         physicalSerial:         70SP8E
         physicalSize:           20.009765625T
         slotNumber:             5
         status:                 failed
```
Ensure that the storage server Do Not Service LED is not lit.
Ensure that the failed device is ready for removal.
- If the failed device is a HDD or flash drive located in one of the hot-swappable drive bays in the front of the server, ensure that the blue OK to Remove LED on the device is lit before removing the device.
- If the failed device is a hot-swappable flash card contained inside the server, ensure that the power LED on the flash card is not lit before removing the device. Starting with Exadata Storage Server X7-2, all storage server models contain hot-swappable flash cards.
Remove the failed storage device and install the replacement.

See the associated server hardware guide for additional details about physical hardware replacement.
Wait for the server to recognize the replaced device.

When you physically replace a hot-swappable storage device, it may take a few minutes for the server to recognize the new device.

Confirm the status of the replacement device.

Use the CellCLI LIST PHYSICALDISK command to confirm that the status of the replacement device is normal.

For example:

CellCLI> list physicaldisk 0:5 detail
         name:                   0:5
         deviceName:             /dev/sdi
         diskType:               HardDisk
         enclosureDeviceId:      0
         luns:                   0_5
         makeModel:              "WDC W7222A520ORA022T"
         physicalFirmware:       A7B0
         physicalInsertTime:     2023-09-01T12:00:25-07:00
         physicalInterface:      sas
         physicalSerial:         75X8RD
         physicalSize:           20.009765625T
         slotNumber:             5
         status:                 normal

Monitor the storage pool rebalance operation.

As part of reintegrating any Exascale pool disk on the replacement storage device, the affected storage pool undergoes a rebalance operation.

Use the ESCLI lsstoragepooloperation command to monitor the storage pool rebalance operation.
Confirm that Exascale is using the replacement device.

Use the ESCLI lspooldisk command and examine the status attribute.

Initially, as the replacement device comes online, the pool disk status is briefly set to BEING ADDED. However, the status value transitions to ONLINE as Exascale reintegrates the replacement device.

Parent topic: Administer the Storage Devices in an Exascale Storage Pool