3.4.3 Replacing a Flash Disk Due to Flash Disk Problems

Oracle Exadata Storage Server is equipped with four PCIe cards. Each card has four flash disks (FDOMs) for a total of 16 flash disks. The four PCIe cards are present on PCI slot numbers 1, 2, 4, and 5. Starting with Oracle Exadata Database Machine X7, you can replace the PCIe cards without powering down the storage server. See Performing a Hot Pluggable Replacement of a Flash Disk.

In Oracle Exadata Database Machine X6 and earlier systems, the PCIe cards are not hot-pluggable. The Oracle Exadata Storage Server must be powered down before replacing the flash disks or cards.

Starting with Oracle Exadata Database Machine X7, each flash card on both High Capacity and Extreme Flash storage servers is a field-replaceable unit (FRU). The flash cards are also hot-pluggable, so you do not have to shut down the storage server before removing the flash card.

On Oracle Exadata Database Machine X5 and X6 systems, each flash card on High Capacity and each flash drive on Extreme Flash are FRUs. This means that there is no peer failure for these systems.

On Oracle Exadata Database Machine X3 and X4 systems, because the flash card itself is a FRU, if any FDOMs were to fail, the Oracle Exadata System Software would automatically put the rest of FDOMs on that card to peer failure so that the data can be moved out to prepare for the flash card replacement.

On Oracle Exadata Database Machine V2 and X2 systems, each FDOM is a FRU. There is no peer failure for flash for these systems.

Determining when to proceed with disk replacement depends on the release, as described in the following:

  • For Oracle Exadata System Software releases earlier than 11.2.3.2:

    Wait until the Oracle ASM disks have been successfully dropped by querying the V$ASM_DISK_STAT view before proceeding with the flash disk replacement. If the normal drop did not complete before the flash disk fails, then the Oracle ASM disks are automatically dropped with the FORCE option from the Oracle ASM disk group. If the DROP command did not complete before the flash disk fails, then refer to Replacing a Flash Disk Due to Flash Disk Failure.

  • For Oracle Exadata System Software releases 11.2.3.2 and later:

    An alert is sent when the Oracle ASM disks have been dropped, and the flash disk can be safely replaced. If the flash disk is used for write-back flash cache, then wait until none of the grid disks are cached by the flash disk. Use the following command to check the cachedBy attribute of all the grid disks. The cell disk on the flash disk should not appear in any grid disk's cachedBy attribute.

    CellCLI> LIST GRIDDISK ATTRIBUTES name, cachedBy
    

    If the flash disk is used for both grid disks and flash cache, then wait until receiving the alert, and the cell disk is not shown in any grid disk's cachedBy attribute.

The following procedure describes how to replace a flash disk on High Capacity storage servers for Oracle Exadata Database Machine X6 and earlier due to disk problems.

Note:

On Extreme Flash storage servers for Oracle Exadata Database Machine X6 and all storage servers for Oracle Exadata Database Machine X7 and later, you can just remove the flash disk from the front panel and insert a new one. You do not need to shut down the storage server.
  1. Stop the cell services using the following command:
    CellCLI> ALTER CELL SHUTDOWN SERVICES ALL
    

    The preceding command checks if any disks are offline, in predictive failure status or need to be copied to its mirror. If Oracle ASM redundancy is intact, then the command takes the grid disks offline in Oracle ASM, and then stops the cell services. If the following error is displayed, then it may not be safe to stop the cell services because a disk group may be forced to dismount due to redundancy.

    Stopping the RS, CELLSRV, and MS services...
    The SHUTDOWN of ALL services was not successful.
    CELL-01548: Unable to shut down CELLSRV because disk group DATA, RECO may be
    forced to dismount due to reduced redundancy.
    Getting the state of CELLSRV services... running
    Getting the state of MS services... running
    Getting the state of RS services... running
    

    If the error occurs, then restore Oracle ASM disk group redundancy and retry the command when disk status is back to normal for all the disks.

  2. Shut down the storage server.
  3. Replace the failed flash disk based on the PCI number and FDOM number. A white Locator LED is lit to help locate the affected storage server.
  4. Power up the storage server. The cell services are started automatically. As part of the storage server startup, all grid disks are automatically ONLINE in Oracle ASM.
  5. Verify that all grid disks have been successfully put online using the following command:
    CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
    

    Wait until asmmodestatus shows ONLINE or UNUSED for all grid disks.

The new flash disk is automatically used by the system. If the flash disk is used for flash cache, then the effective cache size increases. If the flash disk is used for grid disks, then the grid disks are re-created on the new flash disk. If those gird disks were part of an Oracle ASM disk group, then they are added back to the disk group, and the data is rebalanced on them based on the disk group redundancy and the ASM_POWER_LIMIT parameter.

See Also: