1.9.2.1 Step 1: Prepare the Disk Controller BBU for Removal

On certain X3-2, X4-2, and X4-8 database nodes, and X3-2, X4-2, and X3-8, X4-8 storage servers, the BBU is remote mounted and does not require a system shutdown to be accessed. However you must still prepare it for removal from the RAID HBA to avoid the risk of data corruption to the disk volumes. Note there is no remote mount BBU option for X3-8 database nodes.

For Systems with Remote Mount BBU

Perform the steps in this section if your system has a remote mount BBU. If your system does not have a remote mount BBU, perform the steps in "For Systems That Do Not Have a Remote Mount BBU".

  1. Log in as the root user.
  2. Get the version of the image that is running on the server in the rack that requires service.
    # cellcli -e LIST CELL ATTRIBUTES releaseVersion
    11.2.3.2.1
    
  3. Drop the disk controller BBU.

    If you are running version 11.2.3.3.0 or later:

    1. Drop the disk controller BBU for replacement. Run the following command as the celladmin or root user:
      # cellcli -e ALTER CELL BBU DROP FOR REPLACEMENT
      HDD disk controller battery has been dropped for replacement
      
    2. Verify that the BBU was dropped for replacement:
      # cellcli -e LIST CELL ATTRIBUTES bbustatus
      dropped for replacement.

    If you are running version 11.2.3.2.x:

    1. Locate the server in the rack being serviced, and turn on the indicator light.

      Exadata Storage Servers are identified by a number 1 through 18, where 1 is the lowest Storage Server in the rack installed in RU2, counting up to the top of the rack.

      Exadata Database Nodes are identified by a number 1 through 8, where 1 is the lowest most database node in the rack installed in RU16.

      Turn on the locate indicator light for easier identification of the server being serviced. If the server number has been identified, then the Locate Button on the front panel may be pressed.

      To turn on the indicator light remotely, use any of the following methods:

      From a login to the CellCli on Exadata Storage Servers:

      CellCli> ALTER CELL LED ON
      

      From a login to the server's ILOM:

      -> set /SYS/LOCATE value=Fast_Blink
      

      From a login to the server's root account:

      # ipmitool chassis identify force
      Chassis identify interval: indefinite
      
    2. Check that HBA can see the battery and its current status.

      Note:

      If you are running on Solaris, use /opt/MegaRAID/MegaCli in place of /opt/MegaRAID/MegaCli/MegaCli64 in the commands below.

      # /opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -a0
      

      The default output should show that the battery is still visible and may show low voltage or other issues depending on the fault. It may return an error reading the BBU if it is hard failed and no longer accessible to the HBA.

    3. Verify the current cache policy for all logical volumes.

      # /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
      

      The default cache policy should be WriteBack for all volumes. If the battery is functioning normally it will report as current cache policy WriteBack. However if it is failed it may report current cache policy as WriteThrough.

    4. Set the cache policy for all logical volumes to WriteThrough cache mode, which does not use the battery.

      # /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0
      
    5. Verify the current cache policy for all logical volumes is now WriteThrough.

      # /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
      

For Systems That Do Not Have a Remote Mount BBU

Perform the steps in this section if your system does not have a remote mount BBU. If your system has a remote mount BBU, see "For Systems with Remote Mount BBU".

If the system does not have the remote mounted battery installed, you need to shut down the node for which the battery requires replacement.

Note:

If you are running Oracle Exadata System Software 19.0 or later, substitute /opt/MegaRAID/storcli/storcli64 for /opt/MegaRAID/MegaCli/MegaCli64 in the following commands:
  1. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the battery occurs.
    1. Set all logical volumes cache policy to WriteThrough cache mode.
      # /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0
      
    2. Verify the current cache policy for all logical volumes is now WriteThrough, which does not use the battery:
      # /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
      
  2. Shut down the server operating system.

    Note the following when powering off Exadata Storage Servers:

    • Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is failing may cause database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server's disks go offline.
    • Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.
    • All database and Oracle Clusterware processes should be shut down prior to shutting down more than one Exadata Storage Server. Refer to the Exadata Owner's Guide for details if this is necessary.

    ASM drops a disk shortly after it is taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may need to be changed if you need more time.

    1. Check the disk repair time by logging into ASM and running the following query.
      SQL> SELECT dg.name,a.value FROM v$asm_attribute a, v$asm_diskgroup dg
       WHERE a.name = 'disk_repair_time' AND a.group_number = dg.group_number;
      

      As long as the value is large enough to comfortably replace the components being replaced, there is no need to change it.

      If you need to change it, you can use this statement:

      SQL> ALTER DISKGROUP DATA SET ATTRIBUTE 'disk_repair_time'='8.5H';
      
    2. Check if ASM will be OK if the grid disks go offline. The following command should return Yes for the grid disks being listed.
      # cellcli -e LIST GRIDDISK ATTRIBUTES name,asmmodestatus,asmdeactivationoutcome
      ...sample ...
      DATA_CD_09_cel01 ONLINE Yes
      DATA_CD_10_cel01 ONLINE Yes
      DATA_CD_11_cel01 ONLINE Yes
      RECO_CD_00_cel01 ONLINE Yes
      RECO_CD_01_cel01 ONLINE Yes
      ...repeated for all griddisks....
      

      If one or more disks does not return asmdeactivationoutcome='Yes', check the respective disk group and restore the data redundancy for that disk group. Once the disk group data redundancy is fully restored, re-run the command to verify that asmdeactivationoutcome='Yes' for all grid disks. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

      Note:

      Shutting down the cell services when one or more grid disks does not return asmdeactivationoutcome='Yes' will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly.

    3. Inactivate all grid disks on the cell that needs to be powered down for maintenance. This could take up to 10 minutes or longer.

      # cellcli
      ...sample ...
      CellCLI> ALTER GRIDDISK ALL INACTIVE
      GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
      GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
      GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
      GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
      GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
      GridDisk RECO_CD_02_dmorlx8cel01 successfully altered
      ...repeated for all griddisks...
      
    4. Verify that the grid disks are now offline. The output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all grid disks once the disks are offline and inactive in ASM.

      CellCLI> LIST GRIDDISK ATTRIBUTES name,status,asmmodestatus,asmdeactivationoutcome
      DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes
      DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes
      DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes
      RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes
      RECO_CD_01_dmorlx8cel01 inactive OFFLINE Yes
      RECO_CD_02_dmorlx8cel01 inactive OFFLINE Yes
      ...repeated for all griddisks...
      
    5. Once all disks are offline and inactive, you can shut down the cell.
      # shutdown -hP now
      
      When powering off Exadata Storage Servers, all storage services are automatically stopped.