1.9.2.1 Step 1: Prepare the Disk Controller BBU for Removal
On certain X3-2, X4-2, and X4-8 database nodes, and X3-2, X4-2, and X3-8, X4-8 storage servers, the BBU is remote mounted and does not require a system shutdown to be accessed. However you must still prepare it for removal from the RAID HBA to avoid the risk of data corruption to the disk volumes. Note there is no remote mount BBU option for X3-8 database nodes.
For Systems with Remote Mount BBU
Perform the steps in this section if your system has a remote mount BBU. If your system does not have a remote mount BBU, perform the steps in "For Systems That Do Not Have a Remote Mount BBU".
- Log in as the
root
user. - Get the version of the image that is running on the server in the rack that requires service.
# cellcli -e LIST CELL ATTRIBUTES releaseVersion 11.2.3.2.1
- Drop the disk controller BBU.
If you are running version 11.2.3.3.0 or later:
- Drop the disk controller BBU for replacement. Run the following command as the
celladmin
orroot
user:# cellcli -e ALTER CELL BBU DROP FOR REPLACEMENT HDD disk controller battery has been dropped for replacement
- Verify that the BBU was dropped for replacement:
# cellcli -e LIST CELL ATTRIBUTES bbustatus dropped for replacement.
If you are running version 11.2.3.2.x:
- Locate the server in the rack being serviced, and turn on the indicator light.
Exadata Storage Servers are identified by a number 1 through 18, where 1 is the lowest Storage Server in the rack installed in RU2, counting up to the top of the rack.
Exadata Database Nodes are identified by a number 1 through 8, where 1 is the lowest most database node in the rack installed in RU16.
Turn on the locate indicator light for easier identification of the server being serviced. If the server number has been identified, then the Locate Button on the front panel may be pressed.
To turn on the indicator light remotely, use any of the following methods:
From a login to the CellCli on Exadata Storage Servers:
CellCli> ALTER CELL LED ON
From a login to the server's ILOM:
-> set /SYS/LOCATE value=Fast_Blink
From a login to the server's root account:
# ipmitool chassis identify force Chassis identify interval: indefinite
-
Check that HBA can see the battery and its current status.
Note:
If you are running on Solaris, use
/opt/MegaRAID/MegaCli
in place of/opt/MegaRAID/MegaCli/MegaCli64
in the commands below.# /opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -a0
The default output should show that the battery is still visible and may show low voltage or other issues depending on the fault. It may return an error reading the BBU if it is hard failed and no longer accessible to the HBA.
-
Verify the current cache policy for all logical volumes.
# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
The default cache policy should be
WriteBack
for all volumes. If the battery is functioning normally it will report as current cache policyWriteBack
. However if it is failed it may report current cache policy asWriteThrough
. -
Set the cache policy for all logical volumes to WriteThrough cache mode, which does not use the battery.
# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0
-
Verify the current cache policy for all logical volumes is now WriteThrough.
# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
- Drop the disk controller BBU for replacement. Run the following command as the
For Systems That Do Not Have a Remote Mount BBU
Perform the steps in this section if your system does not have a remote mount BBU. If your system has a remote mount BBU, see "For Systems with Remote Mount BBU".
If the system does not have the remote mounted battery installed, you need to shut down the node for which the battery requires replacement.
Note:
If you are running Oracle Exadata System Software 19.0 or later, substitute/opt/MegaRAID/storcli/storcli64
for /opt/MegaRAID/MegaCli/MegaCli64
in the following commands:
- Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the battery occurs.
- Set all logical volumes cache policy to WriteThrough cache mode.
# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0
- Verify the current cache policy for all logical volumes is now WriteThrough, which does not use the battery:
# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
- Set all logical volumes cache policy to WriteThrough cache mode.
- Shut down the server operating system.
Note the following when powering off Exadata Storage Servers:
- Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is failing may cause database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server's disks go offline.
- Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.
- All database and Oracle Clusterware processes should be shut down prior to shutting down more than one Exadata Storage Server. Refer to the Exadata Owner's Guide for details if this is necessary.
ASM drops a disk shortly after it is taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default
DISK_REPAIR_TIME
attribute value of 3.6hrs should be adequate for replacing components, but may need to be changed if you need more time.- Check the disk repair time by logging into ASM and running the following query.
SQL> SELECT dg.name,a.value FROM v$asm_attribute a, v$asm_diskgroup dg WHERE a.name = 'disk_repair_time' AND a.group_number = dg.group_number;
As long as the value is large enough to comfortably replace the components being replaced, there is no need to change it.
If you need to change it, you can use this statement:
SQL> ALTER DISKGROUP DATA SET ATTRIBUTE 'disk_repair_time'='8.5H';
- Check if ASM will be OK if the grid disks go offline. The following command should return
Yes
for the grid disks being listed.# cellcli -e LIST GRIDDISK ATTRIBUTES name,asmmodestatus,asmdeactivationoutcome ...sample ... DATA_CD_09_cel01 ONLINE Yes DATA_CD_10_cel01 ONLINE Yes DATA_CD_11_cel01 ONLINE Yes RECO_CD_00_cel01 ONLINE Yes RECO_CD_01_cel01 ONLINE Yes ...repeated for all griddisks....
If one or more disks does not return
asmdeactivationoutcome='Yes'
, check the respective disk group and restore the data redundancy for that disk group. Once the disk group data redundancy is fully restored, re-run the command to verify thatasmdeactivationoutcome='Yes'
for all grid disks. Once all disks returnasmdeactivationoutcome='Yes'
, proceed to the next step.Note:
Shutting down the cell services when one or more grid disks does not return
asmdeactivationoutcome='Yes'
will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly. -
Inactivate all grid disks on the cell that needs to be powered down for maintenance. This could take up to 10 minutes or longer.
# cellcli ...sample ... CellCLI> ALTER GRIDDISK ALL INACTIVE GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk DATA_CD_01_dmorlx8cel01 successfully altered GridDisk DATA_CD_02_dmorlx8cel01 successfully altered GridDisk RECO_CD_00_dmorlx8cel01 successfully altered GridDisk RECO_CD_01_dmorlx8cel01 successfully altered GridDisk RECO_CD_02_dmorlx8cel01 successfully altered ...repeated for all griddisks...
-
Verify that the grid disks are now offline. The output should show
asmmodestatus='UNUSED'
or'OFFLINE'
andasmdeactivationoutcome=Yes
for all grid disks once the disks are offline and inactive in ASM.CellCLI> LIST GRIDDISK ATTRIBUTES name,status,asmmodestatus,asmdeactivationoutcome DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes RECO_CD_01_dmorlx8cel01 inactive OFFLINE Yes RECO_CD_02_dmorlx8cel01 inactive OFFLINE Yes ...repeated for all griddisks...
- Once all disks are offline and inactive, you can shut down the cell.
When powering off Exadata Storage Servers, all storage services are automatically stopped.# shutdown -hP now