Checking the Health of the Storage Servers

A Recovery Appliance X5 and higher versions have three to 18 storage servers, and a Recovery Appliance X4 rack has three to 14 storage servers. Begin at the bottom of the rack and check each server.

To check a storage server:

  1. Power on all storage servers if they are not already on, and wait while the servers initialize the BIOS and load the Linux operating system.
  2. Use SSH to connect your laptop to the first storage server. Use its factory IP address.
  3. Log in as the root user with the welcome1 password.

    The terminal emulation settings are the same as for the compute servers. See "Checking the Health of the Compute Servers".

  4. Verify that the rack master and host serial numbers are set correctly. The first number must match the rack serial number, and the second number must match the SysSN label on the front panel of the server.
    # ipmitool sunoem cli "show /System" | grep serial
         serial_number = AK01234567
         component_serial_number = 1234NM5678
    
  5. Verify that the model and rack serial numbers are set correctly:
    # ipmitool sunoem cli "show /System" | grep model
         model = ZDLRA X5
    # ipmitool sunoem cli "show /System" | grep ident
         system_identifier = Oracle Zero Data Loss Recovery Appliance X5 AK01234567
    
  6. Verify that the management network is working:
    # ethtool eth0 | grep det
    Link detected: yes
    
  7. Verify that the ILOM management network is working:
    # ipmitool sunoem cli 'show /SP/network' | grep ipadd
    ipaddress = 192.168.1.101
    pendingipaddress = 192.168.1.101
    
  8. Verify that all memory is present. X5 has 96 GB, while X8 has 384 GB:
    # grep MemTotal /proc/meminfo
    MemTotal: 98757064 kB
    [

    If the value is smaller, then use the Oracle ILOM event logs to identify the faulty memory.

  9. Verify that the hardware profile is operating correctly:
    # /opt/oracle.SupportTools/CheckHWnFWProfile
    [SUCCESS] The hardware and firmware matches supported profile for
    server=ORACLE_SERVER_X5-2L_EXADATA_HIGHCAPACITY
    

    The previous output shows correct operations. However, the following response indicates a problem that you must correct before continuing:

    [WARNING] The hardware and firmware are not supported. See details below
    [InfinibandHCAPCIeSlotWidth]
    Requires:
    x8
    Found:
    x4
    [WARNING] The hardware and firmware are not supported. See details above
    

    Use the --help argument to review the available options, such as obtaining more detailed output.

  10. Verify that 12 disks are visible, online, and numbered from slot 0 to slot 11:
    # cd /opt/MegaRAID/MegaCli
    # ./MegaCli64 -Pdlist -a0 | grep "Slot\|Firmware state" 
    Slot Number: 0
    Firmware state: Online, Spun Up
    Slot Number: 1
    Firmware state: Online, Spun Up
         .
         .
         .
    
  11. Verify that there are four NVME logical devices:
    # ls -l /dev | grep nvme | grep brw
    brw-rw---- 1 root disk 259, 0 Nov 12 19:10 nvme0n1
    brw-rw---- 1 root disk 259, 1 Nov 12 19:10 nvme1n1
    brw-rw---- 1 root disk 259, 2 Nov 12 19:10 nvme2n1
    brw-rw---- 1 root disk 259, 3 Nov 12 19:10 nvme3n1
    
  12. Confirm the healthy status of the AIC card:
    # nvmecli --identify --all | grep -i indicator
    Health Indicator      : Healthy
    Health Indicator      : Healthy
    Health Indicator      : Healthy
    Health Indicator      : Healthy
    
  13. Verify that the boot order is USB (Oracle Unigen), RAID, and PXE:
    [# ubiosconfig export all > /tmp/bios.xml
    [# grep -m1 -A20 boot_order /tmp/bios.xml
    <boot_order>
      <boot_device>
        <description>USB:USBIN0:ORACLE SSM UNIGEN-UFD PMAP</description>
        <instance>1</instance>
      </boot_device>
      <boot_device>
        <description>RAID:PCIE6:(Bus 50 Dev 00)PCI RAID Adapter</description>
        <instance>1</instance>
      </boot_device>
      <boot_device>
        <description>PXE:NET0:IBA XE Slot 3A00 v2320</description>
        <instance>1</instance>
      </boot_device>
      <boot_device>
        <description>PXE:NET1:IBA XE Slot 4001 v2196</description>
        <instance>1</instance>
      </boot_device>
    
  14. If the boot order is wrong, then restart the server and fix the order in the BIOS setup:
    # ipmitool chassis bootdev bios
    # shutdown -r now
    
  15. Exit or log out of SSH.
  16. Repeat these steps for the next storage server until you have checked all of them.