3 Backup and Recovery of Hardware Components

This chapter describes how to back up and recover the components of the Exalogic infrastructure.

Note:

Prior to recovering a component, ensure that the component is not in use.

It contains the following sections:

3.1 Exalogic Configuration Utility

The Exalogic Configuration Utility (ECU) is used to configure an Exalogic machine during initial deployment. It is strongly recommended that you back up the configuration and the runtime files generated by the ECU after the initial deployment is complete.

  1. Mount the NFS location defined in Chapter 2, "Backup and Recovery Locations" on the master compute node.

    The master compute node is the node in the Exalogic rack on which the ECU was run.

  2. Create tarballs of the following directories:

    • /opt/exalogic/ecu: Exalogic configuration directory

    • /var/tmp/exalogic/ecu: Exalogic runtime directory

    • (optional) /var/log/exalogic/ecu: Contains ECU log files

  3. To recover the ECU files, extract the tarball containing the ECU configuration files to the /opt/exalogic/ecu directory, and extract the tarball containing the runtime files to the /var/tmp/exalogic/ecu directory. There is no need to restore the log files.

3.2 Exalogic Compute Nodes

This section contains the following subsections:

3.2.1 Backing Up Exalogic Compute Nodes

Backing up the compute node consists of backing up the following:

  • The ILOM of the compute node

  • The operating system of the compute node

3.2.1.1 Backing Up the ILOM of a Compute Node

To back up the ILOM of a compute node, do the following:

Note:

You cannot back up the ILOM of the compute node you are using to back up and restore components. To back up the ILOM of that compute node, run the steps from a different compute node.

  1. Mount the NFS location (Chapter 2, "Backup and Recovery Locations") on one of the compute nodes.

  2. Log in to the compute node as the ilom-admin user.

  3. Encode the backup by running the following command:

    set /SP/config passphrase=phrase
    

    Example:

    set /SP/config passphrase=mypassword1
    set 'passphrase to 'mypassword1'
    

    mypassword1 is the password chosen by the user. Provide the password used when creating the backup.

  4. Back up the configuration of the ILOM by running the following command:

    set /SP/config dump_uri=URI
    

    URI is the command used to perform the backup.

    Example:

    set /SP/config dump_uri=scp://root:rootpwd@hostIP/export/Exalogic_Backup/compute_nodes/computenode.backup
    

    hostIP is the IP address of the target host for the backup file.

    /export/Exalogic_Backup/compute_nodes/computenode.backup is the absolute path and the name of the backup file on the remote host.

3.2.1.2 Backing Up the Operating System of a Compute Node

The operating system of an Exalogic machine is installed on the local disk of each compute node.

If the official Exalogic base image was customized, it is recommended that you create a backup of the root file system and the customizations by using standard operating system utilities: tar, dump and so on, while excluding the /var, /tmp, /tree, /proc, /dev, /poolfsmnt, and the NFS mounted file systems. If you are running the Exalogic virtual stack, then, in addition, you should exclude the poolfs, ExalogicPool, ExalogicRepo file systems. The ExalogicPool and the ExalogicRepo file systems are mounted over NFS.

Save the backup to the NFS location you created for the compute nodes, as described in Chapter 2, "Backup and Recovery Locations" (for example, /export/Exalogic_Backup/compute_nodes).

Note:

Run the following command to list the NFS file systems mounted on your compute node.

mount -t nfs | awk '{print $3}'

For more information, see the Exalogic Backup and Recovery Best Practices White Paper at http://www.oracle.com/technetwork/database/features/availability/maa-exalogic-br-1529241.pdf.

3.2.2 Reimaging and Bare Metal Restore

A compute node should be reimaged when it has been irretrievably damaged, or multiple disk failures cause local disk failure with no existing backup for the compute node. During the reimaging procedure, the other compute nodes in the Exalogic machine are available. You should restore any scripting, CRON jobs, maintenance actions, and other customizations performed on top of the Exalogic base image.

Bare-metal restore is the process of restoring a new compute node to the same state as one on which a backup was taken. To perform a bare-metal restore, the new compute node must be reimaged by performing the steps described in this section. After the node is reimaged with the Exalogic base image, it should be restored to its original state by using a previously taken backup.

When an Exalogic machine is deployed in either a physical or virtual configuration, do the following to reimage the compute node. The procedure in Section 3.2.3, "Recovering Exalogic Compute Nodes in a Virtual Environment" must be performed when the Exalogic machine is deployed in a virtual configuration.

  1. Open an Oracle support request with Oracle Support Services.

    The support engineer will identify the failed server and send a replacement. Provide the support engineer the output of the imagehistory and imageinfo commands run from a surviving compute node. This output provides the details about the correct image and the patch sets that were used to image and patch the original compute node, and it provides a means to restore the system to the same level.

  2. Restore the ILOM of the compute node.

    Note:

    You cannot restore the ILOM of the compute node you are using to back up and restore components. To restore the ILOM of that compute node, run the steps from a different compute node.

    To restore the ILOM of the compute node, do the following:

    1. Mount the NFS location (Chapter 2, "Backup and Recovery Locations") on one of the compute nodes.

    2. Log in to the repaired compute node as the ilom-admin user.

    3. Encode the backup by running the following command:

      set /SP/config passphrase=phrase
      

      Example:

      set /SP/config passphrase=mypassword1
      set 'passphrase to 'mypassword1'
      

      mypassword1 is the password chosen by the user. Provide the password used when creating the backup.

    4. Restore the configuration of the ILOM by running the following command:

      set /SP/config load_uri=URI
      

      URI is the command used to perform the backup.

      Example:

      set /SP/config load_uri=scp://root:rootpwd@hostIP/export/Exalogic_Backup/compute_nodes/computenode.backup
      

      hostIP is the IP address of the target host for the backup file.

      /export/Exalogic_Backup/compute_nodes/computenode.backup is the absolute path and the name of the backup file on the remote host.

  3. Download the Oracle Exalogic base image and patch-set updates (PSUs).

    Download the appropriate Oracle Exalogic base image from https://edelivery.oracle.com and the appropriate PSU from My Oracle Support https://support.oracle.com.

  4. Image the replacement compute node.

    The compute node being replaced can be imaged using a PXE boot server or through the web-based ILOM of the compute node. This document does not cover the steps to configure a PXE boot server, however it provides the steps to enable the compute node to use a PXE boot server.

    If a PXE boot server is being used to re-image the compute node, log in to the ILOM of the compute node through SSH, set the boot_device to pxe and reboot the compute node.

    If the web-based ILOM is being used instead, ensure that the image downloaded earlier is on the local disk of the host from which the web-based ILOM interface is being launched and then do the following:

    1. Open a web browser and bring up the ILOM of the compute node, such as http://host-ilom.mycompany.com/

    2. Log in to the ILOM as the root user.

    3. Navigate to Redirection under the Remote Control tab, and click the Launch Remote Console button. The remote console window is displayed.

      Note:

      Do not close this window until the entire imaging process is completed. You will need to return to this window to complete the network configuration at the end of the imaging process.

    4. In the remote console window, click on the Devices menu item and select:

      - Keyboard (selected by default)

      - Mouse (selected by default)

      - CD-ROM Image

      In the new dialog box that is displayed, select the Linux base image iso file that you downloaded.

    5. On the ILOM window, navigate to the Host Control tab under the Remote Control tab.

    6. Select CDROM from the drop-down list and then click Save.

    7. Navigate to the Remote Power Control tab in the Remote Control tab.

    8. Select Power Cycle from the drop-down list, and then click Save.

    9. Click OK to confirm that you want to power cycle the machine.

    This starts the imaging of the compute node. Once the imaging is complete, the first boot scripts prompt the user to provide the network configuration.

  5. Configure the replacement compute node.

    • If you have a valid backup, restore the /etc directory and customizations you made, if any, to the replacement compute node.

    • If you do not have a valid backup, configure the replacement compute node with the appropriate DNS, time zone, and NTP settings. These settings should be the same on all the compute nodes in the Exalogic machine.

    Note:

    If the compute node being replaced is the master node of the Exalogic machine, restore the ECU configuration that was backed up earlier, as described in Section 3.1, "Exalogic Configuration Utility.". The master node in an Exalogic machine is the node on which the Exalogic Configuration Utility (ECU) is run.

  6. If the Exalogic machine was deployed in a physical configuration, you may have to update the VNIC configuration on the IB switches with the new IB port GUIDs.

    To validate the existing VNIC configuration on both the IB switches attached to the compute node, do the following:

    1. Get the port GUIDs of the replacement compute node by running the ibstat command on the compute node.

    2. Log in to the IB switches attached to the compute node and run the showvnics command to view the VNICs created on the switch.

    3. For the VNICs associated with the replacement compute node, verify that the port GUIDs are displayed in the output of the ibstat command that you ran in step a.

3.2.3 Recovering Exalogic Compute Nodes in a Virtual Environment

Note:

In an Exalogic virtual configuration, do not attempt to replace a failed compute node with an entirely new one. Contact Oracle Support for the procedure to perform such a replacement. An improperly replaced component might not be discovered correctly by Exalogic Control. You can use the procedures described in this document to restore a failed component after repairing it.

When the Exalogic machine is deployed in a virtual configuration, do the following to replace the compute node:

  1. If the first compute node is down, you must migrate Oracle VM Manager and the control database vServer to a different compute node. To migrate them, do the following:

    1. Migrate the Database vServer by performing the steps in Section 4.1.1 on a running compute node.

    2. Migrate the Oracle VM Manager vServer by performing the steps in Section 4.1.2 on a running compute node

    3. Stop the components of the Exalogic Control stack. For more information, see Section 5.2.1, "Stopping the Components of the Exalogic Control Stack."

    4. Start the components of the Exalogic Control stack. For more information, see Section 5.2.5, "Starting the Exalogic Control Stack."

  2. Migrate the virtual machines running on the compute node:

    1. Log in to the Oracle VM Manager BUI.

    2. Navigate to Home, and then to Server Pools.

    3. Select and expand the server pool to list the compute nodes in the pool.

    4. Select and expand the compute node to list the virtual machines running on the selected node.

    5. Migrate the virtual machines, one at a time, by doing the following:

      i. Select the virtual machine to be migrated.

      ii. Select Migrate under Actions to bring up the Migration Assistant.

      iii. Select the Unassigned Virtual Machine folder.

      iv. Select OK.

  3. Remove the compute node from the Oracle VM server pool:

    1. Log in to Exalogic Control as the root user.

    2. Expand Assets in the navigation pane on left.

    3. Under Servers, expand the compute node that is being replaced.

    4. Select the Oracle VM Server asset.

    5. Click Remove from Server Pool in the Actions pane on the right.

      Note:

      The job to remove the compute node may fail. If it does fail, examine the job in the jobs pane. The job consists of the following tasks:

      • RemoveOvmServerFromPool

      • OvmRefreshDomainModelTask

      If the first tasks succeeds, the failure of the second task can be ignored.

    6. Verify that the node has been removed from the pool by logging in to Oracle VM Manager.

    7. Delete the now unassigned compute node from Oracle VM Manager.

  4. Remove the compute node from the assets:

    1. Log into Exalogic Control as the root user.

    2. Expand Assets in the navigation pane on left.

    3. Expand Servers to list all the compute nodes.

    4. Select and expand the compute node that is being replaced.

    5. Select the operating system and place it in maintenance mode, by clicking Place in Maintenance Mode in the Actions pane.

    6. Select the server and place it in maintenance mode, by clicking Place in Maintenance Mode in the Actions pane.

    7. Delete the operating system by clicking Delete Asset in the Actions pane.

    8. Delete the server by clicking Delete Asset in the Actions pane.

  5. Replace the failed compute node by following the standard replacement process.

  6. Perform the steps in Section 3.2.2, "Reimaging and Bare Metal Restore" of this document to re-image the compute node and restore the previous configuration from a backup.

  7. Considering that the replacement server has the same IP address but a different MAC address, you may need to flush the ARP cache of cn01 (ECU master node).

    Ping the new node from ECU master node.

    If the ping fails, but connectivity is good, flush the ARP cache on cn01. You may have to wait for some time for the cache on the Cisco switch to be cleared.

    • To look at the cache, run arp -n

    • To flush the cache, run ip -s neigh flush all

  8. After the node is reimaged, log in to the compute node as root, and set the ovs-agent password for the oracle user:

    ovs-agent-passwd oracle password
    

    Note:

    For information on the default password, contact Oracle Support.

  9. On the master compute node, go to the /opt/exalogic/ecu directory, and set the ECU_HOME environment variable, as follows:

    export ECU_HOME=/opt/exalogic/ecu
    

    Run cd $ECU_HOME to verify whether the ECU_HOME environment variable is set correctly.

  10. Set up password-less SSH over IP and IB by running the /opt/exalogic.tools/tools/setup-ssh.sh script as follows:

    ./setup-ssh.sh -H IP-of-xenbr0-on-replaced-node
    ./setup-ssh.sh -H IP-of-bond1-on-replaced-node
    
  11. Identify the GUIDs of the IB ports of the failed compute node.

    The GUIDs of the IB ports of the failed compute node are located in the ECU log files.

    1. Log in to the master compute node, and go to the /var/tmp/exalogic/ecu/cnodes directory.

      The IB port GUIDs of all the compute nodes in the machine are stored in files named ibstat.node.NodeIndex, where NodeIndex is the compute node number (1–30).

    2. Using a text editor, open the ibstat.node.NodeIndex file corresponding to the failed compute node.

      For example, if compute node 15 is the failed node, open ibstat.node.15, as shown in the following example:

      root@exlcn15 cnodes]# cat ibstat.node.15
         CA 'mlx4_0'
                 CA type: MT26428
                 Number of ports: 2
                 Firmware version: 2.9.1000
                 Hardware version: b0
                 Node GUID: 0x0021280001a122a8
                 System image GUID: 0x0021280001a122ab
                 Port 1:
                         State: Active
                         Physical state: LinkUp
                         Rate: 40
                         Base lid: 158
                         LMC: 0
                         SM lid: 95
                         Capability mask: 0x02510868
                         Port GUID: 0x0021280001a122a9
                         Link layer: IB
                 Port 2:
                         State: Active
                         Physical state: LinkUp
                         Rate: 40
                         Base lid: 159
                         LMC: 0
                         SM lid: 95
                         Capability mask: 0x02510868
                         Port GUID: 0x0021280001a122aa
                       Link layer: IB
      
    3. Note the IB port GUIDs for the compute node.

      They are indicated for each port with the keyword Port GUID,

      In this example, the GUID for IB port 1 is 0x0021280001a122a9 and the GUID for port 2 is 0x0021280001a122aa.

  12. Configure the networks and IB partitions on the compute node by running the Exalogic Configuration Utility (ECU):

    Note:

    If this node is the master node in the Exalogic rack—that is, the node from which the ECU was run initially, then, before this step, restore the ECU configuration files, run time files, and log files by performing the steps in Section 3.1.

    1. Discover the switches.

      ./ecu.sh ib_switches discover
      
    2. Apply the configuration to the new compute node.

      Note:

      Before you run the following command, verify that the /var/tmp/ecu/cnodes_current.json file contains the current IP addresses on the eth-admin and IPoIB-default interfaces of the node being replaced.

      ./ecu.sh apply_cnode_config node_index
      

      node_index is the compute node number in the rack.

    3. Reboot the node.

      ./ecu.sh reboot_cnode current node_index
      
    4. Test that the IP addresses are as expected.

      ./ecu.sh test_cnode_network target cnode_number
      

      The following part of the output will indicate the interfaces and IP addresses required for the next steps:

      Network IP Ping Status
      ------------------------------------------------------
      ILOM 10.196.17.152 OK
      eth-admin 10.196.17.122 OK
      IPoIB-default 192.168.17.122 OK
      IPoIB-admin 192.168.30.2 OK
      IPoIB-storage 192.168.31.2 OK
      IPoIB-virt-admin 172.36.0.2 OK
      IPoIB-ovm-mgmt 192.168.33.2 OK
      IPoIB-vserver-shared-storage 172.37.0.2 OK
      
      INFO:netutils:Ping to all IP addresses succeeded
      
  13. For the virtual-machine networks to be plumbed correctly, the customer EoIB and private vNet IB partitions must be updated with the IB port GUIDs of the replacement compute node.

    1. Log in to the replacement compute node.

    2. Identify the GUIDs of the IB ports of the replacement compute node, by running the ibstat command, as shown in the following example:

      root@exlcn15 ~]# ibstat
        CA 'mlx4_0'
                CA type: MT26428
                Number of ports: 2
                Firmware version: 2.9.1000
                Hardware version: b0
                Node GUID: 0x0021280001eface6
                System image GUID: 0x0021280001eface9
                Port 1:
                        State: Active
                        Physical state: LinkUp
                        Rate: 40
                        Base lid: 98
                        LMC: 0
                        SM lid: 1
                        Capability mask: 0x02510868
                        Port GUID: 0x0021280001eface7
                        Link layer: IB
                Port 2:
                        State: Active
                        Physical state: LinkUp
                        Rate: 40
                        Base lid: 99
                        LMC: 0
                        SM lid: 1
                        Capability mask: 0x02510868
                        Port GUID: 0x0021280001eface8
                      Link layer: IB
      

      The GUIDs for each port are indicated by the Port GUID.

      In this example, the GUID for IB port 1 is 0x0021280001a122a9 and the GUID for port 2 is 0x0021280001eface8.

    3. Log in to the IB switch running the master subnet manager and run smpartition start.

      This command creates a temporary file partitions.conf.tmp in the /conf directory. This file can be updated using regular Linux commands.

    4. In the /conf/partitions.conf.tmp file, replace the failed compute node's IB port GUIDs, which you identified in step 11, with the GUIDs of the replacement node, as determined in step 13.b.

      You can do this by using a text editor, or by using the sed command, as shown in the following example:

      sed 's/0x0021280001a122a9/0x0021280001a122a9/g' /conf/partitions.conf.tmp
      
    5. Propagate the configuration to all the IB switches in the fabric by running smpartition commit.

  14. Update the credentials for the ILOM and the compute node:

    1. Log in to the Exalogic Control BUI.

    2. Navigate to Credentials in the Plan Management section.

    3. Enter the host name of the compute node in the search box, and click Search.

      The IPMI and SSH credential entries for the ILOM and the compute node are displayed.

    4. To update all four credentials, do the following:

      i. Select the entry for the credentials and click Edit. The Update Credentials dialog box is displayed.

      ii. Update the password and confirm the password fields.

      iii. Click Update.

  15. Rediscover and add the asset:

    1. Log in to the Exalogic Control BUI.

    2. In the navigation pane on the left, expand Plan Management, and under Profiles and Policies, expand Discovery.

    3. Select the appropriate Server OS @ host discovery profile.

    4. In the Actions pane on the right, click Add Assets.

    5. On the resulting screen, verify whether the correct discovery profile is displayed.

    6. Click Add Now.

    7. Wait until the discovery process succeeds.

    8. Select the appropriate Server ILOM @ host discovery profile.

    9. In the Actions pane on the right, click Add Assets.

    10. On the resulting screen, verify whether the correct discovery profile is displayed.

    11. Click Add Now.

    12. Wait until the discovery process succeeds.

    13. In the left navigation pane, expand the Assets section to display all the assets.

    14. Verify whether the replaced server is displayed in the Assets section and positioned correctly in the photo-realistic view.

  16. Log in to Oracle VM Manager using the admin user credentials and discover the new compute node. Use the IP address of the IPoIB-ovm-mgmt partition.

    Note:

    The default partition key and network CIDR for the IPoIB-ovm-mgmt-partition are 0x8004 and 192.168.23.0/24 respectively.

    This information is also available in the /opt/exalogic/ecu/config/cnode_ipoib_networks.json file on the master compute node.

  17. After the new compute node is discovered, ensure that it is added to the required pool.

    If the compute node is in the unassigned-server group, you should add it manually in Oracle VM Manager.

    1. In the Hardware tab of the left navigation pane, expand Resources, and right-click on the name of the server pool to which you want to add the compute node.

    2. From the resulting context menu, select Add/Remove Servers.

      The Add/Remove Servers from the Server Pool dialog box is displayed.

    3. Select the server that you want to add from the Available Servers list and move it to the Selected Servers list.

    4. Click OK.

  18. Refresh the repository in Oracle VM Manager by doing the following:

    1. Log in to the Oracle VM Manager console.

    2. Click Home under the View menu.

    3. Select Servers Pools in the left pane.

    4. Select Repositories in the right pane.

    5. Click the Refresh Repositories icon. This is the icon with curved blue arrows.

  19. Present the repository to the compute nodes in Oracle VM Manager:

    1. Log in to the Oracle VM Manager console.

    2. Click on Home under the View menu.

    3. Select Servers Pools in the left pane.

    4. Select Repositories in the right pane.

    5. Select the entry with a forward slash under the Repositories table.

    6. Click the Present-Unpresent Selected Repository icon. This is the icon with green up and down arrows.

    7. In the Present this Repository to Server(s) dialog box, select the compute nodes listed under the Servers column and move them to the Present to Server(s) column.

    8. Click OK.

    9. Verify whether the compute node has been added by monitoring the Oracle VM Manager job.

  20. Add the compute node as an admin server to the repository.

    1. Log in to the Oracle VM Manager console.

    2. Click on Hardware under the View menu.

    3. Select the Storage tab in the left pane.

    4. Expand File Servers.

    5. Expand Generic Network File System.

    6. Select the Generic Network File System, and then select Add/Remove Admin Servers from the menu.

    7. In the Present this Repository to Server(s) dialog box, select the compute nodes listed under the Servers column and move them to the Present to Server(s) column.

    8. Click OK.

    9. Verify whether the compute node has been added by monitoring the Oracle VM Manager job.

  21. Log in to the Exalogic Control BUI as the Cloud Admin user, and the start the virtual machines that were migrated to the Unassigned Virtual Machine folder in the Oracle VM Manager BUI in step 2. The virtual machines will be started up on the replacement compute node.

    Note:

    • If you migrated the virtual machines to another compute node within the pool, use Oracle VM Manager to migrate them back to the replaced compute node. Follow the instructions in step 2.e, but instead of selecting the Unassigned Virtual Machine folder, select the replacement compute node.

    • Currently, migrating virtual machines between pools is not supported.

3.3 InfiniBand Switches

The InfiniBand switches are a core part of an Exalogic machine and the configurations of all the Infiniband switches must be backed up regularly. The configuration backups of the Service Processor can be created either using the ILOM BUI or CLI.

This section contains the following subsections:

3.3.1 Backing Up the InfiniBand switches

Save the IB switch backups to the NFS locations you created for the IB switches as described in Chapter 2, "Backup and Recovery Locations" (for example, /export/Exalogic_Backup/ib_gw_switches and /export/Exalogic_Backup/ib_spine_switches). Backups must be created for all the switches in the fabric. Create separate directories under the NFS share for each switch in the fabric.

To back up the Service Processor configuration of an IB switch by using the ILOM CLI, do the following:

  1. Mount the NFS location (Chapter 2, "Backup and Recovery Locations") on one of the compute nodes.

  2. Log in to the InfiniBand switch as the ilom-admin user.

  3. Encode the backup by running the following command:

    set /SP/config passphrase=phrase
    

    Example:

    set /SP/config passphrase=mypassword1
    set 'passphrase to 'mypassword1'
    

    mypassword1 is the password chosen by the user. Provide the password used when creating the backup.

  4. Back up the configuration of all the Infiniband switches by running the following command:

    set /SP/config dump_uri=URI
    

    URI is the command used to perform the backup.

    Example:

    set /SP/config dump_uri=scp://root:rootpwd@hostIP/export/Exalogic_Backup/ib_type_switches/switch.backup
    

    hostIP is the IP address of the target host for the backup file.

    type is either gw or spine depending on the type of IB switch.

    /export/Exalogic_Backup/ib_type_switches/switch.backup is the absolute path and the name of the backup file on the remote host.

  5. To back up the user settings of the InfiniBand switch, manually back up the /etc/opensm/opensm.conf file to the same location you backed up Service Processor configuration.

  6. To back up the partitions of the InfiniBand switch, manually back up the /conf/partitions.current file to the same location you backed up Service Processor configuration.

    After the files are transferred to the NFS location, they can be backed up to more permanent storage as part of the operating system backup.

3.3.2 Recovering the InfiniBand Switches in a Physical Environment

Ensure that no configuration changes are being made while performing the restore. Configuration changes include vServer creation and vNet creation.

Note:

During the restore, there will be a temporary disruption of traffic.

Restore the configuration of an IB switch through the ILOM CLI, by doing the following:

  1. Mount the NFS location (Chapter 2, "Backup and Recovery Locations") on one of the compute nodes.

  2. Log in to the Infiniband switch as the ilom-admin user.

  3. Encode the backup by running the following command:

    set /SP/config passphrase=phrase
    

    Example:

    set /SP/config passphrase=mypassword1
    set 'passphrase to 'mypassword1'
    

    mypassword1 is the password chosen by the user. Provide the password used when creating the backup.

  4. Restore the configuration of all the Infiniband switches by doing the following:

    1. Run the following command:

      set /SP/config load_uri=URI
      

      URI is the command used to perform the backup.

      Example:

      set /SP/config load_uri=scp://root:rootpwd@hostIP/export/Exalogic_Backup/ib_type_switches/switch.backup
      

      hostIP is the IP address of the target host for the backup file.

      type is either gw or spine depending on the type of IB switch.

      /export/Exalogic_Backup/ib_type_switches/switch.backup is the absolute path and the name of the backup file on the remote host.

    2. If the failed switch is replaced with a new one, as opposed to being repaired and reinstalled, add the GUIDs of the BridgeX ports to the EoIB partitions on the switch.

      Identify the GUIDs of the BridgeX ports by running showgwports on the switch:

      showgwports
      INTERNAL PORTS:
      ---------------
      Device Port Portname PeerPort PortGUID LID IBState GWState
      -------------------------------------------------------------------------
      Bridge-0 1 Bridge-0-1 4 0x002128f4832ec001 0x0007 Active Up
      Bridge-0 2 Bridge-0-2 3 0x002128f4832ec002 0x0006 Active Up
      Bridge-1 1 Bridge-1-1 2 0x002128f4832ec041 0x000a Active Up
      Bridge-1 2 Bridge-1-2 1 0x002128f4832ec042 0x000e Active Up
      

      Log in to the switch running the master subnet manager and add the Bridgex ports as full members to each of the EoIB partitions by running the following command:

      smpartition add -pkey PKEY -port BridgeXGUID -m full
      

      Example:

      smpartition add -pkey 0x8006 -port 0x002128f4832ec002 -m full
      
    3. Restore the user settings that you backed up earlier as described in Section 3.3.1, "Backing Up the InfiniBand switches."

    4. Restore the partitions that you backed up earlier as described in Section 3.3.1, "Backing Up the InfiniBand switches." Before restoring the partitions, review the current partitions and the partitions.current file.

      Restore the partitions.current file to the master switch and propagate the subnet manager configuration, by running the following commands:

      Note:

      To ensure that the new switch joins the IB fabric, enable the subnet manager by running the enablesm command on both switches. However, only one of the gateway switches should be set as the master sm.

      smpartition start
      smpartition commit
      

3.3.3 Recovering the InfiniBand Switches in a Virtual Environment

Note:

In an Exalogic virtual configuration, do not attempt to replace a failed InfiniBand switch with an entirely new one. Contact Oracle Support for the procedure to perform such a replacement. An improperly replaced component might not be discovered correctly by Exalogic Control. You can use the procedures described in this document to restore a failed component after repairing it.

When Exalogic is deployed in a virtual configuration, do the following to replace a failed InfiniBand switch.

  1. Remove the InfiniBand switch from the assets:

    1. Log in to the Exalogic Control BUI as the root user.

    2. From the Assets accordion in the navigation pane on the left, expand Switches.

    3. Select the switch being replaced.

    4. Place the switch in maintenance mode, by clicking Place in Maintenance Mode in the Actions pane.

    5. Select the switch being replaced.

    6. In the Actions pane on the right, click Delete Assets.

  2. Replace the failed switch by following the standard replacement procedure.

    Note:

    Before connecting the switch to the IB fabric, disable the subnet manager by running the disablesm command on the switch.

  3. Restore the switch from the latest backup, by performing the procedure described in Section 3.3.2, "Recovering the InfiniBand Switches in a Physical Environment."

    Note:

    For the NM2-36P switches in a virtual configuration, do not restore the partitions.current file.

  4. Identify the BridgeX port GUIDs of the failed IB switch.

    The port GUIDs are created when running the ECU, and can be retrieved from the runtime ECU configuration files.

    1. Log in to the master compute node, and go to the /var/tmp/exalogic/ecu/switches directory.

      The BridgeX port GUIDs of all the switches in the machine are stored in files named switchHostname_showgwports.out, where switchHostname is the hostname or IP address of the IB switch.

    2. Using a text editor, open the file corresponding to the failed IB switch. For example, if elswib02 is the failed IB switch, open elswib02_showgwports.out, as shown in the following example:

      cat elswib02_showgwports.out
      showgwports
         
      INTERNAL PORTS:
      ---------------
         
      Device   Port Portname  PeerPort PortGUID           LID    IBState  GWState
      ---------------------------------------------------------------------------
      Bridge-0  1   Bridge-0-1    4    0x002128deb28ac001 0x004b Active   Up
      Bridge-0  2   Bridge-0-2    3    0x002128deb28ac002 0x004f Active   Up
      Bridge-1  1   Bridge-1-1    2    0x002128deb28ac041 0x0055 Active   Up
      Bridge-1  2   Bridge-1-2    1    0x002128deb28ac042 0x0059 Active   Up
         
      CONNECTOR 0A-ETH:
      -----------------
         
      Port      Bridge      Adminstate Link  State       Linkmode       Speed 
      ------------------------------------------------------------------------
      0A-ETH-1  Bridge-0-2  Enabled    Up    Up          XFI            10Gb/s
      0A-ETH-2  Bridge-0-2  Enabled    Up    Up          XFI            10Gb/s
      0A-ETH-3  Bridge-0-1  Enabled    Up    Up          XFI            10Gb/s
      0A-ETH-4  Bridge-0-1  Enabled    Up    Up          XFI            10Gb/s
         
      CONNECTOR 1A-ETH:
      -----------------
         
      Port      Bridge      Adminstate Link  State       Linkmode       Speed 
      ------------------------------------------------------------------------
      1A-ETH-1  Bridge-1-2  Enabled    Up    Up          XFI            10Gb/s
      1A-ETH-2  Bridge-1-2  Enabled    Up    Up          XFI            10Gb/s
      1A-ETH-3  Bridge-1-1  Enabled    Up    Up          XFI            10Gb/s
      1A-ETH-4  Bridge-1-1  Enabled    Up    Up          XFI            10Gb/s
      
    3. Note the four BridgeX port GUIDs, which are displayed in the PortGUID column of the INTERNAL PORTS section.

      In this example, the BridgeX ports are 0x002128deb28ac001, 0x002128deb28ac002, 0x002128deb28ac041 and 0x002128deb28ac042.

  5. Add the gateway port GUIDs of the switch to the existing EoIB partitions:

    1. Log in to the master compute node of the rack—that is, the compute node on which the Exalogic Configuration Utility (ECU) was run.

    2. Set the ECU_HOME variable in your shell, as shown in the following example:

      export ECU_HOME=/opt/Exalogic/ecu
      
    3. Discover all the IB switches in the fabric, by running the following command:

      ./ecu.sh ib_switches discover
      
    4. Discover and add the gateway bridge GUIDs of all the IB switches to the system EoIB partitions of the switches, by running the following commands:

      ./ecu.sh ib_switch_gw_ports discover — to discover the GUIDs

      ./ecu.sh ib_switch_gw_ports add_ports — to add the GUIDs to the 0x8006 partition (the EoIB partition for the Exalogic Control stack)

      ./ecu.sh ib_switch_gw_ports show — to display the ports GUIDs

    5. Add the gateway port GUIDs of the switch being replaced to the custom EoIB partitions, as full members.

      i. Log in to the switch running the master subnet manager.

      ii. Run smpartition start to start editing the partitions. This command creates a temporary file partitions.conf.tmp in the /conf directory. This file can be updated using regular Linux commands.

      iii. In the /conf/partitions.conf.tmp file, replace the failed switch's BridgeX port GUIDs, which you identified in step 4, with the BridgeX port GUIDs of the replacement switch, as identified in step 5.d.

      You can do this by using a text editor, or by using the sed command, as shown in the following example:

      sed 's/0x002128deb28ac001/0x002128fe54f6c001/g' /conf/partitions.conf.tmp
      

      iv. Run smpartition commit to commit and propagate the configuration to all the switches in the fabric.

  6. Verify whether the output of the smnodes list command on all the InfiniBand switches is correct. The command must display the IP addresses of the switches in the fabric.

    Note:

    • If the output of smnodes list does not contain the IP addresses of all the IB switches intended to run the subnet manager, use the smnodes add command to update the SM nodes across all the switches in the fabric.

      smnodes add IP_address_of_IB_switch
      
    • To delete the IP address of a switch from the smnodes list, use the smnodes delete command.

      smnodes delete IP_address_of_IB_switch
      
  7. Propagate the current subnet manager configuration to the switch:

    Note:

    To ensure that the new switch joins the IB fabric, enable the subnet manager by running the enablesm command on both switches. However, only one of the gateway switches should be set as the master sm.

    1. Identify the switch running the master subnet manager, by logging into any one of the switches and running getmaster.

    2. Log in to the switch running the master subnet manager.

    3. Run smpartition start to edit the subnet manager configuration.

    4. Run smpartition commit to save and propagate the subnet manager configuration.

    5. Log in to the switch being replaced, and verify whether the subnet manager configuration has been propagated by running smpartition list active.

  8. Update the credentials for the switch:

    1. Log in to the Exalogic Control BUI.

    2. Select Credentials in the Plan Management accordion.

    3. Enter the host name of the switch in the search box and click Search.

      The IPMI and SSH credential entries for the switch are displayed.

    4. To update all four credentials, do the following:

      i. Select the entry for the credentials and click Edit. The Update Credentials dialog box is displayed.

      ii. Update the password and confirm the password fields.

      iii. Click Update.

  9. Rediscover the asset:

    1. Log in to the Exalogic Control BUI.

    2. In the navigation pane on the left, expand Plan Management, and under Profiles and Policies, expand Discovery.

    3. Select the appropriate Infiniband @ host discovery profile.

    4. In the Actions pane on the right, click Add Assets.

    5. On the resulting screen, verify whether the correct discovery profile is displayed.

    6. Click Add Now.

    7. Wait until the discovery process succeeds.

    8. In the left navigation pane, expand Assets to display all the assets.

    9. Verify whether the replaced server is displayed in the Assets section and positioned correctly in the photo-realistic view.

  10. Add the switch as an asset:

    1. Log in to the Exalogic Control BUI.

    2. Expand the Assets accordion.

    3. Select the appropriate rack, and then select Place/Remove Assets from the Actions accordion.

    4. In Place/Remove Assets in the Oracle Exalogic Rack dialog box, select the switch, and then click Submit.

    After the job is complete, the switch will be visible in the Assets tab.

3.4 Cisco Management Switch

The Cisco Management switch provides connectivity on the management interface and must be backed up regularly.

This section contains the following subsections:

3.4.1 Backing Up the Management Switch

Save the Cisco switch backups to the NFS locations you created for the Cisco switch as described in Chapter 2, "Backup and Recovery Locations" (for example, /export/Exalogic_Backup/management_switches).

  1. Enable the FTP service on the local ZFS storage appliance:

    1. Log in to the storage BUI, https://storageIP:215/, as the root user.

    2. Click the Shares tab.

    3. Double click the management_switches share.
      The properties page of the management_switches share is displayed.

    4. Click the Protocols tab.

    5. Under the FTP section, deselect Inherit from project.

    6. Set the Share mode as Read/write.

    7. Click Apply.

    8. Navigate to Services under Configuration, and select the FTP service.

    9. On the FTP page, in the General Settings section, set the Default Login root to the NFS share location created for the Cisco switch as described in Chapter 2, "Backup and Recovery Locations."

      Example: /export/Exalogic_Backup/management_switches

    10. Under Security Settings, permit root login.

    11. Click Apply.

  2. Back up the configuration of the Cisco switch:

    1. Log in to the Cisco switch and at the Router> prompt, issue the enable command.

      Provide the required password when prompted. The prompt changes to Router#, which indicates that the router is now in privileged mode.

    2. Configure the FTP user name and password:

      Router#config terminal
      Router (config)#ip ftp username root
      Router (config)#ip ftp password password for root
      Router (config)#end
      Router#
      
    3. Copy the configuration to the FTP server.

      Router#copy running-config ftp:
      Address or name of remote host []? IP address of your storage
      Destination filename [Router-confg]? backup_cfg_for_router
      Writing backup_cfg_for_router !
      1030 bytes copied in 3.341 secs (308 bytes/sec)
      Router#
      
    4. Open the configuration file using a text editor. Search for and remove any line that starts with AAA.

      Note:

      This step is performed to remove any security commands that can lock you out of the router.

3.4.2 Recovering the Management Switch in a Physical Environment

To recover the Cisco switch, do the following:

Note:

Ensure that no configuration changes are being made while performing the restore.

  1. Log in to the Cisco switch.

    At the Router> prompt, issue the enable command, provide the required password when prompted. The prompt changes to Router#, which indicates that the router is now in privileged mode.

  2. Configure the FTP user name and password:

    Router#config terminal
    Router (config)#ip ftp username root
    Router (config)#ip ftp password password for root
    Router (config)#end
    Router #
    
  3. Copy the configuration file from the FTP server to a router in privileged (enable) mode which has a basic configuration.

  4. Run the following commands:

    Router# copy ftp: running-config
    Address or name of remote host [IP address]? 
    Source filename [backup_cfg_for_router]? 
    Destination filename [running-config]? 
    Accessing ftp://storageIP/backup_cfg_for_router...
    Loading backup_cfg_for_router !
    [OK - 1030/4096 bytes]
    1030 bytes copied in 13.213 secs (78 bytes/sec)
    Router#
    

3.4.3 Recovering the Management Switch in a Virtual Environment

Note:

In an Exalogic virtual configuration, do not attempt to replace a failed Cisco switch with an entirely new one. Contact Oracle Support for the procedure to perform such a replacement. An improperly replaced component might not be discovered correctly by Exalogic Control. You can use the procedures described in this document to restore a failed component after repairing it.

When Exalogic is deployed in a virtual configuration, do the following to replace a failed Cisco switch.

  1. Remove the Cisco switch from the assets:

    1. Log in to the Exalogic Control BUI as the root user.

    2. Navigate to the Assets section on the left side of the page.

    3. Expand Switches to list all the switches associated with the vDC.

    4. Select the switch being replaced.

    5. Click Delete Assets in the Actions pane.

  2. Replace the failed Cisco switch by following the standard replacement process.

  3. After the switch has been replaced, perform the steps in Section 3.4.2, "Recovering the Management Switch in a Physical Environment" to restore the switch from the latest backup.

  4. Update the credentials for the switch:

    1. Log in to the Exalogic Control BUI.

    2. Select Credentials in the Plan Management accordion.

    3. Enter the host name of the switch in the search box and click Search.

      The IPMI and SSH credential entries for the switch are displayed.

    4. To update all four credentials, do the following:

      i. Select the entry for the credentials and click Edit. The Update Credentials dialog box is displayed.

      ii. Update the password and confirm the password fields.

      iii. Click Update.

  5. Rediscover the Cisco switch.

    1. Log in to the Exalogic Control BUI.

    2. In the navigation pane on the left, expand Plan Management, and under Profiles and Policies, expand Discovery.

    3. Select the Cisco Switch @ Cisco-switch discovery profile.

    4. In the Actions pane on the right, click Add Assets.

    5. On the resulting screen, verify whether the correct discovery profile is displayed.

    6. Click Add Now.

    7. Wait until the discovery process succeeds.

    8. In the left navigation pane, expand the Assets section to display all the assets.

    9. Verify whether the replaced switch is displayed in the Assets section and positioned correctly in the photo-realistic view.

  6. If the switch is not displayed in the Assets section, add the Cisco switch manually by doing the following:

    1. Log in to the Exalogic Control BUI.

    2. Expand the Assets section.

    3. Select the appropriate rack, and then select Place/Remove Assets in the Actions section on the left side of the page.

    4. In the Place/Remove assets in the Oracle Exalogic Rack dialog box, select the switch, and then click Submit.

    After the job is complete, the switch is shown in the Assets tab.

3.5 ZFS Storage Heads

The ZFS storage appliance in an Exalogic machine has two heads that are deployed in a clustered configuration. At any given time, one head is active and the other is passive. If a storage head fails and has to be replaced, the configuration from the surviving active node is pushed to the new storage head. It is not required to backup the configuration of the ZFS Storage appliance.

Note:

Ensure that the passive head is being restored and not the active head. While performing the restore, ensure that no configuration changes are being made

Recovering the ZFS Storage Head in a Virtual Configuration

Note:

In an Exalogic virtual configuration, do not attempt to replace a failed ZFS storage head with an entirely new one. Contact Oracle Support for the procedure to perform such a replacement. An improperly replaced component might not be discovered correctly by Exalogic Control. You can use the procedures described in this document to restore a failed component after repairing it

When the Exalogic machine is deployed in a virtual configuration, do the following to add a ZFS storage head.

  1. Remove the failed storage head from the assets:

    1. Log in to the Exalogic Control BUI as the root user.

    2. Navigate to the Assets section on the left side of the page.

    3. Expand Storage to list all the storage heads associated with the vDC.

    4. Select the storage head being replaced.

    5. Click Delete Assets in the Actions pane.

  2. Replace the failed storage by following the standard replacement process.

  3. After the storage head has been replaced, update the credentials for the switch:

    1. Log in to the Exalogic Control BUI.

    2. Select Credentials under the Plan Management section.

    3. Enter the host name of the switch in the search box and click Search.

      The IPMI and SSH credential entries for the ILOM and the compute node are displayed.

    4. To update all four credentials do the following:

      i. Select the entry for the credentials and click Edit. The Update Credentials dialog box is displayed.

      ii. Update the password and confirm the password fields.

      iii. Click Update.

  4. Rediscover the storage appliance.

    1. Log in to the Exalogic Control BUI.

    2. In the navigation pane on the left, expand Plan Management, and under Profiles and Policies, expand Discovery.

    3. Select the appropriate Storage Appliance @ host discovery profile.

    4. In the Actions pane on the right, click Add Assets.

    5. On the resulting screen, verify whether the correct discovery profile is displayed.

    6. Click Add Now.

    7. Wait until the discovery process succeeds.

    8. In the left navigation pane, expand the Assets section to display all the assets.

    9. Verify whether the replaced storage is displayed in the Assets section and positioned correctly in the photo-realistic view.

  5. Add the replaced storage as an asset:

    1. Log in to the Exalogic Control BUI.

    2. Expand the Assets section.

    3. Select the appropriate rack, and then select Place/remove Assets in the Actions section on the right side of the page.

    4. In the Place/Remove assets in Oracle Exalogic Rack dialog box, select the storage head, and then click Submit.

    After the job is complete, the storage head is shown in the Assets tab.

  6. Add the IB port GUIDs of the replaced storage head to the IPoIB-admin, IPoIB-storage, and the IPoIB-vserver-shared-storage partitions. The default keys for these partitions are 0x8001, 0x8002, and 0x8005 respectively.

    1. Log in to the replaced storage head by using SSH.

      i. Identify the IB port GUID for the first port as follows:

      storagehead:> configuration net devices

      storagehead:configuration net devices> select ibp0

      storagehead sn02:configuration net devices ibp0> show

      Properties:

      speed = 32000 Mbit/s

      up = true

      active = false

      media = Infiniband

      factory_mac = not available

      port = 1

      guid = 0x212800013f279b

      storagehead:configuration net devices ibp0>

      The IB port GUID for port 1 is shown by the guid entry. In this example, the GUID is 0x212800013f279b.

      ii. Identify the IB port GUID for the second port as follows:

      storagehead:> configuration net devices

      storagehead:configuration net devices> select ibp1

      storagehead sn02:configuration net devices ibp1> show

      Properties:

      speed = 32000 Mbit/s

      up = true

      active = false

      media = Infiniband

      factory_mac = not available

      port = 1

      guid = 0x212800013f279c

      storagehead:configuration net devices ibp0>

      iii. The IB port GUID for port 1 is shown by the guid entry. In this example, the GUID is 0x212800013f279c.

    2. Add the IB port GUIDs as full members of the IPoIB-admin partition with the default pkey of 0x8001:

      i. Log in to the switch running the master subnet manager.

      ii. Run smpartition start to edit the partitions.

      iii. Add the GUIDs to partition 0x8001 using the following commands:

      smpartition add -pkey 8001 -port GUID_for_port1 -m full 
      smpartition add -pkey 8001 -port GUID_for_port2 -m full
      

      Example:

      smpartition add -pkey 8001 -port 0x212800013f279b -m full
      smpartition add -pkey 8001 -port 0x212800013f279c -m full
      

      iv. Run smpartition commit to update and the propagate the configuration to all the switches in the fabric.

    3. Repeat step (b) to add the IB port GUIDs identified in step (a) to the IPoIB-storage network with a default pkey of 0x8002 and to the IPoIB-vserver-shared-storage network with a default pkey of 0x8005.