2.3.3.1 Cabling Several RoCE Network Fabric Racks Together

Use this procedure to add another rack to an existing multi-rack system with RoCE Network Fabric.

This procedure is for systems with RoCE Network Fabric (X8M or later) using Oracle Exadata System Software Release 20.1.0 or later.

WARNING:

Take time to read and understand this procedure before implementation. Pay careful attention to all the instructions, not just the command examples. A system outage may occur if the instructions are not applied correctly.

In this procedure, the existing racks are R1, R2, … ,Rn, and the new rack is Rn+1.

Note:

Cabling three or more racks together requires no downtime for the existing racks R1, R2, …, Rn. Only the new rack, Rn+1, is powered down.

Use the applicable cabling tables depending on your system:

In the following steps, these example switch names are used for the new rack (Rn+1):

  • rack5sw-roces0: Rack 5 spine switch (R5SS)
  • rack5sw-rocea0: Rack 5 lower leaf switch (R5LL)
  • rack5sw-roceb0: Rack 5 upper leaf switch (R5UL)
  1. Ensure the new rack is near the existing racks (R1, R2, …, Rn).
    The RDMA Network Fabric cables must be able to reach the servers in each rack.
  2. Ensure you have a backup of the current switch configuration for each switch in the existing racks and the new rack.
    For each switch, complete the steps in the Oracle Exadata Database Machine Maintenance Guide, section Backing Up Settings on the RoCE Network Fabric Switch.
  3. Shut down all servers in the new rack (Rn+1).
    Refer to Powering Off Oracle Exadata Rack. The switches must remain online and available.
  4. Verify the configuration of the existing RoCE Network Fabric switches.

    Before you configure the RoCE Network Fabric switches in the new rack (Rn+1), check the configuration of the RoCE Network Fabric switches in the existing racks (R1, R2, …, Rn). You must do this to ensure that every switch uses a unique loopback octet. The loopback octet is the last octet of the switch loopback IP address.

    1. Connect to an existing RoCE Network Fabric leaf switches and determine the loopback octet for the switch.

      Use the command shown in the following example.

      rack1sw-rocea0# show interface loopback 1 | grep Address
      
      Internet Address is 192.128.10.101/32

      In the example, the loopback octet is 101.

    2. Determine the loopback octet for every other leaf switch.

      Use the command shown in the following example.

      rack1sw-rocea0# show nve peers
      
      Interface Peer-IP                                State LearnType
      --------- -------------------------------------- ----- ---------
      nve1      192.128.10.102                         Up    CP
      nve1      192.128.10.103                         Up    CP
      nve1      192.128.10.104                         Up    CP
      nve1      192.128.10.105                         Up    CP
      nve1      192.128.10.106                         Up    CP
      nve1      192.128.10.107                         Up    CP
      nve1      192.128.10.108                         Up    CP

      In the example, the output shows seven other leaf switches having loopback octet values from 102 to 108. This output is consistent with an existing system containing four racks.

    3. Determine the loopback octet for every spine switch.

      Use the command shown in the following example.

      rack1sw-rocea0# show bgp l2vpn evpn summary | egrep -v
      "BGP|Idle|I|Neighbor|memory"
      
      192.128.10.201 4 65502 9161 581 75716 0 0 08:53:23 3687
      192.128.10.202 4 65502 9160 582 75716 0 0 08:34:20 3687
      192.128.10.203 4 65502 9162 582 75716 0 0 08:41:22 3687
      192.128.10.204 4 65502 9163 582 75716 0 0 08:50:27 3687

      In the example, the output shows four spine switches having loopback octet values from 201 to 204. This output is also consistent with an existing system containing four racks.

    4. Validate the configuration of the existing RoCE Network Fabric switches.

      Check the information gathered from the existing RoCE Network Fabric switches to ensure that every switch uses a unique loopback octet value and that all the values are as expected.

      Verify that the information gathered from the existing RoCE Network Fabric switches conforms to the following conventions:

      • On the leaf switches, the overall range of loopback octet values should start with 101 and increase incrementally (by 1) for each leaf switch.

        According to the best-practice convention, the loopback octet value for each leaf switch should be configured as follows:

        • 101 - Rack 1 lower leaf switch (R1LL)

        • 102 - Rack 1 upper leaf switch (R1UL)

        • 103 - Rack 2 lower leaf switch (R2LL)

        • 104 - Rack 2 upper leaf switch (R2UL)

        • 105 - Rack 3 lower leaf switch (R3LL)

        • 106 - Rack 3 upper leaf switch (R3UL), and so on.

      • On the spine switches, the range of loopback octet values should start with 201 and increase incrementally (by 1) for each spine switch.

        According to the best-practice convention, the loopback octet value for each spine switch should be configured as follows:

        • 201 - Rack 1 spine switch (R1SS)

        • 202 - Rack 2 spine switch (R2SS)

        • 203 - Rack 3 spine switch (R3SS)

        • 204 - Rack 4 spine switch (R4SS), and so on.

      Caution:

      If the switches in the existing racks (R1, R2, …, Rn) don't conform to the above conventions, then you must take special care to assign unique loopback octet values to the switches in the new rack (Rn+1) as part of applying their golden configuration settings (in the next step).

      If multiple switches use the same loopback octet, the RoCE Network Fabric cannot function correctly, resulting in a system outage.

  5. Apply the golden configuration settings on the RoCE Network Fabric switches in the new rack (Rn+1).

    Combine the information about the existing RoCE Network Fabric switches you gathered in the previous step and the procedure described in Applying Golden Configuration Settings on RoCE Network Fabric Switches (in Oracle Exadata Database Machine Maintenance Guide).

    Caution:

    Take care when performing this step, as misconfiguration of the RoCE Network Fabric will likely cause a system outage.

    For example, every switch in a multi-rack configuration must have a unique loopback octet. If multiple switches use the same loopback octet, the RoCE Network Fabric cannot function correctly, resulting in a system outage.

  6. Enable the leaf switch server ports on the RoCE Network Fabric leaf switches in the new rack (Rn+1).

    The leaf switch server ports may be disabled as a consequence of applying the multi-rack golden configuration settings in the previous step.

    To ensure that the leaf switch server ports are enabled, log in to each of the leaf switches in the new rack and run the following commands on each switch:

    rack5sw-rocea0# config term
    rack5sw-rocea0# int eth1/8-30
    rack5sw-rocea0# no shut
    rack5sw-rocea0# copy running-config startup-config
  7. Perform the physical cabling of the switches in the new rack (Rn+1).

    Caution:

    Cabling within a live network must be done carefully in order to avoid potentially serious disruptions.
    1. Remove the eight existing inter-switch connections (ports 4, 5, 6, 7 and 30, 31, 32, 33) between each leaf switch in the new rack (Rn+1).
    2. Cable the leaf switches in the new rack according to the applicable cabling table.

      For example, if you are adding a 5th rack to a system using Exadata X9M (or later model) racks, then use "Table 4-17 Leaf Switch Connections for the Fifth Rack in a Five-Rack System".

  8. Add the new rack to the switches in the existing racks (R1 to Rn).
    1. For an existing rack (Rx), cable the lower leaf switch RxLL according to the applicable cabling table.
    2. For the same rack, cable the upper leaf switch RxUL according to the applicable cabling table.
    3. Repeat these steps for each existing rack, R1 to Rn.
  9. Confirm each switch is available and connected.

    For each switch in racks R1, R2, …, Rn, Rn+1, confirm the output for the switch show interface status command shows connected and 100G.

    When run from a spine switch, the output should be similar to the following:

    rack1sw-roces0# show interface status
    --------------------------------------------------------------------------------
    Port          Name               Status    Vlan      Duplex  Speed   Type
    --------------------------------------------------------------------------------
    mgmt0         --                 connected routed    full    1000    -- 
    --------------------------------------------------------------------------------
    Port          Name               Status    Vlan      Duplex  Speed   Type
    --------------------------------------------------------------------------------
    ...
    Eth1/5        RouterPort5        connected routed    full    100G    QSFP-100G-CR4
    Eth1/6        RouterPort6        connected routed    full    100G    QSFP-100G-SR4
    Eth1/7        RouterPort7        connected routed    full    100G    QSFP-100G-CR4
    Eth1/8        RouterPort8        connected routed    full    100G    QSFP-100G-SR4
    Eth1/9        RouterPort9        connected routed    full    100G    QSFP-100G-CR4
    Eth1/10       RouterPort10       connected routed    full    100G    QSFP-100G-SR4
    Eth1/11       RouterPort11       connected routed    full    100G    QSFP-100G-CR4
    Eth1/12       RouterPort12       connected routed    full    100G    QSFP-100G-SR4
    Eth1/13       RouterPort13       connected routed    full    100G    QSFP-100G-CR4
    Eth1/14       RouterPort14       connected routed    full    100G    QSFP-100G-SR4
    Eth1/15       RouterPort15       connected routed    full    100G    QSFP-100G-CR4
    Eth1/16       RouterPort16       connected routed    full    100G    QSFP-100G-SR4
    Eth1/17       RouterPort17       connected routed    full    100G    QSFP-100G-CR4
    Eth1/18       RouterPort18       connected routed    full    100G    QSFP-100G-SR4
    Eth1/19       RouterPort19       connected routed    full    100G    QSFP-100G-CR4
    Eth1/20       RouterPort20       connected routed    full    100G    QSFP-100G-SR4
    Eth1/21       RouterPort21       xcvrAbsen      routed    full    100G    --
    ...

    When run from a leaf switch, the output should be similar to the following:

    rack1sw-rocea0# show interface status
    --------------------------------------------------------------------------------
    Port          Name               Status    Vlan      Duplex  Speed   Type
    --------------------------------------------------------------------------------
    mgmt0         --                 connected routed    full    1000    -- 
    --------------------------------------------------------------------------------
    Port          Name               Status    Vlan      Duplex  Speed   Type
    --------------------------------------------------------------------------------
    ...
    Eth1/4        RouterPort1        connected routed    full    100G    QSFP-100G-CR4
    Eth1/5        RouterPort2        connected routed    full    100G    QSFP-100G-CR4
    Eth1/6        RouterPort3        connected routed    full    100G    QSFP-100G-CR4
    Eth1/7        RouterPort4        connected routed    full    100G    QSFP-100G-CR4
    Eth1/8        celadm14           connected 3888      full    100G    QSFP-100G-CR4
    ...
    Eth1/29       celadm01           connected 3888      full    100G    QSFP-100G-CR4
    Eth1/30       RouterPort5        connected routed    full    100G    QSFP-100G-SR4
    Eth1/31       RouterPort6        connected routed    full    100G    QSFP-100G-SR4
    Eth1/32       RouterPort7        connected routed    full    100G    QSFP-100G-SR4
    Eth1/33       RouterPort8        connected routed    full    100G    QSFP-100G-SR4
    ...
  10. Check the neighbor discovery for every switch in racks R1, R2, …, Rn, Rn+1.
    Log in to each switch and use the show lldp neighbors command. Make sure that all switches are visible and check the switch ports assignment (leaf switches: ports Eth1/4 - Eth1/7, Eth1/30 - Eth1/33; spine switches: ports Eth1/5 - Eth1/20) against the applicable cabling tables.

    Each spine switch should see all the leaf switches in each rack, but not the other spine switches. The output for a spine switch should be similar to the following:

    Note:

    The interfaces in the rightmost output column (for example, Ethernet1/5) are different for each switch based on the applicable cabling tables.
    rack1sw-roces0# show lldp neighbors | grep roce
    rack1sw-roceb0 Eth1/5 120 BR Ethernet1/5
    rack2sw-roceb0 Eth1/6 120 BR Ethernet1/5
    rack1sw-roceb0 Eth1/7 120 BR Ethernet1/7
    rack2sw-roceb0 Eth1/8 120 BR Ethernet1/7
    rack1sw-roceb0 Eth1/9 120 BR Ethernet1/4
    rack2sw-roceb0 Eth1/10 120 BR Ethernet1/4
    rack3sw-roceb0 Eth1/11 120 BR Ethernet1/5
    rack3sw-roceb0 Eth1/12 120 BR Ethernet1/7
    rack1sw-rocea0 Eth1/13 120 BR Ethernet1/5
    rack2sw-rocea0 Eth1/14 120 BR Ethernet1/5
    rack1sw-rocea0 Eth1/15 120 BR Ethernet1/7
    rack2sw-rocea0 Eth1/16 120 BR Ethernet1/7
    rack3sw-rocea0 Eth1/17 120 BR Ethernet1/5
    rack2sw-rocea0 Eth1/18 120 BR Ethernet1/4
    rack3sw-rocea0 Eth1/19 120 BR Ethernet1/7
    rack3sw-rocea0 Eth1/20 120 BR Ethernet1/4 
    ...

    Each leaf switch should see the spine switch in every rack, but not the other leaf switches. The output for a leaf switch should be similar to the following:

    Note:

    The interfaces in the rightmost output column (for example, Ethernet1/13) are different for each switch based on the applicable cabling tables.
    rack1sw-rocea0# show lldp neighbors | grep roce
    rack3sw-roces0 Eth1/4 120 BR Ethernet1/13
    rack1sw-roces0 Eth1/5 120 BR Ethernet1/13
    rack3sw-roces0 Eth1/6 120 BR Ethernet1/15
    rack1sw-roces0 Eth1/7 120 BR Ethernet1/15
    rack2sw-roces0 Eth1/30 120 BR Ethernet1/17
    rack2sw-roces0 Eth1/31 120 BR Ethernet1/13
    rack3sw-roces0 Eth1/32 120 BR Ethernet1/17
    rack2sw-roces0 Eth1/33 120 BR Ethernet1/15
    ...
  11. Power on all the servers in the new rack (Rn+1).
  12. For each rack, confirm the multi-rack cabling by running the verify_roce_cables.py script.

    Refer to My Oracle Support Doc ID 2587717.1 for download and usage instructions.

    Check the output of the verify_roce_cables.py script against the applicable cabling tables. Also, check that output in the CABLE OK? columns contains the OK status.

    When running the script, two input files are used, one for nodes and one for switches. Each file should contain the servers or switches on separate lines. Use fully qualified domain names or IP addresses for each server and switch.

    The following output is a partial example of the command results:

    # ./verify_roce_cables.py -n nodes.rack1 -s switches.rack1
    SWITCH PORT (EXPECTED PEER)  LEAF-1 (rack1sw-rocea0)     : CABLE OK?  LEAF-2 (rack1sw-roceb0)    : CABLE OK?
    ----------- --------------   --------------------------- : --------   -----------------------    : ---------
    Eth1/4 (ISL peer switch)   : rack1sw-roces0 Ethernet1/17 : OK         rack1sw-roces0 Ethernet1/9 : OK
    Eth1/5 (ISL peer switch)   : rack1sw-roces0 Ethernet1/13 : OK         rack1sw-roces0 Ethernet1/5 : OK
    Eth1/6 (ISL peer switch)   : rack1sw-roces0 Ethernet1/19 : OK         rack1sw-roces0 Ethernet1/11: OK
    Eth1/7 (ISL peer switch)   : rack1sw-roces0 Ethernet1/15 : OK         rack1sw-roces0 Ethernet1/7 : OK
    Eth1/12 (celadm10)         : rack1celadm10 port-1        : OK         rack1celadm10 port-2       : OK
    Eth1/13 (celadm09)         : rack1celadm09 port-1        : OK         rack1celadm09 port-2       : OK
    Eth1/14 (celadm08)         : rack1celadm08 port-1        : OK         rack1celadm08 port-2       : OK
    ...
    Eth1/15 (adm08)            : rack1dbadm08 port-1         : OK         rack1dbadm08 port-2        : OK
    Eth1/16 (adm07)            : rack1dbadm07 port-1         : OK         rack1dbadm07 port-2        : OK
    Eth1/17 (adm06)            : rack1dbadm06 port-1         : OK         rack1dbadm06 port-2        : OK
    ...
    Eth1/30 (ISL peer switch)  : rack2sw-roces0 Ethernet1/17 : OK         rack2sw-roces0 Ethernet1/9 : OK
    Eth1/31 (ISL peer switch)  : rack2sw-roces0 Ethernet1/13 : OK         rack2sw-roces0 Ethernet1/5 : OK
    Eth1/32 (ISL peer switch)  : rack2sw-roces0 Ethernet1/19 : OK         rack2sw-roces0 Ethernet1/11: OK
    Eth1/33 (ISL peer switch)  : rack2sw-roces0 Ethernet1/15 : OK         rack2sw-roces0 Ethernet1/7 : OK
    
  13. Verify the RoCE Network Fabric operation by using the infinicheck command.

    Use the following recommended command sequence. In each command, hosts.lst contains a list of database server RoCE Network Fabric IP addresses (2 RoCE Network Fabric IP addresses for each database server), and cells.lst contains a list of RoCE Network Fabric IP addresses for the storage servers (2 RoCE Network Fabric IP addresses for each storage server).

    • Use infinicheck with the -z option to clear the files that were created during the last run of the infinicheck command. For example:

      # /opt/oracle.SupportTools/ibdiagtools/infinicheck -g hosts.lst -c cells.lst -z
    • Use infinicheck with the -s option to set up user equivalence for password-less SSH across the RoCE Network Fabric. For example:

      # /opt/oracle.SupportTools/ibdiagtools/infinicheck -g hosts.lst -c cells.lst -s
    • Finally, verify the RoCE Network Fabric operation by using infinicheck with the -b option, which is recommended on newly imaged machines where it is acceptable to suppress the cellip.ora and cellinit.ora configuration checks. For example:

      # /opt/oracle.SupportTools/ibdiagtools/infinicheck -g hosts.lst -c cells.lst -b
      
      INFINICHECK                    
              [Network Connectivity, Configuration and Performance]        
                     
                ####  FABRIC TYPE TESTS  #### 
      System type identified: RoCE
      Verifying User Equivalance of user=root from all DBs to all CELLs.
           ####  RoCE CONFIGURATION TESTS  ####       
           Checking for presence of RoCE devices on all DBs and CELLs 
      [SUCCESS].... RoCE devices on all DBs and CELLs look good
           Checking for RoCE Policy Routing settings on all DBs and CELLs 
      [SUCCESS].... RoCE Policy Routing settings look good
           Checking for RoCE DSCP ToS mapping on all DBs and CELLs 
      [SUCCESS].... RoCE DSCP ToS settings look good
           Checking for RoCE PFC settings and DSCP mapping on all DBs and CELLs
      [SUCCESS].... RoCE PFC and DSCP settings look good
           Checking for RoCE interface MTU settings. Expected value : 2300
      [SUCCESS].... RoCE interface MTU settings look good
           Verifying switch advertised DSCP on all DBs and CELLs ports ( )
      [SUCCESS].... Advertised DSCP settings from RoCE switch looks good  
          ####  CONNECTIVITY TESTS  ####
          [COMPUTE NODES -> STORAGE CELLS] 
            (60 seconds approx.)       
          (Will walk through QoS values: 0-6) [SUCCESS]..........Results OK
      [SUCCESS]....... All  can talk to all storage cells          
          [COMPUTE NODES -> COMPUTE NODES]               
      ...
  14. After cabling the racks together, proceed to Configuring the New Hardware to finish the configuration of the new rack.