4.7.8 Verifying RoCE Network Fabric Operation

Verify the RoCE Network Fabric is operating properly after making modifications to the underlying hardware.

If hardware maintenance has taken place with any component in the RoCE Network Fabric, including replacing an RDMA Network Fabric Adapter on a server, a switch, or a cable, or if the operation of the RoCE Network Fabric is suspected to be substandard, then verify the RoCE Network Fabric is operating properly. The following procedure describes how to verify network operation:

  1. Complete the steps in Verifying the RoCE Network Fabric Configuration.
  2. Prepare for infinicheck.

    You may need to run the following commands before you can use the infinicheck command to perform RoCE Network Fabric configuration, connectivity, and performance checks.

    • If required, use the -s option set up user equivalence for password-less SSH across the RoCE Network Fabric. For example:

      # /opt/oracle.SupportTools/ibdiagtools/infinicheck -g hostips -c cellips -s
    • You can use the -z option to clear the files that were created during the last run of the infinicheck command. For example:

      # /opt/oracle.SupportTools/ibdiagtools/infinicheck -g hostips -c cellips -z

    In the previous commands, hostips is the name of an input file that contains a list of RoCE Network Fabric IP addresses for the database servers, and cellips is the name of an input file that contains a list of RoCE Network Fabric IP addresses for the storage servers.

  3. Run the infinicheck command to perform RoCE Network Fabric configuration, connectivity, and performance checks.

    On a properly configured system, you can run the infinicheck command on any database server with minimal arguments. For example:

    # /opt/oracle.SupportTools/ibdiagtools/infinicheck

    By default, the infinicheck command performs a group of configuration and connectivity checks on the RoCE Network Fabric. You can use the -p option to run the optional performance tests. Or, use the -a option to perform all checks, including the performance tests. For example:

    # /opt/oracle.SupportTools/ibdiagtools/infinicheck -a

    Note:

    System performance may be impacted when the infinicheck command performs performance stress tests. Consequently, only run the infinicheck performance tests when required and preferably when there is no workload on the system.

    You can also specify the servers in your system explicitly by using the -g option to specify the database servers and the -c option to specify the storage servers. For example:

    # /opt/oracle.SupportTools/ibdiagtools/infinicheck -g hostips -c cellips

    In the previous example, hostips is the name of an input file that contains a list of RoCE Network Fabric IP addresses for the database servers, and cellips is the name of an input file that contains a list of RoCE Network Fabric IP addresses for the storage servers.

    Instead of listing the database servers and storage servers in input files, you can supply a comma-separated list of IP addresses on the command line.

    The following example displays typical terminal output from the infinicheck command.

    # /opt/oracle.SupportTools/ibdiagtools/infinicheck -g hostips -c cellips
                            INFINICHECK
                    [Network Connectivity, Configuration and Performance]
    
                        #### FABRIC TYPE TESTS ####
    
    System type identified: RoCE
    Verifying User Equivalence of user=root from all DBs to all CELLs.
    
                    #### RoCE CONFIGURATION TESTS ####
            Checking for presence of RoCE devices on all DBs and CELLs
    [SUCCESS].... RoCE devices on all DBs and CELLs look good
            Checking for RoCE Policy Routing settings on all DBs and CELLs
    [SUCCESS].... RoCE Policy Routing settings look good
            Checking for RoCE DSCP ToS mapping on all DBs and CELLs
    [SUCCESS].... RoCE DSCP ToS settings look good
            Checking for RoCE PFC settings and DSCP mapping on all DBs and CELLs
    [SUCCESS].... RoCE PFC and DSCP settings look good
            Checking for RoCE interface MTU settings. Expected value : 2300
    [SUCCESS].... RoCE interface MTU settings look good
            Verifying switch advertised DSCP on all DBs and CELLs ports ( ~ 2 min )
    [SUCCESS].... Advertised DSCP settings from RoCE switch looks good
    
                        #### CONNECTIVITY TESTS ####
                        [COMPUTE NODES -> STORAGE CELLS]
                               (60 seconds approx.)
                       (Will walk through QoS values: 0-6)
    [SUCCESS]..............Results OK
    [SUCCESS]....... All can talk to all storage cells
                        [COMPUTE NODES -> COMPUTE NODES]
                               (60 seconds approx.)
                       (Will walk through QoS values: 0-6)
    [SUCCESS]..............Results OK
    [SUCCESS]....... All hosts can talk to all other nodes
            Verifying Subnet Masks on all nodes
    [SUCCESS] ......... Subnet Masks is same across the network