4.8.4 Verifying InfiniBand Network Fabric Operation

Verify the InfiniBand Network Fabric network is operating properly after making modifications to the underlying hardware.

If hardware maintenance has taken place with any component in the InfiniBand Network Fabric network, including replacing an InfiniBand HCA on a server, an InfiniBand Network Fabric switch, or an InfiniBand Network Fabric cable, or if operation of the InfiniBand Network Fabric is suspected to be substandard, then verify the InfiniBand Network Fabric is operating properly. The following procedure describes how to verify network operation:

Note:

The following procedure can be used any time the InfiniBand Network Fabric is performing below expectations.
  1. Complete the steps in Verifying the InfiniBand Network Fabric Configuration.
  2. Run the ibdiagnet command to verify the InfiniBand Network Fabric operation.
    # ibdiagnet -c 1000

    All errors reported by this command should be investigated. This command generates a small amount of network traffic, and may be run while normal workload is running.

  3. Run the ibqueryerrors.pl command to report on switch port error counters and port configuration information.
    #  ibqueryerrors.pl -rR -s RcvSwRelayErrors,XmtDiscards,XmtWait,VL15Dropped

    Errors such as LinkDowned, RcvSwRelayErrors, XmtDiscards, and XmtWait are ignored when using the preceding command.

    Note:

    • The InfiniBand Network Fabric counters are cumulative and the errors may have occurred at any time in the past. If there are errors reported, then Oracle recommends clearing the InfiniBand Network Fabric counters using the ibclearcounters command. After running the command, let the system run for a few minutes under load, and then run the ibquerryerrors command.

    • Some counters, such as SymbolErrors or RcvErrors can increment when servers are rebooted. Small values for these counters which are less than the LinkDowned counter are generally not a problem. The LinkDowned counter indicates the number of times the port has gone down usually for valid reasons, such as a reboot, and is not usually an error indicator by itself.

    • Any links reporting high, persistent errors especially SymbolErrors, LinkRecovers, RcvErrors, or LinkIntegrityErrors may indicate a bad or loose cable or port.

    • If there are persistent, high InfiniBand Network Fabric error counters, then investigate and correct the problem.

  4. If there is no load running on any portion of the InfiniBand Network Fabric, such as no databases running, then run the infinicheck command to perform full InfiniBand Network Fabric configuration, connectivity and performance evaluation.

    Note:

    This command evaluates full network maximum throughput and should not be run when there is workload running on any system on the InfiniBand Network Fabric.

    This command relies on a fully-configured system. The first command clears the files that were created during the last run of the infinicheck command.

    # /opt/oracle.SupportTools/ibdiagtools/infinicheck -z 
    
    # /opt/oracle.SupportTools/ibdiagtools/infinicheck

    The following is an example of the output from the command:

    Verifying User Equivalance of user=root to all hosts.
    (If it isn't setup correctly, an authentication prompt will appear to push keys
     to all the nodes)
     
     Verifying User Equivalance of user=root to all cells.
    (If it isn't setup correctly, an authentication prompt will appear to push keys
     to all the nodes)
     
     
                        ####  CONNECTIVITY TESTS  ####
                        [COMPUTE NODES -> STORAGE CELLS]
                               (30 seconds approx.)
    [SUCCESS]..............Connectivity verified
     
    [SUCCESS]....... All hosts can talk to all storage cells
     
            Verifying Subnet Masks on Hosts and Cells
    [SUCCESS] ......... Subnet Masks is same across the network
     
            Checking for bad links in the fabric
    [SUCCESS].......... No bad fabric links found
     
                        [COMPUTE NODES -> COMPUTE NODES]
                               (30 seconds approx.)
    [SUCCESS]..............Connectivity verified
     
    [SUCCESS]....... All hosts can talk to all other nodes
     
     
                        ####  PERFORMANCE TESTS  ####
     
                        [(1) Every COMPUTE NODE to its STORAGE CELL]
                              (15 seconds approx.)
    [SUCCESS]........ Network Bandwidth looks OK.
    .......... To view only performance results run ./infinicheck -d -p
     
                        [(2) Every COMPUTE NODE to another COMPUTE NODE]
                              (10 seconds approx.)
    [SUCCESS]........ Network Bandwidth looks OK.
    ...... To view only performance results run ./infinicheck -d -p
     
                        [(3) Every COMPUTE NODE to ALL STORAGE CELLS]
                      (45 seconds approx.) (looking for SymbolErrors)
     
    [SUCCESS]....... No port errors found