15.1 Monitor the InfiniBand Fabric

This section contains the following topics:

15.1.1 Identify All Switches in the Fabric

You can use the ibswitches command to identify the Sun Network QDR InfiniBand Gateway Switches in the InfiniBand fabric in your Exalogic machine. This command displays the Global Unique Identifier (GUID), name, Local Identifier (LID), and LID mask control (LMC) for each switch. The output of the command is a mapping of GUID to LID for switches in the fabric.

On any command-line interface (CLI), run the following command:

# ibswitches

The output is displayed, as in the following example:

Switch : 0x0021283a8389a0a0 ports 36 "Sun DCS 36 QDR switch localhost" enhancedport 0 lid 15 lmc 0

Note:

The actual output for your InfiniBand fabric will differ from that in the example.

15.1.2 Identify All HCAs in the Fabric

You can use the ibhosts command to display identity information about the host channel adapters (HCAs) in the InfiniBand fabric in a subnet. This command displays the GUID and name for each HCA.

On the command-line interface (CLI), run the following command:

# ibhosts

The output is displayed, as in the following example:

Ca : 0x0003ba000100e388 ports 2 "nsn33-43 HCA-1"
Ca : 0x5080020000911310 ports 1 "nsn32-20 HCA-1"
Ca : 0x50800200008e532c ports 1 "ib-71 HCA-1"
Ca : 0x50800200008e5328 ports 1 "ib-70 HCA-1"
Ca : 0x50800200008296a4 ports 2 "ib-90 HCA-1"
.
.
.
#

Note:

The output in the example is just a portion of the full output and varies for each InfiniBand topology.

15.1.3 Display the InfiniBand Fabric Topology

To understand the routing that happens within your InfiniBand fabric, the ibnetdiscover command displays the node-to-node connectivity. The output of the command is dependent upon the size of your fabric. You can also use this command to display the LIDs of HCAs.

On the command-line interface (CLI), enter the following command:

# ibnetdiscover

The output is displayed, as in the following example:

# Topology file: generated on Sat Apr 13 22:28:55 2002
#
# Max of 1 hops discovered
# Initiated from node 0021283a8389a0a0 port 0021283a8389a0a0
vendid=0x2c9
devid=0xbd36
sysimgguid=0x21283a8389a0a3
switchguid=0x21283a8389a0a0(21283a8389a0a0)
Switch   36 "S-0021283a8389a0a0" # "Sun DCS 36 QDR switch localhost" enhanced port 0 lid 15 lmc 0
[23]    "H-0003ba000100e388"[2](3ba000100e38a) # "nsn33-43 HCA-1" lid 14 4xQDR
vendid=0x2c9
devid=0x673c
sysimgguid=0x3ba000100e38b
caguid=0x3ba000100e388
Ca   2 "H-0003ba000100e388" # "nsn33-43 HCA-1"
[2](3ba000100e38a)   "S-0021283a8389a0a0"[23] # lid 14 lmc 0 "Sun DCS 36 QDR switch localhost" lid 15 4xQDR

Note:

The actual output for your InfiniBand fabric will differ from that in the example.

15.1.4 Display a Route Through the Fabric

You sometimes need to know the route between two nodes in the InfiniBand fabric. The ibtracert command can provide that information by displaying the GUIDs, ports, and LIDs of the nodes.On the command-line interface (CLI), run the following command:

# ibtracert slid dlid

where slid is the LID of the source node and dlid is the LID of the destination node in the fabric.

The output is displayed, as in the following example:

# ibtracert 15 14
#
From switch {0x0021283a8389a0a0} portnum 0 lid 15-15 "Sun DCS 36 QDR switch localhost"
[23] -> ca port {0x0003ba000100e38a}[2] lid 14-14 "nsn33-43 HCA-1"
To ca {0x0003ba000100e388} portnum 2 lid 14-14 "nsn33-43 HCA-1"
#

For this example:

The route starts at switch with GUID 0x0021283a8389a0a0 and is using port 0. The switch is LID 15 and in the description, the switch host's name is Sun DCS 36 QDR switch localhost. The route enters at port 23 of the HCA with GUID 0x0003ba000100e38a and exits at port 2. The HCA is LID 14.

Note:

The actual output for your InfiniBand fabric will differ from that in the example.

15.1.5 Display the Link Status of a Node

If you want to know the link status of a node in the InfiniBand fabric, run the ibportstate command to display the state, width, and speed of that node:

On the command-line interface (CLI), run the following command:

# ibportstate lid port

where lid is the LID of the node in the fabric, port is the port of the node.

The output is displayed, as in the following example:

# ibportstate 15 23

PortInfo:
# Port info: Lid 15 port 23
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Peer PortInfo:
# Port info: Lid 15 DR path slid 15; dlid 65535; 0,23
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
#

Note:

The actual output for your InfiniBand fabric will differ from that in the example.

15.1.6 Display Counters for a Node

To help ascertain the health of a node in the fabric, use the perfquery command to display the performance, error, and data counters for that node:

On the command-line interface (CLI), enter the following command:

# perfquery lid port

where lid is the LID of the node in the fabric, and port is the port of the node.

Note:

If a port value of 255 is specified for a switch node, the counters are the total for all switch ports.

For example:

# perfquery 15 23
#
# Port counters: Lid 15 port 23
PortSelect:......................23
CounterSelect:...................0x1b01
SymbolErrors:....................0
.
.
.
VL15Dropped:.....................0
XmtData:.........................20232
RcvData:.........................20232
XmtPkts:.........................281
RcvPkts:.........................281

Note:

The output in the example is just a portion of the full output.

15.1.7 Display Data Counters for a Node

To list the data counters for a node in the fabric, use the ibdatacounts command.

On the command-line interface (CLI), enter the following command:

# ibdatacounts lid port

where lid is the LID of the node in the fabric, and port is the port of the node.

For example:

# ibdatacounts 15 23
#
XmtData:.........................6048
RcvData:.........................6048
XmtPkts:.........................84
RcvPkts:.........................84

Note:

The actual output for your InfiniBand fabric will differ from that in the example.

15.1.8 Display Low-Level Detailed Information for a Node

If intensive troubleshooting is necessary to resolve a problem, you can use the smpquery command to display very detailed information about a node in the fabric.

On the command-line interface (CLI), enter the following command:

# smpquery switchinfo lid

where lid is the LID of the node in the fabric.

For example:

# smpquery switchinfo 15
#
# Switch info: Lid 15
LinearFdbCap:....................49152
RandomFdbCap:....................0
McastFdbCap:.....................4096
LinearFdbTop:....................16
DefPort:.........................0
DefMcastPrimPort:................255
DefMcastNotPrimPort:.............255
LifeTime:........................18
StateChange:.....................0
LidsPerPort:.....................0
PartEnforceCap:..................32
InboundPartEnf:..................1
OutboundPartEnf:.................1
FilterRawInbound:................1
FilterRawOutbound:...............1
EnhancedPort0:...................1
#
# smpquery portinfo lid port

Note:

The actual output for your InfiniBand fabric will differ from that in the example.

15.1.9 Display Low-Level Detailed Information for a Port

If intensive troubleshooting is necessary to resolve a problem, you can use the smpquery command to display very detailed information about a port.

On the command-line interface (CLI), enter the following command:

# smpquery portinfo lid port

where lid is the LID of the node in the fabric.

For example:

# smpquery portinfo 15 23
#
Mkey:............................0x0000000000000000
GidPrefix:.......................0x0000000000000000
Lid:.............................0x0000
SMLid:...........................0x0000
CapMask:.........................0x0
DiagCode:........................0x0000
MkeyLeasePeriod:.................0
LocalPort:.......................0
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkDownDefState:................Polling
ProtectBits:.....................0
LMC:.............................0
.
.
.
SubnetTimeout:...................0
RespTimeVal:.....................0
LocalPhysErr:....................8
OverrunErr:......................8
MaxCreditHint:...................85
RoundTrip:.......................16777215
#

Note:

The actual output for your InfiniBand fabric will differ from that in the example, and it is just a portion of the full output.

15.1.10 Map LIDs to GUIDs

In the InfiniBand fabric in Exalogic machines, as a Subnet Manager and Subnet administrator, you may want to assign subnet-specific LIDs to nodes in the fabric. Often in the use of the InfiniBand commands, you must provide an LID to issue a command to a particular InfiniBand device.

Alternatively, the output of a command might identify InfiniBand devices by their LID. You can create a file that is a mapping of node LIDs to node GUIDs, which can help with administrating your InfiniBand fabric.

Note:

Creation of the mapping file is not a requirement for InfiniBand administration.

The following procedure creates a file that lists the LID in hexadecimal, the GUID in hexadecimal, and the node description:

  1. Create an inventory file:

    # osmtest -f c -i inventory.txt

    The inventory.txt file can be used for other purposes too, besides this procedure.

  2. Create a mapping file:

    # cat inventory.txt |grep -e '^lid' -e 'port_guid' -e 'desc' |sed 's/^lid/\nlid/'> mapping.txt

  3. Edit the latter half of the mapping.txt file to remove the nonessential information. The content of the mapping.txt file looks similar to the following:
    lid 0x14
    port_guid 0x0021283a8620b0a0
    # node_desc Sun DCS 72 QDR switch 1.2(LC)
    lid 0x15
    port_guid 0x0021283a8620b0b0
    # node_desc Sun DCS 72 QDR switch 1.2(LC)
    lid 0x16
    port_guid 0x0021283a8620b0c0
    # node_desc Sun DCS 72 QDR switch 1.2(LC)
    

Note:

The output in the example is just a portion of the entire file.

15.1.11 Perform Comprehensive Diagnostics for the Entire Fabric

If you require a full testing of your InfiniBand fabric, you can use the ibdiagnet command to perform many tests with verbose results. The command is a useful tool to determine the general overall health of the InfiniBand fabric.

On the command-line interface (CLI), run the following command:

# ibdiagnet -v -r

The ibdiagnet.log file contains the log of the testing.

15.1.12 Perform Comprehensive Diagnostics for a Route

You can use the ibdiagpath command to perform some of the same comprehensive tests for a particular route.

On the command-line interface (CLI), run the following command:

# ibdiagpath -v -l slid dlid

where slid is the LID of the source node in the fabric, and dlid is the LID of the destination node.

The ibdiagpath.log file contains the log of the testing.

15.1.13 Determine Changes to the InfiniBand Topology

If your fabric has a number of nodes that are suspect, the osmtest command enables you to take a snapshot (inventory file) of your fabric and at a later time compare that file to the present conditions.

Note:

Although this procedure is most useful after initializing the Subnet Manager, it can be performed at any time.

Complete the following steps:

  1. Ensure that Subnet Manager is initiated.
  2. On the command-line interface (CLI), run the following command to take a snapshot of the topology:

    # osmtest -f c

    For example:

    # osmtest -f c
    Command Line Arguments
    Done with args
    Flow = Create Inventory
    Aug 13 19:44:53 601222 [B7D466C0] 0x7f -> Setting log level to: 0x03
    Aug 13 19:44:53 601969 [B7D466C0] 0x02 -> osm_vendor_init: 1000 pending umadsspecified
    using default guid 0x21283a8620b0f0
    Aug 13 19:44:53 612312 [B7D466C0] 0x02 -> osm_vendor_bind: Binding to port0x21283a8620b0f0
    Aug 13 19:44:53 636876 [B7D466C0] 0x02 -> osmtest_validate_sa_class_port_info:
    -----------------------------
    SA Class Port Info:
    base_ver:1
    class_ver:2
    cap_mask:0x2602
    cap_mask2:0x0
    resp_time_val:0x10
    -----------------------------
    OSMTEST: TEST "Create Inventory" PASS
    #
    
  3. After an event, compare the present topology to that saved in the inventory file, as in the following example:
    # osmtest -f v
    Command Line Arguments
    Done with args
    Flow = Validate Inventory
    Aug 13 19:45:02 342143 [B7EF96C0] 0x7f -> Setting log level to: 0x03
    Aug 13 19:45:02 342857 [B7EF96C0] 0x02 -> osm_vendor_init: 1000 pending umadsspecified
    using default guid 0x21283a8620b0f0
    Aug 13 19:45:02 351555 [B7EF96C0] 0x02 -> osm_vendor_bind: Binding to port0x21283a8620b0f0
    Aug 13 19:45:02 375997 [B7EF96C0] 0x02 -> osmtest_validate_sa_class_port_info:
    -----------------------------
    SA Class Port Info:
    base_ver:1
    class_ver:2
    cap_mask:0x2602
    cap_mask2:0x0
    resp_time_val:0x10
    -----------------------------
    Aug 13 19:45:02 378991 [B7EF96C0] 0x01 -> osmtest_validate_node_data: Checkingnode 0x0021283a8620b0a0, LID 0x14
    Aug 13 19:45:02 379172 [B7EF96C0] 0x01 -> osmtest_validate_node_data: Checkingnode 0x0021283a8620b0b0, LID 0x15
    .
    .
    .
    Aug 13 19:45:02 480201 [B7EF96C0] 0x01 ->osmtest_validate_single_path_rec_guid_pair:
    Checking src 0x0021283a8620b0f0 to dest 0x0021283a8620b0f0
    Aug 13 19:45:02 480588 [B7EF96C0] 0x01 -> osmtest_validate_path_data: Checkingpath SLID 0x19 to DLID 0x19
    Aug 13 19:45:02 480989 [B7EF96C0] 0x02 -> osmtest_run:
    ***************** ALL TESTS PASS *****************
    OSMTEST: TEST "Validate Inventory" PASS
    #
    

    Note:

    Depending on the size of your InfiniBand fabric, the output from the osmtest command could be tens of thousands of lines long.

15.1.14 Determine Which Links Are Experiencing Significant Errors

You can use the ibdiagnet command to determine which links are experiencing symbol errors and recovery errors by injecting packets.

On the command-line interface (CLI), run the following command:

# ibdiagnet -c 100 -P all=1

In this instance of the ibdiagnet command, 100 test packets are injected into each link and the -P all=1 option returns all counters that increment during the test.

In the output of the ibdiagnet command, search for the symbol_error_counter string. That line contains the symbol error count in hexadecimal. The preceding lines identify the node and port with the errors. Symbol errors are minor errors, and if there are relatively few during the diagnostic, they can be monitored.

Note:

According to the InfiniBand specification 10E-12 BER, the maximum allowable symbol error rate is 120 errors per hour.

In addition, in the output of the ibdiagnet command, search for the link_error_recovery_counter string.

That line contains the recovery error count in hexadecimal. The preceding lines identify the node and port with the errors. Recovery errors are major errors and the respective links must be investigated for the cause of the rapid symbol error propagation.

Additionally, the ibdiagnet.log file contains the log of the testing.

15.1.15 Check All Ports

To perform a quick check of all ports of all nodes in your InfiniBand fabric, you can use the ibcheckstate command.

On the command-line interface (CLI), run the following command:

# ibcheckstate -v

The output is displayed, as in the following example:

# Checking Switch: nodeguid 0x0021283a8389a0a0
Node check lid 15: OK
Port check lid 15 port 23: OK
Port check lid 15 port 19: OK
.
.
.
# Checking Ca: nodeguid 0x0003ba000100e388
Node check lid 14: OK
Port check lid 14 port 2: OK
## Summary: 5 nodes checked, 0 bad nodes found
## 10 ports checked, 0 ports with bad state found
#

Note:

The ibcheckstate command requires time to complete, depending upon the size of your InfiniBand fabric. Without the -v option, the output contains only failed ports. The output in the example is only a small portion of the actual output.