Exit Print View

Sun Datacenter InfiniBand Switch 72 Topic Set

Get PDF Book Print View
 

Document Information

Using This Documentation

Related Documentation

Documentation, Support, and Training

Documentation Feedback

User Guide

Installing the Switch

Administering the Switch

Troubleshooting the Switch

Switch Hardware Problems

InfiniBand Fabric Problems

Understanding the LEDs

Front Status LEDs

Rear Status LEDs

Check Chassis Status LEDs

Check Network Management Port Status LEDs

Check Link Status LEDs

Check Power Supply Status LEDs

Check Fan Status LEDs

Understanding Routing Through the Switch

CXP Connectors and Link LEDs to Switch Chip Port Routes

Switch Chip Port to Switch Chip Port Routes

Switch Chip Port to CXP Connectors and Link LED Routes

Signal Route Through the Switch

Switch GUIDs Overview

Administrative Command Overview

Hardware Command Overview

InfiniBand Command Overview

Monitoring the Hardware

Display Switch General Health

Display Power Supply Status

Check Board-Level Voltages

Display Internal Temperatures

Display Fan Status

Display Switch Environmental and Operational Data

Display Switch Firmware Versions

Locate a Switch Chip or Connector From the GUID

Display Switch Chip Boot Status

Display Link Status

Display Switch Chip Port Status

Monitoring the InfiniBand Fabric

Identify All Switches in the Fabric

Identify All HCAs in the Fabric

Display the InfiniBand Fabric Topology

Display a Route Through the Fabric

Display the Link Status of a Node

Display Counters for a Node

Display Data Counters for a Node

Display Low-Level Detailed Information About a Node

Display Low-Level Detailed Information About a Port

Map LIDs to GUIDs

Display Subnet Manager Status

Controlling the Hardware

Restart the Management Controller

Restart the Entire Switch

Reset the Switch Chip

Recover Ports After Switch Chip Reset

Set Link Speed

Disable a Switch Chip Port

Enable a Switch Chip Port

Change the Administrator Password

Controlling the InfiniBand Fabric

Perform Comprehensive Diagnostics for the Entire Fabric

Perform Comprehensive Diagnostics for a Route

Determine Changes to the InfiniBand Fabric Topology

Find 1x or SDR or DDR Links in the Fabric

Determine Which Links Are Experiencing Significant Errors

Clear Error Counters

Clear Data Counters

Check All Ports

Reset a Port

Set Port Speed

Disable a Port

Enable a Port

Controlling the Subnet Manager

Create the guid.txt File

Create the opensm.conf File

Enable the Subnet Manager

Disable the Subnet Manager

Set the Subnet Manager Priority

Start the Subnet Manager With the opensmd Daemon

Stop the Subnet Manager With the opensmd Daemon

Servicing the Switch

Reference

Understanding the Commands

Index

InfiniBand Fabric Problems

The following table lists situations that might occur with the InfiniBand fabric and corrective steps that can be taken to resolve the problem.

Situation
Corrective Steps
Performance of the InfiniBand fabric seems diminished.
  1. Determine if there are errors or problems with the InfiniBand fabric.

    See:

  2. Locate the affected nodes by the GUID provided in the output of the ibdiagnet command.

    See Locate a Switch Chip or Connector From the GUID.

  3. If the problem is at a cable connection, swap the suspect cable with a known good cable or reconnect the cable to a known good remote port and repeat Step 1.

    See Servicing the InfiniBand Cables.

  4. If the problem still remains at the cable connection, disable and re-enable the respective port and repeat Step 1.

    See Disable a Port and Enable a Port.

Temporary solution:

  • If the problem still remains, disable the affected port.

    See Disable a Port.

Permanent solution:

An InfiniBand Link LED is blinking.
  1. Disconnect and properly reconnect both ends of the respective InfiniBand cable.

    See Switch Service, servicing an InfiniBand cable.

  2. If the LED is still blinking, determine the significance of the errors through use of the ibdiagnet command.

    See Determine Which Links Are Experiencing Significant Errors.

  3. Determine which connectors map to the affected link by deconstructing the node’s GUID and port.

    See Locate a Switch Chip or Connector From the GUID.

  4. If some of the links are running at 1x or SDR, use that situation elsewhere in this table to rectify the problem.

  5. Disable and re-enable the respective ports.

    See Disable a Port and Enable a Port.

  6. If the errors are still significant, swap the cable with a known good one or reconnect the cable to a known good remote port, and repeat from 2.

  7. Depending upon what does or does not rectify the problem, replace that component.

    See Servicing the InfiniBand Cables.

    See remote port’s documentation for replacement procedures.

Some InfiniBand links are running at 1x or SDR.
For a temporary solution:
  1. Identify the suspect links using the ibdiagnet command.

    See Find 1x or SDR or DDR Links in the Fabric. Look for text like the following:

    -W- link with SPD=2.5 found at direct path "1,19"

    From: a Switch PortGUID=0x00066a00d80001dd Port=19

    To: a Switch PortGUID=0x00066a00d80001dd Port=24

  2. Determine which connectors map to the affected link by deconstructing the node’s GUID and port.

    See Locate a Switch Chip or Connector From the GUID.

  3. Verify the cable connection at both ends.

    See Servicing the InfiniBand Cables.

  4. Disable and re-enable the respective ports.

    See Disable a Port and Enable a Port.

  5. If the previous steps do not rectify the problem, disable the port.

    See Disable a Port.

For a permanent solution:

  1. Perform the steps for a temporary solution, steps 1 to step 4.

  2. Swap the cable with a known good one or reconnect the cable to a known good remote port, and repeat from 1.

  3. Depending upon what does or does not rectify the problem, replace that component or the switch.

    See Servicing the InfiniBand Cables.

    See remote port’s documentation for replacement procedures.

    See Remove the Switch From the Rack and Installing the Switch.

There are errors on some InfiniBand links.
  1. Clear the error counters.

    See Clear Error Counters.

  2. Start a fabric stress test.

  3. Identify the suspect links using the ibdiagnet command.

    See Determine Which Links Are Experiencing Significant Errors. Look for text like the following:

    -W- lid=0x0006 guid=0x0021283a8816c0a0 dev=48438 Port=34

    Performance Monitor counter : Value

    link_recovery_error_counter : 0x1

    symbol_error_counter : 0x25 (Increase by 3 during ibdiagnet)

  4. For links that are experiencing recovery errors or substantial symbol errors, refer to other parts of this table to help identify the cause and rectify the problem.

Output of InfiniBand commands provides only GUID and port, not switch chip or CXP connectors.
You can find the location of a node in the switch, by deconstructing the node’s GUID and port, then you can crossreference the node and port to a connector.

See Locate a Switch Chip or Connector From the GUID and Understanding Routing Through the Switch.

Related Information