Sun HPC 3.0 SCI Guide

Chapter 6 SCI Interface Troubleshooting

SCI Switch

General Hardware Inspection

Perform the following checks to determine the physical state of various SCI subsystem components. Verify that:

SCI Switch Status LED Locations

Clusters with three or four nodes can be connected through one or two SCI switches. The switch status LEDs provide information that can be used to troubleshoot SCI switch failures (Figure 6-1). Guidelines for interpreting these LEDs are provided in "Port Status LEDs"" and "General Switch Status LED".

Figure 6-1 SCI Status LED Locations

Graphic

Port Status LEDs

The four port status LEDs located on the switch front panel can be used to troubleshoot individual port failures (Table 6-1).


Note -

A switch port sync error can result from a cable being removed.


Table 6-1 SCI Switch Port Status LEDs

Situation 

Port LED Status 

No power 

All four LEDs not lit 

Fatal switch errors: 

fatal hardware error, 

temperature to high, 

fan(s) not operative, 

power supply problem 

All four LEDs red 

Port errors: 

SCI cable out, 

sync error 

Associated port LED is red 

Port operative, no transactions 

Associated port LED is green 

Port operative, with transactions 

Associated port LED is blinking green 

General Switch Status LED

The switch status LED located on the rear panel indicates overall switch failures (Table 6-2).

Table 6-2 SCI Switch Rear Panel LED

Situation 

LED Status 

Fatal switch errors: 

fatal hardware error, 

temperature too high, 

fan(s) not operative, 

power supply problem 

Red 

Switch operational 

Green 

The get_ci_status Command

You can use the results of the get_ci_status command to troubleshoot clusters that have SCI switches. For example, for the configuration in Figure 6-2, if the get_ci_status command is used on interconn1, a typical output would be:


# /opt/SUNWsma/bin/get_ci_status
sma: sci #0: sbus_slot# 1; adapter_id 8 (0x08); ip_address 1; switch_id# 0; port_id# 0; Adapter Status - UP; Link Status - UP
sma: Switch_id# 0
sma: port_id# 1: host_name = interconn2; adapter_id = 72; active | operational
sma: port_id# 2: host_name = interconn3; adapter_id = 136; active | operational
sma: port_id# 3: host_name = interconn4; adapter_id = 200;inactive|inoperational
# 

In this example, the line


sma: port_id# 3: host_name = interconn4; adapter_id = 200;inactive|inoperational

indicates that the path between SCI switch 0, port 3 and interconn4 is inactive and not operational.

Figure 6-2 Typical Four-Node Configuration with an SCI Switch

Graphic

In this instance, if the get_ci_status command were run on all four nodes, and if the same path was inactive and inoperative between SCI switch 0, port 3 and interconn4, it is more than likely that either the SCI switch 0, port 3, the cable, or the interconn4 host adapter is faulty.

However, if the get_ci_status command indicates that the same path is inactive and inoperative for one node only, such as in the instance of interconn1, then it is more than likely that either the interconn 1 host adapter, the cable, or SCI switch 0, port 0 is faulty.

Note that some aspects of the get_ci_status command output, such as host names, will vary according to your configuration.

Client Net Failure

System console messages will identify the specific port that has failed. Otherwise, for information on test commands as well as additional troubleshooting, refer to the documentation that came with your client network interface card.

Incorrect Software Configuration

Make sure that:

Incorrect Firmware

If an SCI adapter cardSCI adapter card is loaded with the wrong firmware, the SCI cards will not be detected upon system power-on or reboot/reset.

Improper loading of the firmware can happen two ways:

If proper firmware is loaded, a banner (containing the word FCode) will be printed from each SCI card twice during power-on or reboot or reset. No banner will be printed at all for a card loaded with improper firmware.

The following are sample console messages (which are not saved in the message file):

  1. One SCI card is working in the node:


    rebooting...
    Resetting ... 
    
    DOLPHIN SBus-to-SCI (SBus2b) Adapter - 9029, Serial #5017 
    FCode 9029 $Revision: 2.3 $  - d9029_52 $Date: 1996/10/30 07:47:53 $ 
    
    Executing SCI adapter selftest.    Adapter OK. 
    screen not found.
    Can't open input device.
    Keyboard not present.  Using ttya for input and output.
    
    DOLPHIN SBus-to-SCI (SBus2b) Adapter - 9029, Serial #5017 
    FCode 9029 $Revision: 2.3 $  - d9029_52 $Date: 1996/10/30 07:47:53 $ 
    
    Executing SCI adapter selftest.    Adapter OK. 
    
    Sun Ultra 1 SBus (UltraSPARC 167MHz), No Keyboard

  2. No SCI cards are working in the node:


    rebooting...
    Resetting ... 
    
    screen not found.
    Can't open input device.
    Keyboard not present.  Using ttya for input and output.
    
    Sun Ultra 1 SBus (UltraSPARC 167MHz), No Keyboard


    Note -

    If SCI cards do not show up during boot time, check the physical installation of the cards. If reseating the cards does not correct the problem, the SCI cards may be damaged and should be returned.


If you suspect that an SCI SBus interface card is loaded with the wrong firmware, perform the following steps to investigate:

  1. With the system powered off, note the serial numbers of the adapter cards that are physically installed.

  2. Turn the system power on.

  3. Run /opt/SUNWsci/bin/sciadm and enter the identify command.

    This command displays the firmware version, fcode version, and serial number of each adapter board found.

  4. Compare the number of cards found by sciadm against the number of adapters physically installed.

    Two cards should be displayed in the output. If not, there is at least one bad card in the system.

  5. Compare the adapter board serial numbers from the output of the identify command, to the serial number on each adapter card physically installed.

    Note which serial number(s) are displayed. Cards that do not have their serial numbers displayed are bad and need replacement.