Sun HPC 3.0 SCI Guide

Chapter 6 SCI Interface Troubleshooting

SCI Switch

General Hardware Inspection

Perform the following checks to determine the physical state of various SCI subsystem components. Verify that:

All SCI scrubber jumpers are properly set, depending on the cluster topology.
All SCI cables are properly seated.
All SCI switches have power applied
No SCI status LEDs are red--see Table 6-1 and Table 6-2

SCI Switch Status LED Locations

Clusters with three or four nodes can be connected through one or two SCI switches. The switch status LEDs provide information that can be used to troubleshoot SCI switch failures (Figure 6-1). Guidelines for interpreting these LEDs are provided in "Port Status LEDs"" and "General Switch Status LED".

Figure 6-1 SCI Status LED Locations

Port Status LEDs

The four port status LEDs located on the switch front panel can be used to troubleshoot individual port failures (Table 6-1).

Note -

A switch port sync error can result from a cable being removed.

Table 6-1 SCI Switch Port Status LEDs


Situation	Port LED Status
No power	All four LEDs not lit
Fatal switch errors: fatal hardware error, temperature to high, fan(s) not operative, power supply problem	All four LEDs red
Port errors: SCI cable out, sync error	Associated port LED is red
Port operative, no transactions	Associated port LED is green
Port operative, with transactions	Associated port LED is blinking green

General Switch Status LED

The switch status LED located on the rear panel indicates overall switch failures (Table 6-2).

Table 6-2 SCI Switch Rear Panel LED


Situation	LED Status
Fatal switch errors: fatal hardware error, temperature too high, fan(s) not operative, power supply problem	Red
Switch operational	Green

The `get_ci_status` Command

You can use the results of the get_ci_status command to troubleshoot clusters that have SCI switches. For example, for the configuration in Figure 6-2, if the get_ci_status command is used on interconn1, a typical output would be:

# /opt/SUNWsma/bin/get_ci_status
sma: sci #0: sbus_slot# 1; adapter_id 8 (0x08); ip_address 1; switch_id# 0; port_id# 0; Adapter Status - UP; Link Status - UP
sma: Switch_id# 0
sma: port_id# 1: host_name = interconn2; adapter_id = 72; active | operational
sma: port_id# 2: host_name = interconn3; adapter_id = 136; active | operational
sma: port_id# 3: host_name = interconn4; adapter_id = 200;inactive|inoperational
#

In this example, the line

sma: port_id# 3: host_name = interconn4; adapter_id = 200;inactive|inoperational

indicates that the path between SCI switch 0, port 3 and interconn4 is inactive and not operational.

Figure 6-2 Typical Four-Node Configuration with an SCI Switch

In this instance, if the get_ci_status command were run on all four nodes, and if the same path was inactive and inoperative between SCI switch 0, port 3 and interconn4, it is more than likely that either the SCI switch 0, port 3, the cable, or the interconn4 host adapter is faulty.

However, if the get_ci_status command indicates that the same path is inactive and inoperative for one node only, such as in the instance of interconn1, then it is more than likely that either the interconn 1 host adapter, the cable, or SCI switch 0, port 0 is faulty.

Note that some aspects of the get_ci_status command output, such as host names, will vary according to your configuration.

Client Net Failure

System console messages will identify the specific port that has failed. Otherwise, for information on test commands as well as additional troubleshooting, refer to the documentation that came with your client network interface card.

Incorrect Software Configuration

Make sure that:

The working copy of the sm_config template file correctly matches the hardware configuration and cluster topology.

sm_config ran successfully on only one of the cluster nodes.

All nodes were rebooted after sm_config was executed.

Incorrect Firmware

If an SCI adapter cardSCI adapter card is loaded with the wrong firmware, the SCI cards will not be detected upon system power-on or reboot/reset.

Improper loading of the firmware can happen two ways:

Old firmware programmed into new SBus2b cards
New firmware programmed into old SBus2 cards

If proper firmware is loaded, a banner (containing the word FCode) will be printed from each SCI card twice during power-on or reboot or reset. No banner will be printed at all for a card loaded with improper firmware.

The following are sample console messages (which are not saved in the message file):

One SCI card is working in the node:

rebooting...
Resetting ... 

DOLPHIN SBus-to-SCI (SBus2b) Adapter - 9029, Serial #5017 
FCode 9029 $Revision: 2.3 $  - d9029_52 $Date: 1996/10/30 07:47:53 $ 

Executing SCI adapter selftest.    Adapter OK. 
screen not found.
Can't open input device.
Keyboard not present.  Using ttya for input and output.

DOLPHIN SBus-to-SCI (SBus2b) Adapter - 9029, Serial #5017 
FCode 9029 $Revision: 2.3 $  - d9029_52 $Date: 1996/10/30 07:47:53 $ 

Executing SCI adapter selftest.    Adapter OK. 

Sun Ultra 1 SBus (UltraSPARC 167MHz), No Keyboard

No SCI cards are working in the node:
rebooting... Resetting ... screen not found. Can't open input device. Keyboard not present. Using ttya for input and output. Sun Ultra 1 SBus (UltraSPARC 167MHz), No Keyboard
Note -
If SCI cards do not show up during boot time, check the physical installation of the cards. If reseating the cards does not correct the problem, the SCI cards may be damaged and should be returned.

If you suspect that an SCI SBus interface card is loaded with the wrong firmware, perform the following steps to investigate:

With the system powered off, note the serial numbers of the adapter cards that are physically installed.

Turn the system power on.

Run /opt/SUNWsci/bin/sciadm and enter the identify command.

This command displays the firmware version, fcode version, and serial number of each adapter board found.

Compare the number of cards found by sciadm against the number of adapters physically installed.

Two cards should be displayed in the output. If not, there is at least one bad card in the system.

Compare the adapter board serial numbers from the output of the identify command, to the serial number on each adapter card physically installed.

Note which serial number(s) are displayed. Cards that do not have their serial numbers displayed are bad and need replacement.