Perform the following checks to determine the physical state of various SCI subsystem components. Verify that:
All SCI scrubber jumpers are properly set, depending on the cluster topology.
All SCI cables are properly seated.
All SCI switches have power applied
Clusters with three or four nodes can be connected through one or two SCI switches. The switch status LEDs provide information that can be used to troubleshoot SCI switch failures (Figure 6-1). Guidelines for interpreting these LEDs are provided in "Port Status LEDs"" and "General Switch Status LED".
The four port status LEDs located on the switch front panel can be used to troubleshoot individual port failures (Table 6-1).
A switch port sync error can result from a cable being removed.
Situation |
Port LED Status |
---|---|
No power |
All four LEDs not lit |
Fatal switch errors: fatal hardware error, temperature to high, fan(s) not operative, power supply problem |
All four LEDs red |
Port errors: SCI cable out, sync error |
Associated port LED is red |
Port operative, no transactions |
Associated port LED is green |
Port operative, with transactions |
Associated port LED is blinking green |
The switch status LED located on the rear panel indicates overall switch failures (Table 6-2).
Table 6-2 SCI Switch Rear Panel LED
Situation |
LED Status |
---|---|
Fatal switch errors: fatal hardware error, temperature too high, fan(s) not operative, power supply problem |
Red |
Switch operational |
Green |
You can use the results of the get_ci_status command to troubleshoot clusters that have SCI switches. For example, for the configuration in Figure 6-2, if the get_ci_status command is used on interconn1, a typical output would be:
# /opt/SUNWsma/bin/get_ci_status sma: sci #0: sbus_slot# 1; adapter_id 8 (0x08); ip_address 1; switch_id# 0; port_id# 0; Adapter Status - UP; Link Status - UP sma: Switch_id# 0 sma: port_id# 1: host_name = interconn2; adapter_id = 72; active | operational sma: port_id# 2: host_name = interconn3; adapter_id = 136; active | operational sma: port_id# 3: host_name = interconn4; adapter_id = 200;inactive|inoperational # |
In this example, the line
sma: port_id# 3: host_name = interconn4; adapter_id = 200;inactive|inoperational |
indicates that the path between SCI switch 0, port 3 and interconn4 is inactive and not operational.
In this instance, if the get_ci_status command were run on all four nodes, and if the same path was inactive and inoperative between SCI switch 0, port 3 and interconn4, it is more than likely that either the SCI switch 0, port 3, the cable, or the interconn4 host adapter is faulty.
However, if the get_ci_status command indicates that the same path is inactive and inoperative for one node only, such as in the instance of interconn1, then it is more than likely that either the interconn 1 host adapter, the cable, or SCI switch 0, port 0 is faulty.
Note that some aspects of the get_ci_status command output, such as host names, will vary according to your configuration.
System console messages will identify the specific port that has failed. Otherwise, for information on test commands as well as additional troubleshooting, refer to the documentation that came with your client network interface card.
Make sure that:
The working copy of the sm_config template file correctly matches the hardware configuration and cluster topology.
sm_config ran successfully on only one of the cluster nodes.
All nodes were rebooted after sm_config was executed.
If an SCI adapter cardSCI adapter card is loaded with the wrong firmware, the SCI cards will not be detected upon system power-on or reboot/reset.
Improper loading of the firmware can happen two ways:
Old firmware programmed into new SBus2b cards
New firmware programmed into old SBus2 cards
If proper firmware is loaded, a banner (containing the word FCode) will be printed from each SCI card twice during power-on or reboot or reset. No banner will be printed at all for a card loaded with improper firmware.
The following are sample console messages (which are not saved in the message file):
One SCI card is working in the node:
rebooting... Resetting ... DOLPHIN SBus-to-SCI (SBus2b) Adapter - 9029, Serial #5017 FCode 9029 $Revision: 2.3 $ - d9029_52 $Date: 1996/10/30 07:47:53 $ Executing SCI adapter selftest. Adapter OK. screen not found. Can't open input device. Keyboard not present. Using ttya for input and output. DOLPHIN SBus-to-SCI (SBus2b) Adapter - 9029, Serial #5017 FCode 9029 $Revision: 2.3 $ - d9029_52 $Date: 1996/10/30 07:47:53 $ Executing SCI adapter selftest. Adapter OK. Sun Ultra 1 SBus (UltraSPARC 167MHz), No Keyboard |
No SCI cards are working in the node:
rebooting... Resetting ... screen not found. Can't open input device. Keyboard not present. Using ttya for input and output. Sun Ultra 1 SBus (UltraSPARC 167MHz), No Keyboard |
If SCI cards do not show up during boot time, check the physical installation of the cards. If reseating the cards does not correct the problem, the SCI cards may be damaged and should be returned.
If you suspect that an SCI SBus interface card is loaded with the wrong firmware, perform the following steps to investigate:
With the system powered off, note the serial numbers of the adapter cards that are physically installed.
Turn the system power on.
Run /opt/SUNWsci/bin/sciadm and enter the identify command.
This command displays the firmware version, fcode version, and serial number of each adapter board found.
Compare the number of cards found by sciadm against the number of adapters physically installed.
Two cards should be displayed in the output. If not, there is at least one bad card in the system.
Compare the adapter board serial numbers from the output of the identify command, to the serial number on each adapter card physically installed.
Note which serial number(s) are displayed. Cards that do not have their serial numbers displayed are bad and need replacement.