C H A P T E R  10

Troubleshooting

This chapter provides troubleshooting information for a system administrator. The chapter describes the following topics:


Device Mapping

The physical address represents a physical characteristic that is unique to the device. Examples of physical addresses include the bus address and the slot number. The slot number indicates where the device is installed.

You reference a physical device by the node identifier--agent ID (AID). The AID ranges from 0 to 31 in decimal notation (0 to 1f in hexadecimal). In the device path beginning with ssm@0,0 the first number, 0, is the node ID.

CPU/Memory Mapping

CPU/Memory board and memory agent IDs (AIDs) range from 0 to 23 in decimal notation (0 to 17 in hexadecimal). The system can have up to three CPU/Memory boards.

Each CPU/Memory board has four CPUs, depending on your configuration. Each CPU/Memory board has up to four banks of memory. Each bank of memory is controlled by one memory management unit (MMU), which is the CPU. The following code example shows a device tree entry for a CPU and its associated memory:


/ssm@0,0/SUNW/UltraSPARC-III@b,0 /ssm@0,0/SUNW/memory-controller@b,400000

where:

in b,0

in b,400000

There are up to four CPUs on each CPU/Memory board (TABLE 10-1):

IB_SSC Assembly Mapping

TABLE 10-2 lists the types of I/O assembly, the number of slots each I/O assembly has, and the systems the I/O assembly types are supported on.


TABLE 10-2 I/O Assembly Type and Number of Slots

I/O Assembly Type

Number of Slots Per I/O Assembly

PCI

6


TABLE 10-3 lists the number of I/O assemblies per system and the I/O assembly name.


TABLE 10-3 Number and Name of I/O Assemblies per System

Number of I/O Assemblies

I/O Assembly Name

1

IB6


Each I/O assembly hosts two I/O controllers:

When mapping the I/O device tree entry to a physical component in the system, you must consider up to five nodes in the device tree:

TABLE 10-4 lists the AIDs for the two I/O controllers in each I/O assembly.


TABLE 10-4 I/O Controller Agent ID Assignments

Slot Number

I/O Assembly Name

Even I/O controller AID

Odd I/O Controller AID

6

IB6

24 (18)

25 (19)

The first number in the column is a decimal number. The number (or a number and letter combination) in parentheses is in hexadecimal notation.


The I/O controller has two bus sides: A and B.

The board slots located in the I/O assembly are referenced by the device number.

This section describes the PCI I/O assembly slot assignments and provides an example of the device path.

The following code example gives a breakdown of a device tree entry for a SCSI disk:


/ssm@0,0/pci@19,700000/pci@3/SUNW,isptwo@4/sd@5,0



Note - The numbers in the device path are hexadecimal.



where:

in 19,700000

in pci@3

isptwo is the SCSI host adapter

in sd@5,0

This section describes the PCI I/O assembly slot assignments and provides an example of the device path.

TABLE 10-5 lists, in hexadecimal notation, the slot number, I/O assembly name, device path of each I/O assembly, the I/O controller number, and the bus.


TABLE 10-5 IB_SSC Assembly PCI Device Mapping

I/O Assembly Name

Device Path

Physical Slot Number

I./O Controller Number

Bus

IB6

/ssm@0,0/pci@18,700000/*@1

0

0

B

 

/ssm@0,0/pci@18,700000/*@2

1

0

B

 

/ssm@0,0/pci@18,700000/*@3

x

0

B

 

/ssm@0,0/pci@18,600000/*@1

5

0

A

 

/ssm@0,0/pci@18,600000/*@2

w

0

A

 

/ssm@0,0/pci@19,700000/*@1

2

1

B

 

/ssm@0,0/pci@19,700000/*@2

3

1

B

 

/ssm@0,0/pci@19,700000/*@3

4

1

B

 

/ssm@0,0/pci@19,600000/*@1

y

1

A

 

/ssm@0,0/pci@19,600000/*@2

z

1

A


where:

w = onboard LSI1010R SCSI controller

x = onboard CMD646U2 EIDE controller

y = onboard Gigaswift Ethernet controller 0

z = onboard Gigaswift Ethernet controller 1

and * is dependent upon the type of PCI card installed in the slot.

Note the following:


FIGURE 10-1 Sun Fire Entry-Level Midrange Systems IB_SSC PCI Physical Slot Designations for IB6

Graphic showing PCI physical slot designations for IB6.


where * is dependent upon the type of PCI card installed in the slot.

For instance:

These would generate device paths as follows:


/ssm@0,0/pci@19,700000/scsi@3,1
/ssm@0,0/pci@19,700000/scsi@3,1 (scsi-2)
/ssm@0,0/pci@19,700000/scsi@3,1/tape (byte)
/ssm@0,0/pci@19,700000/scsi@3,1/disk (block)
/ssm@0,0/pci@19,700000/scsi@3 (scsi-2)
/ssm@0,0/pci@19,700000/scsi@3/tape (byte)
/ssm@0,0/pci@19,700000/scsi@3/disk (block)
 
/ssm@0,0/pci@19,700000/SUNW,qlc@2 (scsi-fcp)
/ssm@0,0/pci@19,700000/SUNW,qlc@2/fp@0,0 (fp)
/ssm@0,0/pci@19,700000/SUNW,qlc@2/fp@0,0/disk (block)
 
/ssm@0,0/pci@19,700000/SUNW,qlc@1 (scsi-fcp)
/ssm@0,0/pci@19,700000/SUNW,qlc@1/fp@0,0 (fp)
/ssm@0,0/pci@19,700000/SUNW,qlc@1/fp@0,0/disk (block)


System Faults


A system fault is any condition that is considered to be unacceptable for normal system operation. When the system has a fault, the Fault LED (

Graphic of Fault LED

) turns on. The system indicators are shown in FIGURE 10-2.

FIGURE 10-2 System Indicators

Graphic of system indicator board showing locations of indicator LEDs and the On/Standby switch.


The indicator states are shown in TABLE 10-6. You must take immediate action to eliminate a system fault.


TABLE 10-6 System Fault Indicator States

FRU name

Fault indicator lit when fault detected[1]

System Fault indicator lit on FRU fault*

Top Access lit on FRU fault1

Comments

System Board

Yes

Yes

Yes

Includes processors, Ecache and DIMMs

Level 2 repeater

Yes

Yes

Yes

 

IB_SSC

Yes

Yes

Yes

 

System Controller

No

Yes

Yes

IB_SSC fault LED lit

Fan

Yes

Yes

Yes

IB Fan fault LED lit

Power Supply

Yes (by hardware)

Yes

No

All power supply indicators are lit by the power supply hardware. There is also a predicted fault indicator. Power supply EEPROM errors do not cause degraded state as there is no indicator control.

Power distribution board

No

Yes

Yes

Can only be degraded.

Baseplane

No

Yes

Yes

Can only be degraded.

System indicator board

No

Yes

Yes

Can only be degraded.

System configuration card

No

Yes

No

 

Fan tray

Yes

Yes

No

 

Main fan

Yes

Yes

No

 

Media bay

No

Yes

Yes

 

Disk

Yes

Yes

No

 


Customer Replaceable Units

The following topics describe the field replaceable units, by system.

Sun Fire E2900 System

The following FRUs are considered to be ones on which you can deal with faults:

If a fault is indicated on any other FRU or a physical replacement of blacklisted FRUs above is required, then Sun Service should be called.

Sun Fire V1280 System

The following FRUs are considered to be ones on which you can deal with faults:

If a fault is indicated on any other FRU or a physical replacement of blacklisted FRUs above is required, then Sun Service should be called.

Netra 1280 System

The following FRUs are considered to be ones on which you can deal with faults:



Note - Only suitably trained personnel or Sun Service are permitted to enter the Restricted Access Location to hot-swap PSUs or hard disk drives.



If a fault is indicated on any other FRU or a physical replacement of blacklisted FRUs above is required, then Sun Service should be called.

Manual Blacklisting (While Waiting for Repair)

The SC supports the blacklisting feature, which allows you to disable components on a board (TABLE 10-7).

Blacklisting provides a list of system board components that will not be tested and will not be configured into the Solaris Operating System. The blacklist is stored in nonvolatile memory.


TABLE 10-7 Blacklisting Component Names

System Component

Component Subsystem

 

Component Name

 

CPU system

 

slot/port/physical-bank/logical-bank

 

CPU/Memory boards (slot)

SB0, SB2, SB4

 

Ports on the
CPU/Memory board

P0, P1, P2, P3

 

Physical memory banks on
CPU/Memory boards

B0, B1

 

Logical banks on CPU/Memory boards

L0, L1, L2, L3

I/O assembly system

 

slot/port/bus or slot/card

 

I/O assembly

IB6

 

Ports on the
I/O assembly

P0, P1

 

Buses on the I/O assembly

B0, B1

 

I/O cards in the I/O assemblies

C0, C1, C2, C3, C4, C5

Repeater system

 

<slot>

 

Repeater board

RP0, RP2


Blacklist a component or device if you believe it might be failing intermittently or is failing. Troubleshoot a device you believe is having problems.

There are two system controller commands for blacklisting:



Note - The enablecomponent and disablecomponent commands have been replaced by the setls command. These commands were formerly used to manage component resources. While the enablecomponent and disablecomponent commands are still available, it is suggested that you use the setls command to control the configuration of components into or out of the system.



The setls command updates only the blacklist. It does not directly affect the state of the currently configured system boards.

The updated lists take effect when you do one the following:

In order to use setls on the Repeater boards (RP0/RP2), the system first has to be shut down to Standby using the poweroff command.

When the setls command is issued for a Repeater board (RP0/RP2), the SC will be automatically reset to make use of the new settings.

If a replacement Repeater board is inserted, it is necessary to manually reset the SC using the resetsc command. See the Sun Fire Entry-Level Midrange System Controller Command Reference Manual for a description of this command.

Special Considerations for CPU/Memory Boards

In the unlikely event that a CPU/Memory board fails the interconnect test during POST, a message similar to the following appears in POST output:


Jul 15 15:58:12 noname lom: SB0/ar0 Bit in error P3_ADDR [2]  
Jul 15 15:58:12 noname lom: SB0/ar0 Bit in error P3_ADDR [1]  
Jul 15 15:58:12 noname lom: SB0/ar0 Bit in error P3_ADDR [0]  
Jul 15 15:58:12 noname lom: AR Interconnect test: System board SB0/ar0 address repeater connections to system board RP2/ar0 failed
Jul 15 15:58:13 noname lom: SB0/ar0 Bit in error P3_INCOMING [0]  
Jul 15 15:58:17 noname lom: SB0/ar0 Bit in error P3_PREREQ [0]  
Jul 15 15:58:17 noname lom: SB0/ar0 Bit in error P3_ADDR [18]  
Jul 15 15:58:17 noname lom: SB0/ar0 Bit in error P3_ADDR [17] 

A CPU/Memory board failing the interconnect test might prevent the poweron command from completely powering on the system. The system then drops back to the lom> prompt.

As a provisional measure, before service intervention is obtained, the faulty CPU/Memory board can be isolated from the system using the following sequence of commands at the SC lom> prompt:


lom>disablecomponent SBx
.
.
lom>poweroff
.
.
lom>resetsc -y

A subsequent poweron command should now be successful.


Recovering a Hung System

If you cannot log into the Solaris Operating System, and typing the break command from the LOM shell did not force control of the system back to the OpenBoot PROM ok prompt, then the system has stopped responding.

In some circumstances the host watchdog detects that the Solaris Operating System has stopped responding and automatically resets the system.

Assuming that the host watchdog has not been disabled (using the setupsc command), then the Host Watchdog causes an automatic reset of the system.

Also, you can issue the reset command (default option is -x which causes an XIR to be sent to the processors) from the lom> prompt. The reset command causes the Solaris Operating System to be terminated.



caution icon

Caution - When the Solaris Operating System is terminated, data in memory might not be flushed to disk. This could cause a loss or corruption of the application file system data. Before the Solaris Operating System is terminated, this action requires confirmation from you.




procedure icon  To Recover a Hung System Manually

1. Complete the steps in Assisting Sun Service Personnel in Determining Causes of Failure.

2. Access the LOM shell.

See Chapter 3.

3. Type the reset command to force control of the system back to the OpenBoot PROM.

The reset command sends an externally initiated reset (XIR) to the system and collects data for debugging the hardware.


lom>reset



Note - An error is displayed if the setsecure command has been used to set the system into secure mode. You cannot use the reset or break commands while the system is in secure mode. See the Sun Fire Entry-Level Midrange System Controller Command Reference Manual for more details.



4. This step depends on the setting of the Open Boot PROM
error-reset-recovery configuration variable.

5. If the previous actions fail to reboot the system, use the poweroff and poweron commands to power cycle the system.

To power off the system, type:


lom>poweroff

To power on the system, type:


lom>poweron

Moving System Identity

You might decide that the simplest way to restore service is to use a complete replacement system. In order to facilitate the rapid transfer of system identity and critical settings from one system to its replacement, the System Configuration Card (SCC) can be physically removed from the SCC Reader (SCCR) of the faulty system and inserted into the SCCR of the replacement system.

The following information is stored on the System Configuration Card (SCC):


Temperature

One indication of problems might be overtemperature of one or more components. Use the showenvironment command to list current status.


TABLE 10-8 Checking Temperature Conditions Using the showenvironment Command
lom>showenviroment
 
Slot Device    Sensor    Value  Units     Age     Status
---- --------- --------- ------ --------- ------- ------
SSC1 SBBC 0    Temp. 0    34    Degrees C   1 sec OK
SSC1 CBH 0     Temp. 0    41    Degrees C   1 sec OK
SSC1 Board 0   Temp. 0    22    Degrees C   1 sec OK
SSC1 Board 0   Temp. 1    22    Degrees C   1 sec OK
SSC1 Board 0   Temp. 2    28    Degrees C   1 sec OK
SSC1 Board 0   1.5 VDC 0   1.49 Volts DC    1 sec OK
SSC1 Board 0   3.3 VDC 0   3.35 Volts DC    1 sec OK
SSC1 Board 0   5 VDC 0     4.98 Volts DC    1 sec OK
/N0/PS0 Input 0   Volt. 0        - -           1 sec OK
/N0/PS0 48 VDC 0  Volt. 0    48.00 Volts DC    1 sec OK
/N0/PS1 Input 0   Volt. 0        - -           5 sec OK
/N0/PS1 48 VDC 0  Volt. 0    48.00 Volts DC    5 sec OK
/N0/FT0 Fan 0     Cooling 0   Auto             5 sec OK
/N0/FT0 Fan 1     Cooling 0   Auto             5 sec OK
/N0/FT0 Fan 2     Cooling 0   Auto             5 sec OK
/N0/FT0 Fan 3     Cooling 0   Auto             5 sec OK
/N0/FT0 Fan 4     Cooling 0   Auto             5 sec OK
/N0/FT0 Fan 5     Cooling 0   Auto             5 sec OK
/N0/FT0 Fan 6     Cooling 0   Auto             5 sec OK
/N0/FT0 Fan 7     Cooling 0   Auto             5 sec OK
/N0/RP0 Board 0   1.5 VDC 0   1.49 Volts DC    5 sec OK
/N0/RP0 Board 0   3.3 VDC 0   3.37 Volts DC    5 sec OK
/N0/RP0 Board 0   Temp. 0    20    Degrees C   5 sec OK
/N0/RP0 Board 0   Temp. 1    19    Degrees C   5 sec OK
/N0/RP0 SDC 0     Temp. 0    55    Degrees C   5 sec OK
/N0/RP0 AR 0      Temp. 0    45    Degrees C   5 sec OK
/N0/RP0 DX 0      Temp. 0    57    Degrees C   5 sec OK
/N0/RP0 DX 1      Temp. 0    59    Degrees C   5 sec OK
/N0/RP2 Board 0   1.5 VDC 0   1.48 Volts DC    5 sec OK
/N0/RP2 Board 0   3.3 VDC 0   3.37 Volts DC    5 sec OK
/N0/RP2 Board 0   Temp. 0    22    Degrees C   5 sec OK
/N0/RP2 Board 0   Temp. 1    22    Degrees C   5 sec OK
/N0/RP2 SDC 0     Temp. 0    53    Degrees C   5 sec OK
/N0/RP2 AR 0      Temp. 0    43    Degrees C   5 sec OK
/N0/RP2 DX 0      Temp. 0    49    Degrees C   5 sec OK
/N0/RP2 DX 1      Temp. 0    52    Degrees C   5 sec OK
/N0/SB0 Board 0   1.5 VDC 0   1.51 Volts DC    5 sec OK
/N0/SB0 Board 0   3.3 VDC 0   3.29 Volts DC    5 sec OK
/N0/SB0 SDC 0     Temp. 0    46    Degrees C   5 sec OK
/N0/SB0 AR 0      Temp. 0    39    Degrees C   5 sec OK
/N0/SB0 DX 0      Temp. 0    45    Degrees C   5 sec OK
/N0/SB0 DX 1      Temp. 0    49    Degrees C   5 sec OK
/N0/SB0 DX 2      Temp. 0    53    Degrees C   5 sec OK
/N0/SB0 DX 3      Temp. 0    48    Degrees C   5 sec OK
/N0/SB0 SBBC 0    Temp. 0    49    Degrees C   5 sec OK
/N0/SB0 Board 1   Temp. 0    24    Degrees C   5 sec OK
/N0/SB0 Board 1   Temp. 1    24    Degrees C   6 sec OK
/N0/SB0 CPU 0     Temp. 0    47    Degrees C   6 sec OK
/N0/SB0 CPU 0     1.8 VDC 0   1.72 Volts DC    6 sec OK
/N0/SB0 CPU 1     Temp. 0    47    Degrees C   6 sec OK
/N0/SB0 CPU 1     1.8 VDC 1   1.72 Volts DC    6 sec OK
/N0/SB0 SBBC 1    Temp. 0    37    Degrees C   6 sec OK
/N0/SB0 Board 1   Temp. 2    24    Degrees C   6 sec OK
/N0/SB0 Board 1   Temp. 3    24    Degrees C   6 sec OK
/N0/SB0 CPU 2     Temp. 0    49    Degrees C   6 sec OK
/N0/SB0 CPU 2     1.8 VDC 0   1.71 Volts DC    6 sec OK
/N0/SB0 CPU 3     Temp. 0    46    Degrees C   6 sec OK
/N0/SB0 CPU 3     1.8 VDC 1   1.72 Volts DC    7 sec OK
/N0/SB2 Board 0   1.5 VDC 0   1.51 Volts DC    6 sec OK
/N0/SB2 Board 0   3.3 VDC 0   3.29 Volts DC    6 sec OK
/N0/SB2 SDC 0     Temp. 0    55    Degrees C   6 sec OK
/N0/SB2 AR 0      Temp. 0    37    Degrees C   6 sec OK
/N0/SB2 DX 0      Temp. 0    47    Degrees C   6 sec OK
/N0/SB2 DX 1      Temp. 0    50    Degrees C   6 sec OK
/N0/SB2 DX 2      Temp. 0    53    Degrees C   6 sec OK
/N0/SB2 DX 3      Temp. 0    47    Degrees C   6 sec OK
/N0/SB2 SBBC 0    Temp. 0    48    Degrees C   6 sec OK
/N0/SB2 Board 1   Temp. 0    23    Degrees C   7 sec OK
/N0/SB2 Board 1   Temp. 1    24    Degrees C   7 sec OK
/N0/SB2 CPU 0     Temp. 0    45    Degrees C   7 sec OK
/N0/SB2 CPU 0     1.8 VDC 0   1.72 Volts DC    7 sec OK
/N0/SB2 CPU 1     Temp. 0    46    Degrees C   7 sec OK
/N0/SB2 CPU 1     1.8 VDC 1   1.73 Volts DC    7 sec OK        
/N0/SB2 SBBC 1    Temp. 0    37    Degrees C   7 sec OK
/N0/SB2 Board 1   Temp. 2    24    Degrees C   7 sec OK
/N0/SB2 Board 1   Temp. 3    25    Degrees C   7 sec OK
/N0/SB2 CPU 2     Temp. 0    47    Degrees C   7 sec OK
/N0/SB2 CPU 2     1.8 VDC 0   1.71 Volts DC    7 sec OK
/N0/SB2 CPU 3     Temp. 0    45    Degrees C   7 sec OK
/N0/SB2 CPU 3     1.8 VDC 1   1.71 Volts DC    7 sec OK
/N0/IB6 Board 0   1.5 VDC 0   1.50 Volts DC    7 sec OK
/N0/IB6 Board 0   3.3 VDC 0   3.35 Volts DC    7 sec OK
/N0/IB6 Board 0   5 VDC 0     4.95 Volts DC    7 sec OK
/N0/IB6 Board 0   12 VDC 0   11.95 Volts DC    7 sec OK
/N0/IB6 Board 0   Temp. 0    29    Degrees C   7 sec OK
/N0/IB6 Board 0   Temp. 1    28    Degrees C   7 sec OK
/N0/IB6 Board 0   3.3 VDC 1   3.30 Volts DC    7 sec OK
/N0/IB6 Board 0   3.3 VDC 2   3.28 Volts DC    7 sec OK
/N0/IB6 Board 0   1.8 VDC 0   1.81 Volts DC    7 sec OK
/N0/IB6 Board 0   2.5 VDC 0   2.51 Volts DC    7 sec OK
/N0/IB6 Fan 0     Cooling 0   High             7 sec OK
/N0/IB6 Fan 1     Cooling 0   High             7 sec OK
/N0/IB6 SDC 0     Temp. 0    63    Degrees C   7 sec OK
/N0/IB6 AR 0      Temp. 0    77    Degrees C   7 sec OK
/N0/IB6 DX 0      Temp. 0    69    Degrees C   7 sec OK
/N0/IB6 DX 1      Temp. 0    73    Degrees C   8 sec OK
/N0/IB6 SBBC 0    Temp. 0    51    Degrees C   8 sec OK
/N0/IB6 IOASIC 0  Temp. 0    46    Degrees C   8 sec OK
/N0/IB6 IOASIC 1  Temp. 1    52    Degrees C   8 sec OK


Power Supplies

Each power supply unit (PSU) has its own LEDs as follows:

In addition there are two system LEDs labelled SourceA and SourceB. These show the state of the power feeds to the system. There are four physical power feeds and they are split into A and B.

Feed A supplies PS0 and PS1, feed B supplies PS2 and PS3. If either PS0 or PS1 receives input power then the SourceA indicator is lit. If either PS2 or PS3 receives input power then the SourceB indicator is lit. If neither of the supplies receives input power, the indicator is turned off.

These indicators are set on the basis of periodic monitoring at least once every 10 seconds.


Displaying Diagnostic Information

For information on displaying diagnostic information, see the Sun Hardware Platform Guide, which is available with your Solaris Operating System release.


Assisting Sun Service Personnel in Determining Causes of Failure

Provide the following information to Sun service personnel so that they can help you determine the causes of your failure:

 

1 (TableFootnote) This includes faults where the FRU is only degraded.1 If lit, indicates the failing FRU is accessed from the top of the platform. It is important that you employ the anti-tip legs on the cabinet before extending the platform out on its rails.