C H A P T E R  12

Troubleshooting

An internal fault is any condition that is considered to be unacceptable for normal system operation. When the system has a fault, the Fault LED (fault LED icon)will turn on. When domains encounter hardware errors, the auto-diagnosis and auto-restoration features will detect, diagnose, and attempt to deconfigure components associated with hardware errors (see Automatic Diagnosis and Recovery Overviewfor details). However, further troubleshooting by the system administrator may be necessary when there are other system problems or error conditions that are not handled by the auto-diagnosis engine.

This chapter provides general guidelines for troubleshooting system problems and covers the following topics:


Capturing and Collecting System Information

To analyze a system failure or to assist your Sun service provider in determining the cause of a system failure, gather information from the following sources:

Platform, Domain, and System Messages

TABLE 12-1 identifies different ways to capture error messages and other system information displayed on the platform or console.


TABLE 12-1 Capturing Error Messages and Other System Information

Error Logging System

Definition

/var/adm/messages

File in the Solaris operating environment containing messages that are reported by the Solaris operating environment as determined by syslog.conf. This file does not contain any system controller or domain console messages.

Note: Messages diverted to external syslog hosts can be found in the /var/adm/messages file of the syslog host.

Platform console

Contains and displays system controller error and event messages.

Domain console

Contains and displays:

  • Messages written to the domain console by the Solaris operating environment
  • System controller error and event messages

Note: System controller messages that relate to a domain are reported to the domain console only and are not reported to the Solaris operating environment.

loghost

Used to collect system controller messages. You must set up a syslog loghost for the platform shell and for each domain shell, to capture platform and domain console output. To permanently save loghost error messages, you must set up a loghost server. For details on setting up the loghost for the platform and domains, see TABLE 3-1.

 

The system controller log files are necessary because they contain more information than the showlogs system controller command. Also, with the system controller log files, your service provider can obtain a persistent, stored history of the system, which can help during troubleshooting.

showlogs

System controller command that displays system controller messages for the platform and domain that are stored in a dynamic buffer. Once the buffer is filled, the old messages are overwritten.

 

The message buffer is cleared under these conditions:

  • When you reboot the system controller
  • When the system controller loses power

 

However, in systems with enhanced-memory SCs (SC V2s), certain log messages are maintained in persistent storage. These logs persist even after the system is rebooted or the system loses power. The showboards -p command enables you to view specific persistent logs.

showerrorbuffer

System controller command that displays system error information stored in the system error buffer. The output provides details about the error, such as a fault condition. You and your service provider can review this information to analyze a failure or problem. The first error entry in the buffer is retained for diagnostic purposes. However, once the buffer becomes full, subsequent error messages cannot be stored and are discarded. The error buffer must be cleared by your service provider after the error condition is resolved.

 

In systems with enhanced-memory SCs (SC V2s), these system error messages are retained in persistent storage. These system error messages persist even after the SC is rebooted or the SC loses power.

showfru

System controller command that displays the field-replaceable unit (FRUs) installed in a Sun Fire midrange system. Your service provider uses this information to track the FRUs in a system.


Platform and Domain Status Information From System Controller Commands

TABLE 12-2 identifies system controller commands that provide platform and domain status information that can be used for troubleshooting purposes.


TABLE 12-2 System Controller Commands that Display Platform and Domain Status Information

Command

Platform

Domain

Description

showboards -v

x

x

Displays the assignment information and status for all the components in the system.

showenvironment

x

x

Displays the current environmental status, temperatures, currents, voltages, and fan status for the platform or the domain.

showdomain -v

 

x

Displays the domain configuration parameters.

showerrorbuffer

x

 

Shows the contents of the system errors in the system error buffer.

showfru -r manr

x

 

Displays the manufacturing records of FRUs installed in a Sun Fire midrange system.

showlogs -v or

showlogs -v d domainID

x

x

Displays the system controller-logged events stored in the dynamic buffer.

showlogs -p f filter

x

x

For systems with SC V2s, displays the system controller-logged messages recorded in persistent storage.

showplatform -v or
showplatform -d domainID

x

 

Shows the configuration parameters for the platform and specific domain information.

showresetstate -v or

showresetstate -v -f URL

 

x

Prints a summary report of the contents of registers from every CPU in the domain that has a valid saved state. If you specify the -f URL option with the showresetstate command, the report summary is written to a URL, which can be reviewed by your service provider.

showsc -v

x

 

Shows the system controller and clock failover status, ScApp and RTOS versions, and uptime.


For additional information on these commands, refer to their command descriptions in the Sun Fire Midrange System Controller Command Reference Manual.

Diagnostic and System Configuration Information From Solaris Operating Environment Commands

You can obtain diagnostic and system configuration information through the Solaris operating environment, with the following commands:

The prtconf command prints the system configuration information. The output includes:

This command has many options. For command syntax, options, and examples, see the prtconf(1M) man page in your Solaris operating environment release.

The prtdiag command displays the following information to the domain of your Sun Fire midrange system:

For more information on this command, see the prtdiag (1M) man page in your Solaris operating environment release.

The Solaris operating environment sysdef utility outputs the current system definition in tabular form. It lists:

This command generates the output by analyzing the named bootable operating system file (namelist) and extracting configuration information from it. The default system namelist is /dev/kmem.

For command syntax, options, and examples, see the sysdef(1M) man page in your Solaris operating environment release.

The Solaris operating environment utility format, which is used to format drives, can also be used to display both logical and physical device names. For command syntax, options, and examples, see the format(1M) man page in your Solaris operating environment release.


Domain Not Responding

If a domain is not responding, the domain is most likely in one of the following states:

If the system controller detects a hardware error, and the reboot-on-error parameter in the setupdomain command is set to true, the domain is automatically rebooted after the auto-diagnosis engine reports and deconfigures components associated with the hardware error.

However, if the reboot-on-error parameter is set to false, the domain is paused. If the domain is paused, reset the domain by turning the domain off with the setkeyswitch off command and then turning the domain on with the setkeyswitch on command.

A domain can be hung because

In such cases, the system controller automatically performs an XIR and reboots the domain, provided that the hang-policy parameter of the setupdomain command is set to reset.

However, if the domain hangs and the hang-policy parameter of the setupdomain command is set to notify, the system controller reports that the domain is hung but does not automatically recover the domain. In this case, you must recover the hung domain as explained in the following procedure.

A domain is considered to be hard hung when the Solaris operating environment and OpenBoot PROM (OBP) are not responding at the domain console.


procedure icon  To Recover From a Hung Domain



Note - This procedure assumes that the system controller is functioning and that the hang-policy parameter of the setupdomain command is set to notify.



1. Determine the status for the domain as reported by the system controller.

Type one of the following system controller commands:

These commands provide the same type of information in the same format. If the output in the Domain Status field displays Not Responding, the system controller has determined that the domain is hung.

2. Reset the domain:



Note - A domain cannot be reset while the domain keyswitch is in the secure position.



a. Access the domain shell.

See System Controller Navigation.

b. Reset the domain by typing the reset command.

In order for the system controller to perform this operation, you must confirm it. For a complete definition of this command, refer to the reset command in the Sun Fire Midrange System Controller Command Reference Manual.

The manner in which the domain recovery occurs is determined by the OBP.error-reset-recovery parameter settings in the setupdomain command. For details on the domain parameters, refer to the setupdomain command in the Sun Fire Midrange System Controller Command Reference Manual.


Board and Component Failures

The auto-diagnosis engine can diagnose and identify certain types of components, such as CPU/Memory boards and I/O assemblies, associated with hardware failures. However, other components, such as the System Controller boards, Repeater boards, power supplies, and fan trays are not handled by the auto-diagnosis engine.

Handling Component Failures

This section describes what to do when the following components fail:

For additional information about these components, refer to the Sun Fire 6800/4810/4800/3800 Systems Service Manual or the Sun Fire E6900/E4900 Systems Service Manual.


procedure icon  To Handle Failed Components

1. Capture and collect system information for troubleshooting purposes.

2. Contact your service provider for further assistance.

Your service provider will review the troubleshooting data that you gathered and will initiate the appropriate service action.

Recovering from a Repeater Board Failure

If a Repeater board failure occurs, you can use remaining domain resources until the failed board can be replaced. You must set the partition mode parameter (of the setupplatform command) to dual-partition mode and adjust the domain resources to use available domains, as indicated in TABLE 12-3.


TABLE 12-3 Adjusting Domain Resources When a Repeater Board Fails

Midrange Server

RP0 Failure

RP1 Failure

RP2 Failure

RP3 Failure

Use Available Domains

Sun Fire E6900 and 6800

X

 

 

 

C and D

 

 

X

 

 

C and D

 

 

 

X

 

A and B

 

 

 

 

X

A and B

Sun Fire E4900/4810/

4800/3800 systems

X

Not applicable

 

Not applicable

C

 

 

Not applicable

X

Not applicable

A


If you are running host-licensed software on a domain affected by a Repeater board failure, you can also swap the HostID/MAC address of the affected domain with that of an available domain. You can then use the hardware of the available domain to run the host-licensed software without encountering license restrictions. Use the HostID/MAC Address Swap parameter in the setupplatform command to swap the HostID/MAC address between a pair of domains. For details, see Swapping Domain HostID/MAC Addresses.