An internal fault is any condition that is considered to be unacceptable for normal system operation. When the system has a fault, the Fault LED ()fault LED icon will turn on. When domains encounter hardware errors, the auto-diagnosis and auto-restoration features will detect, diagnose, and attempt to deconfigure components associated with hardware errors (see Auto-Diagnosis and Auto-Restoration for details). However, further troubleshooting by the system administrator may be necessary when there are other system problems or error conditions that are not handled by the auto-diagnosis engine.
This chapter provides general guidelines for troubleshooting system problems and covers the following topics:
Capturing and Collecting System Information
To analyze a system failure or to assist your Sun service provider in determining the cause of a system failure, gather information from the following sources:
Platform, Domain, and System Messages
TABLE 9-1 identifies different ways to capture error messages and other system information displayed on the platform or console.
TABLE 9-1 Capturing Error Messages and Other System Information
Error Logging System
|
Definition
|
/var/adm/messages
|
File in the Solaris operating environment containing messages that are reported by the Solaris operating environment as determined by syslog.conf. This file does not contain any system controller or domain console messages.
Note: Messages diverted to external syslog hosts can be found in the /var/adm/messages file of the syslog host.
|
Platform console
|
Contains and displays system controller error and event messages.
|
Domain console
|
Contains and displays:
- Messages written to the domain console by the Solaris operating environment
- System controller error and event messages
Note: System controller messages that relate to a domain are reported to the domain console only and are not reported to the Solaris operating environment.
|
loghost
|
Used to collect system controller messages. You must set up a syslog loghost for the platform shell and for each domain shell, to capture platform and domain console output. To permanently save loghost error messages, you must set up a loghost server. For details on setting up the loghost for the platform and domains, see TABLE 3-1.
The system controller log files are necessary because they contain more information than the showlogs system controller command. Also, with the system controller log files, your service provider can obtain a persistent, stored history of the system, which can help during troubleshooting.
|
showlogs
|
System controller command that displays system controller messages for the platform and domain that are stored in the message buffer. Once the buffer is filled, the old messages are overwritten.
The message buffer is cleared under these conditions:
- When you reboot the system controller
- When the system controller loses power
|
showerrorbuffer
|
System controller command that displays system error information stored in the system error buffer. The output provides details about the error, such as a fault condition. You and your service provider can review this information to analyze a failure or problem. The first error entry in the buffer is retained for diagnostic purposes. However, once the buffer becomes full, subsequent error messages cannot be stored and are discarded. The error buffer must be cleared by your service provider after the error condition is resolved.
|
Platform and Domain Status Information From System Controller Commands
TABLE 9-2 identifies system controller commands that provide platform and domain status information that can be used for troubleshooting purposes.
TABLE 9-2 System Controller Commands that Display Platform and Domain Status Information
Command
|
Platform
|
Domain
|
Description
|
showboards -v
|
x
|
x
|
Displays the assignment information and status for all the components in the system.
|
showenvironment
|
x
|
x
|
Displays the current environmental status, temperatures, currents, voltages, and fan status for the platform or the domain.
|
showdomain -v
|
|
x
|
Displays the domain configuration parameters.
|
showerrorbuffer
|
x
|
|
Shows the contents of the system error buffer.
|
showlogs -v or
showlogs -v d domainID
|
x
|
x
|
Displays the system controller-logged events stored in the system controller message buffer.
|
showplatform -v or
showplatform -d domainID
|
x
|
|
Shows the configuration parameters for the platform and specific domain information.
|
showresetstate -v or
showresetstate -v -f URL
|
|
x
|
Prints a summary report of the contents of registers from every CPU in the domain that has a valid saved state. If you specify the -f URL option with the showresetstate command, the report summary is written to a URL, which can be reviewed by your service provider.
|
showsc -v
|
x
|
|
Shows the system controller and clock failover status, ScApp and RTOS versions, and uptime.
|
For additional information on these commands, refer to their command descriptions in the Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual.
Diagnostic and System Configuration Information From Solaris Operating Environment Commands
To obtain diagnostic and system configuration information through the Solaris operating environment, use the following commands:
The prtconf command prints the system configuration information. The output includes:
- Total amount of memory
- Configuration of the system peripherals formatted as a device tree
This command has many options. For command syntax, options, and examples, see the prtconf(1M) man page in your Solaris operating environment release.
The prtdiag command displays the following information to the domain of your Sun Fire 6800/4810/4800/3800 system:
- Configuration
- Diagnostic (any failed FRUs)
- Total amount of memory
For more information on this command, see the prtdiag (1M) man page in your Solaris operating environment release.
The Solaris operating environment sysdef utility outputs the current system definition in tabular form. It lists:
- All hardware devices
- Pseudo devices
- System devices
- Loadable modules
- Values of selected kernel tunable parameters
This command generates the output by analyzing the named bootable operating system file (namelist) and extracting configuration information from it. The default system namelist is /dev/kmem.
For command syntax, options, and examples, see the sysdef(1M) man page in your Solaris operating environment release.
The Solaris operating environment utility, format, which is used to format drives, can also be used to display both logical and physical device names. For command syntax, options, and examples, see the format(1M) man page in your Solaris operating environment release.
Domain Not Responding
If a domain is not responding, the domain is most likely in one of the following states:
- Paused due to a hardware error
If the system controller detects a hardware error, and the reboot-on-error parameter in the setupdomain command is set to true, the domain is automatically rebooted after the auto-diagnosis engine reports and deconfigures components associated with the hardware error.
However, if the reboot-on-error parameter is set to false, the domain is paused. If the domain is paused, reset the domain by turning the domain off with the setkeyswitch off command and then turning the domain on with the setkeyswitch on command.
A domain can be hung because
- The domain heartbeat stops.
- The domain does not respond to interrupts.
In such cases, the system controller automatically performs an XIR and reboots the domain, provided that the hang-policy parameter of the setupdomain command is set to reset.
However, if the domain hangs and the hang-policy parameter of the setupdomain command is set to notify, the system controller reports that the domain is hung but does not automatically recover the domain. In this case you must recover the hung domain as explained in the following procedure.
A domain is considered to be hard hung when the Solaris operating environment and OBP are not responding at the domain console.
To Recover From a Hung Domain
|
Note - This procedure assumes that the system controller is functioning and that the hang-policy parameter of the setupdomain command is set to notify.
|
1. Determine the status for the domain as reported by the system controller.
Type one of the following system controller commands:
- showplatform -p status (platform shell)
- showdomain -p status (domain shell)
These commands provide the same type of information in the same format. If the output in the Domain Status field displays Not Responding, the system controller has determined that the domain is hung.
2. Reset the domain:
Note - A domain cannot be reset while the domain keyswitch is in the secure position.
|
a. Access the domain shell.
See System Controller Navigation.
b. Reset the domain by typing the reset command.
In order for the system controller to perform this operation, you must confirm it. For a complete definition of this command, refer to the reset command in the Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual.
The manner in which the domain recovery occurs is determined by the OBP.error-reset-recovery parameter settings in the setupdomain command. For details on the domain parameters, refer to the setupdomain command in the Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual.
Board and Component Failures
The auto-diagnosis engine can diagnose and identify certain types of components, such as CPU/Memory boards and I/O assemblies, associated with hardware failures. However, other components, such as the System Controller boards, Repeater boards, power supplies, and fan trays are not handled by the auto-diagnosis engine.
Handling Component Failures
This section describes what to do when the following components fail:
- CPU/Memory boards
- I/O assemblies
- Repeater boards
- System Controller boards
- Power supplies
- Fan trays
For additional information about these components, refer to the Sun Fire 6800/4810/4800/3800 Systems Service Manual.
To Handle Failed Components
|
1. Capture and collect system information for troubleshooting purposes.
- CPU/Memory board failure - Collect auto-diagnosis event messages from the sources described in TABLE 9-1.
- I/O assembly failure - Collect auto-diagnosis event messages from the sources described in TABLE 9-1.
- Repeater board failure- Collect troubleshooting data as described in TABLE 9-1 and TABLE 9-2 and temporarily adjust available domain resources. See Recovering from a Repeater Board Failure.
- System controller board failure:
- In a redundant SC configuration, wait for automatic SC failover to occur. After the failover, review the showlogs command output, the platform loghost, if configured, and platform messages for the working SC to obtain information on the failure condition.
- If you have one SC and it fails, collect data from the platform and domain console or loghosts, and output from the showlogs and showerrorbuffer commands.
- Power supply failure - If you have do not have a redundant power supply, collect troubleshooting data as described in TABLE 9-1 and TABLE 9-2.
- Fan tray failure - If you have do not have a redundant fan tray, collect troubleshooting data as described in TABLE 9-1 and TABLE 9-2.
2. Contact your service provider for further assistance.
Your service provider will review the troubleshooting data that you gathered and will initiate the appropriate service action.
Recovering from a Repeater Board Failure
If a Repeater board failure occurs, you can use remaining domain resources until the failed board can be replaced. You must set the partition mode parameter (of the setupplatform command) to dual partition mode and adjust the domain resources to use available domains, as indicated in TABLE 9-3.
TABLE 9-3 Adjusting Domain Resources When a Repeater Board Fails
Midframe Server
|
RP0 Failure
|
RP1 Failure
|
RP2 Failure
|
RP3 Failure
|
Use Available Domains
|
6800
|
X
|
|
|
|
C and D
|
|
|
X
|
|
|
C and D
|
|
|
|
X
|
|
A and B
|
|
|
|
|
X
|
A and B
|
4810/4800/3800
|
X
|
Not applicable
|
|
Not applicable
|
C
|
|
|
Not applicable
|
X
|
Not applicable
|
A
|
If you are running host-licensed software on a domain affected by a Repeater board failure, you can also swap the HostID/MAC address of the affected domain with that of an available domain. You can then use the hardware of the available domain to run the host-licensed software without encountering license restrictions. Use the HostID/MAC Address Swap parameter in the setupplatform command to swap the HostID/MAC address between a pair of domains. For details, see Swapping Domain HostID/MAC Addresses.