|C H A P T E R 12|
An internal fault is any condition that is considered to be unacceptable for normal system operation. When the system has a fault, the Fault LED () will turn on. When domains encounter hardware errors, the auto-diagnosis and auto-restoration features will detect, diagnose, and attempt to deconfigure components associated with hardware errors (see Automatic Diagnosis and Recovery Overview for details). However, further troubleshooting by the system administrator may be necessary when there are other system problems or error conditions that are not handled by the auto-diagnosis engine.
This chapter provides general guidelines for troubleshooting system problems and covers the following topics:
To analyze a system failure or to assist your Sun service provider in determining the cause of a system failure, gather information from the following sources:
TABLE 12-1 identifies different ways to capture error messages and other system information displayed on the platform or console.
File in the Solaris operating environment containing messages that are reported by the Solaris operating environment as determined by syslog.conf. This file does not contain any system controller or domain console messages.
Used to collect system controller messages. You must set up a syslog loghost for the platform shell and for each domain shell, to capture platform and domain console output. To permanently save loghost error messages, you must set up a loghost server. For details on setting up the loghost for the platform and domains, see TABLE 3-1.
The system controller log files are necessary because they contain more information than the showlogs system controller command. Also, with the system controller log files, your service provider can obtain a persistent, stored history of the system, which can help during troubleshooting.
System controller command that displays system error information stored in the system error buffer. The output provides details about the error, such as a fault condition. You and your service provider can review this information to analyze a failure or problem. The first error entry in the buffer is retained for diagnostic purposes. However, once the buffer becomes full, subsequent error messages cannot be stored and are discarded. The error buffer must be cleared by your service provider after the error condition is resolved.
TABLE 12-2 identifies system controller commands that provide platform and domain status information that can be used for troubleshooting purposes.
Prints a summary report of the contents of registers from every CPU in the domain that has a valid saved state. If you specify the -f URL option with the showresetstate command, the report summary is written to a URL, which can be reviewed by your service provider.
For additional information on these commands, refer to their command descriptions in the Sun Fire Midrange System Controller Command Reference Manual.
To obtain diagnostic and system configuration information through the Solaris operating environment, use the following commands:
The prtconf command prints the system configuration information. The output includes:
This command has many options. For command syntax, options, and examples, see the prtconf(1M) man page in your Solaris operating environment release.
The prtdiag command displays the following information to the domain of your Sun Fire midrange system:
For more information on this command, see the prtdiag (1M) man page in your Solaris operating environment release.
The Solaris operating environment sysdef utility outputs the current system definition in tabular form. It lists:
This command generates the output by analyzing the named bootable operating system file (namelist) and extracting configuration information from it. The default system namelist is /dev/kmem.
For command syntax, options, and examples, see the sysdef(1M) man page in your Solaris operating environment release.
The Solaris operating environment utility format, which is used to format drives, can also be used to display both logical and physical device names. For command syntax, options, and examples, see the format(1M) man page in your Solaris operating environment release.
If a domain is not responding, the domain is most likely in one of the following states:
If the system controller detects a hardware error, and the reboot-on-error parameter in the setupdomain command is set to true, the domain is automatically rebooted after the auto-diagnosis engine reports and deconfigures components associated with the hardware error.
However, if the reboot-on-error parameter is set to false, the domain is paused. If the domain is paused, reset the domain by turning the domain off with the setkeyswitch off command and then turning the domain on with the setkeyswitch on command.
A domain can be hung because
In such cases, the system controller automatically performs an XIR and reboots the domain, provided that the hang-policy parameter of the setupdomain command is set to reset.
However, if the domain hangs and the hang-policy parameter of the setupdomain command is set to notify, the system controller reports that the domain is hung but does not automatically recover the domain. In this case, you must recover the hung domain as explained in the following procedure.
A domain is considered to be hard hung when the Solaris operating environment and OpenBoot PROM (OBP) are not responding at the domain console.
1. Determine the status for the domain as reported by the system controller.
Type one of the following system controller commands:
These commands provide the same type of information in the same format. If the output in the Domain Status field displays Not Responding, the system controller has determined that the domain is hung.
2. Reset the domain:
a. Access the domain shell.
See System Controller Navigation.
b. Reset the domain by typing the reset command.
In order for the system controller to perform this operation, you must confirm it. For a complete definition of this command, refer to the reset command in the Sun Fire Midrange System Controller Command Reference Manual.
The manner in which the domain recovery occurs is determined by the OBP.error-reset-recovery parameter settings in the setupdomain command. For details on the domain parameters, refer to the setupdomain command in the Sun Fire Midrange System Controller Command Reference Manual.
The auto-diagnosis engine can diagnose and identify certain types of components, such as CPU/Memory boards and I/O assemblies, associated with hardware failures. However, other components, such as the System Controller boards, Repeater boards, power supplies, and fan trays are not handled by the auto-diagnosis engine.
This section describes what to do when the following components fail:
For additional information about these components, refer to the Sun Fire 6800/4810/4800/3800 Systems Service Manual or the Sun Fire E6900/E4900 Systems Service Manual.
1. Capture and collect system information for troubleshooting purposes.
2. Contact your service provider for further assistance.
Your service provider will review the troubleshooting data that you gathered and will initiate the appropriate service action.
If a Repeater board failure occurs, you can use remaining domain resources until the failed board can be replaced. You must set the partition mode parameter (of the setupplatform command) to dual-partition mode and adjust the domain resources to use available domains, as indicated in TABLE 12-3.
If you are running host-licensed software on a domain affected by a Repeater board failure, you can also swap the HostID/MAC address of the affected domain with that of an available domain. You can then use the hardware of the available domain to run the host-licensed software without encountering license restrictions. Use the HostID/MAC Address Swap parameter in the setupplatform command to swap the HostID/MAC address between a pair of domains. For details, see Swapping Domain HostID/MAC Addresses.