|C H A P T E R 2|
This chapter provides information about error indications and software commands to help you determine which component you need to replace. It contains the following sections:
Note - The procedures in this chapter assume that you are familiar with the OpenBoot firmware and that you know how to enter the OpenBoot environment. For more information about the OpenBoot firmware, see the OpenBoot 4.x Command Reference Manual. An online version of the manual is included with the OpenBoot Collection AnswerBook2 that ships with Solaris software.
The following chart shows which tools you can use to diagnose hardware and software problems.
The system provides the following features to help you identify and isolate hardware problems:
This section describes the error indications and software commands provided to help you troubleshoot your system. Diagnostic tools are covered in About Diagnostic Tools.
The system provides error indications via LEDs and error messages. Using the two in combination, you can isolate a problem to a particular field-replaceable uint (FRU) with a high degree of confidence.
The system provides status indicator LEDs in the following places:
Error messages are logged in the /var/adm/messages file and are also displayed on the system console by the diagnostic tools.
For additional information about LEDs, see the Sun Fire V890 Server Owner's Guide.
Front panel LEDs provide your first indication that there is a problem with your system. Usually, a front panel LED is not the only indication of a problem. Error messages and other LEDs within the enclosure can help to isolate the problem further. For additional information about the front panel LEDs, see the Sun Fire V890 Server Owner's Guide.
The front panel LEDs provide general system status, alert you to system problems, and help you determine the location of system faults:
Located on the rear of each power supply, the power supply LEDs indicate:
For additional information about the power supply LEDs, see the Sun Fire V890 Server Owner's Guide.
Fault LEDs within the enclosure help pinpoint the location of the faulty device. LEDs within the enclosure include:
For detailed information about these LEDs, see the Sun Fire V890 Server Owner's Guide.
Since all front panel and power supply LEDs are powered by the system's 5-volt standby power source, fault LEDs remain illuminated for any fault condition that results in a system shutdown.
During system startup, the front panel LEDs are individually toggled on and off to verify that each one is working correctly.
Error messages and other system messages are saved in the file /var/adm/messages. The two firmware-based diagnostic tools, POST and OpenBoot Diagnostics, also display error messages in a standard format on the local system console or on an RSC console (if configured). See Sample POST Error Messages and Sample OpenBoot Diagnostics Error Messages for more information.
The amount of information displayed in OpenBoot Diagnostics messages is determined by the keywords specified for the OpenBoot configuration variable test-args. See OpenBoot Configuration Variables for OpenBoot Diagnostics for additional details.
Several Solaris and OpenBoot firmware commands are available for diagnosing system problems. For more information about Solaris commands, see the appropriate man pages. For additional information about OpenBoot commands, see the OpenBoot 4.x Command Reference Manual. An online version of the manual is included with the OpenBoot Collection AnswerBook that ships with Solaris software.
The prtdiag command is a UNIX shell command used to display system configuration and diagnostic information. You can use the prtdiag command to display:
To run prtdiag, type:
An example of prtdiag output follows.
To isolate an intermittent failure, it may be helpful to maintain a prtdiag history log. Use prtdiag with the -l (log) option to send output to a log file in /var/adm.
To display environmental information, use prtdiag with the -v option. Type:
The prtdiag command with the -v option produces all of the output of the prtdiag command (shown in the preceding example) in addition to environmental information, current keyswitch position, LED indications, and other information.
The following is an example of the additional output produced by the -v option.
Solaris prtconf Command
The prtconf command displays system configuration information, including the total amount of memory and the device configuration as described by the system's device hierarchy.
To run prtconf, type:
The following is partial sample output.
The prtfru command displays specific information about the following FRUs:
The prtfru command also displays the contents of the FRU SEEPROMs:
The following is partial sample output from the prtfru command.
The prtpicl command displays the name and Platform Information and Control Library (PICL) class of all nodes in the PICL tree.
To display the high temperature and low temperature critical thresholds for each component, use the prtpicl -v option. See Environmental Failures for more information.
The following is partial sample output from the prtpicl command.
Solaris showrev Command
The showrev command displays revision information for the current hardware and software. When used with the -p option, this command displays installed patches.
The following is partial sample output from the showrev command with the -p option.
Solaris psrinfo Command
The psrinfo command displays the date and time each CPU came online.
The psrinfo command with the -v option displays additional information about the CPUs, including clock speed.
The following is sample output from the psrinfo command with the -v option.
If you are working from the ok prompt, you can use the OpenBoot show-devs command to list the devices in the system configuration. The following is sample show-devs output for a Sun Fire V890 server configured with a full complement of CPU/Memory boards, DIMMs, power supplies, and FC-AL disk backplanes. The system also includes a Sun StorEdge Dual Fibre Channel Host Adapter card to drive Loop B of the FC-AL mass storage subsystem. The show-devs output displays the device tree for the system. Helpful descriptions for most of the devices are provided to the right of the sample output.
Use the OpenBoot .env command to display the current environmental status information.
The following is sample output from the .env command.
Use the OpenBoot printenv command to display the OpenBoot configuration variables. The display includes the current values for these variables as well as the default values.
The following is sample output for the printenv command.
To diagnose problems with the SCSI or FC-AL devices, you can use the OpenBoot probe-scsi and probe-scsi-all commands. Both commands require that you get to the ok prompt after a reset.
Note - When it is not practical to halt the system, you can use SunVTS software as an alternative method of testing the SCSI and FC-AL interfaces. See About SunVTS Software for more information.
The probe-scsi command transmits an inquiry command to all SCSI and FC-AL devices connected to the on-board SCSI and FC-AL controllers. This includes any internal tape or DVD/CD-ROM drives connected to an optional SCSI controller. For any SCSI or FC-AL device that is connected and active, its target address, unit number, device type, and manufacturer name are displayed.
Note - You can also use the probe-scsi command to isolate failures on the FC-AL loop. See FC-AL Loop or Disk Drive Failure for more information.
The probe-scsi-all command transmits an inquiry command to all SCSI and FC-AL devices connected to the on-board SCSI and FC-AL controllers, and any host adapters installed in PCI slots. The first identifier listed in the display is the host adapter address in the system device tree, followed by the device identification data.
The following is sample output from the probe-scsi command.
This section describes how to diagnose the following problems:
The system is unable to communicate over the network.
Your system conforms to the Ethernet 10/100BASE-T standard, which states that the Ethernet 10BASE-T link integrity test function should always be enabled on both the host system and the Ethernet hub. If you have trouble establishing a connection between the Sun Fire V890 server and your Ethernet hub, verify that the Ethernet hub also has the link test function enabled.
This problem applies only to 10BASE-T network hubs, where the Ethernet link integrity test is optional. This is not a problem for 100BASE-T networks, where the test is enabled by default. Refer to the documentation provided with your Ethernet hub for more information about the link integrity test function.
Use the test command to test an individual network device. At the ok prompt, type test and the full path name of the device as shown in the following example:
If you connect the system to a network and the network does not respond, use the OpenBoot PROM command watch-net-all to display conditions for all network connections:
For most PCI Ethernet cards, the link integrity test function can be enabled or disabled with a hardware jumper on the PCI card, which you must set manually. (See the documentation supplied with the card.) For the standard TPE I/O board port, the link test is enabled or disabled through software, as described below.
Note - Some hub designs permanently enable or disable the link integrity test through a hardware jumper. In this case, refer to the hub installation or user manual for details of how the test is implemented.
To enable or disable the link integrity test for the standard Ethernet interface, or for a PCI-based Ethernet interface, you must first know the device name of the desired Ethernet interface. To list the device name, follow these steps:
1. Shut down the operating system and take the system to the ok prompt.
2. Determine the device name for the desired Ethernet interface:
b. In the show-devs listing, find the device name for the desired Ethernet interface.
The device name is /pci@9,700000/network@1,1 for the Fast Ethernet interface. For a PCI-based Ethernet interface, the device name may appear similar to the following: /pci@8,700000/pci@2/SUNW,hme@0,1
Use this method while the operating system is running:
1. Become superuser.
3. Reboot the system (when convenient) to make the changes effective.
Use this alternative method when the system is already at the OpenBoot prompt:
1. At the ok prompt, type:
2. Reboot the system to make the changes effective.
The system attempts to power on but does not boot or initialize the terminal or monitor.
1. Verify that the CPU/Memory boards are seated correctly.
2. Run POST diagnostics.
See Running POST Diagnostics.
3. Observe POST results.
Check the POST output using a locally attached terminal, tip connection, or RSC console. If you see no front panel LED activity, a power supply may be defective. See the Sun Fire V890 Server Owner's Guide for information about power supply LED indications.
If the front panel System Fault LED remains lit or the POST output contains an error message, POST has failed. The most probable cause for this type of failure is the motherboard.
4. Before you replace the motherboard, run the OpenBoot Diagnostics test-all command from the ok prompt or obdiag> prompt.
Note - To get to the ok prompt, you must set the OpenBoot PROM configuration variable auto-boot? to false and then reset the system. (The default setting for auto-boot? is true.) See Running OpenBoot Diagnostics for instructions.
5. If OpenBoot Diagnostics error messages show any defective components, remove or replace those components and run firmware diagnostics again.
Remove any failed components that are optional. Replace any failed components that are required for a minimum configuration. Be sure the required eight DIMMs are installed in groups A0 and B0 for each CPU/Memory board installed.
6. If POST still fails after you have removed or replaced all failed components, replace the motherboard.
No video at the system monitor.
1. Check that the power cord is connected to the monitor and to the wall outlet.
2. Verify with a volt-ohmmeter that the wall outlet is supplying AC power.
3. Verify that the video cable connection is secure between the monitor and the video output port.
Use a volt-ohmmeter to perform the continuity test on the video cable.
4. If the cables are connected securely, troubleshoot the monitor and the graphics card. Use the test command.
The system console has been redirected to an RSC console, but the RSC console is not working.
The most likely cause of this problem is a faulty system controller card. To recover from this problem and gain access to the system from a local system console, follow these steps:
1. Press the system Power button briefly to initiate a graceful software shutdown.
2. Make sure that the system is connected to a local console device.
Install a local console if necessary. See the Sun Fire V890 Server Owner's Guide for instructions.
3. Press and release the Power button and wait until the System Fault LED on the front panel begins to blink.
4. Immediately press the Power button twice (with a one-second delay between presses).
A screen similar to the following is displayed to indicate that you have successfully reset the OpenBoot NVRAM configuration variables to their default values.
By changing the NVRAM configuration variables to their default values, you temporarily redirect the system console to the local console device. Note that these NVRAM settings are reset to the defaults for this power cycle only. If you do nothing other than reset the system at this point, the values are not permanently changed. Only settings that you change manually at this point become permanent.
5. To permanently redirect the system console to the local console device, type the following commands at the system ok prompt:
6. To cause the changes to take effect, power cycle the system, or type:
The system permanently stores the parameter changes
7. Run OpenBoot Diagnostics and/or SunVTS tests for the system controller card.
8. Replace the system controller card, if necessary.
A disk drive read, write, or parity error is reported by the operating system or a software application.
Replace the drive indicated by the failure message.
An internal FC-AL disk drive fails to boot, is not responding to commands, or an FC-AL loop fails to initialize.
Run OpenBoot Diagnostics tests for the mass storage subsystem.
1. At the ok prompt, type:
2. Power off the system.
3. Verify that all cables attached to the FC-AL disk backplanes are properly connected.
4. Power on the system and observe the POST status messages.
If POST reports a problem, replace the component indicated by the failure message and repeat POST diagnostics until the problem is resolved.
5. At the ok prompt, type:
The OpenBoot Diagnostics menu is displayed, followed by the obdiag> prompt.
6. Test segment 5 of the I2C bus (i2c@1,30) to verify that it is operating correctly.
Enter the test number corresponding to the i2c@1,30 test. For example:
I2C segment 5 must be working correctly in order to test the FC-AL subsystem. If this test fails, test the remaining segments of the I2C bus and replace the component or components indicated by the failure messages. Segment 5 test failures can also result from a faulty I2C cable.
7. Run the SSC-100 SES controller tests in the following order:
a. controller@0,16 - base backplane Loop A
b. controller@0,1c - expansion backplane Loop A (if installed)
c. controller@0,1a - base backplane Loop B
d. controller@0,1e - expansion backplane Loop A (if installed)
8. Run the ISP2200A FC-AL controller tests in the following order:
a. SUNW,qlc@2 - on-board FC-AL controller (Loop A)
b. SUNW,qlc@4 - PCI FC-AL controller (Loop B, if installed)
If a failure message identifies one or more specific disks, replace the disks with known good disks and repeat the testing. Disk failure messages identify a specific disk by its AL_PA address, according to the following table.
Other types of failures during the on-board controller test usually indicate a problem with the motherboard or the motherboard FC-AL cable. When testing the PCI controller, these types of failure messages point to the PCI card or the FC-AL cable between the card and the base backplane.
In a dual-backplane configuration, removing the FC-AL cables between backplanes and repeating the test can help to isolate the problem.
A DVD-ROM drive read error or parity error is reported by the operating system or a software application.
Replace the DVD-ROM drive.
DVD-ROM drive fails to boot or is not responding to commands.
Test the drive response to the probe-ide command as follows.
Note - You must halt the system to execute the probe-ide command. If this is not practical, you can use the SunVTS software to test the DVD-ROM. See About SunVTS Software.
1. At the ok prompt, type:
2. Check the output message.
If a target address, unit number, device type, and manufacturer name are displayed for the device, the system IDE controller has successfully probed the device. This indicates that the motherboard is operating correctly.
3. Take one of the following actions, depending on what the probe-ide command reports:
a. Replace the DVD-ROM data cable.
b. If the problem is still evident after replacing the cable, replace the drive.
c. If the problem is still evident, replace the motherboard.
If there is a problem with a power supply, the environmental monitoring system lights the following LEDs:
In addition, the AC Status and DC Status LEDs at the rear of each power supply indicate any problem with the AC input and DC output, respectively. See the Sun Fire V890 Server Owner's Guide for more information about the LEDs.
After you identify the problem power supply, replace it according to the removal and installation instructions in the Sun Fire V890 Server Service Manual.
SunVTS and POST diagnostics can report memory errors encountered during program execution. Memory error messages typically indicate the location number
("J" number) of the failing DIMM.
1. Use the following diagram to identify the location of a failing DIMM
from its J number.
2. After you identify the defective DIMM, replace it according to the removal and installation instructions in the Sun Fire V890 Server Service Manual.
The Sun Fire V890 server features an environmental monitoring subsystem designed to protect against:
Monitoring and control capabilities reside at the operating system level as well as in the system's flash PROM firmware. This ensures that monitoring capabilities remain operational even if the system has halted or is unable to boot.
The environmental monitoring subsystem uses an industry-standard I2C bus. The I2C bus is a simple two-wire serial bus, used throughout the system to allow the monitoring and control of temperature sensors, fans, power supplies, status LEDs, and the front panel keyswitch.
Temperature sensors are located throughout the system to monitor the ambient temperature of the system and the temperature of each CPU. The monitoring subsystem frequently polls each sensor and uses the sampled temperatures to report and respond to any overtemperature or undertemperature conditions.
The hardware and software together ensure that the temperatures within the enclosure do not stray outside predetermined "safe operation" ranges. If the temperature observed by a sensor falls below a low-temperature warning threshold or rises above a high-temperature warning threshold, the monitoring subsystem software generates a Warning message to the system console. If the temperature exceeds a low-temperature or high-temperature critical threshold, the software issues a Critical message and proceeds to gracefully shut down the system. In both cases, the System Fault and Thermal Fault LEDs on the front status panel are illuminated to indicate the nature of the problem.
This thermal shutdown capability is also built into the hardware circuitry as a fail-safe measure. This feature provides backup thermal protection in the unlikely event that the environmental monitoring subsystem becomes disabled at both the software and firmware levels.
All error and warning messages are displayed on the system console (if one is attached) and are logged in the /var/adm/messages file. Front panel fault LEDs remain lit after an automatic system shutdown to aid in problem diagnosis.
The monitoring subsystem is also designed to detect fan failures. The basic system features three primary fan trays, which include a total of five individual fans. Systems equipped with the redundant cooling option include three additional (secondary) fan trays for a total of 10 individual fans. During normal operation, only the five primary fans are active.
If any primary fan fails, the monitoring subsystem detects the failure and performs the following:
The power subsystem is monitored in a similar fashion. The monitoring subsystem periodically polls the power supply status registers for a power supply OK status, indicating the status of each supply's 3.3V, 5.0V, 12V, and 48V DC outputs.
If a power supply problem is detected, an error message is displayed on the system console and logged in the /var/adm/messages file. The System Fault and Power Fault LEDs on the status and control panel are also lit. LEDs located on the back of each power supply indicate the source and nature of the fault.
Note - The Sun Fire V890 server power supplies have their own built-in overtemperature protection circuits that will automatically shut down the supplies in response to certain overtemperature and power fault conditions. To recover from an automatic power supply shutdown, you must disconnect the AC power cord, wait approximately 10 seconds, and then reconnect the power cord.
The error messages, generated by the monitoring subsystem in response to an environmental error condition are listed and described in the following table. The environmental error messages are displayed on the system console (if one is attached) and logged in the /var/adm/messages file.
Indicates that the temperature measured at Temperature-Sensor has exceeded the critical threshold. This message is displayed briefly and then followed by the shutdown message, "The system will be shutting down in one minute." After one minute, the system automatically shuts down.
Indicates that the temperature measured at Temperature-Sensor has fallen below the critical threshold. This message is displayed briefly and then followed by the shutdown message, "The system will be shutting down in one minute." After one minute, the system automatically shuts down.
Indicates that the temperature measured at Temperature-Sensor has exceeded the warning threshold. If the temperature continues to rise and exceeds the critical threshold, the system issues the "CRITICAL: HIGH TEMPERATURE..." Warning and the shut down message.
Indicates that the temperature measured at Temperature-Sensor has fallen below the warning threshold. If the temperature continues to fall and goes below the critical threshold, the system issues the "CRITICAL: LOW TEMPERATURE..." warning and the shutdown message.