C H A P T E R  7

Troubleshooting Hardware Problems

The term troubleshooting refers to the act of applying diagnostic tools--often heuristically and accompanied by common sense--to determine the causes of system problems.

Each system problem must be treated on its own merits. It is not possible to provide a cookbook of actions that resolve each problem. However, this chapter provides some approaches and procedures, which used in combination with experience and common sense, can resolve many problems that might arise.

Tasks covered in this chapter include:

Other information in this chapter includes:


Information to Gather During Troubleshooting

Familiarity with a wide variety of equipment, and experience with a particular machine's common failure modes can be invaluable when troubleshooting system problems. Establishing a systematic approach to investigating and solving a particular system's problems can help ensure that you can quickly identify and remedy most issues as they arise.

The Netra 440 server indicates and logs events and errors in a variety of ways. Depending on the system's configuration and software, certain types of errors are captured only temporarily. Therefore, you must observe and record all available information immediately before you attempt any corrective action. POST, for instance, accumulates a list of failed components across resets. However, failed component information is cleared after a system reset. Similarly, the state of LEDs in a hung system is lost when the system reboots or resets.

If you encounter any system problems that are not familiar to you, gather as much information as you can before you attempt any remedial actions. The following task listing outlines a basic approach to information gathering.

Error Information From the ALOM System Controller

In most troubleshooting situations, you can use the ALOM system controller as the primary source of information about the system. On the Netra 440 server, the ALOM system controller provides you with access to a variety of system logs and other information about the system, even when the system is powered off. For more information about ALOM, see:

Error Information From the System

Depending on the state of the system, you should check as many of the following sources as possible for error indications and record the information found.

Recording Information About the System

As part of your standard operating procedures, it is important to have the following information about your system readily available:

Having all of this information available and verified makes it easier for you to recognize any problems already identified by others. This information is also required if you contact Sun support or your authorized support provider.

It is vital to know the version and patch revision levels of the system's operating system, patch revision levels of the firmware, and your specific hardware configuration before you attempt to fix any problems. Problems often occur after changes have been made to the system. Some errors are caused by hardware and software incompatibilities and interactions. If you have all system information available, you might be able to quickly fix a problem by simply updating the system's firmware. Knowing about recent upgrades or component replacements might help you avoid replacing components that are not faulty.


System Error States

When troubleshooting, it is important to understand what kind of error has occurred, to distinguish between real and apparent system hangs, and to respond appropriately to error conditions so as to preserve valuable information.

Responding to System Error States

Depending on the severity of a system error, a Netra 440 server might or might not respond to commands you issue to the system. Once you have gathered all available information, you can begin taking action. Your actions depend on the information you have already gathered and the state of the system.

Remember these guidelines:

Responding to System Hang States

Troubleshooting a hanging system can be a difficult process because the root cause of the hang might be masked by false error indications from another part of the system. Therefore, it is important that you carefully examine all the information sources available to you before you attempt any remedy. Also, it is helpful to understand the type of hang the system is experiencing. This hang state information is especially important to Sun support services engineers, should you contact them.

A system soft hang can be characterized by any of the following symptoms:

Some soft hangs might dissipate on their own, while others will require that the system be interrupted to gather information at the OpenBoot prompt level. A soft hang should respond to a break signal that is sent through the system console.

A system hard hang leaves the system unresponsive to a system break sequence. You will know that a system is in a hard hang state when you have attempted all the soft hang remedies with no success.

See Troubleshooting a System That Is Hanging.

Responding to Fatal Reset Errors and RED State Exceptions

Fatal Reset errors and RED State Exceptions are most often caused by hardware problems. Hardware Fatal Reset errors are the result of an "illegal" hardware state that is detected by the system. A hardware Fatal Reset error can either be a transient error or a hard error. A transient error causes intermittent failures. A hard error causes persistent failures that occur in the same way each time. CODE EXAMPLE 7-1 shows a sample Fatal Reset error alert from the system console.

CODE EXAMPLE 7-1 Fatal Reset Error Alert
Sun-SFV440-a console login: 
 
Fatal Error Reset 
CPU 0000.0000.0000.0002 AFSR 0210.9000.0200.0000  JETO PRIV OM TO
AFAR 0000.0280.0ec0.c180 
SC Alert: Host System has Reset
 
SC Alert: Host System has read and cleared bootmode.

A RED State Exception condition is most commonly a hardware fault that is detected by the system. There is no recoverable information that you can use to troubleshoot a RED State Exception. The Exception causes a loss of system integrity, which would jeopardize the system if Solaris software continued to operate. Because of this, Solaris software terminates ungracefully without logging any details of the RED State Exception error in the /var/adm/messages file. CODE EXAMPLE 7-2 shows a sample RED State Exception alert from the system console.

CODE EXAMPLE 7-2 RED State Exception Alert
Sun-SFV440-a console login: 
 
RED State Exception
Error enable reg: 0000.0001.00f0.001f 
ECCR: 0000.0000.02f0.4c00 
CPU: 0000.0000.0000.0002 
TL=0000.0000.0000.0005  TT=0000.0000.0000.0010 
   TPC=0000.0000.0100.4200  TnPC=0000.0000.0100.4204  TSTATE=0000.0044.8200.1507 
TL=0000.0000.0000.0004  TT=0000.0000.0000.0010 
   TPC=0000.0000.0100.4200  TnPC=0000.0000.0100.4204  TSTATE=0000.0044.8200.1507 
TL=0000.0000.0000.0003  TT=0000.0000.0000.0010 
   TPC=0000.0000.0100.4680  TnPC=0000.0000.0100.4684  TSTATE=0000.0044.8200.1507 
TL=0000.0000.0000.0002  TT=0000.0000.0000.0034 
   TPC=0000.0000.0100.7164  TnPC=0000.0000.0100.7168  TSTATE=0000.0044.8200.1507 
TL=0000.0000.0000.0001  TT=0000.0000.0000.004e 
   TPC=0000.0001.0001.fd24  TnPC=0000.0001.0001.fd28  TSTATE=0000.0000.8200.1207 
 
SC Alert: Host System has Reset
 
SC Alert: Host System has read and cleared bootmode.

In some isolated cases, software can cause a Fatal Reset error or RED State Exception. Typically, these are device driver problems that can be identified easily. You can obtain this information through SunSolve Online (see Web Sites), or by contacting Sun or the third-party driver vendor.

The most important pieces of information to gather when diagnosing a Fatal Reset error or RED State Exception are:

Capturing system console indications and messages at the time of the error can help you isolate the true cause of the error. In some cases, the true cause of the original error might be masked by false error indications from another part of the system. For example, POST results (shown by the output from the prtdiag command) might indicate failed components, when, in fact, the "failed" components are not the actual cause of the Fatal Reset error. In most cases, a good component will actually report the Fatal Reset error.

By analyzing the system console output at the time of the error, you can avoid replacing components based on these false error indications. In addition, knowing the service history of a system experiencing transient errors can help you avoid repeatedly replacing "failed" components that do not fix the problem.


Unexpected Reboots

Sometimes, a system might reboot unexpectedly. In that case, ensure that the reboot was not caused by a panic. For example, L2-cache errors, which occur in user space (not kernel space), might cause Solaris software to log the L2-cache failure data and reboot the system. The information logged might be sufficient to troubleshoot and correct the problem. If the reboot was not caused by a panic, it might be caused by a Fatal Reset error or a RED State Exception. See Troubleshooting Fatal Reset Errors and RED State Exceptions.

Also, system ASR and POST settings can determine the system response to certain error conditions. If POST is not invoked during the reboot process, or if the system diagnostics level is not set to max, you might need to run system diagnostics at a higher level of coverage to determine the source of the reboot if the system message and system console files do not clearly indicate the source of the reboot.


Troubleshooting a System With the Operating System Responding

This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.


procedure icon  To Troubleshoot a System With the Operating System Running

1. Log in to the system controller and access the sc> prompt.

For information, refer to the Netra 440 Server System Administration Guide.

2. Examine the ALOM event log. Type:

sc> showlogs

The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. CODE EXAMPLE 7-3 shows a sample event log, which indicates that the front panel Service Required LED is ON.

CODE EXAMPLE 7-3 showlogs Command Output
MAY 09 16:54:27 Sun-SFV440-a: 00060003: "SC System booted."
MAY 09 16:54:27 Sun-SFV440-a: 00040029: "Host system has shut down."
MAY 09 16:56:35 Sun-SFV440-a: 00060000: "SC Login: User admin Logged on."
MAY 09 16:56:54 Sun-SFV440-a: 00060000: "SC Login: User admin Logged on."
MAY 09 16:58:11 Sun-SFV440-a: 00040001: "SC Request to Power On Host."
MAY 09 16:58:11 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 16:58:13 Sun-SFV440-a: 0004000b: "Host System has read and cleared                                          bootmode."
MAY 09 16:58:13 Sun-SFV440-a: 0004004f: "Indicator PS0.POK is now ON"
MAY 09 16:58:13 Sun-SFV440-a: 0004004f: "Indicator PS1.POK is now ON"
MAY 09 16:59:19 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:00:46 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:01:51 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now ON"
MAY 09 17:03:22 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:03:22 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now OFF"
MAY 09 17:03:24 Sun-SFV440-a: 0004000b: "Host System has read and cleared                                          bootmode."
MAY 09 17:04:30 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:05:59 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:06:40 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now ON"
MAY 09 17:07:44 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.ACT is now ON"
sc>



Note - Time stamps for ALOM logs reflect UTC (Universal Time Coordinated) time, while time stamps for the Solaris OS reflect local (server) time. Therefore, a single event might generate messages that appear to be logged at different times in different logs.



3. Examine system environment status. Type:

sc> showenvironment

The showenvironment command reports much useful data such as temperature readings; state of system and component LEDs; motherboard voltages; and status of system disks, fans, motherboard circuit breakers, and CPU module DC-to-DC converters. CODE EXAMPLE 7-4, an excerpt of output from the showenvironment command, indicates that the front panel Service Required LED is ON. When reviewing the complete output from the showenvironment command, check the state of all Service Required LEDs and verify that all components show a status of OK. See CODE EXAMPLE 4-1 for a sample of complete output from the showenvironment command.

CODE EXAMPLE 7-4 showenvironment Command Output
System Indicator Status:
---------------------------------------------------
SYS_FRONT.LOCATE     SYS_FRONT.SERVICE    SYS_FRONT.ACT       
--------------------------------------------------------
OFF                  ON                   ON                  
.
.
.
sc> 

4. Examine the output of the prtdiag -v command. Type:

sc> console
Enter #. to return to ALOM.
# /usr/platform/`uname -i`/sbin/prtdiag -v

The prtdiag -v command provides access to information stored by POST and OpenBoot Diagnostics tests. Any information from this command about the current state of the system is lost if the system is reset. When examining the output to identify problems, verify that all installed CPU modules, PCI cards, and memory modules are listed; check for any Service Required LEDs that are ON; and verify that the system PROM firmware is the latest version. CODE EXAMPLE 7-5 shows an excerpt of output from the prtdiag -v command. See CODE EXAMPLE 2-8 through CODE EXAMPLE 2-13 for the complete prtdiag -v output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-5 prtdiag -v Command Output
System Configuration: Sun Microsystems  sun4u Netra 440
System clock frequency: 177 MHZ
Memory size: 4GB        
 
==================================== CPUs ====================================
                      E$          CPU     CPU       Temperature         Fan
       CPU  Freq      Size        Impl.   Mask     Die    Ambient   Speed   Unit
       ---  --------  ----------  ------  ----  --------  --------  -----   ----
        0  1062 MHz  1MB         US-IIIi   2.3     -     -  
        1  1062 MHz  1MB         US-IIIi   2.3     -     -  
 
================================= IO Devices =================================
     Bus   Freq
Brd  Type  MHz   Slot        Name                          Model
---  ----  ----  ----------  ----------------------------  --------------------
 0   pci    66           MB  pci108e,abba (network)        SUNW,pci-ce        
 0   pci    33           MB  isa/su (serial)                                  
 0   pci    33           MB  isa/su (serial)                                  
.
.
.
 
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels
--------------------------------------------------
0              0        C0/P0/B0/D0,C0/P0/B0/D1
0              1        C0/P0/B1/D0,C0/P0/B1/D1
 
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels
--------------------------------------------------
1              0        C1/P0/B0/D0,C1/P0/B0/D1
1              1        C1/P0/B1/D0,C1/P0/B1/D1
.
.
.
System PROM revisions:
----------------------
OBP 4.10.3 2003/05/02 20:25 Netra 440
OBDIAG 4.10.3 2003/05/02 20:26  
# 

5. Check the system LEDs.

6. Check the /var/adm/messages file.

The following are clear indications of a failing part:

If there is no clear indication of a failing part, investigate the installed applications, the network, or the disk configuration.

If you have clear indications that a part has failed or is failing, replace that part as soon as possible.

If the problem is a confirmed environmental failure, replace the fan or power supply as soon as possible.

A system with a redundant configuration might still operate in a degraded state, but the stability and performance of the system will be affected. Since the system is still operational, attempt to isolate the fault using several methods and tools to ensure that the part you suspect as faulty really is causing the problems you are experiencing. See Isolating Faults in the System.

For information about installing and replacing field-replaceable parts, refer to the Netra 440 Server Service Manual (817-3883-xx).


Troubleshooting a System After an Unexpected Reboot

This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.


procedure icon  To Troubleshoot a System After an Unexpected Reboot

1. Log in to the system controller and access the sc> prompt.

For information, refer to the Netra 440 Server System Administration Guide.

2. Examine the ALOM event log. Type:

sc> showlogs

The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. CODE EXAMPLE 7-6 shows a sample event log, which indicates that the front panel Service Required LED is ON.

CODE EXAMPLE 7-6 showlogs Command Output
MAY 09 16:54:27 Sun-SFV440-a: 00060003: "SC System booted."
MAY 09 16:54:27 Sun-SFV440-a: 00040029: "Host system has shut down."
MAY 09 16:56:35 Sun-SFV440-a: 00060000: "SC Login: User admin Logged on."
MAY 09 16:56:54 Sun-SFV440-a: 00060000: "SC Login: User admin Logged on."
MAY 09 16:58:11 Sun-SFV440-a: 00040001: "SC Request to Power On Host."
MAY 09 16:58:11 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 16:58:13 Sun-SFV440-a: 0004000b: "Host System has read and cleared                                          bootmode."
MAY 09 16:58:13 Sun-SFV440-a: 0004004f: "Indicator PS0.POK is now ON"
MAY 09 16:58:13 Sun-SFV440-a: 0004004f: "Indicator PS1.POK is now ON"
MAY 09 16:59:19 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:00:46 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:01:51 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now ON"
MAY 09 17:03:22 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:03:22 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now OFF"
MAY 09 17:03:24 Sun-SFV440-a: 0004000b: "Host System has read and cleared                                          bootmode."
MAY 09 17:04:30 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:05:59 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:06:40 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now ON"
MAY 09 17:07:44 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.ACT is now ON"
sc>



Note - Time stamps for ALOM logs reflect UTC (Universal Time Coordinated) time, while time stamps for the Solaris OS reflect local (server) time. Therefore, a single event might generate messages that appear to be logged at different times in different logs.



3. Examine the ALOM run log. Type:

sc> consolehistory run -v

This command shows the log containing the most recent system console output of boot messages from the Solaris OS. When troubleshooting, examine the output for hardware or software errors logged by the operating environment on the system console. CODE EXAMPLE 7-7 shows sample output from the consolehistory run -v command.

CODE EXAMPLE 7-7 consolehistory run -v Command Output
May  9 14:48:22 Sun-SFV440-a rmclomv: SC Login: User admin Logged on.
 
# 
# init 0
# 
INIT: New run level: 0
The system is coming down.  Please wait.
System services are now being stopped.
Print services stopped.
May  9 14:49:18 Sun-SFV440-a last message repeated 1 time
 
May  9 14:49:38 Sun-SFV440-a syslogd: going down on signal 15
 
The system is down.
syncing file systems... done
Program terminated
{1} ok boot disk
 
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.
 
Initializing     1MB of memory at addr        123fecc000 -
                                                                      
Initializing     1MB of memory at addr        123fe02000 -
                                                                      
Initializing    14MB of memory at addr        123f002000 -
                                                                      
Initializing    16MB of memory at addr        123e002000 -
                                                                      
Initializing   992MB of memory at addr        1200000000 -
                                                                      
Initializing  1024MB of memory at addr        1000000000 -
                                                                      
Initializing  1024MB of memory at addr         200000000 -
                                                                      
Initializing  1024MB of memory at addr                 0 -
                                                                      
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0  File and args: 
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.
Hardware watchdog enabled
Indicator SYS_FRONT.ACT is now ON
configuring IPv4 interfaces: ce0.
Hostname: Sun-SFV440-a
The system is coming up.  Please wait.
NIS domainname is Ecd.East.Sun.COM
Starting IPv4 router discovery.
starting rpc services: rpcbind keyserv ypbind done.
Setting netmask of lo0 to 255.0.0.0
Setting netmask of ce0 to 255.255.255.0
Setting default IPv4 interface for multicast: add net 224.0/4: gateway Sun-SFV440-a
syslog service starting.
Print services started.
volume management starting.
The system is ready.
 
Sun-SFV440-a console login: May  9 14:52:57 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = UNKNOWN
 
May  9 14:52:57 Sun-SFV440-a rmclomv: Keyswitch Position has changed to Unknown state.
 
May  9 14:52:58 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = LOCKED
 
May  9 14:52:58 Sun-SFV440-a rmclomv: KeySwitch Position has changed to Locked State.
 
May  9 14:53:00 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = NORMAL
 
May  9 14:53:01 Sun-SFV440-a rmclomv: KeySwitch Position has changed to On State.
 
sc> 

4. Examine the ALOM boot log. Type:

sc> consolehistory boot -v

The ALOM boot log contains boot messages from POST, OpenBoot firmware, and Solaris software from the server's most recent reset. When examining the output to identify a problem, check for error messages from POST and OpenBoot Diagnostics tests.

CODE EXAMPLE 7-8 shows the boot messages from POST. Note that POST returned no error messages. See What POST Error Messages Tell You for a sample POST error message and more information about POST error messages.

CODE EXAMPLE 7-8 consolehistory boot -v Command Output (Boot Messages From POST)
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs 
Power-On Reset
Executing Power On SelfTest
0>@(#) Netra[TM] 440 POST 4.10.3 2003/05/04 22:08 
       /export/work/staff/firmware_re/post/post-build-4.10.3/Fiesta/system/integrated  (firmware_re)  
0>Hard Powerup RST thru SW
0>CPUs present in system: 0 1
0>OBP->POST Call with %o0=00000000.01012000.
0>Diag level set to MIN.
0>MFG scrpt mode set NORM 
0>I/O port set to TTYA.
0>Start selftest...
1>Print Mem Config
1>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
1>Memory interleave set to 0
1>      Bank 0 1024MB : 00000010.00000000 -> 00000010.40000000.
1>      Bank 2 1024MB : 00000012.00000000 -> 00000012.40000000.
0>Print Mem Config
0>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
0>Memory interleave set to 0
0>      Bank 0 1024MB : 00000000.00000000 -> 00000000.40000000.
0>      Bank 2 1024MB : 00000002.00000000 -> 00000002.40000000.
0>INFO:
0>      POST Passed all devices.
0>POST: Return to OBP.

CODE EXAMPLE 7-9 shows the initialization of the OpenBoot PROM.

CODE EXAMPLE 7-9 consolehistory boot -v Command Output (OpenBoot PROM Initialization)
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs 
POST Results: Cpu 0000.0000.0000.0000 
  %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff 
POST Results: Cpu 0000.0000.0000.0001 
  %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff 
Membase: 0000.0000.0000.0000 
MemSize: 0000.0000.0004.0000 
Init CPU arrays Done
Probing /pci@1d,700000 Device 1  Nothing there 
Probing /pci@1d,700000 Device 2  Nothing there 

The following sample output shows the system banner.

CODE EXAMPLE 7-10 consolehistory boot -v Command Output (System Banner Display)
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.

The following sample output shows OpenBoot Diagnostics testing. See What OpenBoot Diagnostics Error Messages Tell You for a sample OpenBoot Diagnostics error message and more information about OpenBoot Diagnostics error messages.

CODE EXAMPLE 7-11 consolehistory boot -v Command Output (OpenBoot Diagnostics Testing)
Running diagnostic script obdiag/normal
 
Testing /pci@1f,700000/network@1
Testing /pci@1e,600000/ide@d
Testing /pci@1e,600000/isa@7/flashprom@2,0
Testing /pci@1e,600000/isa@7/serial@0,2e8
Testing /pci@1e,600000/isa@7/serial@0,3f8
Testing /pci@1e,600000/isa@7/rtc@0,70
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={gpio@0.42,gpio@0.44,gpio@0.46,gpio@0.48}
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={hardware-monitor@0.5c}
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={temperature-sensor@0.9c}
Testing /pci@1c,600000/network@2
Testing /pci@1f,700000/scsi@2,1
Testing /pci@1f,700000/scsi@2

The following sample output shows memory initialization by the OpenBoot PROM.

CODE EXAMPLE 7-12 consolehistory boot -v Command Output (Memory Initialization)
Initializing     1MB of memory at addr        123fe02000 -
                                                                      
Initializing    12MB of memory at addr        123f000000 -
                                                                      
Initializing  1008MB of memory at addr        1200000000 -
                                                                      
Initializing  1024MB of memory at addr        1000000000 -
                                                                      
Initializing  1024MB of memory at addr         200000000 -
                                                                      
Initializing  1024MB of memory at addr                 0 -
 
{1} ok boot disk

The following sample output shows the system booting and loading Solaris software

CODE EXAMPLE 7-13 consolehistory boot -v Command Output (System Booting and Loading Solaris Software)
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0  File and args: 
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. 
FCode UFS Reader 1.11 97/07/10 16:19:15. 
Loading: /platform/SUNW,Netra-440/ufsboot
Loading: /platform/sun4u/ufsboot
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.
Hardware watchdog enabled
sc> 

5. Check the /var/adm/messages file for indications of an error.

Look for the following information about the system's state:

6. If possible, check whether the system saved a core dump file.

Core dump files provide invaluable information to your support provider to aid in diagnosing any system problems. For further information about core dump files, see The Core Dump Process and "Managing System Crash Information" in the Solaris System Administration Guide.

7. Check the system LEDs.

You can use the ALOM system controller to check the state of the system LEDs. Refer to the Netra 440 Server System Administration Guide (817-3884-xx) for information about system LEDs.

8. Examine the output of the prtdiag -v command. Type:

sc> console
Enter #. to return to ALOM.
# /usr/platform/`uname -i`/sbin/prtdiag -v

The prtdiag -v command provides access to information stored by POST and OpenBoot Diagnostics tests. Any information from this command about the current state of the system is lost if the system is reset. When examining the output to identify problems, verify that all installed CPU modules, PCI cards, and memory modules are listed; check for any Service Required LEDs that are ON; and verify that the system PROM firmware is the latest version. CODE EXAMPLE 7-14 shows an excerpt of output from the prtdiag -v command. See CODE EXAMPLE 2-8 through CODE EXAMPLE 2-13 for the complete prtdiag -v output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-14 prtdiag -v Command Output
System Configuration: Sun Microsystems  sun4u Netra 440
System clock frequency: 177 MHZ
Memory size: 4GB        
 
==================================== CPUs ====================================
                      E$          CPU     CPU       Temperature         Fan
       CPU  Freq      Size        Impl.   Mask     Die    Ambient   Speed   Unit
       ---  --------  ----------  ------  ----  --------  --------  -----   ----
        0  1062 MHz  1MB         US-IIIi   2.3     -     -  
        1  1062 MHz  1MB         US-IIIi   2.3     -     -  
 
================================= IO Devices =================================
     Bus   Freq
Brd  Type  MHz   Slot        Name                          Model
---  ----  ----  ----------  ----------------------------  --------------------
 0   pci    66           MB  pci108e,abba (network)        SUNW,pci-ce        
 0   pci    33           MB  isa/su (serial)                                  
 0   pci    33           MB  isa/su (serial)                                  
 
.
.
.
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels
--------------------------------------------------
0              0        C0/P0/B0/D0,C0/P0/B0/D1
0              1        C0/P0/B1/D0,C0/P0/B1/D1
.
.
.
System PROM revisions:
----------------------
OBP 4.10.3 2003/05/02 20:25 Netra 440
OBDIAG 4.10.3 2003/05/02 20:26  
# 

9. Verify that all user and system processes are functional. Type:

# ps -ef

Output from the ps -ef command shows each process, the start time, the run time, and the full process command-line options. To identify a system problem, examine the output for missing entries in the CMD column. CODE EXAMPLE 7-15 shows the
ps -ef command output of a "healthy" Netra 440 server.

CODE EXAMPLE 7-15 ps -ef Command Output
     UID   PID  PPID  C    STIME TTY      TIME CMD
    root     0     0  0 14:51:32 ?        0:17 sched
    root     1     0  0 14:51:32 ?        0:00 /etc/init -
    root     2     0  0 14:51:32 ?        0:00 pageout
    root     3     0  0 14:51:32 ?        0:02 fsflush
    root   291     1  0 14:51:47 ?        0:00 /usr/lib/saf/sac -t 300
    root   205     1  0 14:51:44 ?        0:00 /usr/lib/lpsched
    root   312   148  0 14:54:33 ?        0:00 in.telnetd
    root   169     1  0 14:51:42 ?        0:00 /usr/lib/autofs/automountd
   user1   314   312  0 14:54:33 pts/1    0:00 -csh
    root    53     1  0 14:51:36 ?        0:00 /usr/lib/sysevent/syseventd
    root    59     1  0 14:51:37 ?        0:02 /usr/lib/picl/picld
    root   100     1  0 14:51:40 ?        0:00 /usr/sbin/in.rdisc -s
    root   131     1  0 14:51:40 ?        0:00 /usr/lib/netsvc/yp/ypbind -broadcast
    root   118     1  0 14:51:40 ?        0:00 /usr/sbin/rpcbind
    root   121     1  0 14:51:40 ?        0:00 /usr/sbin/keyserv
    root   148     1  0 14:51:42 ?        0:00 /usr/sbin/inetd -s
    root   218     1  0 14:51:44 ?        0:00 /usr/lib/power/powerd
    root   199     1  0 14:51:43 ?        0:00 /usr/sbin/nscd
    root   162     1  0 14:51:42 ?        0:00 /usr/lib/nfs/lockd
  daemon   166     1  0 14:51:42 ?        0:00 /usr/lib/nfs/statd
    root   181     1  0 14:51:43 ?        0:00 /usr/sbin/syslogd
    root   283     1  0 14:51:47 ?        0:00 /usr/lib/dmi/snmpXdmid -s Sun-SFV440-a
    root   184     1  0 14:51:43 ?        0:00 /usr/sbin/cron
    root   235   233  0 14:51:44 ?        0:00 /usr/sadm/lib/smc/bin/smcboot
    root   233     1  0 14:51:44 ?        0:00 /usr/sadm/lib/smc/bin/smcboot
    root   245     1  0 14:51:45 ?        0:00 /usr/sbin/vold
    root   247     1  0 14:51:45 ?        0:00 /usr/lib/sendmail -bd -q15m
    root   256     1  0 14:51:45 ?        0:00 /usr/lib/efcode/sparcv9/efdaemon
    root   294   291  0 14:51:47 ?        0:00 /usr/lib/saf/ttymon
    root   304   274  0 14:51:51 ?        0:00 mibiisa -r -p 32826
    root   274     1  0 14:51:46 ?        0:00 /usr/lib/snmp/snmpdx -y -c /etc/snmp/conf
    root   334   292  0 15:00:59 console  0:00 ps -ef
# 

10. Verify that all I/O devices and activities are still present and functioning. Type:

# iostat -xtc

This command shows all I/O devices and reports activity for each device. To identify a problem, examine the output for installed devices that are not listed. CODE EXAMPLE 7-16 shows the iostat -xtc command output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-16 iostat -xtc Command Output
                  extended device statistics                      tty         cpu
device       r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b  tin tout  us sy wt id
sd0          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0    0  183   0  2  2 96
sd1          6.5    1.2   49.5    7.9  0.0  0.2   24.6   0   3 
sd2          0.2    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd3          0.2    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd4          0.2    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
nfs1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
nfs2         0.0    0.0    0.1    0.0  0.0  0.0    9.6   0   0 
nfs3         0.1    0.0    0.6    0.0  0.0  0.0    1.4   0   0 
nfs4         0.0    0.0    0.1    0.0  0.0  0.0    5.1   0   0 
# 

11. Examine errors pertaining to I/O devices. Type:

# iostat -E

This command reports on errors for each I/O device. To identify a problem, examine the output for any type of error that is more than 0. For example, in CODE EXAMPLE 7-17, iostat -E reports Hard Errors: 2 for I/O device sd0.

CODE EXAMPLE 7-17 iostat -E Command Output
sd0      Soft Errors: 0 Hard Errors: 2 Transport Errors: 0 
Vendor: TOSHIBA  Product: DVD-ROM SD-C2612 Revision: 1011 Serial No: 04/17/02 
Size: 18446744073.71GB <-1 bytes>
Media Error: 0 Device Not Ready: 2 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd1      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0207 Serial No: 3JA0BW6Y00002317 
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd2      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0207 Serial No: 3JA0BRQJ00007316 
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd3      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0207 Serial No: 3JA0BWL000002318 
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd4      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0207 Serial No: 3JA0AGQS00002317 
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
# 

12. Verify that any mirrored RAID devices are functioning. Type:

# raidctl

This command shows the status of RAID devices. To identify a problem, examine the output for Disk Status that is not OK. For more information about configuring mirrored RAID devices, refer to "About Hardware Disk Mirroring" in the Netra 440 Server System Administration Guide (817-3884-xx).

CODE EXAMPLE 7-18 raidctl Command Output
# raidctl
RAID            RAID            RAID            Disk
Volume          Status          Disk            Status
------------------------------------------------------
c1t0d0          RESYNCING       c1t0d0          OK
                                c1t1d0          OK
# 

13. Run an exercising tool such as Sun VTS software or Hardware Diagnostic Suite.

See Chapter 5 for information about exercising tools.

14. If this is the first occurrence of an unexpected reboot and the system did not run POST as part of the reboot process, run POST.

If ASR is not enabled, now is a good time to enable ASR. ASR runs POST and OpenBoot Diagnostics tests automatically at reboot. With ASR enabled, you can save time diagnosing problems since POST and OpenBoot Diagnostics test results are already available after an unexpected reboot. Refer to the Netra 440 Server System Administration Guide (817-3884-xx) for more information about ASR and complete instructions for enabling ASR.

15. Once troubleshooting is complete, schedule maintenance as necessary for any service actions.


Troubleshooting Fatal Reset Errors and RED State Exceptions

This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.

For more information about Fatal Reset errors and RED State Exceptions, see Responding to Fatal Reset Errors and RED State Exceptions. For a sample Fatal Reset error message, see CODE EXAMPLE 7-1. For a sample RED State Exception message, see CODE EXAMPLE 7-2.

1. Log in to the system controller and access the sc> prompt.

For information, refer to the Netra 440 Server System Administration Guide.

2. Examine the ALOM event log. Type:

sc> showlogs

The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. CODE EXAMPLE 7-19 shows a sample event log, which indicates that the front panel Service Required LED is ON.

CODE EXAMPLE 7-19 showlogs Command Output
MAY 09 16:54:27 Sun-SFV440-a: 00060003: "SC System booted."
MAY 09 16:54:27 Sun-SFV440-a: 00040029: "Host system has shut down."
MAY 09 16:56:35 Sun-SFV440-a: 00060000: "SC Login: User admin Logged on."
MAY 09 16:56:54 Sun-SFV440-a: 00060000: "SC Login: User admin Logged on."
MAY 09 16:58:11 Sun-SFV440-a: 00040001: "SC Request to Power On Host."
MAY 09 16:58:11 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 16:58:13 Sun-SFV440-a: 0004000b: "Host System has read and cleared                                          bootmode."
MAY 09 16:58:13 Sun-SFV440-a: 0004004f: "Indicator PS0.POK is now ON"
MAY 09 16:58:13 Sun-SFV440-a: 0004004f: "Indicator PS1.POK is now ON"
MAY 09 16:59:19 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:00:46 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:01:51 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now ON"
MAY 09 17:03:22 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:03:22 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now OFF"
MAY 09 17:03:24 Sun-SFV440-a: 0004000b: "Host System has read and cleared                                          bootmode."
MAY 09 17:04:30 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:05:59 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:06:40 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now ON"
MAY 09 17:07:44 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.ACT is now ON"
sc>



Note - Time stamps for ALOM logs reflect UTC (Universal Time Coordinated) time, while time stamps for the Solaris OS reflect local (server) time. Therefore, a single event might generate messages that appear to be logged at different times in different logs.



3. Examine the ALOM run log. Type:

sc> consolehistory run -v

This command shows the log containing the most recent system console output of boot messages from the Solaris software. When troubleshooting, examine the output for hardware or software errors logged by the operating system on the system console. CODE EXAMPLE 7-20 shows sample output from the consolehistory run -v command.

CODE EXAMPLE 7-20 consolehistory run -v Command Output
May  9 14:48:22 Sun-SFV440-a rmclomv: SC Login: User admin Logged on.
 
# 
# init 0
# 
INIT: New run level: 0
The system is coming down.  Please wait.
System services are now being stopped.
Print services stopped.
May  9 14:49:18 Sun-SFV440-a last message repeated 1 time
 
May  9 14:49:38 Sun-SFV440-a syslogd: going down on signal 15
 
The system is down.
syncing file systems... done
Program terminated
{1} ok boot disk
 
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.
 
Initializing     1MB of memory at addr        123fecc000 -
                                                                      
Initializing     1MB of memory at addr        123fe02000 -
                                                                      
Initializing    14MB of memory at addr        123f002000 -
                                                                      
Initializing    16MB of memory at addr        123e002000 -
                                                                      
Initializing   992MB of memory at addr        1200000000 -
                                                                      
Initializing  1024MB of memory at addr        1000000000 -
                                                                      
Initializing  1024MB of memory at addr         200000000 -
                                                                      
Initializing  1024MB of memory at addr                 0 -
                                                                      
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0  File and args: 
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.
Hardware watchdog enabled
Indicator SYS_FRONT.ACT is now ON
configuring IPv4 interfaces: ce0.
Hostname: Sun-SFV440-a
The system is coming up.  Please wait.
NIS domainname is Ecd.East.Sun.COM
Starting IPv4 router discovery.
starting rpc services: rpcbind keyserv ypbind done.
Setting netmask of lo0 to 255.0.0.0
Setting netmask of ce0 to 255.255.255.0
Setting default IPv4 interface for multicast: add net 224.0/4: gateway Sun-SFV440-a
syslog service starting.
Print services started.
volume management starting.
The system is ready.
 
Sun-SFV440-a console login: May  9 14:52:57 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = UNKNOWN
 
May  9 14:52:57 Sun-SFV440-a rmclomv: Keyswitch Position has changed to Unknown state.
 
May  9 14:52:58 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = LOCKED
 
May  9 14:52:58 Sun-SFV440-a rmclomv: KeySwitch Position has changed to Locked State.
 
May  9 14:53:00 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = NORMAL
 
May  9 14:53:01 Sun-SFV440-a rmclomv: KeySwitch Position has changed to On State.
 
sc> 

4. Examine the ALOM boot log. Type:

sc> consolehistory boot -v

The ALOM boot log contains boot messages from POST, OpenBoot firmware, and Solaris software from the server's most recent reset. When examining the output to identify a problem, check for error messages from POST and OpenBoot Diagnostics tests.

CODE EXAMPLE 7-21 shows the boot messages from POST. Note that POST returned no error messages. See What POST Error Messages Tell You for a sample POST error message and more information about POST error messages.

CODE EXAMPLE 7-21 consolehistory boot -v Command Output (Boot Messages From POST)
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs 
Power-On Reset
Executing Power On SelfTest
 
0>@(#) Netra[TM] 440 POST 4.10.3 2003/05/04 22:08 
       /export/work/staff/firmware_re/post/post-build-4.10.3/Fiesta/system/integrated  (firmware_re)  
0>Hard Powerup RST thru SW
0>CPUs present in system: 0 1
0>OBP->POST Call with %o0=00000000.01012000.
0>Diag level set to MIN.
0>MFG scrpt mode set NORM 
0>I/O port set to TTYA.
0>
0>Start selftest...
1>Print Mem Config
1>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
1>Memory interleave set to 0
1>      Bank 0 1024MB : 00000010.00000000 -> 00000010.40000000.
1>      Bank 2 1024MB : 00000012.00000000 -> 00000012.40000000.
0>Print Mem Config
0>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
0>Memory interleave set to 0
0>      Bank 0 1024MB : 00000000.00000000 -> 00000000.40000000.
0>      Bank 2 1024MB : 00000002.00000000 -> 00000002.40000000.
0>INFO:
0>      POST Passed all devices.
0>
0>POST: Return to OBP.

The following output shows the initialization of the OpenBoot PROM.

CODE EXAMPLE 7-22 consolehistory boot -v Command Output (OpenBoot PROM Initialization)
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs 
POST Results: Cpu 0000.0000.0000.0000 
  %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff 
POST Results: Cpu 0000.0000.0000.0001 
  %o0 = 0000.0000.0000.0000 %o1 = ffff.ffff.f00a.2b73 %o2 = ffff.ffff.ffff.ffff 
Membase: 0000.0000.0000.0000 
MemSize: 0000.0000.0004.0000 
Init CPU arrays Done
Probing /pci@1d,700000 Device 1  Nothing there 
Probing /pci@1d,700000 Device 2  Nothing there 

The following sample output shows the system banner.

CODE EXAMPLE 7-23 c onsolehistory boot -v Command Output (System Banner Display)
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.

The following sample output shows OpenBoot Diagnostics testing. See What OpenBoot Diagnostics Error Messages Tell You for a sample OpenBoot Diagnostics error message and more information about OpenBoot Diagnostics error messages.

CODE EXAMPLE 7-24 consolehistory boot -v Command Output (OpenBoot Diagnostics Testing)
Running diagnostic script obdiag/normal
 
Testing /pci@1f,700000/network@1
Testing /pci@1e,600000/ide@d
Testing /pci@1e,600000/isa@7/flashprom@2,0
Testing /pci@1e,600000/isa@7/serial@0,2e8
Testing /pci@1e,600000/isa@7/serial@0,3f8
Testing /pci@1e,600000/isa@7/rtc@0,70
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={gpio@0.42,gpio@0.44,gpio@0.46,gpio@0.48}
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={hardware-monitor@0.5c}
Testing /pci@1e,600000/isa@7/i2c@0,320:tests={temperature-sensor@0.9c}
Testing /pci@1c,600000/network@2
Testing /pci@1f,700000/scsi@2,1
Testing /pci@1f,700000/scsi@2

The following sample output shows memory initialization by the OpenBoot PROM.

CODE EXAMPLE 7-25 consolehistory boot -v Command Output (Memory Initialization)
Initializing     1MB of memory at addr        123fe02000 -
                                                                      
Initializing    12MB of memory at addr        123f000000 -
                                                                      
Initializing  1008MB of memory at addr        1200000000 -
                                                                      
Initializing  1024MB of memory at addr        1000000000 -
                                                                      
Initializing  1024MB of memory at addr         200000000 -
                                                                      
Initializing  1024MB of memory at addr                 0 -
 
{1} ok boot disk

The following sample output shows the system booting and loading the Solaris software.

CODE EXAMPLE 7-26 consolehistory boot -v Command Output (System Booting and Loading Solaris Software)
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0  File and args: 
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54. 
FCode UFS Reader 1.11 97/07/10 16:19:15. 
Loading: /platform/SUNW,Netra-440/ufsboot
Loading: /platform/sun4u/ufsboot
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.
Hardware watchdog enabled
sc> 

5. Check the /var/adm/messages file for indications of an error.

Look for the following information about the system's state:

6. If possible, check whether the system saved a core dump file.

Core dump files provide invaluable information to your support provider to aid in diagnosing any system problems. For further information about core dump files, see The Core Dump Process and "Managing System Crash Information" in the Solaris System Administration Guide.

7. Check the system LEDs.

You can use the ALOM system controller to check the state of the system LEDs. Refer to the Netra 440 Server System Administration Guide (817-3884-xx) for information about system LEDs.

8. Examine the output of the prtdiag -v command. Type:

sc> console
Enter #. to return to ALOM.
# /usr/platform/`uname -i`/sbin/prtdiag -v

The prtdiag -v command provides access to information stored by POST and OpenBoot Diagnostics tests. Any information from this command about the current state of the system is lost if the system is reset. When examining the output to identify problems, verify that all installed CPU modules, PCI cards, and memory modules are listed; check for any Service Required LEDs that are ON; and verify that the system PROM firmware is the latest version. CODE EXAMPLE 7-27 shows an excerpt of output from the prtdiag -v command. See CODE EXAMPLE 2-8 through CODE EXAMPLE 2-13 for the complete prtdiag -v output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-27 prtdiag -v Command Output
System Configuration: Sun Microsystems  sun4u Netra 440
System clock frequency: 177 MHZ
Memory size: 4GB        
 
==================================== CPUs ====================================
                      E$          CPU     CPU       Temperature         Fan
       CPU  Freq      Size        Impl.   Mask     Die    Ambient   Speed   Unit
       ---  --------  ----------  ------  ----  --------  --------  -----   ----
        0  1062 MHz  1MB         US-IIIi   2.3     -     -  
        1  1062 MHz  1MB         US-IIIi   2.3     -     -  
 
================================= IO Devices =================================
     Bus   Freq
Brd  Type  MHz   Slot        Name                          Model
---  ----  ----  ----------  ----------------------------  --------------------
 0   pci    66           MB  pci108e,abba (network)        SUNW,pci-ce        
 0   pci    33           MB  isa/su (serial)                                  
 0   pci    33           MB  isa/su (serial)                                  
.
.
.
Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels
--------------------------------------------------
0              0        C0/P0/B0/D0,C0/P0/B0/D1
0              1        C0/P0/B1/D0,C0/P0/B1/D1
.
.
.
System PROM revisions:
----------------------
OBP 4.10.3 2003/05/02 20:25 Netra 440
OBDIAG 4.10.3 2003/05/02 20:26  
# 

9. Verify that all user and system processes are functional. Type:

# ps -ef

Output from the ps -ef command shows each process, the start time, the run time, and the full process command-line options. To identify a system problem, examine the output for missing entries in the CMD column. CODE EXAMPLE 7-28 shows the
ps -ef command output of a "healthy" Netra 440 server.

CODE EXAMPLE 7-28 ps -ef Command Output
     UID   PID  PPID  C    STIME TTY      TIME CMD
    root     0     0  0 14:51:32 ?        0:17 sched
    root     1     0  0 14:51:32 ?        0:00 /etc/init -
    root     2     0  0 14:51:32 ?        0:00 pageout
    root     3     0  0 14:51:32 ?        0:02 fsflush
    root   291     1  0 14:51:47 ?        0:00 /usr/lib/saf/sac -t 300
    root   205     1  0 14:51:44 ?        0:00 /usr/lib/lpsched
    root   312   148  0 14:54:33 ?        0:00 in.telnetd
    root   169     1  0 14:51:42 ?        0:00 /usr/lib/autofs/automountd
   user1   314   312  0 14:54:33 pts/1    0:00 -csh
    root    53     1  0 14:51:36 ?        0:00 /usr/lib/sysevent/syseventd
    root    59     1  0 14:51:37 ?        0:02 /usr/lib/picl/picld
    root   100     1  0 14:51:40 ?        0:00 /usr/sbin/in.rdisc -s
    root   131     1  0 14:51:40 ?        0:00 /usr/lib/netsvc/yp/ypbind -broadcast
    root   118     1  0 14:51:40 ?        0:00 /usr/sbin/rpcbind
    root   121     1  0 14:51:40 ?        0:00 /usr/sbin/keyserv
    root   148     1  0 14:51:42 ?        0:00 /usr/sbin/inetd -s
    root   226     1  0 14:51:44 ?        0:00 /usr/lib/utmpd
    root   218     1  0 14:51:44 ?        0:00 /usr/lib/power/powerd
    root   199     1  0 14:51:43 ?        0:00 /usr/sbin/nscd
    root   162     1  0 14:51:42 ?        0:00 /usr/lib/nfs/lockd
  daemon   166     1  0 14:51:42 ?        0:00 /usr/lib/nfs/statd
    root   181     1  0 14:51:43 ?        0:00 /usr/sbin/syslogd
    root   283     1  0 14:51:47 ?        0:00 /usr/lib/dmi/snmpXdmid -s Sun-SFV440-a
    root   184     1  0 14:51:43 ?        0:00 /usr/sbin/cron
    root   235   233  0 14:51:44 ?        0:00 /usr/sadm/lib/smc/bin/smcboot
    root   233     1  0 14:51:44 ?        0:00 /usr/sadm/lib/smc/bin/smcboot
    root   245     1  0 14:51:45 ?        0:00 /usr/sbin/vold
    root   247     1  0 14:51:45 ?        0:00 /usr/lib/sendmail -bd -q15m
    root   256     1  0 14:51:45 ?        0:00 /usr/lib/efcode/sparcv9/efdaemon
    root   294   291  0 14:51:47 ?        0:00 /usr/lib/saf/ttymon
    root   304   274  0 14:51:51 ?        0:00 mibiisa -r -p 32826
    root   274     1  0 14:51:46 ?        0:00 /usr/lib/snmp/snmpdx -y -c /etc/snmp/conf
    root   334   292  0 15:00:59 console  0:00 ps -ef
    root   281     1  0 14:51:47 ?        0:00 /usr/lib/dmi/dmispd
    root   282     1  0 14:51:47 ?        0:00 /usr/dt/bin/dtlogin -daemon
    root   292     1  0 14:51:47 console  0:00 -sh
    root   324   314  0 14:54:51 pts/1    0:00 -sh
# 

10. Verify that all I/O devices and activities are still present and functioning. Type:

# iostat -xtc

This command shows all I/O devices and reports activity for each device. To identify a problem, examine the output for installed devices that are not listed. CODE EXAMPLE 7-29 shows the iostat -xtc command output from a "healthy" Netra 440 server.

CODE EXAMPLE 7-29 iostat -xtc Command Output
                  extended device statistics                      tty         cpu
device       r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b  tin tout  us sy wt id
sd0          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0    0  183   0  2  2 96
sd1          6.5    1.2   49.5    7.9  0.0  0.2   24.6   0   3 
sd2          0.2    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd3          0.2    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd4          0.2    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
nfs1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
nfs2         0.0    0.0    0.1    0.0  0.0  0.0    9.6   0   0 
nfs3         0.1    0.0    0.6    0.0  0.0  0.0    1.4   0   0 
nfs4         0.0    0.0    0.1    0.0  0.0  0.0    5.1   0   0 
# 

11. Examine errors pertaining to I/O devices. Type:

# iostat -E

This command reports on errors for each I/O device. To identify a problem, examine the output for any type of error that is more than 0. For example, in CODE EXAMPLE 7-30, iostat -E reports Hard Errors: 2 for I/O device sd0.

CODE EXAMPLE 7-30 iostat -E Command Output
sd0      Soft Errors: 0 Hard Errors: 2 Transport Errors: 0 
Vendor: TOSHIBA  Product: DVD-ROM SD-C2612 Revision: 1011 Serial No: 04/17/02 
Size: 18446744073.71GB <-1 bytes>
Media Error: 0 Device Not Ready: 2 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd1      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0207 Serial No: 3JA0BW6Y00002317 
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd2      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0207 Serial No: 3JA0BRQJ00007316 
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd3      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0207 Serial No: 3JA0BWL000002318 
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd4      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0207 Serial No: 3JA0AGQS00002317 
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
# 

12. Check your system Product Notes and the SunSolve Online Web site for the latest information, driver updates, and Free Info Docs for the system.

13. Check the system's recent service history.

A system that has had several recent Fatal Reset errors and subsequent FRU replacements should be monitored closely to determine whether the recently replaced parts were, in fact, not faulty, and whether the actual faulty hardware has gone undetected.


Troubleshooting a System That Does Not Boot

A system might be unable to boot due to hardware or software problems. If you suspect that the system is unable to boot for software reasons, refer to "Troubleshooting Miscellaneous Software Problems" in the Solaris System Administration Guide: Advanced Administration. If you suspect the system is unable to boot due to a hardware problem, use the following procedure to determine the possible causes.

This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.

1. Log in to the system controller and access the sc> prompt.

For information, refer to the Netra 440 Server System Administration Guide.

2. Examine the ALOM event log. Type:

sc> showlogs

The ALOM event log shows system events such as reset events and LED indicator state changes that have occurred since the last system boot. To identify problems, examine the output for Service Required LEDs that are ON. CODE EXAMPLE 7-31 shows a sample event log, which indicates that the front panel Service Required LED is ON.

CODE EXAMPLE 7-31 showlogs Command Output
MAY 09 16:54:27 Sun-SFV440-a: 00060003: "SC System booted."
MAY 09 16:54:27 Sun-SFV440-a: 00040029: "Host system has shut down."
MAY 09 16:56:35 Sun-SFV440-a: 00060000: "SC Login: User admin Logged on."
MAY 09 16:56:54 Sun-SFV440-a: 00060000: "SC Login: User admin Logged on."
MAY 09 16:58:11 Sun-SFV440-a: 00040001: "SC Request to Power On Host."
MAY 09 16:58:11 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 16:58:13 Sun-SFV440-a: 0004000b: "Host System has read and cleared                                          bootmode."
MAY 09 16:58:13 Sun-SFV440-a: 0004004f: "Indicator PS0.POK is now ON"
MAY 09 16:58:13 Sun-SFV440-a: 0004004f: "Indicator PS1.POK is now ON"
MAY 09 16:59:19 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:00:46 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:01:51 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now ON"
MAY 09 17:03:22 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:03:22 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now OFF"
MAY 09 17:03:24 Sun-SFV440-a: 0004000b: "Host System has read and cleared                                          bootmode."
MAY 09 17:04:30 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:05:59 Sun-SFV440-a: 00040002: "Host System has Reset"
MAY 09 17:06:40 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.SERVICE is now ON"
MAY 09 17:07:44 Sun-SFV440-a: 0004004f: "Indicator SYS_FRONT.ACT is now ON"
sc>

3. Examine the ALOM run log. Type:

sc> consolehistory run -v

This command shows the log containing the most recent system console output of boot messages from the Solaris OS. When troubleshooting, examine the output for hardware or software errors logged by the operating system on the system console. CODE EXAMPLE 7-32 shows sample output from the consolehistory run -v command.

CODE EXAMPLE 7-32 consolehistory run -v Command Output
May  9 14:48:22 Sun-SFV440-a rmclomv: SC Login: User admin Logged on.
 
# 
# init 0
# 
INIT: New run level: 0
The system is coming down.  Please wait.
System services are now being stopped.
Print services stopped.
May  9 14:49:18 Sun-SFV440-a last message repeated 1 time
 
May  9 14:49:38 Sun-SFV440-a syslogd: going down on signal 15
 
The system is down.
syncing file systems... done
Program terminated
{1} ok boot disk
 
Netra 440, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.10.3, 4096 MB memory installed, Serial #53005571.
Ethernet address 0:3:ba:28:cd:3, Host ID: 8328cd03.
 
Initializing     1MB of memory at addr        123fecc000 -
                                                                      
Initializing     1MB of memory at addr        123fe02000 -
                                                                      
Initializing    14MB of memory at addr        123f002000 -
                                                                      
Initializing    16MB of memory at addr        123e002000 -
                                                                      
Initializing   992MB of memory at addr        1200000000 -
                                                                      
Initializing  1024MB of memory at addr        1000000000 -
                                                                      
Initializing  1024MB of memory at addr         200000000 -
                                                                      
Initializing  1024MB of memory at addr                 0 -
                                                                      
Rebooting with command: boot disk
Boot device: /pci@1f,700000/scsi@2/disk@0,0  File and args: 
\
SunOS Release 5.8 Version Generic_114696-04 64-bit
Copyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.
Hardware watchdog enabled
Indicator SYS_FRONT.ACT is now ON
configuring IPv4 interfaces: ce0.
Hostname: Sun-SFV440-a
The system is coming up.  Please wait.
NIS domainname is Ecd.East.Sun.COM
Starting IPv4 router discovery.
starting rpc services: rpcbind keyserv ypbind done.
Setting netmask of lo0 to 255.0.0.0
Setting netmask of ce0 to 255.255.255.0
Setting default IPv4 interface for multicast: add net 224.0/4: gateway Sun-SFV440-a
syslog service starting.
Print services started.
volume management starting.
The system is ready.
 
Sun-SFV440-a console login: May  9 14:52:57 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = UNKNOWN
 
May  9 14:52:57 Sun-SFV440-a rmclomv: Keyswitch Position has changed to Unknown state.
 
May  9 14:52:58 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = LOCKED
 
May  9 14:52:58 Sun-SFV440-a rmclomv: KeySwitch Position has changed to Locked State.
 
May  9 14:53:00 Sun-SFV440-a rmclomv: NOTICE: keyswitch change event - state = NORMAL
 
May  9 14:53:01 Sun-SFV440-a rmclomv: KeySwitch Position has changed to On State.
 
sc> 



Note - Time stamps for ALOM logs reflect UTC (Universal Time Coordinated) time, while time stamps for the Solaris OS reflect local (server) time. Therefore, a single event might generate messages that appear to be logged at different times in different logs.





Note - The ALOM system controller runs independently from the system and uses standby power from the server. Therefore, ALOM firmware and software continue to function when power to the machine is turned off.



4. Examine the ALOM boot log. Type:

sc> consolehistory boot -v

The ALOM boot log contains boot messages from POST, OpenBoot firmware, and the Solaris software from the server's most recent reset. When examining the output to identify a problem, check for error messages from POST and OpenBoot Diagnostics tests.

CODE EXAMPLE 7-33 shows the boot messages from POST. Note that POST returned no error messages. See What POST Error Messages Tell You for a sample POST error message and more information about POST error messages.

CODE EXAMPLE 7-33 consolehistory boot -v Command Output (Boot Messages From POST)
Keyswitch set to diagnostic position.
@(#)OBP 4.10.3 2003/05/02 20:25 Netra 440
Clearing TLBs 
Power-On Reset
Executing Power On SelfTest
 
0>@(#) Netra[TM] 440 POST 4.10.3 2003/05/04 22:08 
       /export/work/staff/firmware_re/post/post-build-4.10.3/Fiesta/system/integrated  (firmware_re)  
0>Hard Powerup RST thru SW
0>CPUs present in system: 0 1
0>OBP->POST Call with %o0=00000000.01012000.
0>Diag level set to MIN.
0>MFG scrpt mode set NORM 
0>I/O port set to TTYA.
0>
0>Start selftest...
1>Print Mem Config
1>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
1>Memory interleave set to 0
1>      Bank 0 1024MB : 00000010.00000000 -> 00000010.40000000.
1>      Bank 2 1024MB : 00000012.00000000 -> 00000012.40000000.
0>Print Mem Config
0>Caches : Icache is ON, Dcache is ON, Wcache is ON, Pcache is ON.
0>Memory interleave set to 0
0>      Bank 0 1024MB : 00000000.00000000 -> 00000000.40000000.
0>      Bank 2 1024MB : 00000002.00000000 -> 00000002.40000000.
0>INFO:
0>      POST Passed all devices.
0>
0>POST: Return to OBP.

5. Turn the system control rotary switch to the Diagnostics position.

6. Power on the system.

If the system does not boot, the system might have a basic hardware problem. If you have not made any recent hardware changes to the system, contact your authorized service provider.

7. If the system gets to the ok prompt but does not load the operating system, you might need to change the boot-device setting in the system firmware.

See Using OpenBoot Information Commands for information about using the probe commands. You can use the probe commands to display information about active SCSI and IDE devices.

For information on changing the default boot device, refer to the Solaris System Administration Guide: Basic Administration.

a. Try to load the operating system for a single user from a CD.

Place a valid Solaris OS CD into the system DVD-ROM or CD-ROM drive and enter boot cdrom -s from the ok prompt.

b. If the system boots from the CD and loads the operating system, check the following:

c. If the system gets to the ok prompt but does not load the operating system from the CD, check the following:


Troubleshooting a System That Is Hanging

This procedure assumes that the system console is in its default configuration, so that you are able to switch between the system controller and the system console. Refer to the Netra 440 Server System Administration Guide.


procedure icon  To Troubleshoot a System That Is Hanging

1. Verify that the system is hanging.

a. Type the ping command to determine whether there is any network activity.

b. Type the ps -ef command to determine whether any other user sessions are active or responding.

If another user session is active, use it to review the contents of the /var/adm/messages file for any indications of the system problem.

c. Try to access the system console through the ALOM system controller.

If you can establish a working system console connection, the problem might not be a true hang but might instead be a network-related problem. For suspected network problems, use the ping, rlogin, or telnet commands to reach another system that is on the same sub-network, hub, or router. If NFS services are served by the affected system, determine whether NFS activity is present on other systems.

d. Change the system control rotary switch position while observing the system console.

For example, turn the rotary switch from the Normal position to the Diagnostics position, or from the Locked position to the Normal position. If the system console logs the change of rotary switch position, the system is not fully hung.

2. If there are no responding user sessions, record the state of the system LEDs.

The system LEDs might indicate a hardware failure in the system. You can use the ALOM system controller to check the state of the system LEDs. Refer to the
Netra 440 Server System Administration Guide (817-3884-xx) for more information about system LEDs.

3. Attempt to bring the system to the ok prompt.

For instructions, refer to the Netra 440 Server System Administration Guide.

If the system can get to the ok prompt, then the system hang can be classified as a soft hang. Otherwise, the system hang can be classified as a hard hang. See Responding to System Hang States for more information.

4. If the preceding step failed to bring the system to the ok prompt, execute an externally initiated reset (XIR).

Executing an XIR resets the system and preserves the state of the system before it resets, so that indications and messages about transient errors might be saved.

An XIR is the equivalent of issuing a direct hardware reset. For further information about XIR, refer to the Netra 440 Server System Administration Guide.

5. If an XIR brings the system to the ok prompt, do the following.

a. Issue the printenv command.

This command displays the settings of the OpenBoot configuration variables.

b. Set the auto-boot? variable to true, the diag-switch? variable to true, the diag-level variable to max, and the post-trigger and obdiag-trigger variables to all-resets.

c. Issue the sync command to obtain a core dump file.

Core dump files provide invaluable information to your support provider to aid in diagnosing any system problems. For further information about core dump files, see The Core Dump Process and "Managing System Crash Information" in the Solaris System Administration Guide, which is part of the Solaris System Administrator Collection.

The system reboots automatically provided that the OpenBoot configuration auto-boot? variable is set to true (default value).



Note - Steps 3, 4, and 5 occur automatically when the hardware watchdog mechanism is enabled.



6. If an XIR failed to bring the system to the ok prompt, follow these steps:

a. Turn the system control rotary switch to the Diagnostics position.

This forces the system to run POST and OpenBoot Diagnostics tests during system startup.

b. Press the system Power button for five seconds.

This causes an immediate hardware shutdown.

c. Wait at least 30 seconds; then power on the system by pressing the Power button.



Note - You can also use the ALOM system controller to set the POST and OpenBoot Diagnostics levels, and to power off and reboot the system. Refer to the Advanced Lights Out Manager Software User's Guide for the Netra 440 Server (817-5481-xx).



7. Use the POST and OpenBoot Diagnostics tests to diagnose system problems.

When the system initiates the startup sequence, it will run POST and OpenBoot Diagnostics tests. See Isolating Faults Using POST Diagnostics and Isolating Faults Using Interactive OpenBoot Diagnostics Tests.

8. Review the contents of the /var/adm/messages file.

Look for the following information about the system's state:

9. If possible, check whether the system saved a core dump file.

Core dump files provide invaluable information to your support provider to aid in diagnosing any system problems. For further information about core dump files, see The Core Dump Process and "Managing System Crash Information" in the Solaris System Administration Guide, which is part of the Solaris System Administrator Collection.