Sun Ultra 45 and Ultra 25 Workstations Service and Diagnostics Manual
|
|
Solaris 10 Predictive Self-Healing and Solaris Diagnostics
|
This chapter provides information about the following topics:
9.1 Predictive Self-Healing Overview
Always access the following web site first to interpret faults and obtain information on a fault:
http://www.sun.com/msg/
The web site directs you to provide the message ID that your system displayed. The web site then provides knowledge articles about the fault and corrective action to resolve the fault. The fault information and documentation at this web site is updated regularly.
You can find more detailed descriptions of Solaris 10 Predictive Self-Healing at the web site below:
http://www.sun.com/bigadmin/features/articles/selfheal.html
9.2 Predictive Self-Healing Tools
In Solaris 10, the fault manager runs in the background. The fault manager performs the following functions:
- Receives telemetry information about problems detected by the system software.
- Diagnoses the problems.
- Initiates pro-active self-healing activities. For example, the fault manager can disable faulty components.
TABLE 9-1 shows a typical message generated when a fault occurs on your system. The message appears on your console and is recorded in the /var/adm/messages file.
Note - The message in TABLE 9-1 indicates that the fault has already been diagnosed. Any corrective action that the system can perform has already taken place. If your workstation is still running, it continues to run.
|
TABLE 9-1 System Generated Predictive Self-Healing Message
Output Displayed
|
Description
|
Nov 1 16:30:20 dt88-292 EVENT-TIME: Tue Nov 1 16:30:20 PST 2005
|
EVENT-TIME: the time stamp of the diagnosis.
|
Nov 1 16:30:20 dt88-292 PLATFORM: SUNW,A70, CSN: -, HOSTNAME: dt88-292
|
PLATFORM: A description of the system encountering the problem
|
Nov 1 16:30:20 dt88-292 SOURCE: eft, REV: 1.13
|
SOURCE: Information on the Diagnosis Engine used to determine the fault
|
Nov 1 16:30:20 dt88-292 EVENT-ID: afc7e660-d609-4b2f-86b8-ae7c6b8d50c4
|
EVENT-ID: The Universally Unique event ID (UUID) for this fault
|
Nov 1 16:30:20 dt88-292 DESC:Nov 1 16:30:20 dt88-292 A problem was detected in the PCI-Express subsystem
|
DESC: A basic description of the failure
|
Nov 1 16:30:20 dt88-292 Refer to http://sun.com/msg/SUN4-8000-0Y for more information.
|
WEBSITE: Where to find specific information and actions for this fault
|
Nov 1 16:30:20 dt88-292 AUTO-RESPONSE: One or more device instances may be disabled
|
AUTO-RESPONSE: What, if anything, the system did to alleviate any follow-on issues
|
Nov 1 16:30:20 dt88-292 IMPACT: Loss of services provided by the device instances associated with this fault
|
IMPACT: A description of what that response may have done
|
Nov 1 16:30:20 dt88-292 REC-ACTION: Schedule a repair procedure to replace the affected device. Use Nov 1 16:30:20 dt88-292 fmdump -v -u EVENT_ID to identify the device or contact Sun for support.
|
REC-ACTION: A short description of what the system administrator should do
|
9.3 Using the Predictive Self-Healing Commands
For complete information about Predictive Self-Healing commands, refer to the Solaris 10 man pages. This section describes some details of the following commands:
9.3.1 Using the fmdump Command
After the message in TABLE 9-1 is displayed, you may desire more information about the fault. The fmdump command can be used to display the contents of any log files associated with the Solaris Fault Manager.
The fmdump command produces output similar to TABLE 9-1. This example assumes there is only one fault.
# fmdump
TIME UUID SUNW-MSG-ID
Nov 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2 SUN4-8000-0Y
|
9.3.1.1 fmdump -V Command
You can obtain more detail by using the -V option.
# fmdump -V -u 0ee65618-2218-4997-c0dc-b5c410ed8ec2
TIME UUID SUNW-MSG-ID
Nov 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2 SUN4-8000-0Y
100% fault.io.fire.asic
FRU: hc://product-id=SUNW,A70/motherboard=0
rsrc: hc:///motherboard=0/hostbridge=0/pciexrc=0
|
At least three lines of new output are delivered to the user with the -V option.
- The first line is a summary of information you have seen before in the console message but includes the timestamp, the UUID, and the Message-ID.
- The second line is a declaration of the certainty of the diagnosis. In this case we are 100% sure the failure is in the ASIC described. If the diagnosis may involve multiple components, you may see two lines here with 50% in each (for example)
- The FRU line declares the part which needs to be replaced to return the system to a fully operational state.
- The rsrc line describes what component was taken out of service as a result of this fault.
9.3.1.2 fmdump -e Command
To get information of the errors which caused this failure, you can use the -e option.
# fmdump -e
TIME CLASS
Nov 02 10:04:14.3008 ereport.io.fire.jbc.mb_per
|
9.3.2 Using the fmadm faulty Command
The fmadm faulty command can be used by administrators and service personnel to view and modify system configuration parameters that are maintained by the Solaris Fault Manager. The fmadm faulty command is primarily used to determine the status of a component involved in a fault.
# fmadm faulty
STATE RESOURCE / UUID
-------- -------------------------------------------------------------
degraded dev:////pci@1e,600000
0ee65618-2218-4997-c0dc-b5c410ed8ec2
|
The PCI device is degraded and is associated with the same UUID as seen above. You may also see "faulted" states.
9.3.2.1 fmadm config Command
The fmadm config command output shows you the version numbers of the diagnosis engines in use by your system, as well as their current state. You can check these versions against information on the SunSolve web site to determine if you are running the latest diagnostic engines.
# fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-diagnosis 1.5 active UltraSPARC-III/IV CPU/Memory Diagnosis
cpumem-retire 1.0 active CPU/Memory Retire Agent
eft 1.13 active eft diagnosis engine
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 1.0 active I/O Retire Agent
syslog-msgs 1.0 active Syslog Messaging Agent
|
9.3.3 Using the fmstat Command
The fmstat command can report statistics associated with the Solaris Fault Manager. The fmstat command shows information about DE performance. In the example below, the eft DE (also seen in the console output) has received an event which it accepted. A case is "opened" for that event and a diagnosis is performed to "solve" the cause for the failure.
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-diagnosis 0 0 0.0 0.0 0 0 0 0 3.0 K0
cpumem-retire 0 0 0.0 0.0 0 0 0 0 0 0
eft 1 1 0.0 1191.8 0 0 1 1 3.3M 11K
fmd-self-diagnosis 0 0 0.0 0.0 0 0 0 0 0 0
io-retire 1 0 0.0 32.4 0 0 0 0 37b 0
syslog-msgs 1 0 0.0 0.5 0 0 0 0 32b 0
|
9.4 Determining Which Diagnostics Tools to Use
When a failure occurs, a message is often displayed on the monitor. Use the flowcharts in FIGURE 7-1 and FIGURE 7-2 to find the correct methods for diagnosing system problems with Predictive Self-Healing tools, OpenBoot PROM, OpenBoot Diagnostics, or other Solaris commands.
9.5 Traditional Solaris Troubleshooting Commands
These superuser commands can help you determine if you have issues in your workstation, in the network, or within another system that you are networking with.
The following commands are described in this section:
Most of these commands are located in the /usr/bin or /usr/sbin directories.
9.5.1 iostat Command
The iostat command iteratively reports terminal, drive, and tape I/O activity, as well as CPU utilization.
9.5.1.1 Options
TABLE 9-2 describes options for the iostat command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.
TABLE 9-2 Options for iostat
Option
|
Description
|
How It Can Help
|
No option
|
Reports status of local I/O devices.
|
A quick three-line output of device status.
|
-c
|
Reports the percentage of time the system has spent in user mode, in system mode, waiting for I/O, and idling.
|
Quick report of CPU status.
|
-e
|
Displays device error summary statistics. The total errors, hard errors, soft errors, and transport errors are displayed.
|
Provides a short table with accumulated errors. Identifies suspect I/O devices.
|
-E
|
Displays all device error statistics.
|
Provides information about devices: manufacturer, model number, serial number, size, and errors.
|
-n
|
Displays names in descriptive format.
|
Descriptive format helps identify devices.
|
-x
|
For each drive, reports extended drive statistics. The output is in tabular form.
|
Similar to the -e option, but provides rate information. This helps identify poor performance of internal devices and other I/O devices across the network.
|
The following example shows output for one iostat command.
# iostat -En
c0t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3120026A Revision: 8.01 Serial No: 3JT4H4C2
Size: 120.03GB <120031641600 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c0t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: LITE-ON Product: COMBO SOHC-4832K Revision: O3K1 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
|
9.5.2 prtdiag Command
The prtdiag command displays configuration and diagnostic information for a system. The diagnostic information identifies any failed component in the system.
The prtdiag command is located in the /usr/platform/platform-name/sbin/ directory.
Note - The prtdiag command might indicate a slot number different than that identified elsewhere in this document. This is normal.
|
9.5.2.1 Options
TABLE 9-3 describes options for the prtdiag command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.
TABLE 9-3 Options for prtdiag
Option
|
Description
|
How It Can Help
|
No option
|
Lists system components.
|
Identifies CPU timing and PCI cards installed.
|
-v
|
Verbose mode. Displays the time of the most recent AC power failure, the most recent hardware fatal error information, and (if applicable) environmental status.
|
Provides the same information as no option. Additionally lists fan status, temperatures, ASIC, and PROM revisions.
|
The following example shows output for the prtdiag command in verbose mode.
# /usr/platform/sun4u/sbin/prtdiag -v
System Configuration: Sun Microsystems sun4u Sun Ultra 45 or Ultra 25 workstation
System clock frequency: 160 MHZ
Memory size: 1GB
. . .
============================ Environmental Status ============================
Fan Speeds:
---------------------------------------------
Location Sensor Status Speed
---------------------------------------------
F2 CPU okay 3183rpm
F1 Intake okay 2280rpm
F0 Outtake okay 2280rpm
Temperature sensors:
-----------------------------------------------------------------------------
Location Sensor Temperature Lo LoWarn HiWarn Hi Status
-----------------------------------------------------------------------------
MB/0 Die 68C -10C 0C 95C 100C okay
MB Ambient 37C -10C 0C 70C 75C okay
MB Ambient 30C -11C 0C 60C 70C okay
================================ HW Revisions ================================
ASIC Revisions:
-------------------------------------------------------------------
Path Device Status Revision
-------------------------------------------------------------------
/pci@1e,600000 pci108e,a801 okay 4
/pci@1f,700000 pci108e,a801 okay 4
System PROM revisions:
----------------------
OBP 4.16.3 2004/11/05 18:29 Sun Ultra 45
OBDIAG 4.16.3 2004/11/05 18:31
|
9.5.3 prtconf Command
Similar to the show-devs command run at the ok prompt, the prtconf command displays the devices that are configured for the Sun Ultra 45 or Ultra 25 workstation.
The prtconf command identifies hardware that is recognized by the Solaris OS. If hardware is not suspected of being bad, yet software applications are having trouble with the hardware, the prtconf command can indicate if the Solaris software recognizes the hardware and if a driver for the hardware is loaded.
9.5.3.1 Options
TABLE 9-4 describes options for the prtconf command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.
TABLE 9-4 Options for prtconf
Option
|
Description
|
How It Can Help
|
No option
|
Displays the device tree of devices recognized by the OS.
|
If a hardware device is recognized, then it is probably functioning properly. If the message "(driver not attached)" is displayed for the device or for a sub-device, then the driver for the device is corrupt or missing.
|
-D
|
Similar to the output of no option, however the device driver is listed.
|
Lists the driver needed or used by the OS to enable the device.
|
-p
|
Similar to the output of no option, yet is abbreviated.
|
Reports a brief list of the devices.
|
-V
|
Displays the version and date of the OpenBoot PROM firmware.
|
Provides a quick check of firmware version.
|
The following example shows output for the prtconf command.
# prtconf
System Configuration: Sun Microsystems sun4u
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
SUNW,Sun Ultra 45
packages (driver not attached)
SUNW,builtin-drivers (driver not attached)
deblocker (driver not attached)
disk-label (driver not attached)
terminal-emulator (driver not attached)
dropins (driver not attached)
kbd-translator (driver not attached)
obp-tftp (driver not attached)
SUNW,i2c-ram-device (driver not attached)
SUNW,fru-device (driver not attached)
SUNW,asr (driver not attached)
ufs-file-system (driver not attached)
chosen (driver not attached)
openprom (driver not attached)
client-services (driver not attached)
options, instance #0
aliases (driver not attached)
. . .
|
9.5.4 netstat Command
The netstat command displays the network status.
9.5.4.1 Options
TABLE 9-5 describes options for the netstat command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.
TABLE 9-5 Options for netstat
Option
|
Description
|
How It Can Help
|
-i
|
Displays the interface state, including packets in/out, error in/out, collisions, and queue.
|
Provides a quick overview of the system's network status.
|
-i interval
|
Providing a trailing number with the -i option repeats the netstat command every interval seconds.
|
Identifies intermittent or long-duration network events. By piping netstat output to a file, overnight activity can be viewed all at once.
|
-p
|
Displays the media table.
|
Provides Media Access Controller (MAC) address for hosts on the subnet.
|
-r
|
Displays the routing table.
|
Provides routing information.
|
-n
|
Replaces host names with IP addresses.
|
Used when an address is more useful than a host name.
|
The following example shows output for the netstat -p command.
# netstat -p
Net to Media Table: IPv4
Device IP Address Mask Flags Phys Addr
------ -------------------- --------------- ----- ---------------
bge0 phatair-46 255.255.255.255 08:00:20:92:4a:47
bge0 ns-umpk27-02-46 255.255.255.255 08:00:20:93:fb:99
bge0 moreair-46 255.255.255.255 08:00:20:8a:e5:03
bge0 fermpk28a-46 255.255.255.255 00:00:0c:07:ac:2e
bge0 fermpk28as-46 255.255.255.255 00:50:e2:61:d8:00
bge0 kayakr 255.255.255.255 08:00:20:d1:83:c7
bge0 matlock 255.255.255.255 SP 00:03:ba:27:01:48
bge0 toronto2 255.255.255.255 08:00:20:b6:15:b5
bge0 tocknett 255.255.255.255 08:00:20:7c:f5:94
bge0 mpk28-lobby 255.255.255.255 08:00:20:a6:d5:c8
bge0 efyinisedeg 255.255.255.255 08:00:20:8d:6a:80
bge0 froggy 255.255.255.255 08:00:20:73:70:44
bge0 d-mpk28-46-245 255.255.255.255 00:10:60:24:0e:00
bge0 224.0.0.0 240.0.0.0 SM 01:00:5e:00:00:00
|
9.5.5 ping Command
The ping command sends ICMP ECHO_REQUEST packets to network hosts. Depending on how the ping command is configured, the output displayed can identify troublesome network links or nodes. The destination host is specified in the variable hostname.
9.5.5.1 Options
TABLE 9-6 describes options for the ping command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.
TABLE 9-6 Options for ping
Option
|
Description
|
How It Can Help
|
hostname
|
The probe packet is sent to hostname and returned.
|
Verifies that a host is active on the network.
|
-g hostname
|
Forces the probe packet to route through a specified gateway.
|
By identifying different routes to the target host, those individual routes can be tested for quality.
|
-i interface
|
Designates which interface to send and receive the probe packet through.
|
Enables a simple check of secondary network interfaces.
|
-n
|
Replaces host names with IP addresses.
|
Used when an address is more beneficial than a host name.
|
-s
|
Pings continuously in one-second intervals. Ctrl-C aborts. Upon abort, statistics are displayed.
|
Helps identify intermittent or long-duration network events. By piping ping output to a file, activity overnight can later be viewed at once.
|
-svR
|
Displays the route the probe packet followed in one second intervals.
|
Indicates probe packet route and number of hops. Comparing multiple routes can identify bottlenecks.
|
The following example shows output for the ping -s command.
# ping -s teddybear
PING teddybear: 56 data bytes
64 bytes from teddybear (192.146.77.140): icmp_seq=0. time=1. ms
64 bytes from teddybear (192.146.77.140): icmp_seq=1. time=0. ms
64 bytes from teddybear (192.146.77.140): icmp_seq=2. time=0. ms
^C
----teddybear PING Statistics----
3 packets transmitted, 3 packets received, 0% packet loss
round-trip (ms) min/avg/max = 0/0/1
|
9.5.6 ps Command
The ps command lists the status of system processes. Using options and rearranging the command output can assist in determining the Sun Ultra 45 or Ultra 25 workstation resource allocation.
9.5.6.1 Options
TABLE 9-7 describes options for the ps command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.
TABLE 9-7 Options for ps
Option
|
Description
|
How It Can Help
|
-e
|
Displays information for every process.
|
Identifies the process ID and the executable.
|
-f
|
Generates a full listing.
|
Provides the following process information: user ID, parent process ID, system time when executed, and the path to the executable.
|
-o option
|
Enables configurable output. The pid, pcpu, pmem, and comm options display process ID, percent CPU consumption, percent memory consumption, and the responsible executable, respectively.
|
Provides only most important information. Knowing the percentage of resource consumption helps identify processes that are affecting system performance and might be hung.
|
The following example shows output for one ps command.
# ps -eo pcpu,pid,comm|sort -rn
1.4 100317 /usr/openwin/bin/Xsun
0.9 100460 dtwm
0.1 100677 ps
0.1 100600 ksh
0.1 100591 /usr/dt/bin/dtterm
0.1 100462 /usr/dt/bin/sdtperfmeter
0.1 100333 mibiisa
%CPU PID COMMAND
0.0 100652 /bin/csh
. . .
|
Note - When using sort with the -r option, the column headings are printed so that the value in the first column is equal to zero.
|
9.5.7 prstat Command
The prstat utility iteratively examines all active processes on the system and reports statistics based on the selected output mode and sort order. The prstat command provides output similar to the ps command.
9.5.7.1 Options
TABLE 9-8 describes options for the prstat command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.
TABLE 9-8 Options for prstat
Option
|
Description
|
How It Can Help
|
No option
|
Displays a sorted list of the top processes that are consuming the most CPU resources. List is limited to the height of the terminal window and the total number of processes. Output is automatically updated every five seconds. Ctrl-C aborts.
|
Output identifies process ID, user ID, memory used, state, CPU consumption, and command name.
|
-n number
|
Limits output to number of lines.
|
Limits amount of data displayed and identifies primary resource consumers.
|
-s key
|
Permits sorting list by key parameter.
|
Useful keys are cpu (default), time, and size.
|
-v
|
Verbose mode.
|
Displays additional parameters.
|
The following example shows output for the prstat command.
# prstat -n 5 -s size
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
100524 mm39236 28M 21M sleep 48 0 0:00.26 0.3% maker6X.exe/1
100317 root 28M 69M sleep 59 0 0:00.26 0.7% Xsun/1
100460 mm39236 11M 8760K sleep 59 0 0:00.03 0.0% dtwm/8
100453 mm39236 8664K 4928K sleep 48 0 0:00.00 0.0% dtsession/4
100591 mm39236 7616K 5448K sleep 49 0 0:00.02 0.1% dtterm/1
Total: 65 processes, 159 lwps, load averages: 0.03, 0.02, 0.04
|
Sun Ultra 45 and Ultra 25 Workstations Service and Diagnostics Manual
|
819-1892
|
|
Copyright © 2006, Sun Microsystems, Inc. All Rights Reserved.