C H A P T E R  9

Solaris 10 Predictive Self-Healing and Solaris Diagnostics

This chapter provides information about the following topics:


9.1 Predictive Self-Healing Overview

Always access the following web site first to interpret faults and obtain information on a fault:

http://www.sun.com/msg/

The web site directs you to provide the message ID that your system displayed. The web site then provides knowledge articles about the fault and corrective action to resolve the fault. The fault information and documentation at this web site is updated regularly.

You can find more detailed descriptions of Solaris 10 Predictive Self-Healing at the web site below:

http://www.sun.com/bigadmin/features/articles/selfheal.html


9.2 Predictive Self-Healing Tools

In Solaris 10, the fault manager runs in the background. The fault manager performs the following functions:

TABLE 9-1 shows a typical message generated when a fault occurs on your system. The message appears on your console and is recorded in the /var/adm/messages file.



Note - The message in TABLE 9-1 indicates that the fault has already been diagnosed. Any corrective action that the system can perform has already taken place. If your workstation is still running, it continues to run.




TABLE 9-1 System Generated Predictive Self-Healing Message

Output Displayed

Description

Nov 1 16:30:20 dt88-292 EVENT-TIME: Tue Nov 1 16:30:20 PST 2005

EVENT-TIME: the time stamp of the diagnosis.

Nov 1 16:30:20 dt88-292 PLATFORM: SUNW,A70, CSN: -, HOSTNAME: dt88-292

PLATFORM: A description of the system encountering the problem

Nov 1 16:30:20 dt88-292 SOURCE: eft, REV: 1.13

SOURCE: Information on the Diagnosis Engine used to determine the fault

Nov 1 16:30:20 dt88-292 EVENT-ID: afc7e660-d609-4b2f-86b8-ae7c6b8d50c4

EVENT-ID: The Universally Unique event ID (UUID) for this fault

Nov 1 16:30:20 dt88-292 DESC:Nov 1 16:30:20 dt88-292 A problem was detected in the PCI-Express subsystem

DESC: A basic description of the failure

Nov 1 16:30:20 dt88-292  Refer to http://sun.com/msg/SUN4-8000-0Y for more information.

WEBSITE: Where to find specific information and actions for this fault

Nov 1 16:30:20 dt88-292 AUTO-RESPONSE: One or more device instances may be disabled

AUTO-RESPONSE: What, if anything, the system did to alleviate any follow-on issues

Nov 1 16:30:20 dt88-292 IMPACT: Loss of services provided by the device instances associated with this fault

IMPACT: A description of what that response may have done

Nov 1 16:30:20 dt88-292 REC-ACTION: Schedule a repair procedure to replace the affected device. Use Nov 1 16:30:20 dt88-292 fmdump -v -u EVENT_ID to identify the device or contact Sun for support.

REC-ACTION: A short description of what the system administrator should do



9.3 Using the Predictive Self-Healing Commands

For complete information about Predictive Self-Healing commands, refer to the Solaris 10 man pages. This section describes some details of the following commands:

9.3.1 Using the fmdump Command

After the message in TABLE 9-1 is displayed, you may desire more information about the fault. The fmdump command can be used to display the contents of any log files associated with the Solaris Fault Manager.

The fmdump command produces output similar to TABLE 9-1. This example assumes there is only one fault.


# fmdump 
TIME UUID SUNW-MSG-ID
Nov 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2 SUN4-8000-0Y

9.3.1.1 fmdump -V Command

You can obtain more detail by using the -V option.


# fmdump -V -u  0ee65618-2218-4997-c0dc-b5c410ed8ec2
TIME                 UUID                                  SUNW-MSG-ID
Nov 02 10:04:15.4911 0ee65618-2218-4997-c0dc-b5c410ed8ec2  SUN4-8000-0Y
100% fault.io.fire.asic
FRU: hc://product-id=SUNW,A70/motherboard=0
rsrc: hc:///motherboard=0/hostbridge=0/pciexrc=0

At least three lines of new output are delivered to the user with the -V option.

9.3.1.2 fmdump -e Command

To get information of the errors which caused this failure, you can use the -e option.


# fmdump -e
TIME                 CLASS
Nov 02 10:04:14.3008 ereport.io.fire.jbc.mb_per

9.3.2 Using the fmadm faulty Command

The fmadm faulty command can be used by administrators and service personnel to view and modify system configuration parameters that are maintained by the Solaris Fault Manager. The fmadm faulty command is primarily used to determine the status of a component involved in a fault.


# fmadm faulty
STATE		RESOURCE / UUID
-------- -------------------------------------------------------------
degraded dev:////pci@1e,600000
		0ee65618-2218-4997-c0dc-b5c410ed8ec2

The PCI device is degraded and is associated with the same UUID as seen above. You may also see "faulted" states.

9.3.2.1 fmadm config Command

The fmadm config command output shows you the version numbers of the diagnosis engines in use by your system, as well as their current state. You can check these versions against information on the SunSolve web site to determine if you are running the latest diagnostic engines.


# fmadm config
MODULE             VERSION  STATUS   DESCRIPTION
cpumem-diagnosis   1.5      active   UltraSPARC-III/IV CPU/Memory Diagnosis
cpumem-retire      1.0      active   CPU/Memory Retire Agent
eft                1.13     active   eft diagnosis engine
fmd-self-diagnosis 1.0      active   Fault Manager Self-Diagnosis
io-retire          1.0      active   I/O Retire Agent
syslog-msgs        1.0      active   Syslog Messaging Agent

9.3.3 Using the fmstat Command

The fmstat command can report statistics associated with the Solaris Fault Manager. The fmstat command shows information about DE performance. In the example below, the eft DE (also seen in the console output) has received an event which it accepted. A case is "opened" for that event and a diagnosis is performed to "solve" the cause for the failure.


# fmstat
module          ev_recv ev_acpt wait svc_t   %w  %b  open solve  memsz bufsz
cpumem-diagnosis   0       0    0.0  0.0     0   0   0    0      3.0   K0
cpumem-retire      0       0    0.0  0.0     0   0   0    0      0     0
eft                1       1    0.0  1191.8  0   0   1    1      3.3M  11K
fmd-self-diagnosis 0       0    0.0  0.0     0   0   0    0      0     0
io-retire          1       0    0.0  32.4    0   0   0    0      37b   0
syslog-msgs        1       0    0.0  0.5     0   0   0    0      32b   0


9.4 Determining Which Diagnostics Tools to Use

When a failure occurs, a message is often displayed on the monitor. Use the flowcharts in FIGURE 7-1 and FIGURE 7-2 to find the correct methods for diagnosing system problems with Predictive Self-Healing tools, OpenBoot PROM, OpenBoot Diagnostics, or other Solaris commands.


9.5 Traditional Solaris Troubleshooting Commands

These superuser commands can help you determine if you have issues in your workstation, in the network, or within another system that you are networking with.

The following commands are described in this section:

Most of these commands are located in the /usr/bin or /usr/sbin directories.

9.5.1 iostat Command

The iostat command iteratively reports terminal, drive, and tape I/O activity, as well as CPU utilization.

9.5.1.1 Options

TABLE 9-2 describes options for the iostat command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.


TABLE 9-2 Options for iostat

Option

Description

How It Can Help

No option

Reports status of local I/O devices.

A quick three-line output of device status.

-c

Reports the percentage of time the system has spent in user mode, in system mode, waiting for I/O, and idling.

Quick report of CPU status.

-e

Displays device error summary statistics. The total errors, hard errors, soft errors, and transport errors are displayed.

Provides a short table with accumulated errors. Identifies suspect I/O devices.

-E

Displays all device error statistics.

Provides information about devices: manufacturer, model number, serial number, size, and errors.

-n

Displays names in descriptive format.

Descriptive format helps identify devices.

-x

For each drive, reports extended drive statistics. The output is in tabular form.

Similar to the -e option, but provides rate information. This helps identify poor performance of internal devices and other I/O devices across the network.


The following example shows output for one iostat command.


# iostat -En
c0t0d0    Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3120026A    Revision: 8.01  Serial No: 3JT4H4C2
Size: 120.03GB <120031641600 bytes>
Media Error: 0 Device Not Ready: 0  No Device: 0 Recoverable: 0
Illegal Request: 0
c0t2d0    Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: LITE-ON  Product: COMBO SOHC-4832K Revision: O3K1 Serial No: 
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0 

9.5.2 prtdiag Command

The prtdiag command displays configuration and diagnostic information for a system. The diagnostic information identifies any failed component in the system.

The prtdiag command is located in the /usr/platform/platform-name/sbin/ directory.



Note - The prtdiag command might indicate a slot number different than that identified elsewhere in this document. This is normal.



9.5.2.1 Options

TABLE 9-3 describes options for the prtdiag command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.


TABLE 9-3 Options for prtdiag

Option

Description

How It Can Help

No option

Lists system components.

Identifies CPU timing and PCI cards installed.

-v

Verbose mode. Displays the time of the most recent AC power failure, the most recent hardware fatal error information, and (if applicable) environmental status.

Provides the same information as no option. Additionally lists fan status, temperatures, ASIC, and PROM revisions.


The following example shows output for the prtdiag command in verbose mode.


# /usr/platform/sun4u/sbin/prtdiag -v
System Configuration: Sun Microsystems  sun4u Sun Ultra 45 or Ultra 25 workstation
System clock frequency: 160 MHZ
Memory size: 1GB 
 
. . .
 
============================ Environmental Status ============================
Fan Speeds:
---------------------------------------------
Location       Sensor          Status   Speed
---------------------------------------------
F2             CPU             okay     3183rpm          
F1             Intake          okay     2280rpm          
F0             Outtake         okay     2280rpm          
 
Temperature sensors:
-----------------------------------------------------------------------------
Location       Sensor         Temperature  Lo   LoWarn  HiWarn    Hi   Status
-----------------------------------------------------------------------------
MB/0           Die              68C       -10C    0C     95C    100C   okay
MB             Ambient          37C       -10C    0C     70C     75C   okay
MB             Ambient          30C       -11C    0C     60C     70C   okay
 
================================ HW Revisions ================================
ASIC Revisions:
-------------------------------------------------------------------
Path                   Device           Status             Revision
-------------------------------------------------------------------
/pci@1e,600000         pci108e,a801     okay               4   
/pci@1f,700000         pci108e,a801     okay               4   
 
System PROM revisions:
----------------------
OBP 4.16.3 2004/11/05 18:29 Sun Ultra 45
OBDIAG 4.16.3 2004/11/05 18:31 

9.5.3 prtconf Command

Similar to the show-devs command run at the ok prompt, the prtconf command displays the devices that are configured for the Sun Ultra 45 or Ultra 25 workstation.

The prtconf command identifies hardware that is recognized by the Solaris OS. If hardware is not suspected of being bad, yet software applications are having trouble with the hardware, the prtconf command can indicate if the Solaris software recognizes the hardware and if a driver for the hardware is loaded.

9.5.3.1 Options

TABLE 9-4 describes options for the prtconf command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.


TABLE 9-4 Options for prtconf

Option

Description

How It Can Help

No option

Displays the device tree of devices recognized by the OS.

If a hardware device is recognized, then it is probably functioning properly. If the message "(driver not attached)" is displayed for the device or for a sub-device, then the driver for the device is corrupt or missing.

-D

Similar to the output of no option, however the device driver is listed.

Lists the driver needed or used by the OS to enable the device.

-p

Similar to the output of no option, yet is abbreviated.

Reports a brief list of the devices.

-V

Displays the version and date of the OpenBoot PROM firmware.

Provides a quick check of firmware version.


The following example shows output for the prtconf command.


# prtconf
System Configuration:  Sun Microsystems  sun4u
Memory size: 1024 Megabytes
System Peripherals (Software Nodes):
 
SUNW,Sun Ultra 45
    packages (driver not attached)
        SUNW,builtin-drivers (driver not attached)
        deblocker (driver not attached)
        disk-label (driver not attached)
        terminal-emulator (driver not attached)
        dropins (driver not attached)
        kbd-translator (driver not attached)
        obp-tftp (driver not attached)
        SUNW,i2c-ram-device (driver not attached)
        SUNW,fru-device (driver not attached)
        SUNW,asr (driver not attached)
        ufs-file-system (driver not attached)
    chosen (driver not attached)
    openprom (driver not attached)
        client-services (driver not attached)
    options, instance #0
    aliases (driver not attached)
. . .

9.5.4 netstat Command

The netstat command displays the network status.

9.5.4.1 Options

TABLE 9-5 describes options for the netstat command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.


TABLE 9-5 Options for netstat

Option

Description

How It Can Help

-i

Displays the interface state, including packets in/out, error in/out, collisions, and queue.

Provides a quick overview of the system's network status.

-i interval

Providing a trailing number with the -i option repeats the netstat command every interval seconds.

Identifies intermittent or long-duration network events. By piping netstat output to a file, overnight activity can be viewed all at once.

-p

Displays the media table.

Provides Media Access Controller (MAC) address for hosts on the subnet.

-r

Displays the routing table.

Provides routing information.

-n

Replaces host names with IP addresses.

Used when an address is more useful than a host name.


The following example shows output for the netstat -p command.


# netstat -p
 
Net to Media Table: IPv4
Device   IP Address               Mask      Flags   Phys Addr 
------ -------------------- --------------- ----- ---------------
bge0   phatair-46           255.255.255.255       08:00:20:92:4a:47
bge0   ns-umpk27-02-46      255.255.255.255       08:00:20:93:fb:99
bge0   moreair-46           255.255.255.255       08:00:20:8a:e5:03
bge0   fermpk28a-46         255.255.255.255       00:00:0c:07:ac:2e
bge0   fermpk28as-46        255.255.255.255       00:50:e2:61:d8:00
bge0   kayakr               255.255.255.255       08:00:20:d1:83:c7
bge0   matlock              255.255.255.255 SP    00:03:ba:27:01:48
bge0   toronto2             255.255.255.255       08:00:20:b6:15:b5
bge0   tocknett             255.255.255.255       08:00:20:7c:f5:94
bge0   mpk28-lobby          255.255.255.255       08:00:20:a6:d5:c8
bge0   efyinisedeg          255.255.255.255       08:00:20:8d:6a:80
bge0   froggy               255.255.255.255       08:00:20:73:70:44
bge0   d-mpk28-46-245       255.255.255.255       00:10:60:24:0e:00
bge0   224.0.0.0            240.0.0.0       SM    01:00:5e:00:00:00

9.5.5 ping Command

The ping command sends ICMP ECHO_REQUEST packets to network hosts. Depending on how the ping command is configured, the output displayed can identify troublesome network links or nodes. The destination host is specified in the variable hostname.

9.5.5.1 Options

TABLE 9-6 describes options for the ping command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.


TABLE 9-6 Options for ping

Option

Description

How It Can Help

hostname

The probe packet is sent to hostname and returned.

Verifies that a host is active on the network.

-g hostname

Forces the probe packet to route through a specified gateway.

By identifying different routes to the target host, those individual routes can be tested for quality.

-i interface

Designates which interface to send and receive the probe packet through.

Enables a simple check of secondary network interfaces.

-n

Replaces host names with IP addresses.

Used when an address is more beneficial than a host name.

-s

Pings continuously in one-second intervals. Ctrl-C aborts. Upon abort, statistics are displayed.

Helps identify intermittent or long-duration network events. By piping ping output to a file, activity overnight can later be viewed at once.

-svR

Displays the route the probe packet followed in one second intervals.

Indicates probe packet route and number of hops. Comparing multiple routes can identify bottlenecks.


The following example shows output for the ping -s command.


# ping -s teddybear
PING teddybear: 56 data bytes
64 bytes from teddybear (192.146.77.140): icmp_seq=0. time=1. ms
64 bytes from teddybear (192.146.77.140): icmp_seq=1. time=0. ms
64 bytes from teddybear (192.146.77.140): icmp_seq=2. time=0. ms
^C
----teddybear PING Statistics----
3 packets transmitted, 3 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 0/0/1

9.5.6 ps Command

The ps command lists the status of system processes. Using options and rearranging the command output can assist in determining the Sun Ultra 45 or Ultra 25 workstation resource allocation.

9.5.6.1 Options

TABLE 9-7 describes options for the ps command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.


TABLE 9-7 Options for ps

Option

Description

How It Can Help

-e

Displays information for every process.

Identifies the process ID and the executable.

-f

Generates a full listing.

Provides the following process information: user ID, parent process ID, system time when executed, and the path to the executable.

-o option

Enables configurable output. The pid, pcpu, pmem, and comm options display process ID, percent CPU consumption, percent memory consumption, and the responsible executable, respectively.

Provides only most important information. Knowing the percentage of resource consumption helps identify processes that are affecting system performance and might be hung.


The following example shows output for one ps command.


# ps -eo pcpu,pid,comm|sort -rn
 1.4 100317 /usr/openwin/bin/Xsun
 0.9 100460 dtwm
 0.1 100677 ps
 0.1 100600 ksh
 0.1 100591 /usr/dt/bin/dtterm
 0.1 100462 /usr/dt/bin/sdtperfmeter
 0.1 100333 mibiisa
%CPU    PID COMMAND
 0.0 100652 /bin/csh
. . .



Note - When using sort with the -r option, the column headings are printed so that the value in the first column is equal to zero.



9.5.7 prstat Command

The prstat utility iteratively examines all active processes on the system and reports statistics based on the selected output mode and sort order. The prstat command provides output similar to the ps command.

9.5.7.1 Options

TABLE 9-8 describes options for the prstat command and how those options can help troubleshoot the Sun Ultra 45 or Ultra 25 workstation.


TABLE 9-8 Options for prstat

Option

Description

How It Can Help

No option

Displays a sorted list of the top processes that are consuming the most CPU resources. List is limited to the height of the terminal window and the total number of processes. Output is automatically updated every five seconds. Ctrl-C aborts.

Output identifies process ID, user ID, memory used, state, CPU consumption, and command name.

-n number

Limits output to number of lines.

Limits amount of data displayed and identifies primary resource consumers.

-s key

Permits sorting list by key parameter.

Useful keys are cpu (default), time, and size.

-v

Verbose mode.

Displays additional parameters.


The following example shows output for the prstat command.


# prstat -n 5 -s size
PID     USERNAME  SIZE   RSS STATE  PRI  NICE TIME    CPU   PROCESS/NLWP       
100524  mm39236    28M   21M sleep   48   0   0:00.26 0.3%  maker6X.exe/1
100317  root       28M   69M sleep   59   0   0:00.26 0.7%  Xsun/1
100460  mm39236    11M 8760K sleep   59   0   0:00.03 0.0%  dtwm/8
100453  mm39236  8664K 4928K sleep   48   0   0:00.00 0.0%  dtsession/4
100591  mm39236  7616K 5448K sleep   49   0   0:00.02 0.1%  dtterm/1
Total: 65 processes, 159 lwps, load averages: 0.03, 0.02, 0.04