System Administration Guide

Chapter 68 Troubleshooting Software Problems (Overview)

This chapter provides a general overview of troubleshooting software problems, including information on troubleshooting system crashes and viewing system messages.

This is a list of information in this chapter.

Where to Find Software Troubleshooting Tasks

Use these references to find step-by-step instructions for troubleshooting software problems.

Troubleshooting a System Crash

If a system running the Solaris operating environment crashes, provide your service provider with as much information as possible--including core files.

What To Do if The System Crashes

The most important things are:

  1. Write down the system console messages.

    If a system crashes, making it run again may seem like your most pressing concern. However, before you reboot the system, examine the console screen for messages. These messages may provide some insight about what caused the crash. Even if the system reboots automatically and the console messages have disappeared from the screen, you may be able to check these messages by viewing the system error log file that is generated automatically in /var/adm/messages (or /usr/adm/messages). See "How to View System Messages" for more information about viewing system error log files.

    If you have frequent crashes and can't determine their cause, gather all the information you can from the system console or the /var/adm/messages files, and have it ready for a customer service representative to examine. See "Troubleshooting a System Crash" for a complete list of troubleshooting information to gather for your service provider.

    See Chapter 70, Troubleshooting Miscellaneous Software Problems, if the system fails to reboot successfully after a system crash.

  2. Synchronize the disks and reboot.


    ok sync
    

    See Chapter 70, Troubleshooting Miscellaneous Software Problems if the system fails to reboot successfully after a system crash.

  3. Attempt to save the crash information written onto the swap area by running the savecore command.


    # savecore
    

See Chapter 69, Generating and Saving System Crash Information for information about saving crash dumps automatically.

Gathering Troubleshooting Data

Answer the following questions to help isolate the system problem. Use "Troubleshooting a System Crash Checklist" for gathering troubleshooting data for a crashed system.

Table 68-1 Identifying System Crash Data

Question 

Description 

Can you reproduce the problem?

This is important because a reproducible test case is often essential for debugging really hard problems. By reproducing the problem, the service provider can build kernels with special instrumentation to trigger, diagnose, and fix the bug. 

Are you using any third-party drivers?

Drivers run in the same address space as the kernel, with all the same privileges, so they can cause system crashes if they have bugs. 

What was the system doing just before it crashed?

If the system was doing anything unusual like running a new stress test or experiencing higher-than-usual load, that may have led to the crash. 

Were there any unusual console messages right before the crash?

Sometimes the system will show signs of distress before it actually crashes; this information is often useful. 

Did you add any tuning parameters to the /etc/system file?

Sometimes tuning parameters, such as increasing shared memory segments so that the system tries to allocate more than it has, can cause the system to crash. 

Did the problem start recently?

If so, did the onset of problems coincide with any changes to the system, for example, new drivers, new software, different workload, CPU upgrade, or a memory upgrade. 

Troubleshooting a System Crash Checklist

Use this checklist when gathering system data for a crashed system.

Item 

Your Data 

Is a core file available? 

 

Identify the operating system release and appropriate software application release levels. 

 

Identify system hardware. 

Include prtdiag output from sun4d systems.

 

Are patches installed? If so, include showrev -p output.

 

Is the problem reproducible? 

 

Does the system have any third-party drivers? 

 

What was the system doing before it crashed? 

 

Were there any unusual console messages right before the system crashed? 

 

Did you add any parameters to the /etc/system file?

 

Did the problem start recently? 

 

Viewing System Messages

When a system crashes, it may display a message on the system console like this:


panic: error message

where error message is one of the panic error messages described in crash(1M).

Less frequently, this message may be displayed instead of the panic message:


Watchdog reset !

The error logging daemon, syslogd, automatically records various system warnings and errors in message files. By default, many of these system messages are displayed on the system console and are stored in /var/adm (or /usr/adm) or . You can direct where these messages are stored by setting up system logging. See "How to Customize System Message Logging" for more information. These messages can alert you to system problems, such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in /var/adm/messages (and in messages.0), and the oldest are in messages.3. After a period of time (usually every ten days), a new messages file is created. The messages.0 file is renamed messages.1, messages.1 is renamed messages.2, and messages.2 is renamed messages.3. The current /var/adm/messages.3 is deleted.

Because /var/adm stores large files containing messages, crash dumps, and other data, this directory can consume lots of disk space. To keep the /var/adm directory from growing too large, and to ensure that future crash dumps can be saved, you should remove unneeded files periodically. You can automate this task by using crontab. See "How to Delete Crash Dump Files" and Chapter 59, Scheduling System Events (Tasks) for more information on automating this task.

How to View System Messages

Display recent messages generated by a system crash or reboot by using the dmesg command.


$ dmesg

Or use the more command to display one screen of messages at a time.


$ more /var/adm/messages

For more information, refer to dmesg(1M).

Example--Viewing System Messages

The following example shows output from the dmesg command.


$ dmesg
Nov 12 16:53
SunOS Release 5.6 Version A [UNIX(R) System V Release 4.0]
copyright (c) 1983-1997, Sun Microsystems, Inc.
DEBUG enabled
WARNING: cannot load psm xpcimach
mem = 32376K (0x1f9e000)
avail mem = 25247744
root nexus = i86pc
Unable to install/attach drive `isa'
eisa0 at root
NOTICE: eisa: DMA buffer-chaining not enabled
NOTICE: IN i8042_acquire
NOTICE: out i8042_acquire
NOTICE: IN i8042_release
NOTICE: about to enable keyboard
NOTICE: out i8042_release
          .
          .
          .

Customizing System Message Logging

You can capture additional error messages that are generated by various system processes by modifying the /etc/syslog.conf file. By default, /etc/syslog.conf directs many system process messages to the /var/adm message files. Crash and boot messages are stored here as well. To view /var/adm messages, see "How to View System Messages".

The /etc/syslog.conf file has two columns separated by tabs:

facility.level ...
action

facility.level

A facility or system source of the message or condition. May be a comma-separated listed of facilities. Facility values are listed in Table 68-2. Alevel, indicates the severity or priority of the condition being logged. Priority levels are listed in Table 68-3.

action

The action field indicates where the messages are forwarded. 

The following example shows sample lines from a default /etc/syslog.conf file.


user.err					/dev/console
user.err					        /var/adm/messages
user.alert					     `root, operator'
user.emerg					     *

The most common error condition sources are shown in Table 68-2. The most common priorities are shown in Table 68-3 in order of severity.

Table 68-2 Source Facilities for syslog.conf Messages

Source 

Description 

kern

The kernel 

auth

Authentication 

daemon

All daemons 

mail

Mail system 

lp

Spooling system 

user

User processes 


Note -

Starting in the Solaris 2.6 release, the number of syslog facilities that can be activated in the /etc/syslog.conf file is unlimited. In previous releases, the number of facilities was limited to 20.


Table 68-3 Priority Levels for syslog.conf Messages

Priority 

Description 

emerg

System emergencies 

alert

Errors requiring immediate correction 

crit

Critical errors 

err

Other errors 

info

Informational messages 

debug

Output used for debugging 

none

This setting doesn't log output  

How to Customize System Message Logging

  1. Become superuser.

  2. Using the editor of your choice, edit the /etc/syslog.conf file, adding or changing message sources, priorities, and message locations according to the syntax described in syslog.conf(4) .

  3. Exit the file, saving the changes.

Example--Customizing Message System Logging

The following /etc/syslog.conf lines are provided by default during the Solaris installation process.


user.err					/dev/console
user.err					        /var/adm/messages
user.alert					     `root, operator'
user.emerg					     *
 

This means the following user messages are automatically logged: