System Administration Guide, Volume 2

Chapter 38 Troubleshooting Software Problems (Overview)

This chapter provides a general overview of troubleshooting software problems, including information on troubleshooting system crashes and viewing system messages.

This is a list of information in this chapter.

Where to Find Software Troubleshooting Tasks

Use these references to find step-by-step instructions for troubleshooting software problems.

What's New in System Troubleshooting?

This section describes new system troubleshooting features in the Solaris 8 release.

`apptrace`

A new application debugging tool, apptrace, enables application developers and system support personnel to debug application or system problems by providing call traces to Solaris shared libraries, which may show the series of events leading up to a point of failure.

The apptrace tool provides more reliable call-tracing than the previously available sotruss command. It also provides better display of function arguments, return values, and error cases for any Solaris library interface.

By default, apptrace traces calls directly from the executable object, specified on the command line, to every shared library the executable depends on.

See apptrace(1) for more information.

Improved Core File Management

The `coreadm` Command

This release introduces the coreadm command, which provides flexible core file naming conventions and better core file retention. For example, you can use the coreadm command to configure a system so that all process core files are placed in a single system directory. This means it is easier to track problems by examining the core files in a specific directory whenever a Solaris process or daemon terminates abnormally.

Two new configurable core file paths, per-process and global, can be enabled or disabled independently of each other. When a process terminates abnormally, it produces a core file in the current directory as in previous Solaris releases. But if a global core file path is enabled and set to /corefiles/core, for example, then each process that terminates abnormally would produce two core files: one in the current working directory and one in the /corefiles directory.

By default, the Solaris core paths and core file retention remain the same.

See "Managing Core Files (coreadm)" and coreadm(1M) for more information.

Examining Core Files With Proc Tools

Some of the proc tools have been enhanced to examine process core files as well as live processes. The proc tools are utilities that can manipulate features of the /proc file system.

The /usr/proc/bin/pstack, pmap, pldd, pflags, and pcred tools can now be applied to core files by specifying the name of the core file on the command line, similar to the way you specify a process ID to these commands. For example:

$ ./a.out
Segmentation Fault(coredump)
$ /usr/proc/bin/pstack ./core
core './core' of 19305: ./a.out
 000108c4 main     (1, ffbef5cc, ffbef5d4, 20800, 0, 0) + 1c
 00010880 _start   (0, 0, 0, 0, 0, 0) + b8

For more information on using proc tools to examine core files, see proc(1).

New Remote Console Messaging Features

New remote console features improve your ability to troubleshoot remote systems.

See "Enabling Remote Console Messaging" and consadm(1M) for more information.

Troubleshooting a System Crash

If a system running the Solaris operating environment crashes, provide your service provider with as much information as possible--including crash dump files.

What to Do if the System Crashes

The most important things to remember are:

Write down the system console messages.

If a system crashes, making it run again might seem like your most pressing concern. However, before you reboot the system, examine the console screen for messages. These messages can provide some insight about what caused the crash. Even if the system reboots automatically and the console messages have disappeared from the screen, you might be able to check these messages by viewing the system error log file that is generated automatically in /var/adm/messages (or /usr/adm/messages). See "How to View System Messages" for more information about viewing system error log files.

If you have frequent crashes and can't determine their cause, gather all the information you can from the system console or the /var/adm/messages files, and have it ready for a customer service representative to examine. See "Troubleshooting a System Crash" for a complete list of troubleshooting information to gather for your service provider.

See Chapter 40, Troubleshooting Miscellaneous Software Problems if the system fails to reboot successfully after a system crash.
Synchronize the disks and reboot.
ok sync
See Chapter 40, Troubleshooting Miscellaneous Software Problems if the system fails to reboot successfully after a system crash.
Attempt to save the crash information written onto the swap area by running the savecore command.
# savecore

See Chapter 39, Managing System Crash Information for information about saving crash dumps automatically.

Gathering Troubleshooting Data

Answer the following questions to help isolate the system problem. Use "Troubleshooting a System Crash Checklist" for gathering troubleshooting data for a crashed system.

Table 38-1 Identifying System Crash Data


Question	Description
Can you reproduce the problem?	This is important because a reproducible test case is often essential for debugging really hard problems. By reproducing the problem, the service provider can build kernels with special instrumentation to trigger, diagnose, and fix the bug.
Are you using any third-party drivers?	Drivers run in the same address space as the kernel, with all the same privileges, so they can cause system crashes if they have bugs.
What was the system doing just before it crashed?	If the system was doing anything unusual like running a new stress test or experiencing higher-than-usual load, that may have led to the crash.
Were there any unusual console messages right before the crash?	Sometimes the system will show signs of distress before it actually crashes; this information is often useful.
Did you add any tuning parameters to the `/etc/system` file?	Sometimes tuning parameters, such as increasing shared memory segments so that the system tries to allocate more than it has, can cause the system to crash.
Did the problem start recently?	If so, did the onset of problems coincide with any changes to the system, for example, new drivers, new software, different workload, CPU upgrade, or a memory upgrade.

Troubleshooting a System Crash Checklist

Use this checklist when gathering system data for a crashed system.

Item	Your Data
Is a core file available?
Identify the operating system release and appropriate software application release levels.
Identify system hardware. Include `prtdiag` output from sun4d systems.
Are patches installed? If so, include `showrev -p` output.
Is the problem reproducible?
Does the system have any third-party drivers?
What was the system doing before it crashed?
Were there any unusual console messages right before the system crashed?
Did you add any parameters to the `/etc/system` file?
Did the problem start recently?

Viewing System Messages

System messages display on the console device. The text of most system messages look like this:

[ID msgid facility.priority]

For example:

[ID 672855 kern.notice] syncing file systems...

If the message originated in the kernel, the kernel module name is displayed. For example:

Oct 1 14:07:24 mars ufs: [ID 845546 kern.notice] alloc: /: file system full

When a system crashes, it may display a message on the system console like this:

panic: error message

where error message is one of the panic error messages described in crash(1M).

Less frequently, this message may be displayed instead of the panic message:

Watchdog reset !

The error logging daemon, syslogd, automatically records various system warnings and errors in message files. By default, many of these system messages are displayed on the system console and are stored in the /var/adm directory. You can direct where these messages are stored by setting up system logging. See "How to Customize System Message Logging" for more information. These messages can alert you to system problems, such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in /var/adm/messages (and in messages.0), and the oldest are in messages.3. After a period of time (usually every ten days), a new messages file is created. The messages.0 file is renamed messages.1, messages.1 is renamed messages.2, and messages.2 is renamed messages.3. The current /var/adm/messages.3 is deleted.

Because /var/adm stores large files containing messages, crash dumps, and other data, this directory can consume lots of disk space. To keep the /var/adm directory from growing too large, and to ensure that future crash dumps can be saved, you should remove unneeded files periodically. You can automate this task by using crontab. See "How to Delete Crash Dump Files" and Chapter 30, Scheduling System Events (Tasks) for more information on automating this task.

How to View System Messages

Display recent messages generated by a system crash or reboot by using the dmesg command.

$ dmesg

Or use the more command to display one screen of messages at a time.

$ more /var/adm/messages

For more information, refer to dmesg(1M).

Example--Viewing System Messages

The following example shows output from the dmesg command.

$ dmesg
date starbug genunix: [ID 540533 kern.notice] SunOS Release 5.8 Version 64-bit
date starbug genunix: [ID 223299 kern.notice] Copyright (c) 1983-1999 by Sun Microsystems, Inc.
date starbug genunix: [ID 678236 kern.info] Ethernet address = xx:xx:xx:xx:xx:xx
date starbug unix: [ID 389951 kern.info] mem = 131072K (0x8000000)
date starbug unix: [ID 930857 kern.info] avail mem = 122134528
date starbug rootnex: [ID 466748 kern.info] root nexus = Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 333MHz)
date starbug rootnex: [ID 349649 kern.info] pcipsy0 at root: UPA 0x1f 0x0
date starbug genunix: [ID 936769 kern.info] pcipsy0 is /pci@1f,0
date starbug pcipsy: [ID 370704 kern.info] PCI-device: pci@1,1, simba0
date starbug genunix: [ID 936769 kern.info] simba0 is /pci@1f,0/pci@1,1
date starbug pcipsy: [ID 370704 kern.info] PCI-device: pci@1, simba1
date starbug genunix: [ID 936769 kern.info] simba1 is /pci@1f,0/pci@1
date starbug simba: [ID 370704 kern.info] PCI-device: ide@3, uata0
date starbug genunix: [ID 936769 kern.info] uata0 is /pci@1f,0/pci@1,1/ide@3
.
.
.

Customizing System Message Logging

You can capture additional error messages that are generated by various system processes by modifying the /etc/syslog.conf file. By default, /etc/syslog.conf directs many system process messages to the /var/adm message files. Crash and boot messages are stored here as well. To view /var/adm messages, see "How to View System Messages".

The /etc/syslog.conf file has two columns separated by tabs:

facility.level ...

action

`facility.level`	A `facility` or system source of the message or condition. May be a comma-separated listed of facilities. Facility values are listed in Table 38-2. A `level`, indicates the severity or priority of the condition being logged. Priority levels are listed in Table 38-3.
`action`	The action field indicates where the messages are forwarded.

The following example shows sample lines from a default /etc/syslog.conf file.

user.err                                        /dev/sysmsg
user.err                                        /var/adm/messages
user.alert                                      `root, operator'
user.emerg                                      *

This means the following user messages are automatically logged:

User errors are printed to the console and also are logged to the /var/adm/messages file.
User messages requiring immediate action (alert) are sent to the root and operator users.
User emergency messages are sent to individual users.

The most common error condition sources are shown in the table below. The most common priorities are shown in Table 38-3 in order of severity.

Table 38-2 Source Facilities for syslog.conf Messages


Source	Description
`kern`	The kernel
`auth`	Authentication
`daemon`	All daemons
`mail`	Mail system
`lp`	Spooling system
`user`	User processes

Note -

Starting in the Solaris 2.6 release, the number of syslog facilities that can be activated in the /etc/syslog.conf file is unlimited. In previous releases, the number of facilities was limited to 20.

Table 38-3 Priority Levels for syslog.conf Messages


Priority	Description
`emerg`	System emergencies
`alert`	Errors requiring immediate correction
`crit`	Critical errors
`err`	Other errors
`info`	Informational messages
`debug`	Output used for debugging
`none`	This setting doesn't log output

How to Customize System Message Logging

Become superuser.

Using the editor of your choice, edit the /etc/syslog.conf file, adding or changing message sources, priorities, and message locations according to the syntax described in syslog.conf(4) .

Exit the file, saving the changes.

Example--Customizing Message System Logging

This sample /etc/syslog.conf user.emerg facility sends user emergency messages to root and individual users.

user.emerg                                      `root, *'

Enabling Remote Console Messaging

The following new console features improve your ability to troubleshoot remote systems:

The consadm command enables you to select a serial device as an auxiliary (or remote) console. Using the consadm command, a system administrator can configure one or more serial ports to display redirected console messages and to host sulogin sessions when the system transitions between run levels. This feature enables you to dial in to a serial port with a modem to monitor console messages and participate in init state transitions. (See sulogin(1M) and the step-by-step procedures below for more information.)

While you can log in to a system using a port configured as an auxiliary console, it is primarily an output device displaying information that is also displayed on the default console. If boot scripts or other applications read and write to and from the default console, the write output displays on all the auxiliary consoles, but the input is only read from the default console. (See "Using the consadm Command During an Interactive Login Session" for using the consadm command during an interactive login session.)
Console output now consists of kernel and syslog messages written to a new pseudo device, /dev/sysmsg. In addition, rc script startup messages are written to /dev/msglog. Previously, all of these messages were written to /dev/console.

Scripts that direct console output to /dev/console need to be changed to /dev/msglog if you want to see script messages displayed on the auxiliary consoles. Programs referencing /dev/console should be explicitly modified to use syslog() or strlog() if you want messages to be redirected to an auxiliary device.
The consadm command runs a daemon to monitor auxiliary console devices. Any display device designated as an auxiliary console that disconnects--hangs up or loses carrier--is removed from the auxiliary console device list and is no longer active. Enabling one or more auxiliary consoles does not disable message display on the default console; messages continue to display on /dev/console.

Using Auxiliary Console Messaging During Run Level Transitions

Keep the following in mind when using auxiliary console messaging during run level transitions:

Input cannot come from an auxiliary console if user input is expected for an rc script that is run when a system is booting. The input must come from the default console.
The sulogin program, invoked by init to prompt for the superuser password when transitioning between run levels, has been modified to send the superuser password prompt to each auxiliary device in addition to the default console device.
When the system is in single-user mode and one or more auxiliary consoles are enabled using the consadm command, a console login session runs on the first device to supply the correct superuser password to the sulogin prompt. When the correct password is received from a console device, sulogin disables input from all other console devices.
A message is displayed on the default console and the other auxiliary consoles when one of the consoles assumes single-user privileges. This message indicates which device has become the console by accepting a correct superuser password. If there is a loss of carrier on the auxiliary console running the single-user shell, one of two actions may occur:
- If the auxiliary console represents a system at run level 1, the system proceeds to the default run level.
- If the auxiliary console represents a system at run level S, the system displays the ENTER RUN LEVEL (0-6, s or S): message on the device where the init s or shutdown command had been entered from the shell. If there isn't any carrier on that device either, you will have to reestablish carrier and enter the correct run level. The init or shutdown command will not redisplay the run-level prompt.
If you are logged in to a system using a serial port, and an init or shutdown command is issued to transition to another run level, the login session is lost whether this device is the auxiliary console or not. This situation is identical to Solaris releases without auxiliary console capabilities.
Once a device is selected as an auxiliary console using the consadm command, it remains the auxiliary console until the system is rebooted or the auxiliary console is unselected. However, the consadm command includes an option to set a device as the auxiliary console across system reboots. (See the procedure below for step-by-step instructions.)

Using the `consadm` Command During an Interactive Login Session

If you want to run an interactive login session by logging in to a system using a terminal that is connected to a serial port, and then using the consadm command to see the console messages from the terminal, note the following behavior.

If you use the terminal for an interactive login session while the auxiliary console is active, the console messages are sent to the /dev/sysmsg or /dev/msglog devices.
While you issue commands on the terminal, input goes to your interactive session and not to the default console (/dev/console).
If you run the init command to change run levels, the remote console software kills your interactive session and runs the sulogin program. At this point, input is accepted only from the terminal and is treated like it's coming from a console device. This allows you to enter your password to the sulogin program as described in "Using Auxiliary Console Messaging During Run Level Transitions".

Then, if you enter the correct password on the (auxiliary) terminal, the auxiliary console runs an interactive sulogin session, locks out the default console and any competing auxiliary console. This means the terminal essentially functions as the system console.
From here you can change to run level 3 or go to another run level. If you change run levels, sulogin runs again on all console devices. If you exit or specify that the system should come up to run level 3, then all auxiliary consoles lose their ability to provide input. They revert to being display devices for console messages.

As the system is coming up, you must provide information to rc scripts on the default console device. After the system comes back up, the login program runs on the serial ports and you can log back into another interactive session. If you've designated the device to be an auxiliary console, you will continue to get console messages on your terminal, but all input from the terminal goes to your interactive session.

How to Enable an Auxiliary (Remote) Console

The consadm daemon does not start monitoring the port until after you add the auxiliary console with the consadm command. As a security feature, console messages are only redirected until carrier drops, or the auxiliary console device is unselected. This means carrier must be established on the port before you can successfully use the consadm command.

See consadm(1M) for more information on enabling an auxiliary console.

Enable the auxiliary console.
# consadm -a devicename

Verify that the current connection is the auxiliary console.
# consadm

Example--Enabling an Auxiliary (Remote) Console

# consadm -a /dev/term/a
# consadm
 /dev/term/a

How to Display a List of Auxiliary Consoles

Select one of the following steps:
1. Display the list of auxiliary consoles.
  # consadm /dev/term/a
2. Display the list of persistent auxiliary consoles.
  # consadm -p /dev/term/b

How to Enable an Auxiliary (Remote) Console Across System Reboots

Enable the auxiliary console across system reboots.
# consadm -a -p devicename
This adds the device to the list of persistent auxiliary consoles.

Verify that the device has been added to the list of persistent auxiliary consoles.
# consadm

Example--Enabling an Auxiliary (Remote) Console Across System Reboots

# consadm -a -p /dev/term/a 
# consadm
/dev/term/a

How to Disable an Auxiliary (Remote) Console

Select one of the following steps:
1. Disable the auxiliary console.
  # consadm -d devicename
  or
2. Disable the auxiliary console and remove it from the list of persistent auxiliary consoles.
  # consadm -p -d devicename

Verify that the auxiliary console has been disabled.
# consadm

Example--Disabling an Auxiliary (Remote) Console

# consadm -d /dev/term/a
# consadm