System Administration Guide: Advanced Administration

Chapter 26 Troubleshooting Software Problems (Overview)

This chapter provides a general overview of troubleshooting software problems, including information on troubleshooting system crashes and viewing system messages.

This is a list of information in this chapter.

What's New in Troubleshooting Software Problems?

This section describes features that are new in the Solaris 9 release.

New System Log Rotation

In Solaris 9 release, system log files are now rotated by the logadm command from an entry in the root crontab file. The /usr/lib/newsyslog script is no longer used.

The new system log rotation is defined in the /etc/logadm.conf file. This file includes log rotation entries for processes such as syslogd. For example, one entry in the /etc/logadm.conf file specifies that the /var/log/syslog file is rotated weekly unless the file is empty. The most recent syslog file becomes syslog.0, the next most recent becomes syslog.1, and so on. Eight previous syslog log files are kept.

The /etc/logadm.conf file also contains time stamps of when the last log rotation occurred.

You can use the logadm command to customize system logging and to add additional logging in the /etc/logadm.conf file as needed.

For example, to rotate the Apache access and error logs, use the following commands:


# logadm -w /var/apache/logs/access_log -s 100m
# logadm -w /var/apache/logs/error_log -s 10m

In this example, the Apache access_log file is rotated when it reaches 100 Mbytes in size, with a .0, .1, (and so on) suffix, keeping 10 copies of the old access_log file. The error_log is rotated when it reaches 10 Mbytes in size with the same suffixes and number of copies as the access_log file.

The /etc/logadm.conf entries for the preceding Apache log rotation examples look similar to the following:


# cat /etc/logadm.conf
.
.
.
/var/apache/logs/error_log -s 10m
/var/apache/logs/access_log -s 100m

For more information, see logadm(1M).

You can use the logadm command as superuser or by assuming an equivalent role (with Log Management rights). With role-based access control (RBAC), you can grant non-root users the privilege of maintaining log files by providing access to the logadm command.

For example, add the following entry to the /etc/user_attr file to grant user andy the ability to use the logadm command:


andy::::profiles=Log Management

Or, you can set up a role for log management by using the Solaris Management Console. For more information about setting up a role, see “Role-Based Access Control (Overview)” in System Administration Guide: Security Services.

New Fall Back Shell for root Account

If you changed root's shell to a non-existent shell in previous Solaris releases, you were forced to boot the system from a local CD or from the network and correct the root shell entry in the /etc/passwd file.

If you mistakenly provide a non-existent shell for root in the Solaris 9 release, root's shell will automatically fall back to /sbin/sh when one of the following occurs:

For more information, see su(1M).

Where to Find Software Troubleshooting Tasks

Troubleshooting Task 

For More Information 

Manage system crash information 

Chapter 28, Managing System Crash Information (Tasks)

Manage core files 

Chapter 27, Managing Core Files (Tasks)

Troubleshoot software problems such as reboot failures and backup problems 

Chapter 29, Troubleshooting Miscellaneous Software Problems (Tasks)

Troubleshoot file access problems 

Chapter 30, Troubleshooting File Access Problems (Tasks)

Troubleshoot printing problems 

Chapter 31, Troubleshooting Printing Problems (Tasks)

Resolve UFS file system inconsistencies 

Chapter 32, Resolving UFS File System Inconsistencies (Tasks)

Troubleshoot software package problems 

Chapter 33, Troubleshooting Software Package Problems (Tasks)

Troubleshooting a System Crash

If a system running the Solaris operating environment crashes, provide your service provider with as much information as possible, including crash dump files.

What to Do if the System Crashes

The most important things to remember are:

  1. Write down the system console messages.

    If a system crashes, making it run again might seem like your most pressing concern. However, before you reboot the system, examine the console screen for messages. These messages can provide some insight about what caused the crash. Even if the system reboots automatically and the console messages have disappeared from the screen, you might be able to check these messages by viewing the system error log, the/var/adm/messages file. For more information about viewing system error log files, see How to View System Messages.

    If you have frequent crashes and can't determine their cause, gather all the information you can from the system console or the /var/adm/messages files, and have it ready for a customer service representative to examine. For a complete list of troubleshooting information to gather for your service provider, see Troubleshooting a System Crash.

    If the system fails to reboot successfully after a system crash, see Chapter 29, Troubleshooting Miscellaneous Software Problems (Tasks).

  2. Synchronize the disks and reboot.


    ok sync
    

    If the system fails to reboot successfully after a system crash, see Chapter 29, Troubleshooting Miscellaneous Software Problems (Tasks).

Check to see if a system crash dump was generated after the system crash. System crash dumps are saved by default. For information about crash dumps, see Chapter 28, Managing System Crash Information (Tasks).

Gathering Troubleshooting Data

Answer the following questions to help isolate the system problem. Use Troubleshooting a System Crash Checklist for gathering troubleshooting data for a crashed system.

Table 26–1 Identifying System Crash Data

Question 

Description 

Can you reproduce the problem?

This is important because a reproducible test case is often essential for debugging really hard problems. By reproducing the problem, the service provider can build kernels with special instrumentation to trigger, diagnose, and fix the bug. 

Are you using any third-party drivers?

Drivers run in the same address space as the kernel, with all the same privileges, so they can cause system crashes if they have bugs. 

What was the system doing just before it crashed?

If the system was doing anything unusual like running a new stress test or experiencing higher-than-usual load, that might have led to the crash. 

Were there any unusual console messages right before the crash?

Sometimes the system will show signs of distress before it actually crashes; this information is often useful. 

Did you add any tuning parameters to the /etc/system file?

Sometimes tuning parameters, such as increasing shared memory segments so that the system tries to allocate more than it has, can cause the system to crash. 

Did the problem start recently?

If so, did the onset of problems coincide with any changes to the system, for example, new drivers, new software, different workload, CPU upgrade, or a memory upgrade. 

Troubleshooting a System Crash Checklist

Use this checklist when gathering system data for a crashed system.

Item 

Your Data 

Is a system crash dump available? 

 

Identify the operating system release and appropriate software application release levels. 

 

Identify system hardware. 

Include prtdiag output for sun4u systems. Include Explorer output for other systems.

 

Are patches installed? If so, include showrev -p output.

 

Is the problem reproducible? 

 

Does the system have any third-party drivers? 

 

What was the system doing before it crashed? 

 

Were there any unusual console messages right before the system crashed? 

 

Did you add any parameters to the /etc/system file?

 

Did the problem start recently? 

 

Viewing System Messages

System messages display on the console device. The text of most system messages look like this:

[ID msgid facility.priority]

For example:


[ID 672855 kern.notice] syncing file systems...

If the message originated in the kernel, the kernel module name is displayed. For example:


Oct 1 14:07:24 mars ufs: [ID 845546 kern.notice] alloc: /: file system full 

When a system crashes, it might display a message on the system console like this:


panic: error message

Less frequently, this message might be displayed instead of the panic message:


Watchdog reset !

The error logging daemon, syslogd, automatically records various system warnings and errors in message files. By default, many of these system messages are displayed on the system console and are stored in the /var/adm directory. You can direct where these messages are stored by setting up system message logging. For more information, see How to Customize System Message Logging. These messages can alert you to system problems, such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in /var/adm/messages file (and in messages.*), and the oldest are in the messages.3 file. After a period of time (usually every ten days), a new messages file is created. The messages.0 file is renamed messages.1, messages.1 is renamed messages.2, and messages.2 is renamed messages.3. The current /var/adm/messages.3 file is deleted.

Because the /var/adm directory stores large files containing messages, crash dumps, and other data, this directory can consume lots of disk space. To keep the /var/adm directory from growing too large, and to ensure that future crash dumps can be saved, you should remove unneeded files periodically. You can automate this task by using the crontab file. For more information on automating this task, see How to Delete Crash Dump Files and Chapter 18, Scheduling System Tasks (Tasks).

How to View System Messages

Display recent messages generated by a system crash or reboot by using the dmesg command.


$ dmesg

Or, use the more command to display one screen of messages at a time.


$ more /var/adm/messages

For more information, see dmesg(1M).

Example—Viewing System Messages

The following example shows output from the dmesg command.


$ dmesg
Jan  3 08:44:41 starbug genunix: [ID 540533 kern.notice] SunOS Release 5.9 ...
Jan  3 08:44:41 starbug genunix: [ID 913631 kern.notice] Copyright 1983-2002 ...
Jan  3 08:44:41 starbug genunix: [ID 678236 kern.info] Ethernet address ...
Jan  3 08:44:41 starbug unix: [ID 389951 kern.info] mem = 131072K (0x8000000)
Jan  3 08:44:41 starbug unix: [ID 930857 kern.info] avail mem = 121888768
Jan  3 08:44:41 starbug rootnex: [ID 466748 kern.info] root nexus = Sun Ultra 5/
10 UPA/PCI (UltraSPARC-IIi 333MHz)
Jan  3 08:44:41 starbug rootnex: [ID 349649 kern.info] pcipsy0 at root: UPA 0x1f0x0
Jan  3 08:44:41 starbug genunix: [ID 936769 kern.info] pcipsy0 is /pci@1f,0
Jan  3 08:44:41 starbug pcipsy: [ID 370704 kern.info] PCI-device: pci@1,1, simba0
Jan  3 08:44:41 starbug genunix: [ID 936769 kern.info] simba0 is /pci@1f,0/pci@1,1
Jan  3 08:44:41 starbug pcipsy: [ID 370704 kern.info] PCI-device: pci@1, simba1
Jan  3 08:44:41 starbug genunix: [ID 936769 kern.info] simba1 is /pci@1f,0/pci@1
Jan  3 08:44:57 starbug simba: [ID 370704 kern.info] PCI-device: ide@3, uata0
Jan  3 08:44:57 starbug genunix: [ID 936769 kern.info] uata0 is /pci@1f,0/pci@1,
1/ide@3
Jan  3 08:44:57 starbug uata: [ID 114370 kern.info] dad0 at pci1095,6460
.
.
.

Customizing System Message Logging

You can capture additional error messages that are generated by various system processes by modifying the /etc/syslog.conf file. By default, the /etc/syslog.conf file directs many system process messages to the /var/adm/messages files. Crash and boot messages are stored here as well. To view /var/adm messages, see How to View System Messages.

The /etc/syslog.conf file has two columns separated by tabs:


facility.level ... action

facility.level

A facility or system source of the message or condition. May be a comma-separated listed of facilities. Facility values are listed in Table 26–2. A level, indicates the severity or priority of the condition being logged. Priority levels are listed in Table 26–3.

action

The action field indicates where the messages are forwarded. 

The following example shows sample lines from a default /etc/syslog.conf file.


user.err                                        /dev/sysmsg
user.err                                        /var/adm/messages
user.alert                                      `root, operator'
user.emerg                                      *

This means the following user messages are automatically logged:

The most common error condition sources are shown in the following table. The most common priorities are shown in Table 26–3 in order of severity.

Table 26–2 Source Facilities for syslog.conf Messages

Source 

Description 

kern

The kernel 

auth

Authentication 

daemon

All daemons 

mail

Mail system 

lp

Spooling system 

user

User processes 


Note –

The number of syslog facilities that can be activated in the /etc/syslog.conf file is unlimited.


Table 26–3 Priority Levels for syslog.conf Messages

Priority 

Description 

emerg

System emergencies 

alert

Errors requiring immediate correction 

crit

Critical errors 

err

Other errors 

info

Informational messages 

debug

Output used for debugging 

none

This setting doesn't log output  

How to Customize System Message Logging

  1. Become superuser.

  2. Edit the /etc/syslog.conf file, adding or changing message sources, priorities, and message locations according to the syntax described in syslog.conf(4).

  3. Exit the file, saving the changes.

Example—Customizing System Message Logging

This sample /etc/syslog.conf user.emerg facility sends user emergency messages to root and individual users.


user.emerg                                      `root, *'

Enabling Remote Console Messaging

The following new console features improve your ability to troubleshoot remote systems:

Using Auxiliary Console Messaging During Run Level Transitions

Keep the following in mind when using auxiliary console messaging during run level transitions:

Using the consadm Command During an Interactive Login Session

If you want to run an interactive login session by logging in to a system using a terminal that is connected to a serial port, and then using the consadm command to see the console messages from the terminal, note the following behavior.

How to Enable an Auxiliary (Remote) Console

The consadm daemon does not start monitoring the port until after you add the auxiliary console with the consadm command. As a security feature, console messages are only redirected until carrier drops, or the auxiliary console device is unselected. This means carrier must be established on the port before you can successfully use the consadm command.

For more information on enabling an auxiliary console, see consadm(1M).

  1. Log in to the system as superuser.

  2. Enable the auxiliary console.


    # consadm -a devicename
    
  3. Verify that the current connection is the auxiliary console.


    # consadm
    

Example—Enabling an Auxiliary (Remote) Console


# consadm -a /dev/term/a
# consadm
 /dev/term/a

How to Display a List of Auxiliary Consoles

  1. Log in to the system as superuser.

  2. Select one of the following steps:

    1. Display the list of auxiliary consoles.


      # consadm
      /dev/term/a
    2. Display the list of persistent auxiliary consoles.


      # consadm -p
      /dev/term/b

How to Enable an Auxiliary (Remote) Console Across System Reboots

  1. Log in to the system as superuser.

  2. Enable the auxiliary console across system reboots.


    # consadm -a -p devicename     
    

    This adds the device to the list of persistent auxiliary consoles.

  3. Verify that the device has been added to the list of persistent auxiliary consoles.


    # consadm
    

Example—Enabling an Auxiliary (Remote) Console Across System Reboots


# consadm -a -p /dev/term/a 
# consadm
/dev/term/a

How to Disable an Auxiliary (Remote) Console

  1. Log in to the system as superuser.

  2. Select one of the following steps:

    1. Disable the auxiliary console.


      # consadm -d devicename
      

      or

    2. Disable the auxiliary console and remove it from the list of persistent auxiliary consoles.


      # consadm -p -d devicename
      
  3. Verify that the auxiliary console has been disabled.


    # consadm
    

Example—Disabling an Auxiliary (Remote) Console


# consadm -d /dev/term/a
# consadm