System Administration Guide: Advanced Administration

Chapter 14 Troubleshooting Software Problems (Overview)

This chapter provides a general overview of troubleshooting software problems, including information on troubleshooting system crashes and viewing system messages.

This is a list of information in this chapter.

What's New in Troubleshooting?

This section describes new or changed troubleshooting information in this release.

For information on new or changed troubleshooting features in the Oracle Solaris 10 release, see the following:

For a complete listing of new features and a description of Oracle Solaris releases, see Oracle Solaris 10 9/10 What’s New.

Common Agent Container Problems

Solaris 10 6/06: The common agent container is a stand-alone Java program that is included in the Oracle Solaris OS. This program implements a container for Java management applications. The common agent container provides a management infrastructure that is designed for Java Management Extensions (JMX) and Java Dynamic Management Kit (Java DMK) based functionality. The software is installed by the SUNWcacaort package and resides in the /usr/lib/cacao directory.

Typically, the container is not visible. However, there are two instances when you might need to interact with the container daemon:

It is possible that another application might attempt to use a network port that is reserved for the common agent container.
In the event that a certificate store is compromised, you might have to regenerate the common agent container certificate keys.

For information about how to troubleshoot these problems, see Troubleshooting Common Agent Container Problems in the Oracle Solaris OS.

x86: SMF Boot Archive Service Might Fail During System Reboot

Solaris 10 1/06: If a system crash occurs in the GRUB based boot environment, it is possible that the SMF service svc:/system/boot-archive:default might fail when the system is rebooted. If this problem occurs, reboot the system and select the failsafe archive in the GRUB boot menu. Follow the prompts to rebuild the boot archive. After the archive is rebuilt, reboot the system. To continue the boot process, you can use the svcadm command to clear the svc:/system/boot-archive:default service. For more information on GRUB based booting, see Booting an x86 Based System by Using GRUB (Task Map) in System Administration Guide: Basic Administration.

Dynamic Tracing Facility

The Oracle Solaris Dynamic Tracing (DTrace) facility is a comprehensive dynamic tracking facility that gives you a new level of observerability into the Solaris kernel and user processes. DTrace helps you understand your system by permitting you to dynamically instrument the OS kernel and user processes to record data that you specify at locations of interest, called, probes. Each probe can be associated with custom programs that are written in the new D programming language. All of DTrace's instrumentation is entirely dynamic and available for use on your production system. For more information, see the dtrace(1M) man page and the Solaris Dynamic Tracing Guide.

`kmdb` Replaces `kadb` as Standard Solaris Kernel Debugger

kmdb has replaced kadb as the standard “in situ” Solaris kernel debugger.

kmdb brings all the power and flexibility of mdb to live kernel debugging. kmdb supports the following:

Debugger commands (dcmds)
Debugger modules (dmods)
Access to kernel type data
Kernel execution control
Inspection
Modification

For more information, see the kmdb(1) man page. For step-by-step instructions on using kmdb to troubleshoot a system, see How to Boot the System With the Kernel Debugger (kmdb) in System Administration Guide: Basic Administration and How to Boot a System With the Kernel Debugger in the GRUB Boot Environment (kmdb) in System Administration Guide: Basic Administration.

Where to Find Software Troubleshooting Tasks

Troubleshooting Task	For More Information
Manage system crash information	Chapter 17, Managing System Crash Information (Tasks)
Manage core files	Chapter 16, Managing Core Files (Tasks)
Troubleshoot software problems such as reboot failures and backup problems	Chapter 18, Troubleshooting Miscellaneous Software Problems (Tasks)
Troubleshoot file access problems	Chapter 19, Troubleshooting File Access Problems (Tasks)
Troubleshoot printing problems	Chapter 13, Troubleshooting Printing Problems in the Oracle Solaris OS (Tasks), in System Administration Guide: Printing
Resolve UFS file system inconsistencies	Chapter 20, Resolving UFS File System Inconsistencies (Tasks)
Troubleshoot software package problems	Chapter 21, Troubleshooting Software Package Problems (Tasks)

Troubleshooting a System Crash

If a system running the Oracle Solaris OS crashes, provide your service provider with as much information as possible, including crash dump files.

What to Do If the System Crashes

The most important things to remember are as follows:

Write down the system console messages.

If a system crashes, making it run again might seem like your most pressing concern. However, before you reboot the system, examine the console screen for messages. These messages can provide some insight about what caused the crash. Even if the system reboots automatically and the console messages have disappeared from the screen, you might be able to check these messages by viewing the system error log, the/var/adm/messages file. For more information about viewing system error log files, see How to View System Messages.

If you have frequent crashes and can't determine their cause, gather all the information you can from the system console or the /var/adm/messages files, and have it ready for a customer service representative to examine. For a complete list of troubleshooting information to gather for your service provider, see Troubleshooting a System Crash.

If the system fails to reboot successfully after a system crash, see Chapter 18, Troubleshooting Miscellaneous Software Problems (Tasks).
Synchronize the disks and reboot.
ok sync
If the system fails to reboot successfully after a system crash, see Chapter 18, Troubleshooting Miscellaneous Software Problems (Tasks).

Check to see if a system crash dump was generated after the system crash. System crash dumps are saved by default. For information about crash dumps, see Chapter 17, Managing System Crash Information (Tasks).

Gathering Troubleshooting Data

Answer the following questions to help isolate the system problem. Use Troubleshooting a System Crash Checklist for gathering troubleshooting data for a crashed system.

Table 14–1 Identifying System Crash Data


Question	Description
Can you reproduce the problem?	This is important because a reproducible test case is often essential for debugging really hard problems. By reproducing the problem, the service provider can build kernels with special instrumentation to trigger, diagnose, and fix the bug.
Are you using any third-party drivers?	Drivers run in the same address space as the kernel, with all the same privileges, so they can cause system crashes if they have bugs.
What was the system doing just before it crashed?	If the system was doing anything unusual like running a new stress test or experiencing higher-than-usual load, that might have led to the crash.
Were there any unusual console messages right before the crash?	Sometimes the system will show signs of distress before it actually crashes; this information is often useful.
Did you add any tuning parameters to the `/etc/system` file?	Sometimes tuning parameters, such as increasing shared memory segments so that the system tries to allocate more than it has, can cause the system to crash.
Did the problem start recently?	If so, did the onset of problems coincide with any changes to the system, for example, new drivers, new software, different workload, CPU upgrade, or a memory upgrade.

Troubleshooting a System Crash Checklist

Use this checklist when gathering system data for a crashed system.

Item	Your Data
Is a system crash dump available?
Identify the operating system release and appropriate software application release levels.
Identify system hardware.
Include `prtdiag` output for sun4u systems. Include Explorer output for other systems.
Are patches installed? If so, include `showrev -p` output.
Is the problem reproducible?
Does the system have any third-party drivers?
What was the system doing before it crashed?
Were there any unusual console messages right before the system crashed?
Did you add any parameters to the `/etc/system` file?
Did the problem start recently?