System Administration Guide, Volume 2

Troubleshooting a System Crash

If a system running the Solaris operating environment crashes, provide your service provider with as much information as possible--including crash dump files.

What to Do if the System Crashes

The most important things to remember are:

  1. Write down the system console messages.

    If a system crashes, making it run again might seem like your most pressing concern. However, before you reboot the system, examine the console screen for messages. These messages can provide some insight about what caused the crash. Even if the system reboots automatically and the console messages have disappeared from the screen, you might be able to check these messages by viewing the system error log file that is generated automatically in /var/adm/messages (or /usr/adm/messages). See "How to View System Messages" for more information about viewing system error log files.

    If you have frequent crashes and can't determine their cause, gather all the information you can from the system console or the /var/adm/messages files, and have it ready for a customer service representative to examine. See "Troubleshooting a System Crash" for a complete list of troubleshooting information to gather for your service provider.

    See Chapter 40, Troubleshooting Miscellaneous Software Problems if the system fails to reboot successfully after a system crash.

  2. Synchronize the disks and reboot.


    ok sync
    

    See Chapter 40, Troubleshooting Miscellaneous Software Problems if the system fails to reboot successfully after a system crash.

  3. Attempt to save the crash information written onto the swap area by running the savecore command.


    # savecore
    

See Chapter 39, Managing System Crash Information for information about saving crash dumps automatically.

Gathering Troubleshooting Data

Answer the following questions to help isolate the system problem. Use "Troubleshooting a System Crash Checklist" for gathering troubleshooting data for a crashed system.

Table 38-1 Identifying System Crash Data

Question 

Description 

Can you reproduce the problem?

This is important because a reproducible test case is often essential for debugging really hard problems. By reproducing the problem, the service provider can build kernels with special instrumentation to trigger, diagnose, and fix the bug. 

Are you using any third-party drivers?

Drivers run in the same address space as the kernel, with all the same privileges, so they can cause system crashes if they have bugs. 

What was the system doing just before it crashed?

If the system was doing anything unusual like running a new stress test or experiencing higher-than-usual load, that may have led to the crash. 

Were there any unusual console messages right before the crash?

Sometimes the system will show signs of distress before it actually crashes; this information is often useful. 

Did you add any tuning parameters to the /etc/system file?

Sometimes tuning parameters, such as increasing shared memory segments so that the system tries to allocate more than it has, can cause the system to crash. 

Did the problem start recently?

If so, did the onset of problems coincide with any changes to the system, for example, new drivers, new software, different workload, CPU upgrade, or a memory upgrade.