C H A P T E R  9

Troubleshooting

This chapter describes the diagnostic tools available for the Sun Fire V445 server.

Topics in this chapter include:


Troubleshooting Options

There are several troubleshooting options that you can implement when you set up and configure the Sun Fire V445 server. By setting up your system with troubleshooting in mind, you can save time and minimize disruptions if the system encounters any problems.

Tasks covered in this chapter include:

Other information in this chapter includes:


About Updated Troubleshooting Information

You can obtain the most current server troubleshooting information in the Sun Fire V445 Server Product Notes and at Sun web sites. These resources can help you understand and diagnose problems that you might encounter.

Product Notes

Sun Fire V445 Server Product Notes contain late-breaking information about the system, including the following:

The latest product notes are available at:

http://www.sun.com/documentation

Web Sites

The following Sun web sites provide troubleshooting and other useful information.

SunSolve Online

This site presents a collection of resources for Sun technical and support information. Access to some of the information on this site depends on the level of your service contract with Sun. This site includes the following:

The SunSolve Online Web site is at:

http://sunsolve.sun.com

Big Admin

This site is a one-stop resource for Sun system administrators. The Big Admin web site is at:

http://www.sun.com/bigadmin


About Firmware and Software Patch Management

Sun makes every attempt to ensure that each system is shipped with the latest firmware and software. However, in complex systems, bugs and problems are discovered in the field after systems leave the factory. Often, these problems are fixed with patches to the system's firmware. Keeping your system's firmware and Solaris OS current with the latest recommended and required patches can help you avoid problems that others might have already discovered and solved.

Firmware and OS updates are often required to diagnose or fix a problem. Schedule regular updates of your system's firmware and software so that you will not have to update the firmware or software at an inconvenient time.

You can find the latest patches and updates for the Sun Fire V445 server at the Web sites listed in Web Sites.


About Sun Install Check Tool

When you install the Sun Install Check tool, you also install Sun Explorer Data Collector. The Sun Install Check tool uses Sun Explorer Data Collector to help you confirm that Sun Fire V445 server installation has been completed optimally. Together, they can evaluate your system for the following:

When Sun Install Check tool and Sun Explorer Data Collector identify potential problems, a report is generated that provides specific instructions to remedy the issue.

The Sun Install Check tool is available at:

http://sunsolve.sun.com

At that site, click on the link to the Sun Install Check tool.

See also About Sun Explorer Data Collector.


About Sun Explorer Data Collector

The Sun Explorer Data Collector is a system data collection tool that Sun support services engineers sometimes use when troubleshooting Sun systems. In certain support situations, Sun support services engineers might ask you to install and run this tool. If you installed the Sun Install Check tool at initial installation, you also installed Sun Explorer Data Collector. If you did not install the Sun Install Check tool, you can install Sun Explorer Data Collector later without the Sun Install Check tool. By installing this tool as part of your initial system setup, you avoid having to install the tool at a later, and often inconvenient time.

Both the Sun Install Check tool (with bundled Sun Explorer Data Collector) and the Sun Explorer Data Collector (standalone) are available at:

http://sunsolve.sun.com

At that site, click on the appropriate link.


About Sun Remote Services Net Connect

Sun Remote Services (SRS) Net Connect is a collection of system management services designed to help you better control your computing environment. These Web-delivered services enable you to monitor systems, to create performance and trend reports, and to receive automatic notification of system events. These services help you to act more quickly when a system event occurs and to manage potential issues before they become problems.

More information about SRS Net Connect is available at:

http://www.sun.com/service/support/srs/netconnect


About Configuring the System for Troubleshooting

System failures are characterized by certain symptoms. Each symptom can be traced to one or more problems or causes by using specific troubleshooting tools and techniques. This section describes troubleshooting tools and techniques that you can control through configuration variables.

Hardware Watchdog Mechanism

The hardware watchdog mechanism is a hardware timer that is continually reset as long as the OS is running. If the system hangs, the OS is no longer able to reset the timer. The timer then expires and causes an automatic externally initiated reset (XIR), displaying debug information on the system console. The hardware watchdog mechanism is enabled by default. If the hardware watchdog mechanism is disabled, the Solaris OS must be configured before the hardware watchdog mechanism can be reenabled.

The configuration variable error-reset-recovery allows you to control how the hardware watchdog mechanism behaves when the timer expires. The following are the error-reset-recovery settings:

For more information about the hardware watchdog mechanism and XIR, see Chapter 5.

Automatic System Restoration Settings

The Automatic System Restoration (ASR) features enable the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system's firmware diagnostics automatically detect failed hardware components. An auto-configuring capability designed into the OpenBoot firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.

How you configure ASR settings has an effect not only on how the system handles certain types of failures but also on how you go about troubleshooting certain problems.

For day-to-day operations, enable ASR by setting OpenBoot configuration variables as shown in TABLE 9-1.


TABLE 9-1 OpenBoot Configuration Variable Settings to Enable Automatic System Restoration

Variable

Setting

auto-boot?

true

auto-boot-on-error?

true

diag-level

max

diag-switch?

true

diag-trigger

all-resets

diag-device

(Set to the boot-device value)


Configuring your system this way ensures that diagnostic tests run automatically when most serious hardware and software errors occur. With this ASR configuration, you can save time diagnosing problems since POST and OpenBoot Diagnostics test results are already available after the system encounters an error.

For more information about how ASR works, and complete instructions for enabling ASR capability, see About Automatic System Restoration.

Remote Troubleshooting Capabilities

You can use the Sun Advanced Lights Out Manager (ALOM) system controller to troubleshoot and diagnose the system remotely. The ALOM system controller enables you to do the following:

In addition, you can use the ALOM system controller to access the system console, provided it has not been redirected. System console access enables you to do the following:

For more information about ALOM system controller, see: Chapter 5 or the Sun Advanced Lights Out Manager (ALOM) Online Help.

For more information about the system console, see Chapter 2.

System Console Logging

Console logging is the ability to collect and log system console output. Console logging captures console messages so that system failure data, like Fatal Reset error details and POST output, can be recorded and analyzed.

Console logging is especially valuable when troubleshooting Fatal Reset errors and RED State Exceptions. In these conditions, the Solaris OS terminates abruptly, and although it sends messages to the system console, the OS software does not log any messages in traditional file system locations like the /var/adm/messages file.

The error logging daemon, syslogd, automatically records various system warnings and errors in message files. By default, many of these system messages are displayed on the system console and are stored in the /var/adm/messages file.



Note - Solaris 10 moves CPU and memory hardware detected data from the /var/adm/messages file to the fault management components. This make it easier to locate hardware events and to facilitate predictive self healing.


You can direct where system log messages are stored or have them sent to a remote system by setting up system message logging. For more information, see "How to Customize System Message Logging" in the System Administration Guide: Advanced Administration, which is part of the Solaris System Administrator Collection.

In some failure situations, a large stream of data is sent to the system console. Because ALOM system controller log messages are written into a circular buffer that holds 64 Kbytes of data, it is possible that the output identifying the original failing component can be overwritten. Therefore, you may want to explore further system console logging options, such as SRS Net Connect or third-party vendor solutions. For more information about SRS Net Connect, see About Sun Remote Services Net Connect.

More information about SRS Net Connect is available at:

http://www.sun.com/service/support/

Certain third-party vendors offer data logging terminal servers and centralized system console management solutions that monitor and log output from many systems. Depending on the number of systems you are administering, these might offer solutions for logging system console information.

For more information about the system console, see Chapter 2.

Predictive Self-Healing

The Solaris Fault Manager daemon, fmd(1M), runs in the background on every Solaris 10 or later system and receives telemetry information about problems detected by the system software. The fault manager then uses this information to diagnose detected problems and initiate proactive self-healing activities such as disabling faulty components.

fmdump(1M), fmadm(1M), and fmstat(1M) are the three core commands that administer the system generated messages produced by the Solaris Fault Manager. See About Predictive Self-Healing for details. Also refer to the man pages for these commands.


Core Dump Process

In some failure situations, a Sun engineer might need to analyze a system core dump file to determine the root cause of a system failure. Although the core dump process is enabled by default, you should configure your system so that the core dump file is saved in a location with adequate space. You might also want to change the default core dump directory to another locally mounted location so that you can better manage any system core dumps. In certain testing and pre-production environments, this is recommended since core dump files can take up a large amount of file system space.

Swap space is used to save the dump of system memory. By default, Solaris software uses the first swap device that is defined. This first swap device is known as the dump device.

During a system core dump, the system saves the content of kernel core memory to the dump device. The dump content is compressed during the dump process at a 3:1 ratio; that is, if the system were using 6 Gbytes of kernel memory, the dump file will be about 2 Gbytes. For a typical system, the dump device should be at least one third the size of the total system memory.

See Enabling the Core Dump Process for instructions on how to calculate the amount of available swap space.


Enabling the Core Dump Process

This is normally a task that you would complete just prior to placing a system into the production environment.

Access the system console. See:


procedure icon  To Enable the Core Dump Process

1. Check that the core dump process is enabled. As root, type the dumpadm command.


TABLE 9-2
# dumpadm
Dump content: kernel pages
Dump device: /dev/dsk/c0t0d0s1 (swap)
Savecore directory: /var/crash/machinename
Savecore enabled: yes

By default, the core dump process is enabled in the Solaris 8 OS.

2. Verify that there is sufficient swap space to dump memory. Type the swap -l command.


TABLE 9-3
# swap -l
swapfile            dev       swaplo     blocks     free
/dev/dsk/c0t3d0s0   32,24     16         4097312    4062048
/dev/dsk/c0t1d0s0   32,8      16         4097312    4060576
/dev/dsk/c0t1d0s1   32,9      16         4097312    4065808

To determine how many bytes of swap space are available, multiply the number in the blocks column by 512. Taking the number of blocks from the first entry, c0t3d0s0, calculate as follows:

4097312 x 512 = 2097823744

The result is approximately 2 Gbytes.

3. Verify that there is sufficient file system space for the core dump files. Type the
df -k command.


TABLE 9-4
# df -k /var/crash/`uname -n`

By default the location where savecore files are stored is:

/var/crash/`uname -n`

For instance, for the mysystem server, the default directory is:

/var/crash/mysystem

The file system specified must have space for the core dump files.

If you see messages from savecore indicating not enough space in the /var/crash/ file, any other locally mounted (not NFS) file system can be used. Following is a sample message from savecore.


TABLE 9-5
System dump time: Wed Apr 23 17:03:48 2003
savecore: not enough space in /var/crash/sf440-a (216 MB avail, 246 MB needed)

Perform Steps 4 and 5 if there is not enough space.

4. Type the df -k1 command to identify locations with more space.


TABLE 9-6
# df -k1
Filesystem         kbytes    used    avail capacity Mounted on
/dev/dsk/c1t0d0s0   832109    552314  221548   72%  /
/proc                     0       0       0     0%  /proc
fd                        0       0       0     0%  /dev/fd
mnttab                    0       0       0     0%  /etc/mntab
swap               3626264      16  362624     81%  /var/run
swap               3626656     408  362624     81%  /tmp
/dev/dsk/c1t0d0s7  33912732      9 33573596     1%  /export/home

5. Type the dumpadm -s command to specify a location for the dump file.


TABLE 9-7
# dumpadm -s /export/home/
      Dump content: kernel pages
        Dump device: /dev/dsk/c3t5d0s1 (swap)
Savecore directory: /export/home
  Savecore enabled: yes

The dumpadm -s command enables you to specify the location for the swap file. See the dumpadm (1M) man page for more information.


Testing the Core Dump Setup

Before placing the system into a production environment, it might be useful to test whether the core dump setup works. This procedure might take some time depending on the amount of installed memory.

Back up all your data and access the system console. See:


procedure icon  To Test the Core Dump Setup

1. Gracefully shut down the system using the shutdown command.

2. At the ok prompt, issue the sync command.

You should see "dumping" messages on the system console.

The system reboots. During this process, you can see the savecore messages.

3. Wait for the system to finish rebooting.

4. Look for system core dump files in your savecore directory.

The files are named unix.y and vmcore.y, where y is the integer dump number. There should also be a bounds file that contains the next crash number savecore will use.

If a core dump is not generated, perform the procedure described in Enabling the Core Dump Process.