C H A P T E R  6

Troubleshooting Options

There are several troubleshooting options that you can implement when you set up and configure the Netra 440 server. By setting up your system with troubleshooting in mind, you can save time and minimize disruptions if the system encounters any problems.

Tasks covered in this chapter include:

Other information in this chapter includes:


Updated Troubleshooting Information

Sun will continue to gather and publish information about the Netra 440 server long after the initial system documentation is shipped. You can obtain the most current server troubleshooting information in the Product Notes and at Sun web sites. These resources can help you understand and diagnose problems that you might encounter.

ReleaseNotes

Netra 440 Server Release Notes (817-3885-xx) contain late-breaking information about the system, including the following:

The latest Release Notes are available at:

http://www.sun.com/documentation

Web Sites

SunSolve Online

This site presents a collection of resources for Sun technical and support information. Access to some of the information on this site depends on the level of your service contract with Sun. This site includes the following:

The SunSolve Online Web site is at:

http://sunsolve.sun.com

Big Admin

This web site is a one-stop resource for Sun system administrators. The Big Admin web site is at:

http://www.sun.com/bigadmin


Firmware and Software Patch Management

Sun makes every attempt to ensure that each system is shipped with the latest firmware and software. However, in complex systems, bugs and problems are discovered in the field after systems leave the factory. Often, these problems are fixed with patches to the system's firmware. Keeping your system's firmware and Solaris OS current with the latest recommended and required patches can help you avoid problems that others might have already discovered and solved.

Firmware and operating system updates are often required to diagnose or fix a problem. Schedule regular updates of your system's firmware and software so that you will not have to update the firmware or software at an inconvenient time.

You can find the latest patches and updates for the Netra 440 server at the Web sites listed in Web Sites.

Sun Install Check Tool

When you install the SunSM Install Check tool, you also install Sun Explorer Data Collector. The Sun Install Check tool uses Sun Explorer Data Collector to help you confirm that Netra 440 server installation has been completed optimally. Together, they can evaluate your system for the following:

If potential issues are identified, the software generates a report that will provide specific instructions to remedy the issues.

You can download the Sun Install Check tool software and documentation at:

http://www.sun.com/software/installcheck/


Sun Explorer Data Collector

The Sun Explorer Data Collector is a system data collection tool that Sun support services engineers sometimes use when troubleshooting Sun SPARC and x86 systems. In certain support situations, Sun support services engineers might ask you to install and run this tool. If you installed the Sun Install Check tool at initial installation, you also installed Sun Explorer Data Collector. If you did not install the Sun Install Check tool, you can install Sun Explorer Data Collector later without the Sun Install Check tool. By installing this tool as part of your initial system setup, you avoid having to install the tool at a later, and often inconvenient time.

Both the Sun Install Check tool (with bundled Sun Explorer Data Collector) and the Sun Explorer Data Collector (standalone) are available at:

http://sunsolve.sun.com

At that site, click on the appropriate link.


Sun Remote Services Net Connect

SunSM Remote Services (SRS) Net Connect is a collection of system management services designed to help you better control your computing environment. These Web-delivered services enable you to monitor systems, to create performance and trend reports, and to receive automatic notification of system events. These services help you to act more quickly when a system event occurs and to manage potential issues before they become problems.

More information about SRS Net Connect is available at:

http://www.sun.com/service/support/srs/netconnect


Configuring the System for Troubleshooting

System failures are characterized by certain symptoms. Each symptom can be traced to one or more problems or causes by using specific troubleshooting tools and techniques. This section describes troubleshooting tools and techniques that you can control through configuration variables.

Hardware Watchdog Mechanism

The hardware watchdog mechanism is a hardware timer that is continually reset as long as the operating system is running. If the system hangs, the operating system is no longer able to reset the timer. The timer then expires and causes an automatic externally initiated reset (XIR), displaying debug information on the system console. The hardware watchdog mechanism is enabled by default. If the hardware watchdog mechanism is disabled, the Solaris OS must be configured before the hardware watchdog mechanism can be reenabled.

The configuration variable error-reset-recovery allows you to control how the hardware watchdog mechanism behaves when the timer expires. The following are the error-reset-recovery settings:

For more information about the hardware watchdog mechanism and XIR, refer to the Netra 440 Server System Administration Guide (817-3884-xx).

For information about troubleshooting system hangs, see:

Automatic System Recovery Settings

The automatic system recovery (ASR) features enable the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system's firmware diagnostics automatically detect failed hardware components. An auto-configuring capability designed into the OpenBoot firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.

How you configure ASR settings effects not only how the system handles certain types of failures but also on how you go about troubleshooting certain problems.

For day-to-day operations, enable ASR by setting OpenBoot configuration variables as shown in TABLE 6-1.

TABLE 6-1 OpenBoot Configuration Variable Settings to Enable Automatic System Recovery

Variable

Setting

auto-boot?

true

auto-boot-on-error?

true

diag-level

max

diag-switch?

true

diag-trigger

all-resets

post-trigger

all-resets

diag-device

(Set to the boot-device value)


Configuring your system this way ensures that diagnostic tests run automatically when most serious hardware and software errors occur. With this ASR configuration, you can save time diagnosing problems since POST and OpenBoot Diagnostics test results are already available after the system encounters an error.

For more information about how ASR works, and complete instructions for enabling ASR capability, refer to the Netra 440 Server System Administration Guide (817-3884-xx).

Remote Troubleshooting Capabilities

You can use the Advanced Lights Out Manager (ALOM) system controller to troubleshoot and diagnose the system remotely. The ALOM system controller lets you do the following:

In addition, you can use the ALOM system controller to access the system console, provided it has not been redirected. System console access enables you to do the following:

For more information about ALOM, see:

For more information about the system console, refer to the Netra 440 Server System Administration Guide.

System Console Logging

Console logging is the ability to collect and log system console output. Console logging captures console messages so that system failure data, like Fatal Reset error details and POST output, can be recorded and analyzed.

Console logging is especially valuable when troubleshooting Fatal Reset errors and RED State Exceptions. In these conditions, the Solaris OS terminates abruptly, and although it sends messages to the system console, the operating sysem software does not log any messages in traditional file system locations like the /var/adm/messages file. The following is an excerpt from the /var/adm/messages file.

CODE EXAMPLE 6-1 /var/adm/messages File Information
May  9 08:42:17 Sun-SFV440-a SUNW,UltraSPARC-IIIi: [ID 904467 kern.info] NOTICE: [AFT0] Corrected memory (RCE) Event detected by CPU0 at TL=0, errID 0x0000005f.4f2b0814
May  9 08:42:17 Sun-SFV440-a     AFSR 0x00100000<PRIV>.82000000<RCE> AFAR 0x00000023.3f808960
May  9 08:42:17 Sun-SFV440-a     Fault_PC <unknown> J_REQ 2
May  9 08:42:17 Sun-SFV440-a     MB/P2/B0: J0601 J0602
May  9 08:42:17 Sun-SFV440-a unix: [ID 752700 kern.warning] WARNING: [AFT0] Sticky Softerror encountered on Memory Module MB/P2/B0: J0601 J0602
May  9 08:42:19 Sun-SFV440-a SUNW,UltraSPARC-IIIi: [ID 263516 kern.info] NOTICE: [AFT0] Corrected memory (CE) Event detected by CPU2 at TL=0, errID 0x0000005f.c52f509c

The error logging daemon, syslogd, automatically records various system warnings and errors in message files. By default, many of these system messages are displayed on the system console and are stored in the /var/adm/messages file. You can direct where these messages are stored or have them sent to a remote system by setting up system message logging. For more information, refer to "How to Customize System Message Logging" in the System Administration Guide: Advanced Administration, which is part of the Solaris System Administrator Collection.

In some failure situations, a large stream of data is sent to the system console. Because ALOM log messages are written into a "circular buffer" that holds 64 Kbyte of data, it is possible that the output identifying the original failing component can be overwritten. Therefore, you may want to explore further system console logging options, such as SRS Net Connect or third-party vendor solutions. For more information about SRS Net Connect, see Sun Remote Services Net Connect.

More information about SRS Net Connect is available at:

http://www.sun.com/service/support/

Certain third-party vendors offer data logging terminal servers and centralized system console management solutions that monitor and log output from many systems. Depending on the number of systems you are administering, these might offer solutions for logging system console information.

For more information about the system console, refer to the Netra 440 Server System Administration Guide.


The Core Dump Process

In some failure situations, a Sun engineer might need to analyze a system core dump file to determine the root cause of a system failure. Although the core dump process is enabled by default, you should configure your system so that the core dump file is saved in a location with adequate space. You might also want to change the default core dump directory to another locally mounted location so that you can better manage any system core dumps. In certain testing and preproduction environments, this is recommended since core dump files can take up a large amount of file system space.

Swap space is used to save the dump of system memory. By default, Solaris software uses the first swap device that is defined. This first swap device is known as the dump device.

During a system core dump, the system saves the content of kernel core memory to the dump device. The dump content is compressed during the dump process at a 3:1 ratio; that is, if the system were using 6 Gbyte of kernel memory, the dump file will be about 2 Gbyte. For a typical system, the dump device should be at least one-third the size of the total system memory.

See To Enable the Core Dump Process for instructions on how to calculate the amount of available swap space. You would normally enable the core dump process just prior to placing a system into the production environment.


procedure icon  To Enable the Core Dump Process

1. Access the system console.

Refer to the Netra 440 Server System Administration Guide.

2. Check that the core dump process is enabled.

As superuser, type the dumpadm command.

# dumpadm
Dump content: kernel pages
Dump device: /dev/dsk/c0t0d0s1 (swap)
Savecore directory: /var/crash/machinename
Savecore enabled: yes

By default, the core dump process is enabled in Solaris 8.

3. Verify that there is sufficient swap space to dump memory.

Type the swap -l command.

# swap -l
swapfile            dev       swaplo     blocks     free
/dev/dsk/c0t3d0s0   32,24     16         4097312    4062048
/dev/dsk/c0t1d0s0   32,8      16         4097312    4060576
/dev/dsk/c0t1d0s1   32,9      16         4097312    4065808

To determine how many bytes of swap space are available, multiply the number in the blocks column by 512. Taking the number of blocks from the first entry, c0t3d0s0, calculate as follows:

4097312 x 512 = 2097823744

The result is approximately 2 Gbyte.

4. Verify that there is sufficient file system space for the core dump files.

Type the df -k command.

# df -k /var/crash/`uname -n`

By default the location where savecore files are stored is:

/var/crash/`uname -n`

For instance, for the mysystem server, the default directory is:

/var/crash/mysystem

The file system specified must have space for the core dump files.

If you see messages from savecore indicating not enough space in the /var/crash/ file, any other locally mounted (not NFS) file system can be used. Following is a sample message from savecore.

System dump time: Wed Apr 23 17:03:48 2003
savecore: not enough space in /var/crash/sf440-a (216 MB avail, 246 MB needed)

Perform Step 5 and Step 6 if there is not enough space.

5. Type the df -k1 command to identify locations with more space.

# df -k1
Filesystem         kbytes    used    avail capacity Mounted on
/dev/dsk/c1t0d0s0   832109    552314  221548   72%  /
/proc                     0       0       0     0%  /proc
fd                        0       0       0     0%  /dev/fd
mnttab                    0       0       0     0%  /etc/mntab
swap               3626264      16  362624     81%  /var/run
swap               3626656     408  362624     81%  /tmp
/dev/dsk/c1t0d0s7  33912732      9 33573596     1%  /export/home

6. Type the dumpadm -s command to specify a location for the dump file.

# dumpadm -s /export/home/
      Dump content: kernel pages
        Dump device: /dev/dsk/c3t5d0s1 (swap)
Savecore directory: /export/home
  Savecore enabled: yes

The dumpadm -s command enables you to specify the location for the swap file. See the dumpadm (1M) man page for more information.


Testing the Core Dump Setup

Before placing the system into a production environment, it might be useful to test whether the core dump setup works. This procedure might take some time depending on the amount of installed memory.


procedure icon  To Test the Core Dump Setup

1. Back up all your data and access the system console.

2. Gracefully shut down the system using the shutdown command.

3. At the ok prompt, issue the sync command.

You should see "dumping" messages on the system console.

The system reboots. During this process, you can see the savecore messages.

4. Wait for the system to finish rebooting.

5. Look for system core dump files in your savecore directory.

The files are named unix.y and vmcore.y, where y is the integer dump number. There should also be a bounds file that contains the next crash number savecore will use.

If a core dump is not generated, perform the procedure described in To Enable the Core Dump Process.