E Common Problems

The installation process is segmented to isolate most problems to a specific phase of the installation. The primary goal is to simplify the debugging of problems in the installation.

This chapter contains 4 sections:

Section 1 contains the most common problems encountered in the installation process.

Section 2 discusses problems that are encountered after building and installing the SNC kernel and booting it.

Section 3 discusses problems encountered after installing the Sun hardware and booting the kernel.

Section 4 discusses problems that are encountered after integration of the hardware and software and verification.

E.1 The Most Common Installation Problems

"Watchdog Reset" errors during boot.

Backplane jumpers misconfigured or switches or jumpers on the SNC are set incorrectly. Correct the jumper or switch problem and reboot.

"Device Not Found" errors during boot.

Backplane jumpers misconfigured or switches or jumpers on the SNC are set incorrectly or incompatible board or incorrect config file addresses. Correct the jumper problem and reboot.

"Spurious Interrupt" error during boot.

Backplane jumpers or switches/jumpers on an SNC are misconfigured. Correct the problem and reboot.

"Bus Timeout" errors during boot.

Backplane jumpers are misconfigured. Correct the jumper problem and reboot.

"No Such Device" error when attempting to neload.

Failed to make the devices for the SNC. Make the devices for the ne interface using /dev/MAKEDEV.ne

Bad File error during boot process.

The kernel can not be found by the boot program. This can be caused by failure to move the kernel to the right location after it was built or a failure to create a link to it.

Extremely slow NFS traffic and frequent "NFS Server Not Responding" messages, and low CPU utilization.

Check the Ethernet connector to the SNC. This can be caused by a loose or partial connection. (Remove any spacers or washers that might prevent a tight connection with the transceiver cable.)

Varying panics including "Bad Trap," "Bus Error," and "Data Fault" while running in single or multi-user mode.

This problem can be caused by a CPU board not up to the most recent revision.

The SNC does not off-load any NFS processing.

Use the nfstat command to determine if the specified SNC board is preprocessing NFS requests.

Check the bootparams and fstab files for the client. They must be given the correct host name to use the SNC, or make sure the SNC is configured to recognize an alternative IP address that matches the hostname of the server.

Check the onboard NFS parameters with the sncnet(8) command. If the value returned is zero (0), set it to one (1) with the sncnet command.

Check the neX configuration flags with the ifconfig(8c) command. If the PROMISC flag is set, the SNC is operating in promiscuous mode, and all received packets are forwarded to the host processing. Terminate the application that has enabled promiscuous mode processing. Some examples of applications that set promiscuous mode include etherfind, traffic, and snoop.

Use the nestat neX command to verify that the NFS requests are being processed by the SNC. If the NFS request count returned is 0, then verify that NFS requests are coming from the client to the server on the network segment that the SNC is connected to. If the client is using a generic name (see Section F.5, "NFS Mounts), then verify that

was used to associate an alternate IP address with the SNC board.

The SNC is recognized at boot time, but does not receive the software download.

Check the device number assigned in MAKEDEV.ne against the device number in the conf.c file.

panic: out of kernelmap for devices

This panic can occur if all loaded devices compete for more than the available amount of kernelmap address space. The following full SPARCserver 690 configuration exceeds the total amount of available kernelmap space:

128 Mbytes of RAM

8 SNCs

2 IPI disk controllers

SBus Presto

SBus cgsix

2 SBus FSBEs

If the system has a cgsix card installed, remove the card and use tty for the console; the system will boot.

Two or more SNC's were installed, but only one is recognized.

Check the kernel configuration file. Make sure the device neX lines for the extra SNC's are not commented out. If they are remove the comment and rebuild the kernel.

E.2 Problems Encountered During Software Installation

If the installation script should abort for any reason, it will display a message describing the cause of the abort. Correct the problem and restart the installation script.

If there is a problem with the kernel file, boot with your old kernel to get your system up and running. Most errors in this phase of the installation are related to misplacing the SNC kernel or the links to it. Check this first. Another common problem encountered in this phase is that the server does not have the complete SunOS distribution installed. Hence, the kernel built in this step has erroneous code.

Varying panics including "Bad Trap," "Bus Error," and "Data Fault" can be caused by a CPU board not up to the most recent Sun revision.

If the boot program can access the SNC kernel, then the most likely cause of errors is typing errors in the kernel data and configuration files while installing manually. These mistakes usually show up as syntax errors or other errors during the config and make steps of building the kernel. If you encounter errors during the kernel build process, check each of the files edited earlier. Verify that the *.add.* files have been added in the correct locations and all of the comments and other changes have been made properly. See Appendix A for information concerning the *.add.* files.

These types of errors should not occur during an automatic installation (i.e., one where the installation script performs the file editing for you).

E.3 Problems Encountered During Hardware Installation

Most hardware problems show up when you try to reboot your system. One common error in rebooting a system is failure to allow the disk drives to come up to full speed before you power up the system. Power up the disk drives first and wait for the ready indication before you power up the server. If you cannot see the drive's ready indication, wait 60 seconds before powering up the server.

The most common errors encountered in the hardware installation process are related to backplane jumper configuration. These show up as bus timeout errors, spurious interrupts, watchdog resets, and device not found errors. Power down the system and double check the backplane jumpers. See section 2.5 for an illustration of the backplane jumper layout and consult the appropriate service manual for the backplane jumper configuration for your system.

Verify that all boards and cables are properly seated. We have found that the Ethernet cable can cause intermittent faults when spacers are left on the connector. The boards and cables should be firmly in place and not prone to wiggling. Power down the system, and press each of the boards and connectors firmly into place. Make sure that each board is held in place with 2 screws, then try to reboot.

E.4 Problems Encountered After System Integration

Most of the problems encountered during system integration are the result of mistakes made during the software and hardware phases of the installation, but are not seen until you attempt to integrate the system. To recover from these problems it is usually necessary to go back to the software or hardware installation section.

Although you may have to redo some of the steps in the software or hardware installation sections, it is not necessary to de-install the hardware before you deal with software installation problems.

Spurious interrupts and system panics during multi-user operation can be caused by conflicts with interrupt vectors used by the hardware. See Appendix C for a discussion of how to correct this problem.

If your server failed to come up in multi-user mode, but boots successfully in single user mode, this is usually a sign of a problem with the Ethernet connection or with the rc.local scripts which configure the Ethernet connection. Some of the client-side errors associated with this problem are "Server not responding for domain," and "NFS Server Not Responding."

If, during the SNC software installation, you have configured out the Sun native Ethernet interface (ie0 or le0), and you attempt to reboot your system without any operational SNC boards installed, your system may hang during the reboot if it depends on NIS. You must then reboot in single-user mode and reconfigure your native interface (just create the required /etc/hostname.ie0 or .../hostname.le0 file).

Then connect the native interface to the Ethernet. Once you verified that the SNC interfaces are properly installed and operating, you can safely disable the native interface if you so desire.

A "No such device" error is caused by a failure to invoke a MAKEDEV.ne command for the SNC. To correct the error you must make the necessary device and reboot. For example, if you've just added a second SNC, you would type:

If you observe no I/O activity on the SNC after installation, along with a "no power" indication on the transceiver unit, it is possible that the thin gray ribbon cable connecting the SNC board to the Ethernet socket is loose. Remove the SNC board and verify that the socket is snugly attached.

If you are using more than one Ethernet interface, and you encounter anomalous behavior, such as lost or duplicated Ethernet packets, reexamine the system configuration.

If you have two or more Ethernet interfaces using the same Ethernet segment, or if you have two or more interfaces on Ethernet segments that are connected by MAC-layer bridge, you must modify the /etc/rc.boot file for your server.

Note - By default, the Ethernet address of an SNC is initialized to the Ethernet address of the host system.

Each SNC has a factory-programmed, unique, Ethernet address. This address can be used by the SNC software instead of the Sun system address.

To change the /etc/rc.boot to do this, find the line in the file that reads

and change it to

and then reboot the system.

Note - If other systems communicate directly with the server (on the same IP network) through this interface, they must also be updated to match this change. Any outdated entry in these other machines' ARP tables must be deleted.