C H A P T E R  1

Preventive Maintenance

Many problems can be avoided with careful system setup, comprehensive change management, and adherence to established, repeatable procedures.


Guidelines for Success

Below are some guidelines that can help you prevent problems and simplify troubleshooting.


Managing Change

Most server problems occur after something in the server changes. When you make changes to your server, follow these guidelines:


Visually Inspecting Your System

Improperly set controls and loose or improperly connected cables are common causes of problems with hardware components. When you investigate a system problem, first check all the external switches, controls, and cable connections. If this does not resolve your problem, then visually inspect the system's interior hardware for problems such as a loose card, cable connector, or mounting screw.

See the Sun Fire V20z and Sun Fire V40z Servers User Guide for information about how to remove and replace hardware components.

External Visual Inspection

To perform a visual inspection of the external system:

1. Inspect the status indicators that can indicate component malfunction. See "Lights, LCD, LED" on page 64.

2. Verify that all power cables are properly connected to the system, the monitor, and the peripherals, and check their power sources.

3. Inspect connections to any attached devices (network cables, keyboard, monitor, mouse) and any devices that are attached to the serial port.

Internal Visual Inspection



Note - Before you continue, read the instructions in the document Important Safety Information about Sun Hardware Systems, which is shipped with your system. Also review the instructions for removal and replacement of components in your Sun Fire V20z and Sun Fire V40z Servers User Guide.



You can use the System Status screen in the SM Console to identify status information of all system hardware components and sensors. This System Status screen simplifies the search for components that have problems or for failed components that must be replaced. The component images that display in the System Status screen represent the actual hardware components and their approximate locations and sizes. See the Systems Management Guide for more information.

1. To perform a visual inspection of the internal system, power off the system.

2. Disconnect all power cables from electrical outlets. (Some servers have two power supplies and two power cables. Ensure that both are disconnected from electrical outlets.)



Caution - When you unplug the AC power cords from the power supplies, system ground is also removed. To avoid electrostatic discharge damage to the machine, you must maintain an equal voltage potential to the machine. Ensure that you wear ESD protection, such as an ESD wrist strap, during all procedures in which you touch components in the system, and during removal and replacement procedures.



3. Remove the server cover (follow the procedures in the user guide for your server).



Caution - Some components can become hot during system operations. Allow components to cool before you touch them.



4. Remove components, if necessary, and verify that sockets are clean.

5. Replace components and verify that they are firmly seated in their sockets or connectors.

6. Check all cable connectors inside the system to verify that they are firmly and correctly attached to their appropriate connectors.

7. Replace the server cover.

8. Reconnect the system and any attached peripherals to their power sources.

9. Power on the server and the attached peripherals.


Troubleshooting Dump Utility



Note - The Troubleshooting Dump Utility also is discussed in the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide, including command syntax, arguments, and returns.



The Troubleshooting Dump Utility (TDU) captures important platform OS and service processors (SP) debug data. When you execute this command, this data is gathered and stored in the specified nfs directory in tar format or sent to stdout, depending on the command option you choose. Along with the log file, TDU creates a summary log file that contains the detail of whether or not TDU successfully gathered each requested piece of information. The summary log file is included in the tar file.

Key TDU definitions are:

The following data is captured by default:

Optionally, the TDU can capture the following data:



Note - Storage of 1 KB of TMB in text mode takes about 4K on disk. Storage of 32KB of default TMB takes 128 KB and storage of 128 MB of TMB takes about 1GB of disk space.



To run the Troubleshooting Dump Utility, use this command:

sp get tdulog

When you specify the -f option, the captured data is gathered and stored on the SP in a compressed tar file. The Troubleshooting Dump Utility can take up to 15 minutes to run. The system prompt displays when it is complete.

Every server management command returns a code when it completes. Below are two return codes, their IDs, and brief descriptions.


Return

ID

Definition

NWSE_Success

0

Command completed successfully.

NWSE_InvalidUsage

1

Invalid usage: bad parameter usage, conflicting options specified.




Note - Return code IDs are decimal numbers.