Sun Fire V20z and Sun Fire V40z Servers Troubleshooting Techniques and Diagnostics Guide
|
|
Many problems can be avoided with careful system setup, comprehensive change management, and adherence to established, repeatable procedures.
Guidelines for Success
Below are some guidelines that can help you prevent problems and simplify troubleshooting.
- Use uniform naming conventions for your servers, such as names that denote server locations.
- Use unique IDs or names for your devices to lessen the risk of competition for the same resource. Use the server setup utility to check for possible conflicts.
- Create a backup plan.
- If data changes frequently, schedule backups to occur frequently.
- Maintain a library of backups, based on your information restoration needs.
- Periodically test your backups to ensure that your data is stored correctly.
- Use enterprise systems management tools to automate certain processes, or manually track this information:
- Periodically check hard disk space. Ensure that each hard drive has a minimum of 15 percent free space.
- Keep historical data. For example, a baseline record of initial CPU use levels ensures that you will be aware of significant increases. If problems occur, you can compare the baseline and current data. Other things you might track include user, bus, and power utilization rates.
- Maintain a trend analysis to account for predictable changes. For example, if the CPU utilization rate always increases by 50 percent during late morning, you can assume that the increase is normal for that server.
- Create a problem resolution notebook. If a problem occurs, keep a log of the actions you took to resolve it. In the future, the information in the log can help you or someone else solve the same problem more quickly. This information also can ensure accuracy in any part replacement issue.
- Keep an updated network topology map in an accessible location. This map can help troubleshooting efforts for networking problems.
Managing Change
Most server problems occur after something in the server changes. When you make changes to your server, follow these guidelines:
- Document the system settings before the change.
- If possible, make one change at a time, in order to isolate potential problems. In this way, you can maintain a controlled environment and reduce the scope of troubleshooting.
- Take note of the results of each change. Include any errors or informational messages.
- Check for potential device conflicts before you add a new device.
- Check for version dependencies, especially with third-party software.
- To find and fix the cause of a server problem, collect information about:
- Events that occurred prior to the failure.
- Whether any hardware or software was modified or installed.
- Whether the server recently was installed or moved.
- How long the server exhibited symptoms.
- The duration or frequency of the problem.
- After you assess the problem and note your current configuration and environment:
- Visually inspect your system (see below).
- Execute diagnostics tests (see "Diagnostics" on page 37).
Visually Inspecting Your System
Improperly set controls and loose or improperly connected cables are common causes of problems with hardware components. When you investigate a system problem, first check all the external switches, controls, and cable connections. If this does not resolve your problem, then visually inspect the system's interior hardware for problems such as a loose card, cable connector, or mounting screw.
See the Sun Fire V20z and Sun Fire V40z Servers User Guide for information about how to remove and replace hardware components.
External Visual Inspection
To perform a visual inspection of the external system:
1. Inspect the status indicators that can indicate component malfunction. See "Lights, LCD, LED" on page 64.
2. Verify that all power cables are properly connected to the system, the monitor, and the peripherals, and check their power sources.
3. Inspect connections to any attached devices (network cables, keyboard, monitor, mouse) and any devices that are attached to the serial port.
Internal Visual Inspection
Note - Before you continue, read the instructions in the document Important Safety Information about Sun Hardware Systems, which is shipped with your system. Also review the instructions for removal and replacement of components in your Sun Fire V20z and Sun Fire V40z Servers User Guide.
|
You can use the System Status screen in the SM Console to identify status information of all system hardware components and sensors. This System Status screen simplifies the search for components that have problems or for failed components that must be replaced. The component images that display in the System Status screen represent the actual hardware components and their approximate locations and sizes. See the Systems Management Guide for more information.
1. To perform a visual inspection of the internal system, power off the system.
2. Disconnect all power cables from electrical outlets. (Some servers have two power supplies and two power cables. Ensure that both are disconnected from electrical outlets.)
|
Caution - When you unplug the AC power cords from the power supplies, system ground is also removed. To avoid electrostatic discharge damage to the machine, you must maintain an equal voltage potential to the machine. Ensure that you wear ESD protection, such as an ESD wrist strap, during all procedures in which you touch components in the system, and during removal and replacement procedures.
|
3. Remove the server cover (follow the procedures in the user guide for your server).
|
Caution - Some components can become hot during system operations. Allow components to cool before you touch them.
|
4. Remove components, if necessary, and verify that sockets are clean.
5. Replace components and verify that they are firmly seated in their sockets or connectors.
6. Check all cable connectors inside the system to verify that they are firmly and correctly attached to their appropriate connectors.
7. Replace the server cover.
8. Reconnect the system and any attached peripherals to their power sources.
9. Power on the server and the attached peripherals.
Troubleshooting Dump Utility
Note - The Troubleshooting Dump Utility also is discussed in the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide, including command syntax, arguments, and returns.
|
The Troubleshooting Dump Utility (TDU) captures important platform OS and service processors (SP) debug data. When you execute this command, this data is gathered and stored in the specified nfs directory in tar format or sent to stdout, depending on the command option you choose. Along with the log file, TDU creates a summary log file that contains the detail of whether or not TDU successfully gathered each requested piece of information. The summary log file is included in the tar file.
Key TDU definitions are:
- GPR - General Purpose Registers.
- MCR - Machine Check Registers.
- MSR - Machine Status Registers, including the MCRs.
- SPR - Special Purpose Registers.
- CSR - PCI Configuration Space Registers.
- TCB - Trace Buffers from K-8.
- TMB - Trace Buffers (TCB) from DRAM
The following data is captured by default:
- SST data (5KB).
- Uncleared current events (120KB).
- Software Inventory (approximately 25KB).
- Hardware Inventory (approximately 25KB).
- pstore data:
- Group file (approximately 0.5KB)
- Event configuration file (evcfg, approximately 4KBb).
- Security configuration file (seccfg, approximately 5KB).
- Ethernet configuration file (netifcfg2-eth0, approximately 0.2KB).
- Current processes on the Service Processor (10KB).
Optionally, the TDU can capture the following data:
- K-8 registers (-c|--cpuregs), including GPRs, SPRs, MSRs, MCRs and TCB (19KB).
- All PCI configuration registers (-p|--pciregs) (25KB).
- TCB from DRAM (--tmb, 128KB by default or user-defined size up to1GB).
Note - Storage of 1 KB of TMB in text mode takes about 4K on disk. Storage of 32KB of default TMB takes 128 KB and storage of 128 MB of TMB takes about 1GB of disk space.
|
To run the Troubleshooting Dump Utility, use this command:
sp get tdulog
When you specify the -f option, the captured data is gathered and stored on the SP in a compressed tar file. The Troubleshooting Dump Utility can take up to 15 minutes to run. The system prompt displays when it is complete.
Every server management command returns a code when it completes. Below are two return codes, their IDs, and brief descriptions.
Return
|
ID
|
Definition
|
NWSE_Success
|
0
|
Command completed successfully.
|
NWSE_InvalidUsage
|
1
|
Invalid usage: bad parameter usage, conflicting options specified.
|
Note - Return code IDs are decimal numbers.
|
Sun Fire V20z and Sun Fire V40z Servers Troubleshooting Techniques and Diagnostics Guide
|
817-7184-12
|
|
Copyright © 2005, Sun Microsystems, Inc. All Rights Reserved.