C H A P T E R  3

Troubleshooting Topics

This chapter contains troubleshooting instructions and references for a variety of issues. Information is organized alphabetically, according to general topics, is cross-referenced as necessary, and is indexed in the last section of this document.


BIOS

This section describes possible causes and suggested troubleshooting steps for systems management events that are related to BIOS.



Note - See the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide and the SM Console Online Help for information about how to update the BIOS. See "Update Failed" on page 83, to troubleshoot BIOS updates.



BIOS Error or Warning Events

The errors listed in the table below are returned by the sp get events command. The possible causes and suggested actions to take in order to resolve each problem (in order of probability, based on experience) are listed below.



Note - See the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide for more information about the sp get events command.



TABLE 3-1 BIOS Error Messages

Error

Solution or Reference

[CPU ID Error]

Possible cause of this error is unmatched CPU Revs. Determine the Rev of each CPU. If they are not the same, replace with consistent Rev CPUs.

[Date and Time Setting Error]

This error usually indicates that the battery failed. To remedy this problem replace the battery, run setup, set time and date, cycle power with a five-minute delay in off state, and check to see if the error recurs.

[Diag Failed Memtest]

To remedy this problem, replace the reported DIMM, then reboot.If other DIMMs fail, replace them and repeat the test.If the same DIMM fails, replace entire set of DIMMs with known good DIMMs, and run the tests again. See DIMM Faults.

[Diagnostic Load Failure]

During loading of the diagnostics from the SP to the platform, the load operation failed. Retry.

[DMA Test Failed], [Software NMI Failed], [Fail-Safe Timer NMI Failed]

You are unlikely to see this message as the probability of its occurrence is extremely low. If you do receive this message, try rebooting the server.

[Fixed Disk Failure]

If all HDDs in a multiple-HDD system have failed, the power supply is the likely source of the problem. The power supply is a possible source of the problem in a single-HDD system, also. But check the other possibilities listed below, first.The HDD data cable might be connected improperly or the backplane connector mating might be skewed. Ensure that the connector is securely connected to the backplane. A drive might not be inserted completely. Pull out the drive, inspect it, reinsert it, and verify that the mating is smooth and complete. Drive electronics or interface failed. If possible, insert the drive into a different slot in the same system.If possible, insert the drive into a different slot in the same system. If the drive works in the other system, return the drive to the server that experienced the initial problem.If the drive fails in the other system also, try a different drive in the original system, if possible.If the second drive works in the second system, but not in the first system, return the first system.If the drive that worked in the second system does not work in the first system and the drive from the first system does not work in the second system, the drive electronics and backplane might be bad. Return the system.

[Flash Image Validation Error]

The BIOS Image that is used in the BIOS Update command is corrupted, or was not a BIOS image (wrong filename), or the transfer of the image to the platform failed. Retry the operation. If it still fails, check that the file is actually a valid BIOS image file.

[Flash Process Failure]

This error probably indicates that the flash chip is defective. To remedy the problem, replace the flash chip. If the problem persists, it might indicate a problem that is not serviceable by the user. Contact the Support Center.

[Incorrect BIOS image file]

The BIOS Image provided to the BIOS update command is a BIOS for a different platform. Obtain the correct BIOS image for your platform.

[IP Failure]

An internal communication error occurred between BIOS and the SP. Retry the operation.

[Memory Mismatched]

Pairs of DIMMs must match. Determine if DIMMs in each pair match, and reconfigure, if necessary. See DIMM Faults.

[Operating System not found]

Possible causes of this error are: Empty drive or media (contains no boot block).Intended boot device is not in the BIOS setup boot settings.Floppy disk was left in floppy drive.Media is damaged or corrupted. (Usually this is caught under fixed drive failure, if booting from the hard drive.)

[Parity Error (Memory)], [Extended Memory Truncation]

BIOS probably would report a bad DIMM mapped out. If either of these errors occurs intermittently, run memory tests. See Diagnostics, and Memory.

[Real-Time Clock Error]

This error can indicate a South bridge failure, a BIOS failure, a bad crystal, or a bad oscillator. Possible solutions are to re-flash the BIOS or to replace the battery.

[Shadow RAM Failed], [System RAM Failed], [Extended RAM Failed]

These errors indicate a general memory DIMM error. The first two indicate that the failure occurred below the first MG of RAM. See DIMM Faults, for details.If you are unable to boot the diagnostics kernel, replace all DIMMs with known good DIMMs. If this is successful, then use diagnostics to identify the bad DIMMs.

[System Timer Error]

This is a legacy error. It can indicate a South bridge failure or a BIOS failure. The most likely cause is a corrupted BIOS. To remedy this, re-flash the BIOS.

Received [early] fatal error from BIOS: [Unable to do anything]

BIOS can detect some hardware errors before enough of the system is working to report a more specific error code. If known good CPUs are installed, contact the Support Center for help.

 
TABLE 3-2 BIOS Warning Messages

Warning

Solution or Reference

[CMOS Checksum Failure], [CMOS Settings do not match hardware configuration], [CMOS Invalid]

To remedy these problems re-run setup (see "BIOS Configuration" in the Software Installation and Configuration Guide), save, exit, and cycle power. If one of the errors occurs again, replace the battery, run setup, set time and date cycle power with a five-minute delay in off state.If the problem recurs, contact the Support Center.

[PCI-X Slot disabled for 8131 Errata 56]

During setup (see "BIOS Configuration" in the Software Installation and Configuration Guide), ensure that in the Advanced menu, you set the option to allow the card to be recognized. Do this only if you are certain that the card will not cause data corruption, or are willing to take the risk. The card was powered off to prevent data corruption. For more information, see the Sun Fire V20z and Sun Fire V40z Servers--Release Notes.

Received warning from BIOS: [CMOS Battery Failure]

This error probably indicates a battery failure. To remedy this problem replace the battery, run setup, set time and date, cycle power with a five-minute delay in off state.If the problem recurs, contact the Support Center.

 

BIOS POST Codes

If hardware or configuration errors occur, the BIOS displays warning or error messages on the video display, if one is attached. However, some errors can be so severe that the BIOS cannot initialize the video or halts immediately. In these cases, you can determine the last Power On Self Test (POST) task that the BIOS executed. It is indicated by the value written to port 80.

The most common POST codes that are reported on the Sun Fire V20z and Sun Fire V40z servers and suggested troubleshooting actions are listed in the table, below.

TABLE 3-3 Common POST Codes

POST Code

Reference or Solution

00

Indicates that no BIOS executed far enough to write any POST codes. This usually is caused by failure to power on, fatal CPU, or a fatal BIOS flash part issue.

C0

Indicates that an operating system was not detected.

28

Indicates that the SPDs on a DIMM did not read correctly. Probably indicates a bad DIMM. See DIMM Faults.

2C

An address or a data error caused by a bad DIMM, VRM, or CPU. See DIMM Faults.

49

PCI config space error. Remove PCI boards to find offending board, shuffle order, replace, or use other brand as needed.

 


Boot Issues

For information about boot problems that are associated with the platform OS, see Platform OS Does not Boot. For boot problems that are associated with the SP, see Service Processor.


Clear CMOS Jumper

In some troubleshooting procedures, it is necessary to clear the CMOS jumper. Instructions for this procedure are below.

1. Power down the server.

2. Disconnect the AC power cord. If you have two power supplies, disconnect both AC power cords.

3. Remove the system cover as instructed in the Hardware Components and Service document.

4. Locate the appropriate jumper. Facing the server from the front panel:

5. Move the jumper to the parked position (away from the dot) so that the CMOS clears on the next boot.

6. Replace the system cover and reconnect the AC power.

7. Reboot the server and press F2 during the boot, to go into the BIOS setup.

8. Press F9 to set defaults.

9. Press F10 to save changes.

10. Power down the server, disconnect the AC power cord(s), and remove the system cover.

11. Move the jumper back to the active position (closer to the dot) so that the CMOS retains setting on the next boot.

12. Replace the system cover, reconnect the AC power, and reboot the server.


DIMM Faults



Note - To enable DIMM fault reporting, you must install the NSV software on your system, as detailed in the Sun Fire V20z and Sun Fire V40z Servers--Installation Guide. Although these drivers are available in the NSV, it is not necessary to mount the NSV to the SP in order to enable this functionality.



The system fault LED blinks and identifies uncorrectable DIMM faults or correctable faults that exceed the threshold. Faults also are reported in the event log, the SM Console, and diags memory tests. (See ECC Errors, for an example of diags output that reports a DIMM fault.) The system might continue to operate normally, depending on the type of failure, the location of the failure, and the robustness of the platform operating system.

IPMI System Event Log (SEL) records are generated for both correctable and uncorrectable DIMM ECC errors. To determine the type of error, examine the sensor-specific offset in Event Data 1. The CPU (memory bank) and DIMM numbers are located in the high and low nibbles of the Event Data 3 field, respectively.



Note - See the Operator Panel's server menu options in the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide. These errors also appear in the system events log. See System Events.




ECC Errors

In both the Sun Fire V20z Server and the Sun Fire V40z Server, each CPU can support four DIMMs.

If your log files report an ECC error or a problem with a memory DIMM, complete the steps below.



Note - See Log Files, for a summary of log files that are available with your server.



In the example below, the log file reports an error with the DIMM in CPU0, bank 0, slot 1.

1. Power down your server and remove the cover.

2. Remove the DIMMs that were indicated in the log file and label them.

3. Visually inspect the DIMMs for physical damage, dust, or any other contamination on the connector.

4. Visually inspect the DIMM slots for physical damage. Look for cracked or broken plastic on the slot.

5. Dust off the DIMMs, clean the contacts, and reseat them. (You can leave the labels on the DIMMs.)

6. Reboot the system. If the problem persists, continue to Step 7.

7. Power down the server again and remove the cover.

8. Remove the DIMMs that were identified in the log file.

9. Exchange the individual DIMMs between the two slots of a given bank. Ensure that they are inserted correctly, with latches secured.

10. Power on the server and run the process that caused the DIMM error.

11. Review the log file. (See "ECC Failure" on page 96, for sample output.)

12. If the error now appears in CPU0, bank 0, slot 0 (opposite to the original error), the problem is related to the individual DIMM that is now in slot 0.

or

If the error still appears in CPU0, bank 0, slot 1 (as the original error did), the problem is not related to an individual DIMM. Instead, it might be caused by CPU0 or by the DDR VRM for CPU0.

13. If you have a Sun Fire V20z Server with a single CPU, you cannot independently troubleshoot the problem any further. A replacement part might be necessary.

or

If you have a server with at least two CPUs, continue with Step 14.

14. Label, then exchange the memory VRMs between the two CPUs.

15. Power on the server and run the process that caused the DIMM error.

16. Review the log file.

17. If the error now appears on CPU1 (a different CPU from the original error), the problem is related to the DDR VRM that originally was seated for CPU0. A replacement part might be necessary.

or

If the error still appears in CPU0, bank 0, slot 1 (as the original error did), the problem is not related to the memory VRM. It might be due to CPU0 or the motherboard. A replacement part might be necessary.


Inventory

Use the inventory get all, inventory get hardware, and inventory get software commands to view a list of field-replaceable hardware components or current software components and versions. See the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide for details about these commands.

If you have an NSV version 2.2 or earlier, and you add a newer NSV version to the same location, the inventory get software command, with the [{-a|--all}] argument, might time out. If this is the case, follow the instructions, below.

1. Move and unzip any newer NSV versions to a different location from the location of your 2.2 NSV.

or

Review the older NSV and remove folders for operating systems that you no longer need.

2. Try the command again.


Lights, LCD, LED

TABLE 3-4 Lights on the Front Panel

Problem

Solution or Reference

Locate light blinks

The Locate Light can be illuminated (or extinguished) by pressing the Locate Light button beside it. The system administrator turns the Locate Light on in order to simplify the task of locating a specific server. A blinking Locate Light does not indicate a problem.

System Fault LED is illuminated

The System Fault LED (Machine Check Error) light is illuminated when a variance occurs. For troubleshooting tips, see Machine Check Error, System Events, and System Events for more information.

Platform Power State Indicator light is not illuminated

Check power connection to AC. On the Sun Fire V20z Server, check the AC Power Switch and the AC Present Indicator on the back panel.

Operator Panel LCD is not illuminated

Check power connection to AC. On the Sun Fire V20z Server, check the AC Power Switch and the AC Present Indicator on the back panel. See also various SP boot issues and solutions in Service Processor.

LCD displays "SP booting," then hangs

Use the SP Reset button to reboot the SP. (The SP Reset button is on the back panel.)

 


Log Files

Depending on the functions and features you use, your server can produce these log files:


Machine Check Error

This section describes possible causes of events that are related to Machine Checks, and provides suggested troubleshooting steps.

If a Machine Check error occurs, the System Fault LED is illuminated. Machine Check errors indicate EEC errors (see ECC Errors) or VRM Crowbar events (see VRM Crowbar Assertions). These errors are reported in the system event log (see System Events).

TABLE 3-5 Machine Check Errors

Error

Solution or Reference

[Bus Unit]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Correctable ECC error.]

This error indicates memory ECC errors, with ECC on. See ECC Errors. See DIMM Faults.

[Detected on a scrub.]

Raw data: <data>. This error should occur with a CPU error or a memory error. See DIMM Faults.

Error detected in [Data Cache]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Error IP Valid.]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Error not corrected]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Error occurred at address <address>.]

See DIMM Faults.

[Error reporting disabled.]

The machine check feature has been turned off. For maximum system reliability, leave this option on.

[InstructionCache]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Invalid bank reached]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Load/Store unit]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

Machine Check error detected on cpu <CPU>

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Machine Check in Progress.]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Misc. register contains more info.]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[North Bridge]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Processor state may have been corrupted]

Any specific details that are included with this error message, such as addresses, might be inaccurate, and are unreliable for further troubleshooting.

[Restart IP Valid.]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Second error detected.]

This error indicates a bad CPU. To remedy the problem, replace the CPU.

[Un-correctable ECC error.]

This error indicates memory ECC errors. See ECC Errors. See DIMM Faults.

 


Network Connectivity



Note - Review the Sun Fire V20z and Sun Fire V40z Servers--Installation Guide and the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide for detailed information about network connectivity.




Network Share Volume



Note - For detailed information about how to install, upgrade, and manage the Network Share Volume (NSV), see the Sun Fire V20z and Sun Fire V40z Servers--Installation Guide, the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide and the SM Console Online Help. See also Restore Default Settings.




Operating System

For information about installing and updating your server's operating system, see the Sun Fire V20z and Sun Fire V40z Servers--Linux Operating System Installation Guide, the Sun Fire V20z and Sun Fire V40z Servers--Guide for Pre-installed Solaris 10 Operating System, or other operating system vendor-supplied documentation.


Operator Panel



Note - For detailed information about the use of the Operator Panel buttons and other controls, see the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide.



This section provides troubleshooting ideas for problems with the Operator Panel LCD display.

Illuminated, Readable Text, Non-working Buttons

If the LCD is illuminated and readable text displays, but the buttons do not appear to work, there could be a problem with DHCP settings. It is possible that the SP cannot find a DHCP server.

1. Use the SM Console or SM commands to ensure that the SP network is set to DHCP.

2. Reboot the SP.



Note - For solutions to SP problems that cause this symptom, see the SP boot issues in Service Processor.



Illuminated, Unreadable Text

If the LCD is illuminated, but the text is unreadable, check and reseat the cables. If the problem persists, it might indicate that the motherboard is faulty. Replace the motherboard.

Illuminated, No Text

If the LCD is illuminated, but no text displays, one of the following might be the cause.

No Illumination

As noted in Lights, LCD, LED, if the panel is not illuminated, check the cable connections. If all cables are securely seated, other possible causes of this symptom include problems with the LCD, with the Operator Panel assembly, or with the motherboard.


PCI or PCI-X Hot-plugs

If a PCI or PCI-X card malfunctions, follow these guidelines, below.

Drivers and OS support for PCI or PCI-X hot-plug functionality - If you have problems with PCI or PCI-X hot-plug functionality, ensure that you have the proper drivers and operating system support on your server, and that you follow the requirements that are described in your server-specific documentation.

Errors with cards in hot-plug slots - If errors occur with cards in hot-plug slots, be sure that you use the AMD HotPlug Control Utility to remove power to the slot before you add or remove any PCI hot-plug devices.

Downloads and installations - Download the latest firmware, Option ROM (OPROM, Option BIOS), and device drivers for your operating system from the card manufacturer's Web site. Install the card's firmware first, then its OPROM, then the drivers.

OPROM enabled - If you install a SCSI card that should display a prompt to press Ctrl-A (or Ctrl-C, or Ctrl-S, or Ctrl-any key) to run the OPROM-based configuration utility, but the prompt never appears during boot time, ensure that the OPROM is not disabled. This problem might be caused by a jumper setting on the board. Press F2 while you boot to run the BIOS Setup utility. From the Advanced menu, select PCI Configuration. Ensure that the OPROM scan is enabled for the card in question. You might receive an error such as:

Expansion ROM not initialized -PCI Mass Storage Controller in slot 3 
Bus:3, Device:02, Function:01

This message indicates that the OPROM is enabled, but the initial size of the OPROM image is too large to fit in the standard OPROM shadow area. This means you cannot boot from the card, and if the card has a boot-time setup utility, you cannot use that functionality. If you disable other OPROMs (in order to free more OPROM shadow space), you might be able to load it. To do this, select PCI Configuration on the Advanced menu of the BIOS Setup utility.



Note - See the BIOS configuration information in the Sun Fire V20z and Sun Fire V40z Servers--User Guide.



Each OPROM image has an initial size when it is first loaded, but is later reduced to a smaller residual size. It might be possible to fit additional OPROMs, if you first load cards with larger initial sizes. To determine initial sizes, see the manufacturer's documentation.

OPROMs are scanned in this order:

1. On-board devices (Video, NICs, SCSI)

2. Physical Slot 1

3. Physical Slot 2

4. Physical Slot 3

5. Physical Slot 6

6. Physical Slot 7

7. Physical Slot 4

8. Physical Slot 5



Note - You can change the boot order from the Boot menu of the BIOS Setup utility, but you cannot change the order of OPROM scans.




Platform OS Does not Boot

This issue can result from poor cable connections or poorly seated hardware. If your platform OS does not boot, follow the steps below.

1. Verify that AC power is available and that the AC power cord is connected securely to the AC connector on the server's power supply. If you have a server with two power supplies, ensure that both are connected securely. If you have a 2100 server, ensure that the AC switch on the back of the server is in the "on" position.

2. If you have power to the SP but not the platform, power down your server, unplug the AC connector from the wall, and remove the system cover. See the Hardware Components and Service document for instructions about how to remove the system cover.

a. Ensure that the SCSI signal cable, SCSI power cable, and other internal cables are attached securely.

b. Ensure that all DIMMs, DDR VRMs, and CPU VRMs are seated firmly in their respective slots.

c. Remove all PCI option cards from the server.

3. Replace the system cover, reconnect the AC power, and reboot the server.

or

4. Power down the server, disconnect the AC power, and remove the system cover.

5. Re-install one of the PCI option cards.

6. Replace the system cover, reconnect the AC power, and reboot the server.

or

7. Clear the server's CMOS jumper. Follow the procedure outlined in Clear CMOS Jumper.

8. Reboot the server.

or



Note - In versions 2.3 and later, you can set an IPMI boot option parameter to clear the CMOS. This eliminates the need to remove the system cover and move the jumper from the active position to the parked position.




PPCBoot - Bad CRC Error

This error message does not indicate a critical error. The situation that triggers this message occurs only when you connect via the Serial Port, perform a flash update, and disconnect or reset the SP before PPCBoot update completes.

Immediately after the "Bad CRC Error" message displays, the system retrieves the necessary environmental variables and writes them into the appropriate partition. On the next reboot, the error message does not display, unless you once again reset the SP before the PPCBoot update is complete.


Restore Default Settings



Note - Related material is included in Failure to Retain User Accounts and Settings.



If you experience general problems with the SP (or simply wish to restore its original settings), you can use the sp reset to default-settings command to restore selected settings.



Note - You also can use the LCD buttons on the Operator Panel to restore default settings. See the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide for details.



The SP configuration files are stored in a persistent file system in the /pstore directory. When the SP boots, it checks these files for existing configuration information. By default, the SP will reboot 60 seconds after the sp reset to default-settings command executes, unless you specify the --nowait option, in which case the reboot occurs immediately. A message displays every 20 seconds to indicate that the reboot will occur.

sp reset to default-settings {-a|--all}
[{-c|--config}] [{-n|--network}] [{-s|--ssh}]
[{-u|--users}] [{-W|--nowait}]

For example:

sp reset to default-settings {-a|--all}

The --all option resets all SP settings to their default configurations, including events and IPMI settings (files are deleted immediately).



Note - To reset IPMI settings only, do not use the SP command. Instead, use the IPMI command: ipmi reset. See the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide for more information about IPMI and about all commands.




SCSI Configuration Utility



Note - For detailed information about how to use the SCSI Configuration Utility that is included with your server, see the Sun Fire V20z and Sun Fire V40z Servers--User Guide.



RAID Properties Menu Item Disabled

To resolve this problem, check these points:

IM Volume Uses Extra SCSI ID

In this situation, the IM volume with two mirrored disks uses an extra SCSI ID off the bus--none of the physical disks of the IM volume has the same SCSI ID as the IM volume--and the configuration utility will not allow the disk at the ID that is currently defined as the volume ID to be configured.

To change the IM volume configuration so that it does not use an extra SCSI ID, but also keeps the same volume ID:

1. Go to the RAID Properties screen. Determine which SCSI ID the primary disk is using and which SCSI ID the volume is using. Also determine the SCSI IDs of the remaining disks of the IM volume.

2. Set the IM volume disks to "No" and save the configuration--break the volume.

3. Return to the RAID Properties screen and reconfigure the IM volume in this manner:

4. To save the configuration, press Esc and follow the on-screen instructions. This creates the IM volume and triggers an automatic resync.

Configuration Utility Disables Selection of Disk

In this situation, the configuration utility does not allow a disk to be selected for an IM volume.

To determine why the disk cannot be selected, press F4 on the RAID Properties screen. The diagnostic codes for each disk display in the Size column. Code definitions are in the table below.

TABLE 3-6 Diagnostic Codes for Disks

Code

Definition

0

Status is good.

1

Could not get serial number from disk.

2

Cannot confirm that disk has SMART capability.

3

Maximum disks have been configured for volume already.

4

Inquiry data that was returned says disk does not support wide, qtags, disconnects, or sector size is not 512 bytes.

5

User disabled qtags or disconnects for disk on device properties screen.

6

Partitions on disk exceed size that can be mirrored by an already selected secondary or hot spare disk.

7

Disk is not large enough to mirror partitions that are contained on selected primary disk.

8

Hot spare was detected while no IM volume exists. You must delete hot spare and save that configuration.

9

Disk partition uses some of all of the last 32 sectors of the disk (16 Kbytes). The last 32 sectors are required for IR (Integrated RAID) internal processing.

10

Disk has sector size other than 512 bytes.

11

Device is not a compatible device type; must be a non-removable disk.

12

Hot spare is too small to mirror volume.

13

Maximum disks already are configured for volume.

 


Service Processor

This section contains information about problems that are associated with the SP.



Note - For detailed information about how to set up, update, and use the SP, see the Sun Fire V20z and Sun Fire V40z Servers--Installation Guide and the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide.



`Booting SP . . . ' Displays on Operator Panel

If the SP image becomes corrupted, the SP fails to boot and the Operator Panel LCD continuously displays the message: `Booting SP.' If left for a number of minutes, the fault light starts to blink and the SP reset button and the front buttons become inoperable. As a result of this issue, you will be unable to access or configure the SP via the Operator Panel, and the SP cannot monitor or manage the system.

A recovery operation is required. This operation is performed through the Operator Panel after an AC power reset.

1. Setup the Java Update Server according to the procedures in the Server Management Guide. Record the IP address of the server and the port number.

2. Disconnect the system from AC power.

3. Reconnect the system to AC power. The SP will start to boot and the following will be displayed on the front panel:

SP Boot: <3..2..1> secAny Key for menu

4. Within three (3) seconds, press the Select (center) button on the Operator Panel to interrupt the SP boot process. After you do this, the Operator Panel LCD displays the following:

Menu:
Update SP?

5. Press the Select button to select the update operation. The following displays in the Operator Panel's LCD:

SP's IP addr:
0.0.0.0

6. Use the buttons on the Operator Panel to specify and enter the SP's IP address, Netmask, and Gateway address, according to the procedures outlined in the Systems Management Guide. After you specify the SP's network information, the following is displayed:

Update from IP:
0.0.0.0

7. Specify the IP address and port number of the Java Update Server you set up in Step 1, using the front panel buttons as described above.

8. Confirm the update with the Select (center) button.

The SP update proceeds. You will be able to monitor the update process on the Update Server, as well as on the Operator Panel.



Note - If you do not see output from the Update Server or if the Operator Panel returns to the `Booting SP' state, the SP could not reach the Update Server. Check your network connections and settings and try again.



When the update is complete, the SP should be fully operational.

Continuous Boot of SP

A failure to initialize is usually caused by networking problems that are related to either DHCP addressing or to the NSV server.

Networking issues or general connectivity issues (if external access is enabled) usually cause loss of heartbeat. It can also be caused by intermittent problems on the SP, such as sensor lockup or application failure.



Note - For SP boot hangs, press the SP Reset button on the server's back panel. Also see `Booting SP . . . ' Displays on Operator Panel.



Failure to Boot

The boot mode probably has been altered. Reset the boot defaults. To accomplish this, first either:

or

Via the SP

1. Power down the server, disconnect the AC power cord(s), and remove the system cover.

2. Place a jumper on TH84 pin set, which is located at the end of the 66 MHz PCI-X slot. (Use the CMOS jumper for this purpose--from J110 or J125--if necessary.)

3. Establish an SSH session to the SP. Create an initial manager account as required, according to the procedures in the Sun Fire V20z and Sun Fire V40z Servers--Installation Guide.

4. To create a service-level account, enter:

access add user -g service -u s -p s3.

5. To su (super user) to the service account, enter:

su s

6. To enable the root account, enter:

sp set root on

7. Specify the service account password and new root account password, in response to the prompt. At the $ type prompt, to su to the root account, enter:

su -

8. Specify the root account password you set in Step 5, in response to the next prompt. At the # type prompt, enter:

setenv uboot 0

9. Power down the server, disconnect the AC power, and remove the system cover.

10. Remove jumper TH84.

11. Replace the system cover, reconnect the AC power, and power on the server.

The SP boot should succeed and the LCD should display appropriate text.

Via a PC Attached to the Serial Port

1. Power down the server, disconnect the AC power cord(s), and remove the system cover.

2. Place a jumper on TH84 pin set, which is located at the end of the 66 MHz PCI-X slot. (Use the CMOS jumper for this purpose--from J110 or J125--if necessary.)

3. Move the jumper at J19 to set the SP output to the Serial Port.

4. Attach a PC to the Serial Port.

5. Replace the system cover and reconnect the AC power cord(s).

6. Power on the server. The Serial Power displays:

Hit any Key to Stop Autoboot = 0.

7. Immediately press the space bar (within the first three seconds of the boot).

8. At the prompt => type:

saveenv

9. Power down the server, disconnect the AC power cord(s), and remove the system cover.

10. Remove the jumper you placed on pin set TH84.

11. Replace the system cover, reconnect the AC power cord(s), and power on the server.

The SP boot should succeed and the LCD should display appropriate text.

Failure to Boot after Downgrade

If this problem occurs immediately after the SP starts booting, use the Operator Panel to update the flash. Details are in the Sun Fire V20z and Sun Fire V40z Servers--Installation Guide and the Sun Fire V20z and Sun Fire V40z Servers--User Guide.



Note - The sp update flash all command does not update pstore data.



Details about the sp update flash all command are in the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide.

Failure to Retain User Accounts and Settings

A corruption of the flash partition that is used to retain SP state information might cause the failure to retain user accounts and settings across an SP reboot or AC power reset. As a result of this issue you must reset desired settings after every SP reboot. This can occur even though the SP is operational and accessible.

To identify this issue, log on to the SP and enter the mount command. An entry for /pstore will not be present.

localhost $ mount
/dev/rd/0 on / type ext2 (rw)
none on /dev type devfs (rw)
proc on /proc type proc (rw)
localhost $

If you experience this problem, perform the following recovery operation through an SSH session.

1. Establish an SSH session to the SP. Create an initial manager account as required, according to the procedures in the Server Management Guide.

2. To create a service-level account, enter:

access add user -g service -u s -p s3

3. To su to the service account, enter:

su s

4. To enable the root account, enter:

sp set root on

5. Specify the service account password and new root account password, in response.

6. To su to the root account, enter:

su -

7. Specify the root account password set in Step 5, in response.

8. To erase the flash partition intended to contain SP state information, enter:

eraseall /dev/mtd/flashfs

9. To reboot the SP, enter:

sp reboot

After the reboot, the SP should be fully operational.

Mount to Network Share Volume

If you receive a permission error when you attempt to add the SP mount to the NSV, ensure that the remote mount has been granted read/write permission.

Persistent Storage Issues

If you monitor system events via any of the methods available with your server, you might receive error messages about persistent storage issues. It is unusual for the persistent storage area to become full during normal operations. If becomes full, and if root access was used to place other files into this space, remove them. Then remove configuration files, as appropriate. For example, use access delete trust, access delete public key, sensor set -R, sp delete event, and so on.

See System Events, for a list of system events and troubleshooting suggestions.

See the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide for information about all the available event-monitoring methods.

SSH Script Hangs

When you use SSH in a script to execute a console command, the {-W|--nowait} option applies as a parameter to SSH, rather than to the command you want to execute. To ensure that SSH returns immediately after the command is executed, use the {-n|--no platform} and the {-f|--forced} SSH options with the {-W|--nowait} option.

For example:

ssh -n -f manager@10.10.20.30 "platform set os state update-bios -i 10.10.100.200 -p 5555 -r LATEST -W"

Update Failed

If you attempted to update the SP, but the update failed, verify that the update server has been loaded, and that you have specified the correct IP and the correct port number.

If you attempted to update the BIOS, but the update failed, ensure that you have the correct version of the BIOS image.



Note - See the Sun Fire V20z and Sun Fire V40z Servers--Server Management Guide for detailed information about how to use the update server.




System Events

System events can yield important information about problems or potential problems in the system. You can use any of these methods to monitor system events:

The system provides information that you can use to evaluate the problem. The format and types of information that the system returns vary slightly among the four monitoring methods that are listed above.This information can include:

View events - When a system event occurs, the system fault LED on the front panel blinks. To view the critical event that caused the alert, run the command sp get events.

Reset system fault LED - To reset the system fault LED, you must delete critical events from the SP event log or clear the log, completely.

Clear - To clear the entire event log, run the command sp delete event -a.

Delete specific events - To delete selected events from the log, run the command sp delete event event-id-number.



Note - Appendix B, "System Events" provides additional event details and specific troubleshooting steps for all possible system events.




Thermal Trip Events

When your CPU experiences a thermal trip, an event is issued that indicates that the platform has been shut down. For example:

CPU 0 has thermally tripped and shut down. Powering off System.

When this condition occurs, the system fault LED on the front panel blinks. To fix this condition:

1. Correct the airflow problem that caused the thermal trip (fan failure, environment became too hot, cover was off too long, and so on).

2. After the system cools, remove all AC power to the system (remove the plug on both power supplies) for 30 seconds.

3. Plug the system in again.

4. Boot the system normally.


VRM Crowbar Assertions

VRM crowbar assertions occur when a CPU or a DDR VRM detects a voltage condition or a temperature condition that exceeds the threshold. When this occurs, either the SP or the PRS forcefully shuts down the system. (Usually, the PRS shuts the system down, since the crowbar signal usually causes the VRM to stop confirmation of the "power good" signal).

When the condition clears, the system is allowed to resume power. While the crowbar is asserted, the system fault LED blinks and the front panel power button, the platform set power command and the platform os state command are disabled.



Note - See System Events for more information about power supply and power good signal events. See Machine Check Error for more information about all machine check errors.