C H A P T E R 4 |
Hardware Issues |
This chapter describes hardware issues related to the Sun Fire X4600 M2 server.
The following are current issues.
Most systems have either dual-core CPUs or quad-core CPUs. However, some systems support a combination of four dual-core CPUs and four quad-core CPUs.
These systems require a 950W power supply.
Before you upgrade a system with dual-core CPUs by adding quad-core CPUs, you must ensure that it has a 950W power supply.
To check the rating of the power supply, pull it out of the system (after shutting it down properly) and look at the rating on the wattage rating on the side.
For instructions, see the Sun Fire X4600 and Sun Fire X4600 M2 Service Manual.
If your system is configured with six or eight 120W CPUs, power supply redundancy is reduced to 3+1.
If you are not certain what kind of CPUs your system has, see your system handbook.
The diagram that shows the DIMM numbering sequence in the Sun Fire X4600 and Sun Fire X4600 M2 Server Diagnostics Guide is incorrect. It shows the sequence as 0-1-2-3-CPU.
The correct sequence is 3-2-1-0-CPU.
The sequence is shown correctly in the Sun Fire X4600 and Sun Fire X4600 M2 Server Service Manual.
The Sun Fire X4600 M2 server supports 2-socket and 6-socket configurations.
Starting with CPLD Version 8, power supply redundancy is increased to 2+2.
Systems with CPLD versions 7 and lower have 3+1 power supply redundancy.
Look at the POST messages to find the CPLD level.
For optimum performance, all DIMMs controlled by a given CPU should be the same capacity and all single-rank or dual-rank. Mixed configurations are supported, but could result in lower memory performance. Note that all supported 4GB and 8GB DIMMs are dual-rank. For 1GB and 2GB DIMMs, you can identify the type by counting the DRAMs; single-rank DIMMs, have 18 DRAMs, while dual-rank DIMMs have 36 DRAMs.
This problem occurs only on systems with the new enhanced quad-core CPU. The system may experience a sync flood error at boot time, just before the POST messages would have been displayed. This causes an immediate reboot, so it is easy to miss the fact that an unsuccessful boot occurred.
The most obvious symptom is a loss of memory capacity due to spurious DIMM errors. Unrecoverable errors such as the following appear in the system event log.
Performing a warm boot (rebooting the OS without powering down the system) does not clear the condition.
Power the system down and reboot. You may need to do this more than once.
The tested configurations for the Sun Fire X4600 M2 server do not include systems with both quad-core processors and 1 GB DIMMs. Such configurations may work, but are not supported.
Spontaneous reboots can occur with no memory errors reported. This problem is associated with an entry in the System Event Log (SEL) labeled “OEM #0x12” followed by a series of entries labeled “OEM record e0”. Here is an example:
6502 | 12/22/2007 | 07:41:21 | OEM #0x12 | | Asserted 6602 | OEM record e0 | 00000000040f0c0200100000f2 6702 | OEM record e0 | 01000000040000000000000000 |
The hexadecimal value in “OEM record e0” entries may be different from those in the example.
If this problem occurs on your system, you should take basic steps to eliminate possible causes, as described below. If you take these steps, and the problem continues to occur, contact Sun Support for additional remediation.
Remedial Steps |
The following steps may eliminate the problem. Retest the system after each step:
1. Verify that the system BIOS, other system firmware, and PCI card firmware is up to date. For information on verifying and updating system BIOS and firmware, consult the Software Release Notes.
2. Remove all PCI cards and CPU modules. Use canned compressed air to clean their contacts and slots, then reseat them.
3. If PCI cards have been added to the system recently, try rearranging the PCI cards. Note any change in symptoms after the cards are rearranged.
A system does not boot if the PCI slots contain 6 Sun Quad Gigabit Ethernet (QGE) cards. Test systems have run reliably with 5 QGE cards.
During a lengthy stress test, a USB mouse and keyboard were both hung. At the time, floppy and DVD were both redirected to physical USB drives.
Disconnect and reconnect mouse and keyboard.
USB ports might become disabled during operation. This appears to be a hardware problem related to the Nvidia USB controller.
When this happens, the device attached to the USB port becomes inactive. A message similar to the following is reported in the file:
[ID 691482 kern.warning] WARNING: /pci@0,0/pci108e,cb84@2 (ohci0): Connecting device on port 1 failed
Reboot the server to re-enable the USB ports.
A test system configured with 16 RAID storage arrays connected to 8 PCI HBAs panicked and rebooted during tests involving a heavy I/O workload. The problem could not be reproduced with fewer than 16 arrays. The problem only occurred when read size was 256K or more.
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.