Platform Notes: Sun Enterprise 6x00/5x00/4x00/3x00 Systems

Chapter 3 CPU Over-Temperature Safeguard

The CPU over-temperature safeguard (COS) is a Sun Enterprise xx00 platform feature for the Solaris 2.6 software environment and compatible versions available for servers with the proper firmware support. COS ensures that the temperature on any CPU/memory board does not exceed the safe operating range.

COS Requirements

COS is not available if a Sun Enterprise xx00 server lacks enabling firmware. In this case, the system displays the following messages during the boot sequence:


WARNING: Firmware does not support CPU power off
WARNING: Automatic CPU shutdown on over-temperature disabled
WARNING: Firmware does not support CPU restart from power off
WARNING: The ability to restart individual CPUs is  disabled

When equipped with the proper firmware, the system displays the following during the boot sequence. Later firmware will show a similar output..


Board 0:   OBP   3.2.8 1997/02/27 14:00   POST 3.5.1 1997/03/05 09:34 

  1. To check the firmware revision level, use the prtdiag -v command.

    The correct firmware version for COS support is 3.2.8 or later.

Factors in Overheating

Many external conditions can raise the CPU/memory board temperature and compound high temperature problems, including:

Some Solaris software environment issues can also affect the CPU temperature, such as bound threads or having only one CPU/memory board in the system. These Solaris software environment issues can cause a fallback to the existing shutdown behavior.

The CPU over-temperature safeguard does not affect the Solaris software environment in any way. COS operates only when the temperature of a CPU/memory board exceeds the safe operating range.

COS Operation

COS functions by monitoring the temperatures of all system CPUs. Warning messages are displayed in the system console if a CPU/memory board over-temperature condition occurs. The following example indicates an over-temperature condition for CPU/memory board 0:


WARNING: CPU/Memory board 0 is warm (temperature: 73C). Please check system cooling
NOTICE: Processor 0 powered off.
NOTICE: Processor 1 powered off.

Resolving an Over-Temperature Condition

When the COS feature detects a CPU over-temperature condition, it takes the CPU offline and powers it off.

The system continues to operate with the offending CPU powered off. The CPUs are the chief source of heat on a CPU/Memory board; removing that heat source lowers the temperature into the normal operating range. This prevents sudden down time to the production server.

To Resolve an Over-Temperature Condition
  1. Verify the new state with the psrinfo command

    The psrinfo output reflects the new CPU state:


    0       powered-off since 03/11/97 09:48:31
    1       powered-off since 03/11/97 09:48:31

  2. Without powering off the operating system, replace the defective power supply (containing cooling fans) with a working unit.


    Note -

    You can also halt the server using /etc/halt or init 0 at the root or superuser prompt before replacing the defective power supply.


  3. Bring the CPU back to normal operation using the psradm command:


    # psradm -n processor_id#

    With the CPU over-temperature safeguard feature, if the temperature sensor again reports an over-temperature (the temperature is still out of range), then the attempt to bring the CPUs back into operation using the psradm command fails, an exit status of -1 and an error message is returned.

    If the CPU in question has returned to normal operating temperature, the console displays a message similar to the following.


    NOTICE: CPU/Memory board 0 has cooled down (temperature: 72C), system OK.

Failure to Disengage CPUs

In some instances, the CPU power control cannot disengage the affected CPU(s) from the Solaris software environment. For example, if the high temperature condition occurs when only one CPU/memory board with two processors is in the system, processor one will not go offline due to its being the last processor in the system.

Failure to Power Off CPUs

If the attempted de-coupling of the problem CPU from the Solaris software environment fails, the temperature may continue to increase. When the temperature reaches the hard upper operational temperature limit, the system shuts down. In this case, a message similar to the following is displayed:


WARNING: CPU/Memory board 0 is very hot (temperature: 83C)
WARNING: System shutdown scheduled in 20 seconds due to over-temperature condition on CPU/Memory board 0
WARNING: CPU/Memory board 0 still too hot (temperature: 83C). Overtemp shutdown started