Platform Notes: Sun Enterprise 6x00/5x00/4x00/3x00 Systems

COS Operation

COS functions by monitoring the temperatures of all system CPUs. Warning messages are displayed in the system console if a CPU/memory board over-temperature condition occurs. The following example indicates an over-temperature condition for CPU/memory board 0:


WARNING: CPU/Memory board 0 is warm (temperature: 73C). Please check system cooling
NOTICE: Processor 0 powered off.
NOTICE: Processor 1 powered off.

Resolving an Over-Temperature Condition

When the COS feature detects a CPU over-temperature condition, it takes the CPU offline and powers it off.

The system continues to operate with the offending CPU powered off. The CPUs are the chief source of heat on a CPU/Memory board; removing that heat source lowers the temperature into the normal operating range. This prevents sudden down time to the production server.

To Resolve an Over-Temperature Condition
  1. Verify the new state with the psrinfo command

    The psrinfo output reflects the new CPU state:


    0       powered-off since 03/11/97 09:48:31
    1       powered-off since 03/11/97 09:48:31

  2. Without powering off the operating system, replace the defective power supply (containing cooling fans) with a working unit.


    Note -

    You can also halt the server using /etc/halt or init 0 at the root or superuser prompt before replacing the defective power supply.


  3. Bring the CPU back to normal operation using the psradm command:


    # psradm -n processor_id#

    With the CPU over-temperature safeguard feature, if the temperature sensor again reports an over-temperature (the temperature is still out of range), then the attempt to bring the CPUs back into operation using the psradm command fails, an exit status of -1 and an error message is returned.

    If the CPU in question has returned to normal operating temperature, the console displays a message similar to the following.


    NOTICE: CPU/Memory board 0 has cooled down (temperature: 72C), system OK.

Failure to Disengage CPUs

In some instances, the CPU power control cannot disengage the affected CPU(s) from the Solaris software environment. For example, if the high temperature condition occurs when only one CPU/memory board with two processors is in the system, processor one will not go offline due to its being the last processor in the system.

Failure to Power Off CPUs

If the attempted de-coupling of the problem CPU from the Solaris software environment fails, the temperature may continue to increase. When the temperature reaches the hard upper operational temperature limit, the system shuts down. In this case, a message similar to the following is displayed:


WARNING: CPU/Memory board 0 is very hot (temperature: 83C)
WARNING: System shutdown scheduled in 20 seconds due to over-temperature condition on CPU/Memory board 0
WARNING: CPU/Memory board 0 still too hot (temperature: 83C). Overtemp shutdown started