Platform Notes: Sun Enterprise 6x00/5x00/4x00/3x00 Systems

Resolving an Over-Temperature Condition

When the COS feature detects a CPU over-temperature condition, it takes the CPU offline and powers it off.

The system continues to operate with the offending CPU powered off. The CPUs are the chief source of heat on a CPU/Memory board; removing that heat source lowers the temperature into the normal operating range. This prevents sudden down time to the production server.

To Resolve an Over-Temperature Condition
  1. Verify the new state with the psrinfo command

    The psrinfo output reflects the new CPU state:


    0       powered-off since 03/11/97 09:48:31
    1       powered-off since 03/11/97 09:48:31

  2. Without powering off the operating system, replace the defective power supply (containing cooling fans) with a working unit.


    Note -

    You can also halt the server using /etc/halt or init 0 at the root or superuser prompt before replacing the defective power supply.


  3. Bring the CPU back to normal operation using the psradm command:


    # psradm -n processor_id#

    With the CPU over-temperature safeguard feature, if the temperature sensor again reports an over-temperature (the temperature is still out of range), then the attempt to bring the CPUs back into operation using the psradm command fails, an exit status of -1 and an error message is returned.

    If the CPU in question has returned to normal operating temperature, the console displays a message similar to the following.


    NOTICE: CPU/Memory board 0 has cooled down (temperature: 72C), system OK.