The CPU over-temperature safeguard (COS) is a Sun Enterprise xx00 platform feature for the Solaris 2.6 software environment and compatible versions available for servers with the proper firmware support. COS ensures that the temperature on any CPU/memory board does not exceed the safe operating range.
COS is not available if a Sun Enterprise xx00 server lacks enabling firmware. In this case, the system displays the following messages during the boot sequence:
WARNING: Firmware does not support CPU power off WARNING: Automatic CPU shutdown on over-temperature disabled WARNING: Firmware does not support CPU restart from power off WARNING: The ability to restart individual CPUs is disabled
When equipped with the proper firmware, the system displays the following during the boot sequence. Later firmware will show a similar output..
Board 0: OBP 3.2.8 1997/02/27 14:00 POST 3.5.1 1997/03/05 09:34
To check the firmware revision level, use the prtdiag -v command.
The correct firmware version for COS support is 3.2.8 or later.
Many external conditions can raise the CPU/memory board temperature and compound high temperature problems, including:
Room air-conditioning set incorrectly
Lateral cooling obstructed
Some Solaris software environment issues can also affect the CPU temperature, such as bound threads or having only one CPU/memory board in the system. These Solaris software environment issues can cause a fallback to the existing shutdown behavior.
The CPU over-temperature safeguard does not affect the Solaris software environment in any way. COS operates only when the temperature of a CPU/memory board exceeds the safe operating range.
COS functions by monitoring the temperatures of all system CPUs. Warning messages are displayed in the system console if a CPU/memory board over-temperature condition occurs. The following example indicates an over-temperature condition for CPU/memory board 0:
WARNING: CPU/Memory board 0 is warm (temperature: 73C). Please check system cooling NOTICE: Processor 0 powered off. NOTICE: Processor 1 powered off.
When the COS feature detects a CPU over-temperature condition, it takes the CPU offline and powers it off.
The system continues to operate with the offending CPU powered off. The CPUs are the chief source of heat on a CPU/Memory board; removing that heat source lowers the temperature into the normal operating range. This prevents sudden down time to the production server.
Verify the new state with the psrinfo command
The psrinfo output reflects the new CPU state:
0 powered-off since 03/11/97 09:48:31 1 powered-off since 03/11/97 09:48:31
Without powering off the operating system, replace the defective power supply (containing cooling fans) with a working unit.
You can also halt the server using /etc/halt or init 0 at the root or superuser prompt before replacing the defective power supply.
Bring the CPU back to normal operation using the psradm command:
# psradm -n processor_id#
With the CPU over-temperature safeguard feature, if the temperature sensor again reports an over-temperature (the temperature is still out of range), then the attempt to bring the CPUs back into operation using the psradm command fails, an exit status of -1 and an error message is returned.
If the CPU in question has returned to normal operating temperature, the console displays a message similar to the following.
NOTICE: CPU/Memory board 0 has cooled down (temperature: 72C), system OK.
In some instances, the CPU power control cannot disengage the affected CPU(s) from the Solaris software environment. For example, if the high temperature condition occurs when only one CPU/memory board with two processors is in the system, processor one will not go offline due to its being the last processor in the system.
If the attempted de-coupling of the problem CPU from the Solaris software environment fails, the temperature may continue to increase. When the temperature reaches the hard upper operational temperature limit, the system shuts down. In this case, a message similar to the following is displayed:
WARNING: CPU/Memory board 0 is very hot (temperature: 83C) WARNING: System shutdown scheduled in 20 seconds due to over-temperature condition on CPU/Memory board 0 WARNING: CPU/Memory board 0 still too hot (temperature: 83C). Overtemp shutdown started