C H A P T E R  2

Environmental Monitoring

The Netra CP2500 board uses an intelligent fault detection environmental monitoring system that increases uptime and manageability of the board. The system management controller (SMC) module on the Netra CP2500 supports the temperature and voltage environmental monitoring functions. This chapter describes the specific environmental monitoring functions of the Netra CP2500.

This chapter includes the following sections:


Environmental Monitoring Component Compatibility

TABLE 2-1 lists the compatible environmental monitoring hardware, OpenBoot PROM, and Solaris OS for the Netra CP2500.


TABLE 2-1 Compatible Environmental Monitoring Components

Component

Environmental Monitoring Compatibility

Hardware

Board supports environmental monitoring

OpenBoot PROM

Environmental monitoring is supported by OpenBoot PROM.

Operating system

Solaris 9 9/05 OS or subsequent compatible versions



Typical Environmental Monitoring System Application

FIGURE 2-1 illustrates the Netra CP2500 environmental monitoring application block diagram. For locations of the temperature sensors, see FIGURE 2-2.


FIGURE 2-1 Typical Environmental Monitoring Application Block Diagram

Diagram shows the transition card linking the I2C external bus to the Netra CP2500 board (left); the power bus links other boards (middle), and power supply (right).


 

The Netra CP2500 monitors its CPU diode temperature and issues warnings at both the OpenBoot PROM and Solaris OS levels when these environmental readings are out of limits. At the Solaris OS level, the application program monitors and issues warnings for the board. At the OpenBoot PROM level, the CPU diode temperature is monitored.


Typical Cycle From Power Up to Shutdown

This section describes a typical environmental monitoring cycle from power up to shutdown.

Environmental Monitoring Protection at the OpenBoot PROM

The OpenBoot PROM monitors the CPU diode temperature at the fixed polling rate of 10 seconds and displays warning messages on the default output device whenever the measured temperature exceeds the preprogrammed warning temperature or the critical temperature. These values have defaults set by the SMC and can not be changed for the OpenBoot PROM-level monitoring.

OpenBoot PROM-level protection is enabled and can not be disabled. If the board temperature exceeds the shutdown temperature, the SMC will shut down power to the Netra CP2500 CPU. The OpenBoot PROM will send a warning or critical temperature message to the user that the Netra CP2500 is overheating.

Environmental Monitoring Protection at the Operating System Level

Monitoring changes in the sensor temperatures can be a useful tool for determining problems with the room where the system is installed, functional problems with the system, or problems on the board. Establishing baseline temperatures early in deployment and operation could be used to trigger alarms if the temperatures from the sensors increase or decrease dramatically. If all the sensors go to room ambient, power has probably been lost to the host system. If one or more sensors rise in temperature substantially, there might be a system fan malfunction, the system cooling might have been compromised, or room air conditioning might have failed.

Protection at the operating system level takes place when the PICL environmental monitoring program (envmond) is running. The environmental monitoring program is part of a UNIX daemon that runs automatically when the Solaris OS boots up.

In a typical environmental monitoring application program, the software reads the CPU, inlet, and exhaust temperature sensors once every polling cycle. The program then compares the measured CPU diode temperature with the warning temperature and displays a warning message on the default output device whenever the warning temperature is exceeded.

The program can also issue a shutdown message on the default output device whenever the measured CPU diode temperature exceeds the shutdown temperature. In addition, the envmond application program can be programmed to sync and shut down the Solaris OS when conditions warrant.

Refer to Sample Application Program for an example of how a simple envmond program can be implemented.

The power module is controlled by the SMC subsystem, except for automatic controls such as overcurrent shutdown or voltage regulation. The functions controlled are core voltage output level, and power sequencing and monitoring.

Post Shutdown Recovery

The on-board voltage controller is a hardware function that is not controlled by either firmware or software. At the OpenBoot PROM level, if the board temperature exceeds the shutdown temperature, the SMC will shut down power to the Netra CP2500 CPU.

There is no mechanism for the Solaris OS to either recover or restore power to the Netra CP2500 when an unusual condition occurs, for example, if the CPU diode temperature exceeds its maximum recommended level. In either case, the end user must intervene and manually recover the Netra CP2500 as well as the system through hardware control. Once a shutdown has occurred, you can recover the board using a cold-reset IPMI command to SMC or by extracting and reinserting the board.


Hardware Environmental Monitoring Functions

This section summarizes the hardware environmental monitoring features on the Netra CP2500 board. TABLE 2-2 lists the environmental monitoring functions on a Netra CP2500 board.


TABLE 2-2 Typical Netra CP2500 Board Hardware Environmental Monitoring Functions

Function

Capability

Board Exhaust Air Temperature

Senses the air temperature at the trailing edge of the board. Assumes air direction from the PMC slots toward the processor/heatsink.

CPU Diode Temperature

Senses a diode temperature in the processor junction.

Board Inlet Air Temperature

Senses the air temperature at the leading edge of the board under the solder-side cover. Assumes air direction from the PMC slots toward the processor/heatsink.


TABLE 2-3 shows the I2C components.


TABLE 2-3 I2C Components

Component

Function

DS80CH11

SMC I2C controller - IPMB

PCF9545

4 channel I2C multiplexor

AT24C64

I2C EEPROM - motherboard FRUID

AT24C01

I2C EEPROM - RTM FRUID and external I2C header

ADM1026

System monitor and general purpose I/O

AT24C64

I2C EEPROM - NVRAM/Ethernet MAC ID

AT24Cxx

I2C EEPROM - DIMM 1 SPD (add-on dependent)

AT24Cxx

I2C EEPROM - DIMM 0 SPD (add-on dependent)

ALi1535D+

Southbridge - SMBUS and I2C controller


FIGURE 2-2 shows the location of the environmental monitoring hardware on the Netra CP2500.


FIGURE 2-2 Location of Environmental Monitoring Hardware on the Netra CP2500 Board - Top Side

The Netra CP2500 board shown from the top, with the exhaust sensor and the CPU thermal sensor near the top, and the inlet sensor near the bottom. The latches are at the left of the figure.


FIGURE 2-3 is a block diagram of the environmental monitoring functions.


FIGURE 2-3 Netra CP2500 Board Environmental Monitoring Functional Block Diagram

The diagram shows the environmental monitoring functions of the Netra CP2500 board. [ D ]


Switching Power On and Off

The on-board voltage controller allows power to the CPU of the Netra CP2500 only when the following conditions are met:

The controller requires these conditions to be true for at least 100 milliseconds to help ensure the supply voltages are stable. If any of these conditions become untrue, the voltage monitoring circuit shuts down the CPU power of the board.

Inlet, Exhaust, and CPU Temperature Monitoring

The CPU diode sensor reading may vary from slot to slot and from board to board in a system, and is dependent primarily on system cooling. As an example, a system might have sensor readings for the CPU diode from 35°C to 49°C with an ambient inlet of 21°C across many boards, with a variety of configurations and positions within a chassis. Care must be taken when setting the alarm and shutdown temperatures based on the CPU diode sensor value. This sensor typically is linear across the operating range of the board.

The exhaust sensor measures the local air temperature at the trailing edge of the board for systems with bottom to top airflow. This value depends on the character and volume of the airflow across the board. Typical values in a chassis may range from a delta over inlet ambient of 0°C to 12°C, depending on the power dissipation of the board configuration and the position in the chassis. The exhaust sensor is nonlinear with respect to ambient inlet temperature.

The inlet sensor measures the local air temperature at the leading edge of the board on the solder side under the solder-side cover. This value typically can range from a reading of 0°C to 13°C above inlet system ambient in a chassis. Care must be taken to understand the application and installation of the board to use this temperature sensor.

A sudden drop of all temperature sensors close to or near room ambient temperature can mean loss of power to one or more Netra CP2500s.

A gradual increase in the delta temperature from inlet to outlet can be due to dust clogging system filters. This feature can be used to set service levels for filter cleaning or changing.

The CPU diode temperature can be used to prevent damage to the board by shutting the board down if this sensor exceeds predetermined limits.


Adjusting the Environmental Monitoring Warning, Critical, and Shutdown Parameter Settings on the Board

The Netra CP2500 uses the environmental monitoring detection system to monitor the temperature of the board. The environmental monitoring system will display messages if the board temperature exceeds the warning and critical settings. Because the on-board sensors may report different temperature readings for different system configurations and airflows, you might want to adjust the warning, critical, and shutdown temperature parameter settings.

The Netra CP2500 determines the board temperature by retrieving temperature data from sensors located on the board. A board sensor reads the temperature of the immediate area around the sensor. Although the software might appear to report the temperature of a specific hardware component, the software is actually reporting the temperature of the area near the sensor. For example, the CPU diode sensor reads the temperature at the location of the sensor and not on the actual CPU heat sink. The board's OpenBoot PROM collects the temperature readings from each board sensor at regular intervals. You can display these temperature readings using the show-sensors OpenBoot PROM command. See Using the show-sensors Command at the OpenBoot PROM.

The temperature read by the CPU sensor will trigger OpenBoot PROM warning and critical messages. When the CPU sensor reads a temperature greater than the warning parameter setting, the OpenBoot PROM will display a warning message. When the sensor reads a temperature greater than the shutdown setting, the SMC will shut down the board.

Many factors affect the temperature readings of the sensors, including the airflow through the system, the ambient temperature of the room, and the system configuration. These factors might contribute to the sensors reporting different temperature readings than expected.

The Netra CP2500 board CPU sensor default temperature threshold values are 110°C for the high warning temperature, 118°C for the high shutdown temperature, and 123°C for the high power-off temperature.



Note - If you have developed an application that uses the environmental monitoring software to monitor the temperature sensors, you may want to adjust your application's settings accordingly.




OpenBoot PROM Environmental Monitoring

This section describes the OpenBoot PROM environmental monitoring of the CPU.

Warning Temperature Response at OpenBoot PROM

When the CPU diode temperature reaches warning temperature, a similar message is displayed at the ok prompt at a regular interval:


Temperature sensor #2 has threshold event of
<<< WARNING!!! Upper Non-critical - going high >>>
The current threshold setting is : 110
The current temperature is : 111

Critical Temperature Response at OpenBoot PROM

When the CPU diode temperature reaches critical temperature, a similar message is displayed at the ok prompt at a regular interval:


Temperature sensor #2 has threshold event of
<<< ALERT!!! Upper Critical - going high >>>
The current threshold setting is : 118
The current temperature is : 119

Using the show-sensors Command at the OpenBoot PROM

The show-sensors command at OpenBoot PROM displays the readings of all the temperature sensors on the board. A sample output for typical sensor readings for a Netra CP2500 is as follows:


ok show-sensors
 
Sensor#    Sensor Name                           Sensor Reading
=======    ====================================  ===================
1       EP 5v                     Sensor      (d1)  4.968 volts
   2       EP 3.3v                   Sensor      (8b)  3.336 volts
   3       BP +12v                   Sensor      (ce)  11.760 volts
   4       BP -12v                   Sensor      (63)  -12.010 volts
   5       IPMB Power                Sensor      (d2)  4.968 volts
   6       SMC Power                 Sensor      (69)  2.448 volts
   7       VDD 3.3v                  Sensor      (a8)  3.2592 volts
   8       VCCP                      Sensor      (64)  1.1800 volts
   9       +12v                      Sensor      (ba)  11.6250 volts
   a       -12v                      Sensor      (36)  -12.040 volts
   b       +5v                       Sensor      (be)  4.940 volts
   c       Standby 3.3v              Sensor      (be)  3.2680 volts
   d       Main 3.3v                 Sensor      (be)  3.2680 volts
   e       External I  temp (CPU)    Sensor      (3e)  62 degree C
   f       External II temp (Outlet) Sensor      (20)  32 degree C
  10       Internal    temp (Inlet)  Sensor      (1d)  29 degree C
 
ok 


Environmental Monitoring Application Programming

The following sections describe how to use the environmental monitoring functions in an application program.

For the environmental monitoring application program (envmond) to monitor the hardware environment, the following conditions must be met:

The environmental monitoring parameter values in the application program apply when the system is running at the Solaris level and do not necessarily have to be the same as the default settings programmed by the SMC and used by the OpenBoot PROM. The OpenBoot PROM environmental monitoring only applies when the system is running at the OpenBoot PROM level.

Reading Temperature Sensor States Using the PICL API

Temperature sensor states may be read using the libpicl API. The following properties are supported in a PICL temperature sensor class node:


TABLE 2-4 PICL Temperature Sensor Class Node Properties

Property

Type

Description

LowWarningThreshold

INT

Low threshold for warning

LowShutdownThreshold

INT

Low threshold for shutdown

LowPowerOffThreshold

INT

Low threshold for power off

HighWarningThreshold

INT

High threshold for warning

HighShutdownThreshold

INT

High threshold for shutdown

HighPowerOffThreshold

INT

High threshold for power off


The PICL plug-in receives these sensor events and updates the State property based on the information extracted from the IPMI message. It then posts a PICL event.

Threshold levels of the PICL node class temperature sensor are:

To obtain a reading of temperature sensor states, use the prtpicl -v command:


# prtpicl -c temperature-sensor -v

Sample PICL output of temperature sensors on a Netra CT system is as follows.


# prtpicl -c temperature-sensor -v
 CPU-sensor (temperature-sensor, 2600000041f)
            :Condition         ok 
            :HighPowerOffThreshold  123 
            :HighShutdownThreshold        118 
            :HighWarningThreshold         110 
            :LowPowerOffThreshold  -20 
            :LowShutdownThreshold  -10
            :LowWarningThreshold  -5
            :Temperature            74 
            :Label         Ambient 
            :GeoAddr       0xe  
            :_class        temperature-sensor 
            :name          CPU-sensor

Using a Configuration File for Sensor Information

On the Netra CP2500, you can enable or disable sensors, and configure sensor threshold actions, such as shutdown and reboot, by editing the /etc/picl/config/envmond.conf file.

Sample entries in the envmond.conf file are:


#entry format: name=value option
envmon-enable = true  /* Globally enables/disables PICL-based                         environmental monitoring */
sensor=CP2500-CPU-sensor threshold_shutdown_cmd="usr/sbin/shutdown -i5 -y -g15&"      /* presence of this line shows that the corresponding sensor is enabled */

Solaris Driver Interface

The PICL envmond plug-in opens a SMC driver stream and requests sensor events. The SMC monitors the sensors and generates an event when it detects a change at a particular sensor which meets one of the specified thresholds and generates an event to local Solaris software. This event is captured by the SMC driver (as an IPMI message) and is sent on an open STREAM that has requested sensor events. The sensor events are received by the PICL plug-in. The PICL plug-in updates the State property based on the information it extracts from the IPMI message and posts a PICL event.

Sample Application Program

This section presents a sample environmental monitoring (envmond) application that monitors the CPU diode temperature.


CODE EXAMPLE 2-1 Sample envmond Application Program
/*
 * sensor_readwrite.c
 *
 * compile: cc sensor_readwrite.c -lthread -lpicl -o sensor_readwrite
 */
#include <stdio.h>
#include <picl.h>
 
#define HI_POWEROFF_THRESHOLD   "HighPowerOffThreshold"
#define HI_SHUTDOWN_THRESHOLD   "HighShutdownThreshold"
#define HI_WARNING_THRESHOLD    "HighWarningThreshold"
#define LO_POWEROFF_THRESHOLD   "LowPowerOffThreshold"
#define LO_SHUTDOWN_THRESHOLD   "LowShutdownThreshold"
#define LO_WARNING_THRESHOLD    "LowWarningThreshold"
#define CURRENT_TEMPERATURE     "Temperature"
 
static int
get_child_by_name(picl_nodehdl_t nodeh, char *name, picl_nodehdl_t *resulth)
{
        picl_nodehdl_t  childh;
        picl_nodehdl_t  nexth;
        char            propname[PICL_PROPNAMELEN_MAX];
        picl_errno_t    rc;
 
        /* look up first child node */
        rc = picl_get_propval_by_name(nodeh, PICL_PROP_CHILD, &childh,
                                        sizeof (picl_nodehdl_t));
        if (rc != PICL_SUCCESS) {
                return (rc);
        }
 
        /* step through child nodes looking for named node */
        while (rc == PICL_SUCCESS) {
                rc = picl_get_propval_by_name(childh, PICL_PROP_NAME,
                                                propname, sizeof (propname));
                if (rc != PICL_SUCCESS) {
                        return (rc);
                }
 
                if (name && strcmp(propname, name) == 0) {
                        /* yes - got it */
                        *resulth = childh;
                        return (PICL_SUCCESS);
                }
 
                if (get_child_by_name(childh, name, resulth) == PICL_SUCCESS) {
                        return (PICL_SUCCESS);
                }
 
                /* get next child node */
                rc = picl_get_propval_by_name(childh, PICL_PROP_PEER,
                                        &nexth, sizeof (picl_nodehdl_t));
                if (rc != PICL_SUCCESS) {
                        return (rc);
                }
                childh = nexth;
        }
        return (rc);
}
 
void
get_sensor_thresholds(picl_nodehdl_t nodeh)
{
        int8_t  threshold;
 
        if (picl_get_propval_by_name(nodeh, HI_POWEROFF_THRESHOLD,
                &threshold, sizeof (threshold)) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to read high power-off threshold.");
        } else
                fprintf(stdout, "High power-off threshold = %d\n", threshold);
 
        if (picl_get_propval_by_name(nodeh, HI_SHUTDOWN_THRESHOLD,
                &threshold, sizeof (threshold)) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to read high shutdown threshold.");
        } else
                fprintf(stdout, "High shutdown threshold = %d\n", threshold);
 
        if (picl_get_propval_by_name(nodeh, HI_WARNING_THRESHOLD,
                &threshold, sizeof (threshold)) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to read high warning threshold.");
        } else
                fprintf(stdout, "High warning threshold = %d\n", threshold);
 
        if (picl_get_propval_by_name(nodeh, LO_POWEROFF_THRESHOLD,
                &threshold, sizeof (threshold)) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to read low power-off threshold.");
        } else
                fprintf(stdout, "Low shutdown threshold = %d\n", threshold);
 
        if (picl_get_propval_by_name(nodeh, LO_SHUTDOWN_THRESHOLD,
                &threshold, sizeof (threshold)) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to read low shutdown threshold.");
        } else
                fprintf(stdout, "Low shutdown threshold = %d\n", threshold);
 
        if (picl_get_propval_by_name(nodeh, LO_WARNING_THRESHOLD,
                &threshold, sizeof (threshold)) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to read low warning threshold.");
        } else
                fprintf(stderr, "Low warning threshold = %d\n", threshold);
}
 
void
set_sensor_thresholds(picl_nodehdl_t nodeh, char *threshold, int8_t value)
{
        int8_t  new_value = value;
 
        if (picl_set_propval_by_name(nodeh, threshold, &new_value,
                                sizeof (new_value)) != PICL_SUCCESS)
                fprintf(stderr, "Failed to set *s\n", threshold);
}
 
int
main(void)
{
        int     warning_temp;
        int8_t  temp;
        char    *sensor = "CPU-sensor";
 
        picl_nodehdl_t  rooth;
        picl_nodehdl_t  platformh;
        picl_nodehdl_t  childh;
 
        if (picl_initialize() != PICL_SUCCESS) {
                fprintf(stderr, "Failed to initialise picl\n");
                return (1);
        }
        if (picl_get_root(&rooth) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to get root node\n");
                picl_shutdown();
                return (1);
        }
        if (get_child_by_name(rooth, "platform", &platformh) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to get platform node\n");
                picl_shutdown();
                return (1);
        }
 
        if (get_child_by_name(platformh, sensor, &childh) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to get %s sensor.", sensor);
                picl_shutdown();
                return (1);
        }
 
        get_sensor_thresholds(childh);
 
        /* Read current sensor temperature */
        if (picl_get_propval_by_name(childh, CURRENT_TEMPERATURE,
                &temp, sizeof (temp)) != PICL_SUCCESS) {
                fprintf(stderr, "Failed to read current temperature\n");
        } else
                fprintf(stdout, "Current temperature = %d\n", temp);
 
        set_sensor_threshold(childh, HI_WARNING_THRESHOLD, temp+5);
 
        picl_shutdown();
        return (0);
 }

 

Reading the CPU Temperature and Environmental Limits

You can access the CPU temperature sensor current readings and environmental monitoring settings from the Solaris prompt by typing the following commands. Sample output is listed after each command.

prtpicl command example:


# prtpicl -c temperature-sensor -v
 CPU-sensor (temperature-sensor, 2600000041f)
            :Condition         ok 
            :HighPowerOffThreshold  123 
            :HighShutdownThreshold        118 
            :HighWarningThreshold         110 
            :LowPowerOffThreshold  -20 
            :LowShutdownThreshold  -10
            :LowWarningThreshold  -5
            :Temperature            74 
            :Label         Ambient 
            :GeoAddr       0xe  
            :_class        temperature-sensor 
            :name          CPU-sensor

prtdiag command example:


# prtdiag -v
...
 
CPU Node Temperature Information
--------------------------------
 
Temperature Reading: 85
Critical Threshold Information
------------------------------
High Power-Off Threshold          123
High Shutdown Threshold           118
High Warning Threshold            110
Low Power Off Threshold          -20
Low Shutdown Threshold           -10
Low Warning Threshold            -5

TABLE 2-5 shows which Solaris commands correspond to the environmental monitoring warning that runs when the CPU temperature exceeds the set limit.


TABLE 2-5 Description of Values Displayed by Solaris Commands

Environmental Monitoring Warning

prtpicl

prtdiag

The first-level temperature warning is displayed.

HighWarning
Threshold

High Warning Threshold

The second-level temperature warning is displayed.

HighShutdown
Threshold

High Shutdown Threshold

The CPU is shut off.

HighPowerOff
Threshold

High Power-Off Threshold