|C H A P T E R 3|
Advanced System Management
Advanced System Monitoring (ASM) is an intelligent fault detection system that increases uptime and manageability of the board. The System Management Controller (SMC) module on the Netra CP2000/CP2100 series supports the temperature monitoring functions of ASM. This chapter describes the specific ASM functions of the Netra CP2000/CP2100 series. This chapter includes the following sections:
TABLE 3-1 lists the compatible ASM hardware, OpenBoot PROM, and Solaris operating environment for the Netra CP2000/CP2100 series.
FIGURE 3-1 illustrates the Netra CP2000/CP2100 series ASM application block diagram.
The Netra CP2000/CP2100 series functions as a system controller board or as a satellite board in a CompactPCI system rack. The Netra CP2000/CP2100 series board monitors its CPU-vicinity temperature and issues warnings at both the OpenBoot PROM and Solaris operating environment levels when these environmental readings are out of limits. At the Solaris operating environment level, the application program monitors and issues warnings for the system controller and the satellite board. In the host and satellite modes of operation, at the OBP level, the CPU vicinity temperature is monitored if the the NVRAM variable env-monitor is enabled.
The OpenBoot PROM monitors CPU-vicinity temperature at the fixed polling rate (from the env-mon-interval parameter) of 10 seconds and the OpenBoot PROM displays warning messages on the default output device whenever the measured temperature exceeds the pre-programmed NVRAM module configurable variable warning temperature (the warning-temperature parameter) or the pre-programmed NVRAM module configurable variable shutdown temperature (the shutdown-temperature parameter). See OpenBoot PROM Environmental Parameters for information on changing these pre-programmed parameters.
The OpenBoot PROM cannot shut down power to the Netra CP2000/CP2100 series board. The shutdown temperature message is only a warning message to the user that the Netra CP2000/CP2100 series board is overheating and needs to be shut down immediately by external means.
OpenBoot PROM-level protection takes place only when the env-monitor parameter is enabled (it is not the default setting). Disabling env-monitor completely disables ASM protection at the OpenBoot PROM level but does not affect ASM protection at the Solaris operating environment level.
Monitoring changes in the ASM temperatures can be a useful tool for determining problems with the room where the system is installed, functional problems with the system, or problems on the board. Establishing baseline temperatures early in deployment and operation could be used to trigger alarms if the temperatures from the sensors increase or decrease dramatically. If all the sensors go to room ambient, power has probably been lost to the host system. If one or more sensors rise in temperature substantially, there may be a system fan malfunction, the system cooling may have been compromised, or room air conditioning may have failed.
To access the CPU-vicinity temperature measurements at the Solaris operating environment level, use the ioctl system call in an application program. To specify the ASM polling rate, use the sleep system call.
Protection at the operating environment level takes place only when the ASM application program is running, which is initiated by the end user. Failure to run the ASM application program completely disables ASM protection at the Solaris level but does not affect ASM protection at the OpenBoot PROM level. Keep the ASM application program running at all times.
The program can also issue a shutdown message on the default output device whenever the measured CPU-vicinity temperature exceeds the shutdown temperature. In addition, the ASM application program can be programmed to sync and shut down the Solaris operating environment when conditions warrant.
The use of system calls to access the ASM device driver at the Solaris level enables OEMs to implement their own monitoring, warning, and shutdown policies through a high-level programming language such as the C programming language. An OEM can log and analyze the environmental data for trends (such as drift rate or sudden changes in average readings). Or, an OEM can communicate the occurrence of an unusual condition to a specialized management network using the Netra CP2000/CP2100 series board Ethernet port.
Refer to Sample Application Program for an example of how a simple ASM monitoring program can be implemented.
The power module is controlled by the SMC subsystem (except for automatic controls such as overcurrent shutdown or voltage regulation). The functions controlled are core voltage output level and module on/off state.
The onboard voltage controller is a hardware function that is not controlled by either firmware or software. At the OpenBoot PROM level, there is no mechanism for the OpenBoot PROM to either remove or restore power to the Netra CP2000/CP2100 series board when the CPU-vicinity temperature exceeds its maximum recommended level.
There is no mechanism for the Solaris operating environment to either recover or restore power to the Netra CP2000/CP2100 series board when an unusual condition occurs (for example, if the CPU-vicinity temperature exceeds its maximum recommended level). In either case, the end user must intervene and manually recover the Netra CP2000/CP2100 series board as well as the CompactPCI system through hardware control.
This section summarizes the hardware ASM features on the Netra CP2000/CP2100 series board. TABLE 3-2 lists the ASM functions and shows the location of the ASM hardware on a typical Netra CP2060 board. TABLE 3-3 shows the same information for the Netra CP2160 board.
Note that in TABLE 3-2 and TABLE 3-3 the readings for the SDRAM modules show the sensor readings as currently unavailable because the tables list information of a typical Netra board that does not support memory modules.
Sensor reading is currently unavailable
Sensor reading is currently unavailable
FIGURE 3-6 is a block diagram of the ASM functions.
The controller requires these conditions to be true for at least 100 milliseconds to help ensure the supply voltages are stable. If any of these conditions become untrue, the voltage monitoring circuit shuts down the power of the board.
The inlet board temperature sensor can be used to ensure that the maximum allowable short-term system-level air inlet temperature is not exceeded. The sensor can also be used to monitor potential issues with the system or installation, since inlet temperature for the Netra CP2160 board should be kept low for the installation reliability requirements.
The two exhaust temperature sensors can be used to ensure that the proper airflow across the board is being maintained. The difference in the temperature between the inlet air temperature and exhaust temperatures can be monitored to determine if system filters need servicing, if air movers have failed, or if an electrical problem has occured due to components drawing too much power on the board.
During normal operation of the Netra CP2160 board, any sudden, sustained, or substantial changes in the delta temperature across the board can be used to alert service personnel to a potential system or board service issue.
The Netra CP2000/CP2100 board uses the Advanced System Monitoring (ASM) detection system to monitor the temperature of the board. The ASM system will display messages if the board temperature exceeds the set warning and shutdown settings. Because the on-board sensors may report different temperature readings for different system configurations and airflows, you may want to adjust the warning and shutdown temperature parameter settings.
The CP2000/CP2100 board determines the board temperature by retrieving temperature data from sensors located on the board. A board sensor reads the temperature of the immediate area around the sensor. Although the software may appear to report the temperature of a specific hardware component, the software is actually reporting the temperature of the area near the sensor. For example, the CPU heat sink sensor reads the temperature at the location of the sensor and not on the actual CPU heat sink. The board's OpenBoot PROM collects the temperature readings from each board sensor at regular intervals. You can display these temperature readings using the show-sensors OpenBoot PROM command. See show-sensors Command at OpenBoot PROM
The temperature read by the CPU heat sink sensor will trigger OpenBoot PROM warning and shutdown messages. When the CPU heat sink sensor reads a temperature greater than the warning parameter setting, the OpenBoot PROM will display a warning message. Likewise, when the sensor reads a temperature greater than the shutdown setting, the OpenBoot PROM will display a shutdown message.
Many factors affect the temperature readings of the sensors, including the airflow through the system, the ambient temperature of the room, and the system configuration. These factors may contribute to the sensors reporting different temperature readings than expected.
TABLE 3-5 shows the sensor readings of a typical Netra CP2040 board operating in a Sun server in a room with an ambient temperature of 21°C. The temperature readings were reported using the show-sensors OpenBoot PROM command. Note that the reported temperatures are higher than the ambient room temperature.
TABLE 3-6 shows the sensor readings of a typical Netra CP2160 board, which has different sensor locations than those on the other Netra CP2000/CP2100 series boards.
Note that the inlet temperature sensor typically does not capture true board inlet temperature due to the heat of nearby components. For typical Sun systems, subtract 3°C to 6°C from this value to approximate true board inlet temperature. For non-Sun systems, a different value to obtain true board inlet temperature may be required.
Since the temperature reported by the CPU diode sensor might be different than the actual CPU die temperature, you may want to adjust the settings for both the warning-temperature and shutdown-temperature OpenBoot PROM parameters. The default values of these parameters have been conservatively set at 60°C for the warning temperature and 65°C for the shutdown temperature.
This section describes how to change the OpenBoot PROM environmental monitoring parameters. These global OpenBoot PROM parameters do not apply at the Solaris level. Instead, the ASM application program provides equivalent parameters that do not necessarily have to be set to the same values as their OpenBoot PROM counterparts. Refer to ASM Application Programming for information about using ASM at the Solaris level. The OpenBoot PROM polling rate is at fixed intervals of 10 seconds.
OBP programs SMC for temperature monitoring using the sensor commands. TABLE 3-7 lists the default threshold temperature settings for the CP2000/CP2100 series boards.
For example, on a Netra CP2160 there are three NVRAM variables that provide different temperature levels. The critical-temperature limit lies between warning and shutdown thresholds. The default values of these temperature thresholds and corresponding action is shown in TABLE 3-8:
Note that there is a lower limit of 50° C on shutdown-temperature value. If the temperature is set to a value lower than 50° C, OpenBoot PROM resets it back to 50° C in SMC. However, OpenBoot PROM does not reset the NVRAM variable shutdown-temperature to 50° C. Therefore, everytime the user resets the system, the OpenBoot PROM displays a warning message similar to the message below:
This safeguards against a user setting the shutdown-temperature lower than the room temperature and thereby causing the CPU processor and the Netra CP2160 board to be powered off by SMC on the next reset.
The warning-temp global OpenBoot PROM parameter determines the temperature at which a warning is displayed. The shutdown-temperature global OpenBoot PROM parameter determines the temperature at which the system is shut down. The temperature monitoring environment variables can be modified at the OpenBoot PROM command level as shown in examples below::
The show-sensors command at OpenBoot PROM displays the readings of all the temperature sensors on the board TABLE 3-9 shows typical sensor readings for a Netra CP2060 board (which would be similar to the Netra CP2040/CP2080/CP2140 boards) and TABLE 3-10 shows typical sensor readings for a Netra CP2160 board.
This sensor reading is not available
The Intelligent Platform Management Interface (IPMI) commands can be used to enable the sensors monitoring and subsequent event generation from satellite boards in the Netra CP2000/CP2100 series CompactPCI system.
The IPMI command examples provided in this section are based on the IPMI Specification Version 1.0. Please use the IPMI Specification for additional information on how to implement these IPMI commands.
Note - To execute an IPMI command, at the OpenBoot PROM ok prompt, type the packets in reverse order followed by the relevant information as shown in examples in Examples of IPMI Command Packets. Change the bytes in the example packet to accommodate different IPMI addresses, different threshold values or different sensor numbers. See also the IPMI Specification Version 1.0.
See Set Sensor Threshold. If no threshold is set, the default threshold operates:
2. Follow instructions in Check Whether the IPMI Commands Are Executed Properly to check proper execution of the command.
2. Follow instructions in Check Whether the IPMI Commands Are Executed Properly to check proper execution of the command.
Note - In byte number 9, if the bit for a corresponding threshold is set to 1, then that threshold is set. If the bit is 0, the System Management Controller ignores that threshold. But if an attempt is made to set a threshold that is not supported, an error is returned in the command response.
The ASM parameter values in the application program apply when the system is running at the Solaris level and do not necessarily have to be the same as the corresponding to the parameter settings in the OpenBoot PROM.
To change the ASM parameter setting at the OpenBoot PROM level, see OpenBoot PROM Environmental Parameters for the procedure. The OpenBoot PROM ASM parameter values only apply when the system is running at the OpenBoot PROM level.
The ASM application program monitors the CPU-vicinity temperature as follows (see Sample Application Program for C code):
The ASM driver is a STREAMS module that sits on top of the Solaris system controller driver. The Netra CP2000/CP2100 series ASM driver accepts STREAMS IOCTL input to the ASM driver, passes it onto the system controller driver as a command, and sends the sensor temperature as the output to the user. Currently, this driver handles only the local I2C bus. On the Netra CP2000 series and the Netra CP2140 board, this driver enables the user to monitor the CPU-vicinity temperature, PMC temperature, memory module heat sink temperature, memory module temperature, SDRAM module1 temperature, SDRAM module2 temperature, and the power module temperature. On the Netra CP2160 board, th driver enables the user to monitor the CPU temperature, the Inlet 1, Exhaust 1, Exhaust 2, SDRAM module 1 and the power module temperatures.
When the monitoring is successful, it returns a 0. For any error, it returns -1 and the errno is set correspondingly. Trying to read any sensor which is not physically present sets errno as ENXIO. For any hardware or firmware failures, the errno is EINVAL. For any memory allocation problems, the errno is EAGAIN.
This section presents a sample ASM application that monitors the CPU-vicinity temperature. Please refer to /usr/platform/sun4u/include/sys/stdasm.h if you want to add support for the other six sensors in the application.
This section describes the test configuration used to generate the data used for the OpenBoot PROM temperature table in the ASM table temperature monitoring function. It should be used as a guideline by OEMs who need to revise the OpenBoot PROM temperature table because of changes to the enclosure, system, or fan configuration.
See the section on Thermocouple Locations above for further details.