Sun N1 System Manager 1.1 Administration Guide

Monitoring Threshold Values

The value of any given monitored attribute is compared to a threshold value. Low and high threshold values are defined and can be configured.

Attribute data is compared against thresholds at regular intervals. These polling intervals are configurable. For further information about polling intervals, see Setting Polling Intervals.

When a monitored attribute is polled and the value of the attribute is beyond the default or user-defined threshold safe range, an event is generated and a status is issued. If the value of the attribute is lower than the low threshold or higher than the high threshold, then depending on the severity of the threshold, an event is generated to show a status of nonrecoverable, critical, or warning. Otherwise, the status of the monitored attribute is OK, provided that a value can be obtained.

If no value can be obtained, an event is generated to show that the status of the monitored attribute is unknown. The health of an OS resource can be shown as unknown if the server is reachable but the monitoring agent cannot be contacted on SNMP port 161.

The values nonrecoverable, critical, and warning are discussed in show server in Sun N1 System Manager 1.1 Command Line Reference Manual.

What Happens When a Threshold is Broken

If the value of a monitored attribute rises above the warninghigh threshold, a status of warninghigh is issued. If the value continues to rise and passes the criticalhigh threshold, a status of criticalhigh is issued. If the value continues to rise above the nonrecoverablehigh threshold, a status of nonrecoverablehigh is issued.

If the value then falls back to the safe range, no further events are generated until the value falls below the warninghigh threshold, at which point an event is generated to show a status of normal.

If the value of a monitored attribute falls below the warninglow threshold, a status of warninglow is issued. If the value continues to fall, and passes the criticallow threshold, a status of criticallow is issued. If the value continues to fall below the nonrecoverablelow threshold, a status of nonrecoverablelow is issued.

If the value then rises back to the safe range, no further events are generated until the value rises above the warninglow threshold, at which point an event is generated to show a status of normal.

Threshold values for OS resource utilization attributes can be configured at the command line. This process is explained in Setting Threshold Values. For threshold values measuring percentages, the valid range is from 0 to 100%. If you try to set a threshold value outside of this range, an error is generated. For attributes that do not measure percentages, these values depend on the number of processors in your system and on the usage characteristics of your installation.

Tuning Threshold Values for Your Installation

After a period of usage, you can develop an awareness of what levels to set for OS resource utilization attribute values. You can adjust thresholds once you determine more closely what value indicates a genuine justification for an event to be generated and for a notification to be sent to your pager or email address. For example, you might want to receive notifications every time a certain attribute reaches a warninghigh severity threshold level.

For important or crucial attributes at your installation, you can set the warninghigh threshold level to a low percentage value so that you are notified about a rising value as early as possible.

To Retrieve Threshold Values for a Server

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Type the show server command:
N1-ok> show server server
In this procedure, server is the name of the provisionable server for which you want to retrieve threshold values.

Detailed monitoring threshold values appear in the output, including threshold information for the server's hardware health, OS resource utilization, and network reachability. Default values are shown if no specific values have been set.

See show server in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Managing Default Threshold Values

Factory-configured default threshold values are provided in the N1 System Manager software for some OS resource utilization thresholds. These values are stated as percentages. Table 5–1 lists default values for these OS resource utilization attributes.

Note –

Setting or modifying threshold values for hardware health attributes is not supported in this version of the Sun N1 System Manager.

Table 5–1 Factory-Configured Default Threshold Values for OS Resource Utilization Attributes


Attribute Name	Description	Default Threshold	Default Threshold
`cpustats.pctusage`	Percentage of overall CPU usage	`warninghigh` 80%	`criticalhigh` 90%
`cpustats.pctidle`	Percentage of CPU idle	`warninglow` 20%	`criticallow` 10%
`memusage.pctmemused`	Percentage of memory in use	`warninghigh` 80%	`criticalhigh` 90%
`memusage.pctmemfree`	Percentage of memory free	`warninglow` 20%	`criticallow` 10%
`memusage.pctswapused`	Percentage of swap space in use	`warninghigh` 80%	`criticalhigh` 90%
`fsusage.pctused`	Percentage of file system space in use	`warninghigh` 80%	`criticalhigh` 90%

Table 5–2 provides the complete list of OS resource utilization attributes and their default values. Where factory-configured default values exist for attributes, these are shown in parentheses.

Table 5–2 All OS Resource Utilization Attributes


Attribute Name	Description	Supported Threshold (Default)	Supported Threshold (Default)
`cpustats.loadavg1min`	System load expressed as average number of queued processes over 1 minute	`warninghigh`	`criticalhigh`
`cpustats.loadavg5min`	System load expressed as average number of queued processes over 5 minutes	`warninghigh`	`criticalhigh`
`cpustats.loadavg15min`	System load expressed as average number of queued processes over 15 minutes	`warninghigh`	`criticalhigh`
`cpustats.pctusage`	Percentage of overall CPU usage	`warninghigh` (80%)	`criticalhigh` (90%)
`cpustats.pctidle`	Percentage of CPU idle	`warninglow` (20%)	`criticallow` (10%)
`memusage.pctmemused`	Percentage of memory in use	`warninghigh` (80%)	`criticalhigh` (90%)
`memusage.pctmemfree`	Percentage of memory free	`warninglow` (20%)	`criticallow` (10%)
`memusage.mbmemused`	Memory in use in MB	`warninghigh`	`criticalhigh`
`memusage.mbmemfree`	Memory free in MB	`warninglow`	`criticallow`
`memusage.pctswapused`	Percentage of swap space in use	`warninghigh` (80%)	`criticalhigh` (90%)
`memusage.mbswapfree`	Free swap space in MB	`warninglow`	`criticallow`
`fsusage.pctused`	Percentage of file system space in use	`warninghigh` (80%)	`criticalhigh` (90%)

Changing Threshold Values With the Monitoring Configuration File

You can modify default values for thresholds by editing the monitoring.properties configuration file.

If the monitoring.properties configuration file is not present, create and save it in /etc/opt/sun/n1gc/. The monitoring.properties configuration file is not created by default at installation.

Any entries that you make in the monitoring.properties configuration file for the threshold values of the attributes listed in Table 5–1 overwrite the factory-configured defaults for the corresponding threshold values.

The monitoring.properties configuration file should be stored only on the management server and not on provisionable servers.

Modifying or adding new entries to the monitoring.properties configuration file affects all the provisionable servers managed by the N1 System Manager.

Specific threshold values can be set at the command line by following the procedures described in Setting Threshold Values.

Once a default value for a monitored item has been modified by manually adding it in the monitoring.properties configuration file, that modified default value applies to all provisionable servers except those servers for which specific values for the monitored attribute have been set at the command line.

Note –

You do not need to reboot the management server or the monitored provisionable server for changes to the monitoring.properties file to take effect.

Monitored attributes for hardware health that are declared as percentages cannot be changed either at the command line or in the monitoring.properties file.

To Modify Default Threshold Values for a Server

To modify default threshold values, edit the /etc/opt/sun/n1gc/monitoring.properties file. Only those default threshold values that relate to OS resource utilization attributes can be modified. Hardware health attribute default threshold values cannot be modified for servers.

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Open the /etc/opt/sun/n1gc/monitoring.properties file.

If the file does not exist, create it.

Modify or add lines in the monitoring.properties file that describe default threshold values.

threshold.attribute.threshold value

The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring.

The threshold is either criticallow, warninglow, warninghigh, or criticalhigh.

The value is a numeric figure and usually represents a percentage value.

Save the file.

You do not need to reboot the management server or the provisionable server for the changes to take effect. The modified default threshold values now apply to all servers managed by the N1 System Manager.

Example 5–1 Modifying the Default Threshold Value for File System Usage

This example shows how to modify the default criticalhigh threshold value for file system usage to 75 percent of maximum file system usage capacity. The following line is added to or amended in the /etc/opt/sun/n1gc/monitoring.properties file:

threshold.fsusage.pctused.criticalhigh=75

This value applies to all provisionable servers, unless you have set specific values for the threshold value at the command line, by using the set command as described in Setting Threshold Values.

Threshold values can be disabled. This process is shown in Example 5–4.

Hardware Sensor Attributes

For x86 servers, the management server software obtains the list of hardware sensor attributes to monitor through IPMI from the service processor of the server. For servers running the SPARC architecture, the ALOM interface is used. The list of hardware sensor attributes can vary from server to server, and between firmware versions. A sample listing for some servers and firmware versions is provided in this section. It depends on the server type and on the number of CPUs that the server has.

Note –

Hardware disk failure and memory failure are not monitored in this version of the N1 System Manager.

The following list contains sensor names and descriptions for a Sun Fire V40z server with firmware version 2.1.0.16:

ambienttemp     Ambient air temp
bulk.v12-0-s0   Bulk 12V S0 voltage at CPU 0
bulk.v12-2-s0   Bulk 12V S0 voltage at CPU 2
bulk.v12-3-s0   Bulk 12V S0 voltage at CPU 3
bulk.v1_8-s0    Bulk 1.8V S0 voltage
bulk.v1_8-s5    Bulk 1.8V S5 voltage
bulk.v2_5-s0    Bulk 2.5V S0 voltage
bulk.v2_5-s0-dc Bulk 2.5V S0 voltage at DC
bulk.v2_5-s5    Bulk 2.5V S5 voltage
bulk.v3_3-s0    Bulk 3.3V S0 voltage
bulk.v3_3-s0-dc Bulk 3.3V S0 voltage at DC
bulk.v3_3-s3    Bulk 3.3V S3 voltage
bulk.v3_3-s5    Bulk 3.3V S5 voltage
bulk.v3_3-s5-dc Aux 3.3V S5 voltage at DC
bulk.v5-s0      Bulk 5V S0 voltage
bulk.v5-s0-dc   Bulk 5V S0 voltage at DC
bulk.v5-s5      Bulk 5V S5 voltage
bulk.v5-s5-dc   Bulk 5V S5 voltage at DC
cd.lp           CDROM Light path location LED
cpu0.dietemp    CPU 0 Die temperature
cpu0.heartbeat  CPU 0 Heartbeat
cpu0.inlettemp  CPU 0 Inlet temperature
cpu0.lp         CPU 0 Light path location LED
cpu0.mem0.lp    CPU 0 Dimm 0 Light path location LED
cpu0.mem1.lp    CPU 0 Dimm 1 Light path location LED
cpu0.mem2.lp    CPU 0 Dimm 2 Light path location LED
cpu0.mem3.lp    CPU 0 Dimm 3 Light path location LED
cpu0.memtemp    CPU 0 Memory temperature
cpu0.memvrm.lp  CPU 0 Memory VRM Light path location LED
cpu0.v2_5-s0    CPU 0 VDDA (2.5V) S0 voltage
cpu0.v2_5-s3    CPU 0 VDD (2.5V) S3 voltage
cpu0.vcore-s0   CPU 0 VCore S0 voltage
cpu0.vid        CPU 0 VID Selection
cpu0.vldt0      CPU 0 LDT0 voltage
cpu0.vrm.lp     CPU 0 VRM Light path location LED
cpu0.vtt-s3     CPU 0 DDR VTT S3 voltage
cpu1.dietemp    CPU 1 Die temperature
cpu1.heartbeat  CPU 1 Heartbeat
cpu1.inlettemp  CPU 1 Inlet temperature
cpu1.lp         CPU 1 Light path location LED
cpu1.mem0.lp    CPU 1 Dimm 0 Light path location LED
cpu1.mem1.lp    CPU 1 Dimm 1 Light path location LED
cpu1.mem2.lp    CPU 1 Dimm 2 Light path location LED
cpu1.mem3.lp    CPU 1 Dimm 3 Light path location LED
cpu1.memtemp    CPU 1 Memory temperature
cpu1.memvrm.lp  CPU 1 Memory VRM Light path location LED
cpu1.v2_5-s0    CPU 1 VDDA (2.5V) S0 voltage
cpu1.v2_5-s3    CPU 1 VDD (2.5V) S3 voltage
cpu1.vcore-s0   CPU 1 VCore S0 voltage
cpu1.vid        CPU 1 VID Selection
cpu1.vldt1      CPU 1 LDT1 voltage
cpu1.vldt2      CPU 1 LDT2 voltage
cpu1.vrm.lp     CPU 1 VRM Light path location LED
cpu1.vtt-s3     CPU 1 DDR VTT S3 voltage
cpu2.dietemp    CPU 2 Die temperature
cpu2.heartbeat  CPU 2 Heartbeat
cpu2.inlettemp  CPU 2 inlet temperature
cpu2.lp         CPU 2 Light path location LED
cpu2.mem0.lp    CPU 2 Dimm 0 Light path location LED
cpu2.mem1.lp    CPU 2 Dimm 1 Light path location LED
cpu2.mem2.lp    CPU 2 Dimm 2 Light path location LED
cpu2.mem3.lp    CPU 2 Dimm 3 Light path location LED
cpu2.memvrm.lp  CPU 2 Memory VRM Light path location LED
cpu2.temp       CPU 2 downwind temperature
cpu2.v2_5-s0    CPU 2 VDDA (2.5V) S0 voltage
cpu2.v2_5-s3    CPU 2 VDD (2.5V) S3 voltage
cpu2.vcore-s0   CPU 2 VCore S0 voltage
cpu2.vid        CPU-2 VID Selection
cpu2.vrm.lp     CPU 2 VRM Light path location LED
cpu2.vtt-s3     CPU 2 DDR VTT voltage
cpu3.dietemp    CPU 3 Die temperature
cpu3.heartbeat  CPU 3 Heartbeat
cpu3.inlettemp  CPU 3 inlet temperature
cpu3.lp         CPU 3 Light path location LED
cpu3.mem0.lp    CPU 3 Dimm 0 Light path location LED
cpu3.mem1.lp    CPU 3 Dimm 1 Light path location LED
cpu3.mem2.lp    CPU 3 Dimm 2 Light path location LED
cpu3.mem3.lp    CPU 3 Dimm 3 Light path location LED
cpu3.memvrm.lp  CPU 3 Memory VRM Light path location LED
cpu3.temp       CPU 3 downwind temperature
cpu3.v2_5-s0    CPU 3 VDDA (2.5V) S0 voltage
cpu3.v2_5-s3    CPU 3 VDD (2.5V) S3 voltage
cpu3.vcore-s0   CPU 3 VCore S0 voltage
cpu3.vid        CPU-3 VID Selection
cpu3.vrm.lp     CPU 3 VRM Light path location LED
cpu3.vtt-s3     CPU 3 DDR VTT voltage
cpuplanar.lp    Daughtercard Light path location LED
fan1.tach       Fan 1 measured speed
fan10.tach      Fan 10 measured speed
fan11.tach      Fan 11 measured speed
fan12.tach      Fan 12 measured speed
fan2.tach       Fan 2 measured speed
fan3.tach       Fan 3 measured speed
fan4.tach       Fan 4 measured speed
fan5.tach       Fan 5 measured speed
fan6.tach       Fan 6 measured speed
fan7.tach       Fan 7 measured speed
fan8.tach       Fan 8 measured speed
fan9.tach       Fan 9 measured speed
faultswitch     System Fault Indication
floppy.lp       Floppy Light path location LED
frontpanel.lp   LCD Light path location LED
g0.vldt1        AMD-8131 PCI-X Tunnel 0 LDT1 voltage
g1.vldt1        AMD-8131 PCI-X Tunnel 1 LDT1 voltage
gbeth.temp      Gigabit ethernet local temperature
golem-v1_8-s0   AMD-8131 PCI-X Tunnel 1.8V S0 voltage
identifyswitch  Identify switch
pci1.lp         PCI Slot 1 Light path location LED
pci2.lp         PCI Slot 2 Light path location LED
pci3.lp         PCI Slot 3 Light path location LED
pci4.lp         PCI Slot 4 Light path location LED
pci5.lp         PCI Slot 5 Light path location LED
pci6.lp         PCI Slot 6 Light path location LED
pci7.lp         PCI Slot 7 Light path location LED
pcifan.lp       Fan Board Light path location LED
planar.lp       Motherboard Light path location LED
scsibp.lp       SCSI Backplane Light path location LED
scsibp.temp     SCSI Disk backplane temperature
scsifault       SCSI Disk Fault Switch
sp.temp         SP local temperature
vldt-reg1-dc    LDT Regulator 1 Voltage
vldt-reg2-dc    LDT Regulator 2 Voltage

The following list contains sensor names and descriptions for a Sun Fire V20z server with firmware version 2.1.0.16:

ambienttemp    Ambient air temp
bulk.v12-0-s0  Bulk 12v supply voltage (cpu0)
bulk.v12-1-s0  Bulk 12v supply voltage (cpu1)
bulk.v1_8-s0   Bulk 1.8v S0 voltage
bulk.v1_8-s5   Bulk 1.8v S5 voltage
bulk.v2_5-s0   Bulk 2.5v S0 voltage
bulk.v2_5-s5   Bulk 2.5v S5 voltage
bulk.v3_3-s0   Bulk 3.3v supply
bulk.v3_3-s3   Bulk 3.3v S3 voltage
bulk.v3_3-s5   Bulk 3.3v S5 voltage
bulk.v5-s0     Bulk 5v supply voltage
bulk.v5-s5     Bulk 5v S5 voltage
cd.lp          CD-ROM Light path location led
cpu0.dietemp   CPU 0 die temp
cpu0.heartbeat CPU 0 heartbeat
cpu0.lp        CPU 0 Light path location led
cpu0.mem0.lp   CPU 0 Dimm 0 Light path location led
cpu0.mem1.lp   CPU 0 Dimm 1 Light path location led
cpu0.mem2.lp   CPU 0 Dimm 2 Light path location led
cpu0.mem3.lp   CPU 0 Dimm 3 Light path location led
cpu0.memtemp   CPU 0 memory temp
cpu0.memvrm.lp CPU 0 Memory VRM Light path location led
cpu0.temp      CPU 0 low side temp
cpu0.v2_5-s0   CPU VDDA voltage
cpu0.v2_5-s3   CPU 0 VDDIO voltage
cpu0.vcore-s0  CPU 0 core voltage
cpu0.vid       CPU-0 VID output
cpu0.vldt1     CPU0 HT 1 voltage
cpu0.vldt2     CPU 0 HT 2 voltage
cpu0.vrm.lp    CPU 0 VRM Light path location led
cpu0.vtt-s3    CPU 0 VTT voltage
cpu1.dietemp   CPU 1 die temp
cpu1.heartbeat CPU 1 heartbeat
cpu1.lp        CPU 1 Light path location led
cpu1.mem0.lp   CPU 1 Dimm 0 Light path location led
cpu1.mem1.lp   CPU 1 Dimm 1 Light path location led
cpu1.mem2.lp   CPU 1 Dimm 2 Light path location led
cpu1.mem3.lp   CPU 1 Dimm 3 Light path location led
cpu1.memtemp   CPU 1 memory temp
cpu1.memvrm.lp CPU 1 Memory VRM Light path location led
cpu1.temp      CPU 1 low side temp
cpu1.v2_5-s3   CPU 1 VDDIO voltage
cpu1.vcore-s0  CPU 1 core voltage
cpu1.vid       CPU-1 VID output
cpu1.vrm.lp    CPU 1 VRM Light path location led
cpu1.vtt-s3    CPU 1 VTT voltage
fan1.tach      Fan 1 measured speed
fan2.tach      Fan 2 measured speed
fan3.tach      Fan 3 measured speed
fan4.tach      Fan 4 measured speed
fan5.tach      Fan 5 measured speed
fan6.tach      Fan 6 measured speed
faultswitch    Fault switch (source for eval)
floppy.lp      Floppy Disk Drive Light path location led
frontpanel.lp  LCD Light path location led
g.vldt1        AMD-8131 PCI-X Tunnel HT 1 voltage
gbeth.temp     Gigabit ethernet temp
golem.temp     PCIX bridge temp
hdd1.lp        Hard Disk Drive 1 Light path location led
hdd2.lp        Hard Disk Drive 2 Light path location led
hddbp.lp       Hard Disk Drive Backplane Light path location led
hddbp.temp     Disk drive backplane temp
identifyswitch Identify switch
pci1.lp        PCI Slot 1 Light path location led
pci2.lp        PCI Slot 2 Light path location led
planar.lp      Motherboard Light path location led
ps.fanfail     Power Supply fan failure sensor
ps.lp          Powersupply Light path location led
ps.tempalert   Power Supply too hot sensor
sp.temp        SP temp
thor.temp      AMD-8111 I/O Hub temp

Monitoring data is retrieved by the N1 System Manager from many of these sensors. For Sun Fire x4100 and x4200 servers, sensors other than analog sensors are not used to retrieve data. Only sensors describing fan speed, voltage and temperature are used to retrieve data. For descriptions of sensors in the Sun Fire x4100 and x4200 servers, refer to the IPMI reference information in the Sun Fire x4100 and x4200 server product documentation.

Setting Threshold Values

Threshold values for monitored objects can be set on specific servers. Setting specific threshold values at the command line for attributes of a monitored object overrides for that object any factory-configured threshold values concerning the attribute. Any entries in the monitoring.properties configuration file concerning the attribute are also overridden.

To Set Threshold Values for a Server

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Use the set server command with the threshold attribute.

The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring and listed in Table 5–2.

The threshold is either criticallow, warninglow, warninghigh, or criticalhigh. The value is a numeric figure and usually represents a percentage.
- To set one threshold value, type the following:
  N1-ok> set server server threshold attribute threshold value
- To set multiple threshold values for the server, type the following:
  N1-ok> set server server threshold attribute threshold value threshold value

Example 5–2 Setting Multiple Threshold Values for CPU Usage on a Server

This example shows how to set the CPU usage warninghigh severity threshold on a provisionable server named serv1 to 53 percent. This example also shows how to set the criticalhigh severity threshold value to 75 percent.

N1-ok> set server serv1 threshold cpustats.pctusage warninghigh 53 criticalhigh 75

These values override the default values stored in the monitoring.properties configuration file on the management server for the server named serv1.

Example 5–3 Setting Multiple Threshold Values for File System Usage On a Server

This example sets the file system usage warninghigh threshold on a provisionable server named serv1 to 75 percent. This example also shows how to set the criticalhigh threshold value to 87 percent.

N1-ok> set server serv1 threshold fsusage.pctused warninghigh 75 criticalhigh 87

Example 5–4 Deleting a Threshold Value for File System Usage on a Server

This example shows how to delete a value that was set for the warninghigh threshold on a provisionable server named serv1.

N1-ok> set server serv1 threshold fsusage warninghigh none

In this case, any previously set value for this threshold at this severity is deleted. The threshold severity value does not revert back to the default threshold value, which is stored in the monitoring.properties configuration file, or to the factory-configured default, if this default existed for the attribute. In effect, monitoring is disabled for the warninghigh threshold for file system usage for this server.

To Set Threshold Values for a Server Group

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Use the set group command with the threshold attribute.

The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring and listed in Table 5–2.

The threshold is either criticallow, warninglow, warninghigh, or criticalhigh. The value is a numeric figure, and usually represents a percentage.
- To modify one threshold for the server group:
  N1-ok> set group group threshold attribute threshold value
- To modify multiple thresholds for the server group:
  N1-ok> set group group threshold attribute threshold value threshold value

Example 5–5 Setting Multiple Threshold Values for File System Usage on a Server Group

This example shows how to set the file system usage warninghigh threshold to 75 percent on a group of provisionable servers with a group name of grp3. This example also shows how to set the criticalhigh threshold severity value to 87 percent.

N1-ok> set group grp3 threshold fsusage.pctused warninghigh 75 criticalhigh 87