The value of any given monitored attribute is compared to a threshold value. Low and high threshold values are defined and can be configured.
Attribute data is compared against thresholds at regular intervals. These polling intervals are configurable. For further information about polling intervals, see Setting Polling Intervals.
When a monitored attribute is polled and the value of the attribute is beyond the default or user-defined threshold safe range, an event is generated and a status is issued. If the value of the attribute is lower than the low threshold or higher than the high threshold, then depending on the severity of the threshold, an event is generated to show a status of nonrecoverable, critical, or warning. Otherwise, the status of the monitored attribute is OK, provided that a value can be obtained.
If no value can be obtained, an event is generated to show that the status of the monitored attribute is unknown. The health of an OS resource can be shown as unknown if the server is reachable but the monitoring agent cannot be contacted on SNMP port 161.
The values nonrecoverable, critical, and warning are discussed in show server in Sun N1 System Manager 1.1 Command Line Reference Manual.
If the value of a monitored attribute rises above the warninghigh threshold, a status of warninghigh is issued. If the value continues to rise and passes the criticalhigh threshold, a status of criticalhigh is issued. If the value continues to rise above the nonrecoverablehigh threshold, a status of nonrecoverablehigh is issued.
If the value then falls back to the safe range, no further events are generated until the value falls below the warninghigh threshold, at which point an event is generated to show a status of normal.
If the value of a monitored attribute falls below the warninglow threshold, a status of warninglow is issued. If the value continues to fall, and passes the criticallow threshold, a status of criticallow is issued. If the value continues to fall below the nonrecoverablelow threshold, a status of nonrecoverablelow is issued.
If the value then rises back to the safe range, no further events are generated until the value rises above the warninglow threshold, at which point an event is generated to show a status of normal.
Threshold values for OS resource utilization attributes can be configured at the command line. This process is explained in Setting Threshold Values. For threshold values measuring percentages, the valid range is from 0 to 100%. If you try to set a threshold value outside of this range, an error is generated. For attributes that do not measure percentages, these values depend on the number of processors in your system and on the usage characteristics of your installation.
After a period of usage, you can develop an awareness of what levels to set for OS resource utilization attribute values. You can adjust thresholds once you determine more closely what value indicates a genuine justification for an event to be generated and for a notification to be sent to your pager or email address. For example, you might want to receive notifications every time a certain attribute reaches a warninghigh severity threshold level.
For important or crucial attributes at your installation, you can set the warninghigh threshold level to a low percentage value so that you are notified about a rising value as early as possible.
To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.
Log in to the N1 System Manager.
See To Access the N1 System Manager Command Line for details.
Type the show server command:
N1-ok> show server server |
In this procedure, server is the name of the provisionable server for which you want to retrieve threshold values.
Detailed monitoring threshold values appear in the output, including threshold information for the server's hardware health, OS resource utilization, and network reachability. Default values are shown if no specific values have been set.
See show server in Sun N1 System Manager 1.1 Command Line Reference Manual for details.
Factory-configured default threshold values are provided in the N1 System Manager software for some OS resource utilization thresholds. These values are stated as percentages. Table 5–1 lists default values for these OS resource utilization attributes.
Setting or modifying threshold values for hardware health attributes is not supported in this version of the Sun N1 System Manager.
Attribute Name |
Description |
Default Threshold |
Default Threshold |
---|---|---|---|
cpustats.pctusage |
Percentage of overall CPU usage |
warninghigh 80% |
criticalhigh 90% |
cpustats.pctidle |
Percentage of CPU idle |
warninglow 20% |
criticallow 10% |
memusage.pctmemused |
Percentage of memory in use |
warninghigh 80% |
criticalhigh 90% |
memusage.pctmemfree |
Percentage of memory free |
warninglow 20% |
criticallow 10% |
memusage.pctswapused |
Percentage of swap space in use |
warninghigh 80% |
criticalhigh 90% |
fsusage.pctused |
Percentage of file system space in use |
warninghigh 80% |
criticalhigh 90% |
Table 5–2 provides the complete list of OS resource utilization attributes and their default values. Where factory-configured default values exist for attributes, these are shown in parentheses.
Table 5–2 All OS Resource Utilization Attributes
Attribute Name |
Description |
Supported Threshold (Default) |
Supported Threshold (Default) |
---|---|---|---|
cpustats.loadavg1min |
System load expressed as average number of queued processes over 1 minute |
warninghigh |
criticalhigh |
cpustats.loadavg5min |
System load expressed as average number of queued processes over 5 minutes |
warninghigh |
criticalhigh |
cpustats.loadavg15min |
System load expressed as average number of queued processes over 15 minutes |
warninghigh |
criticalhigh |
cpustats.pctusage |
Percentage of overall CPU usage |
warninghigh (80%) |
criticalhigh (90%) |
cpustats.pctidle |
Percentage of CPU idle |
warninglow (20%) |
criticallow (10%) |
memusage.pctmemused |
Percentage of memory in use |
warninghigh (80%) |
criticalhigh (90%) |
memusage.pctmemfree |
Percentage of memory free |
warninglow (20%) |
criticallow (10%) |
memusage.mbmemused |
Memory in use in MB |
warninghigh |
criticalhigh |
memusage.mbmemfree |
Memory free in MB |
warninglow |
criticallow |
memusage.pctswapused |
Percentage of swap space in use |
warninghigh (80%) |
criticalhigh (90%) |
memusage.mbswapfree |
Free swap space in MB |
warninglow |
criticallow |
fsusage.pctused |
Percentage of file system space in use |
warninghigh (80%) |
criticalhigh (90%) |
You can modify default values for thresholds by editing the monitoring.properties configuration file.
If the monitoring.properties configuration file is not present, create and save it in /etc/opt/sun/n1gc/. The monitoring.properties configuration file is not created by default at installation.
Any entries that you make in the monitoring.properties configuration file for the threshold values of the attributes listed in Table 5–1 overwrite the factory-configured defaults for the corresponding threshold values.
The monitoring.properties configuration file should be stored only on the management server and not on provisionable servers.
Modifying or adding new entries to the monitoring.properties configuration file affects all the provisionable servers managed by the N1 System Manager.
Specific threshold values can be set at the command line by following the procedures described in Setting Threshold Values.
Once a default value for a monitored item has been modified by manually adding it in the monitoring.properties configuration file, that modified default value applies to all provisionable servers except those servers for which specific values for the monitored attribute have been set at the command line.
You do not need to reboot the management server or the monitored provisionable server for changes to the monitoring.properties file to take effect.
Monitored attributes for hardware health that are declared as percentages cannot be changed either at the command line or in the monitoring.properties file.
To modify default threshold values, edit the /etc/opt/sun/n1gc/monitoring.properties file. Only those default threshold values that relate to OS resource utilization attributes can be modified. Hardware health attribute default threshold values cannot be modified for servers.
To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.
Open the /etc/opt/sun/n1gc/monitoring.properties file.
If the file does not exist, create it.
Modify or add lines in the monitoring.properties file that describe default threshold values.
threshold.attribute.threshold value
The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring.
The threshold is either criticallow, warninglow, warninghigh, or criticalhigh.
The value is a numeric figure and usually represents a percentage value.
Save the file.
You do not need to reboot the management server or the provisionable server for the changes to take effect. The modified default threshold values now apply to all servers managed by the N1 System Manager.
This example shows how to modify the default criticalhigh threshold value for file system usage to 75 percent of maximum file system usage capacity. The following line is added to or amended in the /etc/opt/sun/n1gc/monitoring.properties file:
threshold.fsusage.pctused.criticalhigh=75 |
This value applies to all provisionable servers, unless you have set specific values for the threshold value at the command line, by using the set command as described in Setting Threshold Values.
Threshold values can be disabled. This process is shown in Example 5–4.
For x86 servers, the management server software obtains the list of hardware sensor attributes to monitor through IPMI from the service processor of the server. For servers running the SPARC architecture, the ALOM interface is used. The list of hardware sensor attributes can vary from server to server, and between firmware versions. A sample listing for some servers and firmware versions is provided in this section. It depends on the server type and on the number of CPUs that the server has.
Hardware disk failure and memory failure are not monitored in this version of the N1 System Manager.
The following list contains sensor names and descriptions for a Sun Fire V40z server with firmware version 2.1.0.16:
ambienttemp Ambient air temp bulk.v12-0-s0 Bulk 12V S0 voltage at CPU 0 bulk.v12-2-s0 Bulk 12V S0 voltage at CPU 2 bulk.v12-3-s0 Bulk 12V S0 voltage at CPU 3 bulk.v1_8-s0 Bulk 1.8V S0 voltage bulk.v1_8-s5 Bulk 1.8V S5 voltage bulk.v2_5-s0 Bulk 2.5V S0 voltage bulk.v2_5-s0-dc Bulk 2.5V S0 voltage at DC bulk.v2_5-s5 Bulk 2.5V S5 voltage bulk.v3_3-s0 Bulk 3.3V S0 voltage bulk.v3_3-s0-dc Bulk 3.3V S0 voltage at DC bulk.v3_3-s3 Bulk 3.3V S3 voltage bulk.v3_3-s5 Bulk 3.3V S5 voltage bulk.v3_3-s5-dc Aux 3.3V S5 voltage at DC bulk.v5-s0 Bulk 5V S0 voltage bulk.v5-s0-dc Bulk 5V S0 voltage at DC bulk.v5-s5 Bulk 5V S5 voltage bulk.v5-s5-dc Bulk 5V S5 voltage at DC cd.lp CDROM Light path location LED cpu0.dietemp CPU 0 Die temperature cpu0.heartbeat CPU 0 Heartbeat cpu0.inlettemp CPU 0 Inlet temperature cpu0.lp CPU 0 Light path location LED cpu0.mem0.lp CPU 0 Dimm 0 Light path location LED cpu0.mem1.lp CPU 0 Dimm 1 Light path location LED cpu0.mem2.lp CPU 0 Dimm 2 Light path location LED cpu0.mem3.lp CPU 0 Dimm 3 Light path location LED cpu0.memtemp CPU 0 Memory temperature cpu0.memvrm.lp CPU 0 Memory VRM Light path location LED cpu0.v2_5-s0 CPU 0 VDDA (2.5V) S0 voltage cpu0.v2_5-s3 CPU 0 VDD (2.5V) S3 voltage cpu0.vcore-s0 CPU 0 VCore S0 voltage cpu0.vid CPU 0 VID Selection cpu0.vldt0 CPU 0 LDT0 voltage cpu0.vrm.lp CPU 0 VRM Light path location LED cpu0.vtt-s3 CPU 0 DDR VTT S3 voltage cpu1.dietemp CPU 1 Die temperature cpu1.heartbeat CPU 1 Heartbeat cpu1.inlettemp CPU 1 Inlet temperature cpu1.lp CPU 1 Light path location LED cpu1.mem0.lp CPU 1 Dimm 0 Light path location LED cpu1.mem1.lp CPU 1 Dimm 1 Light path location LED cpu1.mem2.lp CPU 1 Dimm 2 Light path location LED cpu1.mem3.lp CPU 1 Dimm 3 Light path location LED cpu1.memtemp CPU 1 Memory temperature cpu1.memvrm.lp CPU 1 Memory VRM Light path location LED cpu1.v2_5-s0 CPU 1 VDDA (2.5V) S0 voltage cpu1.v2_5-s3 CPU 1 VDD (2.5V) S3 voltage cpu1.vcore-s0 CPU 1 VCore S0 voltage cpu1.vid CPU 1 VID Selection cpu1.vldt1 CPU 1 LDT1 voltage cpu1.vldt2 CPU 1 LDT2 voltage cpu1.vrm.lp CPU 1 VRM Light path location LED cpu1.vtt-s3 CPU 1 DDR VTT S3 voltage cpu2.dietemp CPU 2 Die temperature cpu2.heartbeat CPU 2 Heartbeat cpu2.inlettemp CPU 2 inlet temperature cpu2.lp CPU 2 Light path location LED cpu2.mem0.lp CPU 2 Dimm 0 Light path location LED cpu2.mem1.lp CPU 2 Dimm 1 Light path location LED cpu2.mem2.lp CPU 2 Dimm 2 Light path location LED cpu2.mem3.lp CPU 2 Dimm 3 Light path location LED cpu2.memvrm.lp CPU 2 Memory VRM Light path location LED cpu2.temp CPU 2 downwind temperature cpu2.v2_5-s0 CPU 2 VDDA (2.5V) S0 voltage cpu2.v2_5-s3 CPU 2 VDD (2.5V) S3 voltage cpu2.vcore-s0 CPU 2 VCore S0 voltage cpu2.vid CPU-2 VID Selection cpu2.vrm.lp CPU 2 VRM Light path location LED cpu2.vtt-s3 CPU 2 DDR VTT voltage cpu3.dietemp CPU 3 Die temperature cpu3.heartbeat CPU 3 Heartbeat cpu3.inlettemp CPU 3 inlet temperature cpu3.lp CPU 3 Light path location LED cpu3.mem0.lp CPU 3 Dimm 0 Light path location LED cpu3.mem1.lp CPU 3 Dimm 1 Light path location LED cpu3.mem2.lp CPU 3 Dimm 2 Light path location LED cpu3.mem3.lp CPU 3 Dimm 3 Light path location LED cpu3.memvrm.lp CPU 3 Memory VRM Light path location LED cpu3.temp CPU 3 downwind temperature cpu3.v2_5-s0 CPU 3 VDDA (2.5V) S0 voltage cpu3.v2_5-s3 CPU 3 VDD (2.5V) S3 voltage cpu3.vcore-s0 CPU 3 VCore S0 voltage cpu3.vid CPU-3 VID Selection cpu3.vrm.lp CPU 3 VRM Light path location LED cpu3.vtt-s3 CPU 3 DDR VTT voltage cpuplanar.lp Daughtercard Light path location LED fan1.tach Fan 1 measured speed fan10.tach Fan 10 measured speed fan11.tach Fan 11 measured speed fan12.tach Fan 12 measured speed fan2.tach Fan 2 measured speed fan3.tach Fan 3 measured speed fan4.tach Fan 4 measured speed fan5.tach Fan 5 measured speed fan6.tach Fan 6 measured speed fan7.tach Fan 7 measured speed fan8.tach Fan 8 measured speed fan9.tach Fan 9 measured speed faultswitch System Fault Indication floppy.lp Floppy Light path location LED frontpanel.lp LCD Light path location LED g0.vldt1 AMD-8131 PCI-X Tunnel 0 LDT1 voltage g1.vldt1 AMD-8131 PCI-X Tunnel 1 LDT1 voltage gbeth.temp Gigabit ethernet local temperature golem-v1_8-s0 AMD-8131 PCI-X Tunnel 1.8V S0 voltage identifyswitch Identify switch pci1.lp PCI Slot 1 Light path location LED pci2.lp PCI Slot 2 Light path location LED pci3.lp PCI Slot 3 Light path location LED pci4.lp PCI Slot 4 Light path location LED pci5.lp PCI Slot 5 Light path location LED pci6.lp PCI Slot 6 Light path location LED pci7.lp PCI Slot 7 Light path location LED pcifan.lp Fan Board Light path location LED planar.lp Motherboard Light path location LED scsibp.lp SCSI Backplane Light path location LED scsibp.temp SCSI Disk backplane temperature scsifault SCSI Disk Fault Switch sp.temp SP local temperature vldt-reg1-dc LDT Regulator 1 Voltage vldt-reg2-dc LDT Regulator 2 Voltage |
The following list contains sensor names and descriptions for a Sun Fire V20z server with firmware version 2.1.0.16:
ambienttemp Ambient air temp bulk.v12-0-s0 Bulk 12v supply voltage (cpu0) bulk.v12-1-s0 Bulk 12v supply voltage (cpu1) bulk.v1_8-s0 Bulk 1.8v S0 voltage bulk.v1_8-s5 Bulk 1.8v S5 voltage bulk.v2_5-s0 Bulk 2.5v S0 voltage bulk.v2_5-s5 Bulk 2.5v S5 voltage bulk.v3_3-s0 Bulk 3.3v supply bulk.v3_3-s3 Bulk 3.3v S3 voltage bulk.v3_3-s5 Bulk 3.3v S5 voltage bulk.v5-s0 Bulk 5v supply voltage bulk.v5-s5 Bulk 5v S5 voltage cd.lp CD-ROM Light path location led cpu0.dietemp CPU 0 die temp cpu0.heartbeat CPU 0 heartbeat cpu0.lp CPU 0 Light path location led cpu0.mem0.lp CPU 0 Dimm 0 Light path location led cpu0.mem1.lp CPU 0 Dimm 1 Light path location led cpu0.mem2.lp CPU 0 Dimm 2 Light path location led cpu0.mem3.lp CPU 0 Dimm 3 Light path location led cpu0.memtemp CPU 0 memory temp cpu0.memvrm.lp CPU 0 Memory VRM Light path location led cpu0.temp CPU 0 low side temp cpu0.v2_5-s0 CPU VDDA voltage cpu0.v2_5-s3 CPU 0 VDDIO voltage cpu0.vcore-s0 CPU 0 core voltage cpu0.vid CPU-0 VID output cpu0.vldt1 CPU0 HT 1 voltage cpu0.vldt2 CPU 0 HT 2 voltage cpu0.vrm.lp CPU 0 VRM Light path location led cpu0.vtt-s3 CPU 0 VTT voltage cpu1.dietemp CPU 1 die temp cpu1.heartbeat CPU 1 heartbeat cpu1.lp CPU 1 Light path location led cpu1.mem0.lp CPU 1 Dimm 0 Light path location led cpu1.mem1.lp CPU 1 Dimm 1 Light path location led cpu1.mem2.lp CPU 1 Dimm 2 Light path location led cpu1.mem3.lp CPU 1 Dimm 3 Light path location led cpu1.memtemp CPU 1 memory temp cpu1.memvrm.lp CPU 1 Memory VRM Light path location led cpu1.temp CPU 1 low side temp cpu1.v2_5-s3 CPU 1 VDDIO voltage cpu1.vcore-s0 CPU 1 core voltage cpu1.vid CPU-1 VID output cpu1.vrm.lp CPU 1 VRM Light path location led cpu1.vtt-s3 CPU 1 VTT voltage fan1.tach Fan 1 measured speed fan2.tach Fan 2 measured speed fan3.tach Fan 3 measured speed fan4.tach Fan 4 measured speed fan5.tach Fan 5 measured speed fan6.tach Fan 6 measured speed faultswitch Fault switch (source for eval) floppy.lp Floppy Disk Drive Light path location led frontpanel.lp LCD Light path location led g.vldt1 AMD-8131 PCI-X Tunnel HT 1 voltage gbeth.temp Gigabit ethernet temp golem.temp PCIX bridge temp hdd1.lp Hard Disk Drive 1 Light path location led hdd2.lp Hard Disk Drive 2 Light path location led hddbp.lp Hard Disk Drive Backplane Light path location led hddbp.temp Disk drive backplane temp identifyswitch Identify switch pci1.lp PCI Slot 1 Light path location led pci2.lp PCI Slot 2 Light path location led planar.lp Motherboard Light path location led ps.fanfail Power Supply fan failure sensor ps.lp Powersupply Light path location led ps.tempalert Power Supply too hot sensor sp.temp SP temp thor.temp AMD-8111 I/O Hub temp |
Monitoring data is retrieved by the N1 System Manager from many of these sensors. For Sun Fire x4100 and x4200 servers, sensors other than analog sensors are not used to retrieve data. Only sensors describing fan speed, voltage and temperature are used to retrieve data. For descriptions of sensors in the Sun Fire x4100 and x4200 servers, refer to the IPMI reference information in the Sun Fire x4100 and x4200 server product documentation.
Threshold values for monitored objects can be set on specific servers. Setting specific threshold values at the command line for attributes of a monitored object overrides for that object any factory-configured threshold values concerning the attribute. Any entries in the monitoring.properties configuration file concerning the attribute are also overridden.
To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.
Log in to the N1 System Manager.
See To Access the N1 System Manager Command Line for details.
Use the set server command with the threshold attribute.
The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring and listed in Table 5–2.
The threshold is either criticallow, warninglow, warninghigh, or criticalhigh. The value is a numeric figure and usually represents a percentage.
This example shows how to set the CPU usage warninghigh severity threshold on a provisionable server named serv1 to 53 percent. This example also shows how to set the criticalhigh severity threshold value to 75 percent.
N1-ok> set server serv1 threshold cpustats.pctusage warninghigh 53 criticalhigh 75 |
These values override the default values stored in the monitoring.properties configuration file on the management server for the server named serv1.
This example sets the file system usage warninghigh threshold on a provisionable server named serv1 to 75 percent. This example also shows how to set the criticalhigh threshold value to 87 percent.
N1-ok> set server serv1 threshold fsusage.pctused warninghigh 75 criticalhigh 87 |
This example shows how to delete a value that was set for the warninghigh threshold on a provisionable server named serv1.
N1-ok> set server serv1 threshold fsusage warninghigh none |
In this case, any previously set value for this threshold at this severity is deleted. The threshold severity value does not revert back to the default threshold value, which is stored in the monitoring.properties configuration file, or to the factory-configured default, if this default existed for the attribute. In effect, monitoring is disabled for the warninghigh threshold for file system usage for this server.
To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.
Log in to the N1 System Manager.
See To Access the N1 System Manager Command Line for details.
Use the set group command with the threshold attribute.
The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring and listed in Table 5–2.
The threshold is either criticallow, warninglow, warninghigh, or criticalhigh. The value is a numeric figure, and usually represents a percentage.
This example shows how to set the file system usage warninghigh threshold to 75 percent on a group of provisionable servers with a group name of grp3. This example also shows how to set the criticalhigh threshold severity value to 87 percent.
N1-ok> set group grp3 threshold fsusage.pctused warninghigh 75 criticalhigh 87 |