Sun N1 System Manager 1.1 Administration Guide

Chapter 5 Monitoring Your Servers

The first section of this chapter provides an explanation of what monitoring is, in the context of the N1 System Manager, and describes how to monitor servers that are part of the N1 System Manager. This chapter provides procedures for enabling and disabling monitoring, and for managing monitoring thresholds and polling intervals, using the command line.

This chapter also contains information about managing jobs, event log entries, and about setting up notifications.

This chapter contains the following sections:

Some procedures are also possible using the browser interface. These procedures are provided in the Sun N1 System Manager browser interface help.

This chapter contains descriptions of the following tasks:

Introduction to Monitoring

Monitoring in the Sun N1 System Manager software enables you to track changes to specific attributes in specific managed objects. Managed objects include server hardware elements, operating systems, file systems, and networks. Attributes are the monitored elements, about which data is obtained and delivered by the N1 System Manager software. Examples of attributes are the average number of queued processes and the percentage of used memory. A list of attributes is provided in Hardware Sensor Attributes and in Table 5–2.

Attributes are associated with one of three main areas:

Hardware health attributes. For information about hardware health monitoring, see Hardware Health Monitoring.
OS resource utilization attributes. For information about OS resource utilization monitoring, see OS Resource Utilization Monitoring.
Network connectivity, or reachability. For information about network reachability monitoring, see Network Reachability Monitoring.

For a server or a group of servers, hardware health and operating system utilization and network connectivity are all monitored by the management server. All comparisons and verifications for monitoring are performed by the N1 System Manager. Provisionable servers are used only to access data.

An SNMP agent that is used for data retrieval is provided in the N1 System Manager software. If the management server is running the N1 System Manager on the Solaris OS, this agent is based on the Sun Management Center 3.5 software SNMP agent. If the management server is running the N1 System Manager on Linux, this agent is based on the Sun Management Center 3.6 Linux SNMP agent. The agent is deployed when operating systems are deployed on servers that are managed by the N1 System Manager software.

Note –

On Linux platforms, the N1 System Manager software only monitors ext3 file systems. Other types of file systems are not monitored for Linux platforms.

Monitoring is connected with the broadcasting of the events for each monitored server or group of servers. Events are generated when certain conditions related to attributes occur. For information about events and when they occur, see Managing Event Log Entries. There are no log files related to monitoring. Instead of log files, monitoring data is stored as events in the N1 System Manager database.

If monitoring is enabled for a server, each event causes a notification to be emitted from the N1 System Manager for that event. If monitoring is disabled for a server, monitoring events are not generated for that server. Lifecycle events continue to be generated, even with monitoring disabled. Lifecycle events include server discovery, server change or deletion, or server group creation. If you have requested notification of this type of event, you can still receive notifications even with monitoring disabled.

Hardware Health Monitoring

The hardware health of discovered servers is monitored. Sensors provided in the hardware are used to monitor temperature, voltage, and fan speed. For more information about associated hardware, see the Sun N1 System Manager Connection Information in Sun N1 System Manager 1.1 Site Preparation Guide.

Sensor data is retrieved from the service processor for SPARC devices through the Advanced Lights Out Manager (ALOM) interface. Sensor data is retrieved from IPMI for x64 servers.

General management interface data for Sun Fire V20z and Sun Fire V40z machines is obtained through the command line. General management interface data for Sun Fire x4100 and Sun Fire x4200 servers is obtained through IPMI. Data can be retrieved dynamically from the command line.

The following characteristics of server hardware can be monitored:

CPU temperature
Ambient temperature
Fan speed in revolutions per minute
Voltages
LEDs

A detailed list of these sensors is provided in Hardware Sensor Attributes.

You can view filtered hardware health monitoring information for all servers by using the show server command:

N1-ok> show server health health

See show server in Sun N1 System Manager 1.1 Command Line Reference Manual for details of possible values of the health filters.

OS Resource Utilization Monitoring

OS resource utilization is monitored by the N1 System Manager. As part of the add server feature command, with the agentip keyword, you provide credentials to access the monitored server's operating system through ssh with the agentssh keyword. See To Add the OS Monitoring Feature for additional details. This procedure is important for OS resource utilization monitoring but not for monitoring hardware health or network reachability.

Access to the operating system by this mechanism is required primarily for the Remote Command Execution feature. Access to the operating system by this mechanism is how the management features are used to retrieve data for OS resource utilization monitoring. Platform OS interface data is obtained through ssh and SNMP; all attribute data is retrieved from the server's operating system by using ssh and SNMP. Statistics related to the central processor unit (CPU) are provided, as is data related to memory, swap usage, and file systems. For the purposes of monitoring, system load data, memory usage, and swap usage data can be broken down as follows:

System usage, including system idle times
System load, expressed as the average number of queued processes over 1, 5, and 15 minutes
Memory usage and memory free statistics, in megabytes and as percentages
Physical load statistics
Swap space used and space available, in megabytes and as percentages
File system used and space available, as percentages

A list of these attributes is provided in Hardware Sensor Attributes.

You can filter OS resource utilization monitoring information for all servers by using the show server command:

N1-ok> show server utilization utilization

N1-ok> show server utilization unreachable

The health of an OS resource can be shown as unknown if the server is reachable but the monitoring agent cannot be contacted on SNMP port 161.

The health of an OS resource can be shown as unreachable if the server is unreachable due to, for example, being in standby mode.

See show server in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

The monitoring of OS resource utilization attributes enables you to modify the default threshold values for all servers being managed by the N1 System Manager, through the creation and editing of a configuration file. See Changing Threshold Values With the Monitoring Configuration File for details.

The monitoring of OS resource utilization attributes also enables you to set specific thresholds for individual monitored servers, or for groups of monitored servers, at the command line by using the set command. See Setting Threshold Values for details.

If you are not interested in the values of some attributes, you can disable the threshold severity for monitoring of those attributes. This action prevents annoyance alarms. Example 5–4 shows you how to accomplish this disabling action.

Network Reachability Monitoring

All management interfaces of provisionable servers and all platform interfaces are monitored by default by the N1 System Manager. Platform interfaces include the service processor's management interface, such as eth0, and data network interfaces, such as eth1 or eth2.

Reachability is verified for Linux servers and servers running the Solaris OS by using an ICMP ping to the interface IP address. For further information, see Discovery of Servers in the Factory Default State in Sun N1 System Manager 1.1 Installation and Configuration Guide.

The reachability of all network interfaces is verified at regular intervals. These polling intervals are configurable. For information about configuring polling intervals, see Setting Polling Intervals. The monitoring of network reachability is based on the IP address. If any monitored IP address is unreachable, an event is generated.

You can filter information for all servers by using the show server command with the appropriate parameters to view monitoring information. See show server in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

It is important to distinguish between the unreachable and unknown states for provisionable servers.

N1-ok> show server health unreachable

This command lists all provisionable servers that are unreachable. Any provisionable server returned in the output of this command is unreachable due to a network problem: the server cannot be contacted about its hardware health status. The ping command to the server is unsuccessful. This does not necessarily mean that the server is not transmitting hardware health status information. The server could be in standby mode.

N1-ok> show server health unknown

This command lists all provisionable servers that are not returning any information about hardware health status. The ping command may be successful but servers returned in the output of this command are not returning any hardware health information. The monitoring agent could not be contacted on port 161.

N1-ok> show server power unreachable

This command lists all provisionable servers that are unreachable. Any server returned in the output of this command is unreachable due to a network problem: the server cannot be contacted about its power status. The ping command to the server is unsuccessful. This does not necessarily mean that the server is not transmitting power status information. The server could be in standby mode.

N1-ok> show server power unknown

This command lists all provisionable servers that are not returning any information about power status. The ping command may be successful but servers returned in the output of this command are not returning any power status information. The monitoring agent could not be contacted on port 161.

N1-ok> show server utilization unreachable

This command lists all provisionable servers that are unreachable. Any server returned in the output of this command is unreachable due to a network problem: the server cannot be contacted about its OS resource utilization. The ping command to the server is unsuccessful. This does not necessarily mean that the server is not transmitting OS resource utilization information. The server could be in standby mode.

N1-ok> show server utilization unknown

This command lists all provisionable servers that are not returning any information about OS resource utilization. The ping command may be successful but servers returned in the output of this command are not returning any OS resource utilization information. The monitoring agent could not be contacted on port 161.

Enabling Monitoring

For all provisionable servers, that is to say for all physical servers that have been discovered by the Sun N1 System Manager software, management features are supported when the add server command is used to create monitorable objects. The management features are used to periodically retrieve CPU statistics, filesystem, and memory data, for monitoring purposes.

Monitored file system data for a provisionable server is not available unless an operating system is deployed on the provisionable server, and the management features have been added by using the add server feature command with the agentip keyword:

N1-ok> add server server-name feature basemanagement agentip agentip agentssh username/password

N1-ok> add server server-name feature osmonitor agentip agentip agentssh username/password

The agentip is the IP address of the provisioning network interface of the provisionable server that you want to monitor. See add server in Sun N1 System Manager 1.1 Command Line Reference Manual for details. See also To Add the Base Management Feature and To Add the OS Monitoring Feature for additional details on the syntax used in these commands.

When you specify or change features, you must use the add server command. The set server command cannot be used to specify a feature.

The add server command is useful for enabling OS resource utilization monitoring and network reachability monitoring, but not for monitoring hardware health. Hardware health is already monitored by default as soon as the Sun N1 System Manager software discovers a physical server.

Note –

The polling of network reachability is not possible if OS resource utilization monitoring is not enabled.

For more information about the agentip subcommand, see To Add the OS Monitoring Feature.

The add server command needs to be issued only once for a server and not each time you want to enable or disable monitoring.

Note –

If the provisionable server's IP address changes, use the set server command again before enabling or disabling monitoring.

The default status of monitoring in the Sun N1 System Manager for discovered servers and initialized operating systems is as follows:

Default status of hardware monitoring

When a server or other hardware is discovered, monitoring of the server or other hardware is enabled by default. Before a server can be monitored, however, it must be discovered and correctly registered with the N1 System Manager. This process is described in Discovering Servers. The monitoring of hardware sensors is enabled by default for all managed servers. If a server is deleted and then rediscovered, all states related to that server for the purposes of monitoring are lost. This is the case regardless of whether monitoring was enabled or disabled for that server when the server was deleted. When the server is rediscovered, monitoring is set to true by default. For more information about discovering servers, see To Discover New Servers.

Default status of OS resource utilization monitoring

Disabled by default. When an OS has been successfully provisioned on a provisionable server and the N1 System Manager management features are supported by using the add server feature command with the agentip specified, OS resource utilization monitoring is enabled. The OS provisioning can be performed either through the N1 System Manager or by an external OS installation.

If you are not interested in the values of some OS resource utilization attributes, you can disable the threshold severity for the monitoring of those attributes, while continuing to monitor other OS resource utilization attributes. This action prevents annoyance alarms. Example 5–4 shows how to accomplish this task. For general information about threshold values, see Monitoring Threshold Values.

Default status of network reachability monitoring

When the management interface of the provisionable server is discovered, monitoring of the interface is enabled by default. When the management features are added, monitoring of other interfaces is enabled by default.

To Monitor a Server

The following procedure describes how to use the command line to enable the monitoring of hardware health, operating system utilization, and network reachability of a server.

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Set the monitored attribute to true by using the set server command.
N1-ok> set server server monitored true
In this procedure, server is the name of the provisionable server that you want to monitor.

View the server details.
N1-ok> show server server

To Monitor a Server Group

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features. This procedure is important for OS resource utilization monitoring but not for monitoring hardware health.

Steps

Set the monitored attribute to true by using the set group command.
N1-ok> set group group monitored true
This command is executed for the group of servers that you have already named. See set group in Sun N1 System Manager 1.1 Command Line Reference Manual for details. In this procedure, group is the name of the group of provisionable servers that you want to monitor.

View the server group details to determine if monitoring is enabled for each server in the group.
N1-ok> show group group

View the specific monitoring details for individual servers in the group.
N1-ok> show server server
Detailed monitoring information appears in the output. Information is displayed about polling intervals and threshold values for the monitoring of hardware health, OS resource utilization and network reachability. Polling intervals are explained in Setting Polling Intervals. Monitoring threshold values are explained in Monitoring Threshold Values.

To Disable Monitoring for a Server

You might want to disable monitoring of a hardware component to perform maintenance tasks without generating events.

Steps

Set the monitored attribute to false by using the set server command.
N1-ok> set server server monitored false
In this example, server is the name of the provisionable server that you want to stop monitoring. Executing this command disables monitoring of the server. With monitoring of a server disabled, the violation of threshold values by attributes related to that server does not generate events.

View the server details.
N1-ok> show server server
The output shows that monitoring is disabled.

If you are not interested in the values of some OS resource utilization attributes, you can disable the threshold severity for the monitoring of those attributes, while continuing to monitor other OS resource utilization attributes. This action prevents annoyance alarms. Example 5–4 shows how to accomplish this task. For general information about threshold values, see Monitoring Threshold Values. You can also completely remove the OS resource utilization monitoring feature. See To Remove the OS Monitoring Feature.

To Disable Monitoring for a Server Group

This procedure describes how to disable monitoring for a server group. You might want to disable monitoring of hardware components to perform maintenance tasks without generating events.

Note –

When you disable monitoring for a server, hardware health monitoring, OS monitoring, and network reachability monitoring are all disabled for that server.

Steps

Set the monitored attribute to false by using the set group command.
N1-ok> set group group monitored false
This command is executed for the group of servers that you have already named. See set group in Sun N1 System Manager 1.1 Command Line Reference Manual for details. In this procedure, group is the name of the group of provisionable servers that you want to stop monitoring. Executing this command disables monitoring for all servers in the group. With monitoring of a server group disabled, the violation of threshold values by attributes related to servers in that group does not generate events.

View the server group details to determine if monitoring is disabled for all servers in the group.
N1-ok> show group group

Monitoring Threshold Values

The value of any given monitored attribute is compared to a threshold value. Low and high threshold values are defined and can be configured.

Attribute data is compared against thresholds at regular intervals. These polling intervals are configurable. For further information about polling intervals, see Setting Polling Intervals.

When a monitored attribute is polled and the value of the attribute is beyond the default or user-defined threshold safe range, an event is generated and a status is issued. If the value of the attribute is lower than the low threshold or higher than the high threshold, then depending on the severity of the threshold, an event is generated to show a status of nonrecoverable, critical, or warning. Otherwise, the status of the monitored attribute is OK, provided that a value can be obtained.

If no value can be obtained, an event is generated to show that the status of the monitored attribute is unknown. The health of an OS resource can be shown as unknown if the server is reachable but the monitoring agent cannot be contacted on SNMP port 161.

The values nonrecoverable, critical, and warning are discussed in show server in Sun N1 System Manager 1.1 Command Line Reference Manual.

What Happens When a Threshold is Broken

If the value of a monitored attribute rises above the warninghigh threshold, a status of warninghigh is issued. If the value continues to rise and passes the criticalhigh threshold, a status of criticalhigh is issued. If the value continues to rise above the nonrecoverablehigh threshold, a status of nonrecoverablehigh is issued.

If the value then falls back to the safe range, no further events are generated until the value falls below the warninghigh threshold, at which point an event is generated to show a status of normal.

If the value of a monitored attribute falls below the warninglow threshold, a status of warninglow is issued. If the value continues to fall, and passes the criticallow threshold, a status of criticallow is issued. If the value continues to fall below the nonrecoverablelow threshold, a status of nonrecoverablelow is issued.

If the value then rises back to the safe range, no further events are generated until the value rises above the warninglow threshold, at which point an event is generated to show a status of normal.

Threshold values for OS resource utilization attributes can be configured at the command line. This process is explained in Setting Threshold Values. For threshold values measuring percentages, the valid range is from 0 to 100%. If you try to set a threshold value outside of this range, an error is generated. For attributes that do not measure percentages, these values depend on the number of processors in your system and on the usage characteristics of your installation.

Tuning Threshold Values for Your Installation

After a period of usage, you can develop an awareness of what levels to set for OS resource utilization attribute values. You can adjust thresholds once you determine more closely what value indicates a genuine justification for an event to be generated and for a notification to be sent to your pager or email address. For example, you might want to receive notifications every time a certain attribute reaches a warninghigh severity threshold level.

For important or crucial attributes at your installation, you can set the warninghigh threshold level to a low percentage value so that you are notified about a rising value as early as possible.

To Retrieve Threshold Values for a Server

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Type the show server command:
N1-ok> show server server
In this procedure, server is the name of the provisionable server for which you want to retrieve threshold values.

Detailed monitoring threshold values appear in the output, including threshold information for the server's hardware health, OS resource utilization, and network reachability. Default values are shown if no specific values have been set.

See show server in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Managing Default Threshold Values

Factory-configured default threshold values are provided in the N1 System Manager software for some OS resource utilization thresholds. These values are stated as percentages. Table 5–1 lists default values for these OS resource utilization attributes.

Note –

Setting or modifying threshold values for hardware health attributes is not supported in this version of the Sun N1 System Manager.

Table 5–1 Factory-Configured Default Threshold Values for OS Resource Utilization Attributes


Attribute Name	Description	Default Threshold	Default Threshold
`cpustats.pctusage`	Percentage of overall CPU usage	`warninghigh` 80%	`criticalhigh` 90%
`cpustats.pctidle`	Percentage of CPU idle	`warninglow` 20%	`criticallow` 10%
`memusage.pctmemused`	Percentage of memory in use	`warninghigh` 80%	`criticalhigh` 90%
`memusage.pctmemfree`	Percentage of memory free	`warninglow` 20%	`criticallow` 10%
`memusage.pctswapused`	Percentage of swap space in use	`warninghigh` 80%	`criticalhigh` 90%
`fsusage.pctused`	Percentage of file system space in use	`warninghigh` 80%	`criticalhigh` 90%

Table 5–2 provides the complete list of OS resource utilization attributes and their default values. Where factory-configured default values exist for attributes, these are shown in parentheses.

Table 5–2 All OS Resource Utilization Attributes


Attribute Name	Description	Supported Threshold (Default)	Supported Threshold (Default)
`cpustats.loadavg1min`	System load expressed as average number of queued processes over 1 minute	`warninghigh`	`criticalhigh`
`cpustats.loadavg5min`	System load expressed as average number of queued processes over 5 minutes	`warninghigh`	`criticalhigh`
`cpustats.loadavg15min`	System load expressed as average number of queued processes over 15 minutes	`warninghigh`	`criticalhigh`
`cpustats.pctusage`	Percentage of overall CPU usage	`warninghigh` (80%)	`criticalhigh` (90%)
`cpustats.pctidle`	Percentage of CPU idle	`warninglow` (20%)	`criticallow` (10%)
`memusage.pctmemused`	Percentage of memory in use	`warninghigh` (80%)	`criticalhigh` (90%)
`memusage.pctmemfree`	Percentage of memory free	`warninglow` (20%)	`criticallow` (10%)
`memusage.mbmemused`	Memory in use in MB	`warninghigh`	`criticalhigh`
`memusage.mbmemfree`	Memory free in MB	`warninglow`	`criticallow`
`memusage.pctswapused`	Percentage of swap space in use	`warninghigh` (80%)	`criticalhigh` (90%)
`memusage.mbswapfree`	Free swap space in MB	`warninglow`	`criticallow`
`fsusage.pctused`	Percentage of file system space in use	`warninghigh` (80%)	`criticalhigh` (90%)

Changing Threshold Values With the Monitoring Configuration File

You can modify default values for thresholds by editing the monitoring.properties configuration file.

If the monitoring.properties configuration file is not present, create and save it in /etc/opt/sun/n1gc/. The monitoring.properties configuration file is not created by default at installation.

Any entries that you make in the monitoring.properties configuration file for the threshold values of the attributes listed in Table 5–1 overwrite the factory-configured defaults for the corresponding threshold values.

The monitoring.properties configuration file should be stored only on the management server and not on provisionable servers.

Modifying or adding new entries to the monitoring.properties configuration file affects all the provisionable servers managed by the N1 System Manager.

Specific threshold values can be set at the command line by following the procedures described in Setting Threshold Values.

Once a default value for a monitored item has been modified by manually adding it in the monitoring.properties configuration file, that modified default value applies to all provisionable servers except those servers for which specific values for the monitored attribute have been set at the command line.

Note –

You do not need to reboot the management server or the monitored provisionable server for changes to the monitoring.properties file to take effect.

Monitored attributes for hardware health that are declared as percentages cannot be changed either at the command line or in the monitoring.properties file.

To Modify Default Threshold Values for a Server

To modify default threshold values, edit the /etc/opt/sun/n1gc/monitoring.properties file. Only those default threshold values that relate to OS resource utilization attributes can be modified. Hardware health attribute default threshold values cannot be modified for servers.

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Open the /etc/opt/sun/n1gc/monitoring.properties file.

If the file does not exist, create it.

Modify or add lines in the monitoring.properties file that describe default threshold values.

threshold.attribute.threshold value

The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring.

The threshold is either criticallow, warninglow, warninghigh, or criticalhigh.

The value is a numeric figure and usually represents a percentage value.

Save the file.

You do not need to reboot the management server or the provisionable server for the changes to take effect. The modified default threshold values now apply to all servers managed by the N1 System Manager.

Example 5–1 Modifying the Default Threshold Value for File System Usage

This example shows how to modify the default criticalhigh threshold value for file system usage to 75 percent of maximum file system usage capacity. The following line is added to or amended in the /etc/opt/sun/n1gc/monitoring.properties file:

threshold.fsusage.pctused.criticalhigh=75

This value applies to all provisionable servers, unless you have set specific values for the threshold value at the command line, by using the set command as described in Setting Threshold Values.

Threshold values can be disabled. This process is shown in Example 5–4.

Hardware Sensor Attributes

For x86 servers, the management server software obtains the list of hardware sensor attributes to monitor through IPMI from the service processor of the server. For servers running the SPARC architecture, the ALOM interface is used. The list of hardware sensor attributes can vary from server to server, and between firmware versions. A sample listing for some servers and firmware versions is provided in this section. It depends on the server type and on the number of CPUs that the server has.

Note –

Hardware disk failure and memory failure are not monitored in this version of the N1 System Manager.

The following list contains sensor names and descriptions for a Sun Fire V40z server with firmware version 2.1.0.16:

ambienttemp     Ambient air temp
bulk.v12-0-s0   Bulk 12V S0 voltage at CPU 0
bulk.v12-2-s0   Bulk 12V S0 voltage at CPU 2
bulk.v12-3-s0   Bulk 12V S0 voltage at CPU 3
bulk.v1_8-s0    Bulk 1.8V S0 voltage
bulk.v1_8-s5    Bulk 1.8V S5 voltage
bulk.v2_5-s0    Bulk 2.5V S0 voltage
bulk.v2_5-s0-dc Bulk 2.5V S0 voltage at DC
bulk.v2_5-s5    Bulk 2.5V S5 voltage
bulk.v3_3-s0    Bulk 3.3V S0 voltage
bulk.v3_3-s0-dc Bulk 3.3V S0 voltage at DC
bulk.v3_3-s3    Bulk 3.3V S3 voltage
bulk.v3_3-s5    Bulk 3.3V S5 voltage
bulk.v3_3-s5-dc Aux 3.3V S5 voltage at DC
bulk.v5-s0      Bulk 5V S0 voltage
bulk.v5-s0-dc   Bulk 5V S0 voltage at DC
bulk.v5-s5      Bulk 5V S5 voltage
bulk.v5-s5-dc   Bulk 5V S5 voltage at DC
cd.lp           CDROM Light path location LED
cpu0.dietemp    CPU 0 Die temperature
cpu0.heartbeat  CPU 0 Heartbeat
cpu0.inlettemp  CPU 0 Inlet temperature
cpu0.lp         CPU 0 Light path location LED
cpu0.mem0.lp    CPU 0 Dimm 0 Light path location LED
cpu0.mem1.lp    CPU 0 Dimm 1 Light path location LED
cpu0.mem2.lp    CPU 0 Dimm 2 Light path location LED
cpu0.mem3.lp    CPU 0 Dimm 3 Light path location LED
cpu0.memtemp    CPU 0 Memory temperature
cpu0.memvrm.lp  CPU 0 Memory VRM Light path location LED
cpu0.v2_5-s0    CPU 0 VDDA (2.5V) S0 voltage
cpu0.v2_5-s3    CPU 0 VDD (2.5V) S3 voltage
cpu0.vcore-s0   CPU 0 VCore S0 voltage
cpu0.vid        CPU 0 VID Selection
cpu0.vldt0      CPU 0 LDT0 voltage
cpu0.vrm.lp     CPU 0 VRM Light path location LED
cpu0.vtt-s3     CPU 0 DDR VTT S3 voltage
cpu1.dietemp    CPU 1 Die temperature
cpu1.heartbeat  CPU 1 Heartbeat
cpu1.inlettemp  CPU 1 Inlet temperature
cpu1.lp         CPU 1 Light path location LED
cpu1.mem0.lp    CPU 1 Dimm 0 Light path location LED
cpu1.mem1.lp    CPU 1 Dimm 1 Light path location LED
cpu1.mem2.lp    CPU 1 Dimm 2 Light path location LED
cpu1.mem3.lp    CPU 1 Dimm 3 Light path location LED
cpu1.memtemp    CPU 1 Memory temperature
cpu1.memvrm.lp  CPU 1 Memory VRM Light path location LED
cpu1.v2_5-s0    CPU 1 VDDA (2.5V) S0 voltage
cpu1.v2_5-s3    CPU 1 VDD (2.5V) S3 voltage
cpu1.vcore-s0   CPU 1 VCore S0 voltage
cpu1.vid        CPU 1 VID Selection
cpu1.vldt1      CPU 1 LDT1 voltage
cpu1.vldt2      CPU 1 LDT2 voltage
cpu1.vrm.lp     CPU 1 VRM Light path location LED
cpu1.vtt-s3     CPU 1 DDR VTT S3 voltage
cpu2.dietemp    CPU 2 Die temperature
cpu2.heartbeat  CPU 2 Heartbeat
cpu2.inlettemp  CPU 2 inlet temperature
cpu2.lp         CPU 2 Light path location LED
cpu2.mem0.lp    CPU 2 Dimm 0 Light path location LED
cpu2.mem1.lp    CPU 2 Dimm 1 Light path location LED
cpu2.mem2.lp    CPU 2 Dimm 2 Light path location LED
cpu2.mem3.lp    CPU 2 Dimm 3 Light path location LED
cpu2.memvrm.lp  CPU 2 Memory VRM Light path location LED
cpu2.temp       CPU 2 downwind temperature
cpu2.v2_5-s0    CPU 2 VDDA (2.5V) S0 voltage
cpu2.v2_5-s3    CPU 2 VDD (2.5V) S3 voltage
cpu2.vcore-s0   CPU 2 VCore S0 voltage
cpu2.vid        CPU-2 VID Selection
cpu2.vrm.lp     CPU 2 VRM Light path location LED
cpu2.vtt-s3     CPU 2 DDR VTT voltage
cpu3.dietemp    CPU 3 Die temperature
cpu3.heartbeat  CPU 3 Heartbeat
cpu3.inlettemp  CPU 3 inlet temperature
cpu3.lp         CPU 3 Light path location LED
cpu3.mem0.lp    CPU 3 Dimm 0 Light path location LED
cpu3.mem1.lp    CPU 3 Dimm 1 Light path location LED
cpu3.mem2.lp    CPU 3 Dimm 2 Light path location LED
cpu3.mem3.lp    CPU 3 Dimm 3 Light path location LED
cpu3.memvrm.lp  CPU 3 Memory VRM Light path location LED
cpu3.temp       CPU 3 downwind temperature
cpu3.v2_5-s0    CPU 3 VDDA (2.5V) S0 voltage
cpu3.v2_5-s3    CPU 3 VDD (2.5V) S3 voltage
cpu3.vcore-s0   CPU 3 VCore S0 voltage
cpu3.vid        CPU-3 VID Selection
cpu3.vrm.lp     CPU 3 VRM Light path location LED
cpu3.vtt-s3     CPU 3 DDR VTT voltage
cpuplanar.lp    Daughtercard Light path location LED
fan1.tach       Fan 1 measured speed
fan10.tach      Fan 10 measured speed
fan11.tach      Fan 11 measured speed
fan12.tach      Fan 12 measured speed
fan2.tach       Fan 2 measured speed
fan3.tach       Fan 3 measured speed
fan4.tach       Fan 4 measured speed
fan5.tach       Fan 5 measured speed
fan6.tach       Fan 6 measured speed
fan7.tach       Fan 7 measured speed
fan8.tach       Fan 8 measured speed
fan9.tach       Fan 9 measured speed
faultswitch     System Fault Indication
floppy.lp       Floppy Light path location LED
frontpanel.lp   LCD Light path location LED
g0.vldt1        AMD-8131 PCI-X Tunnel 0 LDT1 voltage
g1.vldt1        AMD-8131 PCI-X Tunnel 1 LDT1 voltage
gbeth.temp      Gigabit ethernet local temperature
golem-v1_8-s0   AMD-8131 PCI-X Tunnel 1.8V S0 voltage
identifyswitch  Identify switch
pci1.lp         PCI Slot 1 Light path location LED
pci2.lp         PCI Slot 2 Light path location LED
pci3.lp         PCI Slot 3 Light path location LED
pci4.lp         PCI Slot 4 Light path location LED
pci5.lp         PCI Slot 5 Light path location LED
pci6.lp         PCI Slot 6 Light path location LED
pci7.lp         PCI Slot 7 Light path location LED
pcifan.lp       Fan Board Light path location LED
planar.lp       Motherboard Light path location LED
scsibp.lp       SCSI Backplane Light path location LED
scsibp.temp     SCSI Disk backplane temperature
scsifault       SCSI Disk Fault Switch
sp.temp         SP local temperature
vldt-reg1-dc    LDT Regulator 1 Voltage
vldt-reg2-dc    LDT Regulator 2 Voltage

The following list contains sensor names and descriptions for a Sun Fire V20z server with firmware version 2.1.0.16:

ambienttemp    Ambient air temp
bulk.v12-0-s0  Bulk 12v supply voltage (cpu0)
bulk.v12-1-s0  Bulk 12v supply voltage (cpu1)
bulk.v1_8-s0   Bulk 1.8v S0 voltage
bulk.v1_8-s5   Bulk 1.8v S5 voltage
bulk.v2_5-s0   Bulk 2.5v S0 voltage
bulk.v2_5-s5   Bulk 2.5v S5 voltage
bulk.v3_3-s0   Bulk 3.3v supply
bulk.v3_3-s3   Bulk 3.3v S3 voltage
bulk.v3_3-s5   Bulk 3.3v S5 voltage
bulk.v5-s0     Bulk 5v supply voltage
bulk.v5-s5     Bulk 5v S5 voltage
cd.lp          CD-ROM Light path location led
cpu0.dietemp   CPU 0 die temp
cpu0.heartbeat CPU 0 heartbeat
cpu0.lp        CPU 0 Light path location led
cpu0.mem0.lp   CPU 0 Dimm 0 Light path location led
cpu0.mem1.lp   CPU 0 Dimm 1 Light path location led
cpu0.mem2.lp   CPU 0 Dimm 2 Light path location led
cpu0.mem3.lp   CPU 0 Dimm 3 Light path location led
cpu0.memtemp   CPU 0 memory temp
cpu0.memvrm.lp CPU 0 Memory VRM Light path location led
cpu0.temp      CPU 0 low side temp
cpu0.v2_5-s0   CPU VDDA voltage
cpu0.v2_5-s3   CPU 0 VDDIO voltage
cpu0.vcore-s0  CPU 0 core voltage
cpu0.vid       CPU-0 VID output
cpu0.vldt1     CPU0 HT 1 voltage
cpu0.vldt2     CPU 0 HT 2 voltage
cpu0.vrm.lp    CPU 0 VRM Light path location led
cpu0.vtt-s3    CPU 0 VTT voltage
cpu1.dietemp   CPU 1 die temp
cpu1.heartbeat CPU 1 heartbeat
cpu1.lp        CPU 1 Light path location led
cpu1.mem0.lp   CPU 1 Dimm 0 Light path location led
cpu1.mem1.lp   CPU 1 Dimm 1 Light path location led
cpu1.mem2.lp   CPU 1 Dimm 2 Light path location led
cpu1.mem3.lp   CPU 1 Dimm 3 Light path location led
cpu1.memtemp   CPU 1 memory temp
cpu1.memvrm.lp CPU 1 Memory VRM Light path location led
cpu1.temp      CPU 1 low side temp
cpu1.v2_5-s3   CPU 1 VDDIO voltage
cpu1.vcore-s0  CPU 1 core voltage
cpu1.vid       CPU-1 VID output
cpu1.vrm.lp    CPU 1 VRM Light path location led
cpu1.vtt-s3    CPU 1 VTT voltage
fan1.tach      Fan 1 measured speed
fan2.tach      Fan 2 measured speed
fan3.tach      Fan 3 measured speed
fan4.tach      Fan 4 measured speed
fan5.tach      Fan 5 measured speed
fan6.tach      Fan 6 measured speed
faultswitch    Fault switch (source for eval)
floppy.lp      Floppy Disk Drive Light path location led
frontpanel.lp  LCD Light path location led
g.vldt1        AMD-8131 PCI-X Tunnel HT 1 voltage
gbeth.temp     Gigabit ethernet temp
golem.temp     PCIX bridge temp
hdd1.lp        Hard Disk Drive 1 Light path location led
hdd2.lp        Hard Disk Drive 2 Light path location led
hddbp.lp       Hard Disk Drive Backplane Light path location led
hddbp.temp     Disk drive backplane temp
identifyswitch Identify switch
pci1.lp        PCI Slot 1 Light path location led
pci2.lp        PCI Slot 2 Light path location led
planar.lp      Motherboard Light path location led
ps.fanfail     Power Supply fan failure sensor
ps.lp          Powersupply Light path location led
ps.tempalert   Power Supply too hot sensor
sp.temp        SP temp
thor.temp      AMD-8111 I/O Hub temp

Monitoring data is retrieved by the N1 System Manager from many of these sensors. For Sun Fire x4100 and x4200 servers, sensors other than analog sensors are not used to retrieve data. Only sensors describing fan speed, voltage and temperature are used to retrieve data. For descriptions of sensors in the Sun Fire x4100 and x4200 servers, refer to the IPMI reference information in the Sun Fire x4100 and x4200 server product documentation.

Setting Threshold Values

Threshold values for monitored objects can be set on specific servers. Setting specific threshold values at the command line for attributes of a monitored object overrides for that object any factory-configured threshold values concerning the attribute. Any entries in the monitoring.properties configuration file concerning the attribute are also overridden.

To Set Threshold Values for a Server

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Use the set server command with the threshold attribute.

The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring and listed in Table 5–2.

The threshold is either criticallow, warninglow, warninghigh, or criticalhigh. The value is a numeric figure and usually represents a percentage.
- To set one threshold value, type the following:
  N1-ok> set server server threshold attribute threshold value
- To set multiple threshold values for the server, type the following:
  N1-ok> set server server threshold attribute threshold value threshold value

Example 5–2 Setting Multiple Threshold Values for CPU Usage on a Server

This example shows how to set the CPU usage warninghigh severity threshold on a provisionable server named serv1 to 53 percent. This example also shows how to set the criticalhigh severity threshold value to 75 percent.

N1-ok> set server serv1 threshold cpustats.pctusage warninghigh 53 criticalhigh 75

These values override the default values stored in the monitoring.properties configuration file on the management server for the server named serv1.

Example 5–3 Setting Multiple Threshold Values for File System Usage On a Server

This example sets the file system usage warninghigh threshold on a provisionable server named serv1 to 75 percent. This example also shows how to set the criticalhigh threshold value to 87 percent.

N1-ok> set server serv1 threshold fsusage.pctused warninghigh 75 criticalhigh 87

Example 5–4 Deleting a Threshold Value for File System Usage on a Server

This example shows how to delete a value that was set for the warninghigh threshold on a provisionable server named serv1.

N1-ok> set server serv1 threshold fsusage warninghigh none

In this case, any previously set value for this threshold at this severity is deleted. The threshold severity value does not revert back to the default threshold value, which is stored in the monitoring.properties configuration file, or to the factory-configured default, if this default existed for the attribute. In effect, monitoring is disabled for the warninghigh threshold for file system usage for this server.

To Set Threshold Values for a Server Group

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Use the set group command with the threshold attribute.

The syntax requires the threshold keyword to be followed by the attribute for which you are setting a threshold. The attribute is an OS resource utilization attribute. OS resource utilization attributes are described in OS Resource Utilization Monitoring and listed in Table 5–2.

The threshold is either criticallow, warninglow, warninghigh, or criticalhigh. The value is a numeric figure, and usually represents a percentage.
- To modify one threshold for the server group:
  N1-ok> set group group threshold attribute threshold value
- To modify multiple thresholds for the server group:
  N1-ok> set group group threshold attribute threshold value threshold value

Example 5–5 Setting Multiple Threshold Values for File System Usage on a Server Group

This example shows how to set the file system usage warninghigh threshold to 75 percent on a group of provisionable servers with a group name of grp3. This example also shows how to set the criticalhigh threshold severity value to 87 percent.

N1-ok> set group grp3 threshold fsusage.pctused warninghigh 75 criticalhigh 87

Setting Polling Intervals

The monitoring of an object consists of regular checks, or polls, of the monitored object. The frequency of these polls is controlled by setting the polling interval. The appropriate interval length between polls of the monitored object is related to the object being monitored and its environment, and the performance conditions to which the monitored object is being subjected. Default polling intervals are provided for some monitored objects, including server hardware objects such as fans. Default polling intervals apply for those servers or groups of servers for which specific interval values have not been set by using the set command.

Changing Polling Intervals With the Monitoring Configuration File

You can modify default values for polling intervals for hardware health, OS resource utilization, and network reachability by editing the monitoring.properties configuration file.

Note –

The polling of network reachability is not possible if OS monitoring is not enabled.

If the monitoring.properties configuration file is not present, create it and save it in /etc/opt/sun/n1gc/monitoring.properties. The monitoring.properties is not created by default at installation.

Factory-configured default polling intervals are provided in the N1 System Manager software. These values are stated in seconds. The factory-configured defaults are provided in Table 5–3.

Table 5–3 Factory-Configured Default Polling Intervals


Type of Monitoring	Default Polling Interval
Hardware health	120 seconds
OS resources	120 seconds
Network reachability	60 seconds

Any entries you make in the monitoring.properties configuration file overwrite these factory-configured defaults.

Note –

The minimum default polling interval that you can set is 60 seconds

The monitoring.properties configuration file exists only on the management server and not on provisionable servers. Modifying the default polling intervals stored in the monitoring.properties configuration file affects all the provisionable servers managed by the N1 System Manager.

You do not need to reboot the management server or the monitored provisionable server for changes to the monitoring.properties file to take effect.

Default polling intervals stored in the monitoring.properties configuration file apply to all servers unless specific values have been set at the command line for a specific server or group of servers. Set specific polling interval values by using the set command, as described in Setting Polling Intervals.

Tuning Polling Intervals for Your Installation

After a period of usage after installation and deployment, you can develop an awareness of how frequently you should be polling hardware health attributes and OS resource utilization attributes, and how often you need to poll your network reachability. Your configuration of the N1 System Manager depends on what your priorities are, in terms of crucial events. When setting polling intervals, or when changing default polling intervals, consider the number of servers you are managing with your N1 System Manager software. Consider also the application loads or application expected loads of your provisionable servers, and the capabilities of your network. Your expected responsiveness to events is also relevant. If you are able to react quickly to events as they occur, polling more frequently is appropriate.

For further information about tuning polling intervals for your installation, see To Increase the N1 System Manager Performance in Sun N1 System Manager 1.1 Installation and Configuration Guide.

To Retrieve Polling Interval Values for a Server

Steps

Type the show server command:
N1-ok> show server server
In this procedure, server is the name of the provisionable server for which you want to retrieve polling intervals.

Detailed monitoring polling intervals appear in the output, including polling interval information for the server's hardware health, OS resource utilization, and network reachability.

See show server in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

To Modify the Default Polling Interval for a Server

Before You Begin

To enable the management agent IP and security credentials on a server named server, add the management features on the server as explained in Adding Base and OS Management Features.

Steps

Open the /etc/opt/sun/n1gc/monitoring.properties file.

If the file does not exist, create it.

Modify or add lines in the monitoring.properties file that describe default polling intervals.

pollinginterval.monitor=value

The syntax requires the pollinginterval keyword.

monitor is either hardwarehealth, osresources or network. The polling of network reachability is not possible unless OS resource monitoring has been enabled, as described in Enabling Monitoring.

The value is in seconds, and the minimum value is 60.

Save the file.

You do not need to reboot the management server or the provisionable server for the changes to take effect. The modified default polling intervals values now apply to all servers managed by the N1 System Manager.

Example 5–6 Modifying Default Values

This example shows how to set the hardware health monitoring polling interval to 180 seconds, the OS resource utilization monitoring polling interval to 175 seconds, and the network reachability monitoring polling interval to 160 seconds. The following entries are made in the monitoring.properties configuration file.

pollinginterval.hardwarehealth=180
pollinginterval.osresources=175
pollinginterval.network=160

Setting Polling Intervals

This section contains procedures that describe how to set the polling intervals for a server or a server group.

To Set Polling Intervals for a Server

This procedure shows you how to set a polling interval for a server at the command line. Any value set this way overwrites the factory-configured default value or the value in the monitoring.properties configuration file, if the file exists.

Steps

Type the set server command with the monitor attribute.
set server server monitor monitor interval value
This command is executed for a server that you have already named. In this procedure, this name appears as server. See set server in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

The monitor is either hardwarehealth, osresources, or network.

The value is in seconds.

Note –
The minimum polling interval that you can set is 60 seconds.

Example 5–7 Setting the Polling Interval for Hardware Health Monitoring of a Server

This example shows how to set a polling interval of 280 seconds for hardware health monitoring of a provisionable server named serv1.

N1-ok> set server serv1 monitor hardwarehealth interval 280

To Set Polling Intervals for a Server Group

Any value set this way overwrites the factory-configured default value or the value in the monitoring.properties configuration file, if the file exists.

Steps

Type the set group command with the monitor attribute.
set group group monitor monitor interval value
This command is executed for a group of servers that you have already named. In this procedure, this name appears as group. See set group in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

The monitor is either hardwarehealth, osresources, or network.

The value is in seconds.

Note –
The minimum polling interval that you can set is 60 seconds.

Example 5–8 Setting the Polling Interval for Network Reachability Monitoring of a Server Group

This example shows how to set a polling interval of 250 seconds for network reachability monitoring of a group of provisionable servers named grp5.

N1-ok> set group grp5 monitor network interval 250

Monitoring MIBs

Two MIBS are provided with the N1 System Manager. These MIBs provide the data structure that third-party monitoring tools can use to retrieve the data from the N1 System Manager using SNMP, and provide the data structure that third party monitoring tools can use to parse the SNMP notifications generated by the N1 System Manager. The MIBs can be found at /opt/sun/n1gc/etc/. These MIBs therefore enable you to use any SNMP client to query the N1 System Manager, and to listen for events using SNMP. The following MIBs are provided:

SUN-N1SM-INFO-MIB: This MIB describes the information that you can retrieve from the N1 System Manager by querying it using an SNMP client.
SUN-N1SM-TRAP-MIB: This MIB describes all of the events related to the N1 System Manager about which you can receive SNMP traps.

These MIBs are read-only. Using them requires a detailed knowledge of SNMP, although detailed descriptions of each object are provided in the MIBs. How you configure your monitoring system to start receiving traps depends on the nature of your monitoring system.

The MIBs are hardware independent.

Example 5–9 Receiving SNMP Traps

This example shows you how to use the simple UNIX trap listener, the snmptrapd command, to start receiving N1 System Manager traps.

N1-ok> snmptrapd -m all -M /opt/sun/n1gc/etc:/usr/share/snmp/ mibs -P 1010

This example uses the snmptrapd command to start monitoring port 1010 for SNMP traps. It also instructs the command to use the MIBs stored at /opt/sun/n1gc/etc and /usr/share/snmp/mibs to parse the contents of SNMP traps.

How you configure your monitoring system to start receiving traps depends on the nature of your monitoring system.

Managing Jobs

This section describes jobs and how they are an integral part of server monitoring.

Each major action you take in the N1 System Manager starts a job. Use the job log to track the status on a currently running action or to verify that a job has finished. Monitoring jobs is useful particularly because some N1 System Manager actions can take a long time to finish. An example of such an action is installing an OS distribution on one or more provisionable servers.

You can track jobs through the Jobs tab in the browser interface or the show job command. The show job command provides information about most of the following characteristics:

Job ID

Generated unique identifier.

Date

Date on which the job was started.

Job Type

Type of job. See show job in Sun N1 System Manager 1.1 Command Line Reference Manual for details. When using the show job command with the type parameter, jobs can be any of the following types:

addbase – Add base management support.
addbasemonitor – Add OS monitoring support.
createos – Create OS distribution from CD/DVD media or ISO files.
deletejob – Delete job.
discover – Server discovery.
loadfirmware – Load firmware update.
loados – Load OS.
loadupdate – Load OS update.
refresh – Server refresh.
removeosmonitor – Remove OS monitoring support.
setagentip – Modify OS monitoring support.
start – Server power on.
stop – Server power off.
unloadupdate – Unload OS update.

State

State of the current job step. Job steps indicate the progress of a job and update results. Each job step has a type, a start time and, when the job completes, a completion time. For the purposes of filtering, job progress is indicated with the following states:

notstarted: Jobs in a notstarted state cannot be stopped.
preflight: When you select a job by ID and view the details of that job, each step of that job appears twice – the preflight check and the execution of the step itself.
running: The job is currently running. Jobs that are currently running cannot be deleted using the delete job command. Jobs that are currently running must finish running or be stopped using the stop job command.

Job completion is indicated with the following results:

completed: Indicates that the job step completed successfully.
warning: Indicates a warning during the job execution. A warning can be an issue reported that might or might not necessarily be severe enough to terminate the job step, and the job, with errors.
abort: Indicates that the job step stopped before it completed.
abort_pending: Indicates that the job is still running but that the job step cannot complete successfully.
error: Indicates a general error in that job step.
timed_out: Indicates that the job timed out before all of the job steps could complete successfully, or that the next step of the job started before the current step completed successfully.

Complete - Warning is issued in the output for an overall job status, if the job successfully completed all of its steps but there were one or more WARNING states issued for steps during the job execution and these warnings were not severe enough to terminate the job with errors.

You can filter jobs depending on their state. See show job in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Owner

The user who started the job. Also called the job creator.

Job Results

Provides details about the results of a completed job. You can review the standard output of remote command operations and completion statuses for all other job types.

To List Jobs

Steps

View the list of jobs.
N1-ok> show job all
A list of all jobs for the N1 System Manager is returned.

See show job in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Example 5–10 Listing All Jobs

This example shows that using the show job command with the all option returns a list of jobs by Job ID, together with the date and time at which the job was started. The job type and status are also returned, along with the identity of the user who created the job.

N1-ok> show job all
Job ID          Date                       Type                  Status        Creator
7               2005-09-16T10:51:07-0700   Discovery             Completed      root
6               2005-09-14T14:42:52-0700   Server Reboot         Error          root
5               2005-09-14T14:38:25-0700   Server Power On       Completed      root
4               2005-09-14T14:29:20-0700   Server Power Off      Completed      root
3               2005-09-09T13:01:35-0700   Discovery             Completed      root
2               2005-09-09T12:38:16-0700   Discovery             Completed      root
1               2005-09-09T10:32:40-0700   Discovery             Completed      root

To View a Specific Job

Steps

View a specific job.
N1-ok> show job job
Detailed information about the job appears in the output.

See show job in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Example 5–11 Viewing Job Details

This example shows that using the show job command with the Job ID returns the date and time at which the job was started, the job type and status, and the identity of the user who created the job. Further details are provided for each step of that job, including the time at which the step started and completed and whether the step was successful.

N1-ok> show job 5
Job ID:      5
Date:        2005-02-14T14:38:25-0700
Type:        Server Power On
Status:      Completed
Creator:     root
Errors:      0
Warnings:    0
Step 1:      
Type:        103
Description: native procedure /bin/sh /opt/sun/n1gc/bin/serverPowerOn.sh :[SERVER_NAME] :[JOBID_KEY]
Start:       2005-02-14T14:38:25-0700
Completion:  2005-02-14T14:38:25-0700
Result:      Complete
Exception:   No Data Available
Step 2:      
Type:        103
Description: native procedure /bin/sh /opt/sun/n1gc/bin/serverPowerOn.sh :[SERVER_NAME] :[JOBID_KEY]
Start:       2005-02-14T14:38:28-0700
Completion:  2005-02-14T14:38:35-0700
Result:      Complete
Exception:   No Data Available
Step 3:      
Type:        135
Description: connect and lock hosts
Start:       2005-02-14T14:38:25-0700
Completion:  2005-02-14T14:38:25-0700
Result:      Complete
Exception:   No Data Available
Step 4:      
Type:        135
Description: connect and lock hosts
Start:       2005-02-14T14:38:27-0700
Completion:  2005-02-14T14:38:28-0700
Result:      Complete
Exception:   No Data Available
Result 1:    
Server:      192.168.200.3
Status:      0
Message:     The server operation was successful.
N1-ok>

Each step appears twice in the output. The first appearance of the step in the list is the preflight check, and the second appearance of the step in the list is the actual execution of the step.

To Stop a Job

Steps

Stop a specific job.
N1-ok> stop job job
The job is stopped.

See stop job in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

View the job details.
N1-ok> show job job
The Result section of the output shows that the job was stopped.

Any job can be stopped. In practice, however, only a job that is not in its last step can be stopped. Some jobs only have one step and so can never be stopped. Jobs in a notstarted state cannot be stopped. Operations that are performed on large groups of servers can take longer and might include a large number of steps.

See show job in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Example 5–12 Stopping a Remote Command Job

This example shows that using the stop job command with the Job ID returns a message confirmed that the request has been received.

N1-ok> stop job 9

Stop Job "9" request received.

This example also shows that the show job command can be used with the Job ID of the job that was stopped to gain more data about the job that was stopped. This returns the confirmation, in Status, that the job was stopped, and that the job was a remote command job. Further details are provided for each step of that job, including the time at which the step started and completed and whether the step was successful. The Result section shows that the job was canceled.

N1-ok> show job 9

Job ID:   9
Date:     2005-02-15T16:43:58-0700
Type:     Remote Command
Status:   Stopped
Owner:    root
Errors:   0
Warnings: 0

Step 1:     
Type:        135
Description: connect and lock hosts
Start:       2005-02-15T16:43:58-0700
Completion:  2005-02-15T16:43:58-0700
Result:      Complete
Exception:   No Data Available

Step 2:     
Type:        103
Description: native procedure /bin/sh /opt/sun/n1gc/bin/remotecmd.sh
:[RCMD_KEY]
Start:       2005-02-15T16:43:58-0700
Completion:  2005-02-15T16:43:58-0700
Result:      Complete
Exception:   No Data Available

Step 3:     
Type:        135
Description: connect and lock hosts
Start:       2005-02-15T16:44:00-0700
Completion:  2005-02-15T16:44:00-0700
Result:      Complete
Exception:   No Data Available

Step 4:     
Type:        103
Description: native procedure /bin/sh /opt/sun/n1gc/bin/remotecmd.sh
:[RCMD_KEY]
Start:       2005-02-15T16:44:00-0700
Completion:  2005-02-15T16:44:49-0700
Result:      Incomplete - Aborted
Exception:   No Data Available

Result :        
Server:      server1
Status:      -1
Message:     Command running on server1 was canceled. Command:
/root/sleep.sh 60
Standard Output: Sleeping for 60 seconds...

Each step appears twice in the output. The first appearance of the step in the list is the preflight check, and the second appearance of the step in the list is the actual execution of the step.

To Delete a Job

Steps

Determine the job you want to delete.
N1-ok> show job all
All jobs and job IDs appear in the output.

See show job in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Delete the desired job.
N1-ok> delete job job
The job is deleted.

See delete job in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Verify that the job was deleted.
N1-ok> show job all
The deleted job should not appear in the output.

See show job in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Example 5–13 Deleting a Job

This example shows how to delete a job.

First, the show job command is used with the all option, which lists all jobs in descending order.

N1-ok> show job all
Job ID          Date                       Type                  Status           Creator
7               2005-02-16T10:51:07-0700   Discovery             Completed        root
6               2005-02-14T14:42:52-0700   Server Reboot         Error            root
5               2005-02-14T14:38:25-0700   Server Power On       Completed        root
4               2005-02-14T14:29:20-0700   Server Power Off      Completed        root
3               2005-02-09T13:01:35-0700   Discovery             Completed        root
2               2005-02-09T12:38:16-0700   Discovery             Completed        root
1               2005-02-09T10:32:40-0700   Discovery             Completed        root

Job ID 6 has an error and can be deleted. The delete job command is now used with the Job ID of the job to be deleted.

N1-ok> delete job 6

The show job command is used again with the all option, which lists all jobs in descending order. The deleted job no longer appears on the list.

N1-ok> show job all
Job ID          Date                       Type                  Status           Creator
7               2005-02-16T10:51:07-0700   Discovery             Completed        root
5               2005-02-14T14:38:25-0700   Server Power On       Completed        root
4               2005-02-14T14:29:20-0700   Server Power Off      Completed        root
3               2005-02-09T13:01:35-0700   Discovery             Completed        root
2               2005-02-09T12:38:16-0700   Discovery             Completed        root
1               2005-02-09T10:32:40-0700   Discovery             Completed        root

Example 5–14 Deleting All Jobs

This example shows how to delete all jobs.

First, the show job command is used with the all option, which lists all jobs in descending order.

N1-ok> show job all
Job ID          Date                       Type                  Status           Creator
7               2005-09-16T10:51:07-0700   Discovery             Completed        root
6               2005-09-14T14:42:52-0700   Server Reboot         Error            root
5               2005-09-14T14:38:25-0700   Server Power On       Completed        root
4               2005-09-14T14:29:20-0700   Server Power Off      Completed        root
3               2005-09-09T13:01:35-0700   Discovery             Running	        root
2               2005-09-09T12:38:16-0700   Discovery             Completed        root
1               2005-09-09T10:32:40-0700   Discovery             Completed        root

The delete job command is now used with the all option, to delete all jobs.

N1-ok> delete job all

Unable to delete job "3"

The show job command is used with the all option, to confirm whether all jobs were successfully deleted.

N1-ok> show job all
Job ID          Date                       Type                  Status           Creator
3               2005-09-09T13:01:35-0700   Discovery             Running	        root

Job ID 3 is still running. This is because jobs that were in a running state when the delete job command was issued must finish running, or must be stopped, before they can be deleted.

To stop the job and then delete it, first the stop job command is used with the ID of the job to be stopped.

N1-ok> stop job 3

Stop Job "3" request received.

The show job command is used to confirm that the job has been stopped.

N1-ok> show job all
Job ID          Date                       Type                  Status           Creator
3               2005-09-09T13:02:35-0700   Discovery             Aborted	        root

The job has been stopped while running and is in the aborted state. The delete job command is now used with the all option, to delete all jobs.

N1-ok> delete job all

The show job command is used to confirm that all jobs have now been deleted.

N1-ok> show job all
Job ID          Date                       Type                  Status           Creator

Managing Event Log Entries

This section describes events and how they are integral to monitoring your servers.

Events are generated when certain conditions related to attributes occur. Each event has an associated topic. For example, when a server is discovered by the management server, an event is generated with the topic Action.Physical.Discovered. For a complete list of event topics, see create notification in Sun N1 System Manager 1.1 Command Line Reference Manual.

Events can be monitored: Monitoring is connected with the broadcasting of events for each monitored server or group of servers. When a monitored attribute is polled and the value of the attribute is beyond the default or user-defined threshold safe range, an event is generated and a status is issued.

If monitoring is enabled for a server, provided a notification rule has been added for the event, the event causes a notification to be emitted from the management server for that event.
If monitoring is disabled for a server, monitoring events are not generated for that server. You might want to disable monitoring of a hardware component to perform maintenance tasks without generating events.

See Introduction to Monitoring for more information about monitoring.

See Setting Up Notifications for more information about notifications.

Lifecycle events continue to be generated, even with monitoring disabled. Lifecycle events include server discovery, server change or deletion, or server group creation. If you have requested notification of this type of event you can still receive notifications even with monitoring disabled.

Logs are created when events occur. For example, if any monitored IP address is unreachable, an event is generated. This event creates a log record, which is visible from the browser interface.

Event Log Overview

During the installation and configuration of the N1 System Manager, you can configure which events to log and you can also interactively configure severity levels for event topics. See Configuring the N1 System Manager System in Sun N1 System Manager 1.1 Installation and Configuration Guide.

Even if a log is not saved, it can still generate a notification.

Use the show command with the log keyword to view the following information about events:

Date – The date and time of the event.
Subject – The server on which the event occurred.
Topic – The topic of the event, which can be useful for setting up notifications. Refer to Setting Up Notifications for information.
Severity – Relative severity of the event.
Level – Relative level of the event.
Source – The name of the component that generated the event. For events that are generated during the execution of a job, the source is the job number.
Role – Role or user name of the user who initiated the event.
Message – Complete text of the event log message.

The n1smconfig script can be used to change the number of days for which logs are kept. Reducing the number of days for which logs are stored reduces the average size of the log files. This task ensures that the log file size does not impair performance. The n1smconfig script is stored at /opt/sun/n1gc/bin. This script can be used to set the number of days for which logs are held.To configure logging, you must specify an event category and a resource category. The following event categories are defined:

Action
Ereport
Lifecycle
List
Problem
Statistic
all

Use the all event category to indicate that all events are to be logged. To understand how other event categories relate to actual events, see the notification topics at create notification in Sun N1 System Manager 1.1 Command Line Reference Manual.

To View the Event Log

Steps

Type the following command:
N1-ok> show log [count count]
The Events log appears with events listed most recent first. The value for the count attribute is the number of events to show in the output. The default value for count is 500. See show log in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

To Filter the Event Log

Steps

Type the following command:
N1-ok> show log [severity severity] [before date] [after date]
The output shows only the events that match the specified criteria. The date variable values must be formatted appropriately, for example, 2005-07-20T11:53:04. The possible values for severity are critical, fatal, information, major, minor, other, unknown, and warning. See show log in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

To View Event Details

Steps

Type the following command:
N1-ok> show log log
The details of the event appear in the output. The log variable is the log ID. See show log in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Example 5–15 Viewing Event Details

N1-ok> show log 72
ID:       72
Date:     2005-03-15T13:35:59-0700
Subject:  RemoteCmdPlan
Topic:    Action.Logical.JobStarted
Severity: Information
Level:    FINE
Source:    Job Service
Role:     root
Message:  RemoteCmdPlan job initiated by root: job ID = 15.

Setting Up Notifications

The N1 System Manager provides the ability to set up email or SNMP notifications when events occur, either within the N1 System Manager itself or when specific events occur on provisionable servers. You can set up customized notification rules for as many different scenarios as you need. Setting up notifications can be done only through the command line.

Use the create notification command to create notification rules based on events that occur or might occur about which you are interested. Use a topic to create a notification.

For setting up notifications using SNMP traps, use the SNMP MIB located at /opt/sun/n1gc/etc/SUN-N1SM-TRAP-MIB.mib. For more information about SNMP MIBs, see Monitoring MIBs.

A notification rule can be used to send a notification of each type of event to a selected destination, using either email or SNMP as the communication medium. For example, you can create a notification rule so that each time a new provisionable server is discovered by the management server, you receive a message on your pager to indicate that the event has happened:

create notification notification destination destination topic topic 
type type [description description]

See create notification in Sun N1 System Manager 1.1 Command Line Reference Manual for details of the terms used in this command syntax.

You can configure your SMTP server to use event notification, during the installation and configuration of the N1 System Manager. See Configuring the N1 System Manager System in Sun N1 System Manager 1.1 Installation and Configuration Guide.

Viewing and Modifying Notifications

Use the show and set commands with the notification option to view and modify notification details. Type help show notification or help set notification at the N1–ok command line for syntax and parameter details.

To View Notifications

Steps

Type the following command:
N1-ok> show notification all
The notifications for which you have read privileges appear in the output. See show notification in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

To View Notification Details

Steps

Type the following command:
N1-ok> show notification notification
The specified notification details appear in the output. See show notification in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Example 5–16 Viewing Notification Details

N1-ok> show notification test2
Name:          test2
Event Topic:   EReport.Physical.ThresholdExceeded
Notifier Type: Email
Destination:   nobody@sun.com
State:         enabled

To Modify a Notification

This procedure describes how to change the name, description, or destination of a notification.

Steps

Type the following command:
N1-ok> set notification notification name name description description destination destination
The specified notification attributes are set to the new values specified. See set notification in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Example 5–17 Modifying a Notification Name

This example shows how to use the set notification command with the name option to change a notification name from test2 to test3.

N1-ok> set notification test2 name test3

Creating, Testing, and Deleting Notifications

Use the create or delete command with the notification option to create and delete notifications.

Use the create command with the notification option and the test subcommand to test a notification.

Type help create notification or help delete notification at the N1–ok command line for syntax and parameter details.

To Create and Test a Notification

Steps

Type the following command:
N1-ok> create notification notification topic topic type type destination destination
The notification is created and enabled. See create notification in Sun N1 System Manager 1.1 Command Line Reference Manual for details and valid topics.

Type the following command:
N1-ok> start notification notification test
A test notification message is sent. See start notification in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Example 5–18 Creating an Email Notification

This example shows how to create a notification to be sent by email if a physical threshold value is exceeded. The notification is called test2. The recipient's email address is nobody@sun.com

N1-ok> create notification test2 destination nobody@sun.com
topic EReport.Physical.ThresholdExceeded type email

The show notification command can be used to verify that the notification has been created.

N1-ok> show notification
Name    Event Topic                         Destination       State
test2   EReport.Physical.ThresholdExceeded  nobody@sun.com     enabled

Example 5–19 Creating an SNMP Notification

This example shows how to create a notification to be sent by SNMP if a physical threshold value is exceeded. The notification is called test23. The recipient SNMP address is sun.com

N1-ok> create notification test23 destination sun.com
topic EReport.Physical.ThresholdExceeded type snmp

The show notification command can be used to verify that the notification has been created.

N1-ok> show notification
Name    Event Topic                         Destination  State
test23  EReport.Physical.ThresholdExceeded  sun.com     enabled

To Delete a Notification

Steps

Type the following command:
N1-ok> delete notification notification
The notification is deleted.

Starting and Stopping Notifications

Notifications are enabled, or started, by default at creation. Use the start command with the notification option to enable a notification that has been disabled. Type help start notification at the N1–ok command line for syntax and parameter details.

To Start a Notification

Steps

Type the following command:
N1-ok> start notification notification
The notification is enabled. See start notification in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

To Stop a Notification

Steps

Type the following command:
N1-ok> stop notification notification
The notification is disabled. See stop notification in Sun N1 System Manager 1.1 Command Line Reference Manual for details.

Chapter 5 Monitoring Your Servers

Introduction to Monitoring

Hardware Health Monitoring

OS Resource Utilization Monitoring

Network Reachability Monitoring

Enabling Monitoring

To Monitor a Server

Before You Begin

Steps

To Monitor a Server Group

Before You Begin

Steps

To Disable Monitoring for a Server

Steps

To Disable Monitoring for a Server Group

Steps

Monitoring Threshold Values

What Happens When a Threshold is Broken

Tuning Threshold Values for Your Installation

To Retrieve Threshold Values for a Server

Before You Begin

Steps

Managing Default Threshold Values

Changing Threshold Values With the Monitoring Configuration File

To Modify Default Threshold Values for a Server

Before You Begin

Steps

Example 5–1 Modifying the Default Threshold Value for File System Usage

Hardware Sensor Attributes

Setting Threshold Values

To Set Threshold Values for a Server

Before You Begin

Steps

Example 5–2 Setting Multiple Threshold Values for CPU Usage on a Server

Example 5–3 Setting Multiple Threshold Values for File System Usage On a Server

Example 5–4 Deleting a Threshold Value for File System Usage on a Server

To Set Threshold Values for a Server Group

Before You Begin

Steps

Example 5–5 Setting Multiple Threshold Values for File System Usage on a Server Group

Setting Polling Intervals

Changing Polling Intervals With the Monitoring Configuration File

Tuning Polling Intervals for Your Installation

To Retrieve Polling Interval Values for a Server

Steps

To Modify the Default Polling Interval for a Server

Before You Begin

Steps

Example 5–6 Modifying Default Values

Setting Polling Intervals

To Set Polling Intervals for a Server

Steps

Example 5–7 Setting the Polling Interval for Hardware Health Monitoring of a Server

To Set Polling Intervals for a Server Group

Steps

Example 5–8 Setting the Polling Interval for Network Reachability Monitoring of a Server Group

Monitoring MIBs

Example 5–9 Receiving SNMP Traps

Managing Jobs

To List Jobs

Steps

Example 5–10 Listing All Jobs

To View a Specific Job

Steps

Example 5–11 Viewing Job Details

To Stop a Job

Steps

Example 5–12 Stopping a Remote Command Job

See Also

To Delete a Job

Steps

Example 5–13 Deleting a Job

Example 5–14 Deleting All Jobs

Managing Event Log Entries

Event Log Overview

To View the Event Log

Steps

See Also

To Filter the Event Log

Steps