1 Monitoring

Because of the size, complexity, and criticality of today's enterprise IT operations, the challenge for IT professionals is to be able to maintain high levels of component availability and performance for both applications and all components that make up the application's technology stack. Monitoring the performance of these components and quickly correcting problems before they can impact business operations is crucial. Enterprise Manager provides comprehensive, flexible, easy-to-use monitoring functionality that supports the timely detection and notification of impending IT problems across your enterprise.

This chapter covers the following topics:

Systems Monitoring: Breadth and Depth
Monitoring Basics
Monitoring Templates
User-Defined Metrics
Accessing Monitoring Information

Systems Monitoring: Breadth and Depth

Enterprise Manager system monitoring features provide increased out-of-box value, automation, and grid monitoring support to enable IT organizations to maximize operational efficiencies and provide high quality services. For applications that are built on Oracle, Enterprise Manager offers the most comprehensive monitoring of the Oracle Grid environment—from Oracle Database instances to Oracle Real Application Clusters to Oracle Application Server Farms and Clusters. To support the myriad and variety of applications built on Oracle, Enterprise Manager expands its monitoring scope to non-Oracle components, such as third-party application servers, hosts, firewalls, server load balancers, and storage.

Enterprise Manager provides the most comprehensive management features for all Oracle products. For example, Enterprise Manager's monitoring functionality is tightly integrated with Oracle Database 10g manageability features such as server-generated alerts. These alerts are generated by the database itself about problems it has self-detected. Server-generated alerts can be managed from the Enterprise Manager console and include recommendations on how problems can be resolved. Performance problems such as poorly performing SQL and corresponding recommendations that are generated by the database's self-diagnostic engine, called Automatic Database Diagnostic Monitor (ADDM), are also captured and exposed through the Enterprise Manager console. This allows Enterprise Manager administrators to implement ADDM recommendations with ease and convenience.

Enterprise Manager also makes it easy to expand the scope of system monitoring beyond individual components. Using Enterprise Manager's group management functionality, you can easily organize monitorable targets into groups, allowing you to monitor and manage many components as one.

Monitoring Basics

System monitoring functionality permits unattended monitoring of your IT environment. Enterprise Manager comes with a comprehensive set of performance and health metrics that allows monitoring of key components in your environment, such as applications, application servers, databases, as well as the back-end components on which they rely (hosts, operating systems, storage, and so on).

The Management Agent on each monitored host monitors the status, health, and performance of all managed components (also referred to as targets) on that host. If a target goes down, or if a performance metric crosses a warning or critical threshold, an alert is generated and sent to Enterprise Manager and to Enterprise Manager administrators who have registered interest in receiving such notifications. Systems monitoring functionality and the mechanisms that support this functionality are discussed in the following sections.

When it is not practical to have a Management Agent present to monitor specific components of your IT infrastructure, as might be the case with an IP traffic controller or remote Web application, Enterprise Manager provides Extended Network and Critical URL Monitoring functionality. This feature allows the Beacon functionality of the Agent to monitor remote network devices and URLs for availability and responsiveness without requiring an Agent to be physically present on that device. You simply select a specific Beacon, and add key network components and URLs to the Network and URL Watch Lists. More information about using this feature is available in the Enterprise Manager online help.

Out-of-Box Monitoring

Enterprise Manager's Management Agents automatically start monitoring their host's systems (including hardware and software configuration data on these hosts) as soon as they are deployed and started. Enterprise Manager provides auto-discovery scripts that enable these Agents to automatically discover all Oracle components and start monitoring them using a comprehensive set of metrics at Oracle-recommended thresholds. This monitoring functionality includes other components of the Oracle ecosystem such as NetApp Filer, BIG-IP load balancers, Checkpoint Firewall, and IBM WebSphere and BEA WebLogic application servers. Metrics from all monitored components are stored and aggregated in the Management Repository, providing administrators with a rich source of diagnostic information and trend analysis data. When critical alerts are detected, notifications are sent to administrators for rapid resolution.

Out-of-box, Enterprise Manager monitoring functionality provides:

In-depth monitoring with Oracle-recommended metrics and thresholds.
Access to real-time performance charts.
Collection, storage, and aggregation of metric data in the Management Repository. This allows you to perform strategic tasks such as trend analysis and reporting.
E-mail notification for detected critical alerts.

Enterprise Manager can monitor a wide variety of components (such as databases, hosts, and routers) within your IT infrastructure.

Some examples of monitored metrics are:

Archive Area Used (Database)
Component Memory Usage (Application Server)
Segments Approaching Maximum Extents Count (Database)
Network Interface Total I/O Rate (Host)

Some metrics have associated predefined limiting parameters called thresholds that cause alerts to be triggered when collected metric values exceed these limits. Enterprise Manager allows you to set metric threshold values for two levels of alert severity:

Warning - Attention is required in a particular area, but the area is still functional.
Critical - Immediate action is required in a particular area. The area is either not functional or indicative of imminent problems.

Hence, thresholds are boundary values against which monitored metric values are compared. For example, for each disk device associated with the Disk Utilization (%) metric, you might define a warning threshold at 80% disk space used and critical threshold at 95%.

Metric Thresholds

As mentioned earlier, some metric thresholds come predefined out-of-box. While these values are acceptable for most monitoring conditions, your environment may require that you customize threshold values to more accurately reflect the operational norms of your environment. Setting accurate threshold values, however, may be more challenging for certain categories of metrics such as performance metrics.

For example, what are appropriate warning and critical thresholds for the Response Time Per Transaction database metric? For such metrics, it might make more sense to be alerted when the monitored values for the performance metric deviates from normal behavior. Enterprise Manager provides features to enable you to capture normal performance behavior for a target and determine thresholds that are deviations from that performance norm.

Note:

Enterprise Manager administrators must be granted OPERATOR or greater privilege on a target in order to perform any metric threshold changes.

Metric Snapshots

A metric snapshot is a named collection of a target's performance metrics that have been collected at a specific point in time. A metric snapshot can be used as an aid in calculating metric threshold values based on the target's past performance.

The key in defining a metric snapshot for a target is to select a date during which target performance was acceptable under typical workloads. Given this date, actual values of the performance metrics for the target are retrieved and these represent what is normal or expected performance behavior for the target. Using these values, you can then use Enterprise Manager to calculate warning and critical thresholds for the metrics that are a specified percentage 'worse' than the actual metric snapshot values. These represent values which, when crossed, could indicate performance problems. After thresholds are calculated, you can still edit the calculated values if needed.

You can define a metric snapshot for a target based on a date and (optionally) time. If you only specify a date, the metric snapshot is the set of average daily values of the target's performance metrics for that date. If you also specify an hour within the date, then the metric snapshot is the set of Low and High metric values for the preceding hour.

Metric snapshots apply to all monitored targets except 10.2 or higher databases, Services, and Web applications. For these targets, the Metric Baseline feature is supported.

Metric Baselines

Metric baselines are statistical characterizations of system performance over well-defined time periods. Metric baselines can be used to implement adaptive alert thresholds for certain performance metrics as well as provide normalized views of system performance. Adaptive alert thresholds are used to detect unusual performance events. Baseline normalized views of metric behavior help administrators explain and understand such events.Metric baselines are well defined time intervals (baseline periods) over which Enterprise Manager has captured system performance metrics. The underlying assumption of metric baselines is that systems with relatively stable performance should exhibit similar metric observations (that is, values) over times of comparable workload. Two types of baseline periods are supported: moving window baseline periods and static baseline periods. Moving window baseline periods are defined as some number of days prior to the current date (example: Last 7 days). This allows comparison of current metric values with recently observed history. Moving window baselines are useful for operational systems with predictable workload cycles (example: OLTP days and batch nights).Static baselines are periods of time that you define that are of particular interest to you (example: end of the fiscal year). These baselines can be used to characterize workload periods for comparison against future occurrences of that workload (example: compare end of the fiscal year from one calendar year to the next).

Adaptive Thresholds

Once metric baselines are defined, they can be used to establish alert thresholds that are statistically significant and adapt to expected variations across time. For example, you can define alert thresholds to be generated based on significance level, such as the HIGH significance level thresholds are values that occur 5 in 100 times. Alternatively, you can generate thresholds based on a percentage of the maximum value observed within the baseline period. These can be used to generate alerts when performance metric values are observed to exceed normal peaks within that period.

Baseline Normalized Views

Enterprise Manager provides charts which graphically display the values of observed performance and workload metrics normalized against the baseline. Using these charts, statistically significant values are easily seen as 'blips' in the charts. These allow administrators to easily perform time-correlation of events. For example, performance events can be related to significantly increased demand or significantly unusual workload.

Metric baselines are supported for databases (version 10.2 or higher) and for Services and Web Application target types.

Alerts

When a metric threshold value is reached, an alert is generated. An alert indicates a potential problem; either a warning or critical threshold for a monitored metric has been crossed. An alert can also be generated for various target availability states, such as:

Target is down.
Oracle Management Agent monitoring the target is unreachable.

When an alert is generated, you can access details about the alert from the Enterprise Manager console. See "Accessing Monitoring Information" on page 1-8 for more information on viewing alert information.

Enterprise Manager provides various options to respond to alerts. Administrators can be automatically notified when an alert triggers and/or corrective actions can be set up to automatically resolve an alert condition.

Notifications

When a target becomes unavailable or if thresholds for performance are crossed, alerts are generated in the Enterprise Manager console and notifications are sent to the appropriate administrators. Enterprise Manager supports notifications via e-mail (including e-mail-to-page systems), SNMP traps, and/or by running custom scripts.

Enterprise Manager supports these various notification mechanisms via notification methods. A notification method is used to specify the particulars associated with a specific notification mechanism, for example, which SMTP gateway(s) to use for e-mail, which OS script to run to log trouble-tickets, and so on. Super Administrators perform a one-time setup of the various types of notification methods available for use. Once defined, other administrators can create notification rules that specify the set of criteria that determines when a notification should be sent and how it should be sent. The criteria defined in notification rules include the targets, metrics and severity states (clear, warning or critical) and the notification method that should be used when an alert occurs that matches the criteria. For example, you can define a notification rule that specifies e-mail should be sent to you when CPU Utilization on any host target is at critical severity, or another notification rule that creates a trouble-ticket when any database is down. Once a notification rule is defined, it can be made public for sharing across administrators. For example, administrators can subscribe to the same rule if they are interested in receiving alerts for the same criteria defined in the rule. Alternatively, an Enterprise Manager Super Administrator can assign notification rules to other administrators such that they receive notifications for alerts as defined in the rule.

Notifications are not limited to alerting administrators. Notification methods can be extended to execute any custom OS script or PL/SQL procedure, and thus can be used to automate any type of alert handling. For example, administrators can define notification methods that call into a trouble ticketing system, invoke third-party APIs to share alert information with other monitoring systems, or log a bug against a product.

Customizing Notifications

Notifications that are sent to Administrators can be customized based on message type and on-call schedule. Message customization is useful for administrators who rely on both e-mail and paging systems as a means for receiving notifications. The message formats for these systems typically vary—messages sent to e-mail can be lengthy and can contain URLs, and messages sent to a pager are brief and limited to a finite number of characters. To support these types of mechanisms, Enterprise Manager allows administrators to associate a long or short message format with each e-mail address. E-mail addresses that are used to send 'regular' e-mails can be associated with the 'long' format; e-mail addresses that are used to send pages can be associated with the 'short' format. The 'long' format contains full details about the alert; the 'short' format contains the most critical pieces of information.

Notifications can also be customized based on an administrator's on-call schedule. An administrator who is on-call might want to be contacted by both his pager and work e-mail address during business hours and only by his pager address during off hours. Enterprise Manager offers a flexible notification schedule to support the wide variety of on-call schedules. Using this schedule, an administrator defines his on-call schedule by specifying the e-mail addresses by which they should be contacted when they are on-call. For periods where they are not on-call, or do not wish to receive notifications for alerts, they simply leave that part of the schedule blank. All alerts that are sent to an administrator automatically adhere to his specified schedule.

Corrective Actions

Corrective actions allow you to specify automated responses to alerts. Corrective actions ensure that routine responses to alerts are automatically executed, thereby saving administrator time and ensuring problems are dealt with before they noticeably impact users. For example, if Enterprise Manager detects that a component, such as the SQL*Net listener is down, a corrective action can be specified to automatically start it back up. A corrective action is thus any task you specify that will be executed when a metric triggers a warning or critical alert severity. By default, the corrective action runs on the target on which the alert has triggered. Administrators can also receive notifications for the success or failure of corrective actions.

A corrective action can also consist of multiple tasks, with each task running on a different target. For example, if an Oracle Application Server's J2EE container (called an OC4J container) triggers a warning alert indicating it is approaching its limit on the number of requests it can handle, a corrective action can be defined to automatically start up another OC4J container on another host, thereby sharing application load among different containers. As shown by this example, corrective actions can be used to dynamically allocate resources as demand increases, thereby preventing performance bottlenecks before they impact overall application availability.

Corrective actions for a target can be defined by all Enterprise Manager administrators who have been granted OPERATOR or greater privilege on the target. For any metric, you can define different corrective actions when the metric triggers at warning severity or at critical severity.

Corrective actions must run using the credentials of a specific Enterprise Manager administrator. For this reason, whenever a corrective action is created or modified, the credentials that the modified action will run with must be specified.

Blackouts

Blackouts allow you to support planned outage periods to perform emergency or scheduled maintenance. When a target is put under blackout, monitoring is suspended, thus preventing unnecessary alerts from being sent when you bring down a target for scheduled maintenance operations such as database backup or hardware upgrade. Blackout periods are automatically excluded when calculating a target's overall availability.

A blackout period can be defined for individual targets, a group of targets or for all targets on a host. The blackout can be scheduled to run immediately or in the future, and to run indefinitely or stop after a specific duration. Blackouts can be created on an as-needed basis, or scheduled to run at regular intervals. If, during the maintenance period, you discover that you need more (or less) time to complete maintenance tasks, you can easily extend (or stop) the blackout that is currently in effect. Blackout functionality is available from both the Enterprise Manager console as well as via the Enterprise Manager command-line interface (EMCLI). The EMCLI is often useful for administrators who would like to incorporate the blacking out of a target within their maintenance scripts. When a blackout ends, the Management Agent automatically re-evaluates all metrics for the target to provide current status of the target post-blackout.

If an administrator inadvertently performs scheduled maintenance on a target without first putting the target under blackout, these periods would be reflected as target downtime instead of planned blackout periods. This has an adverse impact on the target's availability records. In such cases, Enterprise Manager allows Super Administrators to go back and define the blackout period that should have happened at that time. The ability to create these retroactive blackouts provides Super Administrators with the flexibility to define a more accurate picture of target availability.

Monitoring Templates

Monitoring templates simplify the task of standardizing monitoring settings across your enterprise by allowing you to specify the monitoring settings once and apply them to your monitored targets. This makes it easy for you to apply specific monitoring settings to specific classes of targets throughout your enterprise. For example, you can define one monitoring template for test databases and another monitoring template for production databases.

A monitoring template defines all Enterprise Manager parameters you would normally set to monitor a target, such as:

Target type to which the template applies.
Metrics (including user-defined metrics), thresholds, metric collection schedules, and corrective actions.

When a change is made to a template, you can reapply the template across affected targets in order to propagate the new changes. You can reapply the monitoring templates as often as needed. For any target, you can preserve custom monitoring settings by specifying metric settings that can never be overwritten by a template.

Comparing Differences Between Targets and Monitoring Templates

Deciding how and when to apply a template is simplified by using the Compare Monitoring Template feature. This feature allows you to see at a glance how metric and policy settings defined in a template differ from those defined on the destination target. Compare Monitoring Template is especially useful when working with aggregate targets such as groups and systems. For example, after you apply a Monitoring Template to a group, you want to verify that the group members now have the same monitoring settings as the template. The Compare Monitoring Template feature makes checking simple. You can also schedule this as a report, allowing you to check periodically if the group members still follow the template settings.

User-Defined Metrics

User-defined metrics allow you to extend the reach of Enterprise Manager's monitoring to conditions specific to particular environments via custom scripts or SQL queries and function calls. Once a user-defined metric is defined, it will be monitored, aggregated in the repository, and can trigger alerts like any other metric in Enterprise Manager. There are two types of user-defined metrics: Operating System and SQL.

Operating System (OS) User-Defined Metrics: Accessed from Host target home pages, these user-defined metrics allow you to implement custom monitoring functions via OS scripts.
SQL User-Defined Metrics: Accessed from the Database target home pages, these user-defined metrics allow you to implement custom database monitoring using SQL queries or function calls.

Creating a User-Defined Metric

To monitor a particular condition (example: check successful completion of monthly system maintenance routines), you can write a custom OS script to monitor that condition, then register it as a user-defined metric in Enterprise Manager. Each time the metric is evaluated by Enterprise Manager, it uses this script to evaluate the condition. SQL user-defined metrics do not use external scripts: you enter SQL directly into the Enterprise Manager console at the time of metric creation. Once a user-defined metric is defined, all other monitoring features, such as alerts, notifications, historical collections, and corrective actions are automatically available to it.

If you already have your own library of custom monitoring scripts, you can leverage Enterprise Manager's monitoring features by integrating these scripts with Enterprise Manager as OS user-defined metrics. Likewise, existing SQL queries or function calls currently used to monitor database conditions can be easily integrated into Enterprise Manager's monitoring framework as SQL user-defined metrics. For more information about user-defined metrics, see Oracle Enterprise Manager Advanced Configuration.

Accessing Monitoring Information

All monitoring information is accessed via the Enterprise Manager console, providing quick views into the health of your monitored environment.

Enterprise Manager Console Home Page

The Enterprise Manager console home page shown in Figure 1-1 gives you an at-a-glance view of the overall status of your monitored environment. As shown in the following figure, the home page summarizes key monitoring areas such as availability across all managed targets, open alerts, policy violations, and recent problems with job executions. Links on this page allow you to drill down to detailed performance information.

The Resource Center is your central access point to Enterprise Manager documentation as well as the comprehensive technical resources of the Oracle Technology Network (OTN).

Figure 1-1 Enterprise Manager Console

This is the Enterprise Manager Grid Control console.

Description of "Figure 1-1 Enterprise Manager Console"

From the home page, you can easily access alert information. For example, you can click on the Down link in the All Targets Status legend to determine which targets are currently down. Under All Target Alerts, you can click on the Warning alerts value to access a list of warning alerts for all monitored targets (Figure 1-2).

Figure 1-2 Warning Alerts Page

This is the Warning Alerts page with system alerts.

The most recent alerts are listed first. You can change the sorting methodology by clicking on the appropriate column header. By clicking on a specific alert message, you can drill down to explicit details about the metric in alert ().

Figure 1-3 Warning Alert: Metric Details

Description of "Figure 1-3 Warning Alert: Metric Details"

By default, metric values shown on this page reflect the last 24 hours of collected data. You can also select another time period or specify a custom time period with which to view metric data and easily assess if the problem occurred recently or across a long time period. Because Enterprise Manager collects and aggregates metric data in the Management Repository, you can click on the Compare Targets related link to display metric data for more than one target simultaneously, thus allowing you to compare performance across multiple targets (Figure 1-4).

If you do not wish to view metrics collected over time, you can choose one of several Real Time metric refresh periods:

Manual
30 Second
1 Minute
5 Minutes

Figure 1-4 Compare Targets

The Alert History table shows alerts generated over the selected time period. You can view explicit details about a specific alert in this table by clicking on the eyeglasses icon in the Details column. Figure 1-5 shows the Alert Details page.

Figure 1-5 Alert Details

The Alert Details page shows all notifications for an alert, any corrective actions that have been executed, and any custom notifications, for example, the opening of a case ticket for an alert. On this page, you also have the option of adding annotations or comments for other administrators to see.