3 Monitoring Using Oracle Grid Control

This chapter provides best practices for using Oracle Grid Control to monitor and maintain a highly available environment across all tiers of the application stack.

This chapter contains these topics:

Overview of Monitoring and Detection for High Availability
Using Oracle Grid Control for System Monitoring
Managing the High Availability Environment with Oracle Grid Control

3.1 Overview of Monitoring and Detection for High Availability

Continuous monitoring of the system, network, database operations, application, and other system components, ensures early detection of problems. Early detection improves the user's system experience because problems can be avoided or resolved faster. In addition, monitoring captures system metrics to indicate trends in system performance growth and recurring problems. This information can facilitate prevention, enforce security policies, and manage job processing. For the database server, a sound monitoring system must measure availability and detect events that can cause the database server to become unavailable, and provide immediate notification about critical failures to responsible parties.

The monitoring system itself must be highly available and adhere to the same operational best practices and availability practices as the resources it monitors. Failure of the monitoring system leaves all monitored systems unable to capture diagnostic data or alert the administrator about problems.

Oracle Grid Control provides management and monitoring capabilities with many different notification options. Recommendations are available for methods of monitoring the environment's availability and performance, and for using the tools in response to changes in the environment.

3.2 Using Oracle Grid Control for System Monitoring

A major benefit of Oracle Grid Control is its ability to manage components across the entire application stack, from the host operating system to a user or packaged application. Oracle Grid Control treats each of the layers in the application as a target. Targets—such as databases, application servers, and hardware—can then be viewed along with other targets of the same type, or can be grouped by application type. You can also review all targets in a single view from the HA Console (described in more detail in Section 3.3.3, "Manage Database Availability with the High Availability Console"). Each target type has a default generated home page that displays a summary of relevant details for a specific target. You can group different types of targets by function; that is, as resources that support the same application.

Every target is monitored by an Oracle Management Agent. Every Management Agent runs on a system and is responsible for a set of targets. The targets can be on a system that is different from the one that the Management Agent is on. For example, a Management Agent can monitor a storage array that cannot host an agent natively. When a Management Agent is installed on a host, the host is automatically discovered along with other targets that are on the machine.

Moreover, to help you implement the Maximum Availability Architecture (MAA) best practices, Grid Control provides the MAA Advisor (described in detail in Section 3.3.4, "Configure High Availability Solutions with MAA Advisor"). The MAA Advisor page recommends Oracle solutions for most outage types and describes the benefits of each solution.

3.2.1 Oracle Grid Control Home Page

The Oracle Grid Control home page shown in Figure 3-1 provides a picture of the availability of all discovered targets.

Figure 3-1 Oracle Grid Control Home Page

Description of "Figure 3-1 Oracle Grid Control Home Page"

The Oracle Grid Control home page shows the following major kinds of information:

A snapshot of the current availability of all targets. The pie chart associated with availability gives the administrator an immediate indication of any target that is Available (Up), unavailable (Down), or has lost communication with the console (Unknown).
An overview of how many alerts (for events) and problems (for jobs) are known in the entire monitored system. You can display detailed information by clicking the links, or by navigating to Alerts from the upper right portion of any Oracle Grid Control page.
A view of the severity and total number of policy violations for all managed targets. Drill down to determine the source and type of violation.
An overview of what is actually discovered in the system. This list can be shown at the hardware level and the Oracle level.
All Targets Jobs lists the number of scheduled, running, suspended, and problem (stopped/failed) executions for all Enterprise Manager jobs. Click the number next to the status group to view a list of those jobs.

Alerts are generated by a combination of factors and are defined on specific metrics. A metric is a data point sampled by a Management Agent and sent to the Oracle Management Repository. It could be the availability of a component through a simple heartbeat test, or an evaluation of a specific performance measurement such as "disk busy" or percentage of processes waiting for a specific wait event.

There are four states that can be checked for any metric: error, warning, critical, and clear. The administrator must make policy decisions such as:

What objects should be monitored (databases, nodes, listeners, or other services)?
What instrumentation should be sampled (such as availability, CPU percent busy)?
How frequently should the event be sampled?
What should be done when the metric exceeds a predefined threshold?

All of these decisions are predicated on the business needs of the system. For example, all components might be monitored for availability, but some systems might be monitored only during business hours. Systems with specific performance problems can have additional performance tracing enabled to debug a problem.

See Also:

Oracle Enterprise Manager Concepts for more information about monitoring and using metrics in Oracle Grid Control

3.2.2 Set Up Default Notification Rules for Each System

Notification Rules are defined sets of alerts on metrics that are automatically applied to a target when it is discovered by Oracle Grid Control. For example, an administrator can create a rule that monitors the availability of database targets and generates an e-mail message if a database fails. After that rule is generated, it is applied to all existing databases and any database created in the future. Access these rules by navigating to Preferences and then choosing Rules.

The rules monitor problems that require immediate attention, such as those that can affect service availability, and Oracle or application errors. Service availability can be affected by an outage in any layer of the application stack: node, database, listener, and critical application data. A service availability failure, such as the inability to connect to the database, or the inability to access data critical to the functionality of the application, must be identified, reported, and reacted to quickly. Potential service outages such as a full archive log directory also must be addressed correctly to avoid a system outage.

Oracle Grid Control provides a series of default rules that provide a strong framework for monitoring availability. A default rule is provided for each of the preinstalled target types that come with Oracle Grid Control. You can modify these rules to conform to the policies of each individual site, and you can create rules for site-specific targets or applications. You can also set the rules to notify users during specific time periods to create an automated coverage policy.

Use the following best practices:

Modify each rule for high-value components in the target architecture to suit your availability requirements by using the rules modification wizard. For the database rule, set the events in Table 3-1, Table 3-2, and Table 3-3 for each target. The frequency of the monitoring is determined by the service-level agreement (SLA) for each component.
Use Beacon functionality to track the performance of individual applications. A Beacon can be set to perform a user transaction representative of normal application work. Enterprise Manager can then break down the response time of that transaction into its component pieces for analysis. In addition, an alert can be triggered if the execution time of that transaction exceeds a predefined limit.
Add Notification Methods and use them in each Notification Rule. By default, the easiest method for alerting an administrator to a potential problem is to send e-mail. Supplement this notification method by adding a callout to an SNMP trap or operating system script that sends an alert by some method other than e-mail. This avoids the problem that might occur if a component of the e-mail system has failed. Set additional Notification Methods by using the Set-up link at the top of any Oracle Grid Control page.
Modify Notification Rules to notify the administrator when there are errors in computing target availability. This might generate a false positive reading on the availability of the component, but it ensures the highest level of notification to system administrators.

See Also:

Oracle Enterprise Manager Concepts for conceptual information about Beacons
Oracle Enterprise Manager Advanced Configuration for information about configuring service tests and Beacons

Figure 3-2 shows the Edit Notification Rule property page for choosing availability states, with the Down option chosen.

Figure 3-2 Setting Notification Rules for Availability

Description of "Figure 3-2 Setting Notification Rules for Availability"

In addition, modify the metrics monitored by the database rule to report the metrics shown in Table 3-1, Table 3-2, and Table 3-3. This ensures that these metrics are captured for all database targets and that trend data is available for future analysis. All events described in Table 3-1, Table 3-2, and Table 3-3 can be accessed from the Database Homepage by choosing Metrics and Policy Settings.

Use the events shown in Table 3-1 to monitor space management conditions that have the potential to cause a service outage.

Table 3-1 Recommendations for Monitoring Space

Metric	Recommendation
Tablespace Space Used (%)	Set this database-level metric to check the Available Space Used (%) for each tablespace. For cluster databases, this metric is monitored at the cluster database target level and not by member instances. This metric enables the administrator to choose the threshold percentages that Oracle Grid Control tests against, and the number of samples that must occur in error before a message is generated and sent to the administrator. If the percentage of used space is greater than the values specified in the threshold arguments, then a warning or critical alert is generated. The recommended default settings are 85% for a warning and 97% for a critical space usage threshold, but you should adjust these values appropriately, depending on system usage. Also, you can customize this metric to monitor specific tablespaces. For example, set this metric to monitor critical tablespaces such as `SYSTEM`, `SYSAUX`, `UNDO`, `TEMP`, and critical tablespaces for application data. Start with 20% space remaining for Warning Threshold, 10% space remaining for Critical Threshold, and possibly 5 or 2% space remaining for immediate action on the critical tablespaces. Set this metric and similar events in the `Tablespace Full` metric group.
Archiver Hung Alert Log Error	Set this metric to monitor the alert log for `ORA-00257` errors, which indicate a full archived redo log directory. Set this metric in the Alert Log Error Status metric group.
Dump Area Used (%)	Set this metric to monitor the dump directory destinations. Dump space must be available so that the maximum amount of diagnostic information is saved the first time an error occurs. The recommended default settings are 70% for a warning and 90% for an error, but these should be adjusted depending on system usage. Set this metric in the Dump Area metric group.
Recovery Area Free Space (%)	This is a database-level metric that is evaluated by the server every 15 minutes or during a file creation, whichever occurs first. The metric is also printed in the alert log. For cluster databases, this metric is monitored at the cluster database target level and not by member instances. The Critical Threshold is set for < 3% and the Warning Threshold is set for < 15%. You cannot customize these thresholds. An alert is returned the first time the alert occurs, and the alert is not cleared until the available space rises above 15%. See Also: Support note 467653.1 at `http://support.oracle.com/` for more information about setting the Recovery Area Free Space metric.
File System Available(%)	By default, this metric monitors the root file system per host. The default warning level is 20% and the critical warning is 5%.
Archive Area Used (%)	Set this metric to return the percentage of space used on the archive area destination. If the space used is more than the threshold value given in the threshold arguments, then a warning or critical alert is generated.If the database is not running in `ARCHIVELOG` mode or all archive destinations are standby databases for Oracle8i, this metric fails to register. The default warning threshold is 80%, but consider using 70% full to send a warning, 90% for the critical threshold, and 98% for immediate action required.

From the Alert Log Metric group, set Oracle Grid Control to monitor the alert log for errors as shown in Table 3-2.

Table 3-2 Recommendations for Monitoring the Alert Log

Metric	Recommendation
Alert	Set this metric to send an alert when an `ORA-6nn`, `ORA-1578` (database corruption), or `ORA-0060` (deadlock detected) error occurs. If any other error is recorded, then a warning message is generated.
Data Block Corruption	Set this metric to monitor the alert log for `ORA-01157` and `ORA-27048` errors. They signal a corruption in an Oracle Database datafile.

Monitor the system to ensure that the processing capacity is not exceeded. The warning and critical levels for these events should be modified based on the usage pattern of the system. Set the events from the Database Limits metric group using the recommendations in Table 3-3.

Table 3-3 Recommendations for Monitoring Processing Capacity

Metric	Recommendation
Process limit	Set thresholds for this metric to warn if the number of current processes approaches the value of the `PROCESSES` initialization parameter.
Session limit	Set thresholds for this metric to warn if the instance is approaching the maximum number of concurrent connections allowed by the database.

Figure 3-3 shows the Metric and Policy settings page for setting and editing metrics. The online help contains complete reference information for every metric. To access reference information for a specific metric, use the online help search feature.

Figure 3-3 Setting Notification Rules for Metrics

Description of "Figure 3-3 Setting Notification Rules for Metrics"

See Also:

Oracle Database 2 Day DBA for information about setting up notification rules and metric thresholds
Oracle Enterprise Manager Framework, Host, and Services Metric Reference Manual for information about available metrics

3.2.3 Use Database Target Views to Monitor Health, Availability, and Performance

The Database Targets page in Figure 3-4 shows the Database home page with system performance, space usage, and the configuration of important availability components such as archived redo log status, flashback log status, and estimated instance recovery time. Alerts are displayed immediately. You can configure each of the alert values using the links on this page.

Figure 3-4 Database Home Page

Description of "Figure 3-4 Database Home Page"

Many of the metrics from the Oracle Grid Control pertain to performance. A system that is not meeting performance service-level agreements is not meeting HA system requirements. While performance problems seldom cause a major system outage, they can still cause an outage to a subset of customers. Outages of this type are commonly referred to as application service brownouts. The primary cause of brownouts is the intermittent or partial failure of one or more infrastructure components. IT managers must be aware of how the infrastructure components are performing (their response time, latency, and availability), and how they are affecting the quality of application service delivered to the end user.

A performance baseline, derived from normal operations that meet the service-level agreement, should determine what constitutes a performance metric alert. Baseline data should be collected from the first day that an application is in production and should include the following:

Application statistics (transaction volumes, response time, Web service times)
Database statistics (transaction rate, redo rate, hit ratios, top 5 wait events, top 5 SQL transactions)
Operating system statistics (CPU, memory, I/O, network)

You can use Oracle Grid Control to capture a snapshot of database performance as a baseline. Oracle Grid Control compares these values against system performance and displays the result on the database Target page. It can also send alerts if the values deviate too far from the established baseline.

Set the database notification rule to capture the metrics listed in Table 3-4 for all database targets. You can then analyze these parameters using one tool. Historical data is also available.

Table 3-4 Recommended Notification Rules for Metrics

Metric	Recommendation
Disk I/O per Second	This is a database-level metric that monitors I/O operations done by the database. It sends an alert when the number of operations exceeds a user-defined threshold. Use this metric with operating system-level events that are also available with Oracle Grid Control. Set this metric based on the total I/O throughput available to the system, the number of I/O channels available, network bandwidth (in a SAN environment), the effects of the disk cache if you are using a storage array device, and the maximum I/O rate and number of spindles available to the database.
CPU Utilization (%)	For UNIX-based platforms, this metric represents the amount of CPU utilization as a percentage of total CPU processing power available. For Windows, this metric represents the percentage of time the CPU spends to execute a non-Idle thread. CPU Utilization (%) is the primary indicator of processor activity. This metric is set to automatically warn at 80 percent and to show a critical alert at 95 percent. The `Consecutive Number of Occurrences Preceding Notification` column indicates the consecutive number of times the comparison against thresholds should hold `TRUE` before an alert is generated. This usage might be normal at peak periods, but it might also be an indication of a runaway process or of a potential resource shortage.
% Wait Time	Excessive idle time indicates that a bottleneck for one or more resources is occurring. Set this metric based on the system wait time when the application is performing as expected.
Network Bytes per Second	This metric reports network traffic that Oracle generates. It can indicate a potential network bottleneck. Set this metric based in actual usage during peak periods.
Total Parses per Second	This metric measures SQL performance. It can indicate an application change or change in usage that has created a shortage of resources. Set it based on peak periods.

See Also:

Oracle Database Performance Tuning Guide for more information about performance monitoring
Oracle Database 2 Day DBA for more information about monitoring and tuning using Enterprise Manager

3.2.4 Use Event Notifications to React to Metric Changes

There are many operating system events that can be used to supplement a suggested metric. Such operating system events are not required for each host and instance. All metrics defined here can be set individually by instance or database using the Manage Metrics link at the bottom of the navigation bar on the object target page. The values that trigger a warning or critical alert can be changed here, and an operating system script can be activated to respond to an metric threshold, in addition to the standard alert being generated to the Oracle Grid Control.

3.2.5 Use Events to Monitor Data Guard System Availability

Set Oracle Grid Control metrics to monitor the availability of logical and physical Data Guard configurations. Table 3-5 shows the events that are available for monitoring Data Guard databases.

Table 3-5 Recommendations for Setting Data Guard Events

Metric	Recommendation
Data Guard Status	Notifies you about system problems in a Data Guard configuration.
Apply Lag	Displays (in seconds) how far the standby is behind the primary database. This metric generates an alert on the standby database if it falls behind more than the user-specified threshold (if any).
Estimated Failover Time	Displays the approximate number of seconds required to failover to this standby database.
Redo Apply Rate	Displays the Redo Apply rate in KB/second on this standby database.
Transport Lag	Displays the approximate number of seconds of redo that is not yet available on this standby database. The lag may be because the redo data has not yet been transported or there may be a gap. This metric generates an alert on the standby database if it falls behind more than the user-specified threshold (if any).

3.3 Managing the High Availability Environment with Oracle Grid Control

Use Oracle Grid Control as a proactive part of administering any system and for problem notification and analysis. This section includes the following recommendations:

Check Oracle Grid Control Policy Violations
Use Grid Control to Manage Oracle Patches and Maintain System Baselines
Manage Database Availability with the High Availability Console
Configure High Availability Solutions with MAA Advisor

3.3.1 Check Oracle Grid Control Policy Violations

Oracle Grid Control comes with a pre-installed set of policies and recommendations of best practices for all databases. These policies are checked by default, and the number of violations is displayed on the Targets page shown in Figure 3-4. To see a list of all violations, select Policy Violations from the Targets page.

See Also:

Oracle Enterprise Manager Policy Reference Manual for definitions of existing policies

3.3.2 Use Grid Control to Manage Oracle Patches and Maintain System Baselines

You can use Oracle Grid Control to download and manage patches from My Oracle Support (formerly OracleMetalink) at http://support.oracle.com/ for any monitored system in the application environment. A job can be set up to routinely check for patches that are relevant to the user environment. Those patches can be downloaded and stored directly in the Management Repository. Patches can be staged from the Management Repository to multiple systems and applied during maintenance windows.

You can examine patch levels for one system and compare them between systems in either a one-to-one or one-to-many relationship. In this case, a system can be identified as a baseline and used to demonstrate maintenance requirements in other systems. This can be done for operating system patches and database patches.

3.3.3 Manage Database Availability with the High Availability Console

The High Availability (HA) Console is a one stop, dashboard-style page for monitoring the availability of each database. You can use it on any database and if a database is part of a Data Guard configuration, the HA Console allows you to switch your view from the primary database to any of the standby databases.

You can use the HA Console to:

Display high availability events including events from related targets such as standby databases
View the high availability summary that includes the status of the database
View the last backup status
View the Flash Recovery Area Usage, if configured
If Oracle Data Guard is configured: View the Data Guard summary , set up Data Guard standby databases for any database target, manage switchover and failover of database targets other than the database that contains the Management Repository, and monitor the health of a Data Guard configuration at a glance
If Oracle RAC is configured: View the Oracle RAC Services summary including Top Services

The HA Console requires Oracle Enterprise Manager's Management Agents release 10.2.0.5 Agent as well as Grid Control release 10.2.0.5.

See Also:

Oracle Enterprise Manager Grid Control Quick Start Guide and the Oracle Enterprise Manager Concepts for operational requirements to run Grid Control release 10.2.0.5 and for help establishing standard administrative settings

The following HA Console screenshot shows summary information, details, and historical statistics for the primary database. The example page shows the standby databases for the primary target, various Data Guard standby performance metrics and settings, and the data protection mode.

Figure 3-5 Monitoring a Primary Database in the HA Console

Description of "Figure 3-5 Monitoring a Primary Database in the HA Console"

The Availability Summary shows that the primary database is up and its availability is currently 99.09%. Notice that the availability percentage is further broken down by Day, Week, and Month in a horizontal bar graph in the Availability History on the far right side of the page. Because this database is an Oracle RAC database (in the cluster dglnx-cl1), the Availability Summary also shows instance status. ASM status would also appear if ASM was configured for this database. The Availability Events section shows specific high availability events (alerts). You can click the error to obtain more details (or to suppress the event). To set up, manage, and configure a specific solution area for this database, click MAA Advisor Details to go to the Maximum Availability Architecture (MAA) Advisor page (described in more detail in Section 3.3.4, "Configure High Availability Solutions with MAA Advisor").

The Backup and Recovery Summary displays the Last Backup and Next Backup information. The Flash Recovery Area Usage chart indicates about 1.35% of the flash recovery area is currently used. The Used (Non-reclaimable) Flash Recovery Area (%) chart shows the usage over the last 2 hours. You can click on the chart to display the page with the metric details.

The Data Guard Summary shows the primary database is running in Maximum Performance mode and has Fast-Start Failover enabled. You can click the link next to Protection Mode to modify the data protection mode. In the Standby Databases table, the physical standby database (west) is caught up with the primary database (Apply/Transport Lag) metrics, and the Used Flash Recovery Area (FRA) is 0.5%. The Primary Database Redo Rate chart shows the redo trend over the past 2 hours. Note that if Data Guard is not configured, the "Switch To" box in the upper right corner of the console is not displayed.

The Services Summary shows details for the customers, orders, and sales services.

Figure 3-6 Monitoring the Standby Database in the HA Console

Description of "Figure 3-6 Monitoring the Standby Database in the HA Console"

The description for Figure 3-6 is the same as Figure 3-5 Figure 3-5, "Monitoring a Primary Database in the HA Console"except for the Data Guard Summary section and charts at the lower right side of the page. Figure 3-6 shows information for the standby database (west), which is a physical standby database running real-time query. In the Standby Databases table, the Apply/Transport Lag metrics indicate that the physical standby database is caught up with the primary database, and the Used Flash Recovery Area (FRA) is 0.5%. The Standby Database Apply Lag chart shows there has been zero lag over the past two hours. Note that if Data Guard is not configured, the "Switch To" box in the upper right corner of the console is not displayed.

3.3.4 Configure High Availability Solutions with MAA Advisor

The goal of the MAA Advisor is to help you implement Oracle's best practices to achieve the optimal high availability architecture.

From the Availability Summary section on the High Availability Console, you can link to the MAA Advisor to:

View recommended Oracle solutions for each outage type (site failures, computer failures, storage failures, human errors, and data corruptions)
View the configuration status and use the links in the Oracle Solution column to go to the Enterprise Manager page where the solution can be configured.
Understand the benefits of each solution
Link to the MAA Web site for white papers, documentation, and other information

The MAA Advisor page contains a table that lists the outage type, Oracle solutions for each outage, configuration status, and benefits. The MAA Advisor allows you to view HA solutions in the following ways:

Primary Database Recommendations Only—This condensed view shows only the recommended solutions (the default view) for the primary database.
All Primary Database Solutions— This expanded view of the table shows all configuration recommendations and status for primary databases.
All Database Solutions (including standbys)—This expanded view of the table shows all configuration recommendations and status for all primary and standby databases in this configuration. It includes an extra column "Target Name/Role that provides the database name and shows the role (primary, physical, or logical) of the database.

Figure 3-7 shows an example of the All Primary Database Solutions view.

Figure 3-7 MAA Advisor Page in Oracle Grid Control

Description of "Figure 3-7 MAA Advisor Page in Oracle Grid Control"

You can click the link in the Oracle Solution column to go to a page where you can set up, manage, and configure the specific solution area. Once a solution has been configured, click Refresh to update the configuration status on the page. Once the page is refreshed, click Advisor Details on the Console page to see the updated values.