Overview of Monitoring
Use the Oracle Cloud Infrastructure Monitoring service to actively and passively monitor cloud resources using the Metrics and Alarms features. Learn how Monitoring works.
How Monitoring Works
The Monitoring service uses metrics to monitor resources and alarms to notify you when these metrics meet alarm-specified triggers.
Metrics are emitted to the Monitoring service as raw data points , or timestamp-value pairs, along with dimensions and metadata. Metrics come from a variety of sources:
- Resource metrics automatically posted by Oracle Cloud Infrastructure
resources . For example, the Compute service posts metrics for monitoring-enabled compute instances through the oci_computeagent namespace. One such metric is
CpuUtilization
. See Supported Services and Viewing Default Metric Charts. - Custom metrics published using the Monitoring API.
- Data sent to new or existing metrics using Connector Hub (with Monitoring as the target service for a connector).
You can transfer metrics from the Monitoring service using Connector Hub. For more information, see Creating a Connector with a Monitoring Source.
Metric data posted to the Monitoring service is only presented to you or consumed by the Oracle Cloud Infrastructure features that you enable to use metric data.
When you query a metric, the Monitoring service returns aggregated data according to the specified parameters. You can specify a range (such as the last 24 hours), statistic , and interval . The Console displays one monitoring chart per metric for selected resources. The aggregated data in each chart reflects the selected statistic and interval. API requests can optionally filter by dimension and specify a resolution . API responses include the metric name along with its source compartment and metric namespace . You can feed the aggregated data into a visualization or graphing library.
Metric and alarm data is accessible from the Console, CLI, and API. For retention periods, see Storage Limits.
The Alarms feature of the Monitoring service publishes alarm messages to configured destinations, such as topics in Notifications and streams in Streaming.
Metrics Feature Overview
The Metrics feature relays metric data about the health, capacity, and performance of cloud resources.
A metric is a measurement related to health, capacity, or performance of a resource . Resources, services, and applications emit metrics to the Monitoring service. Common metrics reflect data related to:
- Availability and latency
- Application uptime and downtime
- Completed transactions
- Failed and successful operations
- Key performance indicators (KPIs), such as sales and engagement quantifiers
By querying Monitoring for this data, you can understand how well the systems and processes are working to achieve the service levels you commit to your customers. For example, you can monitor the CPU utilization and disk reads of compute instances . You can then use this data to determine when to provision more instances to handle increased load, troubleshoot issues with the instance, or better understand system behavior.
Example Metric: Failure Rate
For application health, one of the common KPIs is failure rate, for which a common definition is the number of failed transactions divided by total transactions. This KPI is usually delivered through application monitoring and management software.
As a developer, you can capture this KPI from your applications using custom metrics. Simply record observations every time an application transaction takes place and then post that data to the Monitoring service. In this case, set up metrics to capture failed transactions, successful transactions, and transaction latency (time spent per completed transaction).
Alarms Feature Overview
Use alarms to monitor the health, capacity, and performance of cloud resources.
The Alarms feature of the Monitoring service works with the configured destination service to notify you when metrics meet alarm-specified triggers. The previous illustration depicts the flow, starting with resources emitting metric data points to Monitoring. When triggered, an alarm sends an alarm message to the configured destination. For Notifications, messages are sent to subscriptions in the configured topic. For Streaming, messages are sent to the configured stream. (This illustration doesn't cover raw and aggregated metric data. For these details, see the "Monitoring Overview" illustration at the top of this page.)
When configured, repeat notifications remind you of a continued firing state at the configured repeat interval. You are also notified when an alarm transitions back to the OK state, or when an alarm is reset.
Alarm Evaluations
Monitoring evaluates alarms once per minute to determine alarm status.
When the alarm splits notifications, Monitoring evaluates each tracked metric stream. If the evaluation of that metric stream indicates a new FIRING
status or other qualifying event, then Monitoring sends an alarm message.
Monitoring tracks metric streams per alarm for qualifying events, but messages are subject to the destination service limits.
Time Needed to Reflect Alarm Updates
Updates to alarms take up to five minutes to be reflected everywhere.
For example, if you update an alarm to split notifications, then it might take up to five minutes for metric stream status to be populated in the Console.
Searching for Alarms
Search for alarms using supported attributes.
For more information about Search, see Overview of Search. For attribute descriptions, see Alarm Reference.
-
id
-
displayName
-
compartmentId
-
metricCompartmentId
-
namespace
-
query
-
severity
-
destinations
-
suppression
-
isEnabled
-
lifecycleState
-
timeCreated
-
timeUpdated
-
tags
The message type indicates the reason that the message was sent.
OK_TO_FIRING
: The alarm changed fromOK
status toFIRING
status.FIRING_TO_OK
: The alarm changed fromFIRING
status toOK
status.REPEAT
: The alarm is maintaining aFIRING
status and repeat notifications are configured.-
RESET
: The alarm isn't detecting the metric firing. The metric is no longer being emitted. The resource that was emitting the metric might have been moved or terminated. Monitoring stops tracking metric streams associated withRESET
messages.Important
When aRESET
status change occurs, determine the health of the resource.
Message Format and Examples
Monitoring Concepts
The following concepts are essential to working with Monitoring.
- aggregated data
- The result of applying a statistic and interval to a selection of raw data points for a metric. For example, you can apply the statistic
max
and interval1h
(one hour) to the last 24 hours of raw data points for the metricCpuUtilization
. Aggregated data is displayed in default metric charts in the Console. You can also build metric queries for specific sets of aggregated data. For instructions, see Viewing Default Metric Charts and Building Metric Queries. - alarm
- The alarm query to evaluate and the notification destination to use when the alarm is in the firing state, along with other alarm properties.
- alarm query
- The Monitoring Query Language (MQL) expression to evaluate for the alarm. An alarm query must specify a metric, statistic, interval, and a trigger rule (threshold or absence). The Alarms feature of the Monitoring service interprets results for each returned time series as a Boolean value, where zero represents false and a nonzero value represents true. A true value means that the trigger rule condition has been met.
- data point
- A timestamp-value pair for the specified metric. Example:
2022-05-10T22:19:00Z, 10.4
- dimension
- A qualifier provided in a metric definition. Example: Resource identifier (
resourceId
), provided in the definitions of oci_computeagent metrics. Use dimensions to filter or group metric data. Example dimension name-value pair for filtering by availability domain:availabilityDomain = "VeBZ:PHX-AD-1"
- frequency
- The time period between each posted raw data point for a metric. (Raw data points are posted by the metric namespace to the Monitoring service.) While frequency varies by metric, default service metrics typically have a frequency of 60 seconds (that is, one data point posted per minute). See also resolution.
- interval
- The time window used to convert the set of raw data points.
- message
- The content that the Alarms feature of the Monitoring service publishes to topics in the alarm’s configured notification destinations. A message is sent when the alarm transitions to another state, such as from
OK
toFIRING
. - metadata
- A reference provided in a metric definition. Example: unit (bytes), provided in the definition of the oci_computeagent metric
DiskBytesRead
. Use metadata to determine additional information about a metric. For metric definitions, see Supported Services. - metric
- A measurement related to health, capacity, or performance of a resource. Example: The
oci_computeagent
metricCpuUtilization
, which measures usage of a compute instance. For metric definitions, see Supported Services. - metric definition
- A set of references, qualifiers, and other information provided by a metric namespace for a metric. For example, the oci_computeagent metric
DiskBytesRead
is defined by dimensions (such as resource identifier) and metadata (specifying bytes for unit) as well as identification of its metric namespace (oci_computeagent). Each posted set of data points carries this information. Use the ListMetricData API operation to get metric definitions. For metric definitions, see Supported Services. - metric namespace
- Indicator of the resource , service, or application that emits the metric. Provided in the metric definition. For example, the
CpuUtilization
metric definition emitted by the Oracle Cloud Agent software on compute instances lists the metric namespaceoci_computeagent
as the source of theCpuUtilization
metric. For metric definitions, see Supported Services. - metric stream
- An individual set of aggregated data for a metric and zero or more dimension values.
- notification destination
- Details for sending messages when the alarm transitions to another state, such as from
OK
toFIRING
. The details and setup might vary by destination service. Available destination services include Notifications and Streaming. - Oracle Cloud Agent software
- Software used by a compute instance to post raw data points to the Monitoring service. Automatically installed with the latest versions of supported images. See Enabling Monitoring for Compute Instances.
- query
- The Monitoring Query Language (MQL) expression and associated information (such as metric namespace) to evaluate for returning aggregated data. The query must specify a metric, statistic, and interval.
- resolution
-
The period between time windows, or the regularity at which time windows shift. For example, use a resolution of
1m
to retrieve aggregations every minute.Note
For metric queries, the interval you select drives the default resolution of the request, which determines the maximum time range of data returned.For alarm queries, the specified interval has no effect on the resolution of the request. The only valid value of the resolution for an alarm query request is
1m
. For more information about the resolution parameter as used in alarm queries, see Alarm.As shown in the following illustration, resolution controls the start time of each aggregation window relative to the previous window while interval controls the length of the windows. Both requests apply the statistic
max
to the data within each five-minute window (from the interval), resulting in a single aggregated data point representing the highestCPUutilization
counter for that window. Only the resolution value differs. This resolution changes the regularity at which the aggregation windows shift, or the start times of successive aggregation windows. Request A doesn't specify a resolution and thus uses the default value equal to the interval (5 minutes). This request's five-minute aggregation windows are thus taken from the sets of data points emitted from 0:n to 5:00, 5:n to 10:00, and so forth. Request B specifies a 1-minute resolution, so its five-minute aggregation windows are taken from the set of data points emitted every minute from 0:n to 5:00, 1:n to 6:00, and so forth.To specify a nondefault resolution that differs from the interval, see Selecting a Nondefault Resolution for a Query and Creating an Alarm.
- resource group
- A custom string provided with a custom metric that can be used as a filter or to aggregate results. The resource group must exist in the definition of the posted metric. Only one resource group can be applied per metric.
- statistic
- The aggregation function applied to the set of raw data points.
- suppression
- A configuration to stop publishing messages during the specified time range. Useful for suspending alarm notifications during system maintenance.
- time range
- The bounds (timestamps) of the metric data that you want. For example, the past hour.
- trigger rule
- The condition that must be met for the alarm to be in the firing state. A trigger rule can be based on a threshold or absence of a metric.
Availability
The Monitoring service is available in all Oracle Cloud Infrastructure commercial regions. See About Regions and Availability Domains for the list of available regions, along with associated locations, region identifiers, region keys, and availability domains.
Supported Services
The following services have resources or components that can emit metrics to Monitoring:
- Analytics Cloud - see About Metrics for Oracle Analytics Cloud
- API Gateway - see API Gateway Metrics
- Application Performance Monitoring - see Application Performance Monitoring Metrics
- Autonomous Recovery Service - see Recovery Service Metrics
- Bastion - see Bastion Metrics
- Big Data Service - see View Cluster Metrics
- Block Volume - see Block Volume Metrics
- Blockchain Platform - see Monitor Metrics (Blockchain Platform)
-
Compute - see these pages:
- Connector Hub - see Connector Hub Metrics
- Container Engine for Kubernetes - see Container Engine for Kubernetes Metrics
- Container Instances - see Container Instances Metrics
- Data Catalog - see Data Catalog Metrics
- Data Flow - see Data Flow Metrics
- Data Integration - see Data Integration Metrics
- Data Science - see About Notebook Session Metrics
- Data Transfer - see these pages:
- Disk-Based Data Import: Viewing Data Transfer Disk-Based Metrics
- Appliance-Based Data Import: Viewing Data Transfer Appliance Import-Based Metrics
- Data Export: Viewing Data Transfer Appliance Export-Based Metrics
- Database - see these pages:
- Monitor Performance with Autonomous Database Metrics (Autonomous Database on Shared Exadata Infrastructure)
- Monitor Databases with Autonomous Database Metrics (Autonomous Database on Dedicated Exadata Infrastructure)
- Metrics for Oracle Exadata Database Service on Dedicated Infrastructure in the Monitoring Service (from Reference Guides for Exadata Cloud Infrastructure)
- Metrics for Base Database Service in the Database Management Service: Monitor a Database Using Database Management Metrics
- Metrics for External Database
- Database Management - see Database Management Metrics
- Database Migration - see Database Migration Metrics
- DevOps - see DevOps Metrics
- Digital Assistant - see Digital Assistant Metrics
- DNS - see DNS Metrics
- Email Delivery - see Email Delivery Metrics
- Events - see Events Metrics
- File Storage - see File System Metrics
- Functions - see Function Metrics
- GoldenGate - see Oracle Cloud Infrastructure GoldenGate Metrics
- Health Checks - see Health Checks Metrics
- Integration - see these pages:
- Integration Generation 2: View Message Metrics
- Integration 3: View Message Metrics and Billable Messages
- Java Management - see Java Management Metrics
- Load Balancer - see Load Balancer Metrics
- Logging - see Logging Metrics
- Logging Analytics - see Monitor Logging Analytics Using Service Metrics
- Media Streams - see Media Streams Metrics
- Management Agent - see Management Agent Metrics
- MySQL Heatwave - see Metrics
-
Networking - see these pages:
- NoSQL Database Cloud - see Service Metrics
- Notifications - see Notifications Metrics
- Network Firewall - see Network Firewall Metrics
- Object Storage - see Object Storage Metrics
- Operations Insights - see Operations Insights Metrics
- Oracle APEX Application Development - see Metrics (APEX)
- OS Management - see OS Management Metrics
- Process Automation - see Process Automation Metrics
- Queue - see Queue Metrics
- Service Mesh - see Service Mesh Metrics
- Stack Monitoring - see Stack Monitoring Metric Reference
- Streaming - see Streaming Metrics
- Vault - see Monitoring Vault Resources
- Vulnerability Scanning - see Scanning Metrics
- WAF - see Edge Policy Metrics
Resource Identifiers
Most types of Oracle Cloud Infrastructure resources have a unique, Oracle-assigned identifier called an Oracle Cloud ID (OCID). For information about the OCID format and other ways to identify your resources, see Resource Identifiers., see Resource Identifiers.
Metric resources don't have OCIDs .
Ways to Access Monitoring
You can access Oracle Cloud Infrastructure (OCI) by using the Console (a browser-based interface), REST API, or OCI CLI. Instructions for using the Console, API, and CLI are included in topics throughout this documentation. For a list of available SDKs, see Software Development Kits and Command Line Interface.
Console: To access Monitoring using the Console, you must use a supported browser. To go to the Console sign-in page, open the navigation menu at the top of this page and click Infrastructure Console. You are prompted to enter your cloud tenant, your user name, and your password. Open the navigation menu and click Observability & Management. Under Monitoring, click Service Metrics.
API: To access Monitoring through APIs, use Monitoring API for metrics and alarms and Notifications API for notifications (used with alarms).
CLI: See Command Line Reference for Monitoring and Command Line Reference for Notifications.
Authentication and Authorization
Each service in Oracle Cloud Infrastructure integrates with IAM for authentication and authorization, for all interfaces (the Console, SDK or CLI, and REST API).
An administrator in your organization needs to set up groups , compartments , and policies that control which users can access which services, which resources, and the type of access. For example, the policies control who can create new users, create and manage the cloud network, launch instances, create buckets, download objects, and so on. For more information, see Getting Started with Policies. For specific details about writing policies for each of the different services, see Policy Reference.
If you’re a regular user (not an administrator) who needs to use the Oracle Cloud Infrastructure resources that your company owns, contact your administrator to set up a user ID for you. The administrator can confirm which compartment or compartments you should be using.
For more information about user authorizations for monitoring, see IAM Policies (Monitoring).
Administrators: For common policies that give groups access to metrics, see Metric Access for Groups. For common alarm policies, see Alarm Access for Groups. To authorize resources, such as instances, to make API calls, add the resources to a dynamic group. Use the dynamic group's matching rules to add the resources, and then create a policy that allows that dynamic group access to metrics. See Metric Access for Resources.
Limits on Monitoring
See Monitoring Limits for a list of applicable limits and instructions for requesting a limit increase.
Other limits include the following.
Storage Limits
Item | Time range stored |
---|---|
Metric definitions | 90 days |
Alarm history entries | 90 days |
Returned Data Limits (Metrics)
When you query metrics and view metric charts, the returned data is subject to certain limits. Limits information for returned data includes the 100,000 data point maximum and time range maximums (determined by resolution, which relates to interval). See MetricData.
Alarm Message Limits
The maximum number of messages per alarm evaluation depends on the alarm destination. Limits are associated with the Oracle Cloud Infrastructure service used for the destination.
Monitoring tracks 200,000 metric streams per alarm for qualifying events. For more information about alarm evaluations, see Alarm Evaluations on this page.
Alarm destination | Delivery | Maximum alarm messages per evaluation |
---|---|---|
topic (Notifications) | At least once | 60 |
stream (Streaming) | At least once | 100,000 |
For example, consider the following evaluations of an alarm that splits notifications among 200 metric streams, using a topic as its destination.
Alarm evaluation (time) | Metric stream transition | Generated messages | Sent messages | Dropped messages |
---|---|---|---|---|
00:01:00 | 110 metric streams transition from OK to FIRING. | 110 | 60 | 50 |
00:02:00 | 90 metric streams transition from OK to FIRING. | 90 | 60 | 30 |
When a topic or stream is overused, it can result in delayed alarm notifications. Overuse can occur when multiple resources are using that topic or stream.
Best Practices to Work Within Limits
When you expect a high volume of alarm notifications, follow these best practices to help prevent exceeding alarm message limits and associated delays.
- Reserve a single topic or stream for use with a high-volume alarm. Don't use one topic or stream for multiple high-volume alarms.
- If you expect more than 60 messages per minute, specify Streaming as the alarm destination.
- Streams:
- Create partitions based on expected load. See Limits on Streaming Resources.
- If alarm messages exceed the stream space, then update the alarm to use a different stream that has more partitions. For example, if the original stream contains five partitions, create a stream with ten partitions and then update the alarm to use the new stream.Note
To avoid missing messages, continue consuming the original stream until no more messages are received.
- Increase limits for the tenancy:
- Topics: See Limits for publishing messages (PublishMessage operation).
- Streams: See Limits on Streaming Resources.
Troubleshooting Limits
To troubleshoot a query error for too many metric streams, see Error: Exceeded Maximum Metric Streams.
For troubleshooting information, see Troubleshooting Monitoring.
Security
This topic describes security for Monitoring.
For information about how to secure Monitoring, including security information and recommendations, see Securing Monitoring.