Monitoring Overview
The Oracle Cloud Infrastructure Monitoring service enables you to actively and passively monitor your cloud resources using the Metrics and Alarms features.
How Monitoring Works
The Monitoring service uses metrics to monitor resources and alarms to notify you when these metrics meet alarm-specified triggers.
Metrics are emitted to the Monitoring service as raw data points , or timestamp-value pairs, along with dimensions and metadata. Metrics come from a variety of sources:
- Resource metrics automatically posted by Oracle Cloud
Infrastructure
resources . For example, the Compute service posts metrics for monitoring-enabled
Compute instances through the oci_computeagent namespace. One such metric is
CpuUtilization
. See Supported Services and Viewing Default Metric Charts. - Custom metrics published using the Monitoring API.
- Data sent to new or existing metrics using Service Connector Hub.
Metric data posted to the Monitoring service is only presented to you or consumed by the Oracle Cloud Infrastructure features that you enable to use metric data.
When you query a metric, the Monitoring service returns aggregated data according to the specified parameters. You can specify a range (such as the last 24 hours), statistic , and interval . The Console displays one monitoring chart per metric for selected resources. The aggregated data in each chart reflects your selected statistic and interval. API requests can optionally filter by dimension and specify a resolution . API responses include the metric name along with its source compartment and metric namespace . You can feed the aggregated data into a visualization or graphing library.
Metric and alarm data is accessible via the Console, CLI, and API. For retention periods, see Storage Limits.
The Alarms feature of the Monitoring service publishes alarm messages to configured destinations managed by the Notifications service. Each destination is a topic with a set of subscribers . For more information about the Notifications service, see Notifications Overview.
The message type indicates the reason that the message was sent.
OK_TO_FIRING
: The alarm changed fromOK
status toFIRING
status.FIRING_TO_OK
: The alarm changed fromFIRING
status toOK
status.REPEAT
: The alarm is maintaining aFIRING
status and repeat notifications are configured.-
RESET
: The alarm is not detecting the metric firing; the metric is no longer being emitted. The resource that was emitting the metric might have been moved or terminated.Important
When aRESET
status change occurs, determine the health of the resource.
Alarm message format:
Parameter | Description |
---|---|
dedupekey Required |
string Unique identifier for all the alarm messages of the alarm. Use for de-duplication. |
title Required |
string The alarm's configured display name. |
body |
string The alarm's configured message body. |
type Required |
string The reason for sending the notification message. Valid values: See Message types. |
severity Required |
string The highest severity level of the listed alarms. Valid
values: |
timestampEpochMillis Required |
long The time when the alarm was triggered, in milliseconds since epoch time. |
timestamp Required |
string The time when the alarm was triggered, in ISO-8601 format. Same information as in timestampEpochMillis. |
alarmMetadata Required |
array of objects List of alarms related to this notification message. |
version Required |
int The version of the alarm message format. |
alarmMetadata format:
Parameter | Description |
---|---|
id Required |
string The alarm OCID . |
status Required |
string The alarm state. Valid values: |
severity Required |
string The alarm severity level. Valid
values: |
query Required |
string The alarm's configured query.
|
totalMetricsFiring Required |
int The number of metric streams represented in this notification message. |
dimensions |
array of objects List of dimension key-value pairs that identify each metric stream. The list is limited to a hundred
entries. Empty for an alarm with a status of
|
Example messages, by subscription protocol, for an alarm titled "High CPU Utilization" that is continuing to be in the FIRING state. In this example, the message includes two metric streams: one for "myinstance1" and another for "myinstance2."
Email and Slack
{"dedupeKey":"exampleuniqueID","title":"High CPU Utilization","body":"Follow runbook at http://example.com/runbooks","type":"REPEAT","severity":"CRITICAL","timestampEpochMillis":1542406320000,"timestamp":"2018-11-16T22:12:00Z","alarmMetaData":[{"id":"ocid1.alarm.oc1.iad.exampleuniqueID","status":"FIRING","severity":"CRITICAL","query":"CpuUtilization[1m].mean()>0","totalMetricsFiring":2,"dimensions":[{"instancePoolId":"Default","resourceDisplayName":"myinstance1","faultDomain":"FAULT-DOMAIN-1","resourceId":"ocid1.instance.oc1.iad.exampleuniqueID","imageId":"ocid1.image.oc1.iad.exampleuniqueID","availabilityDomain":"szYB:US-ASHBURN-AD-1","shape":"VM.Standard2.1","region":"us-ashburn-1"},{"instancePoolId":"Default","resourceDisplayName":"myinstance2","faultDomain":"FAULT-DOMAIN-3","resourceId":"ocid1.instance.oc1.iad.exampleuniqueID","imageId":"ocid1.image.oc1.iad.exampleuniqueID","availabilityDomain":"szYB:US-ASHBURN-AD-1","shape":"VM.Standard2.1","region":"us-ashburn-1"}]}],"version":1.1}
Metrics Feature Overview
The Metrics feature relays metric data about the health, capacity, and performance of your cloud resources. A metric is a measurement related to health, capacity, or performance of a given resource . Resources, services, and applications emit metrics to the Monitoring service. Common metrics reflect data related to:
- Availability and latency
- Application uptime and downtime
- Completed transactions
- Failed and successful operations
- Key performance indicators (KPIs), such as sales and engagement quantifiers
By querying Monitoring for this data, you can understand how well the systems and processes are working to achieve the service levels you commit to your customers. For example, you can monitor the CPU utilization and disk reads of your Compute instances . You can then use this data to determine when to launch more instances to handle increased load, troubleshoot issues with your instance, or better understand system behavior.
Example Metric: Failure Rate
For application health, one of the common KPIs is failure rate, for which a common definition is the number of failed transactions divided by total transactions. This KPI is usually delivered through application monitoring and management software.
As a developer, you can capture this KPI from your applications using custom metrics. Simply record observations every time an application transaction takes place and then post that data to the Monitoring service. In this case, set up metrics to capture failed transactions, successful transactions, and transaction latency (time spent per completed transaction).
Alarms Feature Overview

The Alarms feature of the Monitoring service works with the Notifications service to notify you when metrics meet alarm-specified triggers. The previous illustration depicts the flow, starting with resources emitting metric data points to Monitoring. When triggered, an alarm sends an alarm message to the configured topic (in Notifications), which then sends the message on to all of the topic's subscriptions . (This illustration does not cover raw and aggregated metric data. For these details, see the "Monitoring Overview" illustration at the top of this page.)
When configured, repeat notifications remind you of a continued firing state at the configured repeat interval. You are also notified when an alarm transitions back to the OK state, or when an alarm is reset.
You can search for alarms using Search-supported attributes. For more information about Search, see Overview of Search.
For attribute descriptions, see Alarm Reference.
-
id
-
displayName
-
compartmentId
-
metricCompartmentId
-
namespace
-
query
-
severity
-
destinations
-
suppression
-
isEnabled
-
lifecycleState
-
timeCreated
-
timeUpdated
-
tags
Monitoring Concepts
The following concepts are essential to working with Monitoring.
- aggregated data
- The result of applying a statistic and interval to a selection of raw data points for a given metric. For example, you can apply the statistic
max
and interval1h
(one hour) to the last 24 hours of raw data points for the metricCpuUtilization
. Aggregated data is displayed in default metric charts in the Console. You can also build metric queries for specific sets of aggregated data. For instructions, see Viewing Default Metric Charts and Building Metric Queries. - alarm
- The alarm query to evaluate and the notification destination to use when the alarm is in the firing state, along with other alarm properties. For instructions on managing alarms, see Managing Alarms.
- alarm query
- The Monitoring Query Language (MQL) expression to evaluate for the alarm. An alarm query must specify a metric, statistic, interval, and a trigger rule (threshold or absence). The Alarms feature of the Monitoring service interprets results for each returned time series as a Boolean value, where zero represents false and a non-zero value represents true. A true value means that the trigger rule condition has been met. For more information, see Building Metric Queries and the query attribute description in the Alarm API reference.
- data point
- A timestamp-value pair for the specified metric. Example: 2018-05-10T22:19:00Z, 10.4
- dimension
- A qualifier provided in a metric definition. Example: Resource identifier (
resourceId
), provided in the definitions of oci_computeagent metrics. Use dimensions to filter or group metric data. Example dimension name-value pair for filtering by availability domain:availabilityDomain = "VeBZ:PHX-AD-1"
- frequency
- The time period between each posted raw data point for a given metric. (Raw data points are posted by the metric namespace to the Monitoring service.) While frequency varies by metric, default service metrics typically have a frequency of 60 seconds (that is, one data point posted per minute). See also resolution.
- interval
- The time window used to convert the given set of raw data points.
- message
- The content that the Alarms feature of the Monitoring service publishes to topics in the alarm’s configured notification destinations. A message is sent when the alarm transitions to another state, such as from "OK" to "FIRING." For more information about messages, see How Monitoring Works.
- metadata
- A reference provided in a metric definition. Example: unit (bytes), provided in the definition of the oci_computeagent metric
DiskBytesRead
. Use metadata to determine additional information about a given metric. For metric definitions, see Supported Services. - metric
- A measurement related to health, capacity, or performance of a given resource. Example: The oci_computeagent metric
CpuUtilization
, which measures usage of a Compute instance. For metric definitions, see Supported Services. - metric definition
- A set of references, qualifiers, and other information provided by a metric namespace for a given metric. For example, the oci_computeagent metric
DiskBytesRead
is defined by dimensions (such as resource identifier) and metadata (specifying bytes for unit) as well as identification of its metric namespace (oci_computeagent). Each posted set of data points carries this information. Use the ListMetricData API operation to get metric definitions. For metric definitions, see Supported Services. - metric namespace
- Indicator of the resource , service, or application that emits the metric. Provided in the metric definition. For example, the
CpuUtilization
metric definition emitted by the Oracle Cloud Agent software on Compute instances lists the metric namespaceoci_computeagent
as the source of theCpuUtilization
metric. For metric definitions, see Supported Services. - metric stream
- An individual set of aggregated data for a metric. A stream can be either specific to a single resource or aggregated across all resources in the compartment. Within a metric chart in the Console, each metric stream is represented as a line. By default, metric streams are resource-specific, so the chart displays a line for each resource. If you choose to aggregate all metric streams, then the chart displays one line for all resources.
- notification destination
- Protocol and other details for sending messages when the alarm transitions to another state, such as from "OK" to "FIRING." The details and setup may vary by destination service. For the Notifications service, each destination includes a topic and subscription protocol (such as PagerDuty). For more information about messages, topics, and subscriptions, see Notifications Overview.
- Oracle Cloud Agent software
- Software that allows a Compute instance to post raw data points to the Monitoring service. Automatically installed with the latest versions of supported images. See Enabling Monitoring for Compute Instances.
- query
- The Monitoring Query Language (MQL) expression to evaluate for returning aggregated data. The query must specify a metric, statistic, and interval. For more information, see Building Metric Queries.
- resolution
- The period between time windows, or the regularity at which time windows shift. For example, use a resolution of
1m
to retrieve aggregations every minute. - resource group
- A custom string provided with a custom metric that can be used as a filter or to aggregate results. The resource group must exist in the definition of the posted metric. Only one resource group can be applied per metric.
- statistic
- The aggregation function applied to the given set of raw data points. For supported statistics, see Monitoring Query Language (MQL) Reference.
- suppression
- A configuration to avoid publishing messages during the specified time range. Useful for suspending alarm notifications during system maintenance. Each suppression applies to a single alarm. In the Console, you can apply one definition of a suppression to multiple alarms. The result is an individual suppression for each alarm. For instructions on suppressing alarms, see To suppress alarms.
- trigger rule
- The condition that must be met for the alarm to be in the firing state. A trigger rule can be based on a threshold or absence of a metric.
Availability
The Monitoring service is available in all Oracle Cloud Infrastructure commercial regions. See About Regions and Availability Domains for the list of available regions, along with associated locations, region identifiers, region keys, and availability domains.
Supported Services
The following services have resources or components that can emit metrics to Monitoring:
- API Gateway - see API Gateway Metrics
- Big Data - see View Cluster Metrics
- Block Storage - see Block Volume Metrics
- Blockchain Platform - see Blockchain Platform Metrics
-
Compute - see these topics:
- Container Engine for Kubernetes - see Container Engine for Kubernetes Metrics
- Data Catalog - see Data Catalog Metrics
- Data Integration - see Data Integration Metrics
- Database - see Database Metrics
- Digital Assistant - see Digital Assistant Metrics
- DNS - see DNS Metrics
- Events - see Events Metrics
- Email Delivery - see Email Delivery Metrics
- File Storage - see File System Metrics
- Functions - see Function Metrics
- Health Checks - see Health Checks Metrics
- Integration - see Viewing Message Metrics
- Load Balancing - see Load Balancing Metrics
- Management Agent - see Management Agent Metrics
- MySQL Database - see MySQL Database Metrics
-
Networking - see these topics:
- NoSQL Database Cloud - see Service Metrics
- Notifications - see Notifications Metrics
- Object Storage - see Object Storage Metrics
- OS Management - see OS Management Metrics
- Streaming - see Streaming Metrics
- Vault - see Vault Metrics
- WAF - see WAF Metrics
Resource Identifiers
Most types of Oracle Cloud Infrastructure resources have a unique, Oracle-assigned identifier called an Oracle Cloud ID (OCID). For information about the OCID format and other ways to identify your resources, see Resource Identifiers.
Metric resources do not have OCIDs .
Ways to Access Monitoring
You can access the Monitoring service using the Console (a browser-based interface) or the REST API. Instructions for the Console and API are included in topics throughout this guide. For a list of available SDKs, see Software Development Kits and Command Line Interface.
Console: To access Monitoring using the Console, you must use a supported browser. You can use the Console link at the top of this page to go to the sign-in page. You will be prompted to enter your cloud tenant, your user name, and your password. Open the navigation menu. Under Solutions and Platform, go to Monitoring.
API: To access Monitoring through APIs, use Monitoring API for metrics and alarms and Notifications API for notifications (used with alarms).
Moving Alarms to a Different Compartment
You can move alarms from one compartment to another. When you move an alarm to a new compartment, its associated metrics remain where they are. After you move the alarm to the new compartment, inherent policies apply immediately and affect access to the alarm through the Console. For more information on moving resources to other compartments, see To move a resource to a different compartment.
To move resources between compartments, resource users must have sufficient access permissions on the compartment that the resource is being moved to, as well as the current compartment. For more information about permissions for Monitoring resources, see Details for Monitoring.
Authentication and Authorization
Each service in Oracle Cloud Infrastructure integrates with IAM for authentication and authorization, for all interfaces (the Console, SDK or CLI, and REST API).
An administrator in your organization needs to set up groups , compartments , and policies that control which users can access which services, which resources, and the type of access. For example, the policies control who can create new users, create and manage the cloud network, launch instances, create buckets, download objects, etc. For more information, see Getting Started with Policies. For specific details about writing policies for each of the different services, see Policy Reference.
If you’re a regular user (not an administrator) who needs to use the Oracle Cloud Infrastructure resources that your company owns, contact your administrator to set up a user ID for you. The administrator can confirm which compartment or compartments you should be using.
Administrators: For common policies that give groups access to metrics, see Let users view metric definitions in a compartment and Restrict user access to a specific metric namespace. For a common alarms policy, see Let users view alarms. To authorize resources, such as instances, to make API calls, add the resources to a dynamic group. Use the dynamic group's matching rules to add the resources, and then create a policy that allows that dynamic group access to metrics. See Let instances make API calls to access monitoring metrics in the tenancy.
Limits on Monitoring
See Monitoring Limits for a list of applicable limits and instructions for requesting a limit increase.
Other limits include the following.
Storage Limits
Item | Time range stored |
---|---|
Metric definitions | 14 days |
Alarm history entries | 90 days |
Returned Data Limits (Metrics)
When you query metrics and view metric charts, the returned data is subject to certain limits. Limits information for returned data includes the 100,000 data point maximum and time range maximums (determined by resolution, which relates to interval). See MetricData Reference.
Troubleshooting Limits
If you see an error that the query has exceeded the maximum number of metric streams , then update the query to evaluate a number of metric streams that is within the limit. For example, you can reduce the metric streams by specifying dimensions. You can continue to evaluate all metric streams that were in the original query by spreading the metric streams across multiple queries (or alarms).