Observability is the ability to gain insights into the internal state of a system by examining its external behavior without knowledge of its internal mechanisms. It is based on four fundamental pillars: metrics, logs, events, and traces.
- Metrics provide a numerical representation of data collected over specific time intervals. For example, CPU utilization can be measured as a metric. Metrics are particularly useful for triggering alerts or alarms when issues arise within the system.
- Logs and events consist of timestamped records capturing discrete occurrences over time. While events and logs share similarities, logs alone are not particularly effective for tracking code execution as they often lack crucial context. However, when combined, logs and events can provide valuable insights.
- Tracing is employed to understand the interconnectedness of various services and how requests flow through them. It allows for the visualization and analysis of the entire path taken by a request as it moves through different components.
Observability services several purposes, including understanding trends and identifying outliers. It aids in the detection of potential bottlenecks and patterns within the system. Additionally, observability facilitates the measurement and evaluation of application performance.
- Monitoring Service: This service is designed to monitor and gain insights into the state of a system by utilizing predefined sets of metrics or logs. Its main objectives are to actively collect system metrics and logs, track errors in real-time, store these metrics for historical analysis, and respond to incidents through alerts. An example of its usage would be monitoring CPU utilization over time and triggering alerts when it exceeds a specified threshold. The two key components of the OCI Monitoring Service are metrics and alarm, which enable active and passive monitoring of cloud resources.
- Users can create alarms that are triggered when specific conditions defined in the alarm are met. These alarms then send notifications to subscribers via alarm messages subscriptions. For example, in the case of CPU utilization, the monitoring service collects raw data from customer applications, services, or resources, as well as from resources within OCI. This data is aggregated over time using the Oracle Cloud Infrastructure Console, API (SDK OR CLI), or customer monitoring tools. If the CPU utilization exceeds a threshold defined by the user, such as 80%, an alarm is triggered, notifying the user. User-created metrics are referred to as custom metrics, and alarm trigger rules define the conditions that put alarms into a firing state.
- Logging Service: This service provides a highly scalable and fully managed centralized platform for storing and capturing all logs. A log in OCI is a resource that stores and captures log events within a given context. The generated logs are searchable and indexed in the system, allowing users to perform searches. Additionally, logs are actionable and can be transported to other systems or services. The Logging Service integrates with the Monitoring Service through the Service Connector Hub. There are three types of logs supported by the Logging Service: Service logs: These logs are emitted by OCI services like Object Storage. Service logs are predefined logging categories and users can enable or disable them easily. Custom log: These are emitted by custom applications running on OCI or other cloud providers. Audit log: Audit logs are related to audit events emitted by OCI Audit service. Users can access these logs from the same console used for the Logging Service.
- OCI Events Service: This service enables users to create automation based on state changes in their tenancy's resources. This service allows users to define actions or triggers for specific event types. Events are structured and schematized messages that indicate a change in a resource. For example, when a user creates a bucket, an event type called "create bucket" is generated. Rules are configurations where users define which events should be created and what actions should be triggered when they occur. Actions are user-define responses to events, such as triggering a function or writing to a stream. For instance, in an organization, if a user provisions a database and the provisioning process is completed, it becomes an event representing a state change. Users can define a rule to trigger an action, such as a serverless function sending a notification or an email alert, indicating that a particular database instance has been created.