12 Generating Alerts
You can monitor and route alerts in Oracle Communications Elastic Charging Engine (ECE) composable services by using Prometheus Alertmanager. When alerting is enabled, the ECE composable services continuously evaluate system health and generate alerts based on configurable conditions.
Topics in this document:
About Generating Alerts
Alerts provide a real-time way to monitor the health, performance, and reliability of charging and CDR processing pipelines. They help you detect critical operational conditions, such as processing backlogs, failover activity, retransmissions, and message routing issues, before those issues affect downstream systems or billing accuracy.
In high-throughput, distributed environments, alerts help you quickly identify and respond to failures or delays that can impact financial and operational outcomes. Alerts also support high availability (HA) and multisite deployments by detecting failover conditions and data flow anomalies across sites.
The ECE composable services generate alerts through the following workflow:
-
The internal AlertMetricsService, embedded in the charging-gateway pod, exposes alert and health metrics through an internal /metrics endpoint. This endpoint is accessible only from within the charging-gateway pod.
-
Prometheus Operator continuously monitors the charging-gateway pod for conditions that match your configured alert rules. When a rule condition is met, Prometheus generates an alert and forwards it to Alertmanager.
-
Prometheus Alertmanager manages notifications by grouping related alerts, suppressing repeated notifications, and routing alerts to configured channels such as email, Slack, Microsoft Teams, or PagerDuty.
Setting Up Alerts for the ECE Composable Services
To set up alerts for the ECE composable services:
-
Make sure the Prometheus Operator and Alertmanager are installed and running in your Kubernetes environment. For compatible versions, see BRM Compatibility Matrix.
For more information about installing Prometheus Operator and Alertmanager, see the Prometheus Operator documentation on the GitHub website: https://github.com/prometheus-operator/prometheus-operator/tree/main/Documentation/getting-started.
-
Enable alerts, define the alerting rules and thresholds for each alert type, and then deploy or upgrade the oc-ccs-version Helm chart. See "Configuring Alerting Rules and Thresholds".
-
Configure Prometheus Operator to scrape metrics from the charging-gateway pod endpoint.
For an example, see the servicemonitor.yaml file included in the oc-ccs-version Helm chart.
-
Configure Prometheus Alertmanager to send alerts to notification channels such as email, Slack, or PagerDuty.
-
You can view and manage alerts in the Prometheus UI, which shows both real-time and historical alert activity.
Configuring Alerting Rules and Thresholds
The ECE composable services include several built-in alert types that help you monitor behavior and data pipeline health. These alerts detect failover conditions, data reliability issues, and processing backlogs, giving operators visibility into system stability and data integrity.
-
Remote Consumption Alert: This alert indicates that a site is processing traffic in disaster recovery mode instead of its primary role. Use this alert to identify failover scenarios and verify that traffic shifts correctly during outages or site disruptions.
-
Retransmission Alert: This alert triggers when the ECE composable services retransmit messages. Retransmissions can indicate upstream instability, duplicate delivery, or network issues. Use this alert to identify conditions that may affect message ordering, duplicate handling, or overall processing performance.
-
Suspend Topic Alert: This alert triggers when the ECE composable services route messages to a suspend topic because they cannot process them successfully. Use this alert to identify potential data quality or processing issues that require manual investigation or intervention.
-
High CloudEvent Backlog Alert: This alert monitors the durable retry backlog and triggers when the number of pending events exceeds a configured threshold. Use this alert to detect bottlenecks in Kafka publishing or downstream systems that may delay CDR delivery.
-
CloudEvent Repository Minor Alert: This lower-severity alert triggers whenever a backlog exists in the retry repository. Use this alert to identify issues early before they grow into larger backlogs or affect the system.
By default, all alerts are disabled.
To enable and configure alerts and thresholds:
-
Open your override-values.yaml file for the oc-ccs-version Helm chart.
-
Set the prometheusRule.enabled key to true.
-
Enable the DR consumption alert by configuring these keys under prometheusRule:
-
remoteConsumptionAlert.enabled: Set this to true to enable the alert.
-
remoteConsumptionAlert.window: The amount of time Prometheus evaluates metric data to determine whether the alert condition is occurring. The default is 2m, which means the last 2 minutes.
-
remoteConsumptionAlert.forDuration: The amount of time the alert condition must continuously remain true before the alert is triggered. The default is 0m, which means to trigger the alert immediately.
-
remoteConsumptionAlert.severity: The severity level assigned to alerts. The default is warning.
-
-
Enable the retransmission alert by configuring these keys under prometheusRule:
-
retransmissionAlert.enabled: Set this to true to enable the alert.
-
retransmissionAlert.window: The amount of time Prometheus evaluates metric data to determine whether the alert condition is occurring. The default is 2m, which means the last 2 minutes.
-
retransmissionAlert.forDuration: The amount of time the alert condition must continuously remain true before the alert is triggered. The default is 0m, which means to trigger the alert immediately.
-
retransmissionAlert.severity: The severity level assigned to alerts. The default is warning.
-
-
Enable the suspend topic alert by configuring these keys under prometheusRule:
-
suspendTopicAlert.enabled: Set this to true to enable the alert.
-
suspendTopicAlert.window: The amount of time Prometheus evaluates metric data to determine whether the alert condition is occurring. The default is 2m, which means the last 2 minutes.
-
suspendTopicAlert.forDuration: The amount of time the alert condition must continuously remain true before the alert is triggered. The default is 0m, which means to trigger the alert immediately.
-
suspendTopicAlert.severity: The severity level assigned to alerts. The default is warning.
-
-
Enable the high CloudEvent backlog alert by configuring these keys under prometheusRule:
-
highCloudEventBacklogAlert.enabled: Set this to true to enable the alert.
-
highCloudEventBacklogAlert.processor: The name of the CGF publisher processor that the alert should monitor for retry backlog growth. The default is cdrPublisher.
-
highCloudEventBacklogAlert.threshold: The size of the retry event backlog that triggers an alert. The default is 100.
-
highCloudEventBacklogAlert.forDuration: The amount of time the alert condition must continuously remain true before the alert is triggered. The default is 5m.
-
highCloudEventBacklogAlert.severity: The severity level assigned to alerts. The default is critical.
-
-
Enable the CloudEvent repository minor alert by configuring these keys under prometheusRule:
-
cloudEventRepositoryMinorAlert.enabled: Set this to true to enable the alert.
-
cloudEventRepositoryMinorAlert.processor: The name of the CGF publisher processor that the alert should monitor for retry backlog growth. The default is cdrPublisher.
-
cloudEventRepositoryMinorAlert.threshold: The size of the retry event backlog that triggers an alert. The default is 0.
-
cloudEventRepositoryMinorAlert.forDuration: The amount of time the alert condition must continuously remain true before the alert is triggered. The default is 0m, which means to trigger the alert immediately.
-
cloudEventRepositoryMinorAlert.severity: The severity level assigned to alerts. The default is warning.
-
-
Save and close your override-values.yaml file.
-
Deploy or redeploy the Helm release by running the helm install command:
helm install EceCompServicesReleaseName oc-ccs-version --values override-values.yaml -n EceCompServicesNameSpace
where:
-
EceCompServicesReleaseName is the release name for the Helm chart. Helm uses this name to track the installation instance.
-
EceCompServicesNameSpace is the namespace in which to create Kubernetes objects for the Helm chart.
-