32 Monitoring ECE in a Cloud Native Environment
Learn how to monitor the system processes, such as memory and thread usage, in your Oracle Communications Elastic Charging Engine (ECE) components in a cloud native environment.
Topics in this document:
About Monitoring ECE in a Cloud Native Environment
You can configure ECE to expose JVM, Coherence, and application metric data through a single HTTP endpoint in an OpenMetrics (Prometheus) exposition format. You can then use an external centralized metrics service, such as Prometheus, to scrape the ECE cloud native metrics and store them for analysis and monitoring.
Note:
-
ECE only exposes the metrics on an HTTP endpoint. It does not provide the Prometheus service.
-
Do not modify the oc-cn-ece-helm-chart/templates/ece-ecs-metricsservice.yaml file. It is used only during ECE startup and rolling upgrades. It is not used for monitoring.
-
Prometheus: Scrapes ECE metrics from the /metrics endpoint and monitors for issues that match your defined alert rules. When a condition is met, Prometheus generates an alert and sends it to Prometheus Alertmanager.
-
Prometheus Alertmanager: Streamlines alert management by consolidating related alerts, suppressing repeats, and directing alerts to your chosen channels, such as email, Slack, or PagerDuty. For details, see the Prometheus Alertmanager documentation at: https://prometheus.io/docs/alerting/0.26/overview/
-
Grafana: Displays ECE metrics and alerts in graphical dashboards.
Metrics exposed by ECE cover areas including system health, JVM, Coherence, Kubernetes, and gateway performance (Diameter, HTTP, EM, and so on). Oracle provides an Alert Configuration Template as a reference to help you set up and maintain your alerting policy.
Note:
The template offers starting-point configurations and sample dashboards for alerting and visualization. You are responsible for deploying, customizing, and maintaining your monitoring solution.
ECE cloud native exposes metric data for the following components by default:
-
ECE Server
-
BRM Gateway
-
Customer Updater
-
Diameter Gateway
-
EM Gateway
-
HTTP Gateway
-
CDR Formatter
-
Pricing Updater
-
RADIUS Gateway
-
Rated Event Formatter
Setting up monitoring of these ECE cloud native components involves the following high-level tasks:
-
Ensuring that the ECE metric endpoints are enabled. See "Enabling ECE Metric Endpoints".
ECE cloud native exposes metric data through the following endpoint: http://localhost:19612/metrics.
-
Setting up a centralized metrics service, such as Prometheus Operator, to scrape metrics from the endpoint.
For an example of how to configure Prometheus Operator to scrape ECE metric data, see "Sample Prometheus Operator Configuration".
-
Setting up a visualization tool, such as Grafana, to display your ECE metric data in a graphical format.
Setting Up Alerts with the ECE Alert Configuration Template
-
Locate and extract the Alert Configuration Template from the ECE docker archive (oc-cn-ece-docker-files-version.tgz). The template is in the docker_files/samples/monitoring/prometheus_rules/ directory. Alternatively, you can find the file in the home/charging/temp/sample_data/metrics/prometheus_rules in the ECS pod or /scratch/ri-user-1/opt/OracleCommunications/ECE/ECE/oceceserver/sample_data/metrics/prometheus_rules in case of ON-PREM.
-
Deploy Prometheus (standalone or using Prometheus Operator) in your Kubernetes cluster. Install Alertmanager for managing alert routing and notifications. Deploy Grafana for metrics visualization. See BRM Compatibility Matrix for version information.
-
Edit the ECE alert file, eceAlertRules.yaml, to set alert thresholds, durations, and logic. For more information on creating alert rules, see see "Alerting Rules" in the Prometheus documentation at https://prometheus.io/docs/prometheus/3.0/configuration/alerting_rules/. You can also consult the README.md file located in the same directory as the eceAlertRules.yaml file for information specific to ECE.
-
To use these rules in a standalone Prometheus environment:
-
Add the file path to the rule_files section of your Prometheus configuration file (Prometheus_home/prometheus.yml) from your local machine.
rule_files: - "rules/eceAlertRules.yaml" -
Restart or reload Prometheus to apply the new rules.
-
-
To deploy these rules in a Prometheus Operator for Kubernetes (based on Custom Resource Definition) configuration, edit the Kubernetes eceAlertRules.yaml file:
-
Add the following lines at the beginning of the file. Please ensure that the labels align with what is used in your environment. This is just a sample:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: ece-alerts labels: monitoring: shared release: prometheus spec: -
Indent the entire groups section by 2 spaces to move it inside the spec section.
-
Apply the file to your Kubernetes cluster by running the following command:
kubectl apply -f eceAlertRules.yaml
-
-
Configure your preferred notification endpoints using Alertmanager or Grafana. For information about configuring this in Alertmanager, see "Configuration" in the Alertmanager documentation at: https://prometheus.io/docs/alerting/latest/configuration/. For information about configuring this in Grafana, see "Configure notifications" in the Grafana documentation at: https://grafana.com/docs/grafana/latest/alerting/configure-notifications/.
-
Import the sample Grafana dashboards from the home/charging/temp/sample_data/metrics/grafana_dashboards directory in the ECS pods, or, in an on-prem environment, from scratch/ri-user-1/opt/OracleCommunications/ECE/ECE/oceceserver/sample_data/metrics/grafana_dashboards, into Grafana. Connect Grafana to Prometheus as a data source. For more information, see "Prometheus data source" in the Grafana documentation at: https://grafana.com/docs/grafana/latest/datasources/prometheus/. These dashboards can help you to visualize metrics and alert statuses.
-
Validate all changes in a non-production environment before deploying to production.
Modifying Alert Rules and Thresholds
You can update alert rules, thresholds, and durations at any time by editing a copy of the eceAlertRules.yaml file present in the ECS pods at sample_data/metrics/prometheus_rules directory.
kubectl apply -f eceAlertRules.yamlReview comments in YAML files for descriptions and operational advice. Always document and test changes in a non-production environment before deploying to production.
Enabling ECE Metric Endpoints
The default ECE cloud native configuration exposes JVM, Coherence, and application metric data for all ECE components to a single REST endpoint. If you create additional instances of ECE components, you must configure them to expose metric data.
To ensure that the ECE metric endpoints are enabled:
-
Open your override-values.yaml file for oc-cn-ece-helm-chart.
-
Verify that the charging.metrics.port key is set to the port number where you want to expose the ECE metrics. The default is 19612.
-
Verify that each ECE component instance has metrics enabled.
Each application role under the charging key can be configured to enable or disable metrics. In the jvmOpts key, setting the ece.metrics.http.service.enabled option enables (true) or disables (false) the metrics service for that role.
For example, these override-values.yaml entries would enable the metrics service for ecs1.
charging: labels: "ece" jmxport: "9999" … metrics: port: "19612" ecs1: jmxport: "" replicas: 1 … jvmOpts: "-Dece.metrics.http.service.enabled=true" restartCount: "0" -
Save and close your override-values.yaml file.
-
Run the helm upgrade command to update your ECE Helm release:
helm upgrade EceReleaseName oc-cn-ece-helm-chart --namespace EceNameSpace --values OverrideValuesFilewhere:
-
EceReleaseName is the release name for oc-cn-ece-helm-chart.
-
EceNameSpace is the namespace in which to create ECE Kubernetes objects for the ECE Helm chart.
-
OverrideValuesFile is the name and location of your override-values.yaml file for oc-cn-ece-helm-chart.
-
- Edit the override-values.yaml file for oc-cn-ece-helm-chart. After updating the file, run a helm upgrade command to update Prometheus to point to the new metrics endpoint or port number.
This enables you to manage the scope and detail of data collected for alerting and visualization according to your operational requirements.
Best Practices and Important Notes
-
The Alert Configuration Template provides sample rules and dashboards. Always adapt these to your business requirements.
-
ECE alert rules supplied by Oracle are general; customize them for your operational context.
-
Test all monitoring and alerting configurations in a non-production environment first.
-
Set up and validate notification channels as part of your deployment.
-
Log-based alerting is not included and must be added separately if needed.
-
Always validate rule syntax using Prometheus tools before deployment.
-
Maintain version control for all Prometheus rule files to track configuration changes.
-
Document all custom thresholds and rationale for future reference.
Note:
Alertmanager integrations, such as email, PagerDuty, and Slack, require your own configuration. See the "Alertmanager" in the Prometheus documentation at: https://prometheus.io/docs/alerting/0.26/overview/ for more information on integration specifics.
Sample Prometheus Operator Configuration
After installing Prometheus Operator, you configure it to scrape metrics from the ECE metric endpoint. The following shows sample entries you can use to create Prometheus Service and ServiceMonitor objects that scrape ECE metric data.
This sample creates a Service object that specifies to:
-
Select all pods with the app label ece
-
Scrape metrics from port 19612
apiVersion: v1
kind: Service
metadata:
name: prom-ece-metrics
labels:
application: prom-ece-metrics
spec:
ports:
- name: metrics
port: 19612
protocol: TCP
targetPort: 19612
selector:
app: ece
sessionAffinity: None
type: ClusterIP
clusterIP: NoneThis sample creates a ServiceMonitor object that specifies to:
-
Select all namespaces with ece in their name
-
Select all Service objects with the application label prom-ece-metrics
-
Scrape metrics from the HTTP path /metrics every 15 seconds
kind: ServiceMonitor
metadata:
name: prom-ece-metrics
spec:
endpoints:
- interval: 15s
path: /metrics
port: metrics
scheme: http
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- ece
selector:
matchLabels:
application: prom-ece-metricsFor more information about configuring Prometheus Operator, see https://github.com/prometheus-operator/prometheus-operator/tree/main/Documentation.
ECE Cloud Native Metrics
ECE cloud native collects metrics in the following groups to produce data for monitoring your ECE components:
Note:
Additional labels in the metrics indicates the name of the executor.BRS Metrics
The BRS Metrics group contains the metrics for tracking the throughput and latency of the charging clients that use batch request service (BRS).
Table 32-1 lists the metrics in this group.
Table 32-1 BRS Metrics
| Metric Name | Type | Description |
|---|---|---|
| ece.brs.message.receive | Counter | Tracks how many messages have been received. |
| ece.brs.message.send | Counter | Tracks how many messages have been sent. |
|
ece.brs.task.processed |
Counter |
Tracks the total number of requests accepted, processed, timed out, or rejected by the ECE component. You can use this to track the approximate processing rate over time, aggregated across all client applications, and so on. |
| ece.brs.task.pending.count | Gauge |
Contains the number of requests that are pending for each ECE component. |
| ece.brs.current.latency.by.type | Gauge |
Tracks the latency of a charging client for each service type in the current query interval. These metrics are segregated and exposed from the BRS layer per service type and include event_type, product_type, and op_type tags. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, and Refund_Unit. |
| ece.brs.current.latency | Gauge |
Tracks the current operation latency for a charging client in the current scrape interval. This metric contains the BRS statistics tracked using the charging.brsConfigurations MBean attributes. This configuration tracks the maximum and average latency for an operation type since the last query. The maximum window size for collecting this data is 30 seconds, so the query has to be run every 30 seconds. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, Refund_Unit, and Spending_Limit_Report. |
| ece.brs.retry.queue.phase.count | Counter | Tracks the count of operations performed on the retry queue.
Additional Label: phase |
| ece.brs.task.resubmit | Counter | Tracks the number of tasks that were scheduled for retry.
Additional Label: resubmitReason |
| ece.brs.task.retry.count | Counter | Tracks the distributions of the number of retries performed for a
retried request.
Additional Label: source, retries |
| ece.brs.task.retry.distribution | Distribution Summary | Tracks the distributions of the number of retries performed for a
retried request.
Additional Label: source |
| ece.brs.queue.health.monitor.threshold.enabled | Gauge |
(Requires Interim Patch 37951934) Tracks the status of the threshold monitor: 0: Stopped 1: Started Available Labels: applicationRole, configName, name, type |
| ece.brs.queue.health.monitor.threshold.level | Gauge |
(Requires Interim Patch 37951934) Tracks the threshold-level index representing the monitor’s severity stage as an integer: 0: None 1: LEVEL_ONE 2: LEVEL_TWO 3: LEVEL_THREE Available Labels: applicationRole, configName, name, type |
| ece.brs.queue.health.monitor.threshold.reading | Gauge |
(Requires Interim Patch 37951934) Tracks the monitored value against which the threshold values are evaluated. Available Labels: applicationRole, configName, name, type |
| ece.brs.queue.health.monitor.threshold.setpoint | Gauge |
(Requires Interim Patch 37951934) Specifies the threshold setpoint above which it enters a particular overload control level. Available Labels: applicationRole, configName, level, name, thresholdType, type |
| ece.brs.queue.health.monitor.threshold.state.transitions.total | Counter |
(Requires Interim Patch 37951934) Tracks the total number of transitions between the threshold states. Available Labels: applicationRole, configName, name, type |
Reactor Netty ConnectionProvider Metrics
The Reactor Netty ConnectionProvider Metrics group contains standard metrics that provide insights into the pooled ConnectionProvider which supports built-in integration with Micrometer. Table 32-2 lists the metrics in this group.
For additional information about Reactor Netty ConnectionProvider Metrics, see the Reactor Netty Reference Guide in the Project Reactor documentation: https://projectreactor.io/docs/netty/1.1.17/reference/index.html.
Table 32-2 Reactor Netty ConnectionProvider Metrics
| Metric Name | Type | Description |
|---|---|---|
|
reactor.netty.connection.provider.total.connections |
Gauge |
Tracks the number of active or idle connections. |
|
reactor.netty.connection.provider.active.connections |
Gauge |
Tracks the number of connections that have been successfully acquired and are in active use. |
|
reactor.netty.connection.provider.max.connections |
Gauge |
Tracks the maximum number of active connections that are allowed. |
|
reactor.netty.connection.provider.idle.connections |
Gauge |
Tracks the number of idle connections. |
|
reactor.netty.connection.provider.pending.connections |
Gauge |
Tracks the number of requests that are waiting for a connection. |
|
reactor.netty.connection.provider.pending.connections.time |
Timer |
Tracks the time spent waiting to acquire a connection from the connection pool. |
|
reactor.netty.connection.provider.max.pending.connections |
Gauge |
Tracks the maximum number of requests that are queued while waiting for a ready connection. |
Reactor Netty HTTP Client Metrics
The Reactor Netty HTTP Client Metrics group contains standard metrics that provide insights into the HTTP client which supports built-in integration with Micrometer. Table 32-3 lists the metrics in this group.
For additional information about Reactor Netty ConnectionProvider Metrics, see "Reactor Netty Reference Guide" in the Project Reactor documentation for more information: https://projectreactor.io/docs/netty/1.1.17/reference/index.html.
Table 32-3 Reactor Netty HTTP Client Metrics
| Metric Name | Type | Description |
|---|---|---|
|
reactor.netty.http.client.data.received |
DistributionSummary |
Tracks the amount of data received, in bytes. |
|
reactor.netty.http.client.data.sent |
DistributionSummary |
Tracks the amount of data sent, in bytes. |
|
reactor.netty.http.client.errors |
Counter |
Tracks the number of errors that occurred. |
|
reactor.netty.http.client.tls.handshake.time |
Timer |
Tracks the amount of time spent for TLS handshakes. |
|
reactor.netty.http.client.connect.time |
Timer |
Tracks the amount of time spent connecting to the remote address. |
|
reactor.netty.http.client.address.resolver |
Timer |
Tracks the amount of time spent resolving the remote address. |
|
reactor.netty.http.client.data.received.time |
Timer |
Tracks the amount of time spent consuming incoming data. |
|
reactor.netty.http.client.data.sent.time |
Timer |
Tracks the amount of time spent in sending outgoing data. |
|
reactor.netty.http.client.response.time |
Timer |
Tracks the total time for the request or response. |
BRS Queue Metrics
The BRS Queue Metrics group contains the metrics for tracking the throughput and latency of the BRS queue. Table 32-4 lists the metrics in this group.
Table 32-4 BRS Queue Metrics
| Metric | Type | Description |
|---|---|---|
|
ece.eviction.queue.size |
Gauge |
Tracks the number of items in the queue. |
|
ece.eviction.queue.eviction.batch.size |
Gauge |
Tracks the number of queue items the eviction cycle processes in each iteration. |
|
ece.eviction.queue.time |
Timer |
Tracks the amount of time items spend in the queue. |
|
ece.eviction.queue.operation.duration |
Timer |
Tracks the time it takes to perform an operation on the queue. |
|
ece.eviction.queue.scheduled.operation.duration |
Timer |
Tracks the time it takes to perform a scheduled operation on the queue. |
|
ece.eviction.queue.operation.failed |
Counter |
Counts the number of failures for a queue operation. |
CDR Formatter Metrics
The CDR Formatter Metrics group contains the metrics for tracking Charging Function (CHF) records. Table 32-5 lists the metrics in this group.
Table 32-5 CDR Formatter Metrics
| Metric Name | Metric Type | Description |
|---|---|---|
|
ece.chf.records.processed |
Counter |
Tracks the total number of CHF records the CDR formatter has processed. |
|
ece.chf.records.purged |
Counter |
Tracks the total number of CHF records the CDR formatter purged. |
|
ece.chf.records.loaded |
Counter |
Tracks the total number of CHF records the CDR formatter has loaded. |
Coherence Metrics
All Coherence metrics that are available through the Coherence metrics endpoint are also accessible through the ECE metrics endpoint.
-
For details of the Coherence metrics, see "Oracle Coherence MBeans Reference" in Oracle Fusion Middleware Managing Oracle Coherence
-
For information about querying Coherence metrics, see "Querying for Coherence Metrics" in Oracle Fusion Middleware Managing Oracle Coherence
Coherence Federated Service Metrics
The Coherence Federated Service Metrics group contains metrics about conflicts that have occurred during the federation process in active-active disaster recovery systems. Table 32-6 lists the metrics in this group.
Table 32-6 Coherence Federated Service Metrics
| Metric Name | Type | Description |
|---|---|---|
| coherence_federation_destination_total_committing_local_events | Counter | Tracks the total number of Coherence COMMITTING_LOCAL change event types. |
| coherence_federation_destination_total_committing_local_local_only | Counter | Tracks the total number of cache entries that were set local only by a COMMITTING_LOCAL interceptor. Local only changes are not federated. |
| coherence_federation_destination_total_committing_local_modified | Counter | Tracks the total number of cache entries that were modified by a COMMITTING_LOCAL interceptor. |
| coherence_federation_destination_total_committing_local_rejected | Counter | Tracks the total number of cache entries that were rejected by a COMMITTING_LOCAL interceptor. |
| coherence_federation_destination_total_committing_local_unmodified | Counter | Tracks the total number of cache entries in COMMITTING_LOCAL events that were not modified by any COMMITTING_LOCAL event interceptors. |
| coherence_federation_destination_total_replicating_events | Counter | Tracks the total number of REPLICATING events. |
| coherence_federation_destination_total_replicating_modified | Counter | Tracks the total number of cache entries that were modified by a REPLICATING interceptor. |
| coherence_federation_destination_total_replicating_rejected | Counter | Tracks the total number of cache entries that were rejected by a REPLICATING interceptor. |
| coherence_federation_destination_total_replicating_unmodified | Counter | Tracks the total number of cache entries in REPLICATING events that were not modified by any REPLICATING event interceptors. |
| coherence_federation_origin_total_committing_remote_events | Counter | Tracks the total number of COMMITTING_REMOTE events. |
| coherence_federation_origin_total_committing_remote_modified | Counter | Tracks the total number of cache entries that were modified by a COMMITTING_REMOTE interceptor. |
| coherence_federation_origin_total_committing_remote_rejected | Counter | Tracks the total number of cache entries that were rejected by a COMMITTING_REMOTE interceptor. |
| coherence_federation_origin_total_committing_remote_unmodified | Counter | Tracks the total number of cache entries in COMMITTING_REMOTE events that were not modified by any COMMITTING_REMOTE event interceptors. |
Diameter Gateway Metrics
The Diameter Gateway group contains metrics on events processed by the Diameter Gateway. Table 32-7 lists the metrics in this group.
Table 32-7 Diameter Gateway Metrics
| Metric Name | Type | Description |
|---|---|---|
|
ece.diameter.current.latency.by.type |
Gauge |
Tracks the latency of an Sy request for each operation type in the current query interval. The SLR_INITIAL_REQUEST, SLR_INTERMEDIATE_REQUEST, and STR operations are tracked. |
|
ece.diameter.session.count |
Gauge |
Tracks the count of the currently cached diameter sessions. Additional label: Identity |
|
ece.diameter.session.cache.capacity |
Gauge |
Tracks the maximum number of diameter session cache entries. Additional label: Identity |
|
ece.diameter.session.sub.count |
Gauge |
Tracks the count of currently cached active ECE sessions. This is the count of sessions in the right side of the session map (MapString,MapString,DiameterSession). |
|
ece.diameter.notification.requests.sent |
Timer |
Tracks the amount of time taken to send a diameter notification. Additional labels: protocol, notificationType, result |
|
ece.requests.by.result.code |
Counter |
Tracks the total number of requests processed for each result code. |
ECE Federated Service Metrics
The ECE Federation Service Metrics group contains metrics about conflicts that have occurred during the federation process in active-active disaster recovery systems. See "About Conflict Resolution During the Journal Federation Process" for more information.
Table 32-8 lists the metrics in this group.
Table 32-8 ECE Federated Service Metrics
| Metric Name | Type | Description |
|---|---|---|
| ece.federated.service.change.records | Counter |
Tracks the number of change records and tags them by conflict classification type:
|
ECE Notification Metrics
The ECE Notification Metrics group contains metrics for tracking the throughput, latency, and success or error rates for outgoing requests from the various ECE gateways grouped by application role. Table 32-9 lists the metrics in this group.
Table 32-9 ECE Notification Metrics
| Metric Name | Type | Description |
|---|---|---|
|
ece_notification_requests_sent_seconds |
Timer |
Tracks the latency of requests sent from a gateway to clients. In the metric, the applicationRole label can be:
|
|
ece_notification_requests_sent_total |
Counter |
Counts the number of successfully delivered requests to ECE clients as well as the number of records delivered to the Dead Letter queue for unsuccessful requests. It keeps a separate count for the following record types:
|
EM Gateway Metrics
The EM Gateway Metrics group contains standard metrics that provide insights into the current status of your EM Gateway activity and tasks. Table 32-10 lists the metrics in this group.
Table 32-10 EM Gateway Metrics
| Metric Name | Type | Description |
|---|---|---|
|
ece.emgw.processing.latency |
Timer |
Tracks the overall time taken in the EM Gateway. Additional label: handler |
|
ece.emgw.handler.processing.latency |
Timer |
Tracks the total processing time taken for each request processed by a handler. Additional label: handler |
|
ece.emgw.handler.processing.latency.by.phase |
Timer |
Tracks the time it takes to send a request to the dispatcher or BRS. Additional label: phase,handler |
|
ece.emgw.handler.error.count |
Counter |
Tracks the number of failed requests. Additional label: handler, failureReason |
|
ece.emgw.opcode.formatter.error |
Counter |
Tracks the number of opcode formatter errors. Additional label: phase |
HTTP Gateway Metrics
The HTTP Gateway group contains metrics on events processed by the HTTP Gateway. Table 32-11 lists the metrics in this group.
Table 32-11 HTTP Gateway Metrics
| Metric Name | Type | Description |
|---|---|---|
|
ece_http_server_api_requests_failed |
Counter |
Tracks the number of failed requests. |
|
ece_http_current_latency_seconds |
Timer/ Gauge |
Tracks the current latency of a request for each operation type in seconds. |
JVM Metrics
The JVM Metrics group contains standard metrics about the central processing unit (CPU) and memory utilization of JVMs which are members of the ECE grid. Table 32-12 lists the metrics in this group.
Table 32-12 JVM Metrics
| Metric Name | Type | Description |
|---|---|---|
|
coherence.os.free.swap.space.size |
Gauge |
Contains system swap usage information (by default in megabytes) for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
|
coherence.os.system.cpu.load |
Gauge |
Contains the CPU load information percentage for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
|
jvm.buffer.count.buffers |
Gauge |
Contains the estimated number of mapped and direct buffers in the JVM memory pool. |
|
jvm.buffer.total.capacity.bytes |
Gauge |
Contains the estimated total capacity, in bytes, of the mapped and direct buffers in the JVM memory pool. |
|
jvm.memory.bytes.init |
Gauge |
Contains the initial size, in bytes, for the Java heap and non-heap memory. |
|
jvm.memory.bytes.committed |
Gauge |
Contains the committed size, in bytes, for the Java heap and non-heap memory. |
|
jvm.memory.bytes.used |
Gauge |
Contains the amount , in bytes of Java heap and non-heap memory that are in use. |
|
jvm.memory.bytes.max |
Gauge |
Contains the maximum size, in bytes, for the Java heap and non-heap memory. |
|
jvm.memory.pool.bytes.init |
Gauge |
Contains the initial size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
|
jvm.memory.pool.bytes.committed |
Gauge |
Contains the committed size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
|
jvm.memory.pool.bytes.used |
Gauge |
Contains the amount in bytes, of Java memory space in use by the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space. |
|
process.cpu.usage |
Gauge |
Contains the CPU percentage for each ECE component on the server. This data is collected from the corresponding MBean attributes by JVMs. |
|
process.files.open.files |
Gauge |
Contains the total number of file descriptors currently available for an ECE component and the descriptors in use for that ECE component. |
|
system.load.average.1m |
Gauge |
Contains the system load average (the number of items waiting in the CPU run queue) for each machine in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server. |
Kafka JMX Metrics
The Kafka JMX Metrics group contains metrics for tracking the throughput and latency of the Kafka server and topics. Table 32-13 lists the metrics in this group.
Table 32-13 Kafka JMX Metrics
| Metric Name | Type | Description |
|---|---|---|
|
kafka.app.info.start.time.ms |
Gauge |
Indicates the start time in milliseconds. |
|
kafka.producer.connection.close.rate |
Gauge |
Contains the number of connections closed per second. |
|
kafka.producer.io.ratio |
Gauge |
Contains the fraction of time the I/O thread spent doing I/O. |
|
kafka.producer.io.wait.time.ns.total |
Counter |
Contains the total time the I/O thread spent waiting. |
|
kafka.producer.iotime.total |
Counter |
Contains the total time the I/O thread spent doing I/O. |
|
kafka.producer.metadata.wait.time.ns.total |
Counter |
Contains the total time, in nanoseconds the producer has spent waiting on topic metadata. |
|
kafka.producer.node.request.latency.max |
Gauge |
Contains the maximum latency, in milliseconds of producer node requests. |
|
kafka.producer.record.error.total |
Counter |
Contains the total number of record sends that resulted in errors. |
|
kafka.producer.txn.begin.time.ns.total |
Counter |
Contains the total time, in nanoseconds the producer has spent in beginTransaction. |
|
kafka.producer.txn.commit.time.ns.total |
Counter |
Contains the total time, in nanoseconds the producer has spent in commitTransaction. |
Kafka Client Metrics
The Kafka Client Metrics group contains metrics for tracking the throughput, latency, and performance of Kafka producer and consumer clients.
Note:
All Kafka producer metrics apply to the ECS, HTTP Gateway, Diameter Gateway, and BRM Gateway. All Kafka consumer metrics apply to the BRM Gateway, RADIUS Gateway, HTTP Gateway, and Diameter Gateway.
For more information about the available metrics, refer to the following Apache Kafka documentation:
-
Producer Metrics:
https://kafka.apache.org/36/generated/producer_metrics.html -
Consumer Metrics:
https://kafka.apache.org/36/generated/consumer_metrics.html
Micrometer Executor Metrics
The Micrometer Executor Metrics group contains standard metrics that provide insights into the activity of your thread pool and the status of tasks. These metrics are created by Micrometer, a third party software. Table 32-14 lists the metrics in this group.
Note:
The Micrometer API optionally allows a prefix to the name. In the table below, replace prefix with ece.brs for BRS metrics or ece.emgw for EM Gateway metrics.
Table 32-14 Micrometer Executor Metrics
| Metric Name | Type | Description |
|---|---|---|
| prefix.executor.completed.tasks | FunctionCounter | Tracks the approximate total number of tasks that have
completed execution.
Additional label: Identity |
| prefix.executor.active.threads | Gauge | Tracks the approximate number of threads that are
actively executing tasks.
Additional label: Identity |
| prefix.executor.queued.tasks | Gauge | Tracks the approximate number of tasks that are queued
for execution.
Additional label: Identity |
| prefix.executor.queue.remaining.tasks | Gauge | Tracks the number of additional elements that this queue
can ideally accept without blocking.
Additional label: Identity |
| prefix.executor.pool.size.threads | Gauge | Tracks the current number of threads in the pool.
Additional label: Identity |
| prefix.executor.pool.core.threads | Gauge | Tracks the core number of threads in the pool.
Additional label: Identity |
| prefix.executor.pool.max.threads | Gauge | Tracks the maximum allowed number of threads in the
pool.
Additional label: Identity |
RADIUS Gateway Metrics
Table 32-15 RADIUS Gateway Metrics
| Metric Name | Type | Description |
|---|---|---|
|
ece.radius.sent.disconnect.message.counter.total |
Counter |
Tracks the number of unique disconnect messages sent to the Network Access Server (NAS), excluding the retried ones. |
|
ece.radius.retried.disconnect.message.counter.total |
Counter |
Tracks the number of retried disconnect messages, excluding the total number of retries. |
|
ece.radius.successful.disconnect.message.counter.total |
Counter |
Tracks the number of successful disconnect messages. |
|
ece.radius.failed.disconnect.message.counter.total |
Counter |
Tracks the number of failed disconnect messages. |
|
ece.radius.auth.extension.user.data.latency |
Timer |
Tracks the following:
|
Rated Event Formatter (REF) Metrics
The Rated Event Formatter (REF) Metrics group contains standard metrics that provide insights into the current status of your REF activity and tasks. Table 32-16 lists the metrics in this group.
Table 32-16 REF Metrics
| Metric Name | Type | Description |
|---|---|---|
|
ece.rated.events.checkpoint.interval |
Gauge |
Tracks the time, in seconds, used by the REF instance to read a set of rated events at a specific time interval. |
|
ece.rated.events.ripe.duration |
Gauge |
Tracks the duration, in seconds, that rated events have existed before they can be processed. |
|
ece.rated.events.worker.count |
Gauge |
Contains the number of worker threads used to process rated events. |
|
ece.rated.events.phase.latency |
Timer |
Tracks the amount of time taken to complete a rated event phase. This only measures successful phases. Additional labels: phase, siteName |
|
ece.rated.events.phase.failed |
Counter |
Tracks the number of rated event phase operations that have failed. Additional labels: phase, siteName |
|
ece.rated.events.checkpoint.age |
Gauge |
Tracks the difference in time between the retrieved data and the current time stamp. Additional labels: phase, siteName |
|
ece.rated.events.batch.size |
Gauge |
Tracks the number of rated events retrieved on each iteration. Additional labels: phase, siteName |
Rated Events Metrics
The Rated Events Metrics group contains metrics on rated events processed by ECE server sessions. Table 32-17 lists the metrics in this group.
Table 32-17 Rated Events Metrics
| Metric Name | Type | Description |
|---|---|---|
|
ece.rated.events.formatted |
Counter |
Contains the number of successful or failed formatted rated events per RatedEventFormatter worker thread upon each formatting job operation. |
|
ece.rated.events.cached |
Counter | Contains the total number of rated events cached by each ECE node. |
|
ece.rated.events.inserted |
Counter |
Contains the total number of rated events that were successfully inserted into the database. |
|
ece.rated.events.insert.failed |
Counter |
Contains the total number of rated events that failed to be inserted into the database. |
|
ece.rated.events.purged |
Counter |
Contains the total number of rated events that are purged. |
|
ece.requests.by.result.code |
Counter |
Tracks the total number of requests processed for each result code. |
Session Metrics
The Session Metrics group contains metrics on ECE server sessions. Table 32-18 lists the metrics in this group.
Table 32-18 Session Metrics
| Metric Name | Type | Description |
|---|---|---|
|
ece.session.metrics |
Counter |
Contains the total number of sessions opened or closed by rating group, node, or cluster. |
ECE Cloud Native Alerts
The default alerts described in this section are located in the sample_data/metrics/prometheus_rules/eceAlertRules.yaml file. See "Setting Up Alerts with the ECE Alert Configuration Template" for more information.
CDR Formatter Alerts
The CDR Formatter Alerts group contains the alerts expressions for Charging Function (CHF) records. Table 32-19 lists the alerts in this group.
Table 32-19 CDR Formatter Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
CdrFormatterProcessedRecordsIn15Mins |
critical |
An alert is triggered if the number of records processed by cdrformatter in last 15 minutes is zero. |
|
CdrFormatterProcessedRecordsLow |
warning |
An alert is triggered if the number of records processed by cdrformatter in last 15 minutes is less than 5000. |
|
CdrFormatterErrorMessageIn5Mins |
warning |
An alert is triggered if the rate of error messages increases by more than 10% in 5 minutes. |
CDR Gateway Alerts
Table 32-20 CDR Gateway Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
CdrGatewayErrorMessageIn5Mins |
warning |
An alert is triggered if the increase in error messages is greater than 10% in 5 minutes. |
Coherence Alerts
All Coherence alert expressions used in the eceAlertRules.yaml file for Coherence are listed in Table 32-21.
Table 32-21 Coherence Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
CoherenceCacheServiceThreadUtilizationCritical |
critical |
An alert is triggered when the 5-minute average Coherence cache service thread utilization is greater than 80%. |
|
CoherenceCacheServiceThreadUtilizationMajor |
major |
An alert is triggered when the 5-minute average Coherence cache service thread utilization is 60% to 80%. |
|
CoherenceCacheServiceThreadUtilizationMinor |
minor |
An alert is triggered when the 5-minute average Coherence cache service thread utilization is 40% to 60%. |
|
CoherenceCacheWriteBehindQueueTooLarge |
major |
An alert is triggered if the write-behind queue size for the cache (excluding AggregateObjectUsage) is more than 1,000 for more than 5 minutes. This alert indicates a persistence queue overload. |
|
CoherenceAggregateObjectUsageCacheLargeAndGrowing |
major |
An alert is triggered if the AggregateObjectUsage cache is abnormally large (over 1,000,000 entries) and still increasing (in the past hour). |
|
EceCustomerCacheDown |
warning |
An alert is triggered if the customer cache size has dropped by more than 5% compared to the maximum amount in the last 4 hours. |
|
EceServiceNotificationHigh |
critical |
An alert is triggered if the number of entries in the ServiceContext cache exceeds 200. This can cause the write-behind thread not to publish notifications, leading to service degradation. |
|
CoherenceCustomerCacheSizeChangeCritical |
critical . |
An alert is triggered if the size of the customer cache (in bytes) exceeds 90% change either over a 24-hour period. |
|
CoherenceCustomerCacheSizeChangeMajor |
major |
An alert is triggered if the size of the customer cache (in bytes) lies in between 80% and 90% change over a 24-hour period. |
|
CoherenceCustomerCacheSizeChangeMinor |
minor |
An alert is triggered if the size of the customer cache (in bytes) lies in between 60% and 80% change either over a 24-hour period. |
|
CoherenceCustomerCacheSizeChangeCritical |
critical . |
An alert is triggered if the customer size (in bytes) exceeds 90% change either over a 24-hour period or for total size. |
|
EceFederationCacheReplicationIncomplete |
major |
An alert is triggered if the federation cache replication value is below 100% over a period of time, during the replicate-all operation. |
|
EceFederationCacheError |
major |
An alert is triggered if the rate of federation cache errors increases in last 5 minutes. The alert indicates errors in federation operations. |
|
EceFederationCacheBacklogCritical |
critical |
An alert is triggered if the total federation journal size exceeds 60% of the maximum configured flash journal size in Coherence journal config. |
|
EceFederationCacheBacklogMajor |
major |
An alert is triggered if the total federation journal size is 40% to 60% of the maximum configured flash journal size in Coherence journal config. |
|
EceFederationCacheBacklogMinor |
minor |
An alert is triggered if the total federation journal size is 20% to 40% of the maximum configured flash journal size in Coherence journal config. |
|
CoherenceHAStatusEndangered |
critical |
An alert is triggered if the Coherence partition assignment HA status is ENDANGERED. This prevents potential data loss in high-availability systems. |
|
CoherenceServicePartitionsUnbalanced |
critical |
An alert is triggered if more than 10 Coherence service has unbalanced partition(s) for more than 5 minutes. |
|
CoherenceServiceRequestPendingHigh |
major |
An alert is triggered if the Coherence service requests are pending. |
|
CoherenceServiceRequestPendingTooLong |
major |
An alert is triggered if the Coherence service requests are pending for longer than 10,000 milliseconds. |
|
CoherenceServiceTaskBacklogHigh |
major |
An alert is triggered if there are more than 5 service tasks in a backlog state. |
|
EceFederationCacheBacklogIncrease |
major |
An alert is triggered when the federation cache backlog is more than 12 in 5 minutes. |
|
CoherenceCustomerCacheSizeCritical |
critical |
An alert is triggered when the Coherence Customer cache size is greater than 15GB. |
|
CoherenceCustomerCacheSizeMajor |
major |
An alert is triggered when the Coherence Customer cache size is between 10 and 15GB. |
|
CoherenceCustomerCacheSizeMinor |
minor |
An alert is triggered when the Coherence Customer cache size is between 5 and 10GB. |
Diameter Gateway Alerts
The Diameter Gateway group contains alert expressions on events processed by the Diameter Gateway. Table 32-22 lists the alerts in this group.
Table 32-22 Diameter Gateway Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
DiameterGatewayPendingTaskHigh |
critical |
An alert is triggered if the pods have 100 or more pending tasks for 5 minutes. |
|
DiameterGatewaySyRequestFailed |
warning |
An alert is triggered when the failure rate for Sy requests is between 5% and 10%. |
|
DiameterGatewaySyRequestFailedHigh |
critical |
An alert is triggered when the failure rate for Sy requests are beyond 10%. |
|
DiameterGatewayGyRequestFailed |
warning |
An alert is triggered when the failure rate for Gy requests is between 5% and 10%. |
|
DiameterGatewayGyRequestFailedHigh |
critical |
An alert is triggered when the failure rate for Gy requests are beyond 10%. |
|
DiameterGatewayCurrentLatency |
warning |
An alert is triggered if the batch request server (BRS) latency is between 100 milliseconds and 500 milliseconds. |
|
DiameterGatewayCurrentLatencyHigh |
critical |
An alert is triggered if the BRS latency is more than 300 milliseconds. |
|
DiameterGatewayThroughput |
warning |
An alert is triggered if the Diameter Gateway throughput is less than 50 in 30 minutes. |
|
EceCoherenceStateIncorrect |
critical |
An alarm is triggered if the pod is not in the usage processing state (10). |
EM Gateway Alerts
The EM Gateway Alerts group contains standard alert expressions for your EM Gateway activity and tasks. Table 32-23 lists the alerts in this group.
Table 32-23 EM Gateway Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
EceEmGwLatency |
warning |
An alert is triggered if the latency in the EM Gateway exceeds 0.1 seconds. |
|
EmGwHandlerFailedRequestError |
critical |
An alert is triggered if the number of recorded errors for the EM Gateway handler is more than zero in the last 5 minutes. |
|
EceEmgwHandlerProcessingLatencyHigh |
warning |
An alert is triggered if the EM Gateway handler latency for phases that exceeds 0.1 seconds. |
HTTP Gateway Alerts
Table 32-24 HTTP Gateway Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
HttpGatewayLatencyHigh |
warning |
An alert is triggered if the request latency at the 95th percentile is greater than 0.1 seconds. |
|
HttpRequestFailureRateHigh |
critical |
An alert is triggered if the failure rate is greater than 10%. |
|
HttpGatewayErrorMessageCountLast5Mins |
warning |
An alert is triggered if the error message rate increase is greater than 10% over a period of 5 minutes. |
JVM Alerts
The JVM Alerts group contains standard alert expressions about the central processing unit (CPU) and memory utilization of JVMs which are members of the ECE grid. Table 32-25 lists the alerts in this group.
Table 32-25 JVM Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
EcsJvmHeap90Percent |
critical |
An alert is triggered if the heap size used is more than 90% for ECS. |
|
EcsJvmHeap80Percent |
major |
An alert is triggered if the heap size used is between 80 and 90% for ECS. |
|
HttpGatewayJvmHeap90Percent |
critical |
An alert is triggered if the heap size used is more than 90% for HTTP Gateway. |
|
HttpGatewayJvmHeap80Percent |
major |
An alert is triggered if the heap size used is between 80 and 90% for HTTP Gateway. |
|
CdrGatewayJvmHeap90Percent |
critical |
An alert is triggered if the heap size used is more than 90% for CDR Gateway. |
|
CdrGatewayJvmHeap80Percent |
major |
An alert is triggered if the heap size used is between 80 and 90% for CDR Gateway. |
|
CdrFormatterJvmHeap90Percent |
critical |
An alert is triggered if the heap size used is more than 90% for CDR Formatter. |
| CdrFormatterJvmHeap80Percent | major |
An alert is triggered if the heap size used is between 80 and 90% for CDR Formatter. |
|
HighJvmGcPauseTotal |
warning |
An alert is triggered if the rate of garbage collection pause is more than 0.1 seconds over a period of 5 minutes. |
Kubernetes Alerts
Table 32-26 Kubernetes Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
ECSMinimumPods |
major |
An alert is triggered if not all of the replica pods are available for ECS. |
|
EceEmGwMinimumPods |
warning |
An alert is triggered if not all of the replica pods are available for EM Gateway. |
|
DiameterGatewayMinimumPods |
critical |
An alert is triggered if not all of the replica pods are available for Diameter Gateway. |
|
HttpGatewayMinimumPods |
major |
An alert is triggered if not all of the replica pods are available for HTTP Gateway. |
|
CdrGatewayMinimumPods |
major |
An alert is triggered if not all of the replica pods are available for CDR Gateway. |
| CdrFormatterMinimumPods |
warning |
An alert is triggered if not all of the replica pods are available for CDR Formatter. |
Rated Event Formatter (REF) Alerts
The Rated Event Formatter (REF) Alerts group contains standard alert expressions that provide insights into the status of your REF activity and tasks. Table 32-27 lists the metrics in this group.
Table 32-27 REF Alerts
| Alert Name | Default Severity | Description |
|---|---|---|
|
EceRatedEventThroughput |
warning |
An alert is triggered if the number of rated events throughput is less than 3000 over a period of one hour. |
|
EceRatedEventsPurgedLow |
warning |
An alert is triggered if the number of rated events purge falls below 5000 in the previous hour if at least one event was added. |
|
EceRatedEventsInsertionRateLow |
critical |
An alert is triggered if the rated events insertion rate falls below 90% of the rate of the cached events. |
|
EceRatedEventsInsertFailed |
critical |
An alert is triggered if the rate of rated event insertion has increased for 5 minutes. |
|
EceRatedEventFormatterCacheHigh |
critical |
An alert is triggered if there are more than 1,000,000 cached events. |