Monitoring ECE in a Cloud Native Environment

27 Monitoring ECE in a Cloud Native Environment

You can monitor the system processes, such as memory and thread usage, in your Oracle Communications Elastic Charging Engine (ECE) components in a cloud native environment.

Topics in this document:

About Monitoring ECE in a Cloud Native Environment

You can set up monitoring of your ECE components in a cloud native environment. When configured to do so, ECE exposes JVM, Coherence, and application metric data through a single HTTP endpoint in an OpenMetrics/Prometheus exposition format. You can then use an external centralized metrics service, such as Prometheus, to scrape the ECE cloud native metrics and store them for analysis and monitoring.

Note:

ECE only exposes the metrics on an HTTP endpoint. It does not provide the Prometheus service.
Do not modify the oc-cn-ece-helm-chart/templates/ece-ecs-metricsservice.yaml file. It is used only during ECE startup and rolling upgrades. It is not used for monitoring.

ECE cloud native exposes metric data for the following components by default:

ECE Server
BRM Gateway
Customer Updater
Diameter Gateway
EM Gateway
HTTP Gateway
CDR Formatter
Pricing Updater
Radius Gateway
Rated Event Formatter

Setting up monitoring of these ECE cloud native components involves the following high-level tasks:

Ensuring that the ECE metric endpoints are enabled. See "Enabling ECE Metric Endpoints".

ECE cloud native exposes metric data through the following endpoint: http://localhost:19612/metrics.
Setting up a centralized metrics service, such as Prometheus Operator, to scrape metrics from the endpoint.

For an example of how to configure Prometheus Operator to scrape ECE metric data, see "Sample Prometheus Operator Configuration".
Setting up a visualization tool, such as Grafana, to display your ECE metric data in a graphical format.

Enabling ECE Metric Endpoints

The default ECE cloud native configuration exposes JVM, Coherence, and application metric data for all ECE components to a single REST endpoint. If you create additional instances of ECE components, you must configure them to expose metric data.

To ensure that the ECE metric endpoints are enabled:

Open your override-values.yaml file for oc-cn-ece-helm-chart.
Verify that the charging.metrics.port key is set to the port number where you want to expose the ECE metrics. The default is 19612.
Verify that each ECE component instance has metrics enabled.

Each application role under the charging key can be configured to enable or disable metrics. In the jvmOpts key, setting the ece.metrics.http.service.enabled option enables (true) or disables (false) the metrics service for that role.

For example, these override-values.yaml entries would enable the metrics service for ecs1.
```
charging:
   labels: "ece"
   jmxport: "9999"
   …
   metrics:
      port: "19612"
   ecs1:
      jmxport: ""
      replicas: 1
      …
      jvmOpts: "-Dece.metrics.http.service.enabled=true"
      restartCount: "0"
```
Save and close your override-values.yaml file.
Run the helm upgrade command to update your ECE Helm release:
```
helm upgrade EceReleaseName oc-cn-ece-helm-chart --namespace EceNameSpace --values OverrideValuesFile
```
where:
- EceReleaseName is the release name for oc-cn-ece-helm-chart.
- EceNameSpace is the namespace in which to create ECE Kubernetes objects for the ECE Helm chart.
- OverrideValuesFile is the name and location of your override-values.yaml file for oc-cn-ece-helm-chart.

Sample Prometheus Operator Configuration

After installing Prometheus Operator, you configure it to scrape metrics from the ECE metric endpoint. The following shows sample entries you can use to create Prometheus Service and ServiceMonitor objects that scrape ECE metric data.

This sample creates a Service object that specifies to:

Select all pods with the app label ece
Scrape metrics from port 19612

apiVersion: v1
kind: Service
metadata:
  name: prom-ece-metrics
  labels:
    application: prom-ece-metrics
spec:
  ports:
    - name: metrics
      port: 19612
      protocol: TCP
      targetPort: 19612
  selector:
    app: ece
  sessionAffinity: None
  type: ClusterIP
  clusterIP: None

This sample creates a ServiceMonitor object that specifies to:

Select all namespaces with ece in their name
Select all Service objects with the application label prom-ece-metrics
Scrape metrics from the HTTP path /metrics every 15 seconds

kind: ServiceMonitor
metadata:
  name: prom-ece-metrics
spec:
  endpoints:
    - interval: 15s
      path: /metrics
      port: metrics
      scheme: http
      scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
      - ece
  selector:
    matchLabels:
      application: prom-ece-metrics

For more information about configuring Prometheus Operator, see https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/getting-started.md.

ECE Cloud Native Metrics

ECE cloud native collects metrics in the following groups to produce data for monitoring your ECE components:

JVM Metrics

The JVM Metrics group contains standard metrics about the central processing unit (CPU) and memory utilization of JVMs, which are members of the ECE grid. Table 27-1 lists the metrics in this group.

Table 27-1 JVM Metrics

Metric Name	Type	Description
jvm_memory_bytes_init	Gauge	Contains the initial size, in bytes, for the Java heap and non-heap memory.
jvm_memory_bytes_committed	Gauge	Contains the committed size, in bytes, for the Java heap and non-heap memory.
jvm_memory_bytes_used	Gauge	Contains the amount of Java heap and non-heap memory, in bytes, that are in use.
jvm_memory_bytes_max	Gauge	Contains the maximum size, in bytes, for the Java heap and non-heap memory.
jvm_memory_pool_bytes_init	Gauge	Contains the initial size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.
jvm_memory_pool_bytes_committed	Gauge	Contains the committed size, in bytes, of the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.
jvm_memory_pool_bytes_used	Gauge	Contains the amount of Java memory space, in bytes, in use by the following JVM memory pools: G1 Survivor Space, G1 Old Gen, and G1 Survivor Space.
jvm_buffer_count_buffers	Gauge	Contains the estimated number of mapped and direct buffers in the JVM memory pool.
jvm_buffer_total_capacity_bytes	Gauge	Contains the estimated total capacity, in bytes, of the mapped and direct buffers in the JVM memory pool.
process_cpu_usage	Gauge	Contains the CPU usage information (in percentage) for each ECE component on the server. This data is collected from the corresponding MBean attributes by JVMs.
process_files_open_files	Gauge	Contains the total number of file descriptors currently available for an ECE component and the descriptors in use for that ECE component.
coherence_os_system_cpu_load	Gauge	Contains the CPU load information (in percentage) for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server.
system_load_average_1m	Gauge	Contains the system load average (the number of items waiting in the CPU run queue) information for each machine in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server.
coherence_os_free_swap_space_size	Gauge	Contains system swap usage information (by default in megabytes) for each system in the cluster. These statistics are based on the average data collected from all the ECE grid members running on a server.

BRS Metrics

The BRS Metrics group contains the metrics for tracking the throughput and latency of the charging clients that use batch request service (BRS). Table 27-2 lists the metrics in this group.

Table 27-2 ECE BRS Metrics

Metric Name	Metric Type	Description
ece_brs_task_processed	Counter	Tracks the total number of requests accepted, processed, timed out, or rejected by the ECE component. You can use this to track the approximate processing rate over time, aggregated across all client applications, and so on.
ece_brs_task_pending_count	Gauge	Contains the number of requests that are pending by the ECE component.
ece_brs_current_latency_by_type	Gauge	Tracks the latency of a charging client per service type in the current query interval. These metrics are segregated and exposed from the BRS layer per service type and include event_type, product_type, and op_type tags. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, and Refund_Unit.
ece_brs_current_latency	Gauge	Tracks the current operation latency for a charging client in the current scrape interval. This metric contains the BRS statistics tracked using the charging.brsConfigurations MBean attributes. This configuration tracks the maximum and average latency for an operation type since the last query. The maximum window size for collecting this data is 30 seconds, so the query has to be run every 30 seconds. This metric provides the latency information for the following operation types: Initiate, Update, Terminate, Cancel, Price_Enquiry, Balance_Query, Debit_Amount, Debit_Unit, Refund_Amount, Refund_Unit, and Spending_Limit_Report.

Kafka JMX Metrics

The Kafka JMX Metrics group contains metrics for tracking the throughput and latency of the Kafka server and topics. Table 27-3 lists the metrics in this group.

Table 27-3 Kafka JMX Metrics

Metric Name	Type	Description
kafka_app_info_start_time_ms	Gauge	Indicates the start time in milliseconds.
kafka_producer_metadata_wait_time_ns_total	Counter	Contains the total time the producer has spent waiting on topic metadata in nanoseconds.
kafka_producer_connection_close_rate	Gauge	Contains the number of connections closed per second.
kafka_producer_iotime_total	Counter	Contains the total time the I/O thread spent doing I/O.
kafka_producer_node_request_latency_max	Gauge	Contains the maximum latency of producer node requests in milliseconds.
kafka_producer_txn_commit_time_ns_total	Counter	Contains the total time the producer has spent in commitTransaction in nanoseconds.
afka_producer_record_error_total	Counter	Contains the total number of record sends that resulted in errors.
kafka_producer_io_wait_time_ns_total	Counter	Contains the total time the I/O thread spent waiting.
kafka_producer_io_ratio	Gauge	Contains the fraction of time the I/O thread spent doing I/O.
kafka_producer_txn_begin_time_ns_total	Counter	Contains the total time the producer has spent in beginTransaction in nanoseconds.

Session Metrics

The Session Metrics group contains metrics on ECE server sessions. Table 27-4 lists the metrics in this group.

Table 27-4 Session Metrics

Metric Name	Type	Description
ece_session_metrics	Counter	Contains the total number of sessions opened or closed by rating group, node, or cluster.

Rated Events Metrics

The Rated Events Metrics group contains metrics on rated events processed by ECE server sessions. Table 27-5 lists the metrics in this group.

Table 27-5 Rated Events Metrics

Metric Name	Type	Description
ece_rated_events_formatted	Counter	Contains the number of successful or failed formatted rated events per RatedEventFormatter worker thread upon each formatting job operation from NoSQL or the Oracle database.
ece_rated_events_cached	Counter	Contains the total number of rated events cached by each ECE node.
ece_rated_events_inserted	Counter	Contains the total number of rated events that were successfully inserted into the cache.
ece_rated_events_insert_failed	Counter	Contains the total number of rated events that failed to be inserted into the cache.
ece_rated_events_purged	Counter	Contains the total number of rated events that were purged.
ece_requests_by_result_code	Counter	Tracks the total requests processed by using the result code.

CDR Formatter Metrics

The CDR Formatter Metrics group contains the metrics for tracking Charging Function (CHF) records. Table 27-6 lists the metrics in this group.

Table 27-6 CDR Formatter Metrics

Metric Name	Metric Type	Description
ece_chf_records_processed	Counter	Tracks the total number of CHF records the CDR formatter has processed.
ece_chf_records_purged	Counter	Tracks the total number of CHF records the CDR formatter purged.
ece_chf_records_loaded	Counter	Tracks the total number of CHF records the CDR formatter has loaded.

Coherence Metrics

All Coherence metrics that are available through the Coherence metrics endpoint are also accessible through the ECE metrics endpoint. For more information about the Coherence metrics, see "Oracle Coherence MBeans Reference" in Oracle Fusion Middleware Managing Oracle Coherence.

For information about querying for Coherence metrics, see "Querying for Coherence Metrics" in Oracle Fusion Middleware Managing Oracle Coherence.