61 Monitoring ECE Using ECE Monitoring Agent

Learn how to monitor and manage Oracle Communications Billing and Revenue Management Elastic Charging Engine (ECE) using the ECE Monitoring Agent.

Topics in this document:

About Monitoring ECE Using Monitoring Agent

You can use the ECE Monitoring Agent to monitor:

  • ECE cluster. Monitors the state of each ECE node configured with JMX port in the ECE topology and generates alerts in case of node failures, shutdowns, or network failures. For ECE charging server nodes, the Monitoring Agent generates alerts when there are unbalanced partitions across ECE server nodes and when the number of running storage-enabled nodes falls below the threshold configured.

  • ECE caches. Monitors the ECE cache based on the estimated size of the cache and the threshold configured and generates alerts when the threshold breach occurs. You can define the threshold for generating alerts; for example, you can configure the threshold percentage to generate an alert when the threshold is approaching or when the threshold breach occurs.

    You can configure to monitor the following caches:

    • Subscriber-based caches, such as Customer and Balance caches, based on the estimated size and the number of nodes available.

    • Session-based caches, such as ActiveSession and RatedEvent caches, based on the number of nodes configured and the size of each node.

  • WebLogic Server and Oracle NoSQL database connections. Monitors the connections to WebLogic Server and the Oracle NoSQL database storage nodes and reports connection failures encountered while publishing notifications or storing rated events into the database.

  • ECE server response time. Monitors the ECE server response time for the different types of ECE requests received during the monitoring cycle; for example, IUT requests. You can define the response time and configure the Monitoring Agent to generate alerts in case of threshold breaches. The information about the requests are included in the DiameterGatewayStatistics and EMGatewayStatistics sections in the summary report logs.

  • Rated event throughput. Monitors the number of rated events stored or retrieved from the cache or from the Oracle NoSQL database and generates alerts on threshold breaches.

  • Charging sessions. Monitors the status of all charging sessions and logs the total number of sessions that are opened or closed after the last monitoring check and the sessions that are currently active at the time of current monitoring check. This summary includes the overall status of the charging sessions.

  • Network sessions. Monitors the status of all network sessions and logs the total number of sessions that are opened or closed after the last monitoring check and the sessions that are currently active at the time of the current monitoring check. This summary includes the overall status of the network sessions.

  • Diameter Peers. Monitors all the Diameter peers connected to the Diameter Gateway instances at regular intervals and logs the details of the peers (including the transport protocol). You can also view the details of the diameter peers that are connected to specific Diameter Gateway instances by using ECE MBeans. See "Viewing Active Diameter Peers" in ECE Implementing Charging.

  • Diameter Responses. Monitors all the Gy and Sy responses and their result codes and logs the total number of success and failure responses. This summary includes the overall status of the responses and the details of the failed responses for each result code and product type. The product type is a combination of Service-Context-Id, Service-Identifier, and Rating-Group and it is shown in the summary only for the Initiate, Update, Terminate (IUT) requests. For other requests, such as TopUp, Debit, and Refund, the product type is not applicable and it is shown as Unknown.

In addition to generating periodical reports and alerts on threshold breaches, the Monitoring Agent reports any unusual scenarios that prevents ECE from continuing the processes. By default, the Monitoring Agent generates the following types of alerts: Warning, Critical, and Fatal. You can also define the conditions for generating the alerts; for example, you can define to generate alerts only five times in an hour.

To configure monitoring of ECE, a sample file named monitor-configuration.xml, is included in the ECE_home/config directory. You can use this file to configure the settings for monitoring ECE. See "Configuring Monitoring of ECE" for more information.

The Monitoring Agent runs periodically based on the configured intervals. It also runs when certain event-based scenarios occur; for example, node failures.

Types of Log Files

The Monitoring Agent generates the following types of log files:

These log files are stored in the ECE_home/logs directory by default. Review these log files regularly to monitor your system and detect and diagnose system problems.

monitorAgentX.log

This log file contains general information about various activities of the Monitoring Agent. This log provides information about the issues encountered by the Monitoring Agent.

monitorAgentX_ECE_SUMMARY_REPORT.log

This log file contains the summary report generated at regular monitoring intervals in the JSON format. The last section in the log file contains the details and the format of the information provided in the summary logs. By default, the Monitoring Agent creates 24 summary log files in a day (one summary log file for each hour of the day). After 24 hours, it starts deleting the summary log files one by one every hour and adds the summary log files for the current day. You can view a maximum of 24 summary log files at any time of a day.

monitorAgentX_ECE_ALERT_REPORT.log

This log file contains the alerts logged at regular monitoring intervals in the JSON format. The Alert Format section in this file specifies the formats and the types of alerts logged. The Monitoring Agent publishes the alerts as notifications to a JMS queue. By default, the Monitoring Agent creates 24 alert log files in a day (one alert log file for each hour of the day). After 24 hours, it starts deleting the alert log files one by one every hour and adds the alert log files for the current day. You can view a maximum of 24 alert log files at any time of a day.

You can also view these notifications and subscribe to receive the notifications by accessing the ECE Monitoring.Notifier node. See "Subscribing Notifications" for more information.

Configuring Monitoring of ECE

To configure monitoring of ECE:

  1. Open the ECE_home/config/monitor-configuration.xml file.

  2. Specify or update the value for the entries listed in Table 61-1 as appropriate.

    Table 61-1 Entries in the Monitoring Configuration XML File

    Command Description

    nodeHealthCheckInterval

    The regular interval (in seconds) in which the Monitoring Agent is run.

    Note: The maximum and the default interval is 30 seconds.

    alertCountResetInterval

    The interval (in seconds) for resetting the alert count.

    fatalAlertCount

    The number of fatal alerts generated after which the alerts are suppressed for the time defined by alertCountResetInterval.

    criticalAlertCount

    The number of critical alerts generated after which the alerts are suppressed for the time defined by alertCountResetInterval.

    warningAlertCount

    The number of warnings generated after which they are suppressed for the time defined by alertCountResetInterval.

    iutThresholdLatency

    The latency (in milliseconds) for the Initiate, Update, and Terminate requests from Diameter Gateway to the ECE server.

    Note: Ensure that you enter the appropriate latency for generating this report; for example, a 99.99% or 100% latency.

    When iutThresholdLatency is set, the Monitoring Agent tracks the percentage of requests that are below the iutThresholdLatency value and generate the alerts based on alertType and alertValue.

    For example, if iutThresholdLatency is set to 10 and out of 100 calls made, 95 calls are less than or equal to 10 milliseconds, the latency percentile is considered as 95. If alertType is set to Warning and alertValue for Warning is set to 95, the Monitoring Agent checks if the latency percentile is less than alertValue, which is 95. In this case, the latency percentile is not less than alertValue for Warning, therefore the Monitoring Agent does not generate a Warning alert. In case if the latency percentile is 94, the Monitoring Agent generates a Warning alert.

    thresholdLatency

    The latency (in milliseconds) for all the traced requests from EM Gateway to the ECE server.

    When thresholdLatency is set, the Monitoring Agent tracks the percentage of requests that are below the thresholdLatency value and generates alert based on the alertType and alertValue.

    For example, if thresholdLatency is set to 10 and out of 100 requests received, 89 requests took less than or equal to 10 milliseconds, the latency percentile is considered as 89. If alertType is set to Critical and alertValue for Critical is set to 90, the Monitoring Agent checks if the latency percentile is less than alertValue, which is 90. In this case, the latency percentile is less than alertValue for Critical, therefore the Monitoring Agent generates a Critical alert. If the latency percentile is equal to or more than alertValue, the Monitoring Agent does not generate a Critical alert. However, if the latency percentile is less than the alertValue for Warning, the Monitoring Agent generates a Warning alert.

    alertType

    The type of alert generated. The following are the valid alert types: Warning, Critical, and Fatal.

    alertValue

    The value at which the alert is generated. You specify the value for this entry as follows:

    • In the diameterGatewayLatency and emGatewayLatency sections, you specify the percentages for tracking requests from Diameter Gateway and EM Gateway. For example, you can configure 95% latency for Warnings, 90% latency for Critical alerts, and 80% latency for Fatal alerts.

    • In the ratedEventThroughput section, you specify the ratio for tracking rated events throughput. This is the ratio of the number of rated events stored or retrieved from the cache or from the Oracle NoSQL database. For example, you can configure 0.95 for Warnings, 0.90 for Critical alerts, and 0.80 for Fatal alerts.

    • In the partitionsUnbalanced section, you specify the number of occurrences for tracking unbalanced partitions across ECE server nodes. For example, you can configure 3 occurrences for Warnings, 10 occurrences for Critical alerts, and 20 occurrences for Fatal alerts.

    • In the runningServers section, you specify the number of server nodes that are currently running for tracking the server nodes availability. For example, you can configure 2 nodes for Warnings, 1 node for Critical alerts, and 0 node for Fatal alerts.

    • In the noSQLCommitFailure and webLogicPublishFailure sections, you specify the number of connection failures at which the alerts are generated. For example, you can configure 1 failure for Warnings, 3 failures for Critical alerts, and 5 failures for Fatal alerts.

    • In the subscriberCacheUtilization section, you specify the percentage for tracking the total subscriber cache size (in bytes) across all the ECE server nodes; for example, reaching 60% of cache size for Warnings, reaching 80% of cache size for Critical alerts, and reaching 100% of cache size for Fatal alerts.

    • In the sessionCacheUtilization section, you specify the percentage for tracking the total session cache size (in bytes) for a ECE server node; for example, reaching 60% of cache size for Warnings, reaching 80% of cache size for Critical alerts, and reaching 100% of cache size for Fatal alerts.

    subscriberCacheCapacity name

    The name of the subscriber-related caches, such as Customer and Balance caches.

    projectedClusterCapacity

    The projected capacity of all ECE server nodes to hold a configured subscriber-based cache, such as Customer and Balance caches.

    sessionCacheCapacity name

    The name of the session-related caches, such as ActiveSession, RatedEvent, ServiceContext, and RecurringBundleIdHistory.

    projectedNodeCapacity

    The projected capacity of a ECE server node to hold a configured session-based cache, such as the ActiveSession cache.

  3. Save and close the file.

  4. On the machine where you have Elastic Charging Controller (ECC) installed, go to ECE_home/bin.

  5. Start ECC:

    ./ecc
  6. Run the following command, which deploys the ECE installation onto the server machines:

    sync

    The sync command copies the relevant files of the ECE installation onto the server machines in the ECE cluster.

  7. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  8. Expand the ECE Monitoring node.

  9. Expand a MonitorConfiguration node; for example, MonitorConfiguration.diameterGatewayLatency.

  10. Expand Attributes.

  11. Verify that the values that you specified in step 2 appears.

    Note:

    The attributes displayed here are read-only. You can update these attributes by editing the ECE_home/config/monitor-configuration.xml file.

Subscribing Notifications

To subscribe for receiving monitoring notifications:

  1. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  2. Expand the ECE Monitoring node.

  3. Expand the Notifier node.

  4. Expand Attributes.

  5. Click Notifications.

    The monitoring notifications published to the JMX notification queue appears.

  6. Click Subscribe.

Reading Log Files

By default, ECE stores log files in the ECE_home/logs directory, but you can configure a new log location. See "Configuring Log Location".

Log file names use the format node_name.log. ECE error messages use this syntax:

date_and_time log_level - 
error_code - request_ID - customer_ID - message_1 message_2

For example:

2012-09-11 22:27:09.565 PDT DEBUG - 
324984520132531235 - 49de072b-33ae-4f6c-816d-35c25b9ade78 - Cust#6500000587 -
 Successfully retrieved tariffPolicy from customer; candidate balance ids ::
 [Bal#6500000587\]

Configuring Log Location

To configure a log location:

  1. On the driver machine, open the ECE_home/config/ece.properties file.

  2. Search for and uncomment the following system property:

    #logDir = 
  3. Specify the path to the new log location:

    logDir = ECE_log_path

    where ECE_log_path is the absolute path to the directory that you want to use as your log location.

    For example:

    logDir = ECE_home/ece/logs
  4. Save and close the file.

Setting Log Levels

To set log levels:

  1. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  2. Expand the ECE Configuration node.

  3. Expand ECE Logging.

  4. Expand Operations.

  5. Set the log level of the appropriate ECE modules. You can set log levels for a module by name or for a set of modules by function (for example, modules used for rerating).

    • To set the log level by ECE module name:

      1. Click the getLoggerLevels operation button.

      2. From the list, identify the ECE module relevant to your debugging scenario.

      3. In the input argument of the setLogLevel operation, for each parameter (for example, p1), replace String with the name of the ECE module.

        For example, if you are debugging a problem with using the simulator testing tool, you might set a log level of DEBUG for these ECE modules:

        oracle.communication.brm.charging.appconfiguration
        oracle.communication.brm.charging.brs
        oracle.communication.brm.charging.tools.simulator
    • To set the log level by ECE functional domain:

      1. Click the getFunctionalDomains operation button.

      2. From the list, identify the ECE functional domain relevant to your debugging scenario.

      3. In the input argument of the setLogLevelForFunctionalDomain operation for the first parameter (p1), replace String with the name of the ECE functional domain by pasting the name you copied in the previous step.

        Enter the name of the functional domain exactly as it appears in the list of the getFunctionalDomains operation.

      4. In the input argument of the setLogLevelForFunctionalDomain operation for the second parameter (p2), replace String with the log level you want to set for the ECE modules associated with the ECE functional domain.

Configuring the Charging-Server Health Threshold

You can configure a charging-server health threshold so that you are alerted when charging server node failures threaten the ability of your system to handle usage requests. A charging-server health threshold is the minimum number of charging server nodes needed for your customer base. If the number of charging server nodes running on your system goes below the threshold, ECE stops processing usage requests and issues a SystemHealthException. ECE continues to process update, management, query, top-up, debit, and refund requests.

When setting a charging-server health threshold, note the following points:

  • If a threshold is N, you need to run at least N+ 1 nodes to have uninterrupted usage processing during a rolling upgrade.

  • Configure a minimum of two charging server nodes per machine.

To configure a charging-server health threshold:

  1. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  2. Expand the ECE Configuration node.

  3. Expand chargingServer.

  4. Expand Attributes

  5. Set the degradedModeThreshold attribute to the minimum number of charging server nodes needed for your customer base (the number that can handle the normal expected throughput for your system). The default is 0.

  6. Save your changes.

Checking If Nodes are Started or Stopped

You can use Elastic Charging Controller (ECC) to find out which nodes are started and stopped.

  1. On the driver machine, path to the /oceceserver/bin directory.

  2. Start ECC

    ./ecc
    
  3. Run the status command.

    status

Checking the Health Status of Charging Server Nodes

To check the ongoing health of ECE charging-server nodes:

  1. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  2. Expand the ECE Configuration node.

  3. Expand ChargingClient.

  4. Expand BatchRequestService.

  5. Expand Attributes.

  6. Check the value of the SystemHealth attribute:

    • HEALTHY: Charging server nodes are functioning.

    • DEGRADED: Charging server nodes are not available.

Configuring KeepAlive for EM Gateway

BRM Connection Manager (CM) and External Manager (EM Gateway) use a pool of connections to send/receive requests. When there were no requests exchanged between CM and EM Gateway for a defined period of time, the idle connections are closed by the firewall. To prevent this, the KeepAlive option is now enabled on the listening sockets. This allows EM Gateway to use the operating system's KeepAlive settings.

Note:

Before starting EM Gateway, ensure that the KeepAlive interval (tcp_keepalive_interval) configured for the operating system does not exceed the idle connection timeout configured in the firewall.

By default, the KeepAlive interval for the operating system is set to 7200 seconds. You must reduce this interval so that it does not exceed the idle connection timeout. See your operating system documentation for information on reducing the KeepAlive interval.

By default, EM Gateway is enabled to use the operating system's KeepAlive settings. You can prevent EM Gateway from using the operating system's KeepAlive settings by setting the socketKeepAlive entry in the emGatewayConfigurations.Instance_Name (where Instance_Name is the name of the EM Gateway instance; for example, emGateway1) section of the ECE_home/config/management/charging-settings.xml file to 0.