8 Troubleshooting ECE

This chapter describes ways to troubleshoot Oracle Communications Billing and Revenue Management (BRM) Elastic Charging Engine (ECE).

Overview of Troubleshooting ECE

The following are the primary components used for diagnosing ECE system problems:

  • JConsole

    When ECE is running, you can access JMX information provided by the JVM by using JConsole. You can edit ECE MBean attributes using JConsole. To do so, you must enable one charging server node for JMX management for each unique IP address in your physical topology. See "Enabling a Charging Server Node for JMX Management" for more information.

    For information about JConsole, see the discussion of using JConsole in the JDK product documentation.

  • Oracle Enterprise Manager Cloud Control

    If you have Oracle Application Management Pack for Oracle Communications, Oracle Enterprise Manager Cloud Control can be used for providing information about a problem condition in ECE.

    See Oracle Application Management Pack for Oracle Communications System Administrator's Guide for more information.

  • Log files

    Each ECE node has a log file for logging errors during processing.

    See "Using Error Logs to Troubleshoot ECE" for more information.

    See "Configuring Logging" for information about setting log levels.

  • Elastic Charging Controller (ECC)

    ECC has a status command that returns information about nodes in the topology.

    See the ECC help command for information about the status command.

For troubleshooting JVM issues, note the following:

Troubleshooting JVM and Coherence

ECE is based on Java virtual machine (JVM) technologies and Coherence. When troubleshooting ECE, you may need to troubleshoot JVM and Coherence technologies as well. See "Troubleshooting JVM and Coherence" for more information.

For troubleshooting JVM issues, note the following:

For troubleshooting Coherence issues, note the following:

Troubleshooting Checklists

For information about troubleshooting checklists, refer to the following topics:

ECE Troubleshooting Checklist When Offline

The following is a troubleshooting checklist when ECE is offline in that it is not connected to external applications in the charging system:

  • Verify the installation layout and the VERSION file.

  • Verify ECE configuration files (under the ECE_Home/oceceserver/config directory). See "Configuring the ECE System" for information about configuration options.

    In particular, check the topology, the tuning profiles (if any) and the ece.properties file.

  • Verify Coherence configuration files (under the ECE_Home/oceceserver/config directory).

  • Verify bi-directional password-less SSH between the driver machine and the server machines.

    Servers should be accessible from the driver machine by way of SSH without entering a password and the driver machine should be accessible from each server machine by way of SSH without password.

  • Verify the security configuration.

    Some examples of security errors are as follows:

    • If the server fails with the error: password file read access must be restricted, the cause may be that the jmxremote.password file has write access. The solution is to remove write access to this file (chmod 400 ECE_Home/oceceserver/config/jmxremote.password) or disable JMX security setting if it is not needed.

    • If the server fails with the error cannot find security credentials, the cause may be that the Coherence security settings are incorrect. The solution is to correct the Coherence security settings properties in these files:

      ECE_Home/oceceserver/config/defaultTuningProfile.properties

      ECE_Home/oceceserver/config/ece.properties

  • Verify ECE deployment in remote machines. Issues can arise from out-of-sync installations on various remote machines. Running the ECC sync command can ensure that all remote machines are in sync.

ECE Troubleshooting Checklist When Online

The following is a troubleshooting checklist when ECE is online in that it is connected to external applications in the charging system:

  • Verify nodes are running through JMX such as by using JConsole.

  • Verify the state machine state of nodes.

    In order to start processing requests, ECE goes through different states, such as startup, configuration, update, and processing and so on. The State Manager controls the state transition. A node that is in a state other than USAGE_PROCESSING will not process usage requests.

  • Verify the charging-server health threshold.

    The number of running nodes may have gone below the threshold. See "Configuring Charging-Server Health Thresholds" for information.

  • Verify ECE management settings.

    All management configuration files are located under ECE_Home/oceceserver/config/management. See "About ECE Configuration" for more information.

ECE General Troubleshooting Checklist

When any problems occur, it is best to do some troubleshooting before you contact Oracle technical support. You know your installation better than Oracle technical support does. You know if anything in the system has been changed, so you are more likely to know where to look first. If you have a problem with your ECE system, ask yourself these questions first, because Oracle technical support will ask them of you:

  • What exactly is the problem? Can you isolate it? For example, if it is a problem with an application, does it occur on one instance of the application, or all instances?

    Oracle technical support needs a clear and concise description of the problem, including when it began to occur.

  • What do the log files say?

    Oracle technical support asks for this information first. Check the error log for the ECE component you are having problems with.

  • Read through the ECE troubleshooting checklist. Look through the list of common problems and their solutions.

    See "Troubleshooting Checklists".

  • Has anything changed in the system? Did you install any new hardware or new software? Did the network change in any way?

  • Have you read the Release Notes?

    The Release Notes include information about known problems and workarounds.

  • Does the problem resemble another one you had previously?

  • Has your system usage recently jumped significantly?

  • Is the system otherwise operating normally?

  • Has response time or the level of system resources changed?

  • Are users complaining about additional or different problems?

  • Can you run clients successfully?

  • Are any other processes on the system hardware functioning normally?

If you still cannot resolve the problem, contact Oracle technical support as described in "Getting Help for ECE Problems".

Using Error Logs to Troubleshoot ECE

ECE error log files provide detailed information about system problems. If you have a problem with an ECE process or node, look in its corresponding log file in the ECE_Home/oceceserver/logs directory. This directory is where log files are generated for each node on each machine of your topology.

Log files include errors that need to be managed, as well as errors that do not need immediate attention (for example, invalid login attempts). To manage log files, make a list of the important errors for your system to discern them from errors that do not require immediate attention.

See "Configuring Logging" for information about setting log levels.

Understanding Error-Message Syntax

ECE error messages use this syntax:

log4j.appender.ECE_LOG.layout.ConversionPattern=%d
{yyyy-MM-dd HH:mm:ss.SSS zzz} %5p - 
%X{correlationId} - %X{requestId} - %X{customerId} - %m%n

For example:

2012-09-11 22:27:09.565 PDT DEBUG - 
324984520132531235 - 49de072b-33ae-4f6c-816d-35c25b9ade78 - Cust#6500000587 -
 Successfully retrieved tariffPolicy from customer; candidate balance ids ::
 [Bal#6500000587\]

Collecting Diagnostic Information

For collecting diagnostic information, it is recommended to turn on the ECC feedback mode which produces extra information when running commands. For example:

ecc:000>set feedback true

The feedback mode setting is saved in your local profile so you do not need to set it every time you start ECC.

See "Configuring Logging" for information about how to control ECE logging and log levels.

Collecting Log Files for Sending to Oracle Technical Support

You can use the infoCollector ECC command to collect log files from your ECE system. The command copies files that are relevant to diagnosing problems from all server machines and the driver machine and puts them in a well known location. The command also creates a compressed TAR file of all collected files, which you can then send to Oracle technical support.

The infoCollector command does not collect files from systems on which the Elastic Charging Client is running (for example, it does not collect files from the network mediation system).

To collect log files from your ECE system:

  1. (Optional) Create a directory to put the collected log files.

    If you do not specify a directory, the command puts the collected log files in the user home directory (the directory to which your platform $HOME environment variable points).

  2. Log on to the driver machine.

  3. Change directory to the ECE_Home/oceceserver/bin directory.

  4. Run the following command:

    ./ecc
    
  5. From the command prompt, run the infoCollector command.

    See "infoCollector Syntax" for information about syntax and arguments for running the command.

    If you run the infoCollector command with no arguments, the following occurs:

    ecc:000>infoCollector
    
    • From each server machine in your topology, the following file and directories are copied to the following location on the driver machine (where user_home is the user home directory of the driver machine and server_host is the IP address or host name of the server machine as it is defined in the ECE_Home/oceceserver/config/eceTopology.conf file):

      • user_home/server_host/oceceserver/VERSION

      • user_home/server_host/oceceserver/logs

      • user_home/server_host/oceceserver/config

      • user_home/server_host/oceceserver/brm_config

      • user_home/server_host/oceceserver/odi_transformation

    • From the driver machine, the following file and directories are copied to the following location on the driver machine (where user_home is the user home directory of the driver machine and driver_host is the IP address or host name of the driver machine as it is defined in the ECE_Home/oceceserver/config/eceTopology.conf file):

      • user_home/driver_host/oceceserver/VERSION

      • user_home/driver_host/oceceserver/config

      • user_home/driver_host/oceceserver/brm_config

    • A compressed TAR file, of all copied files, is created with the extension tar.gz in the user home directory of the driver machine (for example, user_home/info_collector.tar.gz).

    To display the syntax and arguments for running the infoCollector command, run the following command:

    ecc:000>help infoCollector
    

infoCollector Syntax

infoCollector [-v] [-nc] [-l] [-d Dir] [-s] [-e "FileFilter", "FileFilter2", "..."]

where:

  • -v turns on verbose mode that outputs information pertaining to the collected files; this option provides status of what is being copied as the command collects files.

  • -nc does not compress the resulting directory into a compressed TAR file.

  • -l includes all log files into the collection; if -l is not used, the command only collects those log files that match the node-name with the .log suffix that you specify.

  • -d Dir uses the directory you specify to hold all collected data where DIR is the path of the directory.

  • -s includes all files from the ECE_Home/oceceserver/sample_data directory.

  • -e "FileFilter" is the path name or path name pattern of custom directories or files to collect and include in the compressed TAR file.

    Separate multiple filters with a comma.

    For example, assume in your ECE_Home directory you have the files notes.txt, comments_1.txt, comments_2.txt, and comments_3.txt, and you also have a directory named observations that contains the files observation1.txt, observation2.txt, and observation3.txt.

    The path name pattern can be either an explicit file name such as notes.txt, an entire directory tree such as observations, or a wildcard * such as shown for the files that begin with comments in the following example:

    infoCollector -e "/home/example/ECE_11.2.0.2/notes.txt", 
    "/home/example/ECE_11.2.0.2/observations", "/home/example/ECE_11.2.0.2/comments*"
    

    The preceding infoCollector command would put all custom files and directories into a directory named extra_files and the resulting collection of files would have the following directory structure:

    infoCollector_2014-05-07T11:02:10
      localhost-DRIVER
        extra_files
          observations
            observation3.txt
            observation2.txt
            observation1.txt
          notes.txt
          comments_3.txt
          comments_2.txt
          comments_1.txt
        config
        brm_config
        VERSION
    

Displaying Data in the Coherence Caches When Troubleshooting

You can use the query.sh script for querying Coherence caches and displaying the data that you want.

See the discussion of the query tool in BRM Elastic Charging Engine Implementation Guide for more information.

Troubleshooting Failed Diameter-Message Processing in Diameter Gateway

If you suspect a problem with how Diameter Gateway nodes are processing Diameter messages, look in the ECE_Home/oceceserver/logs/Instance_Name.log files for errors, where Instance_Name is the name of the Diameter Gateway-node instance (name you defined in the ECE topology file) that you need to troubleshoot. For example, look in diameterGateway1.log. The log file contains information about any errors during Diameter-message processing.

To set log levels for Diameter Gateway nodes, obtain the Diameter Gateway module names in the ECE_Home/oceceserver/config/log4j.properties file, and then set the log levels by module as described in "Setting Log Levels Using JConsole".

Diameter Gateway returns all Diameter result codes (Result-Codes) as part of the Credit Control Answer (CCA) message. When an error occurs, the error ID and Name is returned in the result code. For example, if the CCR was missing an Event-Timestamp AVP, the error would be:

DiameterTalk Answers =[
Diameter Message: CCA
Version: 1
Msg Length: 144
Cmd Flags: PXY
Cmd Code: 272
App-Id: 4
Hop-By-Hop-Id: 1497412149
End-To-End-Id: 734750287
  Session-Id (263,M,l=11) = 111
  Result-Code (268,M,l=12) = DIAMETER_MISSING_AVP (5005)
  Origin-Host (264,M,l=24) = dgw1.example.com
  Origin-Realm (296,M,l=19) = example.com
  Auth-Application-Id (258,M,l=12) = 4
  CC-Request-Type (416,M,l=12) = INITIAL_REQUEST (1)
  CC-Request-Number (415,M,l=12) = 0
  Failed-AVP (279,M,l=20) = 
    Event-Timestamp (55,M,l=12) = 3627391363 (Fri Dec 12 08:42:43 PST 2014)

For the error, an error message is written to the diametergatewayInstance_Name.log file that indicates the nature and stack trace for the error.

For information about Diameter Gateway result codes, see "Diameter Gateway Result Codes".

Diameter Gateway nodes must be started only after the customer data is loaded into the ECE grid; otherwise, they will not be able to process Diameter requests. For information about starting Diameter Gateway, see "Starting and Stopping ECE".

Troubleshooting Failed Update Requests from BRM

Update requests pass event information from BRM to ECE; occasionally, update-request events fail. The following are example of when update requests from BRM might fail:

  • A delay in receiving PDC pricing data

    A PDC delay where ECE has not received a charge offer or discount offer from PDC but did receive the offer external ID (in an update request) from BRM. Since the offer is not loaded into ECE because of the PDC delay, the external ID update does not find the offer in the cache. Consequently, the offer update fails.

  • Configuration errors in customer data

  • Errors in management requests associated with the rerating process

    Management-type requests for the rerating process, such as PrepareToRerate and RerateCompleted, may fail.

    For information on troubleshooting rerating errors, see "Troubleshooting Problems with Rerating".

The preceding types of failed update requests are placed in a suspense queue. As part of the post-installation tasks in an integrated ECE installation, you create the suspense queue to collect the failed events and you create a queue table called ifw_sync_sus. See the discussion of creating the suspense queue in BRM Elastic Charging Engine Installation Guide.

About Customer Updater Error Log Files

Once you have collected the failed requests, you view them in the suspense queue and manually reprocess. When failed events are moved to the suspense queue, error messages are displayed in the customerUpdater.log files in the ECE_Home/oceceserver/logs directory.

Viewing Failed Events in the Suspense Queue

To view the failed events in the suspense queue:

  1. Map the suspense queue to the ifw_sync_sus table.

  2. Execute the select user_data from ifw_sync_sus query to view the failed events after connecting to the SQL database.

For more information, see the discussion of installing account synchronization in BRM Installation Guide.

Propagating Events from Suspense Queue to BRM Account Synchronization DM Database Queue

Once you view the events in the suspense queue and you identify the cause of failure, you propagate the events from the suspense queue to the BRM Account Synchronization DM database queue with the events_propagate_utility script. The Customer Updater reprocesses the events. For more information, see "events_propagate_utility.pl".

Troubleshooting Problems with Rerating

Rerating errors can occur at different stages of the rerating process. If ECE cannot rerate events for a customer, rerating errors are handled as follows for each stage:

  • In the prepare to rerate stage

    Errors in this stage are logged in the CustomerUpdater.log file with appropriate reasons. No acknowledgement is sent so the acknowledgement queue will be empty.

  • During rerating

    Errors are logged in the emGateway.log with appropriate reasons.

  • In the rerate complete stage

    Errors are logged in the CustomerUpdater.log file. The acknowledgement will be sent back to BRM. ECE sends a notification to BRM using the BRM Gateway to create a new rerate job. BRM uses the information in the notification for creating a new rerate job for that customer.

Troubleshooting Performance Issues Using Coherence JMX Metrics

ECE provides Coherence metrics that can help you troubleshoot performance problems, performance tuning, and isolate hardware issues.

To troubleshoot performance issues using Coherence JMX metrics:

  1. Log on to the ECE driver machine.

  2. Access the ECE logging MBeans by doing the following:

    1. Start a JMX editor that allows you to edit MBean attributes. If you use JConsole, enter:

      $ jconsole &
      
    2. Do one of the following:

      • If the driver machine and the server machine are the same machine (standalone system), select Local Connect.

      • For an integrated system (the ECE system is set up on the driver machine and the setup is synchronized onto remote server machines), select Remote Connect.

    3. Enter the IP address or host name and port number of the charging server node you have enabled for JMX-management in your topology.

      For information about enabling ECE nodes for JMX management, see the discussion about post-installation tasks in BRM Elastic Charging Engine Installation Guide.

    4. (JMX secure connections only) When prompted, enter your JMX security credentials.

    5. Click the MBeans tab.

    6. Expand the Coherence navigation tree.

    7. Expand Service.

  3. Refer to the Coherence JMX metric that applies to your troubleshooting scenario:

    • Expand InvocationService and select the node. The TaskCount metric means the number of request batches received by the node.

    • Expand InvocationService and select the node. The TaskAverageDuration metric means the average request batch latency for the node.

    • Expand BRMDistributedCache and select the node. The RequestTotalCount metric means the following depending on the workload:

      • For Initiate, Update, Terminate, and Cancel requests, this metric means the number of entry processor invocations.

      • For Auth Query requests, this metric means the number of get() operations.

    • Expand BRMDistributedCache and select the node. The RequestAverageDuration metric means the following depending on the workload:

      • For Initiate, Update, and Terminate requests, this metric means entry processor latency.

      • For Auth Query requests, this metric means get latency.

You can reset the JMX metrics with the Operations tab resetStatistics() method.

Getting Help for ECE Problems

If you cannot resolve your ECE problem, contact Oracle technical support.

Before You Contact Oracle Technical Support

Problems can often be fixed simply by stopping and restarting a node. To stop and restart ECE nodes, see "Starting and Stopping ECE".

If that does not solve the problem, the first troubleshooting step is to look at the error log for the application or process that reported the problem. See "Using Error Logs to Troubleshoot ECE". Be sure to observe the checklist for resolving problems with ECE before reporting the problem to Oracle technical support. See "Troubleshooting Checklists".

Reporting Problems

If the checklist for resolving problems with ECE does not help you to resolve the problem, write down the pertinent information:

  • A clear and concise description of the problem, including when it began to occur.

  • Relevant portions of the relevant log files.

  • Relevant configuration files.

  • Recent changes in your system, even if you don't think they are relevant.

  • List of all ECE components and patches installed on your system.

When you are ready, report the problem to Oracle technical support.