10 Troubleshooting ECE

This chapter describes ways to troubleshoot Oracle Communications Billing and Revenue Management Elastic Charging Engine (ECE).

Overview of Troubleshooting ECE

The following are the primary components used for diagnosing ECE system problems:

JConsole

When ECE is running, you can access Java Management Extensions (JMX) information provided by the Java Virtual Machine (JVM) by using JConsole. You can edit ECE MBean attributes using JConsole. To do so, you must enable one charging server node for JMX management for each unique IP address in your physical topology. See "Enabling a Charging Server Node for JMX Management" for more information.

For information about JConsole, see the discussion of using JConsole in the JDK product documentation.
Oracle Enterprise Manager Cloud Control

If you have Oracle Application Management Pack for Oracle Communications, Oracle Enterprise Manager Cloud Control can be used for providing information about a problem condition in ECE.

See Oracle Application Management Pack for Oracle Communications System Administrator's Guide for more information.
Log files

Each ECE node has a log file for logging errors during processing.

See "Using Error Logs to Troubleshoot ECE" for more information.

See "Configuring Logging" for information about setting log levels.
Elastic Charging Controller (ECC)

ECC has a status command that returns information about nodes in the topology.

See the ECC help command for information about the status command.

For troubleshooting JVM issues, note the following:

You can refer to the Java Help Center Web site:

http://www.java.com/en/download/help/index.xml
You define Java VM arguments and properties for ECE processes in your own JVM tuning files.

For information about configuring your JVM tuning files in ECE, see "Configuring JVM Tuning Parameters".

Troubleshooting JVM and Coherence

ECE is based on JVM technologies and Coherence. When troubleshooting ECE, you may need to troubleshoot JVM and Coherence technologies as well. See "Troubleshooting JVM and Coherence" for more information.

For troubleshooting JVM issues, note the following:

You can refer to the Java Help Center Web site:

http://www.java.com/en/download/help/index.xml
JVM arguments and properties are defined in the ECE_home/config/ece.properties file, where ECE_home is the directory in which ECE is installed.
Java tuning profiles for nodes are defined in the ECE_home/oceceserver/config/defaultTuningProfile.properties file.
When ECE is running, you can access JMX information provided by the JVM by using JConsole. For more information, see the Web site at:

http://docs.oracle.com/javase/7/docs/technotes/guides/management/jconsole.html

For troubleshooting Coherence issues, note the following:

You can access runtime information about Coherence through JMX, specifically through the Coherence MBeans (which are used to manage and monitor different parts of Coherence).

See the Coherence MBean documentation for the meaning of Coherence JMX metrics as they relate to ECE.

See the Coherence MBean reference in Oracle Coherence Developer's Guide Release 3.6.1 documentation at:

http://docs.oracle.com/cd/E15357_01/coh.360/e15723/appendix_mbean.htm
You can access the Oracle Coherence Community Resources Web site at:

http://www.oracle.com/technetwork/middleware/coherence/community/index.html
See "Configuring Coherence" for information about the ECE configuration files related to configuring Coherence.

Troubleshooting Checklists

For information about troubleshooting checklists, refer to the following topics:

ECE Troubleshooting Checklist When Offline
ECE Troubleshooting Checklist When Online
ECE General Troubleshooting Checklist

ECE Troubleshooting Checklist When Offline

You can use the following troubleshooting checklist when ECE is not connected to external applications in the charging system:

Verify the installation layout and the VERSION file.
Verify ECE configuration files (under the ECE_home/oceceserver/config directory). See "Configuring the ECE System" for information about configuration options.

In particular, check the topology, the tuning profiles (if any), and the ece.properties file.
Verify Coherence configuration files (under the ECE_home/oceceserver/config directory).
Verify bi-directional password-less SSH between the driver machine and the server machines.

Servers should be accessible from the driver machine by way of SSH without entering a password, and the driver machine should be accessible from each server machine by way of SSH without a password.
Verify the security configuration.

Some examples of security errors are as follows:
- If the server fails with the error password file read access must be restricted, the cause may be that the jmxremote.password file has write access. The solution is to remove write access to this file (chmod 400 ECE_home/oceceserver/config/jmxremote.password) or disable the JMX security setting if it is not needed.
- If the server fails with the error cannot find security credentials, the cause may be that the Coherence security settings are incorrect. The solution is to correct the Coherence security settings properties in the following files:
  
  ECE_home/oceceserver/config/defaultTuningProfile.properties
  
  ECE_home/oceceserver/config/ece.properties
Verify ECE deployment in remote machines. Issues can arise from out-of-synchronization installations on various remote machines. Running the ECC sync command can ensure that all remote machines are in synchronization.

ECE Troubleshooting Checklist When Online

You can use the following troubleshooting checklist when ECE is connected to external applications in the charging system:

Verify nodes are running through JMX such as by using JConsole.
Verify the machine state of nodes.

In order to start processing requests, ECE goes through different states, such as startup, configuration, update, and processing and so on. State Manager controls the state transition. A node that is in a state other than USAGE_PROCESSING will not process usage requests.
Verify the charging-server health threshold.

The number of running nodes may have gone below the threshold. See "Configuring Charging-Server Health Thresholds" for information.
Verify ECE management settings.

All management configuration files are located under ECE_home/oceceserver/config/management. See "About ECE Configuration" for more information.

ECE General Troubleshooting Checklist

When any problems occur, do some troubleshooting before you contact Oracle support. You know your installation better than Oracle support does. You know if anything in the system has been changed, so you are more likely to know where to look first. If you have a problem with your ECE system, ask yourself the following questions first, because Oracle support will ask them of you:

What exactly is the problem? Can you isolate it? For example, if it is a problem with an application, does it occur on one instance of the application, or all instances?

Oracle technical support needs a clear and concise description of the problem, including when it began to occur.
What do the log files say?

Oracle technical support asks for this information first. Check the error log for the ECE component you are having problems with.
Read through the ECE troubleshooting checklist. Look through the list of common problems and their solutions.
Has anything changed in the system? Did you install any new hardware or new software? Did the network change in any way?
Have you read the Release Notes?

The Release Notes include information about known problems and workarounds.
Does the problem resemble another one you had previously?
Has your system usage recently jumped significantly?
Is the system otherwise operating normally?
Has response time or the level of system resources changed?
Are users complaining about additional or different problems?
Can you run clients successfully?
Are any other processes on the system hardware functioning normally?

If you still cannot resolve the problem, contact Oracle technical support as described in "Getting Help for ECE Problems".

Using Error Logs to Troubleshoot ECE

ECE error log files provide detailed information about system problems. If you have a problem with an ECE process or node, look in its corresponding log file in the ECE_home/oceceserver/logs directory. This directory is where log files are generated for each node on each machine of your topology.

Log files include errors that need to be managed, as well as errors that do not need immediate attention (for example, invalid login attempts). To manage log files, make a list of the important errors for your system to discern them from errors that do not require immediate attention.

See "Configuring Logging" for information about setting log levels.

Understanding Error-Message Syntax

ECE error messages use the following syntax:

log4j.appender.ECE_LOG.layout.ConversionPattern=%d
{yyyy-MM-dd HH:mm:ss.SSS zzz} %5p - 
%X{correlationId} - %X{requestId} - %X{customerId} - %m%n

For example:

2012-09-11 22:27:09.565 PDT DEBUG - 
324984520132531235 - 49de072b-33ae-4f6c-816d-35c25b9ade78 - Cust#6500000587 -
 Successfully retrieved tariffPolicy from customer; candidate balance ids ::
 [Bal#6500000587\]

Collecting Diagnostic Information

For collecting diagnostic information, turn on the ECC feedback mode, which produces extra information when running commands. For example:

ecc:000>set feedback true

The feedback mode setting is saved in your local profile so you do not need to set it every time you start ECC.

See "Configuring Logging" for information about how to control ECE logging and log levels.

Collecting Log Files for Sending to Oracle Support

You can use the infoCollector ECC command to collect log files from your ECE system. The command copies files that are relevant to diagnosing problems from all server machines and the driver machine and puts them in a well known location. The command also creates a compressed TAR file of all collected files, which you can then send to Oracle support.

The infoCollector command does not collect files from systems on which the Elastic Charging Client is running (for example, it does not collect files from the network mediation system).

To collect log files from your ECE system:

(Optional) Create a directory to put the collected log files.

If you do not specify a directory, the command puts the collected log files in the user home directory (the directory to which your platform $HOME environment variable points).
Log on to the driver machine.
Change directory to the ECE_home/oceceserver/bin directory.
Start ECC:
```
./ecc
```
Run the following command:
```
infoCollector
```
See "infoCollector Syntax" for information about syntax and arguments for running the command.

The following occurs:
- From each server machine in your topology, the following file and directories are copied to the following location on the driver machine (where user_home is the user home directory of the driver machine and server_host is the IP address or host name of the server machine as it is defined in the ECE_home/oceceserver/config/eceTopology.conf file):
  - user_home/server_host/oceceserver/VERSION
  - user_home/server_host/oceceserver/logs
  - user_home/server_host/oceceserver/config
  - user_home/server_host/oceceserver/brm_config
  - user_home/server_host/oceceserver/odi_transformation
- From the driver machine, the following file and directories are copied to the following location on the driver machine (where user_home is the user home directory of the driver machine and driver_host is the IP address or host name of the driver machine as it is defined in the ECE_home/oceceserver/config/eceTopology.conf file):
  - user_home/driver_host/oceceserver/VERSION
  - user_home/driver_host/oceceserver/config
  - user_home/driver_host/oceceserver/brm_config
- A compressed TAR file, of all copied files, is created with the extension tar.gz in the user home directory of the driver machine (for example, user_home/info_collector.tar.gz).

infoCollector Syntax

infoCollector [-v] [-nc] [-l] [-d dir] [-s] [-e "FileFilter", "FileFilter2", "..."]

where:

-v turns on verbose mode that outputs information pertaining to the collected files; this option provides status of what is being copied as the command collects files.
-nc does not compress the resulting directory into a compressed TAR file.
-l includes all log files into the collection; if -l is omitted, the command collects only those log files that match the node-name with the .log suffix that you specify.
-d dir uses the directory you specify to hold all collected data where dir is the path of the directory.
-s includes all files from the ECE_home/oceceserver/sample_data directory.
-e "FileFilter" is the path name or path name pattern of custom directories or files to collect and include in the compressed TAR file.

Separate multiple filters with a comma.

For example, assume in your ECE_home directory you have the files notes.txt, comments_1.txt, comments_2.txt, and comments_3.txt, and you also have a directory named observations that contains the files observation1.txt, observation2.txt, and observation3.txt.

The path name pattern can be either an explicit file name such as notes.txt, an entire directory tree such as observations, or a wildcard * such as shown for the files that begin with comments in the following example:
```
infoCollector -e "/home/example/ECE_11.2.0.2/notes.txt", 
"/home/example/ECE_11.2.0.2/observations", "/home/example/ECE_11.2.0.2/comments*"
```
The preceding infoCollector command would put all custom files and directories into a directory named extra_files and the resulting collection of files would have the following directory structure:
```
infoCollector_2014-05-07T11:02:10
  localhost-DRIVER
    extra_files
      observations
        observation3.txt
        observation2.txt
        observation1.txt
      notes.txt
      comments_3.txt
      comments_2.txt
      comments_1.txt
    config
    brm_config
    VERSION
```

Displaying Data in the Coherence Caches When Troubleshooting

You can use the query.sh script for querying Coherence caches and displaying the data that you want.

See the discussion of the query tool in BRM Elastic Charging Engine Implementation Guide for more information.

Troubleshooting Failed Diameter-Message Processing in Diameter Gateway

If you suspect a problem with how Diameter Gateway nodes are processing Diameter messages, look in the ECE_home/oceceserver/logs/Instance_Name.log files for errors, where Instance_Name is the name of the Diameter Gateway-node instance (a name you defined in the ECE topology file) that you need to troubleshoot. For example, look in diameterGateway1.log. The log file contains information about any errors during Diameter-message processing.

To set log levels for Diameter Gateway nodes, obtain the Diameter Gateway module names in the ECE_home/oceceserver/config/log4j.properties file, and then set the log levels by module as described in "Setting Log Levels by Editing MBeans".

Diameter Gateway returns all Diameter result codes (Result-Codes) as part of the Credit Control Answer (CCA) message. When an error occurs, the error ID and name are returned in the result code. For example, if the CCR was missing an Event-Timestamp AVP, the error would be:

DiameterTalk Answers =[
Diameter Message: CCA
Version: 1
Msg Length: 144
Cmd Flags: PXY
Cmd Code: 272
App-Id: 4
Hop-By-Hop-Id: 1497412149
End-To-End-Id: 734750287
  Session-Id (263,M,l=11) = 111
  Result-Code (268,M,l=12) = DIAMETER_MISSING_AVP (5005)
  Origin-Host (264,M,l=24) = dgw1.example.com
  Origin-Realm (296,M,l=19) = example.com
  Auth-Application-Id (258,M,l=12) = 4
  CC-Request-Type (416,M,l=12) = INITIAL_REQUEST (1)
  CC-Request-Number (415,M,l=12) = 0
  Failed-AVP (279,M,l=20) = 
    Event-Timestamp (55,M,l=12) = 3627391363 (Fri Dec 12 08:42:43 PST 2014)

For the error, an error message is written to the diametergatewayInstance_Name.log file that indicates the nature and stack trace for the error.

For information about Diameter Gateway result codes, see "Diameter Gateway Result Codes".

Diameter Gateway nodes must be started after the customer data is loaded into the ECE grid; otherwise, they cannot process Diameter requests. For information about starting Diameter Gateway, see "Starting and Stopping ECE".

Troubleshooting Failed RADIUS-Message Processing in RADIUS Gateway

If you suspect a problem with how RADIUS Gateway nodes are processing RADIUS messages, see the following files for errors that you need to troubleshoot:

ECE_home/oceceserver/logs/Instance_Name.log files, where Instance_Name is the name of the RADIUS Gateway-node instance (a name you defined in the ECE topology file); for example, ECE_home/oceceserver/logs/radiusGateway1.log
Charging-server node log files; for example, ECE_home/oceceserver/logs/ecs1.log.

To set log levels for RADIUS Gateway nodes, obtain the RADIUS Gateway module names in the ECE_home/oceceserver/config/log4j.properties file, and then set the log levels by module as described in "Setting Log Levels by Editing MBeans".

RADIUS Gateway returns all the results as part of the reply-message attribute-value pair (AVP) in the RADIUS response. For example, if the user password in the authentication request is incorrect, the following error message is returned in the RADIUS response:

Session_Timeout AVP after Deletion : null2016-03-07 23:37:58.896 PST DEBUG -  -  -  - ECE Radius server - Sending the response to client
 Code: Access-Reject(3)
 Identifier: 0
 Length: 20
 Authenticator: 0x00000000000000000000000000000000
 Reply-Message: RadiusGatewayMessagesBundle-31015: Incorrect password from User
 User-Name: 0049100033

Troubleshooting Failed Update Requests from BRM

Update requests pass event information from Oracle Communications Billing and Revenue Management (BRM) to ECE; occasionally, update-request events fail. The following are examples of when update requests from BRM might fail:

A delay in receiving Pricing Design Center (PDC) pricing data

ECE does not receive a charge or discount offer from PDC but does receive the offer's external ID in an update request from BRM. Because the offer is not loaded into ECE due to the PDC delay, the external ID update does not find the offer in the cache. Consequently, the offer update fails.
Configuration errors in customer data.
Errors in management requests associated with the rerating process.

Management-type requests for the rerating process, such as PrepareToRerate and RerateCompleted, may fail. For information on troubleshooting rerating errors, see "Troubleshooting Problems with Rerating".

The preceding types of failed update requests are placed in a suspense queue. During the post-installation phase of an integrated ECE system, you create the suspense queue, and you create a queue table called ifw_sync_sus. See the discussion about creating the suspense queue in BRM Elastic Charging Engine Installation Guide.

About Customer Updater Error Log Files

After failed requests are placed in the suspense queue, you view and manually reprocess them. When failed events are added to the suspense queue, error messages are recorded in the ECE_home/oceceserver/logs/customerUpdater.log files.

Viewing Failed Events in the Suspense Queue

To view failed events in the suspense queue:

Map the suspense queue to the ifw_sync_sus table.
Run the select user_data from ifw_sync_sus query to view the failed events after connecting to the SQL database.

For more information, see the discussion about installing account synchronization in BRM Installation Guide.

Removing Excess Failed Updates in the BRM Gateway Suspense Queue

If the failed updates in the BRM suspense queue are not processed, they stay in the queue and an error is logged in the file ECE_Home/oceceserver/logs/Instance_Name.log.

You can check the log file as part of the daily health status monitoring, and in case of an error in the BRM Gateway suspense queue, you can manually move or delete failed updates from the queue and restart the WebLogic server.

To manually move or delete a failed update:

Log on to the WebLogic server on which the BRM Gateway suspense queue resides.
Log in to WebLogic Server Administration Console.
In the Services section, select JMS Modules.
In the JMS Modules, click ECE Module.
In Summary of Resources, click Suspense Queue.
Click Monitoring, and select ECE!Suspense Queue.
Click Show Messages.

The Summary of JMS Messages appears.
In the JMS Messages table, select a message.
Do one of the following:
- To move a message:
  1. Click Move.
  2. In the NotificationServer field, select JMS Server.
  3. In the DestinationServer field, select Suspense Queue.
  4. Click Finish.
    
    The failed update is sent back to the BRM Gateway suspense queue for reprocessing.
- To delete a message, click Delete.
  
  The failed update is deleted from the BRM Gateway suspense queue.

Propagating Events from the Suspense Queue to the Account Synchronization DM Database Queue

After you view the events in the suspense queue and identify the cause of failure, you propagate the events from the suspense queue to the Account Synchronization Data Manager (DM) database queue with the events_propagate_utility script. Customer Updater reprocesses the events. For more information, see "events_propagate_utility.pl".

Troubleshooting Problems with Coherence

When loading the pricing data from PDC into ECE, if you get the "No space left on device" error, configure the Coherence flash journal resources manager, which manages temporary journal-based files, by doing the following:

Important:

This configuration may affect the performance of your system.

Open the ECE Coherence override file your ECE system uses (for example, ECE_home/oceceserver/config/charging-coherence-override-prod.xml).

To confirm which ECE Coherence override file is used, refer to the tangosol.coherence.override parameter of the ECE_home/oceceserver/config/ece.properties file.

Add the following section:

<journaling-config>
        <flashjournal-manager>
                       <directory>/logs/oracle/flashjournal</directory>
                       <tmp-purge-delay>15m</tmp-purge-delay>
                       <maximum-file-size>100MB</maximum-file-size>
                       <high-journal-size system-property="coherence.flashjournal.highjournalsize">10GB</high-journal-si ze>
     </flashjournal-manager>
</journaling-config>

Save and close the file.

Troubleshooting Problems with Rerating

Rerating errors can occur at different stages of the rerating process. If ECE cannot rerate events for a customer, rerating errors are handled as follows for each stage:

In the prepare-to-rerate stage

Errors in this stage are logged in the CustomerUpdater.log file with appropriate reasons. No acknowledgement is sent so the acknowledgement queue will be empty.
During rerating

Errors are logged in the emGateway.log with appropriate reasons.
In the rerate-complete stage

Errors are logged in the CustomerUpdater.log file. The acknowledgement will be sent back to BRM. ECE sends a notification to BRM using BRM Gateway to create a new rerate job. BRM uses the information in the notification for creating a new rerate job for that customer.

Troubleshooting Performance Issues by Using Coherence JMX Metrics

ECE provides Coherence metrics that can help you troubleshoot performance problems and performance tuning, and isolate hardware issues.

To troubleshoot performance issues by using Coherence JMX metrics:

Access the ECE MBeans:
1. Log on to the driver machine.
2. Start the ECE charging servers (if they are not started).
3. Start a JMX editor, such as JConsole, that enables you to edit MBean attributes.
4. Connect to the ECE charging server node set to start CohMgt = true in the ECE_home/oceceserver/config/eceTopology.conf file.
  
  The eceTopology.conf file also contains the host name and port number for the node.
5. In the editor's MBean hierarchy, expand the Coherence node.
Expand Service.
View the Coherence JMX metrics that apply to your troubleshooting scenario. For example:
- Expand InvocationService, select the appropriate node, and expand Attributes .
  - The TaskCount attribute specifies the number of request batches received by the node.
  - The TaskAverageDuration attribute specifies the average request batch latency for the node.
- Expand BRMDistributedCache, select the appropriate node, and expand Attributes.
  - The RequestTotalCount attribute specifies the following, depending on the request type:
    
    For Initiate, Update, Terminate, and Cancel requests, it specifies the number of entry processor invocations.
    
    For Auth Query requests, it specifies the number of get() operations.
  - The RequestAverageDuration attribute specifies the following, depending on the request type:
    
    For Initiate, Update, and Terminate requests, it specifies entry processor latency.
    
    For Auth Query requests, it specifies get latency.

Note:

To reset the attribute values for a Service subnode, expand the subnode's Operations node, select the resetStatistics operation, and click the resetStatistics button.

Getting Help for ECE Problems

If you cannot resolve your ECE problem, contact Oracle support.

Before You Contact Oracle Support

Problems can often be fixed simply by stopping and restarting a node. To stop and restart ECE nodes, see "Starting and Stopping ECE".

If that does not solve the problem, look at the error log for the application or process that reported the problem. See "Using Error Logs to Troubleshoot ECE". Be sure to observe the checklist for resolving problems with ECE before reporting the problem to Oracle technical support. See "Troubleshooting Checklists".

Reporting Problems

If the checklist for resolving problems with ECE does not help you to resolve the problem, write down the pertinent information:

A clear and concise description of the problem, including when it began to occur.
Relevant portions of the relevant log files.
Relevant configuration files.
Recent changes in your system, even if you think they are not relevant.
List of all ECE components and patches installed on your system.

When you are ready, report the problem to Oracle support.