Troubleshooting OSM

13 Troubleshooting OSM

This chapter provides guidelines to help you troubleshoot problems with your Oracle Communications Order and Service Management (OSM) system.

Information You Need for Troubleshooting

When you are diagnosing and resolving problems, you must be able to obtain the following information:

Database AWR report for a particular period of time.
Database ASH report for a particular period of time.
Oracle WebLogic Server administration server logs and output files.
WebLogic Server managed server logs and output files.
WebLogic Server node manager's logs and output files.
JVM garbage collector logs.
JVM heap dumps.
JVM thread dumps (several in succession).
OSM model and a single order extracted from the database schema. For more information, see "Exporting and Importing the OSM Model and a Single Order."
Java Flight Recorder (JFR) recordings.

General Checklist for Resolving Problems

If you have a problem with your OSM system, go through the following checklist before you contact Oracle Technical Support:

What exactly is the problem? Can you isolate it? For example, if an order causes a problem on one computer, does it give the same result on another computer?

Oracle Technical Support needs a clear and concise description of the problem, including when it began to occur.
What do the log files say?

This is the first thing that Oracle Technical Support asks for. Check the error log for the OSM component you are having problems with.
Have you read the documentation?

Look through the list of common problems and their solutions in "Diagnosing Some Common Problems with OSM".
Has anything changed in the system? Did you install any new hardware or new software? Did the network change in any way? Does the problem resemble another one you had previously? Has your system usage recently jumped significantly?
Is the system otherwise operating normally? Has response time or the level of system resources changed? Are users complaining about additional or different problems?

Diagnosing Some Common Problems with OSM

This section describes common problems and their solutions.

Cannot Log in or Access Certain Functionality

If you cannot log in or access certain functionality, check the following possible causes:

Are you a valid user in the WebLogic Server security realm?
Is the OSM web application deployed?
Are all OSM Enterprise Java Beans (EJB) deployed?
Are the OSM database resources deployed?
Do you belong to the correct groups in the WebLogic Server security realm?
Do you belong to any OSM workgroup?

System Appears Slow

If the functionality of OSM appears to be present, but performance is slow, check the following possible causes:

The amount of memory being used (check the memory configuration in the WebLogic server startup script on the workstation where you have deployed OSM).
The CPU and disk usage on the machine hosting the OSM database.
The database performance (for example, using AWR reports).
For slow worklist access, check the number of flexible headers on your worklist. The number of flexible headers has a direct negative effect on worklist performance.
WebLogic server is paging JMS message bodies to disk. You can verify and confirm this by logging into the WebLogic Administration Console and checking the value of Messages Paged Out Total on individual JMS servers. If messages are being paged, check if JMS messages have been left to accumulate in error queues. Another option is to tune the message buffer size for a JMS server. By default, this is set to 512 megabytes for a production OSM system and can be increased (for example, to 1 gigabytes) if required.

Error: "Java.lang.StackOverflowError" when Using Task Web Client

You may see the error "Java.lang.StackOverflowError" in the log files. If this happens, you can address the problem by tuning the thread stack size parameter.

Note:

The procedures below set the value to 2 MB. This is a suggested value to start with, but you should adjust the value if necessary, according to your needs.

In your instance, project or shape specification file, add or append the following parameter and adjust the value as necessary:

shape: 
   user_mem_args: "-Xss2m"

Coherence Configuration Error: [STUCK] ExecuteThread

The following thread error can occur in the OSM WebLogic server console when running an order:

[STUCK] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock java.util.concurrent.locks.ReentrantReadWriteLock$FairSync@5d7fc269 WAITING

The osm-invocation and osm-distributed thread-count values are set too low. See the discussion of configuring and monitorinbg coherence threads in OSM Installation Guide for more information about increasing these settings.

Unexpected Logout from Web Client

If your system is running on a WebLogic Server cluster, and the following conditions apply:

a user is viewing an order in the Order Management web client or Task web client
that order is hosted on a managed server that fails or is shut down

the user will be logged out of the web client and will have to log back in. See the discussion of order affinity in OSM Installation Guide for more information about orders being hosted on a particular managed server.

Error: "Login failed. Please try again."

If the error "Login failed. Please try again" is displayed when trying to log in through the web client and you have entered the correct user name and password, you probably do not belong to the correct groups in the WebLogic Server security realm.

To resolve this issue , log in to the WebLogic Administration Console using the administrator account. Make sure you have been added to the group OMS_client. Try to log in again.

Automation Plug-ins Are Not Getting Called

If the custom automation plug-ins are not getting called, check the following possible causes:

Is the Automation configuration deployed properly?
Are the JMS resources deployed?
Are the JMS destinations, queues, and topics configured properly?

Delayed JMS Messages

Clock synchronization issues may cause JMS request and response messages to remain in a queue in a delayed state.

When a message is sent, it is stamped with the time on the sender's machine. When the message arrives at the JMS destination, if the recipient's machine has a clock that is running several minutes slower, the timestamp is displayed as a future time and the WebLogic server decides to delay the message until the future time arrives.

To prevent this problem, use network time protocol (NTP) servers to synchronize clocks across all machines in a cluster.

Error Message For Events From a JMS Topic in a Cluster

The following message can occur when attempting to process an event from a JMS topic:

<Message-Driven EJB: YourCartridgeName_1.0.0.0.0_YourPluginName_orderCompleteEventMDB's transaction was rolled back. The transaction details are: ....

OSM does not support JMS topics within an OSM clustered environment. For more information about OSM queue configuration, see the discussion of OSM integration with external systems in OSM Installation Guide.

JMS Message Delivery Failure

If OSM drops and does not process JMS messages, ensure that the messages are not using uniform distributed queue (UDQ) format. OSM supports only weighted distributed queues. For more information about OSM queue configuration, see see the discussion of OSM integration with external systems in OSM Installation Guide.

Unexpected Values for JMS Properties

There are some situations in which OSM may set the JMS properties on a message to values that you do not expect.

Sometimes messages in the queue that were received from external systems will have JMS properties that were not set by the external system.

This is because when OSM sends a request to an external system, any messages received in response must be correlated back to an appropriate automator. When a message is first received, before it is placed on the queue, OSM finds the handling task's context based on the correlation data in the message. OSM adds this context, and some additional data associated with the message-driven bean (MDB) that received the message, to the message using additional JMS message properties. You can see the context data by browsing the messages in the queue using the WebLogic Server Administration console.

When OSM runs in a cluster, this message is sometimes redirected to a message queue hosted on a different managed server from the one that sent the request.
Sometimes messages have values that seem wrong. For example, a message might have JMS properties with pluginJndiName=X and cartridgeNamespace=Y, but plug in X is not in a cartridge with namespace Y.

These property values are implementation details and will not necessarily match what you expect. For example, because different plug ins can share the same MDB, the pluginJndiName property may not contain the name of the plug in that actually handles the message.

Too Many Open Files

If you have a large number of external clients connected to OSM and receive the error: "java.net.SocketException: Too many open files", do one of the following:

From the WebLogic Administration Console, select Servers, then Server, then Protocols, and then HTTP. Reduce the value in Duration from the default 30 seconds to 15 or even 5 seconds. This will allow the WebLogic server to close idle HTTP connections and release more sockets.

Problems When Running Multiple WebLogic Domains on One Host

If you are running multiple WebLogic domains on one host and you see errors such as web clients failing to load, JSP errors, or errors indicating that JSP pages can't be recompiled, you may not have set umask values properly to protect files in one WebLogic instance from other WebLogic instances.

Proxy Fails on a Clustered System

If a proxy fails on a clustered OSM system, all HTTP requests that would normally go through the proxy can no longer get to the OSM server. The problem could be with the physical host the server is running on, or it could be a problem with a standalone managing server that is not part of the cluster but is part of the domain.

To recover, restart the proxy.

Unable to Bring Up Managed Server After Database Failure

OSM fails to start if the Oracle database is unreachable. In addition, if the database is Oracle RAC, both database nodes specified by the data source URLs must be reachable. When WebLogic Server starts, failure of a database node is handled gracefully (processing fails over to the other node).

Orders Are Not Being Created on a Clustered System

If messages are being successfully added to the JMS queue but the corresponding orders are not being created in OSM, first ensure that the servers are running. If they are running, check to see whether the address has been set for your cluster. The Cluster Address field is located in the General tab of the settings for your cluster. If the cluster address is not set, or does not contain the correct values for your managed servers, OSM will not pick up orders from the JMS queue. Generally, this value is set when the domain is created, but it can be changed or removed manually, which can cause this problem to occur.

For more information about the correct value for a cluster address, see the discussion about configuring the WebLogic Server Domain in the OSM Installation Guide chapter on installing OSM in a clustered environment.

JBoss Cache Timeouts

Long full garbage collections can cause JBoss timeout errors to appear in the log files.

OSM Fails to Process Orders Because of Metadata Errors

Metadata errors can occur in any cartridge with orchestration model entities and can cause order processing failures. Search for the string Metadata Errors in the Console view of the Cartridge Management editor in Design Studio. If you are not using Design Studio to deploy cartridges, look in the WebLogic Server logs for the same string.

Error: "No Backend Servers Available"

If the error "No Backend Servers Available" is displayed, you are likely disconnected from your servers. Ensure your servers are connected and functional before continuing with OSM operations.

DataDictionary Expansion Level

If you are having issues deploying cartridges, the cause may be related to the DataDictionary expansion level. In Oracle Communications Design Studio, under Windows preferences, increase the DataDictionary expansion level to 10. In some cases, you may need to increase the level to more than 10.

Quick Fix Button Active During Order Template Conflicts in Design Studio

Conflicts can occur when order templates are created in Design Studio. Presently, Quick Fix does not work for order template conflicts, even if the Quick Fix button is active. All order template conflicts must be resolved manually.

Cannot Create New Orders on a New Cartridge Version

Order creation can fail on a new version of an existing cartridge, even after you have updated all required entities, and built and deployed the cartridge.

When the createOrder request fails, you receive a response like the following example:

<env:Envelope xmlns:env="http://schemas/soap/envelope/"> 
   <env:Header/> 
   <env:Body> 
      <env:Fault 
xmlns:ord="http://URL/communications/ordermanagement"> 
         <faultcode>ord:fault</faultcode> 
         <faultstring>Failed to create and start the order due to 
java.lang.RuntimeException: OMSException: encountered error 
starting orchestration caused by:Cannot find task for notification 
id</faultstring> 
         <faultactor>unknown</faultactor> 
         <detail> 
            <InvalidOrderSpecificationFault 
xmlns="http://URL/communications/ordermanagement"> 
               <Description>Failed to create and start the order due to 
java.lang.RuntimeException: OMSException: encountered error 
starting orchestration caused by:Cannot find task for notification 
id</Description> 
            </InvalidOrderSpecificationFault> 
         </detail> 
      </env:Fault> 
   </env:Body> 
</env:Envelope>

To resolve this issue:

Open the solution cartridge.
Click the Dependency tab of the model project.
Remove all the dependencies that are displayed for the project.
Re-add all the dependencies.
Restart Design Studio.

Error: "exact fetch returns more than requested number of rows"

You may see the error "exact fetch returns more than requested number of rows"" in the log files if there are memory issues relating very large orders causing contention issues in orchestration XQuery calls when multiple orchestration plans are running at the same time. The default orchestration plan concurrency level is 3. You can reduce this value as described below.

To resolve this issue, decrease the orchestration plan concurrency level in your project specification file.

export JAVA_OPTIONS="${JAVA_OPTIONS} -Doracle.communications.ordermanagement.orchestration.generation.model.ConcurrencyLevel=2

Error: "unique constraint violated"

You may see the "unique constraint violated" error in the log files if you retry to purge order data that you already tried to purge once but failed.

ORA-00001: unique constraint ...violated
ORA-06512: at "<database_schema>.OM_SQL_LOG_PKG", line 335
ORA-06512: at "<database_schema>.OM_PART_MAINTAIN", line 5012
ORA-06512: at "<database_schema>.OM_PART_MAINTAIN", line 5599
ORA-06512: at "<database_schema>.OM_PART_MAINTAIN", line 5778
ORA-06512: at "<database_schema>.OM_PART_MAINTAIN", line 6191
ORA-06512: at "<database_schema>.OM_PART_MAINTAIN", line 6360
ORA-06512: at "<database_schema>.OM_PART_MAINTAIN", line 6886
ORA-06512: at line 1

This error occurs because there are non-empty exchange tables that are created by the failed purge operation that you performed the first time.

To resolve this issue, you must purge the exchange tables manually before you retry purging.

Exceptions When Purging is in Progress

When purging is in progress, you may encounter a number of order not found exceptions. This happens because when purging is in progress, there are some automated tasks in JMS queue. As a result, exceptions such as automation context not found and order not found occur.

To resolve this issue:

Log in to Weblogic Administration Console.
Navigate to the page that shows the JMS messages.
Delete all the messages related to the orders that have been purged.

Note:

Do not delete messages related to existing orders. To know which messages are related to existing orders, select * from om_order_header, where order_seq_id=x
Restart the OSM server.

Error: "Ignoring partition"

You may encounter the following error:

"Ignoring partition <partition_number>: The number of OM_ORDER_HEADER subpartitions does not match the number of XCHG_OM_PRG_001$001$ partitions".

This error occurs when you try to purge partitions without setting up or resetting exchange table in the upgraded or new schema.

To resolve this issue, drop the existing exchange tables and create new exchange tables.

Getting Help with OSM Problems

If you cannot resolve your problems with OSM, contact Oracle Technical Support.

Before You Contact Support

The first troubleshooting step is to look at the error log for the application or process that reported the problem. Consult "General Checklist for Resolving Problems" before reporting the problem to Oracle.

Reporting Problems

If "General Checklist for Resolving Problems" does not help you to resolve the problem, write down the pertinent information:

A clear and concise description of the problem, including when it began to occur.
Relevant portions of the log files.
Recent changes in your system, even if you do not think they are relevant.
List of all the OSM components and patches installed on your system.
Have ready all specification files (project, instance and shape) used to create the OSM instance.

When you are ready, report the problem to Oracle.