13 Troubleshooting OSM

This chapter provides guidelines to help you troubleshoot problems with your Oracle Communications Order and Service Management (OSM) system.

Information You Need for Troubleshooting

When you are diagnosing and resolving problems, you must be able to obtain the following information:

Database AWR report for a particular period of time.
Database ASH report for a particular period of time.
Oracle WebLogic Server administration server logs and output files.
WebLogic Server managed server logs and output files.
WebLogic Server node manager's logs and output files (if configured).
JVM garbage collector logs (if collected).
JVM heap dumps, which provide warnings about the size of the files.
JVM thread dumps (several in succession).
OSM model and a single order extracted from the database schema. For more information, see "Exporting and Importing the OSM Model and a Single Order".

General Checklist for Resolving Problems

If you have a problem with your OSM system, go through the following checklist before you contact Oracle Technical Support:

What exactly is the problem? Can you isolate it? For example, if an order causes a problem on one computer, does it give the same result on another computer?

Oracle Technical Support needs a clear and concise description of the problem, including when it began to occur.
What do the log files say?

This is the first thing that Oracle Technical Support asks for. Check the error log for the OSM component you are having problems with.
Have you read the documentation?

Look through the list of common problems and their solutions in "Diagnosing Some Common Problems with OSM".
Has anything changed in the system? Did you install any new hardware or new software? Did the network change in any way? Does the problem resemble another one you had previously? Has your system usage recently jumped significantly?
Is the system otherwise operating normally? Has response time or the level of system resources changed? Are users complaining about additional or different problems?

Diagnosing Some Common Problems with OSM

This section describes common problems and their solutions.

Cannot Log in or Access Certain Functionality

If you cannot log in or access certain functionality, check the following possible causes:

Are you a valid user in the WebLogic Server security realm?
Is the OSM web application deployed?
Are all OSM Enterprise Java Beans (EJB) deployed?
Are the OSM database resources deployed?
Do you belong to the correct groups in the WebLogic Server security realm?
Do you belong to any OSM workgroup?

System Appears Slow

If the functionality of OSM appears to be present, but performance is slow, check the following possible causes:

The amount of memory being used (check the max memory configuration in the WebLogic server startup script on the workstation where you have deployed OSM)
The CPU and disk usage on the machine hosting the OSM database
The database connections
For slow worklist access, check the number of flexible headers on your worklist. The number of flexible headers has a direct negative effect on worklist performance.

Error: "Java.lang.StackOverflowError" when Using Task Web Client

You may see the error "Java.lang.StackOverflowError" in the log files when you use the plus ( + ) and minus ( - ) buttons to add or delete data elements in the OSM Task web client. If this happens, you can address the problem by tuning the thread stack size parameter in WebLogic Server as described below.

Note:

The procedures below set the value to 1MB. This is a suggested value to start with, but you should adjust the value if necessary, according to your needs.

To resolve this issue, increase the thread stack size setting for WebLogic servers on UNIX and Linux:

Back up the domain_home/bin/setDomainEnv.sh file by saving a copy with a different name.
Open the domain_home/bin/setDomainEnv.sh file in a text editor.
Search for the following:
```
USER_MEM_ARGS="
```
Do one of the following:
- If you find the search text, change the value of the variable so that the following option is set:
```
-Xss2m
```
- If do not find the search text, do the following:
  1. Search for the following line:
```
# IF USER_MEM_ARGS the environment variable is set, use it to override ALL MEM_ARGS values
```
  2. Above the line that you searched for, add the USER_MEM_ARGS environment variable as follows:
```
USER_MEM_ARGS="-Xss2m"

# IF USER_MEM_ARGS the environment variable is set, use it to override ALL MEM_ARGS values
```
Save and close the file.

To increase the thread stack size setting for WebLogic servers on Windows:

Back up the domain_home\bin\setDomainEnv.cmd file by saving a copy with a different name.
Open the domain_home\bin\setDomainEnv.cmd file in a text editor.
Search for the line that begins with the following:
```
set USER_MEM_ARGS
```
Do one of the following:
- If you find the search text, change the value of the variable so that the following option is set:
```
-Xss1m
```
- If you do not find the search text, do the following:
  1. Search for the following:
```
@REM IF USER_MEM_ARGS the environment variable is set, use it to override ALL MEM_ARGS values
```
  2. Above the line that you searched for, add the USER_MEM_ARGS environment variable as follows:
```
set USER_MEM_ARGS=--Xss2m

@REM IF USER_MEM_ARGS the environment variable is set, use it to override ALL MEM_ARGS values
```
Save and close the file.

Coherence Configuration Error: [STUCK] ExecuteThread

The following thread error can occur in the OSM WebLogic server console when running an order:

[STUCK] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock java.util.concurrent.locks.ReentrantReadWriteLock$FairSync@5d7fc269 WAITING

The osm-invocation and osm-distributed thread-count values are set too low. See the discussion of configuring and monitoring coherence threads in OSM Installation Guide for more information about increasing these settings.

Unexpected Logout from Web Client

If your system is running on a WebLogic Server cluster, and the following conditions apply:

a user is viewing an order in the Order Management web client or Task web client
that order is hosted on a managed server that fails or is shut down

the user will be logged out of the web client and will have to log back in. See the discussion of order affinity in OSM Installation Guide for more information about orders being hosted on a particular managed server.

Error: "Login failed. Please try again."

If the error "Login failed. Please try again" is displayed when trying to log in through the web client and you have entered the correct user name and password, you probably do not belong to the correct groups in the WebLogic Server security realm.

To resolve this issue , log in to the WebLogic Administration Console using the administrator account. Make sure you have been added to the group OMS_client. Try to log in again.

Automation Plug-ins Are Not Getting Called

If the custom automation plug-ins are not getting called, check the following possible causes:

Is the Automation configuration deployed properly?
Are the JMS resources deployed?
Are the JMS destinations, queues, and topics configured properly?

Delayed JMS Messages

Clock synchronization issues may cause JMS request and response messages to remain in a queue in a delayed state.

When a message is sent, it is stamped with the time on the sender's machine. When the message arrives at the JMS destination, if the recipient's machine has a clock that is running several minutes slower, the timestamp is displayed as a future time and the WebLogic server decides to delay the message until the future time arrives.

To prevent this problem, use network time protocol (NTP) servers to synchronize clocks across all machines in a cluster.

The short term work around for this issue is to manually move the message to another JMS server on a machine with a clock that is synchronized.

To monitor JMS messages in the OSM WebLogic server console:

Log in to the OSM WebLogic server.

The WebLogic Administration Console is displayed.
Click Services.
Select Messaging.
Select JMSModules.
Select the JMS module associated with the delayed message.
Select the queue associated with the delayed message.
Click the Monitoring tab.
Select a destination to find the delayed message.

Note:
If the message is not in the first destination, check the second destination.
Select Show messages.
Select the message in the delayed state.
Click Move.

Error Message For Events From a JMS Topic in a Cluster

The following message can occur when attempting to process an event from a JMS topic:

<Message-Driven EJB: YourCartridgeName_1.0.0.0.0_YourPluginName_orderCompleteEventMDB's transaction was rolled back. The transaction details are: ....

OSM does not support JMS topics within an OSM clustered environment. For more information about OSM queue configuration, see the discussion of OSM integration with external systems in OSM Installation Guide.

JMS Message Delivery Failure

If OSM drops and does not process JMS messages, ensure that the messages are not using uniform distributed queue (UDQ) format. OSM supports only weighted distributed queues. For more information about OSM queue configuration, see the discussion of OSM integration with external systems in OSM Installation Guide.

Unexpected Values for JMS Properties

There are some situations in which OSM may set the JMS properties on a message to values that you do not expect.

Sometimes messages in the queue that were received from external systems will have JMS properties that were not set by the external system.

This is because when OSM sends a request to an external system, any messages received in response must be correlated back to an appropriate automator. When a message is first received, before it is placed on the queue, OSM finds the handling task's context based on the correlation data in the message. OSM adds this context, and some additional data associated with the message-driven bean (MDB) that received the message, to the message using additional JMS message properties. You can see the context data by browsing the messages in the queue using the WebLogic Server Administration console.

When OSM runs in a cluster, this message is sometimes redirected to a message queue hosted on a different managed server from the one that sent the request.
Sometimes messages have values that seem wrong. For example, a message might have JMS properties with pluginJndiName=X and cartridgeNamespace=Y, but plug in X is not in a cartridge with namespace Y.

These property values are implementation details and will not necessarily match what you expect. For example, because different plug ins can share the same MDB, the pluginJndiName property may not contain the name of the plug in that actually handles the message.

Too Many Open Files

If you have a large number of external clients connected to OSM and receive the error: "java.net.SocketException: Too many open files", do one of the following:

From the WebLogic Administration Console, select Servers, then Server, then Protocols, and then HTTP. Reduce the value in Duration from the default 30 seconds to 15 or even 5 seconds. This will allow the WebLogic server to close idle HTTP connections and release more sockets.

Problems When Running Multiple WebLogic Domains on One Host

If you are running multiple WebLogic domains on one host and you see errors such as web clients failing to load, JSP errors, or errors indicating that JSP pages can't be recompiled, you may not have set umask values properly to protect files in one WebLogic instance from other WebLogic instances.

Proxy Fails on a Clustered System

If a proxy fails on a clustered OSM system, all HTTP requests that would normally go through the proxy can no longer get to the OSM server. The problem could be with the physical host the server is running on, or it could be a problem with a standalone managing server that is not part of the cluster but is part of the domain.

To recover, restart the proxy.

Unable to Bring Up Managed Server After Database Failure

OSM JDBC data sources support global transactions through the logging last resource (LLR) WebLogic optimization. LLR transaction records must be available to resolve in-doubt transactions during recovery, which runs automatically at server startup. WebLogic does not start if the LLR table is unreachable. OSM also fails to start if the Oracle database is unreachable. In addition, if the database is Oracle RAC, both database nodes specified by the data source URLs must be reachable. When WebLogic Server starts, failure of a database node is handled gracefully (processing fails over to the other node).

Orders Are Not Being Created on a Clustered System

If messages are being successfully added to the JMS queue but the corresponding orders are not being created in OSM, first ensure that the servers are running. If they are running, check to see whether the address has been set for your cluster. The Cluster Address field is located in the General tab of the settings for your cluster. If the cluster address is not set, or does not contain the correct values for your managed servers, OSM will not pick up orders from the JMS queue. Generally, this value is set when the domain is created, but it can be changed or removed manually, which can cause this problem to occur.

For more information about the correct value for a cluster address, see the discussion about configuring the WebLogic Server Domain in the OSM Installation Guide chapter on installing OSM in a clustered environment.

Problems Displaying Gantt Charts on Solaris or Linux Hosted Systems

To use X server to display Gantt charts in the Task web client, you must configure the Java settings for the Oracle WebLogic Server to avoid display problems and system instability and performance problems.

See the discussion about enabling graphical displays in the post-installation section of the OSM Installation Guide.

JBoss Cache Timeouts

Long full garbage collections can cause JBoss timeout errors to appear in the log files.

OSM Fails to Process Orders Because of Metadata Errors

Metadata errors can occur in any cartridge with orchestration model entities and can cause order processing failures. Search for the string Metadata Errors in the Console view of the Cartridge Management editor in Design Studio. If you are not using Design Studio to deploy cartridges, look in the WebLogic Server logs for the same string.

For more information, see the discussion of metadata errors in OSM Developer's Guide.

Error: "No Backend Servers Available"

If the error "No Backend Servers Available" is displayed, you are likely disconnected from your servers. Ensure your servers are connected and functional before continuing with OSM operations.

DataDictionary Expansion Level

If you are having issues deploying cartridges, the cause may be related to the DataDictionary expansion level. In Oracle Communications Design Studio, under Windows preferences, increase the DataDictionary expansion level to 10. In some cases, you may need to increase the level to more than 10.

Quick Fix Button Active During Order Template Conflicts in Design Studio

Conflicts can occur when order templates are created in Design Studio. Presently, Quick Fix does not work for order template conflicts, even if the Quick Fix button is active. All order template conflicts must be resolved manually.

Cannot Create New Orders on a New Cartridge Version

Order creation can fail on a new version of an existing cartridge, even after you have updated all required entities, and built and deployed the cartridge.

When the createOrder request fails, you receive a response like the following example:

<env:Envelope xmlns:env="http://schemas/soap/envelope/"> 
   <env:Header/> 
   <env:Body> 
      <env:Fault 
xmlns:ord="http://URL/communications/ordermanagement"> 
         <faultcode>ord:fault</faultcode> 
         <faultstring>Failed to create and start the order due to 
java.lang.RuntimeException: OMSException: encountered error 
starting orchestration caused by:Cannot find task for notification 
id</faultstring> 
         <faultactor>unknown</faultactor> 
         <detail> 
            <InvalidOrderSpecificationFault 
xmlns="http://URL/communications/ordermanagement"> 
               <Description>Failed to create and start the order due to 
java.lang.RuntimeException: OMSException: encountered error 
starting orchestration caused by:Cannot find task for notification 
id</Description> 
            </InvalidOrderSpecificationFault> 
         </detail> 
      </env:Fault> 
   </env:Body> 
</env:Envelope>

To resolve this issue:

Open the solution cartridge.
Click the Dependency tab of the model project.
Remove all the dependencies that are displayed for the project.
Re-add all the dependencies.
Restart Design Studio.

Error: "exact fetch returns more than requested number of rows"

You may see the error "exact fetch returns more than requested number of rows" in the log files if there are memory issues relating very large orders causing contention issues in orchestration XQuery calls when multiple orchestration plans are running at the same time. The default orchestration plan concurrency level is 8. You can reduce this value as described below.

To resolve this issue, decrease the orchestration plan concurrency level on UNIX and Linux:

Back up the domain_home/bin/setDomainEnv.sh file by saving a copy with a different name.
Open the domain_home/bin/setDomainEnv.sh file in a text editor.
Search for the following:
```
USER_MEM_ARGS="
```

Search for the following line:

# IF USER_MEM_ARGS the environment variable is set, use it to override ALL MEM_ARGS values

Above the line that you searched for, add the following Java option:

export JAVA_OPTIONS="${JAVA_OPTIONS} -Doracle.communications.ordermanagement.orchestration.generation.model.ConcurrencyLevel=7

# IF USER_MEM_ARGS the environment variable is set, use it to override ALL MEM_ARGS values

You can set the value lower if you continue to receive the error message.

Save and close the file.

Getting Help with OSM Problems

If you cannot resolve your problems with OSM, contact Oracle Technical Support.

Before You Contact Support

Problems can often be fixed by shutting down OSM and restarting the computer that OSM runs on.

If that does not resolve the issue, the first troubleshooting step is to look at the error log for the application or process that reported the problem. Consult "General Checklist for Resolving Problems" before reporting the problem to Oracle.

Reporting Problems

If "General Checklist for Resolving Problems" does not help you to resolve the problem, write down the pertinent information:

A clear and concise description of the problem, including when it began to occur.
Relevant portions of the log files.
Relevant configuration files, such as oms-config.xml.
Recent changes in your system, even if you do not think they are relevant.
List of all the OSM components and patches installed on your system.

When you are ready, report the problem to Oracle.