1 Monitoring and Tuning Oracle Fusion Applications

This chapter discusses how to find the information you need to examine so you can tune your system. It includes how to monitor and tune the database and Oracle Fusion Applications, and troubleshooting.

This chapter includes these sections:

Section 1.1, "Introduction"
Section 1.2, "Monitoring and Tuning Oracle Fusion Applications"
Section 1.3, "Tuning Platforms for Fusion Applications"

1.1 Introduction

Every system of hardware and installed applications is different. Even though Oracle Fusion Applications are written and installed using industry-standard best practices, you can custom tailor your system to improve how it supports your environment.

But to tune your system, you need to locate and examine data. This chapter will explain what data you need to examine, and what tools you will use to gather the data.

1.1.1 Audience

The Oracle Fusion Applications Performance Tuning Guide is intended for developers who are customizing an application, and operators working in a runtime production environment.

1.2 Monitoring and Tuning Oracle Fusion Applications

In general, most of the settings that come default in Oracle Fusion Applications are already tuned.

These guidelines are provided to help ensure your Oracle Fusion Applications instance runs optimally. Note that all metrics listed are from Oracle Enterprise Manager Cloud Control.

Monitor the key host metrics, shown in Table 1-1, to ensure the underlying server hosts are healthy. Rather than constantly checking the metric values, you can set up alert thresholds in Cloud Control and receive notification when thresholds are exceeded.
Monitor the key component metrics, such as WebLogic server metrics, to ensure each component is healthy.
Monitor the number of incidents and logs to ensure the application is configured properly and not constantly wasting resources generating error messages. Review log levels to ensure they are not set too low. See "Troubleshooting Oracle Fusion Applications Using Incidents, Logs, QuickTrace, and Diagnostic Tests" in the Oracle Fusion Applications Administrator's Guide for more information
Monitor the database to ensure it is operating optimally. Follow the guidelines in Chapter 3, "Tuning the Database," to make sure that statistics are being collected.

Table 1-1 Key Host Metrics

Metric Category	Metric Name	Warning Threshold	Critical Threshold	Comments
Disk Activity	Disk Device Busy	>80%	>95%
Filesystems	Filesystem Space Available	<20%	<5%
Load	CPU in I/O wait	>60%	>80%
	CPU Utilization	>80%	>95%
	Run Queue (5 min average)	>2	>4	The run queue is normalized by the number of CPU cores.
	Swap Utilization	>75%	>90%
	Total Processes	>15000	>25000
	Logical Free Memory %	<20	<10
	CPU in System Mode	>20%	>40%
Network Interfaces Summary	All Network Interfaces Combined Utilization	>80%	>95%
Switch/Swap Activity	Total System Swaps	>3	>5	Value is per second.
Paging Activity	Pages Paged-in (per second)
	Pages Paged-out (per second)			The combined value of Pages Paged-in and Pages Paged-out should be <=1000

1.2.1 How to Analyze Host Metrics

Administrators will find it useful to study these suggestions on further analysis to undertake when a metric value exceeds threshold. The commands provided are for the Linux operating system.

When logical free memory/swap activity or paging activity is beyond threshold

This usually happens when memory is not sufficient to handle demands from all the running processes.

Check cat/proc/meminfo and confirm total RAM is expected.
Check if there are unallocated huge pages. If there are and the WebLogic Server/Oracle instances are not expected to use them, reduce the huge page pool size.
Run top and sort by resident memory (type OQ). Look for processes using the most resident memory and investigate those processes.

When page activity is beyond threshold

Follow the steps in "When logical free memory/swap activity or paging activity is beyond threshold" to view and analyze memory usage.

When Network Interface Error Rates Is Beyond Threshold

The normal cause is misconfiguration between the host and the network switch. A bad network card or cabling also can cause this error. You can run /sbin/ifconfig to identify which interface is having packet errors. Contact network administrator to ensure the host and the switch are using same data rate and duplex mode.

Otherwise, check if cabling or the network card is faulty and replace as appropriate.

When Packet Loss Rate Is Beyond Threshold

The normal cause of this error is network saturation of bad network hardware.

Run lsof -Pni | grep ESTAM to determine which network paths are generating the problem.
Then run mtr <target host> or ping <target host> and look for packet lost on that segment.
```
20 packets transmitted, 20 received, 0% packet loss, time 18997ms
  rtt min/avg/max/mdev = 0.168/0.177/0.200/0.010 ms
```
The packet loss should be 0% and rtt should be less than .5 ms.
Ask the network monitoring staff to look for saturation or network packet loss from their side.

When Network Utilization Is Beyond Threshold

The normal cause is very heavy application load.

Run top or lsof to determine which processes are moving a lot of data.
Use tcpdump to sample the network for usage patterns.
Use atop, iftop, ntop or pkstat to see which processes are moving data.

When CPU Usage or Run Queue Length Is Beyond Threshold

The normal cause is runaway demand, a poorly performing application, or poor capacity planning.

Run top to identify which application/process is using time.
If top processes are WebLogic Server JVM processes, conduct a basic WebLogic Server health check. That is, review logs to see if there are configuration errors causing excessive exceptions, and review metrics to see if the load has increased. Use JVMD for a more detailed analysis.
If top processes are Oracle processes, use Enterprise Manager to look for high load SQL.

When System CPU Usage Is Beyond Threshold

High system CPU use could be due to kernel processes looking for pages to swap out during a memory shortage. Follow the steps listed in "When logical free memory/swap activity or paging activity is beyond threshold" section to further diagnose the problem.
High system CPU use is also frequently related to various device failures. Run {{dmesg | less}} and look for repeated messages about errors on some particular device, and also have hardware support personnel check the hardware console to see if there are any errors reported.

When Filesystem Usage Is Beyond Threshold

The normal cause is an application that is logging excessively or leaving behind temporary files.

Run lsof -d 1-99999 | grep REG | sort -nrk 7 | less to see currently open files sorted by size from largest to smallest. Investigate the large files.
Run du -k /mount_point_running_out_of_space > /tmp/sizes to get space used for directories under the mount point. This may take a long time. While it is running, run sort -nr /tmp/sizes and find the directories using most space and investigate those first.

When Total Processes Is Beyond Threshold

The normal cause is runaway code or a stuck NFS filesystem.

Run ps aux. If many processes are in status D, run df to check for stuck mounts.

If there are hundreds or thousands of processes of a particular program, determine why.
Run ps o pid,nlwp,cmd | sort -nrk 2 | head to look for processes with many threads.

When Disk Device Busy Is Beyond Threshold

Check for disk drive failure. As root, check /var/log/messages* and /var/log/mcelog to see if there are any error messages indicating disk failure. For a RAID array, the disk controller needs to be checked. The commands will be specific to the controller manufacturer.
Look for processes that are using the disk. From a shell window, execute ps aux | grep ' D. ' several consecutive times to look for processes with "stat" D.

1.2.2 How to Check for Network Connectivity Issues

Poor performance is a major indicator of network connectivity problems.

Check for cumulative dropped packets drops for each host.
```
netstat -s | grep 'TCP data loss'
  4007 segments retransmited
  3302 TCP data loss events
```
These counts should be 0 or growing very slowly over time.
Check for realtime dropped packets on specific network paths.
```
ping -c 20 other_host
  20 packets transmitted, 20 received, 0% packet loss, time 18997ms
  rtt min/avg/max/mdev = 0.168/0.177/0.200/0.010 ms
```
Packet loss should be 0%.

rtt should be less than .5 ms, except that it can be higher between the browser and load balancer.

Check for network interface errors.

/sbin/ifconfig eth0 | grep errors
   RX packets:842803463 errors:0 dropped:0 overruns:0 frame:0
   TX packets:667946307 errors:0 dropped:0 overruns:0 carrier:0

1.2.3 How to Analyze WebLogic Server Metrics

These metrics provide an indication of whether the WebLogic Server is in a healthy state. Performance may degrade if any of the metrics is exceeding its threshold.

See "Monitoring the Oracle Fusion Applications Middle Tier" in the Oracle Fusion Applications Administrator's Guide and Table 1-2.

Table 1-2 Key WebLogic Server Metrics

Metric Category	Metric Name	Warning Threshold	Critical Threshold	Comments
Datasource Metrics	Connections in Use	>250	>400
	Connection Requests that Waited (%)	>10%	>20%
	Connection Creation Time (ms)
JVM Garbage Collectors	Garbage Collector - Percent Time spent (elapsed)	>10%	>20%
JVM Metrics	Heap Usage	>90%	>98%
Response	Status		=Down	This provides instance availability.
Server Servlet/JSP Metrics	Request Processing Time (ms)	>10s	>15s
Server Work Manager Metrics	Work Manager Stuck Threads	>5	>10
JVM Threads	Deadlocked Threads	>2	>5
Module Metrics By Server	Active Sessions

When CPU Usage On Host Is Beyond Threshold and WebLogic Server Process Is Identified as Top CPU Consumer

Examine the % Time spent in the GC metric to see if JVM is doing excessive GC (>60 percent). If so, follow the process for diagnosing WebLogic Server heap pressure.
Look for incident creation rate and error logs and see if something is triggering a massive amount of logging/errors.
In JVMD, select the CPU state filter and look at top methods. Look for threads that are consistently in a CPU state.

When There Is a Spike in Active Web Sessions

Check access logs to see if there is a spike in the number of users.
Check if there are stuck threads, which could cause users to log in again.
Check session distribution across WebLogic Server managed servers and see if there is a problem with the load balancer.
Check session timeout in web.xml, and see if it is too high or too low.

When There Are Stuck Threads On the System

Get the ECID from the stuck thread error in the WebLogic Server log.
From the Request Monitor, search for the ECID and get details from JVMD.
Alternatively, use JVMD to search for stuck threads and see the timing breakdown.
A stuck thread will also result in an incident with a JFR recording. Use JRMC to analyze the recording.

When There Are Deadlocks Detected On the System

In JVMD, inspect the threads that are in a blocked state.
Deadlock threads normally also will be reported as a stuck thread in the WebLogic Server log. Use the Request Monitor to search for the ECID and expand down into JVMD to show the blocking thread.

When Request Processing Time Is Beyond Threshold

Examine the % Time spent in GC metric to see if JVM is doing excessive garbage collection
Look for incident create rate and error logs and see if something is triggering a massive amount of logging/errors.
In JVMD, look at the thread states and see where most processing time is going.

When Percent Time Spent in GC Is Beyond Threshold

Check the session count. If there is a sudden surge of sessions due to user load, the JVM could be short on heap. Increase heap if possible, or add additional managed server instances.
Look at the stuck threads count. Stuck threads could increase the number of active session, as users could be launching new sessions hoping for a faster response.
Look at the incident creation rate and error logs and see if something is triggering a massive amount of logging/errors. The incident creation/logging operations could be causing a high amount of object creation and garbage collection stress.
Generate a heap dump using JVMD and analyze the top retainer of memory.
Use JRMC to connect and extract a JFR recording. Examine the Memory panel and allocation details to see what is doing a lot of allocations.

When Percent Connection Requests Waiting Is Beyond Threshold

Examine the number of sessions and request rate, and see if there is a spike in the load that would account for an increased demand for connections.
In JVMD, see where time is spent. For example, requests could be running longer due to slow SQLs (and retain the connection longer). In that case, identify and tune slow SQLs.
Consider increasing the initial capacity setting of the corresponding data source.

1.2.4 How to Analyze Oracle HTTP Server Metrics

These metrics provide an indication of whether the Oracle HTTP Server is in a healthy state. Performance may degrade if any of the metrics is exceeding its threshold.

See "Monitoring the Oracle Fusion Applications Middle Tier" in the Oracle Fusion Applications Administrator's Guide and Table 1-3.

Table 1-3 Oracle HTTP Server Metrics

Metric Category	Metric Name	Warning Threshold	Critical Threshold
OHS Server Metrics	Busy Threads (%)	>85%	>95%
	Request Throughput (requests per second)	TBD	Yes
OHS Response Code Metrics	HTTP 4xx errors
	HTTP 5xx errors
OHS Virtual Host Metrics	Request Processing Time for a Virtual Host	>10s	>15s

When Busy Threads % Is Beyond Threshold

Check request throughput to see if load has increased. If the increased load is expected and CPU and memory resources on the OHS host has not exceeded threshold, consider increasing ServerLimit/MaxClients and ThreadsPerChild in httpd.conf.
Check request process time on both OHS and underlying WebLogic Server to see if requests are taking longer. If WebLogic Server response time is increasing, check the key metrics for the WebLogic Server.
If possible, ensure the client browser cache is enabled to reduce number of requests submitted.
Check OHS Response Code Metrics. If there is a sudden increase of HTTP 4xx errors or HTTP 5xx errors, check the health of the underlying WebLogic Servers.
Check and increase the minimum and maximum spare threads for Oracle HTTP Server.

In the httpd.conf file located in instance_home/config/ohs/<ohs_name>/httpd.conf:
- Increase MaxSpareThreads to 800.
- Increase MinSpareThreads to 200.

When Request Processing Time for a Virtual Host Exceeds Threshold

Check the key host metrics to ensure the OHS host is healthy.
For each URL requested, OHS will first check DocumentRoot before passing the request to WebLogic Server. Check the utilization and health of the disk to which the DocumentRoot is pointing. If it is a NFS mount, check the health of the NFS mount point.
Check the key metrics for the underlying WebLogic Server(s) and see if they are healthy.
OHS accesses /tmp for each POST request, so check the performance of the /tmp filesystem.

1.2.5 How to Analyze Oracle Business Intelligence Server Metrics

These metrics provide an indication of whether the Oracle Business Intelligence Server is in a healthy state.

To start monitoring:

Log in to Oracle Enterprise Manager Fusion Applications Control.
Open Business Intelligence > coreapplication > Business Intelligence Instance > Monitoring > Performance, as shown in Figure 1-1.

Figure 1-1 Checking Oracle Business Intelligence Performance Metrics
Use Fusion Applications Control to configure parameters related to Oracle Business Intelligence Suite Enterprise Edition.

Fusion Applications Control can monitor various BI components, including:

Weblogic Analytics Application
Oracle BI Presentation Services
Oracle BI Server
Oracle Weblogic Server (administration and managed servers)

1.2.6 How to Gather Key Identity Management Server Metrics

Oracle Access Manager and Oracle Identity Manager are both WebLogic Server instances. See Section 1.2.3, "How to Analyze WebLogic Server Metrics" to monitor their health.

Use Cloud Control to monitor the Oracle Internet Directory and Oracle Identity Manager databases.

1.2.7 How to Analyze Key Enterprise Scheduler Metrics

These metrics provide an indication of whether the Enterprise Scheduler instance is performing well.

Table 1-4 Key Enterprise Scheduler Metrics

Metric Category	Metric Name	Comments
Completed Job Summary	Average Elapsed Time (ms)	You can define different thresholds for different job names.
Long Running Job	Elapsed Time (ms)
WorkAssignment Metrics aggregated across Group Members	Average Wait Time for Requests in Ready State (seconds)

When the Value of Average Elapsed Time for the Completed Jobs Is Higher Than Expected

Check the key host/WebLogic Server metrics and see if any component that could be involved in process batch jobs is in an unhealthy state.
Follow the steps listed in the "Troubleshooting Slow Batch Job" section and analyze a few jobs to see if there are any common causes.

When the Value of Elapsed Time Under the Long Running Job Category Is Higher Than Expected

Open the Enterprise Scheduler home page in Oracle Enterprise Manager Fusion Applications Control and examine the Top 10 Long Running Jobs.
Identify the job of interest, and follow the steps in the "Troubleshooting Slow Batch Job" section.

When Average Wait Time For Requests in Ready State (seconds) Is Higher Than Expected

Follow the steps in "Troubleshooting Jobs that are in Wait/Ready/Blocked state for a long time" section.

1.2.8 How to Monitor Key SOA Metrics

Monitoring SOA involves monitoring SOA infrastructure, SOA composite and SOA servers.

See "Monitoring the Oracle Fusion Applications Middle Tier" in the Oracle Fusion Applications Administrator's Guide.

1.2.9 How to Tune and Troubleshoot Oracle Identity Management

Follow the steps in this section to tune Oracle Identity Management specifically for Oracle Fusion Applications.

Optimize LDAP Search

Description: Optimize LDAP search by enabling search filters.

Solution:

Create a ldiff file named searchfilter_oid_tuning.ldif with this content:

dn: cn=dsaconfig, cn=configsets, cn=oracle internet directory
changetype: modify
add: orclinmemfiltprocess;dn
orclinmemfiltprocess;dn: cn=Roles,cn=fscm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess;dn: cn=Roles,cn=crm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess;dn: cn=Roles,cn=hcm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess;dn: cn=Permission Sets,cn=fscm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess;dn: cn=Permission Sets,cn=hcm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess;dn: cn=Permission Sets,cn=crm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess:dn; cn=Permissions,cn=JAAS Policy,cn=fscm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess;dn: cn=hcm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess;dn: cn=crm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
orclinmemfiltprocess;dn: cn=fscm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies

At the command prompt, run this command:

ldapmodify -p portNum -h hostname -D cn=orcladmin -f searchfilter_oid_tuning.ldif

Log Levels

Description: Oracle Identity Management stack WebLogic Server log levels are too fine-grained and need to be set to Severe.

Solution: In all WebLogic Servers in the Oracle Identity Management domain, change log levels to SEVERE. This is a two-part process.

Part 1: Manually edit the logging.xml file, or by using the Oracle WebLogic Server Administration Console.

Edit the logging.xml file that is in each server directory of the Oracle Identity Management Domain domain, such as OAM_Server1, OIM_Server1, and SOA, and set level='SEVERE' for all log_handlers and loggers. The path to each logging.xml file will resemble:
```
$domain_home/config/fmwconfig/<servername>
```
Part 2: Edit the log levels in the Oracle WebLogic Server Administration Console:
- Log in to the console (http://hostname:port/console).
- Click the Servers link.
- Click the desired server.
- Click the Logging tab.
- Scroll down and click the Advanced link.
- In the Message destination(s) section, change the log levels as shown here:
```
Log file : Severity level: warning
Standard out :Severity level: error
Domain log broadcaster :Severity level: error
Memory buffer: Memory Buffer Severity level: error
```
- Save the changes.
- Repeat this for all WebLogic Servers in the Oracle Identity Management stack, such as OAM_Server1, OIM_Server1, and SOA.
- Click Activate Changes.
- Restart the server.

Avoid Restarts of httpd-worker Processes

Description: These restarts affect the recreation of connections and threads in Oracle HTTP Server processes during varying load patterns.

Solution: Increase the minimum and maximum spare threads for Oracle HTTP Server.

In the httpd.conf file located in instance_home/config/ohs/<ohs_name>/httpd.conf:

Increase MaxSpareThreads to 800.
Increase MinSpareThreads to 200.

Tune Two OID Configuration Parameters

Description: Two OID configuration parameters, orclmaxcc and orclserverprocs, need to be appropriately tuned.

Solution: Change orclmaxcc to 10 and tune the number of OID processes:

Name the sample script config_oid_tuning.ldif. You will need to set cn=oid1 to your component name. In a multi-component environment, this needs to be changed accordingly. You will need to set orclserverprocs to the number of cores in the OID server that is used.
```
dn: cn=oid1,cn=osdldapd,cn=subconfigsubentry
changetype: modify
replace: orclmaxcc
orclmaxcc: 10
orclserverprocs: <number of cores>
```

Apply the script by running this command at the command prompt:

ldapmodify -p portNum -h hostname -D cn=orcladmin -f config_oid_tuning.ldif

Enable Timing Logging

Description: Add parameters to enable timing logging for OID.

Solution:

Add this entry to the config.xml file in ./oid/user_projects/domains/oid_domain/config/ and the ./oim/user_projects/domains/oim_domain/config/ directories for each WebLogic Server in the Oracle Identity Management domain:

<web-server>
      <web-server-log>
        <file-name>logs/access.log.%yyyyMMdd%</file-name>
        <rotation-type>byTime</rotation-type>
        <number-of-files-limited>true</number-of-files-limited>
        <rotate-log-on-startup>true</rotate-log-on-startup>
        <buffer-size-kb>0</buffer-size-kb>
        <logging-enabled>true</logging-enabled>
        <elf-fields>date time time-taken bytes c-ip s-ip sc-status
sc(X-ORACLE-DMS-ECID) cs-method cs-uri
cs(User-Agent) cs(ECID-Context) cs(Proxy-Remote-User)
cs(Proxy-Client-IP)</elf-fields>
        <log-file-format>extended</log-file-format>
        <log-time-in-gmt>false</log-time-in-gmt>
        <log-milli-seconds>true</log-milli-seconds>
      </web-server-log>
    </web-server>

To set the access log format, add this string to the httpd.conf file in the /u01/ohsauth/ohsauth_inst/config/OHS/ohs1 path.
```
LogFormat "%h %l %u %t \"%r\" %>s %b %D %{X-ORACLE-DMS-ECID}o" common
```

1.3 Tuning Platforms for Fusion Applications

If you are using the Solaris SPARC or the IBM AIX operating system, Oracle recommends that you incorporate these settings.

Solaris SPARC

Incorporate this setting for best performance when using Hotspot JVM:

-XX:+UseParallelOldGC  -XX:ParallelGCThreads=4

IBM AIX

Incorporate this setting for best performance when using IBM JVM 9:

Xgcpolicy:gencon -Xcompressedrefs -XtlhPrefetch