Oracle® Fusion Applications Performance and Tuning Guide 11g Release 1 (11.1.2) Part Number E16686-01 |
|
|
PDF · Mobi · ePub |
This chapter discusses how to find the information you need to examine so you can tune your system. It includes how to monitor and tune the database and Oracle Fusion Applications, and troubleshooting.
This chapter includes these sections:
Every system of hardware and installed applications is different. Even though Oracle Fusion Applications are written and installed using industry-standard best practices, you can custom tailor your system to improve how it supports your environment.
But to tune your system, you need to locate and examine data. This chapter will explain what data you need to examine, and what tools you will use to gather the data.
In general, most of the settings that come default in Oracle Fusion Applications are already tuned.
These guidelines are provided to help ensure your Oracle Fusion Applications instance runs optimally. Note that all metrics listed are from Oracle Enterprise Manager Cloud Control.
Monitor the key host metrics, shown in Table 1-1, to ensure the underlying server hosts are healthy. Rather than constantly checking the metric values, you can set up alert thresholds in Cloud Control and receive notification when thresholds are exceeded.
Monitor the key component metrics, such as WebLogic server metrics, to ensure each component is healthy.
Monitor the number of incidents and logs to ensure the application is configured properly and not constantly wasting resources generating error messages. Review log levels to ensure they are not set too low. See "Troubleshooting Oracle Fusion Applications Using Incidents, Logs, QuickTrace, and Diagnostic Tests" in the Oracle Fusion Applications Administrator's Guide for more information
Monitor the database to ensure it is operating optimally. Follow the guidelines in Chapter 3, "Tuning the Database," to make sure that statistics are being collected.
Table 1-1 Key Host Metrics
Metric Category | Metric Name | Warning Threshold | Critical Threshold | Comments |
---|---|---|---|---|
Disk Activity |
Disk Device Busy |
>80% |
>95% |
|
Filesystems |
Filesystem Space Available |
<20% |
<5% |
|
Load |
CPU in I/O wait |
>60% |
>80% |
|
CPU Utilization |
>80% |
>95% |
||
Run Queue (5 min average) |
>2 |
>4 |
The run queue is normalized by the number of CPU cores. |
|
Swap Utilization |
>75% |
>90% |
||
Total Processes |
>15000 |
>25000 |
||
Logical Free Memory % |
<20 |
<10 |
||
CPU in System Mode |
>20% |
>40% |
||
Network Interfaces Summary |
All Network Interfaces Combined Utilization |
>80% |
>95% |
|
Switch/Swap Activity |
Total System Swaps |
>3 |
>5 |
Value is per second. |
Paging Activity |
Pages Paged-in (per second) |
|||
Pages Paged-out (per second) |
The combined value of Pages Paged-in and Pages Paged-out should be <=1000 |
Administrators will find it useful to study these suggestions on further analysis to undertake when a metric value exceeds threshold. The commands provided are for the Linux operating system.
When logical free memory/swap activity or paging activity is beyond threshold
This usually happens when memory is not sufficient to handle demands from all the running processes.
Check cat/proc/meminfo and confirm total RAM is expected.
Check if there are unallocated huge pages. If there are and the WebLogic Server/Oracle instances are not expected to use them, reduce the huge page pool size.
Run top and sort by resident memory (type OQ). Look for processes using the most resident memory and investigate those processes.
When page activity is beyond threshold
Follow the steps in "When logical free memory/swap activity or paging activity is beyond threshold" to view and analyze memory usage.
When Network Interface Error Rates Is Beyond Threshold
The normal cause is misconfiguration between the host and the network switch. A bad network card or cabling also can cause this error. You can run /sbin/ifconfig to identify which interface is having packet errors. Contact network administrator to ensure the host and the switch are using same data rate and duplex mode.
Otherwise, check if cabling or the network card is faulty and replace as appropriate.
When Packet Loss Rate Is Beyond Threshold
The normal cause of this error is network saturation of bad network hardware.
Run lsof -Pni | grep ESTAM
to determine which network paths are generating the problem.
Then run mtr <target host>
or ping <target host>
and look for packet lost on that segment.
20 packets transmitted, 20 received, 0% packet loss, time 18997ms rtt min/avg/max/mdev = 0.168/0.177/0.200/0.010 ms
The packet loss should be 0% and rtt should be less than .5 ms.
Ask the network monitoring staff to look for saturation or network packet loss from their side.
When Network Utilization Is Beyond Threshold
The normal cause is very heavy application load.
Run top
or lsof
to determine which processes are moving a lot of data.
Use tcpdump
to sample the network for usage patterns.
Use atop
, iftop
, ntop
or pkstat
to see which processes are moving data.
When CPU Usage or Run Queue Length Is Beyond Threshold
The normal cause is runaway demand, a poorly performing application, or poor capacity planning.
Run top
to identify which application/process is using time.
If top processes are WebLogic Server JVM processes, conduct a basic WebLogic Server health check. That is, review logs to see if there are configuration errors causing excessive exceptions, and review metrics to see if the load has increased. Use JVMD for a more detailed analysis.
If top processes are Oracle processes, use Enterprise Manager to look for high load SQL.
When System CPU Usage Is Beyond Threshold
High system CPU use could be due to kernel processes looking for pages to swap out during a memory shortage. Follow the steps listed in "When logical free memory/swap activity or paging activity is beyond threshold" section to further diagnose the problem.
High system CPU use is also frequently related to various device failures. Run {{dmesg | less}}
and look for repeated messages about errors on some particular device, and also have hardware support personnel check the hardware console to see if there are any errors reported.
When Filesystem Usage Is Beyond Threshold
The normal cause is an application that is logging excessively or leaving behind temporary files.
Run lsof -d 1-99999 | grep REG | sort -nrk 7 | less
to see currently open files sorted by size from largest to smallest. Investigate the large files.
Run du -k /mount_point_running_out_of_space > /tmp/sizes
to get space used for directories under the mount point. This may take a long time. While it is running, run sort -nr /tmp/sizes
and find the directories using most space and investigate those first.
When Total Processes Is Beyond Threshold
The normal cause is runaway code or a stuck NFS filesystem.
Run ps aux
. If many processes are in status D, run df
to check for stuck mounts.
If there are hundreds or thousands of processes of a particular program, determine why.
Run ps o pid,nlwp,cmd | sort -nrk 2 | head
to look for processes with many threads.
When Disk Device Busy Is Beyond Threshold
Check for disk drive failure. As root, check /var/log/messages*
and /var/log/mcelog
to see if there are any error messages indicating disk failure. For a RAID array, the disk controller needs to be checked. The commands will be specific to the controller manufacturer.
Look for processes that are using the disk. From a shell window, execute ps aux | grep ' D. '
several consecutive times to look for processes with "stat" D.
Poor performance is a major indicator of network connectivity problems.
Check for cumulative dropped packets drops for each host.
netstat -s | grep 'TCP data loss' 4007 segments retransmited 3302 TCP data loss events
These counts should be 0 or growing very slowly over time.
Check for realtime dropped packets on specific network paths.
ping -c 20 other_host 20 packets transmitted, 20 received, 0% packet loss, time 18997ms rtt min/avg/max/mdev = 0.168/0.177/0.200/0.010 ms
Packet loss should be 0%.
rtt should be less than .5 ms, except that it can be higher between the browser and load balancer.
Check for network interface errors.
/sbin/ifconfig eth0 | grep errors RX packets:842803463 errors:0 dropped:0 overruns:0 frame:0 TX packets:667946307 errors:0 dropped:0 overruns:0 carrier:0
These metrics provide an indication of whether the WebLogic Server is in a healthy state. Performance may degrade if any of the metrics is exceeding its threshold.
See "Monitoring the Oracle Fusion Applications Middle Tier" in the Oracle Fusion Applications Administrator's Guide and Table 1-2.
Table 1-2 Key WebLogic Server Metrics
Metric Category | Metric Name | Warning Threshold | Critical Threshold | Comments |
---|---|---|---|---|
Datasource Metrics |
Connections in Use |
>250 |
>400 |
|
Connection Requests that Waited (%) |
>10% |
>20% |
||
Connection Creation Time (ms) |
||||
JVM Garbage Collectors |
Garbage Collector - Percent Time spent (elapsed) |
>10% |
>20% |
|
JVM Metrics |
Heap Usage |
>90% |
>98% |
|
Response |
Status |
=Down |
This provides instance availability. |
|
Server Servlet/JSP Metrics |
Request Processing Time (ms) |
>10s |
>15s |
|
Server Work Manager Metrics |
Work Manager Stuck Threads |
>5 |
>10 |
|
JVM Threads |
Deadlocked Threads |
>2 |
>5 |
|
Module Metrics By Server |
Active Sessions |
When CPU Usage On Host Is Beyond Threshold and WebLogic Server Process Is Identified as Top CPU Consumer
Examine the % Time spent in the GC metric to see if JVM is doing excessive GC (>60 percent). If so, follow the process for diagnosing WebLogic Server heap pressure.
Look for incident creation rate and error logs and see if something is triggering a massive amount of logging/errors.
In JVMD, select the CPU state filter and look at top methods. Look for threads that are consistently in a CPU state.
When There Is a Spike in Active Web Sessions
Check access logs to see if there is a spike in the number of users.
Check if there are stuck threads, which could cause users to log in again.
Check session distribution across WebLogic Server managed servers and see if there is a problem with the load balancer.
Check session timeout in web.xml, and see if it is too high or too low.
When There Are Stuck Threads On the System
Get the ECID from the stuck thread error in the WebLogic Server log.
From the Request Monitor, search for the ECID and get details from JVMD.
Alternatively, use JVMD to search for stuck threads and see the timing breakdown.
A stuck thread will also result in an incident with a JFR recording. Use JRMC to analyze the recording.
When There Are Deadlocks Detected On the System
In JVMD, inspect the threads that are in a blocked state.
Deadlock threads normally also will be reported as a stuck thread in the WebLogic Server log. Use the Request Monitor to search for the ECID and expand down into JVMD to show the blocking thread.
When Request Processing Time Is Beyond Threshold
Examine the % Time spent in GC metric to see if JVM is doing excessive garbage collection
Look for incident create rate and error logs and see if something is triggering a massive amount of logging/errors.
In JVMD, look at the thread states and see where most processing time is going.
When Percent Time Spent in GC Is Beyond Threshold
Check the session count. If there is a sudden surge of sessions due to user load, the JVM could be short on heap. Increase heap if possible, or add additional managed server instances.
Look at the stuck threads count. Stuck threads could increase the number of active session, as users could be launching new sessions hoping for a faster response.
Look at the incident creation rate and error logs and see if something is triggering a massive amount of logging/errors. The incident creation/logging operations could be causing a high amount of object creation and garbage collection stress.
Generate a heap dump using JVMD and analyze the top retainer of memory.
Use JRMC to connect and extract a JFR recording. Examine the Memory panel and allocation details to see what is doing a lot of allocations.
When Percent Connection Requests Waiting Is Beyond Threshold
Examine the number of sessions and request rate, and see if there is a spike in the load that would account for an increased demand for connections.
In JVMD, see where time is spent. For example, requests could be running longer due to slow SQLs (and retain the connection longer). In that case, identify and tune slow SQLs.
Consider increasing the initial capacity setting of the corresponding data source.
These metrics provide an indication of whether the Oracle HTTP Server is in a healthy state. Performance may degrade if any of the metrics is exceeding its threshold.
See "Monitoring the Oracle Fusion Applications Middle Tier" in the Oracle Fusion Applications Administrator's Guide and Table 1-3.
Table 1-3 Oracle HTTP Server Metrics
Metric Category | Metric Name | Warning Threshold | Critical Threshold |
---|---|---|---|
OHS Server Metrics |
Busy Threads (%) |
>85% |
>95% |
Request Throughput (requests per second) |
TBD |
Yes |
|
OHS Response Code Metrics |
HTTP 4xx errors |
||
HTTP 5xx errors |
|||
OHS Virtual Host Metrics |
Request Processing Time for a Virtual Host |
>10s |
>15s |
When Busy Threads % Is Beyond Threshold
Check request throughput to see if load has increased. If the increased load is expected and CPU and memory resources on the OHS host has not exceeded threshold, consider increasing ServerLimit/MaxClients and ThreadsPerChild in httpd.conf.
Check request process time on both OHS and underlying WebLogic Server to see if requests are taking longer. If WebLogic Server response time is increasing, check the key metrics for the WebLogic Server.
If possible, ensure the client browser cache is enabled to reduce number of requests submitted.
Check OHS Response Code Metrics. If there is a sudden increase of HTTP 4xx errors or HTTP 5xx errors, check the health of the underlying WebLogic Servers.
Check and increase the minimum and maximum spare threads for Oracle HTTP Server.
In the httpd.conf file located in instance_home/config/ohs/<ohs_name>/httpd.conf
:
Increase MaxSpareThreads to 800.
Increase MinSpareThreads to 200.
When Request Processing Time for a Virtual Host Exceeds Threshold
Check the key host metrics to ensure the OHS host is healthy.
For each URL requested, OHS will first check DocumentRoot before passing the request to WebLogic Server. Check the utilization and health of the disk to which the DocumentRoot is pointing. If it is a NFS mount, check the health of the NFS mount point.
Check the key metrics for the underlying WebLogic Server(s) and see if they are healthy.
OHS accesses /tmp for each POST request, so check the performance of the /tmp filesystem.
These metrics provide an indication of whether the Oracle Business Intelligence Server is in a healthy state.
To start monitoring:
Log in to Oracle Enterprise Manager Fusion Applications Control.
Open Business Intelligence > coreapplication > Business Intelligence Instance > Monitoring > Performance, as shown in Figure 1-1.
Use Fusion Applications Control to configure parameters related to Oracle Business Intelligence Suite Enterprise Edition.
Fusion Applications Control can monitor various BI components, including:
Weblogic Analytics Application
Oracle BI Presentation Services
Oracle BI Server
Oracle Weblogic Server (administration and managed servers)
Oracle Access Manager and Oracle Identity Manager are both WebLogic Server instances. See Section 1.2.3, "How to Analyze WebLogic Server Metrics" to monitor their health.
Use Cloud Control to monitor the Oracle Internet Directory and Oracle Identity Manager databases.
These metrics provide an indication of whether the Enterprise Scheduler instance is performing well.
Table 1-4 Key Enterprise Scheduler Metrics
Metric Category | Metric Name | Warning Threshold | Critical Threshold | Comments |
---|---|---|---|---|
Completed Job Summary |
Average Elapsed Time (ms) |
You can define different thresholds for different job names. |
||
Long Running Job |
Elapsed Time (ms) |
|||
WorkAssignment Metrics aggregated across Group Members |
Average Wait Time for Requests in Ready State (seconds) |
When the Value of Average Elapsed Time for the Completed Jobs Is Higher Than Expected
Check the key host/WebLogic Server metrics and see if any component that could be involved in process batch jobs is in an unhealthy state.
Follow the steps listed in the "Troubleshooting Slow Batch Job" section and analyze a few jobs to see if there are any common causes.
When the Value of Elapsed Time Under the Long Running Job Category Is Higher Than Expected
Open the Enterprise Scheduler home page in Oracle Enterprise Manager Fusion Applications Control and examine the Top 10 Long Running Jobs.
Identify the job of interest, and follow the steps in the "Troubleshooting Slow Batch Job" section.
When Average Wait Time For Requests in Ready State (seconds) Is Higher Than Expected
Follow the steps in "Troubleshooting Jobs that are in Wait/Ready/Blocked state for a long time" section.
Monitoring SOA involves monitoring SOA infrastructure, SOA composite and SOA servers.
See "Monitoring the Oracle Fusion Applications Middle Tier" in the Oracle Fusion Applications Administrator's Guide.
Follow the steps in this section to tune Oracle Identity Management specifically for Oracle Fusion Applications.
Optimize LDAP Search
Description: Optimize LDAP search by enabling search filters.
Solution:
Create a ldiff file named searchfilter_oid_tuning.ldif with this content:
dn: cn=dsaconfig, cn=configsets, cn=oracle internet directory changetype: modify add: orclinmemfiltprocess;dn orclinmemfiltprocess;dn: cn=Roles,cn=fscm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess;dn: cn=Roles,cn=crm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess;dn: cn=Roles,cn=hcm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess;dn: cn=Permission Sets,cn=fscm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess;dn: cn=Permission Sets,cn=hcm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess;dn: cn=Permission Sets,cn=crm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess:dn; cn=Permissions,cn=JAAS Policy,cn=fscm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess;dn: cn=hcm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess;dn: cn=crm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies orclinmemfiltprocess;dn: cn=fscm,cn=FusionDomain,cn=JPSContext,cn=FusionAppsPolicies
At the command prompt, run this command:
ldapmodify -p portNum -h hostname -D cn=orcladmin -f searchfilter_oid_tuning.ldif
Log Levels
Description: Oracle Identity Management stack WebLogic Server log levels are too fine-grained and need to be set to Severe.
Solution: In all WebLogic Servers in the Oracle Identity Management domain, change log levels to SEVERE. This is a two-part process.
Part 1: Manually edit the logging.xml file, or by using the Oracle WebLogic Server Administration Console.
Edit the logging.xml file that is in each server directory of the Oracle Identity Management Domain domain, such as OAM_Server1, OIM_Server1, and SOA, and set level='SEVERE' for all log_handlers and loggers. The path to each logging.xml file will resemble:
$domain_home/config/fmwconfig/<servername>
Part 2: Edit the log levels in the Oracle WebLogic Server Administration Console:
Log in to the console (http://hostname:port/console
).
Click the Servers link.
Click the desired server.
Click the Logging tab.
Scroll down and click the Advanced link.
In the Message destination(s) section, change the log levels as shown here:
Log file : Severity level: warning Standard out :Severity level: error Domain log broadcaster :Severity level: error Memory buffer: Memory Buffer Severity level: error
Save the changes.
Repeat this for all WebLogic Servers in the Oracle Identity Management stack, such as OAM_Server1, OIM_Server1, and SOA.
Click Activate Changes.
Restart the server.
Avoid Restarts of httpd-worker Processes
Description: These restarts affect the recreation of connections and threads in Oracle HTTP Server processes during varying load patterns.
Solution: Increase the minimum and maximum spare threads for Oracle HTTP Server.
In the httpd.conf file located in instance_home
/config/ohs/<ohs_name>/httpd.conf
:
Increase MaxSpareThreads to 800.
Increase MinSpareThreads to 200.
Tune Two OID Configuration Parameters
Description: Two OID configuration parameters, orclmaxcc and orclserverprocs, need to be appropriately tuned.
Solution: Change orclmaxcc to 10 and tune the number of OID processes:
Name the sample script config_oid_tuning.ldif. You will need to set cn=oid1
to your component name. In a multi-component environment, this needs to be changed accordingly. You will need to set orclserverprocs to the number of cores in the OID server that is used.
dn: cn=oid1,cn=osdldapd,cn=subconfigsubentry changetype: modify replace: orclmaxcc orclmaxcc: 10 orclserverprocs: <number of cores>
Apply the script by running this command at the command prompt:
ldapmodify -p portNum -h hostname -D cn=orcladmin -f config_oid_tuning.ldif
Enable Timing Logging
Description: Add parameters to enable timing logging for OID.
Solution:
Add this entry to the config.xml file in ./oid/user_projects/domains/oid_domain/config/ and the ./oim/user_projects/domains/oim_domain/config/ directories for each WebLogic Server in the Oracle Identity Management domain:
<web-server> <web-server-log> <file-name>logs/access.log.%yyyyMMdd%</file-name> <rotation-type>byTime</rotation-type> <number-of-files-limited>true</number-of-files-limited> <rotate-log-on-startup>true</rotate-log-on-startup> <buffer-size-kb>0</buffer-size-kb> <logging-enabled>true</logging-enabled> <elf-fields>date time time-taken bytes c-ip s-ip sc-status sc(X-ORACLE-DMS-ECID) cs-method cs-uri cs(User-Agent) cs(ECID-Context) cs(Proxy-Remote-User) cs(Proxy-Client-IP)</elf-fields> <log-file-format>extended</log-file-format> <log-time-in-gmt>false</log-time-in-gmt> <log-milli-seconds>true</log-milli-seconds> </web-server-log> </web-server>
To set the access log format, add this string to the httpd.conf file in the /u01/ohsauth/ohsauth_inst/config/OHS/ohs1 path.
LogFormat "%h %l %u %t \"%r\" %>s %b %D %{X-ORACLE-DMS-ECID}o" common
If you are using the Solaris SPARC or the IBM AIX operating system, Oracle recommends that you incorporate these settings.
Solaris SPARC
Incorporate this setting for best performance when using Hotspot JVM:
-XX:+UseParallelOldGC -XX:ParallelGCThreads=4
IBM AIX
Incorporate this setting for best performance when using IBM JVM 9:
Xgcpolicy:gencon -Xcompressedrefs -XtlhPrefetch