7.7 Database Performance Tuning - 5 Steps to Resolve Issues

For more information and videos, see https://blogs.oracle.com/database/post/database-performance-tuning.

Follow these step by step instructions to get started on your database performance tuning journey using Oracle Autonomous Health Framework.

Understanding Database Performance Tuning

Figure 7-32 Understanding Database Performance Tuning


This image illustrates understanding database performance tuning.

Common database performance issues fall into one of two categories. Either they are something affecting general database wide performance or they are query specific.

Database wide issues can be caused by:

  • A bottleneck with some form of resource, such as CPU, IO, Memory, Network or processes
  • A database misconfiguration of some type
  • A database hang of some form
  • Bugs

Query specific issues can be caused by:

  • Expensive SQL, where the structure of the query forces it to take a long time
  • Poor query optimization – This can be caused by things such as incorrect indexes, old statistics, and unexpected changes in execution plan, and so on.

Database Performance Tuning Steps

The basic Fix Flow steps for Database performance tuning are:

  • Collect database performance diagnostics
  • Use Autonomous Health Framework Insights
    • Compare configuration against best practices
    • Find and fix database anomalies and resource bottlenecks
  • Understand and protect against noisy neighbors
  • Use diagnostics to identify and tune poor performing database workloads
  • Log a new SR using the diagnostic collection

Use Autonomous Health Framework to collect database performance diagnostics

In the first step we’re going to use Autonomous Health Framework to generate a database performance diagnostic collection.

  1. Log into the machine where the database performance issue was seen and as the Oracle user run the command:
    tfactl diagcollect srdc –dbperf –database <database_name>

    Autonomous Health Framework will prompt if you have a performance issue now and then guide you through a series of questions and answers so it can collect all the necessary diagnostics.

    For example:
    $ tfactl diagcollect -srdc dbperf -database db23cdb1
    Do you have a performance issue now [Y|N] [Y]: Y
    Enter duration of the issue in hours [<RETURN>=1h]: 1h
    As you have indicated that the performance issue is currently happening,
    Performance Reports will be collected for the following periods:
    Start time when the performance was bad: 2024-05-01 15:13:32
    Stop time when the performance was bad: 2024-05-01 16:13:32
    For comparison, it is useful to gather data from another period with similar load where problems are not seen. Typically this is likely to be the same time period on a previous day. To compare to the same time period on a previous day enter the number of days ago you wish to use. [<RETURN> to provide other time range]: 1
    Start time when the performance was good 2024-04-30 15:13:32
    Stop time when the performance was good 2024-04-30 16:13:32
    Has any SQL been identified to contribute to the performance issue?[Y|N]: N
    Do you wish to take an AWR Dump as part of this collection? [Y|N]: N
    
    Ending AWR snapshot successfully created.
    Found 3 snapshot(s) for Bad Performance time range in  ORCL
    Found 3 snapshot(s) for baseline range in  ORCL
    "Automatic Workload Repository (AWR) is a licensed feature. Refer to My Oracle Support Document ID 1490798.1 for more information"
    
    Components included in this collection: DATABASE CHMOS CHA OS INSIGHT
    Preparing to execute support diagnostic scripts.
     Executing DB Script srdc_db_lfsdiag.sql on db23cdb1 with timeout of 120 seconds...
     Executing DB Script srdc_real_time_addm.sql on db23cdb1 with timeout of 120 seconds...
     Executing DB Script srdc_statsadvisor_report.sql on db23cdb1 with timeout of 300 seconds...
     Executing DB Script collect_logon_logoff_triggers.sql on db23cdb1 with timeout of 300 seconds...
     Executing OS Script get_perfhub_report with timeout of 600 seconds...
    Collecting data for all nodes
    TFA is using system timezone for collection, All times shown in UTC.
    Scanning files from 2024-05-01 15:13:32 UTC to 2024-05-01 16:13:32 UTC
    Collection Id : 20240501161522machine1
    Detailed Logging at : /u01/app/giusr/oracle.ahf/data/repository/srdc_dbperf_collection_Wed_May_01_16_15_27_UTC_2024_node_all/diagcollect_20240501161522_machine1.log
    Waiting up to 120 seconds for collection to start
    2024/05/01 16:15:43 UTC : NOTE : Any file or directory name containing the string .com will be renamed to replace .com with dotcom
    2024/05/01 16:15:43 UTC : Collection Name : tfa_srdc_dbperf_Wed_May_01_16_15_26_UTC_2024.zip
    2024/05/01 16:15:43 UTC : Collecting diagnostics from hosts : [machine1, machine3, machine2, machine4]
    2024/05/01 16:15:45 UTC : Getting list of files satisfying time range [05/01/2024 15:13:32, 05/01/2024 16:13:32]
    2024/05/01 16:15:45 UTC : Collecting Additional Diagnostic Information...
    2024/05/01 16:16:48 UTC : Executing DB Script awr_reports on db23cdb1 with timeout of 3600 seconds...
    2024/05/01 16:16:57 UTC : Completed Collection of Additional Diagnostic Information for Insights...
    2024/05/01 16:18:56 UTC : Collecting ADR incident files...
    2024/05/01 16:20:32 UTC : Executing Applicable ORAchk Validations with timeout of 600 seconds...
    2024/05/01 16:24:16 UTC : Executing IPS Incident Package Collection(s)...
    2024/05/01 16:24:25 UTC : No ADR Incidents for db23cdb1 covering period "2024-05-01 15:13:32" to "2024-05-01 16:13:32" were generated, IPS Pack will not be collected.
    2024/05/01 16:24:25 UTC : Executing SQL Script db_feature_usage.sql on db23cdb1 with timeout of 600 seconds...
    2024/05/01 16:24:25 UTC : Executing Collection for OS with timeout of 1800 seconds...
    2024/05/01 16:24:39 UTC : Completed Collection of Additional Diagnostic Information...
    2024/05/01 16:24:47 UTC : Completed Local Collection
    2024/05/01 16:24:47 UTC : Not Redacting this Collection ...
    2024/05/01 16:24:47 UTC : Remote Collection in Progress...
    2024/05/01 16:24:48 UTC : Collection completed on host: machine2
    2024/05/01 16:24:48 UTC : Collection completed on host: machine3
    2024/05/01 16:24:48 UTC : Collection completed on host: machine4
    2024/05/01 16:24:52 UTC : Executing Creation of insights zip with timeout of 900 seconds...
    2024/05/01 16:28:49 UTC : Report is generated at : /u01/app/giusr/oracle.ahf/data/repository/srdc_dbperf_collection_Wed_May_01_16_15_27_UTC_2024_node_all/machine1_insights_2024_05_01_16_24_55.zip
    2024/05/01 16:28:49 UTC : Finished creation of insights zip with status 0
    2024/05/01 16:28:50 UTC : Collection completed on host: machine1
    2024/05/01 16:28:49 UTC : Completed collection of zip files.
    .-----------------------------------------.
    |            Collection Summary           |
    +---------------+-----------+------+------+
    | Host          | Status    | Size | Time |
    +---------------+-----------+------+------+
    | machine2       | Completed | 13MB | 289s |
    | machine3       | Completed | 12MB | 332s |
    | machine4       | Completed | 13MB | 420s |
    | machine1       | Completed | 17MB | 544s |
    '---------------+-----------+------+------'
    Logs are being collected to: /u01/app/giusr/oracle.ahf/data/repository/srdc_dbperf_collection_Wed_May_01_16_15_27_UTC_2024_node_all
    /u01/app/giusr/oracle.ahf/data/repository/srdc_dbperf_collection_Wed_May_01_16_15_27_UTC_2024_node_all/machine1.tfa_srdc_dbperf_Wed_May_01_16_15_26_UTC_2024.zip
    /u01/app/giusr/oracle.ahf/data/repository/srdc_dbperf_collection_Wed_May_01_16_15_27_UTC_2024_node_all/machine2.tfa_srdc_dbperf_Wed_May_01_16_15_26_UTC_2024.zip
    /u01/app/giusr/oracle.ahf/data/repository/srdc_dbperf_collection_Wed_May_01_16_15_27_UTC_2024_node_all/machine3.tfa_srdc_dbperf_Wed_May_01_16_15_26_UTC_2024.zip
    /u01/app/giusr/oracle.ahf/data/repository/srdc_dbperf_collection_Wed_May_01_16_15_27_UTC_2024_node_all/machine4.tfa_srdc_dbperf_Wed_May_01_16_15_26_UTC_2024.zip
    /u01/app/giusr/oracle.ahf/data/repository/srdc_dbperf_collection_Wed_May_01_16_15_27_UTC_2024_node_all/machine1_insights_2024_05_01_16_24_55.zip

    Once it’s finished AHF will package everything for you in a zip file for each machine, as you progress you'll only need the one from the node where the problem occurred.

    Now we can move on to step number two. Use Autonomous Health Framework Insights to find recommendations.

Use Autonomous Health Framework Insights

Transfer the diagnostic zip from the machine where you initiated the collection to a machine with a web browser and unzip it.

Figure 7-33 AHF Insights


AHF Insights zip archive

Within here you’ll find another zip containing Autonomous Health Framework Insights. Extract that and open the index.html.

Figure 7-34 AHF Insights


AHF Insights index.html file

  • Compare configuration against best practices

    Autonomous Health Framework Insights provides you a bird's eye view of your entire system. You can use it to spot problems, drill into the root cause and understand how to resolve.

    First, we’ll look for any relevant configuration problems to see if you’ve drifted from Oracle best practices.

    Click on the Best Practices issues to drill down.

    Figure 7-35 AHF Insights


    AHF Insights Best Practice Issues

    The summary provides an overview of where your system has strayed from best practice and the relative severity.

    Figure 7-36 AHF Insights


    AHF Insights Best Practice Issues summary

    As you scroll down you’ll see specific configuration issues. Look down through the list of issues and see if you find anything that may be relevant. If you find something that looks like it could be related to performance, expand it to see the recommendation. Every best practice check explains the impact and risk of this configuration drift along with the repair steps.

    Figure 7-37 AHF Insights


    AHF Insights Best Practice issues impact and risks

  • Find and fix database anomalies and resource bottlenecks

    Next, we’ll look at the Database Anomalies Advisor.

    Autonomous Health Framework uses AI to detect unusual events and recommend solutions.

    Figure 7-38 AHF Insights


    AHF Insights OS Issues

    In this scenario Autonomous Health Framework has detected a number of unusual events suggesting the database and host IO is slow.

    Figure 7-39 AHF Insights


    AHF Insights database anomalies and resource bottlenecks

    From here we can go back out to the home screen to explore the Operating System metrics.

    In this example we can see Autonomous Health Framework has detected the Oracle Processes were blocked in D State, which means they were stuck waiting for IO to return. We can also see some disks had a long wait time.

    Figure 7-40 AHF Insights


    AHF Insights database anomalies and resource bottlenecks

    Expanding this gives us graphs showing disks with average wait time greater than 20 milliseconds and top IO consuming processes.

    Figure 7-41 AHF Insights


    AHF Insights database anomalies and resource bottlenecks

    And scrolling further down we can see problematic snapshots. This one shows us which disks were slow.

    Figure 7-42 AHF Insights


    AHF Insights-database anomalies and resource bottlenecks

    High CPU Usage Example Scenario

    Here’s a different scenario looking at anomalies detected by Autonomous Health Framework, in this case we see the latency for Database global cache messages higher than expected. Autonomous Health Framework tells us the cause and recommended action.

    So now we know CPU is a bottleneck.

    Figure 7-43 AHF Insights


    AHF Insights database anomalies and resource bottlenecks

    Going to the OS issues report page in this scenario shows us a finding for high CPU usage. Expanding that finding shows us details about when the CPU usage was high. As we scroll down we see CPU usage graphed over time.

    Figure 7-44 AHF Insights


    AHF Insights database anomalies and resource bottlenecks

    Continuing down shows us the CPU usage for different processes. Here we can see the database background and foreground processes are consuming an average of over 80% CPU and a Max of over 100% CPU.

    Figure 7-45 AHF Insights


    AHF Insights database anomalies and resource bottlenecks

Use Autonomous Health Framework to understand and protect against noisy neighbors

This third step is only applicable if you’ve had repeated database performance problems on a host with multiple other databases and the host shows periods with more than 70% CPU usage.

In this step we’ll check for noisy neighbors. In generic terms, a noisy neighbor is when one service consumes a larger than expected share of system resources, which can impact the performance of other services. Autonomous Health Framework can detect noisy database neighbors based on CPU usage.

Autonomous Health Framework uses the Oracle Enterprise Manager repository as a source of historical data. So, configure it to connect to this repository by running:
ahf configuration set --type impact --user-name <EM user> --connect-string <EM repository connection string>

You’ll be prompted for the repository password and Autonomous Health Framework will setup the configuration.

For example:
ahf configuration set --type impact --user-name ahftest --connect-string "//<a href="http://mymachine.acme.com/" rel="nofollow">mymachine.acme.com</a>:1521/<a href="http://abcd.acme.com/" rel="nofollow">abcd.acme.com</a>"

Enter EM Repository password:

Re-enter EM Repository password:

Configuration files created in /opt/oracle.ahf/data/mymachine/balance/user_john
Now that's configured you can generate a cluster analysis report by running:
ahf analysis create --type impact --scope cluster --name cluster_name
For example:
ahf analysis create --type impact --scope cluster --name mycluster 

Starting analysis and collecting data for impact

Report is generated at : /opt/oracle.ahf/data/mymachine/diag/balance/user_john/cluster_160424_154432451_UTC.html

The first section in the report gives you a summary of your current noisy neighbor exposure and impact vs what it would be if you set the recommended CPU_COUNTs.

Figure 7-46 AHF Insights


AHF Insights noisy neighbor

The next graph down shows the current noisy neighbor status over the past month vs what it would have been with the recommended CPU_COUNT. Anywhere with a black line indicates a database was impacted by a noisy neighbor.

Figure 7-47 AHF Insights


AHF Insights noisy neighbor

As you hover over the graph these controls appear, which allow you to zoom in and pan around the data. Use this to understand if the black lines correlate with times you’ve had performance problems.

If the black lines correlate with the periods of slow database performance then use the rest of the report to drill down further by database or machine and understand the changes you need to make to CPU_COUNT across your various databases.

Use Autonomous Health Framework diagnostics to identify and tune poor performing database workloads

This fourth step uses more of the diagnostics you captured to identify and tune poor performing database workloads. Looking at the same collection that was used previously for the high CPU utilization example, you can find a perfhub html file.

Figure 7-48 Performance Hub


Performance Hub noisy neighbor

Opening that gives you the Database Performance Hub report with details about the database workload.

Figure 7-49 Performance Hub


Performance Hub noisy neighbor

The first thing to understand is what your active sessions are doing – an active session is one actively working on behalf of the client.

By default the Performance Hub Report opens on the ASH Analytics tab.

ASH stands for Active Session History, which is a database feature that samples active database sessions.

Expand the graph to get a better look.

Here we can see our highest wait class is CPU, which tells us the database is CPU bound.

Figure 7-50 Performance Hub


Performance Hub noisy neighbor

Looking at the SQL by wait class graph we can identify SQL that are candidates for tuning.

Let’s collapse the graph again and take a look at the ADDM tab

ADDM stands for Automatic Database Diagnostic Monitor.

Figure 7-51 Performance Hub


Performance Hub noisy neighbor

The ADDM tab analyzes the data in Automatic Workload Repository or AWR and provides options to resolve performance problems.

Click into one of the finding to get a recommendation. The recommendation will tell you what to change and give you an estimated benefit.

Figure 7-52 Performance Hub


Performance Hub noisy neighbor

You can use this table to identify any particular SQL statements that have long running durations.

Once identified you can focus on tuning their individual performance.

For more information, see .Oracle® Database SQL Tuning Guide

Log a new SR using the Autonomous Health Framework diagnostic collection

If you still need help then the final step is to log an SR using the Autonomous Health Framework diagnostic collection.

  1. Log into My Oracle Support.
  2. Choose the Service Requests tab.
  3. Then click the Create Technical SR button.

    Figure 7-53 My Oracle Support


    My Oracle Support Service Request

  4. Complete all the required Service Request fields.
  5. Make sure to choose product as "Oracle Database - Enterprise Edition"
  6. Then when you come to the Problem Type, select the option to Create an Express SR.

    This is a fast track SR creation route using an Autonomous Health Framework collection.

    Figure 7-54 My Oracle Support


    My Oracle Support Service Request

  7. Follow the rest of the steps including uploading the collection.

    Oracle Support will take over and help resolve it.