4.4 Proactively Detecting and Diagnosing Performance Issues for Oracle RAC

Oracle Cluster Health Advisor provides system and database administrators with early warning of pending performance issues, and root causes and corrective actions for Oracle RAC databases and cluster nodes. Use Oracle Cluster Health Advisor to increase availability and performance management.

Oracle Cluster Health Advisor estimates an expected value of an observed input based on the default model, which is a trained calibrated model based on a normal operational period of the target system. Oracle Cluster Health Advisor then performs anomaly detection for each input based on the difference between observed and expected values. If sufficient inputs associated with a specific problem are abnormal, then Oracle Cluster Health Advisor raises a warning and generates an immediate targeted diagnosis and corrective action.

Oracle Cluster Health Advisor also sends warning messages to Enterprise Manager Cloud Control using the Oracle Clusterware event notification protocol.

The ability of Oracle Cluster Health Advisor to detect performance and availability issues on Oracle Exadata systems has been improved in this release.

With the Oracle Cluster Health Advisor support for Oracle Solaris, you can now get early detection and prevention of performance and availability issues in your Oracle RAC database deployments.

For more information on Installing Grid Infrastructure Management Repository, see Oracle® Grid Infrastructure Grid Infrastructure Installation and Upgrade Guide 20c for Linux.

4.4.1 Oracle Cluster Health Advisor Architecture

Oracle Cluster Health Advisor runs as a highly available cluster resource, ochad, on each node in the cluster.

Each Oracle Cluster Health Advisor daemon (ochad) monitors the operating system on the cluster node and optionally, each Oracle Real Application Clusters (Oracle RAC) database instance on the node.

The ochad daemon receives operating system metric data from the Cluster Health Monitor and gets Oracle RAC database instance metrics from a memory-mapped file. The daemon does not require a connection to each database instance. This data, along with the selected model, is used in the Health Prognostics Engine of Oracle Cluster Health Advisor for both the node and each monitored database instance in order to analyze their health multiple times a minute.

4.4.2 Removing Grid Infrastructure Management Repository

GIMR is desupported in Oracle Database 23ai. If GIMR is configured in your existing Oracle Grid Infrastructure installation, then remove the GIMR.

  1. Confirm if Grid Infrastructure Management Repository (GIMR) is configured in the current release.
    srvctl config mgmtdb

    Note:

    If GIMR is not configured, then do not follow this procedure.
  2. Confirm if Oracle Fleet Patching and Provisioning (Oracle FPP) is configured in central server mode in the current release.
    srvctl config rhpserver

    Note:

    If Oracle FPP is configured on your cluster, then you are recommended to use the Oracle FPP Self-Upgrade feature for smooth migration of the metadata from GIMR to the new metadata repository. Refer to Oracle Fleet Patching and Provisioning Self Upgrade for more information about how to use the Oracle FPP Self-Upgrade feature.
  3. As the grid user, log in to any cluster node and create a new directory owned by grid to store the GIMR deletion script.
    mkdir -p $ORACLE_HOME/gimrdel
    chown grid:oinstall $ORACLE_HOME/gimrdel
  4. Download scriptgimr.zip from the My Oracle Support Note 2972418.1 to the $ORACLE_HOME/gimrdel directory.
  5. Extract the reposScript.sh script from the scriptgimr.zip and ensure that the grid user has read and execute permissions on the reposScript.sh script.
    unzip -q $ORACLE_HOME/gimrdel/scriptgimr.zip
  6. Optional: Query and export the CHA user models.
    Grid_home/bin/chactl query model
    Grid_home/bin/chactl export model -name model_name -file model_name.svm
  7. If Oracle FPP was configured in central mode, then export the Oracle FPP Metadata to re-configure Oracle FPP after upgrading to Oracle Grid Infrastructure 23ai.
    Grid_home/crs/install/reposScript.sh -export_dir=dir_to_export_Oracle_FPP_metadata
  8. Run the reposScript.sh script, in delete mode, from the /gimrdel directory.
    $ORACLE_HOME/gimrdel/reposScript.sh -mode="Delete"

    Note:

    Oracle FPP stops working if you delete the GIMR, but do not upgrade to Oracle Grid Infrastructure 23ai and re-configure Oracle FPP.

4.4.3 Monitoring the Oracle Real Application Clusters (Oracle RAC) Environment with Oracle Cluster Health Advisor

Oracle Cluster Health Advisor is automatically provisioned on each node by default when Oracle Grid Infrastructure is installed for Oracle Real Application Clusters (Oracle RAC) or Oracle RAC One Node database.

Oracle Cluster Health Advisor does not require any additional configuration.

When Oracle Cluster Health Advisor detects an Oracle Real Application Clusters (Oracle RAC) or Oracle RAC One Node database instance as running, Oracle Cluster Health Advisor autonomously starts monitoring the cluster nodes. Use CHACTL while logged in as the Grid user to turn on monitoring of the database.

To monitor the Oracle Real Application Clusters (Oracle RAC) environment:

  1. To monitor a database, run the following command:
    $ chactl monitor database –db db_unique_name

    Oracle Cluster Health Advisor monitors all instances of the Oracle Real Application Clusters (Oracle RAC) or Oracle RAC One Node database using the default model. Oracle Cluster Health Advisor cannot monitor single-instance Oracle databases, even if the single-instance Oracle databases share the same cluster as Oracle Real Application Clusters (Oracle RAC) databases.

    Each database instance is monitored independently both across Oracle Real Application Clusters (Oracle RAC) database nodes and when more than one database run on a single node.

  2. To stop monitoring a database, run the following command:
    $ chactl unmonitor database –db db_unique_name

    Oracle Cluster Health Advisor stops monitoring all instances of the specified database. However, Oracle Cluster Health Advisor does not delete any data or problems until it is aged out beyond the retention period.

  3. To check monitoring status of all cluster nodes and databases, run the following command:
    $ chactl status

    Use the –verbose option to see more details, such as the models used for the nodes and each database.

4.4.4 Using Cluster Health Advisor for Health Diagnosis

Oracle Cluster Health Advisor raises and clears problems autonomously.

The Oracle Grid Infrastructure user can query the stored information using CHACTL.

To query the diagnostic data:

  1. To query currently open problems, run the following command:
    chactl query diagnosis -db db_unique_name -start time -end time

    In the syntax example, db_unique_name is the name of your database instance. You also specify the start time and end time for which you want to retrieve data. Specify date and time in the YYYY-MM-DD HH24:MI:SS format.

  2. Use the -htmlfile file_name option to save the output in HTML format.

Example 4-5 Cluster Health Advisor Output Examples in Text and HTML Format

This example shows the default text output for the chactl query diagnosis command for a database named oltpacbd.
$ chactl query diagnosis -db oltpacdb -start "2016-02-01 02:52:50" -end "2016-02-01 03:19:15"
2016-02-01 01:47:10.0  Database oltpacdb  DB Control File IO Performance (oltpacdb_1) [detected]
2016-02-01 01:47:10.0  Database oltpacdb  DB Control File IO Performance (oltpacdb_2) [detected]
2016-02-01 02:52:15.0  Database oltpacdb  DB CPU Utilization (oltpacdb_2) [detected]
2016-02-01 02:52:50.0  Database oltpacdb  DB CPU Utilization (oltpacdb_1) [detected]
2016-02-01 02:59:35.0  Database oltpacdb  DB Log File Switch (oltpacdb_1) [detected]
2016-02-01 02:59:45.0  Database oltpacdb  DB Log File Switch (oltpacdb_2) [detected]

Problem: DB Control File IO Performance
Description: CHA has detected that reads or writes to the control files are slower than expected.
Cause: The Cluster Health Advisor (CHA) detected that reads or writes to the control files were slow
because of an increase in disk IO. 
The slow control file reads and writes may have an impact on checkpoint and Log Writer (LGWR) performance.
Action: Separate the control files from other database files and move them to faster disks or Solid State Devices.

Problem: DB CPU Utilization
Description: CHA detected larger than expected CPU utilization for this database.
Cause: The Cluster Health Advisor (CHA) detected an increase in database CPU utilization 
because of an increase in the database workload.
Action: Identify the CPU intensive queries by using the Automatic Diagnostic and Defect Manager (ADDM) and 
follow the recommendations given there. Limit the number of CPU intensive queries or 
relocate sessions to less busy machines. Add CPUs if the CPU capacity is insufficent to support 
the load without a performance degradation or effects on other databases.

Problem: DB Log File Switch
Description: CHA detected that database sessions are waiting longer than expected for log switch completions.
Cause: The Cluster Health Advisor (CHA) detected high contention during log switches 
because the redo log files were small and the redo logs switched frequently.
Action: Increase the size of the redo logs.

The timestamp displays date and time when the problem was detected on a specific host or database.

Note:

The same problem can occur on different hosts and at different times, yet the diagnosis shows complete details of the problem and its potential impact. Each problem also shows targeted corrective or preventive actions.

Here is an example of what the output looks like in the HTML format.
$ chactl query diagnosis -start "2016-07-03 20:50:00" -end "2016-07-04 03:50:00" -htmlfile ~/chaprob.html

Figure 4-5 Cluster Health Advisor Diagnosis HTML Output

Description of Figure 4-5 follows
Description of "Figure 4-5 Cluster Health Advisor Diagnosis HTML Output"

4.4.5 Calibrating an Oracle Cluster Health Advisor Model for a Cluster Deployment

As shipped with default node and database models, Oracle Cluster Health Advisor is designed not to generate false warning notifications.

You can increase the sensitivity and accuracy of the Oracle Cluster Health Advisor models for a specific workload using the chactl calibrate command.

Oracle recommends that a minimum of 6 hours of data be available and that both the cluster and databases use the same time range for calibration.

The chactl calibrate command analyzes a user-specified time interval that includes all workload phases operating normally. This data is collected while Oracle Cluster Health Advisor is monitoring the cluster and all the databases for which you want to calibrate.

  1. To check if sufficient data is available, run the query calibration  command.

    Note:

    The query calibration command is supported only with GIMR. GIMR is optionally supported in Oracle Database 19c. However, it's desupported in Oracle Database 23ai.

    If 720 or more records are available, then Oracle Cluster Health Advisor successfully performs the calibration. The calibration function may not consider some data records to be normally occurring for the workload profile being used. In this case, filter the data by using the KPISET  parameters in both the query calibration command and the calibrate command.

    For example:
    $ chactl query calibration -db oltpacdb -timeranges 
    'start=2016-07-26 01:00:00,end=2016-07-26 02:00:00,start=2016-07-26 03:00:00,end=2016-07-26 04:00:00' 
    -kpiset 'name=CPUPERCENT min=20 max=40, name=IOTHROUGHPUT min=500 max=9000' -interval 2
  2. Start the calibration and store the model under a user-specified name for the specified date and time range.
    For example:
    $ chactl calibrate cluster –model weekday –timeranges ‘start=2016-07-03 20:50:00,end=2016-07-04 15:00:00’
  3. Use the new model to monitor the cluster as follows:
    For example:
    $ chactl monitor cluster –model weekday

Example 4-6 Output for the chactl query calibrate command

Database name : oltpacdb
Start time : 2016-07-26 01:03:10
End time : 2016-07-26 01:57:25
Total Samples : 120
Percentage of filtered data : 8.32%
The number of data samples may not be sufficient for calibration.

1) Disk read (ASM) (Mbyte/sec)

MEAN      MEDIAN    STDDEV    MIN       MAX     
4.96      0.20      8.98      0.06      25.68   

<25       <50       <75       <100      >=100    
97.50%    2.50%     0.00%     0.00%     0.00%    

2) Disk write (ASM) (Mbyte/sec)

MEAN      MEDIAN    STDDEV    MIN       MAX     
27.73     9.72      31.75     4.16      109.39  

<50       <100      <150      <200      >=200    
73.33%    22.50%    4.17%     0.00%     0.00%    

3) Disk throughput (ASM) (IO/sec)

MEAN      MEDIAN    STDDEV    MIN       MAX     
2407.50   1500.00   1978.55   700.00    7800.00 

<5000     <10000    <15000    <20000    >=20000  
83.33%    16.67%    0.00%     0.00%     0.00%    

4) CPU utilization (total) (%)

MEAN      MEDIAN    STDDEV    MIN       MAX     
21.99     21.75     1.36      20.00     26.80   

<20       <40       <60       <80       >=80     
0.00%     100.00%   0.00%     0.00%     0.00%    

5) Database time per user call (usec/call)

MEAN      MEDIAN    STDDEV    MIN       MAX     
267.39    264.87    32.05     205.80    484.57  

<10000000  <20000000  <30000000  <40000000  <50000000  <60000000  <70000000  >=70000000
100.00%   0.00%     0.00%     0.00%     0.00%     0.00%     0.00%     0.00%

Database name : oltpacdb
Start time : 2016-07-26 03:00:00
End time : 2016-07-26 03:53:30
Total Samples : 342
Percentage of filtered data : 23.72%
The number of data samples may not be sufficient for calibration.

1) Disk read (ASM) (Mbyte/sec)

MEAN      MEDIAN    STDDEV    MIN       MAX     
12.18     0.28      16.07     0.05      60.98   

<25       <50       <75       <100      >=100    
64.33%    34.50%    1.17%     0.00%     0.00%    

2) Disk write (ASM) (Mbyte/sec)

MEAN      MEDIAN    STDDEV    MIN       MAX     
57.57     51.14     34.12     16.10     135.29  

<50       <100      <150      <200      >=200    
49.12%    38.30%    12.57%    0.00%     0.00%    

3) Disk throughput (ASM) (IO/sec)

MEAN      MEDIAN    STDDEV    MIN       MAX     
5048.83   4300.00   1730.17   2700.00   9000.00 

<5000     <10000    <15000    <20000    >=20000  
63.74%    36.26%    0.00%     0.00%     0.00%    

4) CPU utilization (total) (%)

MEAN      MEDIAN    STDDEV    MIN       MAX     
23.10     22.80     1.88      20.00     31.40   

<20       <40       <60       <80       >=80     
0.00%     100.00%   0.00%     0.00%     0.00%    

5) Database time per user call (usec/call)

MEAN      MEDIAN    STDDEV    MIN       MAX     
744.39    256.47    2892.71   211.45    45438.35

<10000000  <20000000  <30000000  <40000000  <50000000  <60000000  <70000000  >=70000000
100.00%   0.00%     0.00%     0.00%     0.00%     0.00%     0.00%     0.00%

4.4.6 Viewing the Details for an Oracle Cluster Health Advisor Model

Use the chactl query model command to view the model details.

  1. You can review the details of an Oracle Cluster Health Advisor model at any time using the chactl query model command.
    For example:
    $ chactl query model –name weekday
    Model: weekday
    Target Type: CLUSTERWARE
    Version: OS12.2_V14_0.9.8
    OS Calibrated on: Linux amd64
    Calibration Target Name: MYCLUSTER
    Calibration Date: 2016-07-05 01:13:49
    Calibration Time Ranges: start=2016-07-03 20:50:00,end=2016-07-04 15:00:00
    Calibration KPIs: not specified
    

    You can also rename, import, export, and delete the models.

4.4.7 Managing the Oracle Cluster Health Advisor Repository

Oracle Cluster Health Advisor repository stores the historical records of cluster host problems, database problems, and associated metric evidence, along with models.

Note:

Applicable only if GIMR is configured. GIMR is optionally supported in Oracle Database 19c. However, it's desupported in Oracle Database 23ai.

The Oracle Cluster Health Advisor repository is used to diagnose and triage periodic problems. By default, the repository is sized to retain data for 16 targets (nodes and database instances) for 72 hours. If the number of targets increase, then the retention time is automatically decreased. Oracle Cluster Health Advisor generates warning messages when the retention time goes below 72 hours, and stops monitoring and generates a critical alert when the retention time goes below 24 hours.

Use CHACTL commands to manage the repository and set the maximum retention time.

  1. To retrieve the repository details, use the following command:

    $ chactl query repository
    For example, running the command mentioned earlier shows the following output:
    specified max retention time(hrs) : 72
    available retention time(hrs)     : 212
    available number of entities      : 2
    allocated number of entities      : 0
    total repository size(gb)         : 2.00
    allocated repository size(gb)     : 0.07
  2. To set the maximum retention time in hours, based on the current number of targets being monitored, use the following command:

    $ chactl set maxretention -time number_of_hours
    For example:
    $ chactl set maxretention -time 80
    max retention successfully set to 80 hours
    

    Note:

    The maxretention setting limits the oldest data retained in the repository, but is not guaranteed to be maintained if the number of monitored targets increase. In this case, if the combination of monitored targets and number of hours are not sufficient, then increase the size of the Oracle Cluster Health Advisor repository.

  3. To increase the size of the Oracle Cluster Health Advisor repository, use the chactl resize repository command.

    For example, to resize the repository to support 32 targets using the currently set maximum retention time, you would use the following command:

    $ chactl resize repository –entities 32
    repository successfully resized for 32 targets
    

4.4.8 Viewing the Status of Cluster Health Advisor

SRVCTL commands are the tools that offer total control on managing the life cycle of Oracle Cluster Health Advisor as a highly available service.

Use SRVCTL commands to the check the status and configuration of Oracle Cluster Health Advisor service on any active hub or leaf nodes of the Oracle RAC cluster.

Note:

A target is monitored only if it is running and the Oracle Cluster Health Advisor service is also running on the host node where the target exists.

  1. To check the status of Oracle Cluster Health Advisor service on all nodes in the Oracle RAC cluster:
    srvctl status cha [-help]
    For example:
    # srvctl status cha
    Cluster Health Advisor is running on nodes racNode1, racNode2.
    Cluster Health Advisor is not running on nodes racNode3, racNode4.
  2. To check if Oracle Cluster Health Advisor service is enabled or disabled on all nodes in the Oracle RAC cluster:
    srvctl config cha [-help]
    For example:
    # srvctl config cha
    Cluster Health Advisor is enabled on nodes racNode1, racNode2.
    Cluster Health Advisor is not enabled on nodes racNode3, racNode4.

4.4.9 Enhanced Cluster Health Advisor Support for Oracle Pluggable Databases

The Cluster Health Advisor (CHA) diagnostic capabilities have been extended to support 4K PDBs, up from 256 in Oracle Database 23ai.

Going forward, this is crucial for Oracle Autonomous Database deployments. CHA's problem detection and root cause analysis will be improved by considering DB events such as reconfiguration. This improves detection, analysis, and targeted preventative actions for problems such as instance evictions.