10 Monitoring and Troubleshooting Your Cluster

You can use the Autonomous Health Framework to monitor and troubleshoot your cluster.

See Also:

10.1 Autonomous Health Framework

Autonomous Health Framework is a collection of tools that can analyze the collected data, and proactively identify issues before they affect the health of your clusters or your Oracle RAC databases.

The majority of the tools that make up Autonomous Health Framework are already available in Oracle Database 12c release 1. In Oracle Database 12c release 2, the output of several tools is consolidated in the Grid Infrastructure Management Repository and analyzed in real-time to detect patterns which can indicate the development of issues on the production clusters.

10.1.1 Cluster Verification Utility (CVU)

The Cluster Verification Utility (CVU) can assist you in diagnosing a wide variety of configuration problems. As part of Autonomous Health Framework, CVU can perform these checks periodically by running autonomously on a scheduled basis.

Cluster Verification Utility (CVU) performs system checks in preparation for installation, patch updates, or other system changes. Using CVU ensures that you have completed the required system configuration and preinstallation steps so that your Oracle Grid Infrastructure or Oracle Real Application Clusters (Oracle RAC) installation, update, or patch operation, completes successfully.

Oracle Universal Installer is fully integrated with CVU, automating many CVU prerequisite checks. Oracle Universal Installer runs all prerequisite checks and associated fixup scripts when you run the installer.

10.1.1.1 Verifying the Existence of Node Applications

You use the CVU comp nodeapp command to verify the existence of node applications, namely the virtual IP (VIP), Oracle Notification Services (ONS), and Global Service Daemon (GSD), on all the nodes.

To verify the existence of node applications:

  1. In a command window, log in to the operating system as the user who owns the Oracle Clusterware software installation.
  2. Use the comp nodeapp command of CVU, using the following syntax:
    cluvfy comp nodeapp [ -n node_list] [-verbose]
    

    where node_list represents the nodes to check.

  3. If the cluvfy command returns the value of UNKNOWN for a particular node, then CVU cannot determine whether a check passed or failed.
    Determine if the failure was caused by one of the following reasons:
    • The node is down.

    • Executable files that CVU requires are missing in the CRS_home/bin directory or the Oracle_home/bin directory.

    • The user account that ran CVU does not have permissions to run common operating system executable files on the node.

    • The node is missing an operating system patch or required package.

    • The kernel parameters on that node were not configured correctly and CVU cannot obtain the operating system resources required to perform its checks.

10.1.1.2 Verifying the Integrity of Oracle Clusterware Components

You use the CVU comp crs command to verify the existence of all the Oracle Clusterware components.

To verify the integrity of Oracle Clusterware components:

  1. In a command window, log in to the operating system as the user who owns the Oracle Clusterware software installation.
  2. Use the comp crs command of CVU, using the following syntax:
    cluvfy comp crs [ -n node_list] [-verbose]
    

    where node_list represents the nodes to check.

See Also:

10.1.1.3 Verifying the Integrity of the Oracle Cluster Registry

You use the CVU comp ocr command to verify the integrity of the Oracle Clusterware registry.

To verify the integrity of the Oracle Clusterware registry:

  1. In a command window, log in to the operating system as the user who owns the Oracle Clusterware software installation.
  2. Use the comp ocr command of CVU, using the following syntax:
    cluvfy comp ocr [ -n node_list] [-verbose]
    

    where node_list represents the nodes to check.

See Also:

10.1.1.4 Verifying the Integrity of Your Entire Cluster

Use the CVU comp clu command to check that all nodes in the cluster have the same view of the cluster configuration.

To verify the integrity of your cluster:

  1. In a command window, log in to the operating system as the user who owns the Oracle Clusterware software installation.
  2. Use the comp clu command of CVU, using the following syntax:
    cluvfy comp clu [-verbose]
    

See Also:

10.1.1.5 Checking the Settings for the Interconnect

Use the CVU comp nodereach or comp nodeconcommand to check the settings for the interconnect.

Cache Fusion enhances the performance of Oracle RAC by utilizing a high-speed interconnect to send data blocks to another instance's buffer cache. The high-speed interconnect should be a private network with the highest bandwidth to maximize performance.

For network connectivity verification, CVU discovers all the available network interfaces if you do not specify an interface on the CVU command line.

To check the settings for the interconnect:

  1. In a command window, log in to the operating system as the user who owns the Oracle Clusterware software installation.
  2. To verify the accessibility of the cluster nodes, specified by node_list, from the local node or from any other cluster node, specified by srcnode, use the component verification command nodereach as follows:
    cluvfy comp nodereach -n node_list [ -srcnode node ] [-verbose]
    
  3. To verify the connectivity among the nodes, specified by node_list, through the available network interfaces from the local node or from any other cluster node, use the comp nodecon command as shown in the following example:
    cluvfy comp nodecon -n node_list -verbose
    

    When you issue the nodecon command as shown in the previous example, it instructs CVU to perform the following tasks:

    • Discover all the network interfaces that are available on the specified cluster nodes.

    • Review the corresponding IP addresses and subnets for the interfaces.

    • Obtain the list of interfaces that are suitable for use as VIPs and the list of interfaces to private interconnects.

    • Verify the connectivity among all the nodes through those interfaces.

    When you run the nodecon command in verbose mode, it identifies the mappings between the interfaces, IP addresses, and subnets.

  4. To verify the connectivity among the nodes through specific network interfaces, use the comp nodecon command with the -i option and specify the interfaces to be checked with the interface_list argument:
    cluvfy comp nodecon -n node_list -i interface_list [-verbose]
    

    For example, you can verify the connectivity among the nodes racnode1, racnode2, and racnode3, through the specific network interface eth0 by running the following command:

    cluvfy comp nodecon -n racnode1,racnode2,racnode3 -i eth0 -verbose
    

See Also:

10.1.1.6 Enabling Tracing

CVU does not generate trace files unless you enable tracing.

The CVU trace files are created in the ORACLE_BASE/crsdata/host_name/cvu directory by default. Oracle Database automatically rotates the log files and the most recently created log file has the name cvutrace.log.0. You should remove unwanted log files or archive them to reclaim disk place if needed.

To enable tracing using CVU:

  1. In a command window, log in to the operating system as the root user.
  2. Set the environment variable SRVM_TRACE to true.
    # set SRVM_TRACE=true; export SRVM_TRACE
    
  3. Run the command to trace.

See Also:

Oracle Clusterware Administration and Deployment Guide for more information about enabling tracing

10.1.2 ORAchk

As part of Autonomous Health Framework, ORAchk is configured to run in Daemon Mode. ORAchk proactively scans for known problems within a wide range of products and autonomously reports on these issues.

ORAchk replaces the popular RACcheck tool, extending the coverage based on prioritization of top issues reported by users. You can configure ORAchk to upload its findings to the Collection Manager Repository for centralized examination and analysis.

To obtain the most out of ORAchk, Oracle recommends that you run ORAchk as the root user (via password or SUDO).

10.1.2.1 Overview of Oracle ORAchk

Oracle ORAchk provides a lightweight and non-intrusive health check framework for the Oracle stack of software and hardware components.

Oracle ORAchk is provided as a value add-on to your existing support contract. There is no additional fee or license required to run Oracle ORAchk.

Features of ORAchk

  • Automates risk identification and proactive notification before business is impacted.

  • Runs health checks based on critical, reoccurring problems across Oracle customer base.

  • Runs in your environment with no need to send anything to Oracle.

  • Enables you to schedule email health check reports.

  • Integrates the findings into other tools of your choice.

The scope of Oracle ORAchk increases with new releases. The following is the current scope:

Oracle Database

  • Single-instance Oracle Database

  • Oracle Grid Infrastructure and Oracle RAC

  • Maximum Availability Architecture (MAA) validation

  • Upgrade Readiness validation

  • Oracle GoldenGate

  • Application Continuity

Enterprise Manager Cloud Control (12c only)

  • Management Repository

  • Management Agents

  • Oracle Management Service (OMS), version 12.1.0.1 and later on Linux only

Oracle Hardware Systems

  • Oracle Solaris

  • Oracle Solaris Cluster

  • Oracle Systems configuration for Oracle Middleware and Oracle Applications

  • Oracle ZFS Storage Appliance

  • Oracle Virtual Networking

10.1.2.2 Running ORAchk

You can schedule ORAchk to run at periodic intervals or you can run it as needed.

Oracle recommends to use the daemon process to schedule recurring health checks at regular intervals.

Configure the daemon to:

  • Schedule recurring health checks at regular interval.

  • Send email notifications when the health check runs complete, clearly showing any differences since the last run.

  • Purge collection results after a pre-determined period.

  • Check and send email notification about stale passwords.

  • Store multiple profiles for automated health check runs.

  • Restart automatically if the server or node where it is running restarts.

Oracle recommends that you run on-demand health checks in the following scenarios:

  • Pre or post-upgrades

  • Machine moves from one subnet to another

  • Hardware failure or repair

  • Problem troubleshooting

  • In addition to go-live testing

  1. To run on-demand health checks, do not use the daemon process started by any other user. Run on-demand health checks within the same directory where you have started the daemon.
    $ ./orachk
  2. To schedule health checks, do the following:
    1. Set the daemon properties using the -set option.

      At a minimum, set AUTORUN_SCHEDULE and NOTIFICATION_EMAIL.

      For example, to set the tool to run at 3 AM every Sunday and email the results to some.body@acompany.com, run the following command:

      $ ./orachk –set “AUTORUN_SCHEDULE=3 * * 0 ;NOTIFICATION_EMAIL=some.body@example.com”
    2. Configure the health check daemon as described in Automated Daemon Mode Operation.
    3. Start the daemon as root (recommended) or as the Oracle Database or the Oracle Grid Infrastructure home owner.
      $ ./orachk –d start
    4. Answer the questions prompted during startup.

Oracle ORAchk generates a detailed HTML report with findings and recommendations.

For the subsequent health check runs after the first email notification, the daemon emails the diff report of the current run and the immediate past run to all users specified in the NOTIFICATION_EMAIL list.

The email notification message contains:

  • System Health Score of this run compared to previous.

  • Summary of number of checks run and differences between runs.

  • Most recent report result as attachment.

  • Previous report result as attachment.

  • Diff report as attachment.

See Also:

Oracle Autonomous Health Framework User's Guide for more information about using and configuring ORAchk.
10.1.2.3 ORAchk HTML Report Output

Oracle ORAchk generates a detailed HTML report with findings and recommendations.

The Health Check HTML report contains the following:

  • High level health score.

  • Summary of the run.

  • Table of contents that provides easy access to findings.

  • Findings and recommendations to resolve the issues.

System Health Score and Summary

Oracle ORAchk calculates a high level System Health Score based on the number of checks that passed or failed. A summary of the run shows, which includes the following information:

  • The cluster name

  • The operating system and software versions

  • The EM Agent home directory

  • Cluster information — the number of nodes, the number of database servers, the number of storage servers, and the number of IB switches

  • The version of ORAchk

  • The name of the collection file

  • The duration of the ORAchk run

  • The user that started ORAchk

  • The date that ORAchk was run

HTML Report Table of Contents and Features

The Table of Contents provides links to each of the major sections within the HTML report.

The Report Features allow you to:

  • Filter checks based on their statuses.

  • Select the regions.

  • Expand or collapse all checks.

  • View check IDs.

  • Remove findings from the report.

  • Get a printable view.

HTML Report Findings

Report findings are grouped by Oracle Stack component. Findings include:

  • Status of check (FAIL, WARNING, INFO, or PASS).

  • Type of check.

  • Check message.

  • Location where the check was run.

  • Link to expand details for further findings and recommendations.

For any of the findings you can click View to see the findings and the recommendations, which can include:

  • Solution to solve the problem.

  • Applicable recommendations.

  • Where the problem does not apply.

  • Links to relevant documentation or My Oracle Support Notes.

  • Example of data the recommendation is based on.

Maximum Availability Architecture (MAA) Scorecard

The Maximum Availability Architecture (MAA) Scorecard is displayed after the Findings group.

The MAA Scorecard provides a set of best practices for maximum availability architecture. It also shows results related to maximum availability, such as the installed software versions checked for noncurrent software, and use of incompatible features.

Findings Needing Further Review

Issues that health checks have only a partial view and need user reviews to determine if they are relevant are displayed in the Findings needing further review section.

Platinum Certification

The Platinum Certification section shows a list of compliance status items for the Oracle Platinum service. For the existing Platinum customers it is a review. For customers not yet participating in Oracle Platinum, it is an indication of readiness to participate in Oracle Platinum.

Clusterwide Linux Operating system Health Check (VMPScan)

A summary of the VMPScan report is shown in the Clusterwide Linux Operating System Health Check (VMPScan) section of the report.

The full VMPScan report is also available within the collection/reports and collection/outfiles/vmpscan directory.

File Attribute Changes

The File Attribute Changes section is shown in the report only when you run Oracle ORAchk with the –fileattr option.

Skipped Checks

Any checks that were not able to be run and skipped for some reason are shown in the Skipped Checks section.

Component Elapsed Times

The Component Elapsed Times gives a breakdown of time required to check various components.

This can be useful when diagnosing performance problems.

Top 10 Time Consuming Checks

The Top 10 Time Consuming Checks section shows the slowest 10 checks that were run.

This can be useful when diagnosing performance problems.

10.1.3 Cluster Health Monitor

Cluster Health Monitor is integrated part of Oracle Grid Infrastructure. Cluster Health Monitor collects operating system statistics using operating system APIs to gain performance and reduce the CPU usage overhead.

Cluster Health Monitor collects operating system statistics (system metrics) such as memory and swap space usage, processes, IO usage, and network related data. The Cluster Health Monitor collects information in real time and usually once a second. Cluster Health Monitor collects as much of system metrics and data as feasible that is restricted by the acceptable level of resource consumption by the tool.

Cluster Health Monitor provides system metrics and data for troubleshooting many different types of problems such as node reboot, instance eviction, server hang, severe performance degradation, and any other problems that need the system metrics and data. By monitoring the data constantly, users can use Cluster Health Monitor detect potential problem areas such as CPU load, memory constraints, and spinning processes before the problem causes an unwanted outage.

The information collected by Cluster Health Monitor is stored in the Grid Infrastructure Management Repository (GIMR). GIMR is a centralized infrastructure database for diagnostic and performance data and resides in the Grid home. It is a single instance multitenant database with a single pluggable database (PDB). It includes partitioning (for data lifecycle management). GIMR clients include: Cluster Health Monitor, Rapid Home Provisioning, Enterprise Manager Cloud Control, and Trace File Analyzer. Cluster Health Monitor puts all of its operating system metric data in GIMR. Rapid Home Provisioning uses GIMR to persist metadata about each of the database homes that it services. Enterprise Manager Cloud Control and Trace File Analyzer retrieve data from GIMR, predominantly Cluster Health Monitor data.

You can review the information collected by Cluster Health Monitor using either diagcollection.pl or oclumon:

  • Diagcollection.pl: As the grid user, using command Grid_home/bin/diagcollection.pl --collect --chmos produces output for all the data collected in the repository. Because there might be a lot of data, this could take a long time. It is better to limit the query to an interesting time interval.

    Grid_home/bin/diagcollection.pl --collect --crshome Grid_home 
      --chmos --incidenttime start_time --incidentduration 05:00

    The above command generates a report that covers 5 hours from the time specified by incidenttime. The incidenttime must be in the form MM/DD/YYYYHH:MN:SS where MM is month, DD is date, YYYY is year, HH is hour in 24 hour format, MN is minute, and SS is second. For example, if you want to put the incident time to start from 10:15 PM on June 01, 2016, the incident time is 06/01/201622:15:00. The incidenttime and incidentduration can be changed to capture more data.

  • oclumon: Alternatively, you can use the command oclumon dumpnodeview -allnodes -v -last "11:59:59" > output-filename’ if diagcollection.pl fails with any reason. This command generates a report from the repository up to last 12 hours. The -last value can be changed to get more or less data.

10.1.4 Cluster Health Advisor

The Cluster Health Advisor daemon collects data from Oracle Real Application Clusters (Oracle RAC), Oracle RAC One Node databases, and the host operating system and hardware resources, to analyze and produce a root cause and action after finding a database or host performance issue.

The user can initiate a calibration process to generate data from which a predictive analysis data set is created. This baseline operation model is then evaluated every five seconds to determine whether an abnormal condition has occurred and is progressing, along with its root cause, details, and corrective actions. This data evidence persists in the Grid Infrastructure Management Repository database for reporting to Oracle Enterprise Manager Cloud Control.

The Cluster Health Advisor daemon provides system and database administrators with early warning of pending performance issues, and root causes and corrective actions for Oracle RAC databases and cluster nodes. This advanced proactive diagnostic capability will increase availability and performance management.

To track the online health of an Oracle RAC system, Cluster Health Advisor analyzes the system’s measurements (taken as samples of the operational state of a running system) to detect whether the measured values are within an established normal base line, or if they are drifting anomalously beyond their acceptable perimeter. If Cluster Health Advisor detects irregularities in any of the measured values, it invokes an incident diagnostic engine to determine the identity of the underlying issue and generates incident details. The analysis cycle repeats periodically with a fresh set of sampled measurements to track the health status of the system over time.

To view the findings of Cluster Health Advisor, you can use:here is a and a command-line interface. Graphical interfaces include:

  • Graphical interface

    • Enterprise Manager Cloud Control

    • A standalone specialized console dedicated to depicting Cluster Health Advisor analysis

  • Command line interface — CHACTL

Example 10-1 Using CHACTL to Review Cluster Health Advisor Findings

Here is an example of using CHACTL to review the findings of Cluster Health Advisor:

$ chactl query diagnosis –start start_time -end end_time 
10:05:15      	Database A	    Disk Failure   (instance1) [detected]
				                      Storage Bandwidth Saturation (instance 1) [detected]
10:05:30        Database A     Disk Failure (instance 1)  [cleared]
                               Storage Bandwidth Saturation (instance 2) [detected]
                               Storage Bandwidth Saturation (instance 1) [cleared]
		           Database B      Storage Bandwidth Saturation (instance 1 )
		           Host A		       Disk Failure

10.1.5 Trace File Analyzer Collector

Trace File Analyzer centralizes and automates the collection of diagnostic information.

10.1.5.1 About the Trace File Analyzer Collector (tafctl)

Trace File Analyzer Collector (TFA) is a diagnostic collection utility to simplify diagnostic data collection on Oracle Grid Infrastructure, Oracle RAC and single-instance Oracle Database systems.

TFA is similar to the diagcollection utility packaged with Oracle Grid Infrastructure in that it collects and packages diagnostic data. However, TFA is much more powerful than diagcollection with its ability to centralize and automate the collection of diagnostic information.

TFA Collector Features include:

  • Simplifies diagnostic data collections

  • Performs cluster-wide diagnostic collection for all components with a single command

  • Trims diagnostic files to include only information from around the incident time

  • Consolidates collected diagnostics on a single node

TFA runs on each node of your cluster or on a single node where no Grid Infrastructure is installed. TFA consists of a daemon, and a command line interface (CLI). The TFA daemon is a Java Virtual Machine (JVM) that by default runs at all times on any configured node and can be identified as TFAMain. The CLI is made up of a Java command line processor that sends commands to the TFAMain secure socket and a perl script tfactl.

When TFA is installed with Oracle Grid Infrastructure, it is installed and configured through the installer, but can be patched through regular Patch Set Updates, or by downloading the latest TFA from My Oracle Support.

Upon installation completion a TFAMain JVM will be running on each node in the configuration and will run an inventory process to discover files in the trace directories found. The inventory process determines first and last timestamp as well as the file type for all files in these directories. Any alert type files discovered will then be monitored continuously for significant events and if any of those events occur then TFA can automatically gather relevant diagnostics if configured to do so, though manual collections may be initiated at any time. Alert type files are just CRS, ASM and RDBMS alert logs. Automatic diagnostic collections are disabled by default.

10.1.5.2 tfactl Command Summary
tfactl is a command-line interface for managing the Trace File Analyzer Collector (TFA) utility.

File Path

To Manage TFA, tfactl can be run from the Oracle_base/bin directory and can be run as the root user, or any other user authorized to run TFA. Non root users will only able to run commands that complete diagnostic collections or make directories they own available for collection. If user is given sudo access to run tfactl then they will have access to all commands.

Syntax

tfactl command [options]

tfactl Commands

Command Description
diagcollect

Specify the type of information collected by TFA

collection

Manage TFA collections

analyze

Search for a pattern in the database or clusterware alert logs that occurred during a specified period of time in the past. Default time period is one hour.

ips

Create and manage the contents of a trace file package

run

Run the specified TFA support tool

start tool

Starts TFA. If a tool is specified, then it starts the specified tool.

stop tool

Stops TFA. If a tool is specified, then it stops the specified tool.

enable

Enable TFA autostart

disable

Disable TFA autostart

status

Check the run status of TFA process

print

Print requested details

access

Add or remove or list TFA users and groups

purge

Delete collections from the TFA repository

directory

Add or remove or modify the directory in TFA

host

Add or remove host in TFA

set

Turn on or turn off or modify various TFA features

toolstatus

Print the status of TFA support tools

uninstall

Uninstall TFA from the local node

diagnosetfa

Collect TFA diagnostics

See Also:

My Oracle Support note: TFA Collector - TFA with Database Support Tools Bundle (Doc ID 1513912.1) for a complete list of the options available for each command
10.1.5.3 Using tfactl to Collect Diagnostic Logs for a Specific Time Period

The diagcollection module of TFA has the capability to take a number of different parameters to allow the user to control how large or detailed the required evidence set is.

In this example, you use the -from and -to switches for the tfactl diagcollect command, which tells TFA to collect diagnostic logs of all types from midnight April 2nd midnight to 13:00 on April 2nd. The command will launch the specified diagnostic collection on all cluster nodes and place the results in a zip file for each node in the $TFA_HOME/repository directory on the node in which TFA was invoked.

[oracle@docrac1 ~]# ./tfactl diagcollect -from "Apr/02/2016" -to "Apr/02/2016 13:00:00"
Collecting data for all nodes
Scanning files from Apr/02/2016 00:00:00 to Apr/02/2016 13:00:00

Collection Id : 20160402160419racnode1

Repository Location in docrac1 : /u01/app/oracle/tfa/repository
2016/04/02 15:04:21 EDT : Collection Name : tfa_Thu_Apr_2_15_04_19_EDT_2016.zip
2016/04/02 15:04:21 EDT : Sending diagcollect request to host : cehaovmsp102
2016/04/02 15:04:22 EDT : Scanning of files for Collection in progress...
2016/04/02 15:04:22 EDT : Collecting extra files...
2016/04/02 15:04:32 EDT : Getting list of files satisfying time range [04/02/2014 00:00:00 EDT, 04/02/2014 13:00:00 EDT]
2016/04/02 15:04:32 EDT : Starting Thread to identify stored files to collect
2016/04/02 15:04:32 EDT : Getting List of Files to Collect
2016/04/02 15:04:32 EDT : Trimming file : docrac1/u01/app/11.2.0/grid/log/docrac1/alertdocrac1.log with original file size : 9.7MB
2016/04/02 15:04:32 EDT : Trimming file : docrac1/u01/app/11.2.0/grid/log/docrac1/client/oifcfg.log with original file size : 1.2MB
2016/04/02 15:04:32 EDT : Trimming file : docrac1/u01/app/11.2.0/grid/log/docrac1/crfmond/crfmond.log with original file size : 3.4MB
2016/04/02 15:04:32 EDT : Trimming file : docrac1/u01/app/11.2.0/grid/log/docrac1/evmd/evmd.log with original file size : 3.2MB
2016/04/02 15:04:32 EDT : Trimming file : docrac1/u01/app/11.2.0/grid/log/docrac1/gpnpd/gpnpd.log with original file size : 608kB
2016/04/02 15:04:35 EDT : Trimming file : docrac1/rdbms/orcl/ORCL1/trace/alert_ORCL1.log with original file size : 1.4MB
2016/04/02 15:04:35 EDT : Finshed Getting List of Files to Collect
2016/04/02 15:04:35 EDT : Trimming file : docrac1/tnslsnr/docrac1/listener_scan1/trace/listener_scan1.log with original file size : 83MB
2016/04/02 15:04:35 EDT : Trimming file : docrac1/asm/+asm/+ASM1/trace/alert_+ASM1.log with original file size : 623kB
2016/04/02 15:04:35 EDT : Collecting ADR incident files...
2016/04/02 15:04:35 EDT : Waiting for collection of extra files
2016/04/02 15:05:10 EDT : Completed collection of extra files...
2016/04/02 15:05:10 EDT : Completed Zipping of all files
2016/04/02 15:05:10 EDT : Cleaning up temporary files
2016/04/02 15:05:10 EDT : Finished Cleaning up temporary files
2016/04/02 15:05:10 EDT : Finalizing the Collection Zip File
2016/04/02 15:05:11 EDT : Finished Finalizing the Collection Zip File
2016/04/02 15:05:11 EDT : Total Number of Files checked : 17168
2016/04/02 15:05:11 EDT : Total Size of all Files Checked : 2.3GB
2016/04/02 15:05:11 EDT : Number of files containing required range : 29
2016/04/02 15:05:11 EDT : Total Size of Files containing required range : 104MB
2016/04/02 15:05:11 EDT : Number of files trimmed : 8
2016/04/02 15:05:11 EDT : Total Size of data prior to zip : 1.3MB
2016/04/02 15:05:11 EDT : Saved 103MB by trimming files
2016/04/02 15:05:11 EDT : Zip file size : 121kB
2016/04/02 15:05:11 EDT : Total time taken : 49s
2016/04/02 15:05:11 EDT : Remote Collection in Progress...
2016/04/02 15:05:22 EDT : cehaovmsp102:Completed Collection
2016/04/02 15:05:22 EDT : Completed collection of zip files.

Logs are being collected to: /u01/app/oracle/tfa/repository/collection_Thu_Apr_2_15_04_19_EDT_2016_node_all
/u01/app/oracle/tfa/repository/collection_Thu_Apr_2_15_04_19_EDT_2016_node_all/docrac1.tfa_Thu_Apr_2_15_04_19_EDT_2016.zip
/u01/app/oracle/tfa/repository/collection_Thu_Apr_2_15_04_19_EDT_2016_node_all/cehaovmsp102.tfa_Thu_Apr_2_15_04_19_EDT_2016.zip

10.1.6 Hang Manager

Hang Manager can reliably detect and resolve hung database sessions in a timely manner.

Hang Manager is only active when Oracle RAC is enabled. You can determine whether an Oracle RAC database is used / enabled by querying the CLUSTER_DATABASE initialization parameter. If this parameter is set to TRUE, then Oracle RAC is enabled on this database.

Since Oracle Database 11g Release 2 (11.2.0.2), Hang Manager can resolve hangs it detects by terminating the root session or process.  By default, Hang Manager will not terminate an instance or evict a node. Hang Manager does not resolve all hangs that it detects. For example, hangs which involve a possible application issue are left to the user to determine the correct course of action.  Also, if Hang Manager determines that the instance on which the source of a hang resides is experiencing high CPU or IO load, then hang resolution is delayed.  This allows the source session or process time to progress and for the hang to resolve on its own. Hang Manager only resolves hangs on Oracle Database instances; Hang Manager does not resolve any hangs on Oracle ASM instances.

Starting with Oracle Database 12c Release 1, if Oracle Database Quality of Service (QoS) Management is active on the cluster, then Hang Manager uses additional information provided by Oracle Database QoS Management to determine if a hang should be ignored or resolved.

When Hang Manager resolves a hang, it signals an ORA-32701 incident which is visible in the alert log. The alert log message is similar to the following:

ORA-32701: Possible hangs up to hang ID=24 detected
Incident details in: /ee/oracle/oracle_base/diag/rdbms/orcl/orcl1/incident/incdir_1944098/orcl1_dia0_34930694_i1944098.trc
DIA0 terminating blocker (ospid: 28311778 sid: 3398 ser#: 1) of hang with ID = 24
     requested by master DIA0 process on instance 2
     Hang Resolution Reason: Automatic hang resolution was performed to free a
    critical database process.
     by terminating session sid:3398 with serial # 1 (ospid:28311778)

The above message describes which hang is about to be resolved; the hang with ID 24. It also lists the root session ID, serial number and Operating System process ID and on which instance it resides. Finally, Hang Manager gives a short reason as to why this hang will be resolved.  In this case, a regular user session is blocking a critical database process so the user session will be terminated to allow the critical database process to run.

Note:

An ORA-32701 incident in the alert log does not indicate that there is a problem with Hang Manager. Instead, it indicates that Hang Manager has found a hang which will be resolved by terminating the root session.

You can monitor Hang Manager activity by querying the following data dictionary views:

  • V$HANG_INFO: Contains the active hangs and deadlocks detected by Hang Manager. It does not include any hangs which have been resolved by Hang Manager or have self-resolved. Only the latest 32 active hangs are in this view.

  • V$HANG_SESSION_INFO: Contains the sessions in the main chain for all of the hangs included in the V$HANG_INFO view. Only the first 20 sessions of the main chain, including the root of the hang, for each hang are kept in this view.

  • GV$HANG_STATISTICS or V$HANG_STATISTICS: Contains various Hang Manager statistics regarding hangs or deadlocks which it has detected. These statistics include: the number of hangs detected, the number of deadlocks detected, the number of hangs resolved by Hang Manager, the number of hangs ignored broken down by the reason for ignoring the hang, the number of hangs which self-resolved, and so on.

10.1.7 Managing Memory Pressure for Database Servers

Memory Guard detects memory pressure on a server in real time and redirects new sessions to other servers to prevent using all available memory on the stressed server.

Enterprise database servers can use all available memory due to too many open sessions or runaway workloads. Running out of memory can result in failed transactions or, in extreme cases, a restart of the server and the loss of a valuable resource for your applications.

Rerouting new sessions to different servers protects the existing workloads on the memory-stressed server and enables the server to remain available. Memory Guard is a feature of Oracle RAC that manages the memory pressure for servers, adding a new resource protection capability in managing service levels for applications hosted on Oracle RAC databases.

When Oracle Database Quality of Service Management is enabled and managing an Oracle Clusterware server pool, Cluster Health Monitor sends a metrics stream that provides real-time information about memory resources for the cluster servers to Memory Guard. This information includes:

  • Amount of available memory

  • Amount of memory currently in use

If Memory Guard determines that a node is experiencing memory pressure, then the database services managed by Oracle Clusterware are stopped on that node, preventing new connections from being created. After the memory stress is relieved, the services on that node are restarted automatically, and the listener starts sending new connections to that server. The memory pressure can be relieved in several ways (for example, by closing existing sessions or by user intervention).

10.1.8 Oracle Database Quality of Service Management

Oracle Database Quality of Service (QoS) Management is an automated, policy-based product that monitors the workload requests for an entire system.

Oracle Database QoS Management manages the resources that are shared across applications, and adjusts the system configuration to keep the applications running at the performance levels needed by your business. Oracle Database QoS Management responds gracefully to changes in system configuration and demand, thus avoiding additional oscillations in the performance levels of your applications.

Oracle Database QoS Management monitors the performance of each work request on a target system. Oracle Database QoS Management starts to track a work request from the time a work request requests a connection to the database using a database service. The amount of time required to complete a work request, or the response time (also known as the end-to-end response time, or round-trip time), is the time from when the request for data was initiated and when the data request is completed. By accurately measuring the two components of response time, which are the time spent using resources and the time spent waiting to use resources, Oracle Database QoS Management can quickly detect bottlenecks in the system. Oracle Database QoS Management then makes suggestions to reallocate resources to relieve a bottleneck, thus preserving or restoring service levels.

Oracle Database QoS Management manages the resources on your system so that:

  • When sufficient resources are available to meet the demand, business-level performance requirements for your applications are met, even if the workload changes.

  • When sufficient resources are not available to meet the demand, Oracle Database QoS Management attempts to satisfy the more critical business performance requirements at the expense of less critical performance requirements.

Benefits of Using Oracle Database QoS Management

In a typical company, when the response times of your applications are not within acceptable levels, problem resolution can be very slow. Often, the first questions that administrators ask are: "Did we configure the system correctly? Is there a parameter change that fixes the problem? Do we need more hardware?" Unfortunately, these questions are very difficult to answer precisely. The result is often hours of unproductive and frustrating experimentation.

Oracle Database QoS Management provides the following benefits:

  • Reduces the time and expertise requirements for system administrators who manage Oracle Real Application Clusters (Oracle RAC) resources.

  • Helps reduce the number of performance outages.

  • Reduces the time needed to resolve problems that limit or decrease the performance of your applications.

  • Provides stability to the system as the workloads change.

  • Makes the addition or removal of servers transparent to applications.

  • Reduces the impact on the system caused by server failures.

  • Helps ensure that service-level agreements (SLAs) are met.

  • Enables more effective sharing of hardware resources.

Oracle Database QoS Management helps manage the resources that are shared by applications in a cluster. Oracle Database QoS Management can help identify and resolve performance bottlenecks. Oracle Database QoS Management does not diagnose or tune application or database performance issues. When tuning the performance of your applications, the goal is to achieve optimal performance. Oracle Database QoS Management does not seek to make your applications run faster, but instead works to remove obstacles that prevent your applications from running at their optimal performance levels.

10.2 Monitoring Oracle Clusterware Using Enterprise Manager

You can use Enterprise Manager to monitor Oracle Clusterware.

Other monitoring features available with Enterprise Manager include:

  • Viewing the status of Oracle Clusterware on each node of the cluster

  • Receiving notifications if there are any VIP relocations

  • Monitoring the overall throughput across the private interconnect

  • Receiving notifications if nodeapps go down or come up

  • Viewing alerts if a database instance is using the public interface instead of the VIP

  • Monitoring the Clusterware alert log for OCR or voting disk related issues, node evictions, and other clusterware errors

You can also monitor Oracle Clusterware resources and use cluster component health monitoring for individual components. To monitor resources you click the Administration link from the Cluster home page. To monitor the health of individual cluster components, click the All Metrics link in the Related Links section of the Cluster home page.

10.2.1 Accessing the Oracle Clusterware Information

From the Cluster Database Home page, there a several ways to access Oracle Clusterware information.

To access Oracle Clusterware information:

  1. Access the Cluster Database Home page of Oracle Enterprise Manager.

    See Oracle Database 2 Day DBA for details on logging in to Enterprise Manager.

  2. In the General section, click the link next to Cluster to view the Cluster Home page.
  3. On the Cluster Home page, in the Diagnostic Summary section, click the number next to Interconnect Alerts to view the Interconnects subpage for the cluster.

    Alternatively, from the Cluster menu, select Interconnects to display the Interconnects page.

  4. From the Cluster menu, select Configuration, and then select Topology. Click a node in the graphical display to activate the controls.
  5. Click the Interface component. Right-click the Interface component, then choose View Details from the menu to display the Interconnects page for the cluster.
  6. In the Cluster Databases section of the Cluster Home page, select the name of a cluster database to return to the Cluster Database Home page for that database.

10.2.2 Viewing the Cluster Topology Page

The Oracle Enterprise Manager Topology Viewer enables you to visually see the relationships between target types in your cluster.

You can zoom in or out, pan, and see selection details. The Topology Viewer uses individually distinct icons to represent system target types, and standardized visual indicators, such as frames around selections, are used across all target types.

The Topology Viewer populates icons based on your system configuration. If a listener is serving an instance, then a line connects the listener icon and the instance icon. If a cluster database is configured to use Oracle ASM, then the relationship between the cluster Oracle ASM and cluster database appears in the topology, as shown in the following figure:

Figure 10-1 Cluster Topology Page of Oracle Enterprise Manager

Description of Figure 10-1 follows
Description of "Figure 10-1 Cluster Topology Page of Oracle Enterprise Manager"

If the Show Configuration Details option is unchecked, then the topology shows the monitoring view of the environment, which includes general information such as alerts and overall status. If you select the Show Configuration Details option, then additional details are shown in the Selection Details page, which are valid for any topology view. For instance, the listener component would also show the computer name and port number.

You can click an icon and then the right mouse button to display a menu of available actions. Several actions go to pages related to the target type where you can perform monitoring or tuning tasks.

Refer to the Enterprise Manager Online Help for more information about the contents of this page.

10.2.3 Viewing Cluster Health Monitor Data from Enterprise Manager

You can use Oracle Enterprise Manager Cloud Control to view data from Cluster Health Monitor (CHM).

The metric data from Cluster Health Monitor is available in graphical display within Enterprise Manager Cloud Control. Complete cluster views of this data are accessible from the cluster target page. You can view CHM data for the last day in overview format for the entire cluster. The metric categories displayed in the graphs are CPU, Memory and Network.

Each category can be separately displayed in greater detail, showing additional metrics. For example, selecting CPU will result in cluster graphs detailing CPU System Usage, CPU User Usage, and CPU Queue length. From any cluster view you can select individual node views to more closely examine performance of a single server. As in the case of CPU, the performance of each core will be displayed. You can at any time move your cursor along the graph to see a tool-tip displaying the numerical values and time stamp of that point.

Besides examining the current day’s performance, you can also review historical data. The amount of historical data is governed by the retention time configured for CHM repository in the Gird Infrastructure Management Repository. The default is 72 hours. You can view the available historical data by using the View Mode drop-down menu and selecting Historical. A previous date can then be entered or selected from a pop-up calendar that displays the dates where data is available in bold font. Selecting Show Chart will then display the selected metrics graphs.

  1. Select the target cluster in Enterprise Manager.
  2. From the Cluster menu, select Cluster Health Monitoring.
    The Cluster Health Monitoring page appears.
  3. View the graphs. To view data from the past, change the View Mode from Real Time to Historical, and select the date to view.
  4. If you change the Chart Type to CPU, Memory, or Network, you can view additional details, such as CPU User Usage, Private Network IO, or Swap Free Memory Percentage.
  5. While viewing the Network or CPU metric graphs, you can click on a node name to view additional details just for that node, such as the Public Network IO traffic by network interface, or the CPU usage for each core.

10.3 Troubleshooting Configuration Problems in Oracle RAC Environments

The Cluster Ready Services Control (CRSCTL) command check enables you to determine the status of several Oracle Clusterware components simultaneously.

Problems can occur when attempting to complete the installation or database creation process manually instead of using the Oracle Database management tools. Other problems occur due to the database administrator or system administrator missing important operating system or cluster configuration steps before installation. Both Oracle Clusterware and Oracle Database components have subcomponents that you can troubleshoot.

10.3.1 About the Oracle Clusterware Alert Log

Oracle Clusterware posts alert messages when important events occur. For example, you might see alert messages from the Cluster Ready Services (CRS) daemon process when it starts, if it aborts, if the failover process fails, or if automatic restart of an Oracle Clusterware resource failed.

Enterprise Manager monitors the Clusterware log file and posts an alert on the Cluster Home page if an error is detected. For example, if a voting disk is not available, then a CRS-1604 error is raised, and a critical alert is posted on the Cluster Home page. You can customize the error detection and alert settings on the Metric and Policy Settings page.

The Oracle Clusterware alert log should be the first place to look for serious errors. It often contains references to other diagnostic logs that provide detailed information on a specific component. The location of the Oracle Clusterware log file is ORACLE_BASE/diag/crs/hostname/crs/trace/alert.log, where ORACLE_BASE is the Oracle base path you specified when you installed Oracle Grid Infrastructure and hostname is the name of the host.

See Also:

10.3.2 About the Oracle Clusterware Component Log Files

Starting with Oracle Clusterware 12c release 1 (12.1.0.2), diagnostic data files written by Oracle Clusterware programs are known as trace files and have a .trc file extension, and appear together in the trace subdirectory of the Automatic Diagnostic Repository (ADR) home.

Besides trace files, the trace subdirectory in the Oracle Clusterware ADR home contains the simple text Oracle Clusterware alert log. It always has the name alert.log. The alert log is also written as an XML file in the alert subdirectory of the ADR home, but the text alert log is most easily read.

10.3.3 Using CRSCTL to Diagnose Cluster Issues

You can use CRSCTL commands as the root operating system user to diagnose problems with your Oracle Clusterware installation, or to enable dynamic debugging for Oracle Clusterware.

10.3.3.1 Checking the Status of the Oracle Clusterware Installation

Use the CRSCTL check command to display the status of an Oracle Clusterware component or daemon.

To determine the condition of your Oracle Clusterware installation:

  1. Log in to the operating system as the root user in a command window.
  2. Use CRSCTL to check the status of Oracle Clusterware using the following command:
    # crsctl check cluster -all
    
  3. Check the status of an individual Oracle Clusterware daemon using the following syntax, where daemon is crsd, cssd, or evmd:
    # crsctl check daemon
    
  4. To list the status of all Oracle Clusterware resources running on any node in the cluster, use the following command:
    # crsctl status resource -t
    

    This command lists the status of all registered Oracle Clusterware resources, which includes the VIPs, listeners, databases, services, and Oracle ASM instances and disk groups in a tabular form.

10.3.3.2 Enabling Debugging of Oracle Clusterware Components

You can enable debugging for the Oracle Cluster daemons, Event Manager (EVM), and their modules by running CRSCTL commands.

To enable debugging of Oracle Clusterware components:

  1. In a command window, log in to the operating system as the root user.
  2. Use the following command to obtain the module names for a component, where component_name is crs, evm, css or the name of the component for which you want to enable debugging:
    # crsctl lsmodules component_name
    

    For example, viewing the modules of the css component might return the following results:

    # crsctl lsmodules css
    The following are the CSS modules :: 
    CSSD
    COMMCRS
    COMMNS
    
  3. Use CRSCTL as follows, where component_name is the name of the Oracle Clusterware component for which you want to enable debugging, module is the name of module, and debugging_level is a number from 1 to 5:
    # crsctl set log component module:debugging_level
    

    For example, to enable the lowest level of tracing for the CSSD module of the css component, you would use the following command:

    # crsctl set log css CSSD:1
10.3.3.3 Enabling Debugging for an Oracle Clusterware Resource

You can use CRSCTL commands to enable debugging for resource managed by Oracle Clusterware.

To enable debugging of an Oracle Clusterware resource:

  1. In a command window, log in to the operating system as the root user.
  2. Obtain a list of the resources available for debugging by running the following command:
    # crsctl check crs
    
  3. Run the following command to enable debugging, where resource_name is the name of an Oracle Clusterware resource, such as ora.racnode1.vip, and debugging_level is a number from 1 to 5:
    # crsctl set log crs "resource_name:debugging_level"
    
10.3.3.4 Enabling Oracle Clusterware Daemons

When the Oracle Clusterware daemons are enabled, they start automatically when the node is started.

To enable automatic startup for all Oracle Clusterware daemons:

  1. In a command window, log in to the operating system as the root user.
  2. Run the following CRSCTL command:
    # crsctl enable crs
    
10.3.3.5 Disabling Oracle Clusterware Daemons

To prevent the daemons from starting automatically, you can disable them using crsctl commands.

To disable automatic startup for all Oracle Clusterware daemons:

  1. In a command window, log in to the operating system as the root user.
  2. Run the following CRSCTL command:
    # crsctl disable crs
    

10.3.4 Viewing Oracle RAC Database Alert Log Messages

An alert log is created for each instance in a cluster database.

To view the alert log for an Oracle RAC database instance:

  1. On the Cluster Database Home page, click the link for Instances.
  2. Click the name of the instance for which you want to view the alert log.

    The Cluster Database Instance Home page appears.

  3. From the Oracle Database menu, select Logs, then select Alert Log Errors.

    The Alert Log Errors page appears.

  4. Optional: Click XML Alert Log Contents in the Related Links section to view all the entries in the alert log.

    On the View Alert Log Contents page, click Go to view the most recent entries, or you can enter your own search criteria.

See Also:

Oracle Real Application Clusters Administration and Deployment Guide for more information about where to find files for analyzing errors in Oracle RAC