1.1 Oracle Autonomous Health Framework Problem and Solution Space

Oracle Autonomous Health Framework (AHF) maximizes availability and performance by enforcing best practices, capturing data at first failure, monitoring the whole system (server, database, I/O, and network) to proactively discover issues and notify the user and provide timely bug resolution by suggesting fixes automatically after failure.

System administrators can use most of the components in Oracle Autonomous Health Framework interactively during installation, patching, and upgrading. Database administrators can use Oracle Autonomous Health Framework to diagnose operational runtime issues and mitigate the impact of these issues.

1.1.1 Availability Issues

Availability issues are runtime issues that threaten the availability of software stack.

Availability issues can result from either software issues (Oracle Database, Oracle Grid Infrastructure, operating system) or the underlying hardware resources (CPU, Memory, Network, Storage).

The components within Oracle Autonomous Health Framework address the following availability issues:

Examples of Server Availability Issues

Server availability issues can cause a server to be evicted from the cluster and shut down all the database instances that are running on the server.

Examples of such issues are:

  • Issue: Network congestion on the private interconnect can cause time-critical internode or storage I/O to have excessive latency or dropped packets. This type of failure typically builds up and can be detected early, and corrected or relieved.

    Solution: If a change in the server configuration causes this issue, then Cluster Verification Utility (CVU) detects it if the issue persists for more than an hour. However, Oracle Cluster Health Advisor detects the issue within minutes and presents corrective actions.

  • Issue: Network failures on the private interconnect caused by a pulled cable or failed network interface card (NIC) can immediately result in evicted nodes.

    Solution: Although these types of network failures cannot be detected early, the cause can be narrowed down by using Cluster Health Monitor and Oracle Trace File Analyzer to pinpoint the time of the failure and the network interfaces involved.

Examples of Database Availability Issues

Database availability issues can cause an Oracle database or one of the instances of the database to become unresponsive and thus unavailable to users.

Examples of such issues are:

  • Issue: Runaway queries or delays can deny critical database resources such as locks, latches, or CPU to other sessions. Denial of critical database resources results in database or an instance of a database being non-responsive to applications.

    Solution: Blocker Resolver detects and automatically resolves these types of delayss. Also, Oracle Cluster Health Advisor detects, identifies, and notifies the database administrator of such delays and provides an appropriate corrective action.

  • Issue: Denial-of-service (DoS) attacks, vulnerabilities, or simply software bugs can cause a database or a database instance to be unresponsive.

    Solution: Proactive recommendations of known issues and their resolutions provided by Oracle Orachk can prevent such occurrences. If these issues are not prevented, then automatic collection of logs by Oracle Trace File Analyzer, in addition to data collected by Cluster Health Monitor, can speed up the correction of these issues.

  • Issue: Configuration changes can cause database outages that are difficult to troubleshoot. For example, incorrect permissions on the oracle.bin file can prevent session processes from being created.

    Solution: Use Cluster Verification Utility and Oracle Orachk to speed up identification and correction of these types of issues. You can generate a diff report using Oracle Orachk to see a baseline comparison of two reports and a list of differences. You can also view configuration reports created by Cluster Verification Utility to verify whether your system meets the criteria for an Oracle installation.

1.1.2 Performance Issues

Performance issues are runtime issues that threaten the performance of the system.

Performance issues can result from either software issues (bugs, configuration problems, data contention, and so on) or client issues (demand, query types, connection management, and so on).

Server and database performance issues are intertwined and difficult to separate. It is easier to categorize them by their origin: database server or client.

Examples of Database Server Performance Issues

  • Issue: Deviations from best practices in configuration can cause database server performance issues.

    Solution: Oracle Orachk detects configuration issues when Oracle Orachk runs periodically and notifies the database administrator of the appropriate corrective settings.

  • Issue: A session can cause other sessions to slow down waiting for the blocking session to release its resource or complete its work.

    Solution: Blocker Resolver detects these chains of sessions and automatically terminates the root holder session to relieve the bottleneck.

  • Issue: Unresolved known issues or unpatched bugs can cause database server performance issues.

    Solution: These issues can be detected through the automatic Oracle Orachk reports and flagged with associated patches or workarounds. Oracle Orachk is regularly enhanced to include new critical issues, either in existing products or in new product areas.

Examples of Performance Issues Caused by Database Client

  • Issue: Misconfigured parameters such as SGA and PGA allocation, number of sessions or processes, CPU counts, and so on, can cause database performance degradation.

    Solution: Oracle Orachk and Oracle Cluster Health Advisor detect the settings and consequences respectively and notify you automatically with recommended corrective actions.