1
Overview of High Availability

This chapter contains the following sections:

Introduction to High Availability

Databases and the Internet have enabled worldwide collaboration and information sharing by extending the reach of database applications throughout organizations and communities. This reach emphasizes the importance of high availability (HA) in data management solutions. Both small businesses and global enterprises have users all over the world who require access to data 24 hours a day. Without this data access, operations can stop, and revenue is lost. Users, who have become more dependent upon their solutions, now demand service-level agreements from their Information Technology (IT) departments and solutions providers. Increasingly, availability is measured in dollars, euros, and yen, not just in time and convenience.

Enterprises have used their IT infrastructure to provide a competitive advantage, increase productivity, and empower users to make faster and more informed decisions. However, with these benefits has come an increasing dependence on that infrastructure. If a critical application becomes unavailable, then the entire business can be in jeopardy. Revenue and customers can be lost, penalties can be owed, and bad publicity can have a lasting effect on customers and a company's stock price. It is critical to examine the factors that determine how your data is protected and maximize the availability to your users.

What is Availability?

Availability is the degree to which an application or service is available when, and with the functionality, users expect. Availability is measured by the perception of an application's end user. End users experience frustration when their data is unavailable, and they do not understand or care to differentiate between the complex components of an overall solution. Performance failures due to higher than expected usage create the same havoc as the failure of critical components in the solution.

Reliability, recoverability, and continuous operations are three characteristics of a highly available solution:

Reliability: Reliable hardware is one component of an HA solution. Reliable software, including the database, web servers, and application, is as critical to implementing a highly available solution.
Recoverability: There may be many choices in recovering from a failure if one occurs. It is important to determine what types of failures may occur in your high availability environment and how to recover from those failure in the time that meets your business requirements. For example, if a critical table is accidentally deleted from the database, what action would you take to recover it? Does your architecture provide you the ability to recover in the time specified in a service level agreement (SLA)?
Error Detection: If a component in your architecture fails, then fast detection is another essential component in recovering from a possible unexpected failure. While you may be able to recover quickly from an outage, if it takes an additional 90 minutes to discover the problem, then you may not meet your SLA. Monitoring the health of your environment requires reliable software to view it quickly and the ability to notify the DBA of a problem.
Continuous operations: Continuous access to your data is essential when very little or no downtime is acceptable to perform maintenance activities. Such activities as moving a table to another location within the database or even adding additional CPUs to your hardware should be transparent to the end user in an HA architecture.

An HA architecture should have the following characteristics:

Be transparent to most failures
Provide built-in preventative measures
Provide proactive monitoring and fast detection of failures
Provide fast recoverability
Automate the recovery operation
Protect the data so that there is minimal or no data loss
Implement the operational best practices to manage your environment
Provide the HA solution to meet your SLA

Importance of Availability

The importance of high availability varies among applications. However, the need to deliver increasing levels of availability continues to accelerate as enterprises re-engineer their solutions to gain competitive advantage. Most often these new solutions rely on immediate access to critical business data. When data is not available, the operation can cease to function. Downtime can lead to lost productivity, lost revenue, damaged customer relationships, bad publicity, and lawsuits.

If a mission-critical application becomes unavailable, then the enterprise is placed in jeopardy. It is not always easy to place a direct cost on downtime. Angry customers, idle employees, and bad publicity are all costly, but not directly measured in currency. On the other hand, lost revenue and legal penalties incurred because SLA objectives are not met can easily be quantified. The cost of downtime can quickly grow in industries that are dependent upon their solutions to provide service.

Other factors to consider in the cost of downtime are the maximum tolerable length of a single unplanned outage, and the maximum frequency of allowable incidents. If the event lasts less than 30 seconds, then it may cause very little impact and may be barely perceptible to end users. As the length of the outage grows, the effect grows from annoyance at a minor problem into a big problem, and possibly into legal action. When designing a solution, it is important to take into account these issues and to determine the true cost of downtime. An organization should then spend reasonable amounts of money to build solutions that are highly available, yet whose cost is justified. High availability solutions are effective insurance policies.

Oracle makes high availability solutions available to every customer regardless of size. Small workgroups and global enterprises alike are able to extend the reach of their critical business applications. With Oracle and the Internet, applications and their data are now reliably accessible everywhere, at any time.

Causes of Downtime

One of the challenges in designing an HA solution is examining and addressing all the possible causes of downtime. It is important to consider causes of both unplanned and planned downtime. Unplanned downtime includes computer failures and data failures. Data failures can be caused by storage failure, human error, corruption, and site failure. Planned downtime includes system changes and data changes. Planned downtime can be just as disruptive to operations, especially in global enterprises that support users in multiple time zones.

See Also:

Chapter 9, "Recovering from Outages" for a more detailed discussion of unplanned and planned outages

What Does This Book Contain?

Choosing and implementing the architecture that best fits your availability requirements can be a daunting task. This architecture must encompass redundancy across all components, achieve fast client failover for all types of downtime, provide consistent high performance, and provide protection from user errors, corruptions, and site disasters, while being easy to deploy, manage, and scale.

This book describes several HA architectures and provides guidelines for choosing the one that is best for your requirements. It also describes practices that are essential to maintaining an HA environment regardless of the architecture that is chosen. It also describes types of outages and provides a blueprint for detecting and resolving the outages.

Who Should Read This Book?

Knowledge of the Oracle Database server, Real Application Clusters and Oracle Data Guard terminology is required to understand the configuration and implementation details.

See Also:

Oracle Database Concepts
The Oracle Data Guard documentation set
The Real Application Clusters documentation set

Chief technology officers and information technology architects can benefit from reading the following chapters:

Information technology architects can also find essential information in the following chapters:

Database administrators can find useful information in the following chapters:

Network administrators and application administrators can find useful information in the following chapters:

1 Overview of High Availability