16 High Availability Solutions

Highly Available systems are critical to the success of virtually every business today. It is equally important that the management infrastructure monitoring these mission-critical systems are highly available. The Enterprise Manager Grid Control architecture is engineered to be scalable and available from the ground up. It is designed to ensure that you concentrate on managing the assets that support your business, while it takes care of meeting your business Service Level Agreements.

When you configure Grid Control for high availability, your aim is to protect each component of the system, as well as the flow of management data in case of performance or availability problems, such as a failure of a host or a Management Service.

Maximum Availability Architecture (MAA) provides a highly available Enterprise Manager implementation by guarding against failure at each component of Enterprise Manager.

The impacts of failure of the different Enterprise Manager components are:

Management Agent failure or failure in the communication between Management Agents and Management Services

Results in targets no longer monitored by Enterprise Manager, though the Enterprise Manager console is still available and one can view historical data from the Management Repository.
Management Service failure

Results in the unavailability of Enterprise Manager console, as well as unavailability of almost all Enterprise Manager services.
Management Repository failure

Results in failure on the part of Enterprise Manager to save the uploaded data by the Management Agents as well as unavailability of almost all Enterprise Manager services.

Overall, failure in any component of Enterprise Manager can result in substantial service disruption. Therefore it is essential that each component be hardened using a highly available architecture.

Latest High Availability Information

Because of rapidly changing technology, and the fact that high availability implementations extend beyond the realm of Oracle Enterprise Manager, the following resources should be checked regularly for the latest information on third-party. integration with Oracle's high availability solutions (F5 or third-party cluster ware, for example).

Oracle Maximum Availability Architecture Website

http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
Support Note 330072.1: "How To Configure Grid Control Components for High Availability "

Defining High Availability

Oracle Enterprise Manager's flexible, distributed architecture permits a wide range of deployment configurations, allowing it to meet the monitoring and management needs of your business, as well as allowing for expansion as business needs dictate.

For this reason, high availability for Enterprise Manager cannot be narrowly defined as a singular implementation, but rather a range of protection levels based on your available resources, Oracle technology and best practices that safeguard the investment in your IT infrastructure. Depending on your Enterprise Manager deployment and business needs, you can implement the level of high availability necessary to sustain your business. High availably for Enterprise Manager can be categorized into four levels, each level building on the previous and increasing in implementation cost and complexity, but also incrementally increasing the level of availability.

Levels of High Availability

Each high availability solution level is driven by your business requirements and available IT resources. The following table summarizes the four high availability levels for Oracle Enterprise Manager installations.

Table 16-1 Enterprise Manager Availability levels

High Availability Level	Business Need	Hardware Requirement
Level 1	Responsiveness to business application events.	Well-tuned, single instance (one host) with redundant storage. 11.1.0.7 RDBMS + Protected Storage
Level 2	Ability to ensure business application service quality.	Cold Failover Cluster configuration using Data Guard on a single site. Level 1 + Data Guard
Level 3	Operational overhead and heavy costs of manual processes.	n-Instance RAC with local site Data Guard (protection against limited site failure). Level 2 + Primary site on RAC
Level 4	Revenue impact on loss of key business services and applications.	n-Instance RAC with secondary side Data Guard (protection against site loss) Level 3 + Data Guard on a remote site.

Note:

Levels 3 and 4 are not covered in this manual. For more information, refer to Real Application Cluster and Dataguard documentation.

Determining Your High Availability Needs

As previously mentioned, the availability level you choose depends on factors such as the hardware resources available and the business need of your organization. However, developing your high availability plan in a way that objectively encompasses all aspects of your high availability needs (hardware, business processes, effort, cost) can be problematic. The solution is to define high availability needs in terms of Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Recovery Time Objective - The period of time within which your business process or technological resources must be restored after failure. Key Question: How fast do your business processes/resources need to be running again before the bottom line is impacted?
Recovery Point Objective - The period of time between the time of failure and the last backup. Key Question: How much data are you willing to loose?

Defining your high availability needs in terms of RTO and RPO allows you to effectively meet the demands of users. Both values should be determined using the worst-case scenarios.

RTO, RPO, and Availability Levels

Given the broad range of factors that must be taken into consideration when implementing a highly available Enterprise Manager environment, your ultimate decision will be based on the interrelationship between RTO, RPO and the cost involved with implementing one of the availability levels. The following table shows the interrelationship between these factors.

Table 16-2 Comparison of High Availability Levels

Level	RTO	RPO	Build Time	Cost
1	98.0%	Hours	Hours to Days	$
2	98.8%	Minutes	Hours to Days	$$
3	99.9%	Minutes to Seconds	Days	$$$
4	99.9%	Minutes to Seconds	Days	$$$$

The table is not a prescriptive recommendation for choosing a high availability level, but instead should be used to aid your decision making process based on your business needs. For example, you have an uptime requirement of 95% and a desired mean time to recovery of seconds, the you should select level four.What is not reflected in the table are such factors as survivability and scalability. Hence, although the differences between level three and level four seem outwardly insignificant, there are differences. If you need survivability in the event of a primary site loss you need to go with a Level 4 architecture. If you need equalized performance in the event of site loss it's essential. A level three architecture with DG that's asymmetrically scaled will mean degradation in performance when activated.If you need to maintain performance levels you will need for level 4 with a symmetrically sized architecture on both sites. This is particularly true if you want to run through planned failover routines where you actively run on the primary or secondary site for extended periods of time – for example, some finance institutions mandate this as part of operating procedures.