22 High Availability Solutions

Highly Available systems are critical to the success of virtually every business today. It is equally important that the management infrastructure monitoring these mission-critical systems are highly available. The Enterprise Manager Cloud Control architecture is engineered to be scalable and available from the ground up. It is designed to ensure that you concentrate on managing the assets that support your business, while it takes care of meeting your business Service Level Agreements.

When you configure Cloud Control for high availability, your aim is to protect each component of the system, as well as the flow of management data in case of performance or availability problems, such as a failure of a host or a Management Service.

Maximum Availability Architecture (MAA) provides a highly available Enterprise Manager implementation by guarding against failure at each component of Enterprise Manager.

The impacts of failure of the different Enterprise Manager components are:

  • Management Agent failure or failure in the communication between Management Agents and Management Service

    Results in targets monitored by the agent no longer being monitored by Enterprise Manager.

  • Management Service failure

    Results in downtime for Enterprise Manager.

  • Management Repository failure

    Results in downtime for Enterprise Manager.

  • Software Library Failure

    Results in a sub-set of Enterprise Manager operations being unavailable. These operations include self-update and provisioning and patching operations including Agent deployment.

Overall, failure in any component of Enterprise Manager can result in substantial service disruption. Therefore it is essential that each component be hardened using a highly available architecture.

Note:

For information about setting up a high availability solution for BI Publisher, see BI Publisher High Availability.

Latest High Availability Information

Because of rapidly changing technology, and the fact that high availability implementations extend beyond the realm of Oracle Enterprise Manager, the following resources should be checked regularly for the latest information on third-party integration with Oracle's high availability solutions (F5 or third-party cluster ware, for example).

Defining High Availability

Oracle Enterprise Manager's flexible, distributed architecture permits a wide range of deployment configurations, allowing it to meet the monitoring and management needs of your business, as well as allowing for expansion as business needs dictate.

For this reason, high availability for Enterprise Manager cannot be narrowly defined as a singular implementation, but rather a range of protection levels based on your available resources, Oracle technology and best practices that safeguard the investment in your IT infrastructure. Depending on your Enterprise Manager deployment and business needs, you can implement the level of high availability necessary to sustain your business. High availably for Enterprise Manager can be categorized into four levels, each level building on the previous and increasing in implementation cost and complexity, but also incrementally increasing the level of availability.

Levels of High Availability

Each high availability solution level is driven by your business requirements and available IT resources. However, it is important to note that the levels represent a subset of possible deployments that are useful in presenting the various options available. Your IT organization will likely deploy its own configuration which need not exactly match one of the levels.

The following table summarizes four example high availability levels for Oracle Enterprise Manager installations as well as general resource requirements.

Table 22-1 Enterprise Manager High Availability Levels

Level Description Minimum Number of Nodes Recommended Number of Nodes Load Balancer Requirements

Level 1

OMS and repository database. Each resides on their own host with no failover.

1

2

None

Level 2

OMS installed on shared storage with a VIP based failover. Database is using Local Data Guard.

2

4

None

Level 3

OMS in Active/Active configuration. The database is using RAC + Local Data Guard

3

5

Local Load Balancer

Level 4

OMS on the primary site in Active/Active Configuration. Repository deployed using Oracle RAC.

Duplicate hardware deployed at the standby site.

DR for OMS and Software Library using Storage Replication between primary and standby sites.

Database DR using Oracle Data Guard.

Note: Level 4 is a MAA Best Practice, achieving highest availability in the most cost effective, simple architecture.

4

8

Required: Local Load Balancer for each site.

Optional: Global Load Balancer

Comparing Availability Levels

The following tables compare the protection levels and recovery times for the various HA levels.

Table 22-2 High Availability Levels of Protection

Level OMS Host Failure OMS Storage Failure Database Host Failure Database Storage Failure Site Failure/Disaster Recovery

Level 1

No

No

No

No

No

Level 2

Yes

No

Yes

Yes

No

Level 3

Yes

Yes

Yes

Yes

No

Level 4

Yes

Yes

Yes

Yes

Yes

Table 22-3 High Availability Level Recovery Times

Level Node Failure Local Storage Failure Site Failure Cost

Level 1

Hours-Days

Hours-Days

Hours-Days

$

Level 2

Minutes

Hours-Days

Hours-Days

$$

Level 3

No Outage

Minutes

Hours-Days

$$$

Level 4

No Outage

Minutes

Minutes

$$$$

One measure that is not represented in the tables is that of scalability. Levels three and four provide the ability to scale the Enterprise Manager installation as business needs grow. The repository, running as a RAC database, can easily be scaled upwards by adding new nodes to the RAC cluster and it is possible to scale the Management Service tier by simply adding more OMS servers.

If you need equalized performance in the event of failover to a standby deployment, whether that is a local standby database or a Level four standby site including a standby RAC database and standby OMS servers, it is essential to ensure that the deployments on both sites are symmetrically scaled. This is particularly true if you want to run through planned failover routines where you actively run on the primary or secondary site for extended periods of time. For example, some finance institutions mandate this as part of operating procedures.

If you need survivability in the event of a primary site loss you need to go with a Level four architecture.