18 High Availability: Multiple Resource Configurations

Multiple resource configurations add redundancy to your Enterprise Manager installation, thus creating a higher level of availability. Using multiple hosts, Enterprise Manager components can be replicated and configured for failover. This not only increases survivability but also greatly reduces the time required to restore Enterprise Manager after a failure has occurred. If you have a high RTO, installing Enterprise Manager using a multi-host configuration is essential.

This chapter covers the following topics:

Managing Multiple Hosts and Deploying a Remote Management Repository
Using Multiple Management Service Installations
High Availability Configurations
Installation Best Practices for Enterprise Manager High Availability
Configuration With Grid Control
Configuring Oracle Enterprise Manager for Active and Passive Environments
Using Virtual Host Names for Active and Passive High Availability Environments in Enterprise Manager Database Control
Configuring Grid Control Repository in Active/Passive High Availability Environments
How to Configure Grid Control OMS in Active/Passive Environment for High Availability Failover Using Virtual Host Names
Configuring Targets for Failover in Active/Passive Environments

Managing Multiple Hosts and Deploying a Remote Management Repository

Installing all the Grid Control components on a single host is an effective way to initially explore the capabilities and features available to you when you centrally manage your Oracle environment.

A logical progression from the single-host environment is to a more distributed approach, where the Management Repository database is on a separate host and does not compete for resources with the Management Service. The benefit in such a configuration is scalability; the workload for the Management Repository and Management Service can now be split. This configuration also provides the flexibility to adjust the resources allocated to each tier, depending on the system load. (Such a configuration is shown in Figure 18-2, "Grid Control Architecture with Multiple Management Service Installations".)

Figure 18-1 Grid Control Components Distributed on Multiple Hosts with One Management Service

Description of "Figure 18-1 Grid Control Components Distributed on Multiple Hosts with One Management Service"

In this more distributed configuration, data about your managed targets travels along the following paths so it can be gathered, stored, and made available to administrators by way of the Grid Control console:

Administrators use the Grid Control console to monitor and administer the targets just as they do in the single-host scenario described in "Deploying Grid Control Components on a Single Host".
Management Agents are installed on each host on the network, including the Management Repository host and the Management Service host. The Management Agents upload their data to the Management Service by way of the Management Service upload URL, which is defined in the emd.properties file in each Management Agent home directory. The upload URL uploads the data directly through the Oracle HTTP Server.
The Management Repository is installed on a separate machine that is dedicated to hosting the Management Repository database. The Management Service uses JDBC connections to load data into the Management Repository database and to retrieve information from the Management Repository so it can be displayed in the Grid Control console. This remote connection is defined in the Administration Server and can be accessed and changed through emctl commands.
The Management Service communicates directly with each remote Management Agent over HTTP by way of the Management Agent URL. The Management Agent URL is defined by the EMD_URL property in the emd.properties file of each Management Agent home directory. As described in Deploying Grid Control Components on a Single Host, the Management Agent includes a built-in HTTP listener so no Oracle HTTP Server is required on the Management Agent host.

Using Multiple Management Service Installations

In larger production environments, you may find it necessary to add additional Management Service installations to help reduce the load on the Management Service and improve the efficiency of the data flow.

Note:

When you add additional Management Service installations to your Grid Control configuration, be sure to adjust the parameters of your Management Repository database. For example, you will likely need to increase the number of processes allowed to connect to the database at one time. Although the number of required processes will vary depending on the overall environment and the specific load on the database, as a general guideline, you should increase the number of processes by 40 for each additional Management Service.

For more information, see the description of the PROCESSES initialization parameter in the Oracle Database Reference.

Understanding the Flow of Management Data When Using Multiple Management Services

Figure 18-2, "Grid Control Architecture with Multiple Management Service Installations" shows a typical environment where an additional Management Service has been added to improve the scalability of the Grid Control environment.

Figure 18-2 Grid Control Architecture with Multiple Management Service Installations

Description of "Figure 18-2 Grid Control Architecture with Multiple Management Service Installations"

In a multiple Management Service configuration, the management data moves along the following paths:

Administrators can use one of two URLs to access the Grid Control console. Each URL refers to a different Management Service installation, but displays the same set of targets, all of which are loaded in the common Management Repository. Depending upon the host name and port in the URL, the Grid Control console obtains data from the Management Service (by way of the Oracle HTTP Server) on one of the Management Service hosts.
Each Management Agent uploads its data to a specific Management Service, based on the upload URL in its emd.properties file. That data is uploaded directly to the Management Service by way of Oracle HTTP Server.

Whenever more than one Management Service is installed, it is a best practice to have the Management Service upload to a shared directory. This allows all Management Service processes to manage files that have been uploaded from any Management Agent. This protects from the loss of any one Management Server causing a disruption in upload data from Management Agents.

Configure this functionality from the command line of each Management Service process as follows:

emctl config oms loader -shared <yes|no> -dir <load directory>
Important:
By adding a load balancer, you can avoid the following problems:
- Should the Management Service fail, any Management Agent connected to it cannot upload data.
- Because user accounts only know about one Management Service, users lose connectivity should the Management Service go down even if the other Management Service is up.
See High Availability Configurations for information regarding load balancers.
Note:
If the software library is being used in this environment, it should be configured to use shared storage in the same way as the shared Management Service loader. To modify the location for the software library:
1. Click the Deployments tab on the Enterprise Manager Home page.
2. Click the Provisioning subtab.
3. On the Provisioning page, click the Administration subtab.
4. In the Software Library Configuration section, click Add to set the Software Library Directory Location to a shared storage that can be accessed by any host running the Management Service.
Each Management Service communicates by way of JDBC with a common Management Repository, which is installed in a database on a dedicated Management Repository host. Each Management Service uses the same database connection information, defined in the emgc.properties file, to load data from its Management Agents into the Management Repository. The Management Service uses the same connection information to pull data from the Management Repository as it is requested by the Grid Control console.
Any Management Service in the system can communicate directly with any of the remote Management Agents defined in the common Management Repository. The Management Services communicate with the Management Agents over HTTP by way of the unique Management Agent URL assigned to each Management Agent.

As described in Deploying Grid Control Components on a Single Host, the Management Agent URL is defined by the EMD_URL property in the emd.properties file of each Management Agent home directory. Each Management Agent includes a built-in HTTP listener so no Oracle HTTP Server is required on the Management Agent host.

High Availability Configurations

You can configure Enterprise Manager to run in either active-active or active-passive mode using a single instance database as the Management Repository. The following text summarizes the active-active mode.

Refer to the following sections for more information about common Grid Control configurations that take advantage of high availability hardware and software solutions. These configurations are part of the Maximum Availability Architecture (MAA).

Configuring the Management Repository
Configuring the Management Services
Configuring Additional Management Services
Configuring the Management Agent
Disaster Recovery

Configuring the Management Repository

Before installing Enterprise Manager, you should prepare the database, which will be used for setting up Management Repository. Install the database using Database Configuration Assistant (DBCA) to make sure that you inherit all Oracle install best practices.

Configure Database
- For both high availability and scalability, you should configure the Management Repository in the latest certified database version, with the RAC option enabled. Check for the latest version of database certified for Enterprise Manager from the Certify tab on the My Oracle Support website.
- Choose Automatic Storage Management (ASM) as the underlying storage technology.
- When the database installation is complete:
  
  Go to $ORACLE_HOME/rbdms/admin directory of the database home and execute the 'dbmspool.sql'
  
  This installs the DBMS_SHARED_POOL package, which will help in improving throughput of the Management Repository.
Install Enterprise Manager

While installing Enterprise Manager using Oracle Universal Installer (OUI), you will be presented with the option for configuring the Management Repository using an existing database.

Post Management Service - Install Management Repository Configuration

There are some parameters that should be configured during the Management Repository database install (as previously mentioned) and some parameters that should be set after the Management Service has been installed.

Start by installing Management Agents on each Management Repository RAC node. Once the Management Agents are installed and the Management Repository database is discovered as a target, the Enterprise Manager console can be used to configure these best practices in the Management Repository.

These best practices fall in the area of:

Configuring Storage
Configuring Oracle Database 11g with RAC for High Availability and Fast Recoverability
- Enable ARCHIVELOG Mode
- Enable Block Checksums
- Configure the Size of Redo Log Files and Groups Appropriately
- Use a Flash Recovery Area
- Enable Flashback Database
- Use Fast-Start Fault Recovery to Control Instance Recovery Time
- Enable Database Block Checking
- Set DISK_ASYNCH_IO
The details of these settings are available in Oracle Database High Availability Best Practices.

Use the MAA Advisor for additional high availability recommendations that should be applied to the Management Repository. To access the MAA Advisor:
1. On the Database Target Home page, locate the High Availability section.
2. Click Details next to the Console item.
3. In the Availability Summary section of the High Availability Console page, click Details located next to the MAA Advisor item.

Configuring the Management Services

Once you configure the Management Repository, the next step is to install and configure the Enterprise Manager Grid Control mid-tier, the Management Services, for greater availability. Before discussing steps that add mid-tier redundancy and scalability, note that the Management Service itself has a built-in restart mechanism based on the Oracle Weblogic Node Manager and the Oracle Process Management and Notification Service (OPMN). These services will attempt to restart a Management Service that is down. It is advisable to run OPMN and Node Manager as operating system services, so that they restart automatically if their host machine is restarted.

Management Service Install Location

If you are managing a large environment with multiple Management Services and Management Repository nodes, then consider installing the Management Services on hardware nodes that are different from Management Repository nodes (Figure 18-3). This allows you to scale out the Management Services in the future.

Figure 18-3 Management Service and Management Repository on Separate Hardware

Surrounding text describes Figure 18-3 .

Also consider the network latency between the Management Service and the Management Repository while determining the Management Service install location. The distance between the Management Service and the Management Repository may be one of the factors that affect network latency and hence determine Enterprise Manager performance.

If the network latency between the Management Service and Management Repository tiers is high or the hardware available for running Enterprise Manager is limited, then the Management Service can be installed on the same hardware as the Management Repository (Figure 18-4). This allows for Enterprise Manager high availability, as well as keep the costs down.

Figure 18-4 Management Service and Management Repository on Same Hardware

Surrounding text describes Figure 18-4 .

Configuring the First Management Service for High Availability

If you plan ahead, you can configure your Enterprise Manager deployment for high availability by choosing the correct options during the first Management Service install. You can also retrofit the MAA best practices into an existing Enterprise Manager deployment configured initially using the default install options. Both paths will be addressed in the following sections.

Configuring Management Service to Management Repository Communication

Management Service processes need to be configured to communicate with each node of the RAC Management Repository in a redundant fashion.

Note that Real Application Cluster (RAC) nodes are referred to by their virtual IP (vip) names. The service_name parameter is used instead of the system identifier (SID) in connect_data mode and failover is turned on. Refer to Oracle Database Net Services Administrator's Guide for details.

Configure the repository connect descriptor by running the emctl command from any Management Service:

emctl config oms -store_repos_details -repos_conndesc '"(DESCRIPTION=
(ADDRESS_LIST=(FAILOVER=ON)
(ADDRESS=(PROTOCOL=TCP)(HOST=node1-vip.example.com)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=node2-vip.example.com)(PORT=1521)))
(CONNECT_DATA=(SERVICE_NAME=EMREP)))"' -repos_user sysman

After making the previous change, run the following command to make the same change to the monitoring configuration used for the Management Services and Repository target: emctl config emrep -conn_desc

Configuring Shared File System Loader

The Management Service for Grid Control has a high availability feature called the Shared Filesystem Loader. In the Shared Filesystem Loader, management data files received from Management Agents are stored temporarily on a common shared location called the shared receive directory. All Management Services are configured to use the same storage location for the shared receive directory. The Management Services coordinate internally and distribute among themselves the workload of uploading files into the Management Repository. Should a Management Service go down, its workload is taken up by surviving Management Services. You must choose a shared receive directory that is accessible by all the Management Services using redundant file storage.

During the first Management Service installation, the shared receive directory can be configured out of the box by passing SHARED_RECEIVE_DIRECTORY_LOCATION=<shared recv directory> option to runInstaller (setup.exe on Windows). Oracle recommends that this location be outside the Instance Home and Middleware Home locations.

If not configured during installation, the Shared Filesystem Loader can also be configured post-installation by running the following emctl command on every Management Service:

emctl config oms loader -shared yes -dir <shared recv directory>

Note:

If shared filesystem Loader is configured on the first Management Service, any additional management service that is installed later will inherit the shared filesystem loader configuration. Therefore, ensure that the shared recv directory is available on the additional Management Service prior to running install.

Consider the following while configuring Shared Filesystem Loader on Windows.

On Windows platforms, the Enterprise Manager install may configure the Management Service to run as a service using 'LocalSystem' account. This is a local account and will typically not have access to the network drive for the shared filesystem that requires domain authentication. To resolve this issue, configure the Management Service to run as a domain user as follows:
1. Go to the Control Panel and then open the Services panel.
2. Double click the appropriate service (Oracleoms11gProcessManager).
3. Change the 'Log on as' user from the 'Local System Account' to This Account.
4. Specify the domain user that has access to shared network drive.
5. Click OK.
Do not use local drive letters for mapped network shares while configuring the shared filesystem on Windows. Use UNC locations instead.
```
emctl config oms loader -shared yes -dir \\\\<host>\\<share-name>\\<recv-dir>
for example
emctl config oms loader -shared yes -dir \\\\hostA\\vol1\\recv
```
Note the use of double backslashes while specifying the directory location.

Note:

User equivalence should be set up properly across OMS so that files created by one OMS on the loader directory are modifiable by other OMS.

Configuring Software Library

Since software library location has to be accessed by all Management Services, considerations similar to shared fileystem loader directory apply here too. The configuration of software library is not covered during install. It needs to be configured post-install using the Enterprise Manager Console:

On the Enterprise Manager home page, click the Deployments tab.
Click the Provisioning subtab.
On the Provisioning page, click the Administration subtab.
In the Software Library Configuration section, click Add to set the Software Library Directory Location to a shared storage that can be accessed by any host running the Management Service.

Configuring a Load Balancer

This section describes the guidelines for setting up a Server Load Balancer (SLB) to balance the agent and browser traffic to multiple Management Services. This is a two step process:

Configure the SLB
Make needed changes on the Management Services

SLB Setup

Use the following table as reference for setting up the SLB with Grid Control Management Services.

Table 18-1 Management Service Ports

Grid Control Service	TCP Port	Monitor Name	Persistence	Pool Name	Load Balancing	Virtual Server Name	Virtual Server Port
Secure Upload	1159	mon_gcsu1159	None	pool_gcsu1159	Round Robin	vs_gcsu1159	1159
Agent Registration	4889	mon_gcar4889	Active Cookie Insert	pool_gcar4889	Round Robin	vs_gcar4889	4889
Secure Console	7799	mon_gcsc7799	Source IP	pool_gcsc7799	Round Robin	vs_gcsc443	443
Unsecure Console (optional)	7788	mon_gcuc7788	Source IP	pool_gcuc7788	Round Robin	vs_gcuc80	80

Use the administration tools that are packaged with your SLB. A sample configuration follows. This example assumes that you have two Management Services running on host A and host B using the default ports as listed in Table 18-0.

Create Pools

A pool is a set of servers grouped together to receive traffic on a specific TCP port using a load balancing method. Each pool can have its own unique characteristic for a persistence definition and the load-balancing algorithm used.

Table 18-2 Pools

Pool Name	Usage	Members	Persistence	Load Balancing
pool_gcsu1159	Secure upload	HostA:1159 HostB:1159	None	Round Robin
pool_gcar4889	Agent registration	HostA:4889 HostB:4889	Active cookie insert; expiration 60 minutes	Round Robin
pool_gcsc7799	Secured console access	HostA:7799 HostB:7799	Source IP; expiration 60 minutes	Round Robin
pool_gcuc7788 (optional)	Unsecured console access	HostA:7788 HostB:7788	Source IP; expiration 60 minutes	Round Robin

Create Virtual Servers

A virtual server, with its virtual IP Address and port number, is the client addressable hostname or IP address through which members of a load balancing pool are made available to a client. After a virtual server receives a request, it directs the request to a member of the pool based on a chosen load balancing method.

Table 18-3 Virtual Servers

Virtual Server Name	Usage	Virtual Server Port	Pool
vs_gcsu1159	Secure upload	1159	pool_gcsu1159
vs_gcar4889	Agent registration	4889	pool_gcar4889
vs_gcsc443	Secure console access	443	pool_gcsc7799
vs_gcuc80 (optional)	Unsecure console access	80	pool_gcuc7788

Create Monitors

Monitors are used to verify the operational state of pool members. Monitors verify connections and services on nodes that are members of load-balancing pools. A monitor is designed to check the status of a service on an ongoing basis, at a set interval. If the service being checked does not respond within a specified timeout period, the load balancer automatically takes it out of the pool and will choose the other members of the pool. When the node or service becomes available again, the monitor detects this and the member is automatically accessible to the pool and able to handle traffic.

Table 18-4 Monitors

Monitor Name	Configuration	Associate With
mon_gcsu1159	Type: https Interval: 60 Timeout: 181 Send String: GET /em/upload Receive String: Http Receiver Servlet active!	HostA:1159 HostB:1159
mon_gcar4889	Type: http Interval: 60 Timeout: 181 Send String: GET /em/genwallet Receive String: GenWallet Servlet activated	HostA:4889 HostB:4889
mon_gcsc7799	Type: https Interval: 5 Timeout: 16 Send String: GET /em/console/home HTTP/1.0\n Receive String: /em/console/logon/logon;jsessionid=	HostA:7799 HostB:7788
mon_gcuc7788 (optional)	Type: https Interval: 5 Timeout: 16 Send String: GET /em/console/home HTTP/1.0\n Receive String: /em/console/logon/logon;jsessionid=	HostA:7788 HostB:7788

Note: If you have SSO configured, use the following alternate definitions for the mon_gcsc7799 and mon_gcuc7788 monitors.

Table 18-5 Monitors for SSO Configuration

Monitor Name	Configuration	Associate With
mon_gcsc7799	Type: https Interval: 5 Timeout: 16 Send String: GET /em/genwallet Receive String: GenWallet Servlet activated	HostA:7799 HostB:7788
mon_gcuc7788 (optional)	Type: https Interval: 5 Timeout: 16 Send String: GET /em/genwallet Receive String: GenWallet Servlet activated	HostA:7788 HostB:7788

Note:

F5 SLB monitors expect the "Receive String" within the first 5120 characters of a response. For SSO environments, the "Receive String" may be returned at some point beyond the 5120 limit. The monitor will not function in this situation.

Enterprise Manager Side Setup

Perform the following steps:

Resecure the Management Service

By default, the service name on the Management Service-side certificate uses the name of the Management Service host. Management Agents do not accept this certificate when they communicate with the Management Service through a load balancer. You must run the following command to regenerate the certificate on the first Management Service:
```
emctl secure 
  -oms -sysman_pwd <sysman_pwd> 
  -reg_pwd <agent_reg_password> 
  -host slb.example.com 
  -secure_port 1159 
  -slb_port 1159 
  -slb_console_port 443  
  [-lock]  [-lock_console]
```
Resecure all Management Agents

Management Agents that were installed prior to SLB setup, including the Management Agent that comes with the Management Service install, would be uploading directly to the Management Service. These Management Agents will not be able to upload after SLB is setup. Resecure these Management Agents to upload to the SLB by running the following command on each Management Agent:
```
emctl secure agent –emdWalletSrcUrl https://slb.example.com:<upload port>/em
```

Configuring Additional Management Services

Once your first Management Service is setup for high availability, there are two paths to setting up your additional Management Services for high availability:

Installing a fresh additional Management Service as per MAA best practices
Retrofitting MAA best practices on existing Additional Management Service

In either of the two cases, the following considerations should be noted:

The additional Management Service should be hosted in close network proximity to the Management Repository database for network latency reasons.
Configure the same path on all Management Services for the directory used for the shared filesystem loader.
Additional Management Services should be installed using the same OS user and group as the first Management Service. Proper user equivalence should be setup so that files created by each Management Service on the shared loader directory can be accessed and modified by the other Management Service processes.
Adjust the parameters of your Management Repository database. For example, you will likely need to increase the number of processes allowed to connect to the database at one time. Although the number of required processes will vary depending on the overall environment and the specific load on the database, as a general guideline, you should increase the number of processes by 40 for each additional Management Service.

Installing a Fresh Additional Management Service According MAA Best Practices

Install the additional Management Service using the OUI installer. The additional Management Service will inherit most of the HA configuration from the first Management Service. Post installation, do the following step to complete the HA configuration:

Update the SLB configuration by adding the additional Management Service to the different pools on the SLB. Setup monitors for the new Management Service.

Retrofitting MAA Best Practices on Existing Additional Management Service

Once you have the additional Management Service installed, use the following steps to copy over the configuration from the first Management Service.

Export the configuration from the first Management Service using emctl exportconfig oms -dir <location for the export file>
Copy over the exported file to the additional Management Service
Shutdown the additional Management Service
Import the exported configuration on the additional Management Service using emctl importconfig oms -file <full path of the export file>
Restart the additional Management Service
Setup EMCLI using emcli setup -url=https://slb.example.com/em -username sysman -password <sysman password> -nocertvalidate
Resecure the Management Agent that is installed with the additional Management Service to upload to SLB using emctl secure agent -emdWalletSrcUrl https://slb.example.com:<upload port>/em
Update the SLB configuration by adding the additional Management Service to the different pools on the SLB. Setup monitors for the new Management Service. Modify the ssl.conf file to set the Port directive to the SLB virtual port used for UI access.

Configuring the Management Agent

The final piece of Enterprise Manager High Availability is the Management Agent configuration. It is worthwhile to note that the Management Agent has high availability built into it out of the box. A 'watchdog' process, created automatically on Management Agent startup, monitors each Management Agent process. In the event of a failure of the Management Agent process, the 'watchdog' will try to automatically re-start the Management Agent process.

Communication between the Management Agent and the Management Service tiers in a default Enterprise Manager Grid Control install is a point-to-point set up. Therefore, the default configuration does not protect from the scenario where the Management Service becomes unavailable. In that scenario, a Management Agent will not be able to upload monitoring information to the Management Service (and to the Management Repository), resulting in the targets becoming unmonitored until that Management Agent is manually configured to point to a second Management Service.

To avoid this situation, use hardware Server Load Balancer (SLB) between the Management Agents and the Management Services. The Load Balancer monitors the health and status of each Management Service and makes sure that the connections made through it are directed to surviving Management Service nodes in the event of any type of failure. As an additional benefit of using SLB, the load balancer can also be configured to manage user communications to Enterprise Manager. The Load Balancer handles this through the creation of 'pools' of available resources.

Configure the Management Agent to Communicate Through SLB

The load balancer provides a virtual IP address that all Management Agents can use. Once the load balancer is setup, the Management Agents need to be configured to route their traffic to the Management Service through the SLB. This can be achieved through a couple of property file changes on the Management Agents.

Resecure all Management Agents - Management Agents that were installed prior to SLB setup would be uploading directly to the Management Service. These Management Agents will not be able to upload after SLB is setup. Resecure these Management Agents to upload to the SLB by running the following command on each Management Agent.

emctl secure agent -emdWalletSrcUrl https://slb.example.com:<upload port>/em
Configure the Management Agent to Allow Retrofitting a SLB

Some installations may not have access to a SLB during their initial install, but may foresee the need to add one later. If that is the case, consider configuring the Virtual IP address that will be used for the SLB as a part of the initial installation and having that IP address point to an existing Management Service. Secure communications between Management Agents and Management Services are based on the host name. If a new host name is introduced later, each Management Agent will not have to be re-secured as it is configured to point to the new Virtual IP maintained by the SLB.

Load Balancing Connections Between the Management Agent and the Management Service

Before you implement a plan to protect the flow of management data from the Management Agents to the Management Service, you should be aware of some key concepts.

Specifically, Management Agents do not maintain a persistent connection to the Management Service. When a Management Agent needs to upload collected monitoring data or an urgent target state change, the Management Agent establishes a connection to the Management Service. If the connection is not possible, such as in the case of a network failure or a host failure, the Management Agent retains the data and reattempts to send the information later.

To protect against the situation where a Management Service is unavailable, you can use a load balancer between the Management Agents and the Management Services.

However, if you decide to implement such a configuration, be sure to understand the flow of data when load balancing the upload of management data.

Figure 18-5 shows a typical scenario where a set of Management Agents upload their data to a load balancer, which redirects the data to one of two Management Service installations.

Figure 18-5 Load Balancing Between the Management Agent and the Management Service

Description of "Figure 18-5 Load Balancing Between the Management Agent and the Management Service"

In this example, only the upload of Management Agent data is routed through the load balancer. The Grid Control console still connects directly to a single Management Service by way of the unique Management Service upload URL. This abstraction allows Grid Control to present a consistent URL to both Management Agents and Grid Control consoles, regardless of the loss of any one Management Service component.

When you load balance the upload of Management Agent data to multiple Management Service installations, the data is directed along the following paths:

Each Management Agent uploads its data to a common load balancer URL. This URL is defined in the emd.properties file for each Management Agent. In other words, the Management Agents connect to a virtual service exposed by the load balancer. The load balancer routes the request to any one of a number of available servers that provide the requested service.
Each Management Service, upon receipt of data, stores it temporarily in a local file and acknowledges receipt to the Management Agent. The Management Services then coordinate among themselves and one of them loads the data in a background thread in the correct chronological order.

Also, each Management Service communicates by way of JDBC with a common Management Repository, just as they do in the multiple Management Service configuration defined in Using Multiple Management Service Installations.
Each Management Service communicates directly with each Management Agent by way of HTTP, just as they do in the multiple Management Service configuration defined in Using Multiple Management Service Installations.

Disaster Recovery

While high availability typically protects against local outages such as application failures or system-level problems, disaster tolerance protects against larger outages such as catastrophic data-center failure due to natural disasters, fire, electrical failure, evacuation, or pervasive sabotage. For Maximum Availability, the loss of a site cannot be the cause for outage of the management tool that handles your enterprise.

Maximum Availability Architecture for Enterprise Manager mandates deploying a remote failover architecture that allows a secondary datacenter to take over the management infrastructure in the event that disaster strikes the primary management infrastructure.

Figure 18-6 Disaster Recovery Architecture

Surrounding text describes Figure 18-6 .

As can be seen in Figure 18-6, setting up disaster recovery for Enterprise Manager essentially consists of installing a standby RAC, a standby Management Service and a standby Server Load Balancer and configuring them to automatically startup when the primary components fail.

The following sections lists the best practices to configure the key Enterprise Manager components for disaster recovery:

Prerequisites
Setup Standby Database
Setup Standby Management Service
Switchover
Failover
Automatic Failover

Prerequisites

The following prerequisites must be satisfied.

The primary site must be configured as per Grid Control MAA guidelines described in previous sections. This includes Management Services fronted by an SLB and all Management Agents configured to upload to Management Services by the SLB.
The standby site must be similar to the primary site in terms of hardware and network resources to ensure there is no loss of performance when failover happens.
There must be sufficient network bandwidth between the primary and standby sites to handle peak redo data generation.
Configure shared storage used for shared filesystem loader and software library to be replicated at the primary and standby site. In the event of a site outage, the contents of this shared storage must be made available on the standby site using hardware vendor disk level replication technologies.
For complete redundancy in a disaster recovery environment, a second load balancer must be installed at the standby site. The secondary SLB must be configured in the same fashion as the primary. Some SLB vendors (such as F5 Networks) offer additional services that can be used to pass control of the Virtual IP presented by the SLB on the primary site to the SLB on the standby site in the event of a site outage. This can be used to facilitate automatic switching of Management Agent traffic from the primary site to the standby site.

Setup Standby Database

The starting point of this step is to have the primary site configured as per Grid Control MAA guidelines. The following steps lay down the procedure for setting up the standby Management Repository database.

Prepare Standby Management Repository hosts for Data Guard

Install a Management Agent on each of the standby Management Repository hosts. Configure the Management Agents to upload by the SLB on the primary site. Install CRS and Database software on the standby Management Repository hosts. The version used must be the same as that on the primary site.
Prepare Primary Management Repository database for Data Guard

If the primary Management Repository database is not already configured, enable archive log mode, setup flash recovery area and enable flashback database on the primary Management Repository database.
Create Physical Standby Database

In Enterprise Manager, the standby Management Repository database must be physical standbys. Use the Enterprise Manager Console to setup a physical standby database in the standby environment. Note that Enterprise Manager Console does not support creating a standby RAC database. If the standby database has to be RAC, configure the standby database using a single instance and then use the Convert to RAC option from Enterprise Manager Console to convert the single instance standby database to RAC. Also, note that during single instance standby creation, the database files should be created on shared storage to facilitate conversion to RAC later.

Note that the Convert to RAC option is available for Oracle Database releases 10.2.0.5, 11.1.0.7, and above. Oracle Database release 11.1.0.7 requires patch 8824966 for the Convert to RAC option to work.
Add Static Service to Listener

To enable Data Guard to restart instances during the course of broker operations, a service with a specific name must be statically registered with the local listener of each instance. The value for the GLOBAL_DBNAME attribute must be set to a concatenation of <db_unique_name>_DGMGRL.<db_domain>. For example, in the LISTENER.ORA file:
```
SID_LIST_LISTENER=(SID_LIST=(SID_DESC=(SID_NAME=sid_name)
     (GLOBAL_DBNAME=db_unique_name_DGMGRL.db_domain)
     (ORACLE_HOME=oracle_home)))
```
Enable Flashback Database on the Standby Database
Verify Physical Standby

Verify the Physical Standby database through the Enterprise Manager Console. Click the Log Switch button on the Data Guard page to switch log and verify that it is received and applied to the standby database.

Setup Standby Management Service

Consider the following before installing the standby Management Services.

Oracle recommends that this activity be done during a lean period or during a planned maintenance window. When new Management Services are installed on the standby site, they are initially configured to connect to the Management Repository database on the primary site. Some workload will be taken up by the new Management Service. This could result in temporary loss in performance if the standby site Management Services are located far away from the primary site Management Repository database. However there would be no data loss and the performance would recover once the standby Management Services are shutdown post configuration.
The shared storage used for the shared filesystem loader and software library must be made available on the standby site using the same paths as the primary site.

Installing the First Standby Management Service

Install the first standby Management Service using the following steps:

Copy the emkey to the Management Repository by running the following command on the first Management Service on the primary site:

emctl config emkey -copy_to_repos
Perform a software-only install by running the installer with the following arguments:

runInstaller -noconfig -validationaswarnings
Apply one-off patches

<OMS Oracle Home>/install/oneoffs/apply_NewOneoffs.pl <OMS Oracle Home> OC9321514,9207217
Configure the Management Service by running omsca in standby mode. Choose a different domain name for the standby. For example, if the primary WebLogic domain is GCDomain, choose GCDomainStby.

omsca standby DOMAIN_NAME GCDomainStby -nostart

When prompted for Management Repository details, provide the Primary database details.
Configure the virtualization add on by running the following command:

addonca -oui -omsonly -name vt -install gc
Configure the Management Agent that comes with the Management Service install by running:

<Agent Oracle Home>/bin/agentca -f
Export the configuration from the primary Management Service using:

emctl exportconfig oms -dir <location for the export file>
Copy over the exported file to the standby Management Service
Import the exported configuration on the standby Management Service using:

emctl importconfig oms -file <full path of the export file>

Installing Additional Standby Management Services

Install addional standby Management Services as follows:

Specify the primary database details and the standby administration server details on the installer screens. Post installation, do the following steps to complete the HA configuration.

Do a software only install by running the installer with following arguments:

runInstaller -noconfig -validationaswarnings
Apply one-off patches

<OMS Oracle Home>/install/oneoffs/apply_NewOneoffs.pl <OMS Oracle Home> OC9321514,9207217
Configure the Management Service by running omsca. When prompted for Management Repository details, provide the Primary database details. When prompted for Administration Server details, provide the standby administration server details.

omsca add –nostart
Configure the virtualization add on by running the following command

addonca -oui -omsonly -install gc
Configure the Management Agent that comes with the Management Service install by running:

<Agent Oracle Home>/bin/agentca -f
Export the configuration from the primary Management Service using:

emctl exportconfig oms -dir <location for the export file>
Copy over the exported file to the standby Management Service
Import the exported configuration on the standby Management Service using:

emctl importconfig oms -file <full path of the export file>

Validating Your Installation and Complete the Setup

Validate your installation and complete the setup as follows:

Update the standby SLB configuration by adding the standby Management Services to the different pools on the SLB. Setup monitors for the new Management Service.
Make the standby Management Services point to the standby Management Repository database by running the following command on the first standby Management Service:

emctl config oms -store_repos_details -repos_conndesc <connect descriptor of standby database> -repos_user sysman -no_check_db
Shut down all Management Services by running the following command on each Management Service:

emctl stop oms -all

Switchover

Switchover is a planned activity where operations are transferred from the Primary site to a Standby site. This is usually done for testing and validation of Disaster Recovery (DR) scenarios and for planned maintenance activities on the primary infrastructure. This section describes the steps to switchover to the standby site. The same procedure is applied to switchover in either direction.

Enterprise Manager Console cannot be used to perform switchover of the Management Repository database. Use the Data Guard Broker command line tool DGMGRL instead.

Prepare the Standby Database

Verify that recovery is up-to-date. Using the Enterprise Manager Console, you can view the value of the ApplyLag column for the standby database in the Standby Databases section of the Data Guard Overview Page.
Shutdown the Primary Enterprise Manager Application Tier

Shutdown all the Management Services in the primary site by running the following command on each Management Service:

emctl stop oms -all

Shutdown the Enterprise Manager jobs running in Management Repository database:

- alter system set job_queue_processes=0;
Verify Shared Loader Directory / Software Library Availability

Ensure all files from the primary site are available on the standby site.
Switchover to the Standby Database

Use DGMGRL to perform a switchover to the standby database. The command can be run on the primary site or the standby site. The switchover command verifies the states of the primary database and the standby database, affects switchover of roles, restarts the old primary database, and sets it up as the new standby database.

SWITCHOVER TO <standby database name>;

Verify the post switchover states. To monitor a standby database completely, the user monitoring the database must have SYSDBA privileges. This privilege is required because the standby database is in a mounted-only state. A best practice is to ensure that the users monitoring the primary and standby databases have SYSDBA privileges for both databases.
```
SHOW CONFIGURATION;
SHOW DATABASE <primary database name>;
SHOW DATABASE <standby database name>;
```
Startup the Enterprise Manager Application Tier

Startup all the Management Services on the standby site:

emctl start oms

Startup the Enterprise Manager jobs running in Management Repository database on the standby site (the new primary database):

- alter system set job_queue_processes=10;
Relocate Management Services and Management Repository target

The Management Services and Management Repository target is monitored by a Management Agent on one of the Management Services on the primary site. To ensure that the target is monitored after switchover/failover, relocate the target to a Management Agent on the standby site by running the following command on one of the Management Service standby sites.

emctl config emrep -agent <agent name> -conn_desc
Switchover to Standby SLB

Make appropriate network changes to failover your primary SLB to standby SLB that is, all requests should now be served by the standby SLB without requiring any changes on the clients (browser and Management Agents).

This completes the switchover operation. Access and test the application to ensure that the site is fully operational and functionally equivalent to the primary site. Repeat the same procedure to switchover in the other direction.

Failover

A standby database can be converted to a primary database when the original primary database fails and there is no possibility of recovering the primary database in a timely manner. This is known as a manual failover. There may or may not be data loss depending upon whether your primary and target standby databases were synchronized at the time of the primary database failure.

This section describes the steps to failover to a standby database, recover the Enterprise Manager application state by resynchronizing the Management Repository database with all Management Agents, and enabling the original primary database as a standby using flashback database.

The word manual is used here to contrast this type of failover with a fast-start failover described later in Automatic Failover.

Verify Shared Loader Directory and Software Library Availability

Ensure all files from the primary site are available on the standby site.
Failover to Standby Database

Shutdown the database on the primary site. Use DGMGRL to connect to the standby database and execute the FAILOVER command:

FAILOVER TO <standby database name>;

Verify the post failover states:
```
SHOW CONFIGURATION;
SHOW DATABASE <primary database name>;
SHOW DATABASE <standby database name>;
```
Note that after the failover completes, the original primary database cannot be used as a standby database of the new primary database unless it is re-enabled.
Resync the New Primary Database with Management Agents

Skip this step if you are running in Data Guard Maximum Protection or Maximum Availability level as there is no data loss on failover. However, if there is data loss, synchronize the new primary database with all Management Agents.

On any one Management Service on the standby site, run the following command:

emctl resync repos -full -name "<name for recovery action>"

This command submits a resync job that would be executed on each Management Agent when the Management Services on the standby site are brought up.

Repository resynchronization is a resource intensive operation. A well tuned Management Repository will help significantly to complete the operation as quickly as possible. Specifically if you are not routinely coalescing the IOTs/indexes associated with Advanced Queueing tables as described in My Oracle Support note 271855.1, running the procedure before resync will significantly help the resync operation to complete faster.
Startup the Enterprise Manager Application Tier

Startup all the Management Services on the standby site by running the following command on each Management Service.

emctl start oms
Relocate Management Services and Management Repository target

The Management Services and Management Repository target is monitored by a Management Agent on one of the Management Services on the primary site. To ensure that target is monitored after switchover/failover, relocate the target to a Management Agent on the standby site by running the following command on one of the standby site Management Services.

emctl config emrep -agent <agent name> -conn_desc
Switchover to Standby SLB

Make appropriate network changes to failover your primary SLB to the standby SLB, that is, all requests should now be served by the standby SLB without requiring any changes on the clients (browser and Management Agents).
Establish Original Primary Database as Standby Database Using Flashback

Once access to the failed site is restored and if you had flashback database enabled, you can reinstate the original primary database as a physical standby of the new primary database.
- Shutdown all the Management Services in original primary site:
  
  emctl stop oms -all
- Restart the original primary database in mount state:
  
  shutdown immediate;
  
  startup mount;
- Reinstate the Original Primary Database
  
  Use DGMGRL to connect to the old primary database and execute the REINSTATE command
  
  REINSTATE DATABASE <old primary database name>;
- The newly reinstated standby database will begin serving as standby database to the new primary database.
- Verify the post reinstate states:
```
SHOW CONFIGURATION;
SHOW DATABASE <primary database name>;
SHOW DATABASE <standby database name>;
```
Monitor and complete Repository Resynchronization

Navigate to the Management Services and Repository Overview page of Grid Control Console. Under Related Links, click Repository Synchronization. This page shows the progress of the resynchronization operation on a per Management Agent basis. Monitor the progress.

Operations that fail should be resubmitted manually from this page after fixing the error mentioned. Typically, communication related errors are caused by Management Agents being down and can be fixed by resubmitting the operation from this page after restarting the Management Agent.

For Management Agents that cannot be started due to some reason, for example, old decommissioned Management Agents, the operation should be stopped manually from this page. Resynchronization is deemed complete when all the jobs have a completed or stopped status.

This completes the failover operation. Access and test the application to ensure that the site is fully operational and functionally equivalent to the primary site. Do a switchover procedure if the site operations have to be moved back to the original primary site.

Automatic Failover

This section details the steps to achieve complete automation of failure detection and failover procedure by utilizing Fast-Start Failover and Observer process. At a high level the process works like this:

Fast-Start Failover (FSFO) determines that a failover is necessary and initiates a failover to the standby database automatically
When the database failover has completed the DB_ROLE_CHANGE database event is fired
The event causes a trigger to be fired which calls a script that configures and starts Enterprise Manager Application Tier

Perform the following steps:

Develop Enterprise Manager Application Tier Configuration and Startup Script

Develop a script that will automate the Enterprise Manager Application configuration and startup process. See the sample shipped with Grid Control in the OH/sysman/ha directory. A sample script for the standby site is included here and should be customized as needed. Make sure ssh equivalence is setup so that remote shell scripts can be executed without password prompts. Place the script in a location accessible from the standby database host. Place a similar script on the primary site.

#!/bin/sh
# Script: /scratch/EMSBY_start.sh
# Primary Site Hosts
# Repos: earth, OMS: jupiter1, jupiter2
# Standby Site Hosts
# Repos: mars, # OMS: saturn1, saturn2
LOGFILE="/net/mars/em/failover/em_failover.log"
OMS_ORACLE_HOME="/scratch/OracleHomes/em/oms11"
CENTRAL_AGENT="saturn1.example.com:3872"
 
#log message
echo "###############################" >> $LOGFILE
date >> $LOGFILE
echo $OMS_ORACLE_HOME >> $LOGFILE
id >>  $LOGFILE 2>&1
 
#startup all OMS
#Add additional lines, one each per OMS in a multiple OMS setup
ssh orausr@saturn1 "$OMS_ORACLE_HOME/bin/emctl start oms" >>  $LOGFILE 2>&1
ssh orausr@saturn2 "$OMS_ORACLE_HOME/bin/emctl start oms" >>  $LOGFILE 2>&1
 
#relocate Management Services and Repository target
#to be done only once in a multiple OMS setup
#allow time for OMS to be fully initialized
ssh orausr@saturn1 "$OMS_ORACLE_HOME/bin/emctl config emrep -agent $CENTRAL_AGENT -conn_desc -sysman_pwd <password>" >> $LOGFILE 2>&1
 
#always return 0 so that dbms scheduler job completes successfully
exit 0

Automate Execution of Script by Trigger

Create a database event "DB_ROLE_CHANGE" trigger, which fires after the database role changes from standby to primary. See the sample shipped with Grid Control in OH/sysman/ha directory.

--
--
-- Sample database role change trigger
--
--
CREATE OR REPLACE TRIGGER FAILOVER_EM
AFTER DB_ROLE_CHANGE ON DATABASE
DECLARE
    v_db_unique_name varchar2(30);
    v_db_role varchar2(30);
BEGIN
    select upper(VALUE) into v_db_unique_name
    from v$parameter where NAME='db_unique_name';
    select database_role into v_db_role
    from v$database;
 
    if v_db_role = 'PRIMARY' then
 
      -- Submit job to Resync agents with repository
      -- Needed if running in maximum performance mode
      -- and there are chances of data-loss on failover
      -- Uncomment block below if required
      -- begin
      --  SYSMAN.setemusercontext('SYSMAN', SYSMAN.MGMT_USER.OP_SET_IDENTIFIER);
      --  SYSMAN.emd_maintenance.full_repository_resync('AUTO-FAILOVER to '||v_db_unique_name||' - '||systimestamp, true);
      --  SYSMAN.setemusercontext('SYSMAN', SYSMAN.MGMT_USER.OP_CLEAR_IDENTIFIER);
      -- end;
 
      -- Start the EM mid-tier
      dbms_scheduler.create_job(
          job_name=>'START_EM',
          job_type=>'executable',
          job_action=> '<location>' || v_db_unique_name|| '_start_oms.sh',
          enabled=>TRUE
      );
    end if;
EXCEPTION
WHEN OTHERS
THEN
    SYSMAN.mgmt_log.log_error('LOGGING', SYSMAN.MGMT_GLOBAL.UNEXPECTED_ERR,
SYSMAN.MGMT_GLOBAL.UNEXPECTED_ERR_M || 'EM_FAILOVER: ' ||SQLERRM);
END;
/

Note:

Based on your deployment, you might require additional steps to synchronize and automate the failover of SLB and shared storage used for loader receive directory and software library. These steps are vendor specific and beyond the scope of this document. One possibility is to invoke these steps from the Enterprise Manager Application Tier startup and configuration script.

Configure Fast-Start Failover and Observer

Use the Fast-Start Failover configuration wizard in Enterprise Manager Console to enable FSFO and configure the Observer.

This completes the setup of automatic failover.

Installation Best Practices for Enterprise Manager High Availability

The following sections document best practices for installation and configuration of each Grid Control component.

Configuring the Management Agent to Automatically Start on Boot and Restart on Failure

The Management Agent is started manually. It is important that the Management Agent be automatically started when the host is booted to insure monitoring of critical resources on the administered host. To that end, use any and all operating system mechanisms to automatically start the Management Agent. For example, on UNIX systems this is done by placing an entry in the UNIX /etc/init.d that calls the Management Agent on boot or by setting the Windows service to start automatically.

Configuring Restart for the Management Agent

Once the Management Agent is started, the watchdog process monitors the Management Agent and attempts to restart it in the event of a failure. The behavior of the watchdog is controlled by environment variables set before the Management Agent process starts. The environment variables that control this behavior follow. All testing discussed here was done with the default settings.

EM_MAX_RETRIES – This is the maximum number of times the watchdog will attempt to restart the Management Agent within the EM_RETRY_WINDOW. The default is to attempt restart of the Management Agent 3 times.
EM_RETRY_WINDOW - This is the time interval in seconds that is used together with the EM_MAX_RETRIES environmental variable to determine whether the Management Agent is to be restarted. The default is 600 seconds.

The watchdog will not restart the Management Agent if the watchdog detects that the Management Agent has required restart more than EM_MAX_RETRIES within the EM_RETRY_WINDOW time period.

Installing the Management Agent Software on Redundant Storage

The Management Agent persists its intermediate state and collected information using local files in the $AGENT_HOME/$HOSTNAME/sysman/emd sub tree under the Management Agent home directory.

In the event that these files are lost or corrupted before being uploaded to the Management Repository, a loss of monitoring data and any pending alerts not yet uploaded to the Management Repository occurs.

At a minimum, configure these sub-directories on striped redundant or mirrored storage. Availability would be further enhanced by placing the entire $AGENT_HOME on redundant storage. The Management Agent home directory is shown by entering the command 'emctl getemhome' on the command line, or from the Management Services and Repository tab and Agents tab in the Grid Control console.

Install the Management Service Shared File Areas on Redundant Storage

The Management Service contains results of the intermediate collected data before it is loaded into the Management Repository. The loader receive directory contains these files and is typically empty when the Management Service is able to load data as quickly as it is received. Once the files are received by the Management Service, the Management Agent considers them committed and therefore removes its local copy. In the event that these files are lost before being uploaded to the Management Repository, data loss will occur. At a minimum, configure these sub-directories on striped redundant or mirrored storage. When Management Services are configured for the Shared Filesystem Loader, all services share the same loader receive directory. It is recommended that the shared loader receive directory be on a clustered file system like NetApps Filer.

Configuration With Grid Control

Grid Control comes preconfigured with a series of default rules to monitor many common targets. These rules can be extended to monitor the Grid Control infrastructure as well as the other targets on your network to meet specific monitoring needs.

Console Warnings, Alerts, and Notifications

The following list is a set of recommendations that extend the default monitoring performed by Enterprise Manager. Use the Notification Rules link on the Preferences page to adjust the default rules provided on the Configuration/Rules page:

Ensure the Agent Unreachable rule is set to alert on all Management Agents unreachable and Management Agents clear errors.
Ensure the Repository Operations Availability rule is set to notify on any unreachable problems with the Management Service or Management Repository nodes. Also modify this rule to alert on the Targets Not Providing Data condition and any database alerts that are detected against the database serving as the Management Repository.

Modify the Agent Upload Problems Rule to alert when the Management Service status has hit a warning or clear threshold.

Configure Additional Error Reporting Mechanisms

Enterprise Manager provides error reporting mechanisms through e-mail notifications, PL/SQL packages, and SNMP alerts. Configure these mechanisms based on the infrastructure of the production site. If using e-mail for notifications, configure the notification rule through the Grid Control console to notify administrators using multiple SMTP servers if they are available. This can be done by modifying the default e-mail server setting on the Notification Methods option under Setup.

Component Backup

Backup procedures for the database are well established standards. Configure backup for the Management Repository using the RMAN interface provided in the Grid Control console. Refer to the RMAN documentation or the Maximum Availability architecture document for detailed implementation instructions.

In addition to the Management Repository, the Management Service and Management Agent should also have regular backups. Backups should be performed after any configuration change.

Troubleshooting

In the event of a problem with Grid Control, the starting point for any diagnostic effort is the console itself. The Management System tab provides access to an overview of all Management Service operations and current alerts. Other pages summarize the health of Management Service processes and logged errors These pages are useful for determining the causes of any performance problems as the summary page shows at a historical view of the amount of files waiting to be loaded to the Management Repository and the amount of work waiting to be completed by Management Agents.

Upload Delay for Monitoring Data

When assessing the health and availability of targets through the Grid Control console, information is slow to appear in the UI, especially after a Management Service outage. The state of a target in the Grid Control console may be delayed after a state change on the monitored host. Use the Management System page to gauge backlog for pending files to be processed.

Notification Delay of Target State Change

The model used by the Management Agent to assess the state of health for any particular monitored target is poll based. Management Agents immediately post a notification to the Management Service as soon as a change in state is detected. This infers that there is some potential delay for the Management Agent to actually detect a change in state.

Configuring Oracle Enterprise Manager for Active and Passive Environments

Active and Passive environments, also known as Cold Failover Cluster (CFC) environments, refer to one type of high availability solution that allows an application to run on one node at a time. These environments generally use a combination of cluster software to provide a logical host name and IP address, along with interconnected host and storage systems to share information to provide a measure of high availability for applications.

Note:

The database for hosting the Management Repository and the WebLogic Server software must be installed before running Grid Control. For information on installing WebLogic Server, refer to Oracle Enterprise Manager Grid Control Basic Installation Guide.

This chapter contains the following sections:

Using Virtual Host Names for Active and Passive High Availability Environments in Enterprise Manager Database Control
Configuring Grid Control Repository in Active/Passive High Availability Environments
How to Configure Grid Control OMS in Active/Passive Environment for High Availability Failover Using Virtual Host Names
Configuring Targets for Failover in Active/Passive Environments

Using Virtual Host Names for Active and Passive High Availability Environments in Enterprise Manager Database Control

This section provides information to database administrators about configuring an Oracle Database release 11gR1 in Cold Failover Cluster environments using Enterprise Manager Database Control.

The following conditions must be met for Database Control to service a database instance after failing over to a different host in the cluster:

The installation of the database must be done using a Virtual IP address.
The installation must be conducted on a shared disk or volume which holds the binaries, configuration, and runtime data (including the database).
Configuration data and metadata must also failover to the surviving node.
Inventory location must failover to the surviving node.
Software owner and time zone parameters must be the same on all cluster member nodes that will host this database.

The following items are configuration and installation points you should consider before getting started.

To override the physical host name of the cluster member with a virtual host name, software must be installed using the parameter ORACLE_HOSTNAME.
For inventory pointer, software must be installed using the command parameter invPtrLoc to point to the shared inventory location file, which includes the path to the shared inventory location.
The database software, the configuration of the database, and Database Control are done on a shared volume.

Set Up the Alias for the Virtual Host Name and Virtual IP Address

You can set up the alias for the virtual host name and virtual IP address by either allowing the clusterware to set it up automatically or by setting it up manually before installation and startup of Oracle services. The virtual host name must be static and resolvable consistently on the network. All nodes participating in the setup must resolve the virtual IP address to the same host name. Standard TCP tools similar to nslookup and traceroute commands can be used to verify the set up.

Set Up Shared Storage

Shared storage can be managed by the clusterware that is in use or you can use any shared file system volume as long as it is supported. The most common shared file system is NFS. You can also use the Oracle Cluster File System software.

Set Up the Environment

Some operating system versions require specific operating system patches to be applied prior to installing release 11gR1 of the Oracle database. You must also have sufficient kernel resources available when you conduct the installation.

Before you launch the installer, specific environment variables must be verified. Each of the following variables must be identically set for the account you are using to install the software on all machines participating in the cluster.

Operating system variable TZ, time zone setting. You should unset this prior to the installation.
PERL variables. Variables like PERL5LIB should be unset to prevent the installation and Database Control from picking up the incorrect set of PERL libraries.
Paths used for dynamic libraries. Based on the operating system, the variables can be LD_LIBRARY_PATH, LIBPATH, SHLIB_PATH, or DYLD_LIBRARY_PATH. These variables should only point to directories that are visible and usable on each node of the cluster.

Ensure That the Oracle USERNAME, ID, and GROUP NAME Are Synchronized on All Cluster Members

The user and group of the software owner should be defined identically on all nodes of the cluster. You can verify this using the following command:

$ id -a
uid=1234(oracle) gid=5678(dba) groups=5678(dba)

Ensure That Inventory Files Are on the Shared Storage

To ensure that inventory files are on the shared storage, follow these steps:

Create you new ORACLE_HOME directory.
Create the Oracle Inventory directory under the new Oracle home
```
cd <shared oracle home>
mkdir oraInventory
```
Create the oraInst.loc file. This file contains the Inventory directory path information required by the Universal Installer:
1. vi oraInst.loc
2. Enter the path information to the Oracle Inventory directory and specify the group of the software owner as the dba user. For example:
```
inventory_loc=/app/oracle/product/11.1/oraInventory inst_group=dba
```
  Depending on the type of operating system, the default directory for the oraInst.loc file is either /etc (for example, on Linux) or /var/opt/oracle (for example, on Solaris and HP-UX).

Start the Installer

To start the installer, point to the inventory location file oraInst.loc, and specify the host name of the virtual group. The debug parameter in the example below is optional:

$ export ORACLE_HOSTNAME=lxdb.acme.com
$ runInstaller -invPtrloc /app/oracle/share1/oraInst.loc ORACLE_HOSTNAME=lxdb.acme.com -debug

Windows Specific Configuration Steps

On Windows environments, an additional step is required to copy over service and keys required by the Oracle software. Note that these steps are required if your clustering software does not provide a shared windows registry.

Using regedit on the first host, export each Oracle service from under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services.
Using regedit on the first host, export HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE.
Use regedit to import the files created in step 1 and 2 to the failover host.

Start Services

You must start the services in the following order:

Establish IP address on the active node
Start the TNS listener
Start the database
Start dbconsole
Test functionality

In the event that services do not start, do the following:

Establish IP on failover box
Start TNS listener
```
lsnrctl start
```
Start the database
```
dbstart
```
Start Database Control
```
emctl start dbconsole
```
Test functionality

To manually stop or shutdown a service, follow these steps:

Stop the application.
Stop Database Control
```
emctl stop dbconsole
```
Stop TNS listener
```
lsnrctl stop
```
Stop the database
```
dbshut
```
Stop IP

Configuring Grid Control Repository in Active/Passive High Availability Environments

In order for Grid Control repository to fail over to a different host, the following conditions must be met:

The installation must be conducted using a Virtual Hostname and an associated unique IP address
Installation must occur on a shared disk/volume which holds the binaries, the configuration, and the runtime data (including the repository database)
Configuration data and metadata must also failover to the surviving node
Inventory location must failover to the surviving node
Software owner and time zone parameters must be the same on all cluster member nodes that will host this OMS

Installation and Configuration

The following installation and configuration requirements should be noted:

To override the physical host name of the cluster member with a virtual host name, software must be installed using the parameter ORACLE_HOSTNAME.
For inventory pointer, software must be installed using the command line parameter -invPtrLoc to point to the shared inventory location file, which includes the path to the shared inventory location.
The database software, the configuration of the database, and data files are on a shared volume.

If you are using an NFS mounted volume for the installation, ensure that you specify rsize and wsize in your mount command to prevent I/O issues. See My Oracle Support note 279393.1 Linux.NetApp: RHEL/SUSE Setup Recommendations for NetApp Filer Storage.

Example:

grid-repo.acme.com:/u01/app/share1 /u01/app/share1 nfs rw,bg,rsize=32768,wsize=32768,hard,nointr,tcp,noac,vers=3,timeo=600 0 0

Note:

Any reference to shared could also be true for non-shared failover volumes, which can be mounted on active hosts after failover.

Set Up the Virtual Host Name/Virtual IP Address

You can set up the virtual host name and virtual IP address by either allowing the clusterware to set it up or manually setting it up before installation and startup of Oracle services. The virtual host name must be static and resolvable consistently on the network. All nodes participating in the setup must resolve the virtual IP address to the same host name. Standard TCP tools such as nslookup and traceroute can be used to verify the host name. Validate using the commands listed below:

nslookup <virtual hostname>

This command returns the virtual IP address and fully qualified host name.

nslookup <virtual IP>

This command returns the virtual IP address and fully qualified host name.

Be sure to try these commands on every node of the cluster to verify that the correct information is returned.

Set Up the Environment

Some operating system versions require specific patches to be applied prior to installing 11gR1. The user installing and using the 11gR1 software must also have sufficient kernel resources available. Refer to the operating system's installation guide for more details.

Before you launch the installer, certain environment variables must be verified. Each of these variables must be set up identically for the account installing the software on ALL machines participating in the cluster:

OS variable TZ (time zone setting)

You should unset this variable prior to installation.
PERL variables

Variables such as PERL5LIB should also be unset to prevent inadvertently picking up the wrong set of PERL libraries.
Same operating system, operating system patches, and version of the kernel. Therefore, RHEL 3 and RHEL 4 are not allowed for a CFC system.
System libraries

For example, LIBPATH, LD_LIBRARY_PATH, SHLIB_PATH, and so on. The same system libraries must be present.

Synchronize Operating System User IDs

The user and group of the software owner should be defined identically on all nodes of the cluster. This can be verified using the id command:

$ id -a

uid=550(oracle) gid=50(oinstall) groups=501(dba)

Set Up Inventory

You can set up the inventory by using the following steps:

Create your new ORACLE_HOME directory.
Create the Oracle Inventory directory under the new oracle home

cd <shared oracle home>

mkdir oraInventory
Create the oraInst.loc file. This file contains the Inventory directory path information needed by the Universal Installer.

vi oraInst.loc

Enter the path information to the Oracle Inventory directory, and specify the group of the software owner as the oinstall user:

Example:

inventory_loc=/app/oracle/product/11.1/oraInventory inst_group=oinstall

Install the Software

Follow these steps to install the software:

Create the shared disk location on both the nodes for the software binaries.
Point to the inventory location file oraInst.loc (under the ORACLE_BASE in this case), as well as specifying the host name of the virtual group. For example:
```
$ export ORACLE_HOSTNAME=grid-repo.acme.com
$ runInstaller -invPtrLoc /app/oracle/share1/oraInst.loc ORACLE_HOSTNAME=grid-repo.acme.com
```
Install the repository DB software only on the shared location. For example:

/oradbnas/app/oracle/product/oradb111 using Host1
Start DBCA and create all the data files be on the shared location. For example:

/oradbnas/oradata
Continue the rest of the installation normally.
Once completed, copy the files oraInst.loc and oratab to /etc. Also copy /opt/oracle to all cluster member hosts (Host2, Host3, and so on).

Windows Specific Configuration Steps

Using regedit on the first host, export each Oracle service from under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services.
Using regedit on the first host, export HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE.
Use regedit to import the files created in step 1 and 2 to the failover host.

Startup of Services

Be sure you start your services in the proper order:

Establish IP address on the active node
Start the TNS listener if it is part of the same failover group
Start the database if it is part of the same failover group

In case of failover, follow these steps:

Establish IP address on the failover box
Start TNS listener (lsnrctl start) if it is part of the same failover group
Start the database (dbstart) if it is part of the same failover group

Summary

The Grid Control Management Repository can now be deployed in a CFC environment that utilizes a floating host name.

To deploy the OMS midtier in a CFC environment, please see How to Configure Grid Control OMS in Active/Passive Environment for High Availability Failover Using Virtual Host Names.

How to Configure Grid Control OMS in Active/Passive Environment for High Availability Failover Using Virtual Host Names

This section provides a general reference for Grid Control administrators who want to configure Enterprise Manager 11gR1 Grid Control in Cold Failover Cluster (CFC) environments.

Overview and Requirements

The following conditions must be met for Grid Control to fail over to a different host:

The installation must be done using a Virtual Host Name and an associated unique IP address.
Install on a shared disk/volume which holds the binaries, the configuration and the runtime data (including the recv directory).
Configuration data and metadata must also failover to the surviving node.
Inventory location must failover to the surviving node.
Software owner and time zone parameters must be the same on all cluster member nodes that will host this Oracle Management Service (OMS).

Installation and Configuration

To override the physical host name of the cluster member with a virtual host name, software must be installed using the parameter ORACLE_HOSTNAME. For inventory pointer, the software must be installed using the command line parameter -invPtrLoc to point to the shared inventory location file, which includes the path to the shared inventory location.

If you are using an NFS mounted volume for the installation, please ensure that you specify rsize and wsize in your mount command to prevent running into I/O issues.

For example:

oms.acme.com:/u01/app/share1 /u01/app/share1 nfs rw,bg,rsize=32768,wsize=32768,hard,nointr,tcp,noac,vers=3,timeo=600 0 0

Note:

Any reference to shared failover volumes could also be true for non-shared failover volumes which can be mounted on active hosts after failover.

Setting Up the Virtual Host Name/Virtual IP Address

You can set up the virtual host name and virtual IP address by either allowing the clusterware to set it up, or manually setting it up yourself before installation and startup of Oracle services. The virtual host name must be static and resolvable consistently on the network. All nodes participating in the setup must resolve the virtual IP address to the same host name. Standard TCP tools such as nslookup and traceroute can be used to verify the host name. Validate using the following commands:

nslookup <virtual hostname>

This command returns the virtual IP address and full qualified host name.

nslookup <virtual IP>

This command returns the virtual IP address and fully qualified host name.

Be sure to try these commands on every node of the cluster and verify that the correct information is returned.

Setting Up Shared Storage

Storage can be managed by the clusterware that is in use or you can use any shared file system (FS) volume as long as it is not an unsupported type, such as OCFS V1. The most common shared file system is NFS.

Note:

If the OHS directory is on a shared storage, the LockFile directive in the httpd.conf file should be modified to point to a local disk, otherwise there is a potential for locking issues.

Setting Up the Environment

Some operating system versions require specific operating system patches be applied prior to installing 11gR1. The user installing and using the 11gR1 software must also have sufficient kernel resources available. Refer to the operating system's installation guide for more details.Before you launch the installer, certain environment variables need to be verified. Each of these variables must be identically set for the account installing the software on ALL machines participating in the cluster:

OS variable TZ

Time zone setting. You should unset this variable prior to installation
PERL variables

Variables such as PERL5LIB should also be unset to avoid association to the incorrect set of PERL libraries

Synchronizing Operating System IDs

The user and group of the software owner should be defined identically on all nodes of the cluster. This can be verified using the 'id' command:

$ id -a

uid=550(oracle) gid=50(oinstall) groups=501(dba)

Setting Up Shared Inventory

Use the following steps to set up shared inventory:

Create your new ORACLE_HOME directory.
Create the Oracle Inventory directory under the new oracle home:

$ cd <shared oracle home>

$ mkdir oraInventory
Create the oraInst.loc file. This file contains the Inventory directory path information needed by the Universal Installer.
1. vi oraInst.loc
2. Enter the path information to the Oracle Inventory directory and specify the group of the software owner as the oinstall user. For example:
  
  inventory_loc=/app/oracle/product/11.1/oraInventory
  
  inst_group=oinstall

Installing the Software

Refer to the following steps when installing the software:

Create the shared disk location on both the nodes for the software binaries
Install WebLogic Server. For information on installing WebLogic Server, refer to Oracle Enterprise Manager Grid Control Basic Installation Guide.
Point to the inventory location file oraInst.loc (under the ORACLE_BASE in this case), as well as specifying the host name of the virtual group. For example:
```
$ export ORACLE_HOSTNAME=lxdb.acme.com
$ runInstaller -invPtrloc /app/oracle/share1/oraInst.loc 
ORACLE_HOSTNAME=lxdb.acme.com -debug
```

Install Oracle Management Services on cluster member Host1 using the option, "EM install using the existing DB"
Continue the remainder of the installation normally.
Once completed, copy the files oraInst.loc and oratab to /etc. Also copy /opt/oracle to all cluster member hosts (Host2, Host3, and so on).

Windows Specific Configuration Steps

Using regedit on the first host, export each Oracle service from under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services.
Using regedit on the first host, export HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE.
Use regedit to import the files created in step 1 and 2 to the failover host.

Starting Up Services

Ensure that you start your services in the proper order. Use the order listed below:

Establish IP address on the active node
Start the TNS listener (if it is part of the same failover group)
Start the database (if it is part of the same failover group)
Start Grid Control using emctl start oms
Test functionality

In case of failover, refer to the following steps:

Establish IP on failover box
Start TNS listener using the command lsnrctl start if it is part of the same failover group
Start the database using the command dbstart if it is part of the same failover group
Start Grid Control using the command emctl start oms
Test the functionality

Summary

The OMS mid-tier component of Grid Control can now be deployed in a CFC environments that utilize a floating host name.

To deploy the repository database in a CFC environment, see Configuring Grid Control Repository in Active/Passive High Availability Environments.

Configuring Targets for Failover in Active/Passive Environments

This section provides a general reference for Grid Control administrators who want to relocate Cold Failover Cluster (CFC) targets from one existing Management Agent to another. Although the targets are capable of running on multiple nodes, these targets run only on the active node in a CFC environment.

CFC environments generally use a combination of cluster software to provide a virtual host name and IP address along with interconnected host and storage systems to share information and provide high availability for applications. Automating failover of the virtual host name and IP, in combination with relocating the Enterprise Manager targets and restarting the applications on the passive node, requires the use of Oracle Enterprise Manager command-line interface (EM CLI) and Oracle Clusterware (running Oracle Database release 10g or 11g) or third-party cluster software. Several Oracle partner vendors provide clusterware solutions in this area.

The Enterprise Manager Command Line Interface (EM CLI) allows you to access Enterprise Manager Grid Control functionality from text-based consoles (terminal sessions) for a variety of operating systems. Using EM CLI, you can perform Enterprise Manager Grid Control console-based operations, like monitoring and managing targets, jobs, groups, blackouts, notifications, and alerts. See the Oracle Enterprise Manager Command Line Interface manual for more information.

Target Relocation in Active/Passive Environments

Beginning with Oracle Enterprise Manager 10g release 10.2.0.5, a single Oracle Management Agent running on each node in the cluster can monitor targets configured for active / passive high availability. Only one Management Agent is required on each of the physical nodes of the CFC cluster because, in case of a failover to the passive node, Enterprise Manager can move the HA monitored targets from the Management Agent on the failed node to another Management Agent on the newly activated node using a series of EMCLI commands. See the Oracle® Enterprise Manager Command Line Interface manual for more information.

If your application is running in an active/passive environment, the clusterware brings up the applications on the passive node in the event that the active node fails. For Enterprise Manager to continue monitoring the targets in this type of configuration, the existing Management Agent needs additional configuration.

The following sections describe how to prepare the environment to automate and restart targets on the new active node. Failover and fallback procedures are also provided.

Installation and Configuration

The following sections describe how to configure Enterprise Manager to support a CFC configuration using the existing Management Agents communicating with the Oracle Management Service processes:

Prerequisites
Configuration Steps

Prerequisites

Prepare the Active/Passive environments as follows:

Ensure the operating system clock is synchronized across all nodes of the cluster. (Consider using Network Time Protocol (NTP) or another network synchronization method.)
Use the EM CLI RELOCATE_TARGETS command only with Enterprise Manager Release 10.2.0.5 (and higher) Management Agents.

Configuration Steps

The following steps show how to configure Enterprise Manager to support a CFC configuration using the existing Management Agents that are communicating with the OMS processes. The example that follows is based on a configuration with a two-node cluster that has one failover group. For additional information about targets running in CFC active/passive environments, see My Oracle Support note 406014.1.

Configure EM CLI

To set up and configure target relocation, use the Oracle Enterprise Manager command-line interface (EM CLI). See the Oracle Enterprise Manager Command Line Interface manual and the Oracle Enterprise Manager Extensibility manual for information about EM CLI and Management Plug-Ins.
Install Management Agents

Install the Management Agent on a local disk volume on each node in the cluster. Once installed, the Management Agents are visible in the Grid Control console.
Discover Targets

After the Active / Passive targets have been configured, use the Management Agent discovery screen in the Grid Control console to add the targets (such as database, listener, application server, and so on). Perform the discovery on the active node, which is the node that is currently hosting the new target.

Failover Procedure

To speed relocation of targets after a node failover, configure the following steps using a script that contains the commands necessary to automatically initiate a failover of a target. Typically, the clusterware software has a mechanism with which you can automatically execute the script to relocate the targets in Enterprise Manager. Also, see Script Examples for sample scripts.

Shut down the target services on the failed active node.

On the active node where the targets are running, shut down the target services running on the virtual IP.
If required, disconnect the storage for this target on the active node.

Shut down all the applications running on the virtual IP and shared storage.
Enable the target's IP address on the new active node.
If required, connect storage for the target on the currently active node.
Relocate the targets in Grid Control using EM CLI.

To relocate the targets to the Management Agent on the new active node, issue the EM CLI RELOCATE TARGET command for each target type (listener, application servers, and so on) that you must relocate after the failover operation. For example:
```
emcli relocate_targets
-src_agent=<node 1>:3872 
-dest_agent=<node 2>:3872
-target_name=<database_name>
-target_type=oracle_database
-copy_from_src 
-force=yes
```
In the example, port 3872 is the default port for the Management Agent. To find the appropriate port number for your configuration, use the value for the EMD_URL parameter in the emd.properties file for this Management Agent.

Note: In case of a failover event, the source agent will not be running. However, there is no need to have the source Management Agent running to accomplish the RELOCATE operation. EM CLI is an OMS client that performs its RELOCATE operations directly against the Management Repository.

Fallback Procedure

To return the HA targets to the original active node or to any other cluster member node:

Repeat the steps in Failover Procedure to return the HA targets to the active node.
Verify the target status in the Grid Control console.

EM CLI Parameter Reference

Issue the same command for each target type that will be failed over to (or be switched over) during relocation operations. For example, issue the same EM CLI command to relocate the listener, the application servers, and so on. Table 18-6 shows the EM CLI parameters you use to relocate targets:

Table 18-6 EM CLI Parameters

EM CLI Parameter	Description
-src_agent	Management Agent on which the target was running before the failover occurred.
-dest_agent	Management Agent that will be monitoring the target after the failover.
-target_name	Name of the target to be failed over.
-target_type	Type of target to be failed over (internal Enterprise Manager target type). For example, the Oracle database (for a standalone database or an Oracle RAC instance), the Oracle listener for a database listener, and so on.
-copy_from_src	Use the same type of properties from the source Management Agent to identify the target. This is a MANDATORY parameter! If you do not supply this parameter, you can corrupt your target definition!
-force	Force dependencies (if needed) to failover as well.

Script Examples

The following sections provide script examples:

Relocation Script
Start Listener Script
Stop Listener Script

Relocation Script

#! /bin/ksh

#get the status of the targets

emcli get_targets -
targets="db1:oracle_database;listener_db1:oracle_listener" -noheader

  if [[ $? != 0 ]]; then exit 1; fi

# blackout the targets to stop false errors.  This blackout is set to expire in 30 minutes

emcli create_blackout -name="relocating active passive test targets" -
add_targets="db1:oracle_database;listener_db1:oracle_listener" -
reason="testing failover" -
schedule="frequency:once;duration:0:30"
  if [[ $? != 0 ]]; then exit 1; fi

# stop the listener target.  Have to go out to a OS script to use the 'lsnrctl set
current_listener' function

emcli execute_hostcmd -cmd="/bin/ksh" -osscript="FILE" -
input_file="FILE:/scratch/oraha/cfc_test/listener_stop.ksh" -
credential_set_name="HostCredsNormal" -
targets="host1.us.oracle.com:host"
  if [[ $? != 0 ]]; then exit 1; fi

# now, stop the database

emcli execute_sql -sql="shutdown abort" -
targets="db1:oracle_database" -
credential_set_name="DBCredsSYSDBA"
  if [[ $? != 0 ]]; then exit 1; fi

# relocate the targets to the new host

emcli relocate_targets -
src_agent=host1.us.oracle.com:3872 -
dest_agent=host2.us.oracle.com:3872 -
target_name=db1 -target_type=oracle_database -
copy_from_src -force=yes  -
changed_param=MachineName:host1vip.us.oracle.com
  if [[ $? != 0 ]]; then exit 1; fi

emcli relocate_targets -
src_agent=host1.us.oracle.com:3872 -
dest_agent=host2.us.oracle.com:3872 -
target_name=listener_db1 -target_type=oracle_listener -
copy_from_src -force=yes  -
changed_param=MachineName:host1vip.us.oracle.com
  if [[ $? != 0 ]]; then exit 1; fi

# Now, restart database and listener on the new host

emcli execute_hostcmd -cmd="/bin/ksh" -osscript="FILE" -
input_file="FILE:/scratch/oraha/cfc_test/listener_start.ksh" -
credential_set_name="HostCredsNormal" -
targets="host2.us.oracle.com:host"
  if [[ $? != 0 ]]; then exit 1; fi

emcli execute_sql -sql="startup" -
targets="db1:oracle_database" -
credential_set_name="DBCredsSYSDBA"
  if [[ $? != 0 ]]; then exit 1; fi

# Time to end the blackout and let the targets become visible

emcli stop_blackout -name="relocating active passive test targets"
  if [[ $? != 0 ]]; then exit 1; fi

# and finally, recheck the status of the targets

emcli get_targets -
targets="db1:oracle_database;listener_db1:oracle_listener" -noheader
  if [[ $? != 0 ]]; then exit 1; fi

Start Listener Script

#!/bin/ksh

export 
ORACLE_HOME=/oradbshare/app/oracle/product/11.1.0/db
export PATH=$ORACLE_HOME/bin:$PATH

lsnrctl << EOF
set current_listener listener_db1
start
exit
EOF

Stop Listener Script

#!/bin/ksh
export 
ORACLE_HOME=/oradbshare/app/oracle/product/11.1.0/db
export PATH=$ORACLE_HOME/bin:$PATH

lsnrctl << EOF
set current_listener listener_db1
stop
exit
EOF