5 Understanding Oracle Service Bus High Availability

A clustered Oracle Service Bus domain provides high availability. A highly available deployment has recovery provisions in the event of hardware or network failures, and provides for the transfer of control to a backup component when a failure occurs.

The following sections describe clustering and high availability for a Oracle Service Bus deployment:

Section 5.1, "About Oracle Service Bus High Availability"
Section 5.2, "Oracle Service Bus Failure and Recovery"
Section 5.3, "High Availability for Poller Based Transports"

Note:

For detailed instructions on configuring Oracle Service Bus for high availability, see the "Architecting OSB for High Availability and Whole Server Migration" document, available on the Oracle Service Bus product page at http://www.oracle.com/technology/products/integration/service-bus/index.html.

5.1 About Oracle Service Bus High Availability

For a cluster to provide high availability, it must be able to recover from service failures. Oracle WebLogic Server supports failover for clustered objects and services pinned to servers in a clustered environment. For information about how Oracle WebLogic Server handles such failover scenarios, see "Communications in a Cluster" in Oracle Fusion Middleware Using Clusters for Oracle WebLogic Server.

5.1.1 Recommended Hardware and Software

The basic components of a highly available Oracle Service Bus environment include the following:

An administration server
A set of managed servers in a cluster
An HTTP load balancer (router)
Physically shared, highly-available disk subsystems for managed server data—Whole server migration requires that all data on a managed server be located on a multi-ported disk. A typical and recommended way to do this is by using a multi-ported disk subsystem or SAN, and allowing two or more servers to mount file systems within the disk subsystem. The file system does not need to be simultaneously shared; it is only necessary for one server to mount a file system at any one time.
A Microsoft SQL Server or Oracle database configured for failover with a cluster manager—You should take advantage of any high availability or failover solutions offered by your database vendor in addition to using a commercial cluster manager. (For database-specific information, see your database vendor's documentation.)

Note:
For information about availability and performance considerations associated with the various types of JDBC drivers, see "Configure Database Connectivity" in the Oracle Fusion Middleware Oracle WebLogic Server Administration Console Online Help.

A full discussion of how to plan the network topology of your clustered system is beyond the scope of this section. For information about how to fully utilize inbound load balancing and failover features for your Oracle Service Bus configuration by organizing one or more Oracle WebLogic Server clusters in relation to load balancers, firewalls, and Web servers, see "Cluster Architectures" in Oracle Fusion Middleware Using Clusters for Oracle WebLogic Server. For information on configuring outbound load balancing, see "Transport Configuration Page" in the Oracle Fusion Middleware Administrator's Guide for Oracle Service Bus.

For a simplified view of a cluster, showing the HTTP load balancer, highly available database and multi-ported file system, see the following figure.

Figure 5-1 Simplified View of a Cluster

Description of "Figure 5-1 Simplified View of a Cluster"

5.1.1.1 Regarding JMS File Stores

The default Oracle Service Bus domain configuration uses a file store for JMS persistence to store collected metrics for monitoring purposes and alerts. The configuration shown relies on a highly available multi-ported disk that can be shared between managed servers to optimize performance. This will typically perform better than a JDBC store.

For information about configuring JMS file stores, see "Using the WebLogic Persistent Store" in Oracle Fusion Middleware Configuring Server Environments for Oracle WebLogic Server.

5.1.2 What Happens When a Server Fails

A server can fail due to software or hardware problems. The following sections describe the processes that occur automatically in each case and the manual steps that must be taken in these situations.

5.1.2.1 Software Faults

If a software fault occurs, the Node Manager (if configured to do so) will restart the Oracle WebLogic Server. For more information about Node Manager, see "Using Node Manager to Control Servers" in Oracle Fusion Middleware Node Manager Administrator's Guide for Oracle WebLogic Server. For information about how to prepare to recover a secure installation, see "Directory and File Back Ups for Failure Recovery" in Oracle Fusion Middleware Managing Server Startup and Shutdown for Oracle WebLogic Server.

5.1.2.2 Hardware Faults

If a hardware fault occurs, the physical machine may need to be repaired and could be out of operation for an extended period. In this case, the following events occur:

The HTTP load balancer will detect the failed server and will redirect to other managed servers. (The actual algorithm for doing this will depend on the vendor for the http load balancer.)
All new internal requests will be redirected to other managed servers.
All in-flight transactions on the failed server are terminated.
JMS messages that are already enqueued are not automatically migrated, but must be manually migrated. For more information, see Section 5.1.2.3, "Server Migration."
If your Oracle Service Bus configuration includes one or more proxy services that use File, FTP or Email transports with transport pollers pinned to a managed server that failed, you must select a different managed server in the definition of each of those proxy services in order to resume normal operation.
If the managed server that hosts the aggregator application fails, collected metrics for monitoring and alerts are not automatically migrated. You must perform a manual migration. For more information, see Section 5.1.2.3, "Server Migration."

5.1.2.3 Server Migration

Oracle Service Bus leverages Oracle WebLogic Server's whole server migration functionality to enable transparent failover of managed servers from one system to another. For detailed information regarding Oracle WebLogic Server whole server migration, see the following topics in the Oracle WebLogic Server documentation set:

"Failover and Replication in a Cluster" in Oracle Fusion Middleware Using Clusters for Oracle WebLogic Server
"Avoiding and Recovering from Server Failure" in Oracle Fusion Middleware Managing Server Startup and Shutdown for Oracle WebLogic Server.

5.1.2.4 Message Reporting Purger

The Message Reporting Purger for the JMS Message Reporting Provider is deployed in a single managed server in a cluster (see Appendix B, "Oracle Service Bus Deployment Resources").

If the managed server that hosts the Message Reporting Purger application fails, you must select a different managed server for the Message Reporting Purger and its associated queue (wli.reporting.purge.queue) to resume normal operation.

Any pending purging requests in the managed server that failed are not automatically migrated. You must perform a manual migration. Otherwise, target the Message Reporting Purger application and its queue to a different managed server and send the purging request again.

5.2 Oracle Service Bus Failure and Recovery

In addition to the high availability features of Oracle WebLogic Server, Oracle Service Bus has failure and recovery characteristics that are based on the implementation and configuration of your Oracle Service Bus solution. The following sections discuss specific Oracle Service Bus failure and recovery topics:

Section 5.2.1, "Transparent Server Reconnection"
Section 5.2.2, "EIS Instance Failover"

5.2.1 Transparent Server Reconnection

Oracle Service Bus provides transparent reconnection to external servers and services when they fail and restart. If Oracle Service Bus sends a message to a destination while the connection is unavailable, you may see one or more runtime error messages in the server console.

Transparent reconnection is provided for the following types of servers and services:

SMTP
JMS
FTP
DBMS
Business services

Oracle Service Bus Console also provides monitoring features that enable you to view the status of services and to establish a system of SLAs and alerts to respond to service failures. For more information, see "Monitoring" in the Oracle Fusion Middleware Administrator's Guide for Oracle Service Bus.

5.2.2 EIS Instance Failover

Most business services in production environments will be configured to point to a number of EIS instances for load balancing purposes and high availability. If you expect that an EIS instance failure will have an extended duration or a business service points to a single, failed EIS instance, you can reconfigure the business service to point at an alternate, operational EIS instance. This change can be made dynamically.

For information about using the Oracle Service Bus Console to change an endpoint URI for a business service, see "Transport Configuration Page" in the Oracle Fusion Middleware Administrator's Guide for Oracle Service Bus.

5.3 High Availability for Poller Based Transports

An additional JMS-based framework has been created to allow the poller-based File, FTP, and Email transports to recover from failure. These transports use the JMS framework to ensure that the processing of a message is done at least once. However, if the processing is done, but the server crashes or the server is restarted before the transaction is complete, the same file may be processed again. The number of retires depends on the redelivery limit that is set for the poller transport for the domain.

New messages from the target (a directory in case of File and FTP transports and server account in case of Email transport) are copied to the download (stage) directory at the time of polling or pipeline processing.

Note:

For FTP transport, a file is renamed as <name>.stage in the remote directory. It is copied to the stage directory only at the time of pipeline processing,

For File and FTP transports, a JMS task is created corresponding to each new file in the target directory. For Email transport, an e-mail message is stored in a file in the download directory and a JMS task is created corresponding to each of these files.

These JMS task messages are enqueued to a JMS queue which is pre-configured for these transports when the Oracle Service Bus domain is created.

5.3.1 JMS Queues

The following poller-transport-specific JMS queues are configured for Oracle Service Bus domains:

Table 5-1 JMS Queues Configured for Oracle Service Bus Domains

Transport Name	JMS Queue Name
FTP	wlsb.internal.transport.task.queue.ftp
File	wlsb.internal.transport.task.queue.file
Email	wlsb.internal.transport.task.queue.email

A domain-wide message-driven bean (MDB) receives the JMS task. Once the MDB receives the message, it invokes the pipeline in an XA transaction. If the message processing fails in the pipeline due to an exception in the pipeline or server crash, the XA transaction also fails and the message is again enqueued to the JMS queue. This message is re-delivered to the MDB based on the redelivery limit parameter set with the queue. By default, the redelivery limit is 1 (the message is sent once and retried once). If the redelivery limit is exhausted without successfully delivering the message, the message is moved to the error directory. You can change this limit from Oracle WebLogic Server Console. For more information, see "JMS Topic: Configuration: Delivery Failure" in the Oracle Fusion Middleware Oracle WebLogic Server Administration Console Online Help.

Note:

For a single Oracle Service Bus domain transport, the redelivery limit value is global across the domain. For example, within a domain, it is not possible to have an FTP proxy with a redelivery limit of 2 and another FTP proxy with a redelivery limit of 5.

5.3.2 High Availability in Clusters

For clusters, the JMS queue associated with each of these poller based transport is a distributed queue (each Managed Server has a local JMS queue, which is a member of the distributed queue). The JMS queue for a transport is domain-wide. The task message is enqueued to the distributed queue, which is passed on to the underlying local queues on the Managed Server. The MDB deployed on the Managed Server picks up the message and then invokes the pipeline in a transaction for actual message processing.

Since the JMS queues are distributed queues in cluster domains, high availability is achieved by utilizing the Oracle WebLogic Server distributed queue functionality. Instead of all Managed Servers polling for messages, only one of the Managed Servers in a cluster is assigned the job of polling the messages. At the time of proxy service configuration, one of the Managed Servers is configured to poll for new files or e-mail messages.

The poller server polls for new messages and puts them to the uniform distributed queue associated with the respective poller transport. From this queue, the message is passed on the local queue on the Managed Server. The Managed Servers receive the messages through MDBs deployed on all the servers through these local queues.

Note:

There is a uniform distributed queue with a local queue on all the Managed Servers for each of these poller based transports.

If the managed servers crashes after the distributed queue delivers the message to the local queue, you need to do manual migration. For more information, see Section 5.1.2.3, "Server Migration."

When a cluster is created, the uniform distributed queue is created with local queue members - on all the Managed Servers. However, when a new Managed Server is added to an existing cluster, these local queues are not automatically created. You have to manually create the local queues and make them a part of a uniform distributed queue.

To create a local queue:

Create a JMS Server and target it to the newly created Managed Server.
Create a local JMS queue, set the redelivery count, and target it to the new JMS server.
Add this local JMS queue as a member of the uniform distributed queue associated with the transport.

Note:
The JNDI name of the distributed queue is wlsb.internal.transport.task.queue.file (for File transport), wlsb.internal.transport.task.queue.ftp (for FTP transport) and wlsb.internal.transport.task.queue.email (for Email transport).

5.3.2.1 Load Balancing

Since we use distributed JMS queues, messages are distributed to the Managed Servers based on the load balancing algorithm associated with the distributed queue. By default, the JMS framework uses round-robin load balancing. You can change the algorithm using the JMS module in Oracle WebLogic Server Console. For more information, see "Load Balancing for JMS" in Oracle Fusion Middleware Using Clusters for Oracle WebLogic Server. If one of the Managed Servers fails, the remaining messages are processed by any of the remaining active Managed Servers.

Note:

The poller server should always be running. If the poller server fails, the message processing will also stop.