Sun Identity Manager Overview

# Chapter 3 Clustering and High Availability

This chapter provides guidance on how to implement a high availability / fault tolerant (HA/FT) Identity Manager environment.

Note –

Please consult your web server, application server, and database provider's documentation for best practices on ensuring a highly available deployment with each technology. This guide is not a substitute for the vendor-specific recommendations for web servers.

# Assessing the Need for Availability

This section describes how to assess the amount of availability that your specific deployment requires.

## Assessing the Cost of Downtime

Because Identity Manager is not in the transaction path between general users and the systems and applications that they already have access to, Identity Manager downtime is not the nightmare that you might imagine. If Identity Manager is unavailable, end users are still able to access resources through their provisioned accounts.

The main cost of Identity Manager downtime is lost productivity. If Identity Manager is down, end users cannot use Identity Manager to gain access to systems that they are either locked out of or not provisioned to.

To calculate the cost of downtime, the first number that is needed is the average cost of lost productivity due to end users being unable to access computing resources within the enterprise. In our assessment, this number is called productivity per person hour.

The other major number that needs to be determined is the percentage of end users within the user population who need to use Identity Manager at any given time. This population usually includes new hires who need to be provisioned, and end users who have forgotten their password if password management is a part of the deployment.

Consider the following hypothetical situation:

 Total number of employees 20,000 Number of password resets in a day 130 Number of new hires in a day 30 Number of hours in a work day 8

For this particular situation you can calculate the following:

• The number of employees needing Identity Manager at any given hour = (130 + 30) / 8 = 20

• The percentage of employees needing Identity Manager at any given hour = 20 / 20,000 = .1% or 1 in 1000

Using these numbers you can then estimate the cost of an Identity Manager outage:

 Productivity per person hour \$100 Loss in productivity .5 (50% decrease in productivity due to inability to access system) Number of people affected 20 Subtotal \$1,000 Duration of outage 2 hours Total immediate loss \$2,000

This example shows that even though the number of users being managed by Identity Manager is high, the number of users needing Identity Manager to gain access to systems at any given time is usually low.

Another point to consider is that the time it takes to bring a system like Identity Manager back online is usually less than the time it takes to execute the manual provisioning processes that Identity Manager is automating. So while Identity Manager downtime exacts a cost, it is usually less than the cost of using manual processes to give users access to resources.

## Understanding the Causes of Downtime

When planning for an Identity Manager highly-available deployment, it is worthwhile to consider the causes of downtime.

These causes include the following:

• Operator error

• Hardware failure

• Software failure

• Planned down time (Upgrades to hardware and software)

• Poor performance (Perceived downtime)

## Calculating Return on Investment

Identity Manager automates processes and reduces lost productivity. The return on investing in a highly-available Identity Manager architecture is realized by minimizing downtime and averting lost productivity.

You can use the cost of downtime to determine the amount of availability that is ultimately needed for Identity Manager. In general, a moderate investment in making Identity Manager highly-available is worthwhile.

When calculating the cost of your investment, remember that purchasing HA/FT hardware and software is only one part of implementing an available solution. Having a knowledgeable staff to keep it up and running is another cost.

# Understanding the Identity Manager High-Availability Feature Set

Identity Manager is designed to leverage HA infrastructure if it is available. For example, Identity Manager does not require an application server cluster to achieve high availability, but it can utilize a cluster if it exists.

The following diagram shows the major Identity Manager components deployed in a non-redundant architecture. The sections that follow will describe how the Identity Manager repository, application server, and gateway can be made highly-available.

##### Figure 3–1 Identity Manager Standard System Architecture

Note –

See Understanding the System Separation and Physical Proximity Guidelines for information on which components should be physically sited near one another in order to minimize performance issues that could arise due to latency and network congestion.

## Making the Repository Highly-Available

Identity Manager stores all of its provisioning and state information in the Identity Manager repository.

The availability of the database instance storing the Identity Manager repository is the most critical piece to achieving a highly available Identity Manager deployment. The repository is the representation of the entire Identity Manager installation and the data within it must be protected as with other important database applications. At minimum, regular backups must be performed.

Note –

Do not host the Identity Manager repository on a virtual platform such as a VMware virtual machine because performance (transactions per second) will be hindered significantly.

There can only be one image of the repository. It is not possible to have two separate databases for Identity Manager and attempt to synchronize them nightly. Sun recommends using your database's clustering or mirroring capabilities to provide fault tolerance.

## Making the Application Server Highly Available

Identity Manager can run within an application server cluster and take advantage of the added availability and load balancing that a cluster provides. Identity Manager does not use any J2EE features that require clustering, however.

Identity Manager uses the HTTP Session object that is available through the Servlet API. This session object tracks a user's visit as the user logs in and performs actions. In a cluster, you can optionally have multiple nodes handle a user's requests during a given session. This is usually not recommended, however, and most installations are configured to send a user's entire request for a given session to the same server.

It is possible to add additional availability and capacity to the application server running Identity Manager even if you do not set up a cluster. This is achieved by installing multiple application servers with Identity Manager, connecting them to the same repository, and putting a load balancer with session affinity in front of all the application servers.

Note –

Identity Manager runs certain tasks in the background—for example, scheduled reconciliation tasks. These tasks are stored in the database and can be picked up by any Identity Manager server to run. Identity Manager uses the database to ensure that these tasks are always run to completion, even if it has to fail over to another node.

### Configuring Active Sync Clustering on Application Server Nodes

The sources.hosts setting in the Waveset.properties file controls which hosts in a multi instance environment are used for executing Active Sync requests. This setting provides a list of hosts that source adapters can run on. Setting this to localhost or null will allow source adapters to execute on any host in the web farm. (This is the default behavior.) By listing one or more hosts, you can restrict execution to that list. If you have inbound updates from another system that go to a particular host, use the sources.hosts setting to record the host names.

In addition, you can define a property named sources.resourceName.hosts, which controls where the resource's Active Sync task will run. Replace resourceName with the name of the resource object you wish to specify.

## Making the Gateway Highly Available

Identity Manager requires a lightweight gateway to manage resources that cannot be directly accessed from the server. These include systems that require client-side API calls that are platform specific. For example, if Identity Manager is running on a UNIX-based application server, the ability to make NTLM or ADSI calls to managed NT or Active Directory domains is not possible. Because Identity Manager requires a gateway to manage these resources, it is important to ensure that the Identity Manager Gateway is made highly available.

To prevent the Gateway from becoming a single point of failure, Sun recommends having multiple machines running a Gateway instance. A network routing device should be configured to provide failover if the main Gateway instance dies. The failover device should be setup for sticky sessions and use a simple round robin scheme. Do not place the Gateways behind a device that load balances! This is not a supported configuration and will cause certain Identity Manager functions to fail.

All Windows domains managed by a Gateway must be part of the same forest. Managing domains across forest boundaries is unsupported. If you have multiple forests, install at least one Gateway in each forest.

Win32 monitoring tools can be configured to watch the gateway.exe process on the Win32 host. In the event that gateway.exe fails, the process can be automatically restarted.

# Understanding the Recommended HA Architecture

The following diagram shows the Identity Manager architecture Sun recommends if there is no existing web application infrastructure.

##### Figure 3–2 Identity Manager High-Availability Architecture

In an actual deployment, existing redundant application server infrastructure should be utilized to the extent possible. The value of this architecture is that it only uses load balancers for achieving redundancy at the application server. Load balancers with session affinity detect failed application server instances and failover to active instances. Load balancers are also used to provide horizontal scaling in the web environment by spreading the user requests across a cluster of servers.

Though this is a straightforward architecture, the uptime characteristics are comparable to more complex deployments. Because of its simplicity, there are fewer pieces of software to maintain and monitor or fewer pieces that could fail. Because human error is the number one cause of downtime, a relatively simple solution may achieve better uptime characteristics than something more complex. There are no universal right answers. The point is to understand all of the causes of downtime and choose the architecture that will result in the best availability for the investment.

Note –

It would be impossible to describe all of the different HA architectures that are possible with a web application like Identity Manager.

Because Identity Manager can be deployed in a variety of possible combinations, it may be most economical to identify existing infrastructure and utilize as much of it as possible when deploying Identity Manager.

# Understanding the Recommended Service Provider HA Architecture

If Identity Manager Service Provider functionality is to be utilized, Sun recommends adding a web tier between the user tier and the application tier. The web tier consists of one or more web servers that reside in a demilitarized zone (DMZ) that is separated by a firewall from the application tier.

An LDAP repository is required if Service Provider functionality is to be utilized. If Identity Manager will only be supporting extranet clients, a standard Identity Manager repository is recommended, but not required. Otherwise, if Identity Manager will be supporting both intranet and extranet users, an LDAP repository and a standard Identity Manager repository is required.

# Understanding Failure Scenarios

This section lists eight failure scenarios and compares two deployments, one with session persistence, and one without.

• The deployment with session persistence has session affinity across a load balancer. The deployment has multiple instances in a cluster which have some form of session persistence turned on such that session changes are written to a physically distinct repository node.

• The deployment without session persistence has session affinity across a load balancer , and has multiple instances that are not part of a cluster.

## Scenario 1: The No-Workflow Scenario

Scenario Description

The end user or the administrator is editing a form that is not a part of a workflow. The instance on which the user has an established session goes down.

Without Session Persistence

User experience: A nontransparent failover. Upon submitting the form, the user is returned to the login page.

Recovery steps: The user reenters his or her user name and password. Identity Manager then processes the form and presents the results as the next page immediately following the login.

With Session Persistence

User experience: The user's form is submitted and the results are returned without the user being logged off and required to log in again.

Recovery steps: No user action is needed.

Other Scenario Examples

• An end user has logged in and retrieved search results for users or other repository objects when the instance goes down.

• An administrator is about to submit a “reset password” or “edit user” request using the Administrator interface when the instance goes down.

## Scenario 2: The Workflow-in-Progress Scenario

Scenario Description

The end user or the administrator has submitted a form that triggered a workflow. The instance on which the workflow is executing and on which the user's session exists is generally going to be the same except in case of some scheduled tasks where they can be different. This instances goes down while the workflow is in progress.

Without Session Persistence

User experience: A nontransparent failover. The form submit returns the user to the login page. The workflow task instance being executed should be in the repository, but because the node of execution is down, the workflow status will be “terminated.”

Recovery steps: The workflow has to be submitted again. This has to be done by going back to the same form and reentering the same information that was used to trigger the workflow before the node failed.

The submission of the same request data may work in some cases, but not all. If the workflow provisions to more than one resource during its execution and some of these resources were provisioned before the failure, the workflow resubmission from the user would have to account for the “already provisioned to” resources. Note that the terminated workflow sticks around in the repository until the resultLimit expires on the TaskInstance object.

With Session Persistence

User experience: A nontransparent failover. The user does not get logged out because his session is persisted and reestablished in the new instance. The form submit, however, will probably result in an error because the workflow will be terminated. This failover is nontransparent because recovery actions are needed.

Recovery steps: Same as in the Without Session Persistence mode. The user has to resubmit the request that triggered off the previous workflow with the same or modified parameters.

Other Scenario Examples

• An end user has just submitted a self-registration request to create an Identity Manager account and the instance goes down.

• An administrator has just submitted a “password reset” request that is in progress and the instances goes down.

## Scenario 3: The Workflow-in-Abeyance-or-Sleep Scenario

Scenario Description

This scenario covers situations where the workflow has started, but is waiting for a manual action by an approver.

Without Session Persistence

User experience: Transparent failover with respect to the approver provided that the approver has not yet logged in. After the node failure, when the approver does log in, the approver will still see the approval request in his or her inbox, even though the request was triggered from a node that is no longer up.

Recovery steps: No user action is needed.

With Session Persistence

User experience: Same as in the Without Session Persistence mode.

Recovery steps: Same as in the Without Session Persistence mode.

Other Scenario Examples

• The workflow is in a sleep state, for example a manual action that sleeps until a sunrise or sunset date for an employee.

• An administrator submitted a user creation request that is waiting on an approver to log in and approve the request. The node from which the request was sent failed before the approver approved the request.

## Scenario 4: The Work-Item-Edit Scenario

Scenario Description

This scenario includes cases where a user is editing a work item and the node upon which the user has a session goes down before the work item can be submitted.

Without Session Persistence

User experience: A nontransparent failover. When the work item edit form is submitted, the user is logged off and returned to the login page.

Recovery steps: Upon resubmitting login credentials, the user's work item is marked completed and the workflow can resume from that point. The workflow should be picked up by the new mode for execution from the point where the user's manual action is marked completed.

With Session Persistence

User experience: When the work item edit form is submitted, the user sees the effect of his submission—for example, either the next form in the custom workflow if there is one, or a success message.

Recovery steps: No user action needed.

Other Scenario Examples

• An end user is filling out a form associated with a manual action in a custom workflow, for example requesting access to specific resources. Before the user can submit the request, the node the user has a session on dies.

• An administrator has logged in to Identity Manager and has opened up an approval request for editing. Before the request can be submitted, the node the administrator has a session on fails.

## Scenario 5: The Scheduled-Tasks-in-Progress Scenario

Scenario Description

These scenarios cover cases where a node failure occurs while a reconciliation is in progress or while a report is executing.

Without Session Persistence

User experience: The scheduled task terminates in process.

Recovery steps: The scheduled task that was in progress has to be restarted. The task will start from the beginning. (The task will not restart from the point of failure.) This is akin to creating and starting a new task.

With Session Persistence

User experience: Same as in the Without Session Persistence mode.

Recovery steps: Same as in the Without Session Persistence mode.

Other Scenario Examples

• An Active Sync adapter is configured to run on the failed node.

## Scenario 6: The Scheduled-Task-in-Abeyance Scenario

Scenario Description

These scenarios cover cases where a user's custom workflow has scheduled a task for execution at a later date on a specific node. Before the scheduled date is reached, the node that the task was scheduled on fails.

Without Session Persistence

User experience: The failover is transparent with respect to the recovery actions required to ensure that this task executes at its scheduled time.

Recovery steps: The scheduled task is taken up by any live node when the scheduled execution time arrives.

With Session Persistence

User experience: Same as in the Without Session Persistence mode.

Recovery steps: Same as in the Without Session Persistence mode.

Other Scenario Examples

• In the process of a user's account creation, the Deferred Task Scanner is used to implement enabling an account on a sunrise date or to implement disabling the account on a sunset date. Before the sunrise or sunset date arrives, the node that the task was scheduled on fails.

• A report is scheduled to run at a future time, or a reconciliation is scheduled to run at a specific time and, before the time is reached, the node the task was scheduled on fails.

## Scenario 7: The Web-Services-Workflow-Request-Not-Yet-Received-by-Identity Manager Scenario

Scenario Description

These scenarios cover those cases where the Identity Manager GUI is not used to launch provisioning. Instead, the user interface is provided by an application that internally calls to Identity Manager using the SPML or other custom web service interface. Here the user session related to the user going through the UI is managed by way of the calling application. For Identity Manager, the requests are all launched as the “soapadmin” subject.

In such a use case, this failure scenario covers the case where the request by way of the Identity Manager endpoint has not been received yet and the targeted node fails.

Without Session Persistence

User experience: A transparent failover. The SOAP administrator's credentials are passed in for each SOAP request, either over the wire or within Identity Manager through a Waveset.properties setting. As long as the node which was to receive this SOAP request has not received the request before going down, the failover is transparent with or without session persistence.

Recovery steps: No action needed. The SOAP request is sent to a live node that executes it.

With Session Persistence

User experience: Same as in the Without Session Persistence mode.

Recovery steps: Same as in the Without Session Persistence mode.

## Scenario 8: The Web-Services-Workflow-Request-In-Progress-by-Identity Manager Scenario

Scenario Description

This scenario is similar to scenario seven. The only difference is that the workflow is in progress when the node fails, or the node has received the SOAP request when the node fails.

Without Session Persistence

User experience: This scenario is similar to scenario two (workflow in progress). The workflow is marked terminated and the user sees an error as a result of the SOAP request.

Recovery steps: The user has to resubmit the form with similar or modified parameters (based on where the failure occurs in the workflow) using the user interface in the third-party application.

With Session Persistence

User experience: Same as in the Without Session Persistence mode.

Recovery steps: Same as in the Without Session Persistence mode.

# Frequently Asked Questions Regarding Session Affinity and Session Persistence

Should you have session affinity enabled when scaling application servers horizontally?

Yes.

Should you have session persistence when scaling application servers horizontally?

Unless your business requirements place a very high emphasis on having transparent failover in the limited situations where session persistence can make a difference, Sun recommends against using session persistence. Session persistence has its own performance overhead and, unless transparent failovers are absolutely mandated by your business requirements, leave session persistence turned off.

If you study the failure scenarios documented in Understanding Failure Scenarios, in six of the eight scenarios there is no difference in the end-user experience or required recovery actions, regardless of whether session persistence is enabled. Only in scenarios one and four are there any difference between the session-persistence scenarios as opposed to the no-session-persistence scenarios.

In these two scenarios, session persistence can provide some failover transparency, but session persistence hurts performance. Based on the size of the session objects, the repository being used for session persistence, and the optimization of your specific application server's session management code, the performance overhead can range from 10 percent to 20 percent or even higher.

Should you have multiple application server instances in a cluster when scaling horizontally?

Multiple application server instances are not absolutely needed unless you want session persistence. Fail-over without session persistence can be achieved even if all application server nodes are not in a cluster.