Once you have decided on your logical architecture, the next step is deciding what level of service availability is right for your site. The level of service availability you can expect is related to hardware chosen as well as the software infrastructure and maintenance practices you use. This chapter discusses several choices, their value, and their costs.
This chapter contains the following sections:
High availability solutions for Communications Suite vary from product to product.
Messaging Server and Calendar Server support different cluster topologies. Refer to the appropriate cluster product documentation for more information.
If you choose to use Application Server as a web container, you can take advantage of its high availability, load balancing, and cluster management capabilities.
Instant Messaging also provides a Sun Cluster agent, but it does not support VCS. You can also create a “more available” deployment by deploying redundant Instant Messaging multiplexors. In such a deployment, if one multiplexor fails, Instant Messaging clients are able to communicate to the back-end server through another available multiplexor.
In addition, you can build in availability to your Communications Suite deployment by making infrastructure components, such as Directory Server, highly available.
The following sections in this chapter explain the options available for each component.
In addition to evaluating a purely highly available (HA) solution, you should consider deploying hardware that is capable of ASR.
ASR is a process by which hardware failure related downtime can be minimized. If a server is capable of ASR, it is possible that individual component failures in the hardware result in only minimal downtime. ASR enables the server to reboot itself and configure the failed components out of operation until they can be replaced. The downside is that a failed component that is taken out of service could result in a less performing system. For example, a CPU failure could result in a machine rebooting with fewer CPUs available. A system I/O board or chip failure could result in system with diminished or alternative I/O paths in use.
Different Sun SPARC systems support very different levels of ASR. Some systems support no ASR, while others support very high levels. As a general rule, the more ASR capabilities a server has, the more it costs. In the absence of high availability software, choose machines with a significant amount of hardware redundancy and ASR capability for your data stores, assuming that it is not cost prohibitive.
From the Communications Suite standpoint, the most important factor in planning your directory service is availability. As an infrastructure service, the directory must provide as near-continuous service as possible to the higher-level applications for authorization, access, email routing, and so forth.
A key feature of Directory Server that provides for high availability is replication. Replication is the mechanism that automatically copies directory data from one Directory Server to another. Replication enables you to provide a highly available directory service, and to geographically distribute your data. In practical terms, replication brings the following benefits:
The following table shows how you can design your directory for availability.Table 6–1 Designing Directory Server for High Availability
A server acting as a supplier copies a master replica directly to one or more consumer servers. In this configuration, all directory modifications are made to the master replica stored on the supplier, and the consumers contain read-only copies of the data.
Two-way, multi-master replication
In a multi-master environment between two suppliers that share responsibility for the same data, you create two replication agreements. Supplier A and Supplier B each hold a master replica of the same data and there are two replication agreements governing the replication flow of this multi-master configuration.
Provides a pair of Directory Server masters, usually in two separate data centers. This configuration uses four-way MultiMaster Replication (MMR) for replication. Thanks to its four-way master failover configuration, this fully-connected topology provides a highly-available solution that guarantees data integrity. When used with hubs in the replication topology, load distribution is facilitated, and the four consumers in each data center allow this topology to scale for read (lookup) operations.
Using Sun Cluster software provides the highest level of availability for your directory implementation. In the case of failure of an active Directory Server node, Sun Cluster provides for transparent failover of services to a backup node. However, the administrative (and hardware) costs of installing, configuring, and maintaining a cluster are typically higher than the Directory Server replication methods.
See the Directory Server documentation for more information:
For example, Application Server Enterprise Edition enhances the core application server platform with high availability, load balancing and cluster management capabilities. The management capabilities of the Platform Edition are extended in Enterprise Edition to account for multi-instance and multi-machine deployments.
Application Server’s clustering support includes easy-to-configure groups of cloned application server instances to which client requests can be load balanced. Both external load balancers and load balancing web tier-based proxies are supported by this edition. Application Server EE provides failover for HTTP sessions and stateful session beans using the highly available database (HADB).
See the Application Server Enterprise Edition 8.2 documentation for more information:
You can configure Messaging Server and Calendar Server to be highly available by using clustering software. Messaging Server supports both Sun Cluster and Veritas Cluster Server software. Calendar Server supports Sun Cluster software.
In a tiered Communications Suite architecture, where front-end and back-end components are distributed onto separate machines, you would want to make the back-end components highly available through cluster technology as the back ends are the “stores” maintaining persistent data. Cluster technology is not typically warranted on the Messaging Server or Calendar Server front ends as they do not hold persistent data. Typically, you would want to make the Messaging Server MTA and MMP, Webmail Server, and Calendar Server front ends highly available through redundancy, that is, by deploying multiple front-end hosts. You could also add high availability to the MTA by protecting its disk subsystems through RAID technology.
See Chapter 2, Key Concepts for Hardware Service Providers, in Sun Cluster Concepts Guide for Solaris OS for more information on Sun Cluster topologies.
See Chapter 3, Configuring High Availability, in Sun Java System Messaging Server 6.3 Administration Guide for more information on configuring Messaging Server for high availability.
See Chapter 6, Configuring Calendar Server 6.3 Software for High Availability (Failover Service), in Sun Java System Calendar Server 6.3 Administration Guide for more information on configuring Calendar Server for high availability.
Instant Messaging provides a Sun Cluster agent, but it does not support Veritas Cluster Service. You can also create a “more available” environment by deploying redundant Instant Messaging multiplexors, and by taking advantage of the Instant Messaging watchdog process.
Configuring Instant Messaging for high availability (HA) by using the Sun Cluster agent provides for monitoring of and recovery from software and hardware failures. The high availability feature is implemented as a failover data service, not a scalable service, and is currently supported only on Solaris.
You can have multiple Instant Messaging nodes in an HA environment using the same SMTP server.
Before implementing an Instant Messaging HA environment using the Sun Cluster agent, decide which of the following HA deployments best meets your needs.
Mixed HA Environment. This deployment consists of a local configuration and binaries, and global runtime files. The advantage to this setup is that upgrading Instant Messaging requires minimal downtime because you can upgrade on nodes where Instant Messaging is offline. The disadvantage is that you need to ensure that the same configuration and version of Instant Messaging exists on all nodes in the cluster. In addition, if you choose this option, you need to determine whether you will be using HAStoragePlus or the cluster file system for global runtime files.
Global HA Environment. This deployment consists of a global configuration, binaries, and runtime files. This setup is easier to administer, but you need to bring Instant Messaging down on all nodes in the cluster before upgrading.
In an Instant Messaging deployment of multiple multiplexors, if one multiplexor fails, Instant Messaging clients are able to communicate to the back-end server through another available multiplexor. Currently, you can only configure multiple multiplexors to speak to a single instance of Instant Messaging server. You cannot configure multiple multiplexors to talk to multiple instances of Instant Messaging.
Instant Messaging includes a watchdog process, which monitors the Sun Cluster agent, and restarts the services that become unavailable for some reason (such as server lockup, crash, and so forth). In the event that you configure the watchdog process, and an Instant Messaging component stops functioning, the watchdog process shuts down then restarts the component.
In addition to the high availability solutions discussed in the previous section, you can use enabling techniques and technologies to improve both availability and performance. These techniques and technologies include load balancers, Sun Java System Directory Proxy Server, and replica role promotion.
You can use load balancers to ensure the functional availability of each tier in your architecture, providing high availability of the entire end-to-end system. Load balancers can be either a dedicated hardware appliance or a strictly software solution.
Load balancing is the best way to avoid a single application instance, server, or network as a single point of failure while at the same time improving the performance of the service. One of the primary goals of load balancing is to increase horizontal capacity of a service. For example, with a directory service, load balancers increase the aggregate number of simultaneous LDAP connections and LDAP operations per second that the directory service can handle.
Sun Java System Directory Proxy Server (formerly Sun ONE Directory Proxy Server) provides many proxy type features. One of these features is LDAP load balancing. Though Directory Proxy Server might not perform as well as dedicated load balancers, consider using it for failover, referral following, security, and mapping features.
See the Directory Proxy Server documentation for more information:
Directory Server includes a way of promoting and demoting the replica role of a directory instance. This feature enables you to promote a replica hub to a multi-master supplier or vice versa. You can also promote a consumer to the role of replica hub and vice versa. However, you cannot promote a consumer directly to a multi-master supplier or vice versa. In this case, the consumer must first become a replica hub and then it can be promoted from a hub to a multi-master replica. The same is true in the reverse direction.
Replica role promotion is useful in distributed deployments. Consider the case when you have six geographically dispersed sites. You would like to have a multi-master supplier at each site but are limited to only one per site for up to four sites. If you put at least one hub at each of the other two sites, you could promote them if one of the other multi-master suppliers is taken offline or decommissioned for some reason.
See the Directory Server documentation for more information:
For more information on high availability models, see the following product documentation:
Veritas Cluster Server: Veritas Cluster Server User’s Guide: http://seer.support.veritas.com/docs/275725.htm
Remote site failover is the ability to bring up a service at a site that is WAN connected to the primary site in the event of a catastrophic failure to the primary site. There are several forms of remote site failover and they come at different costs.
For all cases of remote site failover, you need additional servers and storage capable of running all or part of the users’ load for the service installed and configured at the remote site. By all or part, this means that some customers might have priority users and non-priority users. Such a situation exists for both ISPs and enterprises. ISPs might have premium subscribers, who pay more for this feature. Enterprises might have divisions that provide email to all of their employees but deem this level of support too expensive for some portion of those users. For example, an enterprise might choose to have remote site failover for mail for those users that are directly involved in customer support but not provide remote site failover for people who work the manufacturing line. Thus, the remote hardware must be capable of handling the load of the users that are allowed to access remote failover mail servers.
While restricting the usage to only a portion of the user base reduces the amount of redundant server and storage hardware needed, it also complicates configuration and management of fail back. Such a policy can also have other unexpected impacts on users in the long term. For instance, if a domain mail router is unavailable for 48 hours, the other MTA routers on the Internet will hold the mail destined for that domain. At some point, the mail will be delivered when the server comes back online. Further, if you do not configure all users in a failover remote site, then the MTA will be up and give permanent failures (bounces) for the users not configured. Lastly, if you configure mail for all users to be accepted, then you have to fail back all users or set up the MTA router to hold mail for the nonfunctional accounts while the failover is active and stream it back out once a failback has occurred.
Potential remote site failover solutions include:
Simple, less expensive scenario. The remote site is not connected by large network bandwidth. Sufficient hardware is setup but not necessarily running. In fact, it might be used for some other purpose in the meantime. Backups from the primary site are shipped regularly to the remote site, but not necessarily restored. The expectation is that there will be some significant data loss and possibly a significant delay in getting old data back online. In the event of a failure at the primary site, the network change is manually started. Services are started, followed by beginning the imsrestore process. Lastly, the file system restore is started, after which services are brought up.
More complicated, more expensive solution. Both Veritas and Sun sell software solutions that cause all writes occurring on local (primary) volumes to also be written to remote sites. In normal production, the remote site is in lock step or near lock step with the primary site. Upon primary site failure, the secondary site can reset the network configurations and bring up services with very little to no data loss. In this scenario, there is no reason to do restores from tape. Any data that does not make the transition prior to the primary failure is lost, at least until failback or manual intervention occurs in the case of the MTA queued data. Veritas Site HA software is often used to detect the primary failure and reset the network and service bring up, but this is not required to get the higher level of data preservation. This solution requires a significant increase in the quantity of hardware at the primary site as there is a substantial impact in workload and latency on the servers to run the data copy.
Most available solution. This solution is essentially the same as the software real time data copy solution except the data copy is not happening on the Message Store server. If the Message Store servers are connected to storage arrays supporting remote replication, then the data copy to the remote site can be handled by the storage array controller itself. Storage arrays that offer a remote replication feature tend to be large, so the base cost of obtaining this solution is higher than using lower-end storage products.
There are a variety of costs to these solutions, from hardware and software, to administrative, power, heat, and networking costs. These are all fairly straightforward to account for and calculate. Nevertheless, it is difficult to account for some costs: the cost of mistakes when putting a rarely practiced set of procedures in place, the inherent cost of downtime, the cost of data loss, and so forth. There are no fixed answers to these types of costs. For some customers, downtime and data loss are extremely expensive or totally unacceptable. For others, it is probably no more than an annoyance.
In doing remote site failover, you also need to ensure that the remote directory is at least as up to date as the messaging data you are planning to recover. If you are using a restore method for the remote site, the directory restore needs to be completed before beginning the message restore. Also, it is imperative that when users are removed from the system that they are only tagged as disabled in the directory. Do not remove users from the directory for at least as long as the messaging backup tapes that will be used might contain those users’ data.
What level of responsiveness does your site need?
For some organizations, it is sufficient to use a scripted set of manual procedures in the event of a primary site failure. Others need the remote site to be active in rather short periods of time (minutes). For these organizations, the need for Veritas remote site failover software or some equivalent is overriding.
Also, do not allow the software to automatically failover from the primary site to the backup site. The possibility for false positive detection of failure of the primary site from the secondary site is too high. Instead, configure the software to monitor the primary site and alert you when it detects a failure. Then, confirm that the failure has happened before beginning the automated process of failing over to the backup site.
How much data must be preserved and how quickly must it be made available?
Although this seems like a simple question, the ramifications of the answer are large. Variations in scenarios, from the simple to the most complete, introduce quite a difference in terms of the costs for hardware, network data infrastructure, and maintenance.