3 Determining Your High Availability Requirements

This chapter includes the following topics:

3.1 About Determining High Availability Requirements

Any enterprise that is designing and implementing a high availability strategy must begin by performing a thorough analysis of the business drivers that require high availability. Implementing high availability may involve critical tasks such as:

  • Retiring legacy systems

  • Investment in more capable and robust systems and facilities

  • Redesign of the overall IT architecture and operations to adapt to this high availability model

  • Redesign of business processes

  • Hiring and training of personnel

An analysis of business requirements for high availability combined with an understanding of the level of investment required to implement different high availability solutions enables the development of a high availability architecture that will achieve both business and technical objectives. This chapter provides a simple framework that can be used effectively to evaluate the high availability requirements of a business.

3.2 Analysis Framework for Determining High Availability Requirements

The elements of this analysis framework are:

3.2.1 Business Impact Analysis

A rigorous business impact analysis identifies the critical business processes in an organization, calculates the quantifiable loss risk for unplanned and planned IT outages affecting each of these business processes, and outlines the effects of these outages. It takes into consideration essential business functions, people and system resources, government regulations, and internal and external business dependencies. This analysis is done using objective and subjective data gathered from interviews with knowledgeable and experienced personnel, reviewing business practice histories, financial reports, IT systems logs, and so on.

The business impact analysis categorizes the business processes based on the severity of the impact of IT-related outages. For example, consider a semiconductor manufacturer with chip fabrication plants located worldwide. Semiconductor manufacturing is an intensely competitive business requiring huge financial investment that is amortized over high production volumes. The human resource applications used by plant administration are unlikely to be considered as mission critical as the applications that control the manufacturing process in the plant. Failure of the applications supporting the fabrication process will affect production levels and have a direct impact on financial results of the company.

In a similar fashion, an internal knowledge management system is likely to be considered mission critical for a management consulting firm because the business of a client-focused company is based on internal research accessibility for its consultants and knowledge workers. The cost of downtime of such a system is extremely high for this business. This leads us to the next element in the high availability requirements framework: cost of downtime.

3.2.2 Cost of Downtime

A complete business impact analysis provides the insight needed to quantify the costs of unplanned and planned downtime. Understanding this cost is essential because this helps prioritize your high availability investment and has a direct influence on the high availability technologies chosen to minimize the downtime risk.

Various reports have been published, documenting the costs of downtime across industry verticals. Examples include costs that range from millions of dollars for each hour of brokerage operations and credit card sales, to tens of thousands of dollars for each hour of package shipping services.

These numbers are staggering and the reasons are obvious. The Internet can connect the business directly to millions of customers. Application downtime can disrupt this connection, cutting off a business from its customers. In addition to lost revenue, downtime can have an equally negative effect on other critical and interdependent business issues such as customer relationships, competitive advantages, legal obligations, industry reputation, and shareholder confidence.

3.2.3 Recovery Time Objective (RTO)

The business impact analysis will determine your recovery time objective (RTO). RTO is defined as the maximum amount of time that an IT-based business process can be down before the organization starts suffering unacceptable consequences (financial losses, impact to customer satisfaction, reputation, and so on). RTO indicates the downtime tolerance of a business process or an organization in general.

The RTO requirements are driven by the mission-critical nature of the business. Thus, for a system running a stock exchange, the RTO is zero or very near to zero.

An organization is likely to have varying RTO requirements across its various business processes. Thus, for a high volume e-commerce Web site, for which there is an expectation of rapid response times and for which customer switching costs are very low, the Web-based customer interaction system that drives e-commerce sales is likely to have an RTO close to zero. However, the RTO of the systems that support back-end operations, such as shipping and billing, can be higher. If these back-end systems are down, then the business may resort to manual operations temporarily without a significantly visible impact.

3.2.4 Recovery Point Objective (RPO)

The business impact analysis also determines your recovery point objective (RPO). RPO is the maximum amount of data an IT-based business process may lose before causing detrimental harm to the organization. RPO indicates the data-loss tolerance of a business process or an organization in general. This data loss is often measured in terms of time, for example, 5 hours or 2 days worth of data loss.

A stock exchange where millions of dollars worth of transactions occur every minute cannot afford to lose any data. Thus, its RPO must be zero. Referring to the e-commerce example, the Web-based sales system does not strictly require an RPO of zero, although a low RPO is essential for customer satisfaction. However, its back-end merchandising and inventory update system may have a higher RPO; lost data in this case can be reentered.

3.2.5 Manageability Goal

A manageability goal is more subjective than either the RPO or the RTO. It results from an objective evaluation of the skill sets and management resources available in an organization, and the degree to which the organization can successfully manage all elements of a high availability architecture. In a fashion similar to how RPO and RTO measures an organization's tolerance toward downtime or data loss, your manageability goal measures the organization's tolerance to complexity in the IT environment. To the extent that less complexity is a requirement, simpler methods of achieving high availability are preferred over methods that may be more complex to manage, even if the latter could attain more aggressive RTO and RPO objectives. Having a good understanding of manageability goals helps organizations differentiate between what is possible and what is practical to implement.

3.2.6 Total Cost of Ownership (TCO) and Return On Investment (ROI)

Understanding TCO and ROI is essential to selecting an high availability architecture that also achieves the business goals of your organization. TCO includes all costs such as acquisition, implementation, systems, networks, facilities, staff, training, and support, over the useful life of the solution chosen. Likewise, the ROI calculation captures all of the financial benefits that accrue to a given high availability architecture.

For example, consider a high availability architecture in which IT systems/storage at a remote standby site remains idle and has no other business use that can be served by its standby systems. The only return on investment for the standby site is the cost of downtime avoided by its use in a failover scenario. Contrast this with a different high availability architecture that enables IT systems and storage at the standby site to be used productively while in the standby role (for example, for reports or for offloading the primary system of the overhead of end-user queries), that makes the standby system a production system in its own right. The return on investment of such an architecture includes both the cost of downtime avoided and the financial benefits that accrue to its use as a production system while in the standby database role.

3.3 High Availability Architecture Requirements

Using the high availability analysis framework, a business can:

  1. Complete a business impact analysis

  2. Identify and categorize the critical business processes that have the high availability requirements

  3. Formulate the cost of downtime

  4. Establish utilization, RTO, and RPO goals for these various business processes.

  5. Understand your goals for manageability, TCO, and ROI

This framework enables the business to define service level agreements (SLAs) in terms of high availability for critical aspects of its business. For example, it can categorize its business processes into several high availability tiers:

  • Tier 1 processes have maximum business impact. They have the most stringent high availability requirements, with RTO and RPO close to zero, and the systems supporting it need to be available on a continuous basis. For a business with a high-volume e-commerce presence, this may be the Web-based customer interaction system.

  • Tier 2 processes that have slightly relaxed high availability and RTO/RPO requirements. The second tier of an e-commerce business may be their supply chain and merchandising systems. For example, these systems do not need to maintain extremely high degrees of availability and may have nonzero RTO/RPO values. Thus, the high availability systems and technologies chosen to support these two tiers of businesses are likely to be different from those of the tier 1 processes.

  • Tier 3 processes may be related to internal development and quality assurance processes. Systems supporting these processes need not have the rigorous high availability requirements of the other tiers.

The next step for the business is to evaluate the capabilities of the various high availability systems and technologies, and choose the ones that meet its SLA requirements, within the guidelines as dictated by business performance issues, budgetary constraints, and anticipated business growth.

Figure 3-1 illustrates this process.

Figure 3-1 Planning and Implementing a Highly Available Enterprise

Description of Figure 3-1 follows
Description of "Figure 3-1 Planning and Implementing a Highly Available Enterprise"

The following sections provide further details about this methodology:

3.3.1 High Availability Systems Capabilities

A broad range of high availability and business continuity solutions exists today. As the sophistication and scope of these systems increase, they make more of the IT infrastructure, such as the data storage, server, network, applications, and facilities, highly available. They also reduce RTO and RPO from days to hours, or even to minutes and seconds. Increased availability often comes with an increased cost, and on some occasions, with an increased impact on systems performance. With Oracle Grid infrastructure, higher availability can equate to lower cost, greater scalability, and more complete utilization of system resources. The high availability approach to satisfying business requirements may differ for a legacy system.

Organizations need to carefully analyze the capabilities of these high availability systems and map their capabilities to the business requirements to ensure they have an optimal combination of high availability solutions to keep their business running. Consider the business with a significant e-commerce presence as an example.

For this business, the IT infrastructure supporting the system that customers encounter, the core e-commerce engine, must be highly available and disaster proof. The business may consider clustering for the Web servers, application servers and the database servers serving this e-commerce engine. With built-in redundancy, clustered solutions eliminate single points of failure. Also, modern clustering solutions are application transparent, provide scalability to accommodate future business growth, and provide load-balancing to handle heavy traffic. Thus, such clustering solutions are ideally suited for mission-critical high-transaction applications.

If unplanned and planned outages occur, the data that supports the high volume e-commerce transactions must be protected adequately and be available with minimal downtime. This data should not only be backed up at regular intervals at the local data centers, but should also be replicated to databases at a remote data center connected over a high-speed, redundant network. This remote data center should be equipped with secondary servers and databases readily available, and be synchronized with the primary servers and databases. This gives the business the capability to switch to these servers at a moment's notice with minimal downtime if there is an outage, instead of waiting for hours and days to rebuild servers and recover data from backed-up tapes. Factors to consider when planning a remote data center include the network bandwidth and latency (distance) between sites, and usage consideration (such as whether the sites are fully or partially staffed). These factors should be used to determine whether remote data centers are feasible and their location in relation to the primary data center.

Maintaining synchronized remote data centers is an example where redundancy is built along the entire system's infrastructure. This may be expensive; however, the mission-critical nature of the systems and the data it protects may warrant this expense. Considering another aspect of the business, the high availability requirements are less stringent for systems that gather clickstream data and perform data mining. The cost of downtime is low, and the RTO and RPO requirements for this system could be a few days, because even if this system is down and some data is lost, that does not have a detrimental effect on the business. While the business may need powerful computers to perform data mining, it does not need to mirror this data on a real-time basis. To obtain data protection, perform regularly scheduled backups, and archive the tapes for offsite storage.

For this e-commerce business, the back-end merchandising and inventory systems are expected to have higher high availability requirements than the data mining systems, and thus they may employ technologies such as local mirroring or local snapshots, in addition to scheduled backups and offsite archiving.

The business should employ a management infrastructure that performs overall systems management, administration and monitoring, and provides an executive dashboard. This management infrastructure should be highly available and fault tolerant.

Finally, the overall IT infrastructure for this e-commerce business should be extremely secure, to protect against malicious external and internal electronic attacks.

3.3.2 Business Performance, Budget, and Growth Plans

High availability solutions must also be based on business performance issues. For example, a business may use a zero-data-loss solution that synchronously mirrors every transaction on the primary database to a remote database. However, considering the speed-of-light limitations and the physical limitations associated with a network, there are round-trip-delays in the network transmission. These delays increase with distance and vary based on network bandwidth, traffic congestion, router latencies, and so on. Thus, this synchronous mirroring, if performed over large WAN distances, may impact the primary site performance. Online buyers may notice these system latencies and be frustrated with long system response times; consequently, they may go somewhere else for their purchases. This is an example where the business must make a trade-off between having a zero data loss solution and maximizing system performance. Conversely, if the business drivers justify the investment to avoid making this tradeoff, a multisite architecture can be implemented that places a synchronous zero data loss standby site in close proximity to the primary site and a second asynchronous standby site located up to thousands of miles away.

High availability solutions must also be based on financial considerations and future growth estimates. It is tempting to build redundancies throughout the IT infrastructure and claim that the infrastructure is completely failure proof. Although higher availability does not always equate higher cost, going to extremes with such solutions may lead to budget overruns or an unmanageable and unscalable combination of solutions that is extremely complex and expensive to integrate and maintain.

A high availability solution that has very impressive performance benchmark results may look good in theory. However, if an investment is made in such a solution without a careful analysis of how the technology capabilities match the business drivers, then a business may end up with a solution that:

  • Does not integrate well with the rest of the system infrastructure

  • Has annual integration and maintenance costs that easily exceed the up-front implementation costs

  • Forces a vendor lock-in

Cost-conscious and business-savvy decision makers must invest only in solutions that are well-integrated, standards-based, easy to implement, maintain and manage, and have a scalable architecture for accommodating future business growth.