Skip Headers

Oracle® High Availability Architecture and Best Practices
10g Release 1 (10.1)

Part Number B10726-01
Go to Documentation Home
Go to Book List
Book List
Go to Table of Contents
Go to Index
Go to Master Index
Master Index
Go to Feedback page

Go to previous page
Go to next page
View PDF

Operational Policies for High Availability

This chapter describes the policies and procedures that essential for maintaining high availability.

Introduction to Operational Policies for High Availability

Operational policies together with service management are fundamental to avoiding and minimizing outages, as well as reducing the time to recover from an outage. Operational policies are the foundation for managing the information technology infrastructure. They focus on process, policy, and management.

Operational policies for high availability focus on setting and establishing processes, policies, and management. They are divided into the following categories:

Service Level Management for High Availability

Information Technology (IT) departments are required to deliver increasing levels of service and availability while reducing costs. Service level management is an accepted method to ensure that IT services are meeting the business requirements. Service level management requires a dialogue between IT managers and the company's lines of business. It starts with mapping business requirements to IT investments.

Service level management encompasses complete end-to-end management of the service infrastructure. The foundation of service level management is the service level agreement (SLA). The SLA is critical for building accountability into the provider-client relationship and for evaluating the provider's performance. SLAs are becoming more accepted and necessary as a monitoring and control instrument for the relationship between a customer and the IT supplier (external or internal). SLAs are developed for critical business processes and application systems, such as order processing. The business individuals who specify the functionality of systems should develop the SLA for those systems. The SLA represents a detailed, complete description of the services that the supplier is obligated to deliver, and the responsibilities of the users of that service. Developing an SLA challenges the client to rank the requirements and focus resources toward the most important requirements. An SLA should evolve with the business requirements.

There is no standardized SLA that meets the needs of all companies, but a typical SLA should contain the following sections:

Developing an SLA and service level measurements requires commitment and hard work by all parties. Service levels should be measured by business requirements, such as cost for each order processed. Any shared services or components must perform at the level of the most stringent SLA. Furthermore, SLAs should be developed between interdependent IT groups and with external suppliers. Many technologists advocate an integrated, comprehensive SLA rather than individual SLAs for infrastructure components.

The benefits of developing a SLA are:

Planning Capacity to Promote High Availability

Planning capacity and monitoring thresholds is essential to prevent downtime or unacceptably delayed transactions. Understanding average and maximum usage and the requirements to maintain that load over time helps ensure acceptable performance.

Capacity planning includes the ability to estimate the time remaining before a tablespace becomes completely full and planning ahead to add disk space. Capacity planning can also delay or prevent scheduled outages to increase the maximum number of sessions in the database.

Change Management for High Availability

Change management is a set of procedures or rules that ensure that changes to the hardware, software, application, and data on a system are authorized, scheduled, and tested. A stable system in which unexpected, untested, and unauthorized changes are not permitted is one that guarantees integrity to its users. The users can rely on the hardware, software, and data to perform as anticipated. Knowing exactly when and what changes have been applied to the system is vital to debugging a problem. Each customer handles change management of systems, databases, and application code differently, but there are general guidelines that can help prevent unnecessary system outages, thus protecting the system's integrity. With proper change management, application and hardware systems have greater stability, and new problems are easier to debug.

Figure 5-1 describes a typical change control flow. For emergency cases such as disasters, the change control process may need to be shortened.

Figure 5-1 Change Control

Text description of maxav030.gif follows

Text description of the illustration maxav030.gif

The following recommendations are the foundation of good change management:

Backup and Recovery Planning for High Availability

Proper backup and recovery plans are essential and must be constructed to meet your specific service levels. Both disk and tape database backups are recommended. Disk and tape backups are essential for disaster recovery and for cases when you need to restore from a very old backup.

A robust backup and recovery scheme requires an understanding of how it is strengthened or compromised by the physical location of files, the order of events during the backup process, and the handling of errors. A robust scheme is resilient in the face of media failures, programmatic failures, and, if possible, operator failures. Complete, successful, and tested processes are fundamental to the successful recovery of any failed environment.

Take the following steps to construct useful backup and recovery plans:

Disaster Recovery Planning

Disaster recovery (DR) planning is a process designed and developed specifically to deal with catastrophic, large-scale interruptions in service to allow timely resumption of operations. These interruptions can be caused by disasters like fire, flood, earthquakes, or malicious attacks. The basic assumption is that the building where the data center and computers reside may not be accessible, and that the operations need to resume elsewhere. It assumes the worst and tries to deal with the worst. As an organization increasingly relies on its electronic systems and data, access to these systems and data become a fundamental component of success. This underscores the importance of disaster recovery planning. Proper disaster planning reduces MTTR during a catastrophe and provides continual availability of critical applications, helping to preserve customers and revenue.

Take the following steps to plan recovery from disasters:

Planning Scheduled Outages

Scheduled outages can affect the application server tier, the database tier, or the entire site. These outages may include one or more of the following: node hardware maintenance, node software maintenance, Oracle software maintenance, redundant component maintenance, entire site maintenance. Proper scheduled outage planning reduces MTTR and reduces risk when changes do not go as planned.

Take the following steps to plan scheduled outages:

Staff Training for High Availability

Highly trained people can make better and more informed decisions and are less likely to make mistakes. A comprehensive plan for the continued technical education of the systems administration, database administration, development, and users groups can help ensure higher availability of your systems and databases. Additionally, just as redundancy of system components eliminates a single point of failure, knowledge management and cross training should eradicate the effects to operations of losing an employee.

Documentation as a Means of Maintaining High Availability

Clear and complete documentation should be part of every set of HA operational policies. Without documenting the steps for implementing or executing a process, you run the risk of losing that knowledge, increasing the risk for human error during the execution of that process, and omitting a step within a process. All of these risks affect availability.

Clearly defined operational procedures contribute to shorter learning curves when new employees join your organization. Properly documented operational procedures can greatly reduce the number of questions for support personnel, especially when the people who put the procedures in place are no longer with the group. Proper documentation can also eliminate confusion by clearly defining roles and responsibilities within the organization.

Clear documentation of applications is essential for new employees. When internally developed applications need to be maintained and enhanced, documentation helps developers refresh their knowledge of the internal details of the programs. If the original developers are no longer with the group, this documentation becomes especially valuable to new developers who would otherwise have to struggle through reading the code itself. Readily available application documentation also can greatly reduce the number of questions for your support organization.

Physical Security Policies and Procedures for High Availability

Security policies consider the physical security and operations of the hardware and the data center. Physical security includes protection from unauthorized access, as well as from physical damage such as from fire, heat, and electrical surges. Physical security is the most fundamental security precaution and is essential for the system to meet the customer's availability requirements. Physical security protects against external and internal security breach. The CSI/FBI Computer Crime and Security Survey documents a trend toward increasing external intrusions and maintains that internal security violations still pose a large threat. A detailed discovery process into the security of data center operations and organization is outside the scope of this book. However, a properly secured infrastructure reduces the risk of damage, downtime, and financial loss.

Take the following steps to maintain physical security of the hardware and data center: