5
Operational Policies for High Availability

This chapter describes the policies and procedures that essential for maintaining high availability.

Introduction to Operational Policies for High Availability

Operational policies together with service management are fundamental to avoiding and minimizing outages, as well as reducing the time to recover from an outage. Operational policies are the foundation for managing the information technology infrastructure. They focus on process, policy, and management.

Operational policies for high availability focus on setting and establishing processes, policies, and management. They are divided into the following categories:

Service Level Management for High Availability

Information Technology (IT) departments are required to deliver increasing levels of service and availability while reducing costs. Service level management is an accepted method to ensure that IT services are meeting the business requirements. Service level management requires a dialogue between IT managers and the company's lines of business. It starts with mapping business requirements to IT investments.

Service level management encompasses complete end-to-end management of the service infrastructure. The foundation of service level management is the service level agreement (SLA). The SLA is critical for building accountability into the provider-client relationship and for evaluating the provider's performance. SLAs are becoming more accepted and necessary as a monitoring and control instrument for the relationship between a customer and the IT supplier (external or internal). SLAs are developed for critical business processes and application systems, such as order processing. The business individuals who specify the functionality of systems should develop the SLA for those systems. The SLA represents a detailed, complete description of the services that the supplier is obligated to deliver, and the responsibilities of the users of that service. Developing an SLA challenges the client to rank the requirements and focus resources toward the most important requirements. An SLA should evolve with the business requirements.

There is no standardized SLA that meets the needs of all companies, but a typical SLA should contain the following sections:

Definition of the service provided, the parties involved and the effective dates of the agreement
Specification of the hours and days during which the service or application will be available, excluding time for scheduled testing, maintenance, and upgrades
Specifications of the numbers and locations of users and hardware for which the service or application will be offered.
Problem reporting and escalation procedures, including an expected response time for problems
Procedures for requesting changes, possibly including expected times for completing routine requests
Specification of quality targets and explanations of how they are measured and how frequently they are reported
Specifications of service costs and charges; the charges may be a flat rate or they may be tied to different levels of service quality
Specifications of user responsibilities, such as user training, maintaining proper desktop configuration, not introducing extraneous software or circumventing change management procedures
Description of procedures for resolving service-related disagreements

Developing an SLA and service level measurements requires commitment and hard work by all parties. Service levels should be measured by business requirements, such as cost for each order processed. Any shared services or components must perform at the level of the most stringent SLA. Furthermore, SLAs should be developed between interdependent IT groups and with external suppliers. Many technologists advocate an integrated, comprehensive SLA rather than individual SLAs for infrastructure components.

The benefits of developing a SLA are:

A professional relationship between the supplier and customer with documented accountability
A mutual goal of understanding and meeting the business requirements
A system of measurement for service delivery so that IT can quantify their capabilities and results and continuously improve upon them
IT can prevent or respond faster to events that decrease availability
A documented set of communication and escalation procedures

Planning Capacity to Promote High Availability

Planning capacity and monitoring thresholds is essential to prevent downtime or unacceptably delayed transactions. Understanding average and maximum usage and the requirements to maintain that load over time helps ensure acceptable performance.

Capacity planning includes the ability to estimate the time remaining before a tablespace becomes completely full and planning ahead to add disk space. Capacity planning can also delay or prevent scheduled outages to increase the maximum number of sessions in the database.

Change Management for High Availability

Change management is a set of procedures or rules that ensure that changes to the hardware, software, application, and data on a system are authorized, scheduled, and tested. A stable system in which unexpected, untested, and unauthorized changes are not permitted is one that guarantees integrity to its users. The users can rely on the hardware, software, and data to perform as anticipated. Knowing exactly when and what changes have been applied to the system is vital to debugging a problem. Each customer handles change management of systems, databases, and application code differently, but there are general guidelines that can help prevent unnecessary system outages, thus protecting the system's integrity. With proper change management, application and hardware systems have greater stability, and new problems are easier to debug.

Figure 5-1 describes a typical change control flow. For emergency cases such as disasters, the change control process may need to be shortened.

Figure 5-1 Change Control

Text description of maxav030.gif follows

Text description of the illustration maxav030.gif

The following recommendations are the foundation of good change management:

Develop a change control process

A change control process for both nonescalated and escalated cases should be created, documented, and implemented. Ad hoc and emergency changes to the hardware, database, and software in a system are inevitable, but the change control process must ensure that they are later incorporated into the change management system so their effects and ramifications are examined and recorded.
Form a change control group

Representatives from applications, databases, systems, and management should be members of the change control board. Both hardware and software representatives must be present.

Determine meeting frequency and minimum assessment time. Change control processes should allow essential changes to be implemented in a reasonable time. Change control meetings should be frequent enough to address and discuss the most important issues. There should be a minimum grace period from the time a change is submitted until the time it is scheduled for review to provide adequate assessment time. This assessment time should be bypassed only with upper management approval.
Evaluate proposed changes

Changes must provide short-term or long-term benefit. The change control team needs to assess whether the benefits of a change outweigh the risks and whether the change is consistent with the overall vision of the business, the application, and its rules. Proposed change must document the following:
- Purpose of the change
- Risk assessment
- Fall-back plans
- Test plans and results of the tests
- Estimated time to implement and back out change, including outage times
Gather statistics for base comparisons

Gather snapshots of system, hardware, database, and application configuration and performance statistics. These base numbers can be used for comparison when a change is implemented. After changing a database parameter, you can gather new statistics and compare them with the base statistics to determine the impact of a change.
Track database changes

Database structure changes are easy to make on demand and easy to slip through the change management process. Therefore, it may be necessary to have a special procedure in place for these changes. It is also beneficial to track these changes for trend analysis. In addition, when files are added to the database, the files must be incorporated into the backup and monitoring schemes; proper tracking of this type of change can act as a reminder.
Use a version control system for application code

Some version control system must exist for application code to help track changes and enable fallback to a previous code base. Internally developed applications are modified and enhanced frequently, and new versions are put in place. When a problem is found with the new version, testing the case in the old version provides valuable debugging information. Depending on the type of application and the likelihood of the users' need to revert to an earlier version, the company must decide how many previous versions to keep on hand. At least one is mandatory.
Develop quality assurance and testing procedures

Quality assurance should validate test specifications, test plans, and the results of tests to ensure that the test simulation mimics your applications or at least considers all critical points of the application being tested. Tests and test environments should be designed to test both essential functionality and scalability of the application.
Perform internal audits

Internal audits may be used to verify that your hardware, software, database and application are certified with vendors, performing within service levels, and achieving high availability.

Backup and Recovery Planning for High Availability

Proper backup and recovery plans are essential and must be constructed to meet your specific service levels. Both disk and tape database backups are recommended. Disk and tape backups are essential for disaster recovery and for cases when you need to restore from a very old backup.

A robust backup and recovery scheme requires an understanding of how it is strengthened or compromised by the physical location of files, the order of events during the backup process, and the handling of errors. A robust scheme is resilient in the face of media failures, programmatic failures, and, if possible, operator failures. Complete, successful, and tested processes are fundamental to the successful recovery of any failed environment.

Take the following steps to construct useful backup and recovery plans:

Create recovery plans

Create recovery plans for different types of outages. Start with the most common outage types and progress to the least probable. An outage matrix with recommended recovery actions and a validated MTTR estimate enables you to assess if you can meet your SLAs for different types of outages.

See Also:
Chapter 9, "Recovering from Outages"
Test backups on a regular basis

Monitor the backup tasks for errors and validate backups by testing your recovery procedures periodically.
Automate backup and recovery procedures
Choose an appropriate backup frequency

Having up-to-date backups reduces recovery time.
Maintain offsite tape backups

Offsite backups of the database are essential to protect from site failures.
Maintain updated documentation for backup and recovery plans

Documentation is, for obvious reasons, a safeguard against mistakes, loss of knowledgeable people, and misinterpretations. Maintaining accurate documentation on backup and recovery procedures is as important as having procedures in place.

Disaster Recovery Planning

Disaster recovery (DR) planning is a process designed and developed specifically to deal with catastrophic, large-scale interruptions in service to allow timely resumption of operations. These interruptions can be caused by disasters like fire, flood, earthquakes, or malicious attacks. The basic assumption is that the building where the data center and computers reside may not be accessible, and that the operations need to resume elsewhere. It assumes the worst and tries to deal with the worst. As an organization increasingly relies on its electronic systems and data, access to these systems and data become a fundamental component of success. This underscores the importance of disaster recovery planning. Proper disaster planning reduces MTTR during a catastrophe and provides continual availability of critical applications, helping to preserve customers and revenue.

Take the following steps to plan recovery from disasters:

Choose the right disaster recovery plans (DRPs)

DRPs must deliver the expected MTTR service levels. The implementation costs must also be justified by the service levels. One DRP may not accommodate all disasters or even the common disasters.
Determine what is covered under the disaster recovery plans

The first question to ask when trying to decide whether to include an application in the disaster recovery plans is whether that application supports a key business operation that must be brought back online within a few hours or days of a disaster. This may not be the same as the availability requirements of the application, although it is closely related. It has more to do with the cost to the company every hour or day that the system is not available. Disaster recovery planning requires securing off-site equipment, personnel, and supporting components such as phone lines and networks that can function at an acceptable level in an interim basis. This is costly, and care must be taken to consider only those applications that are key to the survival of the company.
Document DRPs, including diagrams of affected areas and systems

A DRP should clearly identify the outage it protects against and the steps to implement in case of that outage. A general diagram of the system is essential. It needs to be detailed enough to determine hardware fault tolerance. including controllers, mirrored disks, the disaster backup site, processors, communication lines, and power. It also helps identify the current resources and any additional resources that may be necessary. Understanding how and when data flows in and out of the system is essential in identifying parts of the system that merit special attention. Special attention can be in the form of additional monitoring requirements or the frequency and types of backups taken. Conversely, it may also show areas that only require minimal attention and fewer system resources to monitor and manage.
Set up disaster recovery processes

Ensure that critical applications, database instances, systems, or business processes are included in your disaster recovery plan. Use application, system and network diagrams to assess fault tolerance and alternative routes during a disaster.
Assess all of the important business components

Consider all the components that allow your business to run. Ensure that the DRP includes all system, hardware, application and people resources. Verify that network, telephone service and security measures are in place.
Assign a DR coordinator

A DR coordinator and a backup coordinator should be pre-assigned to ensure that all operations and communications are passed on.
Test and validate the DRP

The DRP must be rehearsed periodically, which implies that the facilities to test the DRP must be available.

Planning Scheduled Outages

Scheduled outages can affect the application server tier, the database tier, or the entire site. These outages may include one or more of the following: node hardware maintenance, node software maintenance, Oracle software maintenance, redundant component maintenance, entire site maintenance. Proper scheduled outage planning reduces MTTR and reduces risk when changes do not go as planned.

Take the following steps to plan scheduled outages:

Create a list of scheduled outages

Creating a list of possible scheduled outages, their projected frequency, and estimated duration enables advanced planning and a better assessment of availability requirements. A reliability assessment to understand the mean time between failures (MTBF) of critical components can be used to plan for certain scheduled outages in order to prevent an unscheduled outage. In many cases, only one large scheduled outage is allotted each year, so maintenance must be prioritized and justified.
For each possible scheduled outage, document the impact and assess the risk

Scheduled outages that do not require software or application changes can usually be done with minimum downtime if a subsequent system can take over the new transactions. With Real Application Clusters and Data Guard switchover, you can upgrade hardware and do some standard system maintenance with minimum downtime to your business. For most software upgrades such as Oracle upgrades, the actual downtime can be less than an hour if prepared correctly. For more complex application changes that require schema changes or database object reorganizations, customers must assess whether Oracle's online reorganization features suffice or use some of Oracle's rolling upgrade capabilities.

See Also:
"Recovery Steps for Scheduled Outages"
Justify the scheduled outage

Each change must be consistent with the overall vision of the application and business and adhere to compatibility and change control rules.
Create and automate change, testing, and fallback procedures

Each planned change, such as an Oracle upgrade, should be tested in a simulated real world environment to assess performance and availability impacts. Oracle recommends using complete stress tests and a load simulated to accurately assess performance and load impact. Fallback plans should be created and incorporated into the scheduled outage. An automated process should be in place to implement the change and properly fall back if required.

Staff Training for High Availability

Highly trained people can make better and more informed decisions and are less likely to make mistakes. A comprehensive plan for the continued technical education of the systems administration, database administration, development, and users groups can help ensure higher availability of your systems and databases. Additionally, just as redundancy of system components eliminates a single point of failure, knowledge management and cross training should eradicate the effects to operations of losing an employee.

Cross-train for business-critical positions

Any business-critical systems should have cross-training of technical staff to reduce the impact to operations if an employee becomes unavailable or leaves the company. For example, the system administration group should be familiar with Oracle RDBMS and tools. Maintain formal and regular forms of communication (such as weekly meetings) between different support groups.
Develop guidelines to ensure continued technical education

Ensure that there is a process in place to notify and train staff about new features or procedures associated with the hardware and software your company uses. Additionally, allow time for investigation into new technologies that can improve service levels or reduce costs.
Implement a knowledge management process

Effectively managing the intellectual assets of a company reduces the risk of losing those assets. Create a process to promote central access to information about "lessons learned" within the IT group. For example, group round tables, internal white papers, new features related to upgrades, repositories for problem analysis and resolutions are ways of making information accessible.
Update training materials when applications or system are changed

Training material should be kept up to date with application and system changes. Incorporate training materials into the change management and documentation procedures.

Documentation as a Means of Maintaining High Availability

Clear and complete documentation should be part of every set of HA operational policies. Without documenting the steps for implementing or executing a process, you run the risk of losing that knowledge, increasing the risk for human error during the execution of that process, and omitting a step within a process. All of these risks affect availability.

Clearly defined operational procedures contribute to shorter learning curves when new employees join your organization. Properly documented operational procedures can greatly reduce the number of questions for support personnel, especially when the people who put the procedures in place are no longer with the group. Proper documentation can also eliminate confusion by clearly defining roles and responsibilities within the organization.

Clear documentation of applications is essential for new employees. When internally developed applications need to be maintained and enhanced, documentation helps developers refresh their knowledge of the internal details of the programs. If the original developers are no longer with the group, this documentation becomes especially valuable to new developers who would otherwise have to struggle through reading the code itself. Readily available application documentation also can greatly reduce the number of questions for your support organization.

Ensure that documentation is kept up to date

Update operational procedures when an application or system changes. Keep users informed of documentation updates.
Approve documentation changes through the change management process

The change management team should review and approve changes to the documentation to ensure accuracy.
Document lessons learned and problem resolutions

Documenting problem resolutions and lessons learned can improve the recovery time for repeated problems. Ideally, this documentation can be part of a periodic review process to help set priorities for system enhancements.
Protect the documentation

Secure access to documentation and keep an offsite copy of your operational procedure documentation and any other critical documentation. All critical documentation should also be part of any remote site implementations. Whether the remote site is intended for restarting a system after a disaster or for disaster recovery, the site should also contain a copy of the documented procedures for failing over to that site.

Physical Security Policies and Procedures for High Availability

Security policies consider the physical security and operations of the hardware and the data center. Physical security includes protection from unauthorized access, as well as from physical damage such as from fire, heat, and electrical surges. Physical security is the most fundamental security precaution and is essential for the system to meet the customer's availability requirements. Physical security protects against external and internal security breach. The CSI/FBI Computer Crime and Security Survey documents a trend toward increasing external intrusions and maintains that internal security violations still pose a large threat. A detailed discovery process into the security of data center operations and organization is outside the scope of this book. However, a properly secured infrastructure reduces the risk of damage, downtime, and financial loss.

Take the following steps to maintain physical security of the hardware and data center:

Provide a suitable physical environment for computer equipment

Not every room or closet in an office environment can be used to house computer equipment. The data center should not only account for the appropriate temperature, humidity, and security of the systems, it should also attempt to prevent potential hazards such as electrical surge, fire, and flood.
Restrict access to the operations area to authorized personnel

All security-conscious operations centers need to have some sort of secure access, either in the form of biometric authentication devices or smart-card readers.
Use internal security monitoring

Devices such as cameras and closed-circuit television are essential to a secure operations center by preventing crime and damage caused by people who are inside the facility.
Conduct background checks on DBAs, system administrators, and operational staff

DBAs, system administrators, and operational staff are inherently privileged users and hold positions of trust. Organizations must perform adequate background checks to ensure that these privileged individuals are worthy of the trust placed in them. There is no technical solution that can completely protect against a determined, malicious, and poorly evaluated person holding a position of power.

See Also:
Chapter 6, "System and Network Configuration" for information about data security

5 Operational Policies for High Availability