Go to main content

Oracle® ZFS Storage Appliance Administration Guide, Release OS8.7.x

Exit Print View

Updated: September 2017
 
 

Cluster Advantages and Disadvantages

It is important to understand the scope of the Oracle ZFS Storage Appliance clustering implementation. The term 'cluster' is used in the industry to refer to many different technologies with a variety of purposes. We use it here to mean a metasystem comprised of two appliance controllers and shared storage, used to provide improved availability in the case in which one of the controllers succumbs to certain hardware or software failures. A cluster contains exactly two appliances or storage controllers, referred to for brevity throughout this document as controllers. Each controller may be assigned a collection of storage, networking, and other resources from the set available to the cluster, which allows the construction of either of two major topologies. Many people use the terms active-active to describe a cluster in which there are two (or more) storage pools, one of which is assigned to each controller along with network resources used by clients to reach the data stored in that pool, and active-passive to refer to which a single storage pool is assigned to the controller designated as active along with its associated network interfaces. Both topologies are supported by the appliance. The distinction between these is artificial; there is no software or hardware difference between them and one can switch at will simply by adding or destroying a storage pool. In both cases, if a controller fails, the other (its peer) will take control of all known resources and provide the services associated with those resources.

As an alternative to incurring hours or days of downtime while the controller is repaired, clustering allows a peer appliance to provide service while repair or replacement is performed. In addition, clusters support rolling upgrade of software, which can reduce the business disruption associated with migrating to newer software. Some clustering technologies have certain additional capabilities beyond availability enhancement; the Oracle ZFS Storage Appliance clustering subsystem was not designed to provide these. In particular, it does not provide for load balancing among multiple controllers, improve availability in the face of storage failure, offer clients a unified filesystem namespace across multiple appliances, or divide service responsibility across a wide geographic area for disaster recovery purposes. These functions are likewise outside the scope of this document; however, the appliance and the data protocols it offers support numerous other features and strategies that can improve availability:

  • Replication of data, which can be used for disaster recovery at one or more geographically remote sites

  • Client-side mirroring of data, which can be done using redundant iSCSI LUNs provided by multiple arbitrarily located storage servers

  • Load balancing, which is built into the NFS protocol and can be provided for some other protocols by external hardware or software (applies to read-only data)

  • Redundant hardware components including power supplies, network devices, and storage controllers

  • Fault management software that can identify failed components, remove them from service, and guide technicians to repair or replace the correct hardware

  • Network fabric redundancy provided by LACP and IPMP functionality

  • Redundant storage devices (RAID)

Additional information about other availability features can be found in the appropriate sections of this document.

When deciding between a clustered and standalone Oracle ZFS Storage Appliance configuration, it is important to weigh the costs and benefits of clustered operation. It is common practice throughout the IT industry to view clustering as an automatic architectural decision, but this thinking reflects an idealized view of clustering risks and rewards promulgated by some vendors in this space. In addition to the obvious higher up-front and ongoing hardware and support costs associated with the second controller, clustering also imposes additional technical and operational risks. Some of these risks can be mitigated by ensuring that all personnel are thoroughly trained in cluster operations; others are intrinsic to the concept of clustered operation. Such risks include:

  • The potential for application intolerance of protocol-dependent behaviors during takeover,

  • The possibility that the cluster software itself will fail or induce a failure in another subsystem that would not have occurred in standalone operation,

  • Increased management complexity and a higher likelihood of operator error when performing management tasks,

  • The possibility that multiple failures or a severe operator error will induce data loss or corruption that would not have occurred in a standalone configuration, and

  • Increased difficulty of recovering from unanticipated software and/or hardware states.

These costs and risks are fundamental, apply in one form or another to all clustered or cluster-capable products on the market (including the Oracle ZFS Storage Appliance product), and cannot be entirely eliminated or mitigated. Storage architects must weigh them against the primary benefit of clustering: the opportunity to reduce periods of unavailability from hours or days to minutes or less in the rare event of catastrophic hardware or software failure. Whether that cost/benefit analysis will favor the use of clustering in an Oracle ZFS Storage Appliance deployment will depend on local factors such as SLA terms, available support personnel and their qualifications, budget constraints, the perceived likelihood of various possible failures, and the appropriateness of alternative strategies for enhancing availability. These factors are highly site-, application-, and business-dependent and must be assessed on a case-by-case basis. Understanding the material in the rest of this section will help you make appropriate choices during the design and implementation of your unified storage infrastructure.

Related Topics