JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle® ZFS Storage Appliance Administration Guide
Oracle Technology Network
Library
PDF
Print View
Feedback
search filter icon
search icon

Document Information

Using This Documentation

Chapter 1 Oracle ZFS Storage Appliance Overview

Chapter 2 Status

Chapter 3 Initial Configuration

Chapter 4 Network Configuration

Chapter 5 Storage Configuration

Chapter 6 Storage Area Network Configuration

Chapter 7 User Configuration

Chapter 8 Setting ZFSSA Preferences

Chapter 9 Alert Configuration

Chapter 10 Cluster Configuration

Cluster Features and Benefits

Cluster Disadvantages

Cluster Terminology

Understanding Clustering

Cluster Interconnect I/O

Understanding Cluster Resource Management

Cluster Takeover and Failback

Configuration Changes in a Clustered Environment

Clustering Considerations for Storage

Clustering Considerations for Networking

Private Local IP Interfaces

Clustering Considerations for Infiniband

Clustering Redundant Path Scenarios

Preventing 'Split-Brain' Conditions

Estimating and Reducing Takeover Impact

Cluster Configuration Using the BUI

Configuring Clustering

Unconfiguring Clustering

Configuring Clustering Using the CLI

Shutting Down a Clustered Configuration

Shutdown the Stand-by Head

Unconfiguring Clustering

Cluster Node Cabling

ZS3-2 Cluster Cabling

ZS3-4 and 7x20 Cluster Cabling

Storage Shelf Cabling

Cluster Configuration BUI Page

Chapter 11 ZFSSA Services

Chapter 12 Shares, Projects, and Schema

Chapter 13 Replication

Chapter 14 Shadow Migration

Chapter 15 CLI Scripting

Chapter 16 Maintenance Workflows

Chapter 17 Integration

Index

Estimating and Reducing Takeover Impact

There is an interval during takeover and failback during which access to storage cannot be provided to clients. The length of this interval varies by configuration, and the exact effects on clients depends on the protocol(s) they are using to access data. Understanding and mitigating these effects can make the difference between a successful cluster deployment and a costly failure at the worst possible time.

NFS (all versions) clients typically hide outages from application software, causing I/O operations to be delayed while a server is unavailable. NFSv2 and NFSv3 are stateless protocols that recover almost immediately upon service restoration. NFSv4 incorporates a client grace period at startup, during which I/O typically cannot be performed. The duration of this grace period can be tuned in the Oracle ZFS Storage Appliance (see illustration); reducing it will reduce the apparent impact of takeover and/or failback. For planned outages, the Oracle ZFS Storage Appliance provides grace-less recovery for NFSv4 clients, which avoids the grace period delay. For more information about grace-less recovery, see the Grace period property in NFS Properties.

Figure 10-10  Cluster Grace Period

image:Cluster grace period

iSCSI behavior during service interruptions is initiator-dependent, but initiators will typically recover if service is restored within a client-specific timeout period. Check your initiator's documentation for additional details. The iSCSI target will typically be able to provide service as soon as takeover is complete, with no additional delays.

SMB, FTP, and HTTP/WebDAV are connection-oriented protocols. Because the session states associated with these services cannot be transferred along with the underlying storage and network connectivity, all clients using one of these protocols will be disconnected during a takeover or failback, and must reconnect after the operation completes.

While several factors affect takeover time (and its close relative, failback time), in most configurations these times will be dominated by the time required to import the diskset resource(s). Typical import times for each diskset range from 15 to 20 seconds, linear in the number of disksets. Recall that a diskset consists of one half of one JBOD, provided the disk bays in that half-JBOD have been populated and allocated to a storage pool. Unallocated disks and empty disk bays have no effect on takeover time. The time taken to import diskset resources is unaffected by any parameters that can be tuned or altered by administrators, so administrators planning clustered deployments should either:

Note that while diskset import usually comprises the bulk of takeover time, it is not the only factor. During the pool import process, any intent log records must be replayed, and each share and LUN must be shared via the appropriate service(s). The amount of time required to perform these activities for a single share or LUN is very small - on the order of tens of milliseconds - but with very large share counts this can contribute significantly to takeover times. Keeping the number of shares relatively small - a few thousand or fewer - can therefore reduce these times considerably.

Failback time is normally greater than takeover time for any given configuration. This is because failback is a two-step operation: first, the source appliance exports all resources of which it is not the assigned owner, then the target appliance performs the standard takeover procedure on its own assigned resources only. Therefore it will always take longer to failback from head A to head B than it will take for head A to take over from head B in case of failure. This additional failback time is much less dependent upon the number of disksets being exported than is the takeover time, so keeping the number of shares and LUNs small can have a greater impact on failback than on takeover. It is also important to keep in mind that failback is always initiated by an administrator, so the longer service interruption it causes can be scheduled for a time when it will cause the lowest level of business disruption.

Note: Estimated times cited in this section refer to software/firmware version 2009.04.10,1-0. Other versions may perform differently, and actual performance may vary. It is important to test takeover and its exact impact on client applications prior to deploying a clustered appliance in a production environment.