IP Replication

IP replication is one of two mechanisms provided by the Reliable File Service for sharing data between a master node and a vice-master node. It is also called a “shared nothing” mechanism.

Replication During Normal Operations

Replication is the act of copying data from the master node to the vice-master node. Through replication, the vice-master node has an up-to-date copy of the data on the master node. Replication enables the vice-master node to take over the master role at any time, transparently. After replication, the master node disk and vice-master node disk are synchronized, that is, the replicated partitions contain exactly the same data.

Replication occurs at the following times:

When the master node and vice-master node are running, and data on the master node disk is changed
After startup, to replicate the software installed on the replicated partitions
After a failover or switchover

The following figure illustrates a client node (diskless or dataless) writing data to the master node, and that data being replicated to the vice-master node.

FIGURE 6-1 Data Replication

Example of a diskless node writing data to
the master node, and that data being replicated to the vice-master node.

Replication During Failover and Switchover

During failover or switchover, the master node goes out of service for a time before being re-established as the vice-master node. During this time, changes that are made to the new master node disk cannot be replicated to the vice-master node. Consequently, the cluster becomes unsynchronized.

While the vice-master node is out of service, data continues to be updated on the master node disk, and the modified data blocks are identified in a specific data area. FIGURE 6-2 illustrates Reliable File Service during failover or switchover.

Note - In the Solaris OS, the data area used to keep track of modifications to data blocks is referred to as a “scoreboard bitmap.” On the Linux OS, this data area is referred to as “Distributed Replicated Block Device (DRBD) metadata.”

FIGURE 6-2 Reliable File Service During Failover or Switchover

Diagram shows an example of Reliable NFS during
failover or switchover, modified data blocks are identified in a
scoreboard bitmap.

When the vice-master node is re-established, replication resumes. Any data written to the master node is replicated to the vice-master node. In addition, the data area used to keep track of modifications to data blocks is examined to determine which data blocks have been changed while the vice-master node was out of service. Any changed data blocks are also replicated to the vice-master node. In this way, the cluster becomes synchronized again. The following figure illustrates the restoration of the synchronized state.

FIGURE 6-3 Restoration of the Synchronized State

Diagram shows the restoration of the synchronized
state after failover or switchover.

You can verify whether a cluster is synchronized, as described in “To Verify That the Master Node and Vice-Master Node Are Synchronized” in the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.

You can collect replication statistics by using the Node Management Agent (NMA), as described in the Netra High Availability Suite 3.0 1/08 Foundation Services NMA Programming Guide.

Note - NMA is not available for use on Linux.

Data Area Used to Track Modifications to Data Blocks

On the Solaris OS, the data area that is used to keep track of modifications that are made to data blocks is called the scorecard bitmap. On Linux, these modifications are tracked by DRBD metadata. The following sections describe how this works in each environment.

Scorecard Bitmaps on the Solaris OS

When data is written to a replicated partition on the master node disk, the corresponding scoreboard bitmap is updated.

The scoreboard bitmap maps one bit to a block of data on a replicated partition. When a block of data is changed, the corresponding bit in the scoreboard bitmap is set to 1. When the data has been replicated to the vice-master node, the corresponding bit in the scoreboard bitmap is set to zero.

The scoreboard bitmap can reside on a partition on the master node disk or in memory. There are advantages and disadvantages to storing the scoreboard bitmap on the master node disk or in memory:

Storing the scoreboard bitmap in memory is a problem if the master node and vice-master node fail simultaneously. In this case, the scoreboard bitmap is lost and a full resynchronization is required when the nodes are rebooted.

Storing the scoreboard bitmap on a disk partition is slower during normal operation because writing to disk is slower than writing to memory. However, if the master node and vice-master node fail simultaneously, the scoreboard bitmap can be used to resynchronize the nodes, without the need for a full resynchronization.

Storing the scoreboard bitmap in memory is encouraged when data is continuously and frequently updated, or when data is expected to be written to a replicated partition during a switchover. This ensures that writes to replicated partitions are performed as quickly as possible and that the time required to synchronize the partitions after a switchover is reduced.

Storing the scoreboard bitmap in memory means better performance because writing to memory is quicker than writing to disk.

Each replicated partition on a disk must have a corresponding partition for a scoreboard bitmap, even if the scoreboard bitmap is stored in memory.

For information about how to configure the scoreboard bitmap in memory or on disk, see “Changing the Location of the Scoreboard Bitmap” in the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.

DRBD Metadata on Linux

On Linux, DRBD is used for disk replication. DRBD metadata keeps track of writes to replicated partitions on the master node disk and works like scoreboard bitmaps on the Solaris OS as described in preceding section, except that DRBD metadata cannot be kept in memory, it has to be stored on a disk partition.

Synchronization Options

When a cluster is unsynchronized, the data on the master node disk is not fully backed up. You must not schedule major tasks when a cluster is unsynchronized. If this constraint does not suit the use you want to make of your cluster, you can choose to:

Delay the start of synchronization
Reduce the duration of synchronization

If you need to reduce the disk and network load on the cluster and so want to reduce the CPU used during synchronization, you can serialize slice synchronization.

Delayed Synchronization

This feature enables you to delay the start of synchronization between the master and vice-master disks. By default, this feature is disabled. Delay synchronization if you do not want the synchronization task to conflict with other CPU-intensive activities. If you enable this feature, you can trigger synchronization at a later time of your choosing using the nhenablesync command.

Until synchronization is triggered and completed, the vice-master is not eligible to become master, and the cluster is open to a single point of failure. By delaying synchronization you are prolonging this vulnerability.

Activate this feature by setting the RNFS.EnableSync parameter in the nhfs.conf file. If you are installing the cluster with nhinstall, specify the synchronization method by setting SYNC_FLAG in the cluster_definition.conf file. For more information, see the nhfs.conf4, cluster_definition.conf4, and nhenablesync1M man pages.

Reduced Duration of Disk Synchronization (Solaris OS Only)

Choose how the master and vice-master disks are synchronized if you want to reduce the time it takes to synchronize the master and vice-master disks. You can synchronize either an entire master disk or only those blocks that contain data. By choosing the latter option, you reduce the time it takes to synchronize the disks. You can choose to synchronize only the blocks that contain data if you have a UNIX® File System (UFS). If you have a file system other than UFS, the entire master disk is synchronized.

Set this feature to have the same value on both server nodes. If you change the value of the RNFS.SyncType property defining this feature, reboot the cluster as described in “Shutting Down and Restarting a Cluster” in the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.

Choose the synchronization method by setting the RNFS.SyncType parameter in the nhfs.conf file. If you are installing the cluster with nhinstall, choose the synchronization method by setting SLICE_SYNC_TYPE in the cluster_definition.conf file. For more information, see the nhfs.conf4 and cluster_definition.conf4 man pages.

Serialized Slice Synchronization

This feature enables you to synchronize the master and vice-master disks one slice at a time rather than all at once. By default, this feature is disabled. Enabling this feature reduces the disk and network load. However, it limits the availability of the cluster during a certain limited time period because the vice-master cannot become master before all slices have been synchronized. During this time period, the cluster is vulnerable to a single point of failure.

Activate this feature by setting the value of the RNFS.SerializeSync parameter in the nhfs.conf file. If you are installing the cluster with nhinstall, specify the synchronization method by setting SERIALIZE_SYNC in the cluster_definition.conf file. For more information, see the nhfs.conf4, and cluster_definition.conf4 man pages.

Sanity Check of Replicated Slices

This feature enables you to continuously scan the state of replicated slices. By default, this feature is disabled and it can only be used if the master and vice-master disks are synchronized. If you do not monitor the state of the replicated slices, it is possible that the vice-master disk is corrupted and partly or completely inaccessible.

Activate this feature by setting the RNFS.CheckReplicatedSlices parameter in the nhfs.conf file. If you are installing the cluster with nhinstall, enable this feature by setting CHECK_REPLICATED_SLICES in the cluster_definition.conf file. For more information, see the nhfs.conf4 and cluster_definition.conf4 man pages.

Disk Partitioning

This section describes how the master node disk and vice-master node disk are partitioned.

The master node, vice-master node, and dataless nodes access their local disks. The vice-master node and dataless nodes also access some disk partitions of the master node. Diskless nodes do not have, or are not configured to use, local disks. Diskless nodes rely entirely on the master node to boot and access services and data.

You can partition your disks as described in Solaris OS Standard Disk Partitioning, or as described in Solaris OS Virtual Disk Partitioning.

Solaris OS Standard Disk Partitioning

To use standard disk partitioning, you must specify your disk partitions in the cluster configuration files. During installation, the nhinstall tool partitions the disks according to the specifications in the cluster configuration files. If you manually install the Netra HA Suite software, you must partition the system disk and create the required file systems manually.

The master node disk and vice-master node disk can be split identically into a maximum of eight partitions. For a cluster containing diskless nodes, you can arrange the partitions as follows:

Three partitions for the system configuration
Two partitions for data
Two partitions for scoreboard bitmaps
One free partition

Partitions that contain data are called data partitions. One data partition might typically contain the exported file system for diskless nodes. The other data partition might contain configuration and status files for the Foundation Services. Data partitions are replicated from the master node to the vice-master node.

To be replicated, a data partition must have a corresponding scoreboard bitmap partition. If a data partition does not have a corresponding scoreboard bitmap partition, it cannot be replicated. For information about the scoreboard bitmap, see Replication During Normal Operations.

TABLE 6-1 shows an example disk partition for a cluster containing server nodes and client (diskless) nodes. This example indicates which partitions are replicated.

**TABLE 6-1 Example Disk Partition for a Cluster of Server Nodes and Client (Diskless) Nodes**
Partition	Use	Replicated
`s0`	Solaris boot	Not replicated
`s1`	Swap	Not replicated
`s2`	Whole disk	Not applicable
`s3`	Data partition for diskless Solaris images	Replicated read/write for the diskless nodes
`s4`	Data partition for middleware data and binaries	Replicated read/write for applications
`s5`	Scoreboard bitmap partition	Used to replicate partition `s3`
`s6`	Scoreboard bitmap partition	Used to replicate partition `s4`
`s7`	Free

Note - Server nodes in a cluster that does not contain diskless nodes do not require partitions s3 and s5.

Solaris OS Virtual Disk Partitioning

Virtual disk partitioning is integrated into the Solaris OS (since the Solaris 9 OS) in the Solaris Volume Manager software.

One of the partitions of a physical disk can be configured as a virtual disk using Solaris Volume Manager. A virtual disk can be partitioned into a maximum of 128 soft partitions. To an application, a virtual disk is functionally identical to a physical disk. The following figure shows one partition of a physical disk configured as a virtual disk with soft partitions.

FIGURE 6-4 One Partition of a Physical Disk Configured as a Virtual Disk

Diagram shows one partition of a physical disk
configured as a virtual disk with soft partitions.

In Solaris Volume Manager, a virtual disk is called a volume.

To use virtual disk partitioning, you must manually install and configure the Solaris Operating System and virtual disk partitioning on your cluster. You can then configure the nhinstall tool to install the Netra HA Suite software only, or you can install the Netra HA Suite software manually.

For more information about virtual disk partitioning, see the Solaris documentation.

Linux Standard Disk Partitioning

To use standard disk partitioning on Linux, you must specify your disk partitions in the cluster configuration files. During installation, the nhinstall tool partitions the disks according to the specifications in the cluster configuration files. If you manually install the Foundation Services, you must partition the system disk and create the required file systems manually.

For a cluster containing diskless nodes, you can arrange the partitions on master and vice-master nodes as follows:

Three partitions for the system configuration
One partition for data
One partition for DRBD metadata
One free partition

Partitions that contain data are called data partitions. The data partition might contain configuration and status files for the Foundation Services. Data partitions are replicated from the master node to the vice-master node. If needed, more data partitions can be added.

To be replicated, a data partition must have corresponding DRBD metadata. Metadata for all replicated data partitions can be kept in a single metadata partition (in separate 128-MB slots).

TABLE 6-2 shows an example disk partition for a Netra HA Suite cluster. This example indicates which partitions are replicated.

**TABLE 6-2 Example Disk Partition for a Linux NHAS Cluster**
Partition	Use	Replicated
sda1	Linux root partition, boot	Not replicated
sda2	Swap	Not replicated
sda3	Extended partition	Not replicated
sda5	Data partition for middleware data and binaries	Replicated read/write for applications
sda7	DRBD metadata partition	Used to replicate sda5
sda8	free

Linux Virtual Disk Partitioning

Virtual disk partitioning on Linux can be done using the Logical Volume Manager (LVM). To do this, begin by declaring one or more physical disks as LVM physical volumes. Next, you need to create a volume group as a set of physical volumes. Finally, create the logical volumes from space that is available in a volume group.

To an application, a logical volume is functionally identical to a physical disk.

FIGURE 6-5 Defining Logical Volumes on Linux Using LVM

To use virtual disk partitioning, you must manually install and configure the Linux operating system and virtual disk partitioning on your cluster. You can then configure the nhinstall tool to install only the Foundation Services, or you can install the Foundation Services manually. For more information about virtual disk partitioning, see the Linux and LVM documentation, for example: http://www.tldp.org/HOWTO/LVM-HOWTO/index.html

Logical Mirroring

On the Solaris OS, logical mirroring is provided by Solaris Volume Manager (Solaris VM) for the Solaris 9 and Solaris 10 OS. In the Solaris environment, logical mirroring can be used with IP replication to strengthen system reliability and availability.

Logical mirroring can be used on server nodes with two or more disks. The disks are mirrored locally on the server nodes. They always contain identical information. If a disk on the master node is replaced or crashes, the second local disk takes over without a failover.

For more information about logical mirroring on the Solaris OS, see the Solaris documentation. For more information about logical mirroring on Linux, see the Linux Software RAID documentation, for example: http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html

Reliable File Service

Reliable NFS

Master Node IP Address Failover