C H A P T E R 3 - Configuring the Remote Mirror Software

The Sun StorageTek Availability Suite Remote Mirror software is a volume-level replication facility for the Solaris 10 (Update 1 and higher) Operating Systems. The Remote Mirror software replicates disk volume write operations between physically separate primary and secondary sites in real time. The Remote Mirror software can be used with any Sun network adapter and network link that supports TCP/IP.

Since the software is volume-based, it is storage-independent and supports raw volumes or any volume manager, for both Sun and third-party products. Additionally, the product supports any application or database that has a single host running the Solaris OS that writes data. Databases, applications, or file systems that are configured to allow multiple hosts running the Solaris OS to write data to a shared volume are not supported (for example: Oracle® 9iRAC, Oracle® Parallel Server).

As part of a disaster recovery and business continuance plan, the Remote Mirror software keeps up-to-date copies of critical data at remote sites. The Remote Mirror software enables you to rehearse and test business continuance plans. For a highly available solution, the Sun StorageTek Availability Suite software can be configured to failover within Sun trademark

Cluster 3.x environments.

The Remote Mirror software is active while your applications are accessing the data volumes, replicating the data continually to the remote sites or scoreboarding changes, which allow for a fast resynchronization at a later time.

The Remote Mirror software enables you to initiate resynchronization manually from either the primary site to the secondary site (typically called forward synchronization), or from the secondary site to the primary site (typically called reverse synchronization).

Replication and configuration in the Remote Mirror software is done on a set basis. A Remote Mirror set consists of a primary volume, a secondary volume, a bitmap volume on both the primary and secondary sites (used to track and scoreboard changes for fast resynchronization), and an optional asynchronous queue volume for asynchronous replication mode. The primary and secondary volumes are recommended to be the same size. You can use the dsbitmap tool to determine the required size of the bitmap volumes. For more information on configuring Remote Mirror sets or the dsbitmap tool see the Sun StorageTek Availability Suite 4.0 Remote Mirror Software Administration Guide.

Replication

Replication can occur either synchronously or asynchronously. In synchronous mode, an application write operation is not acknowledged until the write operation is committed on both the primary and secondary hosts. In asynchronous mode, the application write operation is acknowledged when it is committed to storage locally and written to an asynchronous queue. This queue drives write operations to the secondary site asynchronously.

Synchronous Replication

3. When both write operations are complete, the scoreboard bit is cleared (lazy clear).

The advantage of synchronous replication is that both primary and secondary sites are always in sync. This type of replication is practical only if the latency of the link is low and the bandwidth requirements of the application can be met by the link. These constraints usually confine a synchronous solution to a campus or metropolitan location.

In the campus and metropolitan location, the network round trip is negligible and the service time is approximately twice what is observed when the Remote Mirror software is not installed.

However, if the network round trip is approximately 50 milliseconds (typical for long distance replication), the network latency renders the synchronous solution impractical as shown in the following example:

Asynchronous Replication

Asynchronous replication separates the remote write operation from the application write operation. In this mode, the acknowledgment occurs when the network write operation is added to the asynchronous queue. This means that the secondary site can get out of synchronization with the primary site until all write operations are delivered to the secondary site. In this mode, data flows in this manner:

Using the value of 5 milliseconds service time for a write operation, the estimated service time for an asynchronous write operation is:

If the network drain rate for the volume or consistency group is exceeded by the write rate for an extended period of time, the asynchronous queue fills up. Proper sizing is important so a method for estimating the appropriate volume size is discussed later in this document.

There are two modes that govern how the Remote Mirror software behaves in the event of the asynchronous disk queue filling;

In blocking mode, which is the default setting, the Remote Mirror software blocks, and waits for the asynchronous disk queue to drain to a certain point before adding the write to the asynchronous queue. This impacts application write operations, but maintains write ordering across the link.

In non-blocking mode (not available with memory-based queues), the Remote Mirror software does not block when the disk asynchronous queue fills, but drops into logging mode and scoreboards the write. On a subsequent update synchronization these are read from bit 0 forward, and there is no preservation of write ordering. If this mode is used and if the asynchronous disk queue fills and write ordering is lost, the associated volume or consistency group is inconsistent.

Consistency Groups

In synchronous mode, write ordering for an application that spans many volumes is assured because the application waits for completion before issuing another I/O operation when ordering is required, and the Remote Mirror software does not signal completion until the write operation is on both primary and secondary sites.

In asynchronous mode, by default, the queue for each volume is drained by one or more independent threads. Because this operation is separated from the application, write ordering is not preserved across write operations to multiple volumes.

If write ordering is required for an application, the Remote Mirror software provides the consistency group feature. Each consistency group has a single network queue, and although multiple write operations are allowed in parallel, write ordering is preserved through the use of sequence numbers.

Planning for Remote Replication

When you are planning for remote replication, consider your business needs, the application write loads, and your network's characteristics.

Business Needs

When you decide to replicate your business data, consider the maximum delay. How long out of date can you allow the data on the secondary site to become? This determines the replication mode and snapshot scheduling. Additionally, it is very important to know if the applications that you are replicating require the write operations to the secondary volume to be replicated in the correct order.

Application Write Load

Understanding the average and peak write loads is critical to determining the type of network connection required between the primary and secondary sites. To make decisions about the configuration, collect the following information:

The average rate is the amount of data write operations while the application is under typical load. Application read operations are not important to the provisioning and planning of your remote replication.

The peak rate is the largest amount of data written by the application over a measured duration.

The duration is how long the peak write rate lasts and the frequency is how often this condition occurs.

If these application characteristics are not known, you can measure them using tools such as iostat or sar to measure write traffic while the application is running.

Network Characteristics

When you know the application write load, determine the requirements of the network link. The most important network properties to consider are the network bandwidth and the network latency between the primary and secondary sites. If the network link already exists prior to installing the Sun StorageTek Availability Suite software, you can use tools such as ping to help determine the characteristics of the link between the sites.

To use synchronous replication, the network latency must be low enough that your application response time is not affected dramatically by the time of the network round trip of each write operation. Also, the bandwidth of the network must be sufficient to handle the amount of write traffic generated during the application's peak write period. If the network cannot handle the write traffic at any time, the application response time will be impacted.

To use asynchronous replication, the bandwidth of the network link must be able to handle the write traffic generated during the application's average write period. During the application peak write phase, the excess write operations are written to the local asynchronous queue and then written to the secondary site at a later time when the network traffic allows. The application response time can be minimized during bursts of write traffic above the network limit as long as the asynchronous queue is properly sized.

See the Configuring the Asynchronous Queue section of this document. The Remote Mirror asynchronous option mode selected (blocking or non-blocking) determines how the software reacts to the queue filling.

Configuring the Asynchronous Queue

If you use asynchronous replication, plan for the configuration settings described in this section. These settings are set on a Remote Mirror set or consistency group basis.

Disk or Memory Queue

In version 3.2 of the software, the Remote Mirror software added support for disk-based asynchronous queues. For ease of upgrading from previous versions, memory-based queues are still supported, but the new disk-based queues provide the ability to create significantly larger, more efficient queues. Larger queues allow for larger bursts of write activity without affecting application response time. Also, disk-based queues have less impact to system resources than the memory-based queues.

The asynchronous queue must be sufficient in size to handle the bursts of write traffic associated with the application peak write periods. A large queue can handle prolonged bursts of write activity, but also allows the possibility that the secondary site gets further out of sync with the primary. Using the peak write rate, peak write duration, write size, and network link characteristics, you can determine how the queue should be sized. See Setting the Correct Size for Disk-Based Asynchronous Queue.

The queue option you select (blocking or non-blocking) determines how the software reacts to a filled disk queue. Use the dsstat tool to determine statistics for the asynchronous queue, including the high-water mark (hwm), which shows the largest amount of the queue that has been used. To add an asynchronous queue to a Remote Mirror set or consistency group, use the sndradm command with the -q option: sndradm -q a

Queue Size

Monitor the asynchronous queue using the dsstat(1SCM) command to check the high water mark (hwm). If the hwm, caused by the application writes more data than the queue can handle, reaches 80 to 85 percent of the total size of the queue frequently, increase the queue size. This principle applies both to disk-based and memory-based queues. However, the procedure to resize each queue type is different.

Memory-Based Queue

Disk-Based Queue

The effective size of the disk queue is the size of the disk queue volume. A disk queue can only be resized by replacing it with a volume of a different size. For example, for a queue size of 16384 blocks, check that the hwm does not exceed 13000 to 14000 blocks. If it exceeds this amount, resize the queue using the following procedure.

The size of the queue in blocks is given by max q fbas (16384 blocks in this example). The maximum number of items allowed in the queue is given by
max q writes (4096 in this example). In this example, this means that the average size of an item in the queue is 2K.

The diskqueue volume is displayed (/dev/vx/rdsk/data_t3_dg/dq_single). The size of the queue can be determined by examining the size of the volume.

# dsstat -m sndr -d q

name              q role    qi     qk  qhwi   qhwk

data_a5k_dg/vol0  D  net     4     13     5    118

# dsstat -m sndr -r bn -d sq 2

Sample dsstat Output For a Correctly-Sized Queue

The following dsstat(1SCM) kernel statistics output shows information about the asynchronous queue. In these examples, the queue is sized correctly and is not currently filled. This example shows the following settings and statistics:

# dsstat -m sndr -r n -d sq -s priv-2-230:/dev/vx/rdsk/data_t3_dg/vol67

name q role qi qk qhwi qhwk kps tps svt

data_t3_dg/vol67 D net 48 384 240 1944 10 1 54

Assuming the disk queue volume size is 1 Gbyte, or 2097152 disk blocks, the hwm of 1944 blocks is well below 80% full. The disk queue is sized correctly for the write load.

Sample dsstat Output for an Incorrectly-Sized Disk Queue

The following dsstat(1SCM) kernel statistics output shows information about the asynchronous queue, which is incorrectly sized:

# sndradm -P

/dev/vx/rdsk/data_a5k_dg/vol0 -> priv-230:/dev/vx/rdsk/data_a5k_dg/vol0

autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: async, state: replicating

# dsstat -m sndr -d sq

name              q role    qi     qk  qhwi   qhwk    kps   tps  svt

data_a5k_dg/vol0  M  net  3609   8060  3613   8184     87    34   57

k/bitmap_dg/vol0     bmp     -      -     -      -      0     0    0

This example shows the default queue settings but the application is writing more data than the queue can handle. The qhwk value of 8184 Kbytes compared to
max q fbas of 16384 blocks (8192 Kbytes) indicates that the application is approaching the maximum allowed limit of 512-byte blocks. It is possible that the next few I/O operations are not going to be placed into the queue.

Increasing the queue size would be a solution in this case. However, consider improving the network link (such as using larger bandwidth interfaces) to achieve long-term benefits. Alternatively, consider taking point-in-time volume copies and replicating the shadow volumes. See the Sun StorageTek Availability Suite 4.0 Point-in-Time Copy Software Administration Guide.

Setting the Correct Size for Disk-Based Asynchronous Queue

Consider the following example. In this example, iostat was run at an hourly interval to profile the I/O load that will be replicated. In this example, we assume a DS3 (45Mbyte/sec) link. Also assume that this application uses a single consistency group, consequently, a single queue is involved.

After collecting stats over a 24 hour period, and assuming that this is a typical day for the application in question, one may determine the average write rate, the proper sizing for the async queues, how far out of date the remote site may become over the course of the day, and whether the network bandwidth chosen is adequate for this application.

TABLE 3-1 Example for Determining Correct Size for Disk-Based Queue
Time	kwr/s	wr/s	Network Throughput	Queue Growth	Queue Size
	A	B	C	A/1000 - C)*3600
6am	0	0	4MBps ^[1]
7am	1000	400	4MBps
8am	2000	1000	4MBps
9am	2000	1000	4MBps
10am	4000	1800	4MBps
11am	5000	2400	4MBps	3.6GB	3.6GB
12pm	1000	400	4MBps	-10GB
1pm	1200	600	4MBps
2pm	1000	500	4MBps
3pm	1200	400	4MBps
4pm	2000	600	4MBps
5pm	1000		4MBps
6pm	800		4MBps
7pm	800		4MBps
8pm	3200	1000	4MBps
9pm	8000	2500	4MBps	14GB	14GB
10pm	8000	2500	4MBps	14GB	28GB
11pm	1000	400	4MBps	-10	18
12pm	0		4MBps	-14	4
1am	0		4MBps	-14
2am	0		4MBps
3am	0		4MBps
4am	0		4MBps
5am	0		4MBps
Average bandwidth	1.8MBps

After filling in the table and calculating queue growth and size, it is evident that a 30Gbyte queue is sufficient. Although the queue grows large and, consequently, the secondary grows out of sync, a batch job run in the evenings ensures that the queue is empty by normal business hours and the two sites are in sync.

This exercise also validates that the network bandwidth is adequate for the write load the application produces.

Configuring Asynchronous Queue Flusher Threads

The Sun StorageTek Availability Suite software provides the ability to set the number of threads flushing the asynchronous queue. Changing this number allows for multiple I/Os per volume or consistency group on the network at one time. The Remote Mirror software on the secondary node handles write ordering the I/Os using sequence numbers.

Many variables must be considered when determining the number of queue flusher threads that is most efficient for your replication configuration. These variables include the number of sets or consistency groups, available system resources, network characteristics, and whether or not there is a file system. If you have a small number of sets or consistency groups, a larger number of flusher threads might be more efficient. It is recommended that you do some basic testing or prototyping with this variable at slightly different values to determine the most efficient setting for your configuration.

Knowledge of the configuration, network characteristics, and operation of the Remote Mirror software can provide guidelines to proper selection of the number of network threads. The Remote Mirror software utilizes Solaris RPCs, which are synchronous, as a transport mechanism. For each network thread, the maximum throughput the individual thread can achieve is I/O size / Round trip time. Consider a workload that is predominately 2Kbyte I/Os, and a round trip time of
60 milleseconds. Each network thread would be capable of:

In the case where there is a single volume, or many volumes in a single consistency group, the default of 2 network threads would limit network replication to 66Kbyte/sec. Tuning this number up would be advisable. If the replication network were provisioned for 4Mbyte/sec, then theoretically, the optimal number of network threads for a 2Kbyte workload would be:

This assumes linear scalability. In practice it has been observed that adding more than 64 network threads yields no benefit. Consider the case where there is no consistency group, 30 volumes being replicated over a 4Mbyte/sec link, and 8Kbyte I/Os. The default of 2 network threads per volume would yield 60 network threads, and if the workload were spread evenly across these volumes, the theoretical bandwidth would be:

The default setting for the number of asynchronous queue flusher threads is 2. To change this setting, you would use the sndradm command line interface with the
-A option. The description for the -A option is: sndradm -A specifies the maximum number of threads that can be created to process the asynchronous queue when a set is replicating in asynchronous mode (default 2).

To determine the number of flusher threads that are currently configured to serve an asynchronous queue, you can use the sndradm -P command. For example, you can see that the set below has 2 asynchronous flusher threads configured.

# sndradm -P

/dev/md/rdsk/d52 -> lh1:/dev/md/sdsdg/rdsk/d102

autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: async, group: butch, blocking diskqueue: /dev/md/rdsk/d100, state: replicating

An example of how to use the sndradm -A option to change the number of asynchronous queue flusher threads to 3 is:

# sndradm -A 3 lh1:/dev/md/sdsdg/rdsk/d102

Network Tuning

The Remote Mirror software injects itself directly into the system's I/O path, monitoring all traffic to determine if it is targeted to Remote Mirror volumes. The I/O commands that are targeted for Remote Mirror volumes are tracked and replication of these write operations is managed. Due to the fact that the Remote Mirror software is directly in the system's I/O path, some performance impact to the system is expected. Additional TCP/IP processing that is required for network replication also consumes host CPU resources. Perform the procedures in this section on the primary and secondary Remote Mirror hosts.

TCP Buffer Size

The TCP buffer size is the number of bytes that the transfer control protocol allows to be transferred before it waits for an acknowledgment. To get maximum throughput, it is critical to use optimal TCP send and receive socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never fully open. If the receiver buffers are too large, TCP flow control breaks and the sender can overrun the receiver, causing the TCP window to shut down. This event is likely to happen if the sending host is faster than the receiving host. Overly large windows on the sending side are not a problem as long as you have excess memory.

TABLE 3-2 Network Throughput and Buffer Size
Latency	Buffer Size = 24 Kbytes	Buffer Size = 256 Kbytes
10 milliseconds	18.75 MBps ^[2]	100 MBps
20 milliseconds	9.38 MBps	100 MBps
50 milliseconds	3.75 MBps	40 MBps
100 milliseconds	1.88 MBps	20 MBps
200 milliseconds	0.94 MBps	10 MBps

Viewing and Tuning TCP Buffer Sizes

You can view and tune your TCP buffer size by using /usr/bin/netstat(1M) and /usr/sbin/ndd(1M) commands. TCP parameters to consider tuning include:

When you change one of these parameters, restart the Remote Mirror software with the shutdown command, allowing the software to use the new buffer size. However, after you shut down and restart your server, the TCP buffers return to a default size. To keep your change, set the values in a startup script as described later in this section.

Network Tuning To View TCP Buffers and Values

# /usr/sbin/ndd /dev/tcp ? | more

# /usr/sbin/ndd /dev/tcp tcp_max_buf

1073741824

Use the /usr/bin/netstat(1M) command to view the buffer size for a particular network socket.

# netstat -na |grep "121 "

*.121 *.* 0 0 262144 0 LISTEN

192.168.112.2.1009 192.168.111.2.121 263536 0 263536 0 ESTABLISHED

192.168.112.2.121 192.168.111.2.1008 263536 0 263536 0 ESTABLISHED

# netstat -na |grep rdc

*.rdc *.* 0 0 262144 0 LISTEN

ip229.1009 ip230.rdc 263536 0 263536 0 ESTABLISHED

ip229.rdc ip230.ufsd 263536 0 263536 0 ESTABLISHED

The value 263536 shown in this example is the 256 Kbyte buffer size. It must be set identically in the primary and secondary hosts.

#!/bin/sh

ndd -set /dev/tcp tcp_max_buf 16777216

ndd -set /dev/tcp tcp_cwnd_max 16777216

# increase DEFAULT tcp window size

ndd -set /dev/tcp tcp_xmit_hiwat 262144

ndd -set /dev/tcp tcp_recv_hiwat 262144

# /usr/bin/chmod 744 /etc/rc2.d/S68ndd

# /usr/bin/chown root /etc/rc2.d/S68ndd

# /usr/sbin/shutdown -y g0 -i6

Remote Mirror's Use of TCP/IP ports

The Remote Mirror software on both the primary and secondary nodes listens on a well-known port specified in /etc/services, port 121. Remote Mirror write traffic flows from primary to secondary site over a socket with an arbitrarily assigned address on the primary site and the well-known address on the secondary site. The health monitoring heartbeat travels over a different connection, with an arbitrarily assigned address on the secondary and the well-known address on the primary. The Remote Mirror protocol utilizes SUN RPCs over these connections.

Default TCP Listening Port

Port 121 is the default TCP port for use by the Remote Mirror sndrd daemon. To change the port number, edit the /etc/services file using a text editor.

If you change the port number, you must change it on all Remote Mirror hosts within this configuration set (that is, primary and secondary hosts, all hosts in one-to-many, many-to-one, and multihop configurations). In addition, you must shutdown and restart all hosts affected, so that the port number change can take effect.

Using Remote Mirror With a Firewall

Because RPCs require an acknowledgment, the firewall must be opened to allow the well-known port address to be in either the source or destination fields of the packet. If the option is available, be sure to configure the firewall to allow RPC traffic as well.

In the case of write replication traffic, packets destined for the secondary will have the well-known port number in the destination field, acknowledgments of these RPCs will contain the well-known address in source field.

For health monitoring, the heartbeat will originate from the secondary with the well-known address in the destination field, and the acknowledgment will contain this address in the source field.

Remote Mirror Software With Point-in-Time Copy Software

To help ensure the highest level of data integrity and system performance on both sites during normal operations, the Sun StorageTek Availability Suite 4.0 Point-in-Time Copy software is recommended for use in conjunction with Remote Mirror software.

A point-in-time copy can be replicated to a physically remote location, providing a consistent copy of the volume as part of an overall disaster recovery plan. This is commonly referred to as batch replication, and the process and advantages of this practice are described in the best practice guide: Sun StorageTek Availability Suite Software-Improving Data Replication over a Highly Latent Link.

The point-in-time copy of a Remote Mirror secondary volume can be established prior to starting synchronization of a secondary volume from the primary site (the site the primary volume is hosted from). Protection against double failure is provided by enabling the Point-in-Time Copy software to create a point-in-time copy of the replicated data at the secondary site before beginning resynchronization. If a subsequent failure occurs during resynchronization, the point-in-time copy can be used as a fallback position, and resynchronization can be resumed when the subsequent failure issues have been resolved. Once the secondary site is fully synchronized with the primary site, the Point-in-Time Copy software volume set can be disabled, or put to other uses, such as remote backup, remote data analysis, or other functions required at the secondary site.

The Point-in-Time Copy software I/O performed internally during an enable, copy, or update operation can alter the contents of the shadow volume without any new I/O coming down the I/O stack. When this happens, the I/O is not intercepted in the sv layer. If the shadow volume is also a Remote Mirror volume, the Remote Mirror software will not see these I/O operations either. In this situation, the data modified by the I/O will not be replicated to the target Remote Mirror volume.

To allow this replication to occur, the Point-in-Time Copy software can be configured to offer the Remote Mirror software the changed bitmap. If the Remote Mirror software is in logging mode, it accepts the bitmap and performs an OR comparison of the Point-in-Time Copy software bitmap with its own bitmap for that volume, adding the Point-in-Time Copy software changes to its own list of changes to be replicated to the remote node. If the Remote Mirror software is in replication mode for the volume, it rejects the bitmap from the Point-in-Time Copy software. This, in turn, will fail the enable, copy, or update operation. Once Remote Mirror logging has been re-enabled, the Point-in-Time Copy software operation can be reissued.

Remote Replication Configurations

The Remote Mirror software enables you to create one-to-many, many-to-one, and multihop volume sets.

Any combination of the above configurations is also supported by the Remote Mirror software.

^{1 (TableFootnote) megabytes/second}

^{2 (TableFootnote) megabytes/second}

Replication

Synchronous Replication

Asynchronous Replication

Consistency Groups

Planning for Remote Replication

Business Needs

Application Write Load

Network Characteristics

Configuring the Asynchronous Queue

Disk or Memory Queue

Queue Size

Memory-Based Queue

Disk-Based Queue

Sample `dsstat` Output For a Correctly-Sized Queue

Sample `dsstat` Output for an Incorrectly-Sized Disk Queue

Setting the Correct Size for Disk-Based Asynchronous Queue

Configuring Asynchronous Queue Flusher Threads

Network Tuning

TCP Buffer Size

Viewing and Tuning TCP Buffer Sizes

Network Tuning To View TCP Buffers and Values

Remote Mirror's Use of TCP/IP ports

Default TCP Listening Port

Using Remote Mirror With a Firewall

Remote Mirror Software With Point-in-Time Copy Software

Remote Replication Configurations

Replication

Synchronous Replication

Asynchronous Replication

Consistency Groups

Planning for Remote Replication

Business Needs

Application Write Load

Network Characteristics

Configuring the Asynchronous Queue

Disk or Memory Queue

Queue Size

Memory-Based Queue

Disk-Based Queue

Sample dsstat Output For a Correctly-Sized Queue

Sample dsstat Output for an Incorrectly-Sized Disk Queue

Setting the Correct Size for Disk-Based Asynchronous Queue

Configuring Asynchronous Queue Flusher Threads

Network Tuning

TCP Buffer Size

Viewing and Tuning TCP Buffer Sizes

Network Tuning To View TCP Buffers and Values

Remote Mirror's Use of TCP/IP ports

Default TCP Listening Port

Using Remote Mirror With a Firewall

Remote Mirror Software With Point-in-Time Copy Software

Remote Replication Configurations

Sample `dsstat` Output For a Correctly-Sized Queue

Sample `dsstat` Output for an Incorrectly-Sized Disk Queue