C H A P T E R 3 |
Configuring the Remote Mirror Software |
The Sun StorageTek Availability Suite Remote Mirror software is a volume-level replication facility for the Solaris 10 (Update 1 and higher) Operating Systems. The Remote Mirror software replicates disk volume write operations between physically separate primary and secondary sites in real time. The Remote Mirror software can be used with any Sun network adapter and network link that supports TCP/IP.
Since the software is volume-based, it is storage-independent and supports raw volumes or any volume manager, for both Sun and third-party products. Additionally, the product supports any application or database that has a single host running the Solaris OS that writes data. Databases, applications, or file systems that are configured to allow multiple hosts running the Solaris OS to write data to a shared volume are not supported (for example: Oracle® 9iRAC, Oracle® Parallel Server).
As part of a disaster recovery and business continuance plan, the Remote Mirror software keeps up-to-date copies of critical data at remote sites. The Remote Mirror software enables you to rehearse and test business continuance plans. For a highly available solution, the Sun StorageTek Availability Suite software can be configured to failover within Sun Cluster 3.x environments.
The Remote Mirror software is active while your applications are accessing the data volumes, replicating the data continually to the remote sites or scoreboarding changes, which allow for a fast resynchronization at a later time.
The Remote Mirror software enables you to initiate resynchronization manually from either the primary site to the secondary site (typically called forward synchronization), or from the secondary site to the primary site (typically called reverse synchronization).
Replication and configuration in the Remote Mirror software is done on a set basis. A Remote Mirror set consists of a primary volume, a secondary volume, a bitmap volume on both the primary and secondary sites (used to track and scoreboard changes for fast resynchronization), and an optional asynchronous queue volume for asynchronous replication mode. The primary and secondary volumes are recommended to be the same size. You can use the dsbitmap tool to determine the required size of the bitmap volumes. For more information on configuring Remote Mirror sets or the dsbitmap tool see the Sun StorageTek Availability Suite 4.0 Remote Mirror Software Administration Guide.
Replication can occur either synchronously or asynchronously. In synchronous mode, an application write operation is not acknowledged until the write operation is committed on both the primary and secondary hosts. In asynchronous mode, the application write operation is acknowledged when it is committed to storage locally and written to an asynchronous queue. This queue drives write operations to the secondary site asynchronously.
The data flow for synchronous operation is as follows:
1. Scoreboard bit is set in the bitmap volume.
2. Local write operation and network write operation are initiated in parallel
3. When both write operations are complete, the scoreboard bit is cleared (lazy clear).
4. Write operation is acknowledged to application.
The advantage of synchronous replication is that both primary and secondary sites are always in sync. This type of replication is practical only if the latency of the link is low and the bandwidth requirements of the application can be met by the link. These constraints usually confine a synchronous solution to a campus or metropolitan location.
In this case, the average service time for a write operation is:
bitmap write + MAX (local data write, network round trip + remote data write)
In the campus and metropolitan location, the network round trip is negligible and the service time is approximately twice what is observed when the Remote Mirror software is not installed.
Assuming 5 milliseconds for a write, then:
5ms + MAX (5ms, 1ms + 5ms) = 11ms
Note - This value of 5 milliseconds is a reasonable assumption on a lightly loaded system. On a more realistically loaded system, queuing backlog increases the value. |
However, if the network round trip is approximately 50 milliseconds (typical for long distance replication), the network latency renders the synchronous solution impractical as shown in the following example:
5ms + MAX (5ms, 50ms + 5ms) = 60ms
Asynchronous replication separates the remote write operation from the application write operation. In this mode, the acknowledgment occurs when the network write operation is added to the asynchronous queue. This means that the secondary site can get out of synchronization with the primary site until all write operations are delivered to the secondary site. In this mode, data flows in this manner:
2. Local write, asynchronous queue write operations are done in parallel.
3. Write is acknowledged to application.
4. Flusher threads read asynchronous queue entry and perform network write.
5. Scoreboard bit is cleared (lazy clear).
The service time is the time required for the following:
bitmap write + MAX (local write, asynchronous queue entry data)
Using the value of 5 milliseconds service time for a write operation, the estimated service time for an asynchronous write operation is:
If the network drain rate for the volume or consistency group is exceeded by the write rate for an extended period of time, the asynchronous queue fills up. Proper sizing is important so a method for estimating the appropriate volume size is discussed later in this document.
There are two modes that govern how the Remote Mirror software behaves in the event of the asynchronous disk queue filling;
In blocking mode, which is the default setting, the Remote Mirror software blocks, and waits for the asynchronous disk queue to drain to a certain point before adding the write to the asynchronous queue. This impacts application write operations, but maintains write ordering across the link.
In non-blocking mode (not available with memory-based queues), the Remote Mirror software does not block when the disk asynchronous queue fills, but drops into logging mode and scoreboards the write. On a subsequent update synchronization these are read from bit 0 forward, and there is no preservation of write ordering. If this mode is used and if the asynchronous disk queue fills and write ordering is lost, the associated volume or consistency group is inconsistent.
Note - It is strongly advised that a point-in-time copy be taken on the secondary site prior to starting the update synchronization, for example, using the autosync daemon. |
In synchronous mode, write ordering for an application that spans many volumes is assured because the application waits for completion before issuing another I/O operation when ordering is required, and the Remote Mirror software does not signal completion until the write operation is on both primary and secondary sites.
In asynchronous mode, by default, the queue for each volume is drained by one or more independent threads. Because this operation is separated from the application, write ordering is not preserved across write operations to multiple volumes.
If write ordering is required for an application, the Remote Mirror software provides the consistency group feature. Each consistency group has a single network queue, and although multiple write operations are allowed in parallel, write ordering is preserved through the use of sequence numbers.
When you are planning for remote replication, consider your business needs, the application write loads, and your network's characteristics.
When you decide to replicate your business data, consider the maximum delay. How long out of date can you allow the data on the secondary site to become? This determines the replication mode and snapshot scheduling. Additionally, it is very important to know if the applications that you are replicating require the write operations to the secondary volume to be replicated in the correct order.
Understanding the average and peak write loads is critical to determining the type of network connection required between the primary and secondary sites. To make decisions about the configuration, collect the following information:
The average rate is the amount of data write operations while the application is under typical load. Application read operations are not important to the provisioning and planning of your remote replication.
The peak rate is the largest amount of data written by the application over a measured duration.
The duration is how long the peak write rate lasts and the frequency is how often this condition occurs.
If these application characteristics are not known, you can measure them using tools such as iostat or sar to measure write traffic while the application is running.
When you know the application write load, determine the requirements of the network link. The most important network properties to consider are the network bandwidth and the network latency between the primary and secondary sites. If the network link already exists prior to installing the Sun StorageTek Availability Suite software, you can use tools such as ping to help determine the characteristics of the link between the sites.
To use synchronous replication, the network latency must be low enough that your application response time is not affected dramatically by the time of the network round trip of each write operation. Also, the bandwidth of the network must be sufficient to handle the amount of write traffic generated during the application's peak write period. If the network cannot handle the write traffic at any time, the application response time will be impacted.
To use asynchronous replication, the bandwidth of the network link must be able to handle the write traffic generated during the application's average write period. During the application peak write phase, the excess write operations are written to the local asynchronous queue and then written to the secondary site at a later time when the network traffic allows. The application response time can be minimized during bursts of write traffic above the network limit as long as the asynchronous queue is properly sized.
See the Configuring the Asynchronous Queue section of this document. The Remote Mirror asynchronous option mode selected (blocking or non-blocking) determines how the software reacts to the queue filling.
If you use asynchronous replication, plan for the configuration settings described in this section. These settings are set on a Remote Mirror set or consistency group basis.
In version 3.2 of the software, the Remote Mirror software added support for disk-based asynchronous queues. For ease of upgrading from previous versions, memory-based queues are still supported, but the new disk-based queues provide the ability to create significantly larger, more efficient queues. Larger queues allow for larger bursts of write activity without affecting application response time. Also, disk-based queues have less impact to system resources than the memory-based queues.
The asynchronous queue must be sufficient in size to handle the bursts of write traffic associated with the application peak write periods. A large queue can handle prolonged bursts of write activity, but also allows the possibility that the secondary site gets further out of sync with the primary. Using the peak write rate, peak write duration, write size, and network link characteristics, you can determine how the queue should be sized. See Setting the Correct Size for Disk-Based Asynchronous Queue.
The queue option you select (blocking or non-blocking) determines how the software reacts to a filled disk queue. Use the dsstat tool to determine statistics for the asynchronous queue, including the high-water mark (hwm), which shows the largest amount of the queue that has been used. To add an asynchronous queue to a Remote Mirror set or consistency group, use the sndradm command with the -q option: sndradm -q a
Monitor the asynchronous queue using the dsstat(1SCM) command to check the high water mark (hwm). If the hwm, caused by the application writes more data than the queue can handle, reaches 80 to 85 percent of the total size of the queue frequently, increase the queue size. This principle applies both to disk-based and memory-based queues. However, the procedure to resize each queue type is different.
The effective size of the disk queue is the size of the disk queue volume. A disk queue can only be resized by replacing it with a volume of a different size. For example, for a queue size of 16384 blocks, check that the hwm does not exceed 13000 to 14000 blocks. If it exceeds this amount, resize the queue using the following procedure.
1. Place the volume into logging mode using the sndradm -l command.
3. Perform an update synchronization by using the sndradm -u command.
1. Type the following to display the queue size:
The size of the queue in blocks is given by max q fbas (16384 blocks in this example). The maximum number of items allowed in the queue is given by
max q writes (4096 in this example). In this example, this means that the average size of an item in the queue is 2K.
The diskqueue volume is displayed (/dev/vx/rdsk/data_t3_dg/dq_single). The size of the queue can be determined by examining the size of the volume.
2. Type the following to show the current queue length and its hwm:
3. To show streaming summary and disk queue information, type:
4. To show more information, run dsstat(1SCM) with other display options.
Note - This example shows only a portion of the command output required for this section; the dsstat command actually displays more information. |
The following dsstat(1SCM) kernel statistics output shows information about the asynchronous queue. In these examples, the queue is sized correctly and is not currently filled. This example shows the following settings and statistics:
# dsstat -m sndr -r n -d sq -s priv-2-230:/dev/vx/rdsk/data_t3_dg/vol67 |
Assuming the disk queue volume size is 1 Gbyte, or 2097152 disk blocks, the hwm of 1944 blocks is well below 80% full. The disk queue is sized correctly for the write load.
The following dsstat(1SCM) kernel statistics output shows information about the asynchronous queue, which is incorrectly sized:
This example shows the default queue settings but the application is writing more data than the queue can handle. The qhwk value of 8184 Kbytes compared to
max q fbas of 16384 blocks (8192 Kbytes) indicates that the application is approaching the maximum allowed limit of 512-byte blocks. It is possible that the next few I/O operations are not going to be placed into the queue.
Increasing the queue size would be a solution in this case. However, consider improving the network link (such as using larger bandwidth interfaces) to achieve long-term benefits. Alternatively, consider taking point-in-time volume copies and replicating the shadow volumes. See the Sun StorageTek Availability Suite 4.0 Point-in-Time Copy Software Administration Guide.
Consider the following example. In this example, iostat was run at an hourly interval to profile the I/O load that will be replicated. In this example, we assume a DS3 (45Mbyte/sec) link. Also assume that this application uses a single consistency group, consequently, a single queue is involved.
After collecting stats over a 24 hour period, and assuming that this is a typical day for the application in question, one may determine the average write rate, the proper sizing for the async queues, how far out of date the remote site may become over the course of the day, and whether the network bandwidth chosen is adequate for this application.
4MBps[1] |
|||||
After filling in the table and calculating queue growth and size, it is evident that a 30Gbyte queue is sufficient. Although the queue grows large and, consequently, the secondary grows out of sync, a batch job run in the evenings ensures that the queue is empty by normal business hours and the two sites are in sync.
This exercise also validates that the network bandwidth is adequate for the write load the application produces.
The Sun StorageTek Availability Suite software provides the ability to set the number of threads flushing the asynchronous queue. Changing this number allows for multiple I/Os per volume or consistency group on the network at one time. The Remote Mirror software on the secondary node handles write ordering the I/Os using sequence numbers.
Many variables must be considered when determining the number of queue flusher threads that is most efficient for your replication configuration. These variables include the number of sets or consistency groups, available system resources, network characteristics, and whether or not there is a file system. If you have a small number of sets or consistency groups, a larger number of flusher threads might be more efficient. It is recommended that you do some basic testing or prototyping with this variable at slightly different values to determine the most efficient setting for your configuration.
Knowledge of the configuration, network characteristics, and operation of the Remote Mirror software can provide guidelines to proper selection of the number of network threads. The Remote Mirror software utilizes Solaris RPCs, which are synchronous, as a transport mechanism. For each network thread, the maximum throughput the individual thread can achieve is I/O size / Round trip time. Consider a workload that is predominately 2Kbyte I/Os, and a round trip time of
60 milleseconds. Each network thread would be capable of:
In the case where there is a single volume, or many volumes in a single consistency group, the default of 2 network threads would limit network replication to 66Kbyte/sec. Tuning this number up would be advisable. If the replication network were provisioned for 4Mbyte/sec, then theoretically, the optimal number of network threads for a 2Kbyte workload would be:
(4096Kbyte/sec) / (2Kbyte/0.060 IO/sec) = 123 threads
This assumes linear scalability. In practice it has been observed that adding more than 64 network threads yields no benefit. Consider the case where there is no consistency group, 30 volumes being replicated over a 4Mbyte/sec link, and 8Kbyte I/Os. The default of 2 network threads per volume would yield 60 network threads, and if the workload were spread evenly across these volumes, the theoretical bandwidth would be:
60 * (8Kbyte / 0.060 IO/sec) = 8Mbyte/sec
This is more than the network bandwidth. No tuning is required.
The default setting for the number of asynchronous queue flusher threads is 2. To change this setting, you would use the sndradm command line interface with the
-A option. The description for the -A option is: sndradm -A specifies the maximum number of threads that can be created to process the asynchronous queue when a set is replicating in asynchronous mode (default 2).
To determine the number of flusher threads that are currently configured to serve an asynchronous queue, you can use the sndradm -P command. For example, you can see that the set below has 2 asynchronous flusher threads configured.
An example of how to use the sndradm -A option to change the number of asynchronous queue flusher threads to 3 is:
The Remote Mirror software injects itself directly into the system's I/O path, monitoring all traffic to determine if it is targeted to Remote Mirror volumes. The I/O commands that are targeted for Remote Mirror volumes are tracked and replication of these write operations is managed. Due to the fact that the Remote Mirror software is directly in the system's I/O path, some performance impact to the system is expected. Additional TCP/IP processing that is required for network replication also consumes host CPU resources. Perform the procedures in this section on the primary and secondary Remote Mirror hosts.
The TCP buffer size is the number of bytes that the transfer control protocol allows to be transferred before it waits for an acknowledgment. To get maximum throughput, it is critical to use optimal TCP send and receive socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never fully open. If the receiver buffers are too large, TCP flow control breaks and the sender can overrun the receiver, causing the TCP window to shut down. This event is likely to happen if the sending host is faster than the receiving host. Overly large windows on the sending side are not a problem as long as you have excess memory.
TABLE 3-2 shows the maximum possible throughput for a 100BASE-T network.
18.75 MBps[2] |
||
You can view and tune your TCP buffer size by using /usr/bin/netstat(1M) and /usr/sbin/ndd(1M) commands. TCP parameters to consider tuning include:
When you change one of these parameters, restart the Remote Mirror software with the shutdown command, allowing the software to use the new buffer size. However, after you shut down and restart your server, the TCP buffers return to a default size. To keep your change, set the values in a startup script as described later in this section.
Following are procedures to view TCP buffers and values.
Type the following to view all TCP buffers:
Type the following to view settings by buffer name:
This command shows a value of 1073741824.
Use the /usr/bin/netstat(1M) command to view the buffer size for a particular network socket.
For example, view the size for port 121, the default Remote Mirror port:
The value 263536 shown in this example is the 256 Kbyte buffer size. It must be set identically in the primary and secondary hosts.
Note - Create this script on the primary and secondary hosts. |
1. Create the script file in a text editor using the following values:
2. Save the file as /etc/rc2.d/S68ndd and exit the file.
3. Set the permissions and ownership to the /etc/rc2.d/S68ndd file.
4. Shut down and restart your server.
5. Verify the size as shown in To View Buffer Sizes for a Socket.
The Remote Mirror software on both the primary and secondary nodes listens on a well-known port specified in /etc/services, port 121. Remote Mirror write traffic flows from primary to secondary site over a socket with an arbitrarily assigned address on the primary site and the well-known address on the secondary site. The health monitoring heartbeat travels over a different connection, with an arbitrarily assigned address on the secondary and the well-known address on the primary. The Remote Mirror protocol utilizes SUN RPCs over these connections.
Port 121 is the default TCP port for use by the Remote Mirror sndrd daemon. To change the port number, edit the /etc/services file using a text editor.
If you change the port number, you must change it on all Remote Mirror hosts within this configuration set (that is, primary and secondary hosts, all hosts in one-to-many, many-to-one, and multihop configurations). In addition, you must shutdown and restart all hosts affected, so that the port number change can take effect.
Because RPCs require an acknowledgment, the firewall must be opened to allow the well-known port address to be in either the source or destination fields of the packet. If the option is available, be sure to configure the firewall to allow RPC traffic as well.
In the case of write replication traffic, packets destined for the secondary will have the well-known port number in the destination field, acknowledgments of these RPCs will contain the well-known address in source field.
For health monitoring, the heartbeat will originate from the secondary with the well-known address in the destination field, and the acknowledgment will contain this address in the source field.
To help ensure the highest level of data integrity and system performance on both sites during normal operations, the Sun StorageTek Availability Suite 4.0 Point-in-Time Copy software is recommended for use in conjunction with Remote Mirror software.
A point-in-time copy can be replicated to a physically remote location, providing a consistent copy of the volume as part of an overall disaster recovery plan. This is commonly referred to as batch replication, and the process and advantages of this practice are described in the best practice guide: Sun StorageTek Availability Suite Software-Improving Data Replication over a Highly Latent Link.
The point-in-time copy of a Remote Mirror secondary volume can be established prior to starting synchronization of a secondary volume from the primary site (the site the primary volume is hosted from). Protection against double failure is provided by enabling the Point-in-Time Copy software to create a point-in-time copy of the replicated data at the secondary site before beginning resynchronization. If a subsequent failure occurs during resynchronization, the point-in-time copy can be used as a fallback position, and resynchronization can be resumed when the subsequent failure issues have been resolved. Once the secondary site is fully synchronized with the primary site, the Point-in-Time Copy software volume set can be disabled, or put to other uses, such as remote backup, remote data analysis, or other functions required at the secondary site.
The Point-in-Time Copy software I/O performed internally during an enable, copy, or update operation can alter the contents of the shadow volume without any new I/O coming down the I/O stack. When this happens, the I/O is not intercepted in the sv layer. If the shadow volume is also a Remote Mirror volume, the Remote Mirror software will not see these I/O operations either. In this situation, the data modified by the I/O will not be replicated to the target Remote Mirror volume.
To allow this replication to occur, the Point-in-Time Copy software can be configured to offer the Remote Mirror software the changed bitmap. If the Remote Mirror software is in logging mode, it accepts the bitmap and performs an OR comparison of the Point-in-Time Copy software bitmap with its own bitmap for that volume, adding the Point-in-Time Copy software changes to its own list of changes to be replicated to the remote node. If the Remote Mirror software is in replication mode for the volume, it rejects the bitmap from the Point-in-Time Copy software. This, in turn, will fail the enable, copy, or update operation. Once Remote Mirror logging has been re-enabled, the Point-in-Time Copy software operation can be reissued.
The Remote Mirror software enables you to create one-to-many, many-to-one, and multihop volume sets.
Any combination of the above configurations is also supported by the Remote Mirror software.