Sun Directory Server Enterprise Edition 7.0 Troubleshooting Guide

Chapter 3 Troubleshooting Replication

This chapter provides information to help you troubleshoot problems with replication. It also contains a procedure to help you reinitialize your entire topology. It includes the following sections:

Analyzing Replication Problems

This section guides you through the general process of analyzing replication problems. It provides information about how replication works and tools you can use to collect replication data.

Overview of Replication Data Collection

You need to collect a minimum of data from your replication topology when a replication error occurs.

Setting the Replication Logging Level

You need to collect information from the access, errors, and, if available, audit logs. Before you collect errors logs, adjust the log level to keep replication information. To set the error log level to include replication, use the following command:

# dsconf set-log-prop ERROR level:err-replication

Using the `insync` Command

The insync command provides information about the state of synchronization between a supplier replica and one or more consumer replicas. This command compares the RUVs of replicas and displays the time difference or delay, in seconds, between the servers.

For example, the following command shows the state every 30 seconds:

$ insync -D cn=admin,cn=Administrators,cn=config -w mypword \
 -s portugal:1389 30

ReplicaDn         Consumer                Supplier        Delay      

dc=example,dc=com france.example.com:2389 portugal:1389   0         

dc=example,dc=com france.example.com:2389 portugal:1389   10         

dc=example,dc=com france.example.com:2389 portugal:1389   0

You analyze the output for the points at which the replication delay stops being zero. In the above example, we see that there may be a replication problem between the consumer france.example.com and the supplier, portugal, because the replication delay changes to 10, indicating that the consumer is 10 seconds behind the supplier. We should continue to watching the evolution of this delay. If it stays more or less stable or decreases, we can conclude there is not a problem. However, a replication halt is probable when the delay increases over time.

For more information about the insync command, see insync(1).

Using the `repldisc` Command

The repldisc command displays the replication topology, building a graph of all known replicas using the RUVs. It then prints an adjacency matrix that describes the topology. Because the output of this command shows the machine names and their connections, you can use it to help you read the output of the insync tool. You run this command on 6.0 and later versions of Directory Server as follows:

# repldisc -D cn=Directory Manager -w password -b replica-root -s host:port

The following command show an example of the output of the repldisc command:

$ repldisc -D cn=admin,cn=Administrators,cn=config -w pwd \
  -b o=rtest -s portugal:1389

Topology for suffix: o=rtest  

Legend:  
^ : Host on row sends to host on column.  
v : Host on row receives from host on column. 
x : Host on row and host on column are in MM mode.  
H1 : france.example.com:1389  
H2 : spain:1389  
H3 : portugal:389

     | H1 | H2 | H3 |
  ===+===============  
H1 |    | ^  |    |  
---+---------------  
H2 | v  |    | ^  |  
---+---------------  
H3 |    | v  |    |  
---+---------------

Example: Troubleshooting a Replication Problem Using RUVs and CSNs

In this example, two masters replicate to three hubs, which in turn replicate to five consumers:

Replication is not working, and fatal errors appear in the log on consumer 4.

However, because replication is a topology wide feature, we look to see if other consumers in the topology are also experiencing a problem, and we see that consumers 3 and 5 also have fatal errors in their error logs. Using this information, we see that potential participants in the problem are consumers 3, 4, and 5, hubs 2 and 3, and masters 1 and 2. We can safely assume that consumers 1 and 2 and hub 2 are not involved.

To debug this problem, we need to collect at least the following information from the following replication participants:

Topology wide data, using the output of the insync and repldisc commands.
Information about the CSN or CSNs that are blocking, using the RUV of masters 1 and 2 and consumer 4.
Information for each potential participant in the problem, including dse.ldif, nsslapd -V, access and errors log (with replication enabled) related to the date when the blocking CSN was created.
Information about the replication participants that are functioning correctly and most likely not involved in the problem, including dse.ldif, nsslapd -V, and the access and errors log (with replication enabled).

With this data we can now identify where the delays start. Looking at the output of the insync command, we see delays from hub 2 of 3500 seconds, so this is likely where the problem originates. Now, using the RUV in the nsds50ruv attribute, we can find the operation that is at the origin of the delay. We look at the RUVs across the topology to see the last CSN to appear on a consumer. In our example, master 1 has the following RUV:

replica 1: CSN05-1 CSN91-1
replica 2: CSN05-2 CSN50-2

Master 2 contains the following RUV:

replica 2: CSN05-2 CSN50-2
replica 1: CSN05-1 CSN91-1

They appear to be perfectly synchronized. Now we look at the RUV on consumer 4:

replica 1: CSN05-1 CSN35-1
replica 2: CSN05-2 CSN50-2

The problem appears to be related to the change that is next to the change associated with CSN 35 on master 1. The change associated with CSN 35 corresponds to the oldest CSN ever replicated to consumer 4. By using the grep command on the access logs of the replicas on CSN35–01, we can find the time around which the problem started. Troubleshooting should begin from this particular point in time.

As discussed in Defining the Scope of Your Problem, it can be helpful to have information from a system that is working to help identify where the trouble occurs. So we collect data from hub 1 and consumer 1, which are functioning as expected. Comparing the data from the servers that are functioning, focusing on the time when the trouble started, we can identify differences. For example, maybe the hub is being replicated from a different master or a different subnet, or maybe it contains a different change just before the change at which the replication problem occurred.

Possible Symptoms of a Replication Problem and How to Proceed

Depending on the symptoms of your problem, your troubleshooting actions will be different.

For example, if you see nothing in the access logs of the consumers, a network problem may be the cause of the replication failure. Reinitialization is not required.

If the error log shows that it cannot find a particular entry in the change log, the master's change log is not up-to-date. You may or may not need to reinitialize your topology, depending upon whether you can locate an up-to-date change log somewhere in your replication topology (for example, on a hub or other master).

If the consumer has problems, for example experiences processing loops or aborts locks, look in the access log for a large number of retries for a particular CSN. Run the replck tool to locate the CSN at the root of the replication halt and to repair this entry in the change log.

Troubleshooting a Replication Halt or Replication Divergence

This section describes how to troubleshoot a replication halt and replication divergence. It includes the following topics:

Possible Causes of a Replication Halt

The replication halt could be caused by one of the following:

Replication agreement disabled
Supplier missing the change record in its change log
Supplier change log cache corrupted
Replication manager using invalid credentials
Schema conflicts
Unallowed operation on the consumer due to Update Resolution Protocol (URP) conflicts
Network disconnection
Consumer state locked down by an unavailable supplier
Consumer out of disk

Possible Causes of a Replication Divergence

Replication divergence could be caused by one of the following:

Consumer power lower than the supplier
Consumer disks getting to their upper read/write limits
Intermittent network and packet dropping issues
Change log in-memory cache not being used

Collecting Data About a Replication Halt or Replication Divergence

This section describes how to collect information to help you troubleshoot a replication halt or replication divergence.

Collecting Error and Change Logs

Collect errors logs from the consumer that is not getting the changes as well as the supplier of this consumer. By default, the errors logs are located in the following directory:

instance-path/logs/errors

If the errors log is not in the default location, find the path to the log using the dsconf command as follows:

# dsconf get-log-prop -h host -p port ERROR path

The errors log must have the replication logging level enabled. You can use the DSCC to enable the replication logging level or enable it using the command line as follows:

# dsconf set-log-prop -h host -p port ERROR level:err-replication

You should also provide a listing of the supplier's change log , which is located in the same directory as the database. To find the path to your database, use the dsconf command as follows:

# dsconf get-suffix-prop -h host -p port suffix-dn db-path

Collecting Data Using the `insync` and `repldisc` Commands

Use the output of the insync and repldisc commands to help troubleshoot your replication divergence.

The insync command indicates the state of synchronization between a master replica and one or more consumer replicas and can help you identify bottlenecks. If you are troubleshooting a problem with replication divergence, this data must be periodic. For more information, see Using the insync Command.

If you identify a bottleneck using the insync command, for example a bottleneck that results from an increasing delay as reported by the tool, it is helpful to start collecting nsds50ruv and ds6ruv attribute data. This data can help you identify when and where the potential halt is taking place. For more information about Replica Update Vectors (RUVs), see Replica Update Vector in Sun Directory Server Enterprise Edition 7.0 Reference.

The repldisc command displays the replication topology, building a graph of all known replicas, then showing the results as a matrix. For more information, see Using the repldisc Command.

Collecting Information About the Network and Disk Usage

Try to determine if the replication halt is network related using the netstat command on both the consumer and supplier as follows:

# netstat -an | grep port

A replication halt may be the result of the network if a consumer is not receiving information despite the fact that access logs show that the supplier is sending updates. Running the ping and traceroute commands can also help you determine if network latency is responsible for the problem.

Collect swap information to see if you are running out of memory. Memory may be your problem if the output of the swap command is small.

Solaris	`swap` `-l`
HP-UX	`swapinfo`
Linux	`free`
Windows	Already provided in `C:\report.txt`

Try to determine if the disk controllers are fully loaded and if input/output is the cause of your replication problems. To determine if your problem is disk related, use the iostat tool as follows:

# iostat -xnMCz -T d 10

The iostat command iteratively reports terminal, disk, and tape input/output activity and can be helpful in determining if a replication divergence event results from a saturated disk on the consumer side.

Analyzing Replication Halt Data

Use the data you collected to determine if the replication halt is the result of a problem on the supplier or the consumer.

Use the nsds50ruv attribute output that you collected to determine the last CSN that was replicated to a particular consumer. Then, use the consumer's access and errors logs, with the logs set to collect replication level output, to determine the last CSN that was replicated. From this CSN, you can determine the next CSN that the replication process is failing to provide. For example, replication may be failing because the supplier is not replicating the CSN, because the network is blocking the CSN, or because the consumer is refusing to accept the update.

Maybe the CSN cannot be updated on the consumer. Try to grep the CSN that the supplier can not update on the consumer as follows:

grep csn=xxxxxxxx consumer-access-log

If you do not find the CSN, try searching for the previous successful CSN committed to the supplier and consumer that are currently failing. Using CSNs, you can narrow your search for the error.

By using the grep command to search for CSNs in the access and errors logs, you can determine if an error is only transient. Always match the error messages in the errors log with its corresponding access log activity.

If analysis proves that replication is always looping in the same CSN with an etime=0 and an err=32 or err=16, the replication halt is likely to be a critical error. If the replication halt arises from a problem on the consumer, you can run the replck tool to fix the problem by patching the contents of the looping entry in the physical database.

If instead analysis proves that replication is not providing any report of the CSN in the consumer logs, then the problem is likely the result of something on the supplier side or network. If the problem originates with the supplier, you can sometimes restart replication by forcing the replication agreement to send updates to the remote replica or by restarting the supplier. Otherwise, a reinitialization may be required.

To force updates to the remote replica from the local suffix, use the following command:

# dsconf update-repl-dest-now -h host -p port suffix-DN host:port

Resolving a Problem With the Schema

If the error log contains messages indicating a problem with the schema, then collect further schema related information. Before changes are sent from a supplier to a consumer, the supplier verifies that the change adheres to the schema. When an entry does not comply to the schema and the supplier tries to update this entry, a loop can occur.

To remedy a problem that arises because of the schema, get a single supplier that can act as the master reference for schema. Take the contents of its /install-path/resources/schema directory. Tar the directory as follows:

# tar -cvs schema schema.tar

Use FTP to export this tar file to all of the other suppliers and consumers in your topology. Remove the /install-path/resources/schema directory on each of the servers and replace it with the tar file you created on the master schema reference.

Analyzing Replication Divergence Data

Try to determine if the replication divergence is a result of low disk performance on the consumer using the output of the iostat tool. For more information about diagnosing disk performance problems, see Example: Troubleshooting a Replication Problem Using RUVs and CSNs.

Replication divergence is typically the result of one of the following:

The network's capacity is not large enough to guarantee transport speed at the rate that updates are generated. The network capacity may the problem when operating over a very low bandwidth.
Consumer not fast enough to apply the changes is receives. For example, consumer speed can be an issue when disk usage is saturated or when a problem occurs when replication is happening in parallel (unindexed searches, for example).

Advanced Topic: Using the `replcheck` Tool to Diagnose and Repair Replication Halts

Advanced users can use the replcheck tool to check and repair replication on Directory Server. We strongly recommend that you use this tool with the guidance of Sun Support. The tool collects valuable information that Sun Support can use during problem diagnosis and can repair several types of replication halt directly. This tool is located in the install-path/bin/support_tools/ directory.

For more information about the replcheck command, seereplcheck(1M)

Diagnosing Problems with `replcheck`

When run in diagnosis mode, the replcheck tool diagnoses the cause of the replication breakage and summarizes the proposed repair actions. It compares the RUVs for each of the servers in your replication topology to determine if the masters are synchronized. If the search results show that all of the consumer replica in-memory RUVs are evolving on time or not evolving but equal to those on the supplier replicas, the tool will conclude that a replication halt is not occurring.

To diagnose a replication problem, run the replcheck tool as follows:

replcheck diagnose topology-file

The topology-file specifies the path to a file that contains one record for each line in the following format: hostname:port:suffix_dn[:label]. The optional label field provides a name that appears in any messages that are displayed or logged. If you do not specify a label, the hostname:port are used instead.

For example, the following topology file describes a replication topology consisting of two hosts:

host1:389:dc=example,dc=com:Paris
host2:489:dc=example,dc=com:New York

Repairing Replication Failures With `replcheck`

If the replcheck diagnose command determines that a replication halt is occurring, then you can launch the replcheck fix subcommand to repair the replication halt. For example, the command determines that replication is blocked on the entry associated with CSN 24 if a supplier has a CSN of 40, while the consumer has a CSN of 23 that does not evolve at all over time.

To repair a replication halt, run the replcheck fix command as follows:

replcheck fix TOPOLOGY_FILE

Troubleshooting Replication Problems

Refer to the following sections to troubleshoot replication using nsds50ruv and ds6ruv attributes.

Using the `nsds50ruv` Attribute to Troubleshoot 5.2 Replication Problems

When a server stops, the nsds50ruv attribute is not stored in the cn=replica entry. At least every 30 seconds, it is stored in the database as an LDAP subentry whose DN is nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,suffix-name. This information is stored in the suffix instead of the configuration file because this is the only way to export this information into a file. When you initialize a topology, this occurs when the servers are off line. The data is exported into an LDIF file then reimported. If this attribute was not stored in the exported file, then the new replica would not have the correct information after an import.

Whenever you use the db2ldif -r command, the nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,suffix-name entry is included.

Using the `nsds50ruv` and `ds6ruv` Attributes to Troubleshoot Replication Problems

In 6.0 and later versions of Directory Server, you can also use the nsds50ruv attribute to see the internal state of the consumer, as described in the previous section. If you are using the replication priority feature, you can use the ds6ruv attribute, which contains information about the priority operations. When replication priority is configured, you create replication rules to specify that certain changes, such as updating the user password, are replicated with high priority, For example, the RUV appears as follows:

# ldapsearch -h host1 -p 1389 -D "cn=Directory Manager" -w secret \
-b "cn=replica"

nsds50ruv: {replicageneration} 4405697d000000010000
nsds50ruv: {replica 2 ldap://server1:2389}
nsds50ruv: {replica 1 ldap://server1:1390} 440569aa000000010000 44056a23000200010000
ds6ruv: {PRIO 2 ldap://server1:2389}
ds6ruv: {PRIO 1 ldap://server1:1390} 440569b6000100010000 44056a30000800010000

Reinitializing a Topology

This section describes how to analyze your topology to determine which systems need to be reinitialized. It also describes the methods you can use to reinitialize your replication topology.

Note –

When a replica has been reinitialized, all of its consumer replicas must also be reinitialized.

Determining What to Reinitialize

When you reinitialize your topology, you take a good copy of the data from a supplier and overwrite the bad data on the consumers in the topology. Before you reinitialize your topology, determine which systems are unsynchronized and need to reinitialized. This critical step can prevent your from wasting time by overwriting data that is already synchronized.

For example, the following figure illustrates a topology where replication is broken on hub 1.

Because hub 1 provided data to consumers A and B, you need to reinitialize hub 1, consumer A, and consumer B.

In the following example, consumers A and B also receive updates from hub 2.

Illustration of a replication deployment in which hubs
1 and 2 provides updates to both consumers A and B.

Consumers A and B may be synchronized with the supplier of the reinitialized replica because they receive updates from both hubs. Their status depends on which replica you select to reinitialize your topology. If you use RUVs to ensure that you have the latest changes, then these replicas may be up-to-date and you may not need to reinitialize consumers A and B.

Doing a Clean Reinitialization

All of the reinitialization methods copy unnecessary data, for example data that contains values that were deleted or that maintain state information or other historical data. This unnecessary data makes the entry larger in disk. Also, the entry state information may need to be purged. If the root cause of the replication problem is related to this state information, the data is still present in the database and can cause another replication error. To avoid importing this unnecessary and potentially problematic data, you can do a clean reinitialization of your topology.

When you do a clean reinitialization, you create a clean master copy of the data that contains smaller databases, indexes, and empty change logs. A clean reinitialization uses less disk space and takes less time because it does not make backup copies of the database files. It also reduces index fragmentation, which can reduce performance. However, it requires you to stop the server that is being cloned to ensure that the database files are in a coherent state.

To Create Clean Master Data in Directory Server

Stop the master server.

Export the database contents using the dsadm command.

Specify the -Q option so that replication information is not included in the export.
# dsadm export -Q instance-path suffix-DN /tmp/clean-export.ldif

Reimport the exported data to the same master server using the dsadm command.
# dsadm import instance-path /tmp/clean-export.ldif suffix-DN

Restart the master server.

The master server now contains clean data, meaning it contains smaller databases, indexes, and empty change logs.

Import the clean master data, to all of the other servers in your system.

To Reinitialize a Suffix Using the DSCC

This method requires a replication agreement between the supplier and the consumers suffixes. Use this method to reinitialize a single suffix or to reinitialize many small suffixes.