Troubleshooting a Replication Halt or Replication Divergence

This section describes how to troubleshoot a replication halt and replication divergence. It includes the following topics:

Possible Causes of a Replication Halt
Possible Causes of a Replication Divergence
Collecting Data About a Replication Halt or Replication Divergence
Analyzing Replication Halt Data
Analyzing Replication Divergence Data
Advanced Topic: Using the replcheck Tool to Diagnose and Repair Replication Halts
Troubleshooting Replication Problems

Possible Causes of a Replication Halt

The replication halt could be caused by one of the following:

Replication agreement disabled
Supplier missing the change record in its change log
Supplier change log cache corrupted
Replication manager using invalid credentials
Schema conflicts
Unallowed operation on the consumer due to Update Resolution Protocol (URP) conflicts
Network disconnection
Consumer state locked down by an unavailable supplier
Consumer out of disk

Possible Causes of a Replication Divergence

Replication divergence could be caused by one of the following:

Consumer power lower than the supplier
Consumer disks getting to their upper read/write limits
Intermittent network and packet dropping issues
Change log in-memory cache not being used

Collecting Data About a Replication Halt or Replication Divergence

This section describes how to collect information to help you troubleshoot a replication halt or replication divergence.

Collecting Error and Change Logs

Collect errors logs from the consumer that is not getting the changes as well as the supplier of this consumer. By default, the errors logs are located in the following directory:

instance-path/logs/errors

If the errors log is not in the default location, find the path to the log using the dsconf command as follows:

# dsconf get-log-prop -h host -p port ERROR path

The errors log must have the replication logging level enabled. You can use the DSCC to enable the replication logging level or enable it using the command line as follows:

# dsconf set-log-prop -h host -p port ERROR level:err-replication

You should also provide a listing of the supplier's change log , which is located in the same directory as the database. To find the path to your database, use the dsconf command as follows:

# dsconf get-suffix-prop -h host -p port suffix-dn db-path

Collecting Data Using the `insync` and `repldisc` Commands

Use the output of the insync and repldisc commands to help troubleshoot your replication divergence.

The insync command indicates the state of synchronization between a master replica and one or more consumer replicas and can help you identify bottlenecks. If you are troubleshooting a problem with replication divergence, this data must be periodic. For more information, see Using the insync Command.

If you identify a bottleneck using the insync command, for example a bottleneck that results from an increasing delay as reported by the tool, it is helpful to start collecting nsds50ruv and ds6ruv attribute data. This data can help you identify when and where the potential halt is taking place. For more information about Replica Update Vectors (RUVs), see Replica Update Vector in Oracle Directory Server Enterprise Edition Reference.

The repldisc command displays the replication topology, building a graph of all known replicas, then showing the results as a matrix. For more information, see Using the repldisc Command.

Collecting Information About the Network and Disk Usage

Try to determine if the replication halt is network related using the netstat command on both the consumer and supplier as follows:

# netstat -an | grep port

A replication halt may be the result of the network if a consumer is not receiving information despite the fact that access logs show that the supplier is sending updates. Running the ping and traceroute commands can also help you determine if network latency is responsible for the problem.

Collect swap information to see if you are running out of memory. Memory may be your problem if the output of the swap command is small.

Solaris	`swap` `-l`
HP-UX	`swapinfo`
Linux	`free`
Windows	Already provided in `C:\report.txt`

Try to determine if the disk controllers are fully loaded and if input/output is the cause of your replication problems. To determine if your problem is disk related, use the iostat tool as follows:

# iostat -xnMCz -T d 10

The iostat command iteratively reports terminal, disk, and tape input/output activity and can be helpful in determining if a replication divergence event results from a saturated disk on the consumer side.

Analyzing Replication Halt Data

Use the data you collected to determine if the replication halt is the result of a problem on the supplier or the consumer.

Use the nsds50ruv attribute output that you collected to determine the last CSN that was replicated to a particular consumer. Then, use the consumer's access and errors logs, with the logs set to collect replication level output, to determine the last CSN that was replicated. From this CSN, you can determine the next CSN that the replication process is failing to provide. For example, replication may be failing because the supplier is not replicating the CSN, because the network is blocking the CSN, or because the consumer is refusing to accept the update.

Maybe the CSN cannot be updated on the consumer. Try to grep the CSN that the supplier can not update on the consumer as follows:

grep csn=xxxxxxxx consumer-access-log

If you do not find the CSN, try searching for the previous successful CSN committed to the supplier and consumer that are currently failing. Using CSNs, you can narrow your search for the error.

By using the grep command to search for CSNs in the access and errors logs, you can determine if an error is only transient. Always match the error messages in the errors log with its corresponding access log activity.

If analysis proves that replication is always looping in the same CSN with an etime=0 and an err=32 or err=16, the replication halt is likely to be a critical error. If the replication halt arises from a problem on the consumer, you can run the replck tool to fix the problem by patching the contents of the looping entry in the physical database.

If instead analysis proves that replication is not providing any report of the CSN in the consumer logs, then the problem is likely the result of something on the supplier side or network. If the problem originates with the supplier, you can sometimes restart replication by forcing the replication agreement to send updates to the remote replica or by restarting the supplier. Otherwise, a reinitialization may be required.

To force updates to the remote replica from the local suffix, use the following command:

# dsconf update-repl-dest-now -h host -p port suffix-DN host:port

Resolving a Problem With the Schema

If the error log contains messages indicating a problem with the schema, then collect further schema related information. Before changes are sent from a supplier to a consumer, the supplier verifies that the change adheres to the schema. When an entry does not comply to the schema and the supplier tries to update this entry, a loop can occur.

To remedy a problem that arises because of the schema, get a single supplier that can act as the master reference for schema. Take the contents of its /install-path/resources/schema directory. Tar the directory as follows:

# tar -cvs schema schema.tar

Use FTP to export this tar file to all of the other suppliers and consumers in your topology. Remove the /install-path/resources/schema directory on each of the servers and replace it with the tar file you created on the master schema reference.

Analyzing Replication Divergence Data

Try to determine if the replication divergence is a result of low disk performance on the consumer using the output of the iostat tool. For more information about diagnosing disk performance problems, see Example: Troubleshooting a Replication Problem Using RUVs and CSNs.

Replication divergence is typically the result of one of the following:

The network's capacity is not large enough to guarantee transport speed at the rate that updates are generated. The network capacity may the problem when operating over a very low bandwidth.
Consumer not fast enough to apply the changes is receives. For example, consumer speed can be an issue when disk usage is saturated or when a problem occurs when replication is happening in parallel (unindexed searches, for example).

Advanced Topic: Using the `replcheck` Tool to Diagnose and Repair Replication Halts

Advanced users can use the replcheck tool to check and repair replication on Directory Server. We strongly recommend that you use this tool with the guidance of Sun Support. The tool collects valuable information that Sun Support can use during problem diagnosis and can repair several types of replication halt directly. This tool is located in the install-path/bin/support_tools/ directory.

For more information about the replcheck command, seereplcheck(1M)

Diagnosing Problems with `replcheck`

When run in diagnosis mode, the replcheck tool diagnoses the cause of the replication breakage and summarizes the proposed repair actions. It compares the RUVs for each of the servers in your replication topology to determine if the masters are synchronized. If the search results show that all of the consumer replica in-memory RUVs are evolving on time or not evolving but equal to those on the supplier replicas, the tool will conclude that a replication halt is not occurring.

To diagnose a replication problem, run the replcheck tool as follows:

replcheck diagnose topology-file

The topology-file specifies the path to a file that contains one record for each line in the following format: hostname:port:suffix_dn[:label]. The optional label field provides a name that appears in any messages that are displayed or logged. If you do not specify a label, the hostname:port are used instead.

For example, the following topology file describes a replication topology consisting of two hosts:

host1:389:dc=example,dc=com:Paris
host2:489:dc=example,dc=com:New York

Repairing Replication Failures With `replcheck`

If the replcheck diagnose command determines that a replication halt is occurring, then you can launch the replcheck fix subcommand to repair the replication halt. For example, the command determines that replication is blocked on the entry associated with CSN 24 if a supplier has a CSN of 40, while the consumer has a CSN of 23 that does not evolve at all over time.

To repair a replication halt, run the replcheck fix command as follows:

replcheck fix TOPOLOGY_FILE

Troubleshooting Replication Problems

Refer to the following sections to troubleshoot replication using nsds50ruv and ds6ruv attributes.

Using the `nsds50ruv` Attribute to Troubleshoot 5.2 Replication Problems

When a server stops, the nsds50ruv attribute is not stored in the cn=replica entry. At least every 30 seconds, it is stored in the database as an LDAP subentry whose DN is nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,suffix-name. This information is stored in the suffix instead of the configuration file because this is the only way to export this information into a file. When you initialize a topology, this occurs when the servers are off line. The data is exported into an LDIF file then reimported. If this attribute was not stored in the exported file, then the new replica would not have the correct information after an import.

Whenever you use the db2ldif -r command, the nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,suffix-name entry is included.

Using the `nsds50ruv` and `ds6ruv` Attributes to Troubleshoot Replication Problems

In 6.0 and later versions of Directory Server, you can also use the nsds50ruv attribute to see the internal state of the consumer, as described in the previous section. If you are using the replication priority feature, you can use the ds6ruv attribute, which contains information about the priority operations. When replication priority is configured, you create replication rules to specify that certain changes, such as updating the user password, are replicated with high priority, For example, the RUV appears as follows:

# ldapsearch -h host1 -p 1389 -D "cn=Directory Manager" -w secret \
-b "cn=replica"

nsds50ruv: {replicageneration} 4405697d000000010000
nsds50ruv: {replica 2 ldap://server1:2389}
nsds50ruv: {replica 1 ldap://server1:1390} 440569aa000000010000 44056a23000200010000
ds6ruv: {PRIO 2 ldap://server1:2389}
ds6ruv: {PRIO 1 ldap://server1:1390} 440569b6000100010000 44056a30000800010000

Skip Navigation Links
Exit Print View
	Oracle Directory Server Enterprise Edition Troubleshooting Guide 11g Release 1 (11.1.1.5.0)