Sun Java System Directory Server Enterprise Edition 6.1 Troubleshooting Guide

Troubleshooting a Replication Halt or Replication Divergence

This section describes how to troubleshoot a replication halt and replication divergence. It includes the following topics:

Possible Causes of a Replication Halt

The replication halt could be caused by one of the following:

Replication agreement disabled or nonexistent
Supplier missing the change record in its change log
Supplier change log cache corrupted
Replication manager using invalid credentials
Replication sending a thread in a halted state
Schema conflicts
Unallowed operation on the consumer due to Update Resolution Protocol (URP) conflicts
Network disconnection
Consumer state locked down by an unavailable supplier
Consumer out of disk
RUVs contain meaningless information, such as ffff, cleanruv

Possible Causes of a Replication Divergence

Replication divergence could be caused by one of the following:

Consumer power lower than the supplier
Consumer disks getting to their upper read/write limits
Intermittent network and packet dropping issues
Change log in-memory cache not being used

Collecting Data About a Replication Halt or Replication Divergence

This section describes how to collect information to help you troubleshoot a replication halt or replication divergence.

Collecting 6.x Error and Change Logs

Collect errors logs from the consumer that is not getting the changes as well as the supplier of this consumer. By default, the errors logs are located in the following directory:

install-path/logs/errors

If the errors log is not in the default location, find the path to the log using the dsconf command as follows:

# dsconf get-log-prop -h host -p port ERROR path

The errors log must have the replication logging level enabled. You can use the DSCC to enable the replication logging level or enable it using the command line as follows:

# dsconf set-log-prop -h host -p port ERROR level:err-replication

You should also provide a listing of the supplier's change log , which is located in the same directory as the database. To find the path to your database, use the dsconf command as follows:

# dsconf get-suffix-prop -h host -p port suffix-dn db-path

Use the following command to provide a listing of the supplier's change log directory:

# ls -la /db-path/c15*

Collecting 5.x Errors and Change Logs

The errors log in 5.x and earlier versions of Directory Server are located in the following directory:

install-path/slapd-serverID/logs/errors

If the error log file is not in the default location, examine the Directory Server configuration file, install-path/slapd-serverID/config/dse.ldif, to find the path to the logs. The path is specified as the value of the nsslapd-errorlog attribute.

Provide a listing of the supplier's change log directory as follows:

# ls -la changelog-dir

If the change log file is not in the default location, examine the Directory Server configuration file, install-path/slapd-serverID/config/dse.ldif, to find the path to the logs. The path is specified as the value of the nsslapd-changelogdir attribute.

Collecting Data Using the `insync` and `repldisc` Commands

Use the output of the insync and repldisc commands to help troubleshoot your replication divergence.

The insync command indicates the state of synchronization between a master replica and one or more consumer replicas and can help you identify bottlenecks. If you are troubleshooting a problem with replication divergence, this data must be periodic. For more information, see Using the insync Command.

If you identify a bottleneck using the insync command, for example a bottleneck that results from an increasing delay as reported by the tool, it is helpful to start collecting nsds50ruv and ds6ruv attribute data. This data can help you identify when and where the potential halt is taking place. For more information about the nsds50ruv and ds6ruv attributes, see About Replica Update Vectors (RUVs).

The repldisc command displays the replication topology, building a graph of all known replicas, then showing the results as a matrix. For more information, see Using the repldisc Command.

Collecting Information About the Network and Disk Usage

Try to determine if the replication halt is network related using the netstat command on both the consumer and supplier as follows:

# netstat -an | grep port

A replication halt may be the result of the network if a consumer is not receiving information despite the fact that access logs show that the supplier is sending updates. Running the ping and traceroute commands can also help you determine if network latency is responsible for the problem.

Collect swap information to see if you are running out of memory. Memory may be your problem if the output of the swap command is small.

Solaris	`swap` `-l`
HP-UX	`swapinfo`
Linux	`free`
Windows	Already provided in `C:\report.txt`

Try to determine if the disk controllers are fully loaded and if input/output is the cause of your replication problems. To determine if your problem is disk related, use the iostat tool as follows:

# iostat -xnMCz -T d 10

The iostat command iteratively reports terminal, disk, and tape input/output activity and can be helpful in determining if a replication divergence event results from a saturated disk on the consumer side.

Analyzing Replication Halt Data

Use the data you collected to determine if the replication halt is the result of a problem on the supplier or the consumer.

Use the nsds50ruv attribute output that you collected to determine the last CSN that was replicated to a particular consumer. Then, use the consumer's access and errors logs, with the logs set to collect replication level output, to determine the last CSN that was replicated. From this CSN, you can determine the next CSN that the replication process is failing to provide. For example, replication may be failing because the supplier is not replicating the CSN, because the network is blocking the CSN, or because the consumer is refusing to accept the update.

Maybe the CSN cannot be updated on the consumer. Try to grep the CSN that the supplier can not update on the consumer as follows:

grep csn=xxxxxxxx consumer-access-log

If you do not find the CSN, try searching for the previous successful CSN committed to the supplier and consumer that are currently failing. Using CSNs, you can narrow your search for the error.

By using the grep command to search for CSNs in the access and errors logs, you can determine if an error is only transient. Always match the error messages in the errors log with its corresponding access log activity.

If analysis proves that replication is always looping in the same CSN with an etime=0 and an err=32 or err=16, the replication halt is likely to be a critical error. If the replication halt arises from a problem on the consumer, you can run the replck tool to fix the problem by patching the contents of the looping entry in the physical database.

If instead analysis proves that replication is not providing any report of the CSN in the consumer logs, then the problem is likely the result of something on the supplier side or network. If the problem originates with the supplier, you can sometimes restart replication by forcing the replication agreement to send updates to the remote replica or by restarting the supplier. Otherwise, a reinitialization may be required.

To force updates to the remote replica from the local suffix, use the following command:

# dsconf update-repl-dest-now -h host -p port suffix-DN host:port

Resolving a Problem With the Schema

If the error log contains messages indicating a problem with the schema, then collect further schema related information. Before changes are sent from a supplier to a consumer, the supplier verifies that the change adheres to the schema. When an entry does not comply to the schema and the supplier tries to update this entry, a loop can occur.

To remedy a problem that arises because of the schema, get a single supplier that can act as the master reference for schema. Take the contents of its /install-path/config/schema directory. Tar the directory as follows:

# tar -cvs schema schema.tar

Use FTP to export this tar file to all of the other suppliers and consumers in your topology. Remove the /install-path/config/schema directory on each of the servers and replace it with the tar file you created on the master schema reference.

Analyzing Replication Divergence Data

Try to determine if the replication divergence is a result of low disk performance on the consumer using the output of the iostat tool. For more information about diagnosing disk performance problems, see Example: Troubleshooting a Replication Problem Using RUVs and CSNs.

Replication divergence is typically the result of one of the following:

The supplier is not fast enough when sending the data to the consumer. For example, the supplier's changelog has low in-memory cache settings. Confirm these settings by looking at the nsslapd-cachememsize and nslapd-cachesize attribute values in the cn=changelog5,cn=config entry.

The nsslapd-cachememsize attribute specifies the changelog, or database, cache size in terms of the available memory space. The nslapd-cachesize attribute specifies the replication changelog, or database, cache size in terms of the number of entries it can hold.
The network's capacity is not large enough to guarantee transport speed at the rate that updates are generated. The network capacity may the problem when operating over a very low bandwidth.
On Directory Server 5.1, the network latency is too large to guarantee transport speed at the rate that updates are generated. Network latency can cause problems with Directory Server 5.1 because the replication transport protocol is synchronous.
Consumer not fast enough to apply the changes is receives. For example, consumer speed can be an issue when disk usage is saturated or when a problem occurs when replication is happening in parallel (unindexed searches, for example).

Analysis on Earlier Versions of Directory Server

If you are working on the 5.1 version of Directory Server and are experiencing replication divergence, it may be a result of protocol limits. Replication in 5.1 is synchronous, and therefore is not supported over a WAN. If you are replicating over a WAN, you must upgrade to 6.1.

If replicating over a LAN, verify the network latency between the supplier and consumers using the ping command. In the 5.1 version of Directory Server, a supplier can only send changes once it receives an acknowledgement from the consumer. This results in consumer downtime that may resemble a halt when in fact the exchange is only slow. For example, you may update a password, but the new password does not go into effect immediately, giving you the impression that you are experiencing a replication divergence. Analyze the access log of the supplier and see how many updates are received, second by second. For example, the supplier access log should show varied traffic for each second, such as:

13:07:04  14
13:07:05  10
13:07:06  15
13:07:07   5

Next, look in the access log of the consumer. It may show continuous updates, suggesting a bottleneck:

13:07:04   8
13:07:05   8
13:07:06   8
13:07:07   8

If you are experiencing a problem of this kind, it may be the result of your method of network access, bandwidth, or small links.

Advanced Topic: Using the `replcheck` Tool to Diagnose and Repair Replication Halts

Advanced users can use the replcheck tool to check and repair replication on Directory Server 6.x. We strongly recommend that you use this tool with the guidance of Sun Support. The tool collects valuable information that Sun Support can use during problem diagnosis and can repair several types of replication halt directly. This tool is located in the install-path/ds6/support_tools/bin/ directory.

Note –

A earlier version of the replcheck tool is available for Directory Server 5.x. Contact Sun Support for more information.

For more information about the replcheck command, seereplcheck(1M)

Diagnosing Problems with `replcheck`

When run in diagnosis mode, the replcheck tool diagnoses the cause of the replication breakage and summarizes the proposed repair actions. It compares the RUVs for each of the servers in your replication topology to determine if the masters are synchronized. If the search results show that all of the consumer replica in-memory RUVs are evolving on time or not evolving but equal to those on the supplier replicas, the tool will conclude that a replication halt is not occurring.

To diagnose a replication problem, run the replcheck tool as follows:

replcheck diagnose topology-file

The topology-file specifies the path to a file that contains one record for each line in the following format: hostname:port:suffix_dn[:label]. The optional label field provides a name that appears in any messages that are displayed or logged. If you do not specify a label, the hostname:port are used instead.

For example, the following topology file describes a replication topology consisting of two hosts:

host1:389:dc=example,dc=com:Paris
host2:489:dc=example,dc=com:New York

Repairing Replication Failures With `replcheck`

If the replcheck diagnose command determines that a replication halt is occurring, then you can launch the replcheck fix subcommand to repair the replication halt. For example, the command determines that replication is blocked on the entry associated with CSN 24 if a supplier has a CSN of 40, while the consumer has a CSN of 23 that does not evolve at all over time.

To repair a replication halt, run the replcheck fix command as follows:

replcheck fix TOPOLOGY_FILE