Sun Java System Directory Server Enterprise Edition 6.3 Troubleshooting Guide

Analyzing Replication Problems

This section guides you through the general process of analyzing replication problems. It provides information about how replication works and tools you can use to collect replication data.

About Replication

Replication is a topology wide feature that always involves more than one participant. For this reason, when you troubleshoot problems with replication, you need to detect from exactly what point in the topology replication stopped working and which replication agreements are broken.

Replication works as follows:

A master receives a change. Once the change has been applied to the entry in the database, then because the server is a master, it stores the change in the change log database.
The master updates the Replica Update Vector (RUV) in memory.
The master notifies the replication threads that a new change has been recorded in the change log.
These replication threads contact replication partners to propagate the information.

For example, Master 1 receives a change, applies it to the entry and updates its change log. When the threads on master 1 contact the consumer, the consumer shows that master replica its RUV. The master looks at the RUV and compares it with the RUV it contains in memory to see if it contains more recent changes than the consumer. If, for example, it sees that the consumer contains a higher RUV, it does not send changes. If it contains a more recent change, it sends another request to the consumer asking for a lock on replica ID 1 so that it can make updates. If the lock is unavailable, the update will be made later. If the lock is available, then the master can proceed to make the change.

About Change Sequence Numbers (CSNs)

Replication is sequential, meaning that entries are replicated in an orderly way. Because replication is orderly, any change generated by a master is labeled by a change sequence number (CSN) that is unique for any change inside a multi-master topology. The CSN is a hexadecimal that appears in the logs as follows:

41e6ee93000e00640000

The first 8 hexa-digits represent the time when the change was generated in the master. The time is represented in seconds since January 1, 1970.

The next four digits are the sequence number, or the order in the current second in which the change occurred. For example, multiple changes occur in second 41e6ee93. The sequence number tells us at which point in this second the change occurred.

The next four digits specify the replica ID of the master that received the change in the first place.

The last four digits are always 0000.

CSNs are generated only when local traffic introduces a new change to a replica. So only masters that receive updates generate CSNs. Consumers always refer to masters, because all the updates they receive are through replication.

During troubleshooting, try to find the CSN or CSNs that are at the origin of the delay. To find the CSN associated with the delay, you need to use replica update vectors (RUVs). RUVs are described in the next section.

About Replica Update Vectors (RUVs)

Any replica in a replication topology stores its current replication state in a replica update vector (RUV). The RUV is stored in memory by a process that is running and provides the exact knowledge this replica has of itself and every other participant in the replication topology. The RUV entry on a given server contains a line for each master participating in a replication topology. Each line contains an identifier of one of the masters, the URL of the replica, and the CSN of the first and last changes made on the server. The CSN records only the first and last changes known by the server, not necessarily the most recent changes made by the master.

The state of the RUV entry is physically updated every 30 seconds in the following entry:

nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,suffix-name

The RUV is also stored in memory and accessed using ldapsearch on the cn=replica,cn=suffix,cn=mapping tree,cn=config entry. For example, an ldapsearch for the ou=people suffix might yield the following results:

# ldapsearch -h host1 -p 1389 -D cn=Directory Manager -w secret \
-b cn=replica, 
cn=ou=people,cn=mapping tree,cn=config -s base objectclass=* nsds50ruv
nsds50ruv: {replicageneration} 45e8296c000000010000
nsds50ruv: {replica 1 ldap://server1:1389} 45ed8751000000010000 4600f252000000010000
nsds50ruv: {replica 2 ldap://server1:2389} 45eec0e1000000020000 45f03214000000020000

For clarity, we will simplify the RUV syntax to CSNchange-number-replica-id. The change-number shows which change the RUV corresponds to in the successive changes that occurred on the master. For example, 45ed8751000000010000 can be written as CSN05-1. In the previous illustration, master 1 contains the following RUVs:

replica 1: CSN05-1 CSN43-1
replica 2: CSN05-2 CSN40-2

The first line provides information about the first change and the last change that this replica knows about from itself, master 1, as indicated by the replica ID 1. The second line provides information about the first change and the last change that it knows about from master 2. The information that is most interesting to us is the last change. In normal operations, master 1 should know more about the updates it received than master 2. We confirm this by looking at the RUV for master 2:

replica 2: CSN05-2 CSN50-2
replica 1: CSN01-1 CSN35-1

Looking at the last change, we see that master 2 knows more about the last change it received (CSN50-2) than master 1 (which shows the last change as having occurred at CSN40-2). By contrast, master 1 knows more about its last change (CSN43-1) than master 2 (CSN35-1).

When troubleshooting problems with replication, the CSNs can be useful in identifying the problem. Master 1 should always know at least as much about its own replica ID as any other participant in the replication topology because the change was first applied on master 1 and then replicated. So, CSN43-1 should be the highest value attributed to replica ID 1 in the topology.

A problem is identified if, for example, after 30 minutes the RUV on master 1 is still CSN40-2 but on master 2 the RUV has increased significantly to CSN67-2. This indicates that replication is not happening from master 2 to master 1.

If a failure occurs and you need to reinitialize the topology while saving as much data as possible, you can use the RUV picture to determine which machine contains the most recent changes. For example, in the replication topology described previously you have a hub that contains the following RUV:

2: CSN05-2 CSN50-2
1: CSN05-1 CSN43-1

In this case, hub 1 seems like a good candidate for providing the most recent changes.

Using the `nsds50ruv` Attribute to Troubleshoot 5.x Replication Problems

When a server stops, the nsds50ruv attribute is not stored in the cn=replica entry. At least every 30 seconds, it is stored in the nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,suffix-name entry as described above, as an LDAP subentry. This information is stored in the suffix instead of the configuration file because this is the only way to export this information into a file. When you initialize a topology, this occurs when the servers are off line. The data is exported into an LDIF file then reimported. If this attribute was not stored in the exported file, then the new replica would not have the correct information after an import.

Whenever you use the db2ldif -r command, the nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,suffix-name entry is included.

Using the `nsds50ruv` and `ds6ruv` Attributes to Troubleshoot 6.x Replication Problems

In 6.0 and later versions of Directory Server, you can also use the nsds50ruv attribute to see the internal state of the consumer, as described in the previous section. If you are using the replication priority feature, you can use the ds6ruv attribute, which contains information about the priority operations. When replication priority is configured, you create replication rules to specify that certain changes, such as updating the user password, are replicated with high priority, For example, the RUV appears as follows:

nsds50ruv: {replicageneration} 4405697d000000010000
nsds50ruv: {replica 2 ldap://server1:2389}
nsds50ruv: {replica 1 ldap://server1:1390} 440569aa000000010000 44056a23000200010000
ds6ruv: {PRIO 2 ldap://server1:2389}
ds6ruv: {PRIO 1 ldap://server1:1390} 440569b6000100010000 44056a30000800010000

To see the replication information, export the following file:

# dsadm export instance-path suffix-dn [suffix-dn] ldif-file

Overview of Replication Data Collection

You need to collect a minimum of data from your replication topology when a replication error occurs.

Setting the Replication Logging Level

You need to collect information from the access, errors, and, if available, audit logs. Before you collect errors logs, adjust the log level to keep replication information. To set the error log level to include replication, use the following command:

# dsconf set-log-prop ERROR level:err-replication

Note –

In Directory Server 5.x, use the console to set the error log level.

Using the `insync` Command

The insync command provides information about the state of synchronization between a supplier replica and one or more consumer replicas. This command compares the RUVs of replicas and displays the time difference or delay, in seconds, between the servers.

For example, the following command shows the state every 30 seconds:

$ insync -D cn=admin,cn=Administrators,cn=config -w mypword \
 -s portugal:1389 30

ReplicaDn         Consumer                Supplier        Delay      

dc=example,dc=com france.example.com:2389 portugal:1389   0         

dc=example,dc=com france.example.com:2389 portugal:1389   10         

dc=example,dc=com france.example.com:2389 portugal:1389   0

You analyze the output for the points at which the replication delay stops being zero. In the above example, we see that there may be a replication problem between the consumer france.example.com and the supplier, portugal, because the replication delay changes to 10, indicating that the consumer is 10 seconds behind the supplier. We should continue to watching the evolution of this delay. If it stays more or less stable or decreases, we can conclude there is not a problem. However, a replication halt is probable when the delay increases over time.

For more information about the insync command, see insync(1).

Using the `repldisc` Command

The repldisc command displays the replication topology, building a graph of all known replicas using the RUVs. It then prints an adjacency matrix that describes the topology. Because the output of this command shows the machine names and their connections, you can use it to help you read the output of the insync tool. You run this command on 6.0 and later versions of Directory Server as follows:

# /opt/SUNWdsee/ds6/bin/repldisc -D cn=Directory Manager -w password -b replica-root -s host:port

You run this command on 5.x versions of Directory Server as follows:

# install-root /shared/bin/repldisc -D cn=Directory Manager -w password -b replica-root -s host:port

The following command show an example of the output of the repldisc command:

$ repldisc -D cn=admin,cn=Administrators,cn=config -w pwd \
  -b o=rtest -s portugal:1389

Topology for suffix: o=rtest  

Legend:  
^ : Host on row sends to host on column.  
v : Host on row receives from host on column. 
x : Host on row and host on column are in MM mode.  
H1 : france.example.com:1389  
H2 : spain:1389  
H3 : portugal:389

     | H1 | H2 | H3 |
  ===+===============  
H1 |    | ^  |    |  
---+---------------  
H2 | v  |    | ^  |  
---+---------------  
H3 |    | v  |    |  
---+---------------

Example: Troubleshooting a Replication Problem Using RUVs and CSNs

In this example, two masters replicate to two hosts, which in turn replicate to five consumers:

Replication is not working, and fatal errors appear in the log on consumer 4.

However, because replication is a topology wide feature, we look to see if other consumers in the topology are also experiencing a problem, and we see that consumers 3 and 5 also have fatal errors in their error logs. Using this information, we see that potential participants in the problem are consumers 3, 4, and 5, hubs 2 and 3, and masters 1 and 2. We can safely assume that consumers 1 and 2 and hub 2 are not involved.

To debug this problem, we need to collect at least the following information from the following replication participants:

Topology wide data, using the output of the insync and repldisc commands.
Information about the CSN or CSNs that are blocking, using the RUV of masters 1 and 2 and consumer 4.
Information for each potential participant in the problem, including dse.ldif, nsslapd -V, access and errors log (with replication enabled) related to the date when the blocking CSN was created.
Information about the replication participants that are functioning correctly and most likely not involved in the problem, including dse.ldif, nsslapd -V, and the access and errors log (with replication enabled).

With this data we can now identify where the delays start. Looking at the output of the insync command, we see delays from hub 2 of 3500 seconds, so this is likely where the problem originates. Now, using the RUV in the nsds50ruv attribute, we can find the operation that is at the origin of the delay. We look at the RUVs across the topology to see the last CSN to appear on a consumer. In our example, master 1 has the following RUV:

replica 1: CSN05-1 CSN91-1
replica 2: CSN05-2 CSN50-2

Master 2 contains the following RUV:

replica 2: CSN05-2 CSN50-2
replica 1: CSN05-1 CSN91-1

They appear to be perfectly synchronized. Now we look at the RUV on consumer 4:

replica 1: CSN05-1 CSN35-1
replica 2: CSN05-2 CSN50-2

The problem appears to be related to the change that is next to the change associated with CSN 35 on master 1. The change associated with CSN 35 corresponds to the oldest CSN ever replicated to consumer 4. By using the grep command on the access logs of the replicas on CSN35–01, we can find the time around which the problem started. Troubleshooting should begin from this particular point in time.

As discussed in Defining the Scope of Your Problem, it can be helpful to have information from a system that is working to help identify where the trouble occurs. So we collect data from hub 1 and consumer 1, which are functioning as expected. Comparing the data from the servers that are functioning, focusing on the time when the trouble started, we can identify differences. For example, maybe the hub is being replicated from a different master or a different subnet, or maybe it contains a different change just before the change at which the replication problem occurred.

Possible Symptoms of a Replication Problem and How to Proceed

Depending on the symptoms of your problem, your troubleshooting actions will be different.

For example, if you see nothing in the access logs of the consumers, a network problem may be the cause of the replication failure. Reinitialization is not required.

If the error log shows that is can not find a particular entry in the change log, the master's change log is not up-to-date. You may or may not need to reinitialize your topology, depending upon whether you can locate an up-to-date change log somewhere in your replication topology (for example, on a hub or other master).

If the consumer has problems, for example experiences processing loops or aborts locks, look in the access log for a large number of retries for a particular CSN. Run the replck tool to locate the CSN at the root of the replication halt and to repair this entry in the change log.