In this example, two masters replicate to three hubs, which in turn replicate to five consumers:
Replication is not working, and fatal errors appear in the log on consumer 4.
However, because replication is a topology wide feature, we look to see if other consumers in the topology are also experiencing a problem, and we see that consumers 3 and 5 also have fatal errors in their error logs. Using this information, we see that potential participants in the problem are consumers 3, 4, and 5, hubs 2 and 3, and masters 1 and 2. We can safely assume that consumers 1 and 2 and hub 2 are not involved.
To debug this problem, we need to collect at least the following information from the following replication participants:
Topology wide data, using the output of the insync and repldisc commands.
Information about the CSN or CSNs that are blocking, using the RUV of masters 1 and 2 and consumer 4.
Information for each potential participant in the problem, including dse.ldif, nsslapd -V, access and errors log (with replication enabled) related to the date when the blocking CSN was created.
Information about the replication participants that are functioning correctly and most likely not involved in the problem, including dse.ldif, nsslapd -V, and the access and errors log (with replication enabled).
With this data we can now identify where the delays start. Looking at the output of the insync command, we see delays from hub 2 of 3500 seconds, so this is likely where the problem originates. Now, using the RUV in the nsds50ruv attribute, we can find the operation that is at the origin of the delay. We look at the RUVs across the topology to see the last CSN to appear on a consumer. In our example, master 1 has the following RUV:
replica 1: CSN05-1 CSN91-1 replica 2: CSN05-2 CSN50-2
Master 2 contains the following RUV:
replica 2: CSN05-2 CSN50-2 replica 1: CSN05-1 CSN91-1
They appear to be perfectly synchronized. Now we look at the RUV on consumer 4:
replica 1: CSN05-1 CSN35-1 replica 2: CSN05-2 CSN50-2
The problem appears to be related to the change that is next to the change associated with CSN 35 on master 1. The change associated with CSN 35 corresponds to the oldest CSN ever replicated to consumer 4. By using the grep command on the access logs of the replicas on CSN35–01, we can find the time around which the problem started. Troubleshooting should begin from this particular point in time.
As discussed in Defining the Scope of Your Problem, it can be helpful to have information from a system that is working to help identify where the trouble occurs. So we collect data from hub 1 and consumer 1, which are functioning as expected. Comparing the data from the servers that are functioning, focusing on the time when the trouble started, we can identify differences. For example, maybe the hub is being replicated from a different master or a different subnet, or maybe it contains a different change just before the change at which the replication problem occurred.