Deduplicating Nodes in a Viewpoint

Deduplicating nodes enables you to evaluate similar nodes in a viewpoint and then merge them into a single node if they are duplicates of one another.

Deduplication Process Overview

Deduplication operations on a viewpoint follows this general process:

  1. A data manager creates a request for a view that contains the viewpoint to be deduplicated.

    Tip:

    A request is necessary for deduplicating a viewpoint because the outcome of the deduplication process results in processing changes to nodes in a viewpoint.
  2. The data manager creates and runs a match for a specific node type in a particular viewpoint in order to deduplicate that viewpoint. See Running a Deduplication Operation for a Viewpoint.
  3. The matching workbench displays the potential matches as determined by the matching rules that were configured for each data source. See Understanding Deduplication Results and Creating, Editing, and Deleting Matching Rules.

    Note:

    Only the match results with match scores that exceed the Auto Exclude Threshold on the matching rules are displayed.
  4. The data manager reviews the deduplication matches and accepts or rejects each match, and then applies the changes. See Reviewing Deduplication Results and Applying Changes.
  5. The accepted matches are applied as follows:
    • The matched (source) node is deleted from the viewpoint (because it is a duplicate)
    • The properties and relationships from the duplicate node are merged into the match candidate (target) node that will remain as determined by the survivorship rules. See Creating, Editing, and Deleting Survivorship Rules.
  6. The system uses the applied changes to create request items in the request. Delete actions are added for duplicate nodes, and property insert, update, and move actions are added based on the survivorship rules.

Understanding Matched Nodes and Match Candidates in Deduplication

Because the viewpoints that you are deduplicating contain both the matched nodes and the match candidates, it is important to understand the difference between the two:

  • Matched Nodes are the nodes from the data source that you are evaluating during the matching process. When merging nodes, they become the source nodes that get deleted after the merge operation.
  • Match Candidates are the nodes that you are matching against during the matching process. When merging nodes they become the target nodes that survive after a merge, and the properties and relationship values from the source nodes get merged into them as determined by the survivorship rules.

Note:

When you run deduplication using a cluster key, the cluster key is applied to the matched nodes only. It is not used to limit the nodes that are being matched against.

For example, if you deduplicate a customer viewpoint using a cluster key of State and a clustering property value of Texas, only customers in Texas (matched node) will be evaluated , but they could be matched with a customer in California (match candidate) with the same name. When you merge the records, the node from Texas is deleted and its information is merged into the node from California.