Understanding Deduplication
Deduplication enables you to use the matching process to identify nodes in a viewpoint that are duplicates of each other and combine them into a single node.
Deduplication is run on nodes that already exist in a viewpoint, unlike Matching and Merging Request Items, which works on incoming nodes being added in a request. This lets you find and merge duplicate existing nodes that may have been added to the viewpoint before matching was available, or outside of the request process (such as by an import or load).
Deduplication uses many of the same elements as matching and merging request items:
- Matching rules are used to identify potential duplicate nodes.
- Survivorship rules control how properties and relationships are merged after a match is confirmed.
- Matching stopwords can be configured to ignore common words such as "The" and "Company" from being used during deduplication.
- Use the matching workbench to accept, reject, or skip the match candidates.
Deduplication Modes
You can deduplicate nodes in a viewpoint in two modes:
- Cluster key: Define a clustering property for the node types of the nodes to be matched, and then perform matching for each cluster. See Deduplicating Using a Cluster Key.
- Time-based: Deduplicate nodes in a viewpoint based on the date that they were created. See Time-Based Deduplication
You can run only one mode of deduplication for a specific viewpoint and node type in a single request, but you can use both modes to deduplicate nodes in a viewpoint in different contexts. For example, you could initially deduplicate nodes in a viewpoint by cluster key and then any nodes created after that could be deduplicated incrementally using time-based deduplication.
Note:
You can deduplicate nodes of a particular node type in only one active request at a time, regardless of the mode.The cluster key and the node creation date for time-based deduplication work to essentially limit the scope of the deduplication operation. Unlike matching and merging, which is automatically constrained by the maximum limit of request items in a request, viewpoints could potentially contain millions of nodes. Specifying either a node creation date or a clustering property lets you target the specific nodes that you want to deduplicate in a single operation.
Note:
Both cluster key and time-based deduplication require that theCoreStats.Created Date
property is included in the node type
being deduplicated in order for the system to be able to track the progress of which
nodes have been evaluated and which have not.
Deduplicating Using a Cluster Key
In order to deduplicate nodes using a cluster key, you must define a clustering property for the node types of the nodes to be matched. This filters the list of nodes in the viewpoint to be matched to other nodes in the same viewpoint. When you run the deduplication process, you specify the value of the clustering property that you want to deduplicate nodes for.
Tip:
When you define a clustering property for a node type, the property that you select must have an Allowed Values list for that node type (see Configuring a Clustering Property for a Node Type). Then, when you run deduplication using a cluster key you select the clustering property from that list of allowed values. For example, if you are deduplicating customers and the clustering property is State, you could select Texas as the clustering value to deduplicate customers in the state of Texas.The cluster key is applied to the set of nodes that you are matching, not the nodes that are being matched against. So, in the above example where you are matching customers in the state of Texas, a match with the same name in California would be displayed.
Time-Based Deduplication
Time-based deduplication enables you to deduplicate nodes that were created on or after a specified date. It does not require you to specify a clustering property. Instead, when you create a match to deduplicate you specify a node creation start date and, optionally, a batch size.