Running a Deduplication Operation for a Viewpoint

Deduplicating nodes enables you to match similar existing nodes in a viewpoint and combine them into a single node.

Considerations

You must have Data Manager permission or greater on the viewpoint in order to deduplicate that viewpoint.
You cannot deduplicate time-labeled or archived viewpoints.
You can deduplicate a viewpoint in one active request at a time. If a viewpoint is in the process of being deduplicated in another active request, you cannot select that viewpoint in a new deduplication operation.
Each request supports one deduplication mode only. You cannot run a cluster key and a time based deduplication in the same request.
When deduplication in a viewpoint is run:
- Nodes in the viewpoint are matched against all nodes in the node type, even if some of those nodes are not in the existing viewpoint.
- If a viewpoint contains shared nodes, match rules are run only for one instance of the node.
- A limit of 20 maximum match results are displayed for each matched node.
Requests have a limit of 10,000 request items. Because each merge operation results in two request items (a delete of the source node and a property update of the target node), when the number of matched nodes reaches 5000 the deduplication process for that request is stopped and you are prompted to create a new request to continue deduplicating nodes. The request maximum may be reached earlier if your request already contains other request items.
Because you are deduplicating a set of nodes in a viewpoint instead of incoming request items, two different nodes can often be match candidates for each other. For example, when deduplicating a viewpoint that contains the nodes "Oracle" and "Oracle Inc", each node can be a match candidate for the other. The node that you accept as the duplicate will control which node gets deleted and which will be the surviving node. Remember, the matched nodes are the nodes that will be deleted, and the match candidate nodes are the surviving node. See Understanding Matched Nodes and Match Candidates in Deduplication.

Tip:
When you accept a match as a duplicate, that duplicate node is marked as Duplicate in the Deduplication Results screen (see Understanding Deduplication Results). The marked node is the one that will be deleted.
If three or more nodes are matched during deduplication, you cannot merge the first into the second and then merge the second into the third. You can, however, merge both the first and second into the third.
For example, suppose you have nodes "Oracle", "Oracle Inc", and "Oracle Incorporated", and you want to keep "Oracle Incorporated" and merge information from the other two nodes into it. You can't merge "Oracle" into "Oracle Inc" and then merge "Oracle Inc" into "Oracle Incorporated". Instead, locate the matched node "Oracle" and mark it as a duplicate of "Oracle Incorporated", and then locate "Oracle Inc" and mark it as a duplicate of "Oracle Incorporated".

To deduplicate a viewpoint:

Create a request for the viewpoint that you want to deduplicate.
Click the Match and Deduplicate tab on the left side of the viewpoint window.
In the Match Pane, click New , and then select Deduplicate Viewpoint.
On the Deduplicate Viewpoint dialog box, perform these actions:
1. Select the Viewpoint that you want to deduplicate.
2. Select a Node Type in that viewpoint. The node type must be configured for deduplication (see Understanding Deduplication).
3. Select the deduplication Mode:
  - Cluster Key: Deduplicate the viewpoint using a clustering property. Select the clustering property value from the drop down menu. The clustering property values in the drop down menu are based on the allowed values for the property that you defined as the cluster key. See Deduplicating Using a Cluster Key.
    
    Note:
    If a deduplication operation has already been run for the clustering property, the node creation date of the last node processed is displayed.
  - Time Based: Deduplicate the viewpoint based on the date that the nodes were created. Enter the node creation date. See Time-Based Deduplication.
Optional: Enter a Batch Size to specify the number of nodes to be checked for duplicates.

Tip:
This can be helpful, for example, if you've made changes to a matching rule that you want to test. You can run a smaller batch and evaluate the results before deduplicating the entire viewpoint.
Click Run Deduplicate.

Deduplication is run on the viewpoint using the defined match rules for the node type and the registered data source for the viewpoint.

Deduplication Operations

Because viewpoints can contain thousands of nodes, you generally deduplicate them in batches. The batches can be defined in the following ways:

The cluster key (see Deduplicating Using a Cluster Key)
The node creation start date (see Time-Based Deduplication)
The batch size that you specify when creating the match
The request item limit in a request

Batches can also be defined by a combination of some of the above, such as a cluster key and a specified batch size.

You have several options for how you process the nodes in these batches. The following terms can help you understand these options:

Table 11-1 Batch Processing Options

Option	Definition	How to Perform
Run	Perform the initial deduplication of the first batch of nodes for a specified cluster or node creation start date.	Click Run Deduplicate in the Deduplicate Viewpoint dialog box.
Continue	Perform a subsequent deduplication of the next batch of nodes for a specified cluster or node creation start date. The system tracks the nodes that have already been processed so that you can pick up where you left off.	Click Run Deduplicate in the Deduplicate Viewpoint dialog box after performing an initial Run operation.
Rerun	Reprocess an existing result set in a request. This may include one or more batches. Note: Rerun reprocesses unaccepted match results only.	In the Deduplicate Result Set panel, click Actions next to the result set that you want to rerun, and then select Rerun.
Restart	Reprocess a cluster that was already processed by starting from the beginning of it. Note: Restart is available for cluster key deduplication only. Tip: The difference between Rerun and Restart is that Rerun reprocesses one or more batches, while Restart reprocesses a cluster.	Click Restart next to the Cluster Key in the Deduplicate Viewpoint dialog box.
Discard	Delete an existing result set for a given request. The last node that was processed is retained so that you can Continue the next time you run deduplication. Note: Deleting the request will also discard the result set.	In the Deduplicate Result Set panel, click Actions next to the result set that you want to rerun, and then select Discard.
Discard and Rerun	Delete an existing result set for a given request and reprocess the same nodes in the result set. This may include one or more batches.	In the Deduplicate Result Set panel, click Actions next to the result set that you want to rerun, and then select Discard and Rerun.