Deduplicating Data

This topic explains how the deduplication process works in Oracle Unity. Deduplication is the first step in the Identity Resolution Pipeline Job to create master entities.

In this topic:

Introduction

Oracle Unity deduplicates data records by "clustering" all records belonging to distant persons, matching similar records, and linking records to one master identifier.

The diagram below illustrates the Identity Resolution Job results. Through deduplication, augmentation, and promotion, the Identity Resolution Pipeline job matches customer data across multiple sources to create unified customer profiles.

Deduplication Rules

A "Deduplication Rule" is a set of deduplication criteria for a Master Entity. Each Master Entity can have one deduplication rule. For example let's look at the Master Entity "Master Customer". Notice this Master Entity only contains the one deduplication rule named "StandardMasterCustomerClustering".

Looking at the deduplication rule "StandardMasterCustomerClustering", you'll notice there are two "Clustering Rules":

  1. "Email_Exact_FirstLastName_Fuzzy"

  2. "ZipCode_Exact_AddressFirstLastName_Fuzzy"

By default this deduplication rule comes with just two Clustering Rules, but you may configure many more.

Clustering Rules

A Clustering Rule is a set of criteria to configure how Unity should deduplicate records.

How do Clustering Rules Cluster Records?

Clustering Rules cluster records in two steps:

  1. Clustering

  2. Matching

You can observe how Clustering Rules facilitate Clustering and Matching in the Unity interface, where for each Clustering Rule configured, there are "Clustering Criteria" and "Matching Criteria". These are the rules in which the Clustering Rule uses its algorithms and fuzzy matching.

Step 1 - Clustering

The first step in the deduplication process is Clustering. In this step, Unity clusters similar records together.

Let's use one of the out-of-the-box Clustering Rules to demonstrate how Clustering works.

The "Email_Exact_FirstLastName_Fuzzy" Clustering Rule is designed to:

  • Clusters records by Email Addresses that are not null and a 100% match.

Let's consider the following data records that Unity needs to cluster.

CustomerID First Name Last Name Email Address 1 Zip Code
110 Katie Jane Brown KJBrown@example.com 1161 Main Street null
111 Kathy J Brown KJBrown@example.com 1161 Main Street Unit 3 95123
112 Katherine Brown KJBrown@example.com 1161 Main St 95123
113 Karl White KJBrown@example.com 43 River Ave 96138
114 Daniel Smith Dan.Smith@example.com 1161 Main Street Unit 8 95123

Based on email match, 2 clusters are created. Customers 110, 111, 112, and 113 will be placed in one cluster, while Customer 114 will be placed in a different cluster.

Step 2 - Matching

Next, this Clustering Rule will perform "Matching". The Matching Criteria of the "Email_Exact_FirstLastName_Fuzzy" Clustering Rule is designed to:

  • Match records using a fuzzy match on their First Name and Last Name at 85%.

Looking at our data records above, based on First Name and Last Name, customers 110, 111, and 112 are linked together (because their first and last names are similar). Customer 113 will be separate, and Customer 114 will also be separate.

After Clustering and Matching

Unity will combine the results from "Email_Exact_FirstLastName_Fuzzy" and "ZipCode_Exact_AddressFirstLastName_Fuzzy".

After deduplicating data, the Customer IDs are linked to a single master customer ID.

masterCustomerID CustomerID
110 110, 111, 112
113 113
114 114

Determining master record IDs

Unity implements a deterministic method to assign master record IDs. Unity will always use the lowest record to determine the master record ID. Meaning - if a customer record is tied to multiple IDs, then the deduplication process will select the lowest ID for the master record. This method ensures Unity uses a consistent approach when determining master record IDs. The IDs assigned to master record IDs will only change when the ingested data changes.

Example: Using our example above for the cluster with the master customer ID 110, if we ingest a new customer record with the customer ID "099", 099 will be used as the new master customer ID.

Using the Clusters and Clustering Rules API

We recommend you use APIs to retrieve the out-of-the-box Clusters and Clustering rules objects and read them to understand them. You may use the Clustering Rules without modification the first couple of times, examine the output, and look for inaccuracies. Then tweak your rules to resolve those inaccuracies.

Next steps

Clusters API

Learn more

Master Entities

Creating Master Entities in Unity

[dedupe, Dedupe, deduplicate, deduplication, deduping]