Record Duplication Profiler

1.3.7.13 Record Duplication Profiler

The Record Duplication Profiler allows you to find records that are exact duplicates of one another, based on the selected attributes.

Use the record duplication profiler to check if there are any records in the data set that have been entirely duplicated - for example due to a error in data migration.

As you can select the attributes to use in the duplicate check, you can choose to find records that are duplicates based on a subset of the total record - for example, customer records that are duplicates by name, address, and postcode.

The following table describes the configuration options:

Configuration	Description
Inputs	Specify any attributes that you want to use in the duplicate check.
Options	Specify the following options: `Consider no data values as duplicates?`: determines whether or not records that have Null values in all attributes will be considered as duplicates of one another. Values are `Yes` or `No`. Default value is `Yes`. `Ignore case?`: determines whether or not the duplication analysis should ignore case. Values are `Yes` or `No`. Default value is `Yes`. Records that have Null values in some, but not all, attributes, and which exactly match other records, will always be considered as duplicates.
Outputs	Describes any data attribute or flag attribute outputs.
Data Attributes	None.
Flags	The following flags are output: `RecordDuplicate`: indicates which attributes are duplicated elsewhere. Possible values are `Y` or `N`.

The Record Duplication Profiler assesses duplication across a batch of records. It must therefore run to completion before its results are available, and is not suitable for a process that requires a real time response.

When executed against a batch of transactions from a real time data source, it will finish its processing when the commit point (transaction or time limit) configured on the Read Processor is reached. The statistics returned will indicate the number of duplicates in the batch of transactions only.

The following table describes the statistics produced by the profiler:

Statistic	Description
Duplicated	The number of records that are duplicated across the attributes analyzed.
Not duplicated	The number of records that are not duplicated across the attributes analyzed.

Example

In this example, the Record Duplication Profiler finds duplicates in a Customers table using two attributes - ADDRESS1 and ADDRESS2.

Duplicated	Not Duplicated
8	1993

You can drill down on records with Duplicated values:

ADDRESS1	ADDRESS2	RecordDuplicate
Crescent Road,	Reading	Y
Grange Road,	North Berwick	Y
Grange Road,	North Berwick	Y
Crescent Road,	Reading	Y