1.3.7.13 Record Duplication Profiler

The Record Duplication Profiler allows you to find records that are exact duplicates of one another, based on the selected attributes.

Use the record duplication profiler to check if there are any records in the data set that have been entirely duplicated - for example due to a error in data migration.

As you can select the attributes to use in the duplicate check, you can choose to find records that are duplicates based on a subset of the total record - for example, customer records that are duplicates by name, address, and postcode.

The following table describes the configuration options:

Configuration Description

Inputs

Specify any attributes that you want to use in the duplicate check.

Options

Specify the following options:

  • Consider no data values as duplicates?: determines whether or not records that have Null values in all attributes will be considered as duplicates of one another. Values are Yes or No. Default value is Yes.

  • Ignore case?: determines whether or not the duplication analysis should ignore case. Values are Yes or No. Default value is Yes.

Records that have Null values in some, but not all, attributes, and which exactly match other records, will always be considered as duplicates.

Outputs

Describes any data attribute or flag attribute outputs.

Data Attributes

None.

Flags

The following flags are output:

  • RecordDuplicate: indicates which attributes are duplicated elsewhere. Possible values are Y or N.

The Record Duplication Profiler assesses duplication across a batch of records. It must therefore run to completion before its results are available, and is not suitable for a process that requires a real time response.

When executed against a batch of transactions from a real time data source, it will finish its processing when the commit point (transaction or time limit) configured on the Read Processor is reached. The statistics returned will indicate the number of duplicates in the batch of transactions only.

The following table describes the statistics produced by the profiler:

Statistic Description

Duplicated

The number of records that are duplicated across the attributes analyzed.

Not duplicated

The number of records that are not duplicated across the attributes analyzed.

Example

In this example, the Record Duplication Profiler finds duplicates in a Customers table using two attributes - ADDRESS1 and ADDRESS2.

Duplicated Not Duplicated

8

1993

You can drill down on records with Duplicated values:

ADDRESS1 ADDRESS2 RecordDuplicate

Crescent Road,

Reading

Y

Grange Road,

North Berwick

Y

Grange Road,

North Berwick

Y

Crescent Road,

Reading

Y