1.3.4.11.2 Identify

Identify is a sub-processor of all matching processors except Group and Merge. The purpose of the Identify step of match configuration is to map source attributes to identifiers (see below), which are then used to match records in or between data streams.

Identifers

An identifier is a way of representing and identifying a real-world business entity that needs to be matched - for example a person's name, an address, an inventory item etc.

There are a number of different ways of identifying a business entity, and so a number of different kinds of identifier:

  • System identifiers – used within a system to identify the record or entity. In databases, this is often the Primary Key.

  • Real-world identifiers – attribute(s) of the entity which have meaning outside of the system and are intended to be used to establish identity.

  • Alternative identifiers – attribute(s) of the entity which have meaning outside of the system and can be used to establish identity, although not necessarily intended to do so.

For example, within a system storing information about books, a book could be identified by:

  • A Primary Key (System identifier)

  • Its ISBN (Real-world identifier)

  • A combination of Title, Author and Publication date. (Alternative identifier)

EDQ makes no distinction between these kinds of identifier. Any or all of these types of identifier may be used to identify an entity for matching, either separately or in combination.

In EDQ, one or more attributes of an entity are mapped to an identifier in order to identify that entity.

Identifier Types

Different types of identifier exist so that specialist comparisons can be used to match data of different types, for example, for date comparison or number matching.

Note that the default set of identifier types are the base types (Date, Date Array, String, String Array, Number, and Number Array). These only allow a single attribute from each source data stream to be mapped to them. However, it is possible to extend the set of identifier types to add more specific identifiers and comparisons. For example, an Address identifier type that allows addresses in different structures to be mapped, and which is accompanied by specialist address comparisons.

The String Array allows a simple string to be matched against a string array or a string array with another string array. The same is applicable for both the Number Array and the Date Array.

Use

Use the Identify configuration step to map the attributes that you want to match to identifiers. Identifiers are then used in clustering and matching.

This allows you to resolve any differences in attribute names between data streams. For example, the attributes lname in one data stream, and SURNAME in another data stream could both be mapped to a surname identifier.

Note that when matching more than one data stream (for example, when linking), you can match an attribute in one data stream against more than one attribute in another data stream by creating two identifiers. This allows you to overcome any issues with data entered in the wrong fields, within the matching process.

Identifiers may be added in two ways:

  • From the configuration view panel, with the Input sub-processor selected

  • From within the Identify sub-processor

When working with a single data stream, such as in a Deduplicate match processor, it is simplest to add the identifiers directly from the configuration panel when viewing the input attributes. When working with multiple data streams, such as in a Consolidate, Link or Enhance match processor, attributes from each data stream will need to be mapped to the identifiers in the Identify dialog. In this case, you might first create the required identifiers from the input attributes view, but you will need to open the Identify dialog above to map them.

Auto-Mapping Identifiers

Auto-Map functionality is available both within the Identify sub-processor and from the configuration view panel when the Input sub-processor is selected.

Auto-Map is of most use when you want to create identifiers for all the attributes in the input data streams, and a consistent naming convention is in use. Auto-Map creates an identifier for each unique attribute name found in all the working and reference data input streams, and maps all input attributes with that name to the appropriate name.