1.3.4.11.5 Merge

Merge is a sub-processor of all matching processors except Link (where no record merging occurs).

Merging records is an optional part of matching, that allows you to create new 'best' output records from the matching process. These 'best' records are constructed from the multiple records in each match group, using a combination of automatic selection rules and manual decisions.

For example, using automatic rules, the match processor might output the Most Common Value for an attribute from the records found in a match group, the Latest Value using an input date (such as a 'last edited by' field), or the First Non-empty Value, preferring one source of data over another.

Any errors in the automatic output selection for an attribute mean that the attribute is marked with a status of Fail. Groups which contain any failed attributes are also marked with a Fail status. These errors can be manually resolved at the Review stage. For example when consolidating two duplicate records from two different systems, if automatic rules fail to select values (for example because there is no value that is 'Most Common' in the records in the match group, the user can choose one value (such as name) from one record, and another (such as email address) from a related record, depending on which is seen to be best. From the Review screens, it is possible to review the output for records that contained selection errors. The Merge Summary results view also shows the number of failed match groups, so that you are aware of how many errors need resolution.

The typical use of the Merge sub-processor varies according to the type of match processor you are using. When deduplicating a data stream, or consolidating a number of data streams, automatic merge rules are typically used to create a set of duplicate-free output records - for example as part of a data migration project, or when re-purposing data, for example, in the preparation of a customer list for a mailshot.

When enhancing a set of data from reference sources, the merge rules might be used to add in data from the matching reference records. In this case, the original working records may be updated with better, trusted information, or perhaps simply appended with the new information.

Configuration

For the Deduplicate and Consolidate match processors, the Merge configuration has a set of default selection rules, allowing you to create a simple form of de-duplicated output.

For the Enhance processor, the Merge configuration has a set of default selection rules to enhance your working data from your reference data.

All of these default configurations are simple, and designed to create quick output from the matching process. If you need more precisely constructed merged output records, you will need to edit the rules to suit your needs.

To keep the default format for the merged output, keep the Auto Attribute Selection option ticked at the bottom of the dialog. Note that the attributes in the output may still change, as attributes are included for each input attribute. Adding or removing input attributes from the match processor will change the attributes in the default output. To customize the output, untick this box, and add or remove attributes. Note that if you change the output to a custom format, for example, by adding attributes, the Auto Attribute Selection option is automatically de-selected. This means that changing input attributes will not automatically add attributes to the output, though you can still add them manually if required.

Changing the Merged Output Rules

The merge rules are set separately for each output attribute that you require.

By default, an output attribute is included for any attributes that have the same name in all of the data streams presented to matching. (If only one data stream is presented to matching, as in De-duplication, this means an output attribute for every input attribute is created.)

Note:

The output selector used in the default output format is Most Common Value. This means that from all the records in the match group, the most common value for each attribute will be selected. Where there is no most common value, for example, one record in the group has a FirstName of 'Jhon' and the only other record has 'John', the selector has an option (Use first non-empty if tied) simply to pick the first non-empty value from the records in the group. By default, this is set, but you may want to unset it and raise an error where there is no most common value.

Additional output attributes exist for MatchGroup and MatchGroupSize, so that you can cross-reference the merged output with the Match Groups output, in order to maintain a full audit trail of all output selection decisions. A number of other internally generated attributes are available.

There are a number of options available for changing the set of merged output records:

Option Description Default

Generate Merged Output

Determines whether or not to generate the merged output (at all) or not. For example, if you have fully developed the matching process, and you are not using the merged output, you can save on performance by not generating it.

Selected

Output related records

Determines whether or not to output merged output records for groups of related records.

Selected

Output unrelated records

Determines whether or not to output unrelated records.

Selected, for Deduplicate and Consolidate processors.

Not Selected, for Enhance and Advanced Match processors.

Adding Attributes

To add a new output attribute to the merged output, use the Add button at the bottom of the dialog.

A number of further internally generated output attributes are available:

Attribute Name Description Attribute Value

Match_Group_Status

Match Group Status

The status of the Match Group.

FAIL, if there were any errors during output selection for the Match Group.

SUCCESS, if there were no errors during output selection for the Match Group.

Reviewed_Flag

Reviewed Status

An indicator on the Match Group to show if the merged output for it has been manually reviewed or not.

Review_User

Name of Reviewer

The user name of the last user to review the merged output for the Match Group.

Review_Date

Date of Latest Review

The date of the last review of the merged output for the Match Group.

Comment

Latest Comment

The latest comment made on the merged output.

Comment_User

Name of Latest Comment Provider

The user name of the user that made the latest comment.

Comment_Date

Date of Latest Comment

The date of the latest comment.

Alternatively, to create a new merged output attribute, that is, to merge in data from the records in the match group:

  1. Name the output attribute.

  2. Select Merged Value from the bottom of the list.

  3. Select the required Output selector on the right-hand side.

  4. Select the attribute or attributes from which you want to select data, in the appropriate order.

    Note:

    Note that the number of available inputs above varies depending on the number of source data streams that are input to the match processor. In some cases, an output selector might require a specific additional input. For example, the Earliest Value and Latest Value selectors require a date attribute to be used to choose the earliest or latest record from which to select a value.

  5. Configure the options (if any) of the Output selector on the Options tab.

  6. Configure whether or not you want to allow the output attribute to contain a Null value. This option determines whether or not a Null value selection will be considered an error, when automatically selecting output for the attribute. If Nulls are allowed (by default) in the output attribute, an error will not be raised on the attribute if a Null value is selected according to the rule of the output selector.

If you want to select data for output using more complex rules, it is possible to add your own output selectors into the available set.

Output Selection Errors

An output selection error occurs whenever the output selector for a merged output attribute cannot select a single valid value from the input records in the match group. If a merged output record with no errors cannot be output from a match group (that is, the merged output record contains any output selection errors on any attributes), the match group is flagged as a 'failed' group. The failure is intended as a flag to indicate that the merging of data from the match group requires manual review, so that where a value could not be selected using automatic merging rules, the merged output record can be resolved manually.

Note that wherever values could be correctly selected, these are still output (even on failed groups), so it may still be possible to use merged output records with selection failures.

Examples of output selection errors:

Ambiguous Selection

If the output selector finds more than one value that could be considered as the output value, and has not been configured to select arbitrarily between them, an output selection error is raised, and no value will be selected for the merged output attribute. For example, an error will be raised if attempting to select the Most Common Value for a Date of Birth output attribute from the following records in a match group, if the Use first non-empty value if tied? option is not ticked:

Record Date of Birth

A

01/10/1975

B

01/10/1975

C

10/01/1975

D

10/01/1975

In the above case, selection is ambiguous because there are two values that occur twice; that is, there is no single most common value.

Disallowing Null values

You may want to apply a rule to disallow Null values in a merged output attribute, so that you can ensure the completeness of this attribute for downstream processing.

For example, you may want to flag all groups where a Postcode attribute value could not be selected, by unticking the Allow Nulls option on the output selector. If all records in the match group have Null values for the Postcode attribute, a selection error will be raised regardless of the output selector used, as there is no non-Null value to select.