1.3.4.11.3 Cluster

Cluster is a sub-processor of all matching processors except Group and Merge. The purpose of the Cluster stage of match configuration is to configure the clustering process, which stops matching from performing unnecessary comparisons between records. Without clustering, matching would be a very inefficient process, even on small data streams, as every record in each data stream would need to be compared with every other record.

Use clusters to divide up the input records into groups of records (cluster groups) with common cluster keys, within which record comparisons are performed.

The configuration of a cluster consists of one or more identifiers, and optionally a number of ordered transformations of those identifiers. The cluster keys for the cluster will then be generated for each record based on that configuration, and records grouped by the cluster key.

Where more than one identifier is used in a cluster (a Composite cluster), the identifier values (or transformed identifier values) are concatenated together to form the cluster key for each record.

If a single array type identifier is used in a cluster, the cluster key will be generated for all the elements in the array.

If multiple array type identifiers are used in a cluster, cluster key will be generated for all the combination of array elements. For example, if an array of two attributes and another array also of two attributes are used in a cluster then four cluster keys will be generated.

Use the Add Identifier button to add an identifier to the cluster, and the Add Transformation button to add transformations to each identifier.

Note that the transformations that you can validly apply to an identifier depend on the data type (that is, String, Number or Date) of the identifier. You can change the data type of the identifier using one of the Convert transformations (such as Convert Date to String). If you configure any invalid transformations, these will appear in red.

If the Convert String to Date transformation is deleted above, the First N Characters transformation becomes valid.

Additional options - overriding defaults

Three additional options are available when configuring a cluster. Normally, these options do not need to be changed from their default settings, but you may want to change them in specific cases. The options are:

  • Cluster Group Limit

  • Cluster Comparison Limit

  • Allow Nulls

Cluster Group Limit

The Cluster Group Limit is the maximum number of records that are allowed to be in a single cluster. By default, the Cluster Limit is 500 records.

If a cluster consists of more records than this (for example, if when using a simple clustering configuration of the first 5 characters of a Surname, there are more than 500 records with 'SMITH'), that cluster will be ignored by matching, as it would require too many comparisons to be performed. Normally, in this case, you would change your clustering configuration to be more sensitive, and generate smaller groups. However, in some cases, you may simply want to extend the size limit so that the larger clusters are not ignored.

Cluster Comparison Limit

The Cluster Comparison Limit is the maximum number of comparisons the match comparison engine may perform before discarding that cluster. By default, the Cluster Comparison Limit is set to null, meaning that there is no limit.

The number of comparisons that a cluster will produce can be calculated before the cluster processing begins. If the number of comparisons exceeds the cluster comparison limit, the cluster will be discarded before processing, and no relationships will be generated for that cluster.

Allow Nulls

The Allow Nulls option allows you to change whether or not to create a cluster of all the records where the configured cluster key is Null.

By default, Null cluster keys are allowed, and a group will be generated.

For example, if your cluster is simply the whole value of an Email attribute, do you want to compare all records with Null values in the Email attribute with each other? If you do not, you might set this option to False.

Note that if the setting is left to its default setting of True, the cluster for the Null cluster key will be generated, but will often contain more records than the Cluster Limit (above), and will therefore be ignored by matching in any case.

Example

For example, the first few characters of a Surname attribute (transformed to upper case), and the year part of a Date_of_Birth attribute, are used to create clusters in a set of customer data. As in this case the Date_of_Birth is a Date attribute, it is first converted to a String (using the format ddMMyyyy), so that the last 4 characters can be taken to represent the year.

The default cluster size limit of 500 is used, and the cluster allows a Null cluster key to be generated.