1.3.4.7 Advanced Options for Match Processors

A few settings for a match processor are stored as Advanced Options. To access these options, click on the Advanced Options link after opening a match processor.

These settings do not normally need to be changed, but may be adjusted in some cases.

The following options are available on the Advanced tab:

Match groups share working records? [Match Review only]

This option drives whether or not working records should ever be placed together in the same match groups. For example, when enhancing or linking data, the objective is often to consider each working record on its own and match it only to one or more reference data sources. In this case, this option should be turned off, so that each match group contains only a single working data record. Otherwise, even if working records are not directly being compared with each other, they could be placed into the same match group if they both match the same reference data record.

Cluster Size Limit

The Cluster size limit is a default upper limit on the maximum number of records in a cluster. The Match sub-processor does not perform comparisons between records in a cluster if this limit is exceeded. For each cluster where this occurs, a warning message is displayed on the processor panel for the match processor when it is run, and output to the log file.

The default Cluster size limit is 500 records. This setting can be over-ridden for a specific cluster in Cluster configuration.

Note:

It may be desirable for some groups to be skipped when comparing records, in this way. For example, if using multiple clusters, your clustering configuration might yield a large cluster for one cluster function, where all records have a null cluster value, or an extremely common cluster value (for example, a Surname of SMITH) - records that should be matched may still be compared with each other due to another cluster.

Match and Review Group Size Limits

It is possible for a Match process to run out of memory while trying to load Match groups and Review groups for output processing, reviewing and case generation.

The Match Group Size Limit and Review Group Size Limit fields set an upper limit on the number of groups that can be generated. By default, both fields are set to 5000. Clearing the fields will result in no upper limit being set.

The Match Group Size Limit and Review Group Size Limit fields set an upper limit on the number of groups that can be generated. By default, both fields are set to 5000. Clearing the fields will result in no upper limit being set.

Cluster Comparison Limit

The Cluster comparison limit is a default upper limit on the maximum number of comparisons that should be performed on a single cluster. This figure is calculated by assessing the number of comparisons that would be performed in a cluster before processing it; if the number of comparisons that would be performed on the cluster is greater than the limit, the cluster is skipped. This offers a more intelligent way of finding and excluding the most expensive clusters to process from a performance point of view when working with multiple data sets, where not all of the records in the same cluster will be compared with each other. For example, a cluster may contain 1000 records, but if 999 of those records are from a single data set where records are not compared with each other, and only 1 of those records is in the second data set, only 999 comparisons will be performed. In cases where all records in a cluster are compared together, the number of comparisons will be much higher. For example, 249500 (500*499) comparisons will be performed for a 500 record cluster in a Deduplicate processor.

By default, the Cluster comparison limit is set to no value (that is, no limit is applied).

Note also that this setting can be over-ridden for a specific cluster in Cluster configuration.

Cluster Split Threshold

Match processors with a single working data input (with the Compare against self field unchecked) and multiple reference data inputs can split large clusters into sub-clusters, in order to allow more efficient processing in a multithreaded environment. Clusters larger than the value set in this field will be automatically split at each threshold into smaller groups and assigned to multiple threads.

The default value for the threshold is 250. Setting it to 0 disables the feature, ensuring each cluster is processed by a single thread.

This option is not available for processors not meeting these conditions.

Allow Null Clusters

This option determines whether or not clusters with Null values will be generated. For example, if you configure a cluster on a Postcode attribute, you may need to decide whether or not to compare all records with a Null Postcode with one another, for possible matches.

Note that the original attribute or attributes used in clustering may not be null, but the cluster value may still be null, as any value may be removed by transformations. For example, using the Trim Whitespace and Strip Words transformations to remove whitespace, and words such as 'Company' and 'Limited', from cluster values would mean that the value 'Company Limited' would be indexed as a Null value.

Note that by default, Null clusters are created, though they may be ignored when matching, as the group may contain a large number of records (over the Cluster size limit).

Note also that this setting can be over-ridden for a specific cluster in Cluster configuration.

Use review relationships in match groups [Match Review only]

By default, a match group consists only of records that are related to each other by means of a relationship with a decision of Match, that is, a relationship that has been decided, either using automatic rules or by a manual decision, to be a positive match.

However, during the development of a matching process, the final structure of the match groups (after all relationships have been reviewed) may not be known. In order to aid an external review process, and to allow the output of a match processor to provide a full picture of all relationships created by matching, you may choose to include relationships that are still under review when reporting on, or merging, match groups.

Ticking this option therefore changes the way match groups are formed, so that they include relationships that are awaiting review. The option may be changed at any time, but applies to all types of output produced from the match processor, including the final merged output. It should therefore only be changed during the development of a matching process.

Token Attribute Prefix

This option is usually only applicable when using a Deduplicate match processor for Real time duplicate prevention. It allows you to configure the prefix used on cluster key attributes, which may be output on the Clustered output filter from a Deduplicate processor (in order to issue an initial response to the calling system with the appropriate cluster key values for a new record). The given prefix will be used before the name of the cluster to form new attribute names. For example, for a 'Name_Meta' cluster, using the default prefix of 'Clustered_', the name of the output attribute will be 'Clustered_Name_Meta'.

Sort and Filter

The Sort/Filter options allow you to improve matching performance if you know you do not require the ability to sort, filter and search the outputs of a matching process in Match Review. This will be the case if Case Management is in use, or if you do not need users to review match results at all.

There are three possible settings for each match processor:

  • Enable Sort/Filter (default)

  • Do Not Enable Sort/Filter

  • Use Intelligent Sort/Filtering

Enable Sort/Filter means that sorting and filtering in Match Review will be enabled on the outputs from the match processor, as long as the execution preferences for the process or job do not override the setting (See Process Execution Preferences). Use this setting whenever users need to use Match Review to review the results of the match process.

Do Not Enable Sort/Filter means that sorting and filtering in Match Review will never be enabled on the outputs from the match processor (regardless of the process or job level options). This will mean that the results of the match processor cannot be reviewed. Use this setting if you know that Match Review will not be used to process the results.

Use Intelligent Sort/Filtering means that the data size of the match outputs (using both rows and columns) will drive whether or not sorting, filtering in Match Review will be enabled. A configurable system property sets the size above which reviewing, sorting, filtering and searching will not be enabled. Use this setting if you are designing the match process on a sample data set (perhaps less than 100,000 rows), and need to review results during the design phase, but when the match process is deployed on the full data set (which may comprise several million rows), its results will not require user review in Match Review.

Relationship Decision Trigger

This option allows you to choose a configured trigger action to fire when a relationship decision is made.

A trigger can be any kind of action - for example, a trigger might send a JMS message, call a Web Service, or send a notification email. Triggers may include the relationship and decision data.

Triggers must be set up on an EDQ server by an administrator. If you need to set up a trigger (for example to notify another application when a match decision is made in the Match Review application), please contact Support for more information.

Review System

The 'Review System' option is used to control whether or not manual review of the results from the match processor will be enabled, and which type of review UI is used. The three options are:

  • No Relationship Review - the match processor will not write any data for manual review in EDQ. Note that match results may still be written, and may be reviewed externally.

  • Match Review - the match processor will write the results from its latest run for users to review in the EDQ Match Review UI.

  • Case Management - the match processor will publish results to the EDQ Case Management UI each time it runs.

For more information on which review system to use, see the topic Reviewing match results.

Cache reference records for real-time processes

When running a Real-time reference matching service, enabling this option means that the Reference Data (being the data sets connected to the Reference Data inputs of the Match processor) in a real-time match process will be cached and interrogated in memory on the EDQ server rather than stored and interrogated from the Results database. This option should only be enabled if sufficient memory is available and allocated to EDQ.

Changing the Decision Key [Match Review only]

The Decision Key consists of a set of input attributes that are used in a hashing algorithm to re-apply (that is, 'remember') manual match decisions. This means that any manual match decisions made on a pair of records will be re-applied on subsequent runs of the matching process as long as the data values in the attributes that make up the decision key remain the same.

So, for example, if matching individuals using name and address details, and one of the manually matched records changes, you may want to reappraise the records rather than apply a manual decision that was made based on different data. However, if the value in another attribute changes, you may consider there to be no real change to the details of the record used in matching. For example, a Balance attribute containing a numerical amount might be input to a matching process as it may be used in the output selection logic, but a change to the attribute value should not cause a reappraisal of the decision to match, or not match the record against another.

By default, all attributes that have been mapped to identifiers are included in the Decision Key (unless the match processor has been upgraded from a previous version - see note below). However, you can change the Decision Key to use all the attributes input into a match processor, or customize the key by selecting exactly which attributes make up the key. For example, if you want always to re-apply match decisions as long as the records involved are the same, even if the data of those records changes, you can select only the primary key attributes of records in each source involved in matching.

Note:

As the ability to configure the decision key to use a subset of the input attributes is a new feature at version 7.0, any match processors configured using an older version of EDQ will have All attributes selected, though you can change this without losing any decisions that have already been made.

What if I change the Decision Key after decisions have been made?

In general, you should decide how to configure the Decision Key before making a matching process operational and assigning its results for review. However, if decisions have already been made when the construction of the Decision Key changes, EDQ will make its best effort to retain those decisions within the following limitation:

If an attribute that was formerly used in a Decision Key is no longer input to match processor, it will not be possible to reapply any decisions that were made using that key.

This means that adding attributes to a Decision Key can always be done without losing any previous decisions, providing each decision is unique based on the configured key columns.

Note that it is still possible to remove an attribute from a Decision Key and migrate previous decisions, by removing it from the Decision Key in this tab but keeping the input attribute in the match processor for at least one complete run with the same set of data as run previously. Once this has been done it will be safe to remove the attribute from the matching process.

Configuring Case Sources [Case Management only]

Case Sources are used to define the permissions, workflow and data to be used when Case Management is active. Case Sources are configured on the Case Source tab on this screen.

Configuring Workflow parameters [Case Management only]

Workflow parameters are used by Case Management to provide enhanced processing of Cases and Alerts. They are configured on the Workflow parameters tab on this screen.