This chapter describes how EDQ-CDS can be customized to take advantage of some of the more advanced features of the product.
This chapter includes the following sections:
EDQ-CDS has been designed to perform well with minimal customization. Ready-to-use, the application can perform clustering and matching of individual, entity and address data in connected supported applications with little or no configuration changes required.
EDQ-CDS is designed to process customer data from any external system or stand-alone source. By default, pre-configured batch jobs are provided that work with a set of staging tables. Reconfiguring the product to process data from other sources, such as a text file, is straightforward.
In order to reuse the batch data matching services provided, it is necessary to create new input and output mappings for the data interfaces. The following sections use examples that demonstrate how to do this and how to run matching using a modified copy of an existing job configuration.
You can create a new stand-alone individual batch matching job using the following example steps:
Ensure that no jobs are currently running.
In the EDQ-CDS project, create a new server-side data store named File In: Individuals that points to the structured text file containing the customer data to be processed. It is important that this is created as a server-side data store in order to be used within a job definition.
Create a new snapshot named Individuals using the File In: Individuals data store as a source.
Create the Input Data Interface mappings as follows:
Right-click the Individual Candidates data interface and select Mappings... to open the Data Interface Mappings dialog.
Click Add to open the New Data Interface Mappings dialog.
Select the Individuals snapshot as the source and click Next. The Staged data default type is used.
Map the Customer Data Attributes on the left of the dialog to the Data Interface Attributes on the right as follows:
Note :
In some instances, it may be necessary to construct a process that reads from the snapshot and reshapes the data to match the Data Interface, see Section 4.1.2, "Converting Data to the Interface Format."
Click Next.
Name the data interface mapping Individual Candidates and click Finish to save.
Click OK.
Create a new Staged Data named Individual Matches with the following columns:
Create the Output Data Interface mappings as follows:
Right-click the Matches data interface and select Mappings... to open the Data Interface Mappings dialog.
Click Add to open the New Data Interface Mappings dialog.
Select the Individual Matches staged data as the target and click Next.
Map the Matches data interface attributes on the left to the Result Staged Data attributes on the right as required.
Click Next.
Name the Data Interface mapping Individual Matches and give it a description, then click Finish.
Click OK to close the dialog.
Create a new server-side delimited text data store called File Out: Individual Matches to use as a target for the match results. Alternatively, the data can be written to a database if required.
Create a new export called Matches to File Out: Individual Matches that uses the Matches data interface as the source to export from, and the File Out: Individual Matches as the target for the export.
Create and configure a job to run matching as follows:
Create a copy of the Batch Individual Match job, rename it Batch Individual Match using Text File, and then open it.
Open the Individual Match job phase, change the source of the input data by double-clicking on the Individual Candidates data interface and selecting the Individual Candidates mapping.
Click OK to apply the changes. The job configuration is modified accordingly and the old snapshot and staged data items are disconnected.
Drag the Individuals snapshot from the Snapshot in the Tool Palette into the open job phase and make sure it is connected to the Individual Candidates mapping.
Drag the Matches to File Out: Individual Matches export task from the Export in the Tool Palette into the open job phase and connect it to Match Results - Output.
Delete the Batch Matches export task.
It may not always be possible to directly map the input source to the candidates interface if:
fields are of the wrong data type (for example, "Date of Birth" in a date field); or
fields need transforming to a compatible format/structure (for example, Individual names in a full name field).
If this is the case, then the input data should be run through a custom EDQ process to convert the data as appropriate as in the following example steps:
Ensure that no jobs are currently running.
Create a data store and snapshot for the input data as in steps 2 and 3 from Section 4.1, "Using Stand-Alone Batch Matching."
In the EDQ-CDS project, right-click the Processes node in the Project Browser and select New Process... to open the New Process wizard.
Select the snapshot created in step 2 as the data source.
Click Next.
On the last page of the wizard, rename the process Transform Individuals, then click Finish button to create the process.
On the Process canvas, add the necessary processors to transform the data to the interface format. For example, use a Convert Date to String processor to convert a date of birth in date format to the required format for the Candidates interface (for example, either yyyyMMdd, MM/dd/yyyy, yyyy-MM-dd or dd-MMM-yy).
Add a Writer processor to the process canvas and connect it to the process data stream:
In the Writer Configuration dialog, select the Individual Candidates data interface and map the attributes accordingly.
Create and configure a new job as follows:
Make a copy of Batch Individual Match job, renaming it Batch Transformed Individual Match.
Open the new job.
Double-click on the Individual Match job phase.
Drag the Individuals snapshot task from the Snapshot tool palette onto the Individual Match phase of the job.
Double-click the Individual Candidates interface and select the Individual Candidates mapping.
Click OK to apply the changes. The job configuration will be modified accordingly and the snapshot and staged data items will be disconnected. Delete both these items by deleting the Snapshot task. The start of the job phase should now appear as follows:
Use steps 9.d. - 10 of Section 4.1, "Using Stand-Alone Batch Matching" from step 9.d onwards, remembering to modify the job configuration to include the new transformation process and use the modified data interface mappings.
The cleaning processes provided with EDQ-CDS are provided as templates only, with the exception of the Address Cleaning process which is fully functional and uses EDQ-AV for address verification and standardization. The Individual and Entity cleaning processes are intended to be customized to meet the data standardization requirements of the implementation.
The examples in the following sections demonstrate modifying the cleaning services provided with EDQ-CDS.
Modify the Individual Cleaning service to standardize job titles as in the following example steps.
Ensure that no jobs are currently running.
In the EDQ-CDS project, create a new Reference Data set with the columns as follows:
Click Next through the New Reference Data wizard with the name Job Title Standardizations.
Click Finish to close the wizard. The Reference Data Editor dialog opens.
Add the required job title standardizations; for example:
Open the Clean - Individual process.
Add a new Replace processor to the Process Canvas and connect it to the output of the Upper Case the Name Attributes processor.
In the Processor Configuration dialog, set the jobtitle attribute as the Input field, and on the Options tab select the Job Title Standardizations Reference Data in the Replacements field.
Click OK to close the processor configuration dialog.
Connect the All output of the Replace processor to the Writer, then click OK without making any changes to the Writer configuration.
On the Process Canvas delete the direct link between the Upper Case processor and the Writer.
Close the process and save the changes.
Test the modified cleaning service.
The default settings (Allowed Verification Results, Minimum Verification Level and Minimum Match Score) used in the Address Cleaning process that uses EDQ-AV can be overridden on a per-country basis by simply modifying reference data.
Modify the EDQ-AV settings to reduce how strictly German addresses will be validated as in the following example steps.
Ensure that no jobs are currently running.
In the EDQ-CDS project edit the Address Clean - Country verification level and results Reference Data.
Add the following row:
Country Code: DE
Allowed Verification Results: VPA
Minimum Verification Level: 3
Minimum Match Score: 90
Click OK to close the dialog.
This section explains how you can change the EDQ matching settings.
By default, the clusters that are used during matching depend on the value of the clusterlevel
setting. All clusters for the specified level and all lower levels are applied. It is possible to customize the system to turn off particular clusters on an individual basis. However, this is only necessary if greater granularity than the three standard cluster levels is required.
The methods for controlling which Match Clusters are used differs for Batch and Real-Time processing. The following sections contain examples to show you how to modify the clusters used.
Modify match process to turn off clusters during individual batch matching as in the following example steps.
Ensure that no jobs are currently running.
In the EDQ-CDS project, open the Match - Individual process.
Double-click on the Individuals - Match processor to open the processor tab.
Select the Cluster icon, and select or unselect the Cluster options as required.
Note:
You should always select the Real-time Cluster option otherwise real-time matching will no longer operate. The match processors are shared between real-time and batch jobs.
Close the tab, and click Yes to save the changes.
In Real-Time matching, each driving record is compared against every other record in the input set; clustering is performed as a separate, prior call. Therefore, in order to turn off a cluster it must be suppressed at the time of generation.
Modify match process to turn off clusters during entity real-time matching as in the following example steps.
Note :
This will only affect new records, unless all cluster keys are re-created.
Ensure that no jobs are currently running.
In the EDQ-CDS project, open the Cluster Results – Realtime Output process.
Double-click on the Concatenate All Clusters processor to open the Processor Configuration dialog.
Select the Cluster Attributes in the Selected Attributes list as appropriate and click on the left-arrow button to remove them. For example, entclusterWS
, the Website cluster as in the following:
Click OK to close the dialog.
Close the process and save the configuration changes.
Match rule enablement is externalized in this release. You can override this behavior by adding the name...address conflict
properties to your edq-cds.properties
file then editing the values as in the following example:
# Disable all entity "name...address conflict" type rules. phase.*.process.Match\ -\ Entity.[E010V]\ Script\ full\ name\ exact\;\ address\ conflict.entity_match_rules_enabled = false phase.*.process.Match\ -\ Entity.[E020V]\ Full\ name\ exact\;\ address\ conflict.entity_match_rules_enabled = false phase.*.process.Match\ -\ Entity.[E030V]\ Standardized\ full\ name\ exact\;\ address\ conflict.entity_match_rules_enabled = false phase.*.process.Match\ -\ Entity.[E040V]\ Script\ full\ name\ without\ suffixes\ exact\;\ address\ conflict.entity_match_rules_enabled = false phase.*.process.Match\ -\ Entity.[E050V]\ Full\ name\ without\ suffixes\ exact\;\ address\ conflict.entity_match_rules_enabled = false
Capitalization must be respected and characters must be escaped as required. The asterisk (*
) character denotes a wildcard, which specifies that the above rule applies to all phases and all processes.
The value of the matchthreshold
setting is used to control the strength of matches that are returned from the Matching services by filtering out results that fall below the specified threshold. Match rules with a priority score below this value are effectively redundant.
Also, the match processes output a number of additional attributes which are not used in the default configuration and can be removed without loss of functionality. These attributes may be required for use in customizations of EDQ-CDS. For more information, see Section 4.3.3, "Turning off Unused Match Functionality."
The matchthreshold
setting has been configured to have a value of 70
, so all Match rules with a lower priority score will be disabled.
The following example steps show you how to disable Match rules for any Match process (for example, Match - Individual, Match - Entity or Match - Address):
Ensure that no jobs are currently running.
In the EDQ-CDS project, open the Match process.
Double-click the Match processor to open the Match Configuration tab.
Double click the Match sub-processor icon to open the Match Configuration dialog.
Select the Match Rules tab and select the last Match group.
Clear the check box beside each Match rule with a Match Priority score lower than 70
to disable it.
Repeat for each Match group until all rules with a score less than 70
have been disabled.
Click OK to close the dialog.
Close the process and save the configuration changes.
The EDQ-CDS Matching services return only those records that matched with a score equal to or greater than the matchthreshold
setting, and for those records it only returns the record ID, rule name and score. It is useful to be able to view the full record details during rule tuning in order to analyze matches. The Match Review application is a helpful tool in this process.
You can enable match review for individual batch matching as in the following example steps.
Ensure that no jobs are currently running.
In the EDQ-CDS project, open the Match - Individual process.
Double-click on the Match Individuals processor to open the Match Configuration dialog.
Click Advanced Options.
From the Review System list, select Match Review, and then click OK. This makes the Assign Relationship Review option active.
Click Assign Relationship Review.
In the dialog displayed, select the appropriate user or user group in the Assigned To drop-down field.
Click OK to close the dialog.
Close the process and save the configuration changes.
Open the Batch Individual Match job.
Locate the Match phase, right-click on the Match Prepare task and select Configure. The Task Configuration dialog opens.
Select the Process tab, and check the Enable Sort/Filter in Match? option.
Click OK and close the job, saving changes when prompted.
Run the job from Director with the appropriate run profile and no run label to regenerate the data.
Note:
In order to generate Match Review data, you must run jobs without a run label.
Matches can be reviewed as follows:
On the Launchpad page, click Match Review icon.
Note :
If this application is not visible then you will need to publish it via the launchpad server configuration pages.
Login as a user with the appropriate security permissions (for example, a user that is a member of the group selected in step 5).
Select Match - Individual in the Reviews list in the left-hand panel to view the Match Review statistics.
Click the Launch Review Application link to start reviewing matches for the selected Review.
The Decision Key consists of a set of input attributes that are used in a hashing algorithm to re-apply (that is, 'remember') manual match decisions. This means that any manual match decisions made on a pair of records will be re-applied on subsequent runs of the matching process as long as the data values in the attributes that make up the decision key remain the same.
So, for example, if matching individuals using name and address details, and one of the manually matched records changes, you may want to reappraise the records rather than apply a manual decision that was made based on different data. However, if the value in another attribute changes, you may consider there to be no real change to the details of the record used in matching. For example, a Balance attribute containing a numerical amount might be input to a matching process as it may be used in the output selection logic, but a change to the attribute value should not cause a reappraisal of the decision to match, or not match the record against another.
By default, all attributes that have been mapped to identifiers are included in the Decision Key (unless the match processor has been upgraded from a previous version - see note below). However, you can change the Decision Key to use all the attributes input into a match processor, or customize the key by selecting exactly which attributes make up the key. For example, if you want always to re-apply match decisions as long as the records involved are the same, even if the data of those records changes, you can select only the primary key attributes of records in each source involved in matching.
In general, you should decide how to configure the Decision Key before making a matching process operational and assigning its results for review. However, if decisions have already been made when the construction of the Decision Key changes, EDQ will make its best effort to retain those decisions within the following limitation:
If an attribute that was formerly used in a Decision Key is no longer input to match processor, it will not be possible to reapply any decisions that were made using that key
This means that adding attributes to a Decision Key can always be done without losing any previous decisions, providing each decision is unique based on the configured key columns.
Note that it is still possible to remove an attribute from a Decision Key and migrate previous decisions, by removing it from the Decision Key in this tab but keeping the input attribute in the match processor for at least one complete run with the same set of data as run previously. Once this has been done it will be safe to remove the attribute from the matching process.
This section explains how you can modify your data to improve matching and provides examples to aid you.
It is possible to customize the system to strip certain words and phrases from names that are deemed to be noise and/or add little information, and therefore may lead to potential missed matches.
Name fields in customer data systems are often overfilled with additional (non-name) information, either because there are no other suitable fields available or due to errors made by Data Entry users. Common examples include "Fred SMITH (DO NOT CALL)" and "John DOE (DECEASED)". This extraneous information can be removed during name standardization when a "distilled" name is created for use in matching.
Use the following example steps to remove noise from individual names:
Ensure that no jobs are currently running.
In the EDQ-CDS - Initialize Reference Data project open the Strip List – Titles Latin Reference Data.
Add the following rows to the Reference Data set:
DO NOT CALL
DECEASED
Click OK to close the dialog.
Re-run the MAIN Initialize Reference Data job from the Server Console to re-prepare the Reference Data files that are used by the Matching services.
Note :
The Real-Time services will use the modified Reference Data sets the next time the full Real-time START ALL job (which re-snapshots the prepared Reference Data from files) is run.
To remove words and phrases from individual names in non-Latin scripts use the reference data Strip List – Individual Script Strip List Reference Data . This Reference Data set is used as a replacement map and should have a blank value in the second column.
Noise words and phrases or common business words (including suffixes) in Entity names that add little value in matching can be removed during name standardization when a "distilled" name is created. An example of such a noise word is "International", which is often found in organization name fields.
Due to the high frequency of occurrence of this term it is often omitted or shortened when entering the name, which may lead to potential matches being missed. Therefore it may be more appropriate to remove the term and all known variants for the purposes of matching.
Use the following example steps to remove noise from entity names:
Ensure that no jobs are currently running.
In the EDQ-CDS - Initialize Reference Data project open the Strip List – Entity Latin Reference Data.
Add the following rows to the Reference Data set:
INTERNTL
INTL
INT
Click OK to close the dialog.
Re-run the MAIN Initialize Reference Data job from the Server Console to prepare the data.
To remove words and phrases from entity names in non-Latin scripts use the Strip List – Entity Script Suffixes Reference Data.
EDQ-CDS uses a name standardization technique in order to match name variants. It is supplied with a large collection of common name variants for various language domains. It is possible to customize these lists.
Note :
If a name standardization is changed or added, the subsequent results may be eliminated during Conflict Resolution. For further details, see Section 4.4.3, "Resolving Conflicts".
Ensure that no jobs are currently running.
In the EDQ-CDS - Initialize Reference Data project create a new Reference Data set with columns as in the following:
Click Next through the New Reference Data wizard and name it Custom Individual Name Standardizations.
Click Finish to close the dialog.
The Reference Data Editor dialog will open. Add the required name standardizations, where:
VARIANTLATINNAME is the name to be standardized.
MASTERLATINNAME is the standardized version of variant name.
GENDER takes the value M for male, F for Female, or U for unknown or ambiguous.
ISPHRASE takes the value N for single token names and Y for multi-token names containing whitespace.
ISHIGHFREQ is set to Y.
Note :
It is important to ensure that data is entered in upper case and that variant names only have a single master across all language domains.
Click OK to close the dialog.
Open the [D] Initialise Individual Latin to Latin Data process.
Add a Reader process to the Process Canvas and configure it to use the Custom Individual Name Standardizations Reference Data as the source, selecting all attributes for input to the process.
Add a new Add String Attribute processor to the process canvas and connect the reader to the new processor. In the processor configuration dialog rename the new attribute DATASOURCE and set the attribute value to CUSTOM.
Connect the output of the Add String Attribute processor to the Merge Data Streams processor.
In the Custom Individual Name Standardizations tab of the Processor Configuration dialog associate the Available Attributes with the Output Attributes in the Merged Data Stream area:
Click OK to close the dialog.
Close the process and save the configuration changes.
Re-run the MAIN Initialize Reference Data job from the Server Console to prepare the data.
Conflict resolution is performed to resolve issues arising when name standardization rules try to standardize names to more than one Master name. For example, if there is a rule that maps "Jon" to a Master of "John" and another that maps "Jon" to "John-Boy", there is a conflict. This conflict is resolved by assessing the importance of each Master name in the given standardization data. The best candidate is then selected as the primary Master, and other standardization maps conflicting with it are removed and quarantined.
As part of conflict resolution, each removed record is assigned one or more Reason Codes explaining why it is in conflict. These codes are displayed in the REASON column in the Server Console Results window:
The Reason Codes are as follows:
PIV: The Primary record of a cluster of records (for example, the best Master identified for a set of equivalences) is also present as a variant to other Masters. All the instances where this Primary name is a variant are removed.
PVOM: The records that are variants of the current Primary are also variants of other Masters. All the records for these variants pointing to other Masters are removed.
PVIM: The records that are variants of the current Primary are also Masters to other variants. All the records where this variant is a Master are removed.
PIVCUTOFF: Whereas the other removals take place after identification of Primary clusters, there comes a time where it is not efficient to continue to identify the Primaries, and the remaining records where the Master name also exists as a variant have all the variant versions removed in a final cull of records that violate integrity.
Expanding on the simple example given at the beginning of this section, let us assume that there are the following name standardization rules:
Master | Primary |
---|---|
J-MAN |
JON |
JOHN |
JONATHAN |
JOHNNY |
JONNY |
JON |
JOHN |
JON |
JONATHAN |
JON |
JOHN-BOY |
JONNY |
JONATHAN |
JONATHAN |
JONATHON |
JOHNNY |
JONATHAN |
These rules contain a number of inherent conflicts. This is illustrated in the following diagram in which JONATHAN is identified as the Primary:
The arrows indicate the following:
Arrow Type | Reason for Conflict |
---|---|
N/A (No conflict exists) |
|
PIV |
|
PVIM |
|
PVOM |
The conflict resolution rules will discard the mappings that cause conflicts, as follows:
Resulting in the following mappings being created:
Name | Primary |
---|---|
JOHN |
JONATHAN |
JON |
JONATHAN |
JONNY |
JONATHAN |
JOHNNY |
JONATHAN |