Skip Headers
Oracle® Enterprise Data Quality Customer Data Services Pack Customization Guide
11g Release 1 (11.1.1.7)

E40734-02
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Feedback page
Contact Us

  PDF · Mobi · ePub

Oracle® Enterprise Data Quality

Customer Data Services Pack Customization Guide

11g Release 1 (11.1.1.7)

E40734-02

October 2013

Oracle Enterprise Data Quality Customer Data Services Pack (EDQ-CDS) has been designed to perform well with minimal customization. Ready-to-use, the application can perform clustering and matching of individual, entity and address data in connected supported applications with little or no configuration changes required.

This document describes how EDQ-CDS can be customized to take advantage of some of the more advanced features of the product. Therefore, it is assumed that you have a good working knowledge of EDQ and the EDQ-CDS components.

1 Stand-Alone Use of Batch Matching

EDQ-CDS is designed to process customer data from any external system or stand-alone source. By default, pre-configured batch jobs are provided that work with a set of staging tables. Reconfiguring the product to process data from other sources, such as a text file, is straightforward.

In order to reuse the batch data matching services provided, it is necessary to create new input and output mappings for the data interfaces. The following examples demonstrate how to do this and how to run matching using a modified copy of an existing job configuration.

1.1 Example: Stand-Alone Individual Batch Matching

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS project, create a new server-side data store named File In: Individuals that points to the structured text file containing the customer data to be processed. It is important that this is created as a server-side data store in order to be used within a job definition.

  3. Create a new snapshot named Individuals using the File In: Individuals data store as a source.

  4. Create the Input Data Interface mappings as follows:

    1. Right-click the Individual Candidates data interface and select Mappings... to open the Data Interface Mappings dialog.

      Surrounding text describes data_intface_config2.png.
    2. Click Add to open the New Data Interface Mappings dialog.

    3. Select the Individuals snapshot as the source and click Next. The Staged data default type is used.

    4. Map the Customer Data Attributes on the left of the dialog to the Data Interface Attributes on the right as follows:

      Surrounding text describes new_data_intface_map.png.

      Note :

      In some instances, it may be necessary to construct a process that reads from the snapshot and reshapes the data to match the Data Interface, see Section 1.2, "Example: Converting Data to the Interface Format."

    5. Click Next.

    6. Name the data interface mapping Individual Candidates and click Finish to save.

    7. Click OK.

  5. Create a new Staged Data named Individual Matches with the following columns:

    Surrounding text describes staged_data_ind_match.png.
  6. Create the Output Data Interface mappings as follows:

    1. Right-click the Matches data interface and select Mappings... to open the Data Interface Mappings dialog.

    2. Click Add to open the New Data Interface Mappings dialog.

    3. Select the Individual Matches staged data as the target and click Next.

    4. Map the Matches data interface attributes on the left to the Result Staged Data attributes on the right as required.

      Surrounding text describes new_data_intface_map2.png.
    5. Click Next.

    6. Name the Data Interface mapping Individual Matches and give it a description, then click Finish.

    7. Click OK to close the dialog.

  7. Create a new server-side delimited text data store called File Out: Individual Matches to use as a target for the match results. Alternatively, the data can be written to a database if required.

  8. Create a new export called Matches to File Out: Individual Matches that uses the Matches data interface as the source to export from, and the File Out: Individual Matches as the target for the export.

  9. Create and configure a job to run matching as follows:

    1. Create a copy of the Batch Individual Match job, rename it Batch Individual Match using Text File, and then open it.

    2. Open the Individual Match job phase, change the source of the input data by double-clicking on the Individual Candidates data interface and selecting the Individual Candidates mapping.

      Surrounding text describes data_intface_config1.png.
    3. Click OK to apply the changes. The job configuration is modified accordingly and the old snapshot and staged data items are disconnected.

    4. Delete the Individual Candidates snapshot task.

    5. Drag the Individuals snapshot from the Snapshot in the Tool Palette into the open job phase and make sure it is connected to the Individual Candidates mapping.

    6. Drag the Matches to File Out: Individual Matches export task from the Export in the Tool Palette into the open job phase and connect it to Match Results - Output.

    7. Delete the Batch Matches export task.

      Surrounding text describes individ_match_job1.png.
  10. Close the job and save the configuration changes.

1.2 Example: Converting Data to the Interface Format

It may not always be possible to directly map the input source to the candidates interface if:

  • fields are of the wrong data type (for example, "Date of Birth" in a date field); or

  • fields need transforming to a compatible format/structure (for example, Individual names in a full name field).

If this is the case, then the input data should be run through a custom EDQ process to convert the data as appropriate.

  1. Ensure that no jobs are currently running.

  2. Create a data store and snapshot for the input data as in steps 2 and 3 from Section 1, "Stand-Alone Use of Batch Matching."

  3. In the EDQ-CDS project, right-click the Processes node in the Project Browser and select New Process... to open the New Process wizard.

  4. Select the snapshot created in step 2 as the data source.

  5. Click Next.

  6. On the last page of the wizard, rename the process Transform Individuals, then click Finish button to create the process.

  7. On the Process canvas, add the necessary processors to transform the data to the interface format. For example, use a Convert Date to String processor to convert a date of birth in date format to the required format for the Candidates interface (for example, either yyyyMMdd, MM/dd/yyyy, yyyy-MM-dd or dd-MMM-yy).

  8. Add a Writer processor to the process canvas and connect it to the process data stream:

    Surrounding text describes convert_data_ex.png.
  9. In the Writer Configuration dialog, select the Individual Candidates data interface and map the attributes accordingly.

  10. Create and configure a new job as follows:

    1. Make a copy of Batch Individual Match job, renaming it Batch Transformed Individual Match.

    2. Open the new job.

    3. Double-click on the Individual Match job phase.

    4. Drag the Individuals snapshot task from the Snapshot tool palette onto the Individual Match phase of the job.

    5. Double-click the Individual Candidates interface and select the Individual Candidates mapping.

    6. Click OK to apply the changes. The job configuration will be modified accordingly and the snapshot and staged data items will be disconnected. Delete both these items by deleting the Snapshot task. The start of the job phase should now appear as follows:

      Surrounding text describes startjob_phase_ex.png.
    7. Use steps 9.d. - 10 of Section 1.1, "Example: Stand-Alone Individual Batch Matching" from step 9.d onwards, remembering to modify the job configuration to include the new transformation process and use the modified data interface mappings.

2 Cleaning Services

The cleaning processes provided with EDQ-CDS are provided as templates only, with the exception of the Address Cleaning process which is fully functional and uses EDQ-AV for address verification and standardization. The Individual and Entity cleaning processes are intended to be customized to meet the data standardization requirements of the implementation.

2.1 Customizing the Cleaning Services

The following examples demonstrate modifying the cleaning services provided with EDQ-CDS.

2.1.1 Example: Job Title Standardization

Modify the Individual Cleaning service to standardize job titles as follows.

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS project, create a new Reference Data set with the columns as follows:

    Surrounding text describes new_ref_data.png.
  3. Click Next through the New Reference Data wizard with the name Job Title Standardizations.

  4. Click Finish to close the wizard. The Reference Data Editor dialog opens.

  5. Add the required job title standardizations; for example:

    Surrounding text describes ref_data_edit_job.png.
  6. Open the Clean - Individual process.

  7. Add a new Replace processor to the Process Canvas and connect it to the output of the Upper Case the Name Attributes processor.

  8. In the Processor Configuration dialog, set the jobtitle attribute as the Input field, and on the Options tab select the Job Title Standardizations Reference Data in the Replacements field.

    Surrounding text describes replace_dialog.png.
  9. Click OK to close the processor configuration dialog.

  10. Connect the All output of the Replace processor to the Writer, then click OK without making any changes to the Writer configuration.

  11. On the Process Canvas delete the direct link between the Upper Case processor and the Writer.

    Surrounding text describes job_title_ex.png.
  12. Close the process and save the changes.

  13. Test the modified cleaning service.

2.1.2 Country-Specific Address Cleaning Settings

The default settings (Allowed Verification Results, Minimum Verification Level and Minimum Match Score) used in the Address Cleaning process that uses EDQ-AV can be overridden on a per-country basis by simply modifying reference data.

2.1.3 Example: Reducing the Strictness of German Address Validation

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS project edit the Address Clean - Country verification level and results Reference Data.

  3. Add the following row:

    • Country Code: DE

    • Allowed Verification Results: VPA

    • Minimum Verification Level: 3

    • Minimum Match Score: 90

      Surrounding text describes ref_data_addr_clean.png.
  4. Click OK to close the dialog.

3 Adjusting Matching

This section explains how you can change the EDQ matching settings.

3.1 Changing the Match Clusters To Use

By default, the clusters that are used during matching depend on the value of the clusterlevel setting. All clusters for the specified level and all lower levels are applied. It is possible to customize the system to turn off particular clusters on an individual basis. However, this is only necessary if greater granularity than the three standard cluster levels is required.

The methods for controlling which Match Clusters are used differs for Batch and Real-Time processing. The following examples show you how to modify the clusters used.

3.1.1 Example: Turning Off Clusters in Individual Batch Matching

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS project, open the Match - Individual process.

  3. Double-click on the Individuals - Match processor to open the processor tab.

  4. Select the Cluster icon, and select or unselect the Cluster options as required.

    Note:

    You should always select the Real-time Cluster option otherwise real-time matching will no longer operate. The match processors are shared between real-time and batch jobs.

  5. Close the tab, and click Yes to save the changes.

3.1.2 Example: Turning Off Clusters in Entity Real-Time Matching

In Real-Time matching, each driving record is compared against every other record in the input set; clustering is performed as a separate, prior call. Therefore, in order to turn off a cluster it must be suppressed at the time of generation.

Note :

This will only affect new records, unless all cluster keys are re-created.

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS project, open the Cluster Results – Realtime Output process.

  3. Double-click on the Concatenate All Clusters processor to open the Processor Configuration dialog.

  4. Select the Cluster Attributes in the Selected Attributes list as appropriate and click on the left-arrow button to remove them. For example, entclusterWS, the Website cluster as in the following:

    Surrounding text describes selected_attr_list.png.
  5. Click OK to close the dialog.

  6. Close the process and save the configuration changes.

3.2 Changing Match Rule Enablement

Match rule enablement is externalized in this release. You can override this behavior by adding the name...address conflict properties to your edq-cds.properties file then editing the values as in the following example:

# Disable all entity "name...address conflict" type rules.
phase.*.process.Match\ -\ Entity.[E010V]\ Script\ full\ name\ exact\;\ address\ conflict.entity_match_rules_enabled = false
phase.*.process.Match\ -\ Entity.[E020V]\ Full\ name\ exact\;\ address\ conflict.entity_match_rules_enabled = false
phase.*.process.Match\ -\ Entity.[E030V]\ Standardized\ full\ name\ exact\;\ address\ conflict.entity_match_rules_enabled = false
phase.*.process.Match\ -\ Entity.[E040V]\ Script\ full\ name\ without\ suffixes\ exact\;\ address\ conflict.entity_match_rules_enabled = false
phase.*.process.Match\ -\ Entity.[E050V]\ Full\ name\ without\ suffixes\ exact\;\ address\ conflict.entity_match_rules_enabled = false

Capitalization must be respected and characters must be escaped as required. The asterisk (*) character denotes a wildcard, which specifies that the above rule applies to all phases and all processes.

3.3 Turning off Unused Match Functionality

The value of the matchthreshold setting is used to control the strength of matches that are returned from the Matching services by filtering out results that fall below the specified threshold. Match rules with a priority score below this value are effectively redundant.

Also, the match processes output a number of additional attributes which are not used in the default configuration and can be removed without loss of functionality. These attributes may be required for use in customizations of EDQ-CDS. For more information, see Section 3.3, "Turning off Unused Match Functionality."

3.3.1 Example: Disabling Rules with Lower Scores

The matchthreshold setting has been configured to have a value of 70, so all Match rules with a lower priority score will be disabled.

The following steps describe how to disable Match rules for any Match process (for example, Match - Individual, Match - Entity or Match - Address):

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS project, open the Match process.

  3. Double-click the Match processor to open the Match Configuration tab.

  4. Double click the Match sub-processor icon to open the Match Configuration dialog.

  5. Select the Match Rules tab and select the last Match group.

  6. Clear the check box beside each Match rule with a Match Priority score lower than 70 to disable it.

    Surrounding text describes match_rule_boxes.png.
  7. Repeat for each Match group until all rules with a score less than 70 have been disabled.

  8. Click OK to close the dialog.

  9. Close the process and save the configuration changes.

3.4 Reviewing Matches in EDQ

The EDQ-CDS Matching services return only those records that matched with a score equal to or greater than the matchthreshold setting, and for those records it only returns the record ID, rule name and score. It is useful to be able to view the full record details during rule tuning in order to analyze matches. The Match Review application is a helpful tool in this process.

3.4.1 Example: Enabling Match Review in Individual Batch Matching

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS project, open the Match - Individual process.

  3. Double-click on the Match Individuals processor to open the Match Configuration dialog.

  4. Click Advanced Options.

  5. From the Review System list, select Match Review, and then click OK. This makes the Assign Relationship Review option active.

  6. Click Assign Relationship Review.

    Surrounding text describes match_individ_proc.png.
  7. In the dialog displayed, select the appropriate user or user group in the Assigned To drop-down field.

  8. Click OK to close the dialog.

  9. Close the process and save the configuration changes.

  10. Open the Batch Individual Match job.

  11. Locate the Match phase, right-click on the Match Prepare task and select Configure. The Task Configuration dialog opens.

  12. Select the Process tab, and check the Enable Sort/Filter in Match? option.

  13. Click OK and close the job, saving changes when prompted.

  14. Run the job from Director with the appropriate run profile and no run label to regenerate the data.

    Note:

    In order to generate Match Review data, you must run jobs without a run label.

Matches can be reviewed as follows:

  1. On the Launchpad page, click Match Review icon.

    Note :

    If this application is not visible then you will need to publish it via the launchpad server configuration pages.

  2. Login as a user with the appropriate security permissions (for example, a user that is a member of the group selected in step 5).

  3. Select Match - Individual in the Reviews list in the left-hand panel to view the Match Review statistics.

  4. Click the Launch Review Application link to start reviewing matches for the selected Review.

4 Modifying Reference Data Used in Matching

This section explains how you can modify your data to improve matching and provides examples to aid you.

4.1 Stripping Words/Phrases from Name Fields

It is possible to customize the system to strip certain words and phrases from names that are deemed to be noise and/or add little information, and therefore may lead to potential missed matches.

4.1.1 Example: Removing Noise from Individual Names

Name fields in customer data systems are often overfilled with additional (non-name) information, either because there are no other suitable fields available or due to errors made by Data Entry users. Common examples include "Fred SMITH (DO NOT CALL)" and "John DOE (DECEASED)". This extraneous information can be removed during name standardization when a "distilled" name is created for use in matching.

Use the following procedure:

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS - Initialize Reference Data project open the Strip List – Titles Latin Reference Data.

  3. Add the following rows to the Reference Data set:

    • DO NOT CALL

    • DECEASED

  4. Click OK to close the dialog.

  5. Re-run the MAIN Initialize Reference Data job from the Server Console to re-prepare the Reference Data files that are used by the Matching services.

Note :

The Real-Time services will use the modified Reference Data sets the next time the full Real-time START ALL job (which re-snapshots the prepared Reference Data from files) is run.

To remove words and phrases from individual names in non-Latin scripts use the reference data Strip List – Individual Script Strip List Reference Data . This Reference Data set is used as a replacement map and should have a blank value in the second column.

4.1.2 Example: Removing Noise from Entity Names

Noise words/phrases or common business words (including suffixes) in Entity names that add little value in matching can be removed during name standardization when a "distilled" name is created. An example of such a noise word is "International", which is often found in organization name fields.

Due to the high frequency of occurrence of this term it is often omitted or shortened when entering the name, which may lead to potential matches being missed. Therefore it may be more appropriate to remove the term and all known variants for the purposes of matching.

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS - Initialize Reference Data project open the Strip List – Entity Latin Reference Data.

  3. Add the following rows to the Reference Data set:

    • INTERNTL

    • INTL

    • INT

  4. Click OK to close the dialog.

  5. Re-run the MAIN Initialize Reference Data job from the Server Console to prepare the data.

To remove words and phrases from entity names in non-Latin scripts use the Strip List – Entity Script Suffixes Reference Data.

4.2 Changing Name Standardization

EDQ-CDS uses a name standardization technique in order to match name variants. It is supplied with a large collection of common name variants for various language domains. It is possible to customize these lists.

Note :

If a name standardization is changed or added, the subsequent results may be eliminated during Conflict Resolution. For further details, see Section 4.3, "Conflict Resolution".

4.2.1 Example: Adding Individual Name Standardizations

  1. Ensure that no jobs are currently running.

  2. In the EDQ-CDS - Initialize Reference Data project create a new Reference Data set with columns as in the following:

    Surrounding text describes new_ref_data_individ.png.
  3. Click Next through the New Reference Data wizard and name it Custom Individual Name Standardizations.

  4. Click Finish to close the dialog.

  5. The Reference Data Editor dialog will open. Add the required name standardizations, where:

    • VARIANTLATINNAME is the name to be standardized.

    • MASTERLATINNAME is the standardized version of variant name.

    • GENDER takes the value M for male, F for Female, or U for unknown or ambiguous.

    • ISPHRASE takes the value N for single token names and Y for multi-token names containing whitespace.

    • ISHIGHFREQ is set to Y.

    Note :

    It is important to ensure that data is entered in upper case and that variant names only have a single master across all language domains.

    Surrounding text describes ref_data_names.png.
  6. Click OK to close the dialog.

  7. Open the [D] Initialise Individual Latin to Latin Data process.

  8. Add a Reader process to the Process Canvas and configure it to use the Custom Individual Name Standardizations Reference Data as the source, selecting all attributes for input to the process.

    Surrounding text describes reader_config.png.
  9. Add a new Add String Attribute processor to the process canvas and connect the reader to the new processor. In the processor configuration dialog rename the new attribute DATASOURCE and set the attribute value to CUSTOM.

  10. Connect the output of the Add String Attribute processor to the Merge Data Streams processor.

  11. In the Custom Individual Name Standardizations tab of the Processor Configuration dialog associate the Available Attributes with the Output Attributes in the Merged Data Stream area:

    Surrounding text describes merge_data_streams.png.
  12. Click OK to close the dialog.

  13. Close the process and save the configuration changes.

  14. Re-run the MAIN Initialize Reference Data job from the Server Console to prepare the data.

4.3 Conflict Resolution

Conflict resolution is performed to resolve issues arising when name standardization rules try to standardize names to more than one Master name. For example, if there is a rule that maps "Jon" to a Master of "John" and another that maps "Jon" to "John-Boy", there is a conflict. This conflict is resolved by assessing the importance of each Master name in the given standardization data. The best candidate is then selected as the primary Master, and other standardization maps conflicting with it are removed and quarantined.

As part of conflict resolution, each removed record is assigned one or more Reason Codes explaining why it is in conflict. These codes are displayed in the REASON column in the Server Console Results window:

Surrounding text describes reason_code.png.

The Reason Codes are as follows:

  • PIV: The Primary record of a cluster of records (for example, the best Master identified for a set of equivalences) is also present as a variant to other Masters. All the instances where this Primary name is a variant are removed.

  • PVOM: The records that are variants of the current Primary are also variants of other Masters. All the records for these variants pointing to other Masters are removed.

  • PVIM: The records that are variants of the current Primary are also Masters to other variants. All the records where this variant is a Master are removed.

  • PIVCUTOFF: Whereas the other removals take place after identification of Primary clusters, there comes a time where it is not efficient to continue to identify the Primaries, and the remaining records where the Master name also exists as a variant have all the variant versions removed in a final cull of records that violate integrity.

Expanding on the simple example given at the beginning of this section, let us assume that there are the following name standardization rules:

Master Primary

J-MAN

JON

JOHN

JONATHAN

JOHNNY

JONNY

JON

JOHN

JON

JONATHAN

JON

JOHN-BOY

JONNY

JONATHAN

JONATHAN

JONATHON

JOHNNY

JONATHAN


These rules contain a number of inherent conflicts. This is illustrated in the following diagram in which JONATHAN is identified as the Primary:

Surrounding text describes conflict_res_ex2.png.

The arrows indicate the following:

Arrow Type Reason for Conflict

Surrounding text describes noconflict_arr.png.

N/A (No conflict exists)

Surrounding text describes piv_arr.png.

PIV

Surrounding text describes pvim_arr.png.

PVIM

Surrounding text describes pvom_arr.png.

PVOM


The conflict resolution rules will discard the mappings that cause conflicts, as follows:

Surrounding text describes conflict_res_ex1.png.

Resulting in the following mappings being created:

Name Primary

JOHN

JONATHAN

JON

JONATHAN

JONNY

JONATHAN

JOHNNY

JONATHAN


5 Related Documents

For more information, see the following documents in the Oracle Enterprise Data Quality documentation set:

See the latest version of this and all documents in the Oracle Enterprise Data Quality Documentation website at

http://download.oracle.com/docs/cd/E48549_01/index.htm

6 Documentation Accessibility

For information about Oracle's commitment to accessibility, visit the Oracle Accessibility Program website at http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc.

Access to Oracle Support

Oracle customers have access to electronic support through My Oracle Support. For information, visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=info or visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=trs if you are hearing impaired.


Customer Data Services Pack Customization Guide, 11g Release 1 (11.1.1.7)

E40734-02

Copyright © 2006, 2013, Oracle and/or its affiliates. All rights reserved.

This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.

The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.

If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable:

U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to the programs. No other rights are granted to the U.S. Government.

This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherently dangerous applications, including applications that may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate failsafe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group.

This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.