3Data Quality Concepts
Data Quality Concepts
This chapter provides the conceptual information that you must use to configure data quality for Siebel CRM and Oracle Customer Hub. It includes the following topics:
Data Cleansing
The Universal Connector supports data cleansing on the Account, Business Address, Contact, and List Mgmt Prospective Contact business components. For Siebel Industry Applications, the CUT Address business component is used instead of the Business Address business component.
For each type of record, data cleansing is performed for the fields that are specified in the Third Party Administration view. The mapping between the Siebel application field names and the vendor field names is defined for each business component.
In real-time mode, data cleansing begins when a user saves a newly created or modified record. When the record is committed to the Siebel database:
A request for cleansing is automatically submitted to the Data Cleansing business service.
The Data Cleansing business service sends the request to the third-party data cleansing software, along with the applicable data.
The third-party software evaluates the data and modifies it in accordance with the vendor’s internal instructions.
The third-party software sends the modified data to the Siebel application, which updates the database with the cleansed information and displays the cleansed information to the user.
In batch mode you use batch jobs to perform data cleansing on all the records in a business component or on a specified subset of those records. For data cleansing batch jobs, the process is similar to that for real-time mode, but the batch job corrects the records without immediately displaying the changes to users. The process starts when an administrator runs the server task, and the process continues until all the specified records are cleansed.
If both data cleansing and data matching are enabled, data cleansing is done first. For information about running data cleansing batch jobs, see Cleansing Data Using Batch Jobs.
Data Matching
The Universal Connector and Matching Server supports data matching on the Account, Contact, and List Mgmt Prospective Contact business components. For each type of record, data matching is performed for the current record against all other records of the same type, and with the same match keys, in the application using the fields specified in the Third Party Administration view. The mapping between the Siebel application field names and the vendor field names is defined for each business component.
Data quality performs matching using fields, for example, addresses, that can have multi-value group (MVG) values associated with the type of record being matched. However, data quality is not currently able to match using MVGs. Therefore, when performing matching for a contact, data quality checks only the primary address for each contact record and does not consider other addresses.
In real-time data matching, whenever an account, contact, or prospect record is committed to the database, a request is automatically submitted to the Deduplication business service. The business service communicates with third-party data quality software, which checks for possible matches to the newly committed record and reports the results to the Siebel application.
In batch mode data matching, you first start a server task to generate or refresh the keys, and then start another server task to perform data matching. For information about performing batch mode data matching, see Matching Data Using Batch Jobs.
In both real-time and batch mode, whenever a primary address is updated for an account or contact record, match keys are regenerated and data matching is performed for that account or contact.
The following is the overall sequence of events in data matching:
Match keys are generated for database records for which data matching is enabled.
When a user enters or modifies a record in real-time mode, or the administrator submits a batch data matching job:
A request is automatically submitted to the Deduplication business service.
Using match keys, candidate matches are identified for each record. This is a means of filtering the potential matching records.
The Deduplication business service sends the candidate records to the third-party software.
The third-party software evaluates the candidate records and calculates a match score for each candidate record to identify the duplicate records.
The third-party software returns the duplicate records to the Siebel application.
The duplicate records are displayed either in a window for real-time mode, or in the Administration - Data Quality views, from which you can manually merge records into a single record.
Match Key Generation
When data matching is performed in real time or in batch mode, data quality searches in the database for records that potentially match the current record (the record entered by a real-time user or the active record in the batch job). These records are called candidate records. When comparing the current records with existing records in the database, data quality does not use raw data, but instead uses match key values.
Match keys are calculated by applying an algorithm to specified fields in customer records. Typically keys are generated from a combination of name, address, and other identifier fields. For example, a person’s name (first name, middle name, last name) for prospects and contacts, or the account name for accounts.
You generate match keys for records in the database by using batch jobs, as described in Generating or Refreshing Keys Using Batch Jobs.
Typically, an administrator generates and refreshes keys on a periodic basis by running batch jobs. In such batch jobs, keys can be generated for all account keys, all contact keys, all prospect keys, or subsets as defined by search specifications that include a WHERE clause.
Because key data can become out of sync with the base tables, you must refresh the key data periodically. Key generation re-generates the keys for all the records covered by the search specification. Key refresh however, only re-generates the keys for records that are new or have been modified since your last key generation, and which are covered by the search specification. Key refresh is therefore much faster than key generation.
For example, if there are records as follows:
Record 1. The record has a key and has not been updated.
Record 2. The record has been updated therefore the key is out of sync with the record.
Record 3. The record is a new record and no key is generated for it yet.
If you generate match keys with a search specification that covers record 1, 2, and 3, new keys are generated for record 1, 2, and 3. However, if you refresh match keys with a search specification to cover record 1, 2, and 3, new keys are generated for record 2 and 3 only.
The batch capability is useful in the following circumstances:
If you deploy data quality in a Siebel application implementation that already contains data
If you receive new data using an input method that does not involve object manager, such as EIM or batch methods such as the List Import Service Manager
To periodically review data to ensure the correctness of previous matching efforts.
For instructions about using batch jobs to generate or refresh keys, see Generating or Refreshing Keys Using Batch Jobs.
Additionally, if real-time data matching is enabled for users, keys are automatically generated (or refreshed) for a record whenever the user saves a new Account, Contact, or List Mgmt Prospective Contact record or modifies and commits an existing record to the database.
If no keys are generated for a certain record, that record is ignored as a potential candidate record when matching takes place.
Match Key Generation with the Oracle Data Quality Matching Server
When the Universal Connector is integrated with the Oracle Data Quality Matching Server for data matching, it supports data matching on account, contact, and prospect data in real-time and batch processing modes. Whenever a record is created or updated in real-time or batch mode, match keys are generated by and stored within the Oracle Data Quality Matching Server. As a result, the information in Match Key Generation does not apply.
Match Key Generation Using the Universal Connector with Third-Party Data Quality Vendors
When the Universal Connector is integrated with any other third-party data quality vendor software for data matching, match key generation is as described in this topic. That is, match keys are generated by and stored within Siebel CRM whenever a record is created or updated in real-time or in batch mode.
The Universal Connector uses one or multiple keys for each account, contact, or prospect record. The keys are calculated by reading data from specific fields in the record. The fields used depend on the business component configuration, but they can include account name, postal code, street address, or last name fields.
The value of the match keys depend on a business component-specific Dedup Token Expression parameter, as shown in the following table.
You can customize the Dedup Token Expression but it must be consistent with the internal matching logic of the vendor, which is different for each vendor. For optimal results therefore, change the values only after consulting the relevant vendor.
The generation of multiple match keys enhances the span of search for potential duplicate records, and improves match results. However, you must remember that there is a performance impact from using multiple keys.
Keys are stored in the DEDUP_TOKEN fields of the following tables:
S_DQ_ORG_KEY (for Accounts)
S_DQ_CON_KEY (for Contacts)
S_DQ_PRSP_KEY (for Prospects)
You must activate the Dedup Token field in each business component in order to generate the correct match keys. If the Dedup Token field is not defined, match key generation methods will not be called. You must add the user property for the Token Expression along with the Query Expression so that the correct match keys can be generated and stored in the DEDUP_TOKEN field.
Identification of Candidate Records
The way in which candidate records are identified differs for the Oracle Data Quality Matching Server and the Universal Connector as described in the following topics.
Identification of Candidate Records with the Oracle Data Quality Matching Server
When using the Oracle Data Quality Matching Server for data matching, identification of candidate records is irrelevant as match candidate acquisition takes place within the Oracle Data Quality Matching Server.
Identification of Candidate Records with the Universal Connector
Data quality queries the database for candidate records by using a Dedup Query Expression parameter specific to the current Business Component. A Dedup Query Expression is used rather than the related Dedup Token Expression, for the following reason: If a user does not specify a value for any of the fields that compose the Dedup Token Expression, then the token is constructed with an underscore (_) instead of a value in the part of the expression that corresponds to that field. If the token were to be used in a query, the effect would be for the query to seek records that had NULL values in corresponding fields. In contrast, the Dedup Query Expression replaces each underscore in the Dedup Token Expression with a ‘?’ wildcard character that matches any single character, leading to the desired query results.
You can customize both the Dedup Token Expression and the Dedup Query Expression parameters through the Third Party Administration view. The configuration of these expressions must be consistent with the internal matching logic of the vendor, which is different for each vendor. For optimal results therefore, change these values only after consulting the relevant vendor. If you change the expressions, you must regenerate match keys.
See the following table for examples about how the default expressions can differ for different business components.
Table Expressions Used for Keys and Queries (Example)
Business Component | Dedup Token Expression Parameter (Key) | Dedup Query Expression Parameter (for Queries) |
---|---|---|
Account |
"IfNull (Left ([Primary Account Postal Code], 5), '_____') + IfNull (Left ([Name], 1), '_') + IfNull (Mid ([Street Address], FindNoneOf ([Street Address], '1234567890 '), 1), '_')" |
"IfNull (Left ([Primary Account Postal Code], 5), '?????') + IfNull (Left ([Name], 1), '?') + IfNull (Mid ([Street Address], FindNoneOf ([Street Address], '1234567890 '), 1), '?')" |
Contact |
"IfNull (Left ([Postal Code], 5), '_____') + IfNull (Left ([Account], 1), '_') + IfNull (Left ([Last Name], 1), '_')" |
"IfNull (Left ([Postal Code], 5), '?????') + IfNull (Left ([Account], 1), '?') + IfNull (Left ([Last Name], 1), '?')" |
List Mgmt Prospective Contact |
"IfNull (Left ([Postal Code], 5), '_____') + IfNull (Left ([Account], 1), '_') + IfNull (Left ([Last Name], 1), '_')" |
"IfNull (Left ([Postal Code], 5), '?????') + IfNull (Left ([Account], 1), '?') + IfNull (Left ([Last Name], 1), '?')" |
The maximum number of candidate records that are sent to the third-party software at one time is determined by the value of the following vendor parameters in the Third Party Administration view:
Realtime Max Num of Records. Used in real time, the default value is 200, which is the highest value that you can set. Usually there will not be more than 200 records to send, but if there are more than 200 records, the first 200 records are sent.
Batch Max Num of Records. Used in batch mode, the default is 200, which is the highest value that you can set. If there are more than 200 records to send, the first 200 records are sent, then up to 200 records in the next iteration, and so on.
Calculation of Match Scores
After data quality identifies candidate records, they are sent to the third-party software. The software calculates a match score from 0 to 100 to indicate the degree of similarity between the candidate records and the current record.
The match score is calculated using a large number of rules that compensate for how frequently a given name or word appears in a language. The rules then weigh the similarity of each field on the record according to the real-world frequency of the name or word. For example, Smith is a common last name, so a match on a last name of Smith would carry less weight than a match on a last name that is rare.
The algorithms used to calculate match scores are complex. These algorithms are the intellectual property of third-party software vendors, Oracle cannot provide details about how these algorithms work.
Calculation of Match Scores Using the Universal Connector with Third-Party Data Quality Vendors
The third-party software examines the candidate records, computes a match score for each record that is identified as a duplicate, and returns the duplicate records to data quality. The match score is a number that represents the similarity of a record to the current active record. It is calculated taking into account a large number of rules along with a number of other factors and weightings.
Displaying Duplicates
After calculating match scores, the third-party software returns duplicate records to the Siebel application.
In real-time mode, the Siebel application displays the duplicate records in a window. These windows are:
DeDuplication Results (Account) List Applet
DeDuplication Results (Contact) List Applet
DeDuplication Results (Prospect) List Applet
You can however, configure the names of these windows as described in Configuring the Windows Displayed in Real-Time Data Matching.
The user can either choose a record for the current record to be merged with, or click Ignore to leave the possible duplicates unchanged. For more information, see Real-Time Data Cleansing and Data Matching.
In batch mode, duplicate records are displayed in the Duplicate Account Resolution, Duplicate Contact Resolution, and Duplicate Prospect Resolution views in the Administration - Data Quality screen and also in the following views:
Account Duplicates Detail View
Contact Duplicates Detail View
List Mgmt Prospective Contact Duplicates Detail View.
The user can then decide about which records to retain or merge with the retained records. For information about merging records, see Merging Duplicate Records.
If data cleansing is enabled for Siebel Universal Customer Master, you can use the following views of the Administration - Universal Customer Master screen to display duplicates:
UCM Account Duplicates Detail View
UCM Contact Duplicates Detail View
The default data quality views for accounts and contacts must be disabled. There is no separate UCM view for prospects.
Fuzzy Query
Fuzzy query is an advanced query feature that makes searching more intuitive and effective. It uses fuzzy logic to enhance your ability to locate information in the database.
Fuzzy query is useful in customer interaction situations for locating the correct customer information with imperfect information. For example, fuzzy query makes it possible to find matches even if the query entries are misspelled. As an example, in a query for a customer record for Stephen Night, you can enter Steven Knight and records for Stephen Night as well as similar entries like Steve Nite are returned.
Standard query methods can rule out rows due to lack of exact matches, whereas fuzzy query does not rule out rows that contain only some of the query specifications. The fuzzy query feature is most useful for queries on account, contact, and prospect names, street names, and so on.
Fuzzy query operates as follows:
A user enters a query from the Siebel application GUI.
Data quality inspects the query for wildcard characters, such as the * (asterisk). If any wildcards are present, data quality uses standard query functionality for that query, not fuzzy query functionality.
Data quality generates a Dedup Token from certain specified fields in the current query input, and uses the token to query the database for possible data matches. Data quality preserves query text in fields that the DeDuplication service does not evaluate for potential data matches. For more information about Dedup Tokens, see Identification of Candidate Records.
The remainder of the process depends on the number of records that are returned in the previous step:
If the preliminary query results contain more records than the value of the Fuzzy Query Max Results setting, then data quality calls the DeDuplication business service, which works with the third-party data matching engine to evaluate the possible matches. The query result returns the best available matches, up to the number of records specified by Fuzzy Query Max Results.
If the preliminary query results contain fewer records than the value of the Fuzzy Query Max Results setting, then data quality returns all of those records as the query result, sorted according to the default sort specification for the business component.
Fuzzy query is not enabled by default; to use fuzzy query you must enable it and ensure that other conditions are met as described in Enabling and Disabling Fuzzy Query.