Oracle® Warehouse Builder User's Guide 10g Release 2 (10.2.0.2) Part Number B28223-05 |
|
|
View PDF |
Because reporting on erroneous data wastes time and money, data quality is a key element of Business Intelligence. Include Warehouse Builder data quality operators in your mappings to load clean, accurate records to your targets.
This chapter discusses the Warehouse Builder mapping operators that help you achieve data quality. This chapter contains the following topics:
This section includes information and examples on how to use the Match-Merge operator in a mapping. The Match-Merge operator can be used with the Name and Address operator to support householding, which is the process of identifying unique households in name and address data.
Consider how you could utilize the Match-Merge operator to manage a customer mailing list. Use matching to find records that refer to the same person in a table of customer data containing 10,000 rows. For example, you can define a match rule that screens records that have similar first and last names. Through matching you may discover that 5 rows refer to the same person. You can merge those records into one new record. For example, you can create a merge rule to retain the values from the one of the five matched records with the longest address. The newly merged table now contains one record for each customer.
Table 21-1 shows records that refer to the same person prior to using the Match-Merge operator.
Table 21-1 Sample Records
Row | First Name | Last Name | SSN | Address | Unit | Zip |
---|---|---|---|---|---|---|
1 |
Jane |
Doe |
NULL |
123 Main Street |
NULL |
22222 |
2 |
Jane |
Doe |
111111111 |
NULL |
NULL |
22222 |
3 |
J. |
Doe |
NULL |
123 Main Street |
Apt 4 |
22222 |
4 |
NULL |
Smith |
111111111 |
123 Main Street |
Apt 4 |
22222 |
5 |
Jane |
Smith-Doe |
111111111 |
NULL |
NULL |
22222 |
Table 21-2 shows the single record for Jane Doe after using the Match-Merge operator. Notice that the new record retrieves data from different rows in the sample.
When you use Warehouse Builder to match records, you can define a single match rule or multiple match rules. If you create more than one match rule, Warehouse Builder determines two rows match if those rows satisfy any of the match rules. In other words, Warehouse Builder evaluates multiple match rules using OR logic.
The following example illustrates how Warehouse Builder evaluates multiple match rules using OR logic.
In the top portion of the Match Rules tab, create two match rules as described in Table 21-3:
Table 21-3
Name | Position | Rule Type | Usage | Description |
---|---|---|---|---|
Rule_1 |
1 |
Conditional |
Active |
Match SSN |
Rule _2 |
2 |
Conditional |
Active |
Match Last Name and PHN |
In the lower portion of the tab, assign the details to Rule_1 as described in Table 21-4:
Table 21-4
Attribute | Position | Algorithm | Similarity Score | Blank Matching |
---|---|---|---|---|
SSN |
1 |
Exact |
0 |
Do not match if either is blank |
For Rule_2, assign the details as described in Table 21-5:
Table 21-5
Attribute | Position | Algorithm | Similarity Score | Blank Matching |
---|---|---|---|---|
LastName |
1 |
Exact |
0 |
Do not match if either is blank |
PHN |
2 |
Exact |
0 |
Do not match if either is blank |
Assume you have the data listed in Table 21-6:
Table 21-6
Row | First Name | Last Name | PHN | SSN |
---|---|---|---|---|
A |
John |
Doe |
650-123-1111 |
NULL |
B |
Jonathan |
Doe |
650-123-1111 |
555-55-5555 |
C |
John |
Dough |
650-123-1111 |
555-55-5555 |
According to Rule_1, rows B and C match. According to Rule_2, rows A and B match. Therefore, since Warehouse Builder handles match rules using OR logic, all three records match.
The general rule is, if A matches B, and B matches C, then A matches C. Assign a conditional match rule based on similarity such as described in Table 21-7:
Table 21-7 Conditional Match Rule
Attribute | Position | Algorithm | Similarity Score | Blank Matching |
---|---|---|---|---|
LastName |
1 |
Similarity |
80 |
Do not match if either is blank |
Assume you have the data listed in Table 21-8:
Table 21-8 Sample Data
Row | First Name | Last Name | PHN | SSN |
---|---|---|---|---|
A |
John |
Jones |
650-123-1111 |
NULL |
B |
Jonathan |
James |
650-123-1111 |
555-55-5555 |
C |
John |
Jamos |
650-123-1111 |
555-55-5555 |
Jones matches James with a similarity of 80, and James matches Jamos with a similarity of 80. Jones does not match Jamos because the similarity is 60, which is less than the threshold of 80. However, because Jones matches James, and James matches Jamos, all three records match (Jones, James, and Jamos).
Figure 21-1 shows a mapping you can design using a Match-Merge operator. Notice that the Match-Merge operator is preceded by a Name and Address operator, NAMEADDR, and a staging table, CLN_CUSTOMERS. You can design your mapping with or without a Name and Address operator. Preceding the Match-Merge operator with a Name and Address operator provides clean and standardized data before launching time consuming match and merge operations.
Figure 21-1 Match-Merge Operator in a Mapping
Whether you include a Name and Address operator or not, be aware of the following considerations as you design your mapping:
Operating modes: Warehouse Builder operators may accept either set-based or row-based input and generate either set-based or row-based output. SQL is set-based, so a set of records is processed at one time. PL/SQL is row-based, so each row in processed separately. When the Match-Merge operator matches records, it compares each row with the subsequent row in the source and generates row-based code only. A mapping that contains a Match-Merge operator can only run in row-based mode. For more information about set-based and row-based operators, refer to Chapter 8, "Understanding Performance and Advanced ETL Concepts".
SQL based operators before Match-Merge: The Match-Merge operator accepts set-based SQL input, but generates only row-based PL/SQL output. Any operators that generate only SQL code must precede the Match-Merge operator. For example, the Joiner, Key Lookup, and Set operators generate set-based SQL output, so they must precede Match-Merge. If set-based operators appear after Match-Merge, then the mapping is invalid.
PL/SQL input: The Match-Merge operator requires SQL input except from another Match-Merge operator, as described in "Using Two Match-Merge Operators". If you want to precede a Match-Merge with an operator that generates only PL/SQL output such as the Name and Address operator, you must first load the data into a staging table.
Refining Data from Match-Merge operators: To achieve greater data refinement, map the XREF output from one Match-Merge operator into another Match-Merge operator. This scenario is the one exception to the SQL input rule for Match-Merge operators. With additional design elements, the second Match-Merge operator accepts PL/SQL. For more information, see "Using Two Match-Merge Operators".
Most match-merge operations can be performed by a single match-merge operator. However, if you are directing the output to two different targets, then you may need to use two match-merge operators in succession.
For example, when householding name and address data, you may need to merge the data first for addresses and then again for names. Assuming you map the MERGE output to a target table, you can map the XREF
group to another Match-Merge operator. Although you could map the XREF group to a staging table, this practice can lead to significant loss of performance.
Figure 21-2 shows a mapping that uses two match-merge operators. The XREF
group from MM
is mapped directly to MM_1
. For this mapping to be valid, you must assign the Match ID generated for the first XREF
group as the Match Bin rule on the second Match-Merge operator.
Figure 21-2 Householding Data: XREF Group Mapped to Second Match-Merge Operator
Use the Name page to specify a name and optional description for the operator. By default, the wizard names the Match-Merge operator MATCHMERGE.
Use the Groups page to enter customized names and descriptions of the input and output groups.
Group
Lists the predefined input and output groups. You can rename the groups, but you cannot add or delete groups in the Match-Merge operator.
Direction
Identifies whether the group is for input or output. The Match-Merge operator accepts one SQL input group and generates two PL/SQL output groups. The MERGE group has the merged data. The XREF group is an optional group for documenting the merge process.
Description
Enter an optional description.
Use the Input Connections page to select attributes for the input group (INGRP1).
To complete the Input connections page for an operator:
Select complete groups or individual attributes from the left panel.
To search for a specific attribute or group by name, type the text in Search for and click Go. To find the next match, click Go again.
Hold the Shift key down to select multiple groups or attributes. If you want to select attributes from different groups, you must first combine the groups with a Joiner or Set operator.
Use the left to right arrow button in the middle of the page to move your selections to the right side of the wizard page.
You can use the right to left arrow to move groups or attributes from the input connections list. Warehouse Builder removes the selection from the input group and removes the mapping line between the source operator and the current operator.
Use the Input Attributes page to assign input roles to each input attribute.
Attribute
Automatically lists the attributes that you selected on the Input Connections page. Click Add to add a new input attribute.
Input Role
Input roles indicate what kind of information resides in a line of data. For each attribute, select the input role that most closely matches the data contained in the source attribute.
Data Type, Length, Precision, Scale, Seconds Precision
Attributes are set automatically to NUMBER
.
Description
Enter an optional description of the input attributes.
Add
Adds a row so that you can define a new attribute.
Delete
Deletes the selected attribute.
Use this page to specify the attributes for the output MERGE group. The MERGE group produces a consolidated record from the attributes you selected.
Source Attributes
Lists all attributes defined for this Match-Merge operator. Use the shuttle buttons to move selected attributes to the Output Attributes field.
Output Attributes
Lists attributes selected for the output MERGE group.
Use the Cross-Reference Output page to optionally select attributes for the XREF group. Although the Match-Merge operator creates the XREF group by default, you have the option of adding attributes to the group or leaving it empty.
You can use the XREF group to document the merge process. Create a foreign key relationship between the original data set and the new merged data set. Then send the attributes from the XREF group to a table that records the corresponding source row for each merged row.
Alternatively, you can use the XREF group as input to a second Match-Merge operator. By using two operators, you can direct the merged output from a set of attributes to two different target. For an example, refer to "Using Two Match-Merge Operators".
Source Attributes
Lists the input attributes (INGRP1) and an XREF output attribute for each MATCH output attribute. The XREF attributes are distinguished by a prefix, which has a default value of MM_.
Output Attributes
The attributes that you want to cross-reference. To move selected attributes between the Source Attributes and Output Attributes lists, use the shuttle keys.
Merge Prefix
The prefix used to distinguish cross-reference output from data output.
Set Prefix
Changes the prefix on XREF attributes in both lists to the value displayed in the Merge Prefix field.
Use the Match Bins page to limit the number of rows to be compared. When Warehouse Builder matches the rows, it compares each row with the subsequent row for all rows within the same grouping. Limiting the number of rows can greatly enhance performance, because Warehouse Builder searches for matches only within a bin and not throughout the entire data set.
Ideally, keep the number of rows in each grouping under 2000. The number of comparisons Warehouse Builder performs is based on the following formula:
n=(b*(b-1))/2
where n is the number of comparisons, and b is the number of records in a bin.
For example, matching 5 records requires 10 comparisons, matching 50 records requires 1,225 comparisons, and matching 500 records requires 124,750 comparisons.
While you want to define Match Bins that separate rows into manageable groupings, you also want to avoid separating rows that should be matched. The attributes you select for grouping similar rows depends on your data. For example, if you have a table of customer addresses with a million rows, you may want to group the data by partial street name, city name, and zip code.
Available Attributes
Lists all input attributes, from which you can select the ones to use for binning.
Selected Attributes
One or more attributes that must match for a row to be included in a particular bin. To move attributes between the Source Attributes and Output Attributes lists, select one or more attributes and click the arrow keys located between the two lists. Use the arrow keys at the right to order the attributes from the more general at the top (such as Country) to the more specific at the bottom (such as Street).
Match New Records Only
After the first deployment, you can choose whether to match and merge all records or only new records. You do not want to match and merge the same data twice because of the impact on performance. Instead, you can just match and merge the new, uncleansed data. This option enables you to add uncleansed data into the data warehouse.
New Record Condition
Displays the conditional expression used to identify new records. Click the Ellipsis button to display the Match New Record Condition Editor (also known as the Expression Builder User Interface).
The Match Rules are used to identify duplicate records, even though some fields may have different values. You can define match rules for a single attribute or multiple attributes in the operator. On the Match Rules tab, create match rules at the top of the page. In the lower portion of Match Rules tab, specify the details for each match rule. Table 21-9 describes the match rules.
If you create more than one match rule, Warehouse Builder determines a match when those two rows satisfy any of the match rules. In other words, Warehouse Builder evaluates multiple match rules using OR logic. This is indicated by the OR icon to the left of each additional row. For more information, see "Understanding Matching Concepts".
Name
An arbitrary name for the rule. Warehouse Builder creates a default name such as MA_0 and MA_1 for each match rule. You can replace these names with meaningful ones. Meaningful names are particularly helpful when referencing the rules from a custom PL/SQL program.
Position
The order of execution. You can change the position of a rule by clicking on the row header and dragging the row to its new location. The row headers are the boxes to the left of the Name column.
Rule Type
Assign one of the rule types listed in Table 21-9. When you select a rule type, the lower portion of the Match Rules tab activates and you can enter details for the match rule.
Usage
You can designate a match rule as either active or passive.
Active: Warehouse Builder executes a match rule if you designate it as active. If more than one match rule is active, each is evaluated in turn until a match is found or until all rules are evaluated. If any rule identifies a match, those records are considered a match.
Passive: Warehouse Builder does not directly execute passive match rules. It only executes passive rules when they are called through an active custom match rule. All defined match rules appear in a list of available functions in the Custom Match Rule Editor.
Description
Displays an optional description, which you can enter.
Table 21-9 describes the types of match rules.
Table 21-9 Match Rule Descriptions
Match Rule | Description |
---|---|
All Match |
Matches all the rows within the match bin. |
None Match |
Turns off matching. No rows match within the match bin. |
Conditional |
Matches rows based on an algorithm you select. For details, see "Conditional Match Rule". |
Weight |
Matches rows based on scores that you assign to the attributes. For details, see "Weight Match Rule". |
Person |
Matches records based on people's names. For details, see "Person Match Rule". |
Firm |
Matches records based on business names. For details, see "Firm Match Rule". |
Address |
Matches records based on postal addresses. For details, see "Address Match Rule". |
Custom |
Create a custom comparison algorithm. Select Edit to launch the Custom Match Rule Editor. For more information, see "Custom Match Rule". |
Use the Conditional Match Rule to combine multiple attribute comparisons into one composite rule. When you assign multiple attributes for comparison, all the comparisons must be true for the records to be considered a match. Warehouse Builder displays the AND icon in the left-most column of subsequent conditions.
Attribute
Identifies the attribute that will be tested for a particular condition. You can select from any input attribute (INGRP1).
Position
The order of execution. You can change the position of a rule by clicking on the row header and dragging the row to its new location. The row headers are the boxes to the left of the Attribute column.
Algorithm
A list of methods that can be used to determine a match. Table 21-10 describes the algorithms.
Similarity Score
The minimum similarity value required for two strings to match, as calculated by the Edit Distance, Standardized Edit Distance, Jaro-Winkler, or Standardized Jaro-Winkler algorithms. Enter a value between 0 and 100. A value of 100 indicates an exact match, and a value of 0 indicates no similarity.
Blank Matching
Lists options for handling empty strings in a match.
To define a Conditional match rule, complete the following steps:
On the top portion of the Match Rules tab, select Conditional for the rule type.
The operator displays a Details section at the bottom of the tab.
Click Add to add a new row.
Select an attribute.
Select an algorithm. Refer to Table 21-10 for descriptions.
Specify a similarity score for the Edit Distance, Standardized Edit Distance, Jaro-Winkler, or Standardized Jaro-Winkler algorithms.
Select a method for handling blanks.
Table 21-10 describes the algorithms available for defining conditional match rules.
Table 21-10 Algorithms for Conditional Match Rules
Algorithm | Description |
---|---|
Exact |
Matches values only when they are exactly the same. For example, 'Dog' and 'dog!' do not match because the second string is not capitalized and contains an extra character. For numeric, date, and other non-character data types, this is the only type of comparison allowed. |
Standardized Exact |
Eliminates case, spaces, and non-alphanumeric characters before comparing for an exact match. For example, 'Dog' and 'dog!' do match. |
Soundex |
Matches phonetically similar stings. The operator converts the strings to phonetic codes. If the codes match, then the two strings match. This algorithm is often used for matching names. It is an older phonetic algorithm than Double Metaphone. The Soundex algorithm ignores case and spaces. It basically retains the first letter, eliminates vowels, and replaces consonants with numbers. Consonants with similar sounds have the same numeric value. |
Edit Distance |
Matches strings with a similarity value equal to or greater than the Similarity Score that you specify. This algorithm is often used for correcting typographical errors such as transposed characters. The Edit Distance algorithm calculates the number of deletions, insertions, or substitutions required to transform one string into another. A similarity value of 100 indicates that the two strings are identical. A value of zero indicates no similarity whatsoever. For example, if the string 'tootle' is compared with the string 'tootles', then the edit distance is 1. The length of the string 'tootles' is 7. The similarity value is therefore 6/7*100 or 85. |
Standardized Edit Distance |
Eliminates case, spaces, and non-alphanumeric characters before using the Edit Distance algorithm to determine a match. |
Partial Name |
Matches strings when one string is contained in the other, starting with the first word. The algorithm performs a Standardized Exact comparison on the entire string before attempting to match a partial name. For example, 'Midtown Power' matches 'Midtown Power and Light,' but would not match 'Northern Midtown Power'. |
Abbreviation |
Matches strings when one string contains words that are abbreviations of corresponding words in the other. The operator first performs a Standardized Exact comparison on the entire string. It then looks for abbreviations for each word. If the larger of the words contains all of the letters from the shorter word and the letters appear in the same order as the shorter word, then the words are considered a match. For example, 'Intl. Business Products' matches 'International Bus Prd'. |
Acronym |
Matches strings if one string is an acronym for the other. The operator first performs a Standardized Edit Distance comparison on the entire string. If no match is found, then each word of one string is compared to the corresponding word in the other string. If the entire word does not match, each character of the word in one string is compared to the first character of each remaining word in the other string. If the characters are the same, the names match. For example, 'Chase Manhattan Bank NA' matches 'CMB North America.' The comparison ignores case, non-alphanumeric characters, and noise words such as 'and' and 'the.' |
Jaro-Winkler |
Matches strings based on their similarity value using an improved comparison system over the Edit Distance algorithm. It accounts for the length of the strings and penalizes more for errors at the beginning. It also recognizes common typographical errors. The strings match when their similarity value is equal to or greater than the Similarity Score that you specify. A similarity value of 100 indicates that the two strings are identical. A value of zero indicates no similarity whatsoever. Note that the value actually calculated by the algorithm (0.0 to 1.0) is multiplied by 100 to correspond to the Edit Distance scores. |
Standardized Jaro-Winkler |
Eliminates case, spaces, and non-alphanumeric characters before using the Jaro-Winkler algorithm to determine a match. |
Double Metaphone |
Matches phonetically similar strings using an improved coding system over the Soundex algorithm. It generates two codes for strings that could be pronounced in multiple ways. If the primary codes match for the two strings, or if the secondary codes match, then the strings match. The Double Metaphone algorithm accounts for alternate pronunciations in Italian, Spanish, French, and Germanic and Slavic languages. Unlike the Soundex algorithm, Double Metaphone encodes the first letter, so that 'Kathy' and 'Cathy' evaluate to the same phonetic code. |
Use this rule to match rows based on a weight value. A weighted match rule is most useful when comparing a large number of attributes. This rule can prevent a single attribute from invalidating a match, which results from the AND logic of conditional rules.
Warehouse Builder compares each attribute using a similarity algorithm that returns a score between 0 and 100 to represent the similarity between the rows. For two rows to be considered a match, the total counts must be greater than the overall score you designate.
Similarity Algorithm
The method used to determine a match. Choose from these algorithms:
Edit Distance: Calculates the number of deletions, insertions, or substitutions required to transform one string into another.
Jaro-Winkler: Uses an improved comparison system over the Edit Distance algorithm. It accounts for the length of the strings and penalizes more for errors at the beginning. It also recognizes common typographical errors.
Attribute
Identifies the attribute that will be tested for a particular condition. You can select from any input attribute (INGRP1).
Maximum Score
The weight value for the attribute. This value should be greater than the value of Required Score to Match.
Score When Blank
The similarity value when one of the records is empty.
Required Score to Match
A value that represents the similarity required for a match. A value of 100 indicates that the two values are identical. A value of zero indicates there is no similarity.
To use the Weight match rule, complete the following steps:
On the Match Rules tab, select Weight as the Rule Type.
The Details tab is displayed at the bottom of the page.
Select Add at the bottom of the page to add a new row.
For each row, select an attribute to add to the rule.
In Maximum Score, assign a weight to each attribute. Warehouse Builder compares each attribute using a similarity algorithm that returns a score between 0 and 100 to represent the similarity between the rows.
In Required score to match, assign an overall score for the match.
For two rows to be considered a match, the total counts must be greater than the Required score.
Assume you want to apply the Weight match rule to the data in Table 21-11.
Table 21-11 Example Records for Matching
Record Number | Attr_1 | Attr_2 |
---|---|---|
Rec_1 |
CA |
|
Rec_2 |
CA |
|
Rec_3 |
CA |
QR |
For Maximum score, assign a value of 50 to both Att_1 and Att_2. Assign a value of 80 for the Required score to match. You can expect the following results:
Rec_1 is the new record. The operator reads it first.
In Rec_2, the value for Attr_1 is CA. That value has a similarity of 100 with the value in the new record, Rec_1. Since the weight value for Attr_1 is 50, its score is 50 (100% of 50).
In Rec_2, the value for Attr_2 is QQ and has a similarity of 100. The weight value for Attr_2 is also 50 and its score is therefore 50 (100% of 50). The total maximum score is 100 (50 + 50). This equals or exceeds the value of the Required score for match, so Rec_2 and Rec_1 match.
In Rec_3, Attr_1 is CA and has a similarity of 100 with Rec_1. Since the weight value for Attr_1 is 50, its weighted score is 50 (100% of 50).
In Rec_3, the value for Attr_2 is QR and that has a similarity of 50. The maximum value for Attr_2 is 50, so its score is 25 (50% of 50). The total weighted score is 75 (50+25). This is less than the value of the Required score to match. Therefore, Rec_3 and Rec_1 do not match.
Use the Person match rule to match records based on names. Matching by names is most effective when you first correct the address data using the Name and Address operator before the Match-Merge operator.
When you select the Person match rule, the Match Rules page displays the Person Attributes and Details tabs.
Person Attributes Tab
Eligible Attributes: Lists all input attributes.
Attributes: The attributes containing parts of names. Use the shuttle keys to move attributes from Eligible Attributes to the Attributes column of Name Roles.
Name Roles: Lists the roles for different parts of a name. Select the appropriate role for each attribute. Table 21-12 describes the roles.
Table 21-12 Name Roles for Person Match Rules
Role | Description |
---|---|
Prename |
The operator compares prenames only if First Name Standardized is blank for one of the records, the 'Mrs.' option is selected, and the Last Name and any Middle Name role match. Given that criteria, the operator matches the record 'Mrs. William Webster' with 'Mrs. Webster'. |
First Name Standardized |
First names match if both are blank. A blank first name will not match a non-blank first name unless the Prename role has been assigned and the 'Mrs. Match' option is set. |
Middle Name Standardized, Middle Name 2 Standardized, Middle Name 3 Standardized |
The operator compares and cross compares any of the middle names assigned. By default, the middle names must match exactly. Middle names match if either or both are blank. To assign any of the middle name roles, you must also assign the First Name Standardized role. |
Last Name |
The operator assigns last names as matching if both are blank and not matching if only one is blank. |
Maturity Post Name |
This is the same as post names such as 'Jr.' and 'Sr.'. The operator assigns these as matching if the values are exact or if either is blank. |
Person Details Tab
Use the Details tab to set options for determining a match. Table 21-13 lists the rule options you can select for each component of the name. For a description of the algorithms, refer to Table 21-10.
Table 21-13 Options for Person Match Rule
Option | Description |
---|---|
Detect switched name order |
Detects switched name orders such as matching 'Elmer Fudd' to 'Fudd Elmer'. You can select this option if you selected First Name and Last Name roles for attributes on the Person Attributes tab. |
Match on initials |
Matches initials to names such as 'R.' and 'Robert'. You can select this option for first name and middle name roles. |
Match on substrings |
Matches substrings to names such as 'Rob' to 'Robert'. You can select this option for first name and middle name roles. |
Similarity Score |
Uses a similarity score to determine a match, as calculated by the Edit Distance or Jaro-Winkler algorithms. Enter a value between 0 and 100 as the minimum similarity value required for a match. A value of 100 requires an exact match, and a value of 0 requires no similarity whatsoever. |
Match on Phonetic Codes |
Determines a match using either the Soundex or the Double Metaphone algorithms. |
Detect compound name |
Matches compound names to names such as 'De Anne' to 'Deanne'. You can select this option for the first name role. |
"Mrs" Match |
Matches prenames to first and last names such as 'Mrs. Washington' to 'George Washington'. You can select this option for the prename role. |
Match hyphenated names |
The operator matches hyphenated names to unhyphenated names such as 'Reese-Jones' to 'Reese'. You can select this option for the last name role. |
Detect missing hyphen |
The operator detects missing hyphens, such as matching 'Hillary Rodham Clinton' to 'Hillary Rodham-Clinton'. You can select this option for the last name role. |
To define a Person match rule, complete the following steps:
On the Match Rules tab, select Person as the Rule Type.
The Person Attributes tab and Details tab are displayed at the bottom of the page.
In the left panel of the Person Attributes tab, select the attributes that describe a full name and use the arrow key to move them to Name Roles Attributes.
For each attribute, select the role it plays in a name. You must define either the Last Name or First Name Standardized for the match rule to be effective. See Table 21-12 for the types of roles you can assign.
Select the Details tab and select the applicable options as listed in Table 21-13.
Use the Firm match rule to match records by business name. This type of match is most effective when you first correct the address data using the Name and Address operator before the Match-Merge operator.
When you select the Firm match rule, the Match Rules page displays the Firm Attributes and Details tabs.
Firm Attributes tab
Use the Firm Attributes tab to identify attributes that contain business names.
Eligible Attributes: Lists all input attributes.
Attributes: The attributes containing the names of businesses. Move one or two attributes containing business names from Eligible Attributes to the Attributes column of Firm Roles.
Firm Roles: Lists two roles, Firm 1 and Firm 2. If you selected one attribute, then designate Firm 1 as its role. If you selected two attributes, then designate one of them as Firm 1 and the other as Firm 2.
Firm Details Page
Use the Details tab to set options for determining a match. By default, the operator compares the values in Firm 1 for exact matches.
Table 21-14 lists the rule options you can select for each component of the name. For a description of the algorithms, refer to Table 21-10.
Table 21-14 Options for Firm Rules
Option | Description |
---|---|
Strip noise words |
Removes words such as 'and' and 'the.' |
Cross-match firm 1 and firm 2 |
Attempts to find matches between the Firm 1 attribute and the Firm 2 attribute. |
Match on partial firm name |
Uses the Partial Name algorithm to determine a match. |
Match on abbreviations |
Uses the Abbreviation algorithm to determine a match. |
Match on acronyms |
Uses the Acronym algorithm to determine a match. |
Similarity score |
Uses a similarity score to determine a match, as calculated by the Edit Distance or Jaro-Winkler algorithms. Enter a value between 0 and 100 as the minimum similarity value required for a match. A value of 100 requires an exact match, and a value of 0 requires no similarity whatsoever. |
To define a Firm match rule, complete the following steps:
On the Match Rules tab, select Firm as the Rule Type.
The Firm Attributes tab and Details tab are displayed at the bottom of the page.
In the left panel of the Firm Attributes tab, select one or two attributes that represent the firm name and click the right shuttle button.
The attributes move to the Firm Roles box.
For each attribute, click Role Required. From the list, select Firm 1 for the first attribute, and Firm 2 for the second attribute, if it exists.
On the Details tab, select the applicable options.
Use the Address match rule to match records based on postal addresses.
Matching by address is most effective when you first correct the address data using the Name and Address operator before the Match-Merge operator. The Name and Address operator identifies addresses as existing in a postal matching database and designates the records with the Is Found flag. The Match-Merge operator processes addresses with the Is Found role faster because the data is known to be syntactically correct, legal, and existing.
Address Attributes tab
Use the Address Attributes tab to identify attributes that contain addresses.
Eligible Attributes: Lists all input attributes.
Attributes: The attributes containing parts of addresses. Use the shuttle keys to move attributes containing address information from Eligible Attributes to the Attributes column of Address Roles.
Address Roles: Lists the various parts of an address. Select the one that most closely matches each attribute. Table 21-15 describes the address roles.
Table 21-15 Address Roles
Role | Description |
---|---|
Primary Address |
A street address such as 100 Main Street or a post office box such as PO Box 100. Assign this role to one attribute; otherwise the match rule is invalid. |
Unit Number |
Suite numbers, floor numbers, or apartment numbers for the Primary Address. For addresses with matching primary addresses, the operator compares unit numbers. If both unit numbers are blank, then they match. If only one unit number is blank, they match only when you selected the Match on blank secondary address option. |
PO Box |
A post office box number in the Primary Address. The operator compares the post office box number with the number portion of the primary address, when the primary address is a PO Box. When the primary address represents a street address, the PO Box number is blank. |
Dual Primary Address |
A second address, which may represent either an additional location or a former location. For addresses with matching primary addresses, the operator compares the dual primary addresses |
Dual Unit Number |
Suite numbers, floor numbers, or apartment numbers for the Dual Primary Address. The operator compares the Dual Unit Number in one record with the Unit Number and Dual Unit Number of another record. Unit numbers match when one or both are blank. |
Dual PO Box |
A post office box number in the Dual Primary Address. The operator compares the Dual PO Box in one record with both the PO Box and Dual PO Box of another record. |
City |
Assign this role only when also assigning the State role. For uncorrected address data, the operator compares each City. For addresses already corrected by the Name and Address operator, Match-Merge only compares City when the postal codes do not match. If both City and State match, then the operator compares the address roles. Cities match when both are blank but not when only one is blank. |
State |
Assign this role only when also assigning the City role. For uncorrected address data, the operator compares each State. For addresses already corrected by the Name and Address operator, Match-Merge only compares States when the postal codes do not match. If both City and State match, then the operator compares the address roles. States match when both are blank but not when only one is blank. |
Postal Code |
For uncorrected address data, the operator does not use Postal Code. For addresses already corrected by the Name and Address operator, Match-Merge only compares each Postal Code. If the codes match, then the operator compares the address roles. If the codes do not match, then the operator compares City and State to determine if it should compare address roles such as Primary Address. |
Is Found |
Assign this role to an address that you previously cleansed and standardized with the Name and Address operator. The Name and Address operator marks records with the Is Found flag when it identifies the address as part of a country postal matching database. |
Address Details Page
Use the Details tab to set options for determining a match. Table 21-16 describes the options you can assign to Address Roles.
Table 21-16 Options for Address Roles
Option | Description |
---|---|
Allow differing secondary address |
Matches addresses with different unit numbers. |
Match on blank secondary address |
Matches addresses with one blank unit number. |
Match on either street or post office box |
Matches records if either the street address or the post office box match. |
Address line similarity |
Uses a similarity score to determine a match in the address lines, as calculated by the Edit Distance or Jaro-Winkler algorithms. Enter a value between 0 and 100 as the minimum similarity value required for a match. A value of 100 requires an exact match, and a value of 0 requires no similarity whatsoever. In an address, the street name is the Primary Address and the unit number is the Secondary Address. Address Line Similarity evaluates both the primary and secondary addresses. |
Last line similarity |
Uses a similarity score to determine a match in the City, State, and Postal Code attributes, as calculated by the Edit Distance or Jaro-Winkler algorithms. Enter a value between 0 and 100 as the minimum similarity value required for a match. A value of 100 requires an exact match, and a value of 0 requires no similarity whatsoever. |
To define an Address match rule, complete the following steps:
On the Match Rules tab, select Address as the Rule Type.
The Address Attributes tab and Details tab are displayed at the bottom of the page.
In the left panel of the Address Attributes tab, select the attribute that represents the primary address. Use the right shuttle key to move it to the Address Roles Attributes column.
Click Role Required and designate that attribute as the Primary Address.
You must perform this step. If you do not assign the Primary Address role, the match rule is invalid.
Add other attributes and designate their roles as necessary. See Table 21-15 for the types of roles you can assign.
Select the Details tab and select the applicable options as listed in Table 21-16.
Use the Custom match rule to execute your own PL/SQL program as a comparison algorithm.
Match Rules Detail
Displays the PL/SQL code composing your custom algorithm. You can edit code directly in this field or use the Custom Match Rule Editor.
Edit
Displays the Custom Match Rule Editor.
Use the Custom Match Rule Editor to develop custom PL/SQL programs to use as comparison algorithms.
The editor provides basic program development support. It consists of these components:
Menu Bar
The menu bar contains the following menus:
Code: Enables you to read from and write to files on your local computer network, save your editing changes, and so forth.
Edit: Provides basic cut, copy, and paste functionality.
Search: Provides basic search and replace functionality.
Test: Checks for syntax errors in the code entered in the Implementation field.
Help: Displays this topic.
Search For Field
All or part of a function name you wish to find in the navigation tree. Click Go to search for the first instance, click Go again to find the next instance, and so forth.
Navigation Tree
Lists input parameters and transformations that you can include in your program. The navigation tree contains these folders:
Match Functions: List both active and passive rules that you have defined. You can call these rules the same as any other PL/SQL function.
Parameters: Lists all input attributes under two subfolders named THIS_ and THAT_. Your program can compare two rows in the same attribute or different ones.
Transformation Library: Lists all Warehouse Builder transformations.
To insert a function, parameter, or transformation into your code at the cursor, double-click or drag-and-drop it into the Implementation field.
Implementation Field
Displays your program code.
Definition Tab
Displays the signature of the selected function.
Messages Tab
Displays information about the success or failure of validating the code.
To define a Custom match rule, complete the following steps:
On the Match Rules tab, select Custom as the Rule Type.
A Details field is displayed at the bottom of the page with the skeleton of a PL/SQL program.
Click Edit to open the Custom Match Rules Editor.
To enter PL/SQL code, do any combination of the following:
To read in a file, choose Open File from the Code menu.
To enter text, first position the cursor using the mouse or arrow keys, then begin typing. You can also use the commands on the Edit and Search menus.
To reference any function, parameter, or transformation in the navigation tree, first position the cursor, then double-click or drag-and-drop the object onto the Implementation field.
To validate your code, choose Validate from the Test menu.
The validation results appear on the Messages tab.
To save your code, choose Save from the Code menu.
To close the Custom Match Rules Editor, choose Close from the Code menu.
Use the Merge Rules page to select values for the attributes in the merged record.
Name
An arbitrary name for the rule. Warehouse Builder creates a default name such as ME_0 for each merge rule. You can replace these names with meaningful ones. Meaningful names are particularly helpful when calling rules from a PL/SQL custom rule.
Position
The order of execution. You can change the position of a rule by clicking on the row header and dragging the row to its new location. The row headers are the boxes to the left of the Name column.
Rule Type
Assign one of the rule types listed in Table 21-17. When you select a rule type, the lower portion of the Merge Rules tab activates and you can enter details for the merge rule.
Attribute
Identifies the attribute whose values are selected by the merge rule. Provides a list of all merged attributes.
When you select a Record rule, Warehouse Builder disables the list of attributes. You select multiple attributes on the Attributes tab, which appears on the lower portion of the page.
Description
An optional description of the rule, which you can enter.
When you define a merge rule, you can define one rule for all the attributes in the merged record or define a rule for each attribute. For instance, if the merged record is a customer record, it may have attributes such as ADDRESS1, ADDRESS2, CITY, STATE, and ZIP. You can write five rules that select the value of each attribute from up to five different records, or one Record rule that selects that values of all five attributes from one record. Use record rules when multiple attributes compose a logical unit, such as an address. For example, City, State, and Zip Code might be three different attributes, but the data for these attributes should all come from the same record.
Table 21-17 describes the types of merge rules.
Table 21-17 Merge Rule Types
Merge Rule | Description |
---|---|
Any |
Uses the first non-blank value. |
Match ID |
Merges records that have already been output from another Match-Merge operator. See Match ID Merge Rule for details. |
Rank |
Uses the ranked values in a second attribute to select the preferred value. See Rank and Rank Record Merge Rules for details. |
Sequence |
Uses the values in a sequence to generate unique keys as the data is loaded. See Sequence Merge Rule for details. |
Min Max |
Uses the first value based on the order of another attribute. See Min Max and Min Max Record Merge Rules for details. |
Copy |
Uses the values from another merged attribute. See Copy Merge Rule for details. |
Custom |
Uses the PL/SQL code that you provide as the criteria for merging records. See Custom and Custom Record Merge Rules for details. |
Any Record |
Identical to the Any rule, except that an Any Record rule applies to multiple attributes. |
Rank Record |
Identical to the Rank rule, except that a Rank Record rule applies to multiple attributes |
Min Max Record |
Identical to the Min Max rule, except that a Min Max Record rule applies to multiple attributes. |
Custom Record |
Identical to the Custom rule, except that a Custom Record rule applies to multiple attributes. |
Use the Match ID merge rule to merge records that have already been output in the XREF group from another Match-Merge operator. No other operator is valid for this type of input. For more information, refer to "Using Two Match-Merge Operators".
Next Value of the Sequence
Identifies the sequence that will be used by the rule.
sequences list
Lists all sequences defined in the current project.
Select Sequence
Sets the sequence for the rule to the sequence currently selected in the list. Move a sequence from the sequences list to Select Sequence.
Use the Rank and Rank Record rules when merging data from multiple sources. These rules enable you to identify your preference for certain sources. Your data must have a second input attribute on which the rule is based.
For example, the second attribute might identify the data source, and these data sources are ranked in order of reliability. The most reliable value would be used in the merged record. The merge rule might look like this:
INGRP1.SOURCE = 'Order Entry'
Name
An arbitrary name for the rule. Warehouse Builder creates a default name such as RULE_0 for each rank merge rule. You can replace these names with meaningful ones.
Position
The order of execution. You can change the position of a rule by clicking on the row header and dragging the row to its new location. The row headers are the boxes to the left of the Name column.
Expression Record Selection
The custom SQL expression used in the ranking. Click the Ellipsis button to display the Rank Rule Editor (also called the Expression Builder User Interface). Use this editor to develop the ranking expression.
The Sequence rule uses the next value in a sequence.
Next Value of the Sequence
Identifies the sequence that will be used by the rule.
sequences list
Lists all sequences defined in the current project.
Select Sequence
Sets the sequence for the rule to the sequence currently selected in the list.
The Min Max and Min Max Record rules select an attribute value based on the size of another attribute value in the record.
For example, you might select the First Name value from the record in each bin that contains the longest Last Name value.
Selecting Attribute
Lists all input attributes. Select the attribute whose values provide the order.
Attribute Relation
Select the characteristic for choosing a value in the selected attribute.
Minimum. Selects the smallest numeric value or the oldest date value.
Maximum. Selects the largest numeric value or the most recent date value.
Shortest. Selects the shortest character value.
Longest. Selects the longest character value.
The Copy rule uses the values from another merged attribute.
Merged Attribute
Lists the other merged attributes, which you selected on the Merge Attributes page.
The Custom and Custom Record rules use PL/SQL code that you provide to merge the records. The following is an example of a Custom merge rule, which returns the value of the TAXID attribute for record 1.
BEGIN RETURN M_MATCHES(1)."TAXID"; END;
Following is an example of a Custom Record merge rule, which returns a record for record 1:
BEGIN RETURN M_MATCHES(1); END;
Merge Rules Detail
Displays the PL/SQL code composing your custom algorithm. You can edit code directly in this field or use the Custom Merge Rule Editor.
Edit
Displays the Custom Merge Rule Editor.
Use the Custom Merge Rule Editor to develop custom PL/SQL programs to use as merging algorithms.
The editor provides basic program development support. It consists of these components:
Menu Bar
The menu bar contains the following menus:
Code: Enables you to read from and write to files on your local computer network, save your editing changes, and so forth.
Edit: Provides basic cut, copy, and paste functionality.
Search: Provides basic search and replace functionality.
Test: Checks for syntax errors in the code entered in the Implementation field.
Help: Displays this topic.
Search For Field
All or part of a function name you wish to find in the navigation tree. Click Go to search for the first instance, click Go again to find the next instance, and so forth.
Navigation Tree
Lists input parameters and transformations that you can include in your program. The navigation tree contains these folders:
Parameters: Lists all input attributes under the M_MATCHES folder and the merged attributes under the M_MERGE folder.
Transformation Library: Lists all Warehouse Builder transformations.
To insert a parameter or a transformation into your code at the cursor, double-click or drag-and-drop it into the Implementation field.
Implementation Field
Displays your program code.
Definition Tab
Displays the signature of the selected function.
Messages Tab
Displays information about the success or failure of validating the code.
The Name and Address operator accepts one PL/SQL input and generates one PL/SQL output.
If you experience time-out errors, you may need to increase the socket time-out setting of the Name and Address Server. The time-out setting is the number of seconds the server will wait for a parsing request from a mapping before the server drops a connection. The default setting is 600 seconds (10 minutes). After the server drops a connection because of inactivity, subsequent parsing requests fail with a NAS-00021 error.
For most mappings, long time lapses between parsing requests are rare. However, maps operating in row based mode with a filter operator may have long time lapses between record parsing requests, because of the inefficiency of filtering records in row based mode. For this type of mapping, you may need to increase the socket time-out value to prevent connections from being dropped.
To increase the socket time-out setting, refer to "Configuring the Name and Address Server".
Use the General page to specify a name and optional description for the operator. By default, the wizard names the Name and Address operator NAMEADDR
.
Characterize the nature of your input data by assigning general definitions to this Name and Address operator. In the Definitions page, select a Parsing Type, Primary Country, and Dual Address Assignment.
Select one of the following parsing types from the drop-down list:
Note:
You can only specify the parsing type when you first add the Name and Address operator to your mapping. You cannot modify the parsing type in the editor.Name Only: Select this option when the input data contains only name data. Names can include both personal and business names. Selecting this option instead of the more generic Name and Address option may improve performance and accuracy, depending on the adapter.
Address Only: Select this option when the input data contains only address data and no name data. Selecting this option instead of the more generic Name and Address option may improve performance and accuracy, depending on the adapter.
Name and Address: Select this option when the input data contains both name and address data.
Select the country that best represents the country distribution of your data. The primary country is used by some providers of name and address cleansing software as a hint for the appropriate parser or parsing rules to use on the initial parse of the record. For other name and address service providers, external configuration of their installation controls this behavior.
A dual address contains both a Post Office (PO) box and a street address for the same address record. For records that have dual addresses, your selection determines which address becomes the normal address and which address becomes the dual address. A sample dual address is:
PO Box 2589 4439 Mormon Coulee Rd La Crosse WI 54601-8231
Note that the choice for Dual Address Assignment affects which postal codes are assigned during postal code correction, because the street address and PO box address may correspond to different postal codes.
Street Assignment: The street address is the normal address and the PO Box address is the dual address. This means that the Address
component is assigned the street address. In the preceding example, the Address
is 4439 MORMON COULEE RD. This choice corrects the postal code to 54601-8220.
PO Box Assignment: The PO Box address is the normal address and the street address is the dual address. This means that the Address
component is assigned the Post Office (PO) box address. In the preceding example, the Address
is PO BOX 2589. This choice corrects the postal code to 54602-2589.
Closest to Last Line: Whichever address occurs closest to the last line is the normal address; the other is the dual address. This means that the Address
component is assigned the address line closest to the last line. In the preceding example, the Address
is 4439 MORMON COULEE RD. This choice corrects the postal code to 54601-8220.
This option has no effect for records having a single street or PO box address.
Note:
Dual Address Assignment may not be supported by all name and address cleansing software providers.The Groups page lists the input and output groups defined for an operator. By definition, the Name and Address operator has one input group and one output group. You cannot edit, add, or delete groups in the Name and Address operator. The input group is called INGRP1 and the output group is OUTGRP1. You can edit these names. If the input data requires multiple groups, create a separate Name and Address operator for each group.
Use the Input Connections page to select attributes from any operator in your mapping that you want to copy and map into the operator. The Available Attributes box lists the available attributes. The Mapped Attributes box lists the attributes that will be processed by the Name and Address operator. You can move an entire group of attributes, or individual attributes from a single group.
If you have not created one or more operators for the source data yet, the Available Attributes column will be empty.
To complete the Input Connections page for an operator:
Select complete groups or individual attributes from the Available Attributes panel. The Available Attributes panel enables you to select attributes from any operator in your mapping.
To search for a specific attribute or group by name, type the text in Search for and click Go. To find the next match, click Go again.
Hold the Shift key down to select multiple groups or attributes. If you want to select attributes from different groups, you must first combine the groups with a Joiner or Set operator.
Use the left to right arrow button between the two panels to move your selections to the Mapped Attributes panel.
Use the Input Attributes page to assign input roles to each input attribute.
Attribute
Automatically lists the attributes that you selected on the Input Connections page. Otherwise, the list may be empty. Click Add to add each input attribute.
Map all attributes through the Name and Address operator, including those that you do not wish to process. Bypassing the Name and Address operator was valid in earlier releases of Warehouse Builder, but may cause problems now. Instead, assign these attributes the Pass Through
input role in the Name and Address operator.
Input Role
Input roles indicate what kind of name or address information resides in a line of data. For each attribute, select the input role that most closely matches the data contained in the source attribute. Refer to "Input Role Descriptions" for a complete list of input roles and their descriptions.
Whenever possible, choose discrete roles (such as City
, State
, and Postal Code
) rather than non-discrete, line-oriented roles (such as Last Line
). Discrete roles give the Name and Address operator more information about the data content and result in better parsing.
Data Type, Length, Precision, Scale, Seconds Precision
Set the data type and related parameters for attributes given the Pass Through
input role. Attributes with other input roles are set automatically to VARCHAR2 and cannot be changed.
Description
Enter an optional description of the input attributes.
Table 21-18 describes the input roles for the Name and Address Operator.
Table 21-18 Name and Address Operator Input Roles
Input Role | Description |
---|---|
Pass Through |
Any attribute that requires no processing. |
First Name |
First name, nickname, or shortened version of the first name. |
Middle Name |
Middle name or initial. Use when there is only one middle name, or for the first of several middle names; for example, 'May' in Ethel May Roberta Louise Mertz. |
Middle Name 2 |
Second middle name; for example, 'Roberta' in Ethel May Roberta Louise Mertz. |
Middle Name 3 |
Third middle name; for example, 'Louise' in Ethel May Roberta Louise Mertz. |
Last Name |
Last name or surname. |
First Part Name |
First part of the Person name, including:
Use when these components are contained in one source column. |
Last Part Name |
Last part of Person Name, including:
Use when these components are all contained in one source column. |
Pre Name |
Information that precedes and qualifies the name; for example, Ms., Mr., or Dr. |
Post Name |
Generation or other information qualifying the name; for example, Jr. or Ph.D. |
Person |
Full person name, including:
Use when these components are all contained in one source column. |
Person 2 |
Designates a second person if the input includes multiple personal contacts. |
Person 3 |
Designates a third person if the input includes multiple personal contacts. |
Firm Name |
Name of the company or organization. |
Primary Address |
Box, route, or street address, including:
This does not include the Unit Designator or the Unit Number. |
Secondary Address |
The second part of the street address, including:
For example, in a secondary address of Suite 2100, the Unit Designator is STE (a standardization of 'Suite') and the Unit Number is 2100. |
Address |
Full address line, including:
Use when these components share one column. |
Address 2 |
Generic address line. |
Neighborhood |
Neighborhood or barrio, common in South and Latin American addresses. |
Locality Name |
The city (shi) or island (shima) in Japan. |
Locality 2 |
The ward (ku) in Japan. |
Locality 3 |
The district (machi) or village (mura) in Japan. |
Locality 4 |
The subdistrict (aza, bu, chiwari, or sen) in Japan. |
City |
Name of city. |
State |
Name of state or province. |
Postal Code |
Postal code, such as a ZIP code in the United States or a postal code in Canada. |
Country Name |
Full country name. |
Country Code |
The ISO 3166-1993 (E) two- or three-character country code. For example, US or USA for United States; CA or CAN for Canada. |
Last Line |
Last address line, including:
Use when these components are all contained in one source column. |
Last Line 2 |
For Japanese adaptors, specifies additional line information that appears at the end of an address. |
Line1... Line10 |
Use for free-form name, business, personal, and address text of any type. These roles do not provide the parser with any information about the data content. Whenever possible, use the discrete input roles provided instead. |
Use the Output Attributes page to define output attributes that determine how the Name and Address operator handles parsed data. Specifically, the output attribute properties characterize the data extracted from the parser output.
The Output Attributes page is empty initially. You can create and edit attributes.
Note:
The attributes for output components with the Pass Through role cannot be changed.Add button
To create a new output attribute, click Add.
Attribute column
A new attribute has a default name such as OUTPUT1. This default name changes when you select an output component to a default descriptive name, such as Primary_Address
. It does not change if you already replaced the name. Click the name to replace it with a new name.
Select an output component for every output attribute. Click the Ellipsis button to the right of the cell to open the Output Attribute Components dialog. See "Descriptions of Output Components" for a complete list of output components.
Be sure to add error handling flags, such as Is Parsed
, Is Good Name
, and Is Good Address
. These flags can be used with the Splitter operator to separate good records from records with errors, and load them into different targets.
Data Type column
Pass Through output components retain their input data type. All other output component types are VARCHAR2. This column is read-only.
Length column
Adjust the field length to match the length of the target attribute to which you intend to map the output attribute. This practice helps prevent data truncation warnings during code generation, or errors during execution.
Use the Output Attribute Components dialog to select a component for each output attribute. The output components indicate which component an attribute constitutes, such as the first name, last name, street name, city, or state.
Select an Output Component
Select a component from the navigation tree to apply to the output attribute. See "Descriptions of Output Components" for a description of these components.
Output Component
Identifies the component currently selected from the tree. If this field is empty, the current selection is a folder. Expand the folder and select a valid component. Note that some folders are valid components.
Address Type
Available only for dual addresses, and not supported by all name and address cleansing software providers. The Dual Address Assignment option you specified in the Definitions Page page determines whether the street address or the PO box address is used as the dual address. Select either NORMAL or DUAL. For more information on dual addresses, see "Dual Address Assignment".
Instance
Specify which instance of an output component to use when there are multiple occurrences of the same attribute in a single record. The instance control applies to all name components and several address components, such as Miscellaneous Address
and Complex
. This setting enables you to extract numerous attributes of the same nature. The number of instances allowed for various components depends on the third-party name and address cleansing software.
For example, an input record containing John and Jane Doe has two name occurrences: John Doe and Jane Doe. You can extract John Doe by assigning Instance 1 to the First Name and Last Name components. Similarly, you can extract Jane Doe by assigning Instance 2 to the First Name and Last Name components.
Output components are grouped in the following categories:
The Pass Through
output component is for any attribute that requires no processing. When you create a Pass Through
input role, the corresponding Pass Through
output component is created automatically. You cannot edit a Pass Through
output component, but you can edit the corresponding input role.
Table 21-19 describes the Name output components. Many components can be used multiple times to process a record, as noted in the table. For example, in records with two occurrences of Firm Name, you can extract both by adding two output attributes. Assign one as the First instance, and the other as the Second instance.
Table 21-19 Name Output Components
Subfolder | Output Component | Description |
---|---|---|
None |
Pre Name |
Title or salutation appearing before a name; for example, Ms. or Dr. Can be used multiple times. |
None |
First Name Standardized |
Standard version of first name; for example, Theodore for Ted or James for Jim. Can be used multiple times. |
None |
Middle Name Standardized |
Standardized version of the middle name; for example, Theodore for Ted or James for Jim. Use when there is only one middle name, or for the first of several middle names. Can be used multiple times. |
None |
Middle Name 2 Standardized |
Standardized version of the second middle name; for example, Theodore for Ted or James for Jim. Can be used multiple times. |
None |
Middle Name 3 Standardized |
Standardized version of the third middle name; for example, Theodore for Ted or James for Jim. Can be used multiple times. |
None |
Post Name |
Name suffix indicating generation; for example, Sr., Jr., or III. Can be used multiple times. |
None |
Other Post Name |
Name suffix indicating certification, academic degree, or affiliation; for example, Ph.D., M.D., or R.N. Can be used multiple times. |
None |
Title |
Personal title, for example, Manager. |
None |
Name Designator |
Personal name designation; for example, ATTN (to the attention of) or C/O (care of). Can be used multiple times. |
None |
Relationship |
Information related to another person; for example, Trustee For. Can be used multiple times. |
None |
SSN |
Social security number. |
None |
Email Address |
E-mail address. |
None |
Phone Number |
Telephone number. |
None |
Name/Firm Extra |
Extra information associated with the firm or personal name. |
None |
Person |
First name, middle name, and last name. Can be used multiple times. |
Person |
First Name |
The first name found in the input name. Can be used multiple times. |
Person |
Middle Name |
Middle name or initial. Use this for a single middle name, or for the first of several middle names; for example, 'May' in Ethel May Roberta Louise Mertz. Can be used multiple times. |
Person |
Middle Name 2 |
Second middle name; for example, 'Roberta' in Ethel May Roberta Louise Mertz. Can be used multiple times. |
Person |
Middle Name 3 |
Third middle name; for example, 'Louise' in Ethel May Roberta Louise Mertz. Can be used multiple times. |
Person |
Last Name |
Last name or surname. Can be used multiple times. |
Derived |
Gender |
Probable gender:
Can be used multiple times. |
Derived |
Person Count |
Number of persons the record references; for example, a record with a Person name of 'John and Jane Doe' has a Person Count of 2. |
Business |
Firm Name |
Name of the company or organization, including divisions. Can be used multiple times. |
Business |
Firm Count |
Number of firms referenced in the record. Can be used multiple times. |
Business |
Firm Location |
Location within a firm; for example, Accounts Payable |
Table 21-20 describes the Address output components. In records with dual addresses, you can specify which line is used as the Normal Address (and thus assigned to the Address component) and which is used as the Dual Address for many output components, as noted in the table.
Table 21-20 Address Output Components
Subfolder | Output Component | Description |
---|---|---|
None |
Address |
Full address line, including:
Can be used as the Normal Address or the Dual Address. |
None |
Primary Address |
Box, route, or street address, including:
Does not include the |
Primary Address |
Street Number |
Number that identifies the address, such as a house or building number, sometimes referred to as the primary range. For example, in 200 Oracle Parkway, the |
Primary Address |
Pre Directional |
Street directional indicator appearing before the street name; for example, in 100 N University Drive, the |
Primary Address |
Street Name |
Name of street. Can be used as the Normal Address or the Dual Address. |
Primary Address |
Primary Name 2 |
Second street name, often used for addresses at a street intersection. |
Primary Address |
Street Type |
Street identifier; for example, ST, AVE, RD, DR, or HWY. Can be used as the Normal Address or the Dual Address. |
Primary Address |
Post Directional |
Street directional indicator appearing after the street name; for example, in 100 15th Ave. S., the |
None |
Secondary Address |
The second part of the street address, including:
For example, in a secondary address of Suite 2100, |
Secondary Address |
Unit Designator |
Type of secondary address, such as APT or STE. For example, in a secondary address of Suite 2100, |
Secondary Address |
Unit Number |
A number that identifies the secondary address, such as the apartment or suite number. For example, in a secondary address of Suite 2100, |
Secondary Address |
Non-postal Secondary Address |
A secondary address that is not in official postal format. |
Secondary Address |
Non-postal Unit Designator |
A unit designator that is not in official postal format. |
Secondary Address |
Non-postal Unit Number |
A unit number that is not in official postal format. |
Address |
Last Line |
Final address line, including:
|
Last Line |
Neighborhood |
Neighborhood or barrio, common in South and Latin American addresses. |
Last Line |
City |
Name of city. The U.S. city names may be converted to United States Postal Service preferred names. |
Last Line |
City Abbreviated |
Abbreviated city name, composed of 13 characters for the United States. |
Last Line |
City Abbreviated 2 |
Alternative abbreviation for the city name. |
Last Line |
Alternate City |
An alternate name for a city that may be referenced by more than one name. In the United States, a city may be referenced by its actual name or the name of a larger urban area. For example, Brighton Massachusetts may have Boston as an alternate city name. |
Last Line |
Locality Code |
The last three digits of the International Mailsort Code, which represents a geographical region or locality within each country. Locality Codes are numeric in the range 000-999. |
Last Line |
Locality Name |
In the United Kingdom, the following address is assigned Locality Name KNAPHILL: Chobham Rd Knaphill Woking GU21 2TZ |
Last Line |
Locality 2 |
The ward (ku) in Japan. |
Last Line |
Locality 3 |
The district (machi) or village (mura) in Japan. |
Last Line |
Locality 4 |
The subdistrict (aza, bu, chiwari, or sen) in Japan. |
Last Line |
County Name |
The name of a county in the United Kingdom, United States, or other country. |
Last Line |
State |
Name of state or province. |
Last Line |
Postal Code |
Full postal code with spaces and other non-alphanumeric characters removed. |
Last Line |
Postal Code Formatted |
Formatted version of postal code that includes spaces and other non-alphanumeric characters, such as dashes. |
Last Line |
Delivery Point |
A designation used in the United States and Australia.
|
Last Line |
Country Code |
The ISO 3166-1993 (E) two-character country code, as defined by the International Organization for Standardization; for example, 'US' for United States or 'CA' for Canada. |
Last Line |
Country Code 3 |
The ISO 3166-1993 (E) three-character country code, as defined by the International Organization for Standardization; for example, 'USA' for United States, 'FRA' for France, or 'UKR' for Ukraine. |
Last Line |
Country Name |
The full country name. |
Address |
Address 2 |
A second address line, typically used for Hong Kong addresses that have both a street address and a building or floor address. |
Address |
Last Line 2 |
Additional information that appears at the end of an address in Japan. |
Other Address Line |
Box Name |
The name for a post office box address; for example, for 'PO Box 95', the Box Name is 'PO BOX'. Can be used as the Normal Address or the Dual Address. |
Other Address Line |
Box Number |
The number for a post office box address; for example, for 'PO Box 95', the Box Number is '95'. Can be used as the Normal Address or the Dual Address. |
Other Address Line |
Route Name |
Route name for a rural route address. For an address of 'Route 5 Box 10', the Route Name is 'RTE' (a standardization of 'Route'). Can be used as the Normal Address or the Dual Address. |
Other Address Line |
Route Number |
Route number for a rural route address. For an address of 'Route 5 Box 10', the Route Number is '5'. Can be used as the Normal Address or the Dual Address. |
Other Address Line |
Building Name |
Building name, such as 'Cannon Bridge House'. Building names are common in the United Kingdom. |
Other Address Line |
Complex |
Building, campus, or other complex. For example, USS John F. Kennedy Shadow Green Apartments Cedarvale Gardens Concordia College You can use an the Instance field in the Output Components dialog to specify which complex should be returned in cases where an address has more than one complex. |
Other Address Line |
Miscellaneous Address |
Miscellaneous address information. In records with multiple miscellaneous fields, you can extract them by specifying which instance to use in the Output Components page. |
Geography |
Latitude |
Latitude in degrees north of the equator: Positive for north of the equator; negative for south (always positive for North America). |
Geography |
Longitude |
Longitude in degrees east of the Greenwich Meridian: positive for east of GM; negative for west (always negative for North America). |
Geography |
Geo Match Precision |
Indicates how closely the location identified by the latitude and longitude matches the address. |
Twenty components are open for vendor-specified uses.
Table 21-21 describes the Error Status output components. Refer to "Handling Errors in Name and Address Data" for usages notes on the Error Status components.
Table 21-21 Error Status Output Components
Subfolders | Output Component | Description |
---|---|---|
Name and Address |
Is Good Group |
Indicates whether the name group, address group, or name and address group was processed successfully.
Using this flag in conjunction with another flag, such as the |
Name and Address |
Is Parsed |
Indicates whether the name or address was parsed:
Check the status of warning flags such as |
Name and Address |
Parse Status |
Postal matching software parse status code. |
Name and Address |
Parse Status Description |
Text description of the postal matching software parse status. |
Name Only |
Is Good Name |
Indicates whether the name was parsed successfully:
|
Name Only |
Name Warning |
Indicates whether the parser found unusual or possibly erroneous data in a name:
|
Address Only |
Is Good Address |
Indicates whether the address was processed successfully:
Use this component when you have a mix of records from both postal-matched and non-postal-matched countries. |
Address Only |
Is Found |
Indicates whether the address is listed in the postal matching database for the country indicated by the address:
This flag is true only if all of the other 'Found' flags are true. If postal matching is available, this flag is the best indicator of record quality. |
Address Only: Is Found |
City Found |
T = The postal matcher found the city; otherwise, F. |
Address Only: Is Found |
Street Name Found |
T = The postal matcher found the street name; otherwise, F. |
Address Only: Is Found |
Street Number Found |
T = The postal matcher found the street number within a valid range of numbers for the named street, otherwise, F. |
Address Only: Is Found |
Street Components Found |
T = The postal matcher found the street components, such as the Pre Directional or Post Directional; otherwise, F. |
Address Only: Is Found |
Non-ambiguous Match Found |
Indicates whether the postal matcher found a matching address in the postal database:
|
Address Only |
City Warning |
T = The parser found unusual or possibly erroneous data in a city; otherwise, F. |
Address Only |
Street Warning |
T = The parser found unusual or possibly erroneous data in a street address otherwise, F. |
Address Only |
Is Address Verifiable |
T = Postal matching is available for the country of the address; otherwise, F. F does not indicate whether or not a postal matching database is installed for the country in the address. It only indicates that matching is not available for a particular address. |
Address Only |
Address Corrected |
Indicates whether the address was corrected in any way during matching. Standardization is not considered correction in this case.
|
Address Only: Address Corrected |
Postal Code Corrected |
T = The postal code was corrected during matching, possibly by the addition of a postal extension; otherwise, F. |
Address Only: Address Corrected |
City Corrected |
T = The city name was corrected during matching; otherwise, F. Postal code input is used to determine the city name preferred by the postal service. |
Address Only: Address Corrected |
Street Corrected |
T = The street name was corrected during matching; otherwise, F. Some correct street names may be changed to an alternate name preferred by the postal service. |
Address Only: Address Corrected |
Street Components Corrected |
T = One or more street components, such as |
Address Only |
Address Type |
Type of address. The following are common examples; actual values vary with vendors of postal matching software:
|
Address Only |
Parsing Country |
Country parser that was used for the final parse of the record. |
Table 21-22 describes the output components that are specific to a particular country.
Table 21-22 Country-Specific Output Components
Subfolder | Output Component | Description |
---|---|---|
United States |
ZIP5 |
The five-digit United States postal code. |
United States |
ZIP4 |
The four-digit suffix that is added to the five-digit United States postal code to further specify location. |
United States |
Urbanization Name |
Urban unit name used in Puerto Rico. |
United States |
LACS Flag |
T = Address requires a LACS conversion and should be submitted to a LACS vendor; otherwise, F. The Locatable Address Conversion System (LACS) provides new addresses when a 911 emergency system has been implemented. 911 address conversions typically involve changing rural-style addresses to city-style street addresses, but they may involve renaming or renumbering existing city-style addresses. |
United States |
CART |
Four-character USPS Carrier route. |
United States |
DPBC Check Digit |
Check digit for forming a delivery point bar code. |
United States |
Automated Zone Indicator |
T = The mail in this zip code is sorted by bar code sorting equipment; otherwise, F. |
United States |
Urban Indicator |
T = An address is located within an urban area; otherwise, F. |
United States |
Line of Travel |
United States Postal Service (USPS) line of travel |
United States |
Line of Travel Order |
United States Postal Service (USPS) line of travel order |
United States: Census/Geography |
Metropolitan Statistical Area |
Metropolitan Statistical Area (MSA) number. For example, '0000' indicates that the address does not lie within any MSA, and typically indicates a rural area. |
United States: Census/Geography |
Minor Census District |
Minor Census District. |
United States: Census/Geography |
CBSA Code |
A 5-digit Core Based Statistical Area code that identifies metropolitan and micropolitan areas. |
United States: Census/Geography |
CBSA Descriptor |
Indicates whether the CBSA is metropolitan (population of 50,000 or more) or micropolitan (population of 10,000 to 49,999). |
United States: Census/Geography |
FIPS Code |
The complete (state plus county) code assigned to the county by the Federal Information Processing Standard (FIPS). Because FIPS county codes are unique within a state, a complete FIPS Code includes the two-digit state code followed by the three-digit county code. |
United States: Census/Geography |
FIPS County |
The three-digit county code as defined by the Federal Information Processing Standard (FIPS). |
United States: Census/Geography |
FIPS Place Code |
The five-digit place code as defined by the Federal Information Processing Standard (FIPS). |
United States: Geography |
Census ID |
United States Census tract and block-group number. The first six digits are the tract number; the final digit is the block-group number within the tract. These codes are used for matching to demographic-coding databases. |
Canada |
Installation Type |
A type of Canadian postal installation:
For example, for the address, 'PO Box 7010, Scarborough ON M1S 3C6,' the Installation Type is 'STN'. |
Canada |
Installation Name |
Name of a Canadian postal installation. For example, for the address, 'PO Box 7010, Scarborough ON M1S 3C6,' the Installation Name is 'AGINCOURT'. |
Hong Kong |
Delivery Office Code |
A mailing code used in Hong Kong. For example, the following address is assigned the Delivery Office Code 'WCH': Oracle 39/F The Lee Gardens 33 Hysan Ave Causeway Bay |
Hong Kong |
Delivery Beat Code |
A mailing code used in Hong Kong. For example, the following address is assigned the Delivery Beat Code 'S06': Oracle 39/F The Lee Gardens 33 Hysan Ave Causeway Bay |
Postal reporting applies only to countries that support address correction and postal matching. Country certification varies with different vendors of name and address cleansing software. The most common country certifications are United States, Canada, and Australia. The process provides mailers a common platform to measure the quality of address-matching software, focusing on the accuracy of postal codes (in the case of the United States, of five-digit ZIP Codes and ZIP+4 Codes), delivery point codes, and carrier route codes applied to all mail. Some vendors of name and address cleansing software may ignore these parameters and require external setup for generating postal reports. For more information, see "About Postal Reporting".
Postal Report
Select Yes for a postal report for the Primary Country you chose in the Definitions Page page. Only one postal report can be active.
Processor Name
The use of this field varies with vendors of name and address cleansing software. Typically, this value appears on the United States Coding Accuracy Support System (CASS) report.
List Name
An optional reference field that appears on the United States and United Kingdom reports under the List Name section, but is not included in other reports. The list name provides a reference for tracking multiple postal reports; for example, 'July 2005 Promotional Campaign'.
Processor Address Lines
These address lines may appear on various postal reports. Various name and address cleansing software vendors use these fields differently. They often contain the full address of your company.
All address lists used to produce mailings for discounted automation postal rates must be matched by postal report-certified software. Certifications depend on the third-party vendors of name and address software and data. The certifications may include the following:
United States Postal Service: Coding Accuracy Support System (CASS)
Canada Post: Software Evaluation and Recognition Program (SERP)
Australia Post: Address Matching Approval System (AMAS)
The Coding Accuracy Support System (CASS) was developed by the United States Postal Service (USPS) in cooperation with the mailing industry. The system provides mailers a common platform to measure the quality of address-matching software, focusing on the accuracy of five-digit ZIP Codes, ZIP+4 Codes, delivery point codes, and carrier route codes applied to all mail. All address lists used to produce mailings for automation rates must be matched by CASS-certified software.
To meet USPS requirements, the mailer must submit a CASS report in its original form to the USPS.
Canada Post developed a testing program called Software Evaluation and Recognition Program (SERP), which evaluates software packages for their ability to validate, or validate and correct, mailing lists to Canada Post requirements. Postal programs that meet SERP requirements are listed on the Canada Post Web site.
Canadian postal customers who use Incentive Lettermail, Addressed Admail, and Publications Mail must meet the Address Accuracy Program requirements. Customers can obtain a Statement of Accuracy by comparing their databases to Canada Post's address data.
The Address Matching Approval System (AMAS) was developed by Australia Post to improve the quality of addressing. It provides a standard by which to test and measure the ability of address-matching software to:
Correct and match addresses against the Postal Address File (PAF)
Append a unique Delivery Point Identifier (DPID) to each address record, which is a step toward barcoding mail.
AMAS allows companies to develop address matching software which:
Prepares addresses for barcode creation
Ensures quality addressing
Enables qualification for discounts on PreSort letter lodgements
PreSort Letter Service prices are conditional upon customers using AMAS Approved Software with Delivery Point Identifiers (DPIDs) being current against the latest version of the PAF.
A declaration that the mail was prepared appropriately must be made when using the Presort Lodgement Document, available from post offices.
An external Name and Address server provides an interface between Oracle Database and third-party name and address processing libraries. This section discusses methods of configuring, starting, and stopping the Name and Address Server.
The Name and Address operator generates PL/SQL code, which calls the UTL_NAME_ADDR
package installed in the Runtime Schema. A private synonym, NAME_ADDR
, is defined in the target schema to reference the UTL_NAME_ADDR
package. The UTL_NAME_ADDR
package calls Java packages, which send processing requests to an external Name and Address server, which then interfaces with third-party Name and Address processing libraries, such as Trillium.
You can use the server property file, NameAddr.properties
, to configure server options. This file is located in owb/bin/admin
under the Oracle home of your Oracle Warehouse Builder server-side installation. The following code illustrates several important properties with their default settings.
TraceLevel=0 SocketTimeout=180 ClientThreads=4 Port=4040
The TraceLevel
property is often changed to perform diagnostics on server communication and view output from the postal matching program parser. Other properties are rarely changed.
TraceLevel: Enables output of file NASvrTrace.log
in the owb/bin/admin
folder. This file shows all incoming and outgoing data, verifies that your mapping is communicating with the Name and Address Server, and that the Name and Address Server is receiving output from the service provider. The trace log shows all server input and output and is most useful for determining whether any parsing requests are being made by an executing mapping. Set TraceLevel=1
to enable logging. However, tracing degrades performance and creates a large log file. Set TraceLevel=0
to disable logging for production.
SocketTimeOut: Specifies the number of seconds the name/address server will wait for a parsing request before closing the connection. You can increase this time to 1800 (30 minutes) when running concurrent mappings to prevent timing out.
ClientThreads: Specifies the number of threads used to service client connections. One client connection is made for each database session or slave session if a map is parallelized. Most maps are parallelized, and the number of parallel processes is proportional to the number of processors. On a single processor computer, two parallel processes are spawned for large maps. On a four processor computer, up to eight processes may be spawned. Parallelism may also be controlled by database initialization settings such as Sessions.
For the best performance, set ClientThreads to the maximum number of clients that will be connected simultaneously. The actual number of connected clients is recorded in NASvr.log
after a map run. You should increase the value of ClientThreads when the number of client connections shown in the log is greater.
When the number of clients exceeds the number of threads, all clients are still serviced because the threads are shared among clients.
Port: Specifies the port on which the server listens and was initially assigned by the installer. This value may be changed if the default port conflicts with another process. If the port is changed, the port attribute must also be changed in the runtime_schema
.nas_connection
table to enable the utl_name_addr
package to establish a connection.
Whenever you edit the properties file or perform table maintenance, you must stop and restart the Name and Address Server for the changes to take effect.
To manually stop the Name and Addresss Server:
In Windows, run OWB_ORACLE_HOME
/owb/bin/win32/NAStop.bat
.
In UNIX, run OWB_ORACLE_HOME
/owb/bin/unix/NAStop.sh
.
You can automatically restart the Name and Address Server by invoking a mapping in Warehouse Builder. You can also restart the server manually.
To manually restart the Name and Address Server:
In Windows, run OWB_ORACLE_HOME
/owb/bin/win32/NAStart.bat
.
In UNIX, run OWB_ORACLE_HOME
/owb/bin/unix/NAStart.sh
.