Oracle® Healthcare Master Person Index Standardization Engine Reference Release 1.1 Part Number E18471-01 |
|
|
View PDF |
This chapter provides conceptual information and procedures for setting up patterns-based address configuration and patterns-based business name configuration.
This chapter includes the following sections:
By default, address standardization is performed using the patterns-based framework. Processing street addresses involves parsing, normalizing, data typing, and using advanced patterns rules to map the address fields with their corresponding types, prior to matching. The following sections describe the configuration files that define address processing logic and provide instructions for modifying mefa.xml for processing address fields.
Processing data using the Address data type includes both standardizing and matching on free-form address fields. The OHMPI Standardization Engine can create the parsed, normalized, and typed values for address data. These values are needed for accurate searching and matching on address data. You can implement street address standardization and matching on its own, or within an application designed to process person or business information. Standardizing address information allows you to include address fields as search criteria, even though matching might not be performed against these fields.
Several configuration files are designed specifically to handle address data and define processing logic for the standardization process. These include address clues files, a patterns file, and a constants file. The United States address standardization engine is based on the work performed at the US Census Bureau. The clues files, in particular, are based on census bureau statistics.
Standardization engines use tokens to determine how each field is standardized into its individual field components and to determine how to normalize a field value. Tokens also identify the field components to external applications like a master person index application. The following table lists each token generated by the OHMPI Standardization Engine for address data along with the standardization component they represent. You can only specify the predefined field tokens that are listed in this table for addresses unless you create a new data type or variant.
Table 4-1 Address Data Tokens
Token | Description |
---|---|
BoxDescript | Represents the P.O. box type from a standardized address field. By default, this is stored in the field_name_StName field in a master person index database. |
BoxIdentif | Represents the parsed P.O. box number from a standardized address field. By default, this is stored in the field_name_HouseNo field in a master person index database. |
NeighborhoodName | Represents the parsed structure street's block or neighborhood description from a standardized address field. This address component is not included in the default master person index standardization structure, but you can add it if needed. |
NeighborhoodType | Represents the parsed structure structure street's block or neighborhood identifier from a standardized address field. This address component is not included in the default master person index standardization structure, but you can add it if needed. |
HouseNumber | Represents the parsed house number from a standardized address field. By default, this is stored in the field_name_HouseNo field in a master person index database. |
HouseNumPrefix | Represents the parsed house number prefix from a standardized address field (such as the “A” in “A 1587 4th Street”). This address component is not included in the default master person index standardization structure, but you can add it if needed. |
HouseNumSuffix | Represents the parsed house number suffix from a standardized address field (such as the “B” in “5900 B Arnett Avenue”). This address component is not included in the default master person index standardization structure, but you can add it if needed. |
MatchPropertyName | Represents the parsed match property name from a standardized address field and is an alternative representation of the field used by the standardization engine for blocking and phonetic encoding. This address component is not included in the default master person index standardization structure, but you can add it if needed. |
MatchStreetName | Represents the parsed and standardized street name from a standardized address field and is an alternative representation of the field used by the standardization engine. If you want to store the standardized street name in the database (recommended), map this field to the street name field in the database. By default, this is stored in the field_name_StName field in a master person index database. |
OrigPropertyName | Represents the parsed original property name (such as the name of a complex or business park) from a standardized address field. This address component is not included in the default master person index standardization structure, but you can add it if needed. |
PropDesPrefDirection | Represents the parsed property direction from a standardized address field. This field ID handles cases where the direction is a prefix to the property description. By default, this is stored in the field_name_StDir field in a master person index database. |
PropDesPrefType | Represents the parsed property type from a standardized address field. This field ID handles cases where the street type is a prefix to the property description. By default, this is stored in the field_name_StType field in a master person index database. |
PropertySufDirection | Represents the parsed property direction from a standardized address field. This field ID handles cases where the direction is a suffix to the property description. By default, this is stored in the field_name_StDir field in a master person index database. |
PropertySufType | Represents the parsed property type from a standardized address field. This field ID handles cases where the street type is a suffix to the property description. By default, this is stored in the field_name_StType field in a master person index database. |
RuralRouteDescript | Represents the parsed rural route description from a standardized address field. By default, this is stored in the field_name_StName field in a master person index database. |
RuralRouteIdentif | Represents the parsed rural route identifier from a standardized address field. By default, this is stored in the field_name_HouseNo field in a master person index database. |
SecondHouseNumber | Represents the parsed second house number prefix from a standardized address field. This address component is not included in the default master person index standardization structure, but you can add it if needed. |
SecondHouseNumberPrefix | Represents the parsed second house number prefix from a standardized address field (such as “25” in “25 319 10th Ave.”). This address component is not included in the default master person index standardization structure, but you can add it if needed. |
SecondStreetNameSufDirection | Represents the parsed second street direction from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
SecondStreetNameSufType | Represents the parsed second street type from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
OrigSecondStreetName | Represents the parsed second street name from a standardized address field (for example, an address might include a cross-street or a thoroughfare and dependent thoroughfare). This address component is not included in the default master person index standardization structure, but you can add it if needed. |
OrigStreetName | Represents the parsed street name from an address field. If you want to store the original street name in the database, map this field to the street name field in the database. This address component is not included in the default standardization structure, but you can add it if needed. |
StreetNamePrefDirection | Represents the parsed street direction from a standardized address field. This field ID handles cases where the direction is a prefix to the street name. By default, this is stored in the field_name_StDir field in a master person index database. |
StreetNamePrefType | Represents the parsed street type from a standardized address field. This field ID handles cases where the street type is a prefix to the street name. By default, this is stored in the field_name_StType field in a master person index database. |
StreetNameSufDirection | Represents the parsed street direction from a standardized address field. This field ID handles cases where the direction is a suffix to the street name. By default, this is stored in the field_name_StDir field in a master person index database. |
StreetNameSufType | Represents the parsed street type from a standardized address field. This field ID handles cases where the street type is a suffix to the street name. By default, this is stored in the field_name_StType field in a master person index database. |
StreetNameExtensionIndex | Represents the parsed street name extension from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
WithinStructDescript | Represents the parsed internal descriptor (such as “Floor”) from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
WithinStructIdentif | Represents the parsed internal identifier (such as a floor number) from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
CityName | Represents a city name, within a state or a county, from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
CityDescriptor | Represents a city's description type from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
PostalCode | Represents the location postal code type from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
StateName | Represents a given country's state name from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
CountryName | Represents a given state's county name from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
CountryCode | Represents a 3-digit ISO country name from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. |
ExtraInfo | Represents any extra information that was not included in any of the other parsed components. This address component is not included in the default standardization structure, but you can add it if needed. |
Note:
CityName, CityDescriptor, PostalCode, StateName, CountryName, and CountryCode are new token types. They are implemented in the Mexico standardization locale for this release. They will be available in the United States, United Kingdom, Australia and France standardization implementations in future releases.Three configuration files define address processing logic for the OHMPI Standardization Engine. These files provide information about address patterns and tokens to help the standardization engine determine how to recognize address components and break them out into their respective tokens. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for address data.
The address configuration files are located in the resource folder under each variant name for the Address data type. The following topics provide information about each configuration file.
The address clues file (clues.dat) lists common terms in street addresses, specifies a normalized value for each common term, and categorizes the terms into street address component types. A term can be categorized into multiple component types. A relevance value specifies which of the component types the term is most likely to be. For example, the term “Junction” is standardized as “Jct” and is classified as a street type, building unit, and generic term (giving relevance in that order).
This file helps the OHMPI Standardization Engine recognize common terms in street addresses in order to parse and normalize the values correctly. The syntax of this file is:
common-term normalized-term ID-number/type-token
You can modify or add entries in this table as needed. The following table describes the columns in the address clues file.
Table 4-2 Address Clues File Columns
Column | Description |
---|---|
common-term | A term commonly found in street addresses. |
normalized-term | The normalized version of the common term. |
ID-number/type-token | An ID number and a token indicating the type of address component represented by the common term. The ID number corresponds to an ID number in the address master clues file, and the type token corresponds to the type specified for that ID number in the address master clues file. One term might have several ID number and token type pairs. Their order of appearance indicates their relevance value. |
Following is an excerpt from the US address clues file.
TRLR VLG Trpk 59BU TRPK Trpk 59BU TRPRK Trpk 59BU VILLA Vlla 305TY 60BU VLLA Vlla 305TY 60BU VILLAS Vlla 60BU VILL Vlg 317TY 61BU 364AU VILLAG Vlg 317TY 61BU 364AU VLG Vlg 317TY 61BU 364AU VILLAGE Vlg 317TY 61BU 364AU VILLG Vlg 317TY 61BU 364AU VILLIAGE Vlg 317TY 61BU 364AU VLGE Vlg 317TY 61BU 364AU VIVI Vivi 62BU VIVIENDA Vivi 62BU COLLEGE Coll 64BU 0AU CLG Coll 64BU COTTAGE Cott 65BU 65BP 0AU
The address master clues file (masterClues.dat) lists common terms in street addresses as defined by the United States Postal Service (USPS), the United Kingdom's Royal Mail, the Australian Postal Corporation, France's La Poste (depending on the variant in use), or Mexico's Postal Service. For each common term, this file specifies a normalized value, defines postal information, and categorizes the terms into street address component types. A term can be categorized into multiple component types.
The syntax of this file is:
ID-number common-term normalized-term short-abbrev postal-abbrev CFCCS type-token usage-flag postal-flag
You can modify or add entries in this table as needed. The following table describes the columns in the address master clues file.
Table 4-3 Address Master Clue File Columns
Column | Description |
---|---|
ID-number | A unique identification number for the address common term. This number corresponds to an ID number for the same term in the address clues file. |
common-term | A common address term, such as Park, Village, North, Route, Centre, and so on. |
normalized-term | The normalized version of the common term. |
short-abbrev | A short abbreviation of the common term. |
postal-abbrev | The standard postal abbreviation of the common term. This is less used in other locales. |
CFCCS | The census feature class code of the term (as defined in the Census Tiger® database). The following values are used:
|
type-token | The type of address component represented by the common term. Types are specified by an address token (for more information, see Address Type Tokens). |
usage-flag | A flag indicating how the term is used (for more information, see "Pattern Classes"). |
postal-flag | The standard postal code for the term. This is less used in other countries or locales. |
Following is an excerpt from the US address master clues file.
11Alley Alley Al Aly A TY R U 12Alternate Route Alt Rte Alt Alt A TY R 15Arcade Arcade Arc Arc A TY R U 16Arroyo Arroyo Arryo ArryHA TY R 17Autopista Atpta Apta AptaA TY R 18Avenida Avenida Ava Ava A TY R 19Avenue Avenue Ave Ave A TY R U 26Boulevard Blvd Blvd BlvdA TY R U 32Bulevar Blvr Blv Blv A TY R 33Business Route Bus Rte BusRt BsRtA TY R 34Bypass Bypass Byp Byp A TY R U 36Calle Calle Calle ClleA TY R 37Calleja Calleja Cja Cja A TY R 38Callejon Callej Cjon CjonA TY R 39Camino Camino Cam Cam A TY R 47Carretera Carrt Carr CarrA TY R 48Causeway Cswy Cswy CswyAH TY R U 51Center Center Ctr Ctr DA TY R U
The address patterns file (patterns.dat) defines the expected input patterns of each individual street address field being standardized so the Master Person Index Standardization Engine can recognize and process these values. Tokens indicate the type of address component in the input and output fields. This file contains two rows for each pattern. The first row defines the input pattern for each address field and provides an example. The second row defines the output pattern for each address field, the pattern type, the relative importance of the pattern compared to other patterns, and usage flags. Below is an example.
AU A1 TY 01 Oak B Street NA NA ST T* 75 TX
When an address is parsed, each line of the address is delineated by a pipe (|) and sent to the parser separately. The output tokens for each line are then concatenated and the output pattern is processed using the address patterns file to determine whether the output pattern is listed in the file. If the pattern is found, output patterns are modified as indicated in the patterns file to resolve any ambiguities that might arise when two lines of address information contain common elements. The relative importance determines which pattern to use when the format of the input field matches more than one pattern. This file should only be modified by personnel with a thorough understanding of address patterns and tokens.
The syntax of this file is:
input-pattern example output-pattern pattern-class pattern-modifier priority usage-flag exclude-flag
You can modify or add entries in this table as needed. The following table describes the columns in the address patterns file.
Table 4-4 Address Patterns File
Column | Description |
---|---|
input-pattern | Tokens that represent a possible input pattern from an individual unparsed street address field. Each token represents one component. For more information about address tokens, see "Address Type Tokens". |
example | An example of a street address that fits the specified pattern. This file element is optional. |
output-pattern | Tokens that represent the output pattern for the specified input pattern. Each token represents one component of the output of the Master Person Index Standardization Engine. For more information about address tokens, see "Address Type Tokens". |
pattern-class | An indicator of the type of address component represented by the pattern. Possible pattern types are listed in Pattern ClassesPattern Classes. |
pattern-modifier | An indicator of whether the priority of the pattern is averaged against other patterns that match the input. Pattern modifiers are listed in "Pattern Modifiers". |
priority | The priority weight to use for the pattern when the pattern is a sub-pattern of a larger input pattern. For more information, see "Priority Indicators". |
usage-flag | A flag indicating how the term is used (for more information, see "Pattern Classes"). This file element is optional. |
exclude-flag | This file element is optional. |
The following are excerpts from the address patterns files.
NU NU FC TY // 123 8 1/2 street HN NA NA ST H* 90 NU AU FC TY // 123 8th 1/2 street HN NA NA ST H* 90 NU DR SA TY // 123 South Michigan Street HN PD NA ST H* 95 NU DR TY NU DR // 123 South Avenida 1 West HN PD PT NA SD H* 70
TY NU ND NU // Calle 6 No 1810 PT NA P1 HN H* 75 TY SA NU // Avenida Durango 15 PT NA HN H* 85 TY SC NU // Avenida Tijuana 35 PT NA HN H* 85 TY NU DM NU // AV. 5 DE FEBRERO 2125 PT NA NA HN H* 85 TY AU NU DR // Paseo Alcalde 1810 Norte PT NA HN SD H* 85 CT ZP SC SA CC // TLALPAN 14330 TLALPAN DISTRITO FEDERAL, MexicoCT ZP CN SN CC S* 96
The address patterns files use pattern type tokens, pattern classes, pattern modifiers, and priority indicators to process and parse address data. Before modifying any of the patterns files, you must have a good understanding of these file components.
The address pattern and clues files use tokens to denote different components in a street address, such as street type, house number, street names, and so on. These files use one set of tokens for input fields and another set for output fields. You can use only the predefined tokens to represent address components; the OHMPI Standardization Engine does not recognize custom tokens.
The following table lists and describes each input token.
Table 4-5 Input Address Pattern Type Tokens
Token | Description |
---|---|
A1 | Alphabetic value, one character in length |
AM | Ampersand |
AU | Generic word |
BP | Building property |
BU | Building unit |
BX | Post office box |
CC | Country name abbreviation (3-letter ISO code) |
CD | City descriptor |
CT | City name |
DA | Dash (as a starting character) |
DR | Street direction |
EI | Extra information |
EX | Extension |
FC | Numeric fraction |
HR | Highway route |
MP | Mile posts |
NL | Common words, such as “of”, “the”, and so on |
NU | Numeric value |
OT | Ordinal type |
PT | Prefix type |
RR | Rural route |
SA | State name |
SC | County name |
TY | Street type |
WD | Descriptor within the structure |
WI | Identifier within the structure |
ZP | Postal code |
The following table lists and describes each output token.
Table 4-6 Output Address Pattern Tokens
Token | Description |
---|---|
1P | Building number prefix |
2P | Second building number prefix |
BD | Property or building directional suffix |
BI | Structure (building) identifier |
BN | Property or building name |
BS | Building number suffix |
BT | Property or building type suffix |
BX | Post office box descriptor |
BY | Structure (building) descriptor |
CC | Country name abbreviation (3-letter ISO code) |
CD | City descriptor |
CN | County name |
CT | City name |
DB | Property or building directional prefix |
EI | Extra information |
EX | Extension index |
H1 | First house number (the actual number) |
H2 | Second house number (house number suffix) |
HN | House number |
HS | House number suffix |
N2 | Second street name |
NA | Street name |
NB | Building number |
NL | Conjunctions that connect words or phrases in one component type (usually the street name) |
P1 | House number prefix |
P2 | Second house number prefix |
PD | Directional prefix to the street name |
PT | Street type prefix to the street name |
RR | Rural route descriptor |
RN | Rural route identifier |
S2 | Street type suffix to the second street name |
SD | Directional suffix to the street name |
SN | State name |
ST | Street type suffix to the street name |
TB | Property or building type prefix |
WI | Identifier within the structure |
WD | Descriptor within the structure |
XN | Post office box identifier |
ZP | Postal code |
Each pattern defined in the address patterns file must have an associated pattern class. The pattern class indicates a portion of the input pattern or the type of address data that is represented by the pattern. You can specify any of the following pattern classes.
H - the address pattern represents a house
B - the address pattern represents a building
W - the address pattern represents a unit within a structure, such as an apartment or suite number
T - the address pattern represents a street type or direction
R - the address pattern represents a rural route
P - the address pattern represents a Post Office box
N - the address pattern is mostly numeric
S - the address pattern represents country, state, or county class
These classes are also specified as usage flags in the patterns file and the master clues file.
Each pattern type must be followed by a pattern modifier that indicates how to handle cases where one or more defined patterns is found to be a sub-pattern of a larger input pattern. In this case, the OHMPI Standardization Engine must know how to prioritize each defined pattern that is a part of the larger pattern. There are two pattern modifiers.
* - An asterisk indicates that the priority weight for the matching pattern is averaged down equally with the other matching sub-patterns.
+ - A plus sign indicates that the priority weight for the matching pattern is not averaged down equally with the other matching sub-patterns.
The priority indicator is a numeric value following the pattern modifier that indicates the priority weight of the pattern. These values work best when defined as a multiple of five between and including 35 and 95. If a pattern is assigned a priority of 90 or 95 and the pattern matches, or is a sub-pattern of, the input pattern, the standardization engine stops searching for additional matching patterns and uses the high-priority matching pattern.
Master person index applications rely on the OHMPI Standardization Engine to process address data. To ensure correct processing of address information, you need to customize the Matching Service for the master person index application according to the patterns defined for the standardization engine. This includes modifying mefa.xml to define parsing and phonetic encoding of the appropriate fields. You can use the Master Person Index Configuration Editor to modify mefa.xml.
Standardization is defined in the StandardizationConfig section of mefa.xml, which is described in detail in “Match Field Configuration” in Oracle Healthcare Master Person Index Configuration Reference (Part Number: E18592-01). To configure the required fields for standardization and normalization, modify the standardization structure in mefa.xml. To configure phonetic encoding, modify the phonetic encoding structure. You can perform all of these tasks using the Master Person Index Configuration Editor.
Generally, the address data type processes data that requires standardization prior to processing. You should not need to configure fields to normalize for addresses. The following topics provide information about the fields used in processing address data and how to configure address data standardization for a master person index application. The information provided in these sections is based on the default configuration.
When standardizing address data, not all fields in a record need to be processed by the OHMPI Standardization Engine. The standardization engine only needs to process address fields that might be used in the matching process. For a master person index application, these fields are defined in mefa.xml and processing logic for each field is defined in the Standardization Engine node configuration files.
The OHMPI Standardization Engine expects that street address data will be provided in a free-form text field containing several components that must be standardized (parsed, normalized and typed). By default, the standardized street name is configured to be phonetically encoded.You can specify additional fields for phonetic encoding.
If you specify the Address match type for any field in the wizard, a standardization structure for that field is defined in mefa.xml. The fields listed under"Address Data Processing Fields" are automatically defined as the target fields. Each of these fields has several entries in the standardization structure. This is because different parsed components can be stored in the same field. For example, the house number, post office box number, and rural route identifier are all stored in the house number field. If you do not specify address fields for matching in the wizard but want to standardize the fields, you can create a standardization structure in mefa.xml using the Master Person Index Configuration Editor.
The address fields specified for standardization are parsed into several additional fields. If you specify the Address match type in the wizard, the following fields are automatically added to the object structure and database creation script.
field_name_HouseNo
field_name_StName
field_name_StDir
field_name_StType
field_name_StPhon
where field_name is the name of the field for which you specified address matching. For example, if you specify the Address match type for the AddressLine1 field, the following fields are automatically added to the structure: AddressLine1_HouseNo, AddressLine1_StName, AddressLine1_StDir, AddressLine1_StType, and AddressLine1_StPhon.
You can add these fields manually if you do not specify a match type in the wizard.
For free-form address fields, the source fields you define for standardization should include the associated components that are predefined for parsing, normalization, and data typing. For example, fields containing address information can include any of the field components listed in Address Data Standardization Components. The target fields can include any of these parsed fields. Follow the instructions under “Defining OHMPI Standardization Rules” in Oracle Healthcare Master Person Index Configuration Guide (Part Number E18473-01) to define fields for standardization. For the standardization-type element, enter Address. For a list of field IDs to use in the standardized-object-field-id element, see Address Data Standardization Components.
Note:
In the default configuration, the rules defined for the address data type assume that all input fields must be parsed as well as normalized. Thus, there is no need to configure fields only for normalization.A sample standardization structure for address data is shown below. This structure parses the first two lines of street address into the standard street address fields. Only the United States variant is defined in this structure.
free-form-texts-to-standardize> <group standardization-type="ADDRESS" domain-selector="com.sun.mdm.index.matching.impl.SingleDomainSelectorUS"> <unstandardized-source-fields> <unstandardized-source-field-name>Person.Address[*].Address1 </unstandardized-source-field-name> <unstandardized-source-field-name>Person.Address[*].Address2 </unstandardized-source-field-name> </unstandardized-source-fields> <standardization-targets> <target-mapping> <standardized-object-field-id>HouseNumber </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].HouseNumber </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>RuralRouteIdentif </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].HouseNumber </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>BoxIdentif </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].HouseNumber </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>MatchStreetName </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].StreetName </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>RuralRouteDescript </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].StreetName </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>BoxDescript </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].StreetName </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>PropDesPrefDirection </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].StreetDir </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>PropDesSufDirection </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].StreetDir </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>StreetNameSufType </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].StreetType </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>StreetNamePrefType </standardized-object-field-id> <standardized-target-field-name>Person.Address[*].StreetType </standardized-target-field-name> </target-mapping> </standardization-targets> </group> </free-form-texts-to-standardize>
When you match or standardize on street address fields, the street name should be specified for phonetic conversion (this is done by default in a master person index application). Follow the instructions under "Defining Phonetic Encoding for the Master Person Index" in Oracle Healthcare Master Person Index Configuration Guide (Part Number E18473-01) to define fields for phonetic encoding.
A sample of the phoneticize-fields element is shown below. This sample only converts the address street name. You can define additional fields for phonetic encoding.
<phoneticize-fields> <phoneticize-field> <unphoneticized-source-field-name>Person.Address[*].StreetName </unphoneticized-source-field-name> <phoneticized-target-field-name>Person.Address[*].StreetName_Phon </phoneticized-target-field-name> <encoding-type>NYSIIS</encoding-type> </phoneticize-field> </phoneticize-fields>
By default, business name standardization is performed using the patterns-based framework. Processing business name fields involves parsing, normalizing, and phonetically encoding certain fields prior to matching. The following sections describe the configuration files that define business name processing logic and provide instructions for modifying mefa.xml for processing business names.
Processing data using the BusinessName data type includes both standardizing and matching on free-form business name fields. The OHMPI Standardization Engine can create the parsed, normalized, and phonetic values for business names. These values are needed for accurate searching and matching on business information. You can implement business name standardization and matching on its own, or within an application designed to process person information. Standardizing business name fields allows you to include these fields as search criteria, even though matching might not be performed against these fields.
The OHMPI Standardization Engine can create standardized and phonetic values for business name field components. Several configuration files are designed specifically to handle business names to define additional logic for the standardization and phonetic encoding process. These include reference files, a patterns file, and key type files. The business name standardization files are contained in one generic variant.
Standardization engines use tokens to determine how each field is standardized into its individual field components and to determine how to normalize a field value. Tokens also identify the field components to external applications like a master person index application. The following table lists each token generated by the OHMPI Standardization Engine for business names along with the standardization component they represent. You can only specify the predefined field tokens that are listed in this table for business names unless you create a new data type or variant.
Table 4-7 Business Name Tokens
Token | Description |
---|---|
PrimaryName | Represents the name parsed from a free-form text business name field. |
OrgTypeKeyword | Represents the organization type parsed from a free-form text business name field. |
AssocTypeKeyword | Represents the association type parsed from a free-form text business name field. |
IndustrySectorList | Represents the industry sector parsed a free-form text business name field. |
IndustryTypeKeyword | Represents the industry type parsed from a free-form text business name field (industry type is a subset of the sector). |
AliasList | Represents the alias parsed from a free-form text business name field. |
Url | Represents the URL parsed from a free-form text business name field. |
Several configuration files are used to define business name processing logic for the OHMPI Standardization Engine. These files provide information about business name patterns and tokens to help the standardization engine determine how to recognize business name components and break them out into their respective tokens. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for business names.
The following topics described each file used for business name standardization:
The adjectives key type file (bizAdjectivesTypeKeys.dat) defines adjectives commonly found in business names so the OHMPI Standardization Engine can recognize and process these values as a part of the business name. This file contains one column with a list of commonly used adjectives, such as General, Financial, Central, and so on.
You can modify or add entries in this file as needed. Following is an excerpt from the adjectives key type file.
DIGITAL DIRECTED DIVERSIFIED EDUCATIONAL ELECTROCHEMICAL ENGINEERED EVOLUTIONARY EXTENDED FACTUAL FEDERAL
The alias key type file (bizAliasTypeKeys.dat) lists business name acronyms and abbreviations along with their standardized names so the standardization engine can recognize and process these values correctly. You can add entries to the alias key type file using the following syntax.
alias standardized-name
The following table describes the columns in the alias key type file.
Table 4-8 Alias Key Type File
Column | Description |
---|---|
alias | An abbreviation or acronym commonly used in place of a specific business name. |
standardized-name | The normalized version of the alias name. |
Following is an excerpt from the alias key type file.
BBH BARTLE BOGLE HEGARTY BBH BROWN BROTHERS HARRIMAN IBM INTERNATIONAL BUSINESS MACHINE IDS INCOMES DATA SERVICES IDS INSURANCE DATA SERVICES IDS THE INTEGRATED DECISION SUPPORT GROUP IDS THE INTERNET DATABASE SERVICE CAL-TECH CALIFORNIA INSTITUTE OF TECHNOLOGY
The association key type file (bizAssociationTypeKeys.dat) lists business association types along with their standardized names so the standardization engine can recognize and process these values correctly. You can add entries to the association key type file using the following syntax.
association-type standardized-type
The following table describes the columns in the association key type file.
Table 4-9 Association Key Type File
Column | Description |
---|---|
association-type | A common association type for businesses, such as Partners, Group, and so on. |
standardized-type | The standardized version of the association type. If this column contains a name instead of a zero, that name must also be listed in a different entry as an association type with a standardized form of “0”. |
Following is an excerpt from the bizAssociationTypeKeys.dat file.
ASSOCIATES 0 BANCORP 0 BANCORPORATION BANCORP COMPANIES 0 GP GROUP GROUP 0 PARTNERS 0
The general terms reference file (bizBusinessGeneralTerms.dat) lists terms commonly used in business names. This file is used to identify terms that indicate a business, such as bank, supply, factory, and so on, so the OHMPI Standardization Engine can recognize and process the business name.
This file contains one column that lists common terms in the business names you process. You can add entries as needed. Below is an excerpt from the general terms reference file.
BUILDING CITY CONSUMER EAST EYE FACTORY LATIN NORTH SOUTH
The city or state key type file (bizCityorStateTypeKeys.dat) lists various cities and states that might be used in business names. It also classifies each entry as a city (CT) or state (ST) and indicates the country in which the city or state is located. This enables the standardization engine to recognize and process these values correctly. You can add entries to the city or state key type file using the following syntax.
city-or-state type country
The following table describes the columns in the file.
Table 4-10 City or State Key Type File
Column | Description |
---|---|
city-or-state | The name of a city or state used in business names. |
type | An indicator of whether the value is a city or state. “CT” indicates city and “ST” indicates state. |
country | The country code of the country in which the city or state is located. |
Following is an excerpt from the city or state key type file.
ADELAIDE CT AU ALABAMA ST US ALASKA ST US ALGIERS CT DZ AMSTERDAM CT NL ARIZONA ST US ARKANSAS ST US ASUNCION CT PY ATHENS CT GR
The business former name reference file (bizCompanyFormerNames.dat) provides a list of common company names along with names by which the companies were formerly known so the standardization engine can recognize a business when processing a record containing a previous business name. You can add entries to the business former name table using the following syntax.
former-name current-name
The following table describes each column in the business former name reference file.
Table 4-11 Business Former Name Reference File
Column | Description |
---|---|
former-name | One of the company's previous names. |
current-name | The company's current name. |
Below is an excerpt from the business former name reference file.
HELLENIC BOTTLING COCA-COLA HBC INTERNATIONAL PRODUCTS THE TERLATO WINE ORGANIC FOOD PRODUCTS SPECTRUM ORGANIC PRODUCTS SUTTER HOME WINERY TRINCHERO FAMILY ESTATES
The merged business name category file (bizCompanyMergerNames.dat) provides a list of companies whose name changed because of a merger along with the name of the company after the merge. It also classifies the business names into industry sectors and sub-sectors. This enables the standardization engine to recognize the current company name and determine the sector of the business. You can add entries to the business merger name file using the following syntax.
former-name/merged-name sector-code
The following table describes each column in the merged business name category file.
Table 4-12 Merged Business Name Category File
Column | Description |
---|---|
former-name | The name of the company whose name was not kept after the merger. |
merged-name | The name of the company whose name was kept after the merger. |
sector-code | The industry sector code of the business. Sector codes are listed in the bizIndustryCategoriesCode.dat file. |
Below is an excerpt from the merged business name category file.
DUKE/FLUOR DANIEL 20005 FAULTLESS STARCH/BON AMI 09004 FIND/SVP 10013 FIRST WAVE/NEWPARK SHIPBUILDING 27005 GUNDLE/SLT 19020 HMG/COURTLAND 23004 J BROWN/LMC 10014 KORN/FERRY 10020 LINSCO/PRIVATE LEDGER 14005
The primary business name reference file (bizCompanyPrimaryNames.dat) provides a list of companies by their primary name. It also classifies the business names into industry sectors and sub-sectors. This enables the standardization engine to determine the correct value of the sector field when parsing the business name. You can add entries to the primary business name file using the following syntax.
primary-name sector-code
The following table describes the columns in the primary business name reference file.
Table 4-13 Primary Business Name Reference File
Column | Description |
---|---|
primary-name | The primary name of the company. |
sector-code | The industry sector code of the business. Sector codes are listed in the bizIndustryCategoriesCode.dat file. |
Below is an excerpt from the primary business name reference file.
BROTHER INTERNATIONAL 12006 BRYSTOL-MYERS SQUIBB 11005 BURLINGTON COAT FACTORY 24003 BURLINGTON NORTHERN SANTA FE 27005 BV SOLUTIONS 06012 CABLEVISION 26001 CABOT 04006 CADENCE 06010 CAMPBELL 22006 CAPITAL BLUE CROSS 17001
The connector tokens reference file (bizConnectorTokens.dat) defines common values (typically conjunctions) that connect words in business names. For example, in the business name “Nursery of Venice”, “of” is a connector token. This helps the standardization engine recognize and process the full name of a business by indicating that the token connects two parts of the full name.
This file contains one column that lists the connector tokens in the business names you process. You can add entries as needed. Below is an excerpt from the connector tokens reference file.
AN DE DES DOS LA LAS LE OF THE
The country key type file (bizCountryTypeKeys.dat) lists countries and continents, along with their abbreviations and assigned nationalities. For continents, the abbreviation is “CON” to separate them from countries. This enables the standardization engine to recognize and process these values as countries or continents. You can add entries to the country key type file using the following syntax.
country abbreviation nationality
The following table describes the columns in the country key type file.
Table 4-14 Country Key Type File
Column | Description |
---|---|
country | The name of a country or continent. |
abbreviation | The common abbreviation for the specified country. The abbreviation for a continent is always “CON”. |
nationality | The nationality assigned to a person or business originating in the specified country. |
Following is an excerpt from the country key type file.
AMERICA CON AMERICAN AFRICA CON AFRICAN EUROPE CON EUROPEAN ASIA CON ASIAN AFGHANISTAN AF AFGHAN ALBANIA AL ALBANIAN ALGERIA DZ ALGERIAN
The industry sector reference file (bizIndustryCategoryCode.dat) lists and groups various industry sectors and sub-sectors, and includes an identification code for each type so the standardization engine can identify and process the industry sectors for different businesses. You can add entries to the industry sector reference file using the following syntax.
sector-code industry-sector
The following table describes each column in the industry sector reference file.
Table 4-15 Industry Sector Reference File
Column | Description |
---|---|
sector-code | The identification code of the specified sector. The first two numbers of each code identify the general industry sector; the last three number identify a sub-sector. |
industry-sector | A description of the industry category. This is written in the format “sector - sub-sector”, where sector is a general category of industry types, and sub-sector is a specific industry within that category. |
Following is an excerpt from the industry sector reference file.
02006 Automotive & Transport Equipment - Recreational Vehicles 02007 Automotive & Transport Equipment - Shipbuilding & Related Services 02008 Automotive & Transport Equipment - Trucks, Buses & Other Vehicles 03001 Banking - Banking 04001 Chemicals - Agricultural Chemicals 04002 Chemicals - Basic & Intermediate Chemicals & Petrochemicals 04003 Chemicals - Diversified Chemicals 04004 Chemicals - Paints, Coatings & Other Finishing Products 04005 Chemicals - Plastics & Fibers 04006 Chemicals - Specialty Chemicals 05001 Computer Hardware - Computer Peripherals 05002 Computer Hardware - Data Storage Devices 05003 Computer Hardware - Diversified Computer Products
The industry key type file (bizIndustryTypeKeys.dat) is used to standardize the value of the Industry field into common industries to which businesses belong so the standardization engine can recognize and process the industry types for different businesses. You can add entries to the industry key type file using the following syntax.
industry-type standardized-form sectors
The following table describes each column in the industry key type file.
Table 4-16 Industry Key Type File
Column | Description |
---|---|
industry-type | The original value of the industry type in the input record. |
standardized-form | The normalized version of the industry type. If this column contains a name instead of a zero, that name must also be listed in a different entry as an industry type with a standardized form of “0”. |
sectors | The industry categories of the specified industry type. These values correspond to the sector codes listed in the industry sector file (bizIndustryCategoryCode.dat). You can list as many categories as apply for each type, but they must be entered with a space between each and no line breaks, and they must correspond to an entry in the industry sector file. |
Below is an excerpt from the industry key type file.
TECH TECHNOLOGY 05001-05007 TECHNOLOGIES TECHNOLOGY 05001-05007 TECHNOLOGY 0 05001-05007 TECHSYSTEMS 0 05001-05007 TELE PHONE TELEPHONE 16005 TELE PHONES TELEPHONES 16005 TELEVISION TV 11013 21014 TELECOM 0 16005 26006 26009 26010 TELECOMM TELECOMMUNICATION 16005 26006 26008 TELECOMMUNICATION 0 16005 26006 26008
The organization key type file (bizOrganizationTypeKeys.dat) is used to standardize the value of the Organization field into common organizations to which businesses belong. This helps the standardization engine recognize and process the organization types for different businesses. You can add entries to the organization key type file using the following syntax.
original-type standardized-form
The following table describes each column in the organization key type file.
Table 4-17 Organization Key Type File
Column | Description |
---|---|
original-type | The original value of the organization field in an input record. |
standardized-form | The normalized version of an organization type. A zero (0) in this field indicates that the value in the first column is already in its standardized form. If this column contains a name instead of a zero, that name must also be listed in a different entry as an original type with a standardized form of “0”. |
Below is an excerpt from the organization key type file.
INC INCORPORATED INCORPORATED 0 KG 0 KK 0 LIMITED 0 LIMITED PARTNERSHIP 0 LLC 0 LLP 0 LP LIMITED PARTNERSHIP LTD LIMITED
The business patterns file (bizpatterns.dat) defines multiple formats expected from the business name input fields along with the standardized output of each format. The patterns and output appear in two-row pairs in this file, as shown below.
4 PNT AST SEP-GLC ORT PNT AST DEL ORT
The first line describes the input pattern and the second describes the output pattern using tokens to denote each component. The supported tokens are described in Business Name Tokens. A number at the beginning of the first line indicates the number of components in the given business name format. You can modify this file using the following syntax.
length input-pattern output-pattern
The following table lists and describes the components in the above syntax.
Table 4-18 Business Patterns File Components
Component | Description |
---|---|
length | The number of business name components in the input field. |
input-pattern | Tokens that represent a possible input pattern from the unparsed business name fields. Each token represents one component. For more information about address tokens, see Business Name Tokens. |
output-pattern | Tokens that represent the output pattern for the specified input pattern. Each token represents one component. For more information about business name tokens, see Business Name Tokens. |
Below is an excerpt from the business patterns file.
4 PNT AST SEP-GLC ORT PNT AST DEL ORT 4 NFG AJT SEP-GLC ORT PNT PNT DEL ORT 4 NF AJT SEP-GLC ORT PNT PNT DEL ORT 4 CST IDT NF ORT PNT PNT PNT ORT 4 PNT AJT SEP-GLC ORT PNT PNT DEL ORT
The business patterns file uses tokens to denote different components in a business name, such as the primary name, alias type key, URL, and so on. The file uses one set of tokens for input fields and another set for output fields. The tokens indicate the type key files to use to determine the appropriate values for each output field. You can use only the predefined tokens to represent business name components; the standardization engine does not recognize custom tokens.
The following table lists and describes each input token.
Table 4-19 Business Name Input Pattern Tokens
Pattern Identifier | Description |
---|---|
CTT | A connector token |
PNT | A primary name of a business |
PN-PN | A hyphenated primary name of a business |
BCT | A common business term |
URL | The URL of a business web site |
ALT | A business alias type key (usually an acronym) |
CNT | A country name |
NAT | A nationality |
CST | A city or state type key |
IDT | An industry type key |
IDT-AJT | Both an industry and an adjective type key |
AJT | An adjective type key |
AST | An association type key |
ORT | An organization type key |
SEP | A separator key |
NFG | Generic term, not recognized as a specific business name component, with an internal hyphen |
NF | Generic term, not recognized as a specific business name component |
NFC | A single character, not recognized as a specific business name component |
SEP-GLC | A joining comma (a glue type separator) |
SEP-GLD | A joining hyphen (a glue type separator) |
AND | The text “and” |
GLU | A glue type key, such as a forward slash, connecting two parts of a business name component |
PN-NF | A business primary name followed by a hyphen and a generic term that is not recognized as a specific business name component |
NF-PN | A generic term that is not recognized as a specific business name component, followed by a hyphen and a recognized business primary name |
NF-NF | Two generic terms, not recognized as specific business name components and separated by a hyphen |
The following table lists and describes each output token.
Table 4-20 Business Name Output Pattern Tokens
Pattern Identifier | Description |
---|---|
PNT | The primary name of the business |
URL | The URL of the business |
ALT | The alias type key of the business (usually an acronym) |
IDT | The industry type key of the business |
AST | The association type key of the business |
ORT | The organization type key of the business |
NF | A generic term not recognized as a business name component |
Master person index applications rely on the OHMPI Standardization Engine to process business data. To ensure correct processing of business information, you need to customize the Matching Service for the master person index application according to the rules defines for the standardization engine. This includes modifying mefa.xml to define parsing and phonetic encoding of the appropriate fields. You can modify mefa.xml using the Master Person Index Configuration Editor.
Standardization is defined in the StandardizationConfig section of mefa.xml, which is described in detail in “Match Field Configuration” in Oracle Healthcare Master Person Index Configuration Reference (Part Number: E18592-01). To configure the required fields for parsing and normalization, modify the standardization structure in mefa.xml. To configure phonetic encoding, modify the phonetic encoding structure.
Generally, the BusinessName data type processes data that requires parsing prior to processing. You should not need to configure fields to normalize for business names. The following topics provide information about the fields used in processing business names and how to configure standardization for a master person index application. The information provided in these sections is based on the default configuration.
When standardizing free-form business names, not all fields in a record need to be processed by the OHMPI Standardization Engine. The standardization engine only needs to process fields that must be parsed, normalized, or phonetically converted. For a master person index application, these fields are defined in mefa.xml, and processing logic for each field is defined in the Standardization Engine node configuration files.
The OHMPI Standardization Engine expects that business name data will be provided in a free-form text field containing several components that must be parsed. By default, the match engine is configured to parse these components, and to normalize and phonetically encode the business name. You can specify additional fields for phonetic encoding.
If you specify the BusinessName match type for any field in the wizard, a standardization structure for that field is defined in mefa.xml. The fields listed under "Business Name Object Structure" are automatically defined as the target fields. If you do not specify business name fields for matching in the wizard but want to standardize the fields, you can create a standardization structure in mefa.xml
For the default configuration of the BusinessName data type, the name field specified for standardization is parsed into several additional fields, one of which is also normalized. If you specify the BusinessName match type in the wizard, the following fields are automatically added to the object structure and database creation script.
field_name_Name
field_name_NamePhon
field_name_OrgType
field_name_AssocType
field_name_Industry
field_name_Sector
field_name_Alias
field_name_Url
where field_name is the name of the field for which you specified business name matching. For example, if you specify the BusinessName match type for the Company field, the fields automatically added to the structure include Company_Name, Company_NamePhon, Company_OrgType, and so on.
You can add these fields manually if you do not specify a match type in the wizard.
For free-form business name fields, the source fields you define for parsing should include the standardization components that are predefined for parsing and normalization. For example, fields containing business information can include any of the field components listed in "Business Name Standardization Components". The target fields can include any of these parsed fields. Follow the instructions under “Defining OHMPI Standardization Rules” in Oracle Healthcare Master Person Index Configuration Guide to define fields for standardization. For the standardization-type element, enter BusinessName. For a list of field IDs to use in the standardized-object-field-id element, see "Business Name Standardization Components".
Note:
In the default configuration, the rules defined for the address data type assume that all input fields must be parsed as well as normalized. Thus, there is no need to configure fields only for normalization.A sample standardization structure for business names is shown below. This structure parses a business name field into these standard business name fields: name, organization type, association type, sector, industry, and URL. Note that there is no domain selector specified, which would normally default to the United States domain; however, since business names are not variant dependent, it is irrelevant here.
<free-form-texts-to-standardize> <group standardization-type="BusinessName"> <unstandardized-source-fields> <unstandardized-source-field-name>Company.Name </unstandardized-source-field-name> </unstandardized-source-fields> <standardization-targets> <target-mapping> <standardized-object-field-id>PrimaryName </standardized-object-field-id> <standardized-target-field-name>Company.Name_Name </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>OrgTypekeyword </standardized-object-field-id> <standardized-target-field-name>Company.Name_OrgType </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>AssocTypeKeyword </standardized-object-field-id> <standardized-target-field-name>Company.Name_AssocType </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>IndustrySectorList </standardized-object-field-id> <standardized-target-field-name>Company.Name_Sector </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>IndustryTypeKeyword </standardized-object-field-id> <standardized-target-field-name>Company.Name_Industry </standardized-target-field-name> </target-mapping> <target-mapping> <standardized-object-field-id>Url </standardized-object-field-id> <standardized-target-field-name>Company.Name_URL </standardized-target-field-name> </target-mapping> </standardization-targets> </group> </free-form-texts-to-standardize>
When you match or standardize on business name fields, the business name field should be specified for phonetic conversion (by default, the wizard defines this for you). Follow the instructions under “Defining Phonetic Encoding for the Master Person Index” in Oracle Healthcare Master Person Index Configuration Guide to define fields for phonetic encoding.
A sample of the phoneticize-fields element is shown below. This sample only converts the business name. You can define additional fields for phonetic encoding.
<phoneticize-fields> <phoneticize-field> <unphoneticized-source-field-name>Company.Name_Name </unphoneticized-source-field-name> <phoneticized-target-field-name>Company.Name_NamePhon </phoneticized-target-field-name> <encoding-type>NYSIIS</encoding-type> </phoneticize-field> </phoneticize-fields>