Skip Headers
Oracle® Healthcare Master Person Index Standardization Engine Reference
Release 1.1

Part Number E18471-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

4 Patterns-based Address Data Configuration

This chapter provides conceptual information and procedures for setting up patterns-based address configuration and patterns-based business name configuration.

This chapter includes the following sections:

Setting Patterns-based Address Data Configuration

By default, address standardization is performed using the patterns-based framework. Processing street addresses involves parsing, normalizing, data typing, and using advanced patterns rules to map the address fields with their corresponding types, prior to matching. The following sections describe the configuration files that define address processing logic and provide instructions for modifying mefa.xml for processing address fields.

Address Data Standardization Overview

Processing data using the Address data type includes both standardizing and matching on free-form address fields. The OHMPI Standardization Engine can create the parsed, normalized, and typed values for address data. These values are needed for accurate searching and matching on address data. You can implement street address standardization and matching on its own, or within an application designed to process person or business information. Standardizing address information allows you to include address fields as search criteria, even though matching might not be performed against these fields.

Several configuration files are designed specifically to handle address data and define processing logic for the standardization process. These include address clues files, a patterns file, and a constants file. The United States address standardization engine is based on the work performed at the US Census Bureau. The clues files, in particular, are based on census bureau statistics.

Address Data Standardization Components

Standardization engines use tokens to determine how each field is standardized into its individual field components and to determine how to normalize a field value. Tokens also identify the field components to external applications like a master person index application. The following table lists each token generated by the OHMPI Standardization Engine for address data along with the standardization component they represent. You can only specify the predefined field tokens that are listed in this table for addresses unless you create a new data type or variant.

Table 4-1 Address Data Tokens

Token Description
BoxDescript Represents the P.O. box type from a standardized address field. By default, this is stored in the field_name_StName field in a master person index database.
BoxIdentif Represents the parsed P.O. box number from a standardized address field. By default, this is stored in the field_name_HouseNo field in a master person index database.
NeighborhoodName Represents the parsed structure street's block or neighborhood description from a standardized address field. This address component is not included in the default master person index standardization structure, but you can add it if needed.
NeighborhoodType Represents the parsed structure structure street's block or neighborhood identifier from a standardized address field. This address component is not included in the default master person index standardization structure, but you can add it if needed.
HouseNumber Represents the parsed house number from a standardized address field. By default, this is stored in the field_name_HouseNo field in a master person index database.
HouseNumPrefix Represents the parsed house number prefix from a standardized address field (such as the “A” in “A 1587 4th Street”). This address component is not included in the default master person index standardization structure, but you can add it if needed.
HouseNumSuffix Represents the parsed house number suffix from a standardized address field (such as the “B” in “5900 B Arnett Avenue”). This address component is not included in the default master person index standardization structure, but you can add it if needed.
MatchPropertyName Represents the parsed match property name from a standardized address field and is an alternative representation of the field used by the standardization engine for blocking and phonetic encoding. This address component is not included in the default master person index standardization structure, but you can add it if needed.
MatchStreetName Represents the parsed and standardized street name from a standardized address field and is an alternative representation of the field used by the standardization engine. If you want to store the standardized street name in the database (recommended), map this field to the street name field in the database. By default, this is stored in the field_name_StName field in a master person index database.
OrigPropertyName Represents the parsed original property name (such as the name of a complex or business park) from a standardized address field. This address component is not included in the default master person index standardization structure, but you can add it if needed.
PropDesPrefDirection Represents the parsed property direction from a standardized address field. This field ID handles cases where the direction is a prefix to the property description. By default, this is stored in the field_name_StDir field in a master person index database.
PropDesPrefType Represents the parsed property type from a standardized address field. This field ID handles cases where the street type is a prefix to the property description. By default, this is stored in the field_name_StType field in a master person index database.
PropertySufDirection Represents the parsed property direction from a standardized address field. This field ID handles cases where the direction is a suffix to the property description. By default, this is stored in the field_name_StDir field in a master person index database.
PropertySufType Represents the parsed property type from a standardized address field. This field ID handles cases where the street type is a suffix to the property description. By default, this is stored in the field_name_StType field in a master person index database.
RuralRouteDescript Represents the parsed rural route description from a standardized address field. By default, this is stored in the field_name_StName field in a master person index database.
RuralRouteIdentif Represents the parsed rural route identifier from a standardized address field. By default, this is stored in the field_name_HouseNo field in a master person index database.
SecondHouseNumber Represents the parsed second house number prefix from a standardized address field. This address component is not included in the default master person index standardization structure, but you can add it if needed.
SecondHouseNumberPrefix Represents the parsed second house number prefix from a standardized address field (such as “25” in “25 319 10th Ave.”). This address component is not included in the default master person index standardization structure, but you can add it if needed.
SecondStreetNameSufDirection Represents the parsed second street direction from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
SecondStreetNameSufType Represents the parsed second street type from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
OrigSecondStreetName Represents the parsed second street name from a standardized address field (for example, an address might include a cross-street or a thoroughfare and dependent thoroughfare). This address component is not included in the default master person index standardization structure, but you can add it if needed.
OrigStreetName Represents the parsed street name from an address field. If you want to store the original street name in the database, map this field to the street name field in the database. This address component is not included in the default standardization structure, but you can add it if needed.
StreetNamePrefDirection Represents the parsed street direction from a standardized address field. This field ID handles cases where the direction is a prefix to the street name. By default, this is stored in the field_name_StDir field in a master person index database.
StreetNamePrefType Represents the parsed street type from a standardized address field. This field ID handles cases where the street type is a prefix to the street name. By default, this is stored in the field_name_StType field in a master person index database.
StreetNameSufDirection Represents the parsed street direction from a standardized address field. This field ID handles cases where the direction is a suffix to the street name. By default, this is stored in the field_name_StDir field in a master person index database.
StreetNameSufType Represents the parsed street type from a standardized address field. This field ID handles cases where the street type is a suffix to the street name. By default, this is stored in the field_name_StType field in a master person index database.
StreetNameExtensionIndex Represents the parsed street name extension from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
WithinStructDescript Represents the parsed internal descriptor (such as “Floor”) from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
WithinStructIdentif Represents the parsed internal identifier (such as a floor number) from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
CityName Represents a city name, within a state or a county, from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
CityDescriptor Represents a city's description type from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
PostalCode Represents the location postal code type from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
StateName Represents a given country's state name from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
CountryName Represents a given state's county name from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
CountryCode Represents a 3-digit ISO country name from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.
ExtraInfo Represents any extra information that was not included in any of the other parsed components. This address component is not included in the default standardization structure, but you can add it if needed.

Note:

CityName, CityDescriptor, PostalCode, StateName, CountryName, and CountryCode are new token types. They are implemented in the Mexico standardization locale for this release. They will be available in the United States, United Kingdom, Australia and France standardization implementations in future releases.

Address Data Standardization Files

Three configuration files define address processing logic for the OHMPI Standardization Engine. These files provide information about address patterns and tokens to help the standardization engine determine how to recognize address components and break them out into their respective tokens. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for address data.

The address configuration files are located in the resource folder under each variant name for the Address data type. The following topics provide information about each configuration file.

Address Clues File

The address clues file (clues.dat) lists common terms in street addresses, specifies a normalized value for each common term, and categorizes the terms into street address component types. A term can be categorized into multiple component types. A relevance value specifies which of the component types the term is most likely to be. For example, the term “Junction” is standardized as “Jct” and is classified as a street type, building unit, and generic term (giving relevance in that order).

This file helps the OHMPI Standardization Engine recognize common terms in street addresses in order to parse and normalize the values correctly. The syntax of this file is:

common-term normalized-term ID-number/type-token

You can modify or add entries in this table as needed. The following table describes the columns in the address clues file.

Table 4-2 Address Clues File Columns

Column Description
common-term A term commonly found in street addresses.
normalized-term The normalized version of the common term.
ID-number/type-token An ID number and a token indicating the type of address component represented by the common term. The ID number corresponds to an ID number in the address master clues file, and the type token corresponds to the type specified for that ID number in the address master clues file. One term might have several ID number and token type pairs. Their order of appearance indicates their relevance value.

Following is an excerpt from the US address clues file.

TRLR VLG          Trpk            59BU
TRPK              Trpk            59BU
TRPRK             Trpk            59BU
VILLA             Vlla            305TY          60BU
VLLA              Vlla            305TY          60BU
VILLAS            Vlla            60BU
VILL              Vlg             317TY          61BU        364AU
VILLAG            Vlg             317TY          61BU        364AU
VLG               Vlg             317TY          61BU        364AU
VILLAGE           Vlg             317TY          61BU        364AU
VILLG             Vlg             317TY          61BU        364AU
VILLIAGE          Vlg             317TY          61BU        364AU
VLGE              Vlg             317TY          61BU        364AU
VIVI              Vivi            62BU
VIVIENDA          Vivi            62BU
COLLEGE           Coll            64BU                       0AU
CLG               Coll            64BU
COTTAGE           Cott            65BU           65BP        0AU

Address Master Clues File

The address master clues file (masterClues.dat) lists common terms in street addresses as defined by the United States Postal Service (USPS), the United Kingdom's Royal Mail, the Australian Postal Corporation, France's La Poste (depending on the variant in use), or Mexico's Postal Service. For each common term, this file specifies a normalized value, defines postal information, and categorizes the terms into street address component types. A term can be categorized into multiple component types.

The syntax of this file is:

ID-number common-term normalized-term short-abbrev postal-abbrev CFCCS type-token usage-flag postal-flag

You can modify or add entries in this table as needed. The following table describes the columns in the address master clues file.

Table 4-3 Address Master Clue File Columns

Column Description
ID-number A unique identification number for the address common term. This number corresponds to an ID number for the same term in the address clues file.
common-term A common address term, such as Park, Village, North, Route, Centre, and so on.
normalized-term The normalized version of the common term.
short-abbrev A short abbreviation of the common term.
postal-abbrev The standard postal abbreviation of the common term. This is less used in other locales.
CFCCS The census feature class code of the term (as defined in the Census Tiger® database). The following values are used:
  • A - Road

  • B - Railroad

  • C - Miscellaneous

  • D - Landmark

  • E - Physical feature

  • F - Nonvisible feature

  • H - Hydrography

  • X - Unclassified

    These are not used in other locales.

type-token The type of address component represented by the common term. Types are specified by an address token (for more information, see Address Type Tokens).
usage-flag A flag indicating how the term is used (for more information, see "Pattern Classes").
postal-flag The standard postal code for the term. This is less used in other countries or locales.

Following is an excerpt from the US address master clues file.

11Alley                    Alley            Al         Aly A        TY R U
12Alternate Route          Alt Rte          Alt        Alt A        TY R
15Arcade                   Arcade           Arc        Arc A        TY R U
16Arroyo                   Arroyo           Arryo      ArryHA       TY R
17Autopista                Atpta            Apta       AptaA        TY R
18Avenida                  Avenida          Ava        Ava A        TY R
19Avenue                   Avenue           Ave        Ave A        TY R U
26Boulevard                Blvd             Blvd       BlvdA        TY R U
32Bulevar                  Blvr             Blv        Blv A        TY R
33Business Route           Bus Rte          BusRt      BsRtA        TY R
34Bypass                   Bypass           Byp        Byp A        TY R U
36Calle                    Calle            Calle      ClleA        TY R
37Calleja                  Calleja          Cja        Cja A        TY R
38Callejon                 Callej           Cjon       CjonA        TY R
39Camino                   Camino           Cam        Cam A        TY R
47Carretera                Carrt            Carr       CarrA        TY R
48Causeway                 Cswy             Cswy       CswyAH       TY R U
51Center                   Center           Ctr        Ctr DA       TY R U

Address Patterns File

The address patterns file (patterns.dat) defines the expected input patterns of each individual street address field being standardized so the Master Person Index Standardization Engine can recognize and process these values. Tokens indicate the type of address component in the input and output fields. This file contains two rows for each pattern. The first row defines the input pattern for each address field and provides an example. The second row defines the output pattern for each address field, the pattern type, the relative importance of the pattern compared to other patterns, and usage flags. Below is an example.

AU A1 TY                01 Oak B Street
NA NA ST                T* 75                TX

When an address is parsed, each line of the address is delineated by a pipe (|) and sent to the parser separately. The output tokens for each line are then concatenated and the output pattern is processed using the address patterns file to determine whether the output pattern is listed in the file. If the pattern is found, output patterns are modified as indicated in the patterns file to resolve any ambiguities that might arise when two lines of address information contain common elements. The relative importance determines which pattern to use when the format of the input field matches more than one pattern. This file should only be modified by personnel with a thorough understanding of address patterns and tokens.

The syntax of this file is:

input-pattern example output-pattern pattern-class pattern-modifier priority usage-flag exclude-flag

You can modify or add entries in this table as needed. The following table describes the columns in the address patterns file.

Table 4-4 Address Patterns File

Column Description
input-pattern Tokens that represent a possible input pattern from an individual unparsed street address field. Each token represents one component. For more information about address tokens, see "Address Type Tokens".
example An example of a street address that fits the specified pattern. This file element is optional.
output-pattern Tokens that represent the output pattern for the specified input pattern. Each token represents one component of the output of the Master Person Index Standardization Engine. For more information about address tokens, see "Address Type Tokens".
pattern-class An indicator of the type of address component represented by the pattern. Possible pattern types are listed in Pattern ClassesPattern Classes.
pattern-modifier An indicator of whether the priority of the pattern is averaged against other patterns that match the input. Pattern modifiers are listed in "Pattern Modifiers".
priority The priority weight to use for the pattern when the pattern is a sub-pattern of a larger input pattern. For more information, see "Priority Indicators".
usage-flag A flag indicating how the term is used (for more information, see "Pattern Classes"). This file element is optional.
exclude-flag This file element is optional.

The following are excerpts from the address patterns files.

For United States Locale:

NU NU FC TY                         //   123 8 1/2 street
HN NA NA ST                         H* 90

NU AU FC TY                         //   123 8th 1/2 street
HN NA NA ST                         H* 90

NU DR SA TY                         //   123 South Michigan Street
HN PD NA ST                         H* 95

NU DR TY NU DR                      // 123 South Avenida 1 West
HN PD PT NA SD                      H* 70

For Mexico Locale:

TY NU ND NU                         // Calle 6 No 1810 
PT NA P1 HN                         H* 75

TY SA NU                            // Avenida Durango 15
PT NA HN                            H* 85

TY SC NU                            // Avenida Tijuana 35
PT NA HN                            H* 85

TY NU DM NU                         // AV. 5 DE FEBRERO 2125
PT NA NA HN                         H* 85 

TY AU NU DR                         // Paseo Alcalde 1810 Norte
PT NA HN SD                         H* 85

CT ZP SC SA CC                      // TLALPAN 14330 TLALPAN DISTRITO FEDERAL,
                                    MexicoCT ZP CN SN CC                      S* 96 

Address Pattern File Components

The address patterns files use pattern type tokens, pattern classes, pattern modifiers, and priority indicators to process and parse address data. Before modifying any of the patterns files, you must have a good understanding of these file components.

Address Type Tokens

The address pattern and clues files use tokens to denote different components in a street address, such as street type, house number, street names, and so on. These files use one set of tokens for input fields and another set for output fields. You can use only the predefined tokens to represent address components; the OHMPI Standardization Engine does not recognize custom tokens.

The following table lists and describes each input token.

Table 4-5 Input Address Pattern Type Tokens

Token Description
A1 Alphabetic value, one character in length
AM Ampersand
AU Generic word
BP Building property
BU Building unit
BX Post office box
CC Country name abbreviation (3-letter ISO code)
CD City descriptor
CT City name
DA Dash (as a starting character)
DR Street direction
EI Extra information
EX Extension
FC Numeric fraction
HR Highway route
MP Mile posts
NL Common words, such as “of”, “the”, and so on
NU Numeric value
OT Ordinal type
PT Prefix type
RR Rural route
SA State name
SC County name
TY Street type
WD Descriptor within the structure
WI Identifier within the structure
ZP Postal code

The following table lists and describes each output token.

Table 4-6 Output Address Pattern Tokens

Token Description
1P Building number prefix
2P Second building number prefix
BD Property or building directional suffix
BI Structure (building) identifier
BN Property or building name
BS Building number suffix
BT Property or building type suffix
BX Post office box descriptor
BY Structure (building) descriptor
CC Country name abbreviation (3-letter ISO code)
CD City descriptor
CN County name
CT City name
DB Property or building directional prefix
EI Extra information
EX Extension index
H1 First house number (the actual number)
H2 Second house number (house number suffix)
HN House number
HS House number suffix
N2 Second street name
NA Street name
NB Building number
NL Conjunctions that connect words or phrases in one component type (usually the street name)
P1 House number prefix
P2 Second house number prefix
PD Directional prefix to the street name
PT Street type prefix to the street name
RR Rural route descriptor
RN Rural route identifier
S2 Street type suffix to the second street name
SD Directional suffix to the street name
SN State name
ST Street type suffix to the street name
TB Property or building type prefix
WI Identifier within the structure
WD Descriptor within the structure
XN Post office box identifier
ZP Postal code

Pattern Classes

Each pattern defined in the address patterns file must have an associated pattern class. The pattern class indicates a portion of the input pattern or the type of address data that is represented by the pattern. You can specify any of the following pattern classes.

  • H - the address pattern represents a house

  • B - the address pattern represents a building

  • W - the address pattern represents a unit within a structure, such as an apartment or suite number

  • T - the address pattern represents a street type or direction

  • R - the address pattern represents a rural route

  • P - the address pattern represents a Post Office box

  • N - the address pattern is mostly numeric

  • S - the address pattern represents country, state, or county class

These classes are also specified as usage flags in the patterns file and the master clues file.

Pattern Modifiers

Each pattern type must be followed by a pattern modifier that indicates how to handle cases where one or more defined patterns is found to be a sub-pattern of a larger input pattern. In this case, the OHMPI Standardization Engine must know how to prioritize each defined pattern that is a part of the larger pattern. There are two pattern modifiers.

  • * - An asterisk indicates that the priority weight for the matching pattern is averaged down equally with the other matching sub-patterns.

  • + - A plus sign indicates that the priority weight for the matching pattern is not averaged down equally with the other matching sub-patterns.

Priority Indicators

The priority indicator is a numeric value following the pattern modifier that indicates the priority weight of the pattern. These values work best when defined as a multiple of five between and including 35 and 95. If a pattern is assigned a priority of 90 or 95 and the pattern matches, or is a sub-pattern of, the input pattern, the standardization engine stops searching for additional matching patterns and uses the high-priority matching pattern.

Address Standardization and Oracle Healthcare Master Person Index

Master person index applications rely on the OHMPI Standardization Engine to process address data. To ensure correct processing of address information, you need to customize the Matching Service for the master person index application according to the patterns defined for the standardization engine. This includes modifying mefa.xml to define parsing and phonetic encoding of the appropriate fields. You can use the Master Person Index Configuration Editor to modify mefa.xml.

Standardization is defined in the StandardizationConfig section of mefa.xml, which is described in detail in “Match Field Configuration” in Oracle Healthcare Master Person Index Configuration Reference (Part Number: E18592-01). To configure the required fields for standardization and normalization, modify the standardization structure in mefa.xml. To configure phonetic encoding, modify the phonetic encoding structure. You can perform all of these tasks using the Master Person Index Configuration Editor.

Generally, the address data type processes data that requires standardization prior to processing. You should not need to configure fields to normalize for addresses. The following topics provide information about the fields used in processing address data and how to configure address data standardization for a master person index application. The information provided in these sections is based on the default configuration.

Address Data Processing Fields

When standardizing address data, not all fields in a record need to be processed by the OHMPI Standardization Engine. The standardization engine only needs to process address fields that might be used in the matching process. For a master person index application, these fields are defined in mefa.xml and processing logic for each field is defined in the Standardization Engine node configuration files.

Address Standardized Fields

The OHMPI Standardization Engine expects that street address data will be provided in a free-form text field containing several components that must be standardized (parsed, normalized and typed). By default, the standardized street name is configured to be phonetically encoded.You can specify additional fields for phonetic encoding.

If you specify the Address match type for any field in the wizard, a standardization structure for that field is defined in mefa.xml. The fields listed under"Address Data Processing Fields" are automatically defined as the target fields. Each of these fields has several entries in the standardization structure. This is because different parsed components can be stored in the same field. For example, the house number, post office box number, and rural route identifier are all stored in the house number field. If you do not specify address fields for matching in the wizard but want to standardize the fields, you can create a standardization structure in mefa.xml using the Master Person Index Configuration Editor.

Address Object Structure

The address fields specified for standardization are parsed into several additional fields. If you specify the Address match type in the wizard, the following fields are automatically added to the object structure and database creation script.

  • field_name_HouseNo

  • field_name_StName

  • field_name_StDir

  • field_name_StType

  • field_name_StPhon

    where field_name is the name of the field for which you specified address matching. For example, if you specify the Address match type for the AddressLine1 field, the following fields are automatically added to the structure: AddressLine1_HouseNo, AddressLine1_StName, AddressLine1_StDir, AddressLine1_StType, and AddressLine1_StPhon.

You can add these fields manually if you do not specify a match type in the wizard.

Configuring a Standardization Structure for Address Data

For free-form address fields, the source fields you define for standardization should include the associated components that are predefined for parsing, normalization, and data typing. For example, fields containing address information can include any of the field components listed in Address Data Standardization Components. The target fields can include any of these parsed fields. Follow the instructions under “Defining OHMPI Standardization Rules” in Oracle Healthcare Master Person Index Configuration Guide (Part Number E18473-01) to define fields for standardization. For the standardization-type element, enter Address. For a list of field IDs to use in the standardized-object-field-id element, see Address Data Standardization Components.

Note:

In the default configuration, the rules defined for the address data type assume that all input fields must be parsed as well as normalized. Thus, there is no need to configure fields only for normalization.

A sample standardization structure for address data is shown below. This structure parses the first two lines of street address into the standard street address fields. Only the United States variant is defined in this structure.

free-form-texts-to-standardize>
   <group standardization-type="ADDRESS"
    domain-selector="com.sun.mdm.index.matching.impl.SingleDomainSelectorUS">
      <unstandardized-source-fields>
         <unstandardized-source-field-name>Person.Address[*].Address1    
         </unstandardized-source-field-name>
         <unstandardized-source-field-name>Person.Address[*].Address2
         </unstandardized-source-field-name>
      </unstandardized-source-fields>
      <standardization-targets>
         <target-mapping>
            <standardized-object-field-id>HouseNumber
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].HouseNumber
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>RuralRouteIdentif
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].HouseNumber
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>BoxIdentif
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].HouseNumber
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>MatchStreetName
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>RuralRouteDescript
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>BoxDescript
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>PropDesPrefDirection
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetDir
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>PropDesSufDirection
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetDir
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>StreetNameSufType
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetType
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>StreetNamePrefType
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetType
            </standardized-target-field-name>
         </target-mapping>
      </standardization-targets>
   </group>
</free-form-texts-to-standardize>

Configuring Phonetic Encoding for Address Data

When you match or standardize on street address fields, the street name should be specified for phonetic conversion (this is done by default in a master person index application). Follow the instructions under "Defining Phonetic Encoding for the Master Person Index" in Oracle Healthcare Master Person Index Configuration Guide (Part Number E18473-01) to define fields for phonetic encoding.

A sample of the phoneticize-fields element is shown below. This sample only converts the address street name. You can define additional fields for phonetic encoding.

<phoneticize-fields>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.Address[*].StreetName
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.Address[*].StreetName_Phon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field>
</phoneticize-fields>

Setting Patterns-based Business Name Configuration

By default, business name standardization is performed using the patterns-based framework. Processing business name fields involves parsing, normalizing, and phonetically encoding certain fields prior to matching. The following sections describe the configuration files that define business name processing logic and provide instructions for modifying mefa.xml for processing business names.

Business Name Standardization Overview

Processing data using the BusinessName data type includes both standardizing and matching on free-form business name fields. The OHMPI Standardization Engine can create the parsed, normalized, and phonetic values for business names. These values are needed for accurate searching and matching on business information. You can implement business name standardization and matching on its own, or within an application designed to process person information. Standardizing business name fields allows you to include these fields as search criteria, even though matching might not be performed against these fields.

The OHMPI Standardization Engine can create standardized and phonetic values for business name field components. Several configuration files are designed specifically to handle business names to define additional logic for the standardization and phonetic encoding process. These include reference files, a patterns file, and key type files. The business name standardization files are contained in one generic variant.

Business Name Standardization Components

Standardization engines use tokens to determine how each field is standardized into its individual field components and to determine how to normalize a field value. Tokens also identify the field components to external applications like a master person index application. The following table lists each token generated by the OHMPI Standardization Engine for business names along with the standardization component they represent. You can only specify the predefined field tokens that are listed in this table for business names unless you create a new data type or variant.

Table 4-7 Business Name Tokens

Token Description
PrimaryName Represents the name parsed from a free-form text business name field.
OrgTypeKeyword Represents the organization type parsed from a free-form text business name field.
AssocTypeKeyword Represents the association type parsed from a free-form text business name field.
IndustrySectorList Represents the industry sector parsed a free-form text business name field.
IndustryTypeKeyword Represents the industry type parsed from a free-form text business name field (industry type is a subset of the sector).
AliasList Represents the alias parsed from a free-form text business name field.
Url Represents the URL parsed from a free-form text business name field.

Business Name Standardization Files

Several configuration files are used to define business name processing logic for the OHMPI Standardization Engine. These files provide information about business name patterns and tokens to help the standardization engine determine how to recognize business name components and break them out into their respective tokens. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for business names.

The following topics described each file used for business name standardization:

Business Name Adjectives Key Type File

The adjectives key type file (bizAdjectivesTypeKeys.dat) defines adjectives commonly found in business names so the OHMPI Standardization Engine can recognize and process these values as a part of the business name. This file contains one column with a list of commonly used adjectives, such as General, Financial, Central, and so on.

You can modify or add entries in this file as needed. Following is an excerpt from the adjectives key type file.

DIGITAL
DIRECTED
DIVERSIFIED
EDUCATIONAL
ELECTROCHEMICAL
ENGINEERED
EVOLUTIONARY
EXTENDED
FACTUAL
FEDERAL

Business Alias Key Type File

The alias key type file (bizAliasTypeKeys.dat) lists business name acronyms and abbreviations along with their standardized names so the standardization engine can recognize and process these values correctly. You can add entries to the alias key type file using the following syntax.

alias standardized-name

The following table describes the columns in the alias key type file.

Table 4-8 Alias Key Type File

Column Description
alias An abbreviation or acronym commonly used in place of a specific business name.
standardized-name The normalized version of the alias name.

Following is an excerpt from the alias key type file.

BBH                 BARTLE BOGLE HEGARTY
BBH                 BROWN BROTHERS HARRIMAN
IBM                 INTERNATIONAL BUSINESS MACHINE
IDS                 INCOMES DATA SERVICES
IDS                 INSURANCE DATA SERVICES
IDS                 THE INTEGRATED DECISION SUPPORT GROUP
IDS                 THE INTERNET DATABASE SERVICE
CAL-TECH            CALIFORNIA INSTITUTE OF TECHNOLOGY

Business Association Key Type File

The association key type file (bizAssociationTypeKeys.dat) lists business association types along with their standardized names so the standardization engine can recognize and process these values correctly. You can add entries to the association key type file using the following syntax.

association-type standardized-type

The following table describes the columns in the association key type file.

Table 4-9 Association Key Type File

Column Description
association-type A common association type for businesses, such as Partners, Group, and so on.
standardized-type The standardized version of the association type. If this column contains a name instead of a zero, that name must also be listed in a different entry as an association type with a standardized form of “0”.

Following is an excerpt from the bizAssociationTypeKeys.dat file.

ASSOCIATES          0
BANCORP             0
BANCORPORATION      BANCORP
COMPANIES           0
GP                  GROUP
GROUP               0
PARTNERS            0

Business General Terms Reference File

The general terms reference file (bizBusinessGeneralTerms.dat) lists terms commonly used in business names. This file is used to identify terms that indicate a business, such as bank, supply, factory, and so on, so the OHMPI Standardization Engine can recognize and process the business name.

This file contains one column that lists common terms in the business names you process. You can add entries as needed. Below is an excerpt from the general terms reference file.

BUILDING
CITY
CONSUMER
EAST
EYE
FACTORY
LATIN
NORTH
SOUTH

Business City or State Key Type File

The city or state key type file (bizCityorStateTypeKeys.dat) lists various cities and states that might be used in business names. It also classifies each entry as a city (CT) or state (ST) and indicates the country in which the city or state is located. This enables the standardization engine to recognize and process these values correctly. You can add entries to the city or state key type file using the following syntax.

city-or-state type country

The following table describes the columns in the file.

Table 4-10 City or State Key Type File

Column Description
city-or-state The name of a city or state used in business names.
type An indicator of whether the value is a city or state. “CT” indicates city and “ST” indicates state.
country The country code of the country in which the city or state is located.

Following is an excerpt from the city or state key type file.

ADELAIDE                 CT   AU
ALABAMA                  ST   US
ALASKA                   ST   US
ALGIERS                  CT   DZ
AMSTERDAM                CT   NL
ARIZONA                  ST   US
ARKANSAS                 ST   US
ASUNCION                 CT   PY
ATHENS                   CT   GR

Business Former Name Reference File

The business former name reference file (bizCompanyFormerNames.dat) provides a list of common company names along with names by which the companies were formerly known so the standardization engine can recognize a business when processing a record containing a previous business name. You can add entries to the business former name table using the following syntax.

former-name current-name

The following table describes each column in the business former name reference file.

Table 4-11 Business Former Name Reference File

Column Description
former-name One of the company's previous names.
current-name The company's current name.

Below is an excerpt from the business former name reference file.

HELLENIC BOTTLING                       COCA-COLA HBC
INTERNATIONAL PRODUCTS                  THE TERLATO WINE
ORGANIC FOOD PRODUCTS                   SPECTRUM ORGANIC PRODUCTS
SUTTER HOME WINERY                      TRINCHERO FAMILY ESTATES

Merged Business Name Category File

The merged business name category file (bizCompanyMergerNames.dat) provides a list of companies whose name changed because of a merger along with the name of the company after the merge. It also classifies the business names into industry sectors and sub-sectors. This enables the standardization engine to recognize the current company name and determine the sector of the business. You can add entries to the business merger name file using the following syntax.

former-name/merged-name sector-code

The following table describes each column in the merged business name category file.

Table 4-12 Merged Business Name Category File

Column Description
former-name The name of the company whose name was not kept after the merger.
merged-name The name of the company whose name was kept after the merger.
sector-code The industry sector code of the business. Sector codes are listed in the bizIndustryCategoriesCode.dat file.

Below is an excerpt from the merged business name category file.

DUKE/FLUOR DANIEL                                 20005
FAULTLESS STARCH/BON AMI                          09004
FIND/SVP                                          10013
FIRST WAVE/NEWPARK SHIPBUILDING                   27005
GUNDLE/SLT                                        19020
HMG/COURTLAND                                     23004
J BROWN/LMC                                       10014
KORN/FERRY                                        10020
LINSCO/PRIVATE LEDGER                             14005

Primary Business Name Reference File

The primary business name reference file (bizCompanyPrimaryNames.dat) provides a list of companies by their primary name. It also classifies the business names into industry sectors and sub-sectors. This enables the standardization engine to determine the correct value of the sector field when parsing the business name. You can add entries to the primary business name file using the following syntax.

primary-name sector-code

The following table describes the columns in the primary business name reference file.

Table 4-13 Primary Business Name Reference File

Column Description
primary-name The primary name of the company.
sector-code The industry sector code of the business. Sector codes are listed in the bizIndustryCategoriesCode.dat file.

Below is an excerpt from the primary business name reference file.

BROTHER INTERNATIONAL                             12006
BRYSTOL-MYERS SQUIBB                              11005
BURLINGTON COAT FACTORY                           24003
BURLINGTON NORTHERN SANTA FE                      27005
BV SOLUTIONS                                      06012
CABLEVISION                                       26001
CABOT                                             04006
CADENCE                                           06010
CAMPBELL                                          22006
CAPITAL BLUE CROSS                                17001

Business Connector Tokens Reference File

The connector tokens reference file (bizConnectorTokens.dat) defines common values (typically conjunctions) that connect words in business names. For example, in the business name “Nursery of Venice”, “of” is a connector token. This helps the standardization engine recognize and process the full name of a business by indicating that the token connects two parts of the full name.

This file contains one column that lists the connector tokens in the business names you process. You can add entries as needed. Below is an excerpt from the connector tokens reference file.

AN
DE
DES
DOS
LA
LAS
LE
OF
THE

Business Country Key Type File

The country key type file (bizCountryTypeKeys.dat) lists countries and continents, along with their abbreviations and assigned nationalities. For continents, the abbreviation is “CON” to separate them from countries. This enables the standardization engine to recognize and process these values as countries or continents. You can add entries to the country key type file using the following syntax.

country abbreviation nationality

The following table describes the columns in the country key type file.

Table 4-14 Country Key Type File

Column Description
country The name of a country or continent.
abbreviation The common abbreviation for the specified country. The abbreviation for a continent is always “CON”.
nationality The nationality assigned to a person or business originating in the specified country.

Following is an excerpt from the country key type file.

AMERICA                       CON  AMERICAN
AFRICA                        CON  AFRICAN
EUROPE                        CON  EUROPEAN
ASIA                          CON  ASIAN
AFGHANISTAN                   AF   AFGHAN
ALBANIA                       AL   ALBANIAN
ALGERIA                       DZ   ALGERIAN

Business Industry Sector Reference File

The industry sector reference file (bizIndustryCategoryCode.dat) lists and groups various industry sectors and sub-sectors, and includes an identification code for each type so the standardization engine can identify and process the industry sectors for different businesses. You can add entries to the industry sector reference file using the following syntax.

sector-code industry-sector

The following table describes each column in the industry sector reference file.

Table 4-15 Industry Sector Reference File

Column Description
sector-code The identification code of the specified sector. The first two numbers of each code identify the general industry sector; the last three number identify a sub-sector.
industry-sector A description of the industry category. This is written in the format “sector - sub-sector”, where sector is a general category of industry types, and sub-sector is a specific industry within that category.

Following is an excerpt from the industry sector reference file.

02006         Automotive & Transport Equipment - Recreational Vehicles
02007         Automotive & Transport Equipment - Shipbuilding & Related Services
02008         Automotive & Transport Equipment - Trucks, Buses & Other Vehicles
03001         Banking - Banking
04001         Chemicals - Agricultural Chemicals
04002         Chemicals - Basic & Intermediate Chemicals & Petrochemicals
04003         Chemicals - Diversified Chemicals
04004         Chemicals - Paints, Coatings & Other Finishing Products
04005         Chemicals - Plastics & Fibers
04006         Chemicals - Specialty Chemicals
05001         Computer Hardware - Computer Peripherals
05002         Computer Hardware - Data Storage Devices
05003         Computer Hardware - Diversified Computer Products

Business Industry Key Type File

The industry key type file (bizIndustryTypeKeys.dat) is used to standardize the value of the Industry field into common industries to which businesses belong so the standardization engine can recognize and process the industry types for different businesses. You can add entries to the industry key type file using the following syntax.

industry-type standardized-form sectors

The following table describes each column in the industry key type file.

Table 4-16 Industry Key Type File

Column Description
industry-type The original value of the industry type in the input record.
standardized-form The normalized version of the industry type. If this column contains a name instead of a zero, that name must also be listed in a different entry as an industry type with a standardized form of “0”.
sectors The industry categories of the specified industry type. These values correspond to the sector codes listed in the industry sector file (bizIndustryCategoryCode.dat). You can list as many categories as apply for each type, but they must be entered with a space between each and no line breaks, and they must correspond to an entry in the industry sector file.

Below is an excerpt from the industry key type file.

TECH                TECHNOLOGY          05001-05007
TECHNOLOGIES        TECHNOLOGY          05001-05007
TECHNOLOGY          0                   05001-05007
TECHSYSTEMS         0                   05001-05007
TELE PHONE          TELEPHONE           16005
TELE PHONES         TELEPHONES          16005
TELEVISION          TV                  11013  21014
TELECOM             0                   16005  26006  26009  26010
TELECOMM            TELECOMMUNICATION   16005  26006  26008
TELECOMMUNICATION   0                   16005  26006  26008

Business Organization Key Type File

The organization key type file (bizOrganizationTypeKeys.dat) is used to standardize the value of the Organization field into common organizations to which businesses belong. This helps the standardization engine recognize and process the organization types for different businesses. You can add entries to the organization key type file using the following syntax.

original-type standardized-form

The following table describes each column in the organization key type file.

Table 4-17 Organization Key Type File

Column Description
original-type The original value of the organization field in an input record.
standardized-form The normalized version of an organization type. A zero (0) in this field indicates that the value in the first column is already in its standardized form. If this column contains a name instead of a zero, that name must also be listed in a different entry as an original type with a standardized form of “0”.

Below is an excerpt from the organization key type file.

INC                 INCORPORATED
INCORPORATED        0
KG                  0
KK                  0
LIMITED             0
LIMITED PARTNERSHIP 0
LLC                 0
LLP                 0
LP                  LIMITED PARTNERSHIP
LTD                 LIMITED

Business Patterns File

The business patterns file (bizpatterns.dat) defines multiple formats expected from the business name input fields along with the standardized output of each format. The patterns and output appear in two-row pairs in this file, as shown below.

4 PNT AST SEP-GLC ORT
PNT AST DEL ORT

The first line describes the input pattern and the second describes the output pattern using tokens to denote each component. The supported tokens are described in Business Name Tokens. A number at the beginning of the first line indicates the number of components in the given business name format. You can modify this file using the following syntax.

length input-pattern
output-pattern

The following table lists and describes the components in the above syntax.

Table 4-18 Business Patterns File Components

Component Description
length The number of business name components in the input field.
input-pattern Tokens that represent a possible input pattern from the unparsed business name fields. Each token represents one component. For more information about address tokens, see Business Name Tokens.
output-pattern Tokens that represent the output pattern for the specified input pattern. Each token represents one component. For more information about business name tokens, see Business Name Tokens.

Below is an excerpt from the business patterns file.

4 PNT AST SEP-GLC ORT
PNT AST DEL ORT

4 NFG AJT SEP-GLC ORT
PNT PNT DEL ORT

4 NF AJT SEP-GLC ORT
PNT PNT DEL ORT

4 CST IDT NF ORT
PNT PNT PNT ORT

4 PNT AJT SEP-GLC ORT
PNT PNT DEL ORT
Business Name Tokens

The business patterns file uses tokens to denote different components in a business name, such as the primary name, alias type key, URL, and so on. The file uses one set of tokens for input fields and another set for output fields. The tokens indicate the type key files to use to determine the appropriate values for each output field. You can use only the predefined tokens to represent business name components; the standardization engine does not recognize custom tokens.

The following table lists and describes each input token.

Table 4-19 Business Name Input Pattern Tokens

Pattern Identifier Description
CTT A connector token
PNT A primary name of a business
PN-PN A hyphenated primary name of a business
BCT A common business term
URL The URL of a business web site
ALT A business alias type key (usually an acronym)
CNT A country name
NAT A nationality
CST A city or state type key
IDT An industry type key
IDT-AJT Both an industry and an adjective type key
AJT An adjective type key
AST An association type key
ORT An organization type key
SEP A separator key
NFG Generic term, not recognized as a specific business name component, with an internal hyphen
NF Generic term, not recognized as a specific business name component
NFC A single character, not recognized as a specific business name component
SEP-GLC A joining comma (a glue type separator)
SEP-GLD A joining hyphen (a glue type separator)
AND The text “and”
GLU A glue type key, such as a forward slash, connecting two parts of a business name component
PN-NF A business primary name followed by a hyphen and a generic term that is not recognized as a specific business name component
NF-PN A generic term that is not recognized as a specific business name component, followed by a hyphen and a recognized business primary name
NF-NF Two generic terms, not recognized as specific business name components and separated by a hyphen

The following table lists and describes each output token.

Table 4-20 Business Name Output Pattern Tokens

Pattern Identifier Description
PNT The primary name of the business
URL The URL of the business
ALT The alias type key of the business (usually an acronym)
IDT The industry type key of the business
AST The association type key of the business
ORT The organization type key of the business
NF A generic term not recognized as a business name component

Business Name Standardization and Oracle Healthcare Master Person Index

Master person index applications rely on the OHMPI Standardization Engine to process business data. To ensure correct processing of business information, you need to customize the Matching Service for the master person index application according to the rules defines for the standardization engine. This includes modifying mefa.xml to define parsing and phonetic encoding of the appropriate fields. You can modify mefa.xml using the Master Person Index Configuration Editor.

Standardization is defined in the StandardizationConfig section of mefa.xml, which is described in detail in “Match Field Configuration” in Oracle Healthcare Master Person Index Configuration Reference (Part Number: E18592-01). To configure the required fields for parsing and normalization, modify the standardization structure in mefa.xml. To configure phonetic encoding, modify the phonetic encoding structure.

Generally, the BusinessName data type processes data that requires parsing prior to processing. You should not need to configure fields to normalize for business names. The following topics provide information about the fields used in processing business names and how to configure standardization for a master person index application. The information provided in these sections is based on the default configuration.

Business Name Processing Fields

When standardizing free-form business names, not all fields in a record need to be processed by the OHMPI Standardization Engine. The standardization engine only needs to process fields that must be parsed, normalized, or phonetically converted. For a master person index application, these fields are defined in mefa.xml, and processing logic for each field is defined in the Standardization Engine node configuration files.

Business Name Standardized Fields

The OHMPI Standardization Engine expects that business name data will be provided in a free-form text field containing several components that must be parsed. By default, the match engine is configured to parse these components, and to normalize and phonetically encode the business name. You can specify additional fields for phonetic encoding.

If you specify the BusinessName match type for any field in the wizard, a standardization structure for that field is defined in mefa.xml. The fields listed under "Business Name Object Structure" are automatically defined as the target fields. If you do not specify business name fields for matching in the wizard but want to standardize the fields, you can create a standardization structure in mefa.xml

Business Name Object Structure

For the default configuration of the BusinessName data type, the name field specified for standardization is parsed into several additional fields, one of which is also normalized. If you specify the BusinessName match type in the wizard, the following fields are automatically added to the object structure and database creation script.

  • field_name_Name

  • field_name_NamePhon

  • field_name_OrgType

  • field_name_AssocType

  • field_name_Industry

  • field_name_Sector

  • field_name_Alias

  • field_name_Url

    where field_name is the name of the field for which you specified business name matching. For example, if you specify the BusinessName match type for the Company field, the fields automatically added to the structure include Company_Name, Company_NamePhon, Company_OrgType, and so on.

You can add these fields manually if you do not specify a match type in the wizard.

Configuring a Standardization Structure for Business Names

For free-form business name fields, the source fields you define for parsing should include the standardization components that are predefined for parsing and normalization. For example, fields containing business information can include any of the field components listed in "Business Name Standardization Components". The target fields can include any of these parsed fields. Follow the instructions under “Defining OHMPI Standardization Rules” in Oracle Healthcare Master Person Index Configuration Guide to define fields for standardization. For the standardization-type element, enter BusinessName. For a list of field IDs to use in the standardized-object-field-id element, see "Business Name Standardization Components".

Note:

In the default configuration, the rules defined for the address data type assume that all input fields must be parsed as well as normalized. Thus, there is no need to configure fields only for normalization.

A sample standardization structure for business names is shown below. This structure parses a business name field into these standard business name fields: name, organization type, association type, sector, industry, and URL. Note that there is no domain selector specified, which would normally default to the United States domain; however, since business names are not variant dependent, it is irrelevant here.

<free-form-texts-to-standardize>
   <group standardization-type="BusinessName">
      <unstandardized-source-fields>
         <unstandardized-source-field-name>Company.Name    
         </unstandardized-source-field-name>
      </unstandardized-source-fields>
      <standardization-targets>
         <target-mapping>
            <standardized-object-field-id>PrimaryName
            </standardized-object-field-id>
            <standardized-target-field-name>Company.Name_Name
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>OrgTypekeyword
            </standardized-object-field-id>
            <standardized-target-field-name>Company.Name_OrgType
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>AssocTypeKeyword
            </standardized-object-field-id>
            <standardized-target-field-name>Company.Name_AssocType
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>IndustrySectorList
            </standardized-object-field-id>
            <standardized-target-field-name>Company.Name_Sector
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>IndustryTypeKeyword
            </standardized-object-field-id>
            <standardized-target-field-name>Company.Name_Industry
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>Url
            </standardized-object-field-id>
            <standardized-target-field-name>Company.Name_URL
            </standardized-target-field-name>
         </target-mapping>
      </standardization-targets>
   </group>
</free-form-texts-to-standardize>

Configuring Phonetic Encoding for Business Names

When you match or standardize on business name fields, the business name field should be specified for phonetic conversion (by default, the wizard defines this for you). Follow the instructions under “Defining Phonetic Encoding for the Master Person Index” in Oracle Healthcare Master Person Index Configuration Guide to define fields for phonetic encoding.

A sample of the phoneticize-fields element is shown below. This sample only converts the business name. You can define additional fields for phonetic encoding.

<phoneticize-fields>
   <phoneticize-field>
      <unphoneticized-source-field-name>Company.Name_Name
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Company.Name_NamePhon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field>
</phoneticize-fields>