Understanding the Master Index Standardization Engine

Address Data Standardization Files

Three configuration files define address processing logic for the Master Index Standardization Engine. These files provide information about address patterns and tokens to help the standardization engine determine how to recognize address components and break them out into their respective tokens. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for address data.

The address configuration files are located in the resource folder under each variant name for the Address data type. The following topics provide information about each configuration file.

Address Clues File

The address clues file (clues.dat) lists common terms in street addresses, specifies a normalized value for each common term, and categorizes the terms into street address component types. A term can be categorized into multiple component types. A relevance value specifies which of the component types the term is most likely to be. For example, the term “Junction” is standardized as “Jct” and is classified as a street type, building unit, and generic term (giving relevance in that order).

This file helps the Master Index Standardization Engine recognize common terms in street addresses in order to parse and normalize the values correctly. The syntax of this file is:

common-term normalized-term ID-number/type-token

You can modify or add entries in this table as needed. The following table describes the columns in the address clues file.

Table 4 Address Clues File Columns


Column	Description
common-term	A term commonly found in street addresses.
normalized-term	The normalized version of the common term.
ID-number/type-token	An ID number and a token indicating the type of address component represented by the common term. The ID number corresponds to an ID number in the address master clues file, and the type token corresponds to the type specified for that ID number in the address master clues file. One term might have several ID number and token type pairs. Their order of appearance indicates their relevance value.

Following is an excerpt from the US address clues file.

TRLR VLG          Trpk            59BU
TRPK              Trpk            59BU
TRPRK             Trpk            59BU
VILLA             Vlla            305TY          60BU
VLLA              Vlla            305TY          60BU
VILLAS            Vlla            60BU
VILL              Vlg             317TY          61BU        364AU
VILLAG            Vlg             317TY          61BU        364AU
VLG               Vlg             317TY          61BU        364AU
VILLAGE           Vlg             317TY          61BU        364AU
VILLG             Vlg             317TY          61BU        364AU
VILLIAGE          Vlg             317TY          61BU        364AU
VLGE              Vlg             317TY          61BU        364AU
VIVI              Vivi            62BU
VIVIENDA          Vivi            62BU
COLLEGE           Coll            64BU                       0AU
CLG               Coll            64BU
COTTAGE           Cott            65BU           65BP        0AU

Address Master Clues File

The address master clues file (masterClues.dat) lists common terms in street addresses as defined by the United States Postal Service (USPS), the United Kingdom’s Royal Mail, the Australian Postal Corporation, or France’s La Poste (depending on the variant in use). For each common term, this file specifies a normalized value, defines postal information, and categorizes the terms into street address component types. A term can be categorized into multiple component types.

The syntax of this file is:

ID-number common-term normalized-term short-abbrev postal-abbrev CFCCS type-token usage-flag postal-flag

You can modify or add entries in this table as needed. The following table describes the columns in the address master clues file.

Table 5 Address Master Clue File Columns


Column	Description
ID-number	A unique identification number for the address common term. This number corresponds to an ID number for the same term in the address clues file.
common-term	A common address term, such as Park, Village, North, Route, Centre, and so on.
normalized-term	The normalized version of the common term.
short-abbrev	A short abbreviation of the common term.
postal-abbrev	The standard postal abbreviation of the common term.
CFCCS	The census feature class code of the term (as defined in the Census Tiger® database). The following values are used: A – Road B – Railroad C – Miscellaneous D – Landmark E – Physical feature F – Nonvisible feature H – Hydrography X – Unclassified
type-token	The type of address component represented by the common term. Types are specified by an address token (for more information, see Address Type Tokens).
usage-flag	A flag indicating how the term is used (for more information, see Pattern Classes)
postal-flag	The standard postal code for the term.

Following is an excerpt from the US address master clues file.

11Alley                    Alley            Al         Aly A        TY R U
12Alternate Route          Alt Rte          Alt        Alt A        TY R
15Arcade                   Arcade           Arc        Arc A        TY R U
16Arroyo                   Arroyo           Arryo      ArryHA       TY R
17Autopista                Atpta            Apta       AptaA        TY R
18Avenida                  Avenida          Ava        Ava A        TY R
19Avenue                   Avenue           Ave        Ave A        TY R U
26Boulevard                Blvd             Blvd       BlvdA        TY R U
32Bulevar                  Blvr             Blv        Blv A        TY R
33Business Route           Bus Rte          BusRt      BsRtA        TY R
34Bypass                   Bypass           Byp        Byp A        TY R U
36Calle                    Calle            Calle      ClleA        TY R
37Calleja                  Calleja          Cja        Cja A        TY R
38Callejon                 Callej           Cjon       CjonA        TY R
39Camino                   Camino           Cam        Cam A        TY R
47Carretera                Carrt            Carr       CarrA        TY R
48Causeway                 Cswy             Cswy       CswyAH       TY R U
51Center                   Center           Ctr        Ctr DA       TY R U

Address Patterns File

The address patterns file (patterns.dat) defines the expected input patterns of each individual street address field being standardized so the Master Index Standardization Engine can recognize and process these values. Tokens indicate the type of address component in the input and output fields. This file contains two rows for each pattern. The first row defines the input pattern for each address field and provides an example. The second row defines the output pattern for each address field, the pattern type, the relative importance of the pattern compared to other patterns, and usage flags. Below is an example.

AU A1 TY                01 Oak B Street
NA NA ST                T* 75                TX

When an address is parsed, each line of the address is delineated by a pipe (|) and sent to the parser separately. The output tokens for each line are then concatenated and the output pattern is processed using the address patterns file to determine whether the output pattern is listed in the file. If the pattern is found, output patterns are modified as indicated in the patterns file to resolve any ambiguities that might arise when two lines of address information contain common elements. The relative importance determines which pattern to use when the format of the input field matches more than one pattern. This file should only be modified by personnel with a thorough understanding of address patterns and tokens.

The syntax of this file is:

input-pattern example output-pattern pattern-class pattern-modifier priority usage-flag exclude-flag

You can modify or add entries in this table as needed. The following table describes the columns in the address patterns file.

Table 6 Address Patterns File


Column	Description
input-pattern	Tokens that represent a possible input pattern from an individual unparsed street address field. Each token represents one component. For more information about address tokens, see Address Type Tokens.
example	An example of a street address that fits the specified pattern. This file element is optional.
output-pattern	Tokens that represent the output pattern for the specified input pattern. Each token represents one component of the output of the Master Index Standardization Engine. For more information about address tokens, see Address Type Tokens.
pattern-class	An indicator of the type of address component represented by the pattern. Possible pattern types are listed in Pattern Classes Pattern Classes.
pattern-modifier	An indicator of whether the priority of the pattern is averaged against other patterns that match the input. Pattern modifiers are listed in Pattern Modifiers.
priority	The priority weight to use for the pattern when the pattern is a sub-pattern of a larger input pattern. For more information, see Priority Indicators.
usage-flag	A flag indicating how the term is used (for more information, see Pattern Classes). This file element is optional.
exclude-flag	This file element is optional.

Following is an excerpt from the address patterns file.

NU DR TY A1 AU                     01   123 South Avenida B Oak
HN PD PT NA NA                     H* 70

NU DR TY NU DR                     01   123 South Avenida 1 West
HN PD PT NA SD                     H* 70

NU A1 TY AU TY                     01   123 C circle hill drive
HN HS NA NA ST                     H* 70

NU A1 AM A1 TY                     01   123 M & M road
HN NA NA NA ST                     H* 65

NU TY AU A1                        01   123 Avenida Oak B
HN PT NA NA                        H* 60

NU TY NU A1                        01   123 Avenida 1 B
HN PT NA NA                        H* 60

Address Pattern File Components

The address patterns files use pattern type tokens, pattern classes, pattern modifiers, and priority indicators to process and parse address data. Before modifying any of the patterns files, you must have a good understanding of these file components.

Address Type Tokens

The address pattern and clues files use tokens to denote different components in a street address, such as street type, house number, street names, and so on. These files use one set of tokens for input fields and another set for output fields. You can use only the predefined tokens to represent address components; the Master Index Standardization Engine does not recognize custom tokens.

The following table lists and describes each input token.

Table 7 Input Address Pattern Type Tokens


Token	Description
A1	Alphabetic value, one character in length
AM	Ampersand
AU	Generic word
BP	Building property
BU	Building unit
BX	Post office box
DA	Dash (as a starting character)
DR	Street direction
EI	Extra information
EX	Extension
FC	Numeric fraction
HR	Highway route
MP	Mile posts
NL	Common words, such as “of”, “the”, and so on
NU	Numeric value
OT	Ordinal type
PT	Prefix type
RR	Rural route
SA	State abbreviation
TY	Street type
WD	Descriptor within the structure
WI	Identifier within the structure

The following table lists and describes each output token.

Table 8 Output Address Pattern Tokens


Token	Description
1P	Building number prefix
2P	Second building number prefix
BD	Property or building directional suffix
BI	Structure (building) identifier
BN	Property or building name
BS	Building number suffix
BT	Property or building type suffix
BX	Post office box descriptor
BY	Structure (building) descriptor
DB	Property or building directional prefix
EI	Extra information
EX	Extension index
H1	First house number (the actual number)
H2	Second house number (house number suffix)
HN	House number
HS	House number suffix
N2	Second street name
NA	Street name
NB	Building number
NL	Conjunctions that connect words or phrases in one component type (usually the street name)
P1	House number prefix
P2	Second house number prefix
PD	Directional prefix to the street name
PT	Street type prefix to the street name
RR	Rural route descriptor
RN	Rural route identifier
S2	Street type suffix to the second street name
SD	Directional suffix to the street name
ST	Street type suffix to the street name
TB	Property or building type prefix
WI	Identifier within the structure
WD	Descriptor within the structure
XN	Post office box identifier

Pattern Classes

Each pattern defined in the address patterns file must have an associated pattern class. The pattern class indicates a portion of the input pattern or the type of address data that is represented by the pattern. You can specify any of the following pattern classes.

H - the address pattern represents a house
B - the address pattern represents a building
W - the address pattern represents a unit within a structure, such as an apartment or suite number
T - the address pattern represents a street type or direction
R - the address pattern represents a rural route
P - the address pattern represents a Post Office box
N - the address pattern is mostly numeric

These classes are also specified as usage flags in the patterns file and the master clues file.

Pattern Modifiers

Each pattern type must be followed by a pattern modifier that indicates how to handle cases where one or more defined patterns is found to be a sub-pattern of a larger input pattern. In this case, the Master Index Standardization Engine must know how to prioritize each defined pattern that is a part of the larger pattern. There are two pattern modifiers.

* - An asterisk indicates that the priority weight for the matching pattern is averaged down equally with the other matching sub-patterns.
+ - A plus sign indicates that the priority weight for the matching pattern is not averaged down equally with the other matching sub-patterns.

Priority Indicators

The priority indicator is a numeric value following the pattern modifier that indicates the priority weight of the pattern. These values work best when defined as a multiple of five between and including 35 and 95. If a pattern is assigned a priority of 90 or 95 and the pattern matches, or is a sub-pattern of, the input pattern, the standardization engine stops searching for additional matching patterns and uses the high-priority matching pattern.