Several configuration files are used to define address processing logic for the Sun Match Engine. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for address data. There are no address standardization files that are common to all domains; all address files are domain-specific. These files are located within the domain-specific folders of the Standardization Engine node (with two exceptions noted below).
Address standardization files are specific to each domain and include patterns and clues files, as well as files that define internal and external constants. The domain corresponding to each file is indicated at the end of the file name; for example, addressConstantsUK.cfg and addressConstantsFR.cfg. These domain abbreviations are indicated by an asterisk (*) in the file descriptions in the following topics.
The address constants file defines certain information about the standardization files used for processing address data, primarily the number of lines contained in each file. The number of lines specified here must be equal to or greater than the number of lines actually contained in each file. The constants file for United States data is in the Standardization node of the project and is named addressConstants.cfg; the constants file for the other domains is located under the domain name node.
Table 15 lists and describes each parameter in the constants file. The files referenced by these parameters are described on the following pages.
Table 15 Address Constants File Parameters
Parameter |
Description |
---|---|
The maximum number of words in a given address field. |
|
The maximum number of lines in the address clues file (addressClueAbbrev*.dat). |
|
The maximum number of lines in the patterns file (addressPatterns*.dat). |
|
The maximum length (in characters) of any pattern in the address patterns file. |
|
The maximum length of an input address field. |
|
The maximum output length of a street or property name. |
|
The maximum output length of a house number or rural route number within the structure identifier or post office box fields. |
|
The maximum output length of a directional field (prefix or suffix). |
|
The maximum output length of a street type field (prefix or suffix). |
|
The maximum length of a number prefix field. |
|
The maximum length of a number suffix field. |
|
The maximum output length of any extension field. |
|
The maximum output length of any miscellaneous information that is not recognized as a known type. |
The address clues file lists common terms in street addresses, specifies a normalized value for each common term, and categorizes the terms into street address component types. A term can be categorized into multiple component types. The relevance value specifies which of the component types the term is most likely to be. For example, the term “Junction” is standardized as “Jct”, and is classified as a street type, building unit, and generic term (giving relevance in that order).
This file helps the Sun Match Engine recognize common terms in street addresses, and to parse and normalize the values correctly. The syntax of this file is:
common-term normalized-term ID-number/type-token
You can modify or add entries in this table as needed. Table 16 describes the columns in the addressClueAbbrev*.dat file.
Table 16 Address Clues File Columns
Column |
Description |
---|---|
A term commonly found in street addresses. |
|
normalized-term |
The normalized version of the common term. |
ID-number/type-token |
An ID number and a token indicating the type of address component represented by the common term. The ID number corresponds to an ID number in the address master clues file, and the type token corresponds to the type specified for that ID number in the address master clues file. One term might have several ID number and token type pairs. |
Following is an excerpt from the addressClueAbbrevUS.dat file.
TRLR VLG Trpk 59BU TRPK Trpk 59BU TRPRK Trpk 59BU VILLA Vlla 305TY 60BU VLLA Vlla 305TY 60BU VILLAS Vlla 60BU VILL Vlg 317TY 61BU 364AU VILLAG Vlg 317TY 61BU 364AU VLG Vlg 317TY 61BU 364AU VILLAGE Vlg 317TY 61BU 364AU VILLG Vlg 317TY 61BU 364AU VILLIAGE Vlg 317TY 61BU 364AU VLGE Vlg 317TY 61BU 364AU VIVI Vivi 62BU VIVIENDA Vivi 62BU COLLEGE Coll 64BU 0AU CLG Coll 64BU COTTAGE Cott 65BU 65BP 0AU |
The address internal constants file defines and configures tokens and array sizes used by the address standardizer. This file is used internally by the standardization engine and most of the parameters should not be modified.
One parameter you might need to modify is spCh, which defines any special characters that should not be removed from addresses during standardization. By default, the standardization process keeps hyphens (-), pound signs (#), forward slashes (/), ampersands (&), and pipes (|). Any other special characters found in the address are removed unless they are defined for the spCh parameter. Delineate each special character in the list with a space, as shown below.
spCh = & < >
Characters that are not included in the standard ISO 8859-1 (Latin-1) character set must be preceded by a back slash (\) and represented in Unicode. For example, use the following to retain right and left single quotes (” ’) in addresses:
spCh = \u2018 \u2019 |
Periods (.) and commas (,) are always removed from addresses, even if they are added to the spCh list.
The address master clues file lists common terms in street addresses as defined by the United States Postal Service (USPS), the United Kingdom’s Royal Mail, the Australian Postal Corporation, or France’s La Poste (depending on the domain in use). For each common term, this file specifies a normalized value, defines postal information, and categorizes the terms into street address component types. A term can be categorized into multiple component types.
The syntax of this file is:
ID-number common-term normalized-term short-abbrev postal-abbrev CFCCS type-token usage-flag postal-flag
You can modify or add entries in this table as needed. Table 17 describes the columns in the addressMasterClues*.dat file.
Table 17 Address Master Clue File Columns
Column |
Description |
---|---|
A unique identification number for the address common term. This number corresponds to an ID number for the same term in the address clues file. |
|
common-term |
A common address term, such as Park, Village, North, Route, Centre, and so on. |
normalized-term |
The normalized version of the common term. |
short-abbrev |
A short abbreviation of the common term. |
postal-abbrev |
The standard postal abbreviation of the common term. |
CFCCS |
The census feature class code of the term (as defined in the Census Tiger® database). The following values are used:
|
type-token |
The type of address component represented by the common term. Types are specified by an address token (for more information, see Address Type Tokens). |
usage-flag |
A flag indicating how the term is used (for more information, see Pattern Classes) |
postal-flag |
The standard postal code for the term. |
Following is an excerpt from the addressMasterCluesUS.dat file.
11Alley Alley Al Aly A TY R U 12Alternate Route Alt Rte Alt Alt A TY R 15Arcade Arcade Arc Arc A TY R U 16Arroyo Arroyo Arryo ArryHA TY R 17Autopista Atpta Apta AptaA TY R 18Avenida Avenida Ava Ava A TY R 19Avenue Avenue Ave Ave A TY R U 26Boulevard Blvd Blvd BlvdA TY R U 32Bulevar Blvr Blv Blv A TY R 33Business Route Bus Rte BusRt BsRtA TY R 34Bypass Bypass Byp Byp A TY R U 36Calle Calle Calle ClleA TY R 37Calleja Calleja Cja Cja A TY R 38Callejon Callej Cjon CjonA TY R 39Camino Camino Cam Cam A TY R 47Carretera Carrt Carr CarrA TY R 48Causeway Cswy Cswy CswyAH TY R U 51Center Center Ctr Ctr DA TY R U |
The address patterns file defines the expected input patterns of each individual street address field being standardized so the Sun Match Engine can recognize and process these values. Tokens are used to indicate the type of address component in the input and output fields. This file contains two rows for each pattern. The first row defines the input pattern for each address field and provides an example. The second row defines the output pattern for each address field, the pattern type, the relative importance of the pattern compared to other patterns, and usage flags (as shown below).
AU A1 TY 01 Oak B Street NA NA ST T* 75 TX |
When an address is parsed, each line of the address is delineated by a pipe (|) and sent to the parser separately. The output tokens for each line are then concatenated and the output pattern is processed using the addressOutPatterns*.dat file to determine whether the output pattern is listed in the file. If the pattern is found, output patterns are modified as indicated in the addressOutPatterns*.dat file to resolve any ambiguities that might arise when two lines of address information contain common elements. The relative importance determines which pattern to use when the format of the input field matches more than one pattern. This file should only be modified by personnel with a thorough understanding of address patterns and tokens.
The syntax of this file is:
input-pattern example output-pattern pattern-class pattern-modifier priority usage-flag exclude-flag
You can modify or add entries in this table as needed. Table 18 describes the columns in the addressPatterns*.dat file.
Table 18 Address Patterns File
Column |
Description |
---|---|
Tokens that represent a possible input pattern from an individual unparsed street address field. Each token represents one component. For more information about address tokens, see Address Type Tokens. |
|
example |
An example of a street address that fits the specified pattern. This file element is optional. |
output-pattern |
Tokens that represent the output pattern for the specified input pattern. Each token represents one component of the output of the Sun Match Engine. For more information about address tokens, see Address Type Tokens. |
pattern-class |
An indicator of the type of address component represented by the pattern. Possible pattern types are listed in Pattern ClassesPattern Classes. |
pattern-modifier |
An indicator of whether the priority of the pattern is averaged against other patterns that match the input. Pattern modifiers are listed in Pattern Modifiers. |
priority |
The priority weight to use for the pattern when the pattern is a sub-pattern of a larger input pattern. For more information, see Priority Indicators. |
usage-flag |
A flag indicating how the term is used (for more information, see Pattern Classes). This file element is optional. |
exclude-flag |
This file element is optional. |
Following is an excerpt from the addressPatternsUS.dat file.
NU DR TY A1 AU 01 123 South Avenida B Oak HN PD PT NA NA H* 70 NU DR TY NU DR 01 123 South Avenida 1 West HN PD PT NA SD H* 70 NU A1 TY AU TY 01 123 C circle hill drive HN HS NA NA ST H* 70 NU A1 AM A1 TY 01 123 M & M road HN NA NA NA ST H* 65 NU TY AU A1 01 123 Avenida Oak B HN PT NA NA H* 60 NU TY NU A1 01 123 Avenida 1 B HN PT NA NA H* 60 |
The address output patterns file uses the field patterns output by the addressPatterns*.dat file to determine how to parse all standardized address fields. As with the addressPatterns*.dat file, tokens are used to indicate the type of address component in the input and output data. This file contains two rows for each pattern. The first row defines the input pattern received from addressPatterns*.dat and provides an example. The second row defines the output pattern (as shown below).
EI|BN BT|* // HILLVIEW|FULBOURN HOSPITAL BN|BI BY |
The syntax of this file is:
input-pattern example output-pattern |
You can modify or add entries in this table as needed. Table 19 describes the columns in the addressOutPatterns*.dat file.
Table 19 Address Output Patterns File
Column |
Description |
---|---|
Tokens that represent a possible input pattern from addressPatterns*.dat. Each token represents one component and the pattern for each address field in the address is separated by a pipe (|). For more information about address tokens, see Address Type Tokens. Note that this file only uses output tokens. |
|
example |
An example of a street address that fits the specified pattern. This file element is optional. |
output-pattern |
Tokens that represent the output pattern for the specified input pattern. Each token represents one component of the output of the Sun Match Engine. For more information about address tokens, see Address Type Tokens. |
Following is an excerpt from the addressPatternsUS.dat file. In the first example, addressPatternsUS.dat outputs three address fields containing these components: building name and type; street name and type; and street name and type. addressOutPatternsUS.dat changes the tokens for the second street name and type to indicate they are not the primary street name and type. Therefore, “New Bridge” is populated into the parsed street name field in the database.
BN BT|NA ST|NA ST|* // PROTEA HOUSE|NEW BRIDGE|MARINE PARADE BN BT|NA ST|N2 S2 HN NA ST|HN NA ST|* // 21 HEIGHWAY COURT|45 BROOKLAND ROAD HN NA ST|H2 N2 S2 HN NA ST|NA ST|* // 21 HEIGHWAY COURT|BROOKLAND ROAD HN NA ST|N2 S2 NA ST|HN NA ST|* // HEIGHWAY COURT|45 BROOKLAND ROAD NA ST|H2 N2 S2 |
The address patterns files use pattern type tokens, pattern classes, pattern modifies, and priority indicators to process and parse address data. Before modifying any of the patterns files, you must have a good understanding of these file components.
The address pattern and clues files use tokens to denote different components in a street address, such as street type, house number, street names, and so on. These files use one set of tokens for input fields and another set for output fields. You can use only the predefined tokens to represent address components; the Sun Match Engine does not recognize custom tokens.
Table 20 lists and describes each input token; Table 21 lists and describes each output token.
Table 20 Input Address Pattern Type Tokens
Token |
Description |
---|---|
Alphabetic value, one character in length |
|
Ampersand |
|
Generic word |
|
Building property |
|
Building unit |
|
Post office box |
|
Dash (as a starting character) |
|
Street direction |
|
Extra information |
|
Extension |
|
Numeric fraction |
|
Highway route |
|
Mile posts |
|
Common words, such as “of”, “the”, and so on |
|
Numeric value |
|
Ordinal type |
|
Prefix type |
|
Rural route |
|
State abbreviation |
|
Street type |
|
Descriptor within the structure |
|
Identifier within the structure |
Table 21 lists and describes each output token.
Table 21 Output Address Pattern Tokens
Token |
Description |
---|---|
Building number prefix |
|
Second building number prefix |
|
Property or building directional suffix |
|
Structure (building) identifier |
|
Property or building name |
|
Building number suffix |
|
Property or building type suffix |
|
Post office box descriptor |
|
Structure (building) descriptor |
|
Property or building directional prefix |
|
Extra information |
|
Extension index |
|
First house number (the actual number) |
|
Second house number (house number suffix) |
|
House number |
|
House number suffix |
|
Second street name |
|
Street name |
|
Building number |
|
Conjunctions that connect words or phrases in one component type (usually the street name) |
|
House number prefix |
|
Second house number prefix |
|
Directional prefix to the street name |
|
Street type prefix to the street name |
|
Rural route descriptor |
|
Rural route identifier |
|
Street type suffix to the second street name |
|
Directional suffix to the street name |
|
Street type suffix to the street name |
|
Property or building type prefix |
|
Identifier within the structure |
|
Descriptor within the structure |
|
Post office box identifier |
Each pattern defined in the address patterns file must have an associated pattern class. The pattern class indicates a portion of the input pattern or the type of address data that is represented by the pattern. You can specify any of the following pattern classes.
W - the address pattern represents a unit within a structure, such as an apartment or suite number
T - the address pattern represents a street type or direction
These classes are also specified as usage flags in the patterns file and the master clues file.
Each pattern type must be followed by a pattern modifier that indicates how to handle cases where one or more defined patterns is found to be a sub-pattern of a larger input pattern. In this case, the Sun Match Engine must know how to prioritize each defined pattern that is a part of the larger pattern. There are two pattern modifiers.
* - An asterisk indicates that the priority weight for the matching pattern is averaged down equally with the other matching sub-patterns.
+ - A plus sign indicates that the priority weight for the matching pattern is not averaged down equally with the other matching sub-patterns.
The priority indicator is a numeric value following the pattern modifier that indicates the priority weight of the pattern. These values work best when defined as a multiple of five between and including 35 and 95. If a pattern is assigned a priority of 90 or 95 and the pattern matches, or is a sub-pattern of, the input pattern, the match engine stops searching for additional matching patterns and uses the high-priority matching pattern.