Understanding the Sun Match Engine

Sun Match Engine Address Data Type Configuration

Processing street addresses involves parsing, normalizing, and phonetically encoding certain fields prior to matching. The following topics describe the configuration files that define address processing logic and provide instructions for modifying the Match Field file for processing address fields.

Sun Match Engine Address Matching Overview

Matching on the address data type includes both standardizing and matching on address information in the master index application. You can implement street address standardization and matching on its own, or within a master index application designed to process person or business information. For example, standardizing address information allows you to include address fields as search criteria, even though matching might not be performed against these fields.

The Sun Match Engine can create standardized and phonetic values for street address information. Several configuration files are designed specifically to handle address data to define additional logic for the standardization and phonetic encoding process. These include address clues files, a patterns file, and a constants file. The United States address standardization engine is based on the work performed at the US Census Bureau. The clues files, in particular, are based on census bureau statistics.

The Sun Match Engine can match on any field as long as the match type for the field is defined in the match configuration file (matchConfigFile.cfg).

For information about the fields involved in address standardization and matching, see Sun Match Engine Address Data Processing Fields.

Sun Match Engine Address Data Processing Fields

When matching on address data, not all fields in a record need to be processed by the Sun Match Engine. The match engine only needs to process address fields that must be parsed, normalized, or phonetically encoded, and the fields against which matching is performed. These fields are defined in the Match Field file and processing logic for each field is defined in the standardization and matching configuration files.

Address Data Match String Fields

The match string processed by the Sun Match Engine is defined by the match fields specified in the Match Field file. If you specify an “Address” match type for any field in the wizard, the default fields that store the parsed data are automatically added to the match string in the Match Field file. These fields include the house number, street direction, street type, and street name. You can remove any of these fields from the match string.

The match engine can process any combination of fields you specify for matching. By default, the match configuration file (matchConfigFile.cfg) includes rows specifically for matching on the fields that are parsed from the street address fields, such as the street number, street direction, and so on. The file also defines several generic match types. You can use any of the existing rows for matching or you can add rows for the fields you want to match.

Address Data Standardized Fields

The Sun Match Engine expects that street address data will be provided in a free-form text field containing several components that must be parsed. By default, the match engine is configured to parse these components and to normalize and phonetically encode the street name. You can specify additional fields for phonetic encoding.

If you specify an “Address” match type for any field in the wizard, a standardization structure for that field is defined in the Match Field file. The fields listed below under Address Data Object Structure are automatically defined as the target fields. Each of these fields has several entries in the standardization structure. This is because different parsed components can be stored in the same field. For example, the house number, post office box number, and rural route identifier are all stored in the house number field. If you do not specify address fields for matching in the wizard but want to standardize the fields, you can create a standardization structure in the Match Field file.

Address Data Object Structure

The address fields specified for standardization are parsed into several additional fields. If you specify the “Address” match type in the wizard, the following fields are automatically added to the object structure and database creation script.

You can add these fields manually if you do not specify a match type in the wizard.


Note –

The object structure for Sun Master Patient Index uses a slightly different naming convention. For the names of the fields defined for Sun Master Patient Index, refer to the Sun SeeBeyond eIndex Single Patient View User’s Guide.


Match Configuration for Address Data (Repository)

The default match configuration file, matchConfigFile.cfg, defines several match types for the kinds of address data typically included in the match string. You can customize the existing match types or create new match types for the data being processed. The following match types are typical for matching on address data.

  • StreetNumber

  • StreetDir

  • HouseNumber

  • StreetType

In addition, you can use any of these generic match types for matching on address data.

  • String

  • SSN

  • Date

  • Char

  • Numeric

  • pro

  • Integer

  • Exac

  • Real

 

The match configuration file appears under the Match Engine node of the master index project. For more information about the comparison functions used for each match type and how the weights are tuned, see Customizing the Match Configuration and Match Configuration Comparison Functions for Sun Match Engine (Repository).

Sun Match Engine Standardization Configuration for Address Data

Several configuration files are used to define address processing logic for the Sun Match Engine. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for address data. There are no address standardization files that are common to all domains; all address files are domain-specific. These files are located within the domain-specific folders of the Standardization Engine node (with two exceptions noted below).

Address standardization files are specific to each domain and include patterns and clues files, as well as files that define internal and external constants. The domain corresponding to each file is indicated at the end of the file name; for example, addressConstantsUK.cfg and addressConstantsFR.cfg. These domain abbreviations are indicated by an asterisk (*) in the file descriptions in the following topics.

The Address Constants File (addressConstants*.cfg)

The address constants file defines certain information about the standardization files used for processing address data, primarily the number of lines contained in each file. The number of lines specified here must be equal to or greater than the number of lines actually contained in each file. The constants file for United States data is in the Standardization node of the project and is named addressConstants.cfg; the constants file for the other domains is located under the domain name node.

Table 15 lists and describes each parameter in the constants file. The files referenced by these parameters are described on the following pages.

Table 15 Address Constants File Parameters

Parameter 

Description 

maxWords

The maximum number of words in a given address field. 

clueArraySize

The maximum number of lines in the address clues file (addressClueAbbrev*.dat).

patternArraySize

The maximum number of lines in the patterns file (addressPatterns*.dat).

maxPattSize

The maximum length (in characters) of any pattern in the address patterns file. 

imageSize

The maximum length of an input address field. 

nameOutputFieldSize

The maximum output length of a street or property name. 

numberOutputFieldSize

The maximum output length of a house number or rural route number within the structure identifier or post office box fields. 

directionOutputFieldSize

The maximum output length of a directional field (prefix or suffix). 

typeOutputFieldSize

The maximum output length of a street type field (prefix or suffix). 

prefixOutputFieldSize

The maximum length of a number prefix field. 

suffixOutputFieldSize

The maximum length of a number suffix field. 

extensionOutputFieldSize

The maximum output length of any extension field. 

extrainfoOutputFieldSize

The maximum output length of any miscellaneous information that is not recognized as a known type. 

The Address Clues File (addressClueAbbrev*.dat)

The address clues file lists common terms in street addresses, specifies a normalized value for each common term, and categorizes the terms into street address component types. A term can be categorized into multiple component types. The relevance value specifies which of the component types the term is most likely to be. For example, the term “Junction” is standardized as “Jct”, and is classified as a street type, building unit, and generic term (giving relevance in that order).

This file helps the Sun Match Engine recognize common terms in street addresses, and to parse and normalize the values correctly. The syntax of this file is:

common-term normalized-term ID-number/type-token

You can modify or add entries in this table as needed. Table 16 describes the columns in the addressClueAbbrev*.dat file.

Table 16 Address Clues File Columns

Column 

Description 

common-term

A term commonly found in street addresses. 

normalized-term 

The normalized version of the common term. 

ID-number/type-token 

An ID number and a token indicating the type of address component represented by the common term. The ID number corresponds to an ID number in the address master clues file, and the type token corresponds to the type specified for that ID number in the address master clues file. One term might have several ID number and token type pairs. 

Following is an excerpt from the addressClueAbbrevUS.dat file.


TRLR VLG          Trpk            59BU
TRPK              Trpk            59BU
TRPRK             Trpk            59BU
VILLA             Vlla            305TY          60BU
VLLA              Vlla            305TY          60BU
VILLAS            Vlla            60BU
VILL              Vlg             317TY          61BU        364AU
VILLAG            Vlg             317TY          61BU        364AU
VLG               Vlg             317TY          61BU        364AU
VILLAGE           Vlg             317TY          61BU        364AU
VILLG             Vlg             317TY          61BU        364AU
VILLIAGE          Vlg             317TY          61BU        364AU
VLGE              Vlg             317TY          61BU        364AU
VIVI              Vivi            62BU
VIVIENDA          Vivi            62BU
COLLEGE           Coll            64BU                       0AU
CLG               Coll            64BU
COTTAGE           Cott            65BU           65BP        0AU

The Address Internal Constants File (addressInternalConstants*.cfg)

The address internal constants file defines and configures tokens and array sizes used by the address standardizer. This file is used internally by the standardization engine and most of the parameters should not be modified.

One parameter you might need to modify is spCh, which defines any special characters that should not be removed from addresses during standardization. By default, the standardization process keeps hyphens (-), pound signs (#), forward slashes (/), ampersands (&), and pipes (|). Any other special characters found in the address are removed unless they are defined for the spCh parameter. Delineate each special character in the list with a space, as shown below.

spCh = & < >

Characters that are not included in the standard ISO 8859-1 (Latin-1) character set must be preceded by a back slash (\) and represented in Unicode. For example, use the following to retain right and left single quotes (” ’) in addresses:


spCh = \u2018 \u2019

Note –

Periods (.) and commas (,) are always removed from addresses, even if they are added to the spCh list.


The Address Master Clues File (addressMasterClues*.dat)

The address master clues file lists common terms in street addresses as defined by the United States Postal Service (USPS), the United Kingdom’s Royal Mail, the Australian Postal Corporation, or France’s La Poste (depending on the domain in use). For each common term, this file specifies a normalized value, defines postal information, and categorizes the terms into street address component types. A term can be categorized into multiple component types.

The syntax of this file is:

ID-number common-term normalized-term short-abbrev postal-abbrev CFCCS type-token usage-flag postal-flag

You can modify or add entries in this table as needed. Table 17 describes the columns in the addressMasterClues*.dat file.

Table 17 Address Master Clue File Columns

Column 

Description 

ID-number

A unique identification number for the address common term. This number corresponds to an ID number for the same term in the address clues file. 

common-term 

A common address term, such as Park, Village, North, Route, Centre, and so on. 

normalized-term 

The normalized version of the common term. 

short-abbrev 

A short abbreviation of the common term. 

postal-abbrev 

The standard postal abbreviation of the common term. 

CFCCS 

The census feature class code of the term (as defined in the Census Tiger® database). The following values are used:

  • A – Road

  • B – Railroad

  • C – Miscellaneous

  • D – Landmark

  • E – Physical feature

  • F – Nonvisible feature

  • H – Hydrography

  • X – Unclassified

type-token 

The type of address component represented by the common term. Types are specified by an address token (for more information, see Address Type Tokens).

usage-flag 

A flag indicating how the term is used (for more information, see Pattern Classes)

postal-flag 

The standard postal code for the term. 

Following is an excerpt from the addressMasterCluesUS.dat file.


11Alley                    Alley            Al         Aly A        TY R U
12Alternate Route          Alt Rte          Alt        Alt A        TY R
15Arcade                   Arcade           Arc        Arc A        TY R U
16Arroyo                   Arroyo           Arryo      ArryHA       TY R
17Autopista                Atpta            Apta       AptaA        TY R
18Avenida                  Avenida          Ava        Ava A        TY R
19Avenue                   Avenue           Ave        Ave A        TY R U
26Boulevard                Blvd             Blvd       BlvdA        TY R U
32Bulevar                  Blvr             Blv        Blv A        TY R
33Business Route           Bus Rte          BusRt      BsRtA        TY R
34Bypass                   Bypass           Byp        Byp A        TY R U
36Calle                    Calle            Calle      ClleA        TY R
37Calleja                  Calleja          Cja        Cja A        TY R
38Callejon                 Callej           Cjon       CjonA        TY R
39Camino                   Camino           Cam        Cam A        TY R
47Carretera                Carrt            Carr       CarrA        TY R
48Causeway                 Cswy             Cswy       CswyAH       TY R U
51Center                   Center           Ctr        Ctr DA       TY R U

The Address Patterns File (addressPatterns*.dat)

The address patterns file defines the expected input patterns of each individual street address field being standardized so the Sun Match Engine can recognize and process these values. Tokens are used to indicate the type of address component in the input and output fields. This file contains two rows for each pattern. The first row defines the input pattern for each address field and provides an example. The second row defines the output pattern for each address field, the pattern type, the relative importance of the pattern compared to other patterns, and usage flags (as shown below).


AU A1 TY                01 Oak B Street
NA NA ST                T* 75                TX

When an address is parsed, each line of the address is delineated by a pipe (|) and sent to the parser separately. The output tokens for each line are then concatenated and the output pattern is processed using the addressOutPatterns*.dat file to determine whether the output pattern is listed in the file. If the pattern is found, output patterns are modified as indicated in the addressOutPatterns*.dat file to resolve any ambiguities that might arise when two lines of address information contain common elements. The relative importance determines which pattern to use when the format of the input field matches more than one pattern. This file should only be modified by personnel with a thorough understanding of address patterns and tokens.

The syntax of this file is:

input-pattern example output-pattern pattern-class pattern-modifier priority usage-flag exclude-flag

You can modify or add entries in this table as needed. Table 18 describes the columns in the addressPatterns*.dat file.

Table 18 Address Patterns File

Column 

Description 

input-pattern

Tokens that represent a possible input pattern from an individual unparsed street address field. Each token represents one component. For more information about address tokens, see Address Type Tokens.

example 

An example of a street address that fits the specified pattern. This file element is optional. 

output-pattern 

Tokens that represent the output pattern for the specified input pattern. Each token represents one component of the output of the Sun Match Engine. For more information about address tokens, see Address Type Tokens.

pattern-class 

An indicator of the type of address component represented by the pattern. Possible pattern types are listed in Pattern ClassesPattern Classes.

pattern-modifier 

An indicator of whether the priority of the pattern is averaged against other patterns that match the input. Pattern modifiers are listed in Pattern Modifiers.

priority 

The priority weight to use for the pattern when the pattern is a sub-pattern of a larger input pattern. For more information, see Priority Indicators.

usage-flag 

A flag indicating how the term is used (for more information, see Pattern Classes). This file element is optional.

exclude-flag 

This file element is optional. 

Following is an excerpt from the addressPatternsUS.dat file.


NU DR TY A1 AU                     01   123 South Avenida B Oak
HN PD PT NA NA                     H* 70

NU DR TY NU DR                     01   123 South Avenida 1 West
HN PD PT NA SD                     H* 70

NU A1 TY AU TY                     01   123 C circle hill drive
HN HS NA NA ST                     H* 70

NU A1 AM A1 TY                     01   123 M & M road
HN NA NA NA ST                     H* 65

NU TY AU A1                        01   123 Avenida Oak B
HN PT NA NA                        H* 60

NU TY NU A1                        01   123 Avenida 1 B
HN PT NA NA                        H* 60

The Address Output Patterns File (addressOutPatterns*.dat)

The address output patterns file uses the field patterns output by the addressPatterns*.dat file to determine how to parse all standardized address fields. As with the addressPatterns*.dat file, tokens are used to indicate the type of address component in the input and output data. This file contains two rows for each pattern. The first row defines the input pattern received from addressPatterns*.dat and provides an example. The second row defines the output pattern (as shown below).


EI|BN BT|*          // HILLVIEW|FULBOURN HOSPITAL
BN|BI BY

The syntax of this file is:


input-pattern             example
output-pattern

You can modify or add entries in this table as needed. Table 19 describes the columns in the addressOutPatterns*.dat file.

Table 19 Address Output Patterns File

Column 

Description 

input-pattern

Tokens that represent a possible input pattern from addressPatterns*.dat. Each token represents one component and the pattern for each address field in the address is separated by a pipe (|). For more information about address tokens, see Address Type Tokens. Note that this file only uses output tokens.

example 

An example of a street address that fits the specified pattern. This file element is optional. 

output-pattern 

Tokens that represent the output pattern for the specified input pattern. Each token represents one component of the output of the Sun Match Engine. For more information about address tokens, see Address Type Tokens.

Following is an excerpt from the addressPatternsUS.dat file. In the first example, addressPatternsUS.dat outputs three address fields containing these components: building name and type; street name and type; and street name and type. addressOutPatternsUS.dat changes the tokens for the second street name and type to indicate they are not the primary street name and type. Therefore, “New Bridge” is populated into the parsed street name field in the database.


BN BT|NA ST|NA ST|*          // PROTEA HOUSE|NEW BRIDGE|MARINE PARADE
BN BT|NA ST|N2 S2

HN NA ST|HN NA ST|*          // 21 HEIGHWAY COURT|45 BROOKLAND ROAD
HN NA ST|H2 N2 S2

HN NA ST|NA ST|*             // 21 HEIGHWAY COURT|BROOKLAND ROAD
HN NA ST|N2 S2

NA ST|HN NA ST|*             // HEIGHWAY COURT|45 BROOKLAND ROAD
NA ST|H2 N2 S2

Address Pattern File Components

The address patterns files use pattern type tokens, pattern classes, pattern modifies, and priority indicators to process and parse address data. Before modifying any of the patterns files, you must have a good understanding of these file components.

Address Type Tokens

The address pattern and clues files use tokens to denote different components in a street address, such as street type, house number, street names, and so on. These files use one set of tokens for input fields and another set for output fields. You can use only the predefined tokens to represent address components; the Sun Match Engine does not recognize custom tokens.

Table 20 lists and describes each input token; Table 21 lists and describes each output token.

Table 20 Input Address Pattern Type Tokens

Token 

Description 

A1

Alphabetic value, one character in length 

AM

Ampersand 

AU

Generic word 

BP

Building property 

BU

Building unit 

BX

Post office box 

DA

Dash (as a starting character) 

DR

Street direction 

EI

Extra information 

EX

Extension 

FC

Numeric fraction 

HR

Highway route 

MP

Mile posts 

NL

Common words, such as “of”, “the”, and so on 

NU

Numeric value 

OT

Ordinal type 

PT

Prefix type 

RR

Rural route 

SA

State abbreviation 

TY

Street type 

WD

Descriptor within the structure 

WI

Identifier within the structure 

Table 21 lists and describes each output token.

Table 21 Output Address Pattern Tokens

Token 

Description 

1P

Building number prefix 

2P

Second building number prefix 

BD

Property or building directional suffix 

BI

Structure (building) identifier 

BN

Property or building name 

BS

Building number suffix 

BT

Property or building type suffix 

BX

Post office box descriptor 

BY

Structure (building) descriptor 

DB

Property or building directional prefix 

EI

Extra information 

EX

Extension index 

H1

First house number (the actual number) 

H2

Second house number (house number suffix) 

HN

House number 

HS

House number suffix 

N2

Second street name 

NA

Street name 

NB

Building number 

NL

Conjunctions that connect words or phrases in one component type (usually the street name) 

P1

House number prefix 

P2

Second house number prefix 

PD

Directional prefix to the street name 

PT

Street type prefix to the street name 

RR

Rural route descriptor 

RN

Rural route identifier 

S2

Street type suffix to the second street name 

SD

Directional suffix to the street name 

ST

Street type suffix to the street name 

TB

Property or building type prefix 

WI

Identifier within the structure 

WD

Descriptor within the structure 

XN

Post office box identifier 

Pattern Classes

Each pattern defined in the address patterns file must have an associated pattern class. The pattern class indicates a portion of the input pattern or the type of address data that is represented by the pattern. You can specify any of the following pattern classes.

These classes are also specified as usage flags in the patterns file and the master clues file.

Pattern Modifiers

Each pattern type must be followed by a pattern modifier that indicates how to handle cases where one or more defined patterns is found to be a sub-pattern of a larger input pattern. In this case, the Sun Match Engine must know how to prioritize each defined pattern that is a part of the larger pattern. There are two pattern modifiers.

Priority Indicators

The priority indicator is a numeric value following the pattern modifier that indicates the priority weight of the pattern. These values work best when defined as a multiple of five between and including 35 and 95. If a pattern is assigned a priority of 90 or 95 and the pattern matches, or is a sub-pattern of, the input pattern, the match engine stops searching for additional matching patterns and uses the high-priority matching pattern.

Modifying Sun Match Engine Address Data Configuration Files

To customize the Sun Match Engine configuration files for processing street address data, you can modify any of the address standardization files using the text editor provided in NetBeans. Before modifying the match configuration file, review the information provided inSun Match Engine Matching Configuration and Match Configuration Comparison Functions for Sun Match Engine (Repository). Make sure a thorough data analysis has been performed to determine the best fields for matching and the best comparison functions to use for each field.

Updating most standardization files is a straightforward process. Make sure to follow the syntax guidelines provided in Sun Match Engine Standardization Configuration for Address Data. If you add rows to any of the standardization files, make sure to adjust the corresponding parameter in the address constants file (addressConstants.cfg).

Modifying the patterns file is a more complex task. Only modify this file once you fully understand pattern tokens, types, relevance, and flags.

Configuring the Matching Service for Address Data (Repository)

To ensure the master index application uses the Sun Match Engine to process address information, you must customize the Matching Service. This includes modifying the Match Field file to support the fields on which you want to match, to standardize the appropriate fields, and to specify the Sun Match Engine as the match and standardization engine (by default, the Sun Match Engine is already specified so this does not need to be changed). Perform the following tasks to configure the Matching Service.

When configuring the Matching Service, keep in mind the information presented in Configuring the Master Index Matching Service (Repository).

Configuring the Standardization Structure for Address Data (Repository)

The standardization structure is configured in the StandardizationConfig section of the Match Field file, which is described in detail in Match Field Configuration (Repository) in Understanding Sun Master Index Configuration Options (Repository). To configure the required fields for standardization and phonetic encoding, modify the standardization and phonetic encoding structures. The following sections provide additional guidelines and samples specific to standardizing address data.


Note –

In the default configuration, the rules defined for the address data type assume that all input fields must be parsed as well as normalized. Thus, there is no need to configure fields only for normalization.


Address Standardization Structures

For address fields, the source fields in the standardization structure must include the fields predefined for parsing and normalization. This includes any fields containing street address information, which are parsed into the street address fields listed in Address Data Object Structure (excluding the phonetic street name field). The target fields can include any of these parsed fields. Follow the instructions under Defining Master Index Standardization Rules (Repository) in Configuring Sun Master Indexes (Repository) to define fields for normalization. For the standardization-type element, enter Address (for more information, see Sun Match Engine Match and Standardization Types). For a list of field IDs to use in the standardized-object-field-id element, see Table 3.

A sample standardization structure for address data is shown below. This structure parses the first two lines of street address into the standard street address fields. Only the United States domain is defined in this structure.


free-form-texts-to-standardize>
   <group standardization-type="ADDRESS"
    domain-selector="com.stc.eindex.matching.impl.SingleDomainSelectorUS">
      <unstandardized-source-fields>
         <unstandardized-source-field-name>Person.Address[*].Address1    
         </unstandardized-source-field-name>
         <unstandardized-source-field-name>Person.Address[*].Address2
         </unstandardized-source-field-name>
      </unstandardized-source-fields>
      <standardization-targets>
         <target-mapping>
            <standardized-object-field-id>HouseNumber
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].HouseNumber
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>RuralRouteIdentif
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].HouseNumber
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>BoxIdentif
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].HouseNumber
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>MatchStreetName
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>RuralRouteDescript
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>BoxDescript
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>PropDesPrefDirection
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetDir
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>PropDesSufDirection
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetDir
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>StreetNameSufType
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetType
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>StreetNamePrefType
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Address[*].StreetType
            </standardized-target-field-name>
         </target-mapping>
      </standardization-targets>
   </group>
</free-form-texts-to-standardize>

Address Phonetic Encoding

When you match or standardize on street address fields, the street name should be specified for phonetic conversion (this is done by default). Follow the instructions under Defining Phonetic Encoding for the Master Index (Repository) in Configuring Sun Master Indexes (Repository) to define fields for phonetic encoding.

A sample of the phoneticize-fields element is shown below. This sample only converts the address street name. You can define additional fields for phonetic encoding.


<phoneticize-fields>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.Address[*].StreetName
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.Address[*].StreetName_Phon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field>
</phoneticize-fields>

Configuring the Match String for Address Data (Repository)

For matching on street address fields, make sure the match string you specify in the MatchingConfig section of the Match Field file contains all or a subset of the fields that contain the standardized data (the original text in street address fields are generally too inconsistent to use for matching). You can include additional fields for matching, such as the city name or postal code.

To configure the match string, follow the instructions under Defining the Master Index Match String (Repository) in Configuring Sun Master Indexes (Repository). For the Sun Match Engine, each component of a street address has a different match type (specified by the match-type element). The default match types for addresses are StreetName, HouseNumber, StreetDir, and StreetType. You can specify any of the other match types defined in the match configuration file, as well. For more information, see Sun Match Engine Match and Standardization Types.

A sample match string for address matching is shown below.


<match-system-object>
   <object-name>Person</object-name>
   <match-columns>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.StreetName
         </column-name>
         <match-type>StreetName</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.HouseNumber
         </column-name>
         <match-type>HouseNumber</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.StreetDir
         </column-name>
         <match-type>StreetDir</match-type>
      </match-column>
      <match-column>
         <column-name>Enterprise.SystemSBR.Person.Address.StreetType
         </column-name>
         <match-type>StreetType</match-type>
   </match-column>
   </match-columns>
</match-system-object>