Understanding the Sun Match Engine

Master Index Components and the Sun Match Engine

Sun Match Engine applications use the Sun Match Engine specifically for standardization and probabilistic weighting, while the master index application determines survivorship. This process relies on the logic specified in the configuration files of the master index project and of the Sun Match Engine.

The following topics provide information about how the Sun Match Engine works with master index applications to standardize data and formulate matching weights.

Searching and Matching in Sun Match Engine Applications (Repository)

When a new record is passed to the master index database, the master index application selects a subset of possible matches from the database. The master index application then uses the Sun Match Engine matching algorithm to assign a matching probability weight for each record in this subset (known as the candidate selection pool). To create the candidate selection pool, the master index application makes a series of query passes of the existing data, searching for matches on specific combinations of data. These combinations are defined by the blocking query, which is defined in the Candidate Select file and specified in the Threshold file.

Matching is performed on the fields included in the match string defined in the Match Field file. Each field is assigned a matching weight. The weights for each field are summed to determine the matching probability weight for the entire record (known as the composite weight). Before matching on some fields, such as the first name, the index might standardize the field based on information in the standardization files. You can customize how each field is weighted by modifying the match configuration file.

Standardization and Matching Process in Master Index Applications (Repository)

The standardization and matching processes use logic that is defined by a combination of Sun Match Engine configuration files and master index configuration files. During the standardization and match processes, the following occurs.

  1. The Sun Match Engine receives an incoming record.

  2. The Sun Match Engine standardizes the fields specified for parsing, normalization, and phonetic encoding. These fields are defined in the StandardizationConfig section of the Match Field file and the rules for standardization are defined in the Sun Match Engine standardization configuration files.

  3. The master index application queries the database for a candidate selection pool (records that are possible matches) using the blocking query specified in the Threshold file. If the blocking query uses standardized or phonetic fields, the criteria values are obtained from the database.

  4. For each possible match, the master index application creates a match string (based on the match columns in the Match Field file) and sends the string to the Sun Match Engine.

  5. The Sun Match Engine checks the incoming record against each possible match, producing a matching weight for each. Matching is performed using the weighting rules defined in the match configuration file.

The Master Index Match String (Repository)

The data string that is passed to the Sun Match Engine for match processing is called the match string and is defined in the MatchingConfig section of the Match Field file. The Sun Match Engine configuration files, the blocking query, and the matching configuration are closely linked in the search and matching processes. The blocking query defines the select statements for creating the candidate selection pool during the matching process. The matching configuration defines the match string that is passed to the Sun Match Engine from the records in the candidate selection pool. Finally, the Sun Match Engine configuration files define how the match string is processed.

The Sun Match Engine configuration files are dependent upon the match string, and it is very important when you modify the match string to ensure that the match type you specify corresponds to the correct row in the match configuration file (matchConfigFile.cfg). For example, if you are using person matching and add “MaritalStatus” as a match field, you need to specify a match type for the MaritalStatus field that is listed in the first column of the match configuration file. You must also make sure that the matching logic defined in the corresponding row of the match configuration file is defined appropriately for matching on the MaritalStatus field.

Sun Match Engine Field Identifiers

The Sun Match Engine breaks down fields into various components. For example, it breaks addresses into floor number, street number, street name, street direction, and so on. Some of these components are similar and are typically stored in the same field in the database. In the default configuration, for example, when the standardization engine finds a house number, rural route number, or PO box number, the value is stored in the HouseNumber database field. You can customize this as needed, as long as any field you specify to store a component is also included in the object structure defined for the master index application.

The Sun Match Engine uses field identifiers to determine how to process fields that are defined for normalization or parsing. The IDs are defined internally in the match engine and are referenced in the Match Field file. The field IDs you specify for each field in the Match Field file determine how that field is processed by the standardization engine. The field IDs for person names determine how each name is normalized. The field IDs for business names specify which business type key file to use for standardization. The field IDs for addresses determine which database fields store each field component and how each component is standardized.

Table 3 lists each field component generated by the Sun Match Engine along with their corresponding field IDs. You can only specify the predefined field IDs that are listed in this table.

Table 3 Standardization Field Identifiers

Field ID 

Description 

Person Name Standardization Field Identifiers

FirstName

Specifies a first name field for normalization. 

LastName

Specifies a last name field for normalization. 

Address Standardization Field Identifiers

HouseNumber

Specifies the parsed house number from a standardized address field. By default, this is stored in the field_name_HouseNo field (or the HouseNumber field for Sun Master Patient Index).

RuralRouteIdentif

Specifies the parsed rural route identifier from a standardized address field. By default, this is stored in the field_name_HouseNo field (or the HouseNumber field for Sun Master Patient Index).

BoxIdentif

Specifies the parsed PO box number from a standardized address field. By default, this is stored in the field_name_HouseNo field (or the HouseNumber field for Sun Master Patient Index).

MatchStreetName

Specifies the parsed and standardized street name from a standardized address field and is used internally by the match engine. If you want to store the standardized street name in the database (recommended), map this field to the street name field in the database. By default, this is stored in the field_name_StName field (or the StreetName field for Sun Master Patient Index).

OrigStreetName

Specifies the parsed street name from an address field. If you want to store the original street name in the database, map this field to the street name field in the database. This address component is not included in the default standardization structure, but you can add it if needed. 

RuralRouteDescript

Specifies the parsed rural route description from a standardized address field. By default, this is stored in the field_name_StName field (or the StreetName field for Sun Master Patient Index).

BoxDescript

Specifies the PO box type from a standardized address field. By default, this is stored in the field_name_StName field (or the StreetName field for Sun Master Patient Index).

PropDesPrefDirection

Specifies the parsed property direction from a standardized address field. This field ID handles cases where the direction is a prefix to the property description. By default, this is stored in the field_name_StDir field (or the StreetDir field for Sun Master Patient Index).

PropDesSufDirection

Specifies the parsed property direction from a standardized address field. This field ID handles cases where the direction is a suffix to the property description. By default, this is stored in the field_name_StDir field (or the StreetDir field for Sun Master Patient Index).

StreetNamePrefDirection

Specifies the parsed street direction from a standardized address field. This field ID handles cases where the direction is a prefix to the street name. By default, this is stored in the field_name_StDir field (or the StreetDir field for Sun Master Patient Index).

StreetNameSufDirection

Specifies the parsed street direction from a standardized address field. This field ID handles cases where the direction is a suffix to the street name. By default, this is stored in the field_name_StDir field (or the StreetDir field for Sun Master Patient Index).

StreetNameSufType

Specifies the parsed street type from a standardized address field. This field ID handles cases where the street type is a suffix to the street name. By default, this is stored in the field_name_StType field (or the StreetType field for Sun Master Patient Index).

StreetNamePrefType

Specifies the parsed street type from a standardized address field. This field ID handles cases where the street type is a prefix to the street name. By default, this is stored in the field_name_StType field (or the StreetType field for Sun Master Patient Index).

PropDesSufType

Specifies the parsed property type from a standardized address field. This field ID handles cases where the street type is a suffix to the property description. By default, this is stored in the field_name_StType field (or the StreetType field for Sun Master Patient Index).

PropDesPrefType

Specifies the parsed property type from a standardized address field. This field ID handles cases where the street type is a prefix to the property description. By default, this is stored in the field_name_StType field (or the StreetType field for Sun Master Patient Index).

HouseNumPrefix

Specifies the parsed house number prefix from a standardized address field (such as the “A” in “A 1587 4th Street”). This address component is not included in the default standardization structure, but you can add it if needed. 

SecondHouseNumberPrefix

Specifies the parsed second house number prefix from a standardized address field (such as “25” in “25 319 10th Ave.”). This address component is not included in the default standardization structure, but you can add it if needed.

SecondHouseNumber

Specifies the parsed second house number prefix from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.

HouseNumSuffix

Specifies the parsed house number suffix from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. 

OrigSecondStreetName

Specifies the parsed second street name from a standardized address field (for example, an address might include a cross-street or a thoroughfare and dependent thoroughfare). This address component is not included in the default standardization structure, but you can add it if needed.

SecondStreetNameSufDirection

Specifies the parsed second street direction from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.

SecondStreetNameSufType

Specifies the parsed second street type from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed.

StreetNameExtensionIndex

Specifies the parsed street name extension from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. 

WithinStructDescript

Specifies the parsed internal descriptor (such as “Floor”) from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. 

WithinStructIdentif

Specifies the parsed internal identifier (such as a floor number) from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. 

OrigPropertyName

Specifies the parsed original property name (such as the name of a complex or business park) from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. 

MatchPropertyName

Specifies the parsed match property name from a standardized address field and is used internally by the match engine for blocking and phonetic encoding. This address component is not included in the default standardization structure, but you can add it if needed. 

CenterDescript

Specifies the parsed structure description from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. 

CenterIdentif

Specifies the parsed structure identifier from a standardized address field. This address component is not included in the default standardization structure, but you can add it if needed. 

ExtraInfo

Specifies any extra information that was not included in any of the other parsed components. This address component is not included in the default standardization structure, but you can add it if needed. 

Business Name Standardization Field Identifiers

PrimaryName

Specifies the field containing the parsed name in a free-form text business name field. 

OrgTypeKeyword

Specifies the field containing the parsed organization type in a free-form text business name field. 

AssocTypeKeyword

Specifies the field containing the parsed association type in a free-form text business name field. 

IndustrySectorList

Specifies the field containing the parsed industry sector in a free-form text business name field. 

IndustryTypeKeyword

Specifies the field containing the parsed industry type in a free-form text business name field (industry type is a subset of the sector). 

AliasList

Specifies the field containing the parsed alias in a free-form text business name field. 

Url

Specifies the field containing the parsed URL in a free-form text business name field. 

Sun Match Engine Match and Standardization Types

Indicators are used in the Match Field file to reference the type of matching and standardization to perform on each field. You must specify one of these indicators, called match types and standardization types, for the fields you define for standardization or matching. The match types correspond to the match types listed in the first column of the match configuration file (matchConfigFile.cfg). The standardization types are defined internally in the match engine. The Sun Match Engine uses these types to determine how to process each field.

Table 4 lists the default standardization types; Table 5 lists the default match types. You can modify the match type names but not the standardization type names. For more information about match and standardization types, see Master Index Match Types and Field Names (Repository) in Understanding Sun Master Index Processing (Repository). Note that the match types you can specify in the Match Field file (listed in Table 5) are not the same values you specify for the Match Type field drop-down list in the wizard when you create the master index application.

Table 4 Standardization Types

This indicator ... 

processes this data type ... 

Address 

Free-form street address fields.

PersonName 

Pre-parsed name fields (including any first, middle, last, or alias names).

BusinessName 

Free-form business names.

The standardization types listed above correspond to the three categories of match types listed below. You can also specify miscellaneous match types, which do not correspond to any standardization types.

Table 5 Match Types

This indicator ... 

processes this data type ... 

Business Name Match Types

PrimaryName

The parsed name field of a business name. 

OrgTypeKeyword

The parsed organization type field of a business name. 

AssocTypeKeyword

The parsed association type field of a business name. 

AliasList

The parsed alias type field of a business name. 

IndustrySectorList

The parsed industry sector field of a business name. 

IndustryTypeKeyword

The parsed industry type field of a business name. 

Url

The parsed URL field of a business name. 

Address Match Types

StreetName

The parsed street name field of a street address. 

HouseNumber

The parsed house number field of a street address. 

StreetDir

The parsed street direction field of a street address. 

StreetType

The parsed street type field of a street address. 

Person Name Match Types

FirstName

A first name field, including middle name, alias first name, and alias middle name fields. 

LastName

A last name field, including alias last name fields. 

Date Match Types

DateDays

The day, month, and year of a date field. 

DateMonths

The month and year of a date field. 

DateHours

The hour, day, month, and year of a date field. 

DateMinutes

The minute, hour, day, month, and year of a date field. 

DateSeconds

The seconds, minute, hour, day, month, and year of a date field. 

Miscellaneous Match Types

String

A generic string field. 

Numeric

A numeric field. 

Integer

A field containing integers. 

Real

A field containing real numbers. 

SSN

A field containing a social security number. 

Char

A field containing a single character. 

pro

Any field on which you want the Sun Match Engine to use prorated weights. 

Exac

Any field you want the Sun Match Engine to match character for character. 

Sun Match Engine Configuration File Modifications

The Sun Match Engine configuration files are designed to perform very specific functions in the standardization and match processes. These files should only be modified by personnel with an understanding of the Sun Match Engine and an understanding of the data integrity requirements of your organization. Modifications to both the master index configuration files and the Sun Match Engine configuration files should be made while the master index application is in the preproduction stages. Modifying the files after master index application has moved into production might cause variances in matching weights and data processing.

The most common modifications to the Sun Match Engine configuration files are generally in the match configuration file, where you can fine-tune the weighting process. This file defines probabilities used by the algorithm to determine a matching probability weight for each match field. You can use the match comparison functions provided by the Sun Match Engine to fine-tune the matching logic in this file. Another common modification is inserting additional names or terms into category files, such as the first name category file (personFirstName*.dat).

Depending on your data requirements, you might need to modify additional standardization files. Some of the patterns files (most notably the address patterns files) are very complex and should only be modified by personnel who thoroughly understand the defined patterns and tokens. If you modify standardization files, make sure you modify them for each national domain specified in the Match Field file.