Understanding the Sun Match Engine

Sun Match Engine Overview

The Sun Match Engine is the standard match engine designed to work with the master index applications created by Sun Match Engine (Repository). It is highly configurable in the Sun Match Engine environment and can be used to match on various types of data.

The Sun Match Engine provides data parsing, data standardization, phonetic encoding, and record matching capabilities for master index applications created by Sun Match Engine. Before records can be compared to evaluate the possibility of a match, the data contained in those records must be standardized and in certain cases phonetically encoded or parsed. Once the data is conditioned, the match engine determines a match weight for each field defined for matching. The match weight is based on your configuration of the match engine and the fields on which matching is performed. The composite weight (the sum of weights generated for all match fields in the records) indicates how closely two records match.

The following topics provide information about the configurable components of the match engine and how the Sun Match Engine standardizes and matches data.

About the Sun Match Engine Matching Algorithm

The Sun Match Engine compares records containing similar data types by calculating how closely the records match. The resulting comparison weight is either a positive or negative numeric value that represents the degree to which the two sets of data are similar. The match engine relies on probabilistic algorithms to compare data of a given type using a comparison function specific to the type of data being compared. The comparison functions for each matching field are defined in a match configuration file that you can customize for the type of data you are indexing. The formula used to determine the matching weight is based on either matching and unmatching probabilities or on agreement and disagreement weight ranges.

The Sun Match Engine is also designed to standardize free-form text fields, such as street address fields or business names. This allows the match engine to generate a more accurate weight for free-form data.

Sun Match Engine Standardization and Matching Process

The Sun Match Engine matching algorithm uses a proven methodology to process and weight records in the master index database. By providing both standardization and matching capabilities, the match engine allows you to condition data prior to matching. You can also use these capabilities to review legacy data prior to loading it into the database. This review helps you determine data anomalies, invalid or default values, and missing fields.

Both matching and standardization occur when two records are analyzed for the probability of a match. Before matching, certain fields are normalized, parsed, or converted into their phonetic values if necessary. The match fields are then analyzed and weighted according to the rules defined in a match configuration file. The weights for each field are combined to determine the overall matching weight for the two records. After the match engine has performed these steps, survivorship is determined by the master index application, based on how the overall matching weight compares to the duplicate and match thresholds of the master index application. These thresholds are configured for the Manager Service in the Threshold file.

Sun Match Engine Data Types

You can standardize and match on different types of data with the Sun Match Engine. In its default implementation with a master index application, the match engine supports data standardization and matching on the three primary types of data listed below.

In addition, the Sun Match Engine provides comparison functions for matching on various types of fields contained within the primary data types, such as numbers, dates, Social Security Numbers, single characters, and so on.

When processing person information, the match engine assumes that each match field is stored in a separate field. For street address and business name processing, the match engine can parse free-form text fields for searching and matching. Each data type requires specific customization to the Match Field file in the master index project.

How the Sun Match Engine Works

The Sun Match Engine compares two records and returns a match weight indicating the likelihood of a match between the two records. The three primary components of the Sun Match Engine are the configuration files, the standardization engine, and the match engine.

Sun Match Engine Matching Weight Formulation

The Sun Match Engine determines the matching weight between two records by comparing the match string fields between the two records using the rules defined in the match configuration file and taking into account the matching logic specified for each field. The Sun Match Engine can use either matching (m) and unmatching (u) conditional probabilities or agreement and disagreement weight ranges to fine-tune the match process. It uses the underlying algorithm to arrive at a match weight for each match string field. The weight generated for each field in the match string indicates the level of match between each field. The weights assigned to each field are then summed together for a total, composite matching weight between the two records. Agreement and disagreement weight ranges or m-probabilities and u-probabilities are defined in the match configuration file.

The following topics describe probabilities and weights.

Matching and Unmatching Probabilities

When matching and unmatching conditional probabilities are used, the match engine uses a logarithmic formula to determine agreement and disagreement weights between fields. The m-probabilities and u-probabilities you specify determine the maximum agreement weight and minimum disagreement weight for each field, and so define the agreement and disagreement weight ranges for each field and for the entire record. These probabilities allow you to specify which fields provide the most reliable matching information and which provide the least. For example, in person matching, the gender field is not as reliable as the SSN field for determining a match since a person’s SSN is more specific. Therefore, the SSN field should have a higher m-probability than the gender field. The more reliable the field, the greater the m-probability for that field should be.

If a field matches between two records, an agreement weight, determined by the logarithmic formula using the m-probability and u-probability, is added to the composite match weight for the record. If the fields disagree, a disagreement weight is subtracted from the composite match weight. m-probabilities and u-probabilities are expressed as double values between one and zero (excluding one and zero) and can have up to 16 decimal points.

Agreement and Disagreement Weight Ranges

Defining agreement and disagreement weight ranges is a more direct way to implement m-probabilities and u-probabilities. Like probabilities, the maximum agreement and minimum disagreement weights you define for each field allow you to define the relative reliability of each field; however, the match weight has a more linear relationship with the numbers you specify. When you use agreement and disagreement weight ranges to determine the match weight, you define a maximum weight for each field when they are in complete agreement and a minimum weight for when they are in complete disagreement. The Sun Match Engine assigns a matching weight to each field that falls between the agreement and disagreement weights specified for the field. This provides a more convenient and intuitive representation of conditional probabilities.

Using the SSN and gender field example above, the SSN field would be assigned a higher maximum agreement weight and a lower minimum disagreement weight than the gender field because it is more reliable. If you assign a maximum agreement weight of “10” and two SSNs match, the match weight for that field is “10”. If you assign a minimum disagreement weight of “-10” and two SSNs are in complete disagreement, the match weight for that field is “-10”. Agreement and disagreement weights are expressed as double values and can have up to 16 decimal points.