Probabilities and Direct Weights (Understanding the Master Index Match Engine)

Understanding the Master Index Match Engine

Probabilities and Direct Weights

Optimum (or ceiling) matching weights can be assigned to field values using matching (m) and unmatching (u) probabilities or using agreement and disagreements weights in an equivalent way. Both types are based on a logarithmic function. Optimum agreement and disagreement weights are an equivalent logarithmic expression of the matching and unmatching probabilities, but for an end user, defining agreement and disagreement weight ranges is a more direct way to implement m-probabilities and u-probabilities.

Matching and Unmatching Probabilities

When matching and unmatching conditional probabilities are used, the match engine uses a logarithmic formula to determine agreement and disagreement weights between fields. The m-probabilities and u-probabilities you specify determine the maximum agreement weight and minimum disagreement weight for each field, and so define the agreement and disagreement weight ranges for each field and for the entire record. These probabilities allow you to specify which fields provide the most reliable matching information and which provide the least. For example, in person matching, the gender field is not as reliable as the SSN field for determining a match since a person’s SSN is more specific. Therefore, the SSN field should have a higher m-probability than the gender field. The more reliable the field, the greater the m-probability for that field should be.

If a field matches between two records, an agreement weight, determined by the logarithmic formula using the m-probability and u-probability, is added to the composite match weight for the record. If the fields disagree, the logarithmic formula using the m-probability and u-probability is negative, and a disagreement weight is subtracted from the composite match weight.

Agreement and Disagreement Weight Ranges

Like probabilities, the maximum agreement and minimum disagreement weights you define for each field allow you to specify the relative reliability of each field; however, the match weight has a more linear relationship with the numbers you specify. When you use agreement and disagreement weight ranges to determine the match weight, you define a maximum weight for each field when they are in complete agreement and a minimum weight for when they are in complete disagreement. The value assigned to a field is somewhere between the two numbers based on an underlying logarithmic formula. This provides a more convenient and intuitive representation of conditional probabilities.

Using the SSN and gender field example above, the SSN field is assigned a higher maximum agreement weight and a lower minimum disagreement weight than the gender field because it is more reliable. If you assign a maximum agreement weight of “10” and two SSNs match, the match weight for that field is “10”. If you assign a minimum disagreement weight of “-10” and two SSNs are in complete disagreement, the match weight for that field is “-10”.