Understanding the Sun Match Engine

Generic String Comparator (u)

This is the standard uncertainty comparison function, which processes string fields as described above. As more differences are found between two fields, the agreement weight decreases nonlinearly. Thus, the agreement weight can remain high for several differences, but will drop sharply at a certain point. This comparison function takes no parameters.

The uncertainty comparison function is based on the Jaro algorithm with McLaughlin adjustments for similarities. The Jaro algorithm is a string comparison function that accounts for insertions, deletions, and transpositions by performing the following steps.

  1. Compute the lengths of both strings to be matched.

  2. Determine the number of common characters between the two strings. In order for characters to be considered common, they must be within one-half the length of the shorter string.

  3. Determine the number of transpositions. A transposition means a character from the first string is out of order with the corresponding common character from the second string.