3.3 Individual Name

This is a bespoke rules-based algorithm that has been optimized for determining individual name matches. Generation of the final matching score for Individual Name is based on the combination of scores generated by the following feature vector in the scoring method:
  • Abbreviation: Checks if the first Character matches with the first Character of source and target tokens.

    Example:
    • String 1: "S Turner"
    • String 2: "Steve Turner"

    String1 is an abbreviated string for string2.

  • Character Edit Distance (CED): The Character Edit Distance comparison compares two String tokens and determines how closely they match each other by calculating the minimum, and the maximum number of character edits (deletions, insertions, and substitutions) needed to transform one value into the other.

    It compares each token in a string with each token of another string and finds the minimum edits we need to convert one token to another. Maximum CED score that is required to convert one token to another token in a string.

    The stopwords for Individual Names are Mr., Mrs, and Ms.

    Example:
    • String1: “Mr Jerrod Benito Carrera”
    • String2: “JOSE BENITO CABRERA”
    • Stopword: Mr.

    Jerrod matches with JOSE à CED: 5 (CED_MAX))

    Benito matches with BENITO à CED: 0 (CED_MIN)

    Carrera matches with CABRERA à CED: 1

  • Character Match Percentage (CMP): The Character Match Percentage comparison determines how closely two values match each other by calculating the Character Edit Distance between two String tokens and taking into account the shorter length of the two tokens by character count.

    CMP = (MCL – CED) * 100 / MCL

    • CMP = Character match percentage
    • MCL = Maximum character Length
  • Exact String Match: The Exact String Match comparison is a simple comparison that determines whether or not two String values match. Checks if all the tokens are exact to each other. It can be in any order.

    Example:
    • String 1: Ram Lakshaman
    • String 2: Lakshaman Ram
    • String 3: Ram Lakshaman

    String 1, String 2, and string 3 are exact.

  • Starts With: The Starts With comparison compares two values and determines whether either value starts with the whole of the other value. It, therefore, matches both exact matches and matches where one of the values starts the same as the other but contains extra information.

    Checks if all the tokens start with the respective token in the target tokens.

    Note:

    Whichever token is smaller (either in source or target), that token will be considered compared with the other token (of longer length). It should be in order.
  • Starts With First: It is similar to Starts With. Starts with the first token only.
  • Metaphone: Checks if strings sound like. It is similar for hearing, but spelling may be different. It encodes a string into a double Metaphone value.
  • Tokenize Jaro: Checks similarity is the measure between two strings. It tokenizes source and target strings, then uses the Jaro Winkler algorithm to calculate the score between tokens, and then consolidates the scores to a single score by taking the average.
  • Word Edit Distance (WED): The Word Edit Distance comparison determines how well multi-word String values match each other by calculating the minimum number of word edits (word insertions, deletions, and substitutions) required to transform one value to another.

    So WED is similar to CED, where instead of character edits, we find the word edits.

    In WED, we have an additional parameter called "character tolerance." Character tolerance allows the user to have a character tolerance in words, i.e. how many Character edits can it allow in each token for one token to match another one.

    WED in simple words is: Number of words that did not match with the target words (after allowing the character tolerance)

    Example:
    • String 1: "Yohan Russel Smith"
    • String 2: "Smith Johaan Rusel"

    Yohan matches with Johann - CED: 2

    Russel matches with Rusel - CED: 1

    Smith matches with Smith - CED: 0
    • If we have a character tolerance of "1,". The number of WED will be: 1

      • Russel, with a character tolerance of 1 matches with Rusel.
      • Smith with character tolerance of 0 matches with Smith.
      • Yohan with character tolerance of 2 does not match with Johann as character tolerance is 1.

      One token did not match. WED = 1.

    • If we have a character tolerance of "2,". The number of WED will be : 0

      • Russel, with a character tolerance of 1 matches with Rusel.
      • Smith with character tolerance of 0 matches with Smith.
      • Yohan, with a character tolerance of 2, does match with Johann.

      All tokens matched. WED = 0.

  • Word Match Percentage (WMP): The Word Match Percentage comparison determines how closely two multi-word values match each other by calculating the Word Edit Distance between two Strings and considering the length of the longer or longer or longer, the shorter of the two values, by word count.

    In mathematical terms, the Word Match Percentage comparison uses the following formula to calculate its results as the Number of tokens matched:

    WMP = (WMC / WL) *100

    • WMP: Word Match Percentage
    • MWL: Maximum Word Length (i.e., the maximum number of words in the two values being compared)
    • WED: Word Edit Distance between two String values, and
    • WL: Minimum Word Length, relating to shorter input option
  • Word Match Count (WMC): The Word Match Percentage comparison determines how closely two multi-word values match each other by calculating the Word Edit Distance between two Strings and taking into account the length of the longer or, the longer shorter of the two values, by word count.

    WMC, in simple words, is the Number of words that did match with the target words (after allowing the character tolerance).

    WMC compliments WED, as WMC gives the number of matches between 2 words and WED gives the number of mismatches between 2 words.

    Note:

    Based on the Year of birth, if the match falls beyond +/- 5 years, then the match is eliminated. This is applicable for the Individual PEP and the EDD records.
    Example 1:
    • String 1: "Yohan Russel Smith"
    • String 2: "Smith Johaan Rusel"

      Yohan matches with Johann - CED: 2

      Russel matches with Rusel - CED: 1

      Smith matches with Smith - CED: 0

    • If we have a character tolerance of "1,". The number of WED will be: 2

      • Russel, with a character tolerance of 1 matches with Rusel.
      • Smith with character tolerance of 0 matches with Smith.
      • Yohan with character tolerance of 2 does not match with Johann as character tolerance is 1.

      One word did not match. So WMC = 2.

    • If we have a character tolerance of "2,". The number of WMC will be : 3

      • Russel, with a character tolerance of 1 matches with Rusel.
      • Smith with character tolerance of 0 matches with Smith.
      • Yohan, with a character tolerance of 2, does match with Johann.

      All tokens matched. WMC = 3.

    Example 2:
    • String 1: “Mr Jerrod Benito Carrera”
    • String 2: “JOSE BENITO CABRERA”
    • Stopword: Mr. (It will not be considered for calculating the scrore)

    The individual feature vector scores with final scrore for the match:

    {"ced_list": [5, 0, 1], "ced_min": 0, "ced_max": 5, "cmp": 68.42105, "wed_1": 1, "wed_2": 1, "wmc_1": 2, "wmc_2": 2, "wmp_1": 66.0, "wmp_2": 66.666664, "metaphone": 1, "starts_with": 0, "abbreviation": 1, "tokenize_jaro": 0.8301587, "exact": 0, "inorderMaxPos": 0, "score": 0.88}

Figure 3-1 Individual Name Score