2.2 Clustering
Table 2-3 Cluster Methods
Cluster Method | SAN | PEP | EDD |
---|---|---|---|
Family Name | Y | N | N |
Full Name Metaphone | Y | N | N |
Given Names | Y | N | N |
Full Name Trim | Y | N | N |
Nationality Prohibition | Y | N/A | N/A |
Residency Prohibition | Y | N/A | N/A |
Name and Country | N | Y | Y |
Name and YoB | N | Y | Y |
First and Last Name | N | Y | Y |
Original Script Name | N | N | N |
First Initial Last Name | N | N | N |
Note:
This table shows the default configuration of both Batch and RealTime screening processes, but these may be customized independently of one another.The data used to create the clusters is created before matching by the preparation
process. In all cases, the clusters use the prepared and normalized name
attributes dnGivenNames
, dnFamilyName
,
dnFullName
, and dnOriginalScriptName
. For further
information see Name Normalization.
Family Name Cluster (dnClusterFamilyName)
Table 2-4 Example of Full Name Cluster (dnFullName)
dnFullName | dnFullName | Name tokens and trimmed values | Cluster Keys | dnClusterFullNameTrim | dnClusterFullNameTrim |
---|---|---|---|---|---|
STEPHEN NKOMO | JEQE | JEQE | JEQ |
JEQNKO JEQSTE NKOSTE |
JEQNKO |JEQ STE |NKOSTE |
NKOMO | NKO | ||||
STEPHEN | STE | ||||
SJ NKOMO | - | S | S | NKO | NKO |
NKOMO | NKO | ||||
J | J | ||||
STEPHEN KOMO | JEKEN | JEKE | JEK |
JEKKOM JEKSTE KOMSTE |
JEKKOM|JEK STE|KOMSTE |
KOMO | KOM | ||||
N | N | ||||
STEPHEN | STE |
Clustering only on the family name circumvents this issue but results in large clusters and a concomitant increase in the processing required to cross-check all the records.
The Family Name cluster builder counters spacing and punctuation differences by generating Metaphone keys for all tokens of the family name, AND the whole of the family name after all white space is trimmed. This is to ensure that family names such as those in the last two records in the example table below are all clustered together despite the spacing differences.
- Trim all white space from the normalized family name
- Apply the Metaphone transformation to the result, outputting a key with a length of up to 4 characters
- Strip common name qualifiers from the normalized family name, e.g. Abd, Al.
- Split the family name into several name tokens, using a space
delimiter.
Note:
Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization. - Apply the Metaphone transformation to each name token, outputting a key with a length of up to 4 characters. If there were no tokens remaining after stripping common name qualifiers, then apply the Metaphone transformation to each name token of the original normalized family name.
- Concatenate all the generated Metaphone keys
- Deduplicate the list of keys
Table 2-5 Example of Family Name Cluster (dnFamilyName)
dnFamilyName | Tokensderived from dnFami- lyName | Metaphonetrans- formations | dnClusterFamilyName |
---|---|---|---|
ZHONG | ZHONG | JNK | JNK |
XIAOJIAN | XIAOJIAN | SJN | SJN |
ABACHE | ABACHE | APX | APX |
ABANDA | ABANDA | APNT | APNT |
ABDAL HAFIZ | HAFIZABDALHAFIZ | HFSAPTL | HFS|APTL |
ALBUTHE | BUTHE ALBUTHE | P0ALP0 | P0|ALP0 |
AL | AL | AL | AL |
SOLEIMANHAMAD | SOLEIMANHAMAD SOLEIMANHAMAD | SLMNHMT SLMN | SLMN|HMT |
GOODRIDGE | GOODRIDGE | KTRJ | KTRJ |
GOODRICHSR | GOODRICHSR GOODRICHSR | KTRXSR KTRK | KTRX|SR|KTRK |
NKOMO | NKOMO | NKM | NKM |
NKOMO | NKOMO NKOMO | NKM NKM | N|KM|NKM |
Full Name Metaphone Pairs Cluster (dnClusterFullNameMeta)
- Split the normalized full name into several name tokens, using space as a
delimiter.
Note:
Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization. - Sort the name tokens alphabetically.
- Apply the Metaphone transformation (the standard double-metaphone algorithm) to each name token, outputting a key with a length of up to three characters.
- Concatenate the Metaphone values, generating a final key value for each distinct pair of tokens.
- Deduplicate the list of keys.
Table 2-6 Full Name Metaphone Pairs Cluster
dnFullName | Name tokens and Metaphone values | Name tokens and Metaphone values | Distinct Cluster Keys | dnClusterFullName Meta |
---|---|---|---|---|
XIAO JIAN ZHONG | JIAN | JN | JNS JNJNK SJNK | JNS|JNJNK|SJNK |
XIAO | S | |||
ZHONG | JNK | |||
ZHONG XIAOJIAN | XIAOJIAN | SJN | SJNJNK | SJNJNK |
ZHONG | JNK | |||
MOHAMMED SANI ABACHE | ABACHE | ABX | APXMHM APXSN MHMSN | APXMHM|APXSN| MHMSN |
MOHAMMED | MHMT | |||
SANI | SN | |||
JOSEPH TSANGA ABANDA | ABANDA | APNT | APNJSF APNTSN JSFTSN | APNJSF|APNTSN|J SFTSN |
JOSEPH | JSF | |||
TSANGA | TSNK | |||
ABD AL WAHAB ABD AL HAFIZ | ABD | APT | APTAPT APTAL APTHFS APTAHP ALAL ALHFS ALAHP HFSAHP |
APTAPT|APTAL|AP THFS |APTAHP|ALAL|AL HFS |ALAHP|HFSAHP |
ABD | APT | |||
AL | AL | |||
AL | AL | |||
HAFIZ | HFS | |||
WAHAB | AHP | |||
SULIMAN HAMD SULEIMAN AL BUTHE | AL | AL | ALP0 ALHMT ALSLM P0HMT P0SLM HMTSLM SLMSLM |
ALP0|ALHMT|ALSL M| P0HMT|P0SLM|HM TSLM |SLMSLM |
BUTHE | P0 | - | - | |
HAMD | HMT | - | - | |
SULEIMAN | SLMN | - | - | |
SULIMAN | SLMN | - | - | |
AL BUTHE SOLEIMAN HAMAD | AL | AL | ALP0 ALHMT ALSLM P0HMT P0SLM HMTSLM |
ALP0|ALHMT|ALSL M| P0HMT|P0SLM|HM TSLM |
BUTHE | P0 | - | - | |
HAMAD | HMT | - | - | |
SOLEIMAN | SLMN | - | - | |
REGINALD B GOODRIDGE | B | P |
KTRRJN Note: Initials are ignored by default when generating cluster keys |
KTRRJN |
GOODRIDGE | KTRJ | - | - | |
REGINALD | RJNLT | - | - | |
REGINALD B SR GOODRICH | B | P |
KTRRJN KTRSR RJNSR Note: Initials are ignored by default when generating cluster keys |
KTRRJN|KTRSR|RJ NSR |
GOODRIDGE | KTRJ | - | - | |
REGINALD | RJNLT | - | - | |
SR | SR | - | - | |
STEPHEN JEQE NKOMO | JEQE | JK | JKNKM JKSTF NKMSTF | JKNKM|JKSTF|NK MSTF |
NKOMO | NKM | - | - | |
STEPHEN | STFN | - | - | |
S J NKOMO | J | J |
NKM Note: Initials are ignored by default when generating cluster keys |
NKM |
NKOMO | NKM | - | - | |
S | S | - | - | |
STEPHEN JEKE N KOMO | JEKE | JK | JKKM JKSTF KMSTF | JKKM|JKSTF|KMST F |
KOMO | KM | - | - | |
N | N | - | - | |
STEPHEN | STFN | "- | - |
Given Names Cluster (dnClusterGivenNames)
Note:
Depending on the quality and culture of the name information, this cluster will often not be required. You can test the number of additional alerts identified by the cluster by running matching with this cluster disabled, and then running with it enabled. Comparing the new relationships against the old will highlight the relationships identified by using this cluster.- Split the normalized full name into several name tokens, using space as a
delimiter.
Note:
Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization. - Standardize the normalized given names before clustering. This ensures that names such as 'William' and 'Bill' will be clustered together, although their raw Metaphone values are not the same. A space delimiter is used to split the name before standardizing.
- Apply the Metaphone transformation to the whole of the given names value after token standardization, outputting a key with a length of up to 4 characters.
Table 2-7 Given Names Cluster
dnFullName | Name tokens and trimmed values | Name tokens and trimmed values | Cluster Keys | dnClusterFullNameTrim |
---|---|---|---|---|
XIAO JIAN ZHONG | JIAN | JIA | JIAXIA JIAZHO XIAZHO | JIAXIA|JIAZHO|XIAZHO |
XIAO | XIA | |||
ZHONG | ZHO | |||
ZHONG XIAOJIAN | XIAOJIAN | XIA | XIAZHO | XIAZHO |
ZHONG | ZHO | |||
MOHAMMED SANI ABACHE | ABACHE | ABA | ABAMOH ABASAN | ABAMOH|ABASAN|MOHSAN |
- | MOH | MOHSAN | ||
- | MOHAMM ED | |||
SANI | SAN | |||
JOSEPH TSANGA ABANDA | ABANDA | ABA | ABAJOS ABATSA JOSTSA | ABAJOS|ABATSA|JOSTSA |
JOSEPH | JOS | |||
TSANGA | TSA | |||
ABD AL WAHAB ABD AL HAFIZ |
ABD | ABD |
ABDABD ABDAL ABDHAF ABDWAH ALAL ALHAF ALWAH HAFWAH |
ABDABD|ABDAL|ABDHAF |ABDWAH|ALAL|ALHAF |ALWAH|HAFWAH |
- | ABD | ABD | - | - |
- | AL | AL | - | - |
- | AL | AL | - | - |
- | HAFIZ | HAF | - | - |
- | WAHAB | WAH | - | - |
SULIMAN HAMD SULEIMAN AL BUTHE |
AL | AL |
ALBUT ALHAM ALSUL ALSUL BUTHAM BUTSUL HAMSUL SULSUL |
ALBUT|ALHAM|ALSUL| BUTHAM|BUTSUL| HAMSUL|SULSUL |
- | BUTHE | BUT | - | - |
- | HAMD | HAM | - | - |
- | SULEIMAN | SUL | - | - |
- | SULIMAN | SUL | - | - |
AL BUTHE SOLEIMAN HAMAD |
AL | AL |
ALBUT ALHAM ALSOL BUTHAM BUTSOL HAMSOL |
ALBUT|ALHAM|ALSOL| BUTHAM|BUTSOL |HAMSOL |
- | BUTHE | BUT | - | - |
- | HAMAD | HAM | - | - |
- | SOLEIMAN | - | - | - |
REGINALD B GOODRIDGE | B | B |
GOOREG Note: Initials are ignored by default when generating cluster keys |
GOOREG |
- | GOODRID GE | GOO | - | - |
- | REGINALD | REG | - | - |
REGINALD B SR GOODRICH |
B | B | GOOREG GOOSR REGSR | GOOREG|GOOSR|REGSR |
- | GOODRIC H | GOO | - | - |
- | REGINALD | REG | - | - |
- | SR | SR | - | - |
STEPHEN JEQE NKOMO | JEQE | JEQ | JEQNKO JEQSTE NKOSTE | JEQNKO|JEQSTE|NKOSTE |
- | NKOMO | NKO | - | - |
- | STEPHEN | STE | - | - |
S J NKOMO | S | S |
NKO Note: Initials are ignored by default when generating cluster keys |
NKO |
- | NKOMO | NKO | - | - |
STEPHEN JEKE N KOMO | JEKE | JEK |
JEKKOM JEKSTE KOMSTE Note: Initials are ignored by default when generating cluster keys |
JEKKOM|JEKSTE|KOMSTE |
Nationality Prohibition (Nationality Code)
This cluster uses the space-delimited list of nationality country codes to generate cluster keys by generating an array of the component country codes.
Residency Prohibition (Residency Code)
This cluster uses the space-delimited list of residency country codes to generate cluster keys by generating an array of the component country codes.
Name and Country (dnClusterNameCountry)
The Name and Country cluster provides a backup using more detailed information about names and combining them with country information. The cluster is used to compare very similar names that are located over the same countries.
- Split the normalized Full Name into name tokens, using space as a
delimiter.
Note:
Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization. - Apply the Metaphone transformation to each name token, outputting a key with a length of up to twelve characters.
- Sort the Metaphone values alphabetically.
- For each country code associated with the record:
- Concatenate the country code with the full set of Metaphone values, using an underscore as a separator.
- If more than two Metaphone values are present, then iterate through all groups of Metaphone values which have exactly one value from the set missing, concatenating the country code onto the front of the Metaphone value set.
- If the overall length of the
dnClusterNameCountry
field has exceeded 1000 characters, discard the last key and stop key generation.
Table 2-8 Name and Country
dnFullName | Country Codes | Name tokens and Metaphone values | Name tokens and Metaphone value | Cluster Keys | dnClusterNameCountry |
---|---|---|---|---|---|
MOHAMMED SANI | ES GB | MOHA MMED | MHMT | ES_MHMT_SN GB_MHMT_SN | ES_MHMT_SN|GB_MHMT_SN |
SANI | SN | ||||
SULIMAN HAMD SULEIMAN | ES TH GB | HAMD | HMT | ES_HMT_SLMN_S LMN ES_SLMN_SLMN ES_HMT_SLMN ES_HMT_SLMN TH_HMT_SLMN_ SLMN TH_SLMN_SLMN TH_HMT_SLMN |
ES_HMT_SLMN_SLMN| ES_SLMN_SLMN|ES_HMT_SLM N|ES_HMT_SLMN|TH_HMT_SLM N_SLMN|TH_SLMN_SLMN|TH_HMT_S LMN|TH_HMT_SLMN|GB_HMT_SLMN _SLMN|GB_SLMN_SLMN|GB_HMT_SL MN|GB_HMT_SLMN |
SULEI MAN | SLMN | ||||
SULIM AN | SLMN | ||||
- | - | - | - |
TH_HMT_SLMN GB_HMT_SLMN_ SLMN GB_SLMN_SLMN GB_HMT_SLMN GB_HMT_SLMN |
- |
Name and YOB (dnClusterNameYOB)
The Name and YOB cluster provides a backup using more detailed information about names and initials combining them with years of birth.
- Standardize dnGivenNames and dnFamilyName;
- Apply transliteration followed by the Metaphone transformation to the standardized given name, outputting a key with a length of up to four characters;
- Apply transliteration followed by the Metaphone transformation to the standardized family name, outputting a key with a length of up to four characters;
- Extract and uppercase the first letter of the standardized dnGivenName;
- Extract and uppercase the first letter of the standardized dnFamilyName;
- Extract the first two years of birth from dnYOB to generate two values (referred to as 'First YOB' and 'Second YOB' in the remainder of this example);
- Create up to four cluster keys by concatenating the following combinations
of elements, using the underscore character:
- First YOB + dnFamilyName (uppercased initial) + dnGivenNames (Metaphone).
- First YOB + dnGivenNames (uppercased initial) + dnFamilyNames (Metaphone).
- Second YOB + dnFamilyName (uppercased initial) + dnGivenNames (Metaphone).
- Second YOB + dnGivenNames (uppercased initial) + dnFamilyNames
(Metaphone).
Note:
If any of the required data elements are missing, then the corresponding cluster key will not be generated. - Deduplicate the list of keys.
Table 2-9 Name and YOB
dnGivenNames, dnFamilyName | dnYOB | Name tokens and Metaphone values | Name tokens and Metaphone values | Cluster Keys | dnClusterNameYOB |
---|---|---|---|---|---|
MOHAMMED, SANI |
1969 1970 1971 |
MOHAMMED | MHMT | 1969_S_MHMT 1969_M_SN 1970_S_MHMT 1970_M_SN | 1969_S_MHMT| 1969_M_SN| 1970_S_MHMT| 1970_M_SN |
SULIMAN HAMD, SULEIMAN |
1980 1981 1982 |
HAMD | HMT | 1980_S_SLMN 1981_S_SLMN | 1980_S_SLMN| 1981_S_SLMN |
- | SULEIMAN | SLMN | - | - | - |
- | SULIMAN | SLMN | - | - | - |
First and Last Name (dnClusterFirstLast)
The First and Last Name cluster provides a tighter name only clustering method that relies on the first given name and last family name matching after standardization and allows for variation in any of the name tokens in-between.
- Strip initials from the normalized given name and family name.
- Strip all common name qualifiers from the normalized given names and family name, e.g. Al, Bin, Von.
- Extract the first token from the stripped given names. If all tokens were stripped in steps 1 and 2, then extract the first token from the original normalized given names.
- Extract the last token from the stripped family name. If all tokens were stripped in steps 1 and 2, then extract the last token from the original normalized family name.
- Trim the extracted values to a maximum length of 4 characters.
- Sort the trimmed values alphabetically and concatenate to generate the final key value.
Table 2-10 First and Last Name
dnGivenNames | dnFamilyName | Extracted Values | Extracted Values | dnClusterFirstLast |
---|---|---|---|---|
OSVALDO ANTONIO | CASTELLVALDEZ | OSVALDO | VALDEZ | OSVAVALD |
ABU MAHDI | ALMUHANDIS | MAHDI | MUHANDIS | MAHDMUHA |
ABU | NIDAL | ABU | NIDAL | ABUNIDA |
V U | SHEIMAN | V | SHEIMAN | SHEIV |
OriginalScript Name (dnClusterOriginalScript)
Note:
A single cluster value of "Myanmar" is generated for original script names written in the Burmese alphabet irrespective of the name. This is needed because token splitting is not possible for the Myanmar writing system as it does not use a space character between words. As a result, all original script namesin the Burmese script will be compared during matching. This should not cause performance issues during screening provided there are a low number of customer records using this writing system.- Split the original script name into several name tokens, using a space character as the delimiter.
- Trim each name token to a maximum of 5 characters.
- Concatenate all of the trimmed token values with a pipe separator.
- Deduplicate the list of keys.
Table 2-11 Original Script Name
dnOriginalScriptName | dnClusterOriginalScript |
---|---|
IванАнтонавiч Шчурок | Iван|Антон|Шчуро |
![]() |
![]() |
![]() |
Myanmar |
![]() |
![]() |
First Initial Last Name (dnClusterInitials)
The First Initial Last Name cluster provides a clustering method to group together names that share the same first name initial and last name and allows some variation for transposed names.
- Split the normalized given names into several name tokens, using a space character as the delimiter.
- Split the normalized family name into several name tokens, using a space character as the delimiter.
- Generate the cluster key value as follows:
- If there are two or more characters in the last token of the family name, then concatenate the first character of the given name with the last token of the family name.
- If the last token of the family name is a single initial, then concatenate that character with the first token of the given name.
- Trim the cluster key to a maximum of 12 characters.
Table 2-12 First Initial Last Name
dnGivenNames | dnFamilyName | dnClusterFirstLast |
---|---|---|
MARTIN | JONES | MJONES |
MARTIN PETER | JONES | MJONES |
MARTIN | MORGAN JONES | MJONES |
JONES | M | MJONES |