2.2 Clustering

Oracle Financial Services Customer Screening provides eleven clusters for matching individuals to watch lists during Sanctions screening, and nine clusters for PEP and EDD screening:

Table 2-3 Cluster Methods

Cluster Method SAN PEP EDD
Family Name Y N N
Full Name Metaphone Y N N
Given Names Y N N
Full Name Trim Y N N
Nationality Prohibition Y N/A N/A
Residency Prohibition Y N/A N/A
Name and Country N Y Y
Name and YoB N Y Y
First and Last Name N Y Y
Original Script Name N N N
First Initial Last Name N N N

Note:

This table shows the default configuration of both Batch and RealTime screening processes, but these may be customized independently of one another.

The data used to create the clusters is created before matching by the preparation process. In all cases, the clusters use the prepared and normalized name attributes dnGivenNames, dnFamilyName, dnFullName, and dnOriginalScriptName. For further information see Name Normalization.

Family Name Cluster (dnClusterFamilyName)

The Family Name cluster provides a backup to the full name clusters. This is especially important where the given name data is incomplete, making it difficult to form a complete cluster key for two names. For example, the following three example records do not share any Full Name cluster keys, due to the initials in the second record and the spacing and spelling variations seen throughout:

Table 2-4 Example of Full Name Cluster (dnFullName)

dnFullName dnFullName Name tokens and trimmed values Cluster Keys dnClusterFullNameTrim dnClusterFullNameTrim
STEPHEN NKOMO JEQE JEQE JEQ

JEQNKO

JEQSTE

NKOSTE

JEQNKO |JEQ STE |NKOSTE

NKOMO NKO
STEPHEN STE
SJ NKOMO - S S NKO NKO
NKOMO NKO
J J
STEPHEN KOMO JEKEN JEKE JEK

JEKKOM

JEKSTE

KOMSTE

JEKKOM|JEK STE|KOMSTE

KOMO KOM
N N
STEPHEN STE

Clustering only on the family name circumvents this issue but results in large clusters and a concomitant increase in the processing required to cross-check all the records.

The Family Name cluster builder counters spacing and punctuation differences by generating Metaphone keys for all tokens of the family name, AND the whole of the family name after all white space is trimmed. This is to ensure that family names such as those in the last two records in the example table below are all clustered together despite the spacing differences.

The default logic of the cluster builder is as follows:
  • Trim all white space from the normalized family name
  • Apply the Metaphone transformation to the result, outputting a key with a length of up to 4 characters
  • Strip common name qualifiers from the normalized family name, e.g. Abd, Al.
  • Split the family name into several name tokens, using a space delimiter.

    Note:

    Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization.
  • Apply the Metaphone transformation to each name token, outputting a key with a length of up to 4 characters. If there were no tokens remaining after stripping common name qualifiers, then apply the Metaphone transformation to each name token of the original normalized family name.
  • Concatenate all the generated Metaphone keys
  • Deduplicate the list of keys
Example

Table 2-5 Example of Family Name Cluster (dnFamilyName)

dnFamilyName Tokensderived from dnFami- lyName Metaphonetrans- formations dnClusterFamilyName
ZHONG ZHONG JNK JNK
XIAOJIAN XIAOJIAN SJN SJN
ABACHE ABACHE APX APX
ABANDA ABANDA APNT APNT
ABDAL HAFIZ HAFIZABDALHAFIZ HFSAPTL HFS|APTL
ALBUTHE BUTHE ALBUTHE P0ALP0 P0|ALP0
AL AL AL AL
SOLEIMANHAMAD SOLEIMANHAMAD SOLEIMANHAMAD SLMNHMT SLMN SLMN|HMT
GOODRIDGE GOODRIDGE KTRJ KTRJ
GOODRICHSR GOODRICHSR GOODRICHSR KTRXSR KTRK KTRX|SR|KTRK
NKOMO NKOMO NKM NKM
NKOMO NKOMO NKOMO NKM NKM N|KM|NKM

Full Name Metaphone Pairs Cluster (dnClusterFullNameMeta)

The Full Name Metaphone Pairs cluster uses the normalized full name for the individual to generate a cluster key for every pair of names within the full name. The default logic of this is as follows:
  • Split the normalized full name into several name tokens, using space as a delimiter.

    Note:

    Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization.
  • Sort the name tokens alphabetically.
  • Apply the Metaphone transformation (the standard double-metaphone algorithm) to each name token, outputting a key with a length of up to three characters.
  • Concatenate the Metaphone values, generating a final key value for each distinct pair of tokens.
  • Deduplicate the list of keys.
Example

Table 2-6 Full Name Metaphone Pairs Cluster

dnFullName Name tokens and Metaphone values Name tokens and Metaphone values Distinct Cluster Keys dnClusterFullName Meta
XIAO JIAN ZHONG JIAN JN JNS JNJNK SJNK JNS|JNJNK|SJNK
XIAO S
ZHONG JNK
ZHONG XIAOJIAN XIAOJIAN SJN SJNJNK SJNJNK
ZHONG JNK
MOHAMMED SANI ABACHE ABACHE ABX APXMHM APXSN MHMSN APXMHM|APXSN| MHMSN
MOHAMMED MHMT
SANI SN
JOSEPH TSANGA ABANDA ABANDA APNT APNJSF APNTSN JSFTSN APNJSF|APNTSN|J SFTSN
JOSEPH JSF
TSANGA TSNK
ABD AL WAHAB ABD AL HAFIZ ABD APT APTAPT APTAL APTHFS APTAHP ALAL ALHFS ALAHP HFSAHP

APTAPT|APTAL|AP

THFS

|APTAHP|ALAL|AL

HFS

|ALAHP|HFSAHP

ABD APT
AL AL
AL AL
HAFIZ HFS
WAHAB AHP
SULIMAN HAMD SULEIMAN AL BUTHE AL AL ALP0 ALHMT ALSLM P0HMT P0SLM HMTSLM SLMSLM

ALP0|ALHMT|ALSL M|

P0HMT|P0SLM|HM

TSLM

|SLMSLM

BUTHE P0 - -
HAMD HMT - -
SULEIMAN SLMN - -
SULIMAN SLMN - -
AL BUTHE SOLEIMAN HAMAD AL AL ALP0 ALHMT ALSLM P0HMT P0SLM HMTSLM

ALP0|ALHMT|ALSL M|

P0HMT|P0SLM|HM

TSLM

BUTHE P0 - -
HAMAD HMT - -
SOLEIMAN SLMN - -
REGINALD B GOODRIDGE B P

KTRRJN

Note: Initials are ignored by default when generating cluster keys

KTRRJN
GOODRIDGE KTRJ - -
REGINALD RJNLT - -
REGINALD B SR GOODRICH B P

KTRRJN KTRSR

RJNSR

Note: Initials are ignored by default when generating cluster keys

KTRRJN|KTRSR|RJ NSR
GOODRIDGE KTRJ - -
REGINALD RJNLT - -
SR SR - -
STEPHEN JEQE NKOMO JEQE JK JKNKM JKSTF NKMSTF JKNKM|JKSTF|NK MSTF
NKOMO NKM - -
STEPHEN STFN - -
S J NKOMO J J

NKM

Note: Initials are ignored by default when generating cluster keys

NKM
NKOMO NKM - -
S S - -
STEPHEN JEKE N KOMO JEKE JK JKKM JKSTF KMSTF JKKM|JKSTF|KMST F
KOMO KM - -
N N - -
STEPHEN STFN "- -

Given Names Cluster (dnClusterGivenNames)

The Given Names cluster provides a further backup to the remaining clusters, especially to deal with cases where names are not necessarily well structured into family and given names.

Note:

Depending on the quality and culture of the name information, this cluster will often not be required. You can test the number of additional alerts identified by the cluster by running matching with this cluster disabled, and then running with it enabled. Comparing the new relationships against the old will highlight the relationships identified by using this cluster.
The default logic of the cluster builder is as follows:
  • Split the normalized full name into several name tokens, using space as a delimiter.

    Note:

    Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization.
  • Standardize the normalized given names before clustering. This ensures that names such as 'William' and 'Bill' will be clustered together, although their raw Metaphone values are not the same. A space delimiter is used to split the name before standardizing.
  • Apply the Metaphone transformation to the whole of the given names value after token standardization, outputting a key with a length of up to 4 characters.
Example

Table 2-7 Given Names Cluster

dnFullName Name tokens and trimmed values Name tokens and trimmed values Cluster Keys dnClusterFullNameTrim
XIAO JIAN ZHONG JIAN JIA JIAXIA JIAZHO XIAZHO JIAXIA|JIAZHO|XIAZHO
XIAO XIA
ZHONG ZHO
ZHONG XIAOJIAN XIAOJIAN XIA XIAZHO XIAZHO
ZHONG ZHO
MOHAMMED SANI ABACHE ABACHE ABA ABAMOH ABASAN ABAMOH|ABASAN|MOHSAN
- MOH MOHSAN
- MOHAMM ED
SANI SAN
JOSEPH TSANGA ABANDA ABANDA ABA ABAJOS ABATSA JOSTSA ABAJOS|ABATSA|JOSTSA
JOSEPH JOS
TSANGA TSA

ABD AL

WAHAB ABD

AL HAFIZ

ABD ABD

ABDABD ABDAL

ABDHAF ABDWAH ALAL

ALHAF ALWAH

HAFWAH

ABDABD|ABDAL|ABDHAF

|ABDWAH|ALAL|ALHAF |ALWAH|HAFWAH

- ABD ABD - -
- AL AL - -
- AL AL - -
- HAFIZ HAF - -
- WAHAB WAH - -

SULIMAN

HAMD

SULEIMAN AL

BUTHE

AL AL

ALBUT ALHAM ALSUL

ALSUL BUTHAM

BUTSUL HAMSUL

SULSUL

ALBUT|ALHAM|ALSUL|

BUTHAM|BUTSUL|

HAMSUL|SULSUL

- BUTHE BUT - -
- HAMD HAM - -
- SULEIMAN SUL - -
- SULIMAN SUL - -

AL BUTHE

SOLEIMAN

HAMAD

AL AL

ALBUT ALHAM ALSOL

BUTHAM BUTSOL

HAMSOL

ALBUT|ALHAM|ALSOL|

BUTHAM|BUTSOL |HAMSOL

- BUTHE BUT - -
- HAMAD HAM - -
- SOLEIMAN - - -
REGINALD B GOODRIDGE B B

GOOREG

Note: Initials are ignored

by default when generating cluster keys

GOOREG
- GOODRID GE GOO - -
- REGINALD REG - -

REGINALD B

SR GOODRICH

B B GOOREG GOOSR REGSR GOOREG|GOOSR|REGSR
- GOODRIC H GOO - -
- REGINALD REG - -
- SR SR - -
STEPHEN JEQE NKOMO JEQE JEQ JEQNKO JEQSTE NKOSTE JEQNKO|JEQSTE|NKOSTE
- NKOMO NKO - -
- STEPHEN STE - -
S J NKOMO S S

NKO

Note: Initials are ignored

by default when generating cluster keys

NKO
- NKOMO NKO - -
STEPHEN JEKE N KOMO JEKE JEK

JEKKOM JEKSTE

KOMSTE

Note: Initials are ignored

by default when generating cluster keys

JEKKOM|JEKSTE|KOMSTE

Nationality Prohibition (Nationality Code)

This cluster uses the space-delimited list of nationality country codes to generate cluster keys by generating an array of the component country codes.

Residency Prohibition (Residency Code)

This cluster uses the space-delimited list of residency country codes to generate cluster keys by generating an array of the component country codes.

Name and Country (dnClusterNameCountry)

The Name and Country cluster provides a backup using more detailed information about names and combining them with country information. The cluster is used to compare very similar names that are located over the same countries.

The default logic of the cluster builder is as follows:
  • Split the normalized Full Name into name tokens, using space as a delimiter.

    Note:

    Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization.
  • Apply the Metaphone transformation to each name token, outputting a key with a length of up to twelve characters.
  • Sort the Metaphone values alphabetically.
  • For each country code associated with the record:
    • Concatenate the country code with the full set of Metaphone values, using an underscore as a separator.
    • If more than two Metaphone values are present, then iterate through all groups of Metaphone values which have exactly one value from the set missing, concatenating the country code onto the front of the Metaphone value set.
    • If the overall length of the dnClusterNameCountry field has exceeded 1000 characters, discard the last key and stop key generation.
Example

Table 2-8 Name and Country

dnFullName Country Codes Name tokens and Metaphone values Name tokens and Metaphone value Cluster Keys dnClusterNameCountry
MOHAMMED SANI ES GB MOHA MMED MHMT ES_MHMT_SN GB_MHMT_SN ES_MHMT_SN|GB_MHMT_SN
SANI SN
SULIMAN HAMD SULEIMAN ES TH GB HAMD HMT ES_HMT_SLMN_S LMN ES_SLMN_SLMN ES_HMT_SLMN ES_HMT_SLMN TH_HMT_SLMN_ SLMN TH_SLMN_SLMN TH_HMT_SLMN

ES_HMT_SLMN_SLMN| ES_SLMN_SLMN|ES_HMT_SLM N|ES_HMT_SLMN|TH_HMT_SLM N_SLMN|TH_SLMN_SLMN|TH_HMT_S LMN|TH_HMT_SLMN|GB_HMT_SLMN _SLMN|GB_SLMN_SLMN|GB_HMT_SL MN|GB_HMT_SLMN

SULEI MAN SLMN
SULIM AN SLMN
- - - -

TH_HMT_SLMN GB_HMT_SLMN_ SLMN GB_SLMN_SLMN GB_HMT_SLMN GB_HMT_SLMN

-

Name and YOB (dnClusterNameYOB)

The Name and YOB cluster provides a backup using more detailed information about names and initials combining them with years of birth.

The default logic of the cluster builder is as follows:
  • Standardize dnGivenNames and dnFamilyName;
  • Apply transliteration followed by the Metaphone transformation to the standardized given name, outputting a key with a length of up to four characters;
  • Apply transliteration followed by the Metaphone transformation to the standardized family name, outputting a key with a length of up to four characters;
  • Extract and uppercase the first letter of the standardized dnGivenName;
  • Extract and uppercase the first letter of the standardized dnFamilyName;
  • Extract the first two years of birth from dnYOB to generate two values (referred to as 'First YOB' and 'Second YOB' in the remainder of this example);
  • Create up to four cluster keys by concatenating the following combinations of elements, using the underscore character:
    • First YOB + dnFamilyName (uppercased initial) + dnGivenNames (Metaphone).
    • First YOB + dnGivenNames (uppercased initial) + dnFamilyNames (Metaphone).
    • Second YOB + dnFamilyName (uppercased initial) + dnGivenNames (Metaphone).
    • Second YOB + dnGivenNames (uppercased initial) + dnFamilyNames (Metaphone).

      Note:

      If any of the required data elements are missing, then the corresponding cluster key will not be generated.
    • Deduplicate the list of keys.
Example

Table 2-9 Name and YOB

dnGivenNames, dnFamilyName dnYOB Name tokens and Metaphone values Name tokens and Metaphone values Cluster Keys dnClusterNameYOB
MOHAMMED, SANI

1969

1970

1971

MOHAMMED MHMT 1969_S_MHMT 1969_M_SN 1970_S_MHMT 1970_M_SN 1969_S_MHMT| 1969_M_SN| 1970_S_MHMT| 1970_M_SN
SULIMAN HAMD, SULEIMAN

1980

1981

1982

HAMD HMT 1980_S_SLMN 1981_S_SLMN 1980_S_SLMN| 1981_S_SLMN
- SULEIMAN SLMN - - -
- SULIMAN SLMN - - -

First and Last Name (dnClusterFirstLast)

The First and Last Name cluster provides a tighter name only clustering method that relies on the first given name and last family name matching after standardization and allows for variation in any of the name tokens in-between.

The default logic of the cluster builder is as follows:
  • Strip initials from the normalized given name and family name.
  • Strip all common name qualifiers from the normalized given names and family name, e.g. Al, Bin, Von.
  • Extract the first token from the stripped given names. If all tokens were stripped in steps 1 and 2, then extract the first token from the original normalized given names.
  • Extract the last token from the stripped family name. If all tokens were stripped in steps 1 and 2, then extract the last token from the original normalized family name.
  • Trim the extracted values to a maximum length of 4 characters.
  • Sort the trimmed values alphabetically and concatenate to generate the final key value.
Example

Table 2-10 First and Last Name

dnGivenNames dnFamilyName Extracted Values Extracted Values dnClusterFirstLast
OSVALDO ANTONIO CASTELLVALDEZ OSVALDO VALDEZ OSVAVALD
ABU MAHDI ALMUHANDIS MAHDI MUHANDIS MAHDMUHA
ABU NIDAL ABU NIDAL ABUNIDA
V U SHEIMAN V SHEIMAN SHEIV

OriginalScript Name (dnClusterOriginalScript)

TheOriginal Script Name cluster provides a clustering method for matching names represented in non-Latin writing systems. The cluster builder generates a key for each token in the name.

Note:

A single cluster value of "Myanmar" is generated for original script names written in the Burmese alphabet irrespective of the name. This is needed because token splitting is not possible for the Myanmar writing system as it does not use a space character between words. As a result, all original script namesin the Burmese script will be compared during matching. This should not cause performance issues during screening provided there are a low number of customer records using this writing system.
Thedefault logic of the cluster builder is as follows:
  • Split the original script name into several name tokens, using a space character as the delimiter.
  • Trim each name token to a maximum of 5 characters.
  • Concatenate all of the trimmed token values with a pipe separator.
  • Deduplicate the list of keys.
Example

Table 2-11 Original Script Name

dnOriginalScriptName dnClusterOriginalScript
IванАнтонавiч Шчурок Iван|Антон|Шчуро
Japanese Original Script
Japanese Cluster Script


Mayanmar Original Script

Myanmar

Arabic Original Script


Arabic Cluster Script

First Initial Last Name (dnClusterInitials)

The First Initial Last Name cluster provides a clustering method to group together names that share the same first name initial and last name and allows some variation for transposed names.

The default logic of the cluster builder is as follows:
  • Split the normalized given names into several name tokens, using a space character as the delimiter.
  • Split the normalized family name into several name tokens, using a space character as the delimiter.
  • Generate the cluster key value as follows:
    • If there are two or more characters in the last token of the family name, then concatenate the first character of the given name with the last token of the family name.
    • If the last token of the family name is a single initial, then concatenate that character with the first token of the given name.
  • Trim the cluster key to a maximum of 12 characters.
Examples

Table 2-12 First Initial Last Name

dnGivenNames dnFamilyName dnClusterFirstLast
MARTIN JONES MJONES
MARTIN PETER JONES MJONES
MARTIN MORGAN JONES MJONES
JONES M MJONES