Clustering

2.2 Clustering

Oracle Financial Services Customer Screening provides eleven clusters for matching individuals to watch lists during Sanctions screening, and nine clusters for PEP and EDD screening:

Table 2-3 Cluster Methods

Cluster Method	SAN	PEP	EDD
Family Name	Y	N	N
Full Name Metaphone	Y	N	N
Given Names	Y	N	N
Full Name Trim	Y	N	N
Nationality Prohibition	Y	N/A	N/A
Residency Prohibition	Y	N/A	N/A
Name and Country	N	Y	Y
Name and YoB	N	Y	Y
First and Last Name	N	Y	Y
Original Script Name	N	N	N
First Initial Last Name	N	N	N

Note:

This table shows the default configuration of both Batch and RealTime screening processes, but these may be customized independently of one another.

The data used to create the clusters is created before matching by the preparation process. In all cases, the clusters use the prepared and normalized name attributes dnGivenNames, dnFamilyName,dnFullName, and dnOriginalScriptName. For further information see Name Normalization.

Family Name Cluster (dnClusterFamilyName)

The Family Name cluster provides a backup to the full name clusters. This is especially important where the given name data is incomplete, making it difficult to form a complete cluster key for two names. For example, the following three example records do not share any Full Name cluster keys, due to the initials in the second record and the spacing and spelling variations seen throughout:

Table 2-4 Example of Full Name Cluster (dnFullName)

dnFullName	dnFullName	Name tokens and trimmed values	Cluster Keys	dnClusterFullNameTrim	dnClusterFullNameTrim
STEPHEN NKOMO	JEQE	JEQE	JEQ	JEQNKO JEQSTE NKOSTE	JEQNKO \|JEQ STE \|NKOSTE
		NKOMO	NKO
		STEPHEN	STE
SJ NKOMO	-	S	S	NKO	NKO
		NKOMO	NKO
		J	J
STEPHEN KOMO	JEKEN	JEKE	JEK	JEKKOM JEKSTE KOMSTE	JEKKOM\|JEK STE\|KOMSTE
		KOMO	KOM
		N	N
		STEPHEN	STE

Clustering only on the family name circumvents this issue but results in large clusters and a concomitant increase in the processing required to cross-check all the records.

The Family Name cluster builder counters spacing and punctuation differences by generating Metaphone keys for all tokens of the family name, AND the whole of the family name after all white space is trimmed. This is to ensure that family names such as those in the last two records in the example table below are all clustered together despite the spacing differences.

The default logic of the cluster builder is as follows:

Trim all white space from the normalized family name
Apply the Metaphone transformation to the result, outputting a key with a length of up to 4 characters
Strip common name qualifiers from the normalized family name, e.g. Abd, Al.
Split the family name into several name tokens, using a space delimiter.

Note:
Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization.
Apply the Metaphone transformation to each name token, outputting a key with a length of up to 4 characters. If there were no tokens remaining after stripping common name qualifiers, then apply the Metaphone transformation to each name token of the original normalized family name.
Concatenate all the generated Metaphone keys
Deduplicate the list of keys

Example

Table 2-5 Example of Family Name Cluster (dnFamilyName)

dnFamilyName	Tokensderived from dnFami- lyName	Metaphonetrans- formations	dnClusterFamilyName
ZHONG	ZHONG	JNK	JNK
XIAOJIAN	XIAOJIAN	SJN	SJN
ABACHE	ABACHE	APX	APX
ABANDA	ABANDA	APNT	APNT
ABDAL HAFIZ	HAFIZABDALHAFIZ	HFSAPTL	HFS\|APTL
ALBUTHE	BUTHE ALBUTHE	P0ALP0	P0\|ALP0
AL	AL	AL	AL
SOLEIMANHAMAD	SOLEIMANHAMAD SOLEIMANHAMAD	SLMNHMT SLMN	SLMN\|HMT
GOODRIDGE	GOODRIDGE	KTRJ	KTRJ
GOODRICHSR	GOODRICHSR GOODRICHSR	KTRXSR KTRK	KTRX\|SR\|KTRK
NKOMO	NKOMO	NKM	NKM
NKOMO	NKOMO NKOMO	NKM NKM	N\|KM\|NKM

Full Name Metaphone Pairs Cluster (dnClusterFullNameMeta)

The Full Name Metaphone Pairs cluster uses the normalized full name for the individual to generate a cluster key for every pair of names within the full name. The default logic of this is as follows:

Split the normalized full name into several name tokens, using space as a delimiter.

Note:
Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization.
Sort the name tokens alphabetically.
Apply the Metaphone transformation (the standard double-metaphone algorithm) to each name token, outputting a key with a length of up to three characters.
Concatenate the Metaphone values, generating a final key value for each distinct pair of tokens.
Deduplicate the list of keys.

Example

Table 2-6 Full Name Metaphone Pairs Cluster

dnFullName	Name tokens and Metaphone values	Name tokens and Metaphone values	Distinct Cluster Keys	dnClusterFullName Meta
XIAO JIAN ZHONG	JIAN	JN	JNS JNJNK SJNK	JNS\|JNJNK\|SJNK
	XIAO	S
	ZHONG	JNK
ZHONG XIAOJIAN	XIAOJIAN	SJN	SJNJNK	SJNJNK
ZHONG XIAOJIAN	ZHONG	JNK	SJNJNK	SJNJNK
MOHAMMED SANI ABACHE	ABACHE	ABX	APXMHM APXSN MHMSN	APXMHM\|APXSN\| MHMSN
	MOHAMMED	MHMT
	SANI	SN
JOSEPH TSANGA ABANDA	ABANDA	APNT	APNJSF APNTSN JSFTSN	APNJSF\|APNTSN\|J SFTSN
	JOSEPH	JSF
	TSANGA	TSNK
ABD AL WAHAB ABD AL HAFIZ	ABD	APT	APTAPT APTAL APTHFS APTAHP ALAL ALHFS ALAHP HFSAHP	APTAPT\|APTAL\|AP THFS \|APTAHP\|ALAL\|AL HFS \|ALAHP\|HFSAHP
	ABD	APT
	AL	AL
	AL	AL
	HAFIZ	HFS
	WAHAB	AHP
SULIMAN HAMD SULEIMAN AL BUTHE	AL	AL	ALP0 ALHMT ALSLM P0HMT P0SLM HMTSLM SLMSLM	ALP0\|ALHMT\|ALSL M\| P0HMT\|P0SLM\|HM TSLM \|SLMSLM
	BUTHE	P0	-	-
	HAMD	HMT	-	-
	SULEIMAN	SLMN	-	-
	SULIMAN	SLMN	-	-
AL BUTHE SOLEIMAN HAMAD	AL	AL	ALP0 ALHMT ALSLM P0HMT P0SLM HMTSLM	ALP0\|ALHMT\|ALSL M\| P0HMT\|P0SLM\|HM TSLM
	BUTHE	P0	-	-
	HAMAD	HMT	-	-
	SOLEIMAN	SLMN	-	-
REGINALD B GOODRIDGE	B	P	KTRRJN Note: Initials are ignored by default when generating cluster keys	KTRRJN
	GOODRIDGE	KTRJ	-	-
	REGINALD	RJNLT	-	-
REGINALD B SR GOODRICH	B	P	KTRRJN KTRSR RJNSR Note: Initials are ignored by default when generating cluster keys	KTRRJN\|KTRSR\|RJ NSR
	GOODRIDGE	KTRJ	-	-
	REGINALD	RJNLT	-	-
	SR	SR	-	-
STEPHEN JEQE NKOMO	JEQE	JK	JKNKM JKSTF NKMSTF	JKNKM\|JKSTF\|NK MSTF
	NKOMO	NKM	-	-
	STEPHEN	STFN	-	-
S J NKOMO	J	J	NKM Note: Initials are ignored by default when generating cluster keys	NKM
	NKOMO	NKM	-	-
	S	S	-	-
STEPHEN JEKE N KOMO	JEKE	JK	JKKM JKSTF KMSTF	JKKM\|JKSTF\|KMST F
	KOMO	KM	-	-
	N	N	-	-
	STEPHEN	STFN	"-	-

Given Names Cluster (dnClusterGivenNames)

The Given Names cluster provides a further backup to the remaining clusters, especially to deal with cases where names are not necessarily well structured into family and given names.

Note:

Depending on the quality and culture of the name information, this cluster will often not be required. You can test the number of additional alerts identified by the cluster by running matching with this cluster disabled, and then running with it enabled. Comparing the new relationships against the old will highlight the relationships identified by using this cluster.

The default logic of the cluster builder is as follows:

Split the normalized full name into several name tokens, using space as a delimiter.

Note:
Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization.
Standardize the normalized given names before clustering. This ensures that names such as 'William' and 'Bill' will be clustered together, although their raw Metaphone values are not the same. A space delimiter is used to split the name before standardizing.
Apply the Metaphone transformation to the whole of the given names value after token standardization, outputting a key with a length of up to 4 characters.

Example

Table 2-7 Given Names Cluster

dnFullName	Name tokens and trimmed values	Name tokens and trimmed values	Cluster Keys	dnClusterFullNameTrim
XIAO JIAN ZHONG	JIAN	JIA	JIAXIA JIAZHO XIAZHO	JIAXIA\|JIAZHO\|XIAZHO
	XIAO	XIA
	ZHONG	ZHO
ZHONG XIAOJIAN	XIAOJIAN	XIA	XIAZHO	XIAZHO
ZHONG XIAOJIAN	ZHONG	ZHO	XIAZHO	XIAZHO
MOHAMMED SANI ABACHE	ABACHE	ABA	ABAMOH ABASAN	ABAMOH\|ABASAN\|MOHSAN
	-	MOH	MOHSAN
	-	MOHAMM ED
	SANI	SAN
JOSEPH TSANGA ABANDA	ABANDA	ABA	ABAJOS ABATSA JOSTSA	ABAJOS\|ABATSA\|JOSTSA
	JOSEPH	JOS
	TSANGA	TSA
ABD AL WAHAB ABD AL HAFIZ	ABD	ABD	ABDABD ABDAL ABDHAF ABDWAH ALAL ALHAF ALWAH HAFWAH	ABDABD\|ABDAL\|ABDHAF \|ABDWAH\|ALAL\|ALHAF \|ALWAH\|HAFWAH
-	ABD	ABD	-	-
-	AL	AL	-	-
-	AL	AL	-	-
-	HAFIZ	HAF	-	-
-	WAHAB	WAH	-	-
SULIMAN HAMD SULEIMAN AL BUTHE	AL	AL	ALBUT ALHAM ALSUL ALSUL BUTHAM BUTSUL HAMSUL SULSUL	ALBUT\|ALHAM\|ALSUL\| BUTHAM\|BUTSUL\| HAMSUL\|SULSUL
-	BUTHE	BUT	-	-
-	HAMD	HAM	-	-
-	SULEIMAN	SUL	-	-
-	SULIMAN	SUL	-	-
AL BUTHE SOLEIMAN HAMAD	AL	AL	ALBUT ALHAM ALSOL BUTHAM BUTSOL HAMSOL	ALBUT\|ALHAM\|ALSOL\| BUTHAM\|BUTSOL \|HAMSOL
-	BUTHE	BUT	-	-
-	HAMAD	HAM	-	-
-	SOLEIMAN	-	-	-
REGINALD B GOODRIDGE	B	B	GOOREG Note: Initials are ignored by default when generating cluster keys	GOOREG
-	GOODRID GE	GOO	-	-
-	REGINALD	REG	-	-
REGINALD B SR GOODRICH	B	B	GOOREG GOOSR REGSR	GOOREG\|GOOSR\|REGSR
-	GOODRIC H	GOO	-	-
-	REGINALD	REG	-	-
-	SR	SR	-	-
STEPHEN JEQE NKOMO	JEQE	JEQ	JEQNKO JEQSTE NKOSTE	JEQNKO\|JEQSTE\|NKOSTE
-	NKOMO	NKO	-	-
-	STEPHEN	STE	-	-
S J NKOMO	S	S	NKO Note: Initials are ignored by default when generating cluster keys	NKO
-	NKOMO	NKO	-	-
STEPHEN JEKE N KOMO	JEKE	JEK	JEKKOM JEKSTE KOMSTE Note: Initials are ignored by default when generating cluster keys	JEKKOM\|JEKSTE\|KOMSTE

Nationality Prohibition (Nationality Code)

This cluster uses the space-delimited list of nationality country codes to generate cluster keys by generating an array of the component country codes.

Residency Prohibition (Residency Code)

This cluster uses the space-delimited list of residency country codes to generate cluster keys by generating an array of the component country codes.

Name and Country (dnClusterNameCountry)

The Name and Country cluster provides a backup using more detailed information about names and combining them with country information. The cluster is used to compare very similar names that are located over the same countries.

The default logic of the cluster builder is as follows:

Split the normalized Full Name into name tokens, using space as a delimiter.

Note:
Many other punctuation and noise characters are normalized to spaces before generating the cluster. For further information see Name Normalization.
Apply the Metaphone transformation to each name token, outputting a key with a length of up to twelve characters.
Sort the Metaphone values alphabetically.
For each country code associated with the record:
- Concatenate the country code with the full set of Metaphone values, using an underscore as a separator.
- If more than two Metaphone values are present, then iterate through all groups of Metaphone values which have exactly one value from the set missing, concatenating the country code onto the front of the Metaphone value set.
- If the overall length of the dnClusterNameCountry field has exceeded 1000 characters, discard the last key and stop key generation.

Example

Table 2-8 Name and Country

dnFullName	Country Codes	Name tokens and Metaphone values	Name tokens and Metaphone value	Cluster Keys	dnClusterNameCountry
MOHAMMED SANI	ES GB	MOHA MMED	MHMT	ES_MHMT_SN GB_MHMT_SN	ES_MHMT_SN\|GB_MHMT_SN
MOHAMMED SANI	ES GB	SANI	SN	ES_MHMT_SN GB_MHMT_SN	ES_MHMT_SN\|GB_MHMT_SN
SULIMAN HAMD SULEIMAN	ES TH GB	HAMD	HMT	ES_HMT_SLMN_S LMN ES_SLMN_SLMN ES_HMT_SLMN ES_HMT_SLMN TH_HMT_SLMN_ SLMN TH_SLMN_SLMN TH_HMT_SLMN	ES_HMT_SLMN_SLMN\| ES_SLMN_SLMN\|ES_HMT_SLM N\|ES_HMT_SLMN\|TH_HMT_SLM N_SLMN\|TH_SLMN_SLMN\|TH_HMT_S LMN\|TH_HMT_SLMN\|GB_HMT_SLMN _SLMN\|GB_SLMN_SLMN\|GB_HMT_SL MN\|GB_HMT_SLMN
		SULEI MAN	SLMN
		SULIM AN	SLMN
-	-	-	-	TH_HMT_SLMN GB_HMT_SLMN_ SLMN GB_SLMN_SLMN GB_HMT_SLMN GB_HMT_SLMN	-

Name and YOB (dnClusterNameYOB)

The Name and YOB cluster provides a backup using more detailed information about names and initials combining them with years of birth.

The default logic of the cluster builder is as follows:

Standardize dnGivenNames and dnFamilyName;
Apply transliteration followed by the Metaphone transformation to the standardized given name, outputting a key with a length of up to four characters;
Apply transliteration followed by the Metaphone transformation to the standardized family name, outputting a key with a length of up to four characters;
Extract and uppercase the first letter of the standardized dnGivenName;
Extract and uppercase the first letter of the standardized dnFamilyName;
Extract the first two years of birth from dnYOB to generate two values (referred to as 'First YOB' and 'Second YOB' in the remainder of this example);
Create up to four cluster keys by concatenating the following combinations of elements, using the underscore character:
- First YOB + dnFamilyName (uppercased initial) + dnGivenNames (Metaphone).
- First YOB + dnGivenNames (uppercased initial) + dnFamilyNames (Metaphone).
- Second YOB + dnFamilyName (uppercased initial) + dnGivenNames (Metaphone).
- Second YOB + dnGivenNames (uppercased initial) + dnFamilyNames (Metaphone).
  
  Note:
  If any of the required data elements are missing, then the corresponding cluster key will not be generated.
- Deduplicate the list of keys.

Example

Table 2-9 Name and YOB

dnGivenNames, dnFamilyName	dnYOB	Name tokens and Metaphone values	Name tokens and Metaphone values	Cluster Keys	dnClusterNameYOB
MOHAMMED, SANI	1969 1970 1971	MOHAMMED	MHMT	1969_S_MHMT 1969_M_SN 1970_S_MHMT 1970_M_SN	1969_S_MHMT\| 1969_M_SN\| 1970_S_MHMT\| 1970_M_SN
SULIMAN HAMD, SULEIMAN	1980 1981 1982	HAMD	HMT	1980_S_SLMN 1981_S_SLMN	1980_S_SLMN\| 1981_S_SLMN
-	SULEIMAN	SLMN	-	-	-
-	SULIMAN	SLMN	-	-	-

First and Last Name (dnClusterFirstLast)

The First and Last Name cluster provides a tighter name only clustering method that relies on the first given name and last family name matching after standardization and allows for variation in any of the name tokens in-between.

The default logic of the cluster builder is as follows:

Strip initials from the normalized given name and family name.
Strip all common name qualifiers from the normalized given names and family name, e.g. Al, Bin, Von.
Extract the first token from the stripped given names. If all tokens were stripped in steps 1 and 2, then extract the first token from the original normalized given names.
Extract the last token from the stripped family name. If all tokens were stripped in steps 1 and 2, then extract the last token from the original normalized family name.
Trim the extracted values to a maximum length of 4 characters.
Sort the trimmed values alphabetically and concatenate to generate the final key value.

Example

Table 2-10 First and Last Name

dnGivenNames	dnFamilyName	Extracted Values	Extracted Values	dnClusterFirstLast
OSVALDO ANTONIO	CASTELLVALDEZ	OSVALDO	VALDEZ	OSVAVALD
ABU MAHDI	ALMUHANDIS	MAHDI	MUHANDIS	MAHDMUHA
ABU	NIDAL	ABU	NIDAL	ABUNIDA
V U	SHEIMAN	V	SHEIMAN	SHEIV

OriginalScript Name (dnClusterOriginalScript)

TheOriginal Script Name cluster provides a clustering method for matching names represented in non-Latin writing systems. The cluster builder generates a key for each token in the name.

Note:

A single cluster value of "Myanmar" is generated for original script names written in the Burmese alphabet irrespective of the name. This is needed because token splitting is not possible for the Myanmar writing system as it does not use a space character between words. As a result, all original script namesin the Burmese script will be compared during matching. This should not cause performance issues during screening provided there are a low number of customer records using this writing system.

Thedefault logic of the cluster builder is as follows:

Split the original script name into several name tokens, using a space character as the delimiter.
Trim each name token to a maximum of 5 characters.
Concatenate all of the trimmed token values with a pipe separator.
Deduplicate the list of keys.

Example

Table 2-11 Original Script Name

dnOriginalScriptName	dnClusterOriginalScript
IванАнтонавiч Шчурок	Iван\|Антон\|Шчуро

	Myanmar

First Initial Last Name (dnClusterInitials)

The First Initial Last Name cluster provides a clustering method to group together names that share the same first name initial and last name and allows some variation for transposed names.

The default logic of the cluster builder is as follows:

Split the normalized given names into several name tokens, using a space character as the delimiter.
Split the normalized family name into several name tokens, using a space character as the delimiter.
Generate the cluster key value as follows:
- If there are two or more characters in the last token of the family name, then concatenate the first character of the given name with the last token of the family name.
- If the last token of the family name is a single initial, then concatenate that character with the first token of the given name.
Trim the cluster key to a maximum of 12 characters.

Examples

Table 2-12 First Initial Last Name

dnGivenNames	dnFamilyName	dnClusterFirstLast
MARTIN	JONES	MJONES
MARTIN PETER	JONES	MJONES
MARTIN	MORGAN JONES	MJONES
JONES	M	MJONES