Clustering

3.2 Clustering

Oracle Financial Services Customer Screening provides three different clustering strategies for matching entities: Entity Name Tokens, Name Metaphone, and Name Trimmed. Any of the clusters may be activated or deactivated, as required, and different cluster limits can be configured.

Entity Name Tokens (dnClusterNameTokens)

This cluster uses the standardized entity name to generate cluster keys. The default logic is as follows:

Remove initials.
Remove common name tokens, such as Limited, or Corporation.
Normalize whitespace.
Convert space characters to pipe characters.

Examples

Table 3-2 Entity Name Tokens

dnEntityName	Name with initials and common name tokens stripped	dnClusterNameTo- kens
ANGLOCARIBBEAN CO LTD	ANGLOCARIBBEAN	ANGLO\|CARIBBEAN
GUAMATUR S A	GUAMATUR	GUAMATUR

Name Metaphone (dnClusterLongName)

This cluster uses the standardized entity name to generate cluster keys. The default logic is as follows:

Remove initials.
Remove common name tokens, such as Limited, or Corporation.
Normalize whitespace. Remove common business words, such as Company, or Association.
Transliterate any non-Latin characters into Latin.
Apply the Metaphone transformation (the standard double-Metaphone algorithm) outputting a key with a length of up to eight characters.

Table 3-3 Name Metaphone

dnEntityName	Name with initials,common name tokens and common business words stripped	dnCluster- LongName
HAVANA INTERNATIONAL BANK LTD	HAVANA BANK	HFNPNK
CIMEXS A	CIMEX	SMKS
LAEMPRESA CUBANA DE FLETES	EMPRESACUBANA FLETES	AMPRSKPN

Name Trimmed (dnClusterShortName)

This cluster uses the standardized entity name to generate cluster keys. The default logic is as follows:

Remove all whitespace.
Left-trim the value to a maximum of 4 characters.

Example

Table 3-4 Name Trimmed

dnEntityName	dnClusterShortName
HAVANA INTERNATIONAL BANK LTD	HAVA
CIMEXS A	CIME
LAEMPRESA CUBANA DE FLETES	LAEM

Registration Country Prohibition (Registration Country Code)

This cluster uses the space-delimited list of registration country codes to generate cluster keys by generating an array of the component country codes.

Operating Country Prohibition (Operating Country Code)

This cluster uses the space-delimited list of operating country codes to generate cluster keys by generating an array of the component country codes.

Start/End Name Tokens (dnClusterStartEndNameTokens)

This clustering method is designed as a looser version of the Entity Name Tokens cluster and allows for variation in entity names by creating clusters for the first five and last five characters of each name token.

The default logic is as follows:

Remove initials.
Remove common name tokens, such as Limited, or Corporation.
Normalize whitespace.
For each token that is longer than five characters, replace with two new tokens that are:
- The first five characters of the token.
- The last five characters of the token.

Table 3-5 Start or End Name Tokens

dnEntityName	Name with initials and common name tokens stripped	dnClusterStartEndNameTokens
HAVANA INTERNATIONAL BANK LTD	HAVANA INTERNATIONAL BANK	HAVAN\|AVANA\|INTER\|IONAL\|BANK
CIMEXS A	CIMEX	CIMEX
LA EMPRESA CUBANA DE FLETES	LAEMPRESA CUBANA FLETES	LA\|EMPRE\|PRESA\|CUBAN\|UBANA\|FLETE\|LET ES

Original Script Name (dnClusterOriginalScript)

The Original Script Name cluster provides a clustering method for matching names represented in non-Latin writing systems. The cluster builder generates a key for each token in the name.

Note:

A single cluster value of "Myanmar" is generated for original script names written in the Burmese alphabet irrespective of the name. This is needed because token splitting is not possible for the Myanmar writing system as it does not use a space character between words. As a result, all original script name in Burmese will be compared during matching. This should not cause performance issues during screening provided there are a low number of customer records using this writing system.

The default logic of the cluster builder is as follows:

Split the original script name into several name tokens, using a space character as the delimiter.
Trim each name token to a maximum of 5 characters.
Concatenate all of the trimmed token values with a pipe delimiter
Deduplicate the list of keys.

Table 3-6 Original Script Name

dnOriginalScriptName	dnClusterOriginalScript
Черен септември	Черен\|септе

	Myanmar