3.2 Clustering

Oracle Financial Services Customer Screening provides three different clustering strategies for matching entities: Entity Name Tokens, Name Metaphone, and Name Trimmed. Any of the clusters may be activated or deactivated, as required, and different cluster limits can be configured.

Entity Name Tokens (dnClusterNameTokens)

This cluster uses the standardized entity name to generate cluster keys. The default logic is as follows:
  • Remove initials.
  • Remove common name tokens, such as Limited, or Corporation.
  • Normalize whitespace.
  • Convert space characters to pipe characters.
Examples

Table 3-2 Entity Name Tokens

dnEntityName Name with initials and common name tokens stripped dnClusterNameTo- kens
ANGLOCARIBBEAN CO LTD ANGLOCARIBBEAN ANGLO|CARIBBEAN
GUAMATUR S A GUAMATUR GUAMATUR

Name Metaphone (dnClusterLongName)

This cluster uses the standardized entity name to generate cluster keys. The default logic is as follows:
  • Remove initials.
  • Remove common name tokens, such as Limited, or Corporation.
  • Normalize whitespace. Remove common business words, such as Company, or Association.
  • Transliterate any non-Latin characters into Latin.
  • Apply the Metaphone transformation (the standard double-Metaphone algorithm) outputting a key with a length of up to eight characters.

Table 3-3 Name Metaphone

dnEntityName Name with initials,common name tokens and common business words stripped dnCluster- LongName
HAVANA INTERNATIONAL BANK LTD HAVANA BANK HFNPNK
CIMEXS A CIMEX SMKS
LAEMPRESA CUBANA DE FLETES EMPRESACUBANA FLETES AMPRSKPN

Name Trimmed (dnClusterShortName)

This cluster uses the standardized entity name to generate cluster keys. The default logic is as follows:
  • Remove all whitespace.
  • Left-trim the value to a maximum of 4 characters.
Example

Table 3-4 Name Trimmed

dnEntityName dnClusterShortName
HAVANA INTERNATIONAL BANK LTD HAVA
CIMEXS A CIME
LAEMPRESA CUBANA DE FLETES LAEM

Registration Country Prohibition (Registration Country Code)

This cluster uses the space-delimited list of registration country codes to generate cluster keys by generating an array of the component country codes.

Operating Country Prohibition (Operating Country Code)

This cluster uses the space-delimited list of operating country codes to generate cluster keys by generating an array of the component country codes.

Start/End Name Tokens (dnClusterStartEndNameTokens)

This clustering method is designed as a looser version of the Entity Name Tokens cluster and allows for variation in entity names by creating clusters for the first five and last five characters of each name token.

The default logic is as follows:
  • Remove initials.
  • Remove common name tokens, such as Limited, or Corporation.
  • Normalize whitespace.
  • For each token that is longer than five characters, replace with two new tokens that are:
    • The first five characters of the token.
    • The last five characters of the token.

Table 3-5 Start or End Name Tokens

dnEntityName Name with initials and common name tokens stripped dnClusterStartEndNameTokens
HAVANA INTERNATIONAL BANK LTD HAVANA INTERNATIONAL BANK HAVAN|AVANA|INTER|IONAL|BANK
CIMEXS A CIMEX CIMEX
LA EMPRESA CUBANA DE FLETES LAEMPRESA CUBANA FLETES LA|EMPRE|PRESA|CUBAN|UBANA|FLETE|LET ES

Original Script Name (dnClusterOriginalScript)

The Original Script Name cluster provides a clustering method for matching names represented in non-Latin writing systems. The cluster builder generates a key for each token in the name.

Note:

A single cluster value of "Myanmar" is generated for original script names written in the Burmese alphabet irrespective of the name. This is needed because token splitting is not possible for the Myanmar writing system as it does not use a space character between words. As a result, all original script name in Burmese will be compared during matching. This should not cause performance issues during screening provided there are a low number of customer records using this writing system.
The default logic of the cluster builder is as follows:
  • Split the original script name into several name tokens, using a space character as the delimiter.
  • Trim each name token to a maximum of 5 characters.
  • Concatenate all of the trimmed token values with a pipe delimiter
  • Deduplicate the list of keys.

Table 3-6 Original Script Name

dnOriginalScriptName dnClusterOriginalScript
Черен септември Черен|септе
Chinese Original Script Name
Chinese Cluster Original Script


Myanmar Original Script Name

Myanmar