3.2 Clustering
Oracle Financial Services Customer Screening provides three different clustering strategies for matching entities: Entity Name Tokens, Name Metaphone, and Name Trimmed. Any of the clusters may be activated or deactivated, as required, and different cluster limits can be configured.
Entity Name Tokens (dnClusterNameTokens)
- Remove initials.
- Remove common name tokens, such as Limited, or Corporation.
- Normalize whitespace.
- Convert space characters to pipe characters.
Table 3-2 Entity Name Tokens
dnEntityName | Name with initials and common name tokens stripped | dnClusterNameTo- kens |
---|---|---|
ANGLOCARIBBEAN CO LTD | ANGLOCARIBBEAN | ANGLO|CARIBBEAN |
GUAMATUR S A | GUAMATUR | GUAMATUR |
Name Metaphone (dnClusterLongName)
- Remove initials.
- Remove common name tokens, such as Limited, or Corporation.
- Normalize whitespace. Remove common business words, such as Company, or Association.
- Transliterate any non-Latin characters into Latin.
- Apply the Metaphone transformation (the standard double-Metaphone algorithm) outputting a key with a length of up to eight characters.
Table 3-3 Name Metaphone
dnEntityName | Name with initials,common name tokens and common business words stripped | dnCluster- LongName |
---|---|---|
HAVANA INTERNATIONAL BANK LTD | HAVANA BANK | HFNPNK |
CIMEXS A | CIMEX | SMKS |
LAEMPRESA CUBANA DE FLETES | EMPRESACUBANA FLETES | AMPRSKPN |
Name Trimmed (dnClusterShortName)
- Remove all whitespace.
- Left-trim the value to a maximum of 4 characters.
Table 3-4 Name Trimmed
dnEntityName | dnClusterShortName |
---|---|
HAVANA INTERNATIONAL BANK LTD | HAVA |
CIMEXS A | CIME |
LAEMPRESA CUBANA DE FLETES | LAEM |
Registration Country Prohibition (Registration Country Code)
This cluster uses the space-delimited list of registration country codes to generate cluster keys by generating an array of the component country codes.
Operating Country Prohibition (Operating Country Code)
This cluster uses the space-delimited list of operating country codes to generate cluster keys by generating an array of the component country codes.
Start/End Name Tokens (dnClusterStartEndNameTokens)
This clustering method is designed as a looser version of the Entity Name Tokens cluster and allows for variation in entity names by creating clusters for the first five and last five characters of each name token.
- Remove initials.
- Remove common name tokens, such as Limited, or Corporation.
- Normalize whitespace.
- For each token that is longer than five characters, replace with two new
tokens that are:
- The first five characters of the token.
- The last five characters of the token.
Table 3-5 Start or End Name Tokens
dnEntityName | Name with initials and common name tokens stripped | dnClusterStartEndNameTokens |
---|---|---|
HAVANA INTERNATIONAL BANK LTD | HAVANA INTERNATIONAL BANK | HAVAN|AVANA|INTER|IONAL|BANK |
CIMEXS A | CIMEX | CIMEX |
LA EMPRESA CUBANA DE FLETES | LAEMPRESA CUBANA FLETES | LA|EMPRE|PRESA|CUBAN|UBANA|FLETE|LET ES |
Original Script Name (dnClusterOriginalScript)
Note:
A single cluster value of "Myanmar" is generated for original script names written in the Burmese alphabet irrespective of the name. This is needed because token splitting is not possible for the Myanmar writing system as it does not use a space character between words. As a result, all original script name in Burmese will be compared during matching. This should not cause performance issues during screening provided there are a low number of customer records using this writing system.- Split the original script name into several name tokens, using a space character as the delimiter.
- Trim each name token to a maximum of 5 characters.
- Concatenate all of the trimmed token values with a pipe delimiter
- Deduplicate the list of keys.
Table 3-6 Original Script Name
dnOriginalScriptName | dnClusterOriginalScript |
---|---|
Черен септември | Черен|септе |
![]() |
![]() |
![]() |
Myanmar |