1.3.4.9.11 Match Transformation: Generate Initials

The Generate Initials transformation allows the clustering or matching of records using initialized values from an identifier, for example, in order to match BMW and Bayerische Motoren Werke. This works in exactly the same way as the main Generate Initials processor.

Use the Generate Initials transformation when matching company names, or other names which are frequently initialized when forming an identifier. This is useful to find matches such as "International Business Machines" and "IBM", which are hard for a computer to match without first initializing each value. An option is included to ensure that short 'words' such as "IBM" are not initialized to "I".

The following table describes the configuration options:

Configuration Description

Options

Specify the following options:

  • Delimiters Reference Data: allows the use of a standard set of characters that are used to split up words before generating initials. Type: Reference Data. Default value: *Delimiters.

  • Delimiter characters: specifies an additional set of characters that are used to split up words before generating initials. Type: Free text. Default value: Space.

  • Ignore upper case single words of length: allows the Generate Initials processor to leave alone any single word values (that is, where no word splits occurred) of up to a number of characters in length, and which are all upper case (for example, 'IBM'). See note. Type: Integer. Default value: 4.

Note:

Normally, the Generate Initials transformation simply ignores the case of the original value, and generates upper case initials for each separate word it finds, as separated by the specified delimiters. For example, the values "A j Smith", "ALAN JOHN SMITH" and "Alan john smith" are all initialized as "AJS". However, there may be some values which are already initialized, such as "PWC", "IBM", "BT", which should not be further initialized to "P", "I" and "B" respectively.

These can be distinguished by the fact that they are:

  1. single word values,

  2. already in upper case, and

  3. only a few characters in length.

The Ignore upper case single words of length option allows you to specify a length of word (in characters) below or equal to which you do not want to initialize single upper case word values.

For example, if set to 4, the values "PWC, "BT", "RSPB" and "IBM" would be ignored during the initialization process as they are 4 characters or less in length, are single word values, and are already upper case. By contrast, "IAN JOHN SMITH" would still be initialized to "IJS", as although the word "IAN" is less than 4 characters in length, and is already upper case, it is not a single word value. Also, "RSPCA" would be initialized to "R" as it is over 4 characters in length.

Example configuration

In this example, the Generate Initials transformation is being used within an Exact String match comparison (see Comparison: Exact String Match) to match company names, where values are frequently initialized.

Delimiters Reference Data: None

Delimiter characters: <space> .

Ignore upper case single words of length: 5

Note that two transformations are used before the Generate Initials transformation, as follows:

1. Upper Case - to convert all values to upper case

2. Strip Words - to strip certain words from the values. The reference data used includes the word 'PLC'.

Example transformations

The following table shows examples of transformations using the above configuration:

Table 1-83 Example Transformations for Generate Initials

Value Value after Upper Case and Strip Words transformations Value after Generate Initials transformation

IBM

IBM

IBM

I.B.M.

I.B.M.

IBM

International Business Machines

INTERNATIONAL BUSINESS MACHINES

IBM

PWC

PWC

PWC

Price waterhouse coopers

PRICE WATERHOUSE COOPERS

PWC

Price Waterhouse Coopers

PRICE WATERHOUSE COOPERS

PWC

PRICE WATERHOUSE COOPERS

PRICE WATERHOUSE COOPERS

PWC

British Telecom Plc

BRITISH TELECOM

BT

BT plc

BT

BT

BARKERS plc

BARKERS

B

BARKERS & LEWIS plc

BARKERS & LEWIS

B&L