1.3.4.9.16 Match Transformation: Normalize Whitespace

The Normalize Whitespace transformation normalizes all the whitespace in a String so that all white spaces in between words are a single space character. It also removes leading and trailing white spaces.

Whitespace is defined in EDQ as:

  • Spaces

  • Non-printable characters, such as carriage returns, line feeds and tabs (and all other ASCII characters 0-31)

Use the Normalize Whitespace transformation when keying errors such as multiple spaces may occur in a dataset. Normalizing whitespace may be useful in comparisons, for example, to ensure that the Character edit distance (see Comparison: Character Edit Distance) of values does not discern any difference between a single space and many spaces. It may also be useful when clustering, before a Make Array from String transformation, so that forms of whitespace other than a space (such as carriage returns, tabs or other non-printing characters) can all effectively be used as delimiters. This would mean that the values "John[space]Simpson" and "John[tab][space]Simpson" would be tokenized identically, rather than the latter value yielding a "John[tab]" cluster value, which is different to "John".

Options

None.

Example transformations

The following table shows example transformations:

Table 1-88 Example Transformations for Normalize Whitespace

Value Transformed Value

John[space][tab][carriage return]Simpson

John[space]Simpson

John[space][space]Simpson

John[space]Simpson

[space]John[space]Simpson

John[space]Simpson

John[space]Simpson[space][carriage return]

John[space]Simpson