You can define cleansing rules to transform the input data prior to tokenization to make the input record uniform and ensure the data is correctly separated into its individual components. This standardization step is optional.
Common data transformations include the following:
Converting a string to all uppercase.
Trimming leading and trailing white space.
Converting multiple spaces in the middle of a string to one space.
Transliterating accent characters or diacritical marks.
Adding a space on either side of extra characters (to help the tokenizer recognize them).
Removing extraneous content.
Fixing common typographical errors.
The cleansing rules are defined within a cleanser element in the process definition file. You can use any of the rules defined in Standardization Processing Rules Reference to cleanse the data. Cleansing attributes use regular expressions to define values to find and replace.
The following excerpt from the PhoneNumber data type does the following to the input string prior to processing:
Converts all characters to upper case.
Replaces the specified input patterns with new patterns.
Removes white space at the beginning and end of the string and concatenates multiple consecutive spaces into one space.
<cleanser> <uppercase/> <replaceAll regex="([0-9]{3})([0-9]{3})([0-9]{4})" replacement="($1)$2-$3"/> <replaceAll regex="([-(),])" replacement=" $1 "/> <replaceAll regex="\+(\d+) -" replacement="+$1-"/> <replaceAll regex="E?X[A-Z]*[.#]?\s*([0-9]+)" replacement="X $1"/> <normalizeSpace/> </cleanser> |