Understanding the Master Index Standardization Engine

Data Cleansing Definitions

You can define cleansing rules to transform the input data prior to tokenization to make the input record uniform and ensure the data is correctly separated into its individual components. This standardization step is optional.

Common data transformations include the following:

The cleansing rules are defined within a cleanser element in the process definition file. You can use any of the rules defined in Standardization Processing Rules Reference to cleanse the data. Cleansing attributes use regular expressions to define values to find and replace.

The following excerpt from the PhoneNumber data type does the following to the input string prior to processing:


<cleanser>
   <uppercase/>
   <replaceAll regex="([0-9]{3})([0-9]{3})([0-9]{4})" replacement="($1)$2-$3"/>
   <replaceAll regex="([-(),])" replacement=" $1 "/>
   <replaceAll regex="\+(\d+) -" replacement="+$1-"/>
   <replaceAll regex="E?X[A-Z]*[.#]?\s*([0-9]+)" replacement="X $1"/>
   <normalizeSpace/>
</cleanser>