Standardization Concepts

Data standardization transforms input data into common representations of values to give you a single, consistent view of the data stored in and across organizations. Standardizing the data stored in disparate systems provides a common representation of the data so you can easily and accurately compare data between systems.

Data standardization applies three transformations against the data: parsing into individual components, normalization, and phonetic encoding. These actions help cleanse data to prepare it for matching and searching. Some fields might require all three steps, some just normalization and phonetic conversion, and other data might only need phonetic encoding. Typically data is first parsed, then normalized, and then phonetically encoded, though some cleansing might be needed prior to parsing.

Standardization can include any one or any combination of the following phases.

Data Parsing or Reformatting
Data Normalization
Phonetic Encoding

Data Parsing or Reformatting

If incoming records contain data that is not formatted properly, it must be reformatted before it can be normalized. This process identifies and separates each component of a free-form text field that contains multiple pieces of information. Reformatting can also include removing characters or strings from a field that are not relevant to the data. A good example is standardizing free-form text address fields. If you are comparing or searching on street addresses that are contained in one or more free-form text fields (that is, the street address is contained in one field, apartment number in another, and so on), those fields need to be parsed into their individual components, such as house number, street name, street type, and street direction. Then certain components of the address, such as the street name and type, can be normalized. Field components are also known as tokens, and the process of separating data into its tokens is known as tokenization.

Data Normalization

Normalizing data converts it into a standard or common form. A common use for normalization is to convert nicknames into their standard names, such as converting “Rich” to “Richard” or “Meg” to “Margaret”. Another example is normalizing street address components. For example, both “Dr.” or “Drv” in a street address might be normalized to “Drive”. Normalized values are obtained from lookup tables. Once a field value is normalized, that value can be more accurately compared against values in other records to determine whether they are a match.

Phonetic Encoding

Once data has gone through any necessary reformatting and normalization, it can be phonetically encoded. In a master index application, phonetic values are generally used in blocking queries in order to obtain all possible matches to an incoming record. They are also used to perform searches from the Master Index Data Manager (MIDM) that allow for misspellings and typographic errors. Typically, first names use Soundex encoding and last names and street names use NYSIIS encoding, but the Master Index Standardization Engine supports several additional phonetic encoders as well.

Skip Navigation Links
Exit Print View
	Oracle Java CAPS Master Index Standardization Engine Reference Java CAPS Documentation