Understanding the Master Index Standardization Engine

Person Name Standardization Files

Several configuration files are used to define standardization logic for processing person names. You can customize any of the configuration files described in this section to fit your processing and standardization requirements for person data. There are three types of standardization files for person data: process definition, lexicon, and normalization. Four default variants on the PersonName data type are provided that are specialized for standardizing data from France, Australia, the United Kingdom, or the United State. In a master index project, these files appear under PersonName in the Standardization Engine node. Files for each variant appear within sub-folders of PersonName and each corresponds to a specific national variant.

You can customize these files to add entries of other nationalities or languages, including those containing diacritical marks. You can also create new variants to process data of other nationalities. For more information, see Custom Data Types and Variants.

The following topics provide information about each type of person name standardization file:

Person Name Lexicon Files

Each PersonName variant contains a set of lexicon files. Each lexicon file contains a list of possible values for a field. The standardization engine matches input values against the values listed in these files to recognize input symbols and ensure correct tokenization. The Master Index Standardization Engine uses these files when processing input symbols as defined in the process definition file (standardizer.xml). They are primarily used during the token matching portion of parsing. You can modify these files as needed by adding, deleting, or modifying values in the list. You can also create additional lexicon files.

The PersonName data type includes the following lexicon files:

generation.txt
givenNames.txt
salutation.txt
surnames.txt
titles.txt

These files are located in the resource folder under each variant name.

Person Name Normalization Files

Each PersonName variant contains a set of normalization files that are used to normalize input values. The Master Index Standardization Engine uses these files when processing input symbols as defined in the process definition file (standardizer.xml). Each normalization file contains a column of unnormalized values, such as nicknames or abbreviations, and a second column that contains the corresponding normalized values. The values in each column are separated by a pipe symbol (|). You can modify these files as needed by adding, deleting, or modifying values in the list. You can also create additional normalization files to reference from the process definition file.

The PersonName data type includes the following normalization files:

generationNormalization.txt
givenNameNormalization.txt
salutationNormalization.txt
surnameNormalization.txt
titleNormalization.txt

These files are located in the resource folder under each variant name.

Person Name Process Definition Files

Each variant has its own process definition file (standardizer.xml) that defines the state model for standardizing free-form person names. Each of these files also includes a section that defines just normalization without parsing for person names. The process definition file is located in the resource folder under each variant name. For information about the structure of this file, see Process Definition File.

Person name standardization has several states, each defining how to process tokens when they are found in certain orders. The default file defines states for salutations, first names, middle names, last names, titles, suffixes, and separators. It defines provisions for instances when the fields do not appear in order or when the input string does not contain complete data. For example, the current definition handles instances where the input string is “FirstName, MiddleName, LastName” as well as instances where the input string is “LastName, FirstName, MiddleName”.

The process definition files for person names define several parsing rules for each field component. This file defines a set of cleansing rules to prepare the input string prior to any processing. Then the data is passed to the start state of the FSM. Most fields are preprocessed and then matched against regular expressions or against a list of values in a lexicon file (described in Person Name Lexicon Files). Postprocessing includes replacing regular expressions or normalizing the field value based on a normalization file (described in Person Name Normalization Files). The process definition files also define a set of normalization rules, which are followed when the incoming data already contains name information in separate fields and does not need to be parsed.