Understanding the Master Index Standardization Engine

Person Name Standardization and Sun Master Index

Master index applications rely on the Master Index Standardization Engine to process person name data. To ensure correct processing of person information, you need to customize the Matching Service for the master index application according to the rules defined for the standardization engine. This includes modifying mefa.xml to define normalization or parsing and phonetic encoding of the appropriate fields. You can modify mefa.xml with the Master Index Configuration Editor in the master index project.

Standardization is defined in the StandardizationConfig section of mefa.xml, which is described in detail in Match Field Configuration in Understanding Sun Master Index Configuration Options . To configure the required fields for normalization, modify the normalization structure in mefa.xml. To configure the required fields for parsing and normalization, modify the standardization structure. To configure phonetic encoding, modify the phonetic encoding structure. These tasks can all be performed using the Master Index Configuration Editor.

Generally, the person data type processes data that is parsed prior to processing, so you should not need to configure fields to parse unless your person data is stored in free-form text fields with all name information in one field. When processing person data, you might also want to search on address information. In that case, you need to configure the address fields to parse and normalize.

The following topics provide information about the fields used in processing person data and how to configure person data standardization for a master index application. The information provided in these topics is based on the default configuration.

Person Name Processing Fields

When standardizing person data, not all fields in a record need to be processed by the Master Index Standardization Engine. The standardization engine only needs to process fields that must be parsed, normalized, or phonetically converted. For a master index application, these fields are defined in mefa.xml and processing logic for each field is defined in the standardization engine configuration files.

Person Name Standardized Fields

The Master Index Standardization Engine can process person data that is provided in separate fields within a single record, meaning that no parsing is required of the name fields prior to normalization. It can also process person data contained in one long free-form field and parse the field into its individual components, such as first name, last name, title, and so on. Typically, only first and last names are normalized and phonetically encoded when standardizing person data, but the standardization engine can normalize and phonetically encode any field you choose. By default, the standardization engine processes these fields: first name, middle name, last name, nickname, salutation, generational suffix, and title.

Person Name Object Structure

The fields you specify for person name matching in the Master Index wizard are automatically defined for standardization and phonetic encoding. If you specify the PersonFirstName or PersonLastName match type in the wizard, the following fields are automatically added to the object structure and database creation script:

For example, if you specify the PersonFirstName match type for the FirstName field, two fields, FirstName_Std and FirstName_Phon, are automatically added to the structure. You can also add these fields manually if you do not specify match types in the wizard. If you are parsing free-form person data, be sure all output fields from the standardization process are included in the master index object structure. If you store additional names in the database, such as alias names, maiden names, parent names, and so on, you can modify the phonetic structure to phonetically encode those names as well.

Configuring a Normalization Structure for Person Names

The fields defined for normalization for the PersonName data type can include any name fields. By default, normalization rules are defined in the process definition file for first, middle, and last name fields, and you can easily define additional fields. You only need to define a normalization structure for person data if you are processing individual fields that do not require parsing. Follow the instructions under Defining Master Index Normalization Rules in Configuring Sun Master Indexes to define fields for normalization. For the standardization-type element, enter PersonName. For a list of field IDs to use in the standardized-object-field-id element, see Person Name Standardization Components.

A sample normalization structure for person data is shown below. This sample specifies that the PersonName standardization type is used to normalize the first name, alias first name, last name, and alias last name fields. For all name fields, both United States and United Kingdom domains are defined for standardization.


<structures-to-normalize>
   <group standardization-type="PersonName"
    domain-selector="com.sun.mdm.index.matching.impl.MultiDomainSelector">
      <locale-field-name>Person.PobCountry</locale-field-name>
      <locale-maps>
         <locale-codes>
            <value>UNST</value>
            <locale>US</locale>
         </locale-codes>
         <locale-codes>
            <value>GB</value>
            <locale>UK</locale>
            </locale-codes>
      </locale-maps>
      <unnormalized-source-fields>
         <source-mapping>
            <unnormalized-source-field-name>Person.FirstName
            </unnormalized-source-field-name>
            <standardized-object-field-id>FirstName
            </standardized-object-field-id>
         </source-mapping>
         <source-mapping>
            <unnormalized-source-field-name>Person.LastName
            </unnormalized-source-field-name>
            <standardized-object-field-id>LastName
            </standardized-object-field-id>
         </source-mapping>
      </unnormalized-source-fields>
         <normalization-targets>
            <target-mapping>
               <standardized-object-field-id>FirstName
               </standardized-object-field-id>
               <standardized-target-field-name>Person.FirstName_Std
               </standardized-target-field-name>
            </target-mapping>
            <target-mapping>
               <standardized-object-field-id>LastName
               </standardized-object-field-id>
               <standardized-target-field-name>Person.LastName_Std
               </standardized-target-field-name>
            </target-mapping>
         </normalization-targets>
      </group>
   <group standardization-type="PersonName" domain-selector=
     "com.sun.mdm.index.matching.impl.MultiDomainSelector">
      <locale-field-name>Person.PobCountry</locale-field-name>
      <locale-maps>
         <locale-codes>
            <value>UNST</value>
            <locale>US</locale>
         </locale-codes>
         <locale-codes>
            <value>GB</value>
            <locale>UK</locale>
         </locale-codes>
      </locale-maps>
      <unnormalized-source-fields>
         <source-mapping>
            <unnormalized-source-field-name>Person.Alias[*].FirstName
            </unnormalized-source-field-name>
            <standardized-object-field-id>FirstName
            </standardized-object-field-id>
         </source-mapping>
         <source-mapping>
            <unnormalized-source-field-name>Person.Alias[*].LastName
            </unnormalized-source-field-name>
            <standardized-object-field-id>LastName
            </standardized-object-field-id>
         </source-mapping>
      </unnormalized-source-fields>
      <normalization-targets>
         <target-mapping>
            <standardized-object-field-id>FirstName
            </standardized-object-field-id>
            <standardized-target-field-name>
            Person.Alias[*].FirstName_Std
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>LastName
            </standardized-object-field-id>
            <standardized-target-field-name>
            Person.Alias[*].LastName_Std
            </standardized-target-field-name>
         </target-mapping>
      </normalization-targets>
   </group>
</structures-to-normalize>

Configuring a Standardization Structure for Person Names

For free–form name fields, the source fields that are defined for standardization should include the predefined standardization components. For example, fields containing person name information can include the first name, middle name, last name, suffix, title, and salutation. The target fields you define can include any of these parsed components. Follow the instructions under Defining Master Index Standardization Rules in Configuring Sun Master Indexes to define fields for standardization. For the standardization-type element, enter PersonName. For a list of field IDs to use in the standardized-object-field-id element, see Person Name Standardization Components.

A sample standardization structure for person name data is shown below. Only the United States variant is defined in this structure.


free-form-texts-to-standardize>
   <group standardization-type="PERSONNAME"
    domain-selector="com.sun.mdm.index.matching.impl.SingleDomainSelectorUS">
      <unstandardized-source-fields>
         <unstandardized-source-field-name>Person.Name
         </unstandardized-source-field-name>
      </unstandardized-source-fields>
      <standardization-targets>
         <target-mapping>
            <standardized-object-field-id>salutation
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Prefix
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>firstName
            </standardized-object-field-id>
            <standardized-target-field-name>Person.FirstName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>middleName
            </standardized-object-field-id>
            <standardized-target-field-name>Person.MiddleName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>lastName
            </standardized-object-field-id>
            <standardized-target-field-name>Person.LastName
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>suffix
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Suffix
            </standardized-target-field-name>
         </target-mapping>
         <target-mapping>
            <standardized-object-field-id>title
            </standardized-object-field-id>
            <standardized-target-field-name>Person.Title
            </standardized-target-field-name>
         </target-mapping>
      </standardization-targets>
   </group>
</free-form-texts-to-standardize>

Configuring Phonetic Encoding for Person Names

When you specify a first, middle, or last name field for person name matching in the Master Index wizard, that field is automatically defined for phonetic encoding. You can define additional names, such as maiden names or alias names, for phonetic encoding as well. Follow the instructions under Defining Phonetic Encoding for the Master Index in Configuring Sun Master Indexes to define fields for phonetic encoding.

A sample of fields defined for phonetic encoding is shown below. This sample converts name and alias name fields, as well as the street name.


<phoneticize-fields>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.FirstName_Std
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.FirstName_Phon
      </phoneticized-target-field-name>
      <encoding-type>Soundex</encoding-type>
   </phoneticize-field>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.LastName_Std
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.LastName_Phon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.Alias[*].FirstName_Std
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.Alias[*].FirstName_Phon
      </phoneticized-target-field-name>
      <encoding-type>Soundex</encoding-type>
   </phoneticize-field>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.Alias[*].LastName_Std
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.Alias[*].LastName_Phon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field>
   <phoneticize-field>
      <unphoneticized-source-field-name>Person.Address[*].AddressLine1_StName
      </unphoneticized-source-field-name>
      <phoneticized-target-field-name>Person.Address[*].AddressLine1_StPhon
      </phoneticized-target-field-name>
      <encoding-type>NYSIIS</encoding-type>
   </phoneticize-field></phoneticize-fields>