Understanding Sun Master Index Configuration Options (Repository)

Match Field Configuration (Repository)

The Matching Service, configured in the Match Field file, contains the matching and standardization engines used in the match process, as well as the phonetic encoders used for phonetically encoding data. You can configure the match and standardization engines for the master index application in the Match Field file, and also specify special standardization, matching, and weighting logic used by the engines. This file also defines the strategy for identifying unique records and finding the best matches in the master index database. For optimization, the Match Field components are configurable, allowing you to choose the strategy that best fits your requirements or to implement your own custom components.

The following topics describe the components of the Matching Service and the structure of the Match Field file:

Matching Service Components (Repository)

The Matching Service is configured by the Match Field file, which defines the configurable properties for standardizing data and matching records. These processes are highly configurable for the master index application, allowing you to design and develop the match strategy that best suits your processing requirements.

The following components make up the Matching Service:

Standardization Configuration

Standardization of incoming data applies three functions to the data processed by the master index application: reformatting (or parsing), normalization, and phonetic encoding. These functions help prepare data for matching and searching. Some fields might require all three steps, some just normalization and phonetic conversion, and other data might only need phonetic encoding. You can specify which fields require any of these steps in the standardization configuration section of the Match Field file. In addition, you can specify the nationality of the data being standardized by the Sun Match Engine.

The three stages of standardization include the following:

Matching Configuration

The MatchingConfig section of the Match Field file allows you to define the data fields that are sent to the match engine (called the match string). Probabilistic weighting is performed only against the fields you specify as the match columns. You can specify any field in the object structure as a match column as long as the is configured to use all fields specified. You must specify at least one match field.

The configuration of this section of the Match Field file is specific to the you are using and the types of fields on which you are matching. For more information about how the matching should be configured for the Sun Match Engine, see Understanding the Sun Match Engine.

Match and Standardization Engines

The match and standardization engines control the processes of standardizing data and generating matching probability weights between records. Sun Master Index provides the ability to use the standardization and match engines that best suit your indexing requirements. You can configure the master index application to use the Sun Match Engine, or you can configure the index to use a customized engine of your choice.

These engines perform two functions:

The engines are called during match processing, when the master index application retrieves the best matches during a weighted search from the EDM or when the master index application checks for duplicate records during an insert or update from the EDM or an external system.

Block Picker and Pass Controller

The block picker and pass controller define how the blocking query is executed during the match process. By default, the matching process is executed in multiple stages. Each configured block that defines query criteria is executed and evaluated separately (each query block execution and evaluation is referred to as a match pass). After a block is evaluated, the pass controller determines whether the results found are sufficient or matching should continue by performing another match pass.

The block picker chooses the block definition to use for each match pass. Block definitions define the criteria for each query that checks the database for a subset of the records to be used for matching. The block picker has access to the match results from previous match passes, as well as lists of applicable block definitions that have been executed and of those that have not been executed.

Phonetic Encoders

Sun Master Index provides extensible phonetic encoding capabilities, which are typically used to retrieve records with similar field values from the database for matching. By default, several phonetic encoders are defined to be used in the master index application. Typically, Soundex is used to encode first names (or SoundexFR for first names in the France national domain) and NYSIIS to encode last names. When using the Sun Match Engine, you can specify different types of phonetic encoders, such as Metaphone, Double Metaphone, and Refined Soundex. When you specify the fields in the standardization configuration to be phonetically encoded, you can select one of the encoders defined in the phonetic encoders section.

Sample Standardization and Matching Sequence (Repository)

The following steps illustrate one possible processing sequence that occurs when data is received from an external system and processed by the master index application.

  1. A record is received from an external system.

  2. The local ID does not yet exist in the master index application; initiate the standardization and matching process.

  3. Standardize the record to a common format.

  4. Standardize free-form text.

  5. Normalize fields that need to be converted to a common format.

  6. Phonetically encode fields that are commonly misspelled or spelled in different ways.

  7. Match the record against entries in the database.

  8. Use the selected blocking query (specified in the Threshold file) to retrieve a block of records that might match the new record.

  9. Build and execute the query according to the input record.

  10. Calculate match scores comparing the incoming record against existing records (this is done by the match engine).

  11. Determine whether to repeat the matching process with another block of records, based on the MEFAConfig element in the Match Field file.

  12. Return match scores for further processing.

  13. Determine whether to add the system record to an existing EUID record or to insert the system record as a new EUID record (based on the parameters defined in the DecisionMaker element of the Threshold file).

The Match Field File (Repository)

The properties for the match and standardization process are defined in the Match Field file in XML format. Some of the information entered into the default configuration file is taken from the wizard, but the file might require additional customization in order to meet your data processing needs.

The following topics provide information about working with the the Match Field file:

Modifying the Match Field File

You can modify the Match Field file at any time, but modifying the file is not recommended once you move to production because this file defines how records are processed and data integrity is maintained. You must regenerate the application and redeploy the project after making any changes to this file. Modifying this file once you are in production might cause weighting and standardization to be handled differently, causing unexpected match weight results.

Most of the components configured by this file can be modified using the Configuration Editor. The editor provides a graphical interface that simplifies defining normalization, standardization, matching, and phonetic encoding. It also maintains referential integrity between files in cases where standardization, normalization, or phonetic encoding requires additional fields to be added to the object structure. The possible modifications to this file are restricted by the schema definition, so be sure to validate the file after making any changes.

Match Field File Description

Table 11 lists each element in the Match Field file and provides a description of each element along with any requirements or constraints for each element.

Table 11 Match Field File Structure

Element/Attribute 

Description 

StandardizationConfig

The configuration information for fields to be standardized. It consists of several structures that define standardization rules for a set of fields. The StandardizationConfig attributes define the module name and Java class, and their default values should not be changed.

standardize-system-object

A standardization structure that defines configuration rules, including normalization, parsing, and phonetic encoding. Each standardization structure contains three primary elements: structures-to-normalize, free-form-texts-to-standardize, and phoneticize-fields. These elements are all required, however any of them can be empty

system-object-name

The name of the object containing the fields defined for standardization. Specifying the parent object allows you to specify any field in any object for standardization. You can also create multiple standardization structures and specify a different object for each structure. 

structures-to-normalize

The configuration information for fields that require normalization (but not parsing or reformatting) before being processed by the standardization engine. 

group

The national domain, source fields, and target fields for one normalization unit. You can define multiple group elements.

group/standardization-type

The type of standardization to perform on the source fields. This is specific to the type of data being processed and the standardization engine being used. For more information about Sun Match Engine types, see Understanding the Sun Match Engine.

group/domain-selector

The Java class used by the Sun Match Engine to determine the nationality of the data being processed. If no selector is specified, the default is US. 

Possible values for the Sun Match Engine include the following:

  • com.stc.eindex.matching.impl.SingleDomainSelectorAU

  • com.stc.eindex.matching.impl.SingleDomainSelectorFR

  • com.stc.eindex.matching.impl.SingleDomainSelectorUK

  • com.stc.eindex.matching.impl.SingleDomainSelectorUS

  • com.stc.eindex.matching.impl.MultipleDomainSelector

local-field-name

The ePath to an identifying field in the object structure that indicates which of the defined local-codes definitions to use. If no field is specified, the standardization engine defaults to the United States domain. This field must be contained in the object that contains the fields defined for normalization in this structure.

locale-maps

A list of local codes that define how the standardization engine determines which national domain to use. 

local-codes

A list of value and locale pairs that indicate the national domain to use based on the value of the identifying field in an incoming message (specified by the local-field-name).

value

A value that, when contained in the identifying field, indicates that the standardization engine will use the corresponding locale element to determine which national domain to use to standardize the data. To specify a default domain, enter “Default” in this element.

locale

A domain code indicating which national domain to use to standardize data when the identifying field value in a transaction matches the corresponding value element.

The supported locale codes for the Sun Match Engine include the following:

  • AU - for Australian data

  • FR - for French data

  • UK - for United Kingdom data

  • US - for United States data

unnormalized-source-fields

A list of source fields to be normalized. 

source-mapping

The configuration information for one field in the list of source fields to be normalized. 

unnormalized-source-field-name

The ePath of the source field to normalize in the system object (for example, Person.FirstName). 

standardized-object-field-id

An identification code that identifies the field to normalize to the standardization engine. This ID is specific to the standardization engine in use and must correspond to a field ID defined by that engine. For more information, see Understanding the Sun Match Engine.

normalization-targets

A list of destination fields to hold the normalized data. 

target-mapping

The configuration information for one field in the list of destination fields. 

standardized-object-field-id

An identification code that identifies the normalized field to the standardization engine. This is specific to the standardization engine in use and must correspond to a field ID defined by that engine. For more information, see Understanding the Sun Match Engine.

standardized-target-field-name

The ePath of the target field in which the normalized value is saved in the system object (for example, Person.Alias[*].StdLastName). 

freeform-texts-to-standardize

The configuration information for fields that require parsing or reformatting and, optionally, normalization, before being processed by the standardization engine. 

group

The configuration information for the national domain and the source and target fields for one standardization unit. You can define multiple group elements.

group/standardization-type

The type of standardization to perform on the source fields. This is specific to the standardization engine being used and the type of data being processed. For more information, see Understanding the Sun Match Engine.

group/domain-selector

The Java class used by the Sun Match Engine to determine the nationality of the data being processed. Possible values are listed below. If no selector is specified, the default is US. 

  • com.stc.eindex.matching.impl.SingleDomainSelectorAU

  • com.stc.eindex.matching.impl.SingleDomainSelectorFR

  • com.stc.eindex.matching.impl.SingleDomainSelectorUK

  • com.stc.eindex.matching.impl.SingleDomainSelectorUS

  • com.stc.eindex.matching.impl.MultipleDomainSelector

local-field-name

The ePath to an identifying field in the object structure that indicates which of the defined local-codes definitions to use. If this element is not defined, the standardization engine defaults to the United States domain. This field must be contained in the object that contains the fields defined for standardization in this structure.

locale-maps

A list of local codes that define how the standardization engine determines which national domain to use. 

local-codes

A list of value and locale pairs that indicate the national domain to use based on the value of the identifying field in an incoming message (specified by the local-field-name).

value

A value that, when contained in the identifying field, indicates that the standardization engine will use the corresponding locale element to determine which national domain to use to standardize the data. To specify a default domain, enter “Default” in this element.

locale

A domain code indicating which national domain to use to standardize data when the identifying field value in a transaction matches the corresponding value element. Supported locale codes for the Sun Match Engine are listed below.

  • AU - for Australian data

  • FR - for French data

  • UK - for United Kingdom data

  • US - for United States data

unstandardized-source-fields

A list of fields to be standardized. 

unstandardized-source-field-name

A field to be standardized. If you define more than one source field in the same standardization unit, the fields are concatenated during standardization with a pipe (|) between lines (for the Sun Match Engine). 

standardization-targets

A list of fields in which the standardized data from the source fields is stored. 

target-mapping

The configuration information for one destination field in which standardized data from the source field will be stored. One source field will likely have several destination fields. 

standardized-object-field-id

An abbreviation that identifies the destination field to the standardization engine. This must correspond to a field ID defined by the standardization engine being used. For more information, see Understanding the Sun Match Engine.

standardized-target-field-name

The ePath of the destination field in the object where the standardized value will be saved (for example, Person.Address[*].StreetName). 

phoneticize-fields

A list of fields to be phonetically encoded. 

phoneticize-field

The configuration information for each field to be phonetically encoded, including the encoder to use. 

unphoneticized-source-field-name

The ePath of the source field in the system object from which the value to phonetically encode will be retrieved (for example, Person.Address[*].StreetName). 


Note –

This can refer to the original field or to a standardized or normalized field.


phoneticized-target-field-name

The ePath of the field in which the phonetically encoded value will be saved in the system object. 

phoneticized-object-field-id

A field ID to identify the field to the phonetic encoder. This is not currently used with the Sun Match Engine. 

encoding-type

The phonetic encoder to use for this field. This must correspond to the encoding-type configured for the desired encoder in the PhoneticEncodersConfig element.

MatchingConfig

The configuration information for the match string (that is, the fields that are included in the data string sent to the match engine and against which weighting is performed). The attributes of the MatchingConfig element define the module name and Java class, and their default values should not be changed.

match-system-object

The configuration and field definitions for the match string. 

object-name

The name of the object containing the fields in the match string. If you specify the parent object, you can specify fields from the parent and any child object in the match string. 

match-columns

A list of fields in the match string. This element contains multiple match-column elements.

match-column

The configuration information for one field in the match string. You will use multiple match-column elements.

column-name

The fully qualified field name that defines the location of each field on which to match (for example, Enterprise.SystemSBR.Person.Address.City). 

match-type

The type of matching performed on the specified field. This is an ID that is specific to the match engine and identifies the field to the match engine. This value must correspond to a match type defined for the match engine. 

match-order

An integer specifying the order in which the field appears in the match string. This element is optional. If no order is specified, matching is performed in the order in which the fields are listed. 

MEFAConfig

The configuration information for the components of the matching service. The MEFAConfig attributes (module-name and parser-class) define the module name and Java class, and their default values should not be changed. You should only change the names of the component classes in this section if you created a corresponding custom component.

block-picker

The configuration information for the Java class that chooses which block of criteria defined for the blocking query to use for each match pass. 

class-name

The name of the block picker Java class.  

pass-controller

The configuration information for the Java class that determines whether the blocking query should continue performing match passes after each match pass is complete. 

class-name

The name of the pass controller Java class.  

standardizer-api

The configuration information for the standardization engine to use. 

class-name

The name of the standardizer API Java class. 

standardizer-config

The configuration information for the Java class that provides configuration information to the standardization engine. 

class-name

The name of the standardizer configuration Java class. 

matcher-api

The configuration information for the match engine to use. 

class-name

The name of the match engine API Java class. 

matcher-config

The configuration information for the Java class that provides configuration information to the match engine. 

class-name

The name of the match engine configuration Java class. 

PhoneticEncodersConfig

The configuration information for the phonetic encoders used by the master index application. The attributes (module-name and parser-class) define the module name and Java class. The default values should not be changed.

encoder

A list of phonetic encoders used by the standardization engine. 

encoding-type

The name of the phonetic encoder, such as NYSIIS, Soundex, or Metaphone. 

encoder-implementation-class

The fully qualified name of the Java class that determines the behavior of the phonetic encoder. The following default classes are defined for the Sun Match Engine. 

  • com.stc.eindex.phonetic.impl.Nysiis

  • com.stc.eindex.phonetic.impl.Soundex

  • com.stc.eindex.phonetic.impl.Metaphone

  • com.stc.eindex.phonetic.impl.DoubleMetaphone

  • com.stc.eindex.phonetic.impl.RefinedSoundex

  • com.stc.eindex.phonetic.impl.SoundexFR

Match Field File Example

Below is a short sample of the Match Field file based on a master index application processing person data. This sample covers the basic elements of the Match Field file, but a production environment would contain several more fields to standardize as well as several additional match string fields.


<StandardizationConfig module-name="Standardization" parser-class=
"com.stc.eindex.configurator.impl.standardization.StandardizationConfiguration">
   <standardize-system-object>
      <system-object-name>Person</system-object-name>
      <structures-to-normalize>
         <group standardization-type="PersonName" domain-selector=
          ”com.stc.eindex.matching.impl.SingleDomainSelectorUS">
            <unnormalized-source-fields>
               <source-mapping>
                  <unnormalized-source-field-name>
                   Person.Alias[*].FirstName
                  </unnormalized-source-field-name>
                  <standardized-object-field-id>FirstName
                  </standardized-object-field-id>
               </source-mapping>
               <source-mapping>
                  <unnormalized-source-field-name>
                   Person.Alias[*].LastName
                  </unnormalized-source-field-name>
                  <standardized-object-field-id>LastName
                  </standardized-object-field-id>
               </source-mapping>
            </unnormalized-source-fields>
            <normalization-targets>
               <target-mapping>
                  <standardized-object-field-id>FirstName
                  </standardized-object-field-id>
                  <standardized-target-field-name>
                     Person.Alias[*].StdFirstName
                  </standardized-target-field-name>
               </target-mapping>
               <target-mapping>
                  <standardized-object-field-id>LastName
                  </standardized-object-field-id>
                  <standardized-target-field-name>
                     Person.Alias[*].StdLastName
                  </standardized-target-field-name>
               </target-mapping>
            </normalization-targets>
         </group>
         <group standardization-type="PersonName" domain-selector=
           "com.stc.eindex.matching.impl.SingleDomainSelectorUS”>
            <unnormalized-source-fields>
               <source-mapping>
                  <unnormalized-source-field-name>Person.FirstName
                  </unnormalized-source-field-name>
                  <standardized-object-field-id>FirstName
                  </standardized-object-field-id>
               </source-mapping>
               <source-mapping>
                  <unnormalized-source-field-name>Person.LastName
                  </unnormalized-source-field-name>
                  <standardized-object-field-id>LastName
                  </standardized-object-field-id>
               </source-mapping>
            </unnormalized-source-fields>
            <normalization-targets>
               <target-mapping>
                  <standardized-object-field-id>FirstName
                  </standardized-object-field-id>
                  <standardized-target-field-name>Person.StdFirstName
                  </standardized-target-field-name>
               </target-mapping>
               <target-mapping>
                  <standardized-object-field-id>LastName
                  </standardized-object-field-id>
                  <standardized-target-field-name>Person.StdLastName
                  </standardized-target-field-name>
               </target-mapping>
            </normalization-targets>
         </group>
      </structures-to-normalize>
      <free-form-texts-to-standardize>
         <group standardization-type="Address" domain-selector=
          "com.stc.eindex.matching.impl.MultiDomainSelector">
            <locale-field-name>Person.Country</locale-field-name>
            <locale-maps>
               <locale-codes>
                  <value>Default</value>
                  <locale>US</locale>
               </locale-codes>
            </locale-maps>
            <unstandardized-source-fields>
               <unstandardized-source-field-name>
                Person.Address[*].AddressLine1
               </unstandardized-source-field-name>
               <unstandardized-source-field-name>
                Person.Address[*].AddressLine2
               </unstandardized-source-field-name>
            </unstandardized-source-fields>
            <standardization-targets>
               <target-mapping>
                  <standardized-object-field-id>HouseNumber
                  </standardized-object-field-id>
                  <standardized-target-field-name>
                   Person.Address[*].HouseNumber
                  </standardized-target-field-name>
               </target-mapping>
               <target-mapping>
                  <standardized-object-field-id>MatchStreetName
                  </standardized-object-field-id>
                  <standardized-target-field-name>
                   Person.Address[*].StreetName
                  </standardized-target-field-name>
               </target-mapping>
               <target-mapping>
                  <standardized-object-field-id>
                   StreetNamePrefDirection
                  </standardized-object-field-id>
                  <standardized-target-field-name>
                   Person.Address[*].StreetDir
                  </standardized-target-field-name>
               </target-mapping>
               <target-mapping>
                  <standardized-object-field-id>StreetNameSufType
                  </standardized-object-field-id>
                  <standardized-target-field-name>
                   Person.Address[*].StreetType
                  </standardized-target-field-name>
               </target-mapping>
            </standardization-targets>
         </group>
      </free-form-texts-to-standardize>
      <phoneticize-fields>
         <phoneticize-field>
            <unphoneticized-source-field-name>Person.FirstName_Std
            </unphoneticized-source-field-name>
            <phoneticized-target-field-name>Person.FirstName_Phon
            </phoneticized-target-field-name>
            <encoding-type>Soundex</encoding-type>
         </phoneticize-field>
         <phoneticize-field>
            <unphoneticized-source-field-name>Person.LastName_Std
            </unphoneticized-source-field-name>
            <phoneticized-target-field-name>Person.LastName_Phon
            </phoneticized-target-field-name>
            <encoding-type>NYSIIS</encoding-type>
         </phoneticize-field>
         <phoneticize-field>
            <unphoneticized-source-field-name>
             Person.Address[*].StreetName
            </unphoneticized-source-field-name>
            <phoneticized-target-field-name>
             Person.Address[*].StreetNamePhoneticCode
            </phoneticized-target-field-name>
            <encoding-type>NYSIIS</encoding-type>
         </phoneticize-field>
      </phoneticize-fields>
   </standardize-system-object>
</StandardizationConfig>
<MatchingConfig module-name="Matching" parser-class=
 "com.stc.eindex.configurator.impl.matching.MatchingConfiguration">
   <match-system-object>
      <object-name>Person</object-name>
      <match-columns>
         <match-column>
           <column-name>Enterprise.SystemSBR.Person.StdFirstName
           </column-name>
           <match-type>FirstName</match-type>
         </match-column>
         <match-column>
           <column-name>Enterprise.SystemSBR.Person.StdLastName
           </column-name>
           <match-type>LastName</match-type>
         </match-column>
         <match-column>
           <column-name>Enterprise.SystemSBR.Person.DOB</column-name>
           <match-type>DOB</match-type>
         </match-column>
      </match-columns>
   </match-system-object>
</MatchingConfig>
<MEFAConfig module-name="MEFA" parser-class=
 "com.stc.eindex.configurator.impl.MEFAConfiguration">
   <block-picker>
      <class-name>com.stc.eindex.matching.impl.PickAllBlocksAtOnce
      </class-name>
   </block-picker>
   <pass-controller>
      <class-name>com.stc.eindex.matching.impl.PassAllBlocks
      </class-name>
   </pass-controller>
      <class-name>
       com.stc.eindex.matching.adapter.SbmeStandardizerAdapter
      </class-name>
   </standardizer-api>
   <standardizer-config>
      <class-name>
       com.stc.eindex.matching.adapter.SbmeStandardizerAdapterConfig
      </class-name>
   </standardizer-config>
   <matcher-api>
      <class-name>com.stc.eindex.matching.adapter.SbmeMatcherAdapter
      </class-name>
   </matcher-api>
   <matcher-config>
      <class-name>
       com.stc.eindex.matching.adapter.SbmeMatcherAdapterConfig
      </class-name>
   </matcher-config>
</MEFAConfig>
<PhoneticEncodersConfig module-name="PhoneticEncoders" parser-class=
 "com.stc.eindex.configurator.impl.PhoneticEncodersConfig">
   <encoder>
      <encoding-type>NYSIIS</encoding-type>
      <encoder-implementation-class>
       com.stc.eindex.phonetic.impl.Nysiis
      </encoder-implementation-class>
   </encoder>
   <encoder>
      <encoding-type>Soundex</encoding-type>
      <encoder-implementation-class>
       com.stc.eindex.phonetic.impl.Soundex
      </encoder-implementation-class>
   </encoder>
</PhoneticEncodersConfig>