Master Index Standardization Configuration (Repository) (Understanding the Sun Match Engine)

Understanding the Sun Match Engine

Master Index Standardization Configuration (Repository)

The StandardizationConfig section of the Match Field file determines which fields are normalized, parsed, or phonetically encoded and defines the nationality of the data being processed. The standardization section includes the following structures.

The StandardizationConfig section defines fields that will be normalized, fields that will be parsed and normalized, and fields that will be phonetically encoded. The standardization types you specify in this section correspond to the match configuration file; the field IDs you can specify are listed in Table 3.

Normalization Structures

The normalization structure defines fields that are already parsed, but need to be normalized. It also tells the Sun Match Engine where to place the normalized data in the object structure. Matching on any of these fields is determined by the match string and the logic is defined in the match configuration file.

Of the three data types processed by the Sun Match Engine, only the person name data type is expected to provide information in fields that are already parsed; that is, the first, last, and middle names appear in separate fields, as do the suffix, title, and so on. The person standardization files define logic for normalizing person name fields. By default, only the names you specify for matching in the wizard are defined for normalization. You can define normalization for additional name fields, such as maiden name, spouse’s name, and so on. For each normalization structure, you must specify the national domains for the data you are processing.

Defining New Fields for Normalization

The fields you define for normalization in the Match Field file can include any name fields. If you define normalization for fields that are not currently defined for normalization in the Match Field file, make the following additional changes.

In the Match Field file, define the normalization structure, using the appropriate standardization type (PersonName), domain selector, and field IDs (FirstName, MiddeName, or LastName).
Add the new fields that will store the normalized field value to the appropriate objects in the Object Definition file.
If any of the normalized fields are to be used for blocking, modify the Candidate Select file by adding the new fields to the blocking query.
Regenerate the master index application in NetBeans to include the new fields in the database creation script, the outbound Object Type Definition (OTD), and the method OTD.
To specify that the new normalized fields be used for matching, do the following:
1. Determine the match type or the match comparison function you want to use to match the normalized data, and modify the match configuration file (matchConfigFile.cfg) if needed.
2. Add the new normalized field to the match-columns element of the MatchingConfig section of the Match Field file, making sure to use the appropriate match type from the match configuration file.

Standardization Structures (Parsing and Normalization)

The fields that must be parsed, and possibly normalized, are defined in a standardization structure in the StandardizationConfig section of the Match Field file. The standardization structure tells the Sun Match Engine where to place the standardized information extracted from the parsed fields. The target fields you specify for standardization facilitate searching by the parsed values. Matching on any of these fields is determined by the match string and the logic is defined in the match configuration file.

The Sun Match Engine expects business names and street address information in free-form text fields that must be parsed and normalized prior to matching. The logic for parsing and normalizing street address information is contained in the address standardization files; the logic for parsing and normalizing business names is contained in the business standardization files. You can customize the standardization of these data types by modifying the appropriate patterns file. For each standardization structure, you must specify the national domains for the data being processed.

Defining New Fields for Standardization

The fields you define for standardization in the Match Field file can include any street address or business name field. Perform the following steps if you need to define one of these field types for standardization.

If necessary, modify the patterns file for the type of data you are standardizing.

You can define new input and output patterns or modify existing ones.
Define the standardization structure, using the appropriate standardization type (BusinessName or Address), domain selector, and field IDs (described in Table 3).
Add the new fields that will store the parsed or normalized data to the appropriate objects in the Object Definition file.
If any of the parsed or normalized fields are to be used for blocking, modify the Candidate Select file by adding the new fields to the blocking query.
Regenerate the master index application in NetBeans to include the new fields in the database creation script, the outbound Object Type Definition (OTD), and the method OTD.
To specify that the new standardized fields be used for matching, do the following:
1. Determine the match type or the match comparison function you want to use to match the parsed data, and modify the match configuration file (matchConfigFile.cfg) if needed.
2. Add the new standardized field to the match-columns element of the MatchingConfig section of the Match Field file, making sure to use the appropriate match type from the match configuration file.

Phonetic Encoding Structures

The fields to be phonetically encoded are defined in a phonetic encoding structure in the StandardizationConfig section ofthe Match Field file. The phonetic encoding structure tells the Sun Match Engine where to place the phonetic data created from the fields that are encoded. You can define any field in the object structure for phonetic encoding.

Defining New Fields for Phonetic Encoding

The fields you define for phonetic encoding in the Match Field file can include any field.

Determine the type of phonetic encoder to use to convert the field.

You can use any of the encoders described in Table 7.
Define the phonetic encoding structure, using the appropriate encoders.
Add the new fields that will store the phonetic values to the appropriate objects in the Object Definition file.
If any the phonetic fields are to be used for blocking, modify the Candidate Select file by adding the new fields to the blocking query.
Regenerate the master index application in NetBeans to include the new fields in the database creation script, the outbound OTD, and the method OTD.