Understanding the Sun Match Engine

Standardization Structures (Parsing and Normalization)

The fields that must be parsed, and possibly normalized, are defined in a standardization structure in the StandardizationConfig section of the Match Field file. The standardization structure tells the Sun Match Engine where to place the standardized information extracted from the parsed fields. The target fields you specify for standardization facilitate searching by the parsed values. Matching on any of these fields is determined by the match string and the logic is defined in the match configuration file.

The Sun Match Engine expects business names and street address information in free-form text fields that must be parsed and normalized prior to matching. The logic for parsing and normalizing street address information is contained in the address standardization files; the logic for parsing and normalizing business names is contained in the business standardization files. You can customize the standardization of these data types by modifying the appropriate patterns file. For each standardization structure, you must specify the national domains for the data being processed.

Defining New Fields for Standardization

    The fields you define for standardization in the Match Field file can include any street address or business name field. Perform the following steps if you need to define one of these field types for standardization.

  1. If necessary, modify the patterns file for the type of data you are standardizing.

    You can define new input and output patterns or modify existing ones.

  2. Define the standardization structure, using the appropriate standardization type (BusinessName or Address), domain selector, and field IDs (described in Table 3).

  3. Add the new fields that will store the parsed or normalized data to the appropriate objects in the Object Definition file.

  4. If any of the parsed or normalized fields are to be used for blocking, modify the Candidate Select file by adding the new fields to the blocking query.

  5. Regenerate the master index application in NetBeans to include the new fields in the database creation script, the outbound Object Type Definition (OTD), and the method OTD.

  6. To specify that the new standardized fields be used for matching, do the following:

    1. Determine the match type or the match comparison function you want to use to match the parsed data, and modify the match configuration file (matchConfigFile.cfg) if needed.

    2. Add the new standardized field to the match-columns element of the MatchingConfig section of the Match Field file, making sure to use the appropriate match type from the match configuration file.