JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Loading the Initial Data Set for a Master Index     Java CAPS Documentation
search filter icon
search icon

Document Information

Loading the Initial Data Set for a Master Index

Related Topics

Initial Bulk Match and Load Overview

Initial Bulk Match and Load Process Overview

Data Preparation, Matching, and Loading Procedure Overview

Distributed Processing

About the Bulk Match Process

Block Distribution

Record Matching

EUID Assignment

Master Index Image Creation

Potential Duplicate Creation

About the Bulk Load Process

About the Cluster Synchronizer

Required Format for Flat Data Files

Generating the Initial Bulk Match and Load Tool

To Generate the Initial Bulk Match and Load Tool

Configuring the Environment

To Configure the Environment

Creating the Cluster Synchronizer Database

To Create the Cluster Synchronization Tables

Configuring the Initial Bulk Match and Load Tool

Configuring the Initial Bulk Match and Load Tool Processing

To Configure the IBML Tool

Configuring Initial Bulk Match and Load Tool Logging

To Configure IBML Tool Logging

Initial Bulk Match and Load Tool Configuration Properties

Initial Bulk Match and Load Tool Field Validation Configuration

Initial Bulk Match and Load Tool Blocking Query Configuration

Initial Bulk Match and Load Tool Match String Configuration

Initial Bulk Match and Load Tool Processing Configuration

FTP Server Configuration

Cluster Synchronizer Database Configuration

SQL*Loader Configuration

Data Reader Configuration

Additional Properties

Initial Bulk Match and Load Tool Logging Properties

Performing a Match Analysis

Running the Bulk Matcher in Analysis Mode

To Run the Bulk Matcher in Analysis Mode

Reviewing the Match Analysis Results

Reconfiguring the Matching Logic

To Reconfigure the Matching Logic

Performing the Bulk Match

To Perform the Bulk Match

Running the Bulk Match and Bulk Load in One Step (SQL*Loader Only)

To Run the Bulk Match and Bulk Load in One Step

Loading the Matched Data Into the Master Index Database

Loading Matched Data Using SQL*Loader

To Load Matched Data Using SQL*Loader

Loading Matched Data Using the Command-Line Bulk Loader

To Load Matched Data Using the Command-Line Bulk Loader

Command-Line Bulk Loader Properties

Performing a Match Analysis

Before you perform the actual data matching, you can perform match analyses on a subset of the data to be loaded to determine whether the various components of the match process are configured correctly. This analysis can show whether the data blocks defined for the blocking query are returning too many or too few records and whether certain fields in the match string are inaccurately skewing the composite match weight. You can also use this analysis to determine whether the duplicate and match threshold are correct.

This is an iterative process, and you might need to run through the analysis several times before you are satisfied that the match and query configuration is optimized for your data set.

Perform the following steps to analyze the data for matching:

Running the Bulk Matcher in Analysis Mode

When you run the Bulk Matcher in analysis mode, use a representative sample of the actual data you are loading into the master index database. You do not need to run the entire set of input records through the analysis.


Caution

Caution - If you are rerunning the Bulk Matcher in analysis mode, make sure to truncate the cluster synchronizer database tables first. Otherwise, unique constraint errors occur and the run fails. To truncate the tables, run cluster-truncate.sql against the cluster synchronizer database.


To Run the Bulk Matcher in Analysis Mode

  1. Complete the steps under Configuring the Initial Bulk Match and Load Tool.
  2. For each IBML Tool, open loader-config.xml (located in the IBML Tool home directory in the conf subdirectory).
  3. Set the matchAnalyzerMode property to true, and verify the remaining property settings.
  4. Save and close the file.
  5. To configure and run the match analysis, do one of the following.
    • If the master loader is running on Windows:
      1. Navigate to the master IBML Tool home directory and open run-loader.bat for editing.
      2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, set JDBC_JAR_PATH=C:\oracle\jdbc\lib\ojdbc14.jar.
      3. Close and save the file.
      4. Double-click run-loader.bat or type run-loader from a command line.
    • If the master loader is running on UNIX:
      1. Navigate to the master IBML Tool home directory and open run-loader.sh for editing.
      2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, export JDBC_JAR_PATH=${oracle_home}/jdbc/lib/ojdbc14.jar.
      3. Close and save the file.
      4. Type sh run-loader.sh at the command line.
  6. Examine the log files to be sure no errors occurred during the analysis.
  7. Continue to Reviewing the Match Analysis Results.

Reviewing the Match Analysis Results

The output of the Bulk Matcher when run in analysis mode is a PDF file with a list of records that were automatically matched to each other (assumed matches) from the data set you analyzed. The report displays the matching weight given to each field, so you can analyze the value and accuracy of each field for matching as well as the agreement and disagreement weights (or u-probabilities and m-probabilities) defined in the matching configuration file of the master index application.

The following figure shows two entries from the match analysis report. The name of each match field is listed in the left column, the values for those fields in the two assumed match records are listed in the next two columns, and the composite match weight and the weight for each field are listed in the final column.

Figure 3 Match Analysis Report Excerpt

image:Figure shows an excerpt from a sample match analysis report.

After you perform the steps under Running the Bulk Matcher in Analysis Mode, complete the analysis by using the information in the match analysis report to do the following:

After you complete your analysis, you can reconfigure the matching logic as described inReviewing the Match Analysis Results and then rerun the analysis. If your analysis shows that the matching configuration is correct and does not require any more changes, continue to Performing the Bulk Match. If the matching configuration is correct, make sure to update the master index application to match the new configuration.

Reconfiguring the Matching Logic

If the results of the match analysis show that you need to modify the query, thresholds, or match string, you can make the changes to the IBML Tool configuration file and run the Bulk Matcher again to analyze the new settings. Once you are satisfied with the new settings, you need to update the master index application configuration accordingly.

To Reconfigure the Matching Logic

  1. Complete the match analysis, as describe under Reviewing the Match Analysis Results.
  2. In the directory where the IBML Tool is located, open conf/loader-config.xml.
  3. To modify the match and duplicate thresholds for match analysis, enter new values for the duplicateThreshold and matchThreshold elements.
  4. To modify the blocking query for match analysis, modify the query builder section (described in Initial Bulk Match and Load Tool Blocking Query Configuration).
  5. To modify the match string for match analysis, modify the MatchingConfig section (described in Initial Bulk Match and Load Tool Match String Configuration).
  6. Run the match analysis again, as described in Running the Bulk Matcher in Analysis Mode.
  7. After you run the analysis for the final time, continue to Performing the Bulk Match.

    Caution

    Caution - When you complete the analysis and have made the final modifications to the blocking query, matching string, and match thresholds, be sure to modify the master index application so the processing is identical. The match string is defined in mefa.xml, the thresholds are defined in master.xml, and the blocking query is defined in query.xml. You can copy the configuration from loader-config.xml directly into these files.