Loading the Initial Data Set for a Sun Master Index

Performing a Match Analysis

Before you perform the actual data matching, you can perform match analyses on a subset of the data to be loaded to determine whether the various components of the match process are configured correctly. This analysis can show whether the data blocks defined for the blocking query are returning too many or too few records and whether certain fields in the match string are inaccurately skewing the composite match weight. You can also use this analysis to determine whether the duplicate and match threshold are correct.

This is an iterative process, and you might need to run through the analysis several times before you are satisfied that the match and query configuration is optimized for your data set.

Perform the following steps to analyze the data for matching:

Running the Bulk Matcher in Analysis Mode


Note –

This procedure includes steps that were updated for Java CAPS Release 6 Update 1. The variable JDBC_JAR_PATH was previously ORACLE_JDBC_JAR, and wasn't present in all files.


When you run the Bulk Matcher in analysis mode, use a representative sample of the actual data you are loading into the master index database. You do not need to run the entire set of input records through the analysis.


Caution – Caution –

If you are rerunning the Bulk Matcher in analysis mode, make sure to truncate the cluster synchronizer database tables first. Otherwise, unique constraint errors occur and the run fails. To truncate the tables, run cluster-truncate.sql against the cluster synchronizer database.


ProcedureTo Run the Bulk Matcher in Analysis Mode

  1. Complete the steps under Configuring the Initial Bulk Match and Load Tool.

  2. For each IBML Tool, open loader-config.xml (located in the IBML Tool home directory in the conf subdirectory).

  3. Set the matchAnalyzerMode property to true, and verify the remaining property settings.

  4. Save and close the file.

  5. To configure and run the match analysis, do one of the following.

    • If the master loader is running on Windows:

      1. Navigate to the master IBML Tool home directory and open run-loader.bat for editing.

      2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, set JDBC_JAR_PATH=C:\oracle\jdbc\lib\ojdbc14.jar.

      3. Close and save the file.

      4. Double-click run-loader.bat or type run-loader from a command line.

    • If the master loader is running on UNIX:

      1. Navigate to the master IBML Tool home directory and open run-loader.sh for editing.

      2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, export JDBC_JAR_PATH=${oracle_home}/jdbc/lib/ojdbc14.jar.

      3. Close and save the file.

      4. Type sh run-loader.sh at the command line.

  6. Examine the log files to be sure no errors occurred during the analysis.

  7. Continue to Reviewing the Match Analysis Results.

Reviewing the Match Analysis Results

The output of the Bulk Matcher when run in analysis mode is a PDF file with a list of records that were automatically matched to each other (assumed matches) from the data set you analyzed. The report displays the matching weight given to each field, so you can analyze the value and accuracy of each field for matching as well as the agreement and disagreement weights (or u-probabilities and m-probabilities) defined in the matching configuration file of the master index application.

The following figure shows two entries from the match analysis report. The name of each match field is listed in the left column, the values for those fields in the two assumed match records are listed in the next two columns, and the composite match weight and the weight for each field are listed in the final column.

Figure 3 Match Analysis Report Excerpt

Figure shows an excerpt from a sample match analysis
report.

After you perform the steps under Running the Bulk Matcher in Analysis Mode, complete the analysis by using the information in the match analysis report to do the following:

After you complete your analysis, you can reconfigure the matching logic as described inReviewing the Match Analysis Results and then rerun the analysis. If your analysis shows that the matching configuration is correct and does not require any more changes, continue to Performing the Bulk Match. If the matching configuration is correct, make sure to update the master index application to match the new configuration.

Reconfiguring the Matching Logic

If the results of the match analysis show that you need to modify the query, thresholds, or match string, you can make the changes to the IBML Tool configuration file and run the Bulk Matcher again to analyze the new settings. Once you are satisfied with the new settings, you need to update the master index application configuration accordingly.

ProcedureTo Reconfigure the Matching Logic

  1. Complete the match analysis, as describe under Reviewing the Match Analysis Results.

  2. In the directory where the IBML Tool is located, open conf/loader-config.xml.

  3. To modify the match and duplicate thresholds for match analysis, enter new values for the duplicateThreshold and matchThreshold elements.

  4. To modify the blocking query for match analysis, modify the query builder section (described in Initial Bulk Match and Load Tool Blocking Query Configuration).

  5. To modify the match string for match analysis, modify the MatchingConfig section (described in Initial Bulk Match and Load Tool Match String Configuration).

  6. Run the match analysis again, as described in Running the Bulk Matcher in Analysis Mode.

  7. After you run the analysis for the final time, continue to Performing the Bulk Match.


    Caution – Caution –

    When you complete the analysis and have made the final modifications to the blocking query, matching string, and match thresholds, be sure to modify the master index application so the processing is identical. The match string is defined in mefa.xml, the thresholds are defined in master.xml, and the blocking query is defined in query.xml. You can copy the configuration from loader-config.xml directly into these files.