Skip Navigation Links | |
Exit Print View | |
Loading the Initial Data Set for a Master Index Java CAPS Documentation |
Loading the Initial Data Set for a Master Index
Initial Bulk Match and Load Overview
Initial Bulk Match and Load Process Overview
Data Preparation, Matching, and Loading Procedure Overview
About the Cluster Synchronizer
Required Format for Flat Data Files
Generating the Initial Bulk Match and Load Tool
To Generate the Initial Bulk Match and Load Tool
Creating the Cluster Synchronizer Database
To Create the Cluster Synchronization Tables
Configuring the Initial Bulk Match and Load Tool
Configuring the Initial Bulk Match and Load Tool Processing
Configuring Initial Bulk Match and Load Tool Logging
To Configure IBML Tool Logging
Initial Bulk Match and Load Tool Configuration Properties
Initial Bulk Match and Load Tool Field Validation Configuration
Initial Bulk Match and Load Tool Blocking Query Configuration
Initial Bulk Match and Load Tool Match String Configuration
Initial Bulk Match and Load Tool Processing Configuration
Running the Bulk Matcher in Analysis Mode
To Run the Bulk Matcher in Analysis Mode
Reviewing the Match Analysis Results
Reconfiguring the Matching Logic
To Reconfigure the Matching Logic
Running the Bulk Match and Bulk Load in One Step (SQL*Loader Only)
To Run the Bulk Match and Bulk Load in One Step
Loading the Matched Data Into the Master Index Database
Loading Matched Data Using SQL*Loader
To Load Matched Data Using SQL*Loader
Loading Matched Data Using the Command-Line Bulk Loader
Before you can run the IBML Tool, you need to define certain runtime parameters, such as how to distribute the processing, FTP server properties, database properties, logging properties, and so on. You can also modify the configuration of the query used for matching, the configuration of the match string, and the weighting thresholds used by the Bulk Matcher.
Note - If you are using the command-line Bulk Loader, the properties you set here only apply to the Bulk Matcher.
The following topics provide instructions for configuring the IBML Tool.
The bulk match process is configured by loader-config.xml, which is located in the conf subdirectory in the directory where you extracted the IBML Tool files. The process must be configured on each machine that is running a Bulk Matcher.
Perform the following steps on each machine processing bulk data.
Logging for the IBML Tool is configured in logger.properties, which is located in the conf subdirectory in the directory where you extracted the IBML Tool files.
The configuration file for the IBML Tool, loader-config.xml, defines several aspects of the match process, including the blocking query, match string, EUID generator, FTP server, cluster synchronizer database, and SQL*Loader properties.
The configuration file is divided into the sections listed below. In addition to these sections, you can set the match and duplicate thresholds at the beginning of the file. Use these settings to help analyze the matching logic.
Initial Bulk Match and Load Tool Field Validation Configuration
Initial Bulk Match and Load Tool Blocking Query Configuration
The default field validation for the master index is parsed by the standard XSD and then copied into the IBML Tool configuration file. The default field validator checks the local ID and system fields to verify that the system code is valid, the local ID format is correct, the local ID is the correct length, and neither field is null. You should not need to modify this section unless you defined custom field validations.
When you generate the IBML Tool, the configuration for the blocking query defined for matching in the master index application is parsed by an IBML parser and then copied into the IBML Tool configuration file.
Caution - If you defined a custom parser configuration for the blocking query, the query configuration might not be copied correctly. To ensure the correct configuration if you defined a custom parser, copy the blocking query definition directly from query.xml to loader-config.xml. You also need to create a custom block generator using the com.sun.mdm.index.loader.blocker.BlockIdGenerator interface, and add the name of the custom generator to the field elements in the block ID (for example, <field>Enterprise.SystemSBR.Person.FirstName+CustomBlockIdGenerator</field>). |
This section is included in the configuration file so you can modify the blocking query during the analysis phase to help you fine–tune the query configuration for the best match results. You can quickly change the query configuration and analyze the results of the changes without needing to update the master index application and regenerate the IBML Tool each time. The query configuration is only used by the master IBML Tool, which uses the blocks defined for the query to divide the bulk data into block buckets to be distributed to the other processors that will process the data concurrently.
The blocking query might include fields that are not in your original input data, such as phonetic and normalized fields. If the input to the Bulk Matcher is the file that was generated by the Data Cleanser, the file includes all required fields, including phonetic and standardized fields. If you are using a different source for the input, the IBML Tool can standardize the data for you. For consistent matching results between the initial data set and future transactions, the query configuration used for the match and load processes should be as close as possible to that in query.xml in the master index Project.
When you generate the IBML Tool, the configuration of the match string is copied from the master index Project to the IBML Tool configuration file. The match string is used by all IBML Tools processing the data, so this section should be identical for all IBML Tools. This section is provided so you can modify the match string during the analysis phase in order to fine–tune the match logic to achieve the best match results. As with the query configuration above, you can quickly change the match string configuration and analyze the results without needing to modify the master index application and regenerate the IBML Tool.
Ideally, the match string defined in loader-config.xml is as close as possible to the match string defined in mefa.xml in the master index Project. This assures consistent matching results between the initial data set and future transactions.
The processing properties described in the following table configure how the IBML Tool processes data. In these properties, you define a name for each IBML Tool, the location of the working directories, polling properties, and so on. Some of these properties only apply to specific phases of the match and load process, and some apply to either the master or slave processors.
Table 1 IBML Tool Processing Properties
|
The processing properties described in the following table configure the connection information for the FTP server. They only need to be defined if the IBML Tools are run on multiple processors, and they only need to be defined for the slave processors.
Table 2 FTP Server Properties
|
The cluster synchronizer database is used to coordinate the activities of all IBML Tools processing data. The configuration of this section must be identical for all processors.
Table 3 Cluster Synchronizer Database Properties
|
The SQL*Loader properties are only used if you use SQL*Loader to load the bulk data into the master index database after it has gone through match processing and EUID assignment. If you use the command-line Bulk Loader, you do not need to modify these properties. SQL*Loader can only be used with an Oracle database.
Table 4 SQL*Loader Property
|
This section defines the data reader to use, the location of the input file, and any parameters. You can define a custom data reader to read data into the Bulk Matcher if you are not using the data file output by the Data Cleanser or if your data is formatted differently than the default data reader requires. For more information about the default data reader's requirements, see Required Format for Flat Data Files. You define the custom reader in Java, and then specify the class and configure the reader using the elements and attributes in the following table. You only need to configure the data reader for the master IBML Tool.
Table 5 Custom Data Reader Configuration Elements and Attributes
|
There are additional properties that are not listed in the properties file that you can use to further configure your bulk load processing. You can add these properties to the loader-config.xml file within the properties element. The syntax is <property name="property_name" value="property_value" />. For example:
<system> <properties> <property name="report.size" value="1000" /> .... </properties> </system>
Note - These properties are defined but are not used or supported by the IBML tool and should not be added to the configuration properties file: matchFlushSize, distributionMode, loadMode, goodFile, and matchCacheSize.
Table 6 Optional Properties
|
The configuration file for the logging properties, logging.properties, defines how much information is logged while the IBML Tools are running. By default, logging is set to display INFO level messages and above on both the console and in a log file.
The following table lists and describes the default properties for the configuration file, but you can add new properties based on the log handlers you use. For more information about log handlers and the properties for each, see the Javadocs for the java.util.logging package.
Table 7 IBML Tool Logging Properties
|