JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Loading the Initial Data Set for a Master Index     Java CAPS Documentation
search filter icon
search icon

Document Information

Loading the Initial Data Set for a Master Index

Related Topics

Initial Bulk Match and Load Overview

Initial Bulk Match and Load Process Overview

Data Preparation, Matching, and Loading Procedure Overview

Distributed Processing

About the Bulk Match Process

Block Distribution

Record Matching

EUID Assignment

Master Index Image Creation

Potential Duplicate Creation

About the Bulk Load Process

About the Cluster Synchronizer

Required Format for Flat Data Files

Generating the Initial Bulk Match and Load Tool

To Generate the Initial Bulk Match and Load Tool

Configuring the Environment

To Configure the Environment

Creating the Cluster Synchronizer Database

To Create the Cluster Synchronization Tables

Configuring the Initial Bulk Match and Load Tool

Configuring the Initial Bulk Match and Load Tool Processing

To Configure the IBML Tool

Configuring Initial Bulk Match and Load Tool Logging

To Configure IBML Tool Logging

Initial Bulk Match and Load Tool Configuration Properties

Initial Bulk Match and Load Tool Field Validation Configuration

Initial Bulk Match and Load Tool Blocking Query Configuration

Initial Bulk Match and Load Tool Match String Configuration

Initial Bulk Match and Load Tool Processing Configuration

FTP Server Configuration

Cluster Synchronizer Database Configuration

SQL*Loader Configuration

Data Reader Configuration

Additional Properties

Initial Bulk Match and Load Tool Logging Properties

Performing a Match Analysis

Running the Bulk Matcher in Analysis Mode

To Run the Bulk Matcher in Analysis Mode

Reviewing the Match Analysis Results

Reconfiguring the Matching Logic

To Reconfigure the Matching Logic

Performing the Bulk Match

To Perform the Bulk Match

Running the Bulk Match and Bulk Load in One Step (SQL*Loader Only)

To Run the Bulk Match and Bulk Load in One Step

Loading the Matched Data Into the Master Index Database

Loading Matched Data Using SQL*Loader

To Load Matched Data Using SQL*Loader

Loading Matched Data Using the Command-Line Bulk Loader

To Load Matched Data Using the Command-Line Bulk Loader

Command-Line Bulk Loader Properties

Configuring the Initial Bulk Match and Load Tool

Before you can run the IBML Tool, you need to define certain runtime parameters, such as how to distribute the processing, FTP server properties, database properties, logging properties, and so on. You can also modify the configuration of the query used for matching, the configuration of the match string, and the weighting thresholds used by the Bulk Matcher.


Note - If you are using the command-line Bulk Loader, the properties you set here only apply to the Bulk Matcher.


Configuring the Initial Bulk Match and Load Tool Processing

The bulk match process is configured by loader-config.xml, which is located in the conf subdirectory in the directory where you extracted the IBML Tool files. The process must be configured on each machine that is running a Bulk Matcher.

To Configure the IBML Tool

Perform the following steps on each machine processing bulk data.

  1. Complete the steps under Creating the Cluster Synchronizer Database.
  2. Navigate to the location where you extracted the IBML Tool.
  3. Open conf/loader-config.xml.
  4. Configure processing attributes by modifying the properties described in Initial Bulk Match and Load Tool Processing Configuration.
  5. Configure the cluster synchronizer database connection properties by modifying the properties described in Cluster Synchronizer Database Configuration.
  6. Configure the data reader and enter the name and location of the input file (see Data Reader Configuration for more information).
  7. If the IBML Tool is running on multiple processors, modify the properties described in FTP Server Configuration.
  8. If you are using SQL*Loader to load the master image into the master index database, modify the properties described in SQL*Loader Configuration.
  9. When using the Bulk Matcher in match analysis mode, do any of the following:
  10. Save and close the file.
  11. Repeat the above steps for each load processor in the distributed environment.
  12. To configure logging properties, continue to Configuring Initial Bulk Match and Load Tool Logging; otherwise, skip to Performing a Match Analysis.

Configuring Initial Bulk Match and Load Tool Logging

Logging for the IBML Tool is configured in logger.properties, which is located in the conf subdirectory in the directory where you extracted the IBML Tool files.

To Configure IBML Tool Logging

  1. Complete Configuring the Initial Bulk Match and Load Tool Processing.
  2. Navigate to the location where you extracted the IBML Tool.
  3. Open conf/logger.properties.
  4. Modify the properties defined in Initial Bulk Match and Load Tool Logging Properties.
  5. Save and close the file.
  6. Continue to Performing a Match Analysis.

Initial Bulk Match and Load Tool Configuration Properties

The configuration file for the IBML Tool, loader-config.xml, defines several aspects of the match process, including the blocking query, match string, EUID generator, FTP server, cluster synchronizer database, and SQL*Loader properties.

The configuration file is divided into the sections listed below. In addition to these sections, you can set the match and duplicate thresholds at the beginning of the file. Use these settings to help analyze the matching logic.

Initial Bulk Match and Load Tool Field Validation Configuration

The default field validation for the master index is parsed by the standard XSD and then copied into the IBML Tool configuration file. The default field validator checks the local ID and system fields to verify that the system code is valid, the local ID format is correct, the local ID is the correct length, and neither field is null. You should not need to modify this section unless you defined custom field validations.

Initial Bulk Match and Load Tool Blocking Query Configuration

When you generate the IBML Tool, the configuration for the blocking query defined for matching in the master index application is parsed by an IBML parser and then copied into the IBML Tool configuration file.


Caution

Caution - If you defined a custom parser configuration for the blocking query, the query configuration might not be copied correctly. To ensure the correct configuration if you defined a custom parser, copy the blocking query definition directly from query.xml to loader-config.xml. You also need to create a custom block generator using the com.sun.mdm.index.loader.blocker.BlockIdGenerator interface, and add the name of the custom generator to the field elements in the block ID (for example, <field>Enterprise.SystemSBR.Person.FirstName+CustomBlockIdGenerator</field>).


This section is included in the configuration file so you can modify the blocking query during the analysis phase to help you fine–tune the query configuration for the best match results. You can quickly change the query configuration and analyze the results of the changes without needing to update the master index application and regenerate the IBML Tool each time. The query configuration is only used by the master IBML Tool, which uses the blocks defined for the query to divide the bulk data into block buckets to be distributed to the other processors that will process the data concurrently.

The blocking query might include fields that are not in your original input data, such as phonetic and normalized fields. If the input to the Bulk Matcher is the file that was generated by the Data Cleanser, the file includes all required fields, including phonetic and standardized fields. If you are using a different source for the input, the IBML Tool can standardize the data for you. For consistent matching results between the initial data set and future transactions, the query configuration used for the match and load processes should be as close as possible to that in query.xml in the master index Project.

Initial Bulk Match and Load Tool Match String Configuration

When you generate the IBML Tool, the configuration of the match string is copied from the master index Project to the IBML Tool configuration file. The match string is used by all IBML Tools processing the data, so this section should be identical for all IBML Tools. This section is provided so you can modify the match string during the analysis phase in order to fine–tune the match logic to achieve the best match results. As with the query configuration above, you can quickly change the match string configuration and analyze the results without needing to modify the master index application and regenerate the IBML Tool.

Ideally, the match string defined in loader-config.xml is as close as possible to the match string defined in mefa.xml in the master index Project. This assures consistent matching results between the initial data set and future transactions.

Initial Bulk Match and Load Tool Processing Configuration

The processing properties described in the following table configure how the IBML Tool processes data. In these properties, you define a name for each IBML Tool, the location of the working directories, polling properties, and so on. Some of these properties only apply to specific phases of the match and load process, and some apply to either the master or slave processors.

Table 1 IBML Tool Processing Properties

Property Name
Description
loaderName
A unique name for the IBML Tool residing on the current processor. This name should be unique to each IBML Tool in the distributed environment. It does not need to be modified if you are using a single processor.
isMasterLoader
An indicator of whether the IBML Tool being configured is the master IBML Tool. Specify true if it is the master or only IBML Tool; otherwise specify false.
matchAnalyzerMode
An indicator of whether to process the data in match analysis mode, which only generates analysis reports, or to perform the complete match process and generate the master index image files. Specify true to perform an analysis only; specify false to perform the actual blocking and matching process and generate the master index image files.
BulkLoad
An indicator of whether the current run will load the matched data into the database using SQL*Loader once the match process is complete. Specify true to load the data. To run a match analysis or just the matching process, specify false. If you just run the match process, you can verify the process and then load the output of the Bulk Matcher at a later time.
standardizationMode
An indicator of whether to standardize the input data. Leave the value of the this property set to true.
deleteIntermediateDirs
An indicator of whether the working directories are deleted when each process is complete. Specify true to delete the directories; specify false to retain the directories.
optimizeDuplicates
An indicator of whether to automatically merge records in the input data if they have the same system and local ID. Specify true to automatically merge the duplicate records; otherwise specify false. The default is true.
rmiPort
This is not currently used.
workingDir
The absolute path to the directory in which the IBML Tools create the working files as they progress through the processing stages. The master IBML Tool also creates the master index image files here. If the path you specify does not exist, create it before running the IBML Tool.
ftp.workingDir
The absolute path to the directory on the master processor where files are placed for distribution to the remaining IBML Tools. You only need to define this property for the master IBML Tool and only if you are running multiple IBML Tools. All other tools ignore this property.
numBlockBuckets
The number of block buckets to create for the initial distribution of data blocks. Each IBML Tool works on one bucket at a time so multiple buckets are processed at once. The number of block buckets you specify depends on the number of records to process and how specific the data blocks are in the blocking query.
numThreads
The number of threads to run in parallel during processing.
numEUIDBuckets
The number of buckets the EUID assigner should place the processed records into after they have been matched and assigned an EUID.
totalNoOfRecords
The total number of records being processed. This does not need to be an exact value, but needs to be greater than or equal to the exact number of records.
pollInterval
The number of milliseconds the IBML Tools should wait before polling the master IBML Tool for their next task.
maxWaitTime
The maximum time for an IBML Tool to wait for the next task before giving up.

FTP Server Configuration

The processing properties described in the following table configure the connection information for the FTP server. They only need to be defined if the IBML Tools are run on multiple processors, and they only need to be defined for the slave processors.

Table 2 FTP Server Properties

Property Name
Description
ftp.server
The name of the FTP server on the master processor.
ftp.username
The user ID to log in to the FTP server.
ftp.password
The password to log in to the FTP server.

Cluster Synchronizer Database Configuration

The cluster synchronizer database is used to coordinate the activities of all IBML Tools processing data. The configuration of this section must be identical for all processors.

Table 3 Cluster Synchronizer Database Properties

Property Name
Description
cluster.database
The database platform on which the cluster synchronizer database is installed. Possible values are Oracle, MSSQL, MySQL, or derby.

Caution

Caution - Make sure the value you enter here matches the value of the database property in the object.xml file.


cluster.database.url
The URL for the cluster synchronizer database. The format for the URL varies by database platform.
  • For Oracle, the format is jdbc:oracle:thin:@hostname:port:database_name.

  • For SQL Server, the format is jdbc:sqlserver://hostname:port;databaseName=database_name.

  • For Derby, the format is jdbc:derby://hostname:port/database_name.

  • For MySQL, the format is jdbc:mysql://server:port:database_name.

cluster.database.user
The user ID to log in to the cluster synchronizer database.
cluster.database.password
The password to log in to the cluster synchronizer database.
cluster.database.jdbc.driver
The name of the database driver class.

SQL*Loader Configuration

The SQL*Loader properties are only used if you use SQL*Loader to load the bulk data into the master index database after it has gone through match processing and EUID assignment. If you use the command-line Bulk Loader, you do not need to modify these properties. SQL*Loader can only be used with an Oracle database.

Table 4 SQL*Loader Property

Property Name
Description
sqlldr.userid
The connection descriptor for the master index database. For Oracle, this is in the format user/password@service_name; for example, midm/midm@MIDBS.sun.com.

Data Reader Configuration

This section defines the data reader to use, the location of the input file, and any parameters. You can define a custom data reader to read data into the Bulk Matcher if you are not using the data file output by the Data Cleanser or if your data is formatted differently than the default data reader requires. For more information about the default data reader's requirements, see Required Format for Flat Data Files. You define the custom reader in Java, and then specify the class and configure the reader using the elements and attributes in the following table. You only need to configure the data reader for the master IBML Tool.

Table 5 Custom Data Reader Configuration Elements and Attributes

Element
Attribute
Description
bean
A definition for one data reader. This element includes the following elements and attributes.
id
A unique name for the data reader.
class
The Java class to implement for the data reader. The default data reader is com.sun.mdm.index.dataobject.DataObjectFileReader. This reader can access flat file data in the format described in Required Format for Flat Data Files.
singleton
An indicator of whether to use the singleton pattern for the data reader class. Specify true to use a singleton pattern; otherwise specify false.
constructor-arg
A parameter for the data reader. The default reader accepts two parameters. The first parameter has a type of java.lang.String and specifies the location of the input file. The second parameter has a type of boolean and indicates whether the input file contains delimiters. Specify true if the file contains delimiters.
type
The data type of the parameter value.
value
The value for the parameter.

Additional Properties

There are additional properties that are not listed in the properties file that you can use to further configure your bulk load processing. You can add these properties to the loader-config.xml file within the properties element. The syntax is <property name="property_name" value="property_value" />. For example:

<system>
    <properties>
        <property name="report.size" value="1000" />
        ....
    </properties>
</system>

Note - These properties are defined but are not used or supported by the IBML tool and should not be added to the configuration properties file: matchFlushSize, distributionMode, loadMode, goodFile, and matchCacheSize.


Table 6 Optional Properties

Property Name
Description
TimeFormat
The time format for date/time fields in all input records. Add this property if any date field includes the time as well as the date.

Note - The object.xml file defines the format of the date component in its dateformat property. When you define a TimeFormat property as well, the value of the TimeFormat property is appended to the dateformat value in object.xml to determine the date and time format of the fields. For example, if the dates in your data are similar to 10/05/1972 13:15:20, the dateformat should be mm/dd/yyyy and the TimeFormat should be hh:mm:ss.


record.delimiter
The delimiter used for delimiting record in the master image files. This property is required.
sqlldr.record.delimiter
This is an optional property and is redundant with the above record.delimiter property. If used, its value should match that of the record.delimiter property.
BucketCacheSize
The maximum size of the temporary bucket files that are created by the IBML tool. This property is optional. By default, the size is set to 60 MB, which should be sufficient. If any loader throws an out of memory error, set this value to lower than 60 MB. If the loader generates an error message that it is unable to create a bucket file smaller than 60 MB, change application settings or increase this property value.
blockPrintSize
The block size at which the load tool displays the number of matches performed for the block. The IBML tool displays this value for each block with a size that is greater than or equal to the blockPrintSize property.
report.size
The number of records to include in the Match Analysis Report (when matchAnalyzerMode is set to true). The report displays the records in descending order of match weights, so the lowest match weights are cut off if there are more matching records than the value of the report.size property. The default value is 3000.

Initial Bulk Match and Load Tool Logging Properties

The configuration file for the logging properties, logging.properties, defines how much information is logged while the IBML Tools are running. By default, logging is set to display INFO level messages and above on both the console and in a log file.

The following table lists and describes the default properties for the configuration file, but you can add new properties based on the log handlers you use. For more information about log handlers and the properties for each, see the Javadocs for the java.util.logging package.

Table 7 IBML Tool Logging Properties

Property Name
Description
handlers
A list of log handler classes to use, such as java.util.logging.FileHandler and java.util.logging.ConsoleHandler. Each handler you define needs to be configured according to the properties defined for the class.
level
The logging level to use. By default, this is set to INFO level logging.
java.util.logging.FileHandler.pattern
The name and location of the log files that are generated. By default, the files are located in the directory where you extracted the IBML Tools in the logs subdirectory. The files are named loader#.log, where # is an integer that is incremented each time a new log file is created. The log file with the lowest number is the most recent.
java.util.logging.FileHandler.limit
The maximum number of bytes to write to any one log file.
java.util.logging.FileHandler.count
The number of output files to cycle through.
java.util.logging.FileHandler.formatter
The name of the Java class used to format the log file. The IBML Tool provides a formatting class, com.sun.mdm.index.loader.log.LogFormatter, but you can define your own.
java.util.logging.ConsoleHandler.level
The level at which information is logged on the console.
java.util.logging.ConsoleHandler.formatter
The name of the Java class used to format the log information on the console. By default, the IBML Tool uses java.util.logging.SimpleFormatter, but you can define your own.