Loading the Initial Data Set for a Sun Master Index

Configuring the Initial Bulk Match and Load Tool

Before you can run the IBML Tool, you need to define certain runtime parameters, such as how to distribute the processing, FTP server properties, database properties, logging properties, and so on. You can also modify the configuration of the query used for matching, the configuration of the match string, and the weighting thresholds used by the Bulk Matcher.

Note –

If you are using the Data Integrator Bulk Loader, the properties you set here only apply to the Bulk Matcher.

The following topics provide instructions for configuring the IBML Tool.
Configuring the Initial Bulk Match and Load Tool Processing
Configuring Initial Bulk Match and Load Tool Logging

Configuring the Initial Bulk Match and Load Tool Processing

The bulk match process is configured by loader-config.xml, which is located in the conf subdirectory in the directory where you extracted the IBML Tool files. The process must be configured on each machine that is running a Bulk Matcher.

To Configure the IBML Tool

Perform the following steps on each machine processing bulk data.

Complete the steps under Creating the Cluster Synchronizer Database.

Navigate to the location where you extracted the IBML Tool.

Open conf/loader-config.xml.

Configure processing attributes by modifying the properties described in Initial Bulk Match and Load Tool Processing Configuration.

Configure the cluster synchronizer database connection properties by modifying the properties described in Cluster Synchronizer Database Configuration.

Configure the data reader and enter the name and location of the input file (see Data Reader Configuration for more information).

If the IBML Tool is running on multiple processors, modify the properties described in FTP Server Configuration.

If you are using SQL*Loader to load the master image into the master index database, modify the properties described in SQL*Loader Configuration.

When using the Bulk Matcher in match analysis mode, do any of the following:
- To modify the match and duplicate thresholds for match analysis, enter new values for the duplicateThreshold and matchThreshold elements.
- To modify the blocking query for match analysis, modify the query builder section (described in Initial Bulk Match and Load Tool Blocking Query Configuration).
- To modify the match string for match analysis, modify the MatchingConfig section (described in Initial Bulk Match and Load Tool Match String Configuration).

Save and close the file.

Repeat the above steps for each load processor in the distributed environment.

To configure logging properties, continue to Configuring Initial Bulk Match and Load Tool Logging; otherwise, skip to Performing a Match Analysis.

Configuring Initial Bulk Match and Load Tool Logging

Logging for the IBML Tool is configured in logger.properties, which is located in the conf subdirectory in the directory where you extracted the IBML Tool files.

To Configure IBML Tool Logging

Complete Configuring the Initial Bulk Match and Load Tool Processing.

Navigate to the location where you extracted the IBML Tool.

Open conf/logger.properties.

Modify the properties defined in Initial Bulk Match and Load Tool Logging Properties.

Save and close the file.

Continue to Performing a Match Analysis.

Initial Bulk Match and Load Tool Configuration Properties

Note –

This topic describes the properties that are available in Java CAPS 6 Update 1.

The configuration file for the IBML Tool, loader-config.xml, defines several aspects of the match process, including the blocking query, match string, EUID generator, FTP server, cluster synchronizer database, and SQL*Loader properties.

The configuration file is divided into the sections listed below. In addition to these sections, you can set the match and duplicate thresholds at the beginning of the file. Use these settings to help analyze the matching logic.

Initial Bulk Match and Load Tool Field Validation Configuration

The default field validation for the master index is parsed by the standard XSD and then copied into the IBML Tool configuration file. The default field validator checks the local ID and system fields to verify that the system code is valid, the local ID format is correct, the local ID is the correct length, and neither field is null. You should not need to modify this section unless you defined custom field validations.

Initial Bulk Match and Load Tool Blocking Query Configuration

When you generate the IBML Tool, the configuration for the blocking query defined for matching in the master index application is parsed by an IBML parser and then copied into the IBML Tool configuration file.

Caution –

If you defined a custom parser configuration for the blocking query, the query configuration might not be copied correctly. To ensure the correct configuration if you defined a custom parser, copy the blocking query definition directly from query.xml to loader-config.xml. You also need to create a custom block generator using the com.sun.mdm.index.loader.blocker.BlockIdGenerator interface, and add the name of the custom generator to the field elements in the block ID (for example, <field>Enterprise.SystemSBR.Person.FirstName+CustomBlockIdGenerator</field>).

This section is included in the configuration file so you can modify the blocking query during the analysis phase to help you fine–tune the query configuration for the best match results. You can quickly change the query configuration and analyze the results of the changes without needing to update the master index application and regenerate the IBML Tool each time. The query configuration is only used by the master IBML Tool, which uses the blocks defined for the query to divide the bulk data into block buckets to be distributed to the other processors that will process the data concurrently.

The blocking query might include fields that are not in your original input data, such as phonetic and normalized fields. If the input to the Bulk Matcher is the file that was generated by the Data Cleanser, the file includes all required fields, including phonetic and standardized fields. If you are using a different source for the input, the IBML Tool can standardize the data for you. For consistent matching results between the initial data set and future transactions, the query configuration used for the match and load processes should be as close as possible to that in query.xml in the master index Project.

Initial Bulk Match and Load Tool Match String Configuration

When you generate the IBML Tool, the configuration of the match string is copied from the master index Project to the IBML Tool configuration file. The match string is used by all IBML Tools processing the data, so this section should be identical for all IBML Tools. This section is provided so you can modify the match string during the analysis phase in order to fine–tune the match logic to achieve the best match results. As with the query configuration above, you can quickly change the match string configuration and analyze the results without needing to modify the master index application and regenerate the IBML Tool.

Ideally, the match string defined in loader-config.xml is as close as possible to the match string defined in mefa.xml in the master index Project. This assures consistent matching results between the initial data set and future transactions.

Initial Bulk Match and Load Tool Processing Configuration

The processing properties described in the following table configure how the IBML Tool processes data. In these properties, you define a name for each IBML Tool, the location of the working directories, polling properties, and so on. Some of these properties only apply to specific phases of the match and load process, and some apply to either the master or slave processors.

Table 1 IBML Tool Processing Properties


Property Name	Description
loaderName	A unique name for the IBML Tool residing on the current processor. This name should be unique to each IBML Tool in the distributed environment. It does not need to be modified if you are using a single processor.
isMasterLoader	An indicator of whether the IBML Tool being configured is the master IBML Tool. Specify `true` if it is the master or only IBML Tool; otherwise specify `false`.
matchAnalyzerMode	An indicator of whether to process the data in match analysis mode, which only generates analysis reports, or to perform the complete match process and generate the master index image files. Specify `true` to perform an analysis only; specify `false` to perform the actual blocking and matching process and generate the master index image files.
BulkLoad	An indicator of whether the current run will load the matched data into the database using SQLLoader once the match process is complete. Specify `true`* to load the data. To run a match analysis or just the matching process, specify `false`. If you just run the match process, you can verify the process and then load the output of the Bulk Matcher at a later time.
standardizationMode	An indicator of whether to standardize the input data. Leave the value of the this property set to `true`.
deleteIntermediateDirs	An indicator of whether the working directories are deleted when each process is complete. Specify `true` to delete the directories; specify `false` to retain the directories.
optimizeDuplicates	An indicator of whether to automatically merge records in the input data if they have the same system and local ID. Specify `true` to automatically merge the duplicate records; otherwise specify `false`. The default is `true`.
rmiPort	This is not currently used.
workingDir	The absolute path to the directory in which the IBML Tools create the working files as they progress through the processing stages. The master IBML Tool also creates the master index image files here. If the path you specify does not exist, create it before running the IBML Tool.
ftp.workingDir	The absolute path to the directory on the master processor where files are placed for distribution to the remaining IBML Tools. You only need to define this property for the master IBML Tool and only if you are running multiple IBML Tools. All other tools ignore this property.
numBlockBuckets	The number of block buckets to create for the initial distribution of data blocks. Each IBML Tool works on one bucket at a time so multiple buckets are processed at once. The number of block buckets you specify depends on the number of records to process and how specific the data blocks are in the blocking query.
numThreads	The number of threads to run in parallel during processing.
numEUIDBuckets	The number of buckets the EUID assigner should place the processed records into after they have been matched and assigned an EUID.
totalNoOfRecords	The total number of records being processed. This does not need to be an exact value, but needs to be greater than or equal to the exact number of records.
pollInterval	The number of milliseconds the IBML Tools should wait before polling the master IBML Tool for their next task.
maxWaitTime	The maximum time for an IBML Tool to wait for the next task before giving up.

FTP Server Configuration

The processing properties described in the following table configure the connection information for the FTP server. They only need to be defined if the IBML Tools are run on multiple processors, and they only need to be defined for the slave processors.

Table 2 FTP Server Properties


Property Name	Description
ftp.server	The name of the FTP server on the master processor.
ftp.username	The user ID to log in to the FTP server.
ftp.password	The password to log in to the FTP server.

Cluster Synchronizer Database Configuration

The cluster synchronizer database is used to coordinate the activities of all IBML Tools processing data. The configuration of this section must be identical for all processors.

Table 3 Cluster Synchronizer Database Properties


Property Name	Description
cluster.database	The database platform on which the cluster synchronizer database is installed. Possible values are `Oracle`, `MSSQL`, `MySQL`, or `derby`.
cluster.database.url	The URL for the cluster synchronizer database. The format for the URL varies by database platform. For Oracle, the format is `jdbc:oracle:thin:@hostname:port:database_name`. For SQL Server, the format is `jdbc:sqlserver://hostname:port;databaseName=database_name`. For Derby, the format is `jdbc:derby://hostname:port/database_name`. For MySQL, the format is `jdbc:mysql://server:port:database_name`.
cluster.database.user	The user ID to log in to the cluster synchronizer database.
cluster.database.password	The password to log in to the cluster synchronizer database.
cluster.database.jdbc.driver	The name of the database driver class.

SQL*Loader Configuration

The SQL*Loader properties are only used if you use SQL*Loader to load the bulk data into the master index database after it has gone through match processing and EUID assignment. If you use the Data Integrator Bulk Loader, you do not need to modify these properties. SQL*Loader can only be used with an Oracle database.

Table 4 SQL*Loader Property


Property Name	Description
sqlldr.userid	The connection descriptor for the master index database. For Oracle, this is in the format `user/password@service_name`; for example, `midm/midm@MIDBS.sun.com`.

Data Reader Configuration

This section defines the data reader to use, the location of the input file, and any parameters. You can define a custom data reader to read data into the Bulk Matcher if you are not using the data file output by the Data Cleanser or if your data is formatted differently than the default data reader requires. For more information about the default data reader's requirements, see Required Format for Flat Data Files. You define the custom reader in Java, and then specify the class and configure the reader using the elements and attributes in the following table. You only need to configure the data reader for the master IBML Tool.

Table 5 Custom Data Reader Configuration Elements and Attributes


Element	Attribute	Description
bean		A definition for one data reader. This element includes the following elements and attributes.
	id	A unique name for the data reader.
	class	The Java class to implement for the data reader. The default data reader is `com.sun.mdm.index.dataobject.DataObjectFileReader`. This reader can access flat file data in the format described in Required Format for Flat Data Files.
	singleton	An indicator of whether to use the singleton pattern for the data reader class. Specify `true` to use a singleton pattern; otherwise specify `false`.
constructor-arg		A parameter for the data reader. The default reader accepts two parameters. The first parameter has a type of `java.lang.String` and specifies the location of the input file. The second parameter has a type of boolean and indicates whether the input file contains delimiters. Specify `true` if the file contains delimiters.
	type	The data type of the parameter value.
	value	The value for the parameter.

Initial Bulk Match and Load Tool Logging Properties

The configuration file for the logging properties, logging.properties, defines how much information is logged while the IBML Tools are running. By default, logging is set to display INFO level messages and above on both the console and in a log file.

The following table lists and describes the default properties for the configuration file, but you can add new properties based on the log handlers you use. For more information about log handlers and the properties for each, see the Javadocs for the java.util.logging package.

Table 6 IBML Tool Logging Properties


Property Name	Description
handlers	A list of log handler classes to use, such as `java.util.logging.FileHandler` and `java.util.logging.ConsoleHandler`. Each handler you define needs to be configured according to the properties defined for the class.
level	The logging level to use. By default, this is set to INFO level logging.
java.util.logging.FileHandler.pattern	The name and location of the log files that are generated. By default, the files are located in the directory where you extracted the IBML Tools in the logs subdirectory. The files are named `loader#.log`, where # is an integer that is incremented each time a new log file is created. The log file with the lowest number is the most recent.
java.util.logging.FileHandler.limit	The maximum number of bytes to write to any one log file.
java.util.logging.FileHandler.count	The number of output files to cycle through.
java.util.logging.FileHandler.formatter	The name of the Java class used to format the log file. The IBML Tool provides a formatting class, `com.sun.mdm.index.loader.log.LogFormatter`, but you can define your own.
java.util.logging.ConsoleHandler.level	The level at which information is logged on the console.
java.util.logging.ConsoleHandler.formatter	The name of the Java class used to format the log information on the console. By default, the IBML Tool uses `java.util.logging.SimpleFormatter`, but you can define your own.