2 Initial Bulk Match and Load Tool

This chapter provides information and procedures on how to generate, configure, and run IBML Tool. It also provides the procedures necessary to load matched data into an Oracle Healthcare Master Person Index database.

This chapter includes the following sections:

Generating the Initial Bulk Match and Load Tool
Configuring the Environment
Creating the Cluster Synchronizer Database
Configuring the Initial Bulk Match and Load Tool
Performing a Match Analysis (Optional)
Running the Bulk Match and Bulk Load in One Step (SQL*Loader Only)
Running the Bulk Match and the Bulk Load in Two Steps
Running IBML in Phase Manner

Generating the Initial Bulk Match and Load Tool

In order to use the IBML Tool, you need to generate the tool from the master person index application for which it will be used. You generate the tool from the master person index project in NetBeans. The tool is generated based on the information you specified in the wizard and any changes you made to the configuration files before generating.

To Generate the Initial Bulk Match and Load Tool

Before you begin, make sure the master person index application and database are created and configured. The configuration for the object structure, blocking query, match string, and matching rules should be as close to final as possible. You can use the Bulk Matcher to perform the final tuning on the blocking query and matching rules.

In the NetBeans IDE project window, right-click the main project of the master person index application and then select Generate Loader Zip.

The file is generated and downloaded to NetBeans_Projects\Project_Name\loader-generated.
On your computer, navigate to NetBeans_Projects\Project_Name\loader-generated and extract the contents of loader.zip.
If you are distributing the match and load processes across multiple processors, extract the contents of loader.zip to each machine processing the data.
If the IBML Tools are running on UNIX, make sure the run-initial-bulk-loader.sh and generate-sql-loader.sh files have the execute permission. Use chmod to grant permission, if necessary.

Continue to Configuring the Environment.

Configuring the Environment

Before you start working with the IBML Tool, make sure the environment is configured to handle the requirements of the load process.

To Configure the Environment

Before beginning, complete the steps under Generating the Initial Bulk Match and Load Tool.

If you are using a distributed processing environment, set up an FTP server on the master processor using the FTP provider of your choice.
Install the database driver for the master person index database platform on each processor running the IBML Tools.
Install SQL*Plus and SQL*Loader on the machines performing the load.
Make sure the Java installation you are using matches the Java version used by the NetBeans installation that generated the Initial Bulk Match and Load Tool.

Tip:
To find the Java version for NetBeans, select Tools from the main menu and then select Java Platforms. To find the Java version you are using, type Java -version at a command prompt. If they do not match, either locate or install the correct version and change your PATH variable to point to the correct version.

Continue to Creating the Cluster Synchronizer Database.

Creating the Cluster Synchronizer Database

The cluster synchronizer database helps manage the various processes perform by the IBML Tool. The database is required whether you run the IBML Tool in a distributed environment or on a single processor. The database consists of only three tables, and must be installed in the master person index database.

Tip:

Before you begin each phase of the match and load process, make sure to truncate these tables with the script provided to avoid unique constraint errors in subsequent processes.

To Create the Cluster Synchronization Tables

Before beginning, complete the steps under Configuring the Environment.

Obtain information about the master person index database, such as the login ID and password of the user who created the master person index database, the SID name, the port number, and the server name.
Open a SQL editor, and log in to the master person index database using the above information.
Run NetBeans_Projects/Project_Name/loader-generated/loader/scripts/cluster-synchronizer.sql against the database instance.

Continue to Configuring the Initial Bulk Match and Load Tool.

Configuring the Initial Bulk Match and Load Tool

Before you can run the IBML Tool, you need to define certain runtime parameters, such as how to distribute the processing, FTP server properties, database properties, logging properties, and so on. You can also modify the configuration of the query used for matching, the configuration of the match string, and the weighting thresholds used by the Bulk Matcher.

The following sections provide instructions for:

Configuring the Initial Bulk Match and Load Tool Processing
Configuring the Initial Bulk Match and Load Tool Logging
Setting Up Initial Bulk Match and Load Tool Configuration Properties

Configuring the Initial Bulk Match and Load Tool Processing

The bulk match process is configured by initial-bulk-loader-config.xml, which is located in the conf subdirectory in the directory where you extracted the IBML Tool files. The process must be configured on each machine that is running a Bulk Matcher.

To Configure the IBML Tool

Before beginning, complete the steps under Creating the Cluster Synchronizer Database.

Perform the following steps on each machine processing bulk data.

Navigate to the location where you extracted the IBML Tool.
Open conf/initial-bulk-loader-config.xml.
Configure processing attributes by modifying the properties described in Initial Bulk Match and Load Tool Processing Configuration.
Configure the cluster synchronizer database connection properties by modifying the properties described in Cluster Synchronizer Database Configuration.
Configure the data reader and enter the name and location of the input file (see Data Reader Configuration for more information).
If the IBML Tool is running on multiple processors, modify the properties described in FTP Server Configuration (Optional).
If you are using SQL*Loader to load the master image into the master person index database, modify the properties described in SQL*Loader Configuration.
When using the Bulk Matcher in match analysis mode, perform any of the following:
- To modify the match and duplicate thresholds for match analysis, enter new values for the duplicateThreshold and matchThreshold elements.
- To modify the blocking query for match analysis, modify the query builder section (described in Initial Bulk Match and Load Tool Blocking Query Configuration).
- To modify the match string for match analysis, modify the MatchingConfig section (described in Initial Bulk Match and Load Tool Match String Configuration).
Save and close the file.
Repeat the above steps for each load processor in the distributed environment.

To configure logging properties, continue to Configuring the Initial Bulk Match and Load Tool Logging. Else, skip to Performing a Match Analysis (Optional).

Configuring the Initial Bulk Match and Load Tool Logging

Logging for the IBML Tool is configured in logger.properties, which is located in the conf subdirectory in the directory where you extracted the IBML Tool files.

To Configure IBML Tool Logging

Before beginning, complete Configuring the Initial Bulk Match and Load Tool Processing.

Navigate to the location where you extracted the IBML Tool.
Open conf/logger.properties.
Modify the properties defined inInitial Bulk Match and Load Tool Logging Properties.
Save and close the file.

Continue to Performing a Match Analysis (Optional).

Setting Up Initial Bulk Match and Load Tool Configuration Properties

The configuration file for the IBML Tool, initial-bulk-loader-config.xml, defines several aspects of the match process, including the blocking query, match string, EUID generator, FTP server, cluster synchronizer database, SQL*Loader properties, record limit, record limit per report, and report format of the match analysis report files.

The configuration file is divided into the sections listed below. In addition to these sections, you can set the match and duplicate thresholds at the beginning of the file. Use these settings to help analyze the matching logic.

Initial Bulk Match and Load Tool Field Validation Configuration
Initial Bulk Match and Load Tool Blocking Query Configuration
Initial Bulk Match and Load Tool Match String Configuration
Initial Bulk Match and Load Tool Processing Configuration
FTP Server Configuration (Optional)
Cluster Synchronizer Database Configuration
SQL*Loader Configuration
Data Reader Configuration

Initial Bulk Match and Load Tool Field Validation Configuration

The default field validation for the master person index is parsed by the standard XSD and then copied into the IBML Tool configuration file. The default field validator checks the local ID and system fields to verify that the system code is valid, the local ID format is correct, the local ID is the correct length, and neither field is null. You should not need to modify this section unless you defined custom field validations.

Initial Bulk Match and Load Tool Blocking Query Configuration

When you generate the IBML Tool, the configuration for the blocking query defined for matching in the master person index application is parsed by an IBML parser and then copied into the IBML Tool configuration file.

Caution:

If you defined a custom parser configuration for the blocking query, the query configuration might not be copied correctly. To ensure the correct configuration if you defined a custom parser, copy the blocking query definition directly from query.xml to initial-bulk-loader-config.xml. You also need to create a custom block generator using the com.sun.mdm.index.loader.blocker.BlockIdGenerator interface, and add the name of the custom generator to the field elements in the block ID (for example, Enterprise.SystemSBR.Person.FirstName+CustomBlockIdGenerator).

This section is included in the configuration file so you can modify the blocking query during the analysis phase to help you fine-tune the query configuration for the best match results. You can quickly change the query configuration and analyze the results of the changes without needing to update the master person index application and regenerate the IBML Tool each time. The query configuration is only used by the master IBML Tool, which uses the blocks defined for the query to divide the bulk data into block buckets to be distributed to the other processors that will process the data concurrently.

The blocking query might include fields that are not in your original input data, such as phonetic and normalized fields. If the input to the Bulk Matcher is the file that was generated by the Data Cleanser, the file includes all required fields, including phonetic and standardized fields. If you are using a different source for the input, the IBML Tool can standardize the data for you. For consistent matching results between the initial data set and future transactions, the query configuration used for the match and load processes should be as close as possible to that in query.xml in the master person index project.

Initial Bulk Match and Load Tool Match String Configuration

When you generate the IBML Tool, the configuration of the match string is copied from the master person index project to the IBML Tool configuration file. The match string is used by all IBML Tools processing the data, so this section should be identical for all IBML Tools. This section is provided so you can modify the match string during the analysis phase in order to fine-tune the match logic to achieve the best match results. As with the query configuration above, you can quickly change the match string configuration and analyze the results without needing to modify the master person index application and regenerate the IBML Tool.

Ideally, the match string defined in initial-bulk-loader-config.xml is as close as possible to the match string defined in mefa.xml in the master person index project. This assures consistent matching results between the initial data set and future transactions.

Note:

The match columns in the MatchingConfig section are used by the IBML only during the analysis phase (when property matchAnalyzerMode=true). During the analysis phase, match columns within mefa.xml or within match-ext.xml are not used.

During the initial data loading phase (when property matchAnalyzerMode=false), the IBML uses match columns from the mefa.xml file, as well as from the match-ext.xml file, in the same manner as they used by the OHMPI runtime server and support of the same match enhancement features and the match columns within initial-bulk-loader-config.xml are not used.

Initial Bulk Match and Load Tool Processing Configuration

Table 2-1 lists the processing properties and describes how to configure the properties for the IBML Tool to process data. In these properties, you define a name for each IBML Tool, the location of the working directories, polling properties, and so on. Some of these properties only apply to specific phases of the match and load process, and some apply to either the master or slave processors.

Table 2-1 IBML Tool Processing Properties

Property Name	Description
loaderName (R)^Foot 1	A unique name for the IBML Tool residing on the current processor. This name should be unique to each IBML Tool in the distributed environment. It does not need to be modified if you are using a single processor.
isMasterLoader (R)	An indicator of whether the IBML Tool being configured is the master IBML Tool. Specify true if it is the master or only IBML Tool. Otherwise, specify false.
matchAnalyzerMode (O)^Foot 2	An indicator of whether to process the data in match analysis mode, which only generates analysis reports, or to perform the complete match process and generate the master person index image files. Specify true to perform an analysis only; specify false to perform the actual blocking and matching process and generate the master person index image files. The default is false. Analysis reports are output as html or csv file. Use the report.size property to set the number of records that an analysis report displays.
report.size (O)	By default, the report html or csv file are limited to displaying 3000 records. The top records are displayed in descending order of match weights and the rest are cut off. The value of this property is limited based on the available memory of the JVM at runtime. You can increase the heap size of the JVM for a large report size. To increase the JVM heap size, perform following steps: Open `run-initial-bulk-loader.bat/run-initial-bulk-loader.sh`. Edit JVM_OPTIONS to increase the heap size. By default, it is set to JVM_OPTIONS=-Xmx512M.
report.page.size (O)	By default, this value is equal to the report.size which is 3000 records. This value indicates how many number of records to be stored in one report file. For example, if report.page.size is set to 30 and there are 300 records to be displayed, 10 reports will be generated with 30 records in each report.
report.format (O)	Allows to choose the format in which the match analysis report can be generated. By default, this value is equal to all, so that reports are generated in both html and csv formats. Possible values are html (to generate reports in html format) and csv (to generate reports in csv format).
BulkLoad (O)	An indicator of whether the current run will load the matched data into the database using SQL*Loader once the match process is complete. Specify true to load the data (the default is false). To run a match analysis or just the matching process, specify false. (the default is false). If you just run the match process, you can verify the process and then load the output of the Bulk Matcher at a later time.
standardizationMode (O)	An indicator of whether to standardize the input data. Leave the value of the this property set to true. The default is true.
deleteIntermediateDirs (O)	An indicator of whether the working directories are deleted when each process is complete. Specify true to delete the directories and false to retain the directories. The default is false.
optimizeDuplicates (O)	An indicator of whether to automatically merge records in the input data if they have the same system and local ID. Specify true to automatically merge the duplicate records. Otherwise, specify false. The default is true.
rmiPort	This is not currently used.
workingDir (R)	The absolute path to the directory in which the IBML Tools create the working files as they progress through the processing stages. The master IBML Tool also creates the master person index image files here. If the path you specify does not exist, create it before running the IBML Tool.
ftp.workingDir (O)	The absolute path to the directory on the master processor where files are placed for distribution to the remaining IBML Tools. You only need to define this property for the master IBML Tool and only if you are running multiple IBML Tools. All other tools ignore this property.
numBlockBuckets (R)	The number of block buckets to create for the initial distribution of data blocks. Each IBML Tool works on one bucket at a time so multiple buckets are processed at once. The number of block buckets you specify depends on the number of records to process and how specific the data blocks are in the blocking query.
numThreads (O)	The number of threads to run in parallel during processing. The default is 1.
numEUIDBuckets (R)	The number of buckets the EUID assigner should place the processed records into after they have been matched and assigned an EUID.
totalNoOfRecords (R)	The total number of records being processed. This does not need to be an exact value, but needs to be greater than or equal to the exact number of records.
BucketCacheSize (O)	The maximum bucket file size (in bytes) before splitting. Once exceeding this maximum bucket file size, a single bucket file gets splitted into multiple ones. The default value is 60,000,000.
matchFlushSize (O)	The maximum number of match pairs allowed in a single match file. The default value is 100,000.
pollInterval (R)	The number of milliseconds the IBML Tools should wait before polling the master IBML Tool for their next task.
maxWaitTime (R)	The maximum time for an IBML Tool to wait for the next task before giving up.
TimeFormat (O)	The time format used to specify all input records if any date field also has a time component. Note: The date format component (without time component) for date fields are specified in the `object.xml` "dateformat" field. The date time format string is composed of concatenation of dateFormat + TimeFormat field. That is, if dateformat = "mm/dd/yyyy" and TimeFormat = "hh:mm:ss", then date type fields should be: "11/02/2001 10:10:20".
record.delimiter (R)	The delimiter used for delimiting records in MasterImage files. Example value is $$$.
sqlldr.record.delimiter (O)	This delimiter has been deprecated. Use record.delimiter instead.
BucketCacheSize (O)	The maximum size of a temporary Bucket File that will be created by IBML. The default value is 60000000. Normally any loader should be able to completely load any file bucket in memory. If there is an out of memory error from any loader, set this value to lower than 60M; and if IBML is unable to create a Bucket File smaller than 60M as reported by the IBML error message, either change the application settings or increase this property value.
blockPrintSize (O)	The property to set for IBML to print the number of matches done for each block whose size>= blockPrintSize. This is useful for Match statistics.
loadMode (O)	An indicator of whether to run the IBML Tool in a phase manner or start to end. By default, the value is not set or is set to ENDTOEND_MODE; the IBML Tool runs in start to end mode. To run the IBML Tool in a phase manner, set the value to PHASED_MODE. For more information on how to run IBML in a phase manner, see Running IBML in Phase Manner.
FileTransferProtocol (O)	In cluster mode, FileTransferProtocol is used for configuring one of the three IBML optional protocols to transfer the process files from the master to slaves. These protocols are FTP, COPY, and SHARING.
sharedFolderLocation (O)	A value that defines how slave node accesses the shared master working directory. You must configure this property if the value for FileTransferProtocol is set as COPY.
isOptEuidIndexAssigner (O)	An indicator of whether the default Euid assignment or optimized Euid assignment approach is used. By default, the value is not set or is set to false; the default Euid assignment approach is used for non-cluster mode. If the value is set to true, the optimized Euid assignment approach is used. This approach is used for a large volume of initial records with high match ratio and IBML running in cluster mode.

^Footnote 1(R) = Required

^Footnote 2(O) = Optional

FTP Server Configuration (Optional)

The processing properties described in the following table configure the connection information for the FTP server. They only need to be defined if the IBML Tools are run on multiple processors, and they only need to be defined for the slave processors.

Table 2-2 FTP Server Properties

Property Name	Description
ftp.server	The name of the FTP server on the master processor.
ftp.username	The user ID to log in to the FTP server.
ftp.password	This property has been deprecated. Note: This password is not stored in the FTP Server configuration file. However, you will need to enter this value when the IBML Tool prompts you for it.

Property Name

Description

ftp.server

The name of the FTP server on the master processor.

ftp.username

The user ID to log in to the FTP server.

ftp.password

This property has been deprecated.

Note: This password is not stored in the FTP Server configuration file. However, you will need to enter this value when the IBML Tool prompts you for it.

Cluster Synchronizer Database Configuration

The cluster synchronizer database is used to coordinate the activities of all IBML Tools processing data. The configuration of this section must be identical for all processors.

Table 2-3 Cluster Synchronizer Database Properties

Property Name	Description
cluster.database.url	The URL for the cluster synchronizer database. The format of the URL for the Oracle database is `jdbc:oracle:thin:@`hostname:port:database_name
cluster.database.user	The user ID to log in to the cluster synchronizer database.
cluster.database.password	This property has been deprecated. Note: This password is not stored in the Cluster Synchronizer Database configuration file. However, you will need to enter this value when the IBML Tool prompts you for it.
cluster.database.jdbc.driver	The name of the database driver class.

SQL*Loader Configuration

The SQL*Loader properties are only used if you use SQL*Loader to load the bulk data into the master person index database after it has gone through match processing and EUID assignment.

Table 2-4 SQL*Loader Property

Property Name	Description
sqlldr.userid	The connection descriptor for the master person index database. When specifying the value for sqlldr.userid, use the format: For Windows: property name="sqlldr.userid" value="<user>/%1@<dbname>" / For UNIX and Linux: property name="sqlldr.userid" value="<user>/$1@<dbname>" / Note: Here <user> refers to the database user name; <dbname> refers to the database SID. Enter %1 for Windows and $1 for UNIX and Linux. For example, property name="sqlldr.userid" value="person/%1@mpi" / Caution: Do not store the sqlldr.userid password in the SQL*Loader configuration file.

Property Name

Description

sqlldr.userid

The connection descriptor for the master person index database. When specifying the value for sqlldr.userid, use the format:

For Windows: property name="sqlldr.userid" value="<user>/%1@<dbname>" /

For UNIX and Linux: property name="sqlldr.userid" value="<user>/$1@<dbname>" /

Note: Here <user> refers to the database user name; <dbname> refers to the database SID. Enter %1 for Windows and $1 for UNIX and Linux. For example, property name="sqlldr.userid" value="person/%1@mpi" /

Caution: Do not store the sqlldr.userid password in the SQL*Loader configuration file.

Data Reader Configuration

This section defines the data reader to use, the location of the input file, and any parameters. You can define a custom data reader to read data into the Bulk Matcher if you are not using the data file output by the Data Cleanser or if your data is formatted differently than the default data reader requires. For more information about the default data reader's requirements, see Required Format for Flat Data Files. You define the custom reader in Java, and then specify the class and configure the reader using the elements and attributes in Table 2-5. You only need to configure the data reader for the master IBML Tool.

Table 2-5 Custom Data Reader Configuration Elements and Attributes

Element	Attribute	Parameter	Description
bean	-	-	A definition for one data reader. This element includes the following elements and attributes.
	id	-	A unique name for the data reader.
	class	-	The Java class to implement for the data reader. The default data reader is `com.sun.mdm.index.dataobject.DataObjectFileReader`. This reader can access flat file data in the format described in Required Format for Flat Data Files. To read multiple files from a directory, set the parameter as `com.sun.mdm.index.dataobject.DataObjectDirReader`. Any custom reader should implement interface `com.sun.mdm.index.dataobject.DataObjectReader`.
	singleton	-	An indicator of whether to use the singleton pattern for the data reader class.
	constructor-arg	-	A parameter for the data object reader. For a single file input, the reader accepts two parameters. The first parameter has a type of `java.lang.String` and specifies the location of the input file. The second parameter has a type of `boolean` and indicates whether the input file contains delimiters. Specify true if the file contains delimiters. To read multiple files from a designated directory, the reader accepts only one parameter. This parameter has a type of `java.io.File` and specifies the location of the designated directory. Note: Only one data object reader must be included in the configuration file.
		type	The data type of the parameter value.
		value	The value for the parameter.

Initial Bulk Match and Load Tool Logging Properties

The configuration file for the logging properties, logging.properties, defines how much information is logged while the IBML Tools are running. By default, logging is set to display INFO level messages and above on both the console and in a log file.

The following table lists and describes the default properties for the configuration file, but you can add new properties based on the log handlers you use. For more information about log handlers and the properties for each, see the Javadocs for the java.util.logging package.

Table 2-6 IBML Tool Logging Properties

Property Name	Description
handlers	A list of log handler classes to use, such as `java.util.logging.FileHandler` and `java.util.logging.ConsoleHandler`. Each handler you define needs to be configured according to the properties defined for the class.
level	The logging level to use. By default, this is set to INFO level logging.
java.util.logging.FileHandler.pattern	The name and location of the log files that are generated. By default, the files are located in the directory where you extracted the IBML Tools in the logs subdirectory. The files are named `loader#.log`, where `#` is an integer that is incremented each time a new log file is created. The log file with the lowest number is the most recent.
java.util.logging.FileHandler.limit	The maximum number of bytes to write to any one log file.
java.util.logging.FileHandler.count	The number of output files to cycle through.
java.util.logging.FileHandler.formatter	The name of the Java class used to format the log file. The IBML Tool provides a formatting class, `com.sun.mdm.index.loader.log.LogFormatter`, but you can define your own.
java.util.logging.ConsoleHandler.level	The level at which information is logged on the console.
java.util.logging.ConsoleHandler.formatter	The name of the Java class used to format the log information on the console. By default, the IBML Tool uses `java.util.logging.SimpleFormatter`, but you can define your own.

Performing a Match Analysis (Optional)

Before you perform the actual data matching, you can perform match analyses on a subset of the data to be loaded to determine whether the various components of the match process are configured correctly. This analysis can show whether the data blocks defined for the blocking query are returning too many or too few records and whether certain fields in the match string are inaccurately skewing the composite match weight. You can also use this analysis to determine whether the duplicate and match threshold are correct.

This is an iterative process, and you might need to run through the analysis several times before you are satisfied that the match and query configuration is optimized for your data set.

Perform the following steps to analyze the data for matching:

Running the Bulk Matcher in Analysis Mode
Reviewing the Match Analysis Results
Reconfiguring the Matching Logic

Running the Bulk Matcher in Analysis Mode

When you run the Bulk Matcher in analysis mode, use a representative sample of the actual data you are loading into the master person index database. You do not need to run the entire set of input records through the analysis.

Caution:

If you are rerunning the Bulk Matcher in analysis mode, make sure to truncate the cluster synchronizer database tables first. Otherwise, unique constraint errors occur and the run fails. To truncate the tables, run cluster-truncate.sql against the cluster synchronizer database.

To Run the Bulk Matcher in Analysis Mode

Before beginning, complete the steps under Configuring the Initial Bulk Match and Load Tool.

For each IBML Tool, open initial-bulk-loader-config.xml (located in the IBML Tool home directory in the conf subdirectory).
Set the matchAnalyzerMode property to true, and verify the remaining property settings.
Save and close the file.
To configure and run the match analysis, perform one of the following.

If the master loader is running on Windows:
1. Navigate to the master IBML Tool home directory and open run-initial-bulk-loader.bat for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, set JDBC_JAR_PATH=C:\oracle\jdbc\lib\ojdbc7.jar.
3. Close and save the file.
4. Double-click run-initial-bulk-loader.bat or type run-loader from a command line.
If the master loader is running on UNIX:
1. Navigate to the master IBML Tool home directory and open run-initial-bulk-loader.sh for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, export JDBC_JAR_PATH=${oracle_home}/jdbc/lib/ojdbc7.jar.
3. Close and save the file.
4. Type sh run-initial-bulk-loader.sh at the command line.
Examine the log files to be sure no errors occurred during the analysis.

Continue to Reviewing the Match Analysis Results

Reviewing the Match Analysis Results

The output of the Bulk Matcher when run in analysis mode is a html or csv file with a list of records that were automatically matched to each other (assumed matches) from the data set you analyzed. The report displays the matching weight given to each field, so you can analyze the value and accuracy of each field for matching as well as the agreement and disagreement weights (or u-probabilities and m-probabilities) defined in the matching configuration file of the master person index application.

The name of each match field is listed as column headers, the values for those fields in the two assumed match records are listed as two consequent rows, and the composite match weight and the weight for each field are listed in the third row. The Match Score column provides the composite match weight for the record set.

Figure 2-1 Match Analysis Report Excerpt - HTML

Description of ''Figure 2-1 Match Analysis Report Excerpt - HTML''

Figure 2-2 Match Analysis Report Excerpt - CSV

Description of ''Figure 2-2 Match Analysis Report Excerpt - CSV''

The csv format of the report does not provide a formatted view. To view the report in a better format (in Microsoft Excel), perform the following steps:

In the project folder, navigate to the <project_home>/loader-generated/loader/conf directory.
Double-click on the weight-analysis.xltm file.

If a Security Warning message appears, click Enable Content.
Click Select CSV File.
Browse and select the generated match analysis csv file from the <project_home>/loader-generated/loader/reports directory.
Click View Weight Analysis Report.

The formatted report appears in the weight-analysis worksheet.

Figure 2-3 Match Analysis Report Excerpt - Formatted CSV

Description of ''Figure 2-3 Match Analysis Report Excerpt - Formatted CSV''

After you perform the steps under Running the Bulk Matcher in Analysis Mode, complete the analysis by using the information in the match analysis report to perform the following steps:

Look for records that are assumed matches but should not be. This might indicate that the match threshold is set too low for the number of match fields or that one or more fields are given too much weighting relevance.
Verify that fields that uniquely identify records, such as a social security number, are given a higher weight when they match.
Verify that null fields are being handled correctly.
If the assumed match records with the lowest composite match weights are definite matches of one another, the match threshold might be set too high. You can set the match threshold lower and run another match analysis.

After you complete your analysis, you can reconfigure the matching logic as described in Reviewing the Match Analysis Results and then rerun the analysis. If your analysis shows that the matching configuration is correct and does not require any more changes, continue to Running the Bulk Match and the Bulk Load in Two Steps. If the matching configuration is correct, make sure to update the master person index application to match the new configuration.

Reconfiguring the Matching Logic

If the results of the match analysis show that you need to modify the query, thresholds, or match string, you can make the changes to the IBML Tool configuration file and run the Bulk Matcher again to analyze the new settings. Once you are satisfied with the new settings, you need to update the master person index application configuration accordingly.

To Reconfigure the Matching Logic

Complete the match analysis, as described under Reviewing the Match Analysis Results.
In the directory where the IBML Tool is located, open conf/initial-bulk-loader-config.xml.
To modify the match and duplicate thresholds for match analysis, enter new values for the duplicateThreshold and matchThreshold elements.
To modify the blocking query for match analysis, modify the query builder section (described in Initial Bulk Match and Load Tool Blocking Query Configuration).
To modify the match string for match analysis, modify the MatchingConfig section (described in Initial Bulk Match and Load Tool Match String Configuration).
Run the match analysis again, as described in Running the Bulk Matcher in Analysis Mode

After you run the analysis for the final time, continue to Running the Bulk Match and the Bulk Load in Two Steps.

Caution:

When you complete the analysis and have made the final modifications to the blocking query, matching string, and match thresholds, be sure to modify the master person index application so the processing is identical. The match string is defined in mefa.xml, the thresholds are defined in master.xml, and the blocking query is defined in query.xml. You can copy the configuration from initial-bulk-loader-config.xml directly into these files.

Running the Bulk Match and Bulk Load in One Step (SQL*Loader Only)

Note:

This procedure includes steps that were updated for Java CAPS Release 6 Update 1. The variable JDBC_JAR_PATH was previously ORACLE_JDBC_JAR, and wasn't present in all files.

After you perform the match analysis and are satisfied with the matching logic configuration, you can either run the match process alone, or you can run the match and load processes together if you are using SQL*Loader to load the data. SQL*Loader can only be used on an Oracle Healthcare Master Person Index database.

Note:

If desired, you can also perform this task using two steps. See Running the Bulk Match and the Bulk Load in Two Steps for the procedures.

To Run the Bulk Match and Bulk Load in One Step

Complete the steps under Performing a Match Analysis (Optional).
From the master IBML Tool home/scripts directory, run cluster-truncate.sql against the cluster synchronizer database.
For each IBML Tool, open initial-bulk-loader-config.xml (located in the IBML Tool home directory in the conf subdirectory).
Set the matchAnalyzerMode property to false and the BulkLoad property to true.
Define the SQL*Loader property described in SQL*Loader Configuration.
Verify that the rest of the properties are configured correctly, and then save and close the file.
To generate the loader scripts, perform one of the following.

If the master loader is running on Windows:
1. Navigate to the master IBML Tool home directory and open generate-sql-loader.bat for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, set JDBC_JAR_PATH=C:\oracle\jdbc\lib\ojdbc7.jar.
3. Close and save the file.
4. Double-click generate-sql-loader.bat or type generate-sql-loader from a command line.
If the master loader is running on UNIX:
1. Navigate to the master IBML Tool home directory and open sh generate-sql-loader.sh for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, set JDBC_JAR_PATH=C:\oracle\jdbc\lib\ojdbc7.jar.
3. Close and save the file.
4. Type sh generate-sql-loader.sh at the command line.
  
  A new directory named sqlldr is created in the working directory.
For UNIX only, modify the permissions for the shell scripts in the new sqlldr directory by running the following command:

chmod u+x *.sh
In the master IBML Tool home/scripts directory, run cluster-truncate.sql against the master index database to clear the cluster synchronizer tables.
In the project folder, navigate to project_home/src/DatabaseScript, and run drop-index.sql to the domain-relevant indexes.
To configure and run the load process, perform one of the following.

If the master loader is running on Windows:
1. Navigate to the master IBML Tool home directory and open run-initial-bulk-loader.bat for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, set JDBC_JAR_PATH=C:\oracle\jdbc\lib\ojdbc7.jar.
3. Close and save the file.
4. Double-click run-initial-bulk-loader.bat or type run-loader from a command line.
If the master loader is running on UNIX:
1. Navigate to the master IBML Tool home directory and open run-initial-bulk-loader.sh for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master index database platform; for example, export JDBC_JAR_PATH=${oracle_home}/jdbc/lib/ojdbc7.jar.
3. Close and save the file.
4. Type sh run-initial-bulk-loader.sh at the command line.
Examine the log files to be sure no errors occurred during the match process, and check the files located in the masterIndex folder of your working directory to be sure all tables were populated.

Note:
If for any reason you need to stop and restart this process, run truncate.sql in the sqlldr directory against the master index database before restarting the process.
In the project folder, navigate to project_home/src/DatabaseScript, and run create-index.sql to reinstate the dropped domain-relevant indexes.

Running the Bulk Match and the Bulk Load in Two Steps

The following procedures provide the steps to run the bulk match and the bulk load in two steps:

Running the Bulk Match and Preparing for Loading
Loading the Matched Data Into the Master Person Index Database

Note:
If desired, you can run the bulk match and bulk load in one step. See Running the Bulk Match and Bulk Load in One Step (SQL*Loader Only) for the procedure.

Running the Bulk Match and Preparing for Loading

After you perform the match analysis and are satisfied with the matching logic configuration, you are ready to match the full set of data to be loaded into the master person index database.

You can run the Bulk Matcher and the Bulk Loader all in the same step. For instructions, see Running the Bulk Match and Bulk Load in One Step (SQL*Loader Only).

To Perform the Bulk Match

Before beginning, complete the steps under Performing a Match Analysis (Optional).

From the master IBML Tool home/scripts directory, run cluster-truncate.sql against the cluster synchronizer database.
For each IBML Tool, open initial-bulk-loader-config.xml (located in the IBML Tool home directory in the conf subdirectory).
Set the matchAnalyzerMode property to false, and set the Bulk Load to false.
Verify that the rest of the properties are configured correctly, and then save and close the file.

For information about the configurable properties, see Setting Up Initial Bulk Match and Load Tool Configuration Properties.
To configure and run the match process, perform one of the following.

If the master loader is running on Windows:
1. Navigate to the master IBML Tool home directory and open run-initial-bulk-loader.bat for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master person index database platform; for example, set JDBC_JAR_PATH=C:\oracle\jdbc\lib\ojdbc7.jar.
3. Close and save the file.
4. Double-click run-initial-bulk-loader.bat or type run-loader from a command line.
If the master loader is running on UNIX:
1. Navigate to the master IBML Tool home directory and open run-initial-bulk-loader.sh for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master person index database platform; for example, export JDBC_JAR_PATH=${oracle_home}/jdbc/lib/ojdbc7.jar.
3. Close and save the file.
4. Type sh run-initial-bulk-loader.sh at the command line.
Examine the log files in the logs directory of the IBML Tool home directory to verify that no errors occurred during the match process.
Examine the files located in the masterindex folder of your working directory to be sure all tables were populated.

Figure 2-4 Master Person Index Image Files

Description of ''Figure 2-4 Master Person Index Image Files''
If you configured the IBML Tool to keep the temporary folders in the working directory, you can examine the blocking, EUID, and match files for additional verification.

Figure 2-5 IBML Tool Working Directory

Description of ''Figure 2-5 IBML Tool Working Directory''
To load the data into the master person index database, see Loading the Matched Data Into the Master Person Index Database.

Loading the Matched Data Into the Master Person Index Database

The IBML Tool uses the SQL*Loader to load the master data images generated by the Bulk Matcher into the Oracle database.

To load the matched data into your master person index database, see Loading Matched Data Using SQL*Loader.

Loading Matched Data Using SQL*Loader

Note:

This procedure includes steps that were updated for Java CAPS Release 6 Update 1. The variable JDBC_JAR_PATH was previously ORACLE_JDBC_JAR, and wasn't present in all files.

Use SQL*Loader to load the matched data into the database.

To Load Matched Data Using SQL*Loader

Complete the steps under Running the Bulk Match and the Bulk Load in Two Steps.
From the master IBML Tool home/scripts directory, run cluster-truncate.sql against the cluster synchronizer database.
For each IBML Tool, open initial-bulk-loader-config.xml (located in the IBML Tool home directory in the conf subdirectory).
1. Define the SQL*Loader property as described in SQL*Loader Configuration.
2. Save and close the file.
To generate the loader script, perform one of the following.

If the master loader is running on Windows:
1. Navigate to the master IBML Tool home directory and open generate-sql-loader.bat for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master person index database platform; for example, set JDBC_JAR_PATH=C:\oracle\jdbc\lib\ojdbc7.jar.
3. Close and save the file.
4. Double-click generate-sql-loader.bat or type generate-sql-loader from a command line.
If the master loader is running on UNIX:
1. Navigate to the master IBML Tool home directory and open sh generate-sql-loader.sh for editing.
2. Change the value of the JDBC_JAR_PATH variable in the first line to the location and name of the database driver for the master person index database platform; for example, export JDBC_JAR_PATH=${oracle_home}/jdbc/lib/ojdbc7.jar.
3. Close and save the file.
4. Type sh generate-sql-loader.sh at the command line.
  
  A new directory named sqlldr is created in the working directory.
In the master IBML Tool home/scripts directory, run cluster-truncate.sql against the master person index database to clear the cluster synchronizer tables.
In the project folder, navigate to project_home/src/DatabaseScript, and run drop-index.sql to the domain-relevant indexes.
In the sqlldr directory, perform one of the following:
- On Windows, double-click bulk_loader.bat or type bulk_loader.bat from a command line.
- On UNIX, type sh bulk_loader.sh at the command line.
After the data is loaded, close any command prompts that were left open by the process and examine the SQL*Loader log files located in the sqlldr/log directory to ensure there were no errors during processing.

Note:
Any records that contained bad data and were not inserted into the master person index database are written to the sqlldr/bad directory. Any records that contained bad data and were discarded are written to the sqlldr/discard directory.
In the project folder, navigate to project_home/src/DatabaseScript, and run create-index.sql to reinstate the dropped domain-relevant indexes.

Running IBML in Phase Manner

By default, the IBML runs in the start to end mode (loadMode value is set to ENDTOEND_MODE). In this process, if any immediate sub-process fails, you need to clean up the working folders, reset the database tables, and start the entire process over again.

When you run IBML in a phase manner, the entire process is divided into the following five phases, and you run each of the phase manually in sequence:

BLOCK_DISTRIBUTION
RECORD_MATCHING
EUID_ASSIGNMENT
MASTER_INDEX_CREATION
POTENTIAL_DUPLICATE_CREATION

If execution of a phase fails, you can restart the phase without going back to the very begin to start over. Restarting a phase automatically cleans up the relevant working folders and data, and resets the database IBML synchronization tables.

For running IBML in a phase manner, execute the following command for each of the five phases in sequence from the IBML command line:

For Windows, run-loader.bat <phase_name>
For UNIX and Linux, run-loader.sh <phase_name>