The match configuration file, matchConfigFile.cfg, contains the matching logic for each field on which matching is performed. By default, this file defines the matching logic for the three primary data types (person names, business names, and addresses), and can also handle generic data types, such as dates, numbers, social security numbers, and characters.
The match configuration file defines matching logic for each field on which matching is performed. The Master Index Match Engine provides several comparison functions that you can call in this file to fine-tune the match process. Comparison functions contain the logic to compare different types of data in very specific ways in order to arrive at a match weight for each field. These functions allow you to define how matching is performed for different data types and can be used in conjunction with either matching and unmatching probabilities or agreement and disagreement weight ranges for each field. This file also defines how to handle missing fields.
The following topics describe the format of the configuration file and provide an overview of the predefined comparison functions:
These topics describe the format of the files so you can modify them directly. You can also modify the match configuration file using the Master Index Configuration Editor, which provides an easy, graphical way to configure matching rules.
The match configuration file is divided into two sections. The first section consists of one line that indicates the matching probability type. The second section consists of the matching rules to use for each match field. In a master index application, this file can be modified from the Matching tab of the Master Index Configuration Editor. For more information, see Configuring the Comparison Functions for a Master Index Application in Sun Master Index Configuration GuideConfiguring the Comparison Functions
Following is an excerpt from the default match configuration file. This excerpt illustrates the components that are described in the following sections.
ProbabilityType 1 FirstName 15 0 uf 0.99 0.001 15 -5 LastName 15 0 ul 0.99 0.001 15 -5 String 25 0 ua 0.99 0.001 10 -5 DateDays 20 0 dD 0.99 0.001 10 -10 y 15 30 DateMonths 20 0 dM 0.99 0.001 10 -10 n DateHours 20 0 dH 0.99 0.001 10 -10 y 30 60 DateMinutes 20 0 dm 0.99 0.001 10 -10 y 300 600 DateSeconds 20 0 ds 0.99 0.001 10 -10 y 75 60 Integer 15 0 nI 0.99 0.001 10 -10 n Real 15 0 nR 0.99 0.001 10 -10 n Char 1 0 c 0.99 0.001 5 -5 pro 15 0 p 0.99 0.001 10 -10 20 5 5 |
The first line of the match configuration file defines the probability type to use for matching. Specify “0” (zero) to use m-probabilities and u-probabilities to determine a field’s match weight; specify “1” (one) to use agreement and disagreement weight ranges. If the probability type is set to use agreement and disagreement weight ranges, the m-prob and u-prob columns in the matching rules section are ignored. Likewise, if the probability type is set to use m-probabilities and u-probabilities, the agreement-weight and disagreement-weight columns in the matching rules section are ignored. The default is to use agreement and disagreement weight ranges because they are more intuitive.
For more information about probabilities and weights, see Probabilities and Direct Weights.
The section after the first line of the match configuration file contains match field rows, with each row defining how a certain data type or field will be matched. These are the rules you specify in the match string you define for a master index application. The syntax for this section is:
match-field size null-field function m-prob u-prob agreement disagreement params data-sources
The following table describes each element in a match field row.
Table 1 Match Configuration File Columns
Column Number |
Column Name |
Description |
---|---|---|
1 |
match-field |
A value that indicates to the Master Index Match Engine how each field should be weighted. Each field included in the match string (the MatchingConfig section of mefa.xml) must have a match type corresponding to a value in this column. |
2 |
size |
The number of characters in the field on which matching is performed, beginning with the first character. For example, to match on only the first four characters in a 10-digit field, the value of this column should be “4”. |
3 |
null-field |
An index that specifies how to calculate the total weight for null fields or fields that only contain spaces. You can specify any of the following values:
Note – In the above descriptions, the agreement and disagreement weights are either specified in the file or calculated using a logarithmic formula based on the m and u-probabilities (depending on the probability type). |
4 |
function |
The type of comparison to perform when weighting the field. For information about the available comparison functions, see Master Index Match Engine Comparison Functions and Options. An overview of the comparison functions is provided Table 2 |
5 |
m-prob |
The initial probability that the specified field in two records will match if the records match. The probability is a double value between 0 and 1, and can have up to 16 decimal points. |
6 |
u-prob |
The initial probability that the specified field in two records will match if the records do not match. The probability is a double value between 0 and 1, and can have up to 16 decimal points. |
7 |
agreement |
The matching weight to be assigned to a field given that the fields match between two records. This number can be between 0 and 100 and can have up to 16 decimal points. It represents the maximum match weight for a field. |
8 |
disagreement |
The matching weight to be assigned to a field given that the fields do not match between two records. This number can be between 0 and -100 and can have up to 16 decimal points. It represents the minimum match weight for a field. |
9 |
params |
The parameters that correspond to the comparison function specified in column 4. Some comparison functions do not take any parameters and some take multiple parameters. For additional information about parameters, see Master Index Match Engine Comparison Functions and Options. |
10 |
dataSources |
The complete path to any data sources used by the comparison function specific in column 4. You can define as many data sources as there are data sources listed for the comparator in the comparators list file. The default comparators do not use data sources, but you can create a custom comparator that does. |
Match field comparison functions, or comparators, compare the values of a field in two records to determine whether the fields match. The fields are then assigned a matching weight based on the results of the comparison function. You can use several different types of comparison functions in the match configuration file to define how the Master Index Match Engine should match the fields in the match string. The Master Index Match Engine provides several options to use with each function. You can also define custom comparison functions. For more information, see Creating Custom Comparators for the Master Index Match Engine.
The following table summarizes each comparison function. A complete reference of the comparison functions and their parameters is included in Master Index Match Engine Comparison Functions and Options.
The names of these comparison functions are configurable. The following table lists their default names.
The comparator definition list defines each comparator that is included in a master index application. If a comparator is not included in this list, it cannot be used in the application. If you define a comparator in this list that is not provided with the Master Index Match Engine, you need to define the logic of the new comparator in Java classes (for more information, see Creating Custom Comparators for the Master Index Match Engine.
Below is an excerpt from the default comparators list file that defines two numeric comparators, Real Number Comparator and Integer Comparator. Both comparators take two parameters, and are dependent on a second comparator class named CondensedStringComparator.
<comparator description="Numerics comparator"> <className>NumericsComparator</className> <codes> <code description="Real Number Comparator" name="n[R, ]"/> <code description="Integer Comparator" name="nI" /> </codes> <params> <param description="distance/string comparison option" name="switch" type="java.lang.String"/> <param description="Spectrum of comparison" name="range" type="java.lang.Integer|java.lang.Double"/> </params> <data-sources/> <dependency-classes> <dependency-class matchfield="CSC" name="com.sun.mdm.matcher.comparators.base.CondensedStringComparator"/> </dependency-classes> <curve-adjust status="false"/> </comparator> |
The comparators are defined in XML format. The following table lists and describes each element in the XML file.
Table 3 Comparator Definition List Elements
Element |
Attribute |
Description |
---|---|---|
group |
An element that contains a list of comparators that all share the same Java package. |
|
description |
A brief description of the comparator group. |
|
path |
The Java package that contains the code that defines the comparators in the group. |
|
comparator |
A definition for one subgroup of comparators that are all based on the same Java class, have the same Java class dependencies, accept the same parameters and data sources, and have the same curve adjustment setting. |
|
description |
A brief description of the comparator subgroup. |
|
className |
The name of the class that defines the logic for the comparators. The class must be contained in the package specified for the group element, as described above. |
|
codes |
A container element for a list of the comparators in the subgroup, with descriptions and processing codes of each comparator. |
|
code |
A description and processing code for one comparator. |
|
description |
A description of the comparator. The value you specify here appears in the comparator drop-down list on the Master Index Configuration Editor. |
|
name |
A unique identifying name for the comparator. These are the comparator names used in the rules definitions in the match configuration file (matchConfigFile.cfg). |
|
params |
A container element for a list of static parameters for the subgroup of comparators. Parameters are optional. |
|
param |
One parameter definition for the comparators. |
|
description |
A brief description of the parameter. |
|
name |
A short name for the parameter. |
|
type |
The Java data type of the values that can be specified for the parameter. |
|
data-sources |
A container element for a list of data files that contain additional information for the subgroup of comparators. For example, a comparator that generates weights based on the distance between postal codes might use lookup files containing information about the zip codes. Data sources are optional. |
|
data-source |
A definition for one data source. Currently, only file data sources are supported. |
|
description |
A brief description of the data source. |
|
name |
The complete path and filename of the data source. |
|
type |
The type of data source being used. Currently, the only value you can specify is “java.io.File”. |
|
dependency-classes |
A container element that defines a list of Java classes on which the comparator class is dependent. The current comparator class inherits from the comparator classes you specify here as well as all the match fields (defined in matchConfigFile.cfg) that use that comparator. |
|
dependency-class |
A definition for one comparator class, called a dependency comparator, on which the current comparator class is dependent. |
|
matchField |
The name of the dependency comparator's match field. |
|
name |
The name of the dependency comparator class. |
|
curve-adjust |
An indicator of whether to apply special adjustments to the weighting curve. The curve adjustment is defined for each comparator individually in a Java class named comparator_nameCurveAdjustor. |
|
status |
The status of the curve adjustor. Specify true to use the curve adjustor; specify false to disable the curve adjustor. |