E.5.1.1 Configuring the Fuzzy Name Matcher Utility
The Fuzzy Name Matcher utility can be used through Ingestion Manager as a standalone Fuzzy Name Matcher, or through BD Datamaps.
To use the Fuzzy Matcher Utility through the Ingestion Manager as a standalone Fuzzy
Name Matcher, refer to Executing the Fuzzy Name Matcher Utility. Configure the Fuzzy Name Matcher
by modifying <ingestion_manager>/fuzzy_match/mantas_cfg/install.cfg
.
To use the Fuzzy Matcher Utility through the BD Datamaps
(NameMatchStaging.xml,RegOToBorrower.xml) file in folder (<OFSAAI Installed
Directory>/bdf/config/datamaps). For more information, refer to Chapter 4,
“Managing Data.”. Configure the Fuzzy Name Matcher by modifying
<ingestion_manager>/fuzzy_match/ mantas_cfg/install.cfg
.
The following section provides a sample configuration appearing in <OFSAAI
Installed Directory>/bdf/fuzzy_match/mantas_cfg/install.cfg
.
Sample BDF.xml Configuration Parameters
############################################################# ##
#
# Fuzzy Name Matcher System Properties file (install.cfg) #
############################################################# ##
# # Log configuration items
#
# Specify which priorities are enabled in a hierarchical fashion, i.e., if
# DIAGNOSTIC priority is enabled, NOTICE, WARN, and FATAL are also enabled, # but TRACE is not.
# Uncomment the desired log level to turn on appropriate level(s).
# Note, DIAGNOSTIC logging is used to log database statements and will slow
# down performance. Only turn on if you need to see the SQL statements being # executed.
# TRACE logging is used for debugging during development. Also only turn on # TRACE if needed.
#log.fatal=true #log.warning=true log.notice=true #log.diagnostic=true #log.trace=true
# Specify where a message should get logged -- the choices are mantaslog, # syslog, console, or a filename (with its absolute path).
# Note that if this property is not specified, logging will go to the console. log.default.location=mantaslog
# Specify the location (directory path) of the mantaslog, if the mantaslog # was chosen as the log output location anywhere above.
# Logging will go to the console if mantaslog was selected and this property is # not given a value.
log.mantaslog.location=mp
# # Fuzzy Name Matcher configuration items
# fuzzy_name.match_multi=true fuzzy_name.file.delimiter=~
fuzzy_name.default.prefix=P fuzzy_name.max.threads=1 fuzzy_name.max.names.per.thread=1000 fuzzy_name.max.names.per.process=250000 fuzzy_name.min.intersection.first.letter.count=2
fuzzy_name.temp_file.directory=/scratch/ofsaaapp/BD805/BD805/bdf/data/temp
fuzzy_name.B.stopword_file=/scratch/ofsaaapp/BD805/BD805/bdf/fuzzy_match/share/ stopwords_b.dat
fuzzy_name.B.match_threshold=80 fuzzy_name.B.initial_match_score=75.0 fuzzy_name.B.initial_match_p1=2 fuzzy_name.B.initial_match_p2=1 fuzzy_name.B.extra_token_match_score=100.0 fuzzy_name.B.extra_token_min_match=2 fuzzy_name.B.extra_token_pct_decrease=50 fuzzy_name.B.first_first_match_score=1
fuzzy_name.P.stopword_file=/scratch/ofsaaapp/BD805/BD805/bdf/fuzzy_match/share/ stopwords_p.dat
fuzzy_name.P.match_threshold=70 fuzzy_name.P.initial_match_score=75.0 fuzzy_name.P.initial_match_p1=2 fuzzy_name.P.initial_match_p2=1 fuzzy_name.P.extra_token_match_score=50.0
The following table describes the utility’s configuration parameters as they appear in the BDF.xml file. Note that all scores have percentage values.
Table E-1 Fuzzy Name Matcher Parameters
Parameter | Description |
---|---|
fuzzy_name.stopword_file | Identifies the file that stores the stop word list. The stop word
file is either corporate or personal. The <prefix> token identifies
corporate as B and personal as P.
Certain words such as Corp, Inc, Mr, Mrs, or the, do not add value when comparing names. |
fuzzy_name.match_threshold | Indicates the score above which two names are considered to match each other. The utility uses this parameter only when the match_multi property has a value of true. The allowable range is from 0 to100. |
fuzzy_name.initial_match_score | Specifies the score given for matching to an initial. The allowable range is 0 to 100; the recommended default is 75. |
fuzzy_name.initial_match_p1 | Specifies the number of token picks that must be made before awarding
initial_match_score. The value is an integer >=
0. The default value is 2. |
fuzzy_name.initial_match_p2 | Specifies the number of token picks that must be made before awarding initial_match_score if only initials remain in one name. The value is an integer >= 0. The default value is 1. |
fuzzy_name.extra_token_match_score | Indicates the score given to extra tokens. The allowable range is 0 to 100; the recommended default is 50. |
fuzzy_name.extra_token_min_match | Specifies the minimum number of matches that occur before awarding extra_token_match_score. The range is any integer >= 0. The recommended setting for corporations is 1; for personal names is 2. |
fuzzy_name.extra_token_pct_decrease | Determines the value of the extra_token_match_score parameter in
regard to extra tokens. If multiple extra tokens are present, reduction
of extra_token_match_score occurs for each additional extra token. The
utility multiplies it by this number.
For example, if extra_token_match_score = 50, and extra_pct_decrease is 50 (percent), the first extra token gets 50 percent, the second extra token gets 25 percent, the third token gets 12.5 percent, the fourth 6.25 percent, the fifth 3.125 percent, etc. The allowable range is 0 to 100. The recommended percentage for corporations is 100 (percent); for personal names, 50 (percent). |
fuzzy_name.first_first_match_score | Allows the final score to be more heavily influenced by how well the first token of name #1 matches the first token of name #2. The allowable value is any real number >= 0. The recommended value for corporate names is 1.0; for personal names, 0.0. |
fuzzy_name.match_multi | Determines how to handle multiple matches above the match_threshold value. If set to “true,” the utility returns multiple matches. If set to “false,” it returns only the match with the highest score. |
fuzzy_name.file.delimiter | Specifies the delimiter character used to separate each columns in the result file and target name list file. |
fuzzy_name.min.intersection.firs t.letter.count | Specifies the number of words per name whose first letters
match.
For example, if parameter value = 1 only the first letter of the first or last name would have to match to qualify. If the value = 2, the first letter of both the first and last name would have to match to qualify. Warning: By default, the value is set to 2. Oracle recommends using the default value. You must not change the value to 1 or your system performance may slow down. |
fuzzy_name.default.prefix | For entries that are not specified as business or personal name, default to this configuration set. |
fuzzy_name.max.names.per.process | This property variable determines whether or not the fuzzy matcheralgorithm will be run as a single process or as multiple sequential processes. If the total number of names between both the candidate name list and the target name list is less than the value of this property, then a single process will be run. If the number of names exceeds this property’s value, then multiple processes will be run, based on how far the value is exceeded. For example, if the candidate name list contains 50 names, the target name list contains 50 names, and the fuzzy_name.max.names.per.process property is set to 200, then one process will be run (because the total number of names, 100, does not exceed 200). If the candidate list contains 400 names, the target name list contains 200 names, and the fuzzy_name.max.names.per.process property is set to 300, then four processes will be run (each with 100 candidate names and 200 target names so that the max number of names per process never exceeds 300). The ability to break apart one large fuzzy matcher process into multiple processes through this property can help to overcome per-process memory limitations imposed by certain Behavior Detection architectures. |
fuzzy_name.max.threads | This parameter controls the number of threads to use when Fuzzy Name Matcher is being run. Oracle recommends that this value is not set to a number higher than the number of processing cores on the system. |
fuzzy_name.max.names.per.thread | This parameter keeps the processing threads balanced so that they perform work throughout the course of the fuzzy matcher job. That is, instead of splitting the number of names to process evenly across the threads, the value of this parameter can be set to a smaller batch-size of names so that threads that finish ahead of others can keep working. |