E.6.1 Using the Fuzzy Name Matcher Utility
The utility typically runs as part of automated processing that a job scheduling tool such as Maestro or Unicenter AutoSys manages. You can also execute the utility through a UNIX shell script, which the next section describes.
Configuring the Fuzzy Name Matcher Utility
- Through Ingestion Manager as a standalone Fuzzy Name Matcher.
For more information, refer to Executing the Fuzzy Name Matcher Utility. To configure Fuzzy
Name Matcher, modify
<ingestion_manager>/fuzzy_match/mantas_cfg/install.cfg
. - Through Datamaps
(
NameMatchStaging.xml,RegOToBorrower.xml
) file in folder (<OFSAAI Installed Directory>/bdf/config/datamaps
). For more information, refer to Managing Data. To configure Fuzzy Name Matcher, modify<ingestion_manager>/fuzzy_match/mantas_cfg/ install.cfg
.
<OFSAAI
Installed Directory>/
bdf/fuzzy_match/mantas_cfg/install.cfg
.###############################################################
#
# Fuzzy Name Matcher System Properties file (install.cfg)
#
###############################################################
#--------------------------------------------------------------
# Log configuration items
#--------------------------------------------------------------
# Specify which priorities are enabled in a hierarchical fashion, i.e.,
if
# DIAGNOSTIC priority is enabled, NOTICE, WARN, and FATAL are also
enabled,
# but TRACE is not.
# Uncomment the desired log level to turn on appropriate level(s).
# Note, DIAGNOSTIC logging is used to log database statements and will
slow
# down performance. Only turn on if you need to see the SQL statements
being
# executed.
# TRACE logging is used for debugging during development. Also only
turn on
# TRACE if needed.
#log.fatal=true
#log.warning=true
log.notice=true
#log.diagnostic=true
#log.trace=true
# Specify where a message should get logged -- the choices are
mantaslog,
# syslog, console, or a filename (with its absolute path).
# Note that if this property is not specified, logging will go to the
console.
log.default.location=mantaslog
# Specify the location (directory path) of the mantaslog, if the
mantaslog
# was chosen as the log output location anywhere above.
# Logging will go to the console if mantaslog was selected and this
property is
# not given a value.
log.mantaslog.location=mp
#--------------------------------------------------------------
# Fuzzy Name Matcher configuration items
#--------------------------------------------------------------
fuzzy_name.match_multi=true
fuzzy_name.file.delimiter=~
fuzzy_name.default.prefix=P
fuzzy_name.max.threads=1
fuzzy_name.max.names.per.thread=1000
fuzzy_name.max.names.per.process=250000
fuzzy_name.min.intersection.first.letter.count=2
fuzzy_name.temp_file.directory=/scratch/ofsaaapp/BD805/BD805/bdf/data/
temp
fuzzy_name.B.stopword_file=/scratch/ofsaaapp/BD805/BD805/bdf/
fuzzy_match/share/stopwords_b.dat
fuzzy_name.B.match_threshold=80
fuzzy_name.B.initial_match_score=75.0
fuzzy_name.B.initial_match_p1=2
fuzzy_name.B.initial_match_p2=1
fuzzy_name.B.extra_token_match_score=100.0
fuzzy_name.B.extra_token_min_match=2
fuzzy_name.B.extra_token_pct_decrease=50
fuzzy_name.B.first_first_match_score=1
fuzzy_name.P.stopword_file=/scratch/ofsaaapp/BD805/BD805/bdf/
fuzzy_match/share/stopwords_p.dat
fuzzy_name.P.match_threshold=70
fuzzy_name.P.initial_match_score=75.0
fuzzy_name.P.initial_match_p1=2
fuzzy_name.P.initial_match_p2=1
fuzzy_name.P.extra_token_match_score=50.0
fuzzy_name.P.extra_token_min_match=2
fuzzy_name.P.extra_token_pct_decrease=50
fuzzy_name.P.first_first_match_score=0
Table E-3 Fuzzy Name Matcher Utility Configuration Parameters
Parameter | Description |
---|---|
fuzzy_name.stopword_file |
Identifies the file that stores the stop word list. The stop word file is either corporate or personal. The <prefix>token identifies corporate as B and personal as P. Certain words such as Corp,Inc, Mr, Mrs, or the, do not add value when comparing names. |
fuzzy_name.match_threshold | Indicates the score above which two names are considered to match each other. The utility uses this parameter only when the match_multiproperty has a value of true. The allowable range is from 0 to 100. |
fuzzy_name.initial_match_score | Specifies the score given for matching to an initial. The allowable range is 0 to 100; the recommended default is 75. |
fuzzy_name.initial_match_p1 | Specifies the number of token picks that must be made before awarding initial_match_score. The value is an integer >= 0. The default value is 2. |
fuzzy_name.initial_match_p2 | Specifies the number of token picks that must be made before awarding initial_match_scoreifonly initials remain in one name. The value is an integer >= 0. The default value is 1. |
fuzzy_name.extra_token_match_score | Indicates the score given to extra tokens. The allowable range is 0 to 100; the recommended default is 50. |
fuzzy_name.extra_token_min_match | Specifies the minimum number of matches that occur before awarding extra_token_match_score. The range is any integer>= 0.The recommended setting for corporations is 1; for personal names is 2. |
Parameter | Description |
fuzzy_name.extra_token_pct_decrease |
Determines the value of the extra_token_match_score parameter in regard to extra tokens. If multiple extra tokens are present, reduction of extra_token_match_score occurs for each additional extra token. The utility multiplies it by this number. For example, if extra_token_match_score= 50, and extra_pct_decreaseis 50 (percent), the first extra token gets 50 percent, the second extra token gets 25 percent, the third token gets 12.5 percent, the fourth 6.25 percent, the fifth 3.125 percent, etc. The allowable range is 0 to 100. The recommended percentage for corporations is 100 (percent); for personal names, 50 (percent). |
fuzzy_name.first_first_match_score | Allows the final score to be more heavily influenced by how well the first token of name #1 matches the first token of name #2. The allowable value is any real number >= 0.The recommended value for corporate names is 1.0; for personal names, 0.0. |
fuzzy_name.match_multi | Determines how to handle multiple matches above the match_thresholdvalue. If set to “true,” the utility returns multiple matches. If set to “false,”it returns only the match with the highest score. |
fuzzy_name.file.delimiter | Specifies the delimiter character used to separate each columns in the result file and target name list file. |
fuzzy_name.min.intersection.first.letter.cou nt |
Specifies the number of words per name whose first letters match. For example, if parameter value = 1 only the first letter of the first or last name would have to match to qualify. If the value = 2, the first letter of both the first and last name would have to match to qualify. Warning:By default, the value is set to 2. Oracle recommends using the default value. You must not change the value to 1 or your system performance may slow down. |
fuzzy_name.default.prefix | For entries that are not specified as business or personal name, default to this configuration set. |
fuzzy_name.max.names.per.process | This property variable determines whether or not the fuzzy matcher algorithm will be run as a single process or as multiple sequential processes. If the total number of names between both the candidate name list and the target name list is less than the value of this property, then a single process will be run. If the number of names exceeds this property’s value, then multiple processes will be run, based on how far the value is exceeded. For example, if the candidate name list contains 50 names, the target name list contains 50 names, and the fuzzy_name.max.names.per.process property is set to 200, then one process will be run (because the total number of names, 100, does not exceed 200). If the candidate list contains 400 names, the target name list contains 200 names, and the fuzzy_name.max.names.per.process property is set to 300, then four processes will be run (each with 100 candidate names and 200 target names so that the max number of names per process never exceeds 300). The ability to break apart one large fuzzy matcher process into multiple processes through this property can help to overcome per-process memory limitations imposed by certain Behavior Detection architectures. |
fuzzy_name.max.threads | This parameter controls the number of threads to use when FuzzyName Matcher is being run. Oracle recommends that this value is not set to a number higher than the number of processing cores on the system. |
fuzzy_name.max.names.per.thread | This parameter keeps the processing threads balanced so that they perform work throughout the course of the fuzzy matcher job. That is, instead of splitting the number of names to process evenly across the threads, the value of this parameter can be set to a smaller batch-size of names so that threads that finish ahead of others can keep working. |
Executing the Fuzzy Name Matcher Utility
To execute the Fuzzy Name Matcher Utility manually, type the following at the UNIX command line:
fuzzy_match.sh –t <target_name_list> -c <candidate_name_list> -r
<result_file>