You can modify the
default stemming dictionaries by running Dgidx with the
--stemming-updates
flag and specifying an XML file that
contains the updates to the dictionary that you want to make. The update file
can include both additions and deletions. Dgidx processes the file by adding
and deleting entries in the static stemming dictionary file.
The default static stemming dictionary files are stored in
Endeca\MDEX\
(on Windows) and
version
\conf\stemming/usr/local/endeca/MDEX/version/conf/stemming
(on
UNIX).
For most supported languages, the
stemming
directory contains two types of stemming
dictionaries per language. One dictionary (
) contains
stemming entries that support accented characters for the particular
<RFC 3066
Language Code>
_word_forms_collection.xml
<RFC 3066 Language Code>
.
The other dictionary (
contains
stemming entries in which all accented characters have been folded down
(removed) for the particular
<RFC 3066 Language
Code>
-x-folded_word_forms_collection.xml)
If
present, this is the static stemming dictionary that is used if you specify
<language_code>.
--diacritic-folding
. For details about how to map
accented characters to unaccented characters, refer to the
Oracle Commerce Guided Search Internationalization Guide.
Each entry in a static stemming dictionary is defined by an
<ADD_WORD_FORMS>
element and its sub-element
<WORD_FORMS_COLLECTION>
. For example, the
following entry adds
apple
and its plural form
apples
to the static stemming dictionary:
<!DOCTYPE WORD_FORMS_COLLECTION_UPDATES SYSTEM "word_forms_collection_updates.dtd"> <WORD_FORMS_COLLECTION_UPDATES> <ADD_WORD_FORMS> <WORD_FORMS_COLLECTION> <WORD_FORMS> <WORD_FORM>apple</WORD_FORM> <WORD_FORM>apples</WORD_FORM> </WORD_FORMS> </WORD_FORMS_COLLECTION> </ADD_WORD_FORMS> </WORD_FORMS_COLLECTION_UPDATES>
You specify stemming entries to delete in a
<REMOVE_WORD_FORMS_KEYS>
element. All word forms
that correspond to that key are deleted. For example, the following XML deletes
aalborg
and all of its stemmed variants from the
static stemming dictionary:
<!DOCTYPE WORD_FORMS_COLLECTION_UPDATES SYSTEM "word_forms_collection_updates.dtd"> <WORD_FORMS_COLLECTION_UPDATES> <REMOVE_WORD_FORMS_KEYS> <WORD_FORM>aalborg</WORD_FORM> </REMOVE_WORD_FORMS_KEYS> </WORD_FORMS_COLLECTION_UPDATES>
You can also specify a combination of deletes and adds. Deletes are
processed before adds. For example, the following XML removes
aachen
and then adds it and several stemmed
variants of it.
<!DOCTYPE WORD_FORMS_COLLECTION_UPDATES SYSTEM "word_forms_collection_updates.dtd"> <WORD_FORMS_COLLECTION_UPDATES> <REMOVE_WORD_FORMS_KEYS> <WORD_FORM>aachen</WORD_FORM> </REMOVE_WORD_FORMS_KEYS> <ADD_WORD_FORMS> <WORD_FORMS_COLLECTION> <WORD_FORMS> <WORD_FORM>aachen</WORD_FORM> <WORD_FORM>aachens</WORD_FORM> <WORD_FORM>aachenes</WORD_FORM> </WORD_FORMS> </WORD_FORMS_COLLECTION> </ADD_WORD_FORMS> </WORD_FORMS_COLLECTION_UPDATES>
The syntax of the stemming update file name is as follows:
user_specified
.<RFC 3066 Language Code>.xml
where
user_specified
myAppStemmingChanges.
RFC 3066 Language Code
en
oren-us
. See ISO 639-1 for the full list of two-character codes and RFC 3066 for the two-character sub tag for region.
To process the stemming update file, run Dgidx with the
--stemming-updates
flag and specify the XML file that
contains the stemming updates.
For example:
dgidx --stemming-updates myAppStemmingChanges.en.xml
When Dgidx merges the changes in an update file into the static stemming dictionary, there may be conflicts in cases where the variant for one root in the static stemming dictionary is the same as a variant for another root in the update file. Any duplicate variants of different root words constitute a conflict.
In this case, Dgidx throws a warning about conflicting variants and rejects the variant that was specified in the update file.