If your application requires a stemming language that is not available in the Stemming editor of Developer Studio, you can create and add a custom stemming dictionary. A custom stemming dictionary is available in addition to any stemming selections you may have enabled in Developer Studio. For example, you can enable English and Dutch, and then add an additional custom stemming dictionary for Swahili.
Although you can create any number of custom stemming dictionaries,
only one custom stemming dictionary can be loaded into the MDEX Engine. You
indicate which custom stemming dictionary to load with the
--lang
flag to Dgidx.
To add a custom stemming dictionary:
Create a custom dictionary file with stemming entries. For sample XML, see the XML schema of any default stemming dictionary stored in
<install path>\MDEX\<version>\conf\stemming
.For example, this simplified file contains one term and one stemmed variant:
<?xml version="1.0"?> <!DOCTYPE WORD_FORMS_COLLECTION_SYSTEM "word_forms_collection.dtd."> <WORD_FORMS_COLLECTION> <WORD_FORMS> <WORD_FORM>swahiliterm</WORD_FORM> <WORD_FORM>swahiliterms</WORD_FORM> </WORD_FORMS> </WORD_FORMS_COLLECTION>
When you have created the custom stemming dictionary, save the XML file with one of the following name formats:
If the dictionary contains unaccented characters and you use the Dgidx flag
--diacritic-folding
, save the file as<RFC 3066 Language Code>
-x-folded_word_forms_collection.xmlIf the dictionary contains accented characters and you are not using the Dgidx flag
--diacritic-folding
, save the file as<RFC 3066 Language Code>
_word_forms_collection.xml
For example, the XML above would be saved as
sw_word_forms_collection.xml
wheresw
is the ISO639-1 language code for Swahili.Place the XML file in
<install path>\MDEX\<version>\conf\stemming\custom
.Specify the
--lang
flag to Dgidx with a<lang id>
argument that matches the language code of the custom stemming dictionary file.In the example above that uses a Swahili (
sw
) dictionary, you would specify:dgidx --lang sw