The stemming.xml file contains information that enables dgidx to select a language analysis for a particular language.

The stemming.xml file can contain entries only for the languages originally supported by the MDEX Engine. For a list of the languages supported for use with the MDEX Engine and default language analysis applied to each, see MDEX Engine Supported Languages .

If your source data includes one of the languages originally supported by MDEX and you need to enable stemming for that language, be sure that an entry for this language is included in the stemming.xml file. If stemming.xml does not contain an entry for a language, stemming is not enabled for that language and the default analysis is applied to that language.

In stemming.xml, the entry for a language is contained in a separate <STEMMING> element. Each subelement in the <STEMMING> element begins with STEM_language-code, where language-code identifies the language; for example, STEM_DE for German, STEM_EL for Greek, and STEM_HE for Hebrew. The subelements specify the following:

For example, the following entry, for American English, specifies that stemming is to be performed using a static wordforms file, and that compound matching is not to be performed:

<!DOCTYPE STEMMING SYSTEM "stemming.dtd"
<STEMMING>
  <STEM_EN_US ENABLE="TRUE" 
              USE_COMPOUND_MATCHING="FALSE" 
              USE_STATIC_WORDFORMS="TRUE" />
</STEMMING>

The following sections describe the subelements of the <STEMMING> element. For more information about stemming.xml, refer to the Platform Services XML Reference.

If USE_STATIC_WORDFORMS is set to TRUE, the stemming feature uses the static wordform dictionary files shipped with the MDEX Engine package. A static wordform dictionary file defines sets of inflected forms that are treated as matches for each other by the Guided Search feature; for example, the German word that means table, "Tisch", in its different grammatical cases, is specified as follows in the German wordforms file:

<WORD_FORMS>
  <WORD_FORM>tisch</WORD_FORM>
  <WORD_FORM>tisches</WORD_FORM>
  <WORD_FORM>tische</WORD_FORM>
  <WORD_FORM>tischen</WORD_FORM>
 </WORD_FORMS>

Default static wordform dictionary files are stored in Endeca\MDEX\version\conf\stemming (on Windows) and usr/local/endeca/MDEX/version/conf/stemming (on UNIX). You can update the default static wordform dictionary files, or create custom static wordform dictionary files to use in place of the default files. For information about how to do this, refer to the MDEX Developer's Guide.

If USE_STATIC_WORDFORMS is set to FALSE, dgidx generates the wordforms file from the source data dynamically.

Selecting a Language Analysis through USE_STATIC_WORDFORMS

Static wordform files are always used for stemming by Latin-1 analysis and are never used by OLT analysis. Thus, if USE_STATIC_WORDFORMS is set to TRUE, Latin-1 is selected; if set to FALSE, OLT is selected.

Setting USE_STATIC_WORDFORMS to TRUE forces Latin-1 analysis to be selected for a language even if Latin-1 is not the better analysis for that language. For example, if you set USE_STATIC_WORDFORMS to TRUE for traditional Chinese, Latin-1 is applied, with the result that the Chinese text is not properly tokenized.

USE_COMPOUND_MATCHING

If USE_COMPOUND_MATCHING is set to TRUE, the stemming feature matches compound words not only with themselves, but also with any of their elements taken individually.

Compound matching is supported only when OLT is used for language analysis. A word is recognized as a compound if it is known as such by OLT, or if you have added it to your auxiliary dictionary using the COMPOUND command. For information about the COMPOUND command, see Configuring decompounding in an auxiliary dictionary.

For example, when compound matching is enabled, the GERMAN word "Bananenstecker" (banana plug) can be matched either by "Banane" (banana) or by "Stecker" (plug). When compound matching is disabled, "Bananenstecker" is not matched by "Banane" or "Stecker", although it can be matched by inflected forms of "Bananenstecker" such as "Bananensteckers" (genitive singular).

Similarly, when compound matching is enabled for Dutch, then "tomaten" matches "tomaten", "tomaat", "tomatensaus", "tomatensauzen", "cherrytomaten", and "cherrytomaat". However, if compound matching is disabled, then "tomaten" (tomatoes) matches any of its inflected forms, such as "tomaten" and "tomaat", but does not match compounds containing "tomaat" or "tomaten", such as "tomatensaus", "tomatensauzen", "cherrytomaten", or "cherrytomaat".

To change default values in stemming.xml, you can edit it manually – that is, you can edit it directly using a text editor, rather than editing it through Developer Studio. Manual edits that you make to stemming.xml can be affected when you save your Developer Studio project and when you upgrade Developer Studio to the current version.


Copyright © Legal Notices