Oracle Commerce Guided Search - Specifying Language Analysis in stemming.xml

Specifying Language Analysis in stemming.xml

The stemming.xml file contains information that enables dgidx to select a language analysis for a particular language.

The stemming.xml file can contain entries only for the languages originally supported by the MDEX Engine. For a list of the languages supported for use with the MDEX Engine and default language analysis applied to each, see MDEX Engine Supported Languages .

If your source data includes one of the languages originally supported by MDEX and you need to enable stemming for that language, be sure that an entry for this language is included in the stemming.xml file. If stemming.xml does not contain an entry for a language, stemming is not enabled for that language and the default analysis is applied to that language.

In stemming.xml, the entry for a language is contained in a separate <STEMMING> element. Each subelement in the <STEMMING> element begins with STEM_language-code, where language-code identifies the language; for example, STEM_DE for German, STEM_EL for Greek, and STEM_HE for Hebrew. The subelements specify the following:

Whether stemming is to be performed on that language.
Whether a static wordforms file is to be used.
Whether compound matching is to be performed.

For example, the following entry, for American English, specifies that stemming is to be performed using a static wordforms file, and that compound matching is not to be performed:

<!DOCTYPE STEMMING SYSTEM "stemming.dtd"
<STEMMING>
  <STEM_EN_US ENABLE="TRUE" 
              USE_COMPOUND_MATCHING="FALSE" 
              USE_STATIC_WORDFORMS="TRUE" />
</STEMMING>

The following sections describe the subelements of the <STEMMING> element. For more information about stemming.xml, refer to the Platform Services XML Reference.

ENABLE

When ENABLE is set to TRUE, language analysis (in addition to tokenization) is enabled; the analysis includes not only stemming, but any other functions that the analysis can perform, such as the use of a thesaurus, of stop words, and so on.

When ENABLE is set to FALSE, the language analysis performs only tokenization. A warning is displayed if ENABLE is set to FALSE and OLT language analysis is selected. The warning informs you that the setting of ENABLE in this case will be ignored, because OLT performs stemming unconditionally.

Note that the only stemming that Latin-1 analysis performs on English is to treat singular and plural wordforms as matches for each other; for example, to make "house" and "houses" match each other. This follows necessarily from the largely uninflected nature of English.

USE_STATIC_WORDFORMS

If USE_STATIC_WORDFORMS is set to TRUE, the stemming feature uses the static wordform dictionary files shipped with the MDEX Engine package. A static wordform dictionary file defines sets of inflected forms that are treated as matches for each other by the Guided Search feature; for example, the German word that means table, "Tisch", in its different grammatical cases, is specified as follows in the German wordforms file:

<WORD_FORMS>
  <WORD_FORM>tisch</WORD_FORM>
  <WORD_FORM>tisches</WORD_FORM>
  <WORD_FORM>tische</WORD_FORM>
  <WORD_FORM>tischen</WORD_FORM>
 </WORD_FORMS>

Default static wordform dictionary files are stored in Endeca\MDEX\version\conf\stemming (on Windows) and usr/local/endeca/MDEX/version/conf/stemming (on UNIX). You can update the default static wordform dictionary files, or create custom static wordform dictionary files to use in place of the default files. For information about how to do this, refer to the MDEX Developer's Guide.

If USE_STATIC_WORDFORMS is set to FALSE, dgidx generates the wordforms file from the source data dynamically.

Selecting a Language Analysis through USE_STATIC_WORDFORMS

Static wordform files are always used for stemming by Latin-1 analysis and are never used by OLT analysis. Thus, if USE_STATIC_WORDFORMS is set to TRUE, Latin-1 is selected; if set to FALSE, OLT is selected.

Setting USE_STATIC_WORDFORMS to TRUE forces Latin-1 analysis to be selected for a language even if Latin-1 is not the better analysis for that language. For example, if you set USE_STATIC_WORDFORMS to TRUE for traditional Chinese, Latin-1 is applied, with the result that the Chinese text is not properly tokenized.

Note

Each of the languages that can be configured in stemming.xml has a default value (TRUE or FALSE) for USE_STATIC_WORDFORMS (specified in stemming.dtd). The default varies from language to language. Oracle recommends that you set the value of USE_STATIC_WORDFORMS in stemming.xml, rather than rely on the default.

USE_COMPOUND_MATCHING

If USE_COMPOUND_MATCHING is set to TRUE, the stemming feature matches compound words not only with themselves, but also with any of their elements taken individually.

Compound matching is supported only when OLT is used for language analysis. A word is recognized as a compound if it is known as such by OLT, or if you have added it to your auxiliary dictionary using the COMPOUND command. For information about the COMPOUND command, see Configuring decompounding in an auxiliary dictionary.

For example, when compound matching is enabled, the GERMAN word "Bananenstecker" (banana plug) can be matched either by "Banane" (banana) or by "Stecker" (plug). When compound matching is disabled, "Bananenstecker" is not matched by "Banane" or "Stecker", although it can be matched by inflected forms of "Bananenstecker" such as "Bananensteckers" (genitive singular).

Similarly, when compound matching is enabled for Dutch, then "tomaten" matches "tomaten", "tomaat", "tomatensaus", "tomatensauzen", "cherrytomaten", and "cherrytomaat". However, if compound matching is disabled, then "tomaten" (tomatoes) matches any of its inflected forms, such as "tomaten" and "tomaat", but does not match compounds containing "tomaat" or "tomaten", such as "tomatensaus", "tomatensauzen", "cherrytomaten", or "cherrytomaat".

Developer Studio and Custom stemming.xml Files

To change default values in stemming.xml, you can edit it manually – that is, you can edit it directly using a text editor, rather than editing it through Developer Studio. Manual edits that you make to stemming.xml can be affected when you save your Developer Studio project and when you upgrade Developer Studio to the current version.

Effect on stemming.xml of Saving Your Developer Studio Project

Guided Search Internationalization Guide