The
stemming.xml
file contains information that enables dgidx
to select a language analysis for a particular language.
The
stemming.xml
file can contain entries only for the
languages originally supported by the MDEX Engine. For a list of the languages
supported for use with the MDEX Engine and default language analysis applied to
each, see
MDEX Engine Supported Languages .
If your source data includes one of the languages originally supported
by MDEX and you need to enable stemming for that language, be sure that an
entry for this language is included in the
stemming.xml
file. If
stemming.xml
does not contain an entry for a language,
stemming is not enabled for that language and the default analysis is applied
to that language.
In
stemming.xml
, the entry for a language is contained in a
separate
<STEMMING>
element. Each subelement in the
<STEMMING>
element begins with
STEM_language-code, where
language-code identifies the language; for example, STEM_DE for
German, STEM_EL for Greek, and STEM_HE for Hebrew. The subelements specify the
following:
For example, the following entry, for American English, specifies that stemming is to be performed using a static wordforms file, and that compound matching is not to be performed:
<!DOCTYPE STEMMING SYSTEM "stemming.dtd" <STEMMING> <STEM_EN_US ENABLE="TRUE" USE_COMPOUND_MATCHING="FALSE" USE_STATIC_WORDFORMS="TRUE" /> </STEMMING>
The following sections describe the subelements of the
<STEMMING>
element. For more information about
stemming.xml
, refer to the
Platform Services XML Reference.
When ENABLE is set to TRUE, language analysis (in addition to tokenization) is enabled; the analysis includes not only stemming, but any other functions that the analysis can perform, such as the use of a thesaurus, of stop words, and so on.
When ENABLE is set to FALSE, the language analysis performs only tokenization. A warning is displayed if ENABLE is set to FALSE and OLT language analysis is selected. The warning informs you that the setting of ENABLE in this case will be ignored, because OLT performs stemming unconditionally.
Note that the only stemming that Latin-1 analysis performs on English is to treat singular and plural wordforms as matches for each other; for example, to make "house" and "houses" match each other. This follows necessarily from the largely uninflected nature of English.
If USE_STATIC_WORDFORMS is set to TRUE, the stemming feature uses the static wordform dictionary files shipped with the MDEX Engine package. A static wordform dictionary file defines sets of inflected forms that are treated as matches for each other by the Guided Search feature; for example, the German word that means table, "Tisch", in its different grammatical cases, is specified as follows in the German wordforms file:
<WORD_FORMS> <WORD_FORM>tisch</WORD_FORM> <WORD_FORM>tisches</WORD_FORM> <WORD_FORM>tische</WORD_FORM> <WORD_FORM>tischen</WORD_FORM> </WORD_FORMS>
Default static wordform dictionary files are stored in
Endeca\MDEX\version\conf\stemming
(on Windows) and
usr/local/endeca/MDEX/version/conf/stemming
(on UNIX).
You can update the default static wordform dictionary files, or create custom
static wordform dictionary files to use in place of the default files. For
information about how to do this, refer to the
MDEX Developer's Guide.
If USE_STATIC_WORDFORMS is set to FALSE, dgidx generates the wordforms file from the source data dynamically.
Selecting a Language Analysis through USE_STATIC_WORDFORMS
Static wordform files are always used for stemming by Latin-1 analysis and are never used by OLT analysis. Thus, if USE_STATIC_WORDFORMS is set to TRUE, Latin-1 is selected; if set to FALSE, OLT is selected.
Setting USE_STATIC_WORDFORMS to TRUE forces Latin-1 analysis to be selected for a language even if Latin-1 is not the better analysis for that language. For example, if you set USE_STATIC_WORDFORMS to TRUE for traditional Chinese, Latin-1 is applied, with the result that the Chinese text is not properly tokenized.
Note
Each of the languages that can be configured in
stemming.xml
has a default value (TRUE or FALSE) for
USE_STATIC_WORDFORMS (specified in
stemming.dtd
). The default varies from language to
language. Oracle recommends that you set the value of USE_STATIC_WORDFORMS in
stemming.xml
, rather than rely on the default.
USE_COMPOUND_MATCHING
If USE_COMPOUND_MATCHING is set to TRUE, the stemming feature matches compound words not only with themselves, but also with any of their elements taken individually.
Compound matching is supported only when OLT is used for language analysis. A word is recognized as a compound if it is known as such by OLT, or if you have added it to your auxiliary dictionary using the COMPOUND command. For information about the COMPOUND command, see Configuring decompounding in an auxiliary dictionary.
For example, when compound matching is enabled, the GERMAN word "Bananenstecker" (banana plug) can be matched either by "Banane" (banana) or by "Stecker" (plug). When compound matching is disabled, "Bananenstecker" is not matched by "Banane" or "Stecker", although it can be matched by inflected forms of "Bananenstecker" such as "Bananensteckers" (genitive singular).
Similarly, when compound matching is enabled for Dutch, then "tomaten" matches "tomaten", "tomaat", "tomatensaus", "tomatensauzen", "cherrytomaten", and "cherrytomaat". However, if compound matching is disabled, then "tomaten" (tomatoes) matches any of its inflected forms, such as "tomaten" and "tomaat", but does not match compounds containing "tomaat" or "tomaten", such as "tomatensaus", "tomatensauzen", "cherrytomaten", or "cherrytomaat".
To change default
values in stemming.xml, you can edit it manually – that is, you can edit it
directly using a text editor, rather than editing it through Developer Studio.
Manual edits that you make to
stemming.xml
can be affected when you save your Developer
Studio project and when you upgrade Developer Studio to the current version.
Note
The default values provided in
stemming.xml
by Developer Studio for the languages that it supports are
suitable for almost all purposes.
Developer Studio enables you to select the languages for which you
want to enable stemming; the
stemming.xml
file is then written with default values
for the languages that you select. You can change the default values only by
editing
stemming.xml
manually.
Before you edit
stemming.xml
manually, always save and close your Developer Studio project.
If you make manual edits to
stemming.xml
while your project is open in Developer
Studio, those edits are overwritten when you save the project.
However, if you close your Developer Studio project before you edit
stemming.xml
manually, your edits are preserved,
unless they conflict with default stemming values that are specified in the
project. In this case, the edits are preserved as long as the stemming
configuration in Developer Studio is not changed.
When you upgrade Developer Studio to a newer version, you are prompted to save any existing Developer Studio projects to a new location. Your existing project, however, is preserved in its existing location; as a result, you have two projects, one using the older version of Developer Studio, and one using the new version.
The
stemming.xml
file for the newer, upgraded project has
the same values as the original
stemming.xml
file. You can edit the upgraded version of
stemming.xml
manually or using Developer Studio. Any
edits that you make to the upgraded version of
stemming.xml
do not affect the original version.
Note
Oracle recommends that you save the existing Developer Studio project as prompted when you open the existing project with a new version of Developer Studio.