You can configure a language-specific stemming dictionary to define the components of compound words. This process is known as decompounding. If decompounding is enabled (USE_COMPOUND_MATCHING="TRUE" in stemming.xml), then searching for a word matches both that word as well any compounds that include the word as components.

Decompounding can be performed only when OLT is selected as the language analyzer. If the stemming.xml file is provided to configure the choice of analysis used for a language, OLT must be selected and decompounding must be enabled for that language. If only OLT is supported for use with a language, decompounding is enabled for that language by default.

Each of the individual components of a decompounded word must be a lemma -- that is, an uninflected word stem. If a lemma that you need to specify as a component of a compound word does not exist in your dictionary, you must add the lemma to your dictionary. You can add the lemma to your dictionary using the STEM command.

To customize decompounding:

  1. Open the dictionary file that you wish to modify.

    For example, %ENDECA_MDEX_ROOT%\olt\dictionary.de.dict.

  2. Locate or add the terms that you wish to configure for decompounding.

    For example, you may wish to customize the decompounding of the German word, "Binnenschiffahrt," which refers to transport along inland rivers.

  3. Add the entry using a command of the form COMPOUND word stem1|stem2|… POS,POS

    where:

    In the case of the above example, you could add the word in either or both of two ways: one that adheres to the German orthography reform standards of 1996, and one that reflects the earlier spelling of the word:

    COMPOUND  Binnenschifffahrt   Binnen|Schiff|Fahrt N 
    COMPOUND  Binnenschiffahrt    Binnen|Schiff|Fahrt N

    If a compound word is composed only of lemmas, you can add the word to your dictionary using either a COMPOUND command or a STEM command of the following form:

    STEM word stem1|stem2|... POS,POS,...

    For example, you can add the compound Dutch word appelsalade, which is composed of two lemmas, using either a STEM or a COMPOUND command, as follows:

    COMPOUND appelsalade appel|salade N
    STEM appel|salade N

    You cannot use STEM to add a Dutch word such as sperziebonensalade ("green bean salad"), however, because one of its components, "bonen", ("beans") is an inflected form (plural), and thus is not a lemma. You can, however, add Sperziebonensalade using the following COMPOUND command:

    COMPOUND sperziebonensalade sperzie|boon|salade N

    where "boon" is the lemma of the Dutch word for "bean".

    The above COMPOUND command enables sperziebonensalade to be found by searches for "boon" or "bonnen" (or "salade", "saladen" or "sperzie") if you specify USE_COMPOUND_MATCHING="TRUE" in stemming.xml. If USE_COMPOUND_MATCHING="FALSE", then defining the sperziebonensalade as a compound does not have any effect on searches (the searches do not use the components) but will not cause an error.

  4. Save and close the file.


Copyright © Legal Notices