About the stemming feature

The stemming feature broadens search results to include word roots and word derivations.

Stemming is enabled by default in an Endeca data domain.

Stemming is intended to allow words with a common root form (such as the singular and plural forms of nouns) to be considered interchangeable in search operations. For example, search results for the word shirt will include the derivation shirts, while a search for shirts will also include its word root shirt.

Stemming equivalences are defined among single words. For example, stemming is used to produce an equivalence between the words automobile and automobiles (because the first word is the stem form of the second), but not to define an equivalence between the words vehicle and automobile (this type of concept-level mapping is done via the thesaurus feature).

Stemming equivalences are strictly two-way (that is, all-to-all). For example, if there is a stemming entry for the word truck, then searches for truck will always return matches for both the singular form (truck) and its plural form (trucks), and searches for trucks will also return matches for truck. In contrast, the thesaurus feature supports one-way mappings in addition to two-way mappings.

Note: The stemming implementation does not include decompounding. Decompounding is the ability to decompose a compound word (such as kindergarten) into its single word components (kinder and garten) and then find occurrences based on the smaller words.

Supported languages for stemming

The list of supported languages for stemming is in the topic, Supported languages.

You should specify a language ID for each of your attributes (via the mdex-property_Language property in the attribute's PDR). At ingest time, the Dgraph creates a separate stemming dictionary for each configured language. The dictionaries are stored in the Endeca data domain and cannot be modified by the user.