Stemming and thesaurus equivalences generally introduce little memory overhead (beyond the amount of memory required to store the raw string forms of the equivalences).
In terms of online processing, both features expand the set of results for typical user queries.
While this generally slows search performance (search operations require an amount of time that grows linearly with the number of results), typically these additional results are a required part of the application behavior and cannot be avoided.
The overhead involved in matching the user query to thesaurus and stemming forms is generally low, but could slow performance in cases where a large thesaurus (tens of thousands of entries) is asked to process long search queries (dozens of terms).
Because matching for stemming entries is performed on a single-word basis, the cost for stemming-oriented query expansion does not grow with the size of the stemming database or with the length of the query. However, the stemming performance of a specific language is affected by the degree to which the language is inflected. For example, German nouns are much more inflected than English nouns.
To avoid performance problems related to expensive and non-useful thesaurus search query expansions, consider the following thesaurus clean-up rules.
Use
--thesaurus_cutoff <limit>
to set a limit on the number of words in a user’s search query that are subject to thesaurus replacement. The default value of<limit>
is 3. Up to 3 words in a user’s search query can be replaced with thesaurus entries. If there are more terms in the query that match thesaurus entries, these terms are not replaced by thesaurus expansion. This option serves as a performance guard against very expensive thesaurus queries. Lower values improve thesaurus engine performance.Do not create a two-way thesaurus entry for a word with multiple meanings. For example, khaki can refer to a color as well as to a style of pants. If you create a two-way thesaurus entry for khaki = pants, then a user’s search for khaki towels could return irrelevant results for pants.
Do not create a two-way thesaurus entry between a general and several more-specific terms, such as top = shirt = sweater = vest. This increases the number of results the user has to go through while reducing the overall accuracy of the items returned.
In this instance, better results are attained by creating individual one-way thesaurus entries between the general term top and each of the more specific terms.
Use care when creating thesaurus entries that include a term that is a substring of another term in the entry. Consider the following example with a two-way equivalency between Adam and Eve and Eve.
If users type Eve, they get results for Eve or (Adam and Eve) (that is, the same results they would have gotten for Eve without the thesaurus). If users type Adam and Eve, they get results for (Adam and Eve) or Eve, causing the Adam part of the query to be ignored.
There are times when this behavior might be desirable (such as in an equivalency between George Washington and Washington), but not always.
Do not use stop words such as and or the in single-word thesaurus forms.
For example, if the has been configured as a stop word, thesaurus equivalency between thee and the is not useful.
You can use stop words in multi-word thesaurus forms, because multi-word thesaurus forms are handled as phrases. In phrases, a stop word is treated as a literal word and not a stop word.
Avoid multi-word thesaurus forms where single-word forms are appropriate.
In particular, avoid multi-word forms that are not phrases that users are likely to type, or to which phrase expansion is likely to provide relevant additional results. For example, the two-way thesaurus entry Aethelstan, King Of England (D. 939) = Athelstan, King Of England (D. 939) should be replaced with the single-word form Aethelstan = Athelstan.
Thesaurus forms should not use non-searchable characters. For example, the one-way thesaurus entry Pikes Peak > Pike’s Peak should only be used if apostrophe (’) is enabled as a search character.
Use
--thesaurus_multiword_nostem
to specify that words in a multiple-word thesaurus form should be treated like phrases and should not be stemmed. This may increase performance for some query loads. Single-word terms will be subject to stemming regardless of whether this flag is specified.This flag prevents the Dgraph from expanding multi-word thesaurus forms by stemming. Thesaurus entries continue to match any stemmed form in the query, but multi-word expansions only include explicitly listed forms. To get the multi-word stemmed thesaurus expansions, the various forms must be listed explicitly in the thesaurus.