To avoid performance problems related to expensive and non-useful
thesaurus search query expansions, consider the following thesaurus clean-up
rules.
- Use
--thesaurus_cutoff <limit> to set a limit
on the number of words in a user’s search query that are subject to thesaurus
replacement. The default value of
<limit> is 3. Up to 3 words in a user’s search
query can be replaced with thesaurus entries. If there are more terms in the
query that match thesaurus entries, these terms are not replaced by thesaurus
expansion. This option serves as a performance guard against very expensive
thesaurus queries. Lower values improve thesaurus engine performance.
- Do not create a two-way
thesaurus entry for a word with multiple meanings. For example,
khaki can refer to a color as well as to a style of pants. If
you create a two-way thesaurus entry for
khaki = pants, then a user’s search for
khaki towels could return irrelevant results for pants.
- Do not create a two-way
thesaurus entry between a general and several more-specific terms, such as
top = shirt = sweater = vest. This increases the number of
results the user has to go through while reducing the overall accuracy of the
items returned.
In this instance, better results are attained by creating individual
one-way thesaurus entries between the general term top and each of the more
specific terms.
- Use care when creating
thesaurus entries that include a term that is a substring of another term in
the entry. Consider the following example with a two-way equivalency between
Adam and Eve and
Eve.
If users type
Eve, they get results for Eve or (Adam and Eve) (that
is, the same results they would have gotten for
Eve without the thesaurus). If users type
Adam and Eve, they get results for (Adam and Eve) or
Eve, causing the Adam part of the query to be ignored.
There are times when this behavior might be desirable (such as in an
equivalency between
George Washington and
Washington), but not always.
- Do not use stop words such
as and or the in single-word thesaurus forms.
For example, if the has been configured as a stop word, thesaurus
equivalency between thee and the is not useful.
You can use stop words in multi-word thesaurus forms, because
multi-word thesaurus forms are handled as phrases. In phrases, a stop word is
treated as a literal word and not a stop word.
- Avoid multi-word thesaurus
forms where single-word forms are appropriate.
In particular, avoid multi-word forms that are not phrases that
users are likely to type, or to which phrase expansion is likely to provide
relevant additional results. For example, the two-way thesaurus entry
Aethelstan, King Of England (D. 939) = Athelstan, King Of England
(D. 939) should be replaced with the single-word form
Aethelstan = Athelstan.
- Thesaurus forms should not
use non-searchable characters. For example, the one-way thesaurus entry
Pikes Peak > Pike’s Peak should only be used if apostrophe
(’) is enabled as a search character.
- Use
--thesaurus_multiword_nostem to specify that
words in a multiple-word thesaurus form should be treated like phrases and
should not be stemmed. This may increase performance for some query loads.
Single-word terms will be subject to stemming regardless of whether this flag
is specified.
This flag prevents the Dgraph from expanding multi-word thesaurus
forms by stemming. Thesaurus entries continue to match any stemmed form in the
query, but multi-word expansions only include explicitly li sted forms. To get
the multi-word stemmed thesaurus expansions, the various forms must be listed
explicitly in the thesaurus.