About the thesaurus feature

The thesaurus feature allows you to configure rules for matching queries to text containing equivalent words or concepts.

The thesaurus is intended for specifying concept-level mappings between words and phrases. Even a modest number of well-thought-out thesaurus entries can greatly improve your users’ search experience.

Note: Only one global thesaurus is supported for an Endeca data domain. In other words, language-specific thesauruses are not supported (for example, one thesaurus for English, a second for French, and so on).

The thesaurus feature is at a higher level than the stemming feature, because thesaurus matching and query expansion respects stemming equivalences, whereas the stemming module is unaware of thesaurus equivalences.

For example, if you define a thesaurus entry mapping the words automobile and car, and there is a stemming equivalence between car and cars, then a search for automobile will return matches for automobile, car, and cars. The same results will also be returned for the queries car and cars.

The thesaurus supports specifying multi-word equivalences. For example, an equivalence might specify that the phrase Mark Twain is interchangeable with the phrase Samuel Clemens. It is also possible to mix the number of words in the phrase-forms for a single equivalence. For example, you can specify that wine opener is equivalent to corkscrew.

Multi-word equivalences are matched on a phrase basis. For example, if a thesaurus equivalence between wine opener and corkscrew is defined, then a search for corkscrew will match the text stainless steel wine opener, but will not match the text an effective opener for wine casks.

Thesaurus equivalences can be either one-way or two-way:

Unlike the stemming module, the thesaurus feature lets you define multiple equivalences for a single word or phrase. These multiple equivalences are considered independent and non-transitive.

For example, we might define one equivalence between football and NFL, and another between football and soccer. With these two equivalences, a search for NFL will return hits for NFL and hits for football, a search for soccer will return hits for soccer and football, and a search for football will return all of the hits for football, NFL, and soccer. However, searches for NFL will not return hits for soccer (and vice versa).

This non-transitive nature of the thesaurus is useful for defining equivalences containing ambiguous terms such as football. The word football is sometimes used interchangeably with soccer, but in other cases football refers to American football, which is played professionally in the NFL. In other words, the term football is ambiguous.

When you define equivalences for ambiguous terms, you do not want their specific meanings to overlap into one another. People searching for soccer do not want hits for NFL, but they may want at least some of the hits associated with the more general term football.

Thesaurus entries are essentially used to produce alternate forms of the user query, which in turn are used to produce additional query results. Note that a maximum of three terms in a single search query are subject to thesaurus replacement. This means that up to 3 words in a user’s search query can be replaced with thesaurus entries. If more than three words match thesaurus entries, none of the extra words will be expanded by the thesaurus engine. This thesaurus-expansion limit cannot be changed.

This behavior is particularly important in the presence of overlapping thesaurus forms. For example, suppose that you define an equivalence between red wine and vino rosso, and a second equivalence between wine opener and corkscrew. The query red wine opener might match the thesaurus entries in two different ways: red wine could be mapped to vino rosso based on the first entry; or wine opener could be mapped to corkscrew based on the second entry.

Using the maximal-expansion rule, this issue is resolved by expanding to all possible queries. In other words, the Oracle Endeca Server returns hits for all of the queries: red wine opener, vino rosso opener, and red corkscrew.