As core features of the Oracle Endeca Server search subsystem, stemming and the thesaurus have interactions with other search features.
The following sections describe the types of interactions between the various search features.
The search character set configured for the application dictates the set of available characters for stemming and thesaurus entries. By default, only alphanumeric ASCII characters may be used in stemming and thesaurus entries. Additional punctuation and other special characters may be enabled for use in stemming and thesaurus entries by adding these characters to the search character set.
The Oracle Endeca Server matches user query terms to thesaurus forms using the following rule: all alphanumeric and search characters must match against the stemming and thesaurus forms exactly; other characters in the user search query are treated as word delimiters. For details on search characters, see Search Characters.
Spelling correction is a closely-related feature to stemming and thesaurus functionality, because spelling auto-correction essentially provides an additional mechanism for computing alternate versions of the user query. In the Oracle Endeca Server's Dgraph process, spelling is handled as a higher-level feature than stemming and thesaurus. That is, spelling correction considers only the raw form of the user query when producing alternate query forms.
Alternate spell-corrected queries are then subject to all of the normal stemming and thesaurus processing. For example, if the user enters the query telvision and this query is spell-corrected to television, the results will also include results for the alternate forms televisions, tv, and tvs.
Note that in some cases, the thesaurus feature is used as a replacement or in addition to the system's standard spelling correction features. In general, this technique is discouraged. The vast majority of actual misspelled user queries can be handled correctly by the spelling correction subsystem. But in some rare cases, the spelling correction feature cannot correct a particular misspelled query of interest; in these cases it is common to add a thesaurus entry to handle the correction. If at all possible, such entries should be avoided as they can lead to undesirable feature interactions.
Stop words are words configured to be ignored by the Oracle Endeca Server search query engine. A stop word list typically includes words that occur too frequently in the data to be useful (for example, the word bottle in a wine data set), as well as words that are too general (such as clothing in an apparel-only data set).
If the is marked as a stop word, then a query for the computer will match to text containing the word computer, but possibly missing the word the.
Stop words are not currently expanded by the stemming and thesaurus equivalence set. For example, suppose you mark item as a stop word and also include a thesaurus equivalence between the words item and items. This will not automatically mark the word items as a stop word; such expansions must be applied manually.
Stop words are respected when matching thesaurus entries to user queries. For example, suppose you define an equivalence between Muhammad Ali and Cassius Clay and also mark M as a stop word (it is not uncommon to mark all or most single letter words as stop words). In this case, a query for Cassius M. Clay would match the thesaurus entry and return results for Muhammad Ali as expected.
A phrase search is a search query that contains one or more multi-word phrases enclosed in quotation marks. The words inside phrase-query terms are interpreted strictly literally and are not subject to stemming or thesaurus processing. For example, if you define a thesaurus equivalence between Jennifer Lopez and JLo, normal (unquoted) searches for Jennifer Lopez will also return results for JLo, but a quoted phrase search for "Jennifer Lopez" will not return the additional JLo results.
It is typically desirable to return results for the actual user query ahead of results for stemming and/or thesaurus transformed versions of the query. This type of result ordering is supported by the Relevance Ranking modules. In particular, the module that is affected by thesaurus expansion and stemming is Interp. The module that is not affected by thesaurus and stemming is Freq.