Search query processing order

This section summarizes how the Dgraph process of the Oracle Endeca Server processes record search queries.

While this summary is not exhaustive, it covers the processing steps likely to occur is most application contexts. The process outlined here assumes that other features (such as spelling correction and thesaurus) are being used.

The Dgraph process uses the following high-level steps to process record search queries:
  1. Record filtering
  2. Tokenization
  3. Spelling correction
  4. Thesaurus expansion
  5. Stemming
  6. Primitive term and phrase lookup
  7. Did you mean
  8. Navigation filtering
  9. EQL
  10. Relevance ranking
Note: For Boolean search queries, tokenization, auto correction, and thesaurus expansion are replaced with a separate parsing phase.

Step 1: Record filtering

If a record filter is specified, whether for security or any other reason, Endeca Server applies it before any search processing. The result is that the search query is performed as if the data set only contained records allowed by the record filter.

Step 2: Tokenization

Tokenization is the process by which the Dgraph analyzes the search query string, yielding a sequence of distinct query terms.

Step 3: Spelling correction

If spelling correction is enabled and triggered, the Dgraph implements them as part of the record search processing. If the spelling correction feature is enabled and triggered, the Dgraph creates spelling suggestions by enumerating (for each query term) a set of alternatives, and considering some of the combinations of term alternatives as whole-query alternatives. Each of these whole-query alternatives is subject to thesaurus expansion and stemming.

For example, if the tokenized query is employee moral, then employee may generate the set of alternatives {employer, employee, employed}, while moral may generate the set of alternatives {moral, morale}.

The two query alternatives generated as spelling suggestions might be employer moral and employee morale.

For details on the auto-correction feature, see Spelling Correction and Did You Mean.

Step 4: Thesaurus expansion

The tokenized query, as well as each query alternative generated by spelling suggestion, is expanded by the Dgraph based on thesaurus matches. Thesaurus expansion replaces each expanded query term with an OR of alternatives.

For example, if the thesaurus expands pentium to intel and laptop to notebook, then the query pentium laptop will be expanded to:
(pentium OR intel) AND (laptop OR notebook)

This assumes the match mode is All. The other match modes (with the exception of Boolean) behave analogously.

If there is a multiple-word thesaurus match, then OR is used on the query itself to accommodate the various ways of partitioning the query terms.

For example, if high speed expands to performance, then the query high speed laptop will be expanded to:
(high AND speed AND (laptop OR notebook)) OR (performance 
AND (laptop OR notebook))

Multiple-word thesaurus matches only apply when the words appear in exact sequence in the query. The queries speed high laptop and high laptop speed do not activate the expansion to performance.

For more details on thesaurus expansion, see About the thesaurus feature.

Step 5: Stemming

Query terms, unless they are delimited with quotation marks to be treated as exact phrases, are expanded by the Dgraph using stemming. The expansion for stemming applies even to terms that are the result of thesaurus expansion. A stemmed query term is an OR expression of its word forms.

For example, if the query pentium laptop was thesaurus-expanded to:
(pentium OR intel) AND (laptop OR notebook)
it will be stemmed to:
(pentium OR intel) AND (laptop OR laptops OR notebook 
OR notebooks)
assuming that only the improper nouns have plurals in the word form dictionary.

For more details on stemming, see About the stemming feature.

Step 6: Primitive term and phrase lookup

Primitive term and phrase lookup is the lowest level of search processing. The Dgraph Server evaluates each search term as-is, and matches it to the set of documents containing that precise word or phrase (given the tokenization rules) in the data files being searched. Search is never case-sensitive, even for phrases.

Step 7: Did You Mean

The Dgraph Server performs the "Did You Mean" processing as part of the record search processing. "Did You Mean?" processing is analogous to the spelling correction processing, only that the results are not included, but rather the spelling suggestions are returned.

For details on the "Did You Mean?" feature, see Spelling Correction and Did You Mean.

Step 8: Navigation filtering

The Dgraph performs all filtering based on the navigation state after the search processing. This order is important, because it ensures that the spelling suggestions remain consistent as the navigation state changes.

Step 9: EQL

The Endeca Query Language (EQL) builds on the core capabilities of Endeca Server to enable applications that examine aggregate information such as trends, statistics, analytical visualizations, comparisons, and so on, all within the Guided Navigation interface. If EQL is used, it is applied near the end of processing.

For more information about EQL, see the Oracle Endeca Server EQL Guide.

Step 10: Relevance ranking

Relevance ranking is the last step in the processing for the record search. Each of the navigation-filtered search results is assigned a relevance score, and the results are sorted in descending order of relevance.

For details on this feature, see the section Relevance Ranking.