Designed primarily for use with unstructured data, the
First module ranks documents by how close the query terms
are to the beginning of the document.
The
First module groups its results into variably-sized
strata. The strata are not the same size, because while the first word is
probably more relevant than the tenth word, the 301st is probably not so much
more relevant than the 310th word. This module takes advantage of the fact that
the closer something is to the beginning of a document, the more likely it is
to be relevant.
The
First module works as follows:
- When the query has a
single term,
First’s behavior is straight-forward: it retrieves
the first absolute position of the word in the document, then calculates which
stratum contains that position. The score for this document is based upon that
stratum; earlier strata are better than later strata.
- When the query has
multiple terms,
First behaves as follows: The first absolute
position for each of the query terms is determined, and then the median
position of these positions is calculated. This median is treated as the
position of this query in the document and can be used with stratification as
described in the single word case.
- With query expansion
(using stemming, spelling correction, or the thesaurus), the First module
treats expanded terms as if they occurred in the source query. For example, the
phrase
glucose intolerence would be corrected to
glucose intolerance (with
intolerence spell-corrected to
intolerance). First then continues as it does in the
non-expansion case. The first position of each term is computed and the median
of these is taken.
- In a partially matched
query, where only some of the query terms cause a document to match,
First behaves as if the intersection of terms that
occur in the document and terms that occur in the original query were the
entire query. For example, if the query
cat bird dog is partially matched to a document on the terms
cat and
bird, then the document is scored as if the query were
cat bird. If no terms match, then the document is scored in
the lowest strata.
Note: The
First module does not work with Boolean searches,
cross-field matching, or wildcard search. It assigns all such matches a score
of zero.