This section summarizes the fundamental data structure that represents the user’s query, called a query term vector. ATG Search identifies the tokens, terms, compound terms, phrases, and normalizations that appear in the input. The end result is a sequence of query items. This section describes the items included, and the diagram that follows provides additional details.
The query sequence contains:
The primary index term of the surface form, which could be a simple term, a compound term, or a non-compositional phrase. For example, in the figure that follows the 6th item is a non-compositional phrase, logged in, with an index term of
log_in
, which signifies that it is treated as a single unit. The 12th item is a compound term service pack with an index term ofservice_pack
, also signifying it is treated as a unit.Morphological and alphabetic case information about the original query term, as well as the part of speech information. For example, the 8th item is administrator, which has a morphological ending of +or and a part-of-speech of
noun
.Term expansions from the thesaurus, using the part of speech information. For instance, the 10th item in the example is installing, which has three term expansions (
add
,set_up
, andinstallment
).Set_up
is a non-compositional phrase as well. The width of the term expansion boxes in the figure indicates the strength of the link between the surface term and the expansions. In this case,add
andset_up
have a strongly-related link, where asinstallment
has a moderately-related link. Compositional phrases are recognized and used to add in additional term expansions, but in this example, none are found.
The term weight of the query item is computed based on the frequency of the surface index term plus any additional equivalent terms in the expansion. For example, the 10th item installing has a document frequency of 2300 which translates into a weight of 26 out of 100. In this example, the terms with very low weight have been explicitly weighted in the dictionary and are unaffected by their frequency. The 11th term a is the only true stop-word, with a weight of 0. It will be completely ignored in this query, although it could be significant within a larger double-quoted string (see the Literal Constraint section in the User-Entered Query Operators chapter).
The query item can also hold information about query operators, discussed in the User-Entered Query Operators chapter. For example, the 8th item administrator was double-quoted, which means it is constrained to match literally and will not match administrate. Also, the 6th term logged in was preceded by the simple Boolean operator +, which means that results are required to have this term (or any of its expansions). Once constructed, the query term vector contains the complete information necessary to execute the query and retrieve results.