Natural Language Processing Options

As described in the Query Concepts and Processes chapter, ATG Search uses its natural language components to process the query during search. The natural language components provide options that affect this processing. This section describes the major options, which are passed in as XML elements in the query as part of the <parserOptions> tag.

It is important to keep in mind that any query options you set must align with the selections you have made in Search Administration.

Selecting a Query Language

The query language determines which dictionary to use for processing. Only languages that have been loaded when the content was indexed are valid. The format is:

<language>lang</language>

The lang value is the name of any valid language, and defaults to English.

Selecting a Result Language

ATG Search supports multiple languages within the same index, to support separate language-specific searches on different document collections. However, ATG Search also supports cross-language searches, which involve searching in one language and retrieving results in one or more different languages.

A use case for this functionality is shopping sites, where customers might not speak or write the site language but can read prices and size. In this scenario, users could query in Spanish and get English results and still achieve satisfactory results. In order to perform cross-language searches, both query and target languages must be loaded into the index, as well as special cross-language bridge data (see the ATG Search Administration Guide). At query time, the target language is specified with the following option:

<targetLanguage>lang</targetLanguage>

The lang value is the name of any valid language, or one of two special values:

All means that all languages with indexed content are target languages.
Same means that the same language as the query is the target, and is the default. Multiple instances of this option are allowed, which will establish target languages.

Modifying Spell Checking Options

ATG Search includes two spelling checkers, which are used with natural language processing. The first is an internal module, which uses the indexed content to analyze spelling errors. The second is a third-party library called Wintertree, which uses a dictionary of common terms to guide its analysis.

The internal module does not correct terms that exist in the content, including proper names and other special terms, and it only suggests corrections that exist in the content. Conversely, the third-party module does not correct terms that appear in its common term list, whether they appear in the content or not, and does suggest corrections from its common term list, even if they do not appear in the content.

By default, ATG Search uses both spelling modules to achieve spelling suggestions that reflect the content but are not hampered if the content is limited. The following option controls how these modules interact:

<spellChecker> value </spellChecker>

The value can be any one of the following:

internal—Use only the internal module.
wintertree—Use only the third-party spelling checker.
internal-wintertree—Use both modules (the default); prefer the internal module’s suggestions.
wintertree-internal—Use both modules, prefer the Wintertree module’s suggestions.
none—Perform no spelling correction.

Additional options allow you to control how spelling suggestions are returned:

spellMaxSuggestions—Controls how many suggestions are made for misspelled words.

spellSuggestionCutoff—Controls when to stop suggestion corrections.

spellSuggestionFactor—Controls spelling suggestions for query terms that appear in the index and therefore are not considered misspelled. Normally, no spelling suggestions are returned for such terms. If spellSuggestionFactor is set, suggestable terms are returned if their frequency is greater than the original query term’s frequency multiplied by the spellSuggestionFactor. Set to 0 to disable.

Expanding Multiple Stems

As described in the Query Concepts and Processes chapter, ATG Search performs morphological analysis on content to derive the terms to index and query by. For most forms, a single index term is derived; however, for some forms, multiple index terms are possible. For example, the form spoke is both a noun root and a past tense form of the verb root speak.

During indexing, if multiple index terms are possible, ATG Search chooses the most common term (as defined in the dictionary). At query time, ATG Search uses all root terms for each query term. Part-of-speech tagging can help determine if the terms should be limited, such as choosing the noun spoke for a phrase like thespoke, but is not always able to correctly interpret queries. This tag determines what sort of stem expansion is used:

<expandedStemming>val</expandedStemming>

If val is false, expansion is performed only on a single index term. If val is all, all index terms are used during expansion. And a value of untagged means that query terms that could not be part-of-speech tagged use all index terms for expansion.

Wild Card Maximum

ATG Search can handle wildcard and regular expressions in the user’s query, as described in the User-Entered Query Operators chapter. These patterns are expanded to a set of index terms which are then used during retrieval. ATG Search limits the number of expansions to prevent an explosion of terms that could greatly degrade performance, such as s*. The maximum expansions per pattern are specified by the following option:

<wildcardMax>max</wildcardMax>

The value defaults to 20. Setting this option to 0 disables the interpretation of wild card and regular expressions in user queries.

Security Roles

For statement-level security, ATG Search requires a processing option to specify which roles are accessible for the current query:

<securityRole>role<securityRole>

The role is the name of a security region, as expressed in the XHTML format described in the Structured Content Queries chapter. Multiple elements of this type represent a logical OR of accessible regions. These role values are converted into statement features and act as a filter on the candidate statements. See also the Security Constraints section later in this chapter.

Adaptors

ATG Search contains a large general purpose dictionary which represents all of the knowledge about a language that it processes. The dictionaries for each language can be loaded separately, or in combinations. For each language, the dictionary contains index terms (also called stems), part-of-speech data, syntactic and semantic features, morphological rules, compound and phrase data, term normalization data, term weights, thesaurus entries, text patterns, and various other pieces of data.

The adaptor components are extensions to the general purpose dictionary. Adaptors typically reflect domains, such as financial, computer, and manufacturing. Each domain requires specialized information in the dictionary, which may or may not be applicable to other domains. Administrators determine which adaptors are loaded (see the “Term Dictionaries” chapter of the ATG Search Administration Guide). Adaptors consist of the following:

index terms
compound terms
term normalizations
additional thesaurus entries
modifications to thesaurus entries in the core dictionary

ATG Search offers the following adaptors for English:

aerospace
airlines
apparel
appliances
automotive
business
computer
cooking
crafts
ecommerce
financial
food
healthcare
hotels
housewares
HR
insurance
jewelry
legal
manufacturing
media
personal_care
pets
sports_outdoors
telecommunications
tools
toys
yard_garden

During indexing, the adaptors are loaded based on configuration options. Most of the data is unconditional, meaning that it merges into the core data and cannot be ignored. However, adaptors can define thesaurus entries that are conditioned on the adaptor being activated at query time. This functionality allows different thesaurus entries to be used per query, in cases where multiple domains are contained in the same content. The following option controls which adaptor expansions are active during query time:

<context>name</context>

The name value is any valid name of an adaptor. Multiple instances of this option are allowed.

Category Assignments

As described in the Categorize Query chapter, ATG Search can perform categorization on the query text for automatic category constraints or category feedback. The number of categories used by these processes is controlled by the following options:

<topicMaximum>max</topicMaximum>
<topicConfidence>conf</topicConfidence>
<topicRelevance>relev</topicRelevance>

The max value specifies the maximum categories to return per request. A value of 0 means there is no limit. The default value is 10.

The conf value is the confidence threshold for any assigned categories, ranging from 0 to 100. The default is 0.

The relev value is the relevance threshold for any assigned categories, ranging from 0 to 100. The default is 1.

ATG Search Query Reference Guide