Refine user search by configuring thesaurus, stop words, search characters, stemming, and spelling dictionary features.
Developer Studio makes it possible for you to refine your users' search experience by making adjustments to the following features:
You can establish a one-way or a two-way equivalence between words to enrich your users' query results. You access the Thesaurus editor from the Project tab.
You can add stop words. Stop words are ignored by the Endeca search query engine. You access the Stop Words editor from the Project tab.
You can add search characters to the search list. This allows non-alphanumeric characters to be indexed along with alphanumeric characters, rather than be treated as whitespace. You access the Search Characters editor from the
Edit > Search Characters
menu.You can enable or disable stemming for a variety of languages. Stemming defines sets of words (for example, "shirt" and "shirts") that should be considered strictly equivalent for all search operations. You access the Stemming editor using the
Edit > Stemming
command.You can tune the size of your application's spelling dictionary by instructing Dgidx to exclude small words, large words, and infrequently used words. You access the Spelling Dictionary editor using the Edit > Spelling dictionary command.
The thesaurus allows the system to return matches for related concepts to words or phrases contained in user queries.
It is a powerful tool that allows you to improve control over the vocabulary used in your application.
You can add two kinds of entries to your Endeca thesaurus:
One-way thesaurus entries establish an equivalence relationship between words or phrases that applies in a single direction only. For example, you could define a one-way mapping so that all queries on Tools would also return matches containing Hammers, but queries on Hammers would not return results for the more general term Tools.
Two-way thesaurus entries establish a mutual equivalence relationship between words or phrases. For example, an equivalence might specify that the phrase Mark Twain is interchangeable with the phrase Samuel Clemens.
Create one-way or two-way thesaurus entries in the Thesaurus view, found under Search Configuration in the Project Explorer.
To create a one-way equivalence relationship between words or phrases:
In the Project Explorer, expand Search Configuration and double-click Thesaurus.
The Thesaurus view appears.
In the Thesaurus view, click New, and then choose One Way.
The New One-Way Thesaurus Entry editor appears.
In the From box, type the word or phrase that, when selected, will also return results for the To entry.
In the To box, type the word or phrase the results for which will also be returned when the user's query returns the From entry, and then click Add.
The new one-way thesaurus entry appears in the Thesaurus view, preceded by a red arrow
.
Implementing search features requires additional work outside of Developer Studio. Please refer to the Endeca Advanced Development Guide for details.
Select an entry in the Thesaurus view to make changes.
To edit an existing thesaurus entry:
Implementing search features requires additional work outside of Developer Studio. Please refer to the Endeca Advanced Development Guide for details.
Remove thesaurus entries from the Thesaurus view.
To remove an entry from the thesaurus list:
Implementing search features requires additional work outside of Developer Studio. Please refer to the Endeca Advanced Development Guide for details.
Sort your entries in ascending or descending alphabetical order in the Thesaurus view.
The entries in Thesaurus view are sorted by name to make them easier to work with. You can choose whether the sort is ascending or descending.
To sort your thesaurus entries:
Note
Implementing search features requires additional work outside of Developer Studio. Please refer to the Endeca Advanced Development Guide for details.
If your list of thesaurus entries becomes long, you can filter them by a letter or word.
To filter thesaurus entries:
Implementing search features requires additional work outside of Developer Studio. Please refer to the Endeca Advanced Development Guide for details.
Give single-word names to your entries, avoid non-searchable characters and stop words, and avoid creating entries which include substrings of other entries.
To avoid performance problems related to expensive or less than useful thesaurus search query expansions, follow these recommendations:
Such forms rarely lead to useful expansions. For example, consider the following two entries:
If users type "EVE" they will get results for "EVE OR (ADAM AND EVE)"-the same results they would have gotten for "EVE" without the thesaurus. If the user types "ADAM AND EVE" they get results for (ADAM AND EVE) OR EVE, causing the "ADAM AND" part of the query to basically be ignored.
For example, the following entries:
should be replaced with either:
or
The stop word "AND" should be removed.
In particular, avoid multi-word forms that are not phrases that users are likely to type, or to which phrase expansion is likely to provide relevant additional results. For example, the following thesaurus entries:
should be replaced with the single-word form:
Specify a single word or phrase that, when searched on, returns results for other words or phrases.
Option |
Description |
---|---|
From |
The word or phrase that, when searched on, will also return hits on the words or phrases in the To list. |
To |
The additional words or phrases whose results will also be returned when the user's query returns hits on the From entry. |
Add |
Adds the word or phrase entered in the field above to the To list. |
Modify |
Used to modify a word or phrase in the To list. |
Remove |
Removed the selected word or phrase from the To list. |
Define a list of words or phrases that will return results for each other when searched.
The Two-way Thesaurus Entry editor contains the following fields:
Option |
Description |
---|---|
To |
The list of words or phrases that will have a mutual equivalence relationship. Searches on any of the words in the list will return hits for the other words in the list as well. |
Add |
Adds the word or phrase entered in the field above to the To list. |
Modify |
Used to modify a word or phrase in the To list. To modify an entry, select it in the To list, make your changes in the editing field, and click Modify. |
Remove |
Removed the selected word or phrase from the To list. |
The stemming feature broadens search results to include word roots and word derivations.
Stemming is intended to allow words with a common root form (such as the singular and plural forms of nouns) to be considered interchangeable in search operations. For example, search results for the word "shirt" will include the derivation "shirts," while a search for "shirts" will also include its word root "shirt."
Stemming equivalences are strictly two-way (that is, all-to-all). For example, a search for the singular form of a noun (such as "child") will also return matches for the plural form "children." Likewise, a search for "children" will return matches for "child" and "children." Stemmed words, therefore, are considered equivalent and interchangeable for all search operations.
In contrast, the thesaurus feature supports one-way mappings in addition to two-way mappings.
Note
Stemming files are provided by Endeca for various languages. While you can enable the use of these files, you cannot modify their contents.
Open the Stemming editor from the Edit menu to enable stemming.
To enable stemming for one or more languages in your project:
Note
The
Stemming editor allows you to turn the default version of Dutch and
German stemming-with the word forms file-on and off. If you want to
implement dynamic stemming for Dutch or German, you must edit
the
stemming.xml
file
directly, as described in the
Endeca Advanced Development Guide.
Subsequent use of the Stemming editor will not overwrite manual
changes to the
stemming.xml
file.
To disable stemming, use the above procedure, but uncheck the
languages for which you do not want stemming.
Related links
You can specify punctuation marks as searchable, in addition to digits and upper- and lower-case letters (automatically set as valid search characters).
Upper- and lower-case letters and the digits 0 to 9 are automatically included as valid search characters in your Endeca-enabled application. However, in the case of other characters, such as certain punctuation characters, you can specify whether the character should be indexed along with alphanumeric characters in a token or instead treated as whitespace.
Search characters are configured globally for all search operations.
Add search characters from the Search Characters editor, under the Edit menu.
To add search characters:
Implementing search features requires additional work outside of Developer Studio. Please refer to the Endeca Basic Development Guide for details.
Delete search characters from the Search Characters editor, under the Edit menu.
To remove an additional character from the list of searchable characters:
Implementing search features requires additional work outside of Developer Studio. Please refer to the Endeca Basic Development Guide for details.
The Search Characters editor lists the standard special search characters, as well as any others you have specified.
Upper- and lower-case letters and the digits 0 to 9 are automatically included as valid search characters in your Endeca-enabled application. However, in the case of other characters, such as certain punctuation characters, you can specify whether the character should be indexed along with alphanumeric characters in a token or instead treated as whitespace.
Stop words are words that are set to be ignored by the Endeca MDEX Engine.
Typically, common words like "the" are included in the stop word list. In addition, you might want to add terms that are prevalent in your data set. For example, if your data consists of lists of books, you might want to add the word "book" itself to the stop word list, since a search on that word would return an impracticably large set of records.
Note
Words added to the stop word list are not expanded by other Endeca Developer Studio features like stemming and thesaurus. That means that if you set the word "item" as a stop word, its plural form "items" will not be marked automatically as a stop word. If you want both forms to be on the stop word list, you must add them individually.
Stop words are counted in any search mode (such as MatchPartial) that calculates results based on number of matching terms. However, the Endeca MDEX Engine reduces the minimum term match and maximum word omit requirement by the number of stop words contained in the query.
Stop words must be single words only, and cannot contain any non-searchable characters. If more than one word is entered as a stop word, neither the individual words nor the combined phrase will act as a stop word. Non-searchable characters within a stop word will also cause this behavior. Entering "full-bodied" as a stop word acts just as if you had entered "full bodied", and does not have any effect on searches.
Set stop words from the Stop Words view, under Search Configuration in the Project Explorer.
To add a word to the stop list:
Modify stop words from the Stop Words view.
To edit a stop word:
Remove stop words from the Stop Words view.
To remove a word from the stop word list:
Sort stop words in ascending or descending alphabetical order in the Stop Words view.
The entries in Stop Words view are sorted by name to make them easier to work with. You can choose whether the sort is ascending or descending.
To sort the stop words in your list:
When an application user provides individual search terms in a query, the automatic phrasing feature groups those individual terms into a search phrase and returns query results for the phrase.
Automatic phrasing is similar to placing quotation marks around search terms before submitting them in a query. For example, 'my search terms' is the phrased version of the query my search terms. However, automatic phrasing removes the need for application users to place quotation marks around search phrases to get phrased results.
The result of automatic phrasing is that a Web application can process a more restricted query and therefore return fewer and more focused search results. This feature is available only for record search.
The automatic phrasing feature works by:
Comparing individual search terms in a query to a list of application-specific search phrases. The list of search phrases are stored in a project's phrase dictionary.
Returning query results that are either based on the automatically phrased query, or returning results based on the original unphrased query along with automatically phrased 'Did You Mean?' (DYM) alternatives.
Point three above suggests the two typical implementation scenarios to choose from when using automatic phrasing:
Process an automatically phrased form of the query and suggest the original unphrased query as a DYM alternative.
In this scenario, the automatic phrasing feature rewrites the original query's search terms into a phrased query before processing it. If you are also using DYM, you can display the unphrased alternative so the user can opt-out of automatic phrasing and select their original query, if desired.
For example, an application user searches a wine catalog for the terms "low tannin." The MDEX Engine compares the search terms against the phrase dictionary, finds a phrase entry for "low tannin," and processes the phrased query as "low tannin." The MDEX Engine returns 3 records for the phrased query "low tannin" rather than 16 records for the user's original unphrased query "low tannin." However, the Web application also presents a "Did you mean low tannin?" selection so the user may opt-out of automatic phrasing, if desired.
Process the original query and suggest an automatically-phrased form of the query as a DYM alternative.
In this scenario, the automatic phrasing feature processes the unphrased query as entered and determines if a phrased form of the query exists. If a phrased form is available, the Web application displays an automatically-phrased alternative as a "Did you mean?" option. The user can opt-in to automatic phrasing, if desired.
For example, an application user searches a wine catalog for low tannin. The MDEX Engine returns 16 records for the user's unphrased query low tannin. The Web application also presents a "Did you mean "low tannin"?" option so the user may opt-in to automatic phrasing, if desired.
There are two tasks to implement automatic phrasing:
Note
Implementing search features requires additional work outside of Developer Studio. Refer to the Endeca Advanced Development Guide for details.
Grouping of terms as a phrase exempts the phrase from thesaurus expansion and stemming.
Once individual search terms in a query are grouped as a phrase, the phrase is not subject to thesaurus expansion or stemming by the MDEX Engine.
Describes the processing order of spelling correction and the DYM function with regard to automatic phrasing.
If you are using automatic phrasing, you should enable the MDEX Engine for both spelling correction and "Did you mean?" If you want spelling-corrected automatic phrases, spelling correction ensures search terms are corrected before the terms are automatically phrased. DYM provides users the choice to opt-in or opt-out of automatic phrasing.
The MDEX Engine applies spelling correction to a query before automatically phrasing the terms. This processing order means, for example, if a user misspells the query Napa Valle, the MDEX Engine first spell corrects to Napa Valley and then automatically phrases to "Napa Valley." Without spelling correction enabled, automatic phrasing would typically not find a matching phrase in the phrase dictionary.
If you implement automatic phrasing to rewrite the query using an automatic phrase, then enabling DYM allows users a way to opt-out of automatic phrasing if they want to. On the other hand, if you implement automatic phrasing to process the original query and suggest automatically-phrased alternatives, then enabling DYM allows users to take advantage of automatically phrased alternatives as follow-up queries.
Note
For details about configuring spelling correction and DYM, see the Endeca Advanced Development Guide.
Import phrases from an XML file, or extract phrases from dimension names.
There are two ways to include phrases in your Developer Studio project:
After you add phrases and update your instance configuration, the MDEX Engine builds the phrase dictionary. You cannot view the phrases in Developer Studio. However, after adding phrases and saving your project, you can examine the phrases contained in a project's phrase dictionary by opening the project file named phrases.xml in a text editor. Directly modifying phrases.xml is not supported.
You import an XML file of phrases using the Import Phrases dialog box in Developer Studio. The Import Phrases dialog box can be accessed from either the File menu or from the Automatic Phrasing dialog box.
Before you import the XML file, it must conform to phrase_import.dtd, in the Endeca Navigation Platform conf/dtd directory.
Here is a simple example of a phrase file that conforms to phrase_import.dtd:
<?xml version='1.0' encoding='UTF-8' standalone='no' ?> <!DOCTYPE PHRASE_IMPORT SYSTEM 'phrase_import.dtd'>
<PHRASE_IMPORT>
<PHRASE>Napa Valley</PHRASE>
<PHRASE>low tannin</PHRASE>
</PHRASE_IMPORT>
To import phrases from an XML file:
Double-click Automatic Phrasing.
The Automatic Phrasing dialog box displays.
The Import Phrases dialog box displays.
Note
Alternatively, you can select Import Phrases from the File menu to invoke the Import Phrases dialog box.
Either type the path to your phrases file or click the Browse button to locate the file.
Click OK on the Automatic Phrasing dialog box.
The Messages pane displays the number of phrases read in from the XML file.
In addition to importing an XML file of phrases, you can add phrases to your project based on the dimension values of any dimension you choose.
The MDEX Engine adds each multi-term dimension value in a selected dimension to the phrase dictionary. Single-term dimension values are not included. For example, if you import a Winery dimension from a wine catalog, the MDEX Engine creates a phrase entry for multi-term names such as Agostina Pieri but not for single-term names such as Alessi.
In addition, the MDEX Engine adds each multi-term synonym to the phrase dictionary that has "Search" checked on the Synonyms dialog box. In this release, dimension value phrases that have been modified by a partial update pipeline are not reflected in the phrase dictionary.
To extract phrases from dimension name:
Inclusion of original punctuation marks in search query phrases returns more relevant results.
To add search characters that support automatic phrasing:
If you have phrases that include punctuation, add those punctuation marks as search characters. Adding the punctuation marks ensures that the MDEX Engine includes the punctuation when tokenizing the query, and therefore the MDEX Engine can match search terms with punctuation to phrases with punctuation.
For example, suppose you add phrases based on a Winery dimension, and consequently the Winery name Anderson & Brothers exists in your phrase dictionary. You should create a search character for the ampersand (&).
Note
For details on search characters, see About search characters.
Depending on how a phrased query is processed it may create dead-end results, for reasons including significance of term order and the fact that the MDEX Engine does not extend user phrases to match those in the phrase dictionary.
The following table provides tips and troubleshooting guidance about using the automatic phrasing feature.
Tip |
Description |
---|---|
Examining how a phrased query was processed | |
Single word phrases |
You can include a single word in your phrases_import.xml file and treat the word as a phrase in your project. This may be useful if you do not want stemming or thesaurus expansion applied to single word query terms. You cannot include single word phrases by extracting them from dimension values using the Phrases dialog box. They have to be imported from your file. |
Extending user phrases |
The MDEX Engine does not extend phrases a user provides to match a phrase in the phrase dictionary. For example, if a user provides the query A "BC" D and "BCD" is in the phrase dictionary, the MDEX Engine does not extend the user's original phrasing of "BC" to "BCD." |
Term order is significant in phrases |
Phrases are matched only if search terms are provided in the same exact order and with the same exact terms as the phrase in the phrase dictionary. For example, if "weekend bag" is in the phrase dictionary, the MDEX Engine does not automatically phrase the search terms "weekend getaway bag" or "bag, weekend" to match "weekend bag." |
Possible dead ends |
If an application automatically phrases search terms, it is possible a query may not produce results when it seemingly should have. Specifically, one way in which a dead-end query can occur is when a search phrase is displayed as a DYM link with results and navigation state filtering excludes the results. For example, suppose a car sales application is set up to process a user's original query and display any automatic phrase alternatives as DYM options. Further suppose a user navigates to Cars > Less than $15,000 and then provides the search terms luxury package. The search terms match the phrase 'luxury package' in the phrase dictionary. The user receives query results for Cars > Less than $15,000 and results that matched some occurances of the terms luxury and package. However, if the user clicks the DYM link Did you mean "luxury package"? then no results are available because the navigation state Cars > Less than $15,000 excludes them. NoteSee the Endeca Advanced Development Guidefor details about how processing order affects queries.
|
By default, Dgidx creates an application-specific spelling dictionary based on all words contained in searchable dimensions and properties. These words become possible spell correction recommendations.
To achieve the best possible spelling correction behavior and performance, it is typically necessary to specify constraints on the list of words that Dgidx can include for spelling correction.
Constraining spelling dictionary entries improves the performance of spelling corrected queries.
If you want to fine tune the size of the spelling dictionary and consequently tune the performance of spelling corrected queries, you can specify constraints to control what words Dgidx adds to the spelling dictionary. You can separately configure entries in the dictionary based for dimension search and record search.
To constrain spelling dictionary entries:
In the It Occurs at Least ... Times field, provide a number that indicates the minimum number of times the word must appear in your source data before the word should be included in the spelling dictionary.
In the And Is Between ... and ... Characters Long fields, provide values that represent the minimum and maximum length of a word that should be included in the spelling dictionary.