The set of text management features that Guided Search supports varies from language to language.
Stemming is the process of reducing words to their stem, base, or root form.
Note
This section applies only to Latin-1 analysis. OLT analysis uses a language specific OLT dictionary for stemming.
Guided Search applications support stemming for the following languages:
All of these languages have predefined stemming files. You can add terms to and remove them from the predefined stemming files; for information about how to do this, refer to the MDEX Engine Developer's Guide.
Some words are formed by joining together words that can stand on their own. These compound words can be broken up into their component words, so that the compound word is included in the search results for any of its component words. The process of breaking up a compound word in this way is known as decompounding.
In order to decompound a word, language analysis requires that all the components of the word be in the dictionary. For example, a language analysis decompounds the word "Bananenstecker" (banana plug) into "Bananen" and "Stecker" only if both of these words are in the dictionary.
Compound words are common in most Germanic languages (German, Norwegian, and Swedish), as well as in Japanese.
Note
Decompounding is supported only by OLT language analysis, and not by Latin-1. If the stemming.xml file is provided to configure the choice of analysis used for a language, OLT must be selected and decompounding must be enabled for the language. If the language is an OLT only language, decompounding is enabled by default.
Each MDEX Engine has only one thesaurus. The thesaurus is used for all records processed by the MDEX Engine, whatever their languages.
If your MDEX Engine processes records in more than one language, avoid adding words to the thesaurus that have different meanings in these languages. English and French in particular each include a large number of words that are spelled the same, or almost the same, as in the other language, but that have an entirely different meaning. For example, in English "chair" means a piece of furniture; in French, "chair" means "flesh".
Wildcarding is the use of characters that can be matched by any other characters. Wildcarding is supported only by Latin-1 language analysis, and not by OLT analysis.
To prevent queries in one language from being spell-corrected according to the conventions of a different language, you must configure spelling correction for each particular language.
You configure spelling correction for particular languages when you tag your data with language IDs. Guided Search generates a language-specific dictionary for any data that has been tagged with a language ID. To find the correct spellings in the language specified by a particular language ID, the spelling correction feature consults the dictionary for that language.
Note
Spelling correction can be used only with languages that are written in alphabetic scripts. Thus, spelling correction is not supported for Chinese, and has limited application to Japanese and Korean, owing to the non-alphabetic nature of the scripts in which these languages are written.
If all properties within a Search interface are in the same language, spelling correction will correct words by suggesting other words in these properties. If you use record filters to return records only in a particular language, spelling correction will correct words only by suggesting words that occur in these records.
In addition to requiring a language-specific dictionary to reference,
spelling correction also requires that dgidx be configured to use the proper
spelling mode. Select a spelling mode option for dgidx by specifying one of the
parameters to the dgidx
--spellmode
option listed in the following table.
Use this Spelling Mode . . . |
. . . for this language or type of language |
---|---|
aspell (the default) |
English and similar languages for which sound-alike corrections can be made, using phonetic rules. aspell does not perform corrections to non-alphabetic/non-ASCII terms such as café, 1234, or A&M. |
espell |
Non-English words or terms that are not words, such as part numbers; performs non-phonetic (edit-distance-based) corrections. |
aspell_OR_espell |
Languages that include both ASCII and non-ASCII characters and phrases. Aspell corrects ASCII words and Espell corrects other words. |
aspell_AND_espell |
Both modules suggest corrections and the user selects the best selection from the union of results. |
disable |
Chinese or other languages that use non-alphabetic scripts to which the concept of spelling does not apply. |
For example, to select the
espell
mode, use the following command:
dgidx --spellmode espell
You can discover which spelling mode works best for an alphabetic
language other than English by testing the following spelling modes with data
in that language:
espell
,
aspell
,
aspell_OR_espell
, and
aspell_AND_espell
.
Note
In some cases, you may find it easier to create a separate Oracle Commerce Guided Search application for each language that you are targeting, rather than configuring a single application to manage all languages. For information about the advantages and disadvantages of each approach, see How Many MDEX Engines Do I Need?.
Follow these steps to specify a correction mode in a configuration
file. If such a configuration file exists, it overrides any parameter specified
in the dgidx –-spellmode
option.
Using any standard text editor, create a file that contains the following text:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE SPELL_CONFIG SYSTEM "spell_config.dtd."> <SPELL_CONFIG> <SPELL_ENGINE> <DICT_PER_LANGUAGE> <ESPELL/> </DICT_PER_LANGUAGE> </SPELL_ENGINE> </SPELL_CONFIG>
Save the file as
<app name>_prefix.spell_config.xml
.For more information about the structure of a
spell_config.xml
file, refer to the Platform Services XML Reference. See also thespell_config.dtd
in the MDEX Engineconf/dtd
directory.Store the file in the directory where you store your project's other XML instance configuration files.
Run a baseline update and restart the MDEX Engine with the new configuration file.
A stop word is a commonly used word, such as "the", that a search engine has been programmed to ignore. Each MDEX Engine has only one stop word list. As a result, each stop word will be used for all records processed by the MDEX Engine, whatever their languages.
Thus, if you are using a single MDEX Engine for more than one language, provide a separate version of each stop word for each of the languages that your application supports.
Before you specify a stop word in one language, make sure that it does not appear with the same spelling but a different meaning in the other languages that your application supports. English and French in particular share many such "false cognates." For example, the French word for tea, "thé", can be mistaken for the English word "the", which is commonly designated as a stop word.
Keyword redirects send a user's search to a Web page (that is, to a URL).
Like dynamic business rules, keyword redirects use trigger and target values. The user's search is redirected if it contains a keyword (the trigger), and you have provided a rule that redirects any search containing that keyword to a particular URL (the target). These features are applied after navigation filtering.
If your application supports multiple languages and you intend to use a given keyword trigger in each language, you must create a separate rule for the keyword trigger in each language.
For example, if the word "pants" (English) triggers a rule, and the same rule should apply to queries in French and Spanish, then two other rules must be created: one triggered by "pantalones" (Spanish) and one triggered by "pantalon" (French).
For detailed information about how create keyword triggers, refer to the MDEX Engine Developer's Guide.