Indexing, 5 of 11

Lexer Types

Use the lexer preference to specify the language of the text to be indexed. To create a lexer preference, you must use one of the following lexer types:

type	Description
BASIC_LEXER	Lexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words.
MULTI_LEXER	Lexer for indexing tables containing documents of different languages
CHINESE_VGRAM_LEXER	Lexer for extracting tokens from Chinese text
JAPANESE_VGRAM_LEXER	Lexer for extracting tokens from Japanese text.
JAPANESE_LEXER	Lexer for extracting tokens from Japanese text.
KOREAN_LEXER	Lexer for extracting tokens from Korean text.
KOREAN_MORPH_LEXER	Lexer for extracting tokens from Korean text (recommended).

BASIC_LEXER

Use the BASIC_LEXER type to identify tokens for creating Text indexes for English and all other supported whitespace delimited languages.

The BASIC_LEXER also enables base-letter conversion, composite word indexing, case-sensitive indexing and alternate spelling for whitespace delimited languages that have extended character sets.

In English and French, you can use the BASIC_LEXER to enable theme indexing.

Note:

Any processing the lexer does to tokens before indexing (for example, removal of characters, and base-letter conversion) are also performed on query terms at query time. This ensures that the query terms match the form of the tokens in the Text index.

BASIC_LEXER supports any database character set.

BASIC_LEXER has the following attributes:

Attribute	Attribute Values
continuation	characters
numgroup	characters
numjoin	characters
printjoins	characters
punctuations	characters
skipjoins	characters
startjoins	non alphanumeric characters that occur at the beginning of a token (string)
endjoins	non alphanumeric characters that occur at the end of a token (string)
whitespace	characters (string)
newline	NEWLINE (\n) CARRIAGE_RETURN (\r)
base_letter	NO (disabled)
	YES (enabled)
mixed_case	NO (disabled)
	YES (enabled)
composite	DEFAULT (no composite word indexing, default)
	GERMAN (German composite word indexing)
	DUTCH (Dutch composite word indexing)
index_themes	YES (enabled)
	NO (disabled, default)
index_text	YES (enabled, default
	NO (disabled)
prove_themes	YES(enabled, default)
	NO (disabled)
theme_language	AUTO (default)
	(any NLS supported language)
alternate_spelling	GERMAN (German alternate spelling)
	DANISH (Danish alternate spelling)
	SWEDISH (Swedish alternate spelling)
	NONE (No alternate spelling, default)

continuation

Specify the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are hyphen '-' and backslash '\'.

numgroup

Specify a single character that, when it appears in a string of digits, indicates that the digits are groupings within a larger single unit.

For example, comma ',' might be defined as a numgroup character because it often indicates a grouping of thousands when it appears in a string of digits.

numjoin

Specify the characters that, when they appear in a string of digits, cause Oracle to index the string of digits as a single unit or word.

For example, period '.' can be defined as numjoin characters because it often serves as decimal points when it appears in a string of digits.

Note:

The default values for numjoin and numgroup are determined by the NLS initialization parameters that are specified for the database.

In general, a value need not be specified for either numjoin or numgroup when creating a lexer preference for BASIC_LEXER.

printjoins

Specify the non alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed as alphanumeric and included with the token in the Text index. This includes printjoins that occur consecutively.

For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the Text index as pseudo-intellectual and _file_.

Note:

If a printjoins character is also defined as a punctuations character, the character is only processed as an alphanumeric character if the character immediately following it is a standard alphanumeric character or has been defined as a printjoins or skipjoins character.

punctuations

Specify the non-alphanumeric characters that, when they appear at the end of a word, indicate the end of a sentence. The defaults are period '.', question mark '?', and exclamation point '!'.

Characters that are defined as punctuations are removed from a token before text indexing. However, if a punctuations character is also defined as a printjoins character, the character is removed only when it is the last character in the token and it is immediately preceded by the same character.

For example, if the period (.) is defined as both a printjoins and a punctuations character, the following transformations take place during indexing and querying as well:

Token	Indexed Token
.doc	.doc
dog.doc	dog.doc
dog..doc	dog..doc
dog.	dog
dog...	dog..

In addition, BASIC_LEXER uses punctuations characters in conjunction with newline and whitespace characters to determine sentence and paragraph deliminters for sentence/paragraph searching.

skipjoins

Specify the non-alphanumeric characters that, when they appear within a word, identify the word as a single token; however, the characters are not stored with the token in the Text index.

For example, if the hyphen character '-' is defined as a skipjoins, the word pseudo-intellectual is stored in the Text index as pseudointellectual.

Note:

printjoins and skipjoins are mutually exclusive. The same characters cannot be specified for both attributes.

startjoins/endjoins

For startjoins, specify the characters that when encountered as the first character in a token explicitly identify the start of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token. In addition, the first startjoins character in a string of startjoins characters implicitly ends the previous token.

For endjoins, specify the characters that when encountered as the last character in a token explicitly identify the end of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token.

The following rules apply to both startjoins and endjoins:

The characters specified for startjoins/endjoins cannot occur in any of the other attributes for BASIC_LEXER.
startjoins/endjoins characters can occur only at the beginning or end of tokens

whitespace

Specify the characters that are treated as blank spaces between tokens. BASIC_LEXER uses whitespace characters in conjunction with punctuations and newline characters to identify character strings that serve as sentence delimiters for sentence and paragraph searching.

The predefined default values for whitespace are 'space' and 'tab'. These values cannot be changed. Specifying characters as whitespace characters adds to these defaults.

newline

Specify the characters that indicate the end of a line of text. BASIC_LEXER uses newline characters in conjunction with punctuations and whitespace characters to identify character strings that serve as paragraph delimiters for sentence and paragraph searching.

The only valid values for newline are NEWLINE and CARRIAGE_RETURN (for carriage returns). The default is NEWLINE.

base_letter

Specify whether characters that have diacritical marks (umlauts, cedillas, acute accents, and so on) are converted to their base form before being stored in the Text index. The default is NO (base-letter conversion disabled).

Note:

In German, if alternate spelling is set with base letter conversion, Oracle reduces word queries to their alternate spelling forms.

mixed_case

Specify whether the lexer leaves the tokens exactly as they appear in the text or converts the tokens to all uppercase. The default is NO (tokens are converted to all uppercase).

Note:

Oracle ensures that word queries match the case sensitivity of the index being queried. As a result, if you enable case sensitivity for your Text index, queries against the index are always case sensitive.

composite

Specify whether composite word indexing is disabled or enabled for either GERMAN or DUTCH text. The default is DEFAULT (composite word indexing disabled).

Note:

Base-letter indexing is always off when composite word indexing is set for GERMAN or DUTCH.

Case sensitivity is always on when composite is set for GERMAN.

Case sensitivity can be on or off when composite is set for DUTCH.

Composite User-Dictionary

In your language, you can create a user dictionary to customize how words are decomposed.

You create the user dictionary in the $ORACLE_HOME/ctx/data/<language id> directory. The user dictionary must have the suffix .dct.

For example, the supplied user dictionary file for German is:

$ORACLE_HOME/ctx/data/del/drde.dct

The format for the user dictionary is as follows:

input term <tab> output term

The individual parts of the decomposed word must be separated by the # character. The following example entries are for the German word Hauptbahnhof:

Hauptbahnhof<tab>Haupt#Bahnhof
Hauptbahnhofes<tab>Haupt#Bahnhof
Hauptbahnhof<tab>Haupt#Bahnhof
Hauptbahnhoef<tab>Haupt#Bahnhof

index_themes

Specify YES to index theme information in English or French. This makes ABOUT queries more precise. The index_themes and index_text attributes cannot both be NO.

If you use the BASIC_LEXER and specify no value for index_themes, this attribute defaults to NO.

Note:

The index_themes attribute is supported only for the CONTEXT and CTXRULE index types.

If you create a CTXCAT index with index_themes set, the system ignores this parameter.

prove_themes

Specify YES to prove themes. Theme proving attempts to find related themes in a document. When no related themes are found, parent themes are eliminated from the document.

While theme proving is acceptable for large documents, short text descriptions with a few words rarely prove parent themes, resulting in poor recall performance with ABOUT queries.

Theme proving results in higher precision and less recall (less rows returned) for ABOUT queries. For higher recall in ABOUT queries and possibly less precision, you can disable theme proving. Default is YES.

The prove_themes attribute is supported for CONTEXT and CTXRULE indexes.

theme_language

Specify which knowledge base to use for theme generation when index_themes is set to YES. When index_themes is NO, setting this parameter has no effect on anything.

You can specify any NLS supported language or AUTO. You must have a knowledge base for the language you specify. This release provides a knowledge base in only English and French. In other languages, you can create your own knowledge base.

The default is AUTO, which instructs the system to set this parameter according to the language of the environment.

index_text

Specify YES to index word information. The index_themes and index_text attributes cannot both be NO.

The default is NO.

alternate_spelling

Specify either GERMAN, DANISH, or SWEDISH to enable the alternate spelling in one of these languages. Enabling alternate spelling allows you to query a word in any of its alternate forms.

By default, alternate spelling is enabled in all three languages. You can specify NONE for no alternate spelling.

Note:

In German, if alternate spelling is set with base letter conversion, Oracle reduces word queries to their alternate spelling forms.

See Also:

For more information about the alternate spelling conventions Oracle uses, see Appendix E, "Alternate Spelling Conventions".

BASIC_LEXER Example

The following example sets printjoin characters and disables theme indexing with the BASIC_LEXER:

begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mylex', 'printjoins', '_-');
ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO');
ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES'); 
end;

To create the index with no theme indexing and with printjoins characters set as above, issue the following statement:

create index myindex on mytable ( docs ) 
  indextype is ctxsys.context 
  parameters ( 'LEXER mylex' );

MULTI_LEXER

Use MULTI_LEXER to index text columns that contain documents of different languages. For example, you can use this lexer to index a text column that stores English, German, and Japanese documents.

This lexer has no attributes.

You must have a language column in your base table. To index multi-language tables, you specify the language column when you create the index.

You create a multi-lexer preference with the CTX_DDL.CREATE_PREFERENCE. You add language-specific lexers to the multi-lexer preference with the CTX_DDL.ADD_SUB_LEXER procedure.

During indexing, the MULTI_LEXER examines each row's language column value and switches in the language-specific lexer to process the document.

Multi-language Stoplists

When you use the MULTI-LEXER, you can also use a multi-language stoplist for indexing.

See Also:

Multi-Language Stoplists in this chapter.

MULTI_LEXER Example

Create the multi-language table with a primary key, a text column, and a language column as follows:

create table globaldoc (
   doc_id number primary key,
   lang varchar2(3),
   text clob
);

Assume that the table holds mostly English documents, with the occasional German or Japanese document. To handle the three languages, you must create three sub-lexers, one for English, one for German, and one for Japanese:

ctx_ddl.create_preference('english_lexer','basic_lexer');
ctx_ddl.set_attribute('english_lexer','index_themes','yes');
ctx_ddl.set_attribute('english_lexer','theme_language','english');

ctx_ddl.create_preference('german_lexer','basic_lexer');
ctx_ddl.set_attribute('german_lexer','composite','german');
ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');

ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');

Create the multi-lexer preference:

ctx_ddl.create_preference('global_lexer', 'multi_lexer');

Since the stored documents are mostly English, make the English lexer the default using CTX_DDL.ADD_SUB_LEXER:

ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');

Now add the German and Japanese lexers in their respective languages with CTX_DDL.ADD_SUB_LEXER procedure. Also assume that the language column is expressed in the standard ISO 639-2 language codes, so add those as alternate values.

ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');
ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');

Now create the index globalx, specifying the multi-lexer preference and the language column in the parameter clause as follows:

create index globalx on globaldoc(text) indextype is ctxsys.context
parameters ('lexer global_lexer language column lang');

Querying Multi-Language Tables

At query time, the multi-lexer examines the language setting and uses the sub-lexer preference for that language to parse the query. If the language is not set, then the default lexer is used.

Otherwise, the query is parsed and run as usual. The index contains tokens from multiple languages, so such a query can return documents in several languages. To limit your query to a given language, use a structured clause on the language column.

CHINESE_VGRAM_LEXER

The CHINESE_VGRAM_LEXER type identifies tokens in Chinese text for creating Text indexes. It has no attributes.

You can use this lexer if your database character set is one of the following:

ZHS16CGB231280
ZHS16GBK
ZHT32EUC
ZHT16BIG5
ZHT32TRIS
AL24UTFFSS
UTF8

JAPANESE_VGRAM_LEXER

The JAPANESE_VGRAM_LEXER type identifies tokens in Japanese for creating Text indexes. It has no attributes.

You can use this lexer if your database character set is one of the following:

JA16SJIS
JA16EUC
UTF8

JAPANESE_LEXER

The JAPANESE_LEXER type identifies tokens in Japanese for creating Text indexes. It has no attributes.

This lexer offers the following benefits over the JAPANESE_VGRAM_LEXER:

generates a smaller index
better query response time
generates real word tokens resulting in better query precision

Because the JAPANESE_LEXER uses a new algorithm to generate tokens, indexing time is longer than with JAPANESE_VGRAM_LEXER.

The JAPANESE_LEXER supports the following character sets:

JA16SJIS
JA16EUC
UTF8

Japanese Lexer Example

When you specify JAPANESE_LEXER for creating text index, the JAPANESE_LEXER resolves a sentence into words.

For example, the following compound word (natural language institute)

Text description of jap_befo.gif follows.

Text description of the illustration jap_befo.gif

is indexed as three tokens:

Text description of jap_afte.gif follows.

Text description of the illustration jap_afte.gif

In order to resolve a sentence into words, the internal dictionary is referenced. When a word cannot be found in the internal dictionary, Oracle uses the JAPANESE_VGRAM_LEXER to resolve it.

KOREAN_LEXER

The KOREAN_LEXER type identifies tokens in Korean text for creating Text indexes.

Note:

This lexer is supported for backward compatibility with older versions of Oracle Text that supported only this Korean lexer. If you are building a new application, Oracle recommends that you use the KOREAN_MORPH_LEXER.

You can use this lexer if your database character set is one of the following:

KO16KSC5601
UTF8

When you use the KOREAN_LEXER, you can specify the following boolean attributes:

Attribute	Attribute Values
verb	Specify TRUE or FALSE to index verbs. Default is TRUE.
adjective	Specify TRUE or FALSE to index adjectives. Default is TRUE.
adverb	Specify TRUE or FALSE to index adverb. Default is TRUE.
onechar	Specify TRUE or FALSE to index one character. Default is TRUE.
number	Specify TRUE or FALSE to index number. Default is TRUE.
udic	Specify TRUE or FALSE to index user dictionary. Default is TRUE.
xdic	Specify TRUE or FALSE to index x-user dictionary. Default is TRUE.
composite	Specify TRUE or FALSE to index composite words.
morpheme	Specify TRUE or FALSE for morphological analysis. Default is TRUE.
toupper	Specify TRUE or FALSE to convert English to uppercase. Default is TRUE.
tohangeul	Specify TRUE or FALSE to convert to hanga to hangeul. Default is TRUE.

Limitations

Sentence and paragraph sections are not supported with the Korean lexer.

KOREAN_MORPH_LEXER

The KOREAN_MORPH_LEXER type identifies tokens in Korean text for creating Oracle Text indexes. The KOREAN_MORPH_LEXER lexer offers the following benefits over KOREAN_LEXER:

better morphological analysis of Korean text
faster indexing
smaller indexes
more accurate query searching

Supplied Dictionaries

The KOREAN_MORPH_LEXER uses four dictionaries:

Dictionary	File
System	`$ORACLE_HOME/ctx/data/kolx/drk2sdic.dat`
Grammar	`$ORACLE_HOME/ctx/data/kolx/drk2gram.dat`
Stopword	`$ORACLE_HOME/ctx/data/kolx/drk2xdic.dat`
User-defined	`$ORACLE_HOME/ctx/data/kolx/drk2udic.dat`

The grammar, user-defined, and stopword dictionaries are text format KSC 5601. You can modify these dictionaries using the defined rules. The system dictionary must not be modified.

You can add unregistered words to the user-defined dictionary file. The rules for specifying new words are in the file.

Supported Character Sets

You can use KOREAN_MORPH_LEXER if your database character set is one of the following:

KO16KSC5601
UTF8

Attributes

When you use the KOREAN_MORPH_LEXER, you can specify the following attributes:

Attribute	Attribute Values
verb_adjective	Specify TRUE or FALSE to index verbs and adjectives. Default is FALSE.
one_char_word	Specify TRUE or FALSE to index one syllable. Default is FALSE.
number	Specify TRUE or FALSE to index number. Default is FALSE.
user_dic	Specify TRUE or FALSE to index user dictionary. Default is TRUE.
stop_dic	Specify TRUE of FALSE to use stop-word dictionary. Default is TRUE. The stop-word dictionary belongs to KOREAN_MORPH_LEXER.
composite	Specify indexing style of composite noun. Specify COMPOSITE_ONLY to index only composite nouns. Specify NGRAM to index all noun components of a composite noun. Specify COMPONENT_WORD to index single noun components of composite nouns as well as the composite noun itself. Default is COMPONENT_WORD. The example below describes the difference between NGRAM and COMPONENT_WORD.
morpheme	Specify TRUE or FALSE for morphological analysis. If set to FALSE, tokens are created from the words that are divided by delimiters such as white space in the document. Default is TRUE.
to_upper	Specify TRUE or FALSE to convert English to uppercase. Default is TRUE.
hanja	Specify TRUE to index hanja characters. If set to FALSE, hanja characters are converted to hangul characters. Default is FALSE.
long_word	Specify TRUE to index long words that have more than 16 syllables in Korean. Default is FALSE.
japanese	Specify TRUE to index Japanese characters in KSC5601 code. Default is FALSE.
english	Specify TRUE to index alphanumeric strings. Default is TRUE.

Limitations

Sentence and paragraph sections are not supported with the Korean lexer.

KOREAN_MORPH_LEXER Example: Setting Composite Attribute

You can use the composite attribute to control how composite nouns are indexed.

NGRAM Example

When you specify NGRAM for the composite attribute, composite nouns are indexed with all possible component tokens. For example, the following composite noun (information processing institute):

Text description of the illustration 1.jpg

is indexed as six tokens:

Text description of the illustration 2.jpg

Text description of the illustration 3.jpg

You can specify NGRAM indexing as follows:

begin
ctx_ddl.create_preference('korean_lexer','KOREAN_MORPH_LEXER');
ctx_ddl.set_attribute('korean_lexer','COMPOSITE','NGRAM');
end

To create the index:

create index koreanx on korean(text) indextype is ctxsys.context
parameters ('lexer korean_lexer');

COMPONENT_WORD Example

When you specify COMPONENT_WORD for the composite attribute, composite nouns and their components are indexed. For example, the following composite noun (information processing institute):

Text description of the illustration 1.jpg

is indexed as four tokens:

Text description of the illustration 1.jpg

Text description of the illustration comp.jpg

You can specify COMPONENT_WORD indexing as follows:


begin
ctx_ddl.create_preference('korean_lexer','KOREAN_MORPH_LEXER');
ctx_ddl.set_attribute('korean_lexer','COMPOSITE','COMPONENT_WORD');
end

To create the index:

create index koreanx on korean(text) indextype is ctxsys.context
parameters ('lexer korean_lexer');