Indexing, 5 of 11

Lexer Objects

Use the lexer preference to specify the language of the text to be indexed. To create a lexer object, you must use one of the following objects:

Object Description

BASIC_LEXER

Lexer used for extracting tokens from text in languages, such as English and most western European languages that use single-byte character sets.

MULTI_LEXER

Lexer used for indexing tables containing documents of different languages.

CHINESE_VGRAM_LEXER

Lexer used for extracting tokens from Chinese text.

JAPANESE_VGRAM_LEXER

Lexer used for extracting tokens from Japanese text.

KOREAN_LEXER

Lexer used for extracting tokens from Korean text.

Object	Description
BASIC_LEXER	Lexer used for extracting tokens from text in languages, such as English and most western European languages that use single-byte character sets.
MULTI_LEXER	Lexer used for indexing tables containing documents of different languages.
CHINESE_VGRAM_LEXER	Lexer used for extracting tokens from Chinese text.
JAPANESE_VGRAM_LEXER	Lexer used for extracting tokens from Japanese text.
KOREAN_LEXER	Lexer used for extracting tokens from Korean text.

BASIC_LEXER

Use the BASIC_LEXER object to identify tokens for creating Text indexes for English and all other supported single-byte languages.

The BASIC_LEXER is also used to enable base-letter conversion, composite word indexing, case-sensitive indexing and alternate spelling for single-byte languages that have extended character sets.

In English, you can use the BASIC_LEXER to enable theme indexing.

Note:
Any changes made to tokens before Text indexing (e.g. removing of characters, base-letter conversion) are also performed on the query terms in a Text query. This ensures that the query terms match the form of the tokens in the Text index.

The BASIC_LEXER supports all single-byte character sets plus UTF8.

BASIC_LEXER has the following attributes:

Attribute Attribute Values

continuation

characters (string)

numgroup

characters (string)

numjoin

characters (string)

printjoins

characters (string)

punctuations

characters (string)

skipjoins

characters (string)

startjoins

non-alphanumeric characters that occur at the beginning of a token (string)

endjoins

non-alphanumeric characters that occur at the end of a token (string)

whitespace

characters (string)

newline

NEWLINE (\n)
CARRIAGE_RETURN (\r)

base_letter

NO (disabled)

YES (enabled)

mixed_case

NO (disabled)

YES (enabled)

composite

DEFAULT (no composite word indexing, default)

GERMAN (German composite word indexing)

DUTCH (Dutch composite word indexing)

index_themes

YES (enabled)

NO (disabled)

index_text

YES (enabled)

NO (disabled)

theme_language

AUTO (default)

ENGLISH

alternate_spelling

GERMAN (German alternate spelling)

DANISH (Danish alternate spelling)

SWEDISH (Swedish alternate spelling)

NONE (No alternate spelling)

Attribute	Attribute Values
continuation	characters (string)
numgroup	characters (string)
numjoin	characters (string)
printjoins	characters (string)
punctuations	characters (string)
skipjoins	characters (string)
startjoins	non-alphanumeric characters that occur at the beginning of a token (string)
endjoins	non-alphanumeric characters that occur at the end of a token (string)
whitespace	characters (string)
newline	NEWLINE (\n) CARRIAGE_RETURN (\r)
base_letter	NO (disabled)
	YES (enabled)
mixed_case	NO (disabled)
	YES (enabled)
composite	DEFAULT (no composite word indexing, default)
	GERMAN (German composite word indexing)
	DUTCH (Dutch composite word indexing)
index_themes	YES (enabled)
	NO (disabled)
index_text	YES (enabled)
	NO (disabled)
theme_language	AUTO (default)
	ENGLISH
alternate_spelling	GERMAN (German alternate spelling)
	DANISH (Danish alternate spelling)
	SWEDISH (Swedish alternate spelling)
	NONE (No alternate spelling)

Note:
The BASIC_LEXER object attributes that use character strings can contain multiple characters. Each character in the string serves as a distinct character for that type of attribute.
For example, if the string '*_.-' is specified for the printjoins attribute, each individual character ('*', '_', '.', and '-') in the string is treated as a joining character that is included in the index entry for a token in which the character occurs.

continuation

Specify the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are hyphen '-' and backslash '\'.

numgroup

Specify a single character that, when it appears in a string of digits, indicates that the digits are groupings within a larger single unit.

For example, comma ',' might be defined as numgroup characters because it often indicates a grouping of thousands when it appears in a string of digits.

numjoin

Specify the characters that, when they appear in a string of digits, cause Oracle to index the string of digits as a single unit or word.

For example, period '.' can be defined as numjoin characters because it often serves as decimal points when it appears in a string of digits.

Note:
The default values for numjoin and numgroup are determined by the NLS initialization parameters that are specified for the database.
In general, a value need not be specified for either numjoin or numgroup when creating a Lexer preference for the BASIC_LEXER object.

printjoins

Specify the non-alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed as alphanumeric and included with the token in the Text index. This includes printjoins that occur consecutively.

For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the Text index as pseudo-intellectual and _file_.

Note:
If a printjoins character is also defined as a punctuations character, the character is only processed as an alphanumeric character if the character immediately following it is a standard alphanumeric character or has been defined as a printjoins or skipjoins character.

punctuations

Specify the non-alphanumeric characters that, when they appear at the end of a word, indicate the end of a sentence. The defaults are period '.', question mark '?', and exclamation point '!'.

Characters that are defined as punctuations are removed from a token before text indexing; however, if a punctuations character is also defined as a printjoins character, the character is only removed if it is the last character in the token and it is immediately preceded by the same character.

For example, if the period (.) is defined as both a printjoins and a punctuations character, the following transformations take place during indexing and querying as well:

Token Indexed Token

.doc

.doc

dog.doc

dog.doc

dog..doc

dog..doc

dog.

dog

dog...

dog..

Token	Indexed Token
.doc	.doc
dog.doc	dog.doc
dog..doc	dog..doc
dog.	dog
dog...	dog..

In addition, BASIC_LEXER uses punctuations characters in conjunction with newline and whitespace characters to determine sentence and paragraph deliminters for sentence/paragraph searching.

skipjoins

Specify the non-alphanumeric characters that, when they appear within a word, identify the word as a single token; however, the characters are not stored with the token in the Text index.

For example, if the hyphen character '-' is defined as a skipjoins, the word pseudo-intellectual is stored in the Text index as pseudointellectual.

Note:
printjoins and skipjoins are mutually exclusive. The same characters cannot be specified for both attributes.

startjoins/endjoins

For startjoins, specify the characters that when encountered as the first character in a token explicitly identify the start of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token. In addition, the first startjoins character in a string of startjoins characters implicitly end the previous token.

For endjoins, specify the characters that when encountered as the last character in a token explicitly identify the end of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token.

The following rules apply to both startjoins and endjoins:

the characters specified for startjoins/endjoins cannot occur in any of the other attributes for BASIC_LEXER.
startjoins/endjoins characters can occur only at the beginning/end of tokens

whitespace

Specify the characters that are treated as blank spaces between tokens. BASIC_LEXER uses whitespace characters in conjunction with punctuations and newline characters to identify character strings that serve as sentence delimiters for sentence/paragraph searching.

The predefined, default values for whitespace are 'space' and 'tab'; these values cannot be changed. Specifying characters as whitespace characters adds to these defaults.

newline

Specify the characters that indicate the end of a line of text. BASIC_LEXER uses newline characters in conjunction with punctuations and whitespace characters to identify character strings that server as paragraph delimiters for sentence/paragraph searching.

The only valid values for newline are NEWLINE and CARRIAGE_RETURN (for carriage returns). The default is NEWLINE.

base_letter

Specify whether characters that have diacritical marks (umlauts, cedillas, acute accents, etc.) are converted to their base form before being stored in the Text index. The default is NO (base-letter conversion disabled).

mixed_case

Specify whether the lexer converts the tokens in Text index entries to all uppercase or stores the tokens exactly as they appear in the text. The default is NO (tokens converted to all uppercase).

Note:
Oracle ensures Text queries match the case-sensitivity of the index being queried. As a result, if you enable case-sensitivity for your Text index, queries against the index are always case-sensitive.

composite

Specify whether composite word indexing is disabled or enabled for either GERMAN or DUTCH text. The default is DEFAULT (composite word indexing disabled).

Note:
Base-letter indexing is always off when composite word indexing is set for GERMAN or DUTCH.
Case-sensitivity is always on when composite is set for GERMAN.
Case-sensitivity can be on or off when composite is set for DUTCH.

index_themes

Specify YES to index theme information in English. This makes ABOUT queries more precise. The index_themes and index_text attributes cannot both be NO.

If you use the BASIC_LEXER and specify no value for index_themes, this attribute defaults to NO.

theme_language

Specify which knowledge base to use for theme generation when index_themes is set to YES. When index_themes is NO, setting this parameter has no effect on anything. The default is AUTO, which instructs the system to set this parameter according to the language of the environment.

Note:
In release 8.1.6, the system supports generating themes in English (ENGLISH) only.

index_text

Specify YES to index word information. The index_themes and index_text attributes cannot both be NO.

The default is YES.

alternate_spelling

Specify either GERMAN, DANISH, or SWEDISH to enable alternate spelling in one of these languages. By default, alternate spelling is enabled in all three languages. You can specify NONE for no alternate spelling.

See Also:
For more information about the alternate spelling conventions Oracle uses, see Appendix F, "Alternate Spelling Conventions".

BASIC_LEXER Example

The following example sets printjoin characters and disables theme indexing with the BASIC_LEXER:

begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mylex', 'printjoins', '_-');
ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO');
ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES'); 
end;

To create the index with no theme-indexing and with printjoins characters set as above, issue the following statement:

create index myindex on mytable ( docs ) 
  indextype is ctxsys.context 
  parameters ( 'LEXER mylex' );

MULTI_LEXER

Use this lexer to index text columns that contain documents of different languages. For example, you can use this lexer to index a text column that stores English, German, and Japanese documents.

This lexer has no attributes.

You create a multi-lexer preference with CTX_DDL.CREATE_PREFERENCE and then add language-specific lexers to the multi-lexer preference with CTX_DDL.ADD_SUB_LEXER. You must also have a language column in your base table to index multi-language tables. You specify the language column when you index with CREATE INDEX. See the example.

MULTI_LEXER Example

Create the multi-language table with a primary key, a text column, and a language column as follows:

create table globaldoc (
   doc_id number primary key,
   lang varchar2(3),
   text clob
);

Assume that the table holds mostly English documents, with the occasional German or Japanese document. To handle the three languages, you must create three sub-lexers, one for English, one for German, and one for Japanese:

ctx_ddl.create_preference('english_lexer','basic_lexer');
ctx_ddl.set_attribute('english_lexer','index_themes','yes');
ctx_ddl.set_attribtue('english_lexer','theme_language','english');

ctx_ddl.create_preference('german_lexer','basic_lexer');
ctx_ddl.set_attribute('german_lexer','composite','german');
ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');

ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');

Create the multi-lexer preference:

ctx_ddl.create_preference('global_lexer', 'multi_lexer');

Since the stored documents are mostly English, make the English lexer the default using CTX_DDL.ADD_SUB_LEXER:

ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');

Now add the German and Japanese lexers in their respective languages with CTX_DDL.ADD_SUB_LEXER. Also assume that the language column is expressed in ISO 639-2, so we have to add those as alternate values.

ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');
ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');

Now create the index globalx, specifying the multi-lexer preference and the language column in the parameter string as follows:

create index globalx on globaldoc(text) indextype is ctxsys.context
parameters ('lexer global_lexer language column lang');

Querying Multi-Language Tables

At query time, the multi-lexer examines the language setting and uses the sub-lexer preference for that language to parse the query. If the language is not set, then the default lexer is used.

Otherwise, the query is parsed and run as usual. Since the index contains tokens from multiple languages, such a query can return documents in several languages. To limit your query to a given language, use a structured clause on the language column.

CHINESE_VGRAM_LEXER

The CHINESE_VGRAM_LEXER object identifies tokens in Chinese text for creating Text indexes. It has no attributes.

You can use this lexer if your database character set is one of the following:

ZHS16CGB231280
ZHS16GBK
ZHT32EUC
ZHT16BIG5
ZHT32TRIS
AL24UTFFSS
UTF8

JAPANESE_VGRAM_LEXER

The JAPANESE_VGRAM_LEXER object identifies tokens in Japanese for creating Text indexes. It has no attributes.

You can use this lexer if your database character set is one of the following:

JA16SJIS
JA16EUC
UTF8

KOREAN_LEXER

The KOREAN_LEXER object identifies tokens in Korean text for creating Text indexes.

You can use this lexer if your database character set is one of the following:

KO16KSC5601
UTF8

When you use the KOREAN_LEXER, specify the following boolean attributes:

Attribute Attribute Values

verb

Specify TRUE or FALSE to index verb. Default is TRUE.

adjective

Specify TRUE or FALSE to index adjective. Default is TRUE.

adverb

Specify TRUE or FALSE to index adverb. Default is TRUE.

onechar

Specify TRUE or FALSE to index one character. Default is TRUE.

number

Specify TRUE or FALSE to index number. Default is TRUE.

udic

Specify TRUE or FALSE to index user dictionary.Default is TRUE.

xdic

Specify TRUE or FALSE to index x-user dictionary. Default is TRUE.

composite

Specify TRUE or FALSE to index composite. Default is TRUE.

morpheme

Specify TRUE or FALSE for morphological analysis. Default is TRUE.

toupper

Specify TRUE or FALSE to convert English to uppercase. Default is TRUE.

tohangeul

Specify TRUE or FALSE to convert hanja to hanggeul. Default is TRUE.

Attribute	Attribute Values
verb	Specify TRUE or FALSE to index verb. Default is TRUE.
adjective	Specify TRUE or FALSE to index adjective. Default is TRUE.
adverb	Specify TRUE or FALSE to index adverb. Default is TRUE.
onechar	Specify TRUE or FALSE to index one character. Default is TRUE.
number	Specify TRUE or FALSE to index number. Default is TRUE.
udic	Specify TRUE or FALSE to index user dictionary.Default is TRUE.
xdic	Specify TRUE or FALSE to index x-user dictionary. Default is TRUE.
composite	Specify TRUE or FALSE to index composite. Default is TRUE.
morpheme	Specify TRUE or FALSE for morphological analysis. Default is TRUE.
toupper	Specify TRUE or FALSE to convert English to uppercase. Default is TRUE.
tohangeul	Specify TRUE or FALSE to convert hanja to hanggeul. Default is TRUE.

Limitations

Sentence and paragraph sections are not supported with the Korean lexer.