Indexing, 3 of 6

Considerations For Indexing

You use the CREATE INDEX statement to create an Oracle Text index. When you create an index and specify no parameter string, an index is created with default parameters.

You can also override the defaults and customize your index to suit your query application. The parameters and preference types you use to customize your index with CREATE INDEX fall into the following general categories.

Type of Index

With Oracle Text, you can create one of three index types with CREATE INDEX. The following table describes each type, its purpose, and what features it supports:

Index Type	Description	Supported Preferences and Parameters	Query Operator	Notes
CONTEXT	Use this index to build a text retrieval application when your text consists of large coherent documents. You can index documents of different formats such as MS Word, HTML or plain text. With a context index, you can customize your index in a variety of ways.	All CREATE INDEX preferences and parameters supported except for INDEX SET. These supported parameters include the index partition clause, and the format, charset, and language columns.	CONTAINS	Supports all documents services and query services. Supports indexing of partitioned text tables.
CTXCAT	Use this index type for better mixed query performance. Typically, with this index type, you index small documents or text fragments. Other columns in the base table, such as item names, prices and descriptions can be included in the index to improve mixed query performance.	INDEX SET LEXER (theme indexing not supported) STOPLIST STORAGE WORDLIST (only prefix_index attribute supported for Japanese data) Format, charset, and language columns not supported. Table and index partitioning not supported.	CATSEARCH This operator has its own query language that supports logical operations, phrase queries, and wildcarding. The query language does not support ABOUT, fuzzy, and stem operators.	The size of a CTXCAT index is related to the total amount of text to be indexed, number of indexes in the index set, and number of columns indexed. Carefully consider your queries and your resources before adding indexes to the index set. The CTXCAT index does not support table and index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.)
CTXRULE	Use CTXRULE index to build a document classification or routing application. The CTXRULE index is an index created on a table of queries, where the queries define the classification or routing criteria.	Only the BASIC_LEXER type supported for indexing your query set. Queries in your query set can include ABOUT, STEM, AND, NEAR, NOT, and OR operators. The following operators are not supported: ACCUM, EQUIV, WITHIN, WILDCARD, FUZZY, SOUNDEX, MINUS, WEIGHT, THRESHOLD. The CREATE INDEX storage clause supported for creating the index on the queries. Section group supported for when you use the MATCHES operator to classify documents. Wordlist supported for stemming operations on your query set. Filter, memory, datastore, and populate parameters are not applicable to index type CTXRULE.	MATCHES	Single documents (plain text, HTML, or XML) can be classified using the MATCHES operator, which turns a document into a set of queries and finds the matching rows in the CTXRULE index.

Index Type

Description

Supported Preferences and Parameters

Query Operator

Notes

CONTEXT

Use this index to build a text retrieval application when your text consists of large coherent documents. You can index documents of different formats such as MS Word, HTML or plain text.

With a context index, you can customize your index in a variety of ways.

All CREATE INDEX preferences and parameters supported except for INDEX SET.

These supported parameters include the index partition clause, and the format, charset, and language columns.

CONTAINS

Supports all documents services and query services.

Supports indexing of partitioned text tables.

CTXCAT

Use this index type for better mixed query performance. Typically, with this index type, you index small documents or text fragments. Other columns in the base table, such as item names, prices and descriptions can be included in the index to improve mixed query performance.

INDEX SET

LEXER (theme indexing not supported)

STOPLIST

STORAGE

WORDLIST (only prefix_index attribute supported for Japanese data)

Format, charset, and language columns not supported.

Table and index partitioning not supported.

CATSEARCH

This operator has its own query language that supports logical operations, phrase queries, and wildcarding.

The query language does not support ABOUT, fuzzy, and stem operators.

The size of a CTXCAT index is related to the total amount of text to be indexed, number of indexes in the index set, and number of columns indexed. Carefully consider your queries and your resources before adding indexes to the index set.

The CTXCAT index does not support table and index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.)

CTXRULE

Use CTXRULE index to build a document classification or routing application. The CTXRULE index is an index created on a table of queries, where the queries define the classification or routing criteria.

Only the BASIC_LEXER type supported for indexing your query set.

Queries in your query set can include ABOUT, STEM, AND, NEAR, NOT, and OR operators.

The following operators are not supported: ACCUM, EQUIV, WITHIN, WILDCARD, FUZZY, SOUNDEX, MINUS, WEIGHT, THRESHOLD.

The CREATE INDEX storage clause supported for creating the index on the queries.

Section group supported for when you use the MATCHES operator to classify documents.

Wordlist supported for stemming operations on your query set.

Filter, memory, datastore, and populate parameters are not applicable to index type CTXRULE.

MATCHES

Single documents (plain text, HTML, or XML) can be classified using the MATCHES operator, which turns a document into a set of queries and finds the matching rows in the CTXRULE index.

See Also:

Index Creation in this chapter.

Location of Text

Your document text can reside in one of three places, the text table, the file system, or the world-wide web. When you index with CREATE INDEX, you specify the location using the datastore preference. Use the appropriate datastore according to your application.

The following table describes all the different ways you can store your text with the datastore preference type.

Datastore Type	Use When
DIRECT_DATASTORE	Data is stored internally in a text column. Each row is indexed as a single document. Your text column can be VARCHAR2, CLOB, BLOB, CHAR, or BFILE. XMLType columns are supported for the context index type.
MULTI_COLUMN_DATASTORE	Data is stored in a text table in more than one column. Columns are concatenated to create a virtual document, one document per row.
DETAIL_DATASTORE	Data is stored internally in a text column. Document consists of one or more rows stored in a text column in a detail table, with header information stored in a master table.
FILE_DATASTORE	Data is stored externally in operating system files. Filenames are stored in the text column, one per row.
NESTED_DATASTORE	Data is stored in a nested table.
URL_DATASTORE	Data is stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) are stored in the text column.
USER_DATASTORE	Documents are synthesized at index time by a user-defined stored procedure.

Indexing time and document retrieval time will be increased for indexing URLs since the system must retrieve the document from the network.

See Also:

Datastore Examples in this chapter.

Document Formats and Filtering

Formatted documents such as Microsoft Word and PDF must be filtered to text to be indexed. The type of filtering the system uses is determined by the FILTER preference type. By default the system uses the INSO_FILTER filter type which automatically detects the format of your documents and filters them to text.

Oracle can index most formats. Oracle can also index columns that contain documents with mixed formats.

No Filtering for HTML

If you are indexing HTML or plain text files, do not use the INSO_FILTER type. For best results, use the NULL_FILTER preference type.

Filtering Mixed Formatted Columns

If you have a mixed format column such as one that contains Microsoft Word, plain text, and HTML documents, you can bypass filtering for plain text or HTML by including a format column in your text table. In the format column, you tag each row TEXT or BINARY. Rows that are tagged TEXT are not filtered.

For example, you can tag the HTML and plain text rows as TEXT and the Microsoft Word rows as BINARY. You specify the format column in the CREATE INDEX parameter clause.

Custom Filtering

You can create your own custom filter to filter documents for indexing. You can create either an external filter that is executed from the file system or an internal filter as a PL/SQL or Java stored procedure.

For external custom filtering, use the USER_FILTER filter preference type.

For internal filtering, use the PROCEDURE_FILTER filter type.

See Also:

PROCEDURE_FILTER Example in this chapter.

Bypassing Rows for Indexing

You can bypass rows in your text table that are not to be indexed, such as rows that contain image data. To do so, create a format column in your table and set it to IGNORE. You name the format column in the parameter clause of CREATE INDEX.

Document Character Set

The indexing engine expects filtered text to be in the database character set. When you use the INSO_FILTER filter type, formatted documents are converted to text in the database character set.

If your source is text and your document character set is not the database character set, you can use the INSO_FILTER or CHARSET_FILTER filter type to convert your text for indexing.

Mixed Character Set Columns

If your document set contains documents with different character sets, such as JA16EUC and JA16SJIS, you can index the documents provided you create a charset column. You populate this column with the name of the document character set on a per-row basis. You name the column in the parameter clause of the CREATE INDEX statement.

Document Language

Oracle can index most languages. By default, Oracle assumes the language of text to index is the language you specify in your database setup.

You use the BASIC_LEXER preference type to index whitespace-delimited languages such as English, French, German, and Spanish. For some of these languages you can enable alternate spelling, composite word indexing, and base letter conversion.

You can also index Japanese, Chinese, and Korean.

See Also:

Oracle Text Reference to learn more about indexing these languages.

Indexing Multi-language Columns

Oracle can index text columns that contain documents of different languages, such as a column that contains documents written in English, German, and Japanese. To index a multi-language column, you need a language column in your text table. Use the MULTI_LEXER preference type.

You can also incorporate a multi-language stoplist when you index multi-language columns.

Indexing Special Characters

When you use the BASIC_LEXER preference type, you can specify how non-alphanumeric characters such as hyphens and periods are indexed with respect to the tokens that contain them. For example, you can specify that Oracle include or exclude hyphen character (-) when indexing a word such as web-site.

These characters fall into BASIC_LEXER categories according to the behavior you require during indexing. The way the you set the lexer to behave for indexing is the way it behaves for query parsing.

Some of the special characters you can set are as follows:

Printjoins Character

Define a non-alphanumeric character as printjoin when you want this character to be included in the token during indexing.

For example, if you want your index to include hyphens and underscore characters, define them as printjoins. This means that words such as web-site are indexed as web-site. A query on website does not find web-site.

Skipjoins Character

Define a non-alphanumeric character as a skipjoin when you do not want this character to be indexed with the token that contains it.

For example, with the hyphen (-) character defined as a skipjoin, the word web-site is indexed as website. A query on web-site finds documents containing website and web-site.

Other Characters

Other characters can be specified to control other tokenization behavior such as token separation (startjoins, endjoins, whitespace), punctuation identification (punctuations), number tokenization (numjoins), and word continuation after line-breaks (continuation). These categories of characters have defaults, which you can modify.

See Also:

Oracle Text Reference to learn more about the BASIC_LEXER.

Case-Sensitive Indexing and Querying

By default, all text tokens are converted to uppercase and then indexed. This results in case-insensitive queries. For example, separate queries on each of the three words cat, CAT, and Cat all return the same documents.

You can change the default and have the index record tokens as they appear in the text. When you create a case-sensitive index, you must specify your queries with exact case to match documents. For example, if a document contains Cat, you must specify your query as Cat to match this document. Specifying cat or CAT does not return the document.

To enable or disable case-sensitive indexing, use the mixed_case attribute of the BASIC_LEXER preference.

See Also:

Oracle Text Reference to learn more about the BASIC_LEXER.

Language Specific Features

You can enable the following language specific features at index time:

Indexing Themes

For English and French, you can index document theme information. A document theme is a main document concept. Themes can be queried with the ABOUT operator.

You can index theme information in other languages provided you have loaded and compiled a knowledge base for the language.

By default themes are indexed in English and French. You can enable and disable theme indexing with the index_themes attribute of the BASIC_LEXER preference type.

See Also:

Oracle Text Reference to learn more about the BASIC_LEXER.

ABOUT Queries and Themes in Chapter 3, "Querying".

Base-Letter Conversion for Characters with Diacritical Marks

Some languages contain characters with diacritical marks such as tildes, umlauts, and accents. When your indexing operation converts words containing diacritical marks to their base letter form, queries need not contain diacritical marks to score matches. For example in Spanish with a base-letter index, a query of energía matches energía and energia in the index.

However, with base-letter indexing disabled, a query of energía matches only energía.

You can enable and disable base-letter indexing for your language with the base_letter attribute of the BASIC_LEXER preference type.

See Also:

Oracle Text Reference to learn more about the BASIC_LEXER.

Alternate Spelling

Languages such as German, Danish, and Swedish contain words that have more than one accepted spelling. For instance, in German, the ä character can be substituted for the ae character. The ae character is known as the base letter form.

By default, Oracle indexes words in their base-letter form for these languages. Query terms are also converted to their base-letter form. The result is that these words can be queried with either spelling.

You can enable and disable alternate spelling for your language using the alternate_spelling attribute in the BASIC_LEXER preference type.

See Also:

Oracle Text Reference to learn more about the BASIC_LEXER.

Composite Words

German and Dutch text contain composite words. By default, Oracle creates composite indexes for these languages. The result is that a query on a term returns words that contain the term as a sub-composite.

For example, in German, a query on the term Bahnhof (train station) returns documents that contain Bahnhof or any word containing Bahnhof as a sub-composite, such as Hauptbahnhof, Nordbahnhof, or Ostbahnhof.

You can enable and disable the creation of composite indexes with the composite attribute of the BASIC_LEXER preference.

See Also:

Oracle Text Reference to learn more about the BASIC_LEXER.

Korean, Japanese, and Chinese Indexing

You index these languages with specific lexers:

Language	Lexer
Korean	KOREAN_MORPH_LEXER
Japanese	JAPANESE_LEXER
Chinese	CHINESE_VGRAM_LEXER

The KOREAN_MORPH_LEXER has its own set of attributes to control indexing. Features include composite word indexing.

See Also:

Oracle Text Reference to learn more about these lexers.

Fuzzy Matching and Stemming

Fuzzy matching enables you to match similarly spelled words in queries. Stemming enables you to match words with the same linguistic root.

Fuzzy matching and stemming are automatically enabled in your index if Oracle Text supports this feature for your language.

Fuzzy matching is enabled with default parameters for its similarity score lower limit and for its maximum number of expanded terms. At index time you can change these default parameters.

See Also:

Oracle Text Reference for more information about the BASIC_WORDLIST preference type.

Better Wildcard Query Performance

Wildcard queries enable you to issue left-truncated, right-truncated and doubly truncated queries, such as %ing, cos%, or %benz%. With normal indexing, these queries can sometimes expand into large word lists, degrading your query performance.

Wildcard queries have better response time when token prefixes and substrings are recorded in the index.

By default, token prefixes and substrings are not recorded in the Oracle Text index. If your query application makes heavy use of wildcard queries, consider indexing token prefixes and substrings. To do so, use the wordlist preference type. The trade-off is a bigger index for improved wildcard searching.

Document Section Searching

For documents that have internal structure such as HTML and XML, you can define and index document sections. Indexing document sections enables you to narrow the scope of your queries to within pre-defined sections. For example, you can specify a query to find all documents that contain the term dog within a section you define as Headings.

Sections must be defined prior to indexing and specified with the section group preference.

Oracle Text provides section groups with system-defined section definitions for HTML and XML. You can also specify that the system automatically create sections from XML documents during indexing.

See Also:

Chapter 6, "Document Section Searching"

Stopwords and Stopthemes

A stopword is a word that is not to be indexed. Usually stopwords are low information words in a given language such as this and that in English.

By default, Oracle provides a list of stopwords called a stoplist for indexing a given language. You can modify this list or create your own with the CTX_DDL package. You specify the stoplist in the parameter string of CREATE INDEX.

A stoptheme is a word that is prevented from being theme-indexed or prevented from contributing to a theme. You can add stopthemes with the CTX_DDL package.

You can search document themes with the ABOUT operator. You can retrieve document themes programatically with the CTX_DOC PL/SQL package.

Multi-Language Stoplists

You can also create multi-language stoplists to hold language-specific stopwords. A multi-language stoplist is useful when you use the MULTI_LEXER to index a table that contains documents in different languages, such as English, German, and Japanese.

At indexing time, the language column of each document is examined, and only the stopwords for that language are eliminated. At query time, the session language setting determines the active stopwords, like it determines the active lexer when using the multi-lexer.