Indexing, 2 of 6

About Oracle Text Indexes

An Oracle Text index is an Oracle domain index.To build your query application, you can create an Oracle Text index of type CONTEXT and query it with the CONTAINS operator.

For better performance for mixed queries, you can create a CTXCAT index. Use this index type when your application relies heavily on mixed queries to search small documents or descriptive text fragments based on related criteria such as dates or prices. You query this index with the CATSEARCH operator.

To build a document classification application, you create an Oracle Text index of type CTXRULE. With such an index, you can classify plain text, HTML, or XML documents using the MATCHES operator.

You create an index from a populated text table. In a query application, the table must contain the text or pointers to where the text is stored. Text is usually a collection of documents, but can also be small text fragments. If you are building a document classification application, you store your defining query set in the text table.

You create a text index as a type of extensible index to Oracle using standard SQL. This means that an Oracle Text index operates like an Oracle index. It has a name by which it is referenced and can be manipulated with standard SQL statements.

The benefits of a creating an Oracle Text index include fast response time for text queries with the CONTAINS, CATSEARCH, and MATCHES Oracle Text operators. These operators query the CONTEXT, CTXCAT, and CTXRULE index types respectively.

See Also:

For more information about creating a Text index, see "Index Creation" in this chapter.

Structure of the Oracle Text CONTEXT Index

Oracle Text indexes text by converting all words into tokens. The general structure of an Oracle Text CONTEXT index is an inverted index where each token contains the list of documents (rows) that contain that token.

For example, after a single initial indexing operation, the word DOG might have an entry as follows:

DOG DOC1 DOC3 DOC5

This means that the word DOG is contained in the rows that store documents one, three and five.

For more information, see optimizing the index in this chapter.

Merged Word and Theme Index

By default in English and French, Oracle Text indexes theme information with word information. You can query theme information with the ABOUT operator. You can optionally enable and disable theme indexing.

See Also:

To learn more about indexing theme information, see "Creating Preferences" in this chapter.

The Oracle Text Indexing Process

Figure 2-1

Text description of the illustration ccapp011.gif

You initiate the indexing process with the CREATE INDEX statement. The goal is to create an Oracle Text index of tokens according to the parameters and preferences you specify.

Figure 2-1 shows the indexing process. This process is a data stream that is acted upon by the different indexing objects. Each object corresponds to an indexing preference type or section group you can specify in the parameter string of CREATE INDEX or ALTER INDEX. The sections that follow describe these objects.

Datastore Object

The stream starts with the datastore reading in the documents as they are stored in the system according to your datastore preference. For example, if you have defined your datastore as FILE_DATASTORE, the stream starts by reading the files from the operating system. You can also store you documents on the internet or in the Oracle database.

Filter Object

The stream then passes through the filter. What happens here is determined by your FILTER preference. The stream can be acted upon in one of the following ways:

No filtering takes place. This happens when you specify the NULL_FILTER preference type. Documents that are plain text, HTML, or XML need no filtering.
Formatted documents (binary) are filtered to marked-up text. This happens when you specify the INSO_FILTER preference type.
Text is converted from a non-database character set to the database character set. This happens when you specify CHARSET_FILTER preference type.

Sectioner Object

After being filtered, the marked-up text passes through the sectioner that separates the stream into text and section information. Section information includes where sections begin and end in the text stream. The type of sections extracted is determined by your section group type.

The section information is passed directly to the indexing engine which uses it later. The text is passed to the lexer.

Lexer Object

The lexer breaks the text into tokens according to your language. These tokens are usually words. To extract tokens, the lexer uses the parameters as defined in your lexer preference. These parameters include the definitions for the characters that separate tokens such as whitespace, and whether to convert the text to all uppercase or to leave it in mixed case.

When theme indexing is enabled, the lexer analyses your text to create theme tokens for indexing.

Indexing Engine

The indexing engine creates the inverted index that maps tokens to the documents that contain them. In this phase, Oracle uses the stoplist you specify to exclude stopwords or stopthemes from the index. Oracle also uses the parameters defined in your WORDLIST preference, which tell the system how to create a prefix index or substring index, if enabled.

Partitioned Tables and Indexes

You can create a partitioned CONTEXT index on a partitioned text table. The table must be partitioned by range. Hash, composite and list partitions are not supported.

You might create a partitioned text table to partition your data by date. For example, if your application maintains a large library of dated news articles, you can partition your information by month or year. Partitioning simplifies the manageability of large databases since querying, DML, and backup and recovery can act on single partitions.

See Also:

Oracle9i Database Concepts for more information about partitioning.

Querying Partitioned Tables

To query a partitioned table, you use CONTAINS in the SELECT statement no differently as you query a regular table. You can query the entire table or a single partition. However, if you are using the ORDER BY SCORE clause, Oracle recommends that you query single partitions unless you include a range predicate that limits the query to a single partition.

Parallel Indexing

Oracle Text supports parallel indexing with CREATE INDEX on a partitioned text table.

The parallel indexing operation creates multiple threads where each thread works on a partition. Since indexing is an I/O intensive operation, parallel indexing is most effective in decreasing your indexing time when you have distributed disk access and multiple CPUs.

Since parallel indexing decreases the initial indexing time, it is useful for

data staging, when your product includes an Oracle Text index
rapid initial startup of applications based on large data collections

application testing, when you need to test different index parameters and schemas while developing your application

Note:

Parallel indexing with a partitioned text table can only affect the performance of an initial index with CREATE INDEX. It does not affect DML performance with ALTER INDEX, and has minimal impact on query performance.

Limitations for Indexing

Columns with Multiple Indexes

A column can have no more than a single domain index attached to it, which is in keeping with Oracle standards. However, a single Text index can contain theme information in addition to word information.

Indexing Views

Oracle SQL standards does not support creating indexes on views. Therefore, if you need to index documents whose contents are in different tables, you can create a data storage preference using the USER_DATASTORE object. With this object, you can define a procedure that synthesizes documents from different tables at index time.

See Also:

Oracle Text Reference to learn more about USER_DATASTORE.

About Oracle Text Indexes

Structure of the Oracle Text CONTEXT Index

Merged Word and Theme Index

The Oracle Text Indexing Process

Figure 2-1 Text description of the illustration ccapp011.gif

Datastore Object

Filter Object

Sectioner Object

Lexer Object

Indexing Engine

Partitioned Tables and Indexes

Querying Partitioned Tables

Parallel Indexing

Limitations for Indexing

Columns with Multiple Indexes

Indexing Views

Figure 2-1

Text description of the illustration ccapp011.gif