Learn how to create Oracle Text indexes.
This chapter contains the following topics:
4.1 Summary of the Procedure for Creating an Oracle Text Index
With Oracle Text, you can create indexes of type
Starting with Oracle Database 12c Release 2 (12.2), you can choose to keep old index entries to search on original content by using the
ASYNCHRONOUS_UPDATE parameter string option.
By default, the system expects your documents to be stored in a text column. After you satisfy this requirement, you can create an Oracle Text index by using the
INDEX SQL statement as an extensible index of type
CONTEXT, without explicitly specifying preferences. The system automatically detects your language, the data type of the text column, and the format of the documents. Next, the system sets indexing preferences.
To create an Oracle Text index:
(Optional) Determine your custom indexing preferences, section groups, or stoplists if you do not use the defaults. The following table describes these indexing classes:
How are your documents stored?
How can the documents be converted to plaintext?
What language is being indexed?
How should stem and fuzzy queries be expanded?
How should the index data be stored?
What words or themes are not to be indexed?
How are document sections defined?
(Optional) Create custom preferences, section groups, or stoplists.
Create the Oracle Text index with the
INDEXSQL statement. Name your index and, if necessary, specify preferences.
4.2 Creating Preferences
If you want, you can create custom index preferences to override the defaults. Use the preferences to specify index information, such as where your files are stored and how to filter your documents. You create the preferences and then set the attributes.
4.3 Section Searching Example: Creating HTML Sections
When documents have internal structure such as in HTML and XML, you can define document sections by using embedded tags before you index. This approach enables you to query within the sections by using the
WITHIN operator. You define sections as part of a section group.
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1'); end;
4.4 Using Stopwords and Stoplists
A stopword is a word that is not to be indexed, such as this or that in English.
The system supplies a stoplist for every language. By default during indexing, the system uses the Oracle Text default stoplist for your language.
You can edit the default
CTXSYS.DEFAULT_STOPLIST or create your own with the following PL/SQL procedures:
You specify your custom stoplists in the parameter clause of
You can also dynamically add stopwords after indexing with the
ALTER INDEX statement.
4.4.1 Multilanguage Stoplists
You can create multilanguage stoplists to hold language-specific stopwords. This stoplist is useful when you use
MULTI_LEXER to index a table that contains documents in different languages, such as English, German, and Japanese.
4.4.2 Stopthemes and Stopclasses
You can also specify that numbers are not indexed. A class of alphanumeric characters such a numbers that is not to be indexed is a stopclass.
4.4.3 PL/SQL Procedures for Managing Stoplists
Use the following procedures to manage stoplists, stopwords, stopthemes, and stopclasses:
Oracle Text Reference to learn more about using these procedures
4.5 Creating a CONTEXT Index
CONTEXT index type is well suited for indexing large, coherent documents in formats such as Microsoft Word, HTML, or plain text. With a
CONTEXT index, you can also customize your index in a variety of ways. The documents must be loaded in a text table.
This section contains these topics:
4.5.1 CONTEXT Index and DML
CONTEXT index is not transactional. When you delete a record, the index is changed immediately. That is, your session no longer finds the record from the moment you make the change, and other users cannot find the record after you commit. For inserts and updates, the new information is not visible to text searches until an index synchronization has occurred. Therefore, when you perform inserts or updates on the base table, you must explicitly synchronize the index with
4.5.2 Default CONTEXT Index Example
The following statement creates a default
CONTEXT index called
myindex on the
text column in the
CREATE INDEX myindex ON docs(text) INDEXTYPE IS CTXSYS.CONTEXT;
When you use the
CREATE INDEX statement without explicitly specifying parameters, the system completes the following actions by default for all languages:
Detects the column type and uses filtering for the binary column types of
BFILE.Most document formats are supported for filtering. If your column is plain text, the system does not use filtering.
You can always change the default indexing behavior by customizing your preferences and specifying those preferences in the parameter string of
Oracle Text Reference to learn more about configuring your environment to use the
4.5.3 Incrementally Creating an Index with ALTER INDEX and CREATE INDEX
For a global index, use
INDEXto support the
NOPOPULATEkeyword in the
REPLACEparameter of the
REBUILDclause. By doing so, you can create indexes incrementally. This keyword is valuable for creating Oracle Text indexes in large installations that cannot afford to have the indexing process running continuously.
For a local index partition, modify the
REPLACE...') parameter string to support the
For a partition on a local index,
NOPOPULATE')) is supported. The partition-level
NOPOPULATEkeywords override any
NOPOPULATEspecified at the index level.
Oracle Text Reference to learn more about the syntax for the
4.5.4 Incrementally Creating a CONTEXT Index with POPULATE_PENDING
For large installations that cannot afford to have the indexing process run continuously, use the
CTX_DDL.POPULATE_PENDING procedure. This procedure also provides finer control over creation of the indexes. The preferred method is to create an empty index, place all rowids into the pending queue, and build the index through
Oracle Text Reference for information about
4.5.5 Custom CONTEXT Index Example: Indexing HTML Documents
To index an HTML document set located by URLs, specify the system-defined preference for the
NULL_FILTER in the
CREATE INDEX statement.
You can also specify your
htmgroup section group that uses
NETWORK_PREF datastore that uses
begin ctx_ddl.create_preference('NETWORK_PREF','NETWORK_DATASTORE'); ctx_ddl.set_attribute('NETWORK_PREF','HTTP_PROXY','www-proxy.us.example.com'); ctx_ddl.set_attribute('NETWORK_PREF','NO_PROXY','us.example.com'); ctx_ddl.set_attribute('NETWORK_PREF','TIMEOUT','300'); end; begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1'); end;
You can then index your documents:
CREATE INDEX myindex on docs(htmlfile) indextype is ctxsys.context parameters( 'datastore NETWORK_PREF filter ctxsys.null_filter section group htmgroup' );
Starting with Oracle Database 19c, the Oracle Text type
URL_DATASTORE is deprecated. Use
4.5.6 CONTEXT Index Example: Query Processing with FILTER BY and ORDER BY
To enable more efficient query processing and better response time for mixed queries, use
BY clauses as shown in the following example:
CREATE INDEX myindex on docs(text) INDEXTYPE is CTXSYS.CONTEXT FILTER BY category, publisher, pub_date ORDER BY pub_date desc;
Because you specified the
pub_date clause at query time, Oracle Text also considers pushing a relational predicate on any of these columns into the Oracle Text index row source.
Also, when the query has matching
BY criteria, by specifying
pub_date desc, Oracle Text determines whether to push
SORT into the Oracle Text index row source for better response time.
4.6 Creating a CTXCAT Index
4.6.1 CTXCAT Index and DML Operations
CTXCAT index is transactional. When you perform inserts, updates, and deletes on the base table, Oracle Text automatically synchronizes the index. Unlike a
CONTEXT index, no
CTX_DDL.SYNC_INDEX is necessary.
Applications that insert without invoking triggers, such as SQL*Loader, do not result in automatic index synchronization as described in this section.
4.6.2 About CTXCAT Subindexes and Their Costs
CTXCAT index contains subindexes that you define as part of your index set. You create a subindex on one or more columns to improve mixed query performance. However, the time Oracle Text takes to create a
CTXCAT index depends on its total size, and the total size of a
CTXCAT index is directly related to the following factors:
Total text to be indexed
Number of subindexes in the index set
Number of columns in the base table that make up the subindexes
Many component indexes in your index set also degrade the performance of insert, update, and delete operations, because more indexes must be updated.
Because of the added index time and disk space costs for creating a
CTXCAT index, before adding it to your index set, carefully consider the query performance benefit that each component index gives your application.
You can use
BASIC_STORAGE to speed up creation of a
CTXCAT index. This clause is described in Oracle Text Reference.
4.6.3 Creating CTXCAT Subindexes
Figure 4-1 Auction Table Schema and CTXCAT Index
Description of "Figure 4-1 Auction Table Schema and CTXCAT Index"
Figure 4-1 shows a table called
AUCTION with the following schema:
create table auction( item_id number, title varchar2(100), category_id number, price number, bid_close date);
To create your subindexes, create an index set to contain them:
begin ctx_ddl.create_index_set('auction_iset'); end;
Next, determine the structured queries that you are likely to enter. The
CATSEARCH query operator takes a mandatory text clause and optional structured clause.
In the example, this means that all queries include a clause for the
title column, which is the text column.
Assume that the structured clauses fall into the following categories:
|Structured Clauses||Subindex Definition to Serve Query||Category|
'price < 200'
'price = 150'
'order by price'
'price = 100 order by bid_close'
'order by price, bid_close'
Structured Query Clause Category A
SELECT FROM auction WHERE CATSEARCH(title, 'camera', 'price < 200')> 0; SELECT FROM auction WHERE CATSEARCH(title, 'camera', 'price = 150')> 0; SELECT FROM auction WHERE CATSEARCH(title, 'camera', 'order by price')> 0;
These queries can be served by using subindex B. However, for efficiency, you can also create a subindex only on
price (subindex A):
begin ctx_ddl.add_index('auction_iset','price'); /* sub-index A */ end;
Structured Query Clause Category B
The structured query clause includes an equivalent expression for
price ordered by
bid_close, and an expression for ordering by
bid_close, in that order:
SELECT FROM auction WHERE CATSEARCH( title, 'camera','price = 100 ORDER BY bid_close')> 0; SELECT FROM auction WHERE CATSEARCH( title, 'camera','order by price, bid_close')> 0;
These queries can be served with a subindex defined as follows:
begin ctx_ddl.add_index('auction_iset','price, bid_close'); /* sub-index B */ end;
Like a combined b-tree index, the column order that you specify with
CTX_DDL.ADD_INDEX affects the efficiency and viability of the index scan which Oracle Text uses to serve specific queries. For example, if two structured columns
q have a b-tree index specified as
'p,q', Oracle Text cannot scan this index to sort
4.6.4 Creating CTXCAT Index
This example combines the previous examples and creates the index set preference with the two subindexes:
begin ctx_ddl.create_index_set('auction_iset'); ctx_ddl.add_index('auction_iset','price'); /* sub-index A */ ctx_ddl.add_index('auction_iset','price, bid_close'); /* sub-index B */ end;
Figure 4-1 shows how the subindexes A and B are created from the auction table. Each subindex is a b-tree index on the text column and the named structured columns. For example, subindex A is an index on the
title column and the
CREATE INDEX auction_titlex ON AUCTION(title) INDEXTYPE IS CTXSYS.CTXCAT PARAMETERS ('index set auction_iset') ;
Oracle Text Reference to learn more about creating a
CTXCAT index with
4.7 Creating a CTXRULE Index
To build a document classification application, use the
CTXRULE index on a table or queries. The stream of incoming documents is classified by content, and the queries define your categories. You can use the
MATCHES operator to classify single documents.
To create a
CTXRULE index and a simple document classification application:
Create a table of queries.
myqueriestable to hold the category name and query text, and then populate the table with the classifications and the queries that define each classification.
CREATE TABLE myqueries ( queryid NUMBER PRIMARY KEY, category VARCHAR2(30), query VARCHAR2(2000) );
For example, consider a classification for the US Politics, Music, and Soccer subjects:
INSERT INTO myqueries VALUES(1, 'US Politics', 'democrat or republican'); INSERT INTO myqueries VALUES(2, 'Music', 'ABOUT(music)'); INSERT INTO myqueries VALUES(3, 'Soccer', 'ABOUT(soccer)');
You can also generate a table of rules (or queries) with the
CTX_CLS.TRAINprocedure, which takes as input a document training set.
CREATE INDEXstatement to create the
CTXRULEindex and specify lexer, storage, section group, and wordlist parameters if needed.
CREATE INDEX myruleindex ON myqueries(query) INDEXTYPE IS CTXRULE PARAMETERS ('lexer lexer_pref storage storage_pref section group section_pref wordlist wordlist_pref');
Classify a document.
MATCHESoperator to classify a document.
Assume that incoming documents are stored in the table
CREATE TABLE news ( newsid NUMBER, author VARCHAR2(30), source VARCHAR2(30), article CLOB);
If you want, create a "before insert" trigger with
MATCHESto route each document to a
news_routetable based on its classification:
BEGIN -- find matching queries FOR c1 IN (select category from myqueries where MATCHES(query, :new.article)>0) LOOP INSERT INTO news_route(newsid, category) VALUES (:new.newsid, c1.category); END LOOP; END;
4.8 Creating a Search Index for JSON