Changes in This Release for Oracle Text Application Developer's Guide

This preface describes changes in Oracle Text for this release.

Changes in Oracle Text 12c Release 1 (12.1)

The changes in Oracle Text for Oracle Database 12c Release 1 (12.1) are described in the following topics.

New Features

This section describes the new features introduced in this release for Oracle Text. For a complete list of new features for Oracle Database 12c, see Oracle Database New Features Guide.

Performance Enhancements

BIG_IO Large TOKEN_INFO Option

A new wordlist preference, BIG_IO, specifies that TOKEN_INFO should be stored, where possible, in a single large SecureFiles database field rather than using in-line BLOBS limited to 4,000 bytes.

This avoids the need to do many seeks when loading large TOKEN_INFO data items from disk. Sequential I/O is generally much faster than random I/O, thus improving performance.

See "Improved Response Time using BIG_IO Option of CONTEXT Index."
Separate Offsets

The DOCID list identifies the documents which contain indexed terms and OFFSET identifies the location of those terms within each document.

A new wordlist preference, SEPARATE_OFFSETS, used in conjunction with BIG_IO, will cause the DOCID and OFFSET to be stored in separate locations within the index.

The DOCID list will be much shorter than the previous combined TOKEN_INFO data. It thus reduces the I/O necessary to perform single term queries, AND queries, and other queries where offset (ie word position) information is not needed. Performance is improved for such queries. Queries that do require offset information (for example phrase or near searches) may be slightly slower.

See "Improved Response Time using SEPARATE_OFFSETS Option of CONTEXT Index."

Snippet Support in Result Set Interface

The Result Set Interface in Oracle Database 11g was able to produce the various kinds of data needed for a page of search results all at once, improving performance by sharing overhead. The Result Set Interface could also return data views which were difficult to express in SQL.

To present snippet information along with search results to the end user, multiple iterations were required in Oracle Database 11g. With the approach in Oracle Database 11g, it was necessary to do a search query and iterate through the result to retrieve snippet information on each row.

In Oracle Database 12c Release 1 (12.1), native support of snippet information from the Result Set Interface resolves the previously mentioned issues. The Result Set Descriptor only needs SNIPPET defined if it is required. If defined, the user retrieves the snippet in the result set along with other search results.

This support provides faster, more flexible applications based on the Result Set Interface.

See "CTX_DOC Highlighting Procedures."

Forward Index for Highlighting and Snippet Generation

The forward index feature stores a tokenized and compressed version of the document in the Oracle Text index. This means that features such as highlighting and snippet generation no longer need to access, filter, and tokenize the original document, which is often an expensive process.

See "Document Services Procedures Performance and Forward Index."

XQuery Full-Text

This feature extends Oracle's support for the W3C XQuery specification by adding support for the XQuery full-text extension. This enables customers to perform XML-aware full text searches on XML content stored in the database.

See Oracle XML DB Developer's Guide for more information about the XQuery full-text specification. Also see Oracle Text Reference for information about using CREATE_SECTION_GROUP with CTX_DDL.SET_SEC_GRP_ATTR to set xml_enable to create an Oracle XML Search Index.

Automatic Near Real-Time Indexing

Near real-time indexing allows for frequent synchronization of indexes with heavy DML by maintaining recently changed index information in a new staging index which is designed to remain in memory. Data can be periodically moved from the staging index to the main index by means of a new MERGE mode for index optimization. The new option is turned on using the STAGE_ITAB storage option.

The new staging index table will be relatively small, and easy to cache in memory. When resident in memory, there is virtually no cost to this part of the index being fragmented. By separating the fragmented recent index from the unfragmented main index, performance improves and users are allowed to synchronize their indexes frequently without slowing down query performance. When used with the TRANSACTIONAL and SYNC(ON COMMIT) index parameters, the index will be effectively synchronous.

See "Improved Response Time using STAGE_ITAB Option of CONTEXT Index."

In conjunction with near real-time indexes, automatic management allows for a background task that avoids the need for running optimize merge to move data from the small (normally in-memory) $G table to the larger (normally on-disk) $I table. The automatic management process runs in the background when the system is not in heavy use. Indexes must be registered with the management system if they are to be automatically optimized.

This feature simplifies management and improves performance for near real-time indexes and avoids the risk of a manual optimize merge slowing down the system.

See Oracle Text Reference for information about setting parameters for near real-time indexes.

Language Identification

A new procedure called POLICY_LANGUAGES has been added to the CTX_DOC package. The procedure enables the identification of the language of a section of text.

Applications can identify the language of a document in order to process it in an appropriate manner (for example, to set a LANGUAGE metadata column).

See Oracle Text Reference for more information about language identification.

Document-Level Lexer

The document-level Lexer allows users to define different Lexer and stoplist preferences to different documents in an index. This is an extension of the MULTI_LEXER and MULTI_STOPLIST features, but now the Lexer choice can be independent of language. This feature enables applications to be more flexible. Different types of document or documents from different sources may have Lexer or stopword requirements which differ.

See "Query Language."

BIGRAM Mode for Japanese VGRAM Lexer

Currently, with Japanese VGRAM lexer, certain Japanese queries require wildcard expansion which can be expensive. Oracle now provides a switch to Japanese VGRAM lexer to generate BIGRAM mode only and, therefore, eliminate the need for wildcard queries.

The benefit of this feature is faster query performance on text indexed with the Japanese VGRAM lexer.

See Oracle Text Reference for more information about the Japanese VGRAM Lexer.

Pattern Stopclass

The pattern stopclass enables you to specify regular expressions, and any tokens matching those regular expressions will be considered as stopwords. In other words, they will not be indexed and will not be considered significant in queries. Unwanted strings, for example hexadecimal numbers or identifying codes, can be removed from the index to save space and improve performance.

See Oracle Text Reference for information about ADD_STOPCLASS for using the pattern stopclass.

Query Enhancements

This release introduces the following query enhancements:

NEAR operator Enhancements and MNOT operator

The NEAR operator has been improved to allow for nested NEAR operators, and to allow for OR constructs within the NEAR operator. These enhancements improve the flexibility for application creation.

Mild Not (MNOT) is a new operator designed to find words that are not part of a phrase. For example, if you want to find references to the city of York, you might want to avoid finding it as part of the phrase New York. Excluding all documents containing New York does not solve this problem, since some documents might reference York and New York. The new MNOT operator makes such semantics possible.

The MNOT operator improves precision and recall of searches by allowing searches for words, but excluding unwanted phrases containing those words.

See Oracle Text Reference for information about MNOT.

See "Proximity Queries with NEAR and NEAR_ACCUM Operators" for information about the NEAR operator.
Session-Duration SQEs

Stored Query Expressions (SQEs) are a way of saving frequently used query expressions. Session-duration SQEs are not permanently saved but exist only for the current session. This enbables faster performance than permanent SQEs as they are stored in session memory and avoids the clutter that might occur when SQEs are frequently created for short-term use within an application.

See Oracle Text Reference for information on managing SQEs.
Query Filter Cache

A common scenario in text searching is that a particular set of criteria are used in many queries. For example, you might want to apply a security filter that restricts the results to only those appropriate to a particular user. The query filter cache feature allows you to cache the results of a particular query, or part of a query, then use those results to filter future searches. Conceptually, this is similar to using a Stored Query Expression (SQE), but it provides better performance for queries that have components shared with other queries.

SDATA sections

This release provides the following new features for SDATA sections:

Updateable SDATA sections

SDATA sections may be updated using a new PL/SQL package, CTX_DDL.UPDATE_SDATA. This package will update the value of an SDATA item without requiring reindexing of all the data in that row.

This feature provides better performance for rapidly mutating metadata. For example, if you want to include stocklevel in text queries, make it an SDATA section. The associated row might include a long data sheet of information that you do not want to re-index every time the stocklevel changes. With this new feature you can update only the SDATA part of the index.

See Oracle Text Reference for information on updatable SDATA sections with UPDATE_SDATA.
Adding SDATA sections to existing index

SDATA sections can be added to an existing index without needing to completely rebuild the index. The new SDATA sections will be indexed in all documents added or updated after this time. Previously indexed documents are not affected.

Application flexibility and uptime is improved, as indexes can be modified to reflect new business requirements without having to rebuild the index from scratch.

See Oracle Text Reference for details about SDATA sections.
Ordering by SDATA sections

Query templates now support ordering by one or more SDATA sections. This allows for more flexible application development and faster queries compared to a standard database sort.

See "Ordering By SDATA Sections."

Unlimited Number of Field and MDATA Sections

Previously, you were allowed only 64 field sections. You can now create an almost unlimited number (10,000+) of field sections. Field sections are more efficient than zone sections. Previously, some applications had to use zone sections since there were not enough field sections available. This feature improves the performance of such applications.

The number of MDATA sections allowed is now effectively unlimited. The previous maximum was 100. This feature provides increased application flexibility. There is no longer a need to combine multiple MDATA fields in a single one.

See "Field Section."

See Oracle Text Reference for information about MDATA sections.

Deprecated Features

Some features that are deprecated in this release for Oracle XML Database may affect Oracle Text XML applications. See Oracle Database Upgrade Guide for information on Oracle XML Database deprecations and changes.

Desupported Features

The following Oracle Text features are desupported in this release:

CTXXPATH is desupported in Oracle Database 12c Release 1 (12.1). This does not affect the CTXCAT index type. Use XMLIndex indexes instead. See Oracle XML DB Developer's Guide for information about XMLIndex.
ALTER INDEX OPTIMIZE for Text Indexes is desupported in Oracle Database 12c Release 1 (12.1). The ALTER INDEX OPTIMIZE [token index_token | fast | full [maxtime (time | unlimited)] operation is not supported for Oracle Database 12c. To optimize your index, use CTX_DDL.OPTIMIZE_INDEX. See Oracle Text Reference for information about OPTIMIZE_INDEX.
SYNC [MEMORY memsize] for Text Indexes is desupported in Oracle Database 12c Release 1 (12.1). To synchronize your index, use CTX_DDL.SYNC_INDEX. See Oracle Text Reference for information about SYNC_INDEX.
The CHARSET_FILTER filter type is desupported in Oracle Database 12c Release 1 (12.1).

See Oracle Database Upgrade Guide for a complete list of desupported features for Oracle Database 12c.