|Oracle® Text Application Developer's Guide
12c Release 1 (12.1)
|PDF · Mobi · ePub|
This preface describes changes in Oracle Text for this release.
The changes in Oracle Text for Oracle Database 12c Release 1 (12.1) are described in the following topics.
This section describes the new features introduced in this release for Oracle Text. For a complete list of new features for Oracle Database 12c, see Oracle Database New Features Guide.
BIG_IO Large TOKEN_INFO Option
A new wordlist preference, BIG_IO, specifies that TOKEN_INFO should be stored, where possible, in a single large SecureFiles database field rather than using in-line BLOBS limited to 4,000 bytes.
This avoids the need to do many seeks when loading large TOKEN_INFO data items from disk. Sequential I/O is generally much faster than random I/O, thus improving performance.
The DOCID list identifies the documents which contain indexed terms and OFFSET identifies the location of those terms within each document.
A new wordlist preference, SEPARATE_OFFSETS, used in conjunction with BIG_IO, will cause the DOCID and OFFSET to be stored in separate locations within the index.
The DOCID list will be much shorter than the previous combined TOKEN_INFO data. It thus reduces the I/O necessary to perform single term queries, AND queries, and other queries where offset (ie word position) information is not needed. Performance is improved for such queries. Queries that do require offset information (for example phrase or near searches) may be slightly slower.
The Result Set Interface in Oracle Database 11g was able to produce the various kinds of data needed for a page of search results all at once, improving performance by sharing overhead. The Result Set Interface could also return data views which were difficult to express in SQL.
To present snippet information along with search results to the end user, multiple iterations were required in Oracle Database 11g. With the approach in Oracle Database 11g, it was necessary to do a search query and iterate through the result to retrieve snippet information on each row.
In Oracle Database 12c Release 1 (12.1), native support of snippet information from the Result Set Interface resolves the previously mentioned issues. The Result Set Descriptor only needs
SNIPPET defined if it is required. If defined, the user retrieves the snippet in the result set along with other search results.
This support provides faster, more flexible applications based on the Result Set Interface.
The forward index feature stores a tokenized and compressed version of the document in the Oracle Text index. This means that features such as highlighting and snippet generation no longer need to access, filter, and tokenize the original document, which is often an expensive process.
This feature extends Oracle's support for the W3C XQuery specification by adding support for the XQuery full-text extension. This enables customers to perform XML-aware full text searches on XML content stored in the database.
See Oracle XML DB Developer's Guide for more information about the XQuery full-text specification. Also see Oracle Text Reference for information about using
CREATE_SECTION_GROUP with CTX_DDL.SET_SEC_GRP_ATTR to set
xml_enable to create an Oracle XML Full-Text Index.
Near real-time indexing allows for frequent synchronization of indexes with heavy DML by maintaining recently changed index information in a new staging index which is designed to remain in memory. Data can be periodically moved from the staging index to the main index by means of a new MERGE mode for index optimization. The new option is turned on using the STAGE_ITAB storage option.
The new staging index table will be relatively small, and easy to cache in memory. When resident in memory, there is virtually no cost to this part of the index being fragmented. By separating the fragmented recent index from the unfragmented main index, performance improves and users are allowed to synchronize their indexes frequently without slowing down query performance. When used with the TRANSACTIONAL and SYNC(ON COMMIT) index parameters, the index will be effectively synchronous.
In conjunction with near real-time indexes, automatic management allows for a background task that avoids the need for running optimize merge to move data from the small (normally in-memory) $G table to the larger (normally on-disk) $I table. The automatic management process runs in the background when the system is not in heavy use. Indexes must be registered with the management system if they are to be automatically optimized.
This feature simplifies management and improves performance for near real-time indexes and avoids the risk of a manual optimize merge slowing down the system.
See Oracle Text Reference for information about setting parameters for near real-time indexes.
A new procedure called
POLICY_LANGUAGES has been added to the CTX_DOC package. The procedure enables the identification of the language of a section of text.
Applications can identify the language of a document in order to process it in an appropriate manner (for example, to set a
LANGUAGE metadata column).
See Oracle Text Reference for more information about language identification.
The document-level Lexer allows users to define different Lexer and stoplist preferences to different documents in an index. This is an extension of the MULTI_LEXER and MULTI_STOPLIST features, but now the Lexer choice can be independent of language. This feature enables applications to be more flexible. Different types of document or documents from different sources may have Lexer or stopword requirements which differ.
Currently, with Japanese VGRAM lexer, certain Japanese queries require wildcard expansion which can be expensive. Oracle now provides a switch to Japanese VGRAM lexer to generate BIGRAM mode only and, therefore, eliminate the need for wildcard queries.
The benefit of this feature is faster query performance on text indexed with the Japanese VGRAM lexer.
See Oracle Text Reference for more information about the Japanese VGRAM Lexer.
The pattern stopclass enables you to specify regular expressions, and any tokens matching those regular expressions will be considered as stopwords. In other words, they will not be indexed and will not be considered significant in queries. Unwanted strings, for example hexadecimal numbers or identifying codes, can be removed from the index to save space and improve performance.
See Oracle Text Reference for information about ADD_STOPCLASS for using the pattern stopclass.
This release introduces the following query enhancements:
NEAR operator Enhancements and MNOT operator
The NEAR operator has been improved to allow for nested NEAR operators, and to allow for OR constructs within the NEAR operator. These enhancements improve the flexibility for application creation.
Mild Not (
MNOT) is a new operator designed to find words that are not part of a phrase. For example, if you want to find references to the city of York, you might want to avoid finding it as part of the phrase New York. Excluding all documents containing New York does not solve this problem, since some documents might reference York and New York. The new
MNOT operator makes such semantics possible.
MNOT operator improves precision and recall of searches by allowing searches for words, but excluding unwanted phrases containing those words.
See Oracle Text Reference for information about
See "Proximity Queries with NEAR and NEAR_ACCUM Operators" for information about the
Stored Query Expressions (SQEs) are a way of saving frequently used query expressions. Session-duration SQEs are not permanently saved but exist only for the current session. This enbables faster performance than permanent SQEs as they are stored in session memory and avoids the clutter that might occur when SQEs are frequently created for short-term use within an application.
See Oracle Text Reference for information on managing SQEs.
Query Filter Cache
A common scenario in text searching is that a particular set of criteria are used in many queries. For example, you might want to apply a security filter that restricts the results to only those appropriate to a particular user. The query filter cache feature allows you to cache the results of a particular query, or part of a query, then use those results to filter future searches. Conceptually, this is similar to using a Stored Query Expression (SQE), but it provides better performance for queries that have components shared with other queries.
This release provides the following new features for SDATA sections:
Updateable SDATA sections
SDATA sections may be updated using a new PL/SQL package, CTX_DDL.UPDATE_SDATA. This package will update the value of an SDATA item without requiring reindexing of all the data in that row.
This feature provides better performance for rapidly mutating metadata. For example, if you want to include stocklevel in text queries, make it an SDATA section. The associated row might include a long data sheet of information that you do not want to re-index every time the stocklevel changes. With this new feature you can update only the SDATA part of the index.
See Oracle Text Reference for information on updatable SDATA sections with UPDATE_SDATA.
Adding SDATA sections to existing index
SDATA sections can be added to an existing index without needing to completely rebuild the index. The new SDATA sections will be indexed in all documents added or updated after this time. Previously indexed documents are not affected.
Application flexibility and uptime is improved, as indexes can be modified to reflect new business requirements without having to rebuild the index from scratch.
See Oracle Text Reference for details about SDATA sections.
Ordering by SDATA sections
Query templates now support ordering by one or more SDATA sections. This allows for more flexible application development and faster queries compared to a standard database sort.
Previously, you were allowed only 64 field sections. You can now create an almost unlimited number (10,000+) of field sections. Field sections are more efficient than zone sections. Previously, some applications had to use zone sections since there were not enough field sections available. This feature improves the performance of such applications.
The number of MDATA sections allowed is now effectively unlimited. The previous maximum was 100. This feature provides increased application flexibility. There is no longer a need to combine multiple MDATA fields in a single one.
See "Field Section."
See Oracle Text Reference for information about MDATA sections.
Some features that are deprecated in this release for Oracle XML Database may affect Oracle Text XML applications. See Oracle Database Upgrade Guide for information on Oracle XML Database deprecations and changes.
The following Oracle Text features are desupported in this release:
CTXXPATH is desupported in Oracle Database 12c Release 1 (12.1). This does not affect the CTXCAT index type. Use
XMLIndex indexes instead. See Oracle XML DB Developer's Guide for information about
OPTIMIZE for Text Indexes is desupported in Oracle Database 12c Release 1 (12.1). The
[token index_token | fast | full [maxtime (time | unlimited)] operation is not supported for Oracle Database 12c. To optimize your index, use
CTX_DDL.OPTIMIZE_INDEX. See Oracle Text Reference for information about
SYNC [MEMORY memsize] for Text Indexes is desupported in Oracle Database 12c Release 1 (12.1). To synchronize your index, use
CTX_DDL.SYNC_INDEX. See Oracle Text Reference for information about
See Oracle Database Upgrade Guide for a complete list of desupported features for Oracle Database 12c.