1
Query Concepts

This chapter explains the fundamental concepts that underlie ConText text and theme processing. The following topics are covered in this chapter:

Text Queries

In ConText, a text query is a search for a word or phrase in a document set. A document set is usually stored in a base-table containing a text-column, where every row in the column contains a document. To issue a text query, the text column must be indexed.

See Also:

For more information about creating text indexes for columns, see Oracle8 ConText Cartridge Administrator's Guide.

You can issue text queries (and theme queries) using one of the following query methods:

one-step
two-step
in-memory

All three methods produce the same query results, returning a hitlist of the documents (rows in the base-table) that satisfy your query. Each returned row contains information such as the primary key and relevance score of a document that matches the query. You choose a method depending on the application.

In an application, you can present the hitlist to the user ordered by score and let the user select a document. You can then present the selected documents to the user with query terms highlighted or by summarizing the documents thematically using linguistic analysis.

Case-Sensitivity

By default, ConText creates text indexes without being sensitive to the case of tokens in the documents. Because of this, text queries are case-insensitive. That is, a query on United returns documents that contain United and UNITED and united.

However, you can make text queries case-sensitive by using a case-sensitive lexer when you or your ConText administrator indexes the document set. When you create a case-sensitive index, a query on United is different from united, which is different from UNITED.

See Also:

For more information about issuing case-sensitive text queries, see Case-Sensitive Queries in Chapter 3.

For more information about creating case-sensitive text indexes for columns, see Oracle8 ConText Cartridge Administrator's Guide.

Section Searching

In addition to searching for words within documents, ConText enables you to narrow text searches down to pre-defined sections within documents. To do section searching, you or your ConText administrator must define sections by specifying what tags delimit the section.

For example in an HTML document set, you can define all appropriately tagged headings as a document section, and then use the WITHIN operator to query for a term within all headings across all documents.

See Also:

For more information about how to issue section searches, see "WITHIN Operator" in Chapter 3.

For more information about defining sections, see the Oracle8 ConText Cartridge Administrator's Guide.

Theme Queries

In addition to querying English-language documents by words or phrases (text query), you can query these documents by theme, or by their main concepts.

Theme queries work similarly to text querying in that you must create an index (theme) for the documents before you can query. Theme queries differ from text queries in that you need not provide the word patterns for the search. ConText interprets your query conceptually according to its view of the world and returns an appropriate document hitlist based on theme, along with a measure of how relevant each document is to the query.

You can use the standard query methods to perform theme queries, namely one-step, two-step, and in-memory. In a theme query, you can use most of the operators you use in regular text queries.

See Also:

For more information about theme queries, see Chapter 4, "Theme Queries".

Query Methods

ConText supports three different methods for performing queries:

two-step
one-step
in-memory

In addition, ConText provides a method for counting query hits without performing an actual query.

Two-step Queries

Two-step queries use a PL/SQL procedure in the first step to store the results in a specified result table.

The second step uses a SELECT statement to select the results from the result table. In addition, the hitlist table can be joined with the original table to return more detailed document information. In the two-step method, the physical hitlist table is available to the application program.

See Also:

For more information about using two-step queries, see "Using Two-Step Queries" in Chapter 2.

One-step Queries

In a one-step query, you create a single SQL statement to search for relevant documents. ConText returns directly to you the rows and columns of the text table that satisfy the query.

ConText creates the hitlist using internal result tables. As a result, you do not have to create result tables before running a one-step query; however, the internal result tables are not available to the application program.

See Also:

For more information about using one-step queries, see "Using One-Step Queries" in Chapter 2.

In-memory Queries

In-memory queries use a buffer and a CONTAINS cursor to the buffer to return query results, rather than the result tables used in two-step and one-step queries. As a result, in-memory queries are generally faster than two-step and one-step queries for shorter hitlists.

In an in-memory query, you open a cursor to the query buffer and run a query. ConText writes the results of the query to the buffer. You fetch the results, then close the cursor.

Results can be returned in order of their textkeys or sorted by score.

See Also:

For more information about using in-memory queries, see "Using In-Memory Queries" in Chapter 2.

Counting Query Hits

In addition to fully executing two-step, one-step, and in-memory queries, you can count the number of hits in a two-step or in-memory query before or after you issue the query. The documents can be stored in a local or remote database. Counting query hits helps to audit queries to ensure large and unmanageable hitlists are not returned.

See Also:

For more information about counting query hits, see "Counting Query Hits" in Chapter 2.

Query Expressions

Query expressions are made up of words and phrases (query terms) combined with operators and other special characters to produce search criteria. Operators specify the relative importance of the query terms, define relationships between those terms, control how the search is performed, and determine how much output is returned.

The most basic kind of query expression is single words or phrases that return documents with a score based on the number of occurrences of the words or phrases. More complex expressions allow the user to weight certain terms, search for words that sound like each other, and find all of the words based on a particular root.

See Also:

For more information about query expressions, see Chapter 3, "Understanding Query Expressions".

Stored Query Expressions

A stored query expression (SQE) is a named query expression that has been stored in database tables along with the results of the query.

You can combine queries by referencing an SQE within the query expression of another query. Using an SQE in a query results in faster execution of the query because the results are already stored in the database.

Stored query expressions can also be used to perform interactive queries, in which an initial query is refined using one or more additional queries.

See Also:

For more information about using stored query expressions, see "Stored Query Expressions" in Chapter 3.

Query Expression Feedback

Query expression feedback is a feature that enables you to know how ConText parses a text or theme query expression before you execute the query. Knowing how ConText evaluates a text or theme query expression is useful for refining and debugging queries.

You can also design a query application so that it uses the feedback information to help users write better queries.

Scoring

When you issue either a text query or theme query, ConText returns the hitlist of documents that satisfy your query. Each document has a score that indicates how relevant the document is to the query you entered; the higher the score, the more relevant the document. You can use scores to order the hitlist to show the most relevant documents first.

In two-step queries, ConText calculates the score when you issue CTX_QUERY.CONTAINS procedure and stores the score in a result table called the hitlist table.

In one-step queries, ConText calculates scores when you use the CONTAINS function. You obtain scores using the SCORE function.

In in-memory queries, ConText returns the score for a hit as an out parameter with the CTX_QUERY.FETCH_HIT function.

Score Range

The score of a document in the result set is an integer within the range 1 to 100 inclusive. The highest score for a given query is not necessarily 100; it can be any integer in the range 1 <= n <= 100.

Scoring Algorithm for Text Queries

Note:

This section discusses how ConText calculates score for text queries, which is different from the way it calculates score for theme queries.

For more information about scoring for theme queries, see "Theme Querying" in Chapter 4.

To calculate a relevance score for a returned document in a text query, ConText uses an inverse frequency algorithm. Inverse frequency scoring assumes that frequently occurring terms in a document set are "noise" terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.

The following table illustrates ConText's inverse frequency scoring. The first column shows the number of documents in the document set, and the second column shows the number of terms in the document necessary to score 100.

This table assumes that only one document in the set contains the query term.

Number of Documents in Document Set Frequency of Term in Document

1

34

5

20

10

17

50

13

100

12

500

10

1,000

9

10,000

7

100,000

5

1,000,000

4

Number of Documents in Document Set	Frequency of Term in Document
1	34
5	20
10	17
50	13
100	12
500	10
1,000	9
10,000	7
100,000	5
1,000,000	4

The table illustrates that if only one document contained the query term and there were five documents in the set, the term would have to occur 20 times in the document to score 100. Whereas, if there were 1,000,000 documents in the set, the term would have to occur only 4 times in the document to score 100.

Example

You have 5000 documents dealing with chemistry in which the term chemical occurs at least once in every document. The term chemical thus occurs frequently in the document set.

You have a document that contains 5 occurrences of chemical and 5 occurrences of the term hydrogen. No other document contains the term hydrogen.

Because chemical occurs so frequently in the document set, its score for the document is lower with respect to hydrogen, which is infrequent is the document set as a whole. This is so even though both terms occur 5 times in the document.

Note:

Even if the relatively infrequent term hydrogen occurred 4 times in the document, and chemical occurred 5 times in the document, the score for hydrogen might still be higher, because chemical occurs so frequently in the document set (at least 5000 times).

Inverse frequency scoring also means that adding documents that contain hydrogen lowers the score for that term in the document, and adding more documents that do not contain hydrogen raises the score.

DML and Scoring

Because the scoring algorithm is based on the number of documents in the document set, inserting, updating or deleting documents in the document set is likely change the score for any given term before and after the DML.

If DML is heavy, you or your ConText administrator must optimize the index. Perfect relevance ranking is obtained by executing a query right after optimizing the index.

If DML is light, ConText still gives fairly accurate relevance ranking.

In either case, you or your ConText administrator must synchronize the index with CTX_DML.SYNC whenever DML is performed on the index.

See Also:

For more information about optimizing and synchronizing an index, see Oracle8 ConText Cartridge Administrator's Guide.

Result Tables

Result tables are storage areas used by ConText to store output from user queries. These tables are allocated by the application program or procedure and exist until they are released by the application.

Result tables store the following:

output of a two-step query.
query expression feedback information
highlighting output for viewing query terms in documents.
linguistic output.

Result tables are also used in one-step queries; however, the tables used in one-step queries are internal tables that are allocated by ConText and cannot be accessed from application program.

You can create result tables using the SQL command CREATE or using the CTX_QUERY.GETTAB function.

See Also:

For more information about creating and using result tables, see "Hitlist Result Tables" in Chapter 2.

For more information about the structure of result tables, see Appendix A, "Result Tables".

For more information about the feedback result table, see"Understanding the Feedback Table" in Chapter 5.

For more information about generating linguistic output, see"Generating Linguistic Output" in Chapter 8.

Document Presentation

When your application obtains the results of a query, it can let the user select a document from the hitlist and then present the following:

the document with or without query terms highlighted (text queries)
the document with or without paragraphs highlighted (theme queries)
linguistic output of the document (English only)

Presenting Highlighted Documents

Context enables you to present documents to the user with query terms highlighted for text queries, or with relevant paragraphs highlighted for theme queries.

With PL/SQL, you create the viewable output by calling a highlighting procedure after you issue the query. This procedure outputs the highlighted information to a result table, which you use to present the document.

Context also has a OCX control that you can embed programmatically in Windows client-side applications. This control allows users to query documents and then view them in their native formats, such as Microsoft Word, with query terms or paragraphs highlighted.

See Also:

For more information about presenting highlighted documents, see Chapter 6, "Document Presentation".

Presenting Linguistic Output (English Only)

For English-language documents, ConText linguistics enables you to create different views of the contents of documents that allow the user to quickly review the essential content of documents.

Because these services are separate and distinct from text and theme indexing, you can incorporate linguistic analysis and functionality in a text application, independent of the text/theme indexing process.

ConText linguistics can generate the following forms of linguistic output for documents:

Output Type Description

Themes

The main concepts of a document.

Gist

Paragraph or paragraphs in a document that best represent what the document is about as a whole.

Theme Summary

Paragraph or paragraphs in a document that best represent a given theme in the document.

Sentence-Level Gist

Sentence or sentences in a document that best represent the themes in the document as a whole.

Sentence-Level Theme Summary

Sentence or sentences in a document that best match a single theme in the document.

Output Type	Description
Themes	The main concepts of a document.
Gist	Paragraph or paragraphs in a document that best represent what the document is about as a whole.
Theme Summary	Paragraph or paragraphs in a document that best represent a given theme in the document.
Sentence-Level Gist	Sentence or sentences in a document that best represent the themes in the document as a whole.
Sentence-Level Theme Summary	Sentence or sentences in a document that best match a single theme in the document.

You obtain linguistic output by submitting a linguistic request using the CTX_LING PL/SQL package.

See Also:

For more information about ConText linguistics, see Chapter 7, "Linguistic Concepts".

1 Query Concepts