Using XML Query Result Set Interface

11 Using XML Query Result Set Interface

This chapter descr ibes how to use the XML Query Result Set Interface.

This chapter contains these topics:

11.1 Overview of the XML Query Result Set Interface

The XML Query Result Set Interface (RSI) enables you to perform queries in XML and return results as XML, avoiding the SQL layer and requirement to work within SELECT semantics. The Result Set Interface uses a simple text query and an XML result set descriptor, where the hitlist is returned in XML according to the result set descriptor. The XML Query Result Set Interface uses SDATA sections for grouping and counting.

A page of search results in applications can consist of many disparate elements such as metadata of the first few documents, total hit counts, per-word hit counts, and so on. Each extra call takes time to reparse the query and look up index metadata. Additionally, some search operations such as iterative query refinement are difficult for SQL. If it is even possible to construct a SQL statement to produce the desired results, such SQL is usually suboptimal.

The XML Query Result Set Interface is able to produce the various kinds of data needed for a page of search results all at once, thus improving performance by sharing overhead. The Result Set Interface can also return data views that are difficult to express in SQL.

11.2 Using the XML Query Result Set Interface

The CTX_QUERY.RESULT_SET() and CTX_QUERY.RESULT_SET_CLOB_QUERY() APIs enable you to obtain query results with a single query, rather than running multiple CONTAINS() queries to achieve the same result. The two APIs are identical except that one uses a VARCHAR2 query parameter, and the other uses a CLOB query parameter to allow for longer queries.

For example, in order to display a search result page, the following information needs to be obtained first:

top 20 hit list sorted by date and relevancy
total number of hits for the given Text query
counts group by publication date
counts group by author

Assume the following table definition for storing documents to be searched:

create table docs (
  docid    number,
  author   varchar2(30),
  pubdate  date,
  title    varchar2(60),  doc      clob);

Assume the following Oracle Text Index definition:

create index docidx on docs(doc) indextype is ctxsys.context
filter by author, pubdate, title
order by pubdate;

With these definitions, you can obtain the four pieces of information for displaying the search result page by issuing four SQL statements:

-- Get top 20 hits sorted by date and relevancy
select * from
  (select /*+ first_rows */ rowid, title, author, pubdate
   from docs where contains(doc, 'oracle',1)>0
   order by pubdate desc, score(1) desc)
where rownum < 21;
 
-- Get total number of hits for the given Text query
select count(*) from docs where contains(doc, 'oracle',1)>0;
 
-- Get counts group by publication date
select pubdate, count(*) from docs where contains(doc, 'oracle',1)>0 
group by pubdate;
 
-- Get counts group by author
select author, count(*) from docs where contains(doc, 'oracle',1)>0 group by author;

As you can see, using separate SQL statements results in a resource-intensive query, as you run the same query four times. However, by using CTX_QUERY.RESULT_SET(), you can enter all this information in one single Oracle Text query:

declare
   rs clob;
begin
   dbms_lob.createtemporary(rs, true, dbms_lob.session);
   ctx_query.result_set('docidx', 'oracle text performance tuning', '
   <ctx_result_set_descriptor>  
    <count/>
    <hitlist start_hit_num="1" end_hit_num="20" order="pubDate desc, 
        score desc">
      <score/>
      <rowid/>
         <sdata name="title"/>
      <sdata name="author"/>
      <sdata name="pubDate"/>
    </hitlist>
    <group sdata="pubDate">
      <count/>
    </group>
   <group sdata="author">
     <count/>
   </group>
  </ctx_result_set_descriptor>
 ', rs);
 
-- Put in your code here to process the Output Result Set XML
   dbms_lob.freetemporary(rs);
exception
   when others then
    dbms_lob.freetemporary(rs);
    raise;
end;
/

The result set output will be XML containing all the necessary information required to construct the search result page:

<ctx_result_set>
  <hitlist>
    <hit>
      <score>90</score>
      <rowid>AAAPoEAABAAAMWsAAC</rowid>
      <sdata name="TITLE"> Article 8 </sdata>
      <sdata name="AUTHOR">John</sdata>
      <sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
    </hit>
    <hit>
      <score>86</score>
      <rowid>AAAPoEAABAAAMWsAAG</rowid>
      <sdata name="TITLE"> Article 20 </sdata>
      <sdata name="AUTHOR">John</sdata>
      <sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
    </hit>
    <hit>
      <score>78</score>
      <rowid>AAAPoEAABAAAMWsAAK</rowid>
      <sdata name="TITLE"> Article 17 </sdata>
      <sdata name="AUTHOR">John</sdata>
      <sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
    </hit>
    <hit>
      <score>77</score>
      <rowid>AAAPoEAABAAAMWsAAO</rowid>
      <sdata name="TITLE"> Article 37 </sdata>
      <sdata name="AUTHOR">John</sdata>
      <sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
    </hit>
...
    <hit>
      <score>72</score>
      <rowid>AAAPoEAABAAAMWsAAS</rowid>
      <sdata name="TITLE"> Article 56 </sdata>
      <sdata name="AUTHOR">John</sdata>
      <sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
    </hit>
  </hitlist>
 
  <count>100</count>
 
  <groups sdata="PUBDATE">
    <group value="2001-01-01 00:00:00"><count>25</count></group>
    <group value="2001-01-02 00:00:00"><count>50</count></group>
    <group value="2001-01-03 00:00:00"><count>25</count></group>
  </groups>
 
  <groups sdata="AUTHOR">
    <group value="John"><count>50</count></group>
    <group value="Mike"><count>25</count></group>
    <group value="Steve"><count>25</count></group>
  </groups>
 
</ctx_result_set>

11.3 Creating XML-Only Applications with Oracle Text

Although it is common to create applications using SQL SELECT statements with the CONTAINS clause, this is not always the most efficient way of working. An alternative method is to use the XML-based Result Set Interface. The advantage of doing this is that it is easy to obtain summary information (such as the total number of hits) without fetching all the results of the query.

To use the Result Set Interface, you specify a Result Set Descriptor, or RSD. This declares the information which is to be returned. The information to be returned can consist of:

The total result count of the query
A hitlist
Summary information over SDATA fields

In turn, the hitlist consists of repeating elements, each of which may contain:

The rowid of the hit
SDATA fields from the hit

11.4 Example of a Result Set Descriptor

The example in this section shows how to use a Result Set Descriptor (or RSD). The following example requests a hitlist with the top 10 hits (ordered by score), plus the count of the total number of results.

<ctx_result_set_descriptor>
  <hitlist start_hit_num="1" end_hit_num="10" order="SCORE DESC">
    <rowid />
    <sdata name="title" />
    <sdata name="author" />
    <sdata name="articledate" />
    <snippet radius="20" max_length="160" starttag="&lt;b&gt;" endtag="&lt;/b&gt;" />
  </hitlist>
  <count />
</ctx_result_set_descriptor>

For each hit, you are requesting the rowid (which you could use to fetch further information about the row, if necessary), the contents of the SDATA fields or columns called title, author and articledate, and a snippet (which is a short summary with keywords highlighted, in this case by <b>...</b>).

11.5 Identifying Collocates Using Oracle Text

Collocates are a group of words that frequently co-occur in a document. They provide a quick summary of other keywords or concepts in a document that are related to a specified keyword. You can then use these other keywords in queries to fetch more relevant results.

You identify collocates based on a search query. For each document that is returned by the query, snippets of text around the search keyword are automatically extracted. Next, the words in these snippets are correlated to the query keyword using statistical measures and, depending on how frequently the extracted words occur in the overall document set, a score is assigned to each returned co-occurring word.

The Result Set Interface is used to identify collocates. You can specify the number of co-occurring words that must be returned by the query. You can also specify whether to identify collocates that are common nouns or collocates that emphasize uniqueness. Synonyms of the specified search keyword can also be returned.

Note:

Collocates are supported only for the basic lexer.

To identify collocates:

Create the table that will contain the document set for the query.
Create an Oracle Text index on the document set table.
Use the XML Query Result Set Interface (RSI) to define and input a query that identifies collocates. Include the collocates element with the required attributes.

Example 11-1 Identifying Collocates Within a Document Set

In this example, the keyword used to query documents in a data set is “Nobel”. Oracle Text searches for occurrences of this keyword in the document set. In addition to the result set, we use collocates to search for five common words that co-occur with “Nobel”. The number of collocates to be generated is identified by the max_words attribute. The use_tscore attribute is set to true to specify that common words must be identified for the collocates. The number of words to pick on either side of the keyword in order to identify collocates is 10.

The following is the input Result Set Interface descriptor that is used to determine collocates.

declare
rsd varchar2(32767);
 begin
  ctx_query.result_set('tdrbnbsan01idx', 'nobel',
  <ctx_result_set_descriptor>
  <collocates radius = "10" max_words="5" use_tscore="TRUE"/>
  </ctx_result_set_descriptor>',
  :rs);
  end;
/

The output result set for the above query is as follows:

<ctx_result_set>
<collocates>
    <collocation>
       <word>PRIZE</word>
       <score>82</score>
    </collocation>
    <collocation>
       <word>LAUREATE</word>
       <score>70</score>
    </collocation>
    <collocation>
       <word>NOBELPRIZE</word>
       <score>44</score>
    </collocation>
    <collocation>
       <word>AWARD</word>
       <score>42</score>
    </collocation>
    <collocation>
       <word>ORG</word>
       <score>41</score>
    </collocation
</collocates>
</ctx_result_set>

The top 5 common collocates, in order, for “Nobel” are: Prize, Laureate, Nobelprize, award, and org. Each word is assigned a score that indicates the frequency with which it occurred. Collocates are always returned after any hitlist elements are returned.

If you set use_tscore to FALSE in the same example, then less common (unique) words are identified. The output result set is as follows:

<ctx_result_set>
<collocates>   
    <collocation>
       <word>MOLA</word>
       <score>110</score>   
    </collocation>
    <collocation>
       <word>BISMARCK</word>
       <score>89</score>
    </collocation>
    <collocation>
       <word>COLONNA</word>
       <score>67</score>
    </collocation>
    <collocation>
       <word>LYNEN</word>
       <score>55</score>
    </collocation>
    <collocation>
       <word>TIMBERGEN</word>
       <score>25</score>
    </collocation>
    </collocates>
</ctx_result_set>