Skip Headers

Oracle Text Application Developer's Guide
Release 9.2

Part Number A96517-01
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback

Go to previous page Go to next page

6
Document Section Searching

This chapter describes how to use document sections in an Oracle Text query application.

The following topics are discussed in this chapter:

About Document Section Searching

Section searching enables you to narrow text queries down to blocks of text within documents. Section searching is useful when your documents have internal structure, such as HTML and XML documents.

You can also search for text at the sentence and paragraph level.

Enabling Section Searching

The steps for enabling section searching for your document collection are:

  1. Create a section group
  2. Define your sections
  3. Index your documents
  4. Section search with WITHIN, INPATH, or HASPATH operators

Create a Section Group

Section searching is enabled by defining section groups. You use one of the system-defined section groups to create an instance of a section group. Choose a section group appropriate for your document collection.

You use section groups to specify the type of document set you have and implicitly indicate the tag structure. For instance, to index HTML tagged documents, you use the HTML_SECTION_GROUP. Likewise, to index XML tagged documents, you can use the XML_SECTION_GROUP.

The following table list the different types of section groups you can use:

Section Group Preference Description

NULL_SECTION_GROUP

This is the default. Use this group type when you define no sections or when you define only SENTENCE or PARAGRAPH sections.

BASIC_SECTION_GROUP

Use this group type for defining sections where the start and end tags are of the form <A> and </A>.

Note: This group type dopes not support input such as unbalanced parentheses, comments tags, and attributes. Use HTML_SECTION_GROUP for this type of input.

HTML_SECTION_GROUP

Use this group type for indexing HTML documents and for defining sections in HTML documents.

XML_SECTION_GROUP

Use this group type for indexing XML documents and for defining sections in XML documents.

AUTO_SECTION_GROUP

Use this group type to automatically create a zone section for each start-tag/end-tag pair in an XML document. The section names derived from XML tags are case-sensitive as in XML.

Attribute sections are created automatically for XML tags that have attributes. Attribute sections are named in the form attribute@tag.

Stop sections, empty tags, processing instructions, and comments are not indexed.

The following limitations apply to automatic section groups:

  • You cannot add zone, field or special sections to an automatic section group.
  • Automatic sectioning does not index XML document types (root elements.) However, you can define stop-sections with document type.
  • The length of the indexed tags including prefix and namespace cannot exceed 64 characters. Tags longer than this are not indexed.

PATH_SECTION_GROUP

Use this group type to index XML documents. Behaves like the AUTO_SECTION_GROUP.

The difference is that with this section group you can do path searching with the INPATH and HASPATH operators. Queries are also case-sensitive for tag and attribute names.

NEWS_SECTION_GROUP

Use this group for defining sections in newsgroup formatted documents according to RFC 1036.

You use the CTX_DDL package to create section groups and define sections as part of section groups. For example, to index HTML documents, create a section group with HTML_SECTION_GROUP:

begin
ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
end;

Define Your Sections

You define sections as part of the section group. The following example defines an zone section called heading for all text within the HTML < H1> tag:

begin
ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1');
end;

Note:

If you are using the AUTO_SECTION_GROUP or PATH_SECTION_GROUP to index an XML document collection, you need not explicitly define sections since the system does this for you during indexing.


See Also:

"Section Types" in this chapter for more information about sections.

"XML Section Searching" in this chapter for more information about section searching with XML.

Index your Documents

When you index your documents, you specify your section group in the parameter clause of CREATE INDEX.

create index myindex on docs(htmlfile) indextype is ctxsys.context 
parameters('filter ctxsys.null_filter section group htmgroup');

Section Searching with WITHIN Operator

When your documents are indexed, you can query within sections using the WITHIN operator. For example, to find all the documents that contain the word Oracle within their headings, issue the following query:

'Oracle WITHIN heading'
See Also:

Oracle Text Reference to learn more about using the WITHIN operator.

Path Searching with INPATH and HASPATH Operators

When you use the PATH_SECTION_GROUP, the system automatically creates XML sections for you. In addition to using the WITHIN operator to issue queries, you can issue path queries with the INPATH and HASPATH operators.

See Also:

"XML Section Searching" to learn more about using these operators.

Oracle Text Reference to learn more about using the INPATH operator.

Section Types

All sections types are blocks of text in a document. However, sections can differ in the way they are delimited and the way they are recorded in the index. Sections can be one of the following:

Zone Section

A zone section is a body of text delimited by start and end tags in a document. The positions of the start and end tags are recorded in the index so that any words in between the tags are considered to be within the section. Any instance of a zone section must have a start and an end tag.

For example, the text between the <TITLE> and </TITLE> tags can be defined as a zone section as follows:

<TITLE>Tale of Two Cities</TITLE>
It was the best of times...

Zone sections can nest, overlap, and repeat within a document.

When querying zone sections, you use the WITHIN operator to search for a term across all sections. Oracle returns those documents that contain the term within the defined section.

Zone sections are well suited for defining sections in HTML and XML documents. To define a zone section, use CTX_DDL.ADD_ZONE_SECTION.

For example, assume you define the section booktitle as follows:

begin
ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
ctx_ddl.add_zone_section('htmgroup', 'booktitle', 'TITLE');
end;

After you index, you can search for all the documents that contain the term Cities within the section booktitle as follows:

'Cities WITHIN booktitle'

With multiple query terms such as (dog and cat) WITHIN booktitle, Oracle returns those documents that contain cat and dog within the same instance of a booktitle section.

Repeated Zone Sections

Zone sections can repeat. Each occurrence is treated as a separate section. For example, if <H1> denotes a heading section, they can repeat in the same documents as follows:

<H1> The Brown Fox </H1>
<H1> The Gray Wolf </H1>

Assuming that these zone sections are named Heading, the query Brown WITHIN Heading returns this document. However, a query of (Brown and Gray) WITHIN Heading does not.

Overlapping Zone Sections

Zone sections can overlap each other. For example, if <B> and <I> denote two different zone sections, they can overlap in a document as follows:

plain <B> bold <I> bold and italic </B> only italic </I>  plain
Nested Zone Sections

Zone sections can nest, including themselves as follows:

<TD> <TABLE><TD>nested cell</TD></TABLE></TD>

Using the WITHIN operator, you can write queries to search for text in sections within sections. For example, assume the BOOK1, BOOK2, and AUTHOR zone sections occur as follows in documents doc1 and doc2:

doc1:

<book1> <author>Scott Tiger</author> This is a cool book to read.<book1>

doc2:

<book2> <author>Scott Tiger</author> This is a great book to read.<book2>

Consider the nested query:

'Scott within author within book1'

This query returns only doc1.

Field Section

A field section is similar to a zone section in that it is a region of text delimited by start and end tags. A field section is different from a zone section in that the region is indexed separate from the rest of the document.

Since field sections are indexed differently, you can also get better query performance over zone sections for when you have a large number of documents indexed.

Field sections are more suited to when you have a single occurrence of a section in a a document such as a field in a news header. Field sections can also be made visible to the rest of the document.

Unlike zone sections, field sections have the following restrictions:

Visible and Invisible Field Sections

By default, field sections are indexed as a sub-document separate from the rest of the document. As such, field sections are invisible to the surrounding text and can only be queried by explicitly naming the section in the WITHIN clause.

You can make field sections visible if you want the text within the field section to be indexed as part of the enclosing document. Text within a visible field section can be queried with or without the WITHIN operator.

The following example shows the difference between using invisible and visible field sections.

The following code defines a section group basicgroup of the BASIC_SECTION_GROUP type. It then creates a field section in basicgroup called Author for the <A> tag. It also sets the visible flag to FALSE to create an invisible section:

begin
ctx_ddl_create_section_group('basicgroup', 'BASIC_SECTION_GROUP');
ctx_ddl.add_field_section('basicgroup', 'Author', 'A', FALSE);
end;

Because the Author field section is not visible, to find text within the Author section, you must use the WITHIN operator as follows:

'(Martin Luther King) WITHIN Author'

A query of Martin Luther King without the WITHIN operator does not return instances of this term in field sections. If you want to query text within field sections without specifying WITHIN, you must set the visible flag to TRUE when you create the section as follows:

begin
ctx_ddl.add_field_section('basicgroup', 'Author', 'A', TRUE);
end;
Nested Field Sections

Field sections cannot be nested. For example, if you define a field section to start with <TITLE> and define another field section to start with <FOO>, the two sections cannot be nested as follows:

<TITLE> dog <FOO> cat </FOO> </TITLE>

To work with nested sections, define them as zone sections.

Repeated Field Sections

Repeated field sections are allowed, but WITHIN queries treat them as a single section. The following is an example of repeated field section in a document:

<TITLE> cat </TITLE>
<TITLE> dog </TITLE>

The query dog and cat within title returns the document, even though these words occur in different sections.

To have WITHIN queries distinguish repeated sections, define them as zone sections.

Attribute Section

You can define attribute sections to query on XML attribute text. You can also have the system automatically define and index XML attributes for you.

See Also:

"XML Section Searching" in this chapter.

Special Sections

Special sections are not recognized by tags. Currently the only special sections supported are sentence and paragraph. This enables you to search for combination of words within sentences or paragraphs.

The sentence and paragraph boundaries are determined by the lexer.For example, the BASIC_LEXER recognizes sentence and paragraph section boundaries as follows:

Table 6-1
Special Section Boundary

SENTENCE

WORD/PUNCT/WHITESPACE

WORD/PUNCT/NEWLINE

PARAGRAPH

WORD/PUNCT/NEWLINE/WHITESPACE

WORD/PUNCT/NEWLINE/NEWLINE



If the lexer cannot recognize the boundaries, no sentence or paragraph sections are indexed.

To add a special section, use the CTX_DDL.ADD_SPECIAL_SECTION procedure. For example, the following code enables searching within sentences within HTML documents:

begin
ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
ctx_ddl.add_special_section('htmgroup', 'SENTENCE');
end;

You can also add zone sections to the group to enable zone searching in addition to sentence searching. The following example adds the zone section Headline to the section group htmgroup:

begin
ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
ctx_ddl.add_special_section('htmgroup', 'SENTENCE');
ctx_ddl.add_zone_section('htmgroup', 'Headline', 'H1');
end;

HTML Section Searching

HTML has internal structure in the form of tagged text which you can use for section searching. For example, you can define a section called headings for the <H1> tag. This allows you to search for terms only within these tags across your document set.

To query, you use the WITHIN operator. Oracle returns all documents that contain your query term within the headings section. Thus, if you wanted to find all documents that contain the word oracle within headings, you issue the following query:

'oracle within headings'

Creating HTML Sections

The following code defines a section group called htmgroup of type HTML_SECTION_GROUP. It then creates a zone section in htmgroup called heading identified by the <H1> tag:

begin
ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1');
end;

You can then index your documents as follows:

create index myindex on docs(htmlfile) indextype is ctxsys.context 
parameters('filter ctxsys.null_filter section group htmgroup');

After indexing with section group htmgroup, you can query within the heading section by issuing a query as follows:

'Oracle WITHIN heading'

Searching HTML Meta Tags

With HTML documents you can also create sections for NAME/CONTENT pairs in <META> tags. When you do so you can limit your searches to text within CONTENT.

Example: Creating Sections for <META>Tags

Consider an HTML document that has a META tag as follows:

<META NAME="author" CONTENT="ken">

To create a zone section that indexes all CONTENT attributes for the META tag whose NAME value is author:

begin
ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
ctx_ddl.add_zone_section('htmgroup', 'author', 'meta@author');
end

After indexing with section group htmgroup, you can query the document as follows:

'ken WITHIN author'

XML Section Searching

Like HTML documents, XML documents have tagged text which you can use to define blocks of text for section searching. The contents of a section can be searched on with the WITHIN or INPATH operators.

For XML searching, you can do the following:

Automatic Sectioning

You can set up your indexing operation to automatically create sections from XML documents using the section group AUTO_SECTION_GROUP. The system creates zone sections for XML tags. Attribute sections are created for the tags that have attributes and these sections named in the form tag@attribute.

For example, the following command creates the index myindex on a column containing the XML files using the AUTO_SECTION_GROUP:

CREATE INDEX myindex ON xmldocs(xmlfile) INDEXTYPE IS ctxsys.context PARAMETERS 
('datastore ctxsys.default_datastore filter ctxsys.null_filter section group 
ctxsys.auto_section_group');

Attribute Searching

You can search XML attribute text in one of two ways:

Creating Attribute Sections

Consider an XML file that defines the BOOK tag with a TITLE attribute as follows:

<BOOK TITLE="Tale of Two Cities"> 
  It was the best of times. 
</BOOK> 

To define the title attribute as an attribute section, create an XML_SECTION_GROUP and define the attribute section as follows:

begin
ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP');
ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'book@title');
end;

To index:

CREATE INDEX myindex ON xmldocs(xmlfile) INDEXTYPE IS ctxsys.context PARAMETERS 
('datastore ctxsys.default_datastore filter ctxsys.null_filter section group 
myxmlgroup');

You can query the XML attribute section booktitle as follows:

'Cities within booktitle'

Searching Attributes with the INPATH Operator

You can search attribute text with the INPATH operator. To do so, you must index your XML document set with the PATH_SECTION_GROUP.

See Also:

"Path Section Searching" in this chapter.

Creating Document Type Sensitive Sections

You have an XML document set that contains the <book> tag declared for different document types. You want to create a distinct book section for each document type.

Assume that mydocname1 is declared as an XML document type (root element) as follows:

<!DOCTYPE mydocname1 ... [...

Within mydocname1, the element <book> is declared. For this tag, you can create a section named mybooksec1 that is sensitive to the tag's document type as follows:

begin


ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP');
ctx_ddl.add_zone_section('myxmlgroup', 'mybooksec1', 'mydocname1(book)');
end;

Assume that mydocname2 is declared as another XML document type (root element) as follows:

<!DOCTYPE mydocname2 ... [...

Within mydocname2, the element <book> is declared. For this tag, you can create a section named mybooksec2 that is sensitive to the tag's document type as follows:

begin


ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP');
ctx_ddl.add_zone_section('myxmlgroup', 'mybooksec2', 'mydocname2(book)');
end;

To query within the section mybooksec1, use WITHIN as follows:

'oracle within mybooksec1'

Path Section Searching

XML documents can have parent-child tag structures such as the following:

<A> <B> <C> dog </C> </B> </A>

In this example, tag C is a child of tag B which is a child of tag A.

With Oracle Text, you can do path searching with PATH_SECTION_GROUP. This section group allows you to specify direct parentage in queries, such as to find all documents that contain the term dog in element C which is a child of element B and so on.

With PATH_SECTION_GROUP, you can also perform attribute value searching and attribute equality testing.

The new operators associated with this feature are

Creating Index with PATH_SECTION_GROUP

To enable path section searching, index your XML document set with PATH_SECTION_GROUP.

Create the preference:

begin
ctx_ddl.create_section_group('xmlpathgroup', 'PATH_SECTION_GROUP');
end;

Create the index:

CREATE INDEX myindex ON xmldocs(xmlfile) INDEXTYPE IS ctxsys.context PARAMETERS 
('datastore ctxsys.default_datastore filter ctxsys.null_filter section group 
xmlpathgroup');

When you create the index, you can use the INPATH and HASPATH operators.

Top-Level Tag Searching

To find all documents that contain the term dog in the top-level tag <A>:

dog INPATH (/A)

or

dog INPATH(A)

Any-Level Tag Searching

To find all documents that contain the term dog in the <A> tag at any level:

dog INPATH(//A)

This query finds the following documents:

<A>dog</A>

and

<C><B><A>dog</A></B></C>

Direct Parentage Searching

To find all documents that contain the term dog in a B element that is a direct child of a top-level A element:

dog INPATH(A/B)

This query finds the following XML document:

<A><B>My dog is friendly.</B></A>

but does not find:

<C><B>My dog is friendly.</B></C>

Tag Value Testing

You can test the value of tags. For example, the query:

dog INPATH(A[B="dog"])

Finds the following document:

<A><B>dog</B></A>

But does not find:

<A><B>My dog is friendly.</B></A>

Attribute Searching

You can search the content of attributes. For example, the query:

dog INPATH(//A/@B)

Finds the document

<C><A  B="snoop dog"> </A> </C>

Attribute Value Testing

You can test the value of attributes. For example, the query

California INPATH (//A[@B = "home address"])

Finds the document:

<A B="home address">San Francisco, California, USA</A>

But does not find:

<A B="work address">San Francisco, California, USA</A>

Path Testing

You can test if a path exists with the HASPATH operator. For example, the query:

HASPATH(A/B/C)

finds and returns a score of 100 for the document

<A><B><C>dog</C></B></A>

without the query having to reference dog at all.

Section Equality Testing with HASPATH

You can use the HASPATH operator to do section quality tests. For example, consider the following query:

dog INPATH A

finds

<A>dog</A>

but it also finds

<A>dog park</A>

To limit the query to the term dog and nothing else, you can use a section equality test with the HASPATH operator. For example,

HASPATH(A="dog")

finds and returns a score of 100 only for the first document, and not the second.

See Also:

Oracle Text Reference to learn more about using the INPATH and HASPATH operators.



Go to previous page Go to next page
Oracle
Copyright © 2000, 2002 Oracle Corporation.

All Rights Reserved.
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback