Oracle Text Application Developer's Guide Release 9.0.1 Part Number A90122-01 |
|
Document Section Searching, 2 of 4
Section searching enables you to narrow text queries down to blocks of text within documents. Section searching is useful when your documents have internal structure, such as HTML and XML documents.
You can also search for text at the sentence and paragraph level.
The steps for enabling section searching for your document collection are:
Section searching is enabled by defining section groups. You use one of the system-defined section groups to create an instance of a section group. Choose a section group appropriate for your document collection.
You use section groups to specify the type of document set you have and implicitly indicate the tag structure. For instance, to index HTML tagged documents, you use the HTML_SECTION_GROUP. Likewise, to index XML tagged documents, you can use the XML_SECTION_GROUP.
The following table list the different types of section groups you can use:
You use the CTX_DDL package to create section groups and define sections as part of section groups. For example, to index HTML documents, create a section group with HTML_SECTION_GROUP:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); end;
You define sections as part of the section group. The following example defines an zone section called heading for all text within the HTML < H1> tag:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1'); end;
See Also:
"Section Types" in this chapter for more information about sections. "XML Section Searching" in this chapter for more information about section searching with XML. |
When you index your documents, you specify your section group in the parameter clause of CREATE INDEX.
create index myindex on docs(htmlfile) indextype is ctxsys.context parameters('filter ctxsys.null_filter section group htmgroup');
When your documents are indexed, you can query within sections using the WITHIN operator. For example, to find all the documents that contain the word Oracle within their headings, issue the following query:
'Oracle WITHIN heading'
When you use the PATH_SECTION_GROUP, the system automatically creates XML sections for you. In addition to using the WITHIN operator to issue queries, you can issue path queries with the INPATH and HASPATH operators.
See Also:
"XML Section Searching" to learn more about using these operators. Oracle Text Reference to learn more about using the INPATH operator. |
All sections types are blocks of text in a document. However, sections can differ in the way they are delimited and the way they are recorded in the index. Sections can be one of the following:
A zone section is a body of text delimited by start and end tags in a document. The positions of the start and end tags are recorded in the index so that any words in between the tags are considered to be within the section. Any instance of a zone section must have a start and an end tag.
For example, the text between the <TITLE> and </TITLE> tags can be defined as a zone section as follows:
<TITLE>Tale of Two Cities</TITLE> It was the best of times...
Zone sections can nest, overlap, and repeat within a document.
When querying zone sections, you use the WITHIN operator to search for a term across all sections. Oracle returns those documents that contain the term within the defined section.
Zone sections are well suited for defining sections in HTML and XML documents. To define a zone section, use CTX_DDL.ADD_ZONE_SECTION.
For example, assume you define the section booktitle as follows:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_zone_section('htmgroup', 'booktitle', 'TITLE'); end;
After you index, you can search for all the documents that contain the term Cities within the section booktitle as follows:
'Cities WITHIN booktitle'
With multiple query terms such as (dog and cat) WITHIN booktitle, Oracle returns those documents that contain cat and dog within the same instance of a booktitle section.
Zone sections can repeat. Each occurrence is treated as a separate section. For example, if <H1> denotes a heading
section, they can repeat in the same documents as follows:
<H1> The Brown Fox </H1> <H1> The Gray Wolf </H1>
Assuming that these zone sections are named Heading
, the query Brown WITHIN Heading returns this document. However, a query of (Brown and Gray) WITHIN Heading does not.
Zone sections can overlap each other. For example, if <B>
and <I>
denote two different zone sections, they can overlap in a document as follows:
plain <B> bold <I> bold and italic </B> only italic </I> plain
Zone sections can nest, including themselves as follows:
<TD> <TABLE><TD>nested cell</TD></TABLE></TD>
Using the WITHIN operator, you can write queries to search for text in sections within sections. For example, assume the BOOK1, BOOK2, and AUTHOR zone sections occur as follows in documents doc1 and doc2:
doc1:
<book1> <author>Scott Tiger</author> This is a cool book to read.<book1>
doc2:
<book2> <author>Scott Tiger</author> This is a great book to read.<book2>
Consider the nested query:
'Scott within author within book1'
This query returns only doc1.
A field section is similar to a zone section in that it is a region of text delimited by start and end tags. A field section is different from a zone section in that the region is indexed separate from the rest of the document.
Since field sections are indexed differently, you can also get better query performance over zone sections for when you have a large number of documents indexed.
Field sections are more suited to when you have a single occurrence of a section in a a document such as a field in a news header. Field sections can also be made visible to the rest of the document.
Unlike zone sections, field sections have the following restrictions:
By default, field sections are indexed as a sub-document separate from the rest of the document. As such, field sections are invisible to the surrounding text and can only be queried by explicitly naming the section in the WITHIN clause.
You can make field sections visible if you want the text within the field section to be indexed as part of the enclosing document. Text within a visible field section can be queried with or without the WITHIN operator.
The following example shows the difference between using invisible and visible field sections.
The following code defines a section group basicgroup
of the BASIC_SECTION_GROUP type. It then creates a field section in basicgroup
called Author
for the <A>
tag. It also sets the visible flag to FALSE to create an invisible section:
begin ctx_ddl_create_section_group('basicgroup', 'BASIC_SECTION_GROUP'); ctx_ddl.add_field_section('basicgroup', 'Author', 'A', FALSE); end;
Because the Author
field section is not visible, to find text within the Author
section, you must use the WITHIN operator as follows:
'(Martin Luther King) WITHIN Author'
A query of Martin Luther King without the WITHIN operator does not return instances of this term in field sections. If you want to query text within field sections without specifying WITHIN, you must set the visible flag to TRUE when you create the section as follows:
begin ctx_ddl.add_field_section('basicgroup', 'Author', 'A', TRUE); end;
Field sections cannot be nested. For example, if you define a field section to start with <TITLE>
and define another field section to start with <FOO>
, the two sections cannot be nested as follows:
<TITLE> dog <FOO> cat </FOO> </TITLE>
To work with nested sections, define them as zone sections.
Repeated field sections are allowed, but WITHIN queries treat them as a single section. The following is an example of repeated field section in a document:
<TITLE> cat </TITLE> <TITLE> dog </TITLE>
The query dog and cat within title returns the document, even though these words occur in different sections.
To have WITHIN queries distinguish repeated sections, define them as zone sections.
You can define attribute sections to query on XML attribute text. You can also have the system automatically define and index XML attributes for you.
Special sections are not recognized by tags. Currently the only special sections supported are sentence and paragraph. This enables you to search for combination of words within sentences or paragraphs.
To add a special section, use the CTX_DDL.ADD_SPECIAL_SECTION procedure. For example, the following code enables searching within sentences within HTML documents:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_special_section('htmgroup', 'SENTENCE'); end;
You can also add zone sections to the group to enable zone searching in addition to sentence searching. The following example adds the zone section Headline
to the section group htmgroup
:
begin ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP'); ctx_ddl.add_special_section('htmgroup', 'SENTENCE'); ctx_ddl.add_zone_section('htmgroup', 'Headline', 'H1'); end;
|
![]() Copyright © 1996-2001, Oracle Corporation. All Rights Reserved. |
|