Indexing, 4 of 11

Filter Objects

Use the filter objects to create preferences that determine how text is filtered for indexing. Filters allow word processor and formatted documents as well as plain text, HTML, and XML documents to be indexed.

For formatted documents, Oracle stores documents in their native format and uses filters to build temporary plain text or HTML versions of the documents. Oracle indexes the words derived from the plain text or HTML version of the formatted document.

To create a filter preference, you must use one of the following objects:

Filter Preference Object Description

CHARSET_FILTER

Character set converting filter.

INSO_FILTER

Inso filter for filtering formatted documents.

NULL_FILTER

No filtering required. Use for indexing plain text, HTML, or XML documents.

USER_FILTER

User-defined filter to be used for custom filtering.

Filter Preference Object	Description
CHARSET_FILTER	Character set converting filter.
INSO_FILTER	Inso filter for filtering formatted documents.
NULL_FILTER	No filtering required. Use for indexing plain text, HTML, or XML documents.
USER_FILTER	User-defined filter to be used for custom filtering.

CHARSET_FILTER

Use the CHARSET_FILTER to convert documents from a non-database character set to the database character set.

The CHARSET_FILTER has the following attribute:

Attribute Attribute Value

charset

Specify the NLS name of source character set.
Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets.

Attribute	Attribute Value
charset	Specify the NLS name of source character set. Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets.

See Also:
For more information about the supported NLS character sets, see Oracle8i National Language Support Guide.

Indexing Mixed-Character Set Columns

A mixed character-set column is one that stores documents of different character sets. For example, a Japanese text column might store documents in JA16EUC and JA16SJIS character sets.

To index a table of documents in different character-sets, you must create your base table with a character-set column. In this column, you specify the document character set on a per-row basis. To index the documents, Oracle converts the documents into the database character set.

Character-set conversion works with the CHARSET_FILTER. When the charset column is NULL or not recognized, the source character set is assumed to be the one specified in the charset attribute.

Note:
Character-set conversion also works with the INSO_FILTER when the document format column is set to TEXT.

Indexing Mixed-Character Set Example

For example, create the table with a charset column:

create table hdocs (
     id number primary key,
     fmt varchar2(10),
     cset varchar2(20),
     text varchar2(80)
);

Insert plain-text Japanese documents in EUC and name the character set:

insert into hdocs values(1, 'text', 'JA16EUC', '/docs/tekusuto.euc');
insert in hdocs values (2, 'text', ''JA16SJIS', '/docs/tekusuto.sjs');

Create the index and name the charset column:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.charset_filter 
  format column fmt
  charset column cset');

INSO_FILTER

The Inso filter is a universal filter that filters most document formats. Use it for indexing single and mixed format columns. The INSO_FILTER has no attributes.

See Also:
For a list of the formats supported by INSO_FILTER and to learn more about how to set up your environment to use this filter, see Appendix C, "Supported Filter Formats".

Indexing Formatted Documents

To index a text column containing formatted documents such as Microsoft Word, use the INSO_FILTER. This filter automatically detects the document format. You can use the CTXSYS.INSO_FILTER system-defined preference in the parameter string as follows:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.inso_filter');

Bypassing Plain Text or HTML in Mixed Format Columns

A mixed-format column is a text column containing more than one document format, such as a column that contains Microsoft Word, PDF, plain-text, and HTML documents.

The INSO_FILTER can index mixed-format columns. However, you might want to have the INSO filter bypass the plain-text or HTML documents. Filtering plain-text or HTML with the INSO_FILTER is redundant.

The format column in the base table allows you to specify the type of document contained in the text column. The only two types you can specify are TEXT and BINARY. During indexing, the INSO_FILTER ignores any document typed TEXT (assuming the charset column is not specified.)

To set up the INSO_FILTER bypass mechanism, you must create a format column in your base table.

For example:

create table hdocs (
     id number primary key,
     fmt varchar2(10),
     text varchar2(80)
);

Assuming you are indexing mostly Word documents, you specify BINARY in the format column to filter the Word documents. On the other hand, to have the INSO_FILTER ignore an HTML document, specify TEXT in the format column.

For example, the following statements add two documents to the text able, assigning one format as BINARY and the other TEXT:

insert into hdocs values(1, 'binary', '/docs/myword.doc');
insert in hdocs values (2, 'text', '/docs/index.html');

To create the index, use CREATE INDEX and specify the format column name in the parameter string:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.inso_filter 
  format column fmt');

If you do not specify TEXT or BINARY for the format column, BINARY is used.

Note:
You need not specify the format column in CREATE INDEX when using the INSO_FILTER.

Character-Set Conversion With Inso

The INSO_FILTER converts documents to the database character-set when the document format column is set to TEXT. In this case, the INSO_FILTER looks at the charset column to determine the document character-set.

If charset column value is not an Oracle character-set name, the document is passed through without any character-set conversion.

Note:
You need not specify the charset column when using the INSO_FILTER.

If you do specify the charset column and do not specify the format column, the INSO_FILTER works like the USER_FILTER, except that in this case there is no Japanese character-set auto-detection.

See Also:
USER_FILTER in this section.

Plain-Text Indexing and the INSO_FILTER

Oracle does not recommend using INSO_FILTER to index text documents.

If your table contains text documents entirely, use the INSO_FILTER or the USER_FILTER.

If your table contains text documents mixed with formatted documents, Oracle recommends creating a format column and marking the text documents as TEXT to bypass INSO_FILTER. In such cases, Oracle also recommends creating a charset column to indicate the document character-set.

If, however, you use the INSO_FILTER to index non-binary documents (text) documents and you specify no format column and no charset column, the INSO_FILTER processes the document. Your indexing process is thus subject to the character-set limitations of Inso technology. Specifically, your application must ensure one of the following conditions is true:

The document character set is the same as the database character set and the database character-set is one of the following:
- US7ASCII
- WE8ISO8859P1
- JA16SJIS
- KO16KSC5601
- ZHS16CGB231280
- ZHT16BIG5
Or the database character-set is neither of the ones listed above and the document is in the WE8ISO8859P1 character set.

NULL_FILTER

Use the NULL_FILTER object when plain text or HTML is to be indexed and no filtering needs to be performed. NULL_FILTER has no attributes.

Indexing HTML Documents

If your document set is entirely HTML, Oracle recommends that you do not use the INSO filter. Instead, use the NULL_FILTER in your filter preference.

For example, to index an HTML document set, you can specify the system-defined preferences for NULL_FILTER and HTML_SECTION_GROUP as follows:

create index myindex on docs(htmlfile) indextype is ctxsys.context 
  parameters('filter ctxsys.null_filter
  section group ctxsys.html_section_group');

See Also:
For more information on indexing HTML documents, see "Section Group Types" in this chapter.

USER_FILTER

Use the USER_FILTER object to specify an external filter for filtering documents in a column. USER_FILTER has the following attribute:

Attribute Attribute Values

command

filter executable

Attribute	Attribute Values
command	filter executable

command

Specify the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter specified for command must recognize and handle all such formats.

The executable you specify must go in the $ORACLE_HOME/ctx/bin directory. You must create your user-filter executable with two parameters: the first is the name of the input file to be read, and the second is the name of the output file to be written to.

If all the document formats are supported by the INSO_FILTER, use INSO_FILTER instead of USER_FILTER unless additional tasks besides filtering are required for the documents.

User Filter Example

The following example perl script to be used as the user filter. This script converts the input text file specified in the first argument to uppercase and writes the output to the second argument:

#!/usr/local/bin/perl

open(IN, $ARGV[0]);
open(OUT, ">".$ARGV[1]);

while (<IN>)
{
  tr/a-z/A-Z/;
  print OUT;
}

close (IN);
close (OUT);

Assuming that this file is named upcase.pl, create the filter preference as follows:

begin 
  ctx_ddl.create_preference 
    ( 
      preference_name => 'USER_FILTER_PREF', 
      object_name     => 'USER_FILTER' 
    ); 
  ctx_ddl.set_attribute
    ('USER_FILTER_PREF','COMMAND','upcase.pl');
end;

Create the index in SQL*Plus as follows:

create index user_filter_idx on user_filter ( docs ) 
  indextype is ctxsys.context 
  parameters ('FILTER USER_FILTER_PREF');