Indexing, 4 of 11

Filter Types

Use the filter types to create preferences that determine how text is filtered for indexing. Filters allow word processor and formatted documents as well as plain text, HTML, and XML documents to be indexed.

For formatted documents, Oracle stores documents in their native format and uses filters to build temporary plain text or HTML versions of the documents. Oracle indexes the words derived from the plain text or HTML version of the formatted document.

To create a filter preference, you must use one of the following types:

Filter Preference type	Description
CHARSET_FILTER	Character set converting filter
INSO_FILTER	Inso filter for filtering formatted documents
NULL_FILTER	No filtering required. Use for indexing plain text, HTML, or XML documents
USER_FILTER	User-defined external filter to be used for custom filtering
PROCEDURE_FILTER	User-defined stored procedure filter to be used for custom filtering.

CHARSET_FILTER

Use the CHARSET_FILTER to convert documents from a non-database character set to the database character set.

CHARSET_FILTER has the following attribute:

Attribute	Attribute Value
charset	Specify the NLS name of source character set. If you specify UTF16AUTO, this filter automatically detects the if the character set is UTF16 big- or little-endian. Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets.

Attribute

Attribute Value

charset

Specify the NLS name of source character set.

If you specify UTF16AUTO, this filter automatically detects the if the character set is UTF16 big- or little-endian.

Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets.

See Also:

Oracle9i Globalization and National Language Support Guide for more information about the supported NLS character sets.

UTF-16 Big- and Little-Endian Detection

If your character set is UTF-16, you can specify UTF16AUTO to automatically detect big- or little-endian data. Oracle does so by examining the first two bytes of the document row.

If the first two bytes are 0xFE, 0xFF, the document is recognized as little-endian and the remainder of the document minus those two bytes is passed on for indexing.

If the first two bytes are 0xFF, 0xFE, the document is recognized as big-endian and the remainder of the document minus those two bytes is passed on for indexing.

If the first two bytes are anything else, the document is assumed to be big-endian and the whole document including the first two bytes is passed on for indexing.

Indexing Mixed-Character Set Columns

A mixed character set column is one that stores documents of different character sets. For example, a text table might store some documents in WE8ISO8859P1 and others in UTF8.

To index a table of documents in different character sets, you must create your base table with a character set column. In this column, you specify the document character set on a per-row basis. To index the documents, Oracle converts the documents into the database character set.

Character set conversion works with the CHARSET_FILTER. When the charset column is NULL or not recognized, Oracle assumes the source character set is the one specified in the charset attribute.

Note:

Character set conversion also works with the INSO_FILTER when the document format column is set to TEXT.

Indexing Mixed-Character Set Example

For example, create the table with a charset column:

create table hdocs (
     id number primary key,
     fmt varchar2(10),
     cset varchar2(20),
     text varchar2(80)
);

Insert plain-text documents and name the character set:

insert into hdocs values(1, 'text', 'WE8ISO8859P1', '/docs/iso.txt');
insert in hdocs values (2, 'text', 'UTF8', '/docs/utf8.txt');
commit;

Create the index and name the charset column:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.charset_filter 
  format column fmt
  charset column cset');

INSO_FILTER

The Inso filter is a universal filter that filters most document formats. This filtering technology is licensed from Inso Chicago Corporation.

Use it for indexing single and mixed-format columns.

See Also:

For a list of the formats supported by INSO_FILTER and to learn more about how to set up your environment to use this filter, see Appendix B, "Supported Document Formats".

The INSO_FILTER has the following attribute:

Attribute	Attribute Values
timeout	Specify the INSO_FILTER timeout in seconds. Use a number between 0 and 42,949,672. Default is 120. Setting this value 0 disables the feature. This value is the time Oracle waits for the INSO_FILTER to start writing filtered output for a document row. If this time is reached and the INSO_FILTER has not started to write output, the indexing operation terminates for the document row and an error is recorded in the CTX_USER_INDEX_ERRORS view. Oracle moves to the next document row to be indexed. This feature is disabled for rows for which the corresponding charset and format column cause the INSO_FILTER to bypass the row, such as when format is marked TEXT. Use this feature to prevent the Oracle indexing operation from waiting indefinitely on a hanging Inso filter operation.

Attribute

Attribute Values

timeout

Specify the INSO_FILTER timeout in seconds. Use a number between 0 and 42,949,672. Default is 120. Setting this value 0 disables the feature.

This value is the time Oracle waits for the INSO_FILTER to start writing filtered output for a document row. If this time is reached and the INSO_FILTER has not started to write output, the indexing operation terminates for the document row and an error is recorded in the CTX_USER_INDEX_ERRORS view. Oracle moves to the next document row to be indexed.

This feature is disabled for rows for which the corresponding charset and format column cause the INSO_FILTER to bypass the row, such as when format is marked TEXT.

Use this feature to prevent the Oracle indexing operation from waiting indefinitely on a hanging Inso filter operation.

Indexing Formatted Documents

To index a text column containing formatted documents such as Microsoft Word, use the INSO_FILTER. This filter automatically detects the document format. You can use the CTXSYS.INSO_FILTER system-defined preference in the parameter clause as follows:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.inso_filter');

Bypassing Plain Text or HTML in Mixed Format Columns

A mixed-format column is a text column containing more than one document format, such as a column that contains Microsoft Word, PDF, plain text, and HTML documents.

The INSO_FILTER can index mixed-format columns. However, you might want to have the INSO filter bypass the plain text or HTML documents. Filtering plain text or HTML with the INSO_FILTER is redundant.

The format column in the base table allows you to specify the type of document contained in the text column. The only two types you can specify are TEXT and BINARY. During indexing, the INSO_FILTER ignores any document typed TEXT (assuming the charset column is not specified.)

To set up the INSO_FILTER bypass mechanism, you must create a format column in your base table.

For example:

create table hdocs (
     id number primary key,
     fmt varchar2(10),
     text varchar2(80)
);

Assuming you are indexing mostly Word documents, you specify BINARY in the format column to filter the Word documents. Alternatively, to have the INSO_FILTER ignore an HTML document, specify TEXT in the format column.

For example, the following statements add two documents to the text table, assigning one format as BINARY and the other TEXT:

insert into hdocs values(1, 'binary', '/docs/myword.doc');
insert in hdocs values (2, 'text', '/docs/index.html');
commit;

To create the index, use CREATE INDEX and specify the format column name in the parameter string:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.inso_filter 
  format column fmt');

If you do not specify TEXT or BINARY for the format column, BINARY is used.

Note:

You need not specify the format column in CREATE INDEX when using the INSO_FILTER.

Character Set Conversion With Inso

The INSO_FILTER converts documents to the database character set when the document format column is set to TEXT. In this case, the INSO_FILTER looks at the charset column to determine the document character set.

If the charset column value is not an Oracle character set name, the document is passed through without any character set conversion.

Note:

You need not specify the charset column when using the INSO_FILTER.

If you do specify the charset column and do not specify the format column, the INSO_FILTER works like the CHARSET_FILTER, except that in this case there is no Japanese character set auto-detection.

See Also:

"CHARSET_FILTER".

Plain Text Indexing and the INSO_FILTER

Oracle does not recommend using INSO_FILTER to index plain text documents.

If your table contains text documents exclusively, use the NULL_FILTER or the USER_FILTER.

If your table contains text documents mixed with formatted documents, Oracle recommends creating a format column and marking the text documents as TEXT to bypass INSO_FILTER. In such cases, Oracle also recommends creating a charset column to indicate the document character set.

However, if you use INSO_FILTER to index nonbinary (text) documents and you specify no format column and no charset column, the INSO_FILTER processes the document. Your indexing process is thus subject to the character set limitations of Inso technology. Specifically, your application must ensure that one of the following conditions is true:

The document character set is the same as the database character set and the database character set is one of the following:
- US7ASCII
- WE8ISO8859P1
- JA16SJIS
- KO16KSC5601
- ZHS16CGB231280
- ZHT16BIG5
The database character set is none of those listed above and the document is in the WE8ISO8859P1 character set.

NULL_FILTER

Use the NULL_FILTER type when plain text or HTML is to be indexed and no filtering needs to be performed. NULL_FILTER has no attributes.

Indexing HTML Documents

If your document set is entirely HTML, Oracle recommends that you use the NULL_FILTER in your filter preference.

For example, to index an HTML document set, you can specify the system-defined preferences for NULL_FILTER and HTML_SECTION_GROUP as follows:

create index myindex on docs(htmlfile) indextype is ctxsys.context 
  parameters('filter ctxsys.null_filter
  section group ctxsys.html_section_group');

See Also:

For more information on section groups and indexing HTML documents, see "Section Group Types" in this chapter.

USER_FILTER

Use the USER_FILTER type to specify an external filter for filtering documents in a column. USER_FILTER has the following attribute:

Attribute	Attribute Values
command	Specify the name of the filter executable.

command

Specify the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter specified for command must recognize and handle all such formats.

The executable you specify must exist in the $ORACLE_HOME/ctx/bin directory. You must create your user-filter executable with two parameters: the first is the name of the input file to be read, and the second is the name of the output file to be written to.

If all the document formats are supported by INSO_FILTER, use INSO_FILTER instead of USER_FILTER unless additional tasks besides filtering are required for the documents.

User Filter Example

The following example perl script to be used as the user filter. This script converts the input text file specified in the first argument to uppercase and writes the output to the location specified in the second argument:

#!/usr/local/bin/perl

open(IN, $ARGV[0]);
open(OUT, ">".$ARGV[1]);

while (<IN>)
{
  tr/a-z/A-Z/;
  print OUT;
}

close (IN);
close (OUT);

Assuming that this file is named upcase.pl, create the filter preference as follows:

begin 
  ctx_ddl.create_preference 
    ( 
      preference_name => 'USER_FILTER_PREF', 
      object_name     => 'USER_FILTER' 
    ); 
  ctx_ddl.set_attribute
    ('USER_FILTER_PREF','COMMAND','upcase.pl');
end;

Create the index in SQL*Plus as follows:

create index user_filter_idx on user_filter ( docs ) 
  indextype is ctxsys.context 
  parameters ('FILTER USER_FILTER_PREF');

PROCEDURE_FILTER

Use the PROCEDURE_FILTER type to filter your documents with a stored procedure. The stored procedure is called each time a document needs to be filtered.

This type has the following attributes:

Attribute	Purpose	Allowable Values
procedure	Name of the filter stored procedure.	Any CTXSYS owned procedure. The procedure can be a PL/SQL or Java stored procedure.
input_type	Type of input argument for stored procedure.	VARCHAR2, BLOB, CLOB, FILE
output_type	Type of output argument for stored procedure.	VARCHAR2, CLOB, FILE
rowid_parameter	Include rowid parameter?	TRUE/FALSE
format_parameter	Include format parameter?	TRUE/FLASE
charset_parameter	Include charset parameter?	TRUE/FLASE

procedure

Specify the name of the stored procedure to use for filtering. The procedure can be a PL/SQL or Java stored procedure. The procedure can be a safe callout or call a safe callout.

The procedure must be owned by CTXSYS and have one of the following signatures:

PROCEDURE(IN BLOB, IN OUT NOCOPY CLOB)
PROCEDURE(IN CLOB, IN OUT NOCOPY CLOB)
PROCEDURE(IN VARCHAR, IN OUT NOCOPY CLOB)
PROCEDURE(IN BLOB, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN CLOB, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN VARCHAR2, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN BLOB, IN VARCHAR2)
PROCEDURE(IN CLOB, IN VARCHAR2)
PROCEDURE(IN VARCHAR2, IN VARCHAR2)

The first argument is the content of the unfiltered row as passed out by the datastore. The second argument is for the procedure to pass back the filtered document text.

The procedure attribute is mandatory and has no default.

input_type

Specify the type of the input argument of the filter procedure. You can specify one of the following:

Type	Description
BLOB	The input argument is of type BLOB. The unfiltered document is contained in the BLOB passed in.
CLOB	The input argument is of type CLOB. The unfiltered document is contained in the CLOB passed in. No pre-filtering or character set conversion is done. If the datastore outputs binary data, that binary data is written directly to the CLOB, with NLS doing implicit mapping to character data as best it can.
VARCHAR2	The input argument is of type VARCHAR2. The unfiltered document is contained in the VARCHAR2 passed in. The document can be a maximum of 32767 bytes of data. If the unfiltered document is greater than this length, an error is raised for the document and the filter procedure is not called.
FILE	The input argument is of type VARCHAR2. The unfiltered document content is contained in a temporary file in the file system whose filename is stored in the VARCHAR2 passed in. For example, the value of the passed-in VARCHAR2 might be 'tmp/mydoc.tmp' which means that the document content is stored in the file '/tmp/mydoc.tmp'. The file input type is useful only when your procedure is a safe callout, which can read the file.

The input_type attribute s not mandatory. If not specified, BLOB is the default.

output_type

Specify the type of output argument of the filter procedure. You can specify one of the following types:

Type	Description
CLOB	The output argument is IN OUT NOCOPY CLOB. Your procedure must write the filtered content to the CLOB passed in.
VARCHAR2	The output argument is IN OUT NOCOPY VARCHAR2. Your procedure must write the filtered content to the VARCHAR2 variable passed in.
FILE	The output argument must be IN VARCHAR2. On entering the filter procedure, the output argument is the name of a temporary file. The filter procedure must write the filtered contents to this named file. Using a FILE output type is useful only when the procedure is a safe callout, which can write to the file.

Type

Description

CLOB

The output argument is IN OUT NOCOPY CLOB. Your procedure must write the filtered content to the CLOB passed in.

VARCHAR2

The output argument is IN OUT NOCOPY VARCHAR2. Your procedure must write the filtered content to the VARCHAR2 variable passed in.

FILE

The output argument must be IN VARCHAR2. On entering the filter procedure, the output argument is the name of a temporary file. The filter procedure must write the filtered contents to this named file.

Using a FILE output type is useful only when the procedure is a safe callout, which can write to the file.

The output_type attribute is not mandatory. If not specified, CLOB is the default.

rowid_ parameter

When you specify TRUE, the rowid of the document to be filtered is passed as the first parameter, before the input and output parameters.

For example, with INPUT_TYPE BLOB, OUTPUT_TYPE CLOB, and ROWID_PARAMETER TRUE, the filter procedure must have the signature as follows:

procedure(in rowid, in blob, in out nocopy clob)

This attribute is useful for when your procedure requires data from other columns or tables. This attribute is not mandatory. The default is FALSE.

format_parameter

When you specify TRUE, the value of the format column of the document being filtered is passed to the filter procedure before input and output parameters, but after the rowid parameter, if enabled.

You specify the name of the format column at index time in the parameters string, using the keyword 'format column <columnname>'. The parameter type must be IN VARCHAR2.

The format column value can be read via the rowid parameter, but this attribute allows a single filter to work on multiple table structures, because the format attribute is abstracted and does not require the knowledge of the name of the table or format column.

FORMAT_PARAMETER is not mandatory. The default is FALSE.

charset_parameter

When you specify TRUE, the value of the charset column of the document being filtered is passed to the filter procedure before input and output parameters, but after the rowid and format parameter, if enabled.

You specify the name of the charset column at index time in the parameters string, using the keyword 'charset column <columnname>'. The parameter type must be IN VARCHAR2.

CHARSET_PARAMETER attribute is not mandatory. The default is FALSE.

Parameter Order

ROWID_PARAMETER, FORMAT_PARAMETER, and CHARSET_PARAMETER are all independent. The order is rowid, the format, then charset, but the filter procedure is passed only the minimum parameters required.

For example, assume that INPUT_TYPE is BLOB and OUTPUT_TYPE is CLOB. If your filter procedure requires all parameters, the procedure signature must be:

(id IN ROWID, format IN VARCHAR2, charset IN VARCHAR2, input IN BLOB, output IN 
OUT NOCOPY CLOB)

If your procedure requires only the ROWID, then the procedure signature must be:

(id IN ROWID,input IN BLOB, ouput IN OUT NOCOPY CLOB)

Create Index Requirements

In order to create an index using a PROCEDURE_FILTER preference, the index owner must have execute permission on the procedure.Oracle checks this at index time, which is similar to the security measures for USER_DATASTORE.

Error Handling

The filter procedure can raise any errors needed through the normal PL/SQL raise_application_error facility. These errors are propagated to the CTX_USER_INDEX_ERRORS view or reported to the user, depending on how the filter is invoked.

Procedure Filter Preference Example

Consider a filter procedure CTXSYS.NORMALIZE that you define with the following signature:

PROCEDURE NORMALIZE(id IN ROWID, charset IN VARCHAR2, input IN CLOB, 
output IN OUT NOCOPY VARCHAR2);

To use this procedure as your filter, set up your filter preference as follows:

begin
ctx_ddl.create_preference('myfilt', 'procedure_filter');
ctx_ddl.set_attribute('myfilt', 'procedure', 'normalize');
ctx_ddl.set_attribute('myfilt', 'input_type', 'clob');
ctx_ddl.set_attribute('myfilt', 'output_type', 'varchar2');
ctx_ddl.set_attribute('myfilt', 'rowid_parameter', 'TRUE');
ctx_ddl.set_attribute('myfilt', 'charset_parameter', 'TRUE');
end;