Oracle Text Reference Release 9.0.1 Part Number A90121-01 |
|
Indexing, 4 of 11
Use the filter types to create preferences that determine how text is filtered for indexing. Filters allow word processor and formatted documents as well as plain text, HTML, and XML documents to be indexed.
For formatted documents, Oracle stores documents in their native format and uses filters to build temporary plain text or HTML versions of the documents. Oracle indexes the words derived from the plain text or HTML version of the formatted document.
To create a filter preference, you must use one of the following types:
Use the CHARSET_FILTER to convert documents from a non-database character set to the database character set.
CHARSET_FILTER has the following attribute:
See Also:
Oracle9i Globalization and National Language Support Guide for more information about the supported NLS character sets. |
If your character set is UTF-16, you can specify UTF16AUTO to automatically detect big- or little-endian data. Oracle does so by examining the first two bytes of the document row.
If the first two bytes are 0xFE, 0xFF, the document is recognized as little-endian and the remainder of the document minus those two bytes is passed on for indexing.
If the first two bytes are 0xFF, 0xFE, the document is recognized as big-endian and the remainder of the document minus those two bytes is passed on for indexing.
If the first two bytes are anything else, the document is assumed to be big-endian and the whole document including the first two bytes is passed on for indexing.
A mixed character set column is one that stores documents of different character sets. For example, a text table might store some documents in WE8ISO8859P1 and others in UTF8.
To index a table of documents in different character sets, you must create your base table with a character set column. In this column, you specify the document character set on a per-row basis. To index the documents, Oracle converts the documents into the database character set.
Character set conversion works with the CHARSET_FILTER. When the charset column is NULL or not recognized, Oracle assumes the source character set is the one specified in the charset attribute.
For example, create the table with a charset column:
create table hdocs ( id number primary key, fmt varchar2(10), cset varchar2(20), text varchar2(80) );
Insert plain-text documents and name the character set:
insert into hdocs values(1, 'text', 'WE8ISO8859P1', '/docs/iso.txt'); insert in hdocs values (2, 'text', 'UTF8', '/docs/utf8.txt'); commit;
Create the index and name the charset column:
create index hdocsx on hdocs(text) indextype is ctxsys.context parameters ('datastore ctxsys.file_datastore filter ctxsys.charset_filter format column fmt charset column cset');
The Inso filter is a universal filter that filters most document formats. This filtering technology is licensed from Inso Chicago Corporation.
Use it for indexing single and mixed-format columns.
See Also:
For a list of the formats supported by INSO_FILTER and to learn more about how to set up your environment to use this filter, see Appendix B, "Supported Document Formats". |
The INSO_FILTER has the following attribute:
To index a text column containing formatted documents such as Microsoft Word, use the INSO_FILTER. This filter automatically detects the document format. You can use the CTXSYS.INSO_FILTER system-defined preference in the parameter clause as follows:
create index hdocsx on hdocs(text) indextype is ctxsys.context parameters ('datastore ctxsys.file_datastore filter ctxsys.inso_filter');
A mixed-format column is a text column containing more than one document format, such as a column that contains Microsoft Word, PDF, plain text, and HTML documents.
The INSO_FILTER can index mixed-format columns. However, you might want to have the INSO filter bypass the plain text or HTML documents. Filtering plain text or HTML with the INSO_FILTER is redundant.
The format column in the base table allows you to specify the type of document contained in the text column. The only two types you can specify are TEXT and BINARY. During indexing, the INSO_FILTER ignores any document typed TEXT (assuming the charset column is not specified.)
To set up the INSO_FILTER bypass mechanism, you must create a format column in your base table.
For example:
create table hdocs ( id number primary key, fmt varchar2(10), text varchar2(80) );
Assuming you are indexing mostly Word documents, you specify BINARY in the format column to filter the Word documents. Alternatively, to have the INSO_FILTER ignore an HTML document, specify TEXT in the format column.
For example, the following statements add two documents to the text table, assigning one format as BINARY and the other TEXT:
insert into hdocs values(1, 'binary', '/docs/myword.doc'); insert in hdocs values (2, 'text', '/docs/index.html'); commit;
To create the index, use CREATE INDEX and specify the format column name in the parameter string:
create index hdocsx on hdocs(text) indextype is ctxsys.context parameters ('datastore ctxsys.file_datastore filter ctxsys.inso_filter format column fmt');
If you do not specify TEXT or BINARY for the format column, BINARY is used.
The INSO_FILTER converts documents to the database character set when the document format column is set to TEXT. In this case, the INSO_FILTER looks at the charset column to determine the document character set.
If the charset column value is not an Oracle character set name, the document is passed through without any character set conversion.
If you do specify the charset column and do not specify the format column, the INSO_FILTER works like the CHARSET_FILTER, except that in this case there is no Japanese character set auto-detection.
Oracle does not recommend using INSO_FILTER to index plain text documents.
If your table contains text documents exclusively, use the NULL_FILTER or the USER_FILTER.
If your table contains text documents mixed with formatted documents, Oracle recommends creating a format column and marking the text documents as TEXT to bypass INSO_FILTER. In such cases, Oracle also recommends creating a charset column to indicate the document character set.
However, if you use INSO_FILTER to index nonbinary (text) documents and you specify no format column and no charset column, the INSO_FILTER processes the document. Your indexing process is thus subject to the character set limitations of Inso technology. Specifically, your application must ensure that one of the following conditions is true:
Use the NULL_FILTER type when plain text or HTML is to be indexed and no filtering needs to be performed. NULL_FILTER has no attributes.
If your document set is entirely HTML, Oracle recommends that you use the NULL_FILTER in your filter preference.
For example, to index an HTML document set, you can specify the system-defined preferences for NULL_FILTER and HTML_SECTION_GROUP as follows:
create index myindex on docs(htmlfile) indextype is ctxsys.context parameters('filter ctxsys.null_filter section group ctxsys.html_section_group');
See Also:
For more information on section groups and indexing HTML documents, see "Section Group Types" in this chapter. |
Use the USER_FILTER type to specify an external filter for filtering documents in a column. USER_FILTER has the following attribute:
Attribute | Attribute Values |
---|---|
command |
Specify the name of the filter executable. |
Specify the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter specified for command must recognize and handle all such formats.
The executable you specify must exist in the $ORACLE_HOME/ctx/bin
directory. You must create your user-filter executable with two parameters: the first is the name of the input file to be read, and the second is the name of the output file to be written to.
If all the document formats are supported by INSO_FILTER, use INSO_FILTER instead of USER_FILTER unless additional tasks besides filtering are required for the documents.
The following example perl script to be used as the user filter. This script converts the input text file specified in the first argument to uppercase and writes the output to the location specified in the second argument:
#!/usr/local/bin/perl open(IN, $ARGV[0]); open(OUT, ">".$ARGV[1]); while (<IN>) { tr/a-z/A-Z/; print OUT; } close (IN); close (OUT);
Assuming that this file is named upcase.pl
, create the filter preference as follows:
begin ctx_ddl.create_preference ( preference_name => 'USER_FILTER_PREF', object_name => 'USER_FILTER' ); ctx_ddl.set_attribute ('USER_FILTER_PREF','COMMAND','upcase.pl'); end;
Create the index in SQL*Plus as follows:
create index user_filter_idx on user_filter ( docs ) indextype is ctxsys.context parameters ('FILTER USER_FILTER_PREF');
Use the PROCEDURE_FILTER type to filter your documents with a stored procedure. The stored procedure is called each time a document needs to be filtered.
This type has the following attributes:
Specify the name of the stored procedure to use for filtering. The procedure can be a PL/SQL or Java stored procedure. The procedure can be a safe callout or call a safe callout.
The procedure must be owned by CTXSYS and have one of the following signatures:
PROCEDURE(IN BLOB, IN OUT NOCOPY CLOB) PROCEDURE(IN CLOB, IN OUT NOCOPY CLOB) PROCEDURE(IN VARCHAR, IN OUT NOCOPY CLOB) PROCEDURE(IN BLOB, IN OUT NOCOPY VARCHAR2) PROCEDURE(IN CLOB, IN OUT NOCOPY VARCHAR2) PROCEDURE(IN VARCHAR2, IN OUT NOCOPY VARCHAR2) PROCEDURE(IN BLOB, IN VARCHAR2) PROCEDURE(IN CLOB, IN VARCHAR2) PROCEDURE(IN VARCHAR2, IN VARCHAR2)
The first argument is the content of the unfiltered row as passed out by the datastore. The second argument is for the procedure to pass back the filtered document text.
The procedure attribute is mandatory and has no default.
Specify the type of the input argument of the filter procedure. You can specify one of the following:
The input_type attribute s not mandatory. If not specified, BLOB is the default.
Specify the type of output argument of the filter procedure. You can specify one of the following types:
The output_type attribute is not mandatory. If not specified, CLOB is the default.
When you specify TRUE, the rowid of the document to be filtered is passed as the first parameter, before the input and output parameters.
For example, with INPUT_TYPE BLOB, OUTPUT_TYPE CLOB, and ROWID_PARAMETER TRUE, the filter procedure must have the signature as follows:
procedure(in rowid, in blob, in out nocopy clob)
This attribute is useful for when your procedure requires data from other columns or tables. This attribute is not mandatory. The default is FALSE.
When you specify TRUE, the value of the format column of the document being filtered is passed to the filter procedure before input and output parameters, but after the rowid parameter, if enabled.
You specify the name of the format column at index time in the parameters string, using the keyword 'format column <columnname>'
. The parameter type must be IN VARCHAR2.
The format column value can be read via the rowid parameter, but this attribute allows a single filter to work on multiple table structures, because the format attribute is abstracted and does not require the knowledge of the name of the table or format column.
FORMAT_PARAMETER is not mandatory. The default is FALSE.
When you specify TRUE, the value of the charset column of the document being filtered is passed to the filter procedure before input and output parameters, but after the rowid and format parameter, if enabled.
You specify the name of the charset column at index time in the parameters string, using the keyword 'charset column <columnname>'
. The parameter type must be IN VARCHAR2.
CHARSET_PARAMETER attribute is not mandatory. The default is FALSE.
ROWID_PARAMETER, FORMAT_PARAMETER, and CHARSET_PARAMETER are all independent. The order is rowid, the format, then charset, but the filter procedure is passed only the minimum parameters required.
For example, assume that INPUT_TYPE is BLOB and OUTPUT_TYPE is CLOB. If your filter procedure requires all parameters, the procedure signature must be:
(id IN ROWID, format IN VARCHAR2, charset IN VARCHAR2, input IN BLOB, output IN OUT NOCOPY CLOB)
If your procedure requires only the ROWID, then the procedure signature must be:
(id IN ROWID,input IN BLOB, ouput IN OUT NOCOPY CLOB)
In order to create an index using a PROCEDURE_FILTER preference, the index owner must have execute permission on the procedure.Oracle checks this at index time, which is similar to the security measures for USER_DATASTORE.
The filter procedure can raise any errors needed through the normal PL/SQL raise_application_error facility. These errors are propagated to the CTX_USER_INDEX_ERRORS view or reported to the user, depending on how the filter is invoked.
Consider a filter procedure CTXSYS.NORMALIZE that you define with the following signature:
PROCEDURE NORMALIZE(id IN ROWID, charset IN VARCHAR2, input IN CLOB, output IN OUT NOCOPY VARCHAR2);
To use this procedure as your filter, set up your filter preference as follows:
begin ctx_ddl.create_preference('myfilt', 'procedure_filter'); ctx_ddl.set_attribute('myfilt', 'procedure', 'normalize'); ctx_ddl.set_attribute('myfilt', 'input_type', 'clob'); ctx_ddl.set_attribute('myfilt', 'output_type', 'varchar2'); ctx_ddl.set_attribute('myfilt', 'rowid_parameter', 'TRUE'); ctx_ddl.set_attribute('myfilt', 'charset_parameter', 'TRUE'); end;
|
Copyright © 1996-2001, Oracle Corporation. All Rights Reserved. |
|