Formatted documents such as Microsoft Word and PDF must be filtered to text to be indexed. The type of filtering the system uses is determined by the
FILTER preference type. By default, the system uses the
AUTO_FILTER filter type, which automatically detects the format of your documents and filters them to text.
Oracle Text can index most formats. Oracle Text can also index columns that contain documents with mixed formats.
Oracle Text Reference for information about
AUTO_FILTER supported document and graphics formats
If you have a mixed-format column such as one that contains Microsoft Word, plain text, and HTML documents, you can bypass filtering for plain text or HTML by including a format column in your text table. In the format column, you tag each row
BINARY. Rows that are tagged
TEXT are not filtered.
For example, you can tag the HTML and plain text rows as
TEXT and the Microsoft Word rows as
BINARY. You specify the format column in the
CREATE INDEX parameter clause.
A third format column type,
IGNORE, is provided for when you do not want a document to be indexed at all. This is useful, for example, when you have a mixed-format table that includes plain-text documents in both Japanese and English, but you only want to process the English documents; another example might be that of a mixed-format table that includes both plain-text documents and images. Because
IGNORE is implemented at the datastore level, it can be used with all filters.
You can create your own custom filter to filter documents for indexing. You can create either an external filter that is executed from the file system or an internal filter as a PL/SQL or Java stored procedure.
For external custom filtering, use the
USER_FILTER filter preference type.
For internal filtering, use the
PROCEDURE_FILTER filter type.