Oracle® Outside In Search Export Release 8.3.5 |
|
|
View PDF |
Search Export allows developers to implement sophisticated text extraction from standard business documents. With the current version of Search Export, an application can access documents through a single C API. Search Export is ideal for a wide spectrum of applications, from rapid search and retrieval to indexing. SearchML presents the text in one of three formats: XML, HTML, or plain text.
There may be references to other Outside In Technology SDKs within this manual. To obtain complete documentation for any other Outside In product, see http://download.oracle.com/docs/cd/E14154_01/index.htm.
The updated list of supported formats is posted at http://www.oracle.com/technology/products/content-management/oit/oit_all.html.
Search Export is now supported on the following platforms: IBM AIX 64, HP-UX RISC 64, and Solaris SPARC 64. See Chapter 3, "UNIX Implementation Details" for more information.
Outside In has added support for Lotus Notes mail, also known as NSF files. This support requires the presence of a Lotus Notes client or a Lotus Domino server and is only supported on Win32. Details can be found in the section NSF Support. There is also a new option associated with this support: SCCOPT_LOTUSNOTESDIRECTORY.
A new option, SCCOPT_IGNORE_PASSWORD (SOAP equivalent: ignorePassword), has been added to disable the password verification of files where the contents can be processed without validation of the password.
Access to password-protected files has been added to certain Microsoft Office documents, PDF, Zip, and NSF files. See DASetFileAccessCallback.
Search Export had the ability to output raw XMP (Adobe's Extensible Metadata Platform). In this release, we have added the ability to parse that data and convert it into document properties. See SCCOPT_PARSEXMPMETADATA (SOAP equivalent: parseXMPMetaData).
Search Export has expanded the information available for comments including sub-document properties and a bookmark to locate the comment in the document. You must set the SCCEX_ANNOTATIONS flag in the SCCOPT_XML_SEARCHML_FLAGS option to activate this support.
FI_SEARCHML and FI_SEARCHML20 have been deprecated for several releases, but we have continued supporting them for existing customers. This will be the final release for that support. It is recommended that you use FI_SEARCHML_LATEST to assure that you always get the most recent SearchML schema.
Search Export can normalize all of a document's content to the SearchML or PageML schemas, both provided in the form of a DTD and an XML schema, or it can output the content as simple text (the SearchText output format) or simple HTML (the SearchHTML output format). The output options available to you are determined by your license.
Note:
All Search Export output formats are UTF-8 encoded Unicode text.The SearchML Schema is designed to serve as a foundation for information extraction, with output that is ideal for rapid search and retrieval applications. To facilitate this purpose, the XML tags used by the SearchML schema are designed to closely mirror the information in files created by popular business applications.
Note:
It is recommended that you use FI_SEARCHML_LATEST to assure that you always get the most recent SearchML schema. However, if you must have a particular version of the schema, please see sccfi.h for the other FI_SEARCHML* definitions.The PageML output format provides information about where text would appear in a printed version of the input document. Its output consists of an XML file specifying all of the text runs for each page in the document. The text run locations are given as starting and ending character counts, or "offsets," from the beginning of the input file's text stream. This offset matches the text offsets used by Search Export's SearchML format and other members of the Outside In Viewing Technology family, including Content Access and Text Access.
The PageML Schema supports most input formats supported by Search Export. Most format types will contain <page> elements that correspond to the page that the text appears on, but there are three exceptions.
Bitmap images have no searchable characters in the main document, so no text will appear in the output.
All of the text for archives will appear on a single page.
The text for spreadsheets will have each sheet appear as a separate page.
PageML is run in a manner much like other Search Export output filters, such as FI_SEARCHML_LATEST. When PageML formatted XML is desired, FI_PAGEML is passed as the output formatdwOutputId to EXOpenExport(). Similarly, PageML uses a new schema, also called PageML, when generating the XML output. There is a small set of options that may be used to modify its behavior:
SCCOPT_XML_PAGEML_FLAGS
SCCOPT_XML_PAGEML_PRINTERNAME
textOutOn
xmlDeclarationOff
The PageML Schema supports all word processing formats supported by Search Export, including but not limited to Microsoft Word 97 and newer, WordPerfect Version 7 and newer, HTML, ASCII, and RTF. There is also limited support for PDF.
This format produces output that uses standard HTML tags, but will not be viewable HTML. It is a form of HTML that is easily parsed and therefore ideal for search and retrieval or indexing applications.
Document properties will be stored in <meta> tags using the name attribute for the property type and the content attribute for the property's content. The title document property will be represented by a <title> tag.
Bold, italic, and underline character attributes will be reflected using the <b>, <i> and <u> tags respectively.
SearchHTML is run in a manner much like other Search Export output filters, such as FI_SEARCHML_LATEST. When SearchHTML formatted output is desired, FI_SEARCHHTML is passed as the output formatdwOutputId to EXOpenExport().
The output will obey the HTML 4.01 Transitional DTD, available at http://www.w3.org/TR/REC-html40/
.
The basic architecture of Outside In technologies is the same across all supported platforms:
Filter/Module | Description |
---|---|
Input Filter | The input filters form the base of the architecture. Each one reads a specific file format or set of related formats and sends the data to OIT through a standard set of function calls. There are more than 150 of these filters that read more than 500 distinct file formats. Filters are loaded on demand by the data access module. |
Export Filter | Architecturally similar to input filters, export filters know how to write out a specific format based on information coming from the chunker module. The export filters generate XML, HTML, or text. |
Chunker | The Chunker module is responsible for caching a certain amount of data from the filter and returning this data to the export filter. |
Export | The Export module implements the export API and understands how to load and run individual export filters. |
Data Access | The Data Access module implements a generic API for access to files. It understands how to identify and load the correct filter for all the supported file formats. The module delivers to the developer a generic handle to the requested file, which can then be used to run more specialized processes, such as the Export process. |
Schema | Schemas provide a means for defining the structure, content and semantics of XML documents. Your Search Export license may include the SearchML schema. Schemas can be presented in the form of a DTD (Document Type Definition) or XML Schema (schema). The Search ML schema is provided in both forms. |
The following terms are used in this documentation.
Term | Definition |
---|---|
Developer | Someone integrating this technology into another technology or application. Most likely this is you, the reader. |
Source File | The file the developer wishes to export. |
Output File | The file being written: XML, HTML, or text. |
Data Access Module | The core of Outside In Data Access, in the SCCDA library. |
Data Access Submodule (also referred to as "Submodule") | This refers to any of the Outside In Data Access modules, including SCCEX (Export), but excluding SCCDA (Data Access). |
Document Handle (also referred to as "hDoc ") |
A Document Handle is created when a file is opened using Data Access (see Chapter 4, "Data Access Common Functions"). Each Document Handle may have any number of Subhandles. |
Subhandle (also referred to as "hItem ") |
Any of the handles created by a Submodule's Open function. Every Subhandle has a Document Handle associated with it. For example, the hExport returned by EXOpenExport is a Subhandle. The DASetOption and DAGetOption functions in the Data Access Module may be called with any Subhandle or Document Handle. The DARetrieveDocHandle function returns the Document Handle associated with any Subhandle. |
Each Outside In product has an sdk directory, under which there is a subdirectory for each platform on which the product ships (for example, sx/sdk/sx_win-x86-32_sdk). Under each of these directories are the following three subdirectories:
docs - Contains both a PDF and HTML version of the product manual.
redist - Contains only the files that the customer is allowed to redistribute. These include all the compiled modules, filter support files, .xsd and .dtd files, cmmap000.bin, and third-party libraries, like freetype.
sdk - Contains the other subdirectories that used to be at the root-level of an sdk (common, lib (windows only), resource, samplefiles, and samplecode (previously samples). In addition, one new subdirectory has been added, demo, that holds all of the compiled sample apps and other files that are needed to demo the products. These are files that the customer should not redistribute (.cfg files, exportmaps, etc.).
In the root platform directory (for example, sx/sdk/sx_win-x86-32_sdk), there are two files:
README - Explains the contents of the sdk, and that makedemo must be run in order to use the sample applications.
makedemo (either .bat or .sh – platform-based) - This script will either copy (on Windows) or Symlink (on Unix) the contents of …/redist into …/sdk/demo, so that sample applications can then be run out of the demo directory.
Here's a step-by-step overview of how to export a source file.
Call DAThreadInit if running in multiple threads (optional). Threading is only supported on 32-bit versions of Linux x86, Solaris SPARC, and all Windows platforms. See "DAThreadInit" for more information about DAThreadInit.
Call DAInit to initialize the Data Access technology. This function needs to be called only once per application.
Set any options that require a NULL handle type (optional). Certain options need to be set before the desired source file is opened. These options are identified by requiring a NULL handle type. They include, but aren't limited to:
SCCOPT_FALLBACKFORMAT
SCCOPT_FIFLAGS
SCCOPT_TEMPDIR
SCCOPT_IO_BUFFERSIZE
Open the Source File. DAOpenDocument is called to create a document handle that uniquely identifies the source file. This handle may be used in subsequent calls to the EXOpenExport function or the open function of any other Data Access Submodule, and will be used to close the file when access is complete. This allows the file to be accessed from multiple Data Access Submodules without reopening.
Set the Options. If you require option values other than the default settings, call DASetOption to set options. Note that options listed in the Options Guide as having "Handle Types" that accept VTHEXPORT may be set any time before EXRunExport is called. See "DASetOption" for more information on options and how to set them.
Open a Handle to Search Export. Using the document handle, EXOpenExport is called to obtain an export handle that identifies the file to the specific export product. This handle will be used in all subsequent calls to the specific export functions. The dwOutputId parameter of this function is used to specify that the output file type should be set to one of the following:
FI_SEARCHML_LATEST
FI_PAGEML
FI_SEARCHHTML
FI_SEARCHTEXT
Export the File. EXRunExport is called to generate the output file(s) from the source file.
Close Handle to Search Export. EXCloseExport is called to terminate the export process for the file. After this function is called, the export handle will no longer be valid, but the document handle may still be used.
Close the Source File. DACloseDocument is called to close the source file. After calling this function, the document handle will no longer be valid.
Close Search Export. DADeInit is called to de-initialize the Data Access technology.
The following notice must be included in the documentation, help system, or About box of any software that uses any of Oracle's executable code:
Outside In Search Export © 1991, 2010 Oracle.
The following notice must be included in the documentation of any software that uses Oracle's TIF6 filter (this filter reads TIFF and JPEG formats):
The software is based in part on the work of the Independent JPEG Group.