Information Extractors for Semantically Indexing Documents

5.1 Information Extractors for Semantically Indexing Documents

Information extractors process unstructured documents and extract meaningful information from them, often using natural-language processing engines with the aid of ontologies.

The quality and the completeness of information extracted from a document vary from one extractor to another. Some extractors simply identify the entities (such as names of persons, organizations, and geographic locations from a document), while the others attempt to identify the relationships among the identified entities and additional description for those entities. You can search for a specific document from a large set when the information extracted from the documents is maintained as a semantic index.

You can use an information extractor to create a semantic index on the documents stored in a column of a relational table. An extensible framework allows any third-party information extractor that is accessible from the database to be plugged into the database. An object type created for an extractor encapsulates the extraction logic, and has methods to configure the extractor and receive information extracted from a given document in RDF/XML format.

An abstract type MDSYS.RDFCTX_EXTRACTOR defines the common interfaces to all information extractors. An implementation of this abstract type interacts with a specific information extractor to produce RDF/XML for a given document. An implementation for this type can access a third-party information extractor that either is available as a database application or is installed on the network (accessed using Web service callouts). Example 5-1 shows the definition of the RDFCTX_EXTRACTOR abstract type.

Example 5-1 RDFCTX_EXTRACTOR Abstract Type Definition

create or replace type rdfctx_extractor authid current_user as object (
  extr_type        VARCHAR2(32),
  member function  getDescription return VARCHAR2,
  member function  rdfReturnType return VARCHAR2,
  member function  getContext(attribute VARCHAR2) return VARCHAR2,
  member procedure startDriver,
  member function  extractRDF(document CLOB,
                              docId    VARCHAR2) return CLOB,
  member function  extractRdf(document CLOB,
                              docId    VARCHAR2,
                              params   VARCHAR2,
                              options  VARCHAR2 default NULL) return CLOB
  member function  batchExtractRdf(docCursor        SYS_REFCURSOR,
                              extracted_info_table  VARCHAR2,
                              params                VARCHAR2,
                              partition_name        VARCHAR2 default NULL,
                              docId                 VARCHAR2 default NULL,
                              preferences           SYS.XMLType default NULL,
                              options               VARCHAR2 default NULL)  
                              return CLOB,
  member procedure closeDriver
) not instantiable not final
/

A specific implementation of the RDFCTX_EXTRACTOR type sets an identifier for the extractor type in the extr_type attribute, and it returns a short description for the extractor type using getDescription method. All implementations of this abstract type return the extracted information as RDF triples. In the current release, the RDF triples are expected to be serialized using RDF/XML format, and therefore the rdfReturnType method should return 'RDF/XML'.

An extractor type implementation uses the extractRDF method to encapsulate the extraction logic, possibly by invoking external information extractor using proprietary interfaces, and returns the extracted information in RDF/XML format. When a third-party extractor uses some proprietary XML Schema to capture the extracted information, an XML style sheet can be used to generate an equivalent RDF/XML. The startDriver and closeDriver methods can perform any housekeeping operations pertaining to the information extractor. The optional params parameter allows the extractor to obtain additional information about the type of extraction needed (for example, the desired quality of extraction).

Optionally, an extractor type implementation may support a batch interface by providing an implementation of the batchExtractRdf member function. This function accepts a cursor through the input parameter docCursor and typically uses that cursor to retrieve each document, extract information from the document, and then insert the extracted information into (the specified partition identified by the partition_name partition of the extracted_info_table table. The preferences parameter is used to obtain the preferences value associated with the policy (as described in Indexing External Documents and in the SEM_RDFCTX.CREATE_POLICY reference section).

The getContext member function accepts an attribute name and returns the value for that attribute. Currently this function is used only for extractors supporting the batch interface. The attribute names and corresponding possible return values are the following:

For the BATCH_SUPPORT attribute, the return values are 'YES' or 'NO' depending on whether the extractor supports the batch interface.
For the DBUSER attribute, the return value is the name of a database user that will connect to the database to retrieve rows from the cursor (identified by the docCursor parameter) and that will write to the table extracted_info_table.

This information is used for granting appropriate privileges to the table being indexed and the table extracted_info_table.

The startDriver and closeDriver methods can perform any housekeeping operations pertaining to the information extractor.

An extractor type for the General Architecture for Text Engineering (GATE) engine is defined as a subtype of the RDFCTX_EXTRACTOR type. The implementation of this extractor type sends the documents to a GATE engine over a TCP connection, receives annotations extracted by the engine in XML format, and converts this proprietary XML document to an RDF/XML document. For more information on configuring a GATE engine to work with Oracle Database, see Working with General Architecture for Text Engineering (GATE). For an example of creating a new information extractor, see Creating a New Extractor Type.

Information extractors that are deployed as Web services can be invoked from the database by extending the RDFCTX_WS_EXTRACTOR type, which is a subtype of the RDFCTX_EXTRACTOR type. The RDFCTX_WS_EXTRACTOR type encapsulates the Web service callouts in the extractRDF method; specific implementations for network-based extractors can reuse this implementation by setting relevant attribute values in the type constructor.

Thomson Reuters Calais is an example of a network-based information extractor that can be accessed using web-service callouts. The CALAIS_EXTRACTOR type, which is a subtype of the RDFCTX_WS_EXTRACTOR type, encapsulates the Calais extraction logic, and it can be used to semantically index the documents. The CALAIS_EXTRACTOR type must be configured for the database instance before it can be used to create semantic indexes, as explained in Configuring the Calais Extractor type.

Parent topic: Semantic Indexing for Documents