Semantic Indexing for Documents

5 Semantic Indexing for Documents

Information extractors locate and extract meaningful information from unstructured documents. The ability to search for documents based on this extracted information is a significant improvement over the keyword-based searches supported by the full-text search engines.

Semantic indexing for documents introduces an index type that can make use of information extractors and annotators to semantically index documents stored in relational tables. Documents indexed semantically can be searched using SEM_CONTAINS operator within a standard SQL query. The search criteria for these documents are expressed using SPARQL query patterns that operate on the information extracted from the documents, as in the following example.

SELECT docId
FROM   Newsfeed
WHERE  SEM_CONTAINS (article, 
     ' { ?org    rdf:type            typ:Organization  . 
         ?org    pred:hasCategory    cat:BusinessFinance } ', ..) = 1

The key components that facilitate Semantic Indexing for documents in an Oracle Database include:

Extensible information extractor framework, which allows third-party information extractors to be plugged into the database
SEM_CONTAINS operator to identify documents of interest, based on their extracted information, using standard SQL queries
SEM_CONTAINS_SELECT ancillary operator to return relevant information about the documents identified using SEM_CONTAINS operator
SemContext index type to interact with the information extractor and manage the information extracted from a document set in an index structure and to facilitate semantically meaningful searches on the documents

The application program interface (API) for managing extractor policies and semantic indexes created for documents is provided in the SEM_RDFCTX PL/SQL package. SEM_RDFCTX Package Subprograms provides the reference information about the subprograms in SEM_RDFCTX package.

Information Extractors for Semantically Indexing Documents
Information extractors process unstructured documents and extract meaningful information from them, often using natural-language processing engines with the aid of ontologies.
Extractor Policies
An extractor policy is a named dictionary entity that determines the characteristics of a semantic index that is created using the policy.
Semantically Indexing Documents
Textual documents stored in a CLOB or VARCHAR2 column of a relational table can be indexed using the MDSYS.SEMCONTEXT index type, to facilitate semantically meaningful searches.
SEM_CONTAINS and Ancillary Operators
You can use the SEM_CONTAINS operator in a standard SQL statement to search for documents or document references that are stored in relational tables.
Searching for Documents Using SPARQL Query Patterns
Documents that are semantically indexed (that is, indexed using the mdsys.SemContext index type) can be searched using SEM_CONTAINS operator within a standard SQL query.
Bindings for SPARQL Variables in Matching Subgraphs in a Document (SEM_CONTAINS_SELECT Ancillary Operator)
You can use the SEM_CONTAINS_SELECT ancillary operator to return additional information about each document matched using the SEM_CONTAINS operator.
Improving the Quality of Document Search Operations
The quality of a document search operation depends on the quality of the information produced by the extractor used to index the documents. If the information extracted is incomplete, you may want to add some annotations to a document.
Indexing External Documents
You can use semantic indexing on documents that are stored in a file system or on the network. In such cases, you store the references to external documents in a table column, and you create a semantic index on the column using an appropriate extractor policy.
Configuring the Calais Extractor type
The CALAIS_EXTRACTOR type, which is a subtype of the RDFCTX_WS_EXTRACTOR type, enables you to access a Web service end point anywhere on the network, including the one that is publicly accessible (OpenCalais.com).
Working with General Architecture for Text Engineering (GATE)
General Architecture for Text Engineering (GATE) is an open source natural language processor and information extractor.
Creating a New Extractor Type
You can create a new extractor type by extending the RDFCTX_EXTRACTOR or RDFCTX_WS_EXTRACTOR extractor type.
Creating a Local Semantic Index on a Range-Partitioned Table
A local index can be created on a VARCHAR2 or CLOB column of a range-partitioned table.
Altering a Semantic Index
You can use the ALTER INDEX statement with a semantic index.
Passing Extractor-Specific Parameters in CREATE INDEX and ALTER INDEX
The CREATE INDEX and ALTER INDEX statements allow the passing of parameters needed by extractors.
Performing Document-Centric Inference
Document-centric inference refers to the ability to infer from each document individually.
Metadata Views for Semantic Indexing
This section describes views that contain metadata about semantic indexing
Default Style Sheet for GATE Extractor Output
This section lists the default XML style sheet that the mdsys.gatenlp_extractor implementation uses to convert the annotation set (encoded in XML) into RDF/XML.

Parent topic: Conceptual and Usage Information