5 Semantic Indexing for Documents

Information extractors locate and extract meaningful information from unstructured documents. The ability to search for documents based on this extracted information is a significant improvement over the keyword-based searches supported by the full-text search engines.

Semantic indexing for documents introduces an index type that can make use of information extractors and annotators to semantically index documents stored in relational tables. Documents indexed semantically can be searched using SEM_CONTAINS operator within a standard SQL query. The search criteria for these documents are expressed using SPARQL query patterns that operate on the information extracted from the documents, as in the following example.

SELECT docId
FROM   Newsfeed
WHERE  SEM_CONTAINS (article, 
     ' { ?org    rdf:type            typ:Organization  . 
         ?org    pred:hasCategory    cat:BusinessFinance } ', ..) = 1

The key components that facilitate Semantic Indexing for documents in an Oracle Database include:

  • Extensible information extractor framework, which allows third-party information extractors to be plugged into the database

  • SEM_CONTAINS operator to identify documents of interest, based on their extracted information, using standard SQL queries

  • SEM_CONTAINS_SELECT ancillary operator to return relevant information about the documents identified using SEM_CONTAINS operator

  • SemContext index type to interact with the information extractor and manage the information extracted from a document set in an index structure and to facilitate semantically meaningful searches on the documents

The application program interface (API) for managing extractor policies and semantic indexes created for documents is provided in the SEM_RDFCTX PL/SQL package. SEM_RDFCTX Package Subprograms provides the reference information about the subprograms in SEM_RDFCTX package.