5.3 Semantically Indexing Documents

Textual documents stored in a CLOB or VARCHAR2 column of a relational table can be indexed using the MDSYS.SEMCONTEXT index type, to facilitate semantically meaningful searches.

The extractor policy specified at index creation determines the information extractor used to semantically index the documents. The extracted information, captured as a set of RDF triples for each document, is managed in the RDF data store. Each instance of the semantic index is associated with a system-generated RDF graph, which maintains the RDF triples extracted from the corresponding documents.

The following example creates a semantic index named ArticleIndex on the textual documents in the ARTICLE column of the NEWSFEED table, using the extractor policy named SEM_EXTR:

CREATE INDEX ArticleIndex on Newsfeed (article)
   INDEXTYPE IS mdsys.SemContext PARAMETERS ('SEM_EXTR');

The RDF graph created for an index is managed internally and it is not associated with an application table. The triples stored in such an RDF graph are automatically maintained for any modifications (such as update, insert, or delete) made to the documents stored in the table column. Although a single RDF graph is used to index all documents stored in a table column, the triples stored in the RDF graph maintain references to the documents from which they are extracted; therefore, all the triples extracted from a specific document form an individual graph within the RDF graph. The documents that are semantically indexed can then be searched using a SPARQL query pattern that operates on the triples extracted from the documents.

When creating a semantic index for documents, you can use a basic extractor policy or a dependent policy, which may include one or more user-defined RDF graphs. When you create an index with a dependent extractor policy, the document search pattern specified using SPARQL could span the triples extracted from the documents as well as those defined in user-defined RDF graphs.

You can create an index using multiple extractor policies, in which case the triples extracted by the corresponding extractors are maintained separately in distinct RDF graphs. A document search query using one such index can select the specific policy to be used for answering the query. For example, an extractor policy named CITY_EXTR can be created to extract the names of the cities from a given document, and this extractor policy can be used in combination with the SEM_EXTR policy to create a semantic index, as in the following example:

CREATE INDEX ArticleIndex on Newsfeed (article)
   INDEXTYPE IS mdsys.SemContext PARAMETERS ('SEM_EXTR CITY_EXTR');

The first extractor policy in the PARAMETERS list is considered to be the default policy if a query does not refer to a specific policy; however, you can change the default extractor policy for a semantic index by using the SEM_RDFCTX.SET_DEFAULT_POLICY procedure, as in the following example:

begin
  sem_rdfctx.set_default_policy (index_name => 'ArticleIndex',
                                 policy_name => 'CITY_EXTR');
end;
/