5.5 Searching for Documents Using SPARQL Query Patterns

Documents that are semantically indexed (that is, indexed using the mdsys.SemContext index type) can be searched using SEM_CONTAINS operator within a standard SQL query.

In the query, the SEM_CONTAINS operator must have at least two parameters, the first specifying the column in which the documents are stored and the second specifying the document search criteria expressed as a SPARQL query pattern, as in the following example:

SELECT docId FROM Newsfeed
WHERE  SEM_CONTAINS (article, 
  '{ ?org  rdf:type  <http://www.example.com/classes/Organization>  . 
     ?org  <http://example.com/pred/hasCategory>  
             <http://www.example.com/category/BusinessFinance> }'
           )= 1;

The SPARQL query pattern specified with the SEM_CONTAINS operator is matched against the individual graphs corresponding to each document, and a document is considered to match a search criterion if the triples from the corresponding graph satisfy the query pattern. In the preceding example, the SPARQL query pattern identifies the individual graphs (thus, the documents) that refer to an Organization that belong to BusinessFinance category. The SQL query returns the rows corresponding to the matching documents in its result set. The preceding example assumes that the URIs used in the query are generated by the underlying extractor, and that you (the user searching for documents) are aware of the properties and terms that are generated by the extractor in use.

When you create an index using a dependent extractor policy that includes one or more user-defined RDF graphs, the triples asserted in the user RDF graphs are considered to be common to all the documents. Document searches involving such policies test the search criteria against the triples in individual graphs corresponding to the documents, combined with the triples in the user RDF graphs. For example, the following query identifies all articles referring to organizations in the state of New Hampshire, using the geographical ontology (geo_ontology RDF graph from a preceding example) that maps cities to states:

SELECT docId FROM   Newsfeed
WHERE  SEM_CONTAINS (article, 
        '{ ?org     rdf:type          class:Organization  . 
           ?org     pred:hasLocation  ?city . 
           ?city    geo:hasState      state:NewHampshire }', 
        'SEM_EXTR_PLUS_GEOONT', 
               sem_aliases(                              
                  sem_alias('class', 'http://www.myorg.com/classes/'),
                  sem_alias('pred', 'http://www.myorg.com/pred/'),
                  sem_alias('geo', 'http://geoont.org/rel/'),
                  sem_alias('state', 'http://geoont.org/state/'))) = 1;

The preceding query, with a reference to the extractor policy SEM_EXTR_PLUS_GEOONT (created in an example in Extractor Policies), combines the triples extracted from the indexed documents and the triples in the user RDF graph to find matching documents. In this example, the name of the extractor policy is optional if the corresponding index is created with just this policy or if this is the default extractor policy for the index. When the query pattern uses some qualified names, an optional parameter to the SEM_CONTAINS operator can specify the namespaces to be used for expanding the qualified names.

SPARQL-based document searches can make use of the SPARQL syntax that is supported through SEM_MATCH queries.