7.7 Additions to the SPARQL Syntax to Support Other Features

RDF Graph support for Apache Jena allows you to pass in hints and additional query options. It implements these capabilities by overloading the SPARQL namespace prefix syntax by using Oracle-specific namespaces that contain query options.

The namespaces are in the form PREFIX ORACLE_SEM_xx_NS, where xx indicates the type of feature (such as HT for hint or AP for additional predicate)

7.7.1 SQL Hints

SQL hints can be passed to a SEM_MATCH query including a line in the following form:

PREFIX ORACLE_SEM_HT_NS: <http://oracle.com/semtech#hint>

Where hint can be any hint supported by SEM_MATCH. For example:

PREFIX ORACLE_SEM_HT_NS: <http://oracle.com/semtech#leading(t0,t1)> 
SELECT ?book ?title ?isbn     
WHERE { ?book <http://title> ?title. ?book <http://ISBN> ?isbn }

In this example, t0,t1 refers to the first and second patterns in the query.

Note the slight difference in specifying hints when compared to SEM_MATCH. Due to restrictions of namespace value syntax, a comma (,) must be used to separate t0 and t1 (or other hint components) instead of a space.

For more information about using SQL hints, see Using the SEM_MATCH Table Function to Query RDF Data, specifically the material about the HINT0 keyword in the options attribute.

7.7.2 Using Bind Variables in SPARQL Queries

In Oracle Database, using bind variables can reduce query parsing time and increase query efficiency and concurrency. Bind variable support in SPARQL queries is provided through namespace pragma specifications similar to ORACLE_SEM_FS_NS.

Consider a case where an application runs two SPARQL queries, where the second (Query 2) depends on the partial or complete results of the first (Query 1). Some approaches that do not involve bind variables include:

  • Iterating through results of Query 1 and generating a set of queries. (However, this approach requires as many queries as the number of results of Query 1.)

  • Constructing a SPARQL filter expression based on results of Query 1.

  • Treating Query 1 as a subquery.

Another approach in this case is to use bind variables, as in the following sample scenario:

Query 1:
 
  SELECT ?x
    WHERE { ... <some complex query> ... };
 
 
Query 2:
 
  SELECT ?subject ?x
    WHERE {?subject <urn:related> ?x .};

The following example shows Query 2 with the syntax for using bind variables with the support for Apache Jena:

PREFIX ORACLE_SEM_FS_NS: <http://oracle.com/semtech#no_fall_back,s2s>
PREFIX ORACLE_SEM_UEAP_NS: <http://oracle.com/semtech#x$RDFVID%20in(?,?,?)>
PREFIX ORACLE_SEM_UEPJ_NS: <http://oracle.com/semtech#x$RDFVID>
PREFIX ORACLE_SEM_UEBV_NS: <http://oracle.com/semtech#1,2,3>
SELECT ?subject ?x
WHERE {
  ?subject <urn:related>  ?x
};

This syntax includes using the following namespaces:

  • ORACLE_SEM_UEAP_NS is like ORACLE_SEM_AP_NS, but the value portion of ORACLE_SEM_UEAP_NS is URL Encoded. Before the value portion is used, it must be URL decoded, and then it will be treated as an additional predicate to the SPARQL query.

    In this example, after URL decoding, the value portion (following the # character) of this ORACLE_SEM_UEAP_NS prefix becomes "x$RDFVID in(?,?,?)". The three question marks imply a binding to three values coming from Query 1.

  • ORACLE_SEM_UEPJ_NS specifies the additional projections involved. In this case, because ORACLE_SEM_UEAP_NS references the x$RDFVID column, which does not appear in the SELECT clause of the query, it must be specified. Multiple projections are separated by commas.

  • ORACLE_SEM_UEBV_NS specifies the list of bind values that are URL encoded first, and then concatenated and delimited by commas.

Conceptually, the preceding example query is equivalent to the following non-SPARQL syntax query, in which 1, 2, and 3 are treated as bind values:

SELECT ?subject ?x
  WHERE {
    ?subject <urn:related>  ?x
  }
  AND ?x$RDFVID in (1,2,3);

In the preceding SPARQL example of Query 2, the three integers 1, 2, and 3 come from Query 1. You can use the oext:build-uri-for-id function to generate such internal integer IDs for RDF resources. The following example gets the internal integer IDs from Query 1:

PREFIX oext: <http://oracle.com/semtech/jena-adaptor/ext/function#>
SELECT ?x  (oext:build-uri-for-id(?x) as ?xid)
WHERE { ... <some complex query> ... };

The values of ?xid have the form of <rdfvid:integer-value>. The application can strip out the angle brackets and the "rdfvid:" strings to get the integer values and pass them to Query 2.

Consider another case, with a single query structure but potentially many different constants. For example, the following SPARQL query finds the hobby for each user who has a hobby and who logs in to an application. Obviously, different users will provide different <uri> values to this SPARQL query, because users of the application are represented using different URIs.

SELECT ?hobby
  WHERE { <uri> <urn:hasHobby> ?hobby };

One approach, which would not use bind variables, is to generate a different SPARQL query for each different <uri> value. For example, user Jane Doe might trigger the execution of the following SPARQL query:

SELECT ?hobby WHERE {
<http://www.example.com/Jane_Doe> <urn:hasHobby> ?hobby };

However, another approach is to use bind variables, as in the following example specifying user Jane Doe:

PREFIX ORACLE_SEM_FS_NS: <http://oracle.com/semtech#no_fall_back,s2s>
PREFIX ORACLE_SEM_UEAP_NS: <http://oracle.com/semtech#subject$RDFVID%20in(ORACLE_ORARDF_RES2VID(?))>
PREFIX ORACLE_SEM_UEPJ_NS: <http://oracle.com/semtech#subject$RDFVID>
PREFIX ORACLE_SEM_UEBV_NS: <http://oracle.com/semtech#http%3a%2f%2fwww.example.com%2fJohn_Doe>
SELECT ?subject ?hobby
  WHERE {
    ?subject <urn:hasHobby>  ?hobby
  };

Conceptually, the preceding example query is equivalent to the following non-SPARQL syntax query, in which http://www.example.com/Jane_Doe is treated as a bind variable:

SELECT ?subject ?hobby
WHERE {
  ?subject <urn:hasHobby>  ?hobby
}
AND ?subject$RDFVID in (ORACLE_ORARDF_RES2VID('http://www.example.com/Jane_Doe'));

In this example, ORACLE_ORARDF_RES2VID is a function that translates URIs and literals into their internal integer ID representation. This function is created automatically when the support for Apache Jena is used to connect to an Oracle database.

7.7.3 Additional WHERE Clause Predicates

The SEM_MATCH filter attribute can specify additional selection criteria as a string in the form of a WHERE clause without the WHERE keyword. Additional WHERE clause predicates can be passed to a SEM_MATCH query including a line in the following form:

PREFIX ORACLE_SEM_AP_NS: <http://oracle.com/semtech#pred>

Where pred reflects the WHERE clause content to be appended to the query. For example:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ORACLE_SEM_AP_NS:<http://www.oracle.com/semtech#label$RDFLANG='fr'>  
SELECT DISTINCT ?inst ?label
  WHERE { ?inst a <http://someCLass>. ?inst rdfs:label ?label . }
  ORDER BY (?label) LIMIT 20

In this example, a restriction is added to the query that the language type of the label variable must be 'fr'.

7.7.4 Additional Query Options

Additional query options can be passed to a SEM_MATCH query including a line in the following form:

PREFIX ORACLE_SEM_FS_NS: <http://oracle.com/semtech#option>

Where option reflects a query option (or multiple query options delimited by commas) to be appended to the query. For example:

PREFIX ORACLE_SEM_FS_NS:   
<http://oracle.com/semtech#timeout=3,dop=4,INF_ONLY,ORDERED,ALLOW_DUP=T>
SELECT * WHERE {?subject ?property ?object }

The following query options are supported:

  • ALLOW_DUP=t chooses a faster way to query multiple RDF graphs, although duplicate results may occur.

  • BEST_EFFORT_QUERY=t, when used with the TIMEOUT=n option, returns all matches found in n seconds for the SPARQL query.

  • DEGREE=n specifies, at the statement level, the degree of parallelism (n) for the query. With multi-core or multi-CPU processors, experimenting with different DOP values (such as 4 or 8) may improve performance.

    Contrast DEGREE with DOP, which specifies parallelism at the session level. DEGREE is recommended over DOP for use with the support for Apache Jena, because DEGREE involves less processing overhead.

  • DOP=n specifies, at the session level, the degree of parallelism (n) for the query. With multi-core or multi-CPU processors, experimenting with different DOP values (such as 4 or 8) may improve performance.

  • FETCH_SIZE=n specifies the JDBC fetch size parameter (the number of rows to be read from the result set and put in memory on one trip to the database). This parameter can be used to improve performance. A higher value means fewer trips to the database to retrieve all results. The default value is 1000.

  • INF_ONLY causes only the inferred model to be queried.

  • JENA_EXECUTOR disables the compilation of SPARQL queries to SEM_MATCH (or native SQL); instead, the Jena native query executor will be used.

  • JOIN=n specifies how results from a SPARQL SERVICE call to a federated query can be joined with other parts of the query. For information about federated queries and the JOIN option, see JOIN Option and Federated Queries.

  • NO_FALL_BACK causes the underlying query execution engine not to fall back on the Jena execution mechanism if a SQL exception occurs.

  • ODS=n specifies, at the statement level, the level of dynamic sampling. (For an explanation of dynamic sampling, see the section about estimating statistics with dynamic sampling in Oracle Database SQL Tuning Guide.) Valid values for n are 1 through 10. For example, you could try ODS=3 for complex queries.

  • ORDERED is translated to a LEADING SQL hint for the query triple pattern joins, while performing the necessary RDF_VALUE$ joins last.

  • PLAIN_SQL_OPT=F disables the native compilation of queries directly to SQL.

  • QID=n specifies a query ID number; this feature can be used to cancel the query if it is not responding.

  • RESULT_CACHE uses the Oracle RESULT_CACHE directive for the query.

  • REWRITE=F disables ODCI_Table_Rewrite for the SEM_MATCH table function.

  • S2S (SPARQL to pure SQL) causes the underlying SEM_MATCH-based query or queries generated based on the SPARQL query to be further converted into SQL queries without using the SEM_MATCH table function. The resulting SQL queries are executed by the Oracle cost-based optimizer, and the results are processed by the support for Apache Jena before being passed on to the client. For more information about the S2S option, including benefits and usage information, see S2S Option Benefits and Usage Information.

    S2S is enabled by default for all SPARQL queries. If you want to disable S2S, set the following JVM system property:

    -Doracle.spatial.rdf.client.jena.defaultS2S=false
    
  • SKIP_CLOB=T causes CLOB values not to be returned for the query.

  • STRICT_DEFAULT=F allows the default graph to include triples in named graphs. (By default, STRICT_DEFAULT=T restricts the default graph to unnamed triples when no data set information is specified.)

  • TIMEOUT=n (query timeout) specifies the number of seconds (n) that the query will run until it is terminated. The underlying SQL generated from a SPARQL query can return many matches and can use features like subqueries and assignments, all of which can take considerable time. The TIMEOUT and BEST_EFFORT_QUERY=t options can be used to prevent what you consider excessive processing time for the query.

7.7.4.1 JOIN Option and Federated Queries

A SPARQL federated query, as described in W3C documents, is a query "over distributed data" that entails "querying one source and using the acquired information to constrain queries of the next source." For more information, see SPARQL 1.1 Federation Extensions (http://www.w3.org/2009/sparql/docs/fed/service).

You can use the JOIN option (described in Additional Query Options) and the SERVICE keyword in a federated query that uses the support for Apache Jena. For example, assume the following query:

SELECT ?s ?s1 ?o
 WHERE { ?s1 ?p1 ?s .
                    {
                     SERVICE <http://sparql.org/books> { ?s ?p ?o }
                    }
                 }

If the local query portion (?s1 ?p1 ?s,) is very selective, you can specify join=2, as shown in the following query:

PREFIX ORACLE_SEM_FS_NS:   <http://oracle.com/semtech#join=2>
SELECT ?s ?s1 ?o
 WHERE { ?s1 ?p1 ?s .
                    {
                     SERVICE <http://sparql.org/books> { ?s ?p ?o }
                    }
                 }

In this case, the local query portion (?s1 ?p1 ?s,) is executed locally against the Oracle database. Each binding of ?s from the results is then pushed into the SERVICE part (remote query portion), and a call is made to the service endpoint specified. Conceptually, this approach is somewhat like nested loop join.

If the remote query portion (?s ?s1 ?o) is very selective, you can specify join=3, as shown in the following query, so that the remote portion is executed first and results are used to drive the execution of local portion:

PREFIX ORACLE_SEM_FS_NS:   <http://oracle.com/semtech#join=3>
SELECT ?s ?s1 ?o
 WHERE { ?s1 ?p1 ?s .
                    {
                     SERVICE <http://sparql.org/books> { ?s ?p ?o }
                    }
                  }

In this case, a single call is made to the remote service endpoint and each binding of ?s triggers a local query. As with join=2, this approach is conceptually a nested loop based join, but the difference is that the order is switched.

If neither the local query portion nor the remote query portion is very selective, then we can choose join=1, as shown in the following query:

PREFIX ORACLE_SEM_FS_NS:   <http://oracle.com/semtech#join=1>
SELECT ?s ?s1 ?o
 WHERE { ?s1 ?p1 ?s .
                    {
                     SERVICE <http://sparql.org/books> { ?s ?p ?o }
                    }
                }

In this case, the remote query portion and the local portion are executed independently, and the results are joined together by Jena. Conceptually, this approach is somewhat like a hash join.

For debugging or tracing federated queries, you can use the HTTP Analyzer in Oracle JDeveloper to see the underlying SERVICE calls.

7.7.4.2 S2S Option Benefits and Usage Information

The S2S option, described in Additional Query Options, provides the following potential benefits:

  • It works well with the RESULT_CACHE option to improve query performance. Using the S2S and RESULT_CACHE options is especially helpful for queries that are executed frequently.

  • It reduces the parsing time of the SEM_MATCH table function, which can be helpful for applications that involve many dynamically generated SPARQL queries.

  • It eliminates the limit of 4000 bytes for the query body (the first parameter of the SEM_MATCH table function), which means that longer, more complex queries are supported.

The S2S option causes an internal in-memory cache to be used for translated SQL query statements. The default size of this internal cache is 1024 (that is, 1024 SQL queries); however, you can adjust the size by using the following Java VM property:

-Doracle.spatial.rdf.client.jena.queryCacheSize=<size>

7.7.5 Midtier Resource Caching

When RDF data is stored, all of the resource values are hashed into IDs, which are stored in the triples table. The mappings from value IDs to full resource values are stored in the RDF_VALUE$ table. At query time, for each selected variable, Oracle Database must perform a join with the RDF_VALUE$ table to retrieve the resource.

However, to reduce the number of joins, you can use the midtier cache option, which causes an in-memory cache on the middle tier to be used for storing mappings between value IDs and resource values. To use this feature, include the following PREFIX pragma in the SPARQL query:

PREFIX ORACLE_SEM_FS_NS: <http://oracle.com/semtech#midtier_cache>

To control the maximum size (in bytes) of the in-memory cache, use the oracle.spatial.rdf.client.jena.cacheMaxSize system property. The default cache maximum size is 1GB.

Midtier resource caching is most effective for queries using ORDER BY or DISTINCT (or both) constructs, or queries with multiple projection variables. Midtier cache can be combined with the other options specified in Additional Query Options.

If you want to pre-populate the cache with all of the resources in an RDF graph, use the GraphOracleSem.populateCache or DatasetGraphOracleSem.populateCache method. Both methods take a parameter specifying the number of threads used to build the internal midtier cache. Running either method in parallel can significantly increase the cache building performance on a machine with multiple CPUs (cores).