Best Practices

Best Practices
Prev	Chapter 5. Query RDF Graphs	Next

Additional Query Options

The RDF Graph feature allows you to specify additional query options. It implements these capabilities by using the SPARQL namespace prefix syntax to refer to Oracle-specific namespaces that contain these query options. The namespaces are defined in the form PREFIX ORACLE_SEM_FS_NS.

Additional query options can be passed to a SPARQL query by including a line in the following form:

PREFIX ORACLE_SEM_FS_NS: <http://oracle.com/semtech#option>

The option reflects a query setting (or multiple query options delimited by commas) to be applied to the SPARQL query execution. For example:

PREFIX ORACLE_SEM_FS_NS:
<http://oracle.com/semtech#TIMEOUT=3,DOP=4,ORDERED>
SELECT * WHERE {?subject ?property ?object }

The following query options are supported:

ASSERTED_ONLY causes only the asserted triples/quads to be queried.
BATCH=n specifies the size of the batches (n) used to execute concurrent retrieval of bindings. Using a batch size that is larger than the default of 1,000, such as 5,000 or 10,000 when retrieving RDF data from the Oracle NoSQL Database may improve performance.
BEST_EFFORT_QUERY=T, when used with the TIMEOUT=n option, returns all matches found in n seconds for the SPARQL query.
DOP=n specifies the degree of parallelism (n) for the query. The default value is 1. With multi-core or multi-CPU processors, experimenting with different DOP values (such as 4 or 8) may improve performance. A good starting point for DOP can be the number of CPU cores, assuming the level of query concurrency is low. To ensure that no single query dominates the CPU resources, DOP should be set at a lower value when the number of concurrent requests increases.
INCLUDE=RULEBASE_ID=n specifies the rulebase ID to use when answering a SPARQL query. This query option will override any rulebase configuration defined at the SPARQL Service endpoint.
INF_ONLY causes only the inferred triples/quads to be queried.
JENA_EXECUTOR disables the compilation of SPARQL queries to the RDF Graph feature; instead, the Apache Jena native query executor will be used.
JOIN_METHOD={nl, hash} specifies how query patterns in a SPARQL query can be joined, either a nested loop join (nl) or hash join (hash) method can be used. For more information, see JOIN_METHOD option.
ORDERED specifies that query patterns in a SPARQL query should be executed in the same order as they are specified.
TIMEOUT=n (query timeout) specifies the number of seconds (n) that the query will run until it is terminated. The underlying query execution generated from a SPARQL query can return many matches and can use features like sub-queries and assignments, all of which can take considerable time. The TIMEOUT and BEST_EFFORT_QUERY=t options can be used to prevent what you consider excessive processing time for the query.

JOIN_METHOD option

A SPARQL query consists of a single (or multiple) query patterns, conjunctions, disjunctions, and optional triple patterns. The RDF Graph feature processes triple patterns in the SPARQL query and executes join operations over their partial results to retrieve query results. The RDF Graph feature automatically analyzes the received SPARQL query and determines an execution plan using an efficient join operation between two query row sources (outer and inner, left or right). A query row source consists of a query pattern or the intermediate results from another join operation.

However, you can use the JOIN_METHOD option that uses the RDF Graph feature to specify which join operation to use in SPARQL query execution. For example, assume the following query:

PREFIX ORACLE_SEM_FS_NS:<http://oracle.com/semtech#JOIN_METHOD=NL>
SELECT ?subject ?object ?grandkid 
WHERE {
?subject <u:parentOf>  ?object   .
?object  <u:parentOf>  ?grandkid .
}

In this case, the join method to use will be set to nested loop join. The first (outer) query portion of this query (in this case query pattern ?subject u:parentOf> ?object), is executed against the Oracle NoSQL Database. Each binding of ?object from the results is then pushed into the second (inner) query pattern (in this case ?object <u:parentOf> ?grandkid), and which in turn is then executed against the Oracle NoSQL Database. Note that nested loop join operations can be executed only if the inner row source is a query pattern.

If the join method to use is set to hash join, both the outer row source and inner row source of this query will be executed against the Oracle NoSQL Database. All results from the outer row source (also called the build table) will be stored in a hash table structure with respect to its binding of ?object, as it is a common variable between the outer and inner row sources. Then, each binding of ?object from the inner row source (also called the probe table) will be hashed and matched against the hash data structure.

SPARQL 1.1 federated query SERVICE Clause

When writing a SPARQL 1.1 federated query, you can set a limit on returned rows in the sub-query inside the SERVICE clause. This can effectively constrain the amount of data to be transported between the local repository and the remote SPARQL endpoint.

For example, the following query specifies a limit of 100 in the subquery in the SERVICE clause:

PREFIX : <http://example.com/>
SELECT ?s ?o 
WHERE 
{ 
?s :name "CA" 
SERVICE <http://REMOTE_SPARQL_ENDPOINT_HERE>
{ 
select ?s ?o 
{?s :info ?o} 
limit 100 
} 
}

Data sampling

Having sufficient statistics for the query optimizer is critical for good query performance. In general, you should ensure that you have gathered basic statistics for the RDF Graph feature to use during query execution. In Oracle NoSQL Database, these statistics are generated by maintaining data sampling.

Data sampling is defined as a representative subset of triples from an RDF graph (or dataset) stored in an Oracle NoSQL Database, generated at a certain point of time. The size of this subset is determined by the size of the overall data and a sampling rate. Data sampling is automatically performed when an RDF data file is loaded into or removed from the Oracle NoSQL Database. By default, the data sampling rate is 0.003 (or 3 per 1000). The default sampling rate may not be adequate for all database sizes. It may improve performance to reduce the sampling rate for substantially larger data sets to retain a more manageable count of sampled data. For instance, performance may be improved by setting the sampling as 0.0001 for billions of triples and 0.00001 for trillions of triples.

Data sampling service is provided through the method analyze RDF Graph feature OracleGraphNoSql and DatasetGraphNoSql class. This method essentially gets all the data from the graph (or dataset) and generates a representative subset used as data sampling. Users can choose the size of data sampling by specifying the samplingRate. Note that existing data sampling will be removed once this operation is executed. More information about using analyze can be found in the API reference information (Javadoc).

The following example analyzes the data from a graph and generates a sampling subset with a sampling rate of 0.005 (or 5/1000).

public static void main(String[] args) throws Exception
{
String szStoreName  = args[0];
String szHostName   = args[1];
String szHostPort   = args[2]; 
    
System.out.println("Create Oracle NoSQL connection");
OracleNoSqlConnection conn 
                = OracleNoSqlConnection.createInstance(szStoreName,
                                                       szHostName, 
                                                       szHostPort);
    
System.out.println("Create named graph");
OracleGraphNoSql graph = new OracleGraphNoSql(conn);
    
System.out.println("Clear graph");
graph.clearRepository();
    
    
System.out.println("Load data from file into a NoSQL database");
    
DatasetGraphNoSql.load("family.rdf", Lang.RDFXML, conn, 
                           "http://example.com"); // base URI
    
System.out.println(“Analyze data”);
long sizeSamp = graph.analyze(0.005); // 5 out of 1000
    
System.out.println("sampling size is " + sizeSamp);
    
graph.close();
conn.dispose();
}

Query hints

The RDF Graph feature allows you to include query optimization hints in a SPARQL query. It implements these capabilities by using the SPARQL namespace prefix syntax to refer to Oracle-specific namespaces that contain these hints. The namespace is defined in the form PREFIX ORACLE_SEM_HT_NS.

Query hints can be passed to a SPARQL query by including a line in the following form:

PREFIX ORACLE_SEM_HT_NS: <http://oracle.com/semtech#hint>

Where hint reflects any hint supported by the RDF Graph feature.

A query hint represents a helper for the RDF Graph feature to generate an execution plan used to execute a SPARQL query. An execution plan determines the way query patterns will be handled by the RDF Graph feature. This involves the following conditions:

The order in which query patterns in a Basic Graph Pattern will be executed.
How query patterns will be joined together in order to complete a query execution.
The join method (nested loop join or hash join) to pick in order to merge results retrieved from two query patterns or pre-calculated results.

An execution plan is written using post-fix notation. In this notation, joins operations (expressed as HJ or NLJ) are preceded by its operands (the result of another join operation or a query pattern). The order in which the operands in a join operation are presented is relevant to query execution as the number of operations executed in the join operation are intimately related to the size of these operands. This, in consequence will affect the performance of a query execution.

Query patterns in a plan are expressed as QP< ID>, where ID represents the position of the query pattern with respect to the specified SPARQL query. Additionally, every join operation and its respective operands should be wrapped using parentheses.

For example, consider the following SPARQL query that retrieves all pairs of names of people who know each other.

PREFIX ORACLE_SEM_HT_NS: <http://oracle.com/semtech#plan=
                                          ((qp2%20qp3%20NLJ)%20qp1%20HJ)>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>" +

SELECT ?name1 ?name2 " +
WHERE { 
graph <http://example.org/graph> { 
?person1 foaf:knows ?person2     .         #QP1
?person1 foaf:name ?name1       .         #QP2
?person2 foaf:name ?name2         .        #QP3
}

Suppose that we want to specify an execution plan that will perform first a nested loop join operation between ?person1 foaf:name ?name1 and ?person1 foaf:knows ?person2, and then perform a hash join operation between the results and the third query pattern ?person2 foaf:name ?name2. This plan can be defined using post-fix notation as follows:

(
( 
( ?person1 foaf:name ?name1 )
( ?person1 foaf:name ?name2 )
 NLJ )
 
( ?person1 foaf:knows ?person2 )
HJ )

This execution plan can be specified into the RDF Graph feature using the query hint PLAN=encoded_plan, where encoded_plan represents an URL encoded representation of an execution plan to execute all the query patterns included in a SPARQL query using hash join or nested loop join operations. Query hints can only be applied to SPARQL queries with a single BGP.

Note that if a plan is not UTF-8 encoded, does not include all query patterns in a SPARQL query, or is syntactically incorrect, this hint will be ignored and the RDF Graph feature will continue with a default query optimization and execution. For information about queries and joins operations, see JOIN_METHOD option.

Prev	Up	Next
JavaScript Object Notation (JSON) Format Support	Home	Chapter 6. Update an RDF Graph