Introducing Indices

Introducing Indices
Prev	Chapter 2. XQuery and Berkeley DB XML	Next

One major advantage of modern native XML databases is their ability to index the XML documents they contain. Proper use of indexes can significantly reduce the time required to execute a particular XQuery expression. The previous examples likely executed in a perceptible amount of time, because BDB XML was evaluating each and every document in the container against the query. Without indexes, BDB XML has no choice but to review each document in turn. With indexes, BDB XML can find a subset of matching documents with a single, or significantly reduced, set of lookups. By carefully applying BDB XML indexing strategies we can improve retrieval performance considerably.

Note

By default, BDB XML turns several useful indexes on so you do not have to worry about them. However, for the purposes of this document we turned them off at the beginning of this chapter. (Using the shell setAutoIndexing command). We do this here so we can see relative performance differences between containers with no indexes, and containers with the indexes that we set.

To examine the usefulness of our indexes, we will use the time command with each of our queries. This will report how long it takes for each operation to complete.

Note

The following query execution times are relative to the computer and operating system used by the author. Your query times will differ as they depend on many qualities of your system. However, the percentage in improvement in query execution time should be relatively similar.

Recall the first structural query:

time query '
collection("parts.dbxml")/part[parent-part]'
10000 objects returned for eager expression '
collection("parts.dbxml")/part[parent-part]'

Time in seconds for command 'query': 0.437096

Notice the query execution time. This query takes almost a half a second to execute because the query is examining each document in turn as it searches for the presence of a parent-part element. To improve our performance, we want to specify an index that allows BDB XML to identify the subset of documents containing the parent-part element without actually examining each document.

Indices are specified in four parts: path type, node type, key type, and uniqueness. This query requires an index of the node elements to determine if something is present or not. Because the pattern is not expected to be unique, we do not want to turn on uniqueness. Therefore, the BDB XML index type that we should use is node-element-presence-none.

dbxml> addIndex "" parent-part node-element-presence-none
Adding index type: node-element-presence-none to node: {}:parent-part

dbxml> time query '
collection("parts.dbxml")/part[parent-part]'
10000 objects returned for eager expression '
collection("parts.dbxml")/part[parent-part]'

Our query time improved from .4 seconds to .2 seconds. As containers grow in size or complexity, indexes increase performance even more dramatically.

The previous index will also improve the performance of the value query designed to search for the value of the parent-part element. But for better results, we should index the node as a double value. (You use double here instead of decimal because the XQuery specification indicates that implicit numerical casts should be cast to double).

To do this, use a node-element-equality-double index.

dbxml> time query '
collection("parts.dbxml")/part[parent-part = 1]'
3333 objects returned for eager expression '
collection("parts.dbxml")/part[parent-part = 1]'

Time in seconds for command 'query': 0.511752
                
dbxml> addIndex "" parent-part node-element-equality-double

Adding index type: node-element-equality-decimal to node: {}:parent-part
dbxml> time query '
collection("parts.dbxml")/part[parent-part = 1]'
3333 objects returned for eager expression '
collection("parts.dbxml")/part[parent-part = 1]'

Time in seconds for command 'query': 0.070674

Additional indexes will improve performance for the other value queries.

dbxml> time query '
collection("parts.dbxml")/part[@number > 100 and @number < 105]'
4 objects returned for eager expression
'collection("parts.dbxml")/part[@number > 100 and @number < 105]'

Time in seconds for command 'query': 5.06106

At over 5 seconds there is plenty of room for improvement. To improve our range query, we can provide an index for the number attribute:

dbxml>  addIndex "" number node-attribute-equality-double

Adding index type: node-attribute-equality-double to node: {}:number

dbxml> time query '
collection("parts.dbxml")/part[@number > 100 and @number < 105]'
4 objects returned for eager expression '
collection("parts.dbxml")/part[@number > 100 and @number < 105]'

Time in seconds for command 'query': 3.33212

As you can see, proper use of indexes can dramatically effect query performance.

Note

We mentioned at the beginning of this section that we had turned auto indexing off. If we had left it on, the container would have automatically had the following indexes:

        node-element-string-equality
        node-attribute-string-equality
        node-element-double-equality
        node-attribute-double-equality

These indexes would have been added for all attribute and leaf nodes. For this example, the indexes would have been added for the <description>, <category> and <parent-part> nodes. They would have also been added for the number attribute on the <part> node.

Prev	Up	Next
Value Queries	Home	Reshaping the Result