Chapter 1. Introduction to Berkeley DB XML

Table of Contents

Overview
Benefits
XML Features
Database Features
Languages and Platforms
Getting and Using BDB XML
Documentation and Support
Library Dependencies
Building and Running BDB XML Applications

Welcome to Berkeley DB XML (BDB XML). BDB XML is an embedded database specifically designed for the storage and retrieval of XML-formatted documents. Built on the award-winning Berkeley DB, BDB XML provides for efficient queries against millions of XML documents using XQuery. XQuery is a query language designed for the examination and retrieval of portions of XML documents.

This document introduces BDB XML. It is intended to provide a rapid introduction to the BDB XML API set and related concepts. The goal of this document is to provide you with an efficient mechanism with which you can evaluate BDB XML against your project's technical requirements. As such, this document is intended for Java developers and senior software architects who are looking for an in-process XML data management solution. No prior experience with BDB XML is expected or required.

Note that while this document uses Java for its examples, the concepts described here should apply equally to all language bindings in which the BDB XML API is available. Be aware that a version of this document also exists for the C++ language.

Overview

BDB XML is an embedded database that is tuned for managing and querying hundreds, thousands, or even millions of XML documents. You use BDB XML through a programming API that allows you to manage, query, and modify your documents via an in-process database engine. Because BDB XML is an embedded engine, you use it with your application in the same was as you would any third-party package.

In BDB XML documents are stored in containers, which you create and manage using XmlManager objects. Each such object can open multiple containers at a time.

Each container can hold millions of documents. For each document placed in a container, the container holds all the document data, any metadata that you have created for the document, and any indices maintained for the documents in the container.

(Metadata is information that you associate with your document that might not readily fit into the document schema itself. For example, you might use metadata to track the last time a document was modified instead of maintaining that information from within the actual document.)

XML documents may be stored in BDB XML containers in one of two ways:

  • Whole documents.

    Documents are stored in their entirety. This method works best for smaller documents (that is, documents under a megabyte in size).

  • As document nodes.

    Documents stored as nodes are broken down into their individual document element nodes and each such node is then stored as an individual record in the container. Along with each such record, BDB XML also stores all node attributes, and the text nodes, if any.

    This type of storage is best for large XML documents (greater than 1 megabyte in size).

From an API-usage perspective, there are very few differences between whole document and node storage containers. For more information, see Container Types.

Once a document has been placed in a container, you can use XQuery to retrieve one or more documents. You can also use XQuery to retrieve one or more portions of one or more documents. Queries are performed using XmlManager objects. The queries themselves, however, limit the scope of the query to a specified list of containers or documents. documents.

BDB XML supports the entire XQuery specification. You can read the specification here:

http://www.w3.org/TR/xquery/

Also, because XQuery is an extension to XPath 2.0, BDB XML provides full support for that query language as well.

Finally, BDB XML provides a robust document modification facility that allows you to easily add, delete, or modify selected portions of documents. This means you can avoid writing modification code that manipulates (for example) DOM trees — BDB XML can handle all those details for you.

Benefits

BDB XML provides a series of features that makes it more suitable for storing XML documents than other common XML storage mechanisms. BDB XML's ability to provide efficient indexed queries means that it is a far more efficient storage mechanism than simply storing XML data in the filesystem. And because BDB XML provides the same transaction protection as does Berkeley DB, it is a much safer choice than is the filesystem for applications that might have multiple simultaneous readers and writers of the XML data.

More, because BDB XML stores XML data in its native format, BDB XML enjoys the same extensible schema that has attracted many developers to XML. It is this flexibility that makes BDB XML a better choice than relational database offerings that must translate XML data into internal tables and rows, thus locking the data into a relational database schema.

XML Features

BDB XML is implemented to conform to the W3C standards for XML, XML Namespaces, and the XQuery working draft. In addition, it offers the following features specifically designed to support XML data management and queries:

  • Containers. A container is a single file that contains one or more XML documents, and their metadata and indices. You use containers to add, delete, and modify documents, and to manage indices.

  • Indices. BDB XML indices greatly enhance the performance of queries against the corresponding XML data set. BDB XML indices are based on the structure of your XML documents, and as such you declare indices based on the nodes that appear in your documents as well the data that appears on those nodes.

    Note that you can also declare indices against metadata.

  • Queries. BDB XML queries are performed using the XQuery 3.0 language. XQuery is a W3C draft specification (http://www.w3.org/XML/Query).

  • Query results. BDB XML retrieves documents that match a given XQuery query. BDB XML query results are always returned as a set. The set can contain either matching documents, or a set of values from those matching documents.

  • Storage. If you use node-level storage for you documents (see Container Types), then BDB XML automatically transcodes your documents to Unicode UTF-8. If you use whole document storage, then the document is stored in whatever encoding that it uses. Note that in either case, your documents must use an encoding supported by Xerces before they can be stored in BDB XML containers.

    Beyond the encoding, documents are stored (and retrieved) in their native format with all whitespace preserved.

  • Metadata attribute support. Each document stored in BDB XML can have metadata attributes associated with it. This allows information to be associated with the document without actually storing that information in the document. For example, metadata attributes might identify the last accessed and last modified timestamps for the document.

  • Document modification. BDB XML provides a robust mechanism for modifying documents. Using this mechanism, you can add, replace, and delete nodes from your document. This mechanism allows you to modify both element and attribute nodes, as well as processing instructions and comments.

Database Features

Beyond XML-specific features, BDB XML inherits a great many features from Berkeley DB, which allows BDB XML to provide the same fast, reliable, and scalable database support as does Berkeley DB. The result is that BDB XML is an ideal candidate for mission-critical applications that must manage XML data.

Important features that BDB XML inherits from Berkeley DB are:

  • In-process data access. BDB XML is compiled in the same way as any library. It runs in the same process space as your application. The result is database support in a small footprint without the IPC-overhead required by traditional client/server-based database implementations.

  • Ability to manage databases up to 256 terabytes in size.

  • Database environment support. BDB XML environments support all of the same features as Berkeley DB environments, including multiple databases, transactions, deadlock detection, lock and page control, and encryption. In particular, this means that BDB XML databases can share an environment with Berkeley DB databases, thus allowing an application to gracefully use both.

  • Atomic operations. Complex sequences of read and write access can be grouped together into a single atomic operation using BDB XML's transaction support. Either all of the read and write operations within a transaction succeed, or none of them succeed.

  • Isolated operations. Operations performed inside a transaction see all XML documents as if no other transactions are currently operating on them.

  • Recoverability. BDB XML's transaction support ensures that all committed data is available no matter how the application or system might subsequently fail.

  • Concurrent access. Through the combined use of isolation mechanisms built into BDB XML, plus deadlock handling supplied by the application, multiple threads and processes can concurrently access the XML data set in a safe manner.

Languages and Platforms

The official BDB XML distribution provides the library in the C++, Java, Perl, Python, PHP, and Tcl languages. Because BDB XML is available under an open source license, a growing list of third-parties are providing BDB XML support in languages other than those that are officially supported.

BDB XML is supported on a very large number of platforms. Check with the BDB XML mailing lists for the latest news on supported platforms, as well as for information as to whether your preferred language provides BDB XML support.