Getting Reacquainted with dbXML 2.0
February 25, 2004
Introduction
dbXML is a native XML database written in in Java. Native XML databases (NXDs) are databases that store XML using an internalized format for faster overall processing, and representational flexibility. NXDs also provide support for indexing XML for improved query performance.
The dbXML project has quite a bit of history behind it. Some have likened it to something of a soap opera. Though there has been quite a bit of flux in the project, at its core the focus has remained the same, which is to provide an easy to use native XML database implementation, with both good performance and stability.
Because it utilizes Java's memory mapped I/O and overlapping socket I/O, dbXML requires Java 1.4.2 or higher.
The History of dbXML
The dbXML project was started in July of 1999 as an embedded XML database written in C++, but didn't get very far in that state. Several months later, a company named the dbXML Group was formed to commercially develop the product. When this happened, the project was converted to Java with a refocus of operating as a client/server database. It was released to the world under the terms of the GNU Lesser General Public License (LPGL).
dbXML continued on in this state until 2001 and version 1.0, when the code was forked. One image of the source code was donated to the Apache Software Foundation and became Xindice. The other image of the source code was retained by the dbXML Group to develop commercially in a proprietary fashion. From that point, the two projects diverged considerably, both in focus and in feature sets.
Eventually, work on Xindice began to slow almost to a halt. It has hovered at version 1.1 for quite a long time, and the project suffers from a lack of having anyone who really wants to get their hands dirty working on the core database. Features are being added, and bugs are being fixed, but the pace needs to be picked up if the project is going to survive.
Between versions 1.0 and 2.0, dbXML was almost completely rewritten and has been implemented in several large commercial systems. In the process, the feature set has been refined, the database engine has become faster and more robust, and the system's overall usability has been dramatically improved. On November 13 2003, the dbXML source code was rereleased under the terms of the GNU General Public License (GPL).
A Quick Architectural Overview
Collections
dbXML manages documents in collections. Many collections can be created and managed at one time. Collections can also be laid out in a hierarchical fashion, much in the same way that an operating system's directory structure works. A single collection may be associated with multiple indexes, extensions, triggers, and child collections. Also, a collection may store XML documents or binary streams.
dbXML collections can store either XML documents or binary streams (records), but not both at the same time. XML documents can be stored as binary streams but won't benefit from tokenization, compression, and indexing. It is important to understand that dbXML is not a multimedia database; storing massive binary streams is not recommended. It is probably a good idea to limit binary streams to no more than 500 kilobytes.
Indexes
Collections may have multiple indexes associated with them. An index is a file structure that is used to allow optimized retrieval of documents in a collection based on the structure or values in those documents.
Query Resolvers
Collections of documents and indexes for values in those documents aren't much use if you don't have a way to query those documents or portions of them. dbXML provides several query resolving systems for you to do this. Query resolvers are registered with the entire database. Queries may be executed against specific collections or against documents within a collection.
Extensions
Extensions are a way of adding extra capabilities to the dbXML server. Extensions are Java classes that implement the Extension interface and whose public methods are exposed as web service endpoints. Triggers and other extensions can also reference extensions. It's important to remember that only public methods that take a specific subset of generic parameters can be exposed as web service endpoints.
dbXML also provides experimental support for scripted extensions in JavaScript and Python.
Triggers
A trigger is a Java class that implements the Trigger interface. This interface specifies several methods that are to be implemented to handle triggered callbacks from a collection. These callback events include insertion, updating, deletion and retrieval, and may be fired before or after the actions.
dbXML also provides experimental support for scripted triggers in JavaScript and Python.
So What's New?
There are too many changes between versions 1.0 and 2.0 to mention, so it's probably better to just review the most important changes and allow developers to discover the rest.
Journaling transactions
dbXML now supports basic journaling transactions under the hood. At present, all transactions are implicit unless you're accessing dbXML using the database's lowest level APIs. Explicit transaction APIs will be exposed via the client/server APIs in a future release.
Security
The database now has a pluggable security model. There are currently three security managers to choose from.
- NoSecurityManager provides no security whatsoever and is used when authentication is not needed to access the database.
- SimpleSecurityManager provides simple security, where a single user name and password is used for the entire database. The user name and password are defined in the database's system.xml configuration file.
- DefaultSecurityManager is so named because it is the default security manager. It provides access control based on users and roles stored in the database's system collections.
Web Services Replace CORBA
dbXML 1.0 leveraged CORBA to provide client/server communications. While CORBA made dbXML accessible to many platforms and languages, it also came with its share of headaches. For version 2.0, it was decided that CORBA would no longer be used. dbXML 2.0 utilizes a web services hub called Project Labrador to provide client/server communications.
Currently, Labrador only supports REST and the XML-RPC protocol. As a result, dbXML only supports these modes of access. A future version of Labrador will support SOAP; when it does, dbXML will automatically inherit this capability.
Because Labrador is under the hood, you never really have to be concerned with which protocols it exposes. The dbXML Client APIs do the work of marshalling your calls using the proper protocol handlers. The only time the protocols matter is when you're trying to access dbXML from a language other than Java; in which case, you'll have to use an XML-RPC library to generate calls.
A future version of dbXML will also provide the ability to access the database using a Cocoa library for Mac OS X applications.
Command Line System
dbXML 1.0's command line tools were somewhat cryptic, difficult to work with, and required a JVM to fire up for each call, making them a little slow for batch processes. dbXML 2.0 introduces a new "shell-like" command line that allows you to interactively type commands in the context of a selected "collection" and access command help. The command line can also be fired up to run a string of prescripted commands.
GUI Administrator
dbXML 2.0 now includes an attractive GUI administrator tool that provides access to most of functionality that the command line system provides, but with a little more hand-holding for those users who don't want to remember the command line syntax.
New Database Access APIs
The preferred way to access the database in dbXML 1.0 was via the XML:DB APIs. These APIs were somewhat limited, they but provided a means of writing client applications that interoperate with multiple vendors' XML databases. dbXML 2.0 still provides an XML:DB API implementation, but now prefers a set of APIs that were designed specifically for accessing dbXML data stores. The dbXML Client API consists of four relatively straightforward interfaces:
- dbXMLClient is your initial entry point into the Database and allows you to retrieve references to CollectionClients and ContentClients.
- CollectionClient allows you to interface with dbXML Collections. It provides many methods grouped into several categories, including methods for querying the database and managing documents, subcollections, indexes, triggers, and extensions.
- ContentClient is an access API for a single document or binary stream in the database.
- ResultSetClient is an access API for the results of a query against a collection or document.
The dbXML Client API has been implemented for both XML-RPC client/server communications and local (in-process) access. dbXML's XML:DB implementation wraps the dbXMLClient API, which allows the XML:DB classes to operate both locally and remotely as well.
Querying Capabilities
Unfortunately XML database technologies are still in their infancy, and it will take quite some time before a single, all-purpose query language is available for them. Some may argue that because of the nature of XML, a single, all-purpose query language may not even be desirable. Currently, the state of the art in XML databases still leverages the simple XPath language extensively, and for good reason: it addresses the vast majority of retrieval queries that a developer might want to perform.
dbXML supports four query languages. Each provides functionality for different purposes.
- XPath is a terse path syntax that is similar in some ways to UNIX or DOS directory paths. It allows the returned results to be filtered based on location and predicated evaluation.
- XSLT is a transformation language that converts XML into other forms. These formats can include XML, text, HTML or even PDF when XSLFO is used. dbXML XSLT queries can be executed against a single document, an entire collection, or the results of an XPath query.
- XUpdate is also a transformation with some of the same goals as XSLT, but its syntax is simpler, and its purpose is to modify the content of documents in place.
- FullText is a search engine style query with the ability to search on many words with ANDed and ORed set evaluation. The results of a full text query can also be filtered using an XPath expression.
You'll notice that the W3C's XQuery language is missing from this list. A future version of dbXML may include XQuery, but for now the specification is still in flux. When the specification stabilizes, the project will seriously consider implementing or integrating XQuery into the database.
The newcomers to dbXML 2.0 are XSLT and FullText, both welcome additions to the database. dbXML's XSLT processor caches the parsed results of XSLT stylesheets and utilizes those templates for subsequent queries. dbXML 2.0 now supports fulltext indexing and querying, on both wildcard and specific elements and attribute patterns.
Getting Started with dbXML
Step number one: Obtain the software
The only difficult thing about this step is deciding which packaging to use. dbXML is made available in three forms: binary distributions, source distributions, and via CVS.
Let's assume that you're not interested in the source code and that you're using some flavor of Windows as your operating system. In this case, look at the dbXML product page and find the link for Binary Releases. When the download page is displayed, select the Windows installer.
Note: dbXML requires the Java 2 Standard Edition SDK 1.4.2 (JDK) to operate. If you do not have a copy of the JDK, please visit Sun's Java site to download one. If you plan on building dbXML from source code, you will also need the Apache Software Foundation's Ant build tool. This can be obtained from the Ant project page.
Step number two: Install the software
Downloading the binary installers makes this pretty easy. After it's installed, there are still a couple of steps you have to take to get the system running as a Windows NT service and to make the command line tools available to you.
The first thing you need to do is set the DBXML_HOME environment variable in System Properties -> Advanced -> Environment Variables -> System Variables. The value for this variable should be the installation location for dbXML, which, by default, is "C:\Program Files\dbXML". While you're there, also add the dbXML "bin" directory to your path, by appending ";%DBXML_HOME%\bin" to the PATH environment variable; also check to make sure that the JAVA_HOME environment variable is properly set. These changes will make the settings persistent so you don't have to retype them from the command line constantly.
After you have the software installed, and the environment variables set, you'll want to start up the dbXML Server. The most convenient way to do this on Windows is to install it as a Windows NT service. To install the NT service, open a command line window, change directories to "%DBXML_HOME%\install" and then type "install-ntsvc install". If all goes well, the dbXML Server should be up and running.
Step number three: Explore the software
Let's explore by starting up the dbXML Administrator. Click the icon with the cute little doggy in a hard hat. That's my dog, Fonzie. He weighs a whole ten pounds, but don't get him angry or he'll bite your head off. Anyway, the dbXML Administrator will automatically log you into the database running on "localhost" using the default admin user settings. The admin user has godlike privileges over the database, so be careful not to delete anything: everything in the database out-of-the-box is necessary to proper operation.
Step number four: Start coding
This step will require a discussion that would be far too detailed for this introductory article. If you're interested in developing applications for dbXML, please consult the dbXML Programmer's Guide.
Getting More Information
That concludes my brief reintroduction to dbXML. This project has evolved quite a bit since version 1.0 and is very likely to evolve considerably in the coming year. It is already a mature product, with some rather high profile users, and is in a very good position to become the dominant open source XML database, if not one of the more popular XML databases in general.
Future versions of dbXML promise further enhancements such as:
- Programmatic backup and restore of portions of the database.
- A SOAP-based client library.
- Constraining a Collection using W3C XML Schemas.
- Indexing implicitly coerced based on XML Schema types.
- A rewritten query engine, with vastly improved performance.
- W3C XQuery queries, including updates.
- Improved documentation.
For more information about dbXML, including product downloads, source code availability, documentation, and commercial licensing options if your project can't leverage GPLed software, please visit the dbXML site.