XQuery, libferris, and Virtual Filesystems

July 27, 2007

Introduction

By bringing together an XQuery engine and a virtual filesystem you can use a familiar query language to access relational databases, Berkeley db4 databases, kernel filesystems, and network files as well as XML. libferris, at its, core is a virtual filesystem allowing many different data sources to be exposed through a filesystem interface. These include the expected things like file://, http://, ftp:// as well as not so expected things like databases, XML files, and even applications like emacs, Evolution, XWindow, and Firefox.

For example, a slight change in an XPath expression to use a db4 file instead of an XML file for a customer lookup can substantially improve query performance. Using a db4 file to perform a lookup is faster because the db4 file is indexed and the right piece of information can be directly accessed in only a few disk seeks. Using an XML file for the lookup table would require reading the whole file and parsing it; then there is no index in RAM so searching a large DOM would also require time.

Some familiarity with XQuery will be of help to the reader in understanding this article. The example XQueries are kept simple in order to demonstrate the libferris-related features instead of XQuery itself.

At times, direct filesystem interactions are shown using tools like ferrisls from the libferris distribution. This is the analogy of looking at an input XML file to debug a particular XQuery or to work out an XPath. In early experimentation it can be of great help to use ferrisls to directly view a data source (filesystem) in order to see why an XQuery might not be returning the result you expect.

libferris is available for POSIX platforms. A binary package repository exists for easy installation on Fedora Core 6 and Fedora 7 will be available shortly. In contrast to the older Linux style of having to explicitly mount filesystems, most of the time, libferris handles the mounting of filesystems for you. For example, if you try to read an XML file as a directory, then libferris will automatically mount it for you and allow you to view inside the XML file.

XQuery on db4

Libferris can mount both XML files and db4 files as filesystems. Because the data model of db4 is fairly flexible, both db4 and XML can appear very similar with libferris. The main difference is the underlying data storage; for example, slight modifications to db4 are much faster than for XML (db4 can save only the changes, XML updates usually mean streaming the entire DOM to disk).

Assume we have a customer lookup table in an XML file as partially shown in the following example code.

<customers>

 <customer id="3"   givenname="Foo"   familyname="Bar"/>

 <customer id="15"  givenname="Bobby" familyname="McGee"/>

 <customer id="131" givenname="Ziggy" familyname="Stardust" />

</customers>

The following commands will create a db4 file containing the same information. Another method more suitable for keeping a db4 file in sync with a master XML file is shown at the end of the article.

Because we are dealing with mixing XML and db4 and using a virtual filesystem, there is some different terminology used to refer to very similar things. An XML attribute is referred to using the filesystem terminology of Extended Attribute (EA) in libferris. Thus the --show-ea command-line option to ferrisls is telling it what (extended) attributes we are interested in seeing. The --preserve-list option to ferriscp tells it explicitly which (extended) attributes we want to be copied from the source to the destination.

As can be seen from the two ferrisls commands the same information is available in both the XML and db4 files, the only difference is the extension of the customers file.

$ fcreate --create-type=db4 --rdn=customers.db .

$ ferriscp -av --preserve-list=id,givenname,familyname customers.xml/customers customers.db

$ ferrisls --show-ea=givenname,id customers.xml/customers

Foo     3

Bobby   15

Ziggy   131

$ ferrisls --show-ea=givenname,id customers.db/customers

Foo     3

Bobby   15

Ziggy   131

A simple XQuery on the customers.xml file is shown below. The document URL and the customer ID are declared as variables and the standard doc() function is used to open the customers.xml file to access a customer's information by their unique ID.

declare variable $docurl     := "file:///.../customers.xml";

declare variable $customerid := "131";

<resultdata>

 {

  for $c in doc( $docurl )/customers/*[@id=$customerid]

  return

    <person cid="{ $c/@id }" surname="{ $c/@familyname }" />

 }

</resultdata>

The above query can be executed with either the xqilla command from the XQilla distribution or the ferris-xqilla command from the libferris distribution. The difference is that the latter command makes some extension functions available to the XQuery.

$ ferris-xqilla customers.xq

<resultdata>

  <person cid="131" surname="Stardust"/>

</resultdata>

Very little has to change to use customers.db instead of customers.xml. The docurl will have to have the file name changed to customers.db and instead of using the standard doc() function we use the ferris-doc() extension function. These two functions behave the same way, the latter makes use of libferris to provide the data and query resolution. The query that operates directly on customers.xml could also use the ferris-doc() extension function, though there is no advantage.

declare variable $docurl     := "file:///.../customers.db";

declare variable $customerid := "131";

<resultdata>

 {

  for $c in ferris-doc( $docurl )/customers/*[@id=$customerid]

  return

    <person cid="{ $c/@id }" surname="{ $c/@familyname }" />

 }

</resultdata>

As the XQuery uses the ferris-doc() extension function we must use the ferris-xqilla command to evaluate it. The results will be identical.

$ ferris-xqilla customers-db4.xq

<resultdata>

  <person cid="131" surname="Stardust"/>

</resultdata>

XQuery on PostgreSQL

libferris can mount PostgreSQL as a filesystem. The result is available under both the postgresql:// and the shorthand pg:// URL scheme. At the top level is the name of servers, then databases, and then tables. For example, a table foo in database bar on the local machine can be accessed as pg://localhost/bar/foo.

A relational table is shown by libferris by making each tuple a file and using (extended) attributes to show each column in the table. So, for example, a table with two columns ( name varchar, id int ) with two tuples will make a directory with two files, each of which will have the name and ID (extended) attribute available.

Database functions are also available alongside the tables and can be called to create a virtual directory with the results. For example, if we have a database called xmldotcom2007 with a table customers and a function customerlookup( int, int ), which returns a range of customers by their customer ID, then both of these will be available directly under the database itself—pg://localhost/xmldotcom2007. The logic of the postgresql function is kept trivial for demonstration purposes.

Such a database might be created and populated as the following code shows.

bash$ psql

create database xmldotcom2007;

\c xmldotcom2007

create table customers ( id serial primary key, givenname varchar(100), familyname varchar(100) );

insert into customers values ( 3, 'Foo', 'Bar' );

insert into customers values ( 15, 'Bobby', 'McGee' );

insert into customers values ( 131, 'Ziggy', 'Stardust' );



CREATE TYPE customerlookup_result 

    AS (f1 int, fname varchar(100), lname varchar(100));

CREATE FUNCTION customerlookup( int, int ) 

   returns setof customerlookup_result 

AS

$BODY$

DECLARE

  rec customerlookup_result;

BEGIN

  for rec in 

     select id,givenname,familyname from customers 

     where id >= $1 and id <= $2

  LOOP

  return next rec;

END LOOP;

return;

END;

$BODY$

LANGUAGE 'plpgsql' ;

\q

bash$

Direct interaction with this database is possible from the command line. The -0 command-line option to ferrisls is similar to the -l option to ls(1) except the filesystem itself is asked which attributes it recommends to show to the user. In the case of a table from a relational database all the columns of the table are recommended by the filesystem as being interesting to the user. To call a postgresql function, the URL must be quoted so that bash will not try to interpret the parenthesis. Notice that the (extended) attributes are named f1, fname, and lname for the files returned by the postgresql functions. This is because the type returned by customerlookup() includes these names.

$ ferrisls    pg://localhost/xmldotcom2007

$ ferrisls -0 pg://localhost/xmldotcom2007/customers

131     Ziggy   Stardust        131     id

15      Bobby   McGee   15      id

3       Foo     Bar     3       id

$ ferrisls --xml  pg://localhost/xmldotcom2007/customers

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>

<ferrisls>



  <ferrisls familyname="" givenname="" id="" 

     name="customers" primary-key="id" 

     url="pg:///localhost/xmldotcom2007/customers">

    <context familyname="Stardust" givenname="Ziggy" 

       id="131" name="131" primary-key="id"/>

...



$ ferrisls --xml  'pg://localhost/xmldotcom2007/customerlookup(3,3)'

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>

<ferrisls>

  <ferrisls ...>

    <context f1="3" fname="Foo" lname="Bar" name="3-Foo-Bar"

        primary-key="f1-fname-lname"/>

  </ferrisls>

</ferrisls>

The following performs a call to the postgresql function from an XQuery and outputs the results. Note the use of the @f1, @lname as attribute names because the postgresql function gives the results these names with the PostgreSQL type customerlookup_result. The XPath expression just loops over all the returned tuples assuming that the core logic is implemented in the postgresql function itself.

declare variable $docurl := "pg://localhost/xmldotcom2007/customerlookup(";

declare variable $mincustomerid := "3";

declare variable $maxcustomerid := "15";

<resultdata>

 {

  for $c in ferris-doc( 

     concat( $docurl, $mincustomerid, ",", $maxcustomerid, ")" ))/*

  return

    <person cid="{ $c/@f1 }" 

       surname="{ $c/@lname }" fn="{ $c/@fname }" />

 }

</resultdata>

XQuery as Desktop and Network Search

The filesystem index and search, as described in this Linux Journal article, can also be used from XQuery. This makes it very easy to build a custom Intranet search solution combining information from the file server index, RDF, and other locations with XQueries. Another application is finding the documents you want to perform an XQuery on using a filesystem search as an outer loop on ferris-doc("fulltextquery://...") and an inner loop on the document itself in your XQuery.

The below XQuery will search for "alice" and "wonderland" as a Boolean full text search performed against your filesystem index and return the results as a very simple XML file. Since you can combine many calls to ferris-doc() in the one XQuery, you could quickly build a nice user interface to file server search using just libferris and XQuery.

declare variable $qtype    := "boolean";

declare variable $person   := "alice";

declare variable $location := "wonderland";

<data>

 {

  for $idx in ferris-doc( concat("fulltextquery://", $qtype, "/", 

           $person, " ", $location))

    for $res in $idx/*

       return

    <match 

            name="{ $res/@name }" url="{ $res/@url }" 

            modification-time="{ $res/@mtime-display }"

           >

    </match>

 }

</data> 



$ ferris-xqilla xquery-index.xq

<?xml version="1.0"?>

<data>

  <match modification-time="99 Jul 27 12:53" 

     name="file:///.../doc/CommandLine/command.txt ...>

  <match modification-time="00 Mar 11 06:58" 

     name="file:///.../doc/Gimp/Grokking-the-GIMP-v1.0/node8.html

     ...>

...</data>

The filesystem indexes can be combined with querying by location. For example, the below XQuery will search any files which are geotagged as being in Florence, Italy. For details on setting up geotagging, place name disambiguation, see this Linux.com article

declare variable $placename := "eiffel-tower";

<data>

 {

  for $idx in ferris-doc( concat("eaq://(emblem:has-", $placename, "==1)"))

    for $res in $idx/*

       return

    <match 

            name="{ $res/@name }" url="{ $res/@url }" 

            modification-time="{ $res/@mtime-display }"

           >

    </match>

 }

</data>

Keeping a db4 Cache Hot with rsync

Both XML and db4 can be seen as filesystems with libferris, so you can keep a db4 file up to date with an XML file using the standard rsync(1) tool. In order to do this you need to expose libferris as a Filesystem in Userspace (FUSE) filesystem both as the source and destination for rsync. The tool to expose libferris through FUSE is called ferrisfs.

As rsync supports extended attributes with the -X command-line option, XML attributes can be synced to those in the db4 file cache. Filesystems expect (extended) attributes that users can store information into to be prefixed with "user.". This creates a very simple namespacing of (extended) attributes with "system." attributes carrying restrictions on who can set them.

This filesystem implementation detail is taken care of with the --prepend-user-dot-prefix-to-ea-regex option to ferrisfs. As the XML file does not conform to the "user." namespace restriction the --prepend-user-dot-prefix-to-ea-regex option can be used to have ferrisfs do some name space marshaling. For example, an XML attribute such as id will be reported by ferrisfs as user.id at the input filesystem. At the destination end the "user." prefix will be automatically stripped again; rsync itself only sees the "user.x" extended attributes in both filesystems and everybody is happy.

The --show-ea-regex option is used to tell ferrisfs which (extended) attributes are reported to rsync as existing. This means that any attributes in the XML file not matching this regular expression are not synced to the db4 file.

fcreate --create-type=db4 --rdn=customers.db .

mkdir -p customers

mkdir -p input 

ferrisfs --prepend-user-dot-prefix-to-ea-regex='.*'  -u `pwd`/customers.db customers

ferrisfs --prepend-user-dot-prefix-to-ea-regex='.* \

 --show-ea-regex='(id|givenname|familyname)'  -u `pwd`/customers.xml/customers input

rsync -avzX --delete-after input/ customers/

db_dump -p customers.db 

...

fusermount -u input

fusermount -u customers

Summing Up

The combination of XQilla and libferris allows you to combine access to the many filesystems that are supported by libferris within a single XQuery. In the case of db4 and XML, you can select between the two formats to gain the performance you desire with very little change to the XQuery itself.

Other interesting data sources that libferris makes available include rdf DB (as created with the redland library) and direct queries of what is shown in Firefox. Unfortunately, examples of these will have to wait for another article.

Additional Resources

The libferris homepage
XQuery articles on XML.com: Getting Started with XQuery and Getting Started with XQuery, Part 2
The rsync homepage
XML-related Linux Journal articles: The World Is a libferris Filesystem and xsltfs:// described in Virtual Filesystems Are Virtual Office Documents
Geotagging files with libferris and Google Earth at Linux.com
The XQilla homepage