XQuery, libferris, and Virtual Filesystems
July 27, 2007
Introduction
By bringing together an XQuery engine and a virtual filesystem you can use a familiar
query
language to access relational databases, Berkeley db4 databases, kernel filesystems,
and
network files as well as XML. libferris, at its, core is a virtual filesystem allowing
many
different data sources to be exposed through a filesystem interface. These include
the
expected things like file://
, http://
, ftp://
as well
as not so expected things like databases, XML files, and even applications like emacs,
Evolution, XWindow, and Firefox.
For example, a slight change in an XPath expression to use a db4 file instead of an XML file for a customer lookup can substantially improve query performance. Using a db4 file to perform a lookup is faster because the db4 file is indexed and the right piece of information can be directly accessed in only a few disk seeks. Using an XML file for the lookup table would require reading the whole file and parsing it; then there is no index in RAM so searching a large DOM would also require time.
Some familiarity with XQuery will be of help to the reader in understanding this article. The example XQueries are kept simple in order to demonstrate the libferris-related features instead of XQuery itself.
At times, direct filesystem interactions are shown using tools like ferrisls from the libferris distribution. This is the analogy of looking at an input XML file to debug a particular XQuery or to work out an XPath. In early experimentation it can be of great help to use ferrisls to directly view a data source (filesystem) in order to see why an XQuery might not be returning the result you expect.
libferris is available for POSIX platforms. A binary package repository exists for easy installation on Fedora Core 6 and Fedora 7 will be available shortly. In contrast to the older Linux style of having to explicitly mount filesystems, most of the time, libferris handles the mounting of filesystems for you. For example, if you try to read an XML file as a directory, then libferris will automatically mount it for you and allow you to view inside the XML file.
XQuery on db4
Libferris can mount both XML files and db4 files as filesystems. Because the data model of db4 is fairly flexible, both db4 and XML can appear very similar with libferris. The main difference is the underlying data storage; for example, slight modifications to db4 are much faster than for XML (db4 can save only the changes, XML updates usually mean streaming the entire DOM to disk).
Assume we have a customer lookup table in an XML file as partially shown in the following example code.
<customers> <customer id="3" givenname="Foo" familyname="Bar"/> <customer id="15" givenname="Bobby" familyname="McGee"/> <customer id="131" givenname="Ziggy" familyname="Stardust" /> </customers>
The following commands will create a db4 file containing the same information. Another method more suitable for keeping a db4 file in sync with a master XML file is shown at the end of the article.
Because we are dealing with mixing XML and db4 and using a virtual filesystem, there
is
some different terminology used to refer to very similar things. An XML attribute
is
referred to using the filesystem terminology of Extended Attribute (EA) in libferris.
Thus
the --show-ea
command-line option to ferrisls is telling it what (extended)
attributes we are interested in seeing. The --preserve-list
option to ferriscp
tells it explicitly which (extended) attributes we want to be copied from the source
to the
destination.
As can be seen from the two ferrisls commands the same information is available in both the XML and db4 files, the only difference is the extension of the customers file.
$ fcreate --create-type=db4 --rdn=customers.db . $ ferriscp -av --preserve-list=id,givenname,familyname customers.xml/customers customers.db $ ferrisls --show-ea=givenname,id customers.xml/customers Foo 3 Bobby 15 Ziggy 131 $ ferrisls --show-ea=givenname,id customers.db/customers Foo 3 Bobby 15 Ziggy 131
A simple XQuery on the customers.xml file is shown below. The document URL and the
customer ID are declared as variables and the standard doc()
function is used
to open the customers.xml file to access a customer's information by their unique
ID.
declare variable $docurl := "file:///.../customers.xml"; declare variable $customerid := "131"; <resultdata> { for $c in doc( $docurl )/customers/*[@id=$customerid] return <person cid="{ $c/@id }" surname="{ $c/@familyname }" /> } </resultdata>
The above query can be executed with either the xqilla
command from the XQilla distribution or the
ferris-xqilla
command from the libferris distribution. The difference is that
the latter command makes some extension functions available to the XQuery.
$ ferris-xqilla customers.xq <resultdata> <person cid="131" surname="Stardust"/> </resultdata>
Very little has to change to use customers.db instead of customers.xml.
The docurl
will have to have the file name changed to customers.db and instead
of using the standard doc()
function we use the ferris-doc()
extension function. These two functions behave the same way, the latter makes use
of
libferris to provide the data and query resolution. The query that operates directly
on
customers.xml could also use the ferris-doc()
extension function,
though there is no advantage.
declare variable $docurl := "file:///.../customers.db"; declare variable $customerid := "131"; <resultdata> { for $c in ferris-doc( $docurl )/customers/*[@id=$customerid] return <person cid="{ $c/@id }" surname="{ $c/@familyname }" /> } </resultdata>
As the XQuery uses the ferris-doc()
extension function we must use the
ferris-xqilla
command to evaluate it. The results will be identical.
$ ferris-xqilla customers-db4.xq <resultdata> <person cid="131" surname="Stardust"/> </resultdata>
XQuery on PostgreSQL
libferris can mount PostgreSQL as a filesystem. The result is available under both
the
postgresql://
and the shorthand pg://
URL scheme. At the top
level is the name of servers, then databases, and then tables. For example, a table
foo
in database bar
on the local machine can be accessed as
pg://localhost/bar/foo
.
A relational table is shown by libferris by making each tuple a file and using (extended)
attributes to show each column in the table. So, for example, a table with two columns
( name varchar, id int )
with two tuples will make a directory with two
files, each of which will have the name and ID (extended) attribute available.
Database functions are also available alongside the tables and can be called to create
a
virtual directory with the results. For example, if we have a database called
xmldotcom2007
with a table customers and a function
customerlookup( int, int )
, which returns a range of customers by their
customer ID, then both of these will be available directly under the database
itself—pg://localhost/xmldotcom2007
. The logic of the
postgresql
function is kept trivial for demonstration purposes.
Such a database might be created and populated as the following code shows.
bash$ psql create database xmldotcom2007; \c xmldotcom2007 create table customers ( id serial primary key, givenname varchar(100), familyname varchar(100) ); insert into customers values ( 3, 'Foo', 'Bar' ); insert into customers values ( 15, 'Bobby', 'McGee' ); insert into customers values ( 131, 'Ziggy', 'Stardust' ); CREATE TYPE customerlookup_result AS (f1 int, fname varchar(100), lname varchar(100)); CREATE FUNCTION customerlookup( int, int ) returns setof customerlookup_result AS $BODY$ DECLARE rec customerlookup_result; BEGIN for rec in select id,givenname,familyname from customers where id >= $1 and id <= $2 LOOP return next rec; END LOOP; return; END; $BODY$ LANGUAGE 'plpgsql' ; \q bash$
Direct interaction with this database is possible from the command line. The
-0
command-line option to ferrisls is similar to the -l
option
to ls(1) except the filesystem itself is asked which attributes it recommends to show
to the
user. In the case of a table from a relational database all the columns of the table
are
recommended by the filesystem as being interesting to the user. To call a
postgresql
function, the URL must be quoted so that bash will not try to
interpret the parenthesis. Notice that the (extended) attributes are named f1, fname,
and
lname for the files returned by the postgresql
functions. This is because the
type returned by customerlookup()
includes these names.
$ ferrisls pg://localhost/xmldotcom2007 $ ferrisls -0 pg://localhost/xmldotcom2007/customers 131 Ziggy Stardust 131 id 15 Bobby McGee 15 id 3 Foo Bar 3 id $ ferrisls --xml pg://localhost/xmldotcom2007/customers <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls> <ferrisls familyname="" givenname="" id="" name="customers" primary-key="id" url="pg:///localhost/xmldotcom2007/customers"> <context familyname="Stardust" givenname="Ziggy" id="131" name="131" primary-key="id"/> ... $ ferrisls --xml 'pg://localhost/xmldotcom2007/customerlookup(3,3)' <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls> <ferrisls ...> <context f1="3" fname="Foo" lname="Bar" name="3-Foo-Bar" primary-key="f1-fname-lname"/> </ferrisls> </ferrisls>
The following performs a call to the postgresql
function from an XQuery and
outputs the results. Note the use of the @f1, @lname
as attribute names because
the postgresql
function gives the results these names with the PostgreSQL type
customerlookup_result
. The XPath expression just loops over all the returned
tuples assuming that the core logic is implemented in the postgresql
function
itself.
declare variable $docurl := "pg://localhost/xmldotcom2007/customerlookup("; declare variable $mincustomerid := "3"; declare variable $maxcustomerid := "15"; <resultdata> { for $c in ferris-doc( concat( $docurl, $mincustomerid, ",", $maxcustomerid, ")" ))/* return <person cid="{ $c/@f1 }" surname="{ $c/@lname }" fn="{ $c/@fname }" /> } </resultdata>
XQuery as Desktop and Network Search
The filesystem index and search, as described in this Linux Journal article, can
also be used from XQuery. This makes it very easy to build a custom Intranet search
solution
combining information from the file server index, RDF, and other locations with XQueries.
Another application is finding the documents you want to perform an XQuery on using
a
filesystem search as an outer loop on ferris-doc("fulltextquery://...")
and an
inner loop on the document itself in your XQuery.
The below XQuery will search for "alice" and "wonderland" as a Boolean full text search
performed against your filesystem index and return the results as a very simple XML
file.
Since you can combine many calls to ferris-doc()
in the one XQuery, you could
quickly build a nice user interface to file server search using just libferris and
XQuery.
declare variable $qtype := "boolean"; declare variable $person := "alice"; declare variable $location := "wonderland"; <data> { for $idx in ferris-doc( concat("fulltextquery://", $qtype, "/", $person, " ", $location)) for $res in $idx/* return <match name="{ $res/@name }" url="{ $res/@url }" modification-time="{ $res/@mtime-display }" > </match> } </data> $ ferris-xqilla xquery-index.xq <?xml version="1.0"?> <data> <match modification-time="99 Jul 27 12:53" name="file:///.../doc/CommandLine/command.txt ...> <match modification-time="00 Mar 11 06:58" name="file:///.../doc/Gimp/Grokking-the-GIMP-v1.0/node8.html ...> ...</data>
The filesystem indexes can be combined with querying by location. For example, the below XQuery will search any files which are geotagged as being in Florence, Italy. For details on setting up geotagging, place name disambiguation, see this Linux.com article
declare variable $placename := "eiffel-tower"; <data> { for $idx in ferris-doc( concat("eaq://(emblem:has-", $placename, "==1)")) for $res in $idx/* return <match name="{ $res/@name }" url="{ $res/@url }" modification-time="{ $res/@mtime-display }" > </match> } </data>
Keeping a db4 Cache Hot with rsync
Both XML and db4 can be seen as filesystems with libferris, so you can keep a db4 file up to date with an XML file using the standard rsync(1) tool. In order to do this you need to expose libferris as a Filesystem in Userspace (FUSE) filesystem both as the source and destination for rsync. The tool to expose libferris through FUSE is called ferrisfs.
As rsync supports extended attributes with the -X
command-line option, XML
attributes can be synced to those in the db4 file cache. Filesystems expect (extended)
attributes that users can store information into to be prefixed with "user.". This
creates a
very simple namespacing of (extended) attributes with "system." attributes carrying
restrictions on who can set them.
This filesystem implementation detail is taken care of with the
--prepend-user-dot-prefix-to-ea-regex
option to ferrisfs. As the XML file
does not conform to the "user." namespace restriction the
--prepend-user-dot-prefix-to-ea-regex
option can be used to have ferrisfs do
some name space marshaling. For example, an XML attribute such as id
will be
reported by ferrisfs as user.id
at the input filesystem. At the destination end
the "user." prefix will be automatically stripped again; rsync itself only sees the
"user.x"
extended attributes in both filesystems and everybody is happy.
The --show-ea-regex
option is used to tell ferrisfs which (extended)
attributes are reported to rsync as existing. This means that any attributes in the
XML file
not matching this regular expression are not synced to the db4 file.
fcreate --create-type=db4 --rdn=customers.db . mkdir -p customers mkdir -p input ferrisfs --prepend-user-dot-prefix-to-ea-regex='.*' -u `pwd`/customers.db customers ferrisfs --prepend-user-dot-prefix-to-ea-regex='.* \ --show-ea-regex='(id|givenname|familyname)' -u `pwd`/customers.xml/customers input rsync -avzX --delete-after input/ customers/ db_dump -p customers.db ... fusermount -u input fusermount -u customers
Summing Up
The combination of XQilla and libferris allows you to combine access to the many filesystems that are supported by libferris within a single XQuery. In the case of db4 and XML, you can select between the two formats to gain the performance you desire with very little change to the XQuery itself.
Other interesting data sources that libferris makes available include rdf DB (as created with the redland library) and direct queries of what is shown in Firefox. Unfortunately, examples of these will have to wait for another article.
Additional Resources
- The libferris homepage
- XQuery articles on XML.com: Getting Started with XQuery and Getting Started with XQuery, Part 2
- The rsync homepage
- XML-related Linux Journal articles: The World Is a libferris Filesystem and xsltfs:// described in Virtual Filesystems Are Virtual Office Documents
- Geotagging files with libferris and Google Earth at Linux.com
- The XQilla homepage