Using libferris with XML
March 31, 2004
This article presents the benefits of using libferris with your XML
applications. libferris presents a uniform interface to hierarchical data. This data
can be
persisted using many providers including the filesystem, an RDBMS, or even XML. All
the data
providers in libferris are made available using a filesystem metaphor: MySQL tables
can be
seen using ferrisls on a "mysql://host/database/table"
URL.
The two core abstractions in libferris are the Context and the Extended Attribute (EA). You can think of the Context as a directory or file in a filesystem or as the combination of an element and its child text node(s) in XML. You can think of an EA as an XML attribute. The largest difference between an EA and an XML attribute is that the value of an EA can be either stored or generated at runtime.
There are several benefits of using libferris with your XML applications.
-
Access to large amounts of metadata such as:
- width, height, depth, gamma and pixel data of images
- width, height of video files
- camera and thumbnail information from jpeg/exif files
- native on disk EA from XFS filesystems
- ID3 Artist information from music files
- RDF repositories
- Text representations of HTML, man, pdf and djvu images. The text version is created lazily upon request and accessed through the metadata interface through the "as-text" EA.
The metadata system is easily extensible.
-
Location independent metadata storage. You can create new attributes for any Context. For example, you can add some metadata to a row in the result set of an SQL query or store data about a web site at its file. When metadata cannot be stored in the filesystem itself it's stored locally as RDF and made transparently accessible again for the file that it's attached to. Such a system allows a single metadata interface to be used for all filesystem objects regardless of data location or updatability of the data. All metadata can be exported by libferris either as raw XML or as an RDF/XML file via the "as-xml" and "as-rdf" EAs respectively.
-
Abstraction from specific data serializations. Many formats are supported by libferris:
- Standard XML file
- Sleepycat dbXML native XML database
- Relational database
- RDF/XML or RDF/bdb graph
- ISAM files: db4, tdb, gdbm
- network data using http, ftp, or IPC file
- network accessed via ssh2
- mbox file
- On an LDAP server
- Composite files such as inside
tar.gz
files - Part of a native kernel filesystem
- Anything which you have an EBNF file for. A custom filesystem plugin can be created to extract a custom format into a filesystem representation. Such filesystems are already used internally to mount LDAP search specifications and full text index queries (see plugins/context/pccts in the ferris tarball).
-
Access to the results of a query. Since queries in libferris return a virtual filesystem, the results can be exposed as a DOM too. Various indexing services are available in libferris including fulltext indexing and attribute indexing to speed retrieval. Also an SQL query may be submitted to a database and the result will be given as a filesystem.
-
Mutation of filesystems. Various filesystem decorators exist which are layered on top of another filesystem. For example, you can apply a nested stable sort to a filesystem and access the result through the same API.
-
Command line and graphical data manipulation. Many of the POSIX fileutils have been rewritten to operate on a libferris filesystem. For example, you can view a dbXML file using:
ferrisls ~/sleepycat.dbxml
-
Ability to get an Xerces-C DOM from a libferris filesystem. This DOM is lazily evaluated, that is, the DOM for
/home/ben
will only have nodes created which are accessed. Note that this particular functionality is not fully implemented yet, though it is possible to run XSLT on the DOM wrapper.
POSIX Command Line Replacements
The replacement directory listing command ferrisls
supports an output mode
--xml
, which is similar to an ls -l
except output is a valid XML
document. This can be combined with the extended attribute handling in libferris to
export
interesting metadata. For example, a stylesheet might be interested in the width and
height
of an image. The following command will retrieve an image file and present the selected
attributes as an XML document. The --show-ea
parameter tells
ferrisls
which EA it should list in the output.
$ ferrisls -ld --xml \ --show-ea="name,size-human-readable,width,height" \ http://witme.sf.net/libferris.web/images/project.png
The output of above command when run from a machine with Internet access follows (formatted to fit XML.com).
<ferrisls> <ferrisls url="http:///witme.sf.net/libferris.web/images/project.png" name="project.png" > <context name="project.png" size-human-readable="20.0k" width="640" height="60" /> </ferrisls> </ferrisls>
There is a nested ferrisls
element because ferrisls can list many locations
during a single invocation and so the top level ferrisls
is always added to
ensure a unique root node.
We will now create a Sleepycat native XML database and populate it from the command
line.
New filesystem objects are created using either the console fcreate
or the
GTK+2 graphical gfcreate
tools. These are distributed in the ferriscreate package.
We will use fcreate
to avoid the GUI in the creation process. We pass the
minimum useful information to fcreate
telling it the type of object to make,
its filename (the Relative Domain Name or rdn), and the path at which to create the
new
object.
$ rm -rf /tmp/xmlcom_ferris $ mkdir /tmp/xmlcom_ferris $ fcreate --create-type dbxml \ --rdn mycollection.dbxml /tmp/xmlcom_ferris $ ferriscp --dst-is-dir -v \ /tmp/input.xml /tmp/xmlcom_ferris/mycollection.dbxml
We take the resulting XML from the ferrisls --xml
command and put it into
/tmp/input.xml
to import into the dbXML database. The subtle trick to the
command is the --dst-is-dir
option. This is needed to tell libferris that it
should treat the dbXML file itself as a directory for this operation. Otherwise the
normal
semantics of attempting to copy the XML into the mycollection.dbxml
file itself
would apply: that is, without --dst-is-dir
the mycollection.dbxml
would contain a byte copy of input.xml
. With --dst-is-dir
,
mycollection.dbxml
remains a dbXML file and contains a copy of
input.xml
as an object in its database.
Now we can access the input.xml
directly from the dbXML database using the
fcat
command, list the entire database using ferrisls
, and
generate the MD5 checksum for each XML file as we go.
$ fcat /tmp/xmlcom_ferris/mycollection.dbxml/input.xml <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls dbxml:id="1" dbxml:name="input.xml" name="project.png" url="http:///witme.sf.net/libferris.web/images/project.png" xmlns:dbxml="http://www.sleepycat.com/2002/dbxml"> <context height="60" name="project.png" size-human-readable="20.0k" width="640" /> </ferrisls> $ ferrisls -lh --show-ea="name,md5" /tmp/xmlcom_ferris/mycollection.dbxml input.xml 6976f06b, 77827e2e, 74a8ca80, 9420052d
Support for resolving XPath 1.0 expressions has recently been added to libferris using the "pathan" library. A small directory tree is set up to illustrate:
$ cd /tmp $ mkdir xmlcomxp $ for i in `seq 1 3 10`; do touch xmlcomxp/foo$i.xml; done $ touch xmlcomxp/plain.txt
The URI style of scheme://
is bent slightly for the xpath
URI
scheme in libferris in that everything after the colon forms part of the XPath expression.
This is done to allow the leading //
in XPath to still be used to explore the
entire tree. The top level filesystem items in the xpath:/
filesystem are all
the other filesystem types, for example, the file://
URI scheme is represented
by the file
top level directory.
$ ferrisls -l \ 'xpath:/file/tmp/xmlcomxp/*[@name-extension=".xml" and @size<200]' -rw-rw---- ben ben 0 04 Jan 20 01:10 foo1.xml -rw-rw---- ben ben 0 04 Jan 20 01:10 foo10.xml -rw-rw---- ben ben 0 04 Jan 20 01:10 foo4.xml -rw-rw---- ben ben 0 04 Jan 20 01:10 foo7.xml
The only relational database that is accessible with the open source version of libferris
currently is MySQL. The user name and password to use for each server is setup using
ferris-capplet-auth
graphical tool rather than embedding authentication
information into URLs directly. The capplet
allows you to test each
authentication setting to make sure its acceptable.
Once the appropriate authentication is given, libferris can be used to explore and export relational data. Listing the top level mysql URL scheme will show you hosts which are currently known. Listing a host shows you the databases on that host.
$ ferrisls mysql:// localhost $ ferrisls mysql://localhost ... exphpresso ... $ ferrisls mysql://localhost/exphpresso coffees comments definition types
If you've entered authentication information for remote databases then you can list
them
with ferrisls
as though they existed in mysql://
; libferris will
connect to them and create the appropriate file. For example, with this command a
connection
to the server foo
is created and the databases on it are listed:
$ ferrisls mysql://foo ... stocks ...
The EA interface in libferris also presents interesting metadata about the filesystem
itself. One such attribute is the recommended-ea
. This is a comma separated
list of attributes which a Context thinks are interesting attributes for viewing its
children Contexts. For relational databases the recommended-ea
contains an
entry for each column name in the table or query result set. One can tell
ferrisls
to present the recommended-ea
by adding -0
to the command line. Using --xml
implies that the recommended-ea
be shown.
$ ferrisls -0v mysql://localhost/exphpresso/coffees 1 Classics Americano One shot of ... 2 2lassicsid Classic Espresso One shot of ... ... $ ./ferrisls --xml mysql://localhost/exphpresso/coffees <ferrisls> <ferrisls url="mysql:///localhost/exphpresso/coffees" name="coffees" > <context id="1" coffee_type="Classics" coffee_name="Americano" coffee_details=" One shot of expresso brewed..." name="1" primary-key="id" /> ... </ferrisls></ferrisls>
The next command uses XPath to select some rows from the relational data in the
coffees
table starting with a given coffee_name
and then sorts
the results by the coffee_name
. The sorting specification used in this
parameter allows arbitrary nesting of sorts, as well as reverse, floating, case insensitive
and version sorting. The URL for the displayed context is selectionfactory://
which is a filesystem designed to hand around a collection of links to other filesystems.
In
this case it is a selection of rows from a table, but it can pass arbitrary data around.
$ ferrisls --xml --ferris-sort="coffee_name" \ 'xpath:/mysql/localhost/exphpresso/ coffees/*[starts-with(@coffee_name,"Cafe")]' <ferrisls> <ferrisls url="selectionfactory://" name="1" > <context id="28" coffee_name="Cafe Brulot" ... /> <context id="31" coffee_name="Cafe Diablo" ... /> </ferrisls></ferrisls>
It should be noted that the XPath query is not converted to SQL for execution; it's more expensive to execute than embedding SQL with libferris.
Summmary
I've tried to give an overview of what is possible with libferris and XML and highlight some of the areas where libferris can remove boundaries. If you've enjoyed reading about libferris please consider making a contribution to the project.