A Tour of 4Suite
October 16, 2002
Mike Olson and I began the 4Suite project in 1998 with the release of 4DOM, and it quickly picked up an XPath and XSLT implementation. It has grown to include Python implementations of many other XML technologies, and it now provides a large library of Python APIs for XML as well as an XML server and repository system. In this article and the next, I'll introduce just the basic Python library portion of 4Suite, which includes facilities for XML parsing (complementing PyXML), RELAX NG, XPath, XPatterns, XSLT, RDF, XUpdate and more. If you are unfamiliar with any of these technologies, see the resources section at the end where I provide relevant pointers. Finally, after reviewing 4Suite, I'll summarize events in the Python-XML world since the last article.
Getting and installing 4Suite
In the general case, the only prerequisite for 4Suite is Python 2.1 or more recent. PyXML is required if you wish to parse XML in DTD validation mode, or if your Python install does not have pyexpat built in (many Python distributions do). If you need to install PyXML for these reasons, see this column's previous article.
You can get 4Suite from the project download
page or from
SourceForge. Get the latest 0.12.0 release. I highly recommend it over the older
0.11.1, even though the the 0.12.0 is still in testing. There has been a full redesign
and
many important changes which, in effect, increase stability. Windows users can just
download
and run the Windows executables. On other platforms (or for Windows power users),
building
and installing 4Suite is a matter of the standard distutils magic. After unpacking,
change to the generated directory and run python setup.py install
.
One useful option to the setup command is --without-docs
. By default, the
4Suite build generates a large amount of documentation, and this can take a long time
on
some machines. It may be convenient for you to download the provided documentation
packages
separately and to use python setup.py install --without-docs
to speed things
up. 4Suite power users who install from CVS versions will find the opposite: that
documentation is not built by default and that the --with-docs
option is
needed to build them.
Basic parsing
Parsing in 4Suite revolves around two protocols: readers and input sources. Input
sources,
usually based on the class Ft.Xml.InputSource.InputSource
, are similar to input
source objects in Python/SAX or DOM Level 3 Load and Save. They embody a stream of
bytes
that make up an XML document or the like, encapsulating the base URI associated with
the
data and some parsing preferences such as whether to process XIncludes. Reader objects
actually provide methods for the XML parsing and are usually based on the classes
Ft.Xml.Domlette.ValidatingReaderBase
and
Ft.Xml.Domlette.NonvalidatingReaderBase
. Most users only need to worry about
using singleton instances of these readers, which are provided for convenience. Parsing
XML
is as simple as the examples in listing 1, which parse XML obtained from a file, from
a Web
server, and then from a simple string.
Listing 1: Several examples of XML parsing
#NonvalidatingReader is a global singleton from Ft.Xml.Domlette import NonvalidatingReader #Parse XML from the Web... doc = NonvalidatingReader.parseUri("http://xmlhack.com/read.php?item=1560") #From the file system using an absolute path... doc = NonvalidatingReader.parseUri("file:/tmp/spam.xml") #From the file system, using a relative path... doc = NonvalidatingReader.parseUri("file:spam.xml") #from a string doc = NonvalidatingReader.parseString( "<spam xmlns:x='http://spam.com'>eggs</spam>", "http://spam.com/base" )
Notice the second parameter in the call to parseString
. This is a base URI to
use for the string. In 4Suite, the base URI of any source of XML is a very important
property. Used internally to manage XML resources being processed, it's very important
that
you provide a sensible and unique base URI for each XML source you use in parsing,
even
those, such as strings and file-like objects, which might not have naturally associated
URIs. Remember that URIs are a superset of URLs. For most common uses, using plain
URLs,
including file URLs, is perfectly good enough. In the parseUri
method call, the
URI from which the XML is parsed is naturally assumed as the base URI of the resulting
parsed XML. When using any other parsing method, you should provide the URI explicitly,
as
in the example above. If you wish to use DTD validation while parsing, replace the
NonvalidatingReader
references in the example with
ValidatingReader
.
There are many options, elaborations, and nuances to the parsing tools I've introduced
here. You can configure almost all aspects of the parsing. The doc
object
obtained from the various parsing methods in the listing 1 is a DOM node instance
from
either the cDomlette or FtMinidom implementations. cDomlette is a very fast
and compact DOM written in C, and is the default on platforms that support it; FtMinidom
is
an enhanced version of Python's minidom. You can perform most DOM operations on either
type
of node.
RELAX NG
If DTDs don't suit your needs, 4Suite provides another option: RELAX NG. 4Suite incorporates Eric van der Vlist's XVIF implementation, which is basically RELAX NG with some very useful extensions. RELAX NG validation is not built into the default readers, but it is easy enough to do as a separate step, as shown in listing 2.
Listing 2: Using RELAX NG
#RELAX NG schema file RNG = """<?xml version='1.0' encoding='UTF-8'?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="memo"> <element name="title"> <text/> </element> <element name="date"> <attribute name="form"> <text/> </attribute> <text/> </element> <element name="to"> <text/> </element> <element name="body"> <text/> </element> </element> </start> </grammar> """ #Instance document DOC = """<?xml version='1.0' encoding='UTF-8'?> <memo> <title>With Usura Hath no Man a House of Good Stone</title> <date form="ISO-8601">1936-04-03</date> <to>The Art World</to> <body> It has come to our attention that the basis for art production Has shifted from keen patronage to vulgar commercial measure. Management is concerned this will erode the lasting value of the age's works. </body> </memo>""" from Ft.Xml.Xvif import RelaxNgValidator from Ft.Xml import InputSource factory = InputSource.DefaultFactory rng_isrc = factory.fromString(RNG, "file:example2.rng") xml_isrc = factory.fromString(DOC, "file:example2.xml") validator = RelaxNgValidator(rng_isrc) result = validator.isValid(xml_isrc) if result: print "Valid" else: print "Invalid"
The RELAX NG APIs, like many in 4Suite, take input source objects -- though they usually
have convenience APIs to pass in strings, URIs, or even prepared DOM nodes. Rather
than use
a reader object directly to parse the XML strings, I create input sources based on
each. I
do so using an input source factory, which has methods for generating input sources
from
string, URI, and so on. The Ft.Xml.Xvif.RelaxNgValidator
class represents a
RELAX NG schema, which is read from the input source given in the initializer. The
validator
can then be used to validate any number of XML instance documents, in this case using
the
isValid
method. If you want more detail than a yes-or-no to validity, you can
use the validate
method, which returns a special object with some validation
details.
A RELAX NG alternative
Andrew Kuchling also has a partial RELAX NG implementation for Python. It's in the
PyXML
project's CVS repository but is not distributed with the PyXML package yet. It supports
less
of the RELAX NG standard than XVIF, but it is still useful. If you want to try it,
grab the
sandbox
module of PyXML using the following commands, or their equivalent in
your CVS environment of choice:
cvs -d:pserver:anonymous@cvs.pyxml.sourceforge.net:/cvsroot/pyxml login cvs -z3 -d:pserver:anonymous@cvs.pyxml.sourceforge.net:/cvsroot/pyxml co sandbox
Look in the directory sandbox/relaxng
. It is not clear right now whether the
two RELAX NG implementations will ever merge, or whether they will continue to develop
separately as mutual alternatives.
XPath and XPatterns
XPath is everywhere. It's established itself as the workhorse of XML processing. The
XPath
engine is one of the parts of 4Suite that has had the most development and exercise.
Much of
it is implemented in C for performance sake, and this is one of the key differences
between
the XPath library in current 4Suite and that in PyXML, which is based on an older
release of
4XPath, and is almost entirely in Python. The easiest way to use the XPath library
is
through the functions in Ft.Xml.XPath
. Listing 3 defines a function for
extracting the title from any given XHTML 1.0 file, using XPath.
Listing 3: A function for extracting HTML titles
from Ft.Xml.XPath.Context import Context from Ft.Xml.XPath import Compile, Evaluate from Ft.Xml.Domlette import NonvalidatingReader XHTML_NS = "http://www.w3.org/1999/xhtml" #compile the XPath for retrieving XHTML titles TITLE_EXPR = Compile("string(/h:html/h:head/h:title)") def extract_xhtml_title(uri): """Extract the title from the XHTML document at the given URI""" doc = NonvalidatingReader.parseUri(uri) #set up the context with the XHTML document node #and namespace mapping from the "h" prefix to the XHTML URI context = Context(doc, processorNss={"h": XHTML_NS}) #Compute the XPath against the context title = TITLE_EXPR.evaluate(context) return title
The Context
class is a very important one. During XPath processing, it
maintains a lot of state information, including the context items defined in the XPath
spec.
The most important item in the context is the context node, which I set to the document
node
of the XHTML file. In this case, I also use the context to hold the namespace mapping
from
the "h" prefix which I use to the XHTML namespace. At the global level, I compile
the XPath
object, which is similar to compiling a regular expression using re.compile()
.
The result is a parsed XPath object which has an evaluate
method taking a plain
node object or a full context object. The return value is a Python equivalent of one
of the
four XPath data types. Strings are returned as Python Unicode objects, numbers as
Python
floats, booleans as instances of a special boolean class, and node sets as Python
lists of
node objects. The XPath expression above returns a string, which is directly returned
to the
caller as the requested title.
XSLT defines XPattern, a variation on XPath which is used to declare rules for matching patterns in the XML source against which to fire XSLT templates. The XPattern implementation that 4Suite's XSLT library uses is also exposed as a library of its own. XPatterns are useful when the task is not so much to compute arbitrary information from a given node but, rather, to choose quickly from a collection of nodes the ones that meet some basic rules. This might seem a subtle difference. The following example might help illustrate it.
- XPath task: extract the
class
attribute from all the child elements of the context node - XPattern task: given a list of nodes, sort them into piles of those that have a
class
attribute and those that have atitle
child
The main API for XPattern processing in 4Suite is Ft.Xml.Xslt.PatternList
.
Listing 4 is a code snippet that takes a node and returns a list of patterns it matches.
Listing 4: Use XPatterns to quickly determine which patterns match which nodes
from Ft.Xml.Xslt import PatternList from Ft.Xml.Domlette import NonvalidatingReader #first pattern matches nodes with an href attribute #the second matches elements with a title child PATTERNS = ["*[@class]", "*[title]"] #Second parameter is a dictionary of prefix to namespace mappings plist = PatternList(PATTERNS, {}) DOC = """<spam> <e1 class="1"/> <e2><title>A</title></e2> <e3 class="2"><title>B</title></e3> </spam> """ doc = NonvalidatingReader.parseString(DOC, "file:example4.xml") for node in doc.documentElement.childNodes: #Don't forget that the white space text nodes before and after #e1, e2 and e3 elements are also child nodes of the spam element if node.nodeName[0] == "e": print plist.lookup(node)
The PatternList initializer takes my list of strings, which it conveniently converts
to a
list of compiled XPattern objects. Such objects have a match
method that
returns a boolean value, but I don't use these methods directly in this example. The
PatternList initializer also takes a dictionary that makes up the namespace mapping.
In this
example, we use no namespaces, so the dictionary is empty. The lookup
method is
applied to a selection of the children of the spam
element (all the nodes whose
name starts with "e", which happens to be all the element nodes). The output of listing
4
follows:
[*[attribute::class]] [*[child::title]] [*[attribute::class], *[child::title]]
The output is a list of the representations of the pattern objects that matched each node. Notice how the axis abbreviations have been expanded in the pattern object representation.
Sometimes the built-in facilities of XPath and XPattern aren't quite enough to meet your processing needs. Luckily it's easy to extend the function of these libraries using XPath user extension functions, which are written in Python. I don't cover extension functions in this article, but the resources section has pointers to useful information if you need this facility.
Python-XML Happenings
Here is a brief on significant new happenings relevant to Python-XML development, including significant software releases.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
David Mertz announced the 1.0.4 release of gnosis XML tools. This package provides tools for converting Python objects to XML documents and vice versa, DTD to SQL conversions, and more.
Brian Quinlan announced Pyana 0.6.0. Pyana is a Python extension module for interface to the Xalan XSLT engine.
Eric van der Vlist announced XVIF 0.2.0. XVIF includes a full RELAX NG validator for Python and adds in an XML processing framework system Eric developed as a straw man for ISO DSDL. The new release adds a data typing framework and a partial WXS data types library. It also features improved internals and API.
Henry Thompson announced a new release of XSV, a Python implementation of W3C XML Schema (WXS) which also runs the W3C's on-line WXS validator service. This is release features a major restructuring of the code.
Frank Tobin announced a lightweight Python module to help write out well-formed XML. xmlprinter is inspired by Perl's XML::Writer module.
Daniel Veillard announced the 1.0.21 release of libxslt, with improved Python bindings, among other things.
4Suite 0.12.0a3 is released, which is the version I introduce in this article. Among many other changes and improvements, it includes the latest XVIF.
Resources
|