A Tour of 4Suite

October 16, 2002

Mike Olson and I began the 4Suite project in 1998 with the release of 4DOM, and it quickly picked up an XPath and XSLT implementation. It has grown to include Python implementations of many other XML technologies, and it now provides a large library of Python APIs for XML as well as an XML server and repository system. In this article and the next, I'll introduce just the basic Python library portion of 4Suite, which includes facilities for XML parsing (complementing PyXML), RELAX NG, XPath, XPatterns, XSLT, RDF, XUpdate and more. If you are unfamiliar with any of these technologies, see the resources section at the end where I provide relevant pointers. Finally, after reviewing 4Suite, I'll summarize events in the Python-XML world since the last article.

Getting and installing 4Suite

In the general case, the only prerequisite for 4Suite is Python 2.1 or more recent. PyXML is required if you wish to parse XML in DTD validation mode, or if your Python install does not have pyexpat built in (many Python distributions do). If you need to install PyXML for these reasons, see this column's previous article.

You can get 4Suite from the project download page or from SourceForge. Get the latest 0.12.0 release. I highly recommend it over the older 0.11.1, even though the the 0.12.0 is still in testing. There has been a full redesign and many important changes which, in effect, increase stability. Windows users can just download and run the Windows executables. On other platforms (or for Windows power users), building and installing 4Suite is a matter of the standard distutils magic. After unpacking, change to the generated directory and run python setup.py install.

One useful option to the setup command is --without-docs. By default, the 4Suite build generates a large amount of documentation, and this can take a long time on some machines. It may be convenient for you to download the provided documentation packages separately and to use python setup.py install --without-docs to speed things up. 4Suite power users who install from CVS versions will find the opposite: that documentation is not built by default and that the --with-docs option is needed to build them.

Basic parsing

Parsing in 4Suite revolves around two protocols: readers and input sources. Input sources, usually based on the class Ft.Xml.InputSource.InputSource, are similar to input source objects in Python/SAX or DOM Level 3 Load and Save. They embody a stream of bytes that make up an XML document or the like, encapsulating the base URI associated with the data and some parsing preferences such as whether to process XIncludes. Reader objects actually provide methods for the XML parsing and are usually based on the classes Ft.Xml.Domlette.ValidatingReaderBase and Ft.Xml.Domlette.NonvalidatingReaderBase. Most users only need to worry about using singleton instances of these readers, which are provided for convenience. Parsing XML is as simple as the examples in listing 1, which parse XML obtained from a file, from a Web server, and then from a simple string.

Listing 1: Several examples of XML parsing

#NonvalidatingReader is a global singleton

from Ft.Xml.Domlette import NonvalidatingReader

#Parse XML from the Web...

doc = NonvalidatingReader.parseUri("http://xmlhack.com/read.php?item=1560")

#From the file system using an absolute path...

doc = NonvalidatingReader.parseUri("file:/tmp/spam.xml")

#From the file system, using a relative path...

doc = NonvalidatingReader.parseUri("file:spam.xml")

#from a string

doc = NonvalidatingReader.parseString(

        "<spam xmlns:x='http://spam.com'>eggs</spam>",

        "http://spam.com/base"

)

Notice the second parameter in the call to parseString. This is a base URI to use for the string. In 4Suite, the base URI of any source of XML is a very important property. Used internally to manage XML resources being processed, it's very important that you provide a sensible and unique base URI for each XML source you use in parsing, even those, such as strings and file-like objects, which might not have naturally associated URIs. Remember that URIs are a superset of URLs. For most common uses, using plain URLs, including file URLs, is perfectly good enough. In the parseUri method call, the URI from which the XML is parsed is naturally assumed as the base URI of the resulting parsed XML. When using any other parsing method, you should provide the URI explicitly, as in the example above. If you wish to use DTD validation while parsing, replace the NonvalidatingReader references in the example with ValidatingReader.

There are many options, elaborations, and nuances to the parsing tools I've introduced here. You can configure almost all aspects of the parsing. The doc object obtained from the various parsing methods in the listing 1 is a DOM node instance from either the cDomlette or FtMinidom implementations. cDomlette is a very fast and compact DOM written in C, and is the default on platforms that support it; FtMinidom is an enhanced version of Python's minidom. You can perform most DOM operations on either type of node.

RELAX NG

If DTDs don't suit your needs, 4Suite provides another option: RELAX NG. 4Suite incorporates Eric van der Vlist's XVIF implementation, which is basically RELAX NG with some very useful extensions. RELAX NG validation is not built into the default readers, but it is easy enough to do as a separate step, as shown in listing 2.

Listing 2: Using RELAX NG

#RELAX NG schema file

RNG = """<?xml version='1.0' encoding='UTF-8'?>

<grammar xmlns="http://relaxng.org/ns/structure/1.0">

  <start>

    <element name="memo">

      <element name="title">

        <text/>

      </element>

      <element name="date">

        <attribute name="form">

          <text/>

        </attribute>

        <text/>

      </element>

      <element name="to">

        <text/>

      </element>

      <element name="body">

        <text/>

      </element>

    </element>

  </start>

</grammar>

"""



#Instance document

DOC = """<?xml version='1.0' encoding='UTF-8'?>

<memo>

<title>With Usura Hath no Man a House of Good Stone</title>

<date form="ISO-8601">1936-04-03</date>

<to>The Art World</to>

<body>

It has come to our attention that the basis for art production

Has shifted from keen patronage to vulgar commercial measure.

Management is concerned this will erode the lasting value of the age's works.

</body>

</memo>"""





from Ft.Xml.Xvif import RelaxNgValidator

from Ft.Xml import InputSource

factory = InputSource.DefaultFactory

rng_isrc = factory.fromString(RNG, "file:example2.rng")

xml_isrc = factory.fromString(DOC, "file:example2.xml")



validator = RelaxNgValidator(rng_isrc)

result = validator.isValid(xml_isrc)

if result:

    print "Valid"

else:

    print "Invalid"

The RELAX NG APIs, like many in 4Suite, take input source objects -- though they usually have convenience APIs to pass in strings, URIs, or even prepared DOM nodes. Rather than use a reader object directly to parse the XML strings, I create input sources based on each. I do so using an input source factory, which has methods for generating input sources from string, URI, and so on. The Ft.Xml.Xvif.RelaxNgValidator class represents a RELAX NG schema, which is read from the input source given in the initializer. The validator can then be used to validate any number of XML instance documents, in this case using the isValid method. If you want more detail than a yes-or-no to validity, you can use the validate method, which returns a special object with some validation details.

A RELAX NG alternative

Andrew Kuchling also has a partial RELAX NG implementation for Python. It's in the PyXML project's CVS repository but is not distributed with the PyXML package yet. It supports less of the RELAX NG standard than XVIF, but it is still useful. If you want to try it, grab the sandbox module of PyXML using the following commands, or their equivalent in your CVS environment of choice:

cvs -d:pserver:anonymous@cvs.pyxml.sourceforge.net:/cvsroot/pyxml login

cvs -z3 -d:pserver:anonymous@cvs.pyxml.sourceforge.net:/cvsroot/pyxml co sandbox

Look in the directory sandbox/relaxng. It is not clear right now whether the two RELAX NG implementations will ever merge, or whether they will continue to develop separately as mutual alternatives.

XPath and XPatterns

XPath is everywhere. It's established itself as the workhorse of XML processing. The XPath engine is one of the parts of 4Suite that has had the most development and exercise. Much of it is implemented in C for performance sake, and this is one of the key differences between the XPath library in current 4Suite and that in PyXML, which is based on an older release of 4XPath, and is almost entirely in Python. The easiest way to use the XPath library is through the functions in Ft.Xml.XPath. Listing 3 defines a function for extracting the title from any given XHTML 1.0 file, using XPath.

Listing 3: A function for extracting HTML titles

from Ft.Xml.XPath.Context import Context

from Ft.Xml.XPath import Compile, Evaluate

from Ft.Xml.Domlette import NonvalidatingReader



XHTML_NS = "http://www.w3.org/1999/xhtml"



#compile the XPath for retrieving XHTML titles

TITLE_EXPR = Compile("string(/h:html/h:head/h:title)")



def extract_xhtml_title(uri):

    """Extract the title from the XHTML document at the given URI"""

    doc = NonvalidatingReader.parseUri(uri)

    #set up the context with the XHTML document node

    #and namespace mapping from the "h" prefix to the XHTML URI

    context = Context(doc, processorNss={"h": XHTML_NS})

    #Compute the XPath against the context

    title = TITLE_EXPR.evaluate(context)

    return title

The Context class is a very important one. During XPath processing, it maintains a lot of state information, including the context items defined in the XPath spec. The most important item in the context is the context node, which I set to the document node of the XHTML file. In this case, I also use the context to hold the namespace mapping from the "h" prefix which I use to the XHTML namespace. At the global level, I compile the XPath object, which is similar to compiling a regular expression using re.compile(). The result is a parsed XPath object which has an evaluate method taking a plain node object or a full context object. The return value is a Python equivalent of one of the four XPath data types. Strings are returned as Python Unicode objects, numbers as Python floats, booleans as instances of a special boolean class, and node sets as Python lists of node objects. The XPath expression above returns a string, which is directly returned to the caller as the requested title.

XSLT defines XPattern, a variation on XPath which is used to declare rules for matching patterns in the XML source against which to fire XSLT templates. The XPattern implementation that 4Suite's XSLT library uses is also exposed as a library of its own. XPatterns are useful when the task is not so much to compute arbitrary information from a given node but, rather, to choose quickly from a collection of nodes the ones that meet some basic rules. This might seem a subtle difference. The following example might help illustrate it.

XPath task: extract the class attribute from all the child elements of the context node
XPattern task: given a list of nodes, sort them into piles of those that have a class attribute and those that have a title child

The main API for XPattern processing in 4Suite is Ft.Xml.Xslt.PatternList. Listing 4 is a code snippet that takes a node and returns a list of patterns it matches.

Listing 4: Use XPatterns to quickly determine which patterns match which nodes

from Ft.Xml.Xslt import PatternList

from Ft.Xml.Domlette import NonvalidatingReader



#first pattern matches nodes with an href attribute

#the second matches elements with a title child

PATTERNS = ["*[@class]", "*[title]"]



#Second parameter is a dictionary of prefix to namespace mappings

plist = PatternList(PATTERNS, {})



DOC = """<spam>

  <e1 class="1"/>

  <e2><title>A</title></e2>

  <e3 class="2"><title>B</title></e3>

</spam>

"""

doc = NonvalidatingReader.parseString(DOC, "file:example4.xml")

for node in doc.documentElement.childNodes:

    #Don't forget that the white space text nodes before and after

    #e1, e2 and e3 elements are also child nodes of the spam element

    if node.nodeName[0] == "e":

        print plist.lookup(node)

The PatternList initializer takes my list of strings, which it conveniently converts to a list of compiled XPattern objects. Such objects have a match method that returns a boolean value, but I don't use these methods directly in this example. The PatternList initializer also takes a dictionary that makes up the namespace mapping. In this example, we use no namespaces, so the dictionary is empty. The lookup method is applied to a selection of the children of the spam element (all the nodes whose name starts with "e", which happens to be all the element nodes). The output of listing 4 follows:

[*[attribute::class]]

[*[child::title]]

[*[attribute::class], *[child::title]]

The output is a list of the representations of the pattern objects that matched each node. Notice how the axis abbreviations have been expanded in the pattern object representation.

Sometimes the built-in facilities of XPath and XPattern aren't quite enough to meet your processing needs. Luckily it's easy to extend the function of these libraries using XPath user extension functions, which are written in Python. I don't cover extension functions in this article, but the resources section has pointers to useful information if you need this facility.

Python-XML Happenings

Here is a brief on significant new happenings relevant to Python-XML development, including significant software releases.

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

David Mertz announced the 1.0.4 release of gnosis XML tools. This package provides tools for converting Python objects to XML documents and vice versa, DTD to SQL conversions, and more.

Brian Quinlan announced Pyana 0.6.0. Pyana is a Python extension module for interface to the Xalan XSLT engine.

Eric van der Vlist announced XVIF 0.2.0. XVIF includes a full RELAX NG validator for Python and adds in an XML processing framework system Eric developed as a straw man for ISO DSDL. The new release adds a data typing framework and a partial WXS data types library. It also features improved internals and API.

Henry Thompson announced a new release of XSV, a Python implementation of W3C XML Schema (WXS) which also runs the W3C's on-line WXS validator service. This is release features a major restructuring of the code.

Frank Tobin announced a lightweight Python module to help write out well-formed XML. xmlprinter is inspired by Perl's XML::Writer module.

Daniel Veillard announced the 1.0.21 release of libxslt, with improved Python bindings, among other things.

4Suite 0.12.0a3 is released, which is the version I introduce in this article. Among many other changes and improvements, it includes the latest XVIF.

Resources

For more information, see the 4Suite home page.. I and some other 4Suite developers hang out on the #4suite IRC channel on irc.freenode.net
You can usually find details of various aspects of the 4Suite libraries at my Python/XML Akara and 4Suite Akara.
There is an official RELAX NG tutorial, and Eric van der Vlist makes available chapters of his in-progress book on RELAX NG. If you are interested in Eric's XVIF extensions to RELAX NG, which are also incorporated into 4Suite, see the XVIF home page.
I introduce XPath and XSLT using 4Suite as examples on this tutorial, for which free registration is required. You can also try the zvon.org XPath Tutorial.
XPatterns are usually not covered separately, but you can learn more about XPatterns on any number of on-line XSLT tutorials and books. The W3C XSL page has many links to such resources.