Using libxml in Python

May 14, 2003

The GNOME project, an open source umbrella projects like Apache and KDE, has spawned several useful subprojects. A few years ago the increase of interest in XML processing in GNOME led to the development of a base XML processing library and, subsequently, an XSLT library, both of which are written in C, the foundational language of GNOME. These libraries, libxml and libxslt, are popular for users of C, but also those of the many other languages for which wrappers have been written, as well as language-agnostic users who want good command-line tools.

libxml and libxslt are popular because of their speed, active development, and coverage of many XML specifications with close attention to conformance. They are also available on many platforms. Daniel Veillard is the lead developer of these libraries as well as their Python bindings. He participates on the XML-SIG and has pledged perpetual support for the Python bindings; however, as the documentation says, "the Python interface [has] not yet reached the maturity of the C API."

In this article I'll introduce the Python libxml bindings, which I refer to as Python/libxml. In particular I introduce libxml2. I am using Red Hat 9.0 so installation was a simple matter of installing RPMs from the distribution disk or elsewhere. The two pertinent RPMs in my case are libxml2-2.5.4-1 and libxml2-python-2.5.4-1. The libxml web page offers installation instructions for users of other distributions or platforms, including Windows and Mac OS X.

Basic libxml

libxml exposes a Python interface similar to its C interface. It's unrelated to DOM or any of the other Python interfaces and is fairly complex. To get a flavor of it, see the demonstration in listing 1.

Listing 1: A simple example of the basic libxml2 API

import libxml2



DOC = """<?xml version="1.0" encoding="UTF-8"?>

<verse>

  <attribution>Christopher Okibgo</attribution>

  <line>For he was a shrub among the poplars,</line>

  <line>Needing more roots</line>

  <line>More sap to grow to sunlight,</line>

  <line>Thirsting for sunlight</line>

</verse>

"""



doc = libxml2.parseDoc(DOC)

root = doc.children

print root

#iterate over children of verse

child = root.children

while child is not None:

    print child

    if child.type == "element":

        print "\tAn element with ", child.lsCountNode(), "child(ren)"

        print "\tAnd content", repr(child.content)

    child = child.next

doc.freeDoc()

The entire Python API wrapper is in the module libxml2, which largely delegates to a C/Python extension in the file libxml2mod.so on my machine, which in turn uses the core libxml implementation. parseDoc is one of a family of functions for parsing XML documents, DTDs, and more. There are also parseURI and parseFile for reading instances and parseDTD to read an external DTD subset.

In this listing I use the most literal of the several approaches for walking through nodes, the one closest to the core C API. The children attribute gets the first child node of the instance node in document order. This makes the name a bit misleading, but you can get to the rest of the children using what is in effect a doubly-linked list, where the next and prev attributes link the list together, and last can be used to shuttle to the end. parent.children in effect moves "up" and then back to the start of the list; it can be used in place of the nonexistent attribute first.

The next links eventually run off the end of the list, returning None, which terminates my while loop. Within each iteration I print each node, including some special information for elements. To determine which nodes are elements, I use the type attribute, which returns a string indicating the node type. lsCountNode() gives the count of child nodes and content gives a string consisting of the content of all descendant text nodes. Finally, in order to deallocate the low-level C constructs throughout the document, I call freeDoc(); freeNode() is also available for more fine-grained memory management, usually when using libxml to modify documents.

The following is the output from listing 1.

<xmlNode (verse) object at 0x8136dac>

<xmlNode (text) object at 0x8134c94>

<xmlNode (attribution) object at 0x8135f04>

        An element with  1 child(ren)

        And content 'Christopher Okibgo'

<xmlNode (text) object at 0x8134c94>

<xmlNode (line) object at 0x8135f04>

        An element with  1 child(ren)

        And content 'For he was a shrub among the poplars,'

<xmlNode (text) object at 0x8134c94>

<xmlNode (line) object at 0x8135f04>

        An element with  1 child(ren)

        And content 'Needing more roots'

<xmlNode (text) object at 0x8134c94>

<xmlNode (line) object at 0x8135f04>

        An element with  1 child(ren)

        And content 'More sap to grow to sunlight,'

<xmlNode (text) object at 0x8134c94>

<xmlNode (line) object at 0x8135f04>

        An element with  1 child(ren)

        And content 'Thirsting for sunlight'

<xmlNode (text) object at 0x8134c94>

Iterators. There is also an iterators interface for Python 2.2 users, which is a little more Pythonic. As an example, the following snippet is the functional equivalent of the loop in listing 1.

for child in root:

    print child

    if child.type == "element":

        print "\tAn element with ", child.lsCountNode(), "child(ren)"

        print "\tAnd content", repr(child.content)

Beyond ASCII. As is the GNOME convention, libxml represents Unicode objects as simple strings encoded as UTF-8. This extends to Python/libxml, where rather than using Python Unicode objects, simple Python strings in UTF-8 encoding are returned. Listing 2 gives an example of the behavior of Python/libxml when processing non-ASCII characters.

Listing 2: Simple libxml example handling non-ASCII characters

DOC = """<?xml version="1.0" encoding="UTF-8"?>

<rule>In any triangle, each interior angle &lt; 90&#xB0;</rule>

"""



doc = libxml2.parseDoc(DOC)

root = doc.children

print "Content:", repr(root.content)

print "As Unicode:", repr(unicode(root.content, "utf-8"))

doc.freeDoc()

I still strongly advocate using Python Unicode objects rather than encoded strings when processing XML. I suggest that Python/libxml users convert to and from Unicode when interfacing from the library to application code. But I admit that his might be awkward in some cases and might incur a small performance hit. I do think it would be best for Python/libxml to switch to Python Unicode objects as the basic string type.

A word about documentation. Python-XML projects have been notorious for poor documentation, which is one of the considerations that inspires this column. It was especially difficult for me to get a handle on libxml because of its remarkable richness and thus complexity, combined with elusive documentation. I cobbled together enough understanding of the API to put together the listings above only after combing the on-line documentation on the libxml site (which mostly covers C), reading through all the Python API example and test scripts, reading the Python source for the libxml2 module and in a couple of cases the C source of the extension module. The mailing list is very helpful, as I found while skimming and searching the archives, but you may need some trial and error to understand the nuances of using this very rich API. Luckily, I think the path to understanding is a bit more clear using the most recent addition to the libxml API family.

A Loaner from Redmond

libxml comes from one of the firmest bastions of the open-source software movement, which is often held up as the only current, real competition to Microsoft. Yet, as ever, the OSS camp is happy to borrow useful ideas from Microsoft here and there. One good example is the XmlTextReader interface, inspired by the XmlTextReader and XmlReader classes of C# and .NET. These are basically a variation on pull DOM and thus a hybrid between SAX's approach -- stream through and process a particular window of markup -- and DOM's -- walk through the hierarchy and manipulate nodes in place. XmlTextReader is a new addition to the API and some developers find it simpler. Also, the tree-based API I introduced in the last section loads the entire document into memory. XmlTextReader only loads nodes on demand and so is more efficient.

Listing 3 uses the XmlTextReader API to perform similar processing as in listing 1.

Listing 3: An example of the XmlTextReader interface

import cStringIO

import libxml2



DOC = """<?xml version="1.0" encoding="UTF-8"?>

<verse>

  <attribution>Christopher Okibgo</attribution>

  <line>For he was a shrub among the poplars,</line>

  <line>Needing more roots</line>

  <line>More sap to grow to sunlight,</line>

  <line>Thirsting for sunlight</line>

</verse>

"""



XMLREADER_START_ELEMENT_NODE_TYPE = 1



stream = cStringIO.StringIO(DOC)

input_source = libxml2.inputBuffer(stream)

reader = input_source.newTextReader("urn:bogus")



while reader.Read():

    print "node name: ", reader.Name()

    if reader.NodeType() == XMLREADER_START_ELEMENT_NODE_TYPE:

        print "Start of an element"

I start by wrapping the source string DOC in a StringIO object, which can be wrapped by libxml's inputBuffer class which, among other things, allows me to create an xmlTextReader object for the stream. If I were starting from an actual file or URI in the first place I could use the object newTextReaderFilename shortcut function. Since in listing 2 I am not working from a URI, I have to supply the URI when I create the xmlTextReader -- probably for the same reasons that the 4Suite APIs insist on a URI for XML sources (see my earlier article on 4Suite for a discussion of this). Here I use a bogus URI as a placeholder.

The reader object iterates over the low-level XML structure in much the same way as SAX, generating events for start and end elements, attributes (deviating from SAX in which attributes are bundled with their elements), text, CDATASections, the document node itself, and the rest of the menagerie. But rather than invoking call-backs, the Read() method forwards to the next such event, and returns it directly as an encapsulated object. Each event carries basic information that is available from the node itself, without having to consider its children or any other related events. In the simple example all the node names are printed. I left out the code to count child elements and display the content subtree for simplicity, because it would involve either considering the interaction of several events using a state machine of some sort or using the Expand() method to walk through enough subsequent events to extract a regular libxml subtree from the reader object.

In order to branch to special processing for start element events, I use NodeType(), which returns a node identifier based on the constants defined in DOM. You'll notice that I don't have to do anything special to clean up after this program, unlike the plain tree interface. If you run listing 2, the only unusual thing you're likely to notice is that the text nodes are given the node name #text, which is the DOM convention. Nodes other than element and attribute nodes all have these special node names.

Wrap up and current events

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

libxml also offers a SAX API, both through the low-level API and and through the bundled drv_libxml2.py, a libxml driver for the SAX that comes with Python and PyXML. libxml supports W3C XML Schema, RELAX NG, OASIS catalogs, XInclude, XML Base, and more. There are also extensive features for manipulating XML documents. I hope to cover these other features of this rich library in subsequent articles.

Moving on to the usual coverage of interesting events and resources in the Python-XML community, Brian Quinlan announced the latest version (0.8.0) of Pyana, a Python interface to the Xalan C XSLT processor. New developments include support for node sets as XPath extension function arguments, Python wide Unicode and Mac OS X builds, and validation using external schemas.

Making a new appearance is Skyron, an interesting little Python module that transforms XML documents according to simple "recipes" which are expressed in XML. These recipes bind XML data to handler code in Python. Typical usage is to create a specilized Python data structure from particular XML data patterns.