Using libxml in Python
May 14, 2003
The GNOME project, an open source umbrella projects like Apache and KDE, has spawned several useful subprojects. A few years ago the increase of interest in XML processing in GNOME led to the development of a base XML processing library and, subsequently, an XSLT library, both of which are written in C, the foundational language of GNOME. These libraries, libxml and libxslt, are popular for users of C, but also those of the many other languages for which wrappers have been written, as well as language-agnostic users who want good command-line tools.
libxml and libxslt are popular because of their speed, active development, and coverage of many XML specifications with close attention to conformance. They are also available on many platforms. Daniel Veillard is the lead developer of these libraries as well as their Python bindings. He participates on the XML-SIG and has pledged perpetual support for the Python bindings; however, as the documentation says, "the Python interface [has] not yet reached the maturity of the C API."
In this article I'll introduce the Python libxml bindings, which I refer to as Python/libxml. In particular I introduce libxml2. I am using Red Hat 9.0 so installation was a simple matter of installing RPMs from the distribution disk or elsewhere. The two pertinent RPMs in my case are libxml2-2.5.4-1 and libxml2-python-2.5.4-1. The libxml web page offers installation instructions for users of other distributions or platforms, including Windows and Mac OS X.
Basic libxml
libxml exposes a Python interface similar to its C interface. It's unrelated to DOM or any of the other Python interfaces and is fairly complex. To get a flavor of it, see the demonstration in listing 1.
Listing 1: A simple example of the basic libxml2 API
import libxml2 DOC = """<?xml version="1.0" encoding="UTF-8"?> <verse> <attribution>Christopher Okibgo</attribution> <line>For he was a shrub among the poplars,</line> <line>Needing more roots</line> <line>More sap to grow to sunlight,</line> <line>Thirsting for sunlight</line> </verse> """ doc = libxml2.parseDoc(DOC) root = doc.children print root #iterate over children of verse child = root.children while child is not None: print child if child.type == "element": print "\tAn element with ", child.lsCountNode(), "child(ren)" print "\tAnd content", repr(child.content) child = child.next doc.freeDoc()
The entire Python API wrapper is in the module libxml2
, which largely
delegates to a C/Python extension in the file libxml2mod.so
on my machine,
which in turn uses the core libxml implementation. parseDoc
is one of a family
of functions for parsing XML documents, DTDs, and more. There are also parseURI
and parseFile
for reading instances and parseDTD
to read an
external DTD subset.
In this listing I use the most literal of the several approaches for walking through
nodes,
the one closest to the core C API. The children
attribute gets the first child
node of the instance node in document order. This makes the name a bit misleading,
but you
can get to the rest of the children using what is in effect a doubly-linked list,
where the
next
and prev
attributes link the list together, and
last
can be used to shuttle to the end. parent.children
in
effect moves "up" and then back to the start of the list; it can be used in place
of the
nonexistent attribute first
.
The next
links eventually run off the end of the list, returning
None
, which terminates my while loop. Within each iteration I print each
node, including some special information for elements. To determine which nodes are
elements, I use the type
attribute, which returns a string indicating the node
type. lsCountNode()
gives the count of child nodes and content
gives a string consisting of the content of all descendant text nodes. Finally, in
order to
deallocate the low-level C constructs throughout the document, I call
freeDoc()
; freeNode()
is also available for more fine-grained
memory management, usually when using libxml to modify documents.
The following is the output from listing 1.
<xmlNode (verse) object at 0x8136dac> <xmlNode (text) object at 0x8134c94> <xmlNode (attribution) object at 0x8135f04> An element with 1 child(ren) And content 'Christopher Okibgo' <xmlNode (text) object at 0x8134c94> <xmlNode (line) object at 0x8135f04> An element with 1 child(ren) And content 'For he was a shrub among the poplars,' <xmlNode (text) object at 0x8134c94> <xmlNode (line) object at 0x8135f04> An element with 1 child(ren) And content 'Needing more roots' <xmlNode (text) object at 0x8134c94> <xmlNode (line) object at 0x8135f04> An element with 1 child(ren) And content 'More sap to grow to sunlight,' <xmlNode (text) object at 0x8134c94> <xmlNode (line) object at 0x8135f04> An element with 1 child(ren) And content 'Thirsting for sunlight' <xmlNode (text) object at 0x8134c94>
Iterators. There is also an iterators interface for Python 2.2 users, which is a little more Pythonic. As an example, the following snippet is the functional equivalent of the loop in listing 1.
for child in root: print child if child.type == "element": print "\tAn element with ", child.lsCountNode(), "child(ren)" print "\tAnd content", repr(child.content)
Beyond ASCII. As is the GNOME convention, libxml represents Unicode objects as simple strings encoded as UTF-8. This extends to Python/libxml, where rather than using Python Unicode objects, simple Python strings in UTF-8 encoding are returned. Listing 2 gives an example of the behavior of Python/libxml when processing non-ASCII characters.
Listing 2: Simple libxml example handling non-ASCII characters
DOC = """<?xml version="1.0" encoding="UTF-8"?> <rule>In any triangle, each interior angle < 90°</rule> """ doc = libxml2.parseDoc(DOC) root = doc.children print "Content:", repr(root.content) print "As Unicode:", repr(unicode(root.content, "utf-8")) doc.freeDoc()
I still strongly advocate using Python Unicode objects rather than encoded strings when processing XML. I suggest that Python/libxml users convert to and from Unicode when interfacing from the library to application code. But I admit that his might be awkward in some cases and might incur a small performance hit. I do think it would be best for Python/libxml to switch to Python Unicode objects as the basic string type.
A word about documentation. Python-XML projects have been notorious for poor
documentation, which is one of the considerations that inspires this column. It was
especially difficult for me to get a handle on libxml because of its remarkable richness
and
thus complexity, combined with elusive documentation. I cobbled together enough
understanding of the API to put together the listings above only after combing the
on-line
documentation on the libxml site (which mostly covers C), reading through all the
Python API
example and test scripts, reading the Python source for the libxml2
module and
in a couple of cases the C source of the extension module. The mailing list is very
helpful,
as I found while skimming and searching the archives, but you may need some trial
and error
to understand the nuances of using this very rich API. Luckily, I think the path to
understanding is a bit more clear using the most recent addition to the libxml API
family.
A Loaner from Redmond
libxml comes from one of the firmest bastions of the open-source software movement, which is often held up as the only current, real competition to Microsoft. Yet, as ever, the OSS camp is happy to borrow useful ideas from Microsoft here and there. One good example is the XmlTextReader interface, inspired by the XmlTextReader and XmlReader classes of C# and .NET. These are basically a variation on pull DOM and thus a hybrid between SAX's approach -- stream through and process a particular window of markup -- and DOM's -- walk through the hierarchy and manipulate nodes in place. XmlTextReader is a new addition to the API and some developers find it simpler. Also, the tree-based API I introduced in the last section loads the entire document into memory. XmlTextReader only loads nodes on demand and so is more efficient.
Listing 3 uses the XmlTextReader API to perform similar processing as in listing 1.
Listing 3: An example of the XmlTextReader interface
import cStringIO import libxml2 DOC = """<?xml version="1.0" encoding="UTF-8"?> <verse> <attribution>Christopher Okibgo</attribution> <line>For he was a shrub among the poplars,</line> <line>Needing more roots</line> <line>More sap to grow to sunlight,</line> <line>Thirsting for sunlight</line> </verse> """ XMLREADER_START_ELEMENT_NODE_TYPE = 1 stream = cStringIO.StringIO(DOC) input_source = libxml2.inputBuffer(stream) reader = input_source.newTextReader("urn:bogus") while reader.Read(): print "node name: ", reader.Name() if reader.NodeType() == XMLREADER_START_ELEMENT_NODE_TYPE: print "Start of an element"
I start by wrapping the source string DOC
in a StringIO
object,
which can be wrapped by libxml's inputBuffer
class which, among other things,
allows me to create an xmlTextReader
object for the stream. If I were starting
from an actual file or URI in the first place I could use the object
newTextReaderFilename
shortcut function. Since in listing 2 I am not working
from a URI, I have to supply the URI when I create the xmlTextReader
--
probably for the same reasons that the 4Suite APIs insist on a URI for XML sources
(see my
earlier article on 4Suite for a discussion of this). Here I use a bogus URI as a
placeholder.
The reader
object iterates over the low-level XML structure in much the same
way as SAX, generating events for start and end elements, attributes (deviating from
SAX in
which attributes are bundled with their elements), text, CDATASections, the document
node
itself, and the rest of the menagerie. But rather than invoking call-backs, the
Read()
method forwards to the next such event, and returns it directly as an
encapsulated object. Each event carries basic information that is available from the
node
itself, without having to consider its children or any other related events. In the
simple
example all the node names are printed. I left out
the code to count child elements and display the content subtree for simplicity, because
it
would involve either considering the interaction of several events using a state machine
of
some sort or using the Expand()
method to walk through enough subsequent events
to extract a regular libxml
subtree from the reader
object.
In order to branch to special processing for start element events, I use
NodeType()
, which returns a node identifier based on the constants defined in
DOM. You'll notice that I don't have to do anything special to clean up after this
program,
unlike the plain tree interface. If you run listing 2, the only unusual thing you're
likely
to notice is that the text nodes are given the node name #text
, which is the
DOM convention. Nodes other than element and attribute nodes all have these special
node
names.
Wrap up and current events
Also in Python and XML |
|
Should Python and XML Coexist? |
|
libxml also offers a SAX API, both through the low-level API and and through the bundled drv_libxml2.py, a libxml driver for the SAX that comes with Python and PyXML. libxml supports W3C XML Schema, RELAX NG, OASIS catalogs, XInclude, XML Base, and more. There are also extensive features for manipulating XML documents. I hope to cover these other features of this rich library in subsequent articles.
Moving on to the usual coverage of interesting events and resources in the Python-XML community, Brian Quinlan announced the latest version (0.8.0) of Pyana, a Python interface to the Xalan C XSLT processor. New developments include support for node sets as XPath extension function arguments, Python wide Unicode and Mac OS X builds, and validation using external schemas.
Making a new appearance is Skyron, an interesting little Python module that transforms XML documents according to simple "recipes" which are expressed in XML. These recipes bind XML data to handler code in Python. Typical usage is to create a specilized Python data structure from particular XML data patterns.