xmltramp and pxdom
December 17, 2003
In this article I cover two XML processing libraries with very disjoint goals. xmltramp, developed by Aaron Swartz, is a tool for parsing XML documents into a data structure very friendly to Python. Recently many of the tools I've been covering with this primary goal of Python-friendliness have been data binding tools. xmltramp doesn't meet the definition of a data binding tool I've been using; that is, it isn't a system that represents elements and attributes from the XML document as custom objects that use the vocabulary from the XML document for naming and reference. xmltramp is more like ElementTree, which I covered earlier, defining a set of lightweight objects that make information in XML document accessible through familiar Python idioms. The stated goal of xmltramp is simplicity rather than exhaustive coverage of XML features.
pxdom, on the other hand, has the goal of strict DOM Level 3 compliance. It is developed by Andrew Clover, who contributed to the XML-SIG the document "DOM Standards compliance", a very thorough matrix of feature and defect comparisons between Python DOM implementatons. DOM has generally not been the favorite API of Python users -- or, for that matter, of Java users -- but it certainly has an important place because of its cross-language support.
xmltramp
I downloaded xmltramp 2.0, which is a single Python module. A required Python version is not given, but according to Python features I noticed in the implementation, at least 2.1 is required. I used Python 2.3.2, and installation was a simple matter of copying xmltramp.py to a directory in the Python path. To kick off my exercising of xmltramp I used the same sample file as I've been using in the data binding coverage (see listing 1).
Listing 1: Sample XML file for exercising xmltramp<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label added="2003-06-20"> <quote> <!-- Mixed content --> <emph>Midwinter Spring</emph> is its own season… </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label added="2003-06-10"> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
The following snippet shows how simple it is to parse in a file using xmltramp:
>>> import xmltramp >>> xml_file = open('labels.xml') >>> doc = xmltramp.seed(xml_file)
xmltramp uses SAX behind the scenes for parsing, so it should generally be efficient
in
building up the in-memory structure. The seed
function takes a file-like object
but you can use the parse
function instead if you have a string object. Like
Elementree, xmltramp defines specialized objects (xmltramp.Element
)
representing each element in the XML document. The top-level object (assigned to
doc
) represents the top-level element, rather than the document itself. You
can see its element children by peeking into an internal structure:
doc._dir [<label added="2003-06-20">...</label>, <label added="2003-06-10">...</label>]
In this list each entry is a representation of a child element object. The whitespace text nodes between elements are omitted, which might be conventional stripping of such nodes, but it did make me wonder about the way xmltramp handles mixed content, about which more later. Of course in normal use you would access the xmltramp structures using the public API, which in part adopts Python's list idioms:
>>> for label in doc: print repr(label) ... <label added="2003-06-20">...</label> <label added="2003-06-10">...</label> >>> print repr(doc[0]) <label added="2003-06-20">...</label>
I use repr
because the str
function (used by print
to coerce non-string parameters) applied to Element
objects returns a
concatenation of child text nodes, excluding pure white text nodes:
>>> print doc[1] Ezra Pound45 Usura PlaceHaileyID
You can also use the element node name to navigate the XML structure:
>>> print repr(doc.label) <label added="2003-06-20">...</label>
There are, of course, multiple label
children. The first one is returned. And,
as if that weren't enough, you can also use a dictionary access (mapping) idiom:
>>> print repr(doc['label']) <label added="2003-06-20">...</label>
You read attributes using the function invocation idiom:
>>> print doc.label('added') 2003-06-20
To navigating further into the tree, you can combine and cascade the access methods I described above:
>>> print repr(doc.label.name) <name>...</name> >>> print repr(doc['label']['name']) <name>...</name> >>> print repr(doc[0][1]) <name>...</name>
Unfortunately it seems that the only way to access any element except for the first child element with a certain name is to use list access methods.
>>> doc[1] #Second label element <label added="2003-06-10">...</label>
You can't access this element using either the reference name "label" or using "label" as a key string for mapping access.
Text nodes, whitespace and mixed content
You have to use the list idiom to access child text nodes:
>>> print repr(doc.label.name[0]) u'Thomas Eliot'
You can see that text nodes are maintained as Unicode objects, which is the right
thing to
do. I thought that coercing Element
objects to Unicode would be another good
way to access their child content, but I found an odd quirk:
>>> print repr(unicode(doc.label.name)) #so far so good u'Thomas Eliot' >>> print repr(unicode(doc.label.quote)) u'Midwinter Spring is its own season'
There should be a trailing ellipsis character (the …
character
entity) in the quote
element, but it has gone missing. I looked though the
xmltramp code for an obvious cause of this defect, but it turned out to be rather
subtle. If
you look closely you will see that the whitespace after the ellipsis character is
missing as
well. xmltramp coerces to Unicode by taking all text nodes descending from the given
object
and, using split
and join
string methods, collapses runs of
whitespace into single space characters. Python's Unicode methods treat
…
as whitespace, which surprised me. I know that some other Unicode
characters are treated as whitespace, including  #160;
, popularly
known in its HTML entity form,
, but ellipsis seems a strange
character to treat as whitespace. At any rate, this quick and dirty normalization
by
xmltramp means that coercion to Unicode does not reliably return the precise content
of
descendant text nodes, and I recommend sticking to list access. The following snippet
gets
all text content that is the immediate child of an element, excepting pure whitespace
nodes,
which xmltramp seems to strip:
>>> ''.join([t for t in doc.label.quote if isinstance(t, unicode)]) u' is its own season\x85\n '
Within these constraints, xmltramp maintains mixed content so that you can access it using the patterns I've described.
>>> print list(doc.label.quote) [<emph>...</emph>, u' is its own season\x85\n '] >>> print repr(doc.label.quote.emph) <emph>...</emph> >>> print repr(unicode(doc.label.quote.emph)) u'Midwinter Spring'
Mutations and re-serialization
xmltramp allows for limited mutation. The easiest thing to do is add or modify an attribute:
>>> doc.label('added') u'2003-06-20' >>> doc.label(added=u'2003-11-20') #returns attrs as a dict {u'added': u'2003-11-20'} >>> doc.label('added') u'2003-11-20' >>> doc.label('added', u'2003-12-20') >>> doc.label('added') u'2003-12-20' >>> doc.label(new_attr=u'1') {u'added': u'2003-12-20', 'new_attr': u'1'}
To add an element with simple text content you can use mapping update idiom:
>>> doc[1]['quote'] = u"Make it new"
This code adds a quote
element as the last child of the second
label
element with the simple text content Make it new
. In order
to see the result of this operation I wanted to reserialize the element back to XML.
xmltramp provides for additional parameters to the __repr__
magic method which
can be used for such reserialization. The first is a boolean parameter which you just
set to
True
to trigger full reserialization:
>>> doc[1].__repr__(True) u'<label added="2003-06-10"><name>Ezra Pound</name><address> <street>45 Usura Place</street><city>Hailey</city><state>ID</state> </address><quote>Make it new</quote></label>'
The above output actually appears all on one line, but I've added in breaks for formatting reasons.
Again you can see the effect of the stripped whitespace. The second parameter is also
a
boolean, and True
turns on pretty-printing (using tabs for indentation). You
cannot use the repr
built-in function in this way on xmltramp elements because
it only accepts one argument.
To delete an element, you must use the sequence idiom for deletion, in contrast to the use of mapping idiom for addition of elements:
>>> del doc[1][2] #Remove newly added quote element >>> doc[1].__repr__(True) u'<label added="2003-06-10"><name>Ezra Pound</name><address> <street>45 Usura Place</street><city>Hailey</city><state>ID</state> </address></label>'
The above output actually appears all on one line, but I've added in breaks for formatting reasons.
You can add more complex elements, by passing in well-formed XML documents and adding them as new elements:
>>> new_elem = xmltramp.parse("<emph>Make it new</emph>") >>> doc[1]['quote'] = new_elem >>> doc[1].__repr__(True) u'<label added="2003-06-10"><name>Ezra Pound</name><address> <street>45 Usura Place</street><city>Hailey</city><state>ID</state> </address><quote><emph>Make it new</emph></quote></label>'
The above output actually appears all on one line, but I've added in breaks for formatting reasons.
But you cannot add mixed content so easily because you can't parse a a document which isn't well-formed XML.
>>> new_elem = xmltramp.parse("Make it <emph>new</emph>") [... Raises a SAX parse exception ...]
You would have to combine other operations to add such mixed content.
pxdom
pxdom 0.6 like xmltramp comes as a single Python module. Again I simply copied pxdom.py to a directory in my Python 2.3.2 library path (pxdom supports Python versions from 1.5.2 on). pxdom scrupulously implements DOM Level 3's Load/Save specification which standardizes serialization and deserialization between XML text and DOM. To read XML from a file, use a pattern such as that in listing 2.
Listing 2: Basic loading of an XML fileimport pxdom dom= pxdom.getDOMImplementation('') parser= dom.createDOMParser(dom.MODE_SYNCHRONOUS, None) doc= parser.parseURI('file:labels.xml')
pxdom also provides some convenience functions parseString
and
parse
(accepts a file-like object or an OS-specific pathname) which are not
provided for in DOM but are added in minidom. Listing 3 demonstrates some DOM operations
using pxdom.
import pxdom DOC = """<?xml version="1.0" encoding="UTF-8"?> <verse> <attribution>Christopher Okibgo</attribution> <line>For he was a shrub among the poplars,</line> <line>Needing more roots</line> <line>More sap to grow to sunlight,</line> <line>Thirsting for sunlight</line> </verse> """ #Create a pxdom document node parsed from XML in a string dom= pxdom.getDOMImplementation('') parser= dom.createDOMParser(dom.MODE_SYNCHRONOUS, None) doc_node = pxdom.parseString(DOC) print doc_node #You can execute regular DOM operations on the document node verse_element = doc_node.documentElement print verse_element #As with other Python DOMs you can use "Pythonic" shortcuts for #things like Node lists and named node maps #The first child of the verse element is a white space text node #The second is the attribution element attribution_element = verse_element.childNodes[1] #attribution_string becomes "Christopher Okibgo" attribution_string = attribution_element.firstChild.data print repr(attribution_string)
I was a bit concerned to see that the output from the last line of the listing is a plain text string rather than a Unicode object. I experimented a bit and found that if any text node has a non-ASCII character, pxdom appears to be representing it as a Unicode object rather than a plain string. This at least reassured me of pxdom's Unicode support, but I wonder whether such a mix of text and Unicode objects adds unnecessary complications.
Listing 4 shows how to use pxdom to build a DOM tree from scratch, node by node, and
then
print the corresponding XML. Rather than the toxml
method of minidom and the
Print
and PrettyPrint
functions of Domlette and 4DOM
respectively, pxdom implements the DOM standard saveXML
method.
import pxdom from xml.dom import EMPTY_NAMESPACE, XML_NAMESPACE impl = pxdom.getDOMImplementation('') #Create a document type node using the doctype name "message" #A blank system ID and blank public ID (i.e. no DTD information) doctype = impl.createDocumentType(u"message", None, None) #Create a document node, which also creates a document element node #For the element, use a blank namespace URI and local name "message" doc = impl.createDocument(EMPTY_NAMESPACE, u"message", doctype) #Get the document element msg_elem = doc.documentElement #Create an xml:lang attribute on the new element msg_elem.setAttributeNS(XML_NAMESPACE, u"xml:lang", u"en") #Create a text node with some data in it new_text = doc.createTextNode(u"You need Python") #Add the new text node to the document element msg_elem.appendChild(new_text) #Print out the result print doc.saveXML()
Also in Python and XML |
|
Should Python and XML Coexist? |
|
There is much more to pxdom than I can cover here. After all, it is a complete DOM implementation. The pxdom project puts a premium on conformance, and the module does extremely well running the DOM Level 1/2 Test Suite.
Wrap up
The choices available to Python developers for processing XML continue to multiply, which is a blessing as well as a curse -- there is plenty of variety and choice, but there is also a lot to keep track of. xmltramp and pxdom demonstrate the variety especially well, providing contrasting styles for XML processing. If you need a quick and dirty excavation of an XML document to extract key data, xmltramp is a nice tool to have on hand. If you want to stick to the standard DOM idiom, or need to be able to control all the advanced aspects of XML documents, pxdom is a trusty companion. There are more choices that I have not been able to cover yet, notably PyRXP. I have also not provided much coverage of XML namespaces in articles on individual tools, but I shall be looking at namespace processing across libraries. Watch for such topics in future columns and don't hesitate to post your own ideas for useful coverage in the comments section of this article.