Generating DOM Magic
January 8, 2003
Python 2.2 introduced generators, a special type of function which has more flexible flow-of-control than ordinary, procedural functions. Standard procedural functions start at the top and execute until returning to the caller, maintaining all along a local state for the subroutine (which comprises local variables and passed-in parameters). The function can have multiple return points, but for each invocation, it runs until a single return, and its local state then becomes unavailable. Generators, by contrast, can send control back to the caller and yet remain in a sort of suspended animation, maintaining local state. Such a temporary return is called "yielding". The caller can then revive the generator at any time, at which point operation continues until the next yield or until a final return.
Generators, as they appeared in Python 2.2, were born in Christian Tismer's Stackless Python as a feature made practical by continuations, the core construct of Stackless. Generators are especially suited to efficient operation on tree structures, which are, of course, the natural structure for XML processing. In this article I present examples and techniques for using generators for XML processing. If you're not already familiar with Python generators, you should read " Iterators and simple generators", by David Mertz. Since this article builds on a previous one of mine, " Streamlining DOM XML processing with Python", I recommend that all readers review it before continuing. At the end of this article there are some resources which will help you acquire other relevant background knowledge.
Implementing a DOM classic
Listing 1 presents two functions for iterating over all the nodes in a DOM tree in document order, both of which were introduced in "Streamlining DOM XML processing with Python".
Listing 1: generators for filtered or unfiltered iteration over a DOM tree#Required in Python 2.2, and must be the first import from __future__ import generators from xml.dom import Node def doc_order_iter(node): """ Iterates over each node in document order, yielding each in turn, starting with the given node node - the starting point (subtree rooted at node will be iterated over document order) """ #Document order returns the current node, #then each of its children in turn yield node for child in node.childNodes: #Create a generator for each child, #Over which to iterate for cn in doc_order_iter(child): yield cn return def doc_order_iter_filter(node, filter_func): """ Iterates over each node in document order, applying the filter function to each in turn, starting with the given node, and yielding each node in cases where the filter function computes true node - the starting point (subtree rooted at node will be iterated over document order) filter_func - a callable object taking a node and returning true or false """ if filter_func(node): yield node for child in node.childNodes: for cn in doc_order_iter_filter(child, filter_func): yield cn return
With these generators in hand we can build several other useful functions simply and
easily -- functions which will perform well, too. Users of the 4Suite Domlettes sometimes
complain that they do not include the handy DOM utility methods
getElementsByTagName
and getElementsByTagNameNS
. In fact, I
warned about this omission in my last article A Python & XML Companion.
There are so many trivial ways to implement these as functions, but I think the best
is
using the generator-iterators, as in listing 2. The generator functions presented
here
will work with any Python DOM implementation which conforms to the published API of
xml.dom.Node in standard library reference.
def get_elements_by_tag_name(node, name): """ Returns an iterator (in document order) over all the element nodes that are descendants of the given one, and have the given tag name. y node will be returned if it has the right name name - the node name to check """ return doc_order_iter_filter(node, lambda n: n.nodeType == \ Node.ELEMENT_NODE and n.nodeName == name) def get_elements_by_tag_name_ns(node, ns, local): """ Returns an iterator (in document order) over all the element nodes that are descendants of the given one, and have the given namespace and local name. node - the starting point (subtree rooted at node will be searched) node will be returned if it has the right ns and local name ns - the namespace to check local - the local name to check """ return doc_order_iter_filter(node, lambda n: n.nodeType == \ Node.ELEMENT_NODE and n.namespaceURI == ns and n.localName == local)
get_elements_by_tag_name
uses the filtered iterator, filtering by the
criteria that the node is an element and that its node name matches the one given.
get_elements_by_tag_name_ns
is a namespace-aware variation, which takes the
namespace and local name for the comparison. The following snippet illustrates their
operation by printing all element nodes named "line" (with no namespace) in a document.
DOC = """<?xml version='1.0' encoding='iso-8859-1'?> <verse> <attribution>Louis MacNeice</attribution> <line>The Laird o' Phelps spent Hogmanay declaring he was sober,</line> <line>Counted his feet to prove the fact and found he had one foot over.</line> <line>Mrs. Carmichael had her fifth, looked at the job with repulsion,</line> <line>Said to the midwife "Take it away; I'm through with overproduction."</line> </verse> """ from Ft.Xml.Domlette import NonvalidatingReader doc = NonvalidatingReader.parseString(DOC, "http://dummy.ns") print "Get elements by tag name:" for node in get_elements_by_tag_name(doc, 'line'): print node print "Get elements by tag name ns:" for node in get_elements_by_tag_name_ns(doc, None, 'line'): print node
I use None
to represent "no namespace" when I invoke
get_elements_by_tag_name_ns
. Don't forget this important Python-XML
convention. When I posted this generator-based method of writing a
getElementsByTagName
implementation, my colleague Mike Olson followed up
with an XPath approach, presented in listing 3.
from Ft.Xml.XPath import Evaluate, Context from Ft.Xml.Domlette import GetAllNs def get_elements_by_tag_name_ns_xpath(node, name): #I need the context node and a set of namespace mappings #In case they use a QName for name e.g. ns:local #Use the dictionary of all namespace mappings on the given node context = Context.Context(node, GetAllNs(node)) #Now do an XPath evaluation of an expression that searches #The descendant-or-self axis for all nodes whose name #matches the given return Evaluate(".//" + name, context=context)
This approach has some advantages. For one thing, you can pass a qualified name (QName)
to this function if you want to check within a given namespace, rather than separately
passing in the namespace URI and local name. Therefore, if you are looking for all
elements with namespace URI http://spam.com
and local name eggs
,
and the prefix sp
is bound to that namespace URI in the source document, then
you can invoke get_elements_by_tag_name_ns_xpath(node, "sp:eggs")
rather than
the more verbose get_elements_by_tag_name_ns(node, "http://spam.com",
"eggs")
. Another advantage is that the function is a two-liner. One line is needed
to capture the namespace mappings from the given node, and it so happens that, since
4Suite has been updated in CVS, this line will no longer be necessary and the following
variation will work even given a QName.
def get_elements_by_tag_name_ns_xpath(node, name) #The context= is no longer needed because we're passing in the node #from which the context will be assumed return Evaluate(".//" + name, node)
With 4Suite 0.12.0a3 and earlier versions, this function will fail with an "unknown
prefix" exception if you pass in a QName. Anyway, the shorter version means that you
needn't even make it a function. You can just write Evaluate(node,
".//sp:eggs")
in-line wherever you need the function. A final advantage is that
the XPath approach works with Python 2.1, which does not support generators. You can
also
write the function much the same way as with generators but using raw recursion. Listing
5
is an example of this approach.
def get_elements_by_tag_name_ns_raw(node, ns, local): if node.nodeType == Node.ELEMENT_NODE and \ node.namespaceURI == ns and node.localName == local: result = [node] else: result = [] for child in node.childNodes: result.extend(get_elements_by_tag_name_ns_raw(child, ns, local)) return result
One disadvantage of the XPath approach is that it would be much slower because of the overall overhead and machinery of the XPath implementation. In theory, the raw recursion approach should also be less efficient than the generator approach, but by a much smaller margin. The expected slowness of raw recursion is due to two factors: first, Python's function call overhead -- which is less for resuming generators than for a proper recursive call -- and, second, because building the result sublists in memory takes up extra space that is saved in the generator approach. To confirm these speed hypotheses, I wrote a small performance test script:
doc = NonvalidatingReader.parseUri(BIG_FILE_NAME) import time start = time.time() for i in range(10): for node in get_elements_by_tag_name_ns(doc, None, NAME): pass end = time.time() print "Generator getElementsByTagNameNS: %.2f"%(end-start) start = time.time() for i in range(10): for node in get_elements_by_tag_name_ns_xpath(doc, NAME): pass end = time.time() print "XPath getElementsByTagNameNS: %.2f"%(end-start) start = time.time() for i in range(10): for node in get_elements_by_tag_name_ns_raw(doc, None, NAME): pass end = time.time() print "Raw recursion getElementsByTagNameNS: %.2f"%(end-start)
where BIG_FILE_NAME
is an XML file to process and NAME
is the
name of an element to find throughout the file. A null namespace is assumed. I ran
this
against Elliotte
Rusty Harold's 1998 baseball stats (stats.xml) and two other well-known large files,
Hamlet (hamlet.xml) and the Old Testament (ot.xml) from John Bosak's Revised XML
Document Collections. The results I found are tabulated below (all timings in
seconds):
XML source | raw file size | elem to find | number found | 10 gen runs | 10 XPath runs | 10 raw runs |
---|---|---|---|---|---|---|
hamlet.xml | 288KB | LINE | 4012 | 2.14 | 40.60 | 2.07 |
stats.xml | 644KB | GAMES | 1226 | 5.59 | 39.18 | 5.52 |
ot.xml | 3412KB | v | 23145 | 8.25 | 2275.06 | 7.83 |
You can see how badly the XPath version lags in this test, but I was quite surprised by how close the generator and raw recursion methods are. The test machine is a Dell Inspiron 8100 laptop, Pentium III 1.13GHz, 512MB RAM, Red Hat Linux 8.0, Python 2.2.1 (built myself), 4Suite CVS from 1 January 2003. Certainly none of the tests were big enough to swamp my memory with sublists in the raw recursion function, but I still expected a general performance hit.
More DOM tricks
There are many other utility functions you can create using the generator-iterator
foundations I've laid. I often find that I don't need a list of all the subelements
with a
particular name; instead, I need the first one from that list in document order. In
other
words, I often use an idiom like get_elements_by_tag_name(node, name)[0]
.
This is very efficient using generators: the generator only does as much work as I
ask it
to. If the list of elements retrieved is 23,145 elements long, the generator doesn't
have
to bother looking past the one it finds. This is undoubtedly an area where the generator
implementation beats raw recursion, as I was able to confirm with further experimentation.
Since this is such a common idiom, though, it's nice to give it a convenient handle,
and
so I use the function in listing 4.
def get_first_element_by_tag_name_ns(node, ns, local): """ Returns the first element in document order with the given namespace and local name node - the starting point (subtree rooted at node will be searched) node will be returned if it has the right ns and local name ns - the namespace to check local - the local name to check """ return get_elements_by_tag_name_ns(node, ns, local).next()
The next()
is the way the first item must be retrieved, rather than
[0]
, because generators don't allow random access and only support the
method to retrieve the next yielded value in order.
A common XML processing operation is to retrieve only the character data within a
given
node, which is basically how conversion from node to string works in XPath. You may
be
familiar with this operation from common uses of xsl:value-of
in XSLT.
Listing 5 offers this operation on DOM nodes using the generator-iterator.
def string_value(node): """ Return the XPath string value of the given node This basically consists of one string comprising all the character data in the node and its descendants node - the starting point """ text_nodes = doc_order_iter_filter( node, lambda n: n.nodeType == Node.TEXT_NODE ) return u''.join([ n.data for n in text_nodes ])
This function gets a generator-iterator as usual that filters out all but text nodes.
Then it uses list comprehensions on the generator-iterator to create a list of the
data
strings from each node. Finally, it stitches this list together into a single string
using
join
. It could be even more clever and use another generator to extract the
data strings from the text nodes rather than a list comprehension, but that doesn't
really
gain much and would be less clear. Don't hesitate to use generators in list
comprehensions, as this combination works in many very handy idioms.
Another common need is to search descendant elements for a particular substring among their text nodes; basically, a full-text search. Listing 6 defines a function for this purpose.
Listing 6: a full-text search functiondef text_search(node, sought): """ Return a list of descendant elements which contain a given substring in their text children node - the starting point (subtree rooted at node will be searched) sought - the substring to find """ return doc_order_iter_filter( node, lambda n: n.nodeType == Node.ELEMENT_NODE \ and [ 1 for cn in n.childNodes \ if cn.nodeType == Node.TEXT_NODE \ and cn.data.find(sought) != -1 ] )
In this function I use the generator-iterator-filter foundation. The filter passes
a
descendant node if it is an element and has a child node which is text and contains
the
substring. Again I use list comprehensions, but in a different way this time. It returns
a
list that is empty if no child nodes contain the substring (tested by using the
find
method on the data string). If one or more of the text nodes contains
the substring, then a list of one or more 1
values is returned. Since an
empty list is treated like a boolean false, and a list with any items like a boolean
true,
this has the effect of efficiently selecting those elements containing the wanted
substring.
Sometimes I only want to apply certain processing to leaf element nodes of a document; that is, to those element nodes which have no element children. Listing 7 gets an interator over such childless elements.
Listing 7: Function to return only leaf element nodesdef get_leaf_elements(node): """ Return a list of elements that have no element children node - the starting point (subtree rooted at node will be searched) """ return doc_order_iter_filter( node, lambda n: n.nodeType == Node.ELEMENT_NODE \ and not [ 1 for cn in n.childNodes \ if cn.nodeType == Node.ELEMENT_NODE ] )
This time the filter passes all element nodes that have no element children. The list
comprehension approach used for this test is very similar to that in
text_search
.
Conclusion
Many common DOM operations can be developed using the basic techniques presented in
this
article. The whole module, domtools.py
, with the routines I've presented is
available on my web
site. In a future article I'll present generator idioms for SAX and other
XML-processing tools. Please don't hesitate to comment with any generator-based tricks
you've used for XML processing.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
Python has added a great deal of new goodies lately. I think the aggregate effect has been to make the language better, but some people have complained about the changes. Though I've seen defenders and detractors for almost every language change, generators seem to be one of those changes that are universally acclaimed. Guido van Rossum showed his extremely sharp judgment by selecting the most practical and useful aspect of Stackless Python and making it available in Python.
December was a slow month for Python-XML news. Members of the community pursued our variants of holiday, and those of us in the Northern hemisphere found hat in the short days it felt like, as John Donne said, "the worlds whole sap is sunke". One notable December discovery was PAX, the Pythonic API for XML, which allows developers to parse an XML file into a Pythonic representation (including use of Python 2.2 iterators). PAX also includes a transformation engine and is open source.
ResourcesBefore reading this article, I recommend becoming familiar with Python generators by reading "Iterators and simple generators", by David Mertz and my previous article "Streamlining DOM XML processing with Python". Christian Tismer's Stackless Python has been extensively covered on O'Reilly Network. You can gain some background on the birthplace of Python generators in "Introduction to Stackless Python" and "Programming Stackless Python", both by Cameron Laird. In "Stackless Reincarnate" Stephen Figgins discusses some recent developments in the Stackless world. George McMillan also has an excellent Stackless Python page. David Mertz's column on IBM developerWorks includes a series on Python generators: "Iterators and simple generators", "Generator-based state machines", and "Implementing 'weightless threads' with Python generators". If you are really serious about understanding generators, read PEP 255 (the formal specification for generators). |