Building Dictionaries With SAX
January 14, 2004
People have always liked to complain that Python is slow. People have always liked to complain that XML processing is slow. I've always thought both complaints are typically misguided. Neither Python nor XML emerged out of the need for ultrasonic speed at runtime. Python emerged as an expressive language for programming applications, and XML as an expressive system for building data representation into applications. In both cases the goal for the resulting applications is ease of maintenance. This ease primarily comes from the expressiveness that Python and XML allow. One should expect that such gain in productivity would be matched by a corresponding loss in some area, and sometimes the deficit comes from performance. Complaints about the downsides of this tradeoff often stem from premature optimization, when developers insist on basing key decisions on meaningless benchmarks rather than considering the entire profile under which an application operates.
All that having been said, there are certainly real world cases where one has chosen Python and XML tools with an open mind, and then decided that a performance boost is in order upon careful profiling. There are many ways to squeeze performance out of Python XML applications, some of which take advantage of Python or XML's expressive power. Each application will need its own mix of practical optimizations, but in this article I shall present a modest technique that might be handy to have in your war chest.
The dictionary express
Dictionaries are perhaps the core data structure most dear to the heart of Python. The basic object model of Python is largely constructed upon dictionaries. They have been optimized to the utmost by some of the brightest minds in the computer industry. Conventional wisdom suggests that if you need to optimize a bit of code in Python, you should take advantage of the excellent Python/C interface. A better bit of advice is that you might first want to check whether you are taking full advantage of dictionaries. One detail to keep in mind is that dictionaries are more suited to fast lookup operations than memory savings: again there is a tradeoff, space for speed in this case.
My pet description of XML's fundamental data model is "labeled strings in nested packages". The labeling and nesting are what differentiate XML from good old comma or tab-delimited value and tabular ("square") data models such as spreadsheets and classic SQL databases. This same labeling and nesting makes for a natural accommodation of data from XML in Python dictionaries. Sometimes writing tortured SAX code for complex processing just doesn't cut it as an alternative to loading documents into heavyweight DOM or the like for XPath access. You may be better off writing SAX code that is just sophisticated enough to shape the key bits from XML into a structure of nested dictionaries.
Take for example the address labels format I've been using as an example in some earlier installments, reproduced in listing 1.
Listing 1: Sample XML file (labels.xml) containing address labels<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label added="2003-06-20"> <quote> <!-- Mixed content --> <emph>Midwinter Spring</emph> is its own season… </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label added="2003-06-10"> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
Well, it's not an exact reproduction. In earlier articles I used the character
…
, mistakenly thinking that it was the ellipsis in Unicode. In fact,
it is NEL, an end of line character used most often in IBM mainframe data. I erroneously
looked up a table based on Windows code pages rather than Unicode, which illustrates
all too
well the danger I cautioned against earlier of confusing character sets. Starting
from this
article I'm using the correct character …
. As an aside, there has
been a lot of controversy in the XML community because NEL, the character I happened
to
mistake for ellipsis, has been added as a standard white space character in a revision
that
will probably become XML 1.1. Some claim this breaks backward compatibility and is
not worth
the convenience of a (relatively) few mainframe users. The fact that …
is whitespace from the Unicode point of view explains what I thought was confusing
treatment
of the character by Python string processing APIs.
But back to the main topic of the article. Listing 2 contains some SAX code to create a handy dictionary of dictionaries from labels.xml. Python 2.x is the only requirement: I used Python 2.3.3.
Listing 2: a program (listing2.py) to create Python dictionaries from the labels XML formatimport sys from xml import sax from textnormalize import text_normalize_filter #Subclass from ContentHandler in order to gain default behaviors class label_dict_handler(sax.ContentHandler): #Define constants for important states CAPTURE_KEY = 1 CAPTURE_LABEL_ITEM = 2 CAPTURE_ADDRESS_ITEM = 3 def __init__(self): self.label_dict = {} #Track the item being constructed in the current dictionary self._item_to_create = None self._state = None return def startElement(self, name, attributes): if name == u"label": self._curr_label = {} if name == u"address": self._address = {} if name == u"name": self._state = self.CAPTURE_KEY if name == u"quote": self._item_to_create = name self._state = self.CAPTURE_LABEL_ITEM if name in [u"street", u"city", u"state"]: self._item_to_create = name self._state = self.CAPTURE_ADDRESS_ITEM return def endElement(self, name): if name == u"address": self._curr_label["address"] = self._address if name in [u"quote", u"name", u"street", u"city", u"state"]: self._state = None return def characters(self, text): if self._state == self.CAPTURE_KEY: self.label_dict[text] = self._curr_label curr_dict = None if self._state == self.CAPTURE_ADDRESS_ITEM: curr_dict = self._address if self._state == self.CAPTURE_LABEL_ITEM: curr_dict = self._curr_label print repr(text), curr_dict if curr_dict is not None: if curr_dict.has_key(self._item_to_create): curr_dict[self._item_to_create] += text else: curr_dict[self._item_to_create] = text return if __name__ == "__main__": parser = sax.make_parser() downstream_handler = label_dict_handler() #upstream: the parser; downstream: the next handler in the chain filter_handler = text_normalize_filter(parser, downstream_handler) #XMLFilterBase is designed so that the filter takes on much of the #interface of the parser itself, including the "parse" method filter_handler.parse(sys.argv[1]) label_dict = downstream_handler.label_dict
textnormalize
is a SAX filter that I contributed to The Python
Cookbook. A SAX parser can report contiguous text using multiple
characters
events. In other words, given the following XML document:
<spam>abc</spam>
The text "abc" could technically be reported as three characters
events: one
for the "a", one for the "b" and a third for the "c". Such an extreme case is unlikely
in
real life, but in any case textnormalize
ensures that all such broken-up events
are reported to downstream SAX handlers in the manner most developers would expect.
In the
above case, the filter would consolidate the three characters
events into a
single one for the entire text node "abc". This frees downstream handlers from implementing
code to deal with broken-up text nodes, which can be rather cumbersome when mixed
into
typical SAX-style state machine logic. I've reproduced the filter class as listing
3 for
sake of convenience.
import sys from xml import sax from textnormalize import text_normalize_filter #Subclass from ContentHandler in order to gain default behaviors class label_dict_handler(sax.ContentHandler): #Define constants for important states CAPTURE_KEY = 1 CAPTURE_LABEL_ITEM = 2 CAPTURE_ADDRESS_ITEM = 3 def __init__(self): self.label_dict = {} #Track the item being constructed in the current dictionary self._item_to_create = None self._state = None return def startElement(self, name, attributes): if name == u"label": self._curr_label = {} if name == u"address": self._address = {} if name == u"name": self._state = self.CAPTURE_KEY if name == u"quote": self._item_to_create = name self._state = self.CAPTURE_LABEL_ITEM if name in [u"street", u"city", u"state"]: self._item_to_create = name self._state = self.CAPTURE_ADDRESS_ITEM return def endElement(self, name): if name == u"address": self._curr_label["address"] = self._address if name in [u"quote", u"name", u"street", u"city", u"state"]: self._state = None return def characters(self, text): if self._state == self.CAPTURE_KEY: self.label_dict[text] = self._curr_label curr_dict = None if self._state == self.CAPTURE_ADDRESS_ITEM: curr_dict = self._address if self._state == self.CAPTURE_LABEL_ITEM: curr_dict = self._curr_label print repr(text), curr_dict if curr_dict is not None: if curr_dict.has_key(self._item_to_create): curr_dict[self._item_to_create] += text else: curr_dict[self._item_to_create] = text return if __name__ == "__main__": parser = sax.make_parser() downstream_handler = label_dict_handler() #upstream: the parser; downstream: the next handler in the chain filter_handler = text_normalize_filter(parser, downstream_handler) #XMLFilterBase is designed so that the filter takes on much of the #interface of the parser itself, including the "parse" method filter_handler.parse(sys.argv[1]) label_dict = downstream_handler.label_dict
Working with the resulting structure
In listing 2 label_dict_handler
is my special SAX handler for the labels
format. It's a somewhat rough case and does not go into the detail required to handle
some
edge cases, but it does accomplish the required task. It constructs a dictionary of
labels,
each of which is a dictionary of label details. The address details are given as yet
another
nested set of dictionaries. The main module code constructs the chain from the parser
through the textnormalize
filter to the label_dict_handler
handler.
I ran this code and prepared to experiment with the resulting dictionary as follows:
$ python -i listing2.py labels.xml
Python's standard pprint
module is a friendly way to inspect the contents of
dictionaries:
>>> import pprint >>> pprint.pprint(label_dict) {u'Ezra Pound': {'address': {u'city': u'Hailey', u'state': u'ID', u'street': u'45 Usura Place'}}, u'Thomas Eliot': {'address': {u'city': u'Stamford', u'state': u'CT', u'street': u'3 Prufrock Lane'}, u'quote': u'\n \n Midwinter Spring is its own season\u2026\n '}} >>>
Notice the proper maintenance of the Unicode from the XML document, including the tricky ellipsis character. In this form, the labels data is easy to query:
>>> #Get the Eliot quote ... >>> print repr(label_dict[u"Thomas Eliot"].get(u"quote")) u'\n \n Midwinter Spring is its own season\u2026\n ' >>> #How about a Pound quote? ... >>> print repr(label_dict[u"Ezra Pound"].get("quote")) None >>> #Tell me which labels do have quotes ... >>> print [ k for k, v in label_dict.iteritems() if v.has_key(u"quote") ] [u'Thomas Eliot']
The equivalent XPath to the last sample query is:
/labels/label[quote]/name
The Python dictionary query is certainly a bit more wordy, and it uses Python trickery
such as list comprehensions; but it's more efficient than most XPath implementations,
and if
you're using Python to a fair extent you're likely to be familiar with all the neat
idioms
anyway. Notice that I use the method iteritems
to spool out the contents of the
dictionary rather than the more common items
. iteritems
requires
Python 2.2 and saves memory by creating the key/value pairs for each item in the dictionary
on demand rather than all at once. On older Python versions just replace with
items
.
The dictionary data is just as easily manipulated although it would take some additional work to save any such updates back into the XML format.
Limitations and all that
Also in Python and XML |
|
Should Python and XML Coexist? |
|
This article's fare is not a technique for all seasons. For one thing, even a trimmed down SAX handler that does nothing but create dictionaries can become rather complicated and tricky to debug. The code to generate equivalent data bindings and Python structures in any of the many tools I've examined in this column is almost trivial compared to the SAX presented in this article. But using plain SAX to construct dictionaries is likely to offer more raw speed and less memory overhead. It also has the advantage of requiring no third party modules. One possible variation is to use SAX to construct specialized objects rather than dictionaries, which makes the query code a bit neater, but does reintroduce some overhead.
It has been a slow couple of months in the Python/XML community. One interesting development has been the announcement of Rx4Rdf. As its web page says:
Rx4RDF is a specification and reference implementation for querying, transforming and updating W3C's RDF by specifying a deterministic mapping of the RDF model to the XML data model defined by XPath. Rx4RDF shields developers from the complexity of RDF by enabling you to use familiar XML technologies like XPath, XSLT and XUpdate. We call their RDF equivalents RxPath, RxSLT, and RxUpdate respectively.
Browsing through the code, Rx4RDF appears to start with various modules from 4Suite, but combines and enhances these in very interesting ways for an RDF-centric approach to data processing. I especially found intriguing the description of Racoon, which sounds like a variation on the idea behind the popular Cocoon server, but using RDF and Python rather than XML/XSLT and Java.