Introducing the Amara XML Toolkit
January 19, 2005
As part of my roundup of Python data bindings, I introduced my own Anobind project. Over the column's history, I've also developed other code to meet some need emphasized in one of the previous articles. I recently collected all of these various little projects together into one open source package of XML processing add-ons, Amara XML Toolkit. Amara is meant to complement 4Suite in that 4Suite works towards fidelity to XML technical ideals, while Amara works towards fidelity to Python conventions, taking maximum advantage of Python's strengths. The main components of Amara XML Toolkit are the following:
- Bindery: data binding tool. The code that was formerly available standalone as "Anobind" but with extensive improvements and additions, including a move of the fundamental framework from DOM to SAX.
- Scimitar: an implementation of the ISO Schematron schema language for XML. It also used to be a standalone project, which I've announced here in the past. It converts Schematron files to standalone Python scripts.
- domtools: helper routines for working with Python DOMs, many of which first made their appearance in previous articles such as "Generating DOM Magic" and "Location, Location, Location."
- saxtools: helper frameworks and routines for easier use of Python's SAX implementation, many of which first made their appearance in previous articles such as " Decomposition, Process, Recomposition".
- Flextyper: implementation of Jeni Tennison's Data Type Library Language (DTLL) (on track to become part 5 of ISO Document Schema Definition Languages (DSDL). You can use Flextyper to generate Python modules containing data types classes that can be used with 4Suite's RELAX NG library, although it won't come into its full usefulness until the next release of 4Suite.
In this article I introduce parts of Amara, focusing on several little, common tasks
it's
supposed to help with. Some of these are tasks you will recognize from earlier articles
in
this column. Amara requires Python 2.3 or later and 4Suite 1.0a4 or later. I used
Python
2.3.4 to run all listings presented, working with Amara 0.9.2. With the prerequisites
in
place, installation is the usual matter of python setup.py install
.
Best of SAX and DOM
The very first sample task needs very little preamble. See listing 1, a form of the address label example I so often use.
Listing 1: Sample XML file (labels.xml) containing Address Labels
<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label id="tse" added="2003-06-20"> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> <quote> <emph>Midwinter Spring</emph> is its own season… </quote> </label> <label id="ep" added="2003-06-10"> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> <quote> What thou lovest well remains, the rest is dross… </quote> </label> <!-- Throw in 10,000 more records just like this --> <label id="lh" added="2004-11-01"> <name>Langston Hughes</name> <address> <street>10 Bridge Tunnel</street> <city>Harlem</city> <state>NY</state> </address> </label> </labels>
Listing 2 is code to print out all people and their street addresses.
Listing 2 (listing2.py): Amara Pushdom code to print out all people and their street addresses
from amara import domtools for docfrag in domtools.pushdom('/labels/label', source='labels.xml'): label = docfrag.firstChild name = label.xpath('string(name)') city = label.xpath('string(address/city)') print name, 'of', city
The code is extremely simple, but it does print what a quick glance might lead you to expect:
$ python listing2.py Thomas Eliot of Stamford Ezra Pound of Hailey Langston Hughes of Harlem
The trick is how it does this. domtools.pushdom
is a generator which yields a
DOM document fragment at a time, such that the entire document is broken down into
a series
of subtrees given by the pattern passed in: /labels/label
. The full document is
never in memory (in fact, the code never takes up much more memory than it takes to
maintain
a DOM node for a single label
element. If, as the comment in listing 1
suggests, there were 10,000 more label elements, the memory usage wouldn't be much
greater;
although, if your loop iterates faster than Python can reclaim each discarded node,
you
might want to add an explicit gc.collect()
at the end of the loop. Each node
yielded by the generator is a basic Domlette node, with all the usual properties and
methods
this makes available, including the useful xpath()
method.
Compare listing 2 above to listing 4 of "Decomposition, Process, Recomposition" and you'll get a sense of how this wrappering of ideas from that article simplifies things.
If DOM Is Too Lame for You
Pythonic APIs are meant to make life easier for the many users who find DOM too arcane and alien for use in Python. Almost all of the earlier article on Anobind is still valid in Amara. The biggest change is in the imports. I also added some concessions to people who really don't want to worry about URL and file details and the like; the eight lines of listing 1 from the earlier article can now be reduced to two lines (the top two of listing 3). Listing 3 is an example of how I could use Amara Bindery to display names and cities from listing 1, the functional equivalent of listing 2.
Listing 3: Amara Bindery code to print out all people and their street addresses
from amara import binderytools container = binderytools.bind_file('labels.xml') for l in container.labels.label: print l.name, 'of', l.address.city
binderytools.bind_file
takes a file name, parses the file, and returns a data
binding, rooted at the object container
, which represents the XML root node.
Each element is a specialized object that permits easy access to the data using Python
idioms, with object property names based on the names of XML tags and attributes.
In a
typical expression of the prevalent attitude in the Python community, one blogger
called it
"turning XML into something useful."
The Natural Next Step: Push Binding
One possible problem with listing 3 is that the entire XML document is converted to Python objects, which could mean a lot of memory usage for large documents, for example, if labels.xml were expanded to have 10,000 entries in label elements. Amara Bindery does mitigate this a little bit by using SAX to create data bindings, but this may not be good enough. What would be great is some way to use the pushdom approach from listing 2 while still having the ease-of-use advantage of Amara Bindery. This option is available as the Push binding, illustrated in listing 4.
Listing 4: Amara Push binding code to print out all people and their street addresses
from amara import binderytools for subtree in binderytools.pushbind('/labels/label', source='labels.xml'): print subtree.label.name, 'of', subtree.label.address.city
You use patterns just as in listing 2 to break up the document, and just as in listing
2,
binderytools.pushbind
is a generator that instantiates part of the document
at a time, thus never using up the memory needed to represent the entire document.
This
time, however, the values yielded by the generator are subtrees of an Amara binding
rather
than DOM nodes, so you can use the more natural Python idioms to access the data,
if you
prefer.
Modification
Amara Bindery makes it pretty easy to modify XML objects in place and reserialize them back to XML. As an example, listing 5 makes some changes to one of the label elements and then prints the result back out.
Listing 5: Amara Bindery code to update an address label entry
from amara import binderytools container = binderytools.bind_file('labels.xml') #Add a quote to the Langston Hughes entry #The quote text to be added new_quote_text = \ u'\u2026if dreams die, life is a broken winged bird that cannot fly.' #The ID of Hughes's entry id = 'lh' #Cull to a list of entries with the desired ID lh_label = [ label for label in container.labels.label if label.id == 'lh' ] #We know there's only one, so get it lh_label = lh_label[0] #Now we have an element object. Add a child element to the end #xml_element is a factory method for elements. #Specify no namespace, 'quote' local name #Append the result to the label element lh_label.xml_append(container.xml_element(None, u'quote')) #Now set the child text on the new quote element #Notice how easily the new quote element can be accessed lh_label.quote.xml_children.append(new_quote_text) #Change the added attribute #Even easier than adding an element lh_label.added = u'2005-01-10' #Print the updated label element back out print lh_label.xml() #If you want to print the entire, updated document back out, use #print container.xml()
Again, the code's comments should provide all the needed explanation.
Taming SAX
Sometimes, though perhaps rarely, you may need to process huge files that cannot easily
be
broken into simple patterns. You may need to write SAX code, but of course as discussed
often in this column, SAX isn't always an easy tool to use. Amara provides several
tools to
help make SAX easier to use, including a module
saxtools.xpattern_sax_state_machine
which can write SAX state machines for
you, given patterns. In fact, this module is used in domtools.pushdom
and
binderytools.pushbind
. There is also a framework, Tenorsax, to help
effectively linearize SAX logic. With Tenorsax, you register callback generators rather
than
callback functions, and, using the magic of Python generators, each callback actually
receives multiple SAX events within its logic, so you can use local variables and
manage
state more easily than in most SAX code. Listing 6 is an example using Tenorsax to
also go
through the labels XML file and print names and addresses. Tenorsax is overkill for
such a
purpose, and you've already seen how to accomplish it much more easily with Amara,
but it
should illustrate the workings of Tenorsax.
Listing 6: Tenorsax code to print out all people and their street address
import sys from xml import sax from amara import saxtools class label_handler: def __init__(self): self.event = None self.top_dispatcher = { (saxtools.START_ELEMENT, None, u'labels'): self.handle_labels, } return def handle_labels(self, end_condition): dispatcher = { (saxtools.START_ELEMENT, None, u'label'): self.handle_label, } #First round through the generator corresponds to the #start element event yield None #delegate is a generator that handles all the events "within" #this element delegate = None while not self.event == end_condition: delegate = saxtools.tenorsax.event_loop_body( dispatcher, delegate, self.event) yield None #Element closed. Wrap up return def handle_label(self, end_condition): dispatcher = { (saxtools.START_ELEMENT, None, 'name'): self.handle_leaf, (saxtools.START_ELEMENT, None, 'city'): self.handle_leaf, } delegate = None yield None while not self.event == end_condition: delegate = saxtools.tenorsax.event_loop_body( dispatcher, delegate, self.event) yield None return def handle_leaf(self, end_condition): element_name = self.event[2] yield None name = u'' while not self.event == end_condition: if self.event[0] == saxtools.CHARACTER_DATA: name += self.params yield None #Element closed. Wrap up print name, if element_name == u'name': print 'of', else: print return if __name__ == "__main__": parser = sax.make_parser() #The "consumer" is our own handler consumer = label_handler() #Initialize Tenorsax with handler handler = saxtools.tenorsax(consumer) #Resulting tenorsax instance is the SAX handler parser.setContentHandler(handler) parser.setFeature(sax.handler.feature_namespaces, 1) parser.parse('labels.xml')
Tenorsax allows you to define a hierarchy of generators which handle subtrees of
the
document. Each generator gets multiple SAX events. Tenorsax takes advantage of the
fact that
Python generators can be suspended and resumed. Each time a Tenorsax handler generator
yields, it is suspended, and when the next SAX event comes along, it's resumed. The
current
event information is always available as self.event
. Tenorsax allows you to
define dispatcher dictionaries which map SAX event details to subsidiary generators.
The
current subsidiary generator is called delegate
in listing 6, because the
relationship between a generator and its subsidiaries basically forms a delegation
pattern.
Tenorsax automatically creates and runs the delegates within the main event loop,
while not self.event == end_condition
. The body of this loop is usually a
call back to the Tenorsax framework, although you can also add specialized logic for
the
events that you want each generator to handle itself. end_condition
is provided
by Tenorsax so that generators know when to quit. For a start element, the end condition
is
set up to be the event that marks the corresponding end element. handle_leaf
is
an example of linear logic across SAX events.
It aggregates text from multiple character events into one string, either the contents
of
the name
element or the city
element. It builds this using a local
variable, which is not possible with regular SAX. Usually, you'd have to use a class
variable that is governed by a state machine (so that it is not grabbing text from
the wrong
events). Listing 6 is certainly much more ponderous than all the other sample code
so far.
Again, you would not usually use the heavy artillery for Tenorsax unless you had logic
that
was very hard to force into one of the other facilities in Amara.
Wrapping Up
There is a lot more to Amara XML Toolkit than I can cover in this article. The aim of the project is versatility—giving the developer many flexible ways of processing XML using idioms and native advantages of Python. Because of the popularity of languages such as Java, many XML standards have evolved in directions that don't match up with Python's strengths. Amara looks to bridge that gap. If you're curious about the project name, see this posting.
As often happens in the holiday season, activity has been a bit slow. Holiday revels are also a good excuse for an announcement entitled "xsdb does XML, SQL is dead as disco." Seems Aaron Watters's xsdb project, "a framework for distributing querying and combining tabular data over the Internet," has been renamed "xsdbXML." The announcement is a bit sketchy on the role of XML, but looking at the use cases, it seems xsdbXML is based on pure XML expressions of relational tables, meaning it effectively short-circuits SQL (which is, after all, but one realization of the relational calculus, and one that many relational purists consider flawed). The queries are also expressed in XML. This is a very interesting project, and coming from the brains behind Gadfly, you can expect the highest technical standards. Perhaps less whimsical announcements will help it gain the notice it deserves.
Walter Döwald announced XIST 2.8. "XIST is an extensible HTML/XML generator written in Python. XIST is also a DOM parser (built on top of SAX2) with a very simple and Pythonesque tree API." This release now requires Python 2.4 and there have been some API changes. See the announcement.
Dave Kuhlman announced generateDS 8a. generateDS is a data binding that generates Python data structures from a W3C XML Schema. I covered generateDS an earlier article. This release adds support for mixed content, structured type extensions (limited support), attribute groups, and substitution groups (limited support). See the announcement.