XML Data Bindings in Python
June 11, 2003
In a recent interview, "What's Wrong with XML APIs", Elliotte Rusty Harold offers a familiar classification of XML APIs:
- Push APIs (e.g. SAX)
- Pull APIs (e.g. Python's pulldom)
- Tree-based APIs (e.g. DOM)
- data binding APIs (e.g. PyXML marshalling tools)
- Query APIs (e.g. using 4XPath directly from Python)
The XML community of late there has been a lot of talk that there are no really easy and efficient ways of general XML programming. Push processing has the usual rap of being too difficult. It is easy to dismiss this as a problem for amateur programmers who have not properly learned how to code state machines; but let's face it, state machines are hard to code by hand, and the community has been slow to develop more declarative and friendly tools for developing SAX processing stubs, such as LEX and YACC tools for generating parser state machines. As frequent Python-XML contributor Tom Passim puts it, in a recent XML-DEV posting, with push processing the more context one has to keep track of between callbacks the harder the code is to write and maintain.
Pull processing has strong adherents, but there are also many, including me, who don't see that it really buys all that much simplicity. Tree APIs are easier to code, but less efficient as documents become larger because they generally require the entire document to be in memory. Query APIs take a step toward bridging XML and programming languages, which is a step toward making life easier for developers. Data bindings are a further step toward this goal and the focus of this article and others to come.
The State of Python Data Bindings
A data binding is any system for viewing XML documents as databases or programming language or data structures, and vice versa. There are several aspects, including:
- marshalling -- serializing program data constructs to XML
- unmarshalling -- creating program data constructs from XML
- schema-directed binding -- using XML schema languages (DTD, WXS, RELAX NG, etc.) to provide hints and intended data constructs to marshalling and unmarshalling systems
- query-directed binding -- using XML-specific query languages such as XPath to provide hints to marshalling and unmarshalling systems
- process bindings -- mapping program or DBMS actions designed to process particular data structure patterns covered by marshalling and unmarshalling
All of these aspects are available to some extent in Python, but unfortunately, the coverage is spotty. In the following list, the numbers refer to which aspects of data binding from the preceding list are offered by each tool.
- Generic and WDDX marshalling in PyXML (1)(2)
- I covered these marshalling/unmarshalling tools in the earlier article Introducing PyXML
- generateDS.py (1)(2)(3)
- A tool for generating Python data structures from XML Schema.
- xml_pickle and xml_objectify.py from the Gnosis XML Utilities (1)(2)
- tools for generic and specialized marshalling and unmarshalling.
- XBind (1)(2)
- An XML vocabulary for specifying language-independent data bindings; includes a prototype Python implementation.
- Skyron (1)(2)(5)
- Uses recipes encoded in XML to bind XML data to handler code in Python. Typical usage is to create a specialized Python data structure from particular XML data patterns.
generateDS.py
In future articles I'll survey all these packages, starting in this article with
generateDS.py, which I downloaded (generateDS-1.2a.tar.gz), unpacked and installed using python
setup.py install
. The sample file for exercising the binding is in listing 1.
<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label> <quote> <!-- Mixed content --> <emph>Midwinter Spring</emph> is its own season… </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
This example demonstrates a few things: an XML character entity outside the ASCII
range (to
test proper character support), a bit of the data flavor of XML with repeated, structured
records, and a bit of the document flavor with mixed content in the quote
element. The document flavor can be reinforced a bit if one treats the order of labels
as
important; likewise, the data flavor is reinforced if the order is considered unimportant.
See this excellent
discussion by Python-XML stalwart Paul Prescod for a nice contrast between data and
document nuances of XML usage. Namespaces are another area of consideration, but to
save
space I do not cover them in this discussion of data bindings. generateDS.py operates
on a
WXS definition for the XML format. See listing 2 for the WXS description of the format
used
in listing 1.
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" > <xs:element name="labels"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="label"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" ref="quote"/> <xs:element ref="name"/> <xs:element ref="address"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="quote"> <xs:complexType mixed="true"> <xs:sequence> <xs:element ref="emph"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="emph" type="xs:string"/> <xs:element name="name" type="xs:string"/> <xs:element name="address"> <xs:complexType> <xs:sequence> <xs:element ref="street"/> <xs:element ref="city"/> <xs:element ref="state"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="state" type="xs:string"/> </xs:schema>
generateDS.py requires pyxml, and I used the most recent CVS version. It seems to require Python 2.2, as it uses static methods. I used Python 2.2.2 and ran it against the WXS as follows:
python generateDS.py -o labels.py listing2.xsd
generateDS.py generates Python files with the data binding derived from the schema.
The
-o
option gives the location of the file containing data structures derived
from the schema. This is the heart of the data binding. The output file
labels.py
is too large to paste in its entirety, but listing 3 is a snippet
to give you a feel for the output:
class label: subclass = None def __init__(self, quote=None, name=None, address=None): self.quote = quote self.name = name self.address = address def factory(*args): if label.subclass: return apply(label.subclass, args) else: return apply(label, args) factory = staticmethod(factory) def getQuote(self): return self.quote def setQuote(self, quote): self.quote = quote def getName(self): return self.name def setName(self, name): self.name = name def getAddress(self): return self.address def setAddress(self, address): self.address = address def export(self, outfile, level): showIndent(outfile, level) outfile.write('<label>\n') level += 1 if self.quote: self.quote.export(outfile, level) if self.name: self.name.export(outfile, level) if self.address: self.address.export(outfile, level) level -= 1 showIndent(outfile, level) outfile.write('</label>\n') def build(self, node_): attrs = node_.attributes for child in node_.childNodes: if child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'quote': obj = quote.factory() obj.build(child) self.setQuote(obj) elif child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'name': obj = name.factory() obj.build(child) self.setName(obj) elif child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'address': obj = address.factory() obj.build(child) self.setAddress(obj) # end class label # SNIP class name: subclass = None def __init__(self): pass def factory(*args): if name.subclass: return apply(name.subclass, args) else: return apply(name, args) factory = staticmethod(factory) def export(self, outfile, level): showIndent(outfile, level) outfile.write('<name>\n') level += 1 level -= 1 showIndent(outfile, level) outfile.write('</name>\n') def build(self, node_): attrs = node_.attributes for child in node_.childNodes: pass # end class name
The label
class has, among other things, facilities for marshalling and
unmarshalling. The build
method allows instances of the class to be built from
a DOM, and this appears to be the only supplied method of binding from instances.
This is
what one might expect, since it's the easiest and most convenient way to write a data
binding. It does mean that memory footprint could become a problem as the DOM contents
are
duplicated in the resulting data structures. Given that the DOM might become unnecessary
once the data structures are complete, there seems to be some room for optimization.
The
export
method marshals the object back to XML.
Special Schema Needs
There is a class like label
for each element defined in the schema. As you can
see, this even extends to the name
element and therein lies a problem.
name
is a simple element with only string content. But in the generated
binding it is given its own element, rather than making it a simple data member of
label
. Even worse than that, if you follow the build
method
carefully, you'll see that it throws away the text content of the element upon
unmarshalling. It turns out generateDS.py is rather picky in its interpretation of
WXS. The
relevant snippet from listing 2 is
<xs:element name="label"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" ref="quote"/> <xs:element ref="name"/> <xs:element ref="address"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="name" type="xs:string"/>
This is a common practice in WXS: using a separate xs:element
declaration for
each element, even if it is of simple type. But this usage throws off generateDS.py,
and in
order to have name treated as a simple data member of the binding class you have to
rewrite
the schema:
<xs:element name="label"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" ref="quote"/> <xs:element ref="name" type="xs:string"/> <xs:element ref="address"/> </xs:sequence> </xs:complexType> </xs:element>
Which, according to WXS rules, is strictly equivalent to the original form. Listing 4 is a new version of the WXS to satisfy this preference of generateDS.py.
Listing 4: Adjusted WXS for data binding generation by generateDS.py<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" > <xs:element name="labels"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="label"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" ref="quote"/> <xs:element ref="name" type="xs:string"/> <xs:element ref="address"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="quote"> <xs:complexType mixed="true"> <xs:sequence> <xs:element ref="emph" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="address"> <xs:complexType> <xs:sequence> <xs:element ref="street" type="xs:string"/> <xs:element ref="city" type="xs:string"/> <xs:element ref="state" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
Listing 5 is a snippet from the new data binding. Notice the update to the handling
of the
name
element.
class label: subclass = None def __init__(self, quote=None, name='', address=None): self.quote = quote self.name = name self.address = address def factory(*args): if label.subclass: return apply(label.subclass, args) else: return apply(label, args) factory = staticmethod(factory) def getQuote(self): return self.quote def setQuote(self, quote): self.quote = quote def getName(self): return self.name def setName(self, name): self.name = name def getAddress(self): return self.address def setAddress(self, address): self.address = address def export(self, outfile, level): showIndent(outfile, level) outfile.write('<label>\n') level += 1 if self.quote: self.quote.export(outfile, level) showIndent(outfile, level) outfile.write('<name>%s</name>\n' % quote_xml(self.getName())) if self.address: self.address.export(outfile, level) level -= 1 showIndent(outfile, level) outfile.write('</label>\n') def build(self, node_): attrs = node_.attributes for child in node_.childNodes: if child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'quote': obj = quote.factory() obj.build(child) self.setQuote(obj) elif child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'name': name = '' for text_ in child.childNodes: name += text_.nodeValue self.name = name elif child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'address': obj = address.factory() obj.build(child) self.setAddress(obj) # end class label class quote: subclass = None def __init__(self, emph=''): self.emph = emph def factory(*args): if quote.subclass: return apply(quote.subclass, args) else: return apply(quote, args) factory = staticmethod(factory) def getEmph(self): return self.emph def setEmph(self, emph): self.emph = emph def export(self, outfile, level): showIndent(outfile, level) outfile.write('<quote>\n') level += 1 showIndent(outfile, level) outfile.write('<emph>%s</emph>\n' % quote_xml(self.getEmph())) level -= 1 showIndent(outfile, level) outfile.write('</quote>\n') def build(self, node_): attrs = node_.attributes for child in node_.childNodes: if child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'emph': emph = '' for text_ in child.childNodes: emph += text_.nodeValue self.emph = emph # end class quote
Now this is a pretty straightforward data binding result that, for example, wouldn't
surprise a Java developer. Each complex type in the schema becomes a class, and simple
types
become simple properties with get/set methods (like JavaBeans). This might feel a
bit
unpythonic until you reflect that these binding classes are designed to be subclassed
(note
the factory
convenience functions), and the use of accessor functions allows
classic method polymorphism. Of course, one could still argue that since the binding
already
uses Python 2.2, it could have taken advantage of the more Pythonic approaches to
such
polymorphism available with new style classes in Python 2.2. (For more on new style
classes,
see Unifying types and classes in
Python 2.2 by Guido van Rossum and What's New in Python
2.2 by A.M. Kuchling.)
Look at the quote.build
method. Again, careful examination will show that
generateDS.py does not seem to handle mixed content. In particular it discards text
that is
not within the emph
element: "is its own season...".
Listing 6 demonstrates usage of the data binding, a pretty straightforward matter.
Listing 6:import sys import labels rootObject = labels.parse('listing1.xml') print dir(rootObject) eliot = rootObject.label[0] name = eliot.name street = eliot.address.street print street emphasized = eliot.quote.emph print emphasized pound = rootObject.label[1] #Modify the XML through the data binding pound.name = 'Ezra Loomis Pound' #Marshall back a portion of the XML, as modified pound.export(sys.stdout, 0)
I also wanted to check the handling of non-ASCII characters, but the ellipsis character
I'd
placed in the quote
element was discarded by the binding generation. I moved it
into the emph
element and this time when I tried parsing the instance I ended
up with the infamous "UnicodeError: ASCII encoding error: ordinal not in range(128)".
Examining the binding code, I think this might be more a problem with the marshalling
and
unmarshalling than with the binding implementation, so perhaps it would be easy to
fix.
Just the beginning
generateDS.py is a very nifty program and offers many of the hallmarks of a data binding. I did point out a few shortcomings, not to knock the project, but because I think that rich bindings may be an area where Python can leapfrog the field in XML processing because of its dynamic qualities. In this column I shall continue to explore the issue, exploring the remaining data binding projects and offering discussion on future directions.
Meanwhile, here's the usual brief on activity in the Python-XML landscape.
Dave Kuhlman, the developer behind generateDS.py, announced code for Python support for the REST (XML-over-HTTP) mode of Amazon Web services. The package provides Python code for parsing and processing the Amazon Web Services XML documents. It also includes code for generating WXS from an XML instance document (not unlike the concept in Eric Van der Vlist's Examplotron). Kuhlman has been very busy working with XML, REST, and SWIG (a tool for binding Python and other languages to C code). Another nice resource is Kuhlman's unofficial SWIG-based Python binding of the libxml tree API (see my last article for a discussion of the official Python binding).
Fredrik Lundh has been busy working on ElementTree, which I covered recently. He announced 1.1 and 1.2 alpha 1. Changes include a new XML literal factory, a self-contained ElementTree module, use of ASCII as the default encoding, optimizations, and limited XPath support.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
John Merrells pointed me to the Python API for Berkeley DB XML, part of Berkeley DB. In Merrells' words: "The Python API is basically the same as the C++ and Java APIs, in that they expose the functionality of the product."
See this post and thread for discussion of Tim Bray's comment: "The Python people also piped to say 'everything's just fine here' but then they always do, I really must learn that language". I suspect that Tim Bray might have been referring to comments by me, Paul Prescod and others on the XML-DEV mailing list. I think our point is that Python's dynamic nature makes the horrors of DOM and SAX easier to bear, and not that Python has anything radical to leapfrog them. I'm rather hoping this series on data bindings helps produce such a leap, though.