EaseXML: A Python Data-Binding Tool
July 27, 2005
EaseXML is an XML data-binding tool for Python, available under the Python Software Foundation License. The package used to be called "XMLObject," but that generic name led to the situation I mentioned in Location, Location, Location.
How many "XMLObjects" does it take to screw in a lightbulb? Turns out that even I, who make it my business to pay attention to such things, came up short in my count. Philippe Normand, author of one of the "XMLObjects," lamented the name clash after the latest entrant emerged. Srijit Kumar Bhadra, an innocent bystander (and author of the Python/.NET/XML code bake off I mentioned last month) also complained. The trigger for all this was Greg Luterman's announcement of XMLObject, "a Python module that simplifies the handling of XML streams by converting the data into objects." Of course, anyone who chooses as generic a name as "XMLObject" is just asking for name clashes.
Philippe Normand responded in a comment on that article that he would be changing
the name
of his project. In this article, I'll look at EaseXML 0.2, which I downloaded for installation on Python
2.4 (Python 2.2. is the minimum version). The installation is standard distutils,
a simple
matter of python setup.py install
.
EaseXML at First Glance
In this column I have covered Python data bindings that need no more information than the source XML, such as Amara Bindery and Gnosis Objectify. I also introduced one example, generateDS.py, of a data binding that requires an XML schema file to drive the binding. EaseXML is similar to this latter approach, except that the schema format it uses is just a set of Python classes defined with a set of conventions, with each XML element generally corresponding to a distinct Python class. In this way it is very similar to XIST, although it's less comprehensive.
I'll start by showing the EaseXML binding schema used to process Listing 1, my usual address label example.
Listing 1. Sample XML file (labels.xml) containing address labels<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label id="tse" added="2003-06-20"> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> <quote> <emph>Midwinter Spring</emph> is its own season… </quote> </label> <label id="ep" added="2003-06-10"> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> <quote> What thou lovest well remains, the rest is dross… </quote> </label> <!-- Throw in 10,000 more records just like this --> <label id="lh" added="2004-11-01"> <name>Langston Hughes</name> <address> <street>10 Bridge Tunnel</street> <city>Harlem</city> <state>NY</state> </address> </label> </labels>
Listing 2 (labelsease.py) uses the EaseXML conventions to set up the data binding.
Listing 2 (labelsease.py). EaseXML class definitions for address labelsfrom EaseXML import * class labels(XMLObject): labels = ListNode(u'label') class label(XMLObject): id = StringAttribute() added = StringAttribute(u'added') _nodesOrder = [u'name', u'address', u'quote'] name = TextNode() address = ItemNode(u'address') quote = ItemNode(u'quote', optional=True) class address(XMLObject): _nodesOrder = [u'street', u'city', u'state'] street = TextNode() city = TextNode() state = TextNode() class quote(XMLObject): _name = u'quote' content = ChoiceNode(['#PCDATA', 'emph'], optional=True, main=True, noLimit=True) emph = TextNode(optional=True)
The most important class XMLObject
still bears the name of the original
package. You have to subclass it to create your own specialized classes representing
elements. The top-level element labels
is defined using a class of the same
name. It expresses that its contents are a list of child elements (
EaseXML.ListNode
) named label
. Each of these has an
id
and added
attribute. Data binding tools have to deal with the
situation where XML's naming conventions don't match that of the host language. In
EaseXML,
the names of XML identifiers are usually assumed from the named of the matching Python
object references, but the definition of the added
attribute shows how you can
override that by specifying the actual XML identifier as the first argument. This
argument
is sometimes optional, as in EaseXML.StringAttribute
; but sometimes it's
mandatory, as in EaseXML.ListNode
and EaseXML.ItemNode
. You
specify the order of child nodes using the _nodesOrder
list, specifying XML
identifier names. EaseXML.TextNode
defines a simple node with text content
only. Such nodes do not require a separate Python class. The definition for the
quote
element illustrates a few things. It uses the name_
property to override the XML element identifier, which is derived form the class name
by
default (in this case, the override happens to be the same as the default).
quote
is simple text in one of its occurrences in the XML example, and mixed
content in another. You define mixed content by using a EaseXML.ChoiceNode
,
with #PCDATA
as one of the entries. As in XML DTDs, this is a special
identifier for text content. optional=True
is specified for the mixed content
contsruct as a whole, indicating that the element can be empty, and for the
emph
element, indicating that text alone can occur without any elements mixed
in.
Putting the Binding to Work
After you define the binding classes, you can use them to parse in XML. You can also use them to generate XML, but I don't cover that in this article. The following interactive session demonstrates reading XML with an EaseXML data binding.
$ python -i labelsease.py >>> XML = open('labels.xml', 'r').read() >>> doc = labels.fromXml(XML)
As you can see, I load Listing 2 upon starting the Python interpreter. doc
is
a data structure based on instances of those classes with the data from the XML document.
>>> #Print the ids of all the labels >>> for label in doc.labels: ... print label.id ... tse ep lh >>> #Print the first quote element's contents >>> doc.labels[0].quote.emph u'Midwinter Spring' >>> doc.labels[0].quote.content [u'is its own season\u2026']
I ran into all sorts of quirks when poking introspectively at the resulting data binding.
For example, I found a phantom processing instruction among the child nodes of the
quote
element you see in the last snippet. The Unicode support seems to be
patchy, and I was unable to reserialize the quote element containing the ellipsis
character
…
(I checked the toxml
method for encoding arguments but
didn't find any.) The API itself is a bit strange and hard to get your head around.
I
noticed that the forEach
method is the recommended way for walking EaseXML
objects. Keep in mind that it requires specialized callbacks to work.
I decided to write about EaseXML before I realized to what extent it's a young project. It needs quite a bit of work. Besides the quirks I mentioned above, EaseXML lacks proper namespaces support, and I think the binding schema API could do with some close analysis. Fortunately, the version control logs seem to show a reasonable rate of development. I think it's worth keeping an eye on EaseXML because it does bring some innovative touches to XML processing in Python, but I would suggest waiting for another couple of releases before using it in production.
More on Unicode: Character Information
In the last two articles, Unicode Secrets and More Unicode Secrets, I discussed Python's Unicode facilities, from the point of view of XML processing. There is one more useful part of Python's Unicode libraries that I want to cover.
There are hundreds of thousands of characters in Unicode, and the number grows with each version. There is also a complex internal structure of characters; they are classified as alphabetic, digits, control codes, combining characters, and more, and they have varying collation (sorting), directionality, etc. It can be quite overwhelming, and you can imagine why when you realize that Unicode aims to provide computer representation for just about every writing system on the planet. Developers need all the tools they can to deal with all this rich variety. A useful but not all that well-known resource is Python's built-in Unicode database, in the unicodedata module. It is a Python API for the character database provided by the Unicode Consortium, the definitive catalog of all the characters in Unicode, along with standard properties for each.
Every character has a name, and you can learn what it is with the name
function.
>>> import unicodedata >>> unicodedata.name(u'a') 'LATIN SMALL LETTER A' >>> unicodedata.name(u'\u1000') 'MYANMAR LETTER KA' >>> unicodedata.name(u'\u00B0') 'DEGREE SIGN' >>>
Notice that the names are returned as strings, not Unicode objects. All Unicode character
names use what you can informally call the ASCII subset. You can basically reverse
this
operation, getting a Unicode character by name, using the lookup
function.
>>> unicodedata.lookup('DEGREE SIGN') u'\xb0' >>> unicodedata.lookup('LATIN SMALL LETTER A') u'a' >>>
You can really put this database to work giving your programs super duper powers
of
globalization, head and shoulders above the rest. For example, did you know that the
characters "0" through "9" are not the only form of digits used in
writing? Even though these European digit characters derive from historical Arabic
number
representations, modern Arabic scripts use a different set of characters sometimes
called
"Indic numerals." (Although these are distinct again from the digits used in
modern-day scripts from India. Is your head spinning, yet?) Unicode assigns these
digits the
appropriate decimal values, and you can effortlessly derive the decimal value of any
digit
regardless of script using the decimal
function.
>>> unicodedata.decimal(u'0') 0 >>> unicodedata.decimal(u'\u0660') 0 >>> unicodedata.decimal(u'1') 1 >>> unicodedata.decimal(u'\u0661') 1 >>> #If you pass an invalid digit, it lets you know >>> unicodedata.decimal(u'a') Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: not a decimal >>>
The digit
and numeric
functions are similar, but there are some
differences, and you should refer to the Unicode character database for details (one
obvious
difference from the Python point of view is that numeric
returns floating point
numbers). Unicode organizes characters into categories, such as "Letter,
Lowercase" (abbreviation "Ll"), "Symbol, Currency" (abbreviation
"Sc"), "Punctuation, Connector" (abbreviation "Pc"),
"Right-to-Left Arabic" (abbreviation "AL"), "Separator, Space"
(abbreviation "Zs"), etc. These categories are important for many
character-processing cases. As an example, you might want to be specific about what
you mean
by "white space" when writing Unicode-aware applications. There are more than just
the familiar space, newline, carriage return and tab from ASCII, or nonbreaking space
from
HTML. Interestingly, some of the characters we think of as spaces, such as tab, are
categorized as control codes in Unicode, and XML's own treatment of characters often
doesn't
fall along neat lines of Unicode categories. You can find the category of any character
using the category
function.
>>> unicodedata.category(u'a') 'Ll' >>> unicodedata.category(u'\u00B0') #DEGREE SIGN 'So' >>> unicodedata.category(u'\t') 'Cc' >>> unicodedata.category(u'$') 'Sc' >>>
There are other functions in unicodedata
, but I'll leave them to the reader's
attentions.
From the Community
I mentioned the CJKV writing systems and encodings of the Pacific Rim in my last article. There are many non-Unicode character encodings in heavy use in these regions. There have been several third-party packages supporting these encodings, and Python 2.4 incorporates codecs based on a patch by Hye-Shik Chang. These support the following encodings:
- Chinese: gb2312, gbk, gb18030, big5hkscs, hz, big5, cp950
- Japanese: cp932, euc-jis-2004, euc-jp, euc-jisx0213, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-3, iso-2022-jp-ext, iso-2022-jp-2004, shift-jis, shift-jisx0213, shift-jis-2004
- Korean: cp949, euc-kr, johab, iso-2022-kr
Python 2.4 also adds a few other non-CJK encodings, and I recommend that everyone who is serious about internationalization upgrade to this version as soon as possible.
Christof Hoeke has been busy lately. He has developed encutils for Python 0.2, which is a library for dealing with the encodings of files obtained over HTTP, including XML files. He does not yet implement an algorithm for sniffing an XML encoding from its declaration, but I expect he should be able to add this easily enough using the well-known algorithms for this task (notably the one described by John Cowan), which are the basis for this older Python cookbook recipe by Paul Prescod and this newer recipe by Lars Tiede. Christof also released pyxsldoc 0.69, "an application to produce documentation for XSLT files in XHTML format, similar to what javadoc does for Java files." See the announcements for encutils and pyxsldoc.
I discovered Ken Rimey's Personal Distributed Information Store (PDIS), which includes some XML tools for Nokia's Series 60 phones, which offer python support. This includes an XML parser based on PyExpat and an XPath implementation based on elementtree.