EaseXML: A Python Data-Binding Tool

July 27, 2005

EaseXML is an XML data-binding tool for Python, available under the Python Software Foundation License. The package used to be called "XMLObject," but that generic name led to the situation I mentioned in Location, Location, Location.

How many "XMLObjects" does it take to screw in a lightbulb? Turns out that even I, who make it my business to pay attention to such things, came up short in my count. Philippe Normand, author of one of the "XMLObjects," lamented the name clash after the latest entrant emerged. Srijit Kumar Bhadra, an innocent bystander (and author of the Python/.NET/XML code bake off I mentioned last month) also complained. The trigger for all this was Greg Luterman's announcement of XMLObject, "a Python module that simplifies the handling of XML streams by converting the data into objects." Of course, anyone who chooses as generic a name as "XMLObject" is just asking for name clashes.

Philippe Normand responded in a comment on that article that he would be changing the name of his project. In this article, I'll look at EaseXML 0.2, which I downloaded for installation on Python 2.4 (Python 2.2. is the minimum version). The installation is standard distutils, a simple matter of python setup.py install.

EaseXML at First Glance

In this column I have covered Python data bindings that need no more information than the source XML, such as Amara Bindery and Gnosis Objectify. I also introduced one example, generateDS.py, of a data binding that requires an XML schema file to drive the binding. EaseXML is similar to this latter approach, except that the schema format it uses is just a set of Python classes defined with a set of conventions, with each XML element generally corresponding to a distinct Python class. In this way it is very similar to XIST, although it's less comprehensive.

I'll start by showing the EaseXML binding schema used to process Listing 1, my usual address label example.

Listing 1. Sample XML file (labels.xml) containing address labels

<?xml version="1.0" encoding="iso-8859-1"?>

<labels>

  <label id="tse" added="2003-06-20">

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

    <quote>

      <emph>Midwinter Spring</emph> is its own season&#8230;

    </quote>

  </label>

  <label id="ep" added="2003-06-10">

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

    <quote>

      What thou lovest well remains, the rest is dross&#8230;

    </quote>

  </label>

  <!-- Throw in 10,000 more records just like this -->

  <label id="lh" added="2004-11-01">

    <name>Langston Hughes</name>

    <address>

      <street>10 Bridge Tunnel</street>

      <city>Harlem</city>

      <state>NY</state>

    </address>

  </label>

</labels>

Listing 2 (labelsease.py) uses the EaseXML conventions to set up the data binding.

Listing 2 (labelsease.py). EaseXML class definitions for address labels

from EaseXML import *



class labels(XMLObject):

    labels = ListNode(u'label')



class label(XMLObject):

    id = StringAttribute()

    added = StringAttribute(u'added')

    _nodesOrder = [u'name', u'address', u'quote']

    name = TextNode()

    address = ItemNode(u'address')

    quote = ItemNode(u'quote', optional=True)



class address(XMLObject):

    _nodesOrder = [u'street', u'city', u'state']

    street = TextNode()

    city = TextNode()

    state = TextNode()



class quote(XMLObject):

    _name = u'quote'

    content = ChoiceNode(['#PCDATA', 'emph'], optional=True,

                         main=True, noLimit=True)

    emph = TextNode(optional=True)

The most important class XMLObject still bears the name of the original package. You have to subclass it to create your own specialized classes representing elements. The top-level element labels is defined using a class of the same name. It expresses that its contents are a list of child elements ( EaseXML.ListNode) named label. Each of these has an id and added attribute. Data binding tools have to deal with the situation where XML's naming conventions don't match that of the host language. In EaseXML, the names of XML identifiers are usually assumed from the named of the matching Python object references, but the definition of the added attribute shows how you can override that by specifying the actual XML identifier as the first argument. This argument is sometimes optional, as in EaseXML.StringAttribute; but sometimes it's mandatory, as in EaseXML.ListNode and EaseXML.ItemNode. You specify the order of child nodes using the _nodesOrder list, specifying XML identifier names. EaseXML.TextNode defines a simple node with text content only. Such nodes do not require a separate Python class. The definition for the quote element illustrates a few things. It uses the name_ property to override the XML element identifier, which is derived form the class name by default (in this case, the override happens to be the same as the default). quote is simple text in one of its occurrences in the XML example, and mixed content in another. You define mixed content by using a EaseXML.ChoiceNode, with #PCDATA as one of the entries. As in XML DTDs, this is a special identifier for text content. optional=True is specified for the mixed content contsruct as a whole, indicating that the element can be empty, and for the emph element, indicating that text alone can occur without any elements mixed in.

Putting the Binding to Work

After you define the binding classes, you can use them to parse in XML. You can also use them to generate XML, but I don't cover that in this article. The following interactive session demonstrates reading XML with an EaseXML data binding.


$ python -i labelsease.py

>>> XML = open('labels.xml', 'r').read()

>>> doc = labels.fromXml(XML)

As you can see, I load Listing 2 upon starting the Python interpreter. doc is a data structure based on instances of those classes with the data from the XML document.


>>> #Print the ids of all the labels

>>> for label in doc.labels:

...     print label.id

...

tse

ep

lh

>>> #Print the first quote element's contents

>>> doc.labels[0].quote.emph

u'Midwinter Spring'

>>> doc.labels[0].quote.content

[u'is its own season\u2026']

I ran into all sorts of quirks when poking introspectively at the resulting data binding. For example, I found a phantom processing instruction among the child nodes of the quote element you see in the last snippet. The Unicode support seems to be patchy, and I was unable to reserialize the quote element containing the ellipsis character … (I checked the toxml method for encoding arguments but didn't find any.) The API itself is a bit strange and hard to get your head around. I noticed that the forEach method is the recommended way for walking EaseXML objects. Keep in mind that it requires specialized callbacks to work.

I decided to write about EaseXML before I realized to what extent it's a young project. It needs quite a bit of work. Besides the quirks I mentioned above, EaseXML lacks proper namespaces support, and I think the binding schema API could do with some close analysis. Fortunately, the version control logs seem to show a reasonable rate of development. I think it's worth keeping an eye on EaseXML because it does bring some innovative touches to XML processing in Python, but I would suggest waiting for another couple of releases before using it in production.

More on Unicode: Character Information

In the last two articles, Unicode Secrets and More Unicode Secrets, I discussed Python's Unicode facilities, from the point of view of XML processing. There is one more useful part of Python's Unicode libraries that I want to cover.

There are hundreds of thousands of characters in Unicode, and the number grows with each version. There is also a complex internal structure of characters; they are classified as alphabetic, digits, control codes, combining characters, and more, and they have varying collation (sorting), directionality, etc. It can be quite overwhelming, and you can imagine why when you realize that Unicode aims to provide computer representation for just about every writing system on the planet. Developers need all the tools they can to deal with all this rich variety. A useful but not all that well-known resource is Python's built-in Unicode database, in the unicodedata module. It is a Python API for the character database provided by the Unicode Consortium, the definitive catalog of all the characters in Unicode, along with standard properties for each.

Every character has a name, and you can learn what it is with the name function.


>>> import unicodedata

>>> unicodedata.name(u'a')

'LATIN SMALL LETTER A'

>>> unicodedata.name(u'\u1000')

'MYANMAR LETTER KA'

>>> unicodedata.name(u'\u00B0')

'DEGREE SIGN'

>>>

Notice that the names are returned as strings, not Unicode objects. All Unicode character names use what you can informally call the ASCII subset. You can basically reverse this operation, getting a Unicode character by name, using the lookup function.


>>> unicodedata.lookup('DEGREE SIGN')

u'\xb0'

>>> unicodedata.lookup('LATIN SMALL LETTER A')

u'a'

>>>

You can really put this database to work giving your programs super duper powers of globalization, head and shoulders above the rest. For example, did you know that the characters "0" through "9" are not the only form of digits used in writing? Even though these European digit characters derive from historical Arabic number representations, modern Arabic scripts use a different set of characters sometimes called "Indic numerals." (Although these are distinct again from the digits used in modern-day scripts from India. Is your head spinning, yet?) Unicode assigns these digits the appropriate decimal values, and you can effortlessly derive the decimal value of any digit regardless of script using the decimal function.


>>> unicodedata.decimal(u'0')

0

>>> unicodedata.decimal(u'\u0660')

0

>>> unicodedata.decimal(u'1')

1

>>> unicodedata.decimal(u'\u0661')

1

>>> #If you pass an invalid digit, it lets you know

>>> unicodedata.decimal(u'a')

Traceback (most recent call last):

  File "<stdin>", line 1, in ?

ValueError: not a decimal

>>>

The digit and numeric functions are similar, but there are some differences, and you should refer to the Unicode character database for details (one obvious difference from the Python point of view is that numeric returns floating point numbers). Unicode organizes characters into categories, such as "Letter, Lowercase" (abbreviation "Ll"), "Symbol, Currency" (abbreviation "Sc"), "Punctuation, Connector" (abbreviation "Pc"), "Right-to-Left Arabic" (abbreviation "AL"), "Separator, Space" (abbreviation "Zs"), etc. These categories are important for many character-processing cases. As an example, you might want to be specific about what you mean by "white space" when writing Unicode-aware applications. There are more than just the familiar space, newline, carriage return and tab from ASCII, or nonbreaking space from HTML. Interestingly, some of the characters we think of as spaces, such as tab, are categorized as control codes in Unicode, and XML's own treatment of characters often doesn't fall along neat lines of Unicode categories. You can find the category of any character using the category function.


>>> unicodedata.category(u'a')

'Ll'

>>> unicodedata.category(u'\u00B0') #DEGREE SIGN

'So'

>>> unicodedata.category(u'\t')

'Cc'

>>> unicodedata.category(u'$')

'Sc'

>>>

There are other functions in unicodedata, but I'll leave them to the reader's attentions.

From the Community

I mentioned the CJKV writing systems and encodings of the Pacific Rim in my last article. There are many non-Unicode character encodings in heavy use in these regions. There have been several third-party packages supporting these encodings, and Python 2.4 incorporates codecs based on a patch by Hye-Shik Chang. These support the following encodings:

Chinese: gb2312, gbk, gb18030, big5hkscs, hz, big5, cp950
Japanese: cp932, euc-jis-2004, euc-jp, euc-jisx0213, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-3, iso-2022-jp-ext, iso-2022-jp-2004, shift-jis, shift-jisx0213, shift-jis-2004
Korean: cp949, euc-kr, johab, iso-2022-kr

Python 2.4 also adds a few other non-CJK encodings, and I recommend that everyone who is serious about internationalization upgrade to this version as soon as possible.

Christof Hoeke has been busy lately. He has developed encutils for Python 0.2, which is a library for dealing with the encodings of files obtained over HTTP, including XML files. He does not yet implement an algorithm for sniffing an XML encoding from its declaration, but I expect he should be able to add this easily enough using the well-known algorithms for this task (notably the one described by John Cowan), which are the basis for this older Python cookbook recipe by Paul Prescod and this newer recipe by Lars Tiede. Christof also released pyxsldoc 0.69, "an application to produce documentation for XSLT files in XHTML format, similar to what javadoc does for Java files." See the announcements for encutils and pyxsldoc.

I discovered Ken Rimey's Personal Distributed Information Store (PDIS), which includes some XML tools for Nokia's Series 60 phones, which offer python support. This includes an XML parser based on PyExpat and an XPath implementation based on elementtree.