Writing and Reading XML with XIST
March 16, 2005
XIST is a very interesting project I've been meaning to dig into for some time. If you've been following the news section at the end of each of these columns, you'll have noticed the steady work that Walter Dörwald, the project leader, has put into this toolkit. It started out as a framework for generating HTML and incidentally XML, but the XML facilities have steadily grown and matured, until it is now a sophisticated system for not only generating, but also processing, XML. As the legend on the project page says: "XIST is also a DOM parser (built on top of SAX2) with a very simple and Python-esque tree API. Every XML element type corresponds to a Python class and these Python classes provide a conversion method to transform the XML tree (e.g. into HTML). XIST can be considered 'object-oriented XSL'". XIST isn't one of those projects you hear loudly advocated and debated when Python/XML processing options come up, but it probably should be.
Installation
I'm using my own build of Python 2.4 on Fedora Core 3. I grabbed the latest XIST download
(version 2.8). Turns out it requires a host of other packages as well. I installed
the
apparent minimum requirements: PyXML
0.8.4, ll-url
0.15 and ll-ansistyle 0.6. In all these cases the usual python setup.py install
worked, and so it was for the ll-xist package itself. I installed everything in this
particular order, and yet I immediately noticed something amiss:
$ python Python 2.4 (#1, Dec 6 2004, 09:55:00) [GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import ll Traceback (most recent call last): File "<stdin>", line 1, in ? ImportError: No module named ll >>>
The ll
module is an umbrella over ll.url
,
ll.ansistyle
and ll.xist
. I confirmed that there was indeed an
"ll" directory in my Python "site-packages", but I noticed there was no
"__init__.py"
in it, which explains the problems finding the package. Looking back
over the output from installing the various ll
module components, I found some
suspicious warnings:
[ll-url-0.15]$ python setup.py install [SNIP] running build_py package init file '__init__.py' not found (or not a regular file) creating build creating build/lib.linux-i686-2.4 creating build/lib.linux-i686-2.4/ll copying url.py -> build/lib.linux-i686-2.4/ll package init file '__init__.py' not found (or not a regular file) running build_ext [SNIP] [ll-ansistyle-0.6]$ python2.4 setup.py install [SNIP] running build_py package init file '__init__.py' not found (or not a regular file) creating build creating build/lib.linux-i686-2.4 creating build/lib.linux-i686-2.4/ll copying ansistyle.py -> build/lib.linux-i686-2.4/ll package init file '__init__.py' not found (or not a regular file) running build_ext [SNIP]
I checked the INSTALL
document again to see if I might have missed a step, but
it didn't seem that way. It seemed like either an installer bug, or perhaps a missing
package that needed to be installed in order to get the umbrella ll
module
properly set up. Things seemed to work fine after I hacked in a "__init__.py"
by hand, but soon it became apparent that something was still missing. I browsed the
project
Web site, and guessed that perhaps I also needed the ll-core 0.2.1 package. This
turned out to do the trick. I think the entire sequence of XIST prerequisites should
be
better documented in the README. In order to save other readers any confusion, here
is the
order of prerequisite installation I recommend, including minimum versions:
Building and Writing XML
XIST started out as an HTML or XML generator, so generating XML isn't a bad place to start with XIST. But it turns out that XIST's output mechanism isn't really stream-like; it's more DOM-like (though much richer than W3C DOM). It's a matter of building up the tree you have in mind, and then serializing the tree. For this reason it makes sense to first examine the XML tree building API.
XIST has an interesting approach to XML trees. It's sort of a hybrid between a DOM
and a
Data binding (see "XML Data Bindings in Python" for more on this distinction). But it's a different sort
of hybrid than ElementTree. XIST's tree API is what I'd call "vocabulary-based", where
each
information item for each vocabulary is represented as a distinct Python class. You
assemble
instances of these classes to get the desired tree. Vocabularies in XIST are organized
according to XML namespaces, such that ll.xist.ns.docbook
contains Python
classes representing all the elements defined in Docbook. Yes, that's almost 600 classes.
Some other common information items also have specialized classes, for example
ll.xist.ns.html.DocTypeXHTML10transitional
, which represents the XHTML 1.0
transitional document type declaration (like the Doctype
class in standard DOM)
and ll.xist.ns.xml.XML10
, which represents the standard XML declaration.
To explore XIST's XML output support I'll write code to generate a simple XML Software Autoupdate (XSA) file. XSA is an XML format for listing and describing software packages. This is the example I normally use to illustrate XML output, as in the article "Three More For XML Output". In XIST, you first have to define classes for the elements you're creating. Then you assemble them into a tree. Finally, you serialize the tree. Listing 1 is code to generate an XSA file.
Listing 1: Using XIST to Generate XSA
#Part One: Set up the classes for the elements from ll.xist import xsc #The XML "namespace" represents the basics of XML Infoset from ll.xist.ns import xml class xsa(xsc.Element): pass class vendor(xsc.Element): pass class name(xsc.Element): pass class email(xsc.Element): pass class product(xsc.Element): pass class version(xsc.Element): pass class last_release(xsc.Element): #The proper XML name is not a valid Python ID so you #have to explicitly map to the XML name from the Python #class name xmlname = "last-release" class changes(xsc.Element): pass #Nested classes are used to represent attributes class product(xsc.Element): class Attrs(xsc.Element.Attrs): class id(xsc.TextAttr): pass #Part Two: Create the document instance tree xsa_root = xsa( vendor( name(u"Centigrade systems"), email(u"info@centigrade.bogus"), ), product( name(u"100\u00B0 Server"), version(u"1.0"), last_release(u"20030401"), changes(), id = u"100\u00B0" ) ) #Part Three: Serialize the tree #utf-8 encoding is actually the default print xsa_root.asBytes(encoding="utf-8")
I broke the listing into three parts. In part one, I set up the element types and
other
information items for XSA. Each XML element corresponds to a Python class deriving
from
xsc.Element
. The initializers of these classes allow for a simple and clever
idiom for creating content and elements: positional arguments to the initializer become
child nodes, and keyword arguments become attributes. By default, the class name matches
the
XML element name, but the naming rules are different between Python and XML. Listing
1
illustrates how to get around such mismatches.
The extra work in part one sets up a very natural convention for creating trees,
demonstrated in part two. All I have to do to build the tree is create instances of
the XSA
element classes, all nested within the initializer calls. Part three is when I serialize
the
tree. The asBytes
method returns a string serialization of the tree. It
properly encodes characters as needed, and deals with the non-ASCII degree symbol
without
any problems. Listing 2 shows the resulting output. The actual output is all on one
line,
but I have inserted line feeds for formatting reasons.
Listing 2: Output from Listing 1
<xsa><vendor><name>Centigrade systems</name> <email>info@centigrade.bogus</email></vendor> <product id="100"> <name>100 Server</name> <version>1.0</version> <last-release>20030401</last-release><changes></changes> </product></xsa>
Completing the Document
If you look carefully at Listing 1, you'll notice that what I've created is really
just
the top-level XSA element, and not the entire XML document. There is no XML declaration,
and
no XSA document type declaration (which is required for it to be a valid XSA document).
XIST
does allow for all this added detail. To create a full XML document you use an
ll.xist.xsc.Frag
object, which can gather together all the needed nodes,
including declarations. Listing 3 illustrates this. You can run it by just pasting
in part
one from the top of Listing 1. I didn't reproduce Part 1 in order to save space.
Listing 3: Using XIST to Generate a Proper XSA Document
XSA_PUBLIC = "-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML" XSA_SYSTEM = "http://www.garshol.priv.no/download/xsa/xsa.dtd" class xsa_doctype(xsc.DocType): """ Document type for XSA """ def __init__(self): xsc.DocType.__init__( self, 'xsa PUBLIC "%s" "%s"'%(XSA_PUBLIC, XSA_SYSTEM) ) doc = xsc.Frag( xml.XML10(), xsa_doctype(), xsa( vendor( name(u"Centigrade systems"), email(u"info@centigrade.bogus"), ), product( name(u"100\u00B0 Server"), version(u"1.0"), last_release(u"20030401"), changes(), id = u"100\u00B0" ) ) ) print doc.asBytes(encoding="utf-8")
This time I create an explicit document type declaration class and bundle this into
a
document fragment along with an instance of ll.xist.ns.xml.XML10
, which
represents the XML declaration. Listing 4 shows the resulting output. Again the actual
output is all on one line, but I have inserted line feeds for formatting reasons.
Listing 4: Output from the Variation in Listing 3
<?xml version='1.0' encoding='utf-8'?> <!DOCTYPE xsa PUBLIC "-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML" "http://www.garshol.priv.no/download/xsa/xsa.dtd"> <xsa><vendor><name>Centigrade systems</name> <email>info@centigrade.bogus</email></vendor> <product id="100"> <name>100 Server</name> <version>1.0</version> <last-release>20030401</last-release><changes></changes> </product></xsa>
Reading XML
XIST provides parsers that you can use to read XML into the sorts of XIST data structures I describe above. It's really quite simple, so I'll get right to it. Listing 5 is a simple example using XIST to parse a Docbook instance.
Listing 5: Using XIST to Parse an XML Document
from ll.xist import xsc from ll.xist import parsers #You must import this XIST namespace module, otherwise you #get a validation error because the parser does not Know the #vocabulary from ll.xist.ns import docbook DOC = """\ <!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.1//EN"> <article> <articleinfo> <title>DocBook article example</title> <author> <firstname>Uche</firstname> <surname>Ogbuji</surname> </author> </articleinfo> <section label="main"> <title>Quote from "I Try"</title> <blockquote> <attribution>Talib Kweli</attribution> <para> Life is a beautiful struggle People search through the rubble for a suitable hustle Some people using the noodle Some people using the muscle Some people put it all together, make it fit like a puzzle </para> </blockquote> </section> </article> """ doc = parsers.parseString(DOC)
I'll work interactively from this listing to show some of the tree navigation facilities
for XIST trees. First I'll show how to use XIST iterators to search for the
blockquote
element.
$ python -i listing5.py >>> blockquotes = doc.walk(xsc.FindTypeAll(docbook.blockquote)) >>> bq = blockquotes.next() >>> print bq Talib Kweli Life is a beautiful struggle People search through the rubble for a suitable hustle Some people using the noodle Some people using the muscle Some people put it all together, make it fit like a puzzle >>> print bq.asBytes() <blockquote> <attribution>Talib Kweli</attribution> <para> Life is a beautiful struggle People search through the rubble for a suitable hustle Some people using the noodle Some people using the muscle Some people put it all together, make it fit like a puzzle </para> </blockquote> >>>
The walk
method creates an iterator over the nodes in document order.
xsc.FindTypeAll
creates a filter that restricts the iterator to find all
instances of all the given elements within the subtree. There is also
xsc.FindType
, which searches only the immediate children of the node. So, to
find the attribution of the quote:
>>> attribs = bq.content.walk(xsc.FindTypeAll(docbook.attribution)) >>> attrib = attribs.next() >>> print attrib Talib Kweli >>>
Once you find an element of interest, it's trivial to access one of its attributes. They are available as if items in a dictionary.
>>> sections = doc.walk(xsc.FindTypeAll(docbook.section)) >>> sect = sections.next() >>> print sect[u"label"] main >>>
XIST also takes advantage of Python's operator overloading to support a language in some ways like XPath, but given as Python expressions rather than strings (Unicode objects, to be precise). This language is called XFind. The examples in the documentation look very interesting, but I had some trouble getting the expected results from XFind expressions. I couldn't be sure whether it was something I was doing wrong or quirks in the library, so I'll leave exploring XFind more deeply for another time. I encourage you to experiment with XFind, though. Many people have called for such a pure Python take on XPath, and it looks as if XIST is well on its way down this road.
Wrap Up
Also in Python and XML |
|
Should Python and XML Coexist? |
|
It's surprising that XIST is such a dark horse. It has been around for a long time. It has a lot of very original and interesting capabilities. It's pretty well documented, and has a mature feel about it. Yet I had never tried it before working on this article, and I don't think I know of anyone else who had. Based on my experimentation, it is definitely worth serious consideration when you're looking for a Python-esque XML processing toolkit. The extremely object-oriented framework can feel a bit heavy, but I can appreciate some of the resulting benefits, and it would certainly suit some users' tastes very well. I should also mention that there is a lot more to XIST that I was able to cover in this article. I didn't touch on its support for different HTML and XHTML vocabularies, XML namespaces, XML entities, validation and content models, tree modification, pretty printing, image manipulation, and more.
I could only find one new development to report on regarding XML in the Python space. It's the interesting news that Fred Drake, Pythonista extraordinaire, appears to have started chipping in on the ZSI project for Python Web services. He made the announcement of ZSI 1.7. For those who are still interested in mainstream Web services, this is surely great news.