More Gems From the Mines
November 12, 2003
In a recent article I started mining the riches of the XML-SIG mailing list, prospecting for some of its choicest bits of code. I found a couple of nice bits from 1998 and 1999. This time I cover 2000 and 2001, an exciting period where preparations for Python 2.0 meant that XML tools were finally gaining some long-desired capabilities in the core language. As in the last article, where necessary, I have updated code to use current APIs, style, and conventions in order to make it more immediately useful to readers. All code listings are tested using Python 2.2.2 and PyXML 0.8.3. See the last article for my note on the use of PEP 8: Style Guide for Python Code, which I use in updating listings.
Rich Salz's Simple DOM Node Builder
Yes, I have even more on generating XML output. This time the idiom is more DOM-like, building node objects rather than sending commands to a single output aggregator object, as with the output tools I have examined previously. Rich Salz posted a simplenode.py (June 2001), a very small DOM module ("femto-DOM") with an even simpler API than minidom, but geared mostly towards node-oriented generation of XML output. You use the PyXML canonicalization (c14n) module with it to serialize the resulting nodes to get actual XML output. The output is thus in XML canonical form, which is fine for most output needs, although it may look odd compared to most output from "pretty printers". Not having to take care of the actual output simplifies the module nicely, but it does mean that PyXML is required. There have been recent fixes and improvements to the c14n module since PyXML 0.8.3, so 0.8.4 and beyond should produce even better output once available. I touched refactored simplenode.py to use Unicode strings, for safer support of non-ASCII text. The result is in listing 1.
Listing 1: Rich Salz's Simple DOM node builder (save as simplenode.py)from xml.dom import Node from xml.ns import XMLNS def _splitname(name): '''Split a name, returning prefix, localname, qualifiedname. ''' i = name.find(u':') if i == -1: return u'', name, name return name[:i], name[i + 1:], name def _initattrs(n, type): '''Set the initial node attributes. ''' n.attributes = {} n.childNodes = [] n.nodeType = type n.parentNode = None n.namespaceURI = u'' class SimpleContentNode: '''A CDATA, TEXT, or COMMENT node. ''' def __init__(self, type, text): _initattrs(self, type) self.data = text class SimplePINode: '''A PI node. ''' def __init__(self, name, value): _initattrs(self, Node.PROCESSING_INSTRUCTION_NODE) self.name, self.value, self.nodeValue = name, value, value class SimpleAttributeNode: '''An element attribute node. ''' def __init__(self, name, value): _initattrs(self, Node.ATTRIBUTE_NODE) self.value, self.nodeValue = value, value self.prefix, self.localName, self.nodeName = _splitname(name) class SimpleElementNode: '''An element. Might have children, text, and attributes. ''' def __init__(self, name, nsdict = None, newline = 1): if nsdict: self.nsdict = nsdict.copy() else: self.nsdict = {} _initattrs(self, Node.ELEMENT_NODE) self.prefix, self.localName, self.nodeName = _splitname(name) self.namespaceURI = self.nsdict.get(self.prefix, None) for k,v in self.nsdict.items(): self.addNSAttr(k, v) if newline: self.addText(u'\n') def nslookup(self, key): n = self while n: if n.nsdict.has_key(key): return n.nsdict[key] n = n.parentNode raise KeyError, u'namespace prefix %s not found' % key def addAttr(self, name, value): n = SimpleAttributeNode(name, value) n.parentNode = self if not n.prefix: n.namespaceURI = None else: n.namespaceURI = self.nslookup(n.prefix) self.attributes[name] = n return self def addNSAttr(self, prefix, value): if prefix: n = SimpleAttributeNode(u'xmlns:' + prefix, value) else: n = SimpleAttributeNode(u'xmlns', value) #XMLNS.BASE is the special namespace W3C #circularly uses for namespace declaration attributes themselves n.parentNode, n.namespaceURI = self, XMLNS.BASE self.attributes[u'xmlns:'] = n self.nsdict[prefix] = value return self def addDefaultNSAttr(self, value): return self.addNSAttr(u'', value) def addChild(self, n): n.parentNode = self if n.namespaceURI == None: n.namespaceURI = self.namespaceURI self.childNodes.append(n) return self def addText(self, text): n = SimpleContentNode(Node.TEXT_NODE, text) return self.addChild(n) def addCDATA(self, text): n = SimpleContentNode(Node.CDATA_SECTION_NODE, text) return self.addChild(n) def addComment(self, text): n = SimpleContentNode(Node.COMMENT_NODE, text) return self.addChild(n) def addPI(self, name, value): n = SimplePINode(name, value) return self.addChild(n) return self def addElement(self, name, nsdict = None, newline = 1): n = SimpleElementNode(name, nsdict, newline) self.addChild(n) return n if __name__ == '__main__': e = SimpleElementNode('z', {'': 'uri:example.com', 'q':'q-namespace'}) n = e.addElement('zchild') n.addNSAttr('d', 'd-uri') n.addAttr('foo', 'foo-value') n.addText('some text for d\n') n.addElement('d:k2').addText('innermost txt\n') e.addElement('q:e-child-2').addComment('This is a comment') e.addText('\n') e.addElement('ll:foo', { 'll': 'example.org'}) e.addAttr('eattr', '''eat at joe's''') from xml.dom.ext import Canonicalize print Canonicalize(e, comments=1)
As you can see, the node implementations, SimpleContentNode
,
SimpleAttributeNode
, SimplePINode
and
SimpleElementNode
are dead simple. The last is the only one with any methods
besides the initializer, and these are pretty much all factory methods. The namespace
handling is rather suspect in this module, although I was able to generate a simple
document
with a namespace. For parity, however, I put it to work against the same XSA output
I have
been testing other XML output tools (see the previous article, for example). No namespaces
in this task. In the future I may revisit the output tools I have examined in this
column to
document their suitability for output using namespaces. Listing 2 is the code to generate
the XSA file. Save listing 1 as simplenode.py before running it.
import sys import codecs from simplenode import SimpleElementNode root = SimpleElementNode(u'xsa') vendor = root.addElement(u'vendor') vendor.addElement(u'name').addText(u'Centigrade systems') vendor.addElement(u'email').addText(u'info@centigrade.bogus') product = root.addElement(u'product').addAttr(u'id', u'100\u00B0') product.addElement(u'name').addText(u'100\u00B0 Server') product.addElement(u'version').addText(u'1.0') product.addElement(u'last-release').addText(u'20030401') product.addElement(u'changes') #The wrapper automatically handles output encoding for the stream wrapper = codecs.lookup('utf-8')[3] from xml.dom.ext import Canonicalize Canonicalize(root, output=wrapper(sys.stdout), comments=1)
The c14n module claims to emit UTF-8, but it doesn't really seem to be smart about
encodings because I got the dreaded UnicodeError: ASCII encoding error: ordinal not in
range(128)
when I tried to let the Canonicalizer take care of the output stream for
me. I had to pass the old codecs output stream wrapper to get the Canonicalizer to
output
the degree symbol. The resulting output is
$ python listing2.py <xsa> <vendor> <name> Centigrade systems</name><email> info@centigrade.bogus</email></vendor><product id="100°"> <name> 100° Server</name><version> 1.0</version><last-release> 20030401</last-release><changes> </changes></product></xsa>
Other Bits
I didn't find much other code that was interesting enough and ready for use. There are some near gems in the archives that may also be of interest to some users, but perhaps not really worthy of the full treatment of update and detailed commentary in this column.
In "The Zen of DOM" (April 2000) Laurent Szyster attaches "myXML-0.2.8.tgz" (not to be confused with the current PHP project of the same name). It is basically a Python data binding: a tool that allows you to register special Python objects against XML element names and parse the XML files into data structures consisting of the special objects. The package is not as versatile as current available data binding packages (see recent articles in this column), but it does include some interesting ideas. Brief inspection of the code indicates that it would probably work with minor modifications on more recent pyexpat builds.
Benjamin Saller posted an API and module for simple access to data in XML files (June 2000)). It allows you to parse XML and creates specialized data structures, similar to those of ElementTree. You can use strings with a special syntax not unlike XPath. You can also specify data type conversions from XML character data. See the thread after the original module was posted for some problems, tweaks, and suggestions.
Tom Newman announced a text-based browser/editor for XML (July 2000), tan_xml_browser. The package requires PyNcurses and an old version of 4DOM. It would need some work to update the DOM bindings.
Clark Evans posted a "Namespace Stripper Filter" (March 2001), a SAX filter that strips elements and attributes in a given namespace (it only handles one namespace at a time). It is a good example of namespace filters, except that it accesses internal objects of the SAX API in a bid for performance. This is not an advisable practice, as Lars Marius Garshol warns in a follow-up.
Sjoerd Mullender posted what he claimed to be a validating XML parser in Python (April 2001). The script can emit an XML output form suitable for testing and clean comparisons, similar in intent to the W3C standard Canonical XML format used in PyXML's c14n module. I've always argued that it almost never a good idea for anyone to roll their own XML parser. This code does, however, have several useful passages, especially the various regular expressions defined toward the top.
It's worth nothing one gem that is not a bundle of Python code, at least not directly. Martin von Löwis was working on a unified API for XPath libraries in Python in 2001. He decided to define the interface using an ISO Interface Definition Language (IDL) module for XPath libraries. It defines all the axes, operators, and node types in programmatic detail. Martin used this IDL as the basis for his Python XPath implementation.
And while I'm on variant gems, it is well-known wisdom in Python that it is much
more
efficient to create big strings by using one-time list concatenation or
cStringIO
rather than simple addition of strings. String processing is, of
course, very important in XML processing so Tom Passim, an XML developer, took the
initiative to do some clear analysis on the matter, compiling "New data on speed
of string appending" for Python 1.5.2. Conclusion: don't forget to use
cStringIO
, the winner of the shootout. And if you decide on the runner up,
recent Python releases provide string methods so that you would instead write:
NO_SEPARATOR = '' snippet_list = [] for snippet in strings_to_concat: snippet_list.append(snippet) created_str = NO_SEPARATOR.join(snippet_list)
I'll also point to a posting that I haven't been able to run or analyze successfully
enough
to determine whether it is a gem, but which does look promising. Clark Evans posted
a "Simple wxPython XSLT Testing Tool using MSXML" which uses a GUI interface for
running Microsoft's MSXML XSLT processor. He points to TransformStage
, a
function where you could substitute invocation of some other XSLT processor; it's
the only
place where the pythoncom
module, imported at the top, is used, and this module
is only available on Windows.
Wrapping Up
Also in Python and XML |
|
Should Python and XML Coexist? |
|
In the future I plan to return to this thread to look at XML-SIG postings in 2002 and 2003. There is more useful code available in the XML-SIG archives, and I will return to this topic in the future, presenting updates of other useful code from the archives. If you think may have missed any great postings since 1998, feel free to attach comments to this article or post them to the XML-SIG mailing list.
Meanwhile, it has been a slow month in new Python-XML development.
Pyana 0.8.1 was released. It has been updated for Xalan 1.6/Xerces 2.3. See Brian Quinlan's announcement.
Roman Kennke developed a module, domhelper.py, with functions to provide some common operations on DOM, including looking up namespace URIs and prefixes, non-recursively getting text or child elements of a given node. Be warned that this module does raise errors in strange situations, but you should be able to comment out the extraneous error checking easily enough. The module is actually part of the Python-SOAP project. See Kennke's announcement.