Using SAX for Proper XML Output
March 12, 2003
In an earlier Python and XML
column I discussed ways to achieve proper XML output from Python programs. That
discussion included basic considerations and techniques in generating XML output in
Python
code. I also introduced a couple of useful functions for helping with correct output:
xml.sax.saxutils.escape
from core Python 2.x and
Ft.Xml.Lib.String.TranslateCdata
from 4Suite. There are other tools
for helping with XML generation. In this article I introduce an important one that
comes
with Python itself. Generating XML from Python is one of the most common XML-related
tasks
the average Python user will face; thus, having more than one way to complete such
a common
task is especially helpful.
Pushing SAX
Probably the most effective general approach to creating safe XML output is to use
SAX more
fully than just cherry-picking xml.sax.saxutils.escape
. Most users think of SAX
as an XML input system, which is generally correct; because, however, of some goodies
in
Python's SAX implementation, you can also use it as an XML output tool. First of all,
Python's SAX is implemented with objects which have methods representing each XML
event. So
any code that calls these methods on a SAX handler can masquerade as an XML parser.
Thus,
your code can pretend to be an XML parser, sending events from the serialized XML,
while
actually computing the events in whatever manner you require. On the other end of
things,
xml.sax.XMLGenerator
, documented in the official
Python library reference, is a utility SAX handler that comes with Python. It takes a
stream of SAX events and serializes them to an XML document, observing all the necessary
rules in the process.
You might have gathered from this description how this tandem of facilities leads to an elegant method for emitting XML, If not, listing 1 illustrates just how this technique may be used to implement the code pattern from the earlier XML output article (that is, creating an XML-encoded log file).
Listing 1 (listing1.py): Generating an XML log file using Python's SAX utilitiesimport time from xml.sax.saxutils import XMLGenerator from xml.sax.xmlreader import AttributesNSImpl LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR'] class xml_logger: def __init__(self, output, encoding): """ Set up a logger object, which takes SAX events and outputs an XML log file """ logger = XMLGenerator(output, encoding) logger.startDocument() attrs = AttributesNSImpl({}, {}) logger.startElementNS((None, u'log'), u'log', attrs) self._logger = logger self._output = output self._encoding = encoding return def write_entry(self, level, msg): """ Write a log entry to the logger level - the level of the entry msg - the text of the entry. Must be a Unicode object """ #Note: in a real application, I would use ISO 8601 for the date #asctime used here for simplicity now = time.asctime(time.localtime()) attr_vals = { (None, u'date'): now, (None, u'level'): LOG_LEVELS[level], } attr_qnames = { (None, u'date'): u'date', (None, u'level'): u'level', } attrs = AttributesNSImpl(attr_vals, attr_qnames) self._logger.startElementNS((None, u'entry'), u'entry', attrs) self._logger.characters(msg) self._logger.endElementNS((None, u'entry'), u'entry') return def close(self): """ Clean up the logger object """ self._logger.endElementNS((None, u'log'), u'log') self._logger.endDocument() return if __name__ == "__main__": #Test it out import sys xl = xml_logger(sys.stdout, 'utf-8') xl.write_entry(2, u"Vanilla log entry") xl.close()
I've arranged the logic in a class that encapsulates the SAX machinery.
xml_logger
is initialized with an output file object and an encoding to use.
First I set up an XMLGenerator
instance which will accept SAX events and emit
XML text. I immediately start using it by sending SAX events to initialize the document
and
create a wrapper element for the overall log. You should not forget to send
startDocument
. In opening the top-level element, logs
, I use the
namespace-aware SAX API, even though the log XML documents do not use namespaces.
This is
just to make the example a bit richer, since the namespace-aware APIs are more complex
than
the plain ones.
You ordinarily don't have to worry about how the instances of attribute information
are
created, unless you're writing a driver, filter, or any other SAX event emitter such
as this
one. Unfortunately for such users, the creation APIs for the AttributesImpl
and
AttributesNSImpl
classes are not as well documented as the read APIs. It's
not even clear whether they are at all standardized. The system used in the listing
does
work with all recent Python/SAX and PyXML SAX versions. In the case of the namespace-aware
attribute information class, you have to pass in two dictionaries. One maps a tuple
of
namespace and local name to values, and the other maps the same to the qnames used
in the
serialization. This may seem a rather elaborate protocol, but it is designed to closely
correspond to the standard read API for these objects. In the initializer in the listing
I
create an empty AttributesNSImpl
object by initializing it with two empty
dictionaries. You can see how this works when there are actually attributes by looking
in
the write_entry
method.
Once the AttributesNSImpl
object is ready, creating an element is a simple
matter of calling the startElementNS
method on the SAX handler using the
(namespace, local-name), qname
convention and attribute info object. And
don't forget to call the the endElementNS
method to close the element. In the
initializer of xml_logger
, closing the top-level element and document itself is
left for later. The caller must call the close
method to wrap things up and
have well-formed output. The rest of the xml_logger
class should be easy enough
to follow.
The character of SAX
In the last article on XML output I walked through all the gotchas of proper character encoding. This SAX method largely frees you from the worry of all that. The most important thing to remember is to use Unicode objects rather than strings in your API calls. This follows the principle I recommended in the last article: In all public APIs for XML processing, character data should be passed in strictly as Python Unicode objects.
There are in fact a few areas where simple, ASCII only strings are safe: for example,
output encodings passed to the initializer of XMLGenerator
and similar cases.
But these areas are unusual. Listing 2 demonstrates a use of the xml_logger
class to output a more interesting log entry.
from listing1 import xml_logger import cStringIO stream = cStringIO.StringIO() xl = xml_logger(stream, 'utf-8') xl.write_entry(2, u"In any triangle, each interior angle < 90\u00B0") xl.close() print repr(stream.getvalue())
I use cStringIO
to capture the output as a string. I then display the Python
representation of the output in order to be clear about what is produced. The resulting
string is basically (rearranged to display nicely here):
<?xml version="1.0" encoding="utf-8"?> <log><entry level="ERROR" date="Sat Mar 8 08:55:11 2003"> in any triangle, each interior angle < 90\xc2\xb0 </entry></log>
You can see that the character passed in as "<" has been escaped to "<" and that the character given using the Unicode character escape "\u00B0" (the degree symbol) is rendered as the UTF-8 sequence "\xc2\xb0". If I specify a different encoding for output, as in listing 3, the library will again handle things.
Listing 3: Using xml_logger to emit non-ASCII and escaped characters with ISO-8859-1 encodingfrom listing1 import xml_logger import cStringIO stream = cStringIO.StringIO() xl = xml_logger(stream, 'iso-8859-1') xl.write_entry(2, u"In any triangle, each interior angle < 90\u00B0") xl.close() print repr(stream.getvalue())
Which results in
<?xml version="1.0" encoding="iso-8859-1"?> <log><entry level="ERROR" date="Sat Mar 8 09:35:56 2003"> In any triangle, each interior angle < 90\xb0 </entry></log>
If you use encodings which aren't in the Unicode or ISO-8859 family, or which are not available in the "encodings" directory of the Python library, you may have to download third-party codecs in order to use them in your XML processing. This includes the popular JIS, Big-5, GB, KS, and EUC variants in Asia. You may already have these installed for general processing; if not, it requires a significant amount of sleuthing right now to find them. Eventually they may be available all together in the Python Codecs project. For now you can download particular codecs from projects such as Python Korean Codecsand Tamito Kajiyama's Japanese codecs (page in Japanese).
Other Developments
The built-in SAX library is but one of the available tools for dealing with all the complexities of XML output. It has the advantage of coming with Python, but in future columns I will cover other options available separately. Another useful but less common SAX usage pattern is chaining SAX filters. Soon after this article is published, I'll have an article out with more information on using SAX filters with Python's SAX. Watch my publications list to see when it appears.
The past month or so has been another busy period for Python-XML development. There
has
been a lot of discussion of the future direction of the PyXML project. Martijn Faassen made
" a
modest proposal" for changing the fact that PyXML overwrites the xml
module in a Python installation. This led to the Finding _xmlplus
in Python 2.3a2 thread in which I proposed that
parts of PyXML, pysax, and the dom package (excepting 4DOM) should simply be moved
in to the
Python core. Discussion of these matters is still proceeding, but if you are interested
in
the road map for PyXML, you might wish to join the discussion.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
Francesco Garelli announced Satine, an interesting package which converts XML documents to Python lists of objects which have Python attributes mirroring the XML element attributes, a data structure he calls an "xlist". The package is designed for speed, with key parts coded in C. It also has a web services module which supports plain XML and SOAP over HTTP. Garelli would be grateful for contributors of binary packages on various platforms.
David Mertz announced the 1.0.6 release of gnosis XML tools. Most of the changes have to do with the gnosis.magic module, which isn't directly related to XML, but there are some XML bug fixes.
Mark Bucciarelli was having problems handling WSDL, which eventually led to his contributing a patch to wsdllib that makes it work with the most recent 4Suite. I'll release an updated version of wsdllib that incorporates this patch.