Proper XML Output in Python
November 13, 2002
I planned to conclude my exploration of 4Suite this time, but events since last month's
article led me to discuss some fundamental techniques for Python-XML processing first.
First, I consider ways of producing XML output in Python, which might make you wonder
what's
wrong with good old print
? Indeed programmers often use simple print statements
in order to generate XML. But this approach is not without hazards, and it's good
to be
aware of them. It's even better to learn about tools that can help you avoid the
hazards.
Minding the Law
The main problem with simple print
is that it knows nothing about the
syntactic restrictions in XML standards. As long as you can trust all sources of text
to be
rendered as proper XML, you can constrain the output as appropriate; but it's very
easy to
run into subtle problems which even experts may miss.
XML has been praised partly because, by setting down some important syntactic rules, it eases the path to interoperability across languages, tools, and platforms. When these rules are ignored, XML loses much of its advantage. Unfortunately, developers often produce XML carelessly, resulting in broken XML. The RSS community is a good example. RSS uses XML (and, in some variants, RDF) in order to standardize syntax, but many RSS feeds produce malformed XML. Since some of these feeds are popular, the result has been a spate of RSS processors that are so forgiving, they will even accept broken XML. Which is a pity.
Eric van der Vlist -- as reported in his article for XML.com, "Cataloging XML Vocabularies" -- found that a significant number of Web documents with XML namespaces are not well-formed, including XHTML documents. Even a tech-savvy outfit like Wired has had problems developing systems that reliably put out well-formed XML.
My point is that there's no reason why Python developers shouldn't be good citizens
in
producing well-formed XML output. We have a variety of tools and a language which
allows us
to express our intent very clearly. All we need is to take a suitable amount of care.
Consider the snippet in Listing 1. It defines a function, write_xml_log_entry
,
for writing log file entries as little XML documents, using the print
keyword
and formatted strings.
Listing 1: Simple script to write XML log entries
import time LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR'] def write_xml_log_entry(level, msg): #Note: in a real application, I would use ISO 8601 for the date #asctime used here for simplicity now = time.asctime(time.localtime()) params = {'level': LOG_LEVELS[level], 'date': now, 'msg': msg} print '<entry level="%(level)s" date="%(date)s"> \ \n%(msg)s\n</entry>' % params return write_xml_log_entry(0, "'Twas brillig and the slithy toves") #sleep one second #Space out the log messages just for fun time.sleep(1) write_xml_log_entry(1, "And as in uffish thought he stood,") time.sleep(1) write_xml_log_entry(0, "The vorpal blade went snicker snack")
Listing 1 also includes a few lines to exercise the write_xml_log_entry
function. All in all, this script is straightforward enough. The output looks like
$ python listing1.py <entry level="DEBUG" date="Mon Oct 21 22:11:01 2002"> 'Twas brillig and the slithy toves </entry> <entry level="WARNING" date="Mon Oct 21 22:11:03 2002"> And as in uffish thought he stood, </entry> <entry level="DEBUG" date="Mon Oct 21 22:11:07 2002"> The vorpal blade went snicker snack </entry>
But what if someone uses this function thus:
>>> write_xml_log_entry(2, "In any triangle, each interior angle < 90 degrees") <entry level="ERROR" date="Tue Oct 22 05:41:31 2002"> In any triangle, each interior angle < 90 degrees </entry>
Now the result isn't well-formed XML. The character "<" should, of course, have been escaped to "<". And there's a policy issue to consider. Are messages passed into the logging function merely strings of unescaped character data, or are they structured, markup-containing XML fragments? The latter policy might be prudent if you want to allow people to mark up log entries by, say, italicizing a portion of the message:
>>> write_xml_log_entry(2, "Came no church of cut stone signed: <i>Adamo me fecit.</i>") <entry level="ERROR" date="Tue Oct 22 05:41:31 2002"> Came no church of cut stone signed: <i>Adamo me fecit.</i> </entry>
I've reused the write_xml_log_entry
function because
msg
-as-markup is the policy implied by the function as currently written. There
are further policy issues to consider. In particular, to what XML vocabularies do
you
constrain output, if any? Allowing the user to pass markup often entails that they
have the
responsibility for passing in well-formed markup. The other approach, where the
msg
parameter is merely character data, usually entails that the
write_xml_log_entry
function will perform the escaping required to produce
well-formed XML in the end. For this purpose I can use the escape
utility
function in the xml.sax.saxutils
module. Listing 2 defines a function,
write_xml_cdata_log_entry
, which performs such escaping.
Listing 2: Simple script to write XML log entries, with the policy that messages passed in are considered character data
import time from xml.sax import saxutils LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR'] def write_xml_cdata_log_entry(level, msg): #Note: in a real application, I would use ISO 8601 for the date #asctime used here for simplicity now = time.asctime(time.localtime()) params = {'level': LOG_LEVELS[level], 'date': now, 'msg': saxutils.escape(msg)} print '<entry level="%(level)s" date="%(date)s"> \ \n%(msg)s\n</entry>' % params return
This function is now a bit safer to use for arbitrary text.
$ python -i listing2.py >>> write_xml_cdata_log_entry(2, "In any triangle, each interior angle < 90\260") <entry level="ERROR" date="Tue Oct 22 06:33:51 2002"> In any triangle, each interior angle < 90° </entry> >>>
And it enforces the policy of no embedded markup in messages.
>>> write_xml_cdata_log_entry(2, "Came no church of cut stone signed: <i>Adamo me fecit.</i>") <entry level="ERROR" date="Tue Oct 22 06:41:31 2002"> Came no church of cut stone signed: <i>Adamo me fecit.</i> </entry> >>>
Notice that escape
also escapes ">" characters, which is not required by
XML in character data but is often preferred by users for congruity.
Minding Your Character
Python's regular strings are byte arrays. Characters are represented as one or more bytes each, depending on the encoding, but the string does not indicate which encoding was used. If it surprises you to hear that characters might be represented using more than one byte, consider Asian writing systems, where there are far more characters available than could be squeezed into the 255 a byte can represent. For this reason, some character encodings, such as UTF-8, use more than one byte to encode a single character. Other encodings, such as UTF-16 and UTF-32, effectively organize the byte sequence into groups of two or four bytes, each of which is the basic unit of the character encoding.
Because most single-byte encodings, such as as ISO-8859-1, are identical in the ASCII
byte
range (0x00
to 0x7F
), it's generally safe to use Python's strings
if they contain only ASCII characters and you're using a single-byte encoding (other
than
the old IBM mainframe standard, EBCDIC, which is different from ASCII throughout its
range).
But in any case, especially if you put non-ASCII characters in one of these regular
strings,
both the sender and receiver of these bytes need to be in agreement about what encoding
was
used.
The problems associated with encoded strings are easy to demonstrate. Consider one of the earlier examples, with a slight change -- the word "degrees" has been replaced with the byte B0 (octal 260), which is the degree symbol in the popular ISO-8859-1 and Windows-1252 character encodings:
$ python -i listing2.py >>> write_xml_cdata_log_entry(2, "In any triangle, each interior angle < 90\260") <entry level="ERROR" date="Tue Oct 22 06:33:51 2002"> In any triangle, each interior angle < 90° </entry> >>>
The \260
is an octal escape for characters in Python. It represents a byte of
octal value 260 (B0 hex, 176 decimal).
As for the output produced by write_xml_cdata_log_entry
, the characters seem
properly escaped, but there may still be a problem. If this output is to stand alone
as an
XML document, it's not be well-formed. The problem is that there is no XML declaration,
so
the character encoding is assumed by XML processors to be UTF-8. But the degree symbol
at
the end of the string makes it illegal UTF-8; an XML parser would signal an error.
This is
one of the most common symptoms of bad XML I have seen: documents encoded in ISO-8859-1
or
some other encoding which are not marked as such in an XML declaration.
Just adding an XML declaration is not necessarily a solution. If I have the function add
"<?xml version="1.0" encoding="ISO-8859-1"?>"
then the previous function invocation produces problem-free XML. But nothing prevents
write_xml_cdata_log_entry
from being passed a message in an encoding other
than ISO-8859-1. Almost any sequence of bytes can be interpreted ISO-8859-1, so no
error
would be detected. But this is merely masking a deeper, more insidious problem: the
text
would be completely misinterpreted. To illustrate this specious fix, Listing 3 forces
an
ISO-8859-1 XML declaration.
Listing 3: a variation on write_xml_cdata_log_entry which always puts out an ISO-8859-1 XML declaration
import time from xml.sax import saxutils LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR'] def write_xml_cdata_log_entry(level, msg): #Note: in a real application, I would use ISO 8601 for the date #asctime used here for simplicity now = time.asctime(time.localtime()) params = {'level': LOG_LEVELS[level], 'date': now, 'msg': saxutils.escape(msg)} print '<?xml version="1.0" encoding="ISO-8859-1"?>' print '<entry level="%(level)s" date="%(date)s"> \ \n%(msg)s\n</entry>' % params return
To understand the nastiness that lurks within this seeming fix, take the case where a user passes in a string with a UTF-8 sequence with a Japanese message, which translates to "Welcome" in English.
$ python -i listing2.py >>> write_xml_cdata_log_entry(2, "\343\202\210\343\201\206\343\201\223\343\201\235") <?xml version="1.0" encoding="ISO-8859-1"?> <entry level="ERROR" date="Tue Oct 22 15:54:36 2002"> よãfl†ãfl“ãfl? </entry> >>>
An XML parser would accept this with no complaint. The problem is that any processing tools looking at this XML would read the individual sequences of the UTF-8 encoding as separate ISO-8859-1 characters. Which means they would see twelve characters, rather than the four which our imaginary Japanese user thought she had specified. Even worse, unless this text is displayed in a system localized for Japanese, it will come out as a mess of accented "a"s and other strange characters, rather than the dignified Japanese welcome intended by the user, illustrated in Figure 1.
Figure 1: A Japanese Welcome
Character encoding issues are a very tricky business, and you should always defer to the tools that your language and operating environment provide for such magic, if for no other reason than to pass the buck when something goes wrong. In Python's case, this means using the Unicode facilities available in Python 1.6 and 2.x (although I still highly recommend Python 2.2 or more recent for XML processing). In fact, I use and strongly encourage the following rule for XML processing in Python: In all public APIs for XML processing, character data should be passed in strictly as Python Unicode objects.
In fact, I encourage that all use of strings in programs that process XML should be
in the
form of Unicode objects, but following the above rule alone will prevent a lot of
problems.
Listing 4 updates write_xml_cdata_log_entry
to follow this rule.
Listing 4: a variation on write_xml_cdata_log_entry which strictly accepts Python Unicode objects for message text.
import time, types from xml.sax import saxutils LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR'] def write_xml_cdata_log_entry(level, msg): if not isinstance(msg, types.UnicodeType): raise TypeError("XML character data must be passed in as a unicode object") now = time.asctime(time.localtime()) encoded_msg = saxutils.escape(msg).encode('UTF-8') params = {'level': LOG_LEVELS[level], 'date': now, 'msg': encoded_msg} print '<entry level="%(level)s" date="%(date)s"> \ \n%(msg)s\n</entry>' % params return
Pay particular attention to the line
encoded_msg = saxutils.escape(msg).encode('UTF-8')
Not only does this line escape characters that are illegal in XML character data,
but it
also encodes the Unicode object as a UTF-8 byte string. This is needed because most
output,
including printing to consoles and writing to files on most operating systems, requires
conversion to byte streams. This means using an 8-bit encoding for strings that were
originally in Unicode (because of my suggested rule). The
write_xml_cdata_log_entry
function always uses UTF-8 for this output
encoding, which means that it doesn't have to put out an XML declaration that specifies
an
encoding. I should point out that in general it's considered good practice to always
use an
XML declaration which specifies an encoding, but I wrote the function this way as
an
illustration.
This version of the write_xml_cdata_log_entry
function is safe as far as
character encodings are concerned. It doesn't care whether the character data came
from an
ISO-8859-1 string, a UTF-8 string, or any other form of string, as long as it is passed
in
as a Unicode object.
$ python -i listing4.py >>> write_xml_cdata_log_entry(2, "In any triangle, each interior angle < 90\260") Traceback (most recent call last): File "<stdin>", line 1, in ? File "listing4.py", line 8, in write_xml_cdata_log_entry raise TypeError("XML character data must be passed in as a unicode object") TypeError: XML character data must be passed in as a unicode object
This exception is as expected. We passed in a plain byte string rather than a Unicode object and the function is enforcing policy.
>>> write_xml_cdata_log_entry(2, u"In any triangle, each interior angle < 90\u00B0") <entry level="ERROR" date="Tue Oct 22 17:58:08 2002"> In any triangle, each interior angle < 90° </entry>
The log message unicode object includes a character, \u00B0
in the Python
notation for explicitly representing a Unicode code point. A code point is a number
that uniquely identifies one of the many characters Unicode defines. Here, of course,
the
code point represents the degree symbol. In this case, it would also be correct to
use the
regular octal escape character \260
, but I recommend using the "\u" form of
escape in Python Unicode objects. Be wary of using the position of the character you
want in
your local encoding as the Unicode code point. For example, on Macs predating OS X,
the
176th character is the infinity symbol ("\u221E"/
), rather than the degree
symbol.
The function outputs the single degree character as a two-byte UTF-8 sequence. Since my console thinks it is displaying ISO-8859-1, the bytes appear to be separate characters, but an XML processor would properly read the sequence as a single character.
>>> #The following two lines are equivalent >>> msg = unicode("\343\202\210\343\201\206\343\201\223\343\201\235", "UTF-8") >>> msg = "\343\202\210\343\201\206\343\201\223\343\201\235".decode("UTF-8") >>> write_xml_cdata_log_entry(2, msg) <entry level="ERROR" date="Tue Oct 22 18:10:57 2002"> よãfl†ãfl“ãfl? </entry>
First, I create a Unicode object from the UTF-8-encoded string, and then pass it to the function, which outputs it as UTF-8. This is no longer a problem because the parser will recognize the encoding as UTF-8, rather than confusing it as ISO-8859-1, as before.
Not Quite There Yet
But this function is still not failsafe. A remaining problem is that XML only allows
a
limited set of characters to be present in markup. For example, the form feed character
is
illegal. There is nothing in our function to prevent a user from inserting a form
feed
character, which would result in malformed XML. There are other subtleties to consider.
Users of 4Suite have handy functions that take care of most of the concerns surrounding
the
output of XML character data. The one of most interest in this discussion is
Ft.Xml.Lib.String.TranslateCdata
. Listing 5 is a version of
write_xml_cdata_log_entry
that uses TranslateCdata
to render
character data as well-formed XML.
Listing 5: a variation on write_xml_cdata_log_entry which uses Ft.Xml.Lib.String.TranslateCdata from 4Suite for safer XML outout.
import time, types from xml.sax import saxutils from Ft.Xml.Lib.String import TranslateCdata LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR'] def write_xml_cdata_log_entry(level, msg): if not isinstance(msg, types.UnicodeType): raise TypeError("XML character data must be passed in as a unicode object") #Note: in a real application, I would use ISO 8601 for the date #asctime used here for simplicity now = time.asctime(time.localtime()) encoded_msg = TranslateCdata(msg) params = {'level': LOG_LEVELS[level], 'date': now, 'msg': encoded_msg} print '<entry level="%(level)s" date="%(date)s"> \ \n%(msg)s\n</entry>'% params return
The key bit is now encoded_msg = TranslateCdata(msg)
.
Which uses the 4Suite function. This takes care of the escaping, the character encoding, trapping illegal XML characters, and more. 4Suite also provides functions that prepare character data to be output inside an XML attribute or for HTML output.
But just to put another twist on the matter, even now the 4Suite developers are refining these functions for better design, and the signatures may change in future releases. Since in many cases you have a special task to fulfill, and don't want to bear all the burden of XML correctness, this reinforces the importance of relying on third-party tools.
Conclusion
Also in Python and XML |
|
Should Python and XML Coexist? |
|
So much for the notion that XML output is nothing more than an exercise for the Python
print
keyword. I haven't even plumbed all the issues involved, and I'll
return to further concerns in future articles. The main point I want to get across
is that
generating XML is not as easy as it would at first seem, and that you should use established
tools as much as possible. I have pointed out utility functions in the standard Python
library and in 4Suite. Another approach is to create a DOM tree and then serialize
it. Just
remember to always generate XML with a great deal of care and to test all output thoroughly
with reliable XML parsers. The world could certainly do with more good XML citizens.
Thanks to Mike Brown,an expert on the intersection of XML and character set arcana. He reviewed this article for technical correctness and suggested important clarifications.
Python-XML Happenings
Here is a brief on significant new happenings relevant to Python-XML development, including significant software releases. Not much to report this month.
Walter Dörwald announced version 2.0 of XIST, an XML-based extensible HTML generator written in Python. The announcement also led to sime discussion of the use of namespaces in XIST, leading to this clarification.
Henry Thompson appears to have responded to my teasing about the lack of distutils in XSV with a new release.
Resources
|