More Unicode Secrets
June 15, 2005
In the last article I started a discussion of the Unicode facilities in Python, especially with XML processing in mind. In this article I continue the discussion. I do want to mention that I don't claim these articles to be an exhaustive catalogue of Unicode APIs; I focus on the Unicode APIs I tend to use most in my own XML processing. You should follow up these articles by looking at the further resources I mentioned in the first article.
I also want to mention another general principle to keep in mind: if possible, use
a Python install compiled to use UCS4 character storage. You can determine when
you configure Python before building it whether it stores Unicode characters using
(informally) a two-byte or a four-byte encoding, UCS2 or UCS4. UCS2 is the default
but you
can override this by passing the --enable-unicode=ucs4
flag to
configure
. UCS4 uses more space to store characters, but there are some
problems for XML processing in UCS2, which the Python core team is reluctant to address
because the only known fixes would be too much of a burden on performance. Luckily,
most
distributors have heeded this advice and ship UCS4 builds of Python.
Wrapping Files
In the last article I showed how to manage conversions from strings to Unicode objects.
In
dealing with XML APIs you often deal with file-like objects (stream objects) as well.
Most
file systems and stream representations are byte-oriented rather that character-oriented,
which means that Unicode must be encoded for file output and that file input must
be decoded
for interpretation as Unicode. Python provides facilities for wrapping stream objects
so
that such conversions are largely transparent. Consider the codecs.open
function.
import codecs f = codecs.open('utf8file.txt', 'w', 'utf-8') f.write(u'abc\u2026') f.close()
The first two arguments to codecs.open
are just like the arguments to the
built-in function open
. The third argument is the encoding name. The return
value is the open file pointer. You then use the write
method, passing in
Unicode objects, which are encoded as specified and written to the file. I can't possibly
reiterate the distinction between bytes and characters enough. Look closely at what
is
written to the file in the snippet above.
>>> len(u'abc\u2026') 4 >>>
There are four characters: three lowercase letters and the horizontal ellipsis symbol.
Examine the resulting file. I use hexdump
on Linux. There are many similar
utilities on all operating systems.
$ hexdump -c utf8file.txt 0000000 a b c 342 200 246 0000006
This means that there are six bytes in the file. The first three are as you would
expect,
and the second three are all used to encode a single Unicode character in UTF-8 form
(the
bytes are given in octal form above; in hex form they are e2 80 a6
). If you
were to read this file with a tool that was not aware that this is a UTF-8 encoded
file, it
might misinterpret the contents, which is a hard problem overall in dealing with encoded
files. (See Rick Jelliffe's article, referenced in the sidebar, for more discussion
of this
issue.)
Understanding BOMs
Some encodings have additional details you have to keep in mind. The following code creates a file with the same characters, but encoded in UTF-16.
import codecs f = codecs.open('utf16file.txt', 'w', 'utf-16') f.write(u'abc\u2026') f.close()
Examine the contents of the resulting file. If you're using hexdump
, this
time it's actually more useful to use a different (hexadecimally-based) output formatting
option.
$ hexdump -C utf16file.txt 00000000 ff fe 61 00 62 00 63 00 26 20 |..a.b.c.& | 0000000a
There are 10 bytes in this case. In UTF-16 most characters are encoded in two bytes each. The four Unicode characters are encoded into eight bytes, which are the last eight in the file. This leaves the first two bytes unaccounted for. Unicode has a means of flagging encodings in order to specify the order in which the characters should be read from bytes. These flags come in the form of the encoding of a special character code point called "byte order marks" (BOMs). This is necessary in part because different machines use different means of ordering "words" (pairs of consecutive bytes starting at even machine addresses) and "double words" (pairs of consecutive words starting at machine addresses divisible by four). The difference in word order is all that is relevant in the case of UTF-16.
If you were to place the latter eight characters from the above example in a file
and send
it from a machine with one byte ordering to a machine with another type of ordering,
programming tools (including Python code) would read the characters backwards, scrambling
the contents. Unicode uses BOMs to mark byte order so that machines with different
ordering
will be able to figure out the right way to read characters. The BOM for UTF-16 comprises
the bytes ff
and fe
, which completes the puzzle of the contents of
the file generated in the above example. The relative position of the ff
byte
signals the least significant position and fe
signals the most significant. You
can see how this works when looking at the next word 61 00
. By following the
BOM you can tell that 61
is least significant and 00
is most
significant. This happens to be what is called little-endian byte order (which is
usual for
Intel machines). Many other machines, including Motorola microprocessors, use big-endian
byte order, and the order would be reversed in the BOM, as well as in all the other
characters. Unicode tools know how to look for and interpret the BOM in files, and
the above
file contents should be properly interpreted by any UTF-16-aware tool, even in a language
other than Python.
Deciding which encoding to choose is a very complex issue, although I recomend that you stick to UTF-8 and UTF-16 for uses associated with XML processing. One consideration that might help you choose between these two encodings is that UTF-8 tends to use fewer bytes when encoding text heavy on European and Middle-Eastern characters, and some Asian scripts. UTF-16 tends to use fewer bytes when encoding text heavy in Chinese, Japanese, Korean, Vietnamese (the "CJKV" languages) and the like.
You can use codecs.open
again for reading the files created above:
import codecs f = codecs.open('utf8file.txt', 'r', 'utf-8') u1 = f.read() f.close() f = codecs.open('utf16file.txt', 'r', 'utf-16') u2 = f.read() f.close() assert u1 == u2
Again Python takes care of all the BOM details transparently.
Wrapping File-like Objects
codecs.open
does the trick for wrapping files, but not other types of stream
objects (such as sockets or StringIO
string buffers). You can handle these
using wrappers you obtain using the codecs.lookup
function. In the last article
I showed how to use this function to get encoding and decoding routines (the first
two items
in the returned tuple).
import codecs import cStringIO enc, dec, reader, writer = codecs.lookup('utf-8') buffer = cStringIO.StringIO() #Wrap the buffer for automatic encoding wbuffer = writer(buffer) content = u'abc\u2026' wbuffer.write(content) bytes = buffer.getvalue() #Create the buffer afresh, with the bytes written out buffer = cStringIO.StringIO(bytes) #Wrap the buffer for automatic decoding rbuffer = reader(buffer) content = rbuffer.read() print repr(content)
In this example I've completed a round trip from a Unicode object to an encoded byte
string, which was built using a StringIO
object, and back to a Unicode object
read in from the byte string.
If you need to use one of these functions from codecs.lookup
, and don't want
to bother with the other three, you can get them directly using the functions
codecs.getencoder
, codecs.getdeccoder
codecs.getreader
, and codecs.getwriter
.
If you need to deal with stream objects you can read and write to without having
to close
and reopen it (in a database storage scenario, for example), you'll want to look into
the
class codecs.StreamReaderWriter
, which wraps separate codec reader and writer
objects to provide a combination object.
Unicode Character Representation in Python and XML
XML and Python have different means of representing characters according to their
Unicode
code points. You have seen the horizontal ellipsis character above in Python Unicode
form
\u2026
where the "2026" is the character ordinal in hexadecimal.
This is a 16-bit Python Unicode character escape. You can also use a 32-bit escape,
marked
by a capital "U", \U00002026
. In XML you either use a decimal
character escape format, …
, where "8230" is just hex
"2026" in decimal, or you can use hex directly: …
. Notice
the added "x".
In XML you would use these escapes when you are using an encoding that does not allow you to enter a character literally. As an example, XML allows you to include an ellipsis character even in a document that is encoded in plain ASCII, as illustrated in the example in listing 1. Since there is no way to express the character with code point 2026 in ASCII, I use a character escape. A conforming XML application must be able to handle this document, reporting the right Unicode for the escaped character (and this is another good test for conformance of your tools).
Listing 1: XML file in ASCII encoding that uses a high character<?xml version='1.0' encoding='us-ascii'?> <doc>abc…</doc>
Python can take care of such escaping for you. If you want to write out XML text, and you're using an encoding—ASCII, ISO-8859-1, EUC-JP, cp1252—that may not include all valid XML characters, you can use a special ability of Python codecs to specify special actions on encoding errors.
>>> import codecs >>> enc = codecs.getencoder('us-ascii') >>> print enc(u'abc\u2026')[0] Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 3: ordinal not in range(128)
You can avoid this error by specifying 'xmlcharrefreplace'
as the error
handler.
>>> print enc(u'abc\u2026', 'xmlcharrefreplace')[0] abc…
More Unicode Resources for
|
There are other available error handlers, but they are not as interesting for XML processing.
Conclusion
Let me reiterate that for each of the areas of interest I've covered in Python's Unicode support, there are additional nuances and possibilities that you might find useful. I've generally restricted the discussion to techniques that I have found useful when processing XML, and you should read and explore further in order to uncover even more Unicode secrets. Let me also say that even though some of the techniques I've gone over will enable you to generate correct XML, there is more to well-formedness than just getting the Unicode character model right. For example, there are some Unicode characters that are not allowed in XML documents, even in escaped form. I still recommend that you use one of the many tools I've discussed in this column for generating XML output.
It's quiet time again in the Python-XML community. I did present some code snippets for reading a directory subtree and generating an XML representation, see "XML recursive directory listing, part 2", as well as some Python/Amara equivalents of XQuery and XSLT 2.0 code. There has also been a lot of buzz about Google Sitemaps (currently in beta). Web site owners can create an XML representation of their site, including indicators of updated content. The Google crawlers then use this information to improve coverage of the indexed Web sites. The relevance to this column is that Google has developed sitemap_gen.py, a Python script that "analyzes your web server and generates one or more Sitemap files. These files are XML listings of content you make available on your web server. The files can then be directly submitted to Google." The code uses plain byte string buffer write operations to generate XML. I don't recommend this practice in general, but it seems that the subset of data the Google script includes in the XML file (URLs and last modified dates) is safely in the ASCII subset. (Although as IRIs become more prevalent, this assumption may prove fragile.) It also uses xml.sax and minidom to read XML (mostly the config files in the former case and examples for testing in the latter).