Introducing PyRXP
February 11, 2004
PyRXP is a DTD validating XML parser developed by ReportLab. It is Python wrapper around RXP, a C parser developed by Richard Tobin and Henry Thompson of the Edinburgh Language Technology Group as the core of LT XML, "an integrated set of XML tools and a developers' tool-kit, including a C-based API". ReportLab is a vendor of database reporting software and very well known and respected in the Python community. PyRXP is a core component of many of ReportLab's open source and commercial components. PyRXP focuses on performance above all things by using a fast C parser and by strictly building a bare-bones Python structure of tuples and string buffers from XML source. RXP and PyRXP are both distributed under the GNU General Public License.
I downloaded the full tar/gzip distribution of PyRXP 0.9 for running on Python 2.3.2. Note: the archive does not create its own directory when unpacked, so you'll want to do so by hand:
$ mkdir pyRXP-0-9 $ cd pyRXP-0-9 $ tar zxvf ../pyRXP-0-9.tgz [SNIP] $ python setup.py install [SNIP]
Source XML for the documentation comes in the distribution, but I didn't see an obvious way to build it so I just downloaded the PDF documentation.
Character trouble in tag land
PyRXP builds a bare bones tuple-based Python structure from an XML instance. To get a flavor of this structure, I tried to parse the same document I've been using in recent explorations of Python-XML tools (Listing 1).
Listing 1: Sample XML file (labels.xml) containing address labels<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label added="2003-06-20"> <quote> <!-- Mixed content --> <emph>Midwinter Spring</emph> is its own season… </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label added="2003-06-10"> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
My attempt was the code in listing 2:
Listing 2: Simple parse of XML in a fileimport pyRXP parser = pyRXP.Parser() fobj = open('labels.xml').read() #Introspection doesn't reveal any "parseFile"-like method doc = parser.parse(fobj)
The result of this attempt was rather hair raising:
$ python listing2.py Traceback (most recent call last): File "listing2.py", line 4, in ? doc = parser.parse(fobj) pyRXP.Error: Error: 0x2026 is not a valid 8-bit XML character in unnamed entity at line 6 char 61 of [unknown] error return=1 0x2026 is not a valid 8-bit XML character Parse Failed!
The problem, besides the fact that the parser seemed to fail parsing a perfectly well-formed XML document, is that the error message is unhelpful. The phrase "valid 8-bit XML character" is meaningless. The XML character set is Unicode, with the restriction that some characters are not allowed. But there is no concept of "bits" in the idea of an XML character. Each character is merely an abstract code point. A character can be encoded into a storage format associated with a standard bit length such as UTF-8 (8 bit), but this really has nothing to do with the XML character model. To be fair, this and other concepts relating to Unicode can be rather arcane; but there are excellent resources to help clear things up, including Mike Brown's article "XML Tutorial--A reintroduction to XML with an emphasis on character encoding". For a very friendly discussion of Unicode focusing on the Python implementation there is " Unicode Support in Python (PDF)" by Marc-Andre Lemburg. I gather a lot of relevant notes on these matters in my Akara article "XML Character issues in Python".
At any rate, I pored over the PyRXP documentation expecting to find something I must
have
missed. I found a few properties that can be set on the parser and the closest I found
was
ExpandCharacterEntities
. In effect it returns a character entity such as
…
, the one in the sample document, as the literal sequence of seven
separate characters, starting with the ampersand and ending with the semicolon. This
is a
serious violation of the basic principles of XML, in which …
is strictly
one character rather than seven; further, it doesn't help me parse the sample file
properly. I then checked the ReportLab mailing lists and found others who had run
into the
same problem. The responses from the developers were, more or less, that PyRXP raises
a
fatal error when presented with XML characters with Unicode ordinal greater than U+256,
regardless of how they are represented. The unfortunate upshot of this is that PyRXP
0.9 is not an XML parser.
I only cover XML processing tools in this column; and, frankly, such a fundamental case of non-conformance would have been to my mind more than enough to disqualify PyRXP from discussion. Nevertheless, there was no way I was going to throw up my hands at this point. I have heard a lot of good things about PyRXP, and I'd like to be sure there is fair coverage of as broad a selection of Python-XML tools as possible. I pored through the docs again and found a bit that I'd overlooked the first time. Earlier on, in searching on whether users of the core C RXP parser also had this problem, I came across Norm Walsh's simple instruction to one such user: "I think you need to rebuild or reconfigure RXP with Unicode support. XML isn't 8-bit."
It turns out that the PyRXP developers have provided a start toward this. From the manual, "PyRXPU is the 16-bit Unicode aware version of pyRXP. It is currently only available the source distribution of pyRXP, since it is still 'alpha' quality. Please report any bugs you find with it."
It's still odd to tie the idea of bit width of a character encoding to the foundation of an XML parser (the phrase "16-bit Unicode" is almost as meaningless as "8-bit XML character") but PyRXPU seems well worth a try.
A Conformant Version of PyRXP?
It appears that, contrary to the note in the manual, PyRXPU is only available in CVS. I grabbed and built the CVS version like so:
$ cvs -d :pserver:anonymous@cvs.reportlab.sourceforge.net:/cvsroot/reportlab login [SNIP] $ cvs -d :pserver:anonymous@cvs.reportlab.sourceforge.net:/cvsroot/reportlab co rl_addons/pyRXP [SNIP] $ cd rl_addons/pyRXP $ python setup.py install [SNIP]
I just hit "Enter" at the "CVS password" prompt.
Listing 3: Simple parse of XML in a file, repriseimport pyRXPU parser = pyRXPU.Parser() fobj = open('labels.xml').read() #Introspection doesn't reveal any "parseFile"-like method doc = parser.parse(fobj)
This time the parse is successful, and I was able to start digging into the resulting data structure as illustrated by jumping into the interpreter after running the script:
>>> import pprint >>> pprint.pprint(doc) (u'labels', None, [u'\n ', (u'label', {u'added': u'2003-06-20'}, [u'\n ', (u'quote', None, [u'\n \n ', (u'emph', None, [u'Midwinter Spring'], None), u' is its own season\u2026\n '], None), u'\n ', (u'name', None, [u'Thomas Eliot'], None), u'\n ', (u'address', None, [u'\n ', (u'street', None, [u'3 Prufrock Lane'], None), u'\n ', (u'city', None, [u'Stamford'], None), u'\n ', (u'state', None, [u'CT'], None), u'\n '], None), u'\n '], None), u'\n ', (u'label', {u'added': u'2003-06-10'}, [u'\n ', (u'name', None, [u'Ezra Pound'], None), u'\n ', (u'address', None, [u'\n ', (u'street', None, [u'45 Usura Place'], None), u'\n ', (u'city', None, [u'Hailey'], None), u'\n ', (u'state', None, [u'ID'], None), u'\n '], None), u'\n '], None), u'\n'], None)
I knew that the result would be a structure of Python primitives; thus, as in the
last
article, I used the pprint
module to produce a representation I could follow
easily. It's easy to see the basic pattern: elements become tuples with the node name
as the
first (Unicode) item, a dictionary of attributes or None as the second, and a list
of
contents or None as the third. The fourth is reserved for customized use. This data
structure is quite simple, which is one of the attractions of PyRXPU; but it might
be a bit
cumbersome to navigate in order to extract patterns of data, especially in comparison
to
data binding tools.
As you can see, all strings are Unicode objects, which is very good. From my understanding,
using the production version of PyRXP you only get "classic" string objects, which
I do not
recommend mixing into XML processing. You can see the character that was giving the
production version such fits, that \u2026
. Here it is properly treated.
Nevertheless, the strange bit about "16-bit Unicode" made me wonder whether there
were also
any such conformance problems in PyRXPU. Certainly XML allows numerous characters
above code
point 65535. The following is the relevant production from the XML 1.0 spec:
Character Range [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
The accompanying comment is "any Unicode character, excluding the surrogate blocks, FFFE, and FFFF." Note that this permissiveness will open up even more now that XML 1.1 has just become a full W3C recommendation. Some formerly forbidden characters including the range from #x1 through #x8 have been allowed, strictly in the form of character references.
I tested the treatment of very high Unicode characters in PyRXPU, and it does seem to handle them well enough. If you're an archaeologist with an interest in the Mycenaean culture you might have an interest in Unicode character U+10000, "LINEAR B SYLLABLE B008 A", which is used in the XML document parsed in the following snippet:
>>> import pyRXPU >>> p = pyRXPU.Parser() >>> p.parse("<spam>Very high Unicode char: 𐀀</spam>") (u'spam', None, [u'Very high Unicode char: \U00010000'], None)
As you can see the character value becomes \U00010000
in Python. Python gets
most Unicode matters right and deals with such high characters with aplomb whether
you
compile Python to store Unicode in 16 bits or 32 bits (again the bit width is not
relevant
to the Unicode character whatsoever but is merely a property of the chosen storage
or
encoding). It's good to have this confidence that PyRXPU is a conforming XML parser.
Benchmarks: A Lawyer's Best Friend
ReportLab bills PyRXP as "the fastest validating XML parser available for Python, and quite possibly anywhere..." David Mertz in an independent review also lauds PyRXP's speed but does not seem to have discovered its erroneous handling of characters. I think this is a good example of why benchmarking is a very slippery exercise. It's really inappropriate to even compare PyRXP to any other XML parser: it's not a conformant XML parser and thus not an XML parser at all. As many implementors tell you, it is often the odd corners of conformance that are behind the most significant performance losses. Standardization means we sacrifice some local optimization in order to gain flexibility and interoperability. By refusing to accept a very large class of quite valid XML instances, PyRXP rather does a disservice to the entire idea of XML. I have produced tools that do not fully conform to a target standard, but in such cases I follow the usual convention that such deviations are treated as bugs. I take a rather dim view of the situation in PyRXP given that
- the developers have publicly refused to remedy the non-conformance; and
- the developers trumpet the speed and low memory footprint of PyRXP, even though these advantages are only made possible by scorning conformance
I found threads discussing the development of the PyRXPU variant, which actually does seem to be XML conformant. As I expected, it is some two times less efficient in speed and memory footprint than PyRXP. The only difference is in proper treatment of Unicode, and this demonstrates my point about the cost of conformance. I have a lot of respect for the developers of PyRXP, and I hate to be so sharp about this matter, but I think it's quite serious and merits very unambiguous statement.
I'd also like to mention that if anyone is working on benchmarks of XML processing, which are useful if well done, that they run the tests on a variety of hardware and operating systems, and that they don't focus on a single XML file, but rather examine a variety of XML files. Numerous characteristics of XML files can affect parsing and processing speed, including:
- The preponderance of elements versus attributes versus text (and even comments and processing instructions)
- Any repetition of element or attribute names, values and text content
- The distribution of white space
- The character encoding
- The use of character and general entities
- The input source (in-memory, string, file, URL, etc.)
I do want to point out that I'm one of the developers of cDomlette, which one might consider a competing package. This might seem a temptation to take an especially hard line with competing tools, but then again in this column I have covered the likes of ElementTree, gnosis.xml.objectify, and libxml and have never before had such a fundamental problem with any package.
Conclusion
My recommendation is to consider PyRXPU, but to avoid plain PyRXP. I hope that the former version becomes the default so that this confusing situation can be resolved. PyRXPU produces a simple and highly Pythonic data structure, though one that might be a bit tricky to navigate correctly in code. It operates quickly and offers a low memory footprint.
Development activity seem to be picking up again in the Python-XML world. Peter Yared announced Python XML Marshaller 0.2, a new Python data binding for XML available under the PSF Python license. It includes some WXS support and can generate WXS from Python data structures for round-trip support. It also has some features for customizing the binding. See the announcement.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
Walter Dorwald announced XIST 2.4. Billed as an "object oriented XSLT", XIST uses an easily extensible, DOM-like view of source and target XML documents to generate HTML. This release features some API improvements, bug fixes, and a new find function for searching attributes. See the announcement.
Magnus Lie Hetland announced Atox 0.1 which allows you to write custom scripts for converting plain text into XML. You define the text to XML binding using a simple XML language. It's meant to be used from the command line. See the full announcement.
Arnold deVos announced GraphPath a little XPath-like language for analysing graph-structured data, especially RDF. The implementation is Python and works with rdflib or the Python binding of Redland. It includes a query evaluator and a goal-driven inference engine. I found this annoucement interesting because GraphPath is reminiscent of our early proposals while developing the Versa RDF query language at Fourthought. I think this is an important approach to RDF query and superior to the many SQL-like query languages. It's good to see more than one development along these lines.