Introducing PyXML
September 25, 2002
As you probably noticed in the table of Python software in the previous column, many of the Python tools for XML processing are available in the PyXML package. There are also XML libraries built into Python, but these are pretty well covered in the official documentation.
Happenings...
One of the things I'm going to do in these columns is provide brief information on significant new happenings relevant to Python-XML development, including significant software releases.
PyXML 0.8.1 has been released. Major changes include updated DOM support and the disabling of the bundled XSLT library from the default install. More on PyXML below.
The Fredrik "effbot" Lundh has kicked off a project to develop a graphical RSS newsreader in Python. Also, Mark Nottingham developed RSS.py, a Python module for processing RSS 0.9.x and 1.0 feeds. Mark is the author of an excellent RSS tutorial.
Eric Freese has released SemanText, a program for developing and managing semantic networks using XML Topic Maps. It can also build such topic maps contextually from general XML formats. It comes with a GUI for knowledge-base management.
There has been some talk on the XML-SIG about developing a general type library architecture in Python useful for plugging data types into schema, Web services and other projects.
Finally, in the previous column, I mentioned that the cDomlette XML parser supports XInclude and XML Base, but I neglected to mention that libxml's parser does as well. libxml also supports XML Catalogs. xmlproc also supports XML catalogs. Recent builds of 4Suite add XML and OASIS catalog support to cDomlette.
Setting up PyXML
PyXML is available as source, Windows installer, and RPM. Python itself is the only requirement. Any version later than 2.0 will do, I recommend 2.2.1 or later. If your operating environment has a C compiler, it's easy to build from source. Download the archive, unpack it to a suitable location and run
python setup.py install
Which is the usual way to set up Python programs. It is automatically installed to the package directory of the python executable that is used to run the install command. You will see a lot of output as it compiles and copies files. If you'd rather suppress this, use "python setup.py --quiet install".
PyXML overlays the xml
module that comes with Python. It replaces all the
contents of the Python module with its own versions. This is usually OK because the
two code
bases are the same, except that PyXML has usually incorporated more features and bug
fixes.
It's also possible that PyXML has introduced new bugs as well. If you ever need to
uninstall
PyXML, just remove the directory "_xmlplus" in the "site-packages" directory of your
Python
installation. The original Python xml module will be left untouched.
SAX
The SAX support in PyXML is basically the same code as the built-in SAX library in
Python.
The main added feature is a really clever module, xml.sax.writers
, which
provides facilities for re-serializing SAX events to XML and even SGML. This is especially
useful as the end point of a SAX filter chain, and I'll cover it in more detail in
a future
article on Python SAX filters. PyXML also adds the xmlproc validating XML parser which
can
be selected using the make_parser
function. Base Python does not support
validation.
DOM
PyXML builds a good deal of machinery over the skeleton of the DOM Node class that
comes
with Python. First of all, there is 4DOM, a big DOM implementation which tries to
be more
faithful to the DOM spec than to be naturally Pythonic. It supports DOM Level 2, both
XML
(core) and HTML modules, including events, ranges, and traversal. PyXML also provides
more
frequently updated versions of minidom and pulldom. Because there are many DOM
implementations available for Python, PyXML also adds a system in
xml.dom.domreg
for choosing between DOM implementations according to desired
features. Only minidom and 4DOM are currently registered, so it is not yet generally
useful,
but this should change. There is also a lot of material in the xml.dom.ext
module, which is set aside, generally, for DOM extensions specific to PyXML. These
extensions include mechanisms for reading and writing serialized XML that predate
the DOM
Level 3 facilities for this.
Creating 4DOM nodes is a matter of using the modules in xml.dom.ext.readers
.
The general pattern is to create a reader object, which can then be used to parse
from
multiple sources. For parsing XML, one would usually use
xml.dom.ext.readers.Sax2
, and for HTML,
xml.dom.ext.readers.HtmlLib
. Listing 1, demonstrates reading XML.
from xml.dom.ext.reader import Sax2 DOC = """<?xml version="1.0" encoding="UTF-8"?> <verse> <attribution>Christopher Okibgo</attribution> <line>For he was a shrub among the poplars,</line> <line>Needing more roots</line> <line>More sap to grow to sunlight,</line> <line>Thirsting for sunlight</line> </verse> """ #Create an XML reader object reader = Sax2.Reader() #Create a 4DOM document node parsed from XML in a string doc_node = reader.fromString(DOC) #You can execute regular DOM operations on the document node verse_element = doc_node.documentElement #And you can even use "Pythonic" shortcuts for things like #Node lists and named node maps #The first child of the verse element is a white space text node #The second is the attribution element attribution_element = verse_element.childNodes[1] #attribution_string becomes "Christopher Okibgo" attribution_string = attribution_element.firstChild.data
Listing 2 demonstrates reading HTML. You need to be connected to the Internet to run it successfully.
Listing 2: Creating a 4DOM HTML node by reading from a URLfrom xml.dom.ext.reader import HtmlLib #Create an HTML reader object reader = HtmlLib.Reader() #Create a 4DOM document node parsed from HTML at a URL doc_node = reader.fromUri("http://www.python.org") #Get the title of the HTML document title_elem = doc_node.documentElement.getElementsByTagName("TITLE")[0] #title_string becomes "Python Language Website" title_string = title_elem.firstChild.data
The HTML parser is pretty forgiving, but not as much so as most Web browsers. Some non-standard HTML will cause errors.
One can also output XML using the xml.dom.ext.Print
and
xml.dom.ext.PrettyPrint
functions. One can output proper XHTML from
appropriate XML and HTML DOM nodes using xml.dom.ext.XHtmlPrint
and
xml.dom.ext.XHtmlPrettyPrint
. Pure white space nodes between elements can be
removed using xml.dom.ext.StripXml
and xml.dom.ext.StripHtml
.
There are other small utility functions in this module. Listing 3 is a continuation
of
listing 1 which takes the document node that was read, strips it of white space, and
then
re-serializes the result to XML.
from xml.dom.ext import StripXml, Print #Strip the white space nodes in place and return the same node StripXml(doc_node) #Print the node as serialized XML to stdout Print(doc_node) #Write the node as serialized XML to a file f = open("tmp.xml", "w") Print(doc_node, stream=f) f.close()
Listing 4 illustrates creating a tiny XML document from scratch using standard DOM API, and printing the serialized result.
Listing 4: Creating a 4DOM document bit by bit and writing the resultfrom xml.dom import implementation from xml.dom import EMPTY_NAMESPACE, XML_NAMESPACE from xml.dom.ext import Print #Create a document type node using the doctype name "message" #A blank system ID and blank public ID (i.e. no DTD information) doctype = implementation.createDocumentType("message", None, None) #Create a document node, which also creates a document element node #For the element, use a blank namespace URI and local name "message" doc = implementation.createDocument(EMPTY_NAMESPACE, "message", doctype) #Get the document element msg_elem = doc.documentElement #Create an xml:lang attribute on the new element msg_elem.setAttributeNS(XML_NAMESPACE, "xml:lang", "en") #Create a text node with some data in it new_text = doc.createTextNode("You need Python") #Add the new text node to the document element msg_elem.appendChild(new_text) #Print out the result Print(doc)
Notice the use of EMPTY_NAMESPACE or None for blank URIs and ID fields. This is a Python-XML convention. Rather than using the empty string, which is, after all, a valid URI reference, None is used (EMPTY_NAMESPACE is merely an alias for None). This has the added value of being a bit more readable. If you run this, you'll see that no declaration is given for the XML namespace. This is because of the special status of that namespace. If you created elements or attributes in other namespaces, those namespace declarations would be rendered in the output.
4DOM is a useful tool for learning about DOM in general, but it's not a practical DOM for most uses. The problem is that it is very slow and memory intensive. A lot of effort goes into handling DOM arcana; in particular, the event system is a significant drag on performance, though it has some very interesting uses. I am one of the original authors of 4DOM, and yet I rarely use it any more. There are other, much faster DOM implementations which I'll be covering in future articles. Luckily, the lessons you learn experimenting with 4DOM are mostly applicable to other Python DOMs.
Canonicalization
Two XML files can be different and yet have the same effective content to an XML processor.
This is because XML defines that some differences are insignificant, such as the order
of
attributes, the choice of single or double quotes, and some differences in the use
of
character entities. Canonicalization, popularly abbreviated as c14n, is a mechanism for serializing XML
which normalizes all these differences. As such, c14n can be used to make XML files
easier
to compare. One important application of this is security. You may want to use technology
to
ensure that an XML file has not been tampered with, but you might want this check
to ignore
insignificant lexical differences. The xml.dom.ext.c14n
module provides
canonicalization support for Python. It provides a Canonicalize
function which
takes a DOM node (most implementations should work), and writes out canonical XML.
XPath
PyXML includes XPath and XSLT implementations written completely in Python, but as of this writing, the XSLT module is still being integrated into PyXML and does not really work. If you need XSLT processing, you will want to get Python-libxslt, 4Suite, Pyana, or one of the other packages I mentioned in the last installment. The XPath support, however, does work and operates on DOM nodes. In fact, it can be a very handy shortcut from using DOM methods for navigating. The following snippet uses an XPath expression to retrieve a list of all elements in a DOM document which are in the XSLT namespace. If you want to run it, first of all prepend code to define "doc" as a DOM document node object.
from xml.ns import XSLT #The XSLT namespace http://www.w3.org/1999/XSL/Transform NS = XSLT.BASE from xml.xpath import Context, Evaluate #Create an XPath context with the given DOM node #With no other nodes in the context list #(list size 1, current position 1) #And the given prefix/namespace mapping con = Context.Context(doc, 1, 1, processorNss={"xsl": NS}) #Evaluate the XPath expression and return the resulting #Python list of nodes result = Evaluate("//xsl:*", context=con)
XPath expressions can also return other data types: strings, numbers, and boolean.
The
first two are returned as regular Python strings and floats. Booleans are returned
as
instances of a special class, xml.utils.boolean
. This will probably change in
Python 2.3, which has a built-in boolean type.
If you are making extensive use of XPath, you might still want to consider the implementation in 4Suite, which is a faster version of the one in PyXML, with more bug fixes as well.
Et cetera
I mentioned some of the other modules in PyXML in the last installment, including
the
xmlproc validating parser, in module xml.parsers.xmlproc
, Quick Parsing for
XML, in module xml.utils.qp_xml
and the WDDX library, in module
xml.marshal.wddx
. Marshalling is the process of representing arbitrary Python
objects as XML, and there is also a general marshaling facility in the
xml.marshal.generic
module. It has a similar interface to the standard pickle
module. Listing 6 marshals a small Python dictionary to XML.
from xml.marshal.generic import Marshaller marshal = Marshaller() obj = {1: [2, 3], 'a': 'b'} #Dump to a string xml_form = marshal.dumps(obj)
The created XML looks as follows (line wrapping and indentation added for clarity):
<?xml version="1.0"?> <marshal> <dictionary id="i2"> <int>1</int> <list id="i3"><int>2</int><int>3</int></list> <string>a</string> <string>b</string> </dictionary> </marshal>
Also in Python and XML |
|
Should Python and XML Coexist? |
|
Wrap up
You can find more code examples in the demos directory of the PyXML package. The place to discuss PyXML or to report problems is the Python XML SIG mailing list.
In the next article, I'll offer a tour of 4Suite, which provides enhanced versions of some of PyXML's facilities, as well as some other unique features.