XML Namespaces Support in Python Tools, Part Two
May 13, 2004
In the last article I discussed namespace handling in Python 2.3's SAX and minidom libraries. As I pointed out there are a lot of pitfalls and oddities involved with processing namespaces, and I will continue to give the same treatment to the namespace support in third party Python libraries. In this article I shall focus on the various libraries packaged in 4Suite. If you need background on 4Suite, see my earlier article "A Tour of 4Suite ". I did briefly cover how to express namespaces for use in 4XPath in that article, but in this one I will explore different angles on the topic.
The Namespace Torture Sample, Revisited
Listing 1 is the same sample document I used in the last article. If you haven't read that article I recommend you at least review it for discussion of the aspects of namespaces I exercise in this rather contrived example.
Listing 1: Sample document that uses many XML namespace features and oddities<products> <product id="1144" xmlns="http://example.com/product-info" xmlns:html="http://www.w3.org/1999/xhtml" > <name xml:lang="en">Python Perfect IDE</name> <description> Uses mind-reading technology to anticipate and accommodate all user needs in Python development. Implements all <html:code>from __future__ import</html:code> features though the year 3000. Works well with <code>1166</code>. </description> </product> <p:product id="1166" xmlns:p="http://example.com/product-info"> <p:name>XSLT Perfect IDE</p:name> <p:description xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xl="http://www.w3.org/1999/xlink" > <p:code>red</p:code> <html:code>blue</html:code> <html:div> <ref xl:type="simple" xl:href="index.xml">A link</ref> </html:div> </p:description> </p:product> </products>
4Suite's XPath and Namespaces (Reading)
4Suite implements the natural namespace support in specifications such as XPath and XUpdate, which can be used respectively to exercise the namespace reading and mutation tasks I set up in the last article. Listing 2 uses XPath to display the local name, namespace and prefix of each element and attribute in a document.
Listing 2: 4Suite/XPath code to display namespace information for elements and attributesimport sys from Ft.Xml.XPath.Context import Context from Ft.Xml.XPath import Compile, Evaluate from Ft.Xml.Xslt import PatternList from Ft.Xml.Domlette import NonvalidatingReader #Compile needed XPath expressions NS_NODES_EXPR = Compile('//*|//@*') NSURI_EXPR = Compile('namespace-uri()') LNAME_EXPR = Compile('local-name()') PREFIX_EXPR = Compile('substring-before(name(), ":")') #XPattern is syntactically a subset of XPath IS_ATTR_PAT = '@*' #Second parameter is a dictionary of prefix to namespace mappings plist = PatternList([IS_ATTR_PAT], {}) #Read in the file doc = NonvalidatingReader.parseUri(sys.argv[1]) #Set up the XPath context with the docment read in context = Context(doc) #Extract all the element and attribute nodes in the doc nodes = NS_NODES_EXPR.evaluate(context) for node in nodes: context = Context(node) #Use XPattern to determine the current node type if plist.lookup(node): node_type_str = 'attribute' else: node_type_str = 'element' #Output the namespace details fo rthe current node nsuri = NSURI_EXPR.evaluate(context) print node_type_str, ' namespace:', repr(nsuri) lname = LNAME_EXPR.evaluate(context) print node_type_str, ' local name:', repr(lname) prefix = PREFIX_EXPR.evaluate(context) print 'Prefix used for', node_type_str, repr(prefix)
This code is also a bit contrived in order to illustrate how to perform all the subtasks
using XPath and XPattern, along the lines of the usual division of labor where the
former is
used for gathering nodes and processing the basic data model and the latter is used
for
checking to see whether nodes conform to certain rules. Using an XPath expression
I gather
up all elements and attributes, and they are naturally returned to Python in document
order.
I then iterate over the nodes checking each against an XPattern to determine whether
it is
an attribute. XPath provides functions to get the namespace and local name for a given
node,
but not one for extracting the prefix. This is easily done, though, by using the
substring-before
function and the syntactic limitations on colons in QNames.
The output from this code run against our sample document is as follows:
$ python listing2.py products.xml element namespace: u'' element local name: u'products' Prefix used for element u'' element namespace: u'http://example.com/product-info' element local name: u'product' Prefix used for element u'' attribute namespace: u'' attribute local name: u'id' Prefix used for attribute u'' element namespace: u'http://example.com/product-info' element local name: u'name' Prefix used for element u'' attribute namespace: u'http://www.w3.org/XML/1998/namespace' attribute local name: u'lang' Prefix used for attribute u'xml' element namespace: u'http://example.com/product-info' element local name: u'description' Prefix used for element u'' element namespace: u'http://www.w3.org/1999/xhtml' element local name: u'code' Prefix used for element u'html' element namespace: u'http://example.com/product-info' element local name: u'code' Prefix used for element u'' element namespace: u'http://example.com/product-info' element local name: u'product' Prefix used for element u'p' attribute namespace: u'' attribute local name: u'id' Prefix used for attribute u'' element namespace: u'http://example.com/product-info' element local name: u'name' Prefix used for element u'p' element namespace: u'http://example.com/product-info' element local name: u'description' Prefix used for element u'p' element namespace: u'http://example.com/product-info' element local name: u'code' Prefix used for element u'p' element namespace: u'http://www.w3.org/1999/xhtml' element local name: u'code' Prefix used for element u'html' element namespace: u'http://www.w3.org/1999/xhtml' element local name: u'div' Prefix used for element u'html' element namespace: u'' element local name: u'ref' Prefix used for element u'' attribute namespace: u'http://www.w3.org/1999/xlink' attribute local name: u'type' Prefix used for attribute u'xl' attribute namespace: u'http://www.w3.org/1999/xlink' attribute local name: u'href' Prefix used for attribute u'xl'
The output is all as expected except that you'll notice that null namespaces and prefixes
are represented using u''
rather than the Python convention None
.
This is natural enough given that XPath is not a Python specification, and it is usually
not
problematic because you almost always know when an XPath could return a u''
that would need to be fixed up to None
for further processing in Python.
4Suite's XUpdate and Namespaces (Mutation)
XUpdate is a community specification for using an XML vocabulary to express modifications to XML documents. It is supported by many XML processing tools, especially in the open source category; and 4Suite provides an XUpdate library as well as a command line tool which applies XUpdate and can, for example, be used as a patching utility for XML. In order to show how to use XUpdate to make namespace-aware modifications, I shall perform the following tasks, which are the same as in the last article:
- Add a new element in the products namespace, but using no prefix.
- Add a new element with a prefix and in the products namespace.
- Add a new element that is not in any namespace.
- Add a new global attribute in the XHTML namespace.
- Add a new global attribute in the special XML namespace.
- Add a new attribute in no namespace.
- Remove only the
code
element in the XHTML namespace - Remove a global attribute
- Remove an attribute that is not in any namespace
I don't demonstrate modification in place because this can always be done equivalently with an addition and then a removal. Listing 3 shows how these tasks can be performed in XUpdate.
Listing 3: XUpdate script to make namespace-aware additions and removals of elements and attributes<xup:modifications version="1.0" xmlns:xup="http://www.xmldb.org/xupdate" xmlns:p="http://example.com/product-info" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xl="http://www.w3.org/1999/xlink" > <!-- Task 1 --> <xup:append select="/products/p:product[1]"> <xup:element name="launch-date" namespace="http://example.com/product-info"/> </xup:append> <!-- Task 2 --> <xup:append select="/products/p:product[1]"> <xup:element name="p:launch-date" namespace="http://example.com/product-info"/> </xup:append> <!-- Can also be accomplished using literal result elements: <xup:append select="/products/p:product[1]"> <p:launch-date/> </xup:append> --> <!-- Task 3 --> <xup:append select="/products/p:product[1]"> <xup:element name="island"/> </xup:append> <!-- Can also be accomplished using literal result elements: <xup:append select="/products/p:product[1]"> <island/> </xup:append> --> <!-- Task 4 --> <xup:append select="/products/p:product/p:description/html:div"> <xup:attribute name="global" namespace="http://www.w3.org/1999/xhtml">spam</xup:attribute> </xup:append> <!-- Task 5 --> <xup:append select="/products/p:product/p:description/html:div"> <xup:attribute name="xml:lang">en</xup:attribute> </xup:append> <!-- Task 6 --> <xup:append select="/products/p:product/p:description/html:div"> <xup:attribute name="class">eggs</xup:attribute> </xup:append> <!-- Task 7 --> <xup:remove select="//html:code"/> <!-- Task 8 --> <xup:remove select="/products/p:product/p:description/html:div/ref/@xl:href"/> <!-- Task 9 --> <xup:remove select="/products/p:product[1]/@id"/> </xup:modifications>
If you're familiar with XSLT, then you'll see the resemblance of XUpdate at first
glance.
The envelope element for modifications expressed in XUpdate is
xup:modifications
, similar to xsl:transform
or
xsl:stylesheet
. The namespace declarations on this element assign prefixes
for use in the XUpdate script and have no connection to the prefixes used in the
document being modified (the source document), even though they happen to be the
same. If you want to access elements in a namespace declared as the default in the
source
document, then just as in XSLT you must declare and use a prefix for the namespace
in the
XUpdate script.
Each modification request is expressed as an XUpdate instruction. This example
demonstrates xup:append
and xup:remove
. There are other
instructions providing types of modification such as xup:insert-before
xup:update
and there are also control constructs such as xup:if
,
which is similar to xsl:if
. Instructions usually have a select
attribute containing an XPath expression that specifies the node to be used as a reference
for modification. In the case of xup:append
, select
specifies a
node after which some new XML will be appended. In the case of xup:remove
,
select
identifies nodes to be removed. When an instruction needs to specify a
chunk of XML to be used in the modification it is expressed as the content of the
instructions in a similar fashion to XSLT templates. In the case of xup:append
this template expresses the chunk of XML to be inserted into the document. In order
to
generate elements and attributes XUpdate provides output instructions such as
xup:element
and xup:attribute
, which are very similar to their
XSLT equivalents. In another idea borrowed from XSLT, XUpdate allows you to create
element
by placing literal result elements in the templates. If you'd like to get a closer
look at
XUpdate, the best way is by browsing the very clear examples in the XUpdate Use Cases
compiled by Kimbro Staken. See listing 4 for Python code that can be used to apply
an
XUpdate script. It's a simplified version of the code for the 4xupdate command line.
import sys from Ft.Xml import XUpdate from Ft.Xml import Domlette, InputSource from Ft.Lib import Uri #Set up reader objects for parsing the XML files reader = Domlette.NonvalidatingReader xureader = XUpdate.Reader() #Parse the source file source_uri = Uri.OsPathToUri(sys.argv[1], attemptAbsolute=1) source = reader.parseUri(source_uri) #Parse the XUpdate file xupdate_uri = Uri.OsPathToUri(sys.argv[2], attemptAbsolute=1) isrc = InputSource.DefaultFactory.fromUri(xupdate_uri) xupdate = xureader.fromSrc(isrc) #Set up the XUpdate processor and run against the source file #The Domlette for the source is modified in place processor = XUpdate.Processor() processor.execute(source, xupdate) #Print the updated DOM node to standard output Domlette.Print(source)
Notice the use of Uri.OsPathToUri
to convert file system paths to proper URIs
for use in 4Suite. I strongly recommend this convention as one way to help minimize
confusion between file specifications and URIs -- the basis of many frequently asked
questions. The XUpdate.Processor
class defines the environment for running
XUpdate commands and execute()
is the method that actually kicks off the
processing. It operates on a Domlette instance, modifying it in place (so be careful
when
using using XUpdate in this way). I print the updated document object to standard
output
using Domlette.Print
.
This XUpdate worked fine with the latest CVS version of 4Suite, but the attribute additions did not work with that last packaged release, 1.0a3. It turns out that Mike Brown restored the ability to append attributes just last month. If you need this capability you'll need to use the CVS version until the next packaged release. The following snippet illustrates how to run the test script, and the output result.
$ python listing4.py products.xml listing3.xup <?xml version="1.0" encoding="UTF-8"?> <products xmlns:p="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink" > <product xmlns="http://example.com/product-info"> <name xml:lang="en">Python Perfect IDE</name> <description> Uses mind-reading technology to anticipate and accommodate all user needs in Python development. Implements all features though the year 3000. Works well with <code>1166</code>. </description> <launch-date/><p:launch-date/><island/></product> <p:product id="1166"> <p:name>XSLT Perfect IDE</p:name> <p:description> <p:code>red</p:code> <html:code>blue</html:code> <html:div global="spam" class="eggs" xml:lang="en"> <ref xl:type="simple">A link</ref> </html:div> </p:description> </p:product> </products>
This output uncovers the same bug that I pointed out in minidom last article. I explicitly
asked for the global
attribute generated in task 4 to be in the XHTML
namespace. Even though I did not specify it as a QName, the processor should still
have used
a prefix for the output because an attribute without a prefix is in no namespace,
regardless of the namespace of its element. As I mentioned in the last article this
is an
obscure and controversial corner of XML namespaces, so I'm not surprised the bug appears
to
be widespread.
Wrap Up
Also in Python and XML |
|
Should Python and XML Coexist? |
|
I didn't cover PyXML because the most interesting libraries in it using namespaces are very similar to Python's SAX and minidom, which I did cover. PyXML also includes older versions of the XPath and XPattern libraries from 4Suite. The main idea behind 4Suite is to open up in-depth Python APIs to standard XML technologies, and this extends to all the relevant namespace facilities in various XML specifications. In the next article I shall continue this examination of namespace capabilities in Python tools.
Picking up on what my colleagues have been up to lately, I find Dave Kuhlman's update to generateDS, which includes the ability to interchange Python and XML literal text. For a more in-depth explanation see the announcement. I covered generateDS earlier in this article.
Manfred Stienstra wrote a couple of articles on the use of libxml2's Python bindings: "The Problem with the Libxml Python Bindings" and "More Problems with the Libxml Python Bindings".
Andrew Dalke has long been working on Martel, a tool for working the many flat file text-based file formats used in bioinformatics into XML. Recently his paper on the topic " Martel: Bioinformatics file parsing made easy" came to my attention again. Martel is a very clever idea and applicable beyond the world of bioinformatics. It can be used in general to treat "legacy" formats (including the likes of CSV and simple record-per-line files) as if they were already in XML. One warning is that the link to Martel in the paper is out of date. See the first sentence in this paragraph for the current link.