XML Namespaces Support in Python Tools, Part 1
March 10, 2004
I have covered a lot of tools for processing XML in Python. In general I have deferred discussion of each tool's handling of XML namespaces in order to stick to the basics in the individual treatments. In this article I start to examine the support for XML namespaces in these packages, with a look at SAX and DOM from the standard Python library.
But first, a warning. XML namespaces are largely a matter of shrugging acceptance among most XML users, but they are terribly controversial among XML experts. The controversy is for good reason. Namespaces solve a difficult problem and there are very many approaches to solving this problem, each of which have their pros and cons.
The W3C XML namespaces specification is a compromise and as with all compromises falls a bit short of addressing the needs of each faction. Namespaces have proven, even after all this time, very difficult to smoothly incorporate into the information architecture of XML processing, which translates into the fact that most namespace-processing APIs are clumsy and sprinkled with landmines for the unwary.
The lesson is not to use XML namespaces as a reflex. Think carefully about why and how you plan to use any namespaces you introduce. There are some useful design principles for namespaces that can help reduce problems. These are out of the scope of this article, but I shall be covering them in an upcoming IBM developerWorks article.
Sample Document
In order to exercise the various APIs I will use a rather contrived sample XML document. It exercises the following quirks and qualities:
- Use of multiple namespaces (with different prefixes).
- Local name clashes across namespaces.
- Use of the default namespace.
- Use of namespaces in mixed content.
- Elements in no namespace.
- The special namespace bound to prefix "xml," which need not be declared.
- What are sometimes called global attributes (i.e., attributes with prefixes and thus explicitly in a namespace).
Listing 1: Sample Document with Many XML Namespace Features and Oddities
<products> <product id="1144" xmlns="http://example.com/product-info" xmlns:html="http://www.w3.org/1999/xhtml" > <name xml:lang="en">Python Perfect IDE</name> <description> Uses mind-reading technology to anticipate and accommodate all user needs in Python development. Implements all <html:code>from __future__ import</html:code> features though the year 3000. Works well with <code>1166</code>. </description> </product> <p:product id="1166" xmlns:p="http://example.com/product-info"> <p:name>XSLT Perfect IDE</p:name> <p:description xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xl="http://www.w3.org/1999/xlink" > <p:code>red</p:code> <html:code>blue</html:code> <html:div> <ref xl:type="simple" xl:href="index.xml">A link</ref> </html:div> </p:description> </p:product> </products>
I'll be looking most importantly at how the various tools report the namespaces, and where document mutation is relevant, how to express namespaces in element, and attribute creation and modification. Namespace prefixes are strictly syntactic conveniences, but as a matter of interest I shall have a look at how the tools handle prefixes.
SAX and Namespaces
The SAX library that comes with Python is based on SAX 2.0, and is fully namespace
aware.
Namespaces of elements and attributes are reported using a conventional data structure
of
the form (namespace, local-name), qname
. One can extract the prefix from the
qname
value. The handling of namespaces in SAX is a little bit clumsy, in
part from the awkwardness of namespaces themselves and in part from the awkwardness
of the
original SAX 2 interface in Java. Listing 2 is SAX code that displays the local name,
namespace, and prefix of each element and attribute in a document.
Listing 2: SAX Code to Display Namespace Info for Elements and Attributes
import sys from xml import sax #Subclass from ContentHandler in order to gain default behaviors class ns_test_handler(sax.ContentHandler): def startElementNS(self, name, qname, attributes): (namespace, localname) = name prefix = self._split_qname(qname)[0] print "Element namespace:", repr(namespace) print "Element local name:", repr(localname) print "Prefix used for element:", repr(prefix) for name, value in attributes.items(): (namespace, localname) = name qname = attributes.getQNameByName(name) prefix = self._split_qname(qname)[0] print "Attribute namespace:", repr(namespace) print "Attribute local name:", repr(localname) print "Prefix used for attribute:", repr(prefix) return def _split_qname(self, qname): qname_split = qname.split(':') if len(qname_split) == 2: prefix, local = qname_split else: prefix = None local = qname_split return prefix, local if __name__ == "__main__": parser = sax.make_parser() parser.setContentHandler(ns_test_handler()) parser.setFeature(sax.handler.feature_namespaces, 1) parser.setFeature(sax.handler.feature_namespace_prefixes, 1) parser.parse(sys.argv[1])
At the bottom of this listing I take care to enable a couple of SAX features relating
to
namespace processing. sax.handler.feature_namespaces
instructs the parser to
send namespace-aware events such as startElementNS
, rather than plain events
like startElement
. sax.handler.feature_namespace_prefixes
instructs the parser to preserve and report namespace prefixes. Without this feature
a
parser is free to report None
for any QName, which means your SAX handler would
not have access to the prefixes used in the document. In the handler method for the
startElementNS
event I show code for extracting all parts of the
namespace-related information.
The parameter attributes
arrives as an instance of the class
xml.sax.xmlreader.AttributesNS
, which behaves like a dictionary where the
keys are the (namespace, local-name)
tuples and the values are the attribute
values. There are also a set of special methods for this class that are documented
in
the Python Library Reference. I use one of these methods, getQNameByName
,
which takes one of the name tuples and returns the corresponding QName.
The output from this code run against our sample document is as follows:
$ python listing2.py products.xml Element namespace: None Element local name: u'products' Prefix used for element: None Element namespace: u'http://example.com/product-info' Element local name: u'product' Prefix used for element: None Attribute namespace: None Attribute local name: u'id' Prefix used for attribute: None Element namespace: u'http://example.com/product-info' Element local name: u'name' Prefix used for element: None Attribute namespace: u'http://www.w3.org/XML/1998/namespace' Attribute local name: u'lang' Prefix used for attribute: u'xml' Element namespace: u'http://example.com/product-info' Element local name: u'description' Prefix used for element: None Element namespace: u'http://www.w3.org/1999/xhtml' Element local name: u'code' Prefix used for element: u'html' Element namespace: u'http://example.com/product-info' Element local name: u'code' Prefix used for element: None Element namespace: u'http://example.com/product-info' Element local name: u'product' Prefix used for element: u'p' Attribute namespace: None Attribute local name: u'id' Prefix used for attribute: None Element namespace: u'http://example.com/product-info' Element local name: u'name' Prefix used for element: u'p' Element namespace: u'http://example.com/product-info' Element local name: u'description' Prefix used for element: u'p' Element namespace: u'http://example.com/product-info' Element local name: u'code' Prefix used for element: u'p' Element namespace: u'http://www.w3.org/1999/xhtml' Element local name: u'code' Prefix used for element: u'html' Element namespace: u'http://www.w3.org/1999/xhtml' Element local name: u'div' Prefix used for element: u'html' Element namespace: None Element local name: u'ref' Prefix used for element: None Attribute namespace: u'http://www.w3.org/1999/xlink' Attribute local name: u'type' Prefix used for attribute: u'xl' Attribute namespace: u'http://www.w3.org/1999/xlink' Attribute local name: u'href' Prefix used for attribute: u'xl'
As you can see, all the namespace-related values are given as Unicode objects. This is the right thing to do for prefix and name values because these use the Unicode basis for XML names. Namespaces, however, are URIs, and therefore must be represented using ASCII. This means that it is probably OK to use plain strings for namespaces, but I can't argue with the consistency of Unicode across the board.
An important thing to notice is that None
is given as the namespace value for
elements and attributes that are not in a namespace. Similarly None
is given as
the prefix for elements and attributes that are not represented with a prefix. These
are
standard Python conventions and you should never use the empty string to represent
such
cases. I recommend in general using the constants defined in the Python DOM core interface,
both of which are set to None
:
from xml.dom import EMPTY_NAMESPACE from xml.dom import EMPTY_PREFIX
Also notice that the attribute xml:lang
is shown as bound to the namespace
http://www.w3.org/XML/1998/namespace
even though no such namespace is
declared. This is because this is a special namespace that is implicitly declared
as bound
to the prefix xml
; it must be handled as such by namespace-compliant tools.
There is also a convenience constant in the Python DOM interface for this special
namespace,
xml.dom.XML_NAMESPACE
.
Minidom and Namespaces
Minidom implements a lot of DOM level 2, and accordingly supports namespaces. The API is in some ways even clumsier than SAX's, again through legacy from other languages, but it does make available for reading and edit all the information relating to namespaces. Listing 3 is similar code to Listing 2 and displays all namespace information in the document.
Listing 3: Minidom Code to Display Namespace Info for Elements and Attributes
#Required in Python 2.2, and must be the first import from __future__ import generators import sys from xml.dom import minidom from xml.dom import Node def doc_order_iter_filter(node, filter_func): """ Iterates over each node in document order, applying the filter function to each in turn, starting with the given node, and yielding each node in cases where the filter function computes true node - the starting point (subtree rooted at node will be iterated over document order) filter_func - a callable object taking a node and returning true or false """ if filter_func(node): yield node for child in node.childNodes: for cn in doc_order_iter_filter(child, filter_func): yield cn return def get_all_elements(node): """ Returns an iterator (using document order) over all element nodes that are descendants of the given one """ return doc_order_iter_filter( node, lambda n: n.nodeType == Node.ELEMENT_NODE ) doc = minidom.parse(sys.argv[1]) for elem in get_all_elements(doc): print "Element namespace:", repr(elem.namespaceURI) print "Element local name:", repr(elem.localName) print "Prefix used for element:", repr(elem.prefix) for attr in elem.attributes.values(): print "Attribute namespace:", repr(attr.namespaceURI) print "Attribute local name:", repr(attr.localName) print "Prefix used for attribute:", repr(attr.prefix)
The first two functions in the listing are examples of Python generator-driven DOM
processing of the sort I introduced and advocated in Generating DOM Magic. The main
section uses an iterator over all elements in document order and prints the same namespace
information. The method call elem.attributes.values()
gets a list of all the
attribute node objects for each element. Each attribute node carries all its namespace
information as data members.
There are numerous alternative ways to write this loop because Minidom provides a
variety
of APIs for working with NamedNodeMap
objects, which are the way attributes are
stored. Some of these methods have special namespace-aware versions. The following
snippet
shows some examples:
>>> from xml.dom import minidom >>> doc = minidom.parse('products.xml') >>> products = doc.getElementsByTagNameNS( ... u'http://example.com/product-info', u'product' ... ) >>> perfect_python_ide = products[0] >>> from pprint import pprint >>> pprint(perfect_python_ide.attributes.keys()) ['xmlns', u'xmlns:html', u'id'] >>> >>> pprint(perfect_python_ide.attributes.keysNS()) [('http://www.w3.org/2000/xmlns/', u'html'), ('http://www.w3.org/2000/xmlns/', 'xmlns'), (None, u'id')] >>> >>> pprint(perfect_python_ide.attributes.items()) [('xmlns', u'http://example.com/product-info'), (u'xmlns:html', u'http://www.w3.org/1999/xhtml'), (u'id', u'1144')] >>> >>> pprint(perfect_python_ide.attributes.itemsNS()) [(('http://www.w3.org/2000/xmlns/', 'xmlns'), u'http://example.com/product-info'), (('http://www.w3.org/2000/xmlns/', u'html'), u'http://www.w3.org/1999/xhtml'), ((None, u'id'), u'1144')] >>>
See also methods such as setNamedItemNS
, getNamedItemNS
, and
removeNamedItemNS
(the latter two only in Python 2.3 or recent PyXML), which
provide for namespace-aware retrieval, update and removal of actual attribute node
objects.
You probably have noticed that the namespace declarations themselves appear as attributes. Certainly they are attributes in the XML source, because that is how namespace syntax is defined, but you might be surprised to see that the namespace declarations are not removed from the list of attributes in each element. This is because they contain redundant information, given that every node carries its own namespace details. For example, SAX does not include namespace declarations in the attributes by default. This is one of the well-known surprises and sources of debate in DOM Level 2.
As an aside, I noticed that the special namespace declaration attribute local names
xmlns
are being returned as plain strings rather than Unicode objects, even
in the most recent PyXML (and accordingly in all Python versions). This is a bug,
though
probably a harmless one.
Luckily there is one reliable way to tell namespace declarations from other attributes
in
DOM: they all use the special, reserved XML namespace
http://www.w3.org/2000/xmlns/
. This namespace is also available as a standard
Python constant, xml.dom.XMLNS_NAMESPACE
.
As an example, Listing 4 is a modification of Listing 3 that omits namespace declarations from the reported attributes. Its output is a true match to that of Listing 2.
Listing 4: Minidom Code to Display Namespace Info for Elements and Attributes, Excluding Namespace Declarations
#Required in Python 2.2, and must be the first import from __future__ import generators import sys from xml.dom import minidom from xml.dom import Node from xml.dom import XMLNS_NAMESPACE def doc_order_iter_filter(node, filter_func): """ Iterates over each node in document order, applying the filter function to each in turn, starting with the given node, and yielding each node in cases where the filter function computes true node - the starting point (subtree rooted at node will be iterated over document order) filter_func - a callable object taking a node and returning true or false """ if filter_func(node): yield node for child in node.childNodes: for cn in doc_order_iter_filter(child, filter_func): yield cn return def get_all_elements(node): """ Returns an iterator (using document order) over all element nodes that are descendants of the given one """ return doc_order_iter_filter( node, lambda n: n.nodeType == Node.ELEMENT_NODE ) doc = minidom.parse(sys.argv[1]) for elem in get_all_elements(doc): print "Element namespace:", repr(elem.namespaceURI) print "Element local name:", repr(elem.localName) print "Prefix used for element:", repr(elem.prefix) for attr in elem.attributes.values(): if attr.namespaceURI != XMLNS_NAMESPACE: print "Attribute namespace:", repr(attr.namespaceURI) print "Attribute local name:", repr(attr.localName) print "Prefix used for attribute:", repr(attr.prefix)
Minidom Namespace Mutation
In order to show how to modify a DOM in a namespace-aware manner, I will perform the following tasks:
- Add a new element
launch-date
in the products namespace, but using no prefix. - Add a new element
launch-date
with a prefix and in the products namespace. - Add a new element that is not in any namespace.
- Add a new global attribute in the XHTML namespace.
- Add a new global attribute in the special XML namespace.
- Add a new attribute in no namespace.
- Remove only the
code
element in the XHTML namespace. - Remove a global attribute.
- Remove an attribute that is not in any namespace.
I don't demonstrate modification in place because this can always be done equivalently with an addition and then a removal. Examples of these tasks are as follows:
>>> from xml.dom import minidom >>> from xml.dom import XML_NAMESPACE >>> from xml.dom import EMPTY_NAMESPACE >>> from xml.dom import EMPTY_PREFIX >>> >>> #Set up ... >>> doc = minidom.parse('products.xml') >>> products = doc.getElementsByTagNameNS( ... u'http://example.com/product-info', u'product' ... ) >>> >>> #Task 1 ... >>> new_elem = doc.createElementNS( ... u'http://example.com/product-info', u'launch-date' ... ) >>> products[0].appendChild(new_elem) <DOM Element: launch-date at 0x402ac08c> >>> >>> #Task 2 ... >>> new_elem = doc.createElementNS( ... u'http://example.com/product-info', u'p:launch-date' ... ) >>> products[1].appendChild(new_elem) <DOM Element: p:launch-date at 0x402cd9ac> >>> >>> #Task 3 ... >>> new_elem = doc.createElementNS(EMPTY_NAMESPACE, u'island') >>> products[0].appendChild(new_elem) <DOM Element: island at 0x4030988c> >>> >>> #Task 4 ... >>> divs[0].setAttributeNS( ... u'http://www.w3.org/1999/xhtml', u'global', u'spam' ... ) >>> >>> #Task 5 ... >>> divs[0].setAttributeNS(XML_NAMESPACE, u'xml:lang', u'en') >>> >>> #Task 6 ... >>> divs[0].setAttributeNS(EMPTY_NAMESPACE, u'class', u'eggs') >>> >>> #Task 7 ... >>> html_codes = products[0].getElementsByTagNameNS( ... u'http://www.w3.org/1999/xhtml', u'code' ... ) >>> parent = html_codes[0].parentNode >>> parent.removeChild(html_codes[0]) <DOM Element: html:code at 0x402d3f2c> >>> >>> #Task 8 ... >>> refs = doc.getElementsByTagNameNS(EMPTY_NAMESPACE, u'ref') >>> refs[0].removeAttributeNS(u'http://www.w3.org/1999/xlink', u'href') >>> >>> #Task 9 ... >>> products[0].removeAttributeNS(EMPTY_NAMESPACE, u'id') >>>
After all this manipulation I re-serialized the updated response as XML, by calling
doc.toprettyxml()
. I don't display the output of this for reasons of space,
but when I examined it I did find a bug. The result of Tasks 4-6 is:
<html:div class="eggs" global="spam" xml:lang="en">
I explicitly asked for the http://www.w3.org/1999/xhtml
namespace for the
global
attribute. By rule this should appear with the html
prefix, or equivalent, even though its parent is in the namespace. To be fair this
is one of
the more obscure and confusing corners of XML namespaces, but it's a bug nevertheless.
More to Come on Namespaces
In this article I covered the basic XML libraries that come with recent versions of Python (and with PyXML). In upcoming articles I will look at the handling of namespaces in third-party tools.
Meanwhile, in the Python-XML world...
Valéry Febvre released PyXMLSec, a set of Python bindings under the GPL for standard XML Security facilities based on the libxml2 implementation in C. It covers XML Signature, XML Encryption, Canonical XML, and Exclusive Canonical XML. A warning from the web site:
"The Python interface has not yet reached the completeness of the C API (currently ~ 300 functions are implemented). Bindings are very young, API can't be considered as mature and may be changed at any time. "
Also in Python and XML |
|
Should Python and XML Coexist? |
|
Ned Batchelder released handyxml 1.1, a Python module that wraps XML parsers and parsed DOM implementations into objects with added Python features. It includes XPath support. PyXML or 4Suite are required.
I just discovered Dave Kuhlman's Python XML FAQ and How-to, which is really just a few recipes for SAX and DOM (including Python generator) usage. I also found Paul Boddie's Python and XML: An Introduction, which is really an introduction to Minidom. It's a nice introduction, driven by examples, and on reading through it, everything it discusses should still work with current Minidom versions.
I found Sean B. Palmer's pyrple, another RDF API in Python, but based on earlier work by Palmer. It "parses RDF/XML, N3, and N-Triples. It has in-memory storage with API-level querying, experimental marshalling, many utilities, and is small and minimally interdependent." But Palmer admits that it's a bit more hackish than established Python RDF tools and appropriate "if you don't mind getting your hands dirty, and you want something that's small and handy."
Adam Souzis announced a new release (0.2.0) of Rx4RDF and Rhizome. Updates include performance improvements and support for Redland as well as 4Suite. See the announcement.