Decomposition, Process, Recomposition
July 28, 2004
One unfortunate side-effect of the success of XML is that some users tend to just throw all their familiar data-processing tasks into XML shapes without pausing much to consider the strengths and weaknesses of XML.
The most common effect of this lack of attentiveness often leads people to very dangerous practices such as acting as if XML is an ASCII or even byte-oriented format. I have warned sharply about the ugly effects of such confusion in a few previous articles in this column.
Another common effect I've seen is the tendency to create multi-megabyte or even gigabyte monolithic XML files. XML is so flexible for data representation because of its nature as an annotated hierarchy. But this very nature also makes efficient processing quite difficult, especially with regards to scaling according to number of nodes.
So my first line of advice has always been: don't go processing gigabyte files in XML formats. I have been working with XML for about 8 years. I have used XML in numerous ways with numerous tools for numerous purposes. I have never come across a natural need for an XML file more than a few megabytes. There may be terabytes of raw data in a system, but if you're following XML (and classic data design) best practices, this should be structured into a number of files in some sensible way.
If you find yourself with a monster XML file -- perhaps you receive it that way from another source outside your control -- the best way to handle it is through the classic problem-solving technique, divide and conquer. Decompose the problem into smaller bits, solve each little bit to create a part solution, then combine the partial solutions into the overall solution.
Certainly this requires the problem to have certain mathematical properties, but in almost every case of monster XML files I've seen, an elegant decomposition is available. In this article I present some tools and techniques for processing large XML files by decomposition.
DOM in Bite-size Bits
I'll pretend Listing 1 is the gigantic file I have to cut down to size in order to process.
Listing 1: Sample XML file (labels.xml) containing Address Labels<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label added="2003-06-20"> <quote> <!-- Mixed content --> <emph>Midwinter Spring</emph> is its own season… </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label added="2003-06-10"> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
To get into the spirit of the problem, imagine that rather than the two label
elements, this file had a billion. For many tasks you might have in mind with such
a huge
file, you could simply use SAX. Say you wanted to print out a listing of all the names
in
the file. It's not hard to do this in SAX since all you need to worry about at any
point is
whether you happen to be visiting character data within a name element.
If you have an operation in mind that needs more awareness of context or state, then it's time to sharpen up your state machine chops, which, even for SAX gurus, eventually becomes a painful task. Node trees such as DOM are nice in cases where random node access makes managing state and context a cinch, but loading the entire billion-record file into memory to perform node operations is probably beyond the capabilities of all but the highest-end computers, even if, like cDomlette, it's an implementation that uses a variety of tricks to lower the memory footprint.
You might be able to make some headway by using a technique such as the one I discussed in Building Dictionaries With SAX, but perhaps the best of both worlds would be a tool that uses SAX to build little DOMs, so that you can process each, discard each, and move on to the next record.
In my current example, it would be great to have a SAX tool that creates a DOM for
each
label
element in turn. This way you have the full capability of DOM available
for processing each label, but you never use much more than the memory needed for
a DOM
comprising a single label. If you had such a tool, then you could very efficiently
solve
problems that can be broken down to a subproblem for each label. Such a tool is available
in
libxml's xmlTextReader (see Using libxml in Python), but
libxml's implementation can be very idiosyncratic. Besides, some people dislike installing
large third-party packages.
I happened to write a SAX tool for such a purpose that works with the built-in Python
SAX
and Minidom facilities and sticks to the standard library Python SAX and SOM idioms.
See
Listing 2 for sax2dom_chunker.py
, a module that uses SAX to create little
chunks of DOM according to a set of element paths you give it.
""" sax2dom_chunker.py A SAX handler that takes a set of element paths and creates a series of DOM chunks matching the element paths for individual processing. Designed for Python 2.2. or greater Copyright 2004 Fourthought Inc, USA. This work is licensed under Creative Commons Attribution 1.0 For details: http://creativecommons.org/licenses/by/1.0/ """ from xml import sax from xml.dom import XML_NAMESPACE, XMLNS_NAMESPACE, EMPTY_NAMESPACE import xml.dom.minidom DUMMY_DOCELEM = u'dummy' START_STATE = 0 TOP = -1 class _state_machine: """ A simple state machine specialized for DOM chunking from SAX A state is "live" when it represents the successful completion of a path. This is generally a signal to the handler using this state machine to start creating the DOM fragment from the subset of SAX events until we transit to a non-live state """ def __init__(self, trim_to_paths): if not trim_to_paths: self.event = self.event_nop self.is_live = self.is_live_nop return self._state_table = {START_STATE: {}} self._live_states = [] #Use the given trim paths to generate a state table newest_state = START_STATE for path in trim_to_paths: last_state = START_STATE for segment in path: start_event = (1, segment[0], segment[1]) end_event = (0, segment[0], segment[1]) if self._state_table[last_state].has_key(start_event): top_state = \ self._state_table[last_state][start_event] else: newest_state += 1 top_state = newest_state self._state_table[top_state] = {} self._state_table[last_state][start_event] = \ top_state self._state_table[top_state][end_event] = \ last_state last_state = top_state self._live_states.append(top_state) self._state = START_STATE self.chunk_completed = 0 return def event(self, is_start, ns, local): """ Register an event and effect ant state transitions found in the state table """ #We only have a chunk ready for the handler in #the explicit case below self.chunk_completed = 0 lookup_from = self._state_table[self._state] if lookup_from.has_key((is_start, ns, local)): new_state = lookup_from[(is_start, ns, local)] #If we have completed a chunk, we set a flag for #The chunker if (self._state in self._live_states and new_state not in self._live_states): self.chunk_completed = 1 self._state = new_state return self._state def is_live(self): """ 1 if the curent state is considered live, else 0 """ return self._state in self._live_states def event_nop(self, is_start, ns, local): pass def is_live_nop(self): return 1 class sax2dom_chunker(sax.ContentHandler): """ Note: ignores nodes prior to the document element, such as PIs and text nodes This filter is only designed to work if you set features sax.handler.feature_namespaces and sax.handler.feature_namespace_prefixes to 1 on the parser you use. It will not work on drivers that do not support these features. The default drv_expat works fine in this case, but of course has but very limited DTD processing. It also collapses CDATA sections into plain text trim_to_paths - a list of lists of tuples. Each tuple is of the form (namespace, local-name), providing one segment in a path of contained elements [ [ (None, u'monty'), (None, u'python') ], [ (None, u'monty'), (None, u'spam'), ('urn:dummy', u'eggs') ] ] If None (the default, a DOM node will be created representing the entire tree. chunk_consumer - a callable object taking a DOM node. It will be invoked as each DOM chunk is prepared. domimpl - DOM implemention to build, e.g. mindom (the default) or cDomlette or pxdom (if you have the right third-party packages installed). owner_doc - for advanced uses, if you want to use an existing DOM document object as the owner of all created nodes. """ def __init__(self, trim_to_paths=None, chunk_consumer=None, domimpl=xml.dom.minidom.getDOMImplementation(), owner_doc=None ): self._impl = domimpl if owner_doc: self._owner_doc = owner_doc else: dt = self._impl.createDocumentType(DUMMY_DOCELEM, None, '') self._owner_doc = self._impl.createDocument( DUMMY_DOCELEM, DUMMY_DOCELEM, dt) #Create a docfrag to hold all the generated nodes. root_node = self._owner_doc.createDocumentFragment() self._nodeStack = [ root_node ] self.state_machine = _state_machine(trim_to_paths) self._chunk_consumer = chunk_consumer return def get_root_node(self): """ Only useful if the user does not register trim paths If so, then after SAX processing the user can call this method to retrieve resulting DOm representing the entire document """ return self._nodeStack[0] #Overridden DocumentHandler methods def startElementNS(self, name, qname, attribs): self.state_machine.event(1, name[0], name[1]) if not self.state_machine.is_live(): return (ns, local) = name new_element = self._owner_doc.createElementNS(ns, qname) for ((attr_ns, lname), value) in attribs.items(): if attr_ns is not None: attr_qname = attribs.getQNameByName((attr_ns, lname)) else: attr_qname = lname attr = self._owner_doc.createAttributeNS( attr_ns, attr_qname) attr_qname = attribs.getQNameByName((attr_ns, lname)) attr.value = value new_element.setAttributeNodeNS(attr) self._nodeStack.append(new_element) return def endElementNS(self, name, qname): self.state_machine.event(0, name[0], name[1]) if not self.state_machine.is_live(): if (self._chunk_consumer and self.state_machine.chunk_completed): #Complete the element being closed because it #Is the last bit of a DOM to be fed to the consumer new_element = self._nodeStack[TOP] del self._nodeStack[TOP] self._nodeStack[TOP].appendChild(new_element) #Feed the consumer self._chunk_consumer(self._nodeStack[0]) #Start all over with a new doc frag so the old #One's memory can be reclaimed root_node = self._owner_doc.createDocumentFragment() self._nodeStack = [ root_node ] return new_element = self._nodeStack[TOP] del self._nodeStack[TOP] self._nodeStack[TOP].appendChild(new_element) return def processingInstruction(self, target, data): if self.state_machine.is_live(): pi = self._owner_doc.createProcessingInstruction( target, data) self._nodeStack[TOP].appendChild(pi) return #Overridden LexicalHandler methods def comment(self, text): if self.state_machine.is_live(): new_comment = self._owner_doc.createComment(text) self._nodeStack[TOP].appendChild(new_comment) return def characters(self, chars): if self.state_machine.is_live(): new_text = self._owner_doc.createTextNode(chars) self._nodeStack[TOP].appendChild(new_text) return
The idea here is that you provide a source XML file and a list of element paths,
each of
which defines subsets of the document that defines the boundaries of chunking. For
example,
to process labels, you'd give the chunker a list [ (None, u'labels'), (None, u'label')
]
in order to generate DOM chunks representing each label
element
within the top-level labels
element. You would also give the chunker a callback
function (or any callable object) that would be invoked with each DOM chunk as it
is
generated.
Class _state_machine
is a specialized-state machine class for the chunker used
to encapsulate the complex state logic and keep the SAX handler simple. The
sax2dom_chunker
class is a SAX handler that operates on an XML document and
can either feed the resulting DOM chunks to a consumer routine or build and hold on
to a
node representing the entire document. You can use any SAX filters you please with
it as a
way of increasing the sophistication of the processing.
As an example, I have not included code to consolidate multiple contiguous text events into single ones, so the resulting DOM could be de-normalized. It also doesn't strip whitespace between elements, which you might decide you don't care about and don't want in your DOM. In either case you can use a filter upstream from the chunker. For the first case, I covered such a filter in Building Dictionaries With SAX.
Using the DOM Chunker
Listing 3 is an example of how to use the chunker to process each label as an isolated DOM chunk. The goal is to print out all the cities in which someone named "Eliot" lives.
Listing 3: Uses the Chunker in Listing 2 to Process the Labels File Label by Label""" Print out all the cities in which someone named "Eliot" lives """ from xml import sax import sax2dom_chunker SEARCH_FOR_STRING = 'eliot' def process_label(docfrag): #Invoked for each label label = docfrag.firstChild name = label.getElementsByTagNameNS(None, 'name')[0] city = label.getElementsByTagNameNS(None, 'city')[0] #For simplicity, pretend the above nodes are normalized, and have #No child elements name_text = name.firstChild.data city_text = city.firstChild.data if name_text.lower().find(SEARCH_FOR_STRING) != -1: print city_text.encode('utf-8') return #The path for chunking by label element #Equivalent to the XPattern /labels/label LABEL_PATH = [ (None, u'labels'), (None, u'label') ] #Create an instance of the chunker handler = sax2dom_chunker.sax2dom_chunker( trim_to_paths=[LABEL_PATH], chunk_consumer=process_label ) parser = sax.make_parser() #The chunker is the SAX handler parser.setContentHandler(handler) #The chunker *requires* these features parser.setFeature(sax.handler.feature_namespaces, 1) parser.setFeature(sax.handler.feature_namespace_prefixes, 1) parser.parse('labels.xml')
The result of running python listing3.py labels.xml
is simply
"Stamford" printed to the console. Again, the neat trick that comes from using
chunker is that even though we take advantage of the relatively friendly DOM facility
(it
would be friendlier still if I were using XPaths, but that would once again require
a
third-party tool), the program does not take much more memory for a billion-label
file than
for a two-label file. Listing 4 is a similar script that shows how you can create
chunks
according to multiple paths.
""" Print out all people and their street address """ from xml import sax import sax2dom_chunker #Paths for chunking by label element #Equivalent to the XPattern # ( /labels/label/name|/labels/label/address/street ) PATHS = [ [ (None, u'labels'), (None, u'label'), (None, u'name') ], [ (None, u'labels'), (None, u'label'), (None, u'address'), (None, u'street') ] ] def process_chunk(docfrag): #Invoked for each name or address. We're getting the leaf element #of the path itself (in a doc frag wrapper) so just print its text #content text = docfrag.firstChild.firstChild.data print text.encode('utf-8') return #Create an instance of the chunker handler = sax2dom_chunker.sax2dom_chunker( trim_to_paths=PATHS, chunk_consumer=process_chunk ) parser = sax.make_parser() #The chunker is the SAX handler parser.setContentHandler(handler) #The chunker *requires* these features parser.setFeature(sax.handler.feature_namespaces, 1) parser.setFeature(sax.handler.feature_namespace_prefixes, 1) parser.parse('labels.xml')
The output is in Listing 5.
Listing 5: Output of "python listing4.py labels.xml"Thomas Eliot 3 Prufrock Lane Ezra Pound 45 Usura Place
If you have 4Suite installed, you can save even more memory (and gain some speed) by using the cDomlette implementation. You can also use Andrew Clover's pxdom (to gain W3C DOM conformance, but not performance). In fact, you can use any DOM that follows Python-library DOM conventions by changing the implementation. As an example, to use cDomlette you would replace the following snippet of code in Listing 3:
#Create an instance of the chunker handler = sax2dom_chunker.sax2dom_chunker( trim_to_paths=[LABEL_PATH], chunk_consumer=process_label )
With the following snippet:
from Ft.Xml.Domlette import implementation #Create an instance of the chunker handler = sax2dom_chunker.sax2dom_chunker( domimpl = implementation, trim_to_paths=[LABEL_PATH], chunk_consumer=process_label )
Wrap Up
This technique is much more generally applicable than the SAX dictionary generator I presented earlier. It could be made more general still if the chunker state processing could be adapted to use the full power of XPattern, so that you could process, say the first 10 records in a billion record file and not create DOM chunks for the rest (/labels/label[position() < 10]).
I intend to write a more powerful chunker along these lines, but it would use the XPattern parsing facilities in 4Suite. The chunker I presented in this article works without third-party packages and its element path implementation probably covers the 80% use-case for such decomposition.
In this month's news from the field, Adam Souzis announced Rx4RDF and Rhizome 0.3.0, an update to the RDF-based web application framework and Wiki toolkit. Changes include documentation and security improvements and Wiki feature enhancements. See the full announcement.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
Walter Dörwald announced XIST 2.5. Billed as an "object-oriented XSLT", XIST uses an easily extensible, DOM-like view of source and target XML documents to generate HTML. "Every XML element type corresponds to a Python class, and these Python classes provide a conversion method to transform the XML tree (e.g., into HTML) ." This release features some API improvements, schema validation (apparently not based on any portable schema specification), support for Holger Krekel's XPython (see below), bug fixes, and more. See the announcement.
Holger Krekel presented at EuroPython 2004 a concept he calls XPython, a new templating syntax for XML and HTML generation that would use extensions to Python syntax to embed templates closely into code. He currently has an experimental implementation consisting of a "300 line patch to C Python."
In other EuroPython 2004 news, Martijn Faassen is working on lxml -- "a sane Python wrapper for libxml." Currently available only in CVS, lxml addresses the concern that the official libxml bindings are not Pythonic (too close to the underlying C), not Python Unicode aware (UTF-8 only), unsafe (can cause core dumps), require manual memory management, and poorly documented. lxml is written using Pyrex and for now only supports the most basic element tree APIs.