Decomposition, Process, Recomposition

July 28, 2004

One unfortunate side-effect of the success of XML is that some users tend to just throw all their familiar data-processing tasks into XML shapes without pausing much to consider the strengths and weaknesses of XML.

The most common effect of this lack of attentiveness often leads people to very dangerous practices such as acting as if XML is an ASCII or even byte-oriented format. I have warned sharply about the ugly effects of such confusion in a few previous articles in this column.

Another common effect I've seen is the tendency to create multi-megabyte or even gigabyte monolithic XML files. XML is so flexible for data representation because of its nature as an annotated hierarchy. But this very nature also makes efficient processing quite difficult, especially with regards to scaling according to number of nodes.

So my first line of advice has always been: don't go processing gigabyte files in XML formats. I have been working with XML for about 8 years. I have used XML in numerous ways with numerous tools for numerous purposes. I have never come across a natural need for an XML file more than a few megabytes. There may be terabytes of raw data in a system, but if you're following XML (and classic data design) best practices, this should be structured into a number of files in some sensible way.

If you find yourself with a monster XML file -- perhaps you receive it that way from another source outside your control -- the best way to handle it is through the classic problem-solving technique, divide and conquer. Decompose the problem into smaller bits, solve each little bit to create a part solution, then combine the partial solutions into the overall solution.

Certainly this requires the problem to have certain mathematical properties, but in almost every case of monster XML files I've seen, an elegant decomposition is available. In this article I present some tools and techniques for processing large XML files by decomposition.

DOM in Bite-size Bits

I'll pretend Listing 1 is the gigantic file I have to cut down to size in order to process.

Listing 1: Sample XML file (labels.xml) containing Address Labels

<?xml version="1.0" encoding="iso-8859-1"?>

<labels>

  <label added="2003-06-20">

    <quote>

      <!-- Mixed content -->

      <emph>Midwinter Spring</emph> is its own season&#8230;

    </quote>

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

  </label>

  <label added="2003-06-10">

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

  </label>

</labels>

To get into the spirit of the problem, imagine that rather than the two label elements, this file had a billion. For many tasks you might have in mind with such a huge file, you could simply use SAX. Say you wanted to print out a listing of all the names in the file. It's not hard to do this in SAX since all you need to worry about at any point is whether you happen to be visiting character data within a name element.

If you have an operation in mind that needs more awareness of context or state, then it's time to sharpen up your state machine chops, which, even for SAX gurus, eventually becomes a painful task. Node trees such as DOM are nice in cases where random node access makes managing state and context a cinch, but loading the entire billion-record file into memory to perform node operations is probably beyond the capabilities of all but the highest-end computers, even if, like cDomlette, it's an implementation that uses a variety of tricks to lower the memory footprint.

You might be able to make some headway by using a technique such as the one I discussed in Building Dictionaries With SAX, but perhaps the best of both worlds would be a tool that uses SAX to build little DOMs, so that you can process each, discard each, and move on to the next record.

In my current example, it would be great to have a SAX tool that creates a DOM for each label element in turn. This way you have the full capability of DOM available for processing each label, but you never use much more than the memory needed for a DOM comprising a single label. If you had such a tool, then you could very efficiently solve problems that can be broken down to a subproblem for each label. Such a tool is available in libxml's xmlTextReader (see Using libxml in Python), but libxml's implementation can be very idiosyncratic. Besides, some people dislike installing large third-party packages.

I happened to write a SAX tool for such a purpose that works with the built-in Python SAX and Minidom facilities and sticks to the standard library Python SAX and SOM idioms. See Listing 2 for sax2dom_chunker.py, a module that uses SAX to create little chunks of DOM according to a set of element paths you give it.

Listing 2: Creates DOM Nodes using SAX

"""

sax2dom_chunker.py



A SAX handler that takes a set of element paths and

creates a series of DOM chunks matching the element paths

for individual processing.



Designed for Python 2.2. or greater



Copyright 2004 Fourthought Inc, USA.

This work is licensed under Creative Commons Attribution 1.0

For details: http://creativecommons.org/licenses/by/1.0/

"""



from xml import sax

from xml.dom import XML_NAMESPACE, XMLNS_NAMESPACE, EMPTY_NAMESPACE

import xml.dom.minidom



DUMMY_DOCELEM = u'dummy'

START_STATE = 0

TOP = -1



class _state_machine:

    """

    A simple state machine specialized for DOM chunking from SAX

    A state is "live" when it represents the successful completion

    of a path.

    This is generally a signal to the handler using this state machine

    to start creating the DOM fragment from the subset of SAX

    events until we transit to a non-live state

    """

    def __init__(self, trim_to_paths):

        if not trim_to_paths:

            self.event = self.event_nop

            self.is_live = self.is_live_nop

            return

        self._state_table = {START_STATE: {}}

        self._live_states = []

        #Use the given trim paths to generate a state table

        newest_state = START_STATE

        for path in trim_to_paths:

            last_state = START_STATE

            for segment in path:

                start_event = (1, segment[0], segment[1])

                end_event = (0, segment[0], segment[1])

                if self._state_table[last_state].has_key(start_event):

                    top_state = \

                        self._state_table[last_state][start_event]

                else:

                    newest_state += 1

                    top_state = newest_state

                    self._state_table[top_state] = {}

                self._state_table[last_state][start_event] = \

                    top_state

                self._state_table[top_state][end_event] = \

                    last_state

                last_state = top_state

            self._live_states.append(top_state)

        self._state = START_STATE

        self.chunk_completed = 0

        return



    def event(self, is_start, ns, local):

        """

        Register an event and effect ant state transitions

        found in the state table

        """

        #We only have a chunk ready for the handler in

        #the explicit case below

        self.chunk_completed = 0

        lookup_from = self._state_table[self._state]

        if lookup_from.has_key((is_start, ns, local)):

             new_state = lookup_from[(is_start, ns, local)]

             #If we have completed a chunk, we set a flag for

             #The chunker

             if (self._state in self._live_states and

                 new_state not in self._live_states):

                 self.chunk_completed = 1

             self._state = new_state

        return self._state



    def is_live(self):

        """

        1 if the curent state is considered live, else 0

        """

        return self._state in self._live_states



    def event_nop(self, is_start, ns, local):

        pass



    def is_live_nop(self):

        return 1





class sax2dom_chunker(sax.ContentHandler):

    """

    Note: ignores nodes prior to the document element, such as PIs and

    text nodes This filter is only designed to work if you set

    features sax.handler.feature_namespaces and

    sax.handler.feature_namespace_prefixes to 1 on the parser you use.

    It will not work on drivers that do not support these features.

    The default drv_expat works fine in this case, but of course has

    but very limited DTD processing.  It also collapses CDATA sections

    into plain text



    trim_to_paths - a list of lists of tuples.  Each tuple is of

        the form (namespace, local-name), providing one segment

        in a path of contained elements

        [

          [ (None, u'monty'), (None, u'python') ],

          [ (None, u'monty'), (None, u'spam'), ('urn:dummy', u'eggs') ]

        ]

        If None (the default, a DOM node will be created representing

        the entire tree.



    chunk_consumer - a callable object taking a DOM node.  It will be

        invoked as each DOM chunk is prepared.

    

    domimpl - DOM implemention to build, e.g. mindom (the default)

        or cDomlette or pxdom (if you have the right third-party

        packages installed).

    

    owner_doc - for advanced uses, if you want to use an existing

        DOM document object as the owner of all created nodes.

    """

    def __init__(self,

                 trim_to_paths=None,

                 chunk_consumer=None,

                 domimpl=xml.dom.minidom.getDOMImplementation(),

                 owner_doc=None

                 ):

        self._impl = domimpl

        if owner_doc:

            self._owner_doc = owner_doc

        else:

            dt = self._impl.createDocumentType(DUMMY_DOCELEM, None, '')

            self._owner_doc = self._impl.createDocument(

                DUMMY_DOCELEM, DUMMY_DOCELEM, dt)

        #Create a docfrag to hold all the generated nodes.

        root_node = self._owner_doc.createDocumentFragment()

        self._nodeStack = [ root_node ]

        self.state_machine = _state_machine(trim_to_paths)

        self._chunk_consumer = chunk_consumer

        return



    def get_root_node(self):

        """

        Only useful if the user does not register trim paths

        If so, then after SAX processing the user can call this

        method to retrieve resulting DOm representing the entire

        document

        """

        return self._nodeStack[0]



    #Overridden DocumentHandler methods

    def startElementNS(self, name, qname, attribs):

        self.state_machine.event(1, name[0], name[1])

        if not self.state_machine.is_live():

            return

        (ns, local) = name

        new_element = self._owner_doc.createElementNS(ns, qname)



        for ((attr_ns, lname), value) in attribs.items():

            if attr_ns is not None:

                attr_qname = attribs.getQNameByName((attr_ns, lname))

            else:

                attr_qname = lname

            attr = self._owner_doc.createAttributeNS(

                attr_ns, attr_qname)

            attr_qname = attribs.getQNameByName((attr_ns, lname))

            attr.value = value

            new_element.setAttributeNodeNS(attr)



        self._nodeStack.append(new_element)

        return



    def endElementNS(self, name, qname):

        self.state_machine.event(0, name[0], name[1])

        if not self.state_machine.is_live():

            if (self._chunk_consumer and

                self.state_machine.chunk_completed):

                #Complete the element being closed because it

                #Is the last bit of a DOM to be fed to the consumer

                new_element = self._nodeStack[TOP]

                del self._nodeStack[TOP]

                self._nodeStack[TOP].appendChild(new_element)

                #Feed the consumer

                self._chunk_consumer(self._nodeStack[0])

                #Start all over with a new doc frag so the old

                #One's memory can be reclaimed

                root_node = self._owner_doc.createDocumentFragment()

                self._nodeStack = [ root_node ]

            return

        new_element = self._nodeStack[TOP]

        del self._nodeStack[TOP]

        self._nodeStack[TOP].appendChild(new_element)

        return



    def processingInstruction(self, target, data):

        if self.state_machine.is_live():

            pi = self._owner_doc.createProcessingInstruction(

                target, data)

            self._nodeStack[TOP].appendChild(pi)

        return



    #Overridden LexicalHandler methods

    def comment(self, text):

        if self.state_machine.is_live():

            new_comment = self._owner_doc.createComment(text)

            self._nodeStack[TOP].appendChild(new_comment)

        return



    def characters(self, chars):

        if self.state_machine.is_live():

            new_text = self._owner_doc.createTextNode(chars)

            self._nodeStack[TOP].appendChild(new_text)

        return

The idea here is that you provide a source XML file and a list of element paths, each of which defines subsets of the document that defines the boundaries of chunking. For example, to process labels, you'd give the chunker a list [ (None, u'labels'), (None, u'label') ] in order to generate DOM chunks representing each label element within the top-level labels element. You would also give the chunker a callback function (or any callable object) that would be invoked with each DOM chunk as it is generated.

Class _state_machine is a specialized-state machine class for the chunker used to encapsulate the complex state logic and keep the SAX handler simple. The sax2dom_chunker class is a SAX handler that operates on an XML document and can either feed the resulting DOM chunks to a consumer routine or build and hold on to a node representing the entire document. You can use any SAX filters you please with it as a way of increasing the sophistication of the processing.

As an example, I have not included code to consolidate multiple contiguous text events into single ones, so the resulting DOM could be de-normalized. It also doesn't strip whitespace between elements, which you might decide you don't care about and don't want in your DOM. In either case you can use a filter upstream from the chunker. For the first case, I covered such a filter in Building Dictionaries With SAX.

Using the DOM Chunker

Listing 3 is an example of how to use the chunker to process each label as an isolated DOM chunk. The goal is to print out all the cities in which someone named "Eliot" lives.

Listing 3: Uses the Chunker in Listing 2 to Process the Labels File Label by Label

"""

Print out all the cities in which someone named "Eliot" lives

"""



from xml import sax

import sax2dom_chunker



SEARCH_FOR_STRING = 'eliot'



def process_label(docfrag):

    #Invoked for each label

    label = docfrag.firstChild

    name = label.getElementsByTagNameNS(None, 'name')[0]

    city = label.getElementsByTagNameNS(None, 'city')[0]

    #For simplicity, pretend the above nodes are normalized, and have

    #No child elements

    name_text = name.firstChild.data

    city_text = city.firstChild.data

    if name_text.lower().find(SEARCH_FOR_STRING) != -1:

        print city_text.encode('utf-8')

    return



#The path for chunking by label element

#Equivalent to the XPattern /labels/label

LABEL_PATH = [ (None, u'labels'),  (None, u'label') ]



#Create an instance of the chunker

handler = sax2dom_chunker.sax2dom_chunker(

    trim_to_paths=[LABEL_PATH],

    chunk_consumer=process_label

    )



parser = sax.make_parser()



#The chunker is the SAX handler

parser.setContentHandler(handler)



#The chunker *requires* these features

parser.setFeature(sax.handler.feature_namespaces, 1)

parser.setFeature(sax.handler.feature_namespace_prefixes, 1)



parser.parse('labels.xml')

The result of running python listing3.py labels.xml is simply "Stamford" printed to the console. Again, the neat trick that comes from using chunker is that even though we take advantage of the relatively friendly DOM facility (it would be friendlier still if I were using XPaths, but that would once again require a third-party tool), the program does not take much more memory for a billion-label file than for a two-label file. Listing 4 is a similar script that shows how you can create chunks according to multiple paths.

Listing 4: Uses the Chunker to Process the Labels File Label by Label

"""

Print out all people and their street address

"""



from xml import sax

import sax2dom_chunker



#Paths for chunking by label element

#Equivalent to the XPattern

# ( /labels/label/name|/labels/label/address/street )

PATHS = [

          [ (None, u'labels'),  (None, u'label'),  (None, u'name') ],

          [ (None, u'labels'),  (None, u'label'),  (None, u'address'),

            (None, u'street') ]

    ]



def process_chunk(docfrag):

    #Invoked for each name or address.  We're getting the leaf element

    #of the path itself (in a doc frag wrapper) so just print its text

    #content

    text = docfrag.firstChild.firstChild.data

    print text.encode('utf-8')

    return





#Create an instance of the chunker

handler = sax2dom_chunker.sax2dom_chunker(

    trim_to_paths=PATHS,

    chunk_consumer=process_chunk

    )



parser = sax.make_parser()



#The chunker is the SAX handler

parser.setContentHandler(handler)



#The chunker *requires* these features

parser.setFeature(sax.handler.feature_namespaces, 1)

parser.setFeature(sax.handler.feature_namespace_prefixes, 1)



parser.parse('labels.xml')

The output is in Listing 5.

Listing 5: Output of "python listing4.py labels.xml"

Thomas Eliot

3 Prufrock Lane

Ezra Pound

45 Usura Place

If you have 4Suite installed, you can save even more memory (and gain some speed) by using the cDomlette implementation. You can also use Andrew Clover's pxdom (to gain W3C DOM conformance, but not performance). In fact, you can use any DOM that follows Python-library DOM conventions by changing the implementation. As an example, to use cDomlette you would replace the following snippet of code in Listing 3:

#Create an instance of the chunker

handler = sax2dom_chunker.sax2dom_chunker(

    trim_to_paths=[LABEL_PATH],

    chunk_consumer=process_label

    )

With the following snippet:

from Ft.Xml.Domlette import implementation



#Create an instance of the chunker

handler = sax2dom_chunker.sax2dom_chunker(

    domimpl = implementation,

    trim_to_paths=[LABEL_PATH],

    chunk_consumer=process_label

    )

Wrap Up

This technique is much more generally applicable than the SAX dictionary generator I presented earlier. It could be made more general still if the chunker state processing could be adapted to use the full power of XPattern, so that you could process, say the first 10 records in a billion record file and not create DOM chunks for the rest (/labels/label[position() < 10]).

I intend to write a more powerful chunker along these lines, but it would use the XPattern parsing facilities in 4Suite. The chunker I presented in this article works without third-party packages and its element path implementation probably covers the 80% use-case for such decomposition.

In this month's news from the field, Adam Souzis announced Rx4RDF and Rhizome 0.3.0, an update to the RDF-based web application framework and Wiki toolkit. Changes include documentation and security improvements and Wiki feature enhancements. See the full announcement.

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Walter Dörwald announced XIST 2.5. Billed as an "object-oriented XSLT", XIST uses an easily extensible, DOM-like view of source and target XML documents to generate HTML. "Every XML element type corresponds to a Python class, and these Python classes provide a conversion method to transform the XML tree (e.g., into HTML) ." This release features some API improvements, schema validation (apparently not based on any portable schema specification), support for Holger Krekel's XPython (see below), bug fixes, and more. See the announcement.

Holger Krekel presented at EuroPython 2004 a concept he calls XPython, a new templating syntax for XML and HTML generation that would use extensions to Python syntax to embed templates closely into code. He currently has an experimental implementation consisting of a "300 line patch to C Python."

In other EuroPython 2004 news, Martijn Faassen is working on lxml -- "a sane Python wrapper for libxml." Currently available only in CVS, lxml addresses the concern that the official libxml bindings are not Pythonic (too close to the underlying C), not Python Unicode aware (UTF-8 only), unsafe (can cause core dumps), require manual memory management, and poorly documented. lxml is written using Pyrex and for now only supports the most basic element tree APIs.