Processing Atom 1.0

September 14, 2005

In the fast-moving world of weblogs and Web-based marketing, the approval of the Atom Format 1.0 by the Internet Engineering Task Force (IETF) as a Proposed Standard is a significant and lasting development. Atom is a very carefully designed format for syndicating the contents of weblogs as they are updated, the usual territory of RSS, but its possible uses are far more general, as illustrated in the description on the home page:

Atom is the name of an XML-based Web content and metadata syndication format, and an application-level protocol for publishing and editing Web resources belonging to periodically updated websites.

All Atom feeds must be well-formed XML documents, and are identified with the application/atom+xml media type.

Atom is a very important development in the XML and Web world. Atom technology is already deployed in many areas (though not all up-to-date with Atom 1.0), and parsing and processing Atom is quickly becoming an important task for web developers. In this article, I will show several approaches to reading Atom 1.0 in Python. All the code is designed to work with Python 2.3, or more recent, and is tested with Python 2.4.1.

The example I'll be using of an Atom document is a modified version of the introduction to Atom on the home page, reproduced here in listing 1.

Listing 1 (atomexample.xml). Atom Format 1.0 Example


<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en"
      xmlns="http://www.w3.org/2005/Atom" 
      xmlns:xh="http://www.w3.org/1999/xhtml">
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
  <title>Example Feed</title>
  <updated>2005-09-02T18:30:02Z</updated>
  <link href="http://example.org/"/>
  <author>
    <name>John Doe</name>
  </author>
  <entry>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2005/09/02/robots"/>
    <updated>2005-09-02T18:30:02Z</updated>
    <summary>Some text.</summary>
  </entry>
  <entry>
    <id>urn:uuid:8eb00d01-d632-40d4-8861-f2ed613f2c30</id>
    <title type="xhtml">
      <xh:div>
The quick <xh:del>black</xh:del><xh:ins>brown</xh:ins> fox...
      </xh:div>
    </title>
    <link href="http://example.org/2005/09/01/fox"/>
    <updated>2005-09-01T12:15:00Z</updated>
    <summary>jumps over the lazy dog</summary>
  </entry>
</feed>

Using MiniDOM

If you want to process Atom with no additional dependencies besides Python, you can do so using MiniDOM. MiniDOM isn't the most efficient way to parse XML, but Atom files tend to be small, and rarely get to the megabyte range that bogs down MiniDOM. If by some chance you are dealing with very large Atom files, you can use PullDOM, which works well with Atom because of the way the format can be processed in bite-sized chunks. MiniDOM isn't the most convenient API available, either, but it is the most convenient approach in the Python standard library. Listing 2 is MiniDOM code to produce an outline of an atom feed, containing much of the information you would use if you were syndicating the feed.

Listing 2. MiniDOM Code to Print a Text Outline of an Atom Feed

from xml.dom import minidom
from xml.dom import EMPTY_NAMESPACE
ATOM_NS = 'http://www.w3.org/2005/Atom'
doc = minidom.parse('atomexample.xml')
#Ensure that all text nodes can be simply retrieved
doc.normalize()
def get_text_from_construct(element):
    '''
    Return the content of an Atom element declared with the
    atomTextConstruct pattern.  Handle both plain text and XHTML
    forms.  Return a UTF-8 encoded string.
    '''
    if element.getAttributeNS(EMPTY_NAMESPACE, u'type') == u'xhtml':
        #Grab the XML serialization of each child
        childtext = [ c.toxml('utf-8') for c in element.childNodes ]
        #And stitch it together
        content = ''.join(childtext).strip()
        return content
    else:
        return element.firstChild.data.encode('utf-8')

#process overall feed:

#First title element in doc order is the feed title
feedtitle = doc.getElementsByTagNameNS(ATOM_NS, u'title')[0]

#Field titles are atom text constructs: no markup
#So just print the text node content
print 'Feed title:', get_text_from_construct(feedtitle)

feedlink = doc.getElementsByTagNameNS(ATOM_NS, u'link')[0]
print 'Feed link:', feedlink.getAttributeNS(EMPTY_NAMESPACE, u'href')

print
print 'Entries:'

for entry in doc.getElementsByTagNameNS(ATOM_NS, u'entry'):
    #First title element in doc order within the entry is the title
    entrytitle = entry.getElementsByTagNameNS(ATOM_NS, u'title')[0]
    entrylink = entry.getElementsByTagNameNS(ATOM_NS, u'link')[0]
    etitletext = get_text_from_construct(entrytitle)
    elinktext = entrylink.getAttributeNS(EMPTY_NAMESPACE, u'href')
    print etitletext, '(', elinktext, ')'

The code to access XML is typical of DOM and, as such, it's rather clumsy when compared to much Python code. The normalization step near the beginning of the listing helps eliminate even more complexity when dealing with text content. Many Atom elements are defined using the atomTextConstruct pattern, which can be plain text, with no embedded markup. (HTML is allowed, if escaped, and if you flag this case in the type attribute.) Such elements can also contain well-formed XHTML fragments wrapped in a div. The get_text_from_construct function handles both cases transparently, and so it is generally a utility routine for extracting content from compliant Atom elements. In this listing, I use it to access the contents of the title element, which is in XHTML form in one of the entries in listing 1. Try running listing 2 and you should get the following output.

$ python listing2.py
Feed title: Example Feed
Feed link: http://example.org/

Entries:
Atom-Powered Robots Run Amok ( http://example.org/2005/09/02/robots )
<xh:div>
The quick <xh:del>black</xh:del><xh:ins>brown</xh:ins> fox...
      </xh:div> ( http://example.org/2005/09/01/fox )

Handling Dates

Handling Atom dates in Python is a topic that deserves closer attention. Atom dates are specified in the atomDateConstruct pattern, of which the specification says:

A Date construct is an element whose content MUST conform to the "date-time" production in [RFC3339]. In addition, an uppercase "T" character MUST be used to separate date and time, and an uppercase "Z" character MUST be present in the absence of a numeric time zone offset.

The examples given are:

2003-12-13T18:30:02Z
2003-12-13T18:30:02.25Z
2003-12-13T18:30:02+01:00
2003-12-13T18:30:02.25+01:00

You may be surprised to find that Python is rather limited in the built-in means it provides for parsing such dates. There are good reasons for this: many aspects of date parsing are very hard and can depend a lot on application-specific needs. Python 2.3 introduced the handy datetime data type, which is the recommended way to store and exchange dates, but you have to do the parsing into date-time yourself, and handle the complex task of time-zone processing, as well. Or you have to use a third-party routine that does this for you. I recommend that you complement Python's built-in facilities with Gustavo Niemeyer's DateUtil. (Unfortunately that link uses HTTPS with an expired certificate, so you may have to click through a bunch of warnings, but it's worth it.) In my case I downloaded the 1.0 tar.bz2 and installed using python setup.py install.

Using DateUtil, the following snippet is a function that returns a date read from an atom element:

from dateutil.parser import parse

feedupdated = doc.getElementsByTagNameNS(ATOM_NS, u'updated')[0]
dt = parse(feedupdated.firstChild.data)

And as an example of how you can work with this date-time object, you can use the following code to report how long ago an Atom feed was updated:

from datetime import datetime
from dateutil.tz import tzlocal

#howlongago is a timedelta object from present time to target time
howlongago = dt - datetime.now(tzlocal())
print "Time since feed was updated:", abs(howlongago)

Using Amara Bindery

Because the DOM code above is so clumsy, I shall present similar code using a friendlier Python library, Amara Bindery, which I covered in an earlier article, Introducing the Amara XML Toolkit. Listing 3 does the same thing as listing 2.

Listing 3. Amara Bindery Code to Print a Text Outline of an Atom Feed

from amara import binderytools

doc = binderytools.bind_file('atomexample.xml')

def get_text_from_construct(element):
    '''
    Return the content of an Atom element declared with the
    atomTextConstruct pattern.  Handle both plain text and XHTML
    forms.  Return a UTF-8 encoded string.
    '''
    if hasattr(element, 'type') and element.type == u'xhtml':
        #Grab the XML serialization of each child
        childtext = [ (not isinstance(c, unicode)
                       and c.xml(encoding=u'utf-8') or c)
                      for c in element.xml_children ]
        #And stitch it together
        content = u''.join(childtext).strip().encode('utf-8')
        return content
    else:
        return unicode(element).encode('utf-8')

print 'Feed title:', get_text_from_construct(doc.feed.title)
print 'Feed link:', doc.feed.link

print
print 'Entries:'

for entry in doc.feed.entry:
    etitletext = get_text_from_construct(entry.title)
    print etitletext, '(', entry.link.href, ')'

Using Feedparser (Atom Processing for the Desperate Hacker)

A third approach to reading Atom is to let someone else handle the parsing and just deal with the resulting data structure. This might be especially convenient if you have to deal with broken feeds (and fixing the broken feeds is not an option). It does usually rob you of some flexibility of interpretation of the data, although a really good library would be flexible enough for most users. Probably the best option is Mark Pilgrim's Universal Feed Parser, which parses almost every flavor of RSS and Atom. In my case, I downloaded the 3.3 zip package and installed using python setup.py install. Listing 4 is code similar in function to that of listings 2 and 3.

Listing 4. Universal Feed Parser Code to Print a Text Outline of an Atom Feed

import feedparser

#A hack until Feed parser supports Atom 1.0 out of the box
#(Feedparser 3.3 does not)
from feedparser import _FeedParserMixin
_FeedParserMixin.namespaces["http://www.w3.org/2005/Atom"] = ""

feed_data = feedparser.parse('atomexample.xml')
channel, entries = feed_data.feed, feed_data.entries

print 'Feed title:', channel['title']
print 'Feed link:', channel['link']

print
print 'Entries:'

for entry in entries:
    print entry['title'], '(', entry['link'], ')'

Overall the code is shorter because we no longer have to worry about the different forms of Atom text construct. The library takes care of that for us. Of course I'm pretty leery of how it does so, especially the fact that it strips Namespaces in XHTML content. This is an example of the flexibility you lose when using a generic parser, especially one designed to be as liberal as Universal Feed Parser. That's a trade-off from the obvious gain in simplicity. Notice the hack near the top of listing 4. These two lines should be temporary, and no longer needed, once Mark Pilgrim updates his package to support Atom 1.0.

Wrapping up, on a Grand Scale

Atom 1.0 is pretty easy to parse and process. I may have serious trouble with some of the design decisions for the format, but I do applaud its overall cleanliness. I've presented several approaches to processing Atom in this article. If I needed to reliably process feeds retrieved from arbitrary locations on the Web, I would definitely go for Universal Feed Parser. Mark Pilgrim has dunked himself into the rancid mess of broken Web feeds so you don't have to. In a project where I controlled the environment, and I could fix broken feeds, I would parse them myself, for the greater flexibility. One trick I've used in the past is to use Universal Feed Parser as a proxy tool to convert arbitrary feeds to a single, valid format (RSS 1.0 in my past experience), so that I could use XML (or in that case RDF) tools to parse the feeds directly.

And with this month's exploration, the Python-XML column has come to an end. After discussions with my editor, I'll replace this column with one with a broader focus. It will cover the intersection of Agile Languages and Web 2.0 technologies. The primary language focus will still be Python, but there will sometimes be coverage of other languages such as Ruby and ECMAScript. I think many of the topics will continue to be of interest to readers of the present column. I look forward to continuing my relationship with the XML.com audience.

This brings me to the last hurrah of the monthly round up of Python-XML community news. Firstly, given the topic of this article, I wanted to mention Sylvain Hellegouarch's atomixlib, a module providing a simple API for generation of Atom 1.0, based on Amara Bindery. See his announcement. And relevant to recent articles in this column, Andrew Kuchling wrote up a Python Unicode HOWTO.

Julien Anguenot writes in XML Schema Support on Zope3:

I added a demo package to illustrate the zope3/xml schema integration. [Download the code here]

The goal of the demo is to get a new content object registered within Zope3, with an "add "and "edit" form driven by an XML Schema definition.

Also in Python and XML

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Making Old Things New Again

The article goes on to show a bunch of Python and XML code to work a sample W3C XML schema file into a Zope component.

Mark Nottingham announced sparta.py 0.8, a simple API for RDF.

Sparta is a Python API for RDF that is designed to help easily learn and navigate the Semantic Web programmatically. Unlike other RDF interfaces, which are generally triple-based, Sparta binds RDF nodes to Python objects and RDF arcs to attributes of those Python objects.

This makes using RDF very natural for people who understand (and sometimes think in terms of) objects. One way to think of it is as a databinding from RDF to Python objects.

See the announcement.

Guido Wesdorp announced Templess 0.1.

Templess is an XML templating library for Python, which is very compact and simple, fast, and has a strict separation of logic and design. It is different from other templating languages because instead of "asking" for data from the template, you "tell" the template what content there is to render, and the template just provides placeholders. Instead of calling into your code from the template, all data for the template is prepared in the code before it is handed over to the templating engine to render. This makes Templess very suitable for programmers, since everything is done from the Python code layer rather than using some domain-specific language from the XML.