Processing Atom 1.0
September 14, 2005
In the fast-moving world of weblogs and Web-based marketing, the approval of the Atom Format 1.0 by the Internet Engineering Task Force (IETF) as a Proposed Standard is a significant and lasting development. Atom is a very carefully designed format for syndicating the contents of weblogs as they are updated, the usual territory of RSS, but its possible uses are far more general, as illustrated in the description on the home page:
Atom is the name of an XML-based Web content and metadata syndication format, and an application-level protocol for publishing and editing Web resources belonging to periodically updated websites.
All Atom feeds must be well-formed XML documents, and are identified with the application/atom+xml media type.
Atom is a very important development in the XML and Web world. Atom technology is already deployed in many areas (though not all up-to-date with Atom 1.0), and parsing and processing Atom is quickly becoming an important task for web developers. In this article, I will show several approaches to reading Atom 1.0 in Python. All the code is designed to work with Python 2.3, or more recent, and is tested with Python 2.4.1.
The example I'll be using of an Atom document is a modified version of the introduction to Atom on the home page, reproduced here in listing 1.
Listing 1 (atomexample.xml). Atom Format 1.0 Example
<?xml version="1.0" encoding="utf-8"?> <feed xml:lang="en" xmlns="http://www.w3.org/2005/Atom" xmlns:xh="http://www.w3.org/1999/xhtml"> <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id> <title>Example Feed</title> <updated>2005-09-02T18:30:02Z</updated> <link href="http://example.org/"/> <author> <name>John Doe</name> </author> <entry> <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <title>Atom-Powered Robots Run Amok</title> <link href="http://example.org/2005/09/02/robots"/> <updated>2005-09-02T18:30:02Z</updated> <summary>Some text.</summary> </entry> <entry> <id>urn:uuid:8eb00d01-d632-40d4-8861-f2ed613f2c30</id> <title type="xhtml"> <xh:div> The quick <xh:del>black</xh:del><xh:ins>brown</xh:ins> fox... </xh:div> </title> <link href="http://example.org/2005/09/01/fox"/> <updated>2005-09-01T12:15:00Z</updated> <summary>jumps over the lazy dog</summary> </entry> </feed>
Using MiniDOM
If you want to process Atom with no additional dependencies besides Python, you can do so using MiniDOM. MiniDOM isn't the most efficient way to parse XML, but Atom files tend to be small, and rarely get to the megabyte range that bogs down MiniDOM. If by some chance you are dealing with very large Atom files, you can use PullDOM, which works well with Atom because of the way the format can be processed in bite-sized chunks. MiniDOM isn't the most convenient API available, either, but it is the most convenient approach in the Python standard library. Listing 2 is MiniDOM code to produce an outline of an atom feed, containing much of the information you would use if you were syndicating the feed.
Listing 2. MiniDOM Code to Print a Text Outline of an Atom Feed
from xml.dom import minidom from xml.dom import EMPTY_NAMESPACE ATOM_NS = 'http://www.w3.org/2005/Atom' doc = minidom.parse('atomexample.xml') #Ensure that all text nodes can be simply retrieved doc.normalize() def get_text_from_construct(element): ''' Return the content of an Atom element declared with the atomTextConstruct pattern. Handle both plain text and XHTML forms. Return a UTF-8 encoded string. ''' if element.getAttributeNS(EMPTY_NAMESPACE, u'type') == u'xhtml': #Grab the XML serialization of each child childtext = [ c.toxml('utf-8') for c in element.childNodes ] #And stitch it together content = ''.join(childtext).strip() return content else: return element.firstChild.data.encode('utf-8') #process overall feed: #First title element in doc order is the feed title feedtitle = doc.getElementsByTagNameNS(ATOM_NS, u'title')[0] #Field titles are atom text constructs: no markup #So just print the text node content print 'Feed title:', get_text_from_construct(feedtitle) feedlink = doc.getElementsByTagNameNS(ATOM_NS, u'link')[0] print 'Feed link:', feedlink.getAttributeNS(EMPTY_NAMESPACE, u'href') print print 'Entries:' for entry in doc.getElementsByTagNameNS(ATOM_NS, u'entry'): #First title element in doc order within the entry is the title entrytitle = entry.getElementsByTagNameNS(ATOM_NS, u'title')[0] entrylink = entry.getElementsByTagNameNS(ATOM_NS, u'link')[0] etitletext = get_text_from_construct(entrytitle) elinktext = entrylink.getAttributeNS(EMPTY_NAMESPACE, u'href') print etitletext, '(', elinktext, ')'
The code to access XML is typical of DOM and, as such, it's rather clumsy when compared
to
much Python code. The normalization step near the beginning of the listing helps eliminate
even more complexity when dealing with text content. Many Atom elements are defined
using
the atomTextConstruct
pattern, which can be plain text, with no embedded
markup. (HTML is allowed, if escaped, and if you flag this case in the type
attribute.) Such elements can also contain well-formed XHTML
fragments wrapped
in a div
. The get_text_from_construct
function handles both cases
transparently, and so it is generally a utility routine for extracting content from
compliant Atom elements. In this listing, I use it to access the contents of the
title
element, which is in XHTML form in one of the entries in listing 1. Try
running listing 2 and you should get the following output.
$ python listing2.py Feed title: Example Feed Feed link: http://example.org/ Entries: Atom-Powered Robots Run Amok ( http://example.org/2005/09/02/robots ) <xh:div> The quick <xh:del>black</xh:del><xh:ins>brown</xh:ins> fox... </xh:div> ( http://example.org/2005/09/01/fox )
Handling Dates
Handling Atom dates in Python is a topic that deserves closer attention. Atom dates
are
specified in the atomDateConstruct
pattern, of which the specification
says:
A Date construct is an element whose content MUST conform to the "date-time" production in [RFC3339]. In addition, an uppercase "T" character MUST be used to separate date and time, and an uppercase "Z" character MUST be present in the absence of a numeric time zone offset.
The examples given are:
-
2003-12-13T18:30:02Z
-
2003-12-13T18:30:02.25Z
-
2003-12-13T18:30:02+01:00
-
2003-12-13T18:30:02.25+01:00
You may be surprised to find that Python is rather limited in the built-in means it
provides for parsing such dates. There are good reasons for this: many aspects of
date
parsing are very hard and can depend a lot on application-specific needs. Python 2.3
introduced the handy datetime
data type, which is the recommended way to store
and exchange dates, but you have to do the parsing into date-time yourself, and handle
the
complex task of time-zone processing, as well. Or you have to use a third-party routine
that
does this for you. I recommend that you complement Python's built-in facilities with
Gustavo
Niemeyer's DateUtil. (Unfortunately
that link uses HTTPS with an expired certificate, so you may have to click through
a bunch
of warnings, but it's worth it.) In my case I downloaded the 1.0 tar.bz2 and installed
using
python setup.py install
.
Using DateUtil, the following snippet is a function that returns a date read from an atom element:
from dateutil.parser import parse feedupdated = doc.getElementsByTagNameNS(ATOM_NS, u'updated')[0] dt = parse(feedupdated.firstChild.data)
And as an example of how you can work with this date-time object, you can use the following code to report how long ago an Atom feed was updated:
from datetime import datetime from dateutil.tz import tzlocal #howlongago is a timedelta object from present time to target time howlongago = dt - datetime.now(tzlocal()) print "Time since feed was updated:", abs(howlongago)
Using Amara Bindery
Because the DOM code above is so clumsy, I shall present similar code using a friendlier Python library, Amara Bindery, which I covered in an earlier article, Introducing the Amara XML Toolkit. Listing 3 does the same thing as listing 2.
Listing 3. Amara Bindery Code to Print a Text Outline of an Atom Feed
from amara import binderytools doc = binderytools.bind_file('atomexample.xml') def get_text_from_construct(element): ''' Return the content of an Atom element declared with the atomTextConstruct pattern. Handle both plain text and XHTML forms. Return a UTF-8 encoded string. ''' if hasattr(element, 'type') and element.type == u'xhtml': #Grab the XML serialization of each child childtext = [ (not isinstance(c, unicode) and c.xml(encoding=u'utf-8') or c) for c in element.xml_children ] #And stitch it together content = u''.join(childtext).strip().encode('utf-8') return content else: return unicode(element).encode('utf-8') print 'Feed title:', get_text_from_construct(doc.feed.title) print 'Feed link:', doc.feed.link print print 'Entries:' for entry in doc.feed.entry: etitletext = get_text_from_construct(entry.title) print etitletext, '(', entry.link.href, ')'
Using Feedparser (Atom Processing for the Desperate Hacker)
A third approach to reading Atom is to let someone else handle the parsing and just
deal
with the resulting data structure. This might be especially convenient if you have
to deal
with broken feeds (and fixing the broken feeds is not an option). It does usually
rob you of
some flexibility of interpretation of the data, although a really good library would
be
flexible enough for most users. Probably the best option is Mark Pilgrim's Universal Feed Parser, which parses almost every
flavor of RSS and Atom. In my case, I downloaded the 3.3 zip package and installed
using
python setup.py install
. Listing 4 is code similar in function to that of
listings 2 and 3.
Listing 4. Universal Feed Parser Code to Print a Text Outline of an Atom Feed
import feedparser #A hack until Feed parser supports Atom 1.0 out of the box #(Feedparser 3.3 does not) from feedparser import _FeedParserMixin _FeedParserMixin.namespaces["http://www.w3.org/2005/Atom"] = "" feed_data = feedparser.parse('atomexample.xml') channel, entries = feed_data.feed, feed_data.entries print 'Feed title:', channel['title'] print 'Feed link:', channel['link'] print print 'Entries:' for entry in entries: print entry['title'], '(', entry['link'], ')'
Overall the code is shorter because we no longer have to worry about the different forms of Atom text construct. The library takes care of that for us. Of course I'm pretty leery of how it does so, especially the fact that it strips Namespaces in XHTML content. This is an example of the flexibility you lose when using a generic parser, especially one designed to be as liberal as Universal Feed Parser. That's a trade-off from the obvious gain in simplicity. Notice the hack near the top of listing 4. These two lines should be temporary, and no longer needed, once Mark Pilgrim updates his package to support Atom 1.0.
Wrapping up, on a Grand Scale
Atom 1.0 is pretty easy to parse and process. I may have serious trouble with some of the design decisions for the format, but I do applaud its overall cleanliness. I've presented several approaches to processing Atom in this article. If I needed to reliably process feeds retrieved from arbitrary locations on the Web, I would definitely go for Universal Feed Parser. Mark Pilgrim has dunked himself into the rancid mess of broken Web feeds so you don't have to. In a project where I controlled the environment, and I could fix broken feeds, I would parse them myself, for the greater flexibility. One trick I've used in the past is to use Universal Feed Parser as a proxy tool to convert arbitrary feeds to a single, valid format (RSS 1.0 in my past experience), so that I could use XML (or in that case RDF) tools to parse the feeds directly.
And with this month's exploration, the Python-XML column has come to an end. After discussions with my editor, I'll replace this column with one with a broader focus. It will cover the intersection of Agile Languages and Web 2.0 technologies. The primary language focus will still be Python, but there will sometimes be coverage of other languages such as Ruby and ECMAScript. I think many of the topics will continue to be of interest to readers of the present column. I look forward to continuing my relationship with the XML.com audience.
This brings me to the last hurrah of the monthly round up of Python-XML community news. Firstly, given the topic of this article, I wanted to mention Sylvain Hellegouarch's atomixlib, a module providing a simple API for generation of Atom 1.0, based on Amara Bindery. See his announcement. And relevant to recent articles in this column, Andrew Kuchling wrote up a Python Unicode HOWTO.
Julien Anguenot writes in XML Schema Support on Zope3:
I added a demo package to illustrate the zope3/xml schema integration. [Download the code here]
The goal of the demo is to get a new content object registered within Zope3, with an "add "and "edit" form driven by an XML Schema definition.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
The article goes on to show a bunch of Python and XML code to work a sample W3C XML schema file into a Zope component.
Mark Nottingham announced sparta.py 0.8, a simple API for RDF.
Sparta is a Python API for RDF that is designed to help easily learn and navigate the Semantic Web programmatically. Unlike other RDF interfaces, which are generally triple-based, Sparta binds RDF nodes to Python objects and RDF arcs to attributes of those Python objects.
This makes using RDF very natural for people who understand (and sometimes think in terms of) objects. One way to think of it is as a databinding from RDF to Python objects.
See the announcement.
Guido Wesdorp announced Templess 0.1.
Templess is an XML templating library for Python, which is very compact and simple, fast, and has a strict separation of logic and design. It is different from other templating languages because instead of "asking" for data from the template, you "tell" the template what content there is to render, and the template just provides placeholders. Instead of calling into your code from the template, all data for the template is prepared in the code before it is handed over to the templating engine to render. This makes Templess very suitable for programmers, since everything is done from the Python code layer rather than using some domain-specific language from the XML.