Parsing RSS At All Costs

January 22, 2003

The problem

As I said in last month's article, RSS is an XML-based format for syndicating news and news-like sites. XML was chosen, among other reasons, to make it easier to parse with off-the-shelf XML tools. Unfortunately in the past few years, as RSS has gained popularity, the quality of RSS feeds has dropped. There are now dozens of versions of hundreds of tools producing RSS feeds. Many have bugs. Few build RSS feeds using XML libraries; most treat it as text, by piecing the feed together with string concatenation, maybe (or maybe not) applying a few manually coded escaping rules, and hoping for the best.

On average, at any given time, about 10% of all RSS feeds are not well-formed XML. Some errors are systemic, due to bugs in publishing software. It took Movable Type a year to properly escape ampersands and entities, and most users are still using old versions or new versions with old buggy templates. Other errors are transient, due to rough edges in authored content that the publishing tools are unable or unwilling to fix on the fly. As I write this, the Scripting News site's RSS has an illegal high-bit character, a curly apostrophe. Probably just a cut-and-paste error -- I've done the same thing myself many times -- but I don't know of any publishing tool that corrects it on the fly, and that one bad character is enough to trip up any XML parser.

I just tested the 59 RSS feeds I subscribe to in my news aggregator; 5 were not well-formed XML. 2 of these were due to unescaped ampersands; 2 were illegal high-bit characters; and then there's The Register (RSS), which publishes a feed with such a wide variety of problems that it's typically well-formed only two days each month. (I actually tracked it for a month once to test this. 28 days off; 2 days on.) I also just tested the 100 most recently updated RSS feeds listed on blo.gs (a weblog tracking site); 14 were not well-formed XML.

Clearly, we need a backup plan.

The Heretical Solution

There is a social solution to this problem: register at Syndic8.com to be a "fixer", and volunteer your time contacting the authors of individual sites to get them to fix their feeds. There is also a technical solution to this problem: don't use an XML parser.

I know, I know, this is heresy. The point of XML is that content producers are supposed to put up with the pain of XML formatting rules so that content consumers can do cool things with off-the-shelf tools. Well, guess what? It's not happening. Judging by the sad state of affairs in the RSS world, content producers are either ignorant of the error of their ways, or too lazy to fix the errors, or too busy, or locked into inflexible tools whose vendors are too busy... Whatever the reasons, content consumers are rarely in a position to solve the problem. So we must work around it. We need a parse-at-all-costs RSS parser.

I know, I know, this is how HTML got to be "tag soup": browsers that never complained. Now the same thing is happening in the RSS world because the same social dynamics apply. End users who can't even spell "XML" certainly don't care about silly little formatting rules; they just want to follow their favorite sites in their news aggregator. When 10% of the world's RSS feeds are not well-formed -- including some high-profile feeds that thousands of people want to read -- the ability to parse ill-formed feeds becomes a competitive advantage. (And if you think the same thing won't happen when RDF and the Semantic Web go mainstream, you're deluding yourself. The same social dynamics apply. Boy, is that going to be messy.)

So most desktop news aggregators are now incorporating parse-at-all-costs RSS parsers which they use when XML parsing fails. However, since no one likes tag soup, they are also implementing subtle visual clues, such as smiley and frown icons, to indicate feed quality. Click on the frown face, and the end user can learn that this RSS feed is not well-formed XML. But the program still displays the content of the feed, as best it can, using a parse-at-all-costs parser. Those who care about quality and are motivated to do something about it can contact the publisher. But everyone else can follow their favorite sites, even if the feeds are broken.

The Heretical Code

So how do you build a parse-at-all-costs RSS parser? With regular expressions, of course. Regular expressions are the messy solution to all of life's messy problems. Want to parse invalid HTML and XML? Regular expressions. Want to parse invalid RDF? Regular expressions. And may God have mercy on your soul.

Actually, Python has a secret weapon against poor markup: a little-known standard library called sgmllib. I've written extensively about sgmllib elsewhere for HTML processing, but it's also useful for processing invalid XML.

sgmllib is based on regular expressions under the covers, but you don't need to deal with them directly. It works much like a SAX parser for XML documents. In fact, you can think of it as a SAX parser that doesn't care about details like unescaped ampersands or undefined entities. The sgmllib.SGMLParser class iterates through a document, and you can subclass it to provide element-specific processing. For example, here is an invalid XML document (due to both the undefined entity "—" and the unescaped ampersand):

<rss>

  <channel>

    <title>My weblog &mdash; tech news & other stuff</title>

  </channel>

</rss>

Here is how sgmllib.SGMLParser would handle it:

Call start_rss([]). The empty list indicates no attributes for this tag. If I wanted to do something special when I encountered the beginning rss tag, I would define the start_rss method in my sgmllib.SGMLParser descendant. (If start_rss hasn't been defined, SGMLParser will fall back to calling unknown_starttag('rss', []) instead. This also applies to all subsequent examples.)
start_channel([])
start_title([])
handle_data('My weblog ')
handle_entityref('mdash')
handle_data(' tech news ')
handle_data('&')
handle_data(' stuff')
end_title()
end_channel()
end_rss()

Note that both steps 5 and 8 will choke any compliant XML parser, but sgmllib just says, "Unknown entity? Here, you deal with it. Unescaped ampersand? Must be plain text."

Given this new-found freedom, we can use sgmllib to build a parse-at-all-costs RSS parser. We'll start by subclassing sgmllib.SGMLParser and defining our own methods to keep track of RSS data as we find it. We'll need start_item and end_item methods in order to keep track of whether we're within an RSS item. We'll use a currentTag variable to keep track of the most recent start tag; a currentValue variable which buffers all the text data we find until we hit the end tag (as shown in steps 4-8 of the example above, the text data may be split across several method calls); and a list of dictionaries to hold all of our parsed data.

import sgmllib



class ParseAtAllCostsParser(sgmllib.SGMLParser):

    def reset(self):

        self.items = []

        self.currentTag = None

        self.currentValue = ''

        self.initem = 0

        sgmllib.SGMLParser.reset(self)



    def start_item(self, attrs):

        # set a flag that we're within an RSS item now

        self.items.append({})

        self.initem = 1



    def end_item(self):

        # OK, we're out of the RSS item

        self.initem = 0

Now add in the unknown_starttag and unknown_endtag methods to handle the start and end of an individual item element:

    def unknown_starttag(self, tag, attrs):

        self.currentTag = tag



    def unknown_endtag(self, tag):

        # if we're within an RSS item, save the data we've buffered

        if self.initem:

            # decode entities and strip whitespace

            self.currentValue = decodeEntities(self.currentValue.strip())

            self.items[-1][self.currentTag] = self.currentValue

        self.currentValue = ''

As you can see, once we find the end tag, we take all the buffered text data from within this element (self.currentValue), decode the XML entities manually (since sgmllib will not do this for us), strip whitespace, and stash it in our self.items list. So this requires several things: a decodeEntities function and the appropriate handler methods for buffering the text data in the first place.

Decoding XML entities is easy; there are only five of them:

def decodeEntities(data):

    # in case our document *was* encoded correctly, we'll

    # need to decode the XML entities manually; sgmllib

    # will not do it for us

    data = data.replace('&lt;', '<')

    data = data.replace('&gt;', '>')

    data = data.replace('&quot;', '"') #"

    data = data.replace('&apos;', "'")

    data = data.replace('&amp;', '&')

    return data

Handling the text data that sgmllib.SGMLParser throws down on us (including any entities within the text) is equally easy:

    def handle_data(self, data):

        # buffer all text data

        self.currentValue += data



    def handle_entityref(self, data):

        # buffer all entities

        self.currentValue += '&' + data

    handle_charref = handle_entityref

The final result is that we can feed an invalid RSS document into this parser, and it will parse out any and all item-level elements, well-formed or not.

if __name__ == '__main__':

    p = ParseAtAllCostsParser()

    p.feed(file('invalid.xml').read())

    for rssitem in p.items:

        print 'title:', rssitem.get('title')

        print 'description:', rssitem.get('description')

        print 'link:', rssitem.get('link')

        print

Running this script on this non-well-formed RSS document will produce these results:

title: Layoffs in BT & H description: BT & H has laid off more people as the recession only gets worse. Note the ampersands in both title and description. link: http://example.com/news/3 title: Mozilla Project Hurt by Apple's Decision to use KH description: It's generally best to read Slashdot at a +3 comments threshhold. Note undefined entities in the link (due to unescaped ampersands). link: http://developers.slashdot.org/article.pl?sid=03/01/14/1514205&tid=154&threshold=3

This simple script will not handle many of the advanced features of XML, including namespaces. That may not be a problem; after all, it's just a fallback, right? Hopefully we're trying to use a real XML parser first and only falling back on this messy regular expressions-based sgmllib parser when that fails. However, in flagrant abuse of all things pure and sacred, I have managed to extend this script into a full-fledged parse-at-all-costs RSS parser that supports all the advanced features of RSS, including namespaces. It even handles exotic variations of RSS 0.90 and 1.0, where everything is explicitly placed in a namespace (even the basic title, link, and description tags). I don't recommend it, but it works for me.

In next month's column I'll examine some other RSS validity issues. Valid RSS is more than just well-formed XML. Just because there's no DTD or schema doesn't mean it can't be validated in other ways. We'll discuss the inner workings of one such RSS validator. And then we'll move on to something non-RSS-related. I promise.

Related resources:

Our sample parse-at-all-costs RSS parser, explained above.
A sample invalid RSS document.
HTML Processing with sgmllib, a chapter of my free online book, Dive Into Python.
A full-fledged parse-at-all-costs RSS parser for Python.