Inside the RSS Validator
February 26, 2003
In previous columns, I have introduced RSS and explored options for consuming it. Now we turn to the production side. Last month I stirred up a small controversy by suggesting that RSS consumers should go out of their way to consume as many feeds as possible, even ones which are not well-formed. This month I hope it will be somewhat less controversial to say that RSS producers should go out of their way to produce feeds that conform to specifications as well as possible.
Rule Zero is that all RSS feeds must be well-formed XML. Not all RSS consumers use the advanced techniques we discussed last month. Many can only parse RSS feeds that are well-formed XML. There are many tools for producing XML; you should use one of them as opposed to, say, using string concatenation and a non-XML-aware templating system and hoping for the best.
Beyond well-formedness, there are a number of domain-specific rules and best practices
for
RSS feeds. These are fairly well encapsulated in the free online RSS validator. Point the validator at your
RSS feed and follow its instructions if it finds any errors or warnings. It will catch
common XML errors such as unescaped ampersands and high-bit characters; domain-specific
errors such as missing required elements; and more subtle errors such as improper
language
codes in the <language>
element.
Lather, rinse, repeat till the validator clears your feed for takeoff. Check back every now and then to make sure other obscure bugs haven't crept up and made your feed go invalid, which may indicate bugs in your production software.
How the validator works internally is actually fairly interesting -- much more interesting than the arcane rules of RSS validity -- and that's where I'd like to focus. The validator is written in Python, and it is available under a liberal open source license, so you can download the complete source code and follow along.
The RSS validator relies on Python's built-in SAX interface, xml.sax.handler
.
To use it, you subclass ContentHandler
and provide methods for
startElementNS
(for start tags), endElementNS
(for end tags),
and characters
(for everything in between). Of course, for anything but the
most trivial applications, these will end up being dispatch methods to the real code
stored
elsewhere, which you've separated based on some criteria (namespace, element name,
phase of
the moon).
As the SAX parser processes the input document, the RSS validator maintains a stack of handler objects. Each handler object knows just enough to validate a specific element, and it knows which other handler objects can validate the element's children. Each handler object is set up with contextual information about which element it's handling, what its parent element is, and what attributes, if any, were present in its start tag. The handler object introspects over its own methods to find one that can handle the current element and calls it. This method can perform the validation directly, or it can return one or more handler objects to perform additional validation. This will become clearer with some code.
Here is the first step, a subclass of xml.sax.handler.ContentHandler
, which
initializes the handler stack and then passes all startElementNS
requests to
the top handler in the stack.
class SAXDispatcher(ContentHandler): def __init__(self): ContentHandler.__init__(self) # prime the handler stack with the root handler object self.handler_stack = [[root(self)]] def startElementNS(self, name, qname, attrs): qname, name = name for handler in self.handler_stack[-1]: # call all the handlers for the current element handler.startElementNS(name, qname, attrs)
The second step is a base class for all handler objects. It's really a second-level dispatch; it introspects over its own methods to find a method which matches the current element's name, do_element. If found, it calls the method, which returns one or more handler objects. Each of these handlers is set up with contextual information and pushed onto the stack.
class validatorBase(ContentHandler): def __init__(self): ContentHandler.__init__(self) self.value = "" self.attrs = None self.children = [] def startElementNS(self, name, qname, attrs): from validators import eater if qname: handler = self.unknown_starttag(name, qname, attrs) else: try: # look for specific method for this element (by local-name) handlers = getattr(self, "do_" + name)() except AttributeError: # no specific method for this element, use default handler handlers = [eater()] # small hack: if method returns 1 handler, make it a list of 1 try: iter(handlers) except TypeError: handlers = [handlers] # set up contextual information for each handler object for aHandler in handlers: aHandler.parent = self aHandler.value = "" aHandler.name = name aHandler.attrs = attrs aHandler.prevalidate() self.children.append(name) # push handlers onto the stack self.push(handlers)
Two other methods are present in validatorBase
: the characters
method, which just buffers the raw text data within the current element, and the
endElementNS
method, which gets called when we get to the element's end tag
and which calls a validate
method (defined in the descendant handler
objects).
def characters(self, string): # buffer the text data for this element self.value = self.value + string def endElementNS(self, name, qname): # we've buffered all the text data for this element, so validate it self.validate()
Now we can start defining a hierarchy of handler objects to validate different parts
of the
RSS feed. Each handler needs a validate
method to validate the element's data
and a do_
method for each possible child element. For example, the root handler
does no validation, but it knows about the rss
element, which is the top-level
element of most RSS feeds. (This code example is simplified; in reality we also need
to
handle an rdf
element, which is the top-level element of RSS 0.9 and 1.0
feeds.)
class root(validatorBase): def do_rss(self): from rss import rss return rss()
The rss
handler knows that every rss
element needs a
channel
child element and a version
attribute. It also has a
do_channel
method which dispatches the validation for the child
channel
element.
class rss(validatorBase): def validate(self): if not "channel" in self.children: self.log(MissingChannel({"element":self.name, "attr":"channel"})) if (None, 'version') not in self.attrs.getNames(): self.log(MissingAttribute({"element":self.name, "attr":"version"})) def do_channel(self): from channel import channel return channel()
The channel
handler knows that every channel needs a title, link, and
description (and a few other rules), and it has do_
methods for each possible
child element of channel
: title
, link
,
description
, item
, items
, textInput
and
textinput
(due to subtle differences in various RSS versions -- seven specs, no
waiting), category
, cloud
, rating
, ttl
,
docs
, generator
, pubDate
,
lastBuildDate
, managingEditor
, webMaster
,
language
, copyright
, skipHours
,
skipDays
, and blink
. (There is no blink
tag in RSS,
but there was some confusion about this, so the validator presents a specific error
message
for it.)
class channel(validatorBase): def validate(self): if not "title" in self.children: self.log(MissingTitle({"parent":self.name, "element":"title"})) if not "link" in self.children: self.log(MissingLink({"parent":self.name, "element":"link"})) if not "description" in self.children: self.log(MissingDescription({"parent":self.name,"element":"description"})) # several rules omitted here ... def do_title(self): return nonhtml(), noduplicates() def do_link(self): return rfc2396(), noduplicates() def do_description(self): return nonhtml(), noduplicates() ... # lots of other do_ methods omitted
As you can see, several of the do_
methods return a list of individual
handlers. A channel link
must be an RFC-2396-compliant URI, and there can be
only one link
element per channel. Each of these rules is encoded in its own
handler object:
class rfc2396(validatorBase): rfc2396_re = re.compile("[a-zA-Z][0-9a-zA-Z+\\-\\.]*:(//)?" + "[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%,#]+$") def validate(self, errorClass=InvalidLink): if (not self.value) or (not self.rfc2396_re.match(self.value)): self.log(errorClass({"element":self.name, "value":self.value})) class noduplicates(validatorBase): def prevalidate(self): if self.name in self.parent.children: self.log(DuplicateElement({"parent":self.parent.name, "element":self.name}))
When the channel.do_link
method returns the list of rfc2396
and
noduplicates
handler objects, the secondary dispatcher in
validatorBase.startElementNS
pushes them onto the stack, where the main
dispatcher in SAXDispatcher.startElementNS
pops them off and calls each of them
in turn. Both the rfc2396
instance and the noduplicates
instance
are each set up with contextual information for the current link
element; they
each perform their own validation log their own errors.
This all may seem like a lot of indirection -- and it is -- but it has several advantages:
- It's easy to add new functionality. Adding support for a new element requires writing
a
new handler object that inherits from
validatorBase
, then adding a singledo_
method in the parent element's handler. - It's easy to debug. None of the individual handler objects interact with each other. There are no side effects.
- It encourages code reuse. Several different elements in various levels of an RSS
document have similar validation logic. For instance,
docs
,link
, and the elements within the optionalblogChannel
module all need to be RFC-2396-compliant URIs.
More Dive Into XML Columns |
|
With this framework in place, and an entire hierarchy of handler objects each doing their own little piece of validation, the main function to parse an RSS feed is mostly boilerplate:
def validate(aString): # boilerplate from xml.sax import make_parser, handler from base import SAXDispatcher from exceptions import UnicodeError from cStringIO import StringIO source = InputSource() source.setByteStream(StringIO(aString)) # create an instance of our top-level SAX dispatcher validator = SAXDispatcher() # boilerplate parser = make_parser() parser.setFeature(handler.feature_namespaces, 1) # set up our validator as the handler for all SAX events, # and start parsing parser.setContentHandler(validator) parser.setErrorHandler(validator) parser.setEntityResolver(validator) parser.parse(source) return validator
During the course of parsing, our SAXDispatcher
instance accumulates errors
and warnings through a centralized logging interface (not shown). Each error is stored
as
its own object, and we can access the list of errors and display them however we choose.
The
interactive web-based validator displays
them in an HTML page; the (currently beta) SOAP interface uses the errors to
construct a SOAP response. The downloadable command-line version just prints them
to the screen.
Next month: something other than RSS.