Never Mind the Namespaces: An XSLT RSS Client

January 2, 2003

RSS is an XML-based format for summarizing and providing links to news stories. If you collect RSS feed URIs from your favorite news sites, you can easily build dynamic, customized collections of news stories. In a recent XML.com article Mark Pilgrim explained the history and formats used for RSS. He also showed a simple Python program that can read RSS files conforming to the three RSS formats still in popular use: 0.91, 1.0, and 2.0. While reading Mark's article I couldn't help but think that it would be really easy to do in XSLT.

Easy, that is, if you're familiar with the XPath local-name() function. In a past column I showed how this function retrieves the part of an element name that identifies it within its namespace. For example, an element with a qualified name of "blue:verse" has the local name "verse" (and not "blue", as I wrote in a typo in that column and only just now caught; "blue" is the namespace prefix).

Typical XSLT stylesheets care a great deal about an element's namespace. If a channel element in an RSS 1.0 file comes from the http://purl.org/rss/1.0/ namespace and a channel element from an RSS 2.0 file comes from the http://purl.org/dc/elements/1.1/ namespace, then an XSLT processor considers these two element types to be as different as a title element from a book publishing namespace and a title element from a human resources namespace. However, by basing match conditions (and, as we'll see later, select tests in xsl:apply-templates instructions) on the local name of source tree elements, we can explicitly tell the XSLT processor to ignore the namespace of certain elements. For example, we can have a template rule that applies to all elements with a local name of "channel," regardless of their namespace.

The following stylesheet mimics the behavior of the rss1.py Python program in Mark's article:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0">

  <xsl:output method="text"/>

  <xsl:template match="*[local-name()='title']">
    <xsl:text>title: </xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="*[local-name()='link']">
    <xsl:text>link: </xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="*[local-name()='description']">
    <xsl:text>description: </xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="dc:creator">
    <xsl:text>author: </xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="dc:date">
    <xsl:text>date: </xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="language"/>  <!-- suppress -->

</xsl:stylesheet>

There is one slight difference: it doesn't print the "date:" and "author:" headers for news items that have no dc:creator or dc:date children. RSS 0.91 doesn't use these two Dublin Core elements. The first template rule in this stylesheet has an asterisk and a predicate inside of square braces to specify that the XSLT engine should apply that rule to any element meeting the predicate condition: its local name is "title." The second and third template rules use a similar format to handle the RSS link and description elements.

I won't show the input and output for this stylesheet: they're essentially the same as the input and output in Mark's article. Instead, I'd rather take the stylesheet a few steps further to create a standalone news aggregator that requires no special software other than a web browser and an XSLT processor.

Three basic XSLT techniques make this possible:

Most XSLT processors can read remote documents using XSLT's document() function; our stylesheet will use it to retrieve the news feeds from their servers.
Converting the RSS elements and attributes to HTML for display by the browser.
Using the local-name() function to create template rules that don't care about the namespace of RSS elements such as channel, item, and link.

There are plenty of RSS-based news aggregating clients around: Amphetadesk, NewzCrawler, NetNewsWire, among many others. The advantage of using one written in XSLT means that you don't have to install new software on your machine or login to a server-based aggregator that needs to look up a list of your favorite feeds. You can also more easily integrate the XSLT-based one into other applications -- for example, to add customized news feeds to your company's intranet site without relying on any software more expensive or exotic than an XSLT processor.

Our stylesheet will transform the following XML document, which links to summaries of several news feeds and blogs:

<?xml-stylesheet href="getRSS.xsl" type="text/xsl"?>
<RSSChannels>

  <!-- RSS 0.91 feeds -->
  
  <RSSChannel src="http://xml.coverpages.org/covernews.xml"/>
  <RSSChannel src="http://www.bbc.co.uk/syndication/feeds/news/ukfs_news/world/rss091.xml"/>

  <!-- RSS 1.0 feeds -->
  <RSSChannel src="http://www.ilrt.bristol.ac.uk/discovery/rdf/resources/rss.rdf"/>
  <RSSChannel src="http://www.smartmobs.com/index.rdf"/>
  <RSSChannel src="http://www.infoworld.com/rss/news.rdf"/>

  <!-- RSS 2.0 feeds -->
  <RSSChannel src="http://www.panix.com/~jbm/snappy/index.xml"/>
  <RSSChannel src="http://www.antipixel.com/blog/index.xml"/>
  <RSSChannel src="http://revjim.net/index.xml"/>

</RSSChannels>

As the document's comments tell us, it includes feeds from the three currently popular RSS formats. For now, most feeds using RSS 2.0 come from webloggers interested in playing with the latest technology, but I'm sure we'll see more commercial sites take advantage of the richer metadata possibilities offered by the post-0.91 releases.

The processing instruction in the document's first line identifies the stylesheet to use for dynamic rendering in a web browser. Before looking at how the stylesheet works, first watch it in action: unzip this file onto your hard disk and use a recent release of Internet Explorer to open RSSChannels.xml. There are a few caveats to remember:

This doesn't work with Mozilla, which, as of release 1.2.1, still has some kinks in its implementation of the document() function.
I'd hoped to put the XML file and its stylesheet on a public server so that you could just link to it from this article to see it in action, but I got an "Access denied" message when the stylesheet tried to use the document() function to retrieve a document from a different server. This could be a security precaution in IE's XSLT implementation.

Using IE to open up local copies of RSSChannels.xml and its accompanying getRSS.xsl stylesheet should work fine. A batch file or shell script can also use Xalan or Saxon and these two files to create an HTML file that any web browser can read. So, these caveats won't stand in the way of anyone developing their own XSLT RSS client -- they just get in the way of the flashy demo that I had originally planned.

Let's look at the getRSS.xsl stylesheet.

<!-- getRSS.xsl: retrieve RSS feed(s) and convert to HTML. -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0">

  <xsl:output method="html"/>

  <xsl:template match="RSSChannels">
    <html><head><title>Today's Headlines</title></head>
    <style><xsl:comment>

p         { font-size: 8pt;
            font-family: arial,helvetica; }

h1        { font-size: 12pt;
            font-family: arial,helvetica; 
            font-weight: bold; }

a:link    { color:blue;
            font-weight: bold;
            text-decoration: none; }

a:visited { font-weight: bold;
            color: darkblue;
            text-decoration: none; }

   </xsl:comment></style> 
   <body>
     <xsl:apply-templates/>
   </body></html>
  </xsl:template>

  <xsl:template match="RSSChannel">
    <xsl:apply-templates select="document(@src)"/>
  </xsl:template>

  <!-- Named template outputs HTML a element with href link and RSS
       description as title to show up in mouseOver message. -->
  <xsl:template name="a-element">
    <xsl:element name="a">
      <xsl:attribute name="href">
        <xsl:apply-templates select="*[local-name()='link']"/>
      </xsl:attribute>
      <xsl:attribute name="title">
        <xsl:apply-templates select="*[local-name()='description']"/>
      </xsl:attribute>
      <xsl:value-of select="*[local-name()='title']"/>
    </xsl:element>
  </xsl:template>

  <!-- Output RSS channel name as HTML a link inside of h1 element. -->
  <xsl:template match="*[local-name()='channel']">
    <xsl:element name="h1">
      <xsl:call-template name="a-element"/>
    </xsl:element> 
    <!-- Following line for RSS .091 -->
    <xsl:apply-templates select="*[local-name()='item']"/>
  </xsl:template>

  <!-- Output RSS item as HTML a link inside of p element. -->
  <xsl:template match="*[local-name()='item']">
    <xsl:element name="p">
      <xsl:call-template name="a-element"/>
      <xsl:text> </xsl:text>
      <xsl:if test="dc:date"> <!-- Show date if available -->
        <xsl:text>( </xsl:text>
        <xsl:value-of select="dc:date"/>
        <xsl:text>) </xsl:text>
      </xsl:if>
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>

Even with whitespace and comments, the whole thing is less than 80 lines. It has five template rules:

The first is for the root RSSChannels element of the main document that holds the RSS feed URIs. It does the basic setup of the result HTML document, including the addition of a CSS stylesheet.
The short second template rule acts on an RSSChannel element, using the XSLT document() function to read in the document named by the element's src attribute. The stylesheet assumes that the document being read is an RSS document, and the stylesheet uses the remaining three template rules to transform the elements of the RSS document read in by the document() function into HTML.
The third template rule's xsl:template element has a name attribute instead of a match attribute, making it a named template rule that must be explicitly called from a template rule. Because the fourth and fifth template rules surround their result contents with an HTML a element of a similar structure, the common code is stored in this named template. Note how the xsl:apply-templates instruction uses the local-name() function to selectively identify which element types to use for attribute values in the result.
The fourth template rule outputs the name of an RSS channel -- typically, the title of the news channel such as "XML.com" or "InfoWorld: Top News" -- as an HTML h1 element. The h1 element wraps an a element that links back to the main page of the site using the URI named in the channel element's link child element. The a element includes the description of the channel in a title element so that when the resulting HTML is displayed using recent releases of Internet Explorer, Mozilla, or Opera, a mouseOver event displays that description in a pop-up box. The actual a element is output with a call to the "a-element" named template.
The last template rule outputs an HTML p element containing a link to a particular news item. It uses the RSS item element's link and description child elements the same way that the preceding template rule does, which is why the creation of the a element with these attributes was moved to a separate template rule that these two both call. This final template rule adds one more bit of information: if a dc:date element is supplied with the news item, the template rule adds that to the result tree as plain text.

Ill-formed RSS?

One word of caution: as Mark mentioned in his article, not all RSS feeds are well-formed XML, and anything that you load into a source tree for XSLT processing must be well-formed XML. To process ill-formed RSS, you'll have to go beyond XSLT, and Mark will explain some strategies for that in a follow-up piece. In my research, I found very little ill-formed RSS, so this hasn't been a problem for me.

On December 31st I used Saxon to apply this stylesheet to the RSSChannels document shown above and created an HTML result version that you can see here. (Don't forget to try the mouseOvers...) If I applied the same stylesheet to the same XML document at a later date, the result would be different, with more up-to-date news. That's the beauty of RSS.

The actual HTML and CSS that I used create a pretty stark layout. Some simple additions to the stylesheet could add some glitz to the resulting appearance, but despite its visual simplicity, this stylesheet still does a great deal: it retrieves a customized set of news feeds listed in a simple, easily customizable file, and then displays a menu of the news items where you can see their titles, read their descriptions, and then follow the links to the actual stories. You could modify the layout to make it fancier, or you could modify it to make it simpler -- slight modifications will let you convert the RSS to WML, plain text delivery, or some new markup language being developed for new output devices. XSLT helps you grab these RSS feeds; what you do with them is up to you.

Modify the stylesheet to your heart's content and change the URIs in the RSSChannels document as well. You can find a wide choice of feeds to choose from at WebReference.com, Alternative News on the Web, Yahoo's RSS News Aggregators category, and the massive news4sites list. Happy aggregating!