Extending RSS
July 23, 2003
Introduction
The boom of weblogs has boosted interest in techniques for syndicating news-like material. In response a family of applications, known as aggregators or newsreaders, have been developed. Aggregators or newsreaders consume and display metadata feeds derived from the content. Currently there are two major formats for these data feeds: RSS 1.0 and RSS 2.0. Mark Pilgrim covers these two flavors of RSS in his XML.com article "What is RSS?"
The names are misleading -- the specifications differ not only in version number but also in philosophy and implementation. If you want to syndicate simple news items there is little difference between the formats in terms of capability or implementation requirement. However, if you want to extend into distributing more sophisticated or diverse forms of material, then the differences become more apparent.
The decision over which RSS version to favor really boils down to a single trade-off: syntactic complexity versus descriptive power. RSS 2.0 is extremely easy for humans to read and generate manually. RSS 1.0 isn't quite so easy, as it uses RDF. It is, however, interoperable with other RDF languages and is eminently readable and processible by machines.
This article shows how the RDF foundation of RSS 1.0 helps when you want to extend RSS 1.0 for uses outside of strict news item syndication, and how existing RDF vocabularies can be incorporated into RSS 1.0. It concludes by providing a way to reuse these developments in RSS 2.0 feeds while keeping the formal definitions made with RDF.
RSS 1.0 Terms Have a Formal Definition
RSS 1.0 documents conforms to the RDF/XML Syntax Specification. This means that they are expressed in the language described in RDF Concepts and Abstract Syntax, which has the precise formal semantics defined in RDF Semantics. Unless you're a logician or have masochistic tendencies, you probably won't want to follow the path all the way to the formal base. For most developers the RDF Primer contains plenty to get started. The take-home message is that, unlike with plain XML, which is just syntax, there is well-known meaning that programs can derive from an RDF/XML document.
There is another part of the RDF specification that we need to consider when talking about RSS 1.0: RDF Schema. In the jargon, the RDF Schema specification defines an ontology language. An ontology gives names to concepts and relationships between those concepts. An ontology is really just a tightly controlled vocabulary; to some extent in this context the words "ontology", "vocabulary", and "schema" are interchangeable (in the RSS world, module is often used to refer to essentially the same thing).
RSS 1.0 may be a format defined in human language in the main specification document, but it is also an ontology that is specified in formal language in the RSS 1.0 RDF schema. Consider the RSS 1.0 snippet below.
...
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/">
<item rdf:about="http://example.com/2003/09/29#9">
<title>The Joy of Blogs</title>
...
The example uses the <item>
and <title>
terms, and
they can be found in the schema defined like this :
...
<rdfs:Class rdf:about="http://purl.org/rss/1.0/item" rdfs:label="Item" rdfs:comment="An RSS item.">
<rdfs:isDefinedBy rdf:resource="http://purl.org/rss/1.0/" />
</rdfs:Class> <rdf:Property rdf:about="http://purl.org/rss/1.0/title" rdfs:label="Title" rdfs:comment="A descriptive title for the channel.">
<rdfs:subPropertyOf rdf:resource="http://purl.org/dc/elements/1.1/title" />
<rdfs:isDefinedBy rdf:resource="http://purl.org/rss/1.0/" />
</rdf:Property>
...
The main things being said here are that item in RSS 1.0 is an RDF class and that title is a property. RDF classes more or less correspond to concepts, and properties are used to describe the relationships between those concepts. So returning to our example, it can be demonstrated that there's more being said in the example than is immediately obvious:
...
<item rdf:about="http://example.com/2003/09/29#9">
<title>The Joy of Blogs</title>
...
This says first of all that the resource identified as http://example.com/2003/09/29#9
is an
instance of the class item
.
The RDF/XML syntax provides a specific interpretation of the nesting of the XML, which
allows us to determine that the resource has a property title
, and the value of the
property is the literal string "The Joy of
Blogs"
. This still doesn't seem to offer much advantage over
plain XML. But what we have isn't just given in terms of human-readable documentation,
it's
defined with unambiguous definitions throughout, traceable back to the logical formalism
of
RDF. These semantics allow us to not only make statements about the item but to reason
programmatically with those statements.
In the RDF Schema snippet above, it also says that the title
property is a subproperty of
the resource http://purl.org/dc/elements/1.1/title
, an element defined by
the Dublin Core Metadata Initiative. We can then infer from these statements that
the
literal "The Joy of
Blogs"
is also related to the item as a Dublin Core
title
. If, for
example, a browser-like application were reading the data, but didn't know how to
render
rss:title
, it
could reasonably substitute the renderer for dc:title
.
What do we gain from all this formal grounding? If RSS processing alone is our universe, maybe not a lot. But as soon as we want to start integrating our RSS with other RDF data, or merge other data into our RSS, we start to reap rewards.
Extending RSS: Software Releases
As an example of extending RSS, we'll take a software company's product announcement RSS feed. Periodically they release updates to their product, and they would like the announcement of the update to be an automated part of the release process. So when a new release build is made, an item will be inserted into their news feed that contains the product name and the release version.
We create an RSS module by defining the properties we need, explaining their usage and associating them with a unique namespace. On the face of it, this is a trivial exercise -- for the update module we can just define a couple of simple elements:
-
product
- the name of the product. A character string. -
version
- the version of the release expressed as a string in the formatx.y
wherex
is the major version number andy
the minor version number.
For a namespace we just need a URI, ideally one under our control. So if we have registered
the domain name supersemantics.com
then we could use that as a base. It's a
good idea to recommend a prefix to use for the namespace within XML documents, and
here we
shall use rel
.
Here's what this might look like in our RSS 1.0 feed.
xmlns:rel="http://supersemantics.com/ns/release/" ... <item rdf:about="http://supersemantics.com/release/2003/06/19#9"> <title>New Release</title> <dc:date>2003-06-19T14:02:33+01:00</dc:date> <rel:product>IronBoard</rel:product> <rel:version>2.3</rel:version> ...
The date in RSS 1.0 is expressed using a W3C Date Time Format DTF (W3CDTF), a profile of the ISO 8601 standard.
By using the RDF document the syntax here we actually says more than we would with
plain
XML. The product
and version
elements are actually RDF properties,
relating the item
resource to literal strings. There are two statements being
made here which can be expressed as subject (what's being described),
predicate (the property), and object (value of that property):
http://supersemantics.com/release/2003/06/19#9 rel:product "IronBoard"
http://supersemantics.com/release/2003/06/19#9 rel:version "2.3"
The (subject, predicate, object) statement is an important concept in the RDF world and is usually referred to RSS 1.0 RDF Schema as a triple. The subject of one triple may be the object of another and vice versa. This means the triples can also be thought of as a joined-up structure, and that structure is the RDF graph.
So what's the big deal? The relationship between the item and the product name and
version
number is already defined. We can load our RSS file into any RDF aware toolkit (and
there
are plenty, see Dave Beckett's Resource
Guide) and have it immediately know that an item
has properties
product
and version
. We don't need any more programmer logic to
extend the data model.
If we wish to offer our new module for reuse by others we can, in the same way that
the
item
and title
properties are defined in the RSS 1.0 RDF Schema,
provide a schema with formal definitions for our terms.
Working with Existing Vocabularies
We noted earlier that the RSS 1.0 title
property was actually a subproperty of
Dublin Core's title
. Some parts of the RSS 1.0 vocabulary such as
dc:date
and dc:creator
are used directly from Dublin Core.
Generally speaking it's good practice to use existing vocabularies directly wherever
possible, as it's the best route to interoperability. A common scenario is that a
general
purpose vocabulary contains a term close to what we're looking for, but our requirement
is
more specific. The solution here is to define our own term as a subclass or subproperty
of
the existing term (depending whether the term applies to an entity or a relationship
between
entities). Thus the child class (or property) takes on the same characteristics as
its
parent, in addition to anything specific to the child.
As it happens, there is at least one existing vocabulary designed to describe software
releases. In fact, the release schema
at eikster.com contains terms that directly correspond to our product
and
version
called name
and version
. We can inherit
their descriptions by making our properties subproperties of them.
There is one significant difference between eikster.com's properties and ours -- their
schema provides a Release
class, to which the properties apply. Looking back at
the RSS 1.0 example, we have our product
and version
applied to an
RSS item
-- the resource on the left-hand side of the triples is an item, on
the right-hand side we have a string literal. We can use RDF Schema to say we want
the
domain (left-hand side) of our properties to be instances of item
and the range
(right-hand side) to be literals. Note that the domain and range are primarily
descriptive, they don't in themselves offer any real constraint as found in WXS.
It's up to applications to interpret this as they wish (true constraints can be added
using
the Web Ontology Language OWL).
A few more things that are easy to add to the schema and are likely to be useful are human-readable labels and comments for each property and references to their definition. Including a reference to the definition might seem a little redundant in part of the definition itself, but the statements in an RDF Schema may be used outside of their original context.
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Property rdf:about="http://supersemantics.com/ns/release/product">
<rdfs:label>Product Name</rdfs:label>
<rdfs:comment>The official name of a software package</rdfs:comment>
<rdfs:subPropertyOf rdf:resource="http://eikster.com/2003/release#name" />
<rdfs:domain rdf:resource="http://purl.org/dc/elements/1.1/item"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
<rdfs:isDefinedBy rdf:resource="http://supersemantics.com/ns/release"/>
</rdf:Property> <rdf:Property rdf:about=""http://supersemantics.com/ns/release/version">
<rdfs:label>Release Version</rdfs:label>
<rdfs:comment>The release version of a software package, given in major.minor format, e.g. 2.3</rdfs:comment>
<rdfs:subPropertyOf rdf:resource="http://eikster.com/2003/release#version" />
<rdfs:domain rdf:resource="http://purl.org/dc/elements/1.1/item"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
<rdfs:isDefinedBy rdf:resource="http://supersemantics.com/ns/release"/>
</rdf:Property>
</rdf:RDF>
Together with this schema, RDF Schema-aware software that understands eikster.com's
Release
classes will also be able to understand our RSS items
,
as we have defined how they relate.
Bringing RSS 2.0 to the Party
There are various reasons, substantially matters of personal preference, why some may prefer an RSS 2.0 format. If we can map RSS 2.0 with our extension module unambiguously to the equivalent RSS 1.0 version, then what we have done is to effectively turned the XML syntax into a task-specific serialization of RDF. We can get all the semantic goodness of RDF in the simple XML packaging of RSS 2.0. This is the approach taken by my project, Simple Semantic Resolution (SSR), which is actually defined as an RSS 2.0 module. A step-by-step description, SSR-Enabling an RSS 2.0 Module, is available, but we have already looked at most of these steps already here. What we haven't done yet is defined the mapping. In SSR this is done by supplying an XSLT stylesheet that can carry out transformations of documents using our module in combination with RSS 2.0 into their RSS/RDF counterpart.
A stylesheet is available (thanks to Sjoerd Visscher) that can convert core RSS 2.0 into RSS 1.0, so all we have to do is to do the extra needed to convert our XML elements and contents into RDF properties and objects via a syntactical transformation. Which for our software release module is absolutely nothing. Sjoerd's XSLT passes through unchanged any XML that isn't recognised as RSS, and that's exactly what we want for our syntax.
So all we have to do to give instances of our extended RSS 2.0 the RDF semantics is to use SSR to identify the transform that defines the mapping. All this takes is the insertion of an extra element into the RSS just below the root level, so our enriched RSS 2.0 will look like this:
<rss version="2.0"
xmlns:rel="http://supersemantics.com/ns/release/" xmlns:ssr="http://purl.org/stuff/ssr"> <ssr:rdf transform="http://ideagraph.net/xmlns/ssr/source/rss2rdf.xsl" /> ...
<item>
<title>New Release</title>
<pubDate>Sat, 19 Jun 2003 14:02:33 GMT+1</pubDate>
<link>http://supersemantics.com/release/2003/06/19#9</link>
<rel:product>IronBoard</rel:product>
<rel:version>2.3</rel:version>
... </rss>
A regular RSS 2.0 client can understand this, as there is no change to the core format.
Conclusion
RSS 1.0's strong point is its use of the RDF model, which enables information to be represented in a consistent fashion. This model is backed by a formal specification which provides well-defined semantics. From this point of view, RSS 1.0 becomes just another vocabulary that uses the framework. In contrast, outside of the relationships between the handful of syndication-specific terms defined in its specification, RSS 2.0 simply doesn't have a model. There's no consistent means of interpreting material from other namespaces that may appear in an RSS 2.0 document. It's a semantic void. But it doesn't have to be that way since it's relatively straightforward to map to the RDF framework and use that model.
The scope of applications is often extended, and depending on how you look at it, it's either enhancement or feature creep. Either way, it usually means diminishing returns -- the greater distance from the core domain you get, the more additional work is required for every new piece of functionality. But if you look at the web as one big application, then we can to get a lot more functionality with only a little more effort.