Introducing RDFa

February 14, 2007

For a long time now, RDF has shown great promise as a flexible format for storing, aggregating, and using metadata. Maybe for too long—its most well-known syntax, RDF/XML, is messy enough to have scared many people away from RDF. The W3C is developing a new, simpler syntax called RDFa (originally called "RDF/a") that is easy enough to create and to use in applications that it may win back a lot of the people who were first scared off by the verbosity, striping, container complications, and other complexity issues that made RDF/XML look so ugly.

RDF/XML doesn't have to be ugly, but even simple RDF/XML doesn't fit well into XHTML, because browsers and other applications designed around HTML choke on it. So, while the general plan for RDFa is to make it something that can be embedded into any XML dialect, the main effort has gone into making it easy to embed it into XHTML. This gives it an important potential role in the grand plan for the Semantic Web, in which web page data is readable not only by human eyes but by automated processes that can aggregate data and associated metadata and then perform tasks that are much more sophisticated than those that typical screen scraping applications can do now. In fact, the relationship between RDFa metadata and existing content in web pages has been an important driver in most use cases driving RDFa's progress.

Plenty of software is already available to pull RDFa triples from XHTML documents and use them, which means that even though the specification isn't quite done, there's plenty to play with.

The "a" in "RDFa"

RDF often uses a subject, predicate, object combination called a triple to specify an attribute name/value pair about a particular resource. (That's "attribute" in the object-oriented sense, not the XML sense; for example, a triple could specify that the resource with ID http://example.com/artwork#fountain has an author value of "Richard Mutt.") To allow you to add metadata to a web page without affecting a browser's display of that page, RDFa uses some existing XHTML 1 attributes and a few new XHTML 2 attributes to store the subjects, predicates, and objects of these RDF triples. (The objects may also be existing PCDATA in your web pages, with subject and predicate attributes letting this text play a dual role of human-readable displayed content and machine-readable metadata.)

RDFa uses the existing XHTML 1 attributes href, content, rel, rev, and datatype, and it uses the new about, role and property attributes from XHTML 2's Metainformation Attributes module. While the following chart of their use covers only a subset of ways to store RDFa metadata in an XHTML file, it's enough to get you pretty far. The RDFa Primer and RDF/A Syntax W3C documents (and Part 2 of this article) describe more sophisticated ways to add RDFa metadata to your XHTML documents.

There are two basic cases: triples that have a literal string as their object and triples that have a URI as their object. (When possible, it's better to have a URI as an object, because it lets the same value serve as the object of some triples and the subject of others. This makes it easier to connect triples and find new information through inferencing.)

	subject	predicate	object
literal string as object	`about`	`property`	`content` attribute or PCDATA
URI as object	`about`	`rel`	`href`

The RDFa syntax document tells us that "it should be possible to represent a [triple] using only one XML element." Let's look at three examples:

    <span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html"

      property="dc:title" content="Generating a Single Globally Unique ID"/>



    <span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html"

      property="dc:title">Generating a Single Globally Unique ID</meta>



    <span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html"

      rel="dc:subject" href="http://www.snee.com/bobdc.blog/neat_tricks/"/>

These triples make the following statements (assuming that the dc prefix is assigned to the standard Dublin Core URI):

The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core title value of "Generating a Single Globally Unique ID."
The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core title value of "Generating a Single Globally Unique ID."
The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core subject value of http://www.snee.com/bobdc.blog/neat_tricks/.

If the first two triples say the same thing, why would you prefer one over the other? Assuming that your document has its title in the document's content before you begin adding RDFa markup, adding the second span element above means adding a little less text to your document; you just wrap the existing title in the span start- and end-tags shown, a technique that fits in well with the Semantic Web vision of turning existing web content into machine-readable content. If your title was already part of your document, the content attribute value of the first triple would add redundant information to your document, and if your document's title changes, you would need to change it in two places. On the other hand, when adding information that is not already part of the content of your document (for example, workflow information or attribution rights about components of the document) the first span element above provides a good model.

The third span element above uses slightly different attributes to specify a triple that has a URI as an object value.

RDFa Elements

All three of the elements above are span elements. While these are popular for RDFa because you can insert them anywhere in the body of an HTML document, you can add the same RDFa attributes to any elements you like. link and meta elements are popular for inserting triples into the head of an HTML document. This is part of the beauty of RDFa—these elements have been used to add metadata to the head element for years (for example, to indicate the URL of a web page's CSS stylesheet), and now RDFa-aware software can pull useful metadata from them with only minor modifications to these elements. (Modifications are necessary because the triple pulled from an XHTML 1 link element that points to a CSS stylesheet would not be completely legal RDF, because a rel value of "stylesheet" is not a URI and therefore not a proper RDF predicate.)

The a linking element is also popular for storing RDFa metadata, because it always expresses a relationship between one resource (the document where it's stored) and another (the resource it links to). The a element's rel attribute—which has actually been around as long as HTML itself, despite its lack of use before Google's nofollow trick came along—adds information about the relationship, and this information serves as the predicate of a triple stored in an a element.

More Triples, Fewer Subjects

If a document has 100 triples of metadata, the triples probably won't have 100 different subjects. The subject of many will probably be the document itself, as they specify its title, author, and perhaps workflow data about how the document got into its current state. Another group of triples might describe an image's photographer, date taken, and rights re-use information.

Building on existing XHTML syntax, RDFa lets you build multiple triples from the same subject without cluttering up your document too much. An RDFa processor that finds no about attribute assumes that the about attribute on the nearest ancestor element is the subject. (As we'll see in Part 2 of this article, the presence of an id attribute can provide an alternative to this behavior.) For example, the following stores three metadata statements about the resource at http://www.snee.com/img/myfile.jpg, because although the three span elements have no about attribute, their parent img does:

<img src="http://www.snee.com/img/myfile.jpg"

     about="http://www.snee.com/img/myfile.jpg">

  <span property="dc:subject" content="Niagra Falls"/>

  <span property="dc:creator" content="Richard Mutt"/>

  <span property="dc:format" content="img/jpeg"/>

</img>

If the RDFa processor searches through all the ancestors of the element with a metadata statement's predicate and object, and doesn't find an about attribute, then the subject is an empty string. According to the RDF/A Syntax specification, this "effectively indicates the current document."

This is handy, because plenty of a document's metadata is typically about the document itself. For example, your document's main title could have this span element to indicate that its contents is the Dublin Core title of the work (assuming that no ancestor of the sample h1 element has an about attribute):

  <h1><span property="dc:title">My Story</span></h1>

Metadata about the document with no displayable content can be stored in the head element of the document:

<html>

  <head>

    <meta description="dc:date" content="2007-03-15T10:35:42"/>

Now that we've seen a scattershot tour of what RDFa can do and how it does it, it would be easier to appreciate its potential uses if we step back and look at three categories of use cases:

Inline metadata about document components
Metadata about the containing document
Out-of-line metadata

Google Announces Support for Microformats and RDFa
Learn about the underlying technology that supports microformats and RDFa functionality in Google search and how you can prepare your own content to work with this emerging technology.

Inline Metadata About Components

This category of RDFa use fulfills the original dream that led to RDFa's creation: how to take human-readable web page content and make it machine-readable. For example, the following sentence from the RDF/a Syntax Document describes how Mark Birbeck took a particular picture.

<p>This photo was taken by <span class="author" about="photo1.jpg" property="dc:creator">Mark Birbeck</span>.</p>

The span element and its attribute values let an RDFa-aware tool get the following triple out of this document, shown here in RDF/XML:

<rdf:Description rdf:about="file://C|/dat/xml/rdf/rdfa/photo1.jpg">

  <dc:creator>Mark Birbeck</dc:creator>

</rdf:Description>

Note the full path added to the photo1.jpg resource name by the RDFa extraction tool that I used. If I had an xml:base value declared, it would have used that to create the full URL.

In that example, the PCDATA string "Mark Birbeck" provided the object of the triple. Sometimes you might want to provide an alternative version of the displayed data, such as a normalized version of a date. In this case, a value in a content attribute will override it:

<p>Last revision of document: <span about="http://www.snee.com/docs/mydoc1.html" 

property="dc:date" content="20070315T15:32:00">March 15, 2007, at 3:32 PM</span></p>

The resulting triple uses the content value of the date:

<rdf:Description rdf:about="http://www.snee.com/docs/mydoc1.html">

  <dc:date>20070315T15:32:00</dc:date>

</rdf:Description>

Now, when searching aggregated metadata for a document last updated between 20070312 and 20070318, it will be easy to find the pointer to the document that says it was updated on "March 15, 2007, at 3:32 PM."

Metadata About the Containing Document

Inline metadata about document components was the original use case for RDFa, but its elegant design makes it simple to use for other kinds of metadata, such as metadata about the containing document. While some metadata, such as a document's title and author, is often redundant with existing data in the document, and can be marked up inline as with the examples above, document metadata such as production workflow information can be easily stored in the document header. When no subject is specified, an RDF processor assumes an empty string as the subject, which represents the document itself:

<html xmlns:fm="http://www.foomagazine.com/ns/prod/">

  <head>

    <title>Is Black the New Black?</title>

    <meta property="fm:newsstandDate" content="2006-04-03"/>

    <meta property="fm:copyEditor" content="RSelavy"/>

    <meta property="fm:copyEdited" content="2006-03-28T10:33:00"/>

  </head>

  <body>

<!-- body of page... -->

An RDFa extractor gets the following RDF/XML out of this:

<rdf:Description rdf:about="">

  <fm:newsstandDate>2006-04-03</fm:newsstandDate>

  <fm:copyEditor>RSelavy</fm:copyEditor>

  <fm:copyEdited>2006-03-28T10:33:00</fm:copyEdited>

</rdf:Description>

Out-of-Line Metadata About Components

Nesting of meta elements, which is a new feature of XHTML 2, lets you specify a single subject for multiple triples. When you do this in a web page's head element, you can specify specific components of the document as the subject, making it possible to create metadata in your header for individual portions of your web page. For example, a document with multiple recipes in it can include production metadata in the head element about a specific recipe (note that the following sample also uses XHTML 2's new section and h elements):

<html xmlns:fm="http://www.foomagazine.com/ns/prod/">

<head>

 <meta about="#recipe13941">

   <meta property="fm:ComponentID">XZ3214</meta>

   <meta property="fm:ComponentType">Recipe</meta>

   <meta property="fm:RecipeID">r003423</meta>

 </meta> 

</head>

<body>

  <h>Add Some Tex Mex Sizzle to Your Kid's Lunch</h>

  <section id="recipe22143">

    <h>Amigo Corn Dogs</h> 

    <!-- li, p, etc. -->

  </section>

  <section id="recipe13941">

    <h>EZ Bean Tacos</h>

    <!-- li, p, etc. -->

  </section> 

  <!-- more content --> 

 </body>

</html>

The extracted triples know that this metadata only refers to the element in the document with an ID value of recipe13941:

<rdf:Description rdf:about="file://C|/dat/xml/rdf/rdfa/test5.html#recipe13941">

  <fm:ComponentType>Recipe</fm:ComponentType>

  <fm:RecipeID>r003423</fm:RecipeID>

  <fm:ComponentID>XZ3214</fm:ComponentID>

</rdf:Description>

Because RDFa lets you store a complete triple in an HTML document, you can even store metadata in one HTML document about resources (or portions of resources, like the recipe above, as long as they have an identifier) outside of that document.

Getting Those Triples

Several free tools are already available to extract RDF/XML from a document with RDFa so that you can then feed your triples to semantic web tools or to RDF-aware metadata management tools. Fabien Gandon of INRIA has written an XSLT stylesheet to do this, and Elias Torres has written a web service that only needs the URL of the document with the RDFa triples. Elias implemented this by adding RDFa support to RDFLib (of which I'm a long-time fan). RDFLib can extract embedded RDFa triples and load them into a triplestore in memory or on disk, and then you're off and running for developing an application around your extracted data. Among commercial tools, TopQuadrant's TopBraid Composer includes RDFa support.

The fact that reading RDFa is so easy to implement—you only need a program that can scan a document for certain combinations of a few elements and attributes—means that if no existing RDFa readers can do what you want, you can implement it yourself in any language that provides a reasonable XML parser.

Getting More Out of RDFa

We've seen that RDFa lets you add triples of useful metadata to your XHTML with simple, straightforward markup. It also offers features that let you do even more interesting things with it; in Part 2 of this article, we'll look at how to assign data types to your RDFa values, reification (how to add metadata about your metadata), specifying RDFa metadata about elements with an id attribute, compact URIs, and platforms that make it easier to automate the creation of RDFa metadata. Meanwhile, try adding some RDFa to some documents, play with the RDFa processors mentioned here to extract the metadata, and let me know what you think.