Introducing RDFa
February 14, 2007
For a long time now, RDF has shown great promise as a flexible format for storing, aggregating, and using metadata. Maybe for too long—its most well-known syntax, RDF/XML, is messy enough to have scared many people away from RDF. The W3C is developing a new, simpler syntax called RDFa (originally called "RDF/a") that is easy enough to create and to use in applications that it may win back a lot of the people who were first scared off by the verbosity, striping, container complications, and other complexity issues that made RDF/XML look so ugly.
RDF/XML doesn't have to be ugly, but even simple RDF/XML doesn't fit well into XHTML, because browsers and other applications designed around HTML choke on it. So, while the general plan for RDFa is to make it something that can be embedded into any XML dialect, the main effort has gone into making it easy to embed it into XHTML. This gives it an important potential role in the grand plan for the Semantic Web, in which web page data is readable not only by human eyes but by automated processes that can aggregate data and associated metadata and then perform tasks that are much more sophisticated than those that typical screen scraping applications can do now. In fact, the relationship between RDFa metadata and existing content in web pages has been an important driver in most use cases driving RDFa's progress.
Plenty of software is already available to pull RDFa triples from XHTML documents and use them, which means that even though the specification isn't quite done, there's plenty to play with.
The "a" in "RDFa"
RDF often uses a subject, predicate, object combination called a triple to specify
an
attribute name/value pair about a particular resource. (That's "attribute" in the
object-oriented sense, not the XML sense; for example, a triple could specify that
the
resource with ID http://example.com/artwork#fountain
has an author value of
"Richard Mutt.") To allow you to add metadata to a web page without affecting a browser's
display of that page, RDFa uses some existing XHTML 1 attributes and a few new XHTML
2
attributes to store the subjects, predicates, and objects of these RDF triples. (The
objects
may also be existing PCDATA in your web pages, with subject and predicate attributes
letting
this text play a dual role of human-readable displayed content and machine-readable
metadata.)
RDFa uses the existing XHTML 1 attributes href
, content
,
rel
, rev
, and datatype
, and it uses the new
about
, role
and property
attributes from XHTML 2's
Metainformation
Attributes module. While the following chart of their use covers only a subset of ways
to store RDFa metadata in an XHTML file, it's enough to get you pretty far. The RDFa Primer and RDF/A Syntax W3C documents (and Part
2 of this article) describe more sophisticated ways to add RDFa metadata to your XHTML
documents.
There are two basic cases: triples that have a literal string as their object and triples that have a URI as their object. (When possible, it's better to have a URI as an object, because it lets the same value serve as the object of some triples and the subject of others. This makes it easier to connect triples and find new information through inferencing.)
subject | predicate | object | |
---|---|---|---|
literal string as object | about
|
property
|
content attribute or PCDATA |
URI as object | about
|
rel
|
href
|
The RDFa syntax document tells us that "it should be possible to represent a [triple] using only one XML element." Let's look at three examples:
<span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html" property="dc:title" content="Generating a Single Globally Unique ID"/> <span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html" property="dc:title">Generating a Single Globally Unique ID</meta> <span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html" rel="dc:subject" href="http://www.snee.com/bobdc.blog/neat_tricks/"/>
These triples make the following statements (assuming that the dc
prefix is
assigned to the standard Dublin Core URI):
-
The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core title value of "Generating a Single Globally Unique ID."
-
The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core title value of "Generating a Single Globally Unique ID."
-
The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core subject value of http://www.snee.com/bobdc.blog/neat_tricks/.
If the first two triples say the same thing, why would you prefer one over the other?
Assuming that your document has its title in the document's content before you begin
adding
RDFa markup, adding the second span
element above means adding a little less
text to your document; you just wrap the existing title in the span
start- and
end-tags shown, a technique that fits in well with the Semantic Web vision of turning
existing web content into machine-readable content. If your title was already part
of your
document, the content
attribute value of the first triple would add redundant
information to your document, and if your document's title changes, you would need
to change
it in two places. On the other hand, when adding information that is not already part
of the
content of your document (for example, workflow information or attribution rights
about
components of the document) the first span
element above provides a good
model.
The third span
element above uses slightly different attributes to specify a
triple that has a URI as an object value.
RDFa Elements
All three of the elements above are span
elements. While these are popular for
RDFa because you can insert them anywhere in the body of an HTML document, you can
add the
same RDFa attributes to any elements you like. link
and meta
elements are popular for inserting triples into the head
of an HTML document.
This is part of the beauty of RDFa—these elements have been used to add metadata to
the head
element for years (for example, to indicate the URL of a web page's
CSS stylesheet), and now RDFa-aware software can pull useful metadata from them with
only
minor modifications to these elements. (Modifications are necessary because the triple
pulled from an XHTML 1 link
element that points to a CSS stylesheet would not
be completely legal RDF, because a rel
value of "stylesheet" is not a URI and
therefore not a proper RDF predicate.)
The a
linking element is also popular for storing RDFa metadata, because it
always expresses a relationship between one resource (the document where it's stored)
and
another (the resource it links to). The a
element's rel
attribute—which has actually been around as long as HTML itself,
despite its lack of use before Google's nofollow
trick came along—adds information about the relationship, and this information serves
as the predicate of a triple stored in an a
element.
More Triples, Fewer Subjects
If a document has 100 triples of metadata, the triples probably won't have 100 different subjects. The subject of many will probably be the document itself, as they specify its title, author, and perhaps workflow data about how the document got into its current state. Another group of triples might describe an image's photographer, date taken, and rights re-use information.
Building on existing XHTML syntax, RDFa lets you build multiple triples from the same
subject without cluttering up your document too much. An RDFa processor that finds
no
about
attribute assumes that the about
attribute on the nearest
ancestor element is the subject. (As we'll see in Part 2 of this article, the presence
of an
id
attribute can provide an alternative to this behavior.) For example, the
following stores three metadata statements about the resource at
http://www.snee.com/img/myfile.jpg, because although the three span
elements have no about
attribute, their parent img
does:
<img src="http://www.snee.com/img/myfile.jpg" about="http://www.snee.com/img/myfile.jpg"> <span property="dc:subject" content="Niagra Falls"/> <span property="dc:creator" content="Richard Mutt"/> <span property="dc:format" content="img/jpeg"/> </img>
If the RDFa processor searches through all the ancestors of the element with a metadata
statement's predicate and object, and doesn't find an about
attribute, then the
subject is an empty string. According to the RDF/A Syntax specification, this "effectively
indicates the current document."
This is handy, because plenty of a document's metadata is typically about the document
itself. For example, your document's main title could have this span
element to
indicate that its contents is the Dublin Core title of the work (assuming that no
ancestor
of the sample h1
element has an about
attribute):
<h1><span property="dc:title">My Story</span></h1>
Metadata about the document with no displayable content can be stored in the
head
element of the document:
<html> <head> <meta description="dc:date" content="2007-03-15T10:35:42"/>
Now that we've seen a scattershot tour of what RDFa can do and how it does it, it would be easier to appreciate its potential uses if we step back and look at three categories of use cases:
-
Inline metadata about document components
-
Metadata about the containing document
-
Out-of-line metadata
Related Reading
Google Engineering Explains Microformat Support in Searches
Google
Announces Support for Microformats and RDFa
Learn about the underlying technology that supports microformats and RDFa
functionality in Google search and how you can prepare your own content to work with
this
emerging technology.
Inline Metadata About Components
This category of RDFa use fulfills the original dream that led to RDFa's creation: how to take human-readable web page content and make it machine-readable. For example, the following sentence from the RDF/a Syntax Document describes how Mark Birbeck took a particular picture.
<p>This photo was taken by <span class="author" about="photo1.jpg" property="dc:creator">Mark Birbeck</span>.</p>
The span
element and its attribute values let an RDFa-aware tool get the
following triple out of this document, shown here in RDF/XML:
<rdf:Description rdf:about="file://C|/dat/xml/rdf/rdfa/photo1.jpg"> <dc:creator>Mark Birbeck</dc:creator> </rdf:Description>
Note the full path added to the photo1.jpg resource name by the RDFa extraction
tool that I used. If I had an xml:base
value declared, it would have used that
to create the full URL.
In that example, the PCDATA string "Mark Birbeck" provided the object of the triple.
Sometimes you might want to provide an alternative version of the displayed data,
such as a
normalized version of a date. In this case, a value in a content
attribute will
override it:
<p>Last revision of document: <span about="http://www.snee.com/docs/mydoc1.html" property="dc:date" content="20070315T15:32:00">March 15, 2007, at 3:32 PM</span></p>
The resulting triple uses the content
value of the date:
<rdf:Description rdf:about="http://www.snee.com/docs/mydoc1.html"> <dc:date>20070315T15:32:00</dc:date> </rdf:Description>
Now, when searching aggregated metadata for a document last updated between 20070312 and 20070318, it will be easy to find the pointer to the document that says it was updated on "March 15, 2007, at 3:32 PM."
Metadata About the Containing Document
Inline metadata about document components was the original use case for RDFa, but its elegant design makes it simple to use for other kinds of metadata, such as metadata about the containing document. While some metadata, such as a document's title and author, is often redundant with existing data in the document, and can be marked up inline as with the examples above, document metadata such as production workflow information can be easily stored in the document header. When no subject is specified, an RDF processor assumes an empty string as the subject, which represents the document itself:
<html xmlns:fm="http://www.foomagazine.com/ns/prod/"> <head> <title>Is Black the New Black?</title> <meta property="fm:newsstandDate" content="2006-04-03"/> <meta property="fm:copyEditor" content="RSelavy"/> <meta property="fm:copyEdited" content="2006-03-28T10:33:00"/> </head> <body> <!-- body of page... -->
An RDFa extractor gets the following RDF/XML out of this:
<rdf:Description rdf:about=""> <fm:newsstandDate>2006-04-03</fm:newsstandDate> <fm:copyEditor>RSelavy</fm:copyEditor> <fm:copyEdited>2006-03-28T10:33:00</fm:copyEdited> </rdf:Description>
Out-of-Line Metadata About Components
Nesting of meta
elements, which is a new feature of XHTML 2, lets you specify
a single subject for multiple triples. When you do this in a web page's head
element, you can specify specific components of the document as the subject, making
it
possible to create metadata in your header for individual portions of your web page.
For
example, a document with multiple recipes in it can include production metadata in
the
head
element about a specific recipe (note that the following sample also
uses XHTML 2's new section
and h
elements):
<html xmlns:fm="http://www.foomagazine.com/ns/prod/"> <head> <meta about="#recipe13941"> <meta property="fm:ComponentID">XZ3214</meta> <meta property="fm:ComponentType">Recipe</meta> <meta property="fm:RecipeID">r003423</meta> </meta> </head> <body> <h>Add Some Tex Mex Sizzle to Your Kid's Lunch</h> <section id="recipe22143"> <h>Amigo Corn Dogs</h> <!-- li, p, etc. --> </section> <section id="recipe13941"> <h>EZ Bean Tacos</h> <!-- li, p, etc. --> </section> <!-- more content --> </body> </html>
The extracted triples know that this metadata only refers to the element in the document
with an ID value of recipe13941
:
<rdf:Description rdf:about="file://C|/dat/xml/rdf/rdfa/test5.html#recipe13941"> <fm:ComponentType>Recipe</fm:ComponentType> <fm:RecipeID>r003423</fm:RecipeID> <fm:ComponentID>XZ3214</fm:ComponentID> </rdf:Description>
Because RDFa lets you store a complete triple in an HTML document, you can even store metadata in one HTML document about resources (or portions of resources, like the recipe above, as long as they have an identifier) outside of that document.
Getting Those Triples
Several free tools are already available to extract RDF/XML from a document with RDFa so that you can then feed your triples to semantic web tools or to RDF-aware metadata management tools. Fabien Gandon of INRIA has written an XSLT stylesheet to do this, and Elias Torres has written a web service that only needs the URL of the document with the RDFa triples. Elias implemented this by adding RDFa support to RDFLib (of which I'm a long-time fan). RDFLib can extract embedded RDFa triples and load them into a triplestore in memory or on disk, and then you're off and running for developing an application around your extracted data. Among commercial tools, TopQuadrant's TopBraid Composer includes RDFa support.
The fact that reading RDFa is so easy to implement—you only need a program that can scan a document for certain combinations of a few elements and attributes—means that if no existing RDFa readers can do what you want, you can implement it yourself in any language that provides a reasonable XML parser.
Getting More Out of RDFa
We've seen that RDFa lets you add triples of useful metadata to your XHTML with simple,
straightforward markup. It also offers features that let you do even more interesting
things
with it; in Part 2 of this article, we'll look at how to assign data types to your
RDFa
values, reification (how to add metadata about your metadata), specifying RDFa metadata
about elements with an id
attribute, compact URIs, and platforms that make it
easier to automate the creation of RDFa metadata. Meanwhile, try adding some RDFa
to some
documents, play with the RDFa processors mentioned here to extract the metadata, and
let me
know what you think.