Make Your XML RDF-Friendly
October 30, 2002
Suppose you're designing an XML application or maybe just writing a DTD or schema. You've followed various best practices about element and attribute names, when to use elements versus attributes, and other design issues, because you want your XML to be useful in the widest variety of situations.
As RDF interest and application development grows, there's an increasing payoff in keeping RDF concerns in mind along with the other best practices as you design document types. Your documents store information, and small tweaks to their structure can allow an RDF processor to see that information as subject-predicate-object triples, which it can make good use of. (For an introduction to RDF, see Tim Bray's article What is RDF?) Making your documents more "RDF-friendly" -- that is, more easily digestible by RDF applications -- broadens the range of applications that can use your documents, thereby increasing their value.
A lot of XML RDF documents look like they were designed purely for RDF applications, but that's not always the case. The frequent verbosity of RDF XML, which often intimidates RDF beginners, is a by-product of the flexibility that makes RDF easy to incorporate into your existing XML. By observing eight guidelines when designing a DTD or schema, you can use this flexibility to help your documents work with RDF applications as well as non-RDF applications. Some of the guidelines are easy, while some involve making choices based on trade-offs. But knowing what the issues are gives you a better perspective on the best ways to model your data.
1. Make sure that every element comes from a specific namespace.
This doesn't mean that all your elements need a namespace prefix. For convenience, many documents declare the most frequently used namespace as the default one so that elements from that namespace need no prefix. For example, the article, body, title, and para elements in the following belong to the http://www.snee.com/ns/dummy namespace because the article element's first xmlns attribute declares that as the default namespace. None of those elements need a namespace prefix, and an RDF processor will have no problem with them. (The RDF namespace, http://www.w3.org/1999/02/22-rdf-syntax-ns#, must obviously be declared if an RDF parser is going to find the RDF elements and know what each is for.)
<article xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:ID="a1003" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://www.snee.com/ns/dummy"> <rdf:RDF> <rdf:Description rdf:about="#a1003"> <dc:creator>Herman Melville</dc:creator> <dc:date>1851</dc:date> </rdf:Description> </rdf:RDF> <body> <title>Moby Dick</title> <para>Call me Ishmael.</para> <para>Just <emph>don't</emph> call me late for supper.</para> </body> </article>
2. Use rdf:ID attributes instead of ID attributes.
When you want an RDF processor to know a property of something in a document -- for example, that the article element in the example above has a dc:creator value of "Herman Melville" -- you need a way to identify the subject that has the property. XML DTDs let you declare that a particular attribute is used as an ID value, but RDF doesn't care about DTDs. The only way to be sure that an RDF processor can find the thing you're referring to is to give it a unique value in an rdf:ID attribute.
You're certainly not limited to using the rdf:ID value in RDF applications. A unique ID value is a unique ID value, and useful in all kinds of applications. In fact, if you declare this attribute in a DTD as having a type of ID, you'll get the benefit of both RDF applications and XML 1.0 applications treating rdf:ID as an ID value that is unique within each document.
3. When describing a resource that has an existing URI, put the URI in an rdf:about attribute.
While rdf:ID identifies a resource in your document, which you can then describe with an RDF statement, rdf:about lets you create an RDF statement about anything that can be referenced with a URI, whether it's in your document or not. The name of the element with the rdf:about attribute identifies the type of the subject. For example, the following tells us this fact "about" Bridget Fonda: that her father is Peter Fonda. The rdf:about attribute's presence in an Entertainer element tells us that Bridget Fonda is a resource of the type "Entertainer."
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:imdb="http://us.imdb.com/Name?" xmlns="http://www.cyc.com/2002/04/08/cyc.daml#" xmlns:gc="http://www.daml.org/2001/01/gedcom/gedcom#"> <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Bridget"> <gc:father> <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Peter"/> </gc:father> </Entertainer> </rdf:RDF>
4. When referencing something by its URI, put the URI in an rdf:resource attribute in an empty element.
In our first example, the creator of the article Moby Dick -- or, more correctly, the creator of the work identified as "#a1003" -- is named with the string "Herman Melville." If, instead of a string, it identified the author using a URI in an rdf:resource attribute, the RDF assertion about who created resource a1003 would have more value, because it could then link to other RDF statements that use the same URI.
For example, no RDF statement that tells you that Herman Melville was born in New York City would refer to the author using the string "Herman Melville," because an RDF statement's subject must be a URI. Instead, it might say that the subject http://www.online-literature.com/melville/ has the property bornIn with a value of "New York City." An inference engine could look at that assertion and the following revision of the first RDF statement from the first example above, put the two together, and tell you that the creator of a1003 was born in New York City.
<rdf:Description rdf:about="#a1003"> <dc:creator rdf:resource="http://www.online-literature.com/melville"/> </rdf:Description>
While this element with the rdf:resource attribute isn't absolutely required to be empty, any content that it has must follow certain rules, so it's simplest to make it an empty element whose rdf:resource attribute names a URI value for the type named by the element name -- in this case, dc:creator.
5. If existing ontologies cover any of your element names, use those instead of making up your own URIs.
Most of the power of RDF comes from the network effect of combining RDF triples that reference the same resources. If one set of triples says something about a particular resource and another set says more about the same resource, they can be combined, making it a more valuable collection. For example, guideline 4 above described two RDF statements that could be linked this way; one used the URI http://www.online-literature.com/melville to represent Herman Melville as the creator of article a1003, and the other used the same URI to show where the author was born.
To be honest, http://www.online-literature.com/melville was just the result of some brief web searching. The odds that two different people creating RDF about Melville will both use this URI are pretty small. It's not really an ontology name, but just a URL for a brief biography of Melville at a literary dot-com.
But what is an ontology? In software development, as distinct from its meaning in philosophy, it generally means a set of terms with defined relationships. There are plenty of real ontologies out there, but in a pinch, you can use a recognized URL for a well-known web page that identifies your resource -- as we saw above, any URI is better than a simple string.
The more well-known an ontology is, the more likely others are to use it, and the more useful your RDF statements will be when combined with those others. For example, the Dublin Core ontology used for the dc:creator and dc:date elements in the "Moby Dick" example is one of the most popular, widely-used ontologies.
The DAML Ontology Library is a good place to start looking for ontologies. It's where I found the GEDCOM and CYC ontologies used in the example about the Fondas. The people who created the Internet Movie Database never considered their work to be an ontology, but because it lets you refer to specific actors with URIs, it passes the first test for use in RDF statements.
6. Be careful about the use of container elements.
The good news is that a given resource can be both the object of one or more RDF statements and the subject of others. For example, the following shows that Bridget Fonda's father is Peter Fonda and that Peter Fonda's father is Henry Fonda. Peter is the object of the statement made by the outer triple and the subject of the inner one.
<Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Bridget"> <gc:father> <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Peter"> <gc:father> <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Henry"/> </gc:father> </Entertainer> </gc:father> </Entertainer>
There's no limit to the level of nesting, as long as even-numbered elements in the line of descendants are resources and odd-numbered resources are predicates. This alternating relationship is known in RDF circles as striping.
The bad news is that many common uses of container elements throw this striping pattern off. The following example, which omits the document element and namespace declarations, is otherwise perfectly good RDF until the attachments element.
<email rdf:about="msg001"> <from>bram@snee.com</from> <to>bela@snee.com</to> <date>20021024T081423</date> <msgSubject>Dinner tonight</msgSubject> <attachments> <attachment>data\sample1.txt</attachment><!-- RDF parser chokes here --> <attachment>data\sample2.txt</attachment> </attachments> <cc>frank@snee.com</cc> </email>
Up to that point, an RDF parser knows that the resource with the ID "msg001" has a from value of "bram@snee.com", a to value of "bela@snee.com", and so on, but what is the attachments value? If its contents were an XML element, it would have to be just one element, with an identifier that named it as a specific resource. Having more than one element -- which is the whole point of the wrapper, because a given e-mail message may have more than one attachment -- is something that RDF can't handle when represented this way. It thinks that the attachments property of the email resource has two properties of its own (the two attachment elements). Properties can't have properties, but resources can.
There are two obvious options for giving this email element the resource-predicate-resource-predicate descendant structure that RDF expects: either remove a layer of containment or add one. Removing the attachments container would make each attachment element a sibling of from, to, and the email element's other children, and email wouldn't have any grandchildren:
<email rdf:about="msg002"> <from>bram@snee.com</from> <to>bela@snee.com</to> <date>20021024T081423</date> <msgSubject>Dinner tonight</msgSubject> <attachment>data\sample1.txt</attachment> <attachment>data\sample2.txt</attachment> <cc>frank@snee.com</cc> </email>
The problem with this is that you may have a good reason to use that container. For example, when processing your XML e-mail messages using an event-based model such as the SAX API, maybe there's something specific you want to do when you reach the end of the attachment list. How do you know you've reached the end of that list when processing this version of the email element? When you reach the cc element? What if cc is optional? Nothing says "end of attachment list" like an </attachments>.
If you must have a container around your attachment elements, and want to make it proper RDF, one solution is to use one of RDF's specialized container elements. In this case, you can wrap an rdf:Bag element around the attachment elements in the original e-mail example, inside of the attachments element. (In keeping with guideline 2, the attachments element has been given an rdf:ID attribute to make it easier for a parser to refer to it.) The rdf:Bag element describes a container whose contents aren't ordered in any meaningful way. The example's rdf:Bag element has an rdf:ID value of "i2", telling an RDF parser that in addition to having a from property with a value of "bram@snee.com", as well as the other properties we saw, the resource with the ID "msg003" also has an attachments property with resource #i2 has its value. This i2 resource has a type of rdf:Bag, which RDF parsers understand to be a container of unordered content. The i2 resource has one attachment with a value of "data\sample1.txt" and another with a value of "data\sample1.txt". And, unlike the first e-mail example above, this one causes no error message in the RDF parser.
<email rdf:about="msg003"> <from>bram@snee.com</from> <to>bela@snee.com</to> <date>20021024T081423</date> <msgSubject>Dinner tonight</msgSubject> <attachments rdf:ID="i1"> <rdf:Bag rdf:ID="i2"> <attachment>data\sample1.txt</attachment> <attachment>data\sample2.txt</attachment> </rdf:Bag> </attachments> <cc>frank@snee.com</cc> </email>
In addition to the rdf:Bag container for unordered content, RDF also offers the rdf:Seq element for ordered (or "sequenced") content and the less popular rdf:Alt container to show available alternatives to a specified value.
There is actually a third, even simpler option for converting this email element's structure into something that won't confuse the RDF parser: we can explicitly tell this parser that the attachments property of the email element is itself a resource with the rdf:ParseType attribute:
<email rdf:about="msg004"> <from>bram@snee.com</from> <to>bela@snee.com</to> <date>20021024T081423</date> <msgSubject>Dinner tonight</msgSubject> <attachments rdf:parseType="Resource"> <attachment>data\sample1.txt</attachment> <attachment>data\sample2.txt</attachment> </attachments> <cc>frank@snee.com</cc> </email>
Think about the original problem: the attachments property of the email element couldn't have its own properties, which is why the RDF parser choked at the first attachment element -- it thought that the document was trying to name a property of a property, which is illegal. Now that the attachments element is explicitly named as a resource, it can have properties, so the RDF parser will have no problem with the two attachment children of this element.
7. Eschew mixed content.
Mixed content presents a more advanced version of the problem caused by containers that throw off the striping pattern. Once you see that the resources described in RDF statements must either be siblings of each other or skip an odd number of generations when descendants of each other, and that predicates must be descendants found at the levels between those, it's clear how the typically irregular patterns of mixed content can throw off RDF striping. Mixed content can also put strings of PCDATA in odd places -- or at least in places that seem odd if you're looking for regular recurring patterns.
This doesn't mean that you can't have RDF in a document with mixed content. The "Moby Dick" example at the beginning of this article has mixed content, and the rdf:RDF element showing publishing metadata such as the work's creator and availability date is kept separately in an RDF header section.
RDF statements in a mixed content document can even use elements within the mixed content as resources. The following example has an rdf:RDF header element that contains a made-up imgLink element linking the character in-line element to an image on a remote server.
<article xmlns="http://www.snee.com/ns/dummy#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:RDF> <imgLink rdf:about="#c1"> <image rdf:resource= "http://www.keele.ac.uk/depts/as/Literature/Moby-Dick/images/Moby.gif"/> </imgLink> </rdf:RDF> <body> <title>Moby Dick</title> <para>Call me <character rdf:ID="c1">Ishmael</character>.</para> <para>Just don't call me late for supper.</para> </body> </article>
An RDF parser will find the statement linking the character element to the Moby.gif picture and will have no problem with the mixed content along the way.
8. Find an RDF parser to check that your RDF statements are okay.
When learning any new language, you want to be sure that what you think you're saying is really what you're saying. Most RDF parsers make this easy by outputting a subject-predicate-object triple for each RDF statement they find. For example, the W3C's RDF Validation Service turns this document
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:imdb="http://us.imdb.com/Name?" xmlns="http://www.cyc.com/2002/04/08/cyc.daml#" xmlns:gc="http://www.daml.org/2001/01/gedcom/gedcom#"> <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Bridget"> <gc:father> <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Peter"/> </gc:father> </Entertainer> </rdf:RDF>
into this (carriage returns added):
<http://us.imdb.com/Name?Fonda,%20Peter> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cyc.com/2002/04/08/cyc.daml#Entertainer> . <http://us.imdb.com/Name?Fonda,%20Bridget> <http://www.daml.org/2001/01/gedcom/gedcom#father> <http://us.imdb.com/Name?Fonda,%20Peter> . <http://us.imdb.com/Name?Fonda,%20Bridget> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cyc.com/2002/04/08/cyc.daml#Entertainer> .
Or, in English, using only the URI fragment identifiers:
-
Peter Fonda has a type value of Entertainer.
-
Bridget Fonda has a father value of Peter Fonda.
-
Bridget Fonda has a type value of Entertainer.
In general, using a utility to convert RDF to triples helps you to understand exactly what is being said if you read the subject-predicate-object triple "X, Y, Z" as "X has a Y value of Z." All the natural language descriptions of RDF statements in this article were checked this way.
As RDF tools become more widely available and easy to use, you'll have more resources available to do improved metadata management for your own data. Even if you're not ready to build serious RDF applications just yet, making more of your own data RDF-friendly will do more than widen the number of applications that can use it. For many people, the kinds of things that RDF is good at become clearer to them when used with data that is important to their business or important to them personally, such as an address or appointment file. Using RDF tools to play with your own data will help you understand the strong points of RDF and, perhaps, even the strong points of your own data better.