RPV: Triples Made Plain
November 20, 2002
For as long as RDF has existed, people have been trying to fix it. My predecessor in this spot, Leigh Dodds, wrote a column in the summer of 2000 ("Instant RDF") in which he discussed efforts to respond to complaints about RDF's complexity. At that relatively early point, the two dominant approaches to relating XML and RDF, as Dodds explained, were that RDF should be embedded in XML documents or that RDF should be extracted from, but not embedded in, XML documents.
In last week's column, I claimed, following conversations in the XML development community, that RDF was good for representing "mundane metadata", to use Bob DuCharme's phrase, and as an alternative to RDBMS storage. That is, as a kind of unstructured or semistructured data storage model. My goal was to route around complaints about RDF's XML serialization by suggesting ways in which it didn't matter (not much, anyway) what that serialization looked liked, since the goal was to avoid writing it by hand or reading it, as it were, by eye.
I suggested using a programmatic triple or RDF store from a host programming language, many of which have interesting RDF triplestore implementations (for example, Redland works with several languages). By means of a triplestore API one makes 3-tupled assertions, combining them into graphs, using ontologies (of various degrees of formality and publicity) of terms, predicates, both of which are named by URIs, and values, which may be named by URIs or may be asserted literally.
In this scenario some of the constraints, but also most of the maturity, performance, and wider tool support, of SQL and RDBMSes are avoided in return for a considerable grant of flexibility and extensibility. And if the XML serialization of these graphs of triples, which might be used for exchanging graphs or simply for on-disk storage, was terribly ugly or hard for most people to write and read, who cares? No one is being asked to do so. Except for the people who develop the triplestore implementations, but they're RDF theoretic model wireheads anyway. If you're troubled by the idea that some things are simply to be ignored by some people, think of an RDBMS like MySQL, which is widely and successfully used by thousands of developers, most of whom haven't the slightest idea about the technical details of, say, ISAM table storage. They don't know; don't want, care, or need to know. Perhaps RDF's XML serialization is like that?
In other words, if you don't like or understand or prefer RDF's XML serialization, find a way to avoid dealing with it directly. Using an RDF triplestore from a high-level language is one such way, while retaining some, perhaps all of the benefits of RDF's data model. So, my argument is a more focused variant of the suggestion Shelley Powers has been making repeatedly on XML-DEV lately: if you don't like or understand or prefer RDF, just don't use it. This seems fair enough.
Most recent discussion of RDF, which has bubbled over the bounds of XML-DEV and moved out into the broader confines of the Web development community, has been by turns absurd and sublime. From foundational debates about whether RDF is complex, or fights over how to characterize its complexity, to awfully redundant discussions about whether its XML serialization is all that user-unfriendly, to meta-debates in which various sides jockey for position to see which side can be described as unfair or "politically correct" (whatever that could possibly mean in this context) or dismissive or narrow-minded or high-handed -- and on and on.
Yet the debate has also been productive at times, including Tim Bray's RPV proposal.
Resources, Properties, and Values
Bray says his RPV proposal "is an XML-based language for expressing RDF assertions ... designed to be entirely unambiguous and highly human-readable." That two-part design goal is worth spending some time with insofar as it's emblematic of a good deal of the underlying debate over RDF. To say that an XML language is or should be "entirely unambiguous" and "highly human-readable" is to say that it should be as easily digestible by machines as by humans. It's that tension which runs all the way from XML to RDF.
Further, Bray suggests that RDF has failed to gain traction because of this tension: his RPV proposal "is motivated by a belief that RDF's problems are rooted at least in part in its syntax." He elaborates on this point by saying, first, that RDF's XML serialization is "scrambled and arcane," preventing people from easily reading or writing it; second, that the XML serialization uses qualified names in a way that's not user-friendly and is in some conflict with the TAG's idea that significant resources be identified by URI; third, that there doesn't seem to be a general problem for metadata folks to think of things in terms of RDF's 3-tuples; fourth, that some alternatives to RDF-XML, like n3, suffer because, as non-XML, they can't get the network effect of ubiquitous XML support; and, fifth, that the idea of embedding RDF in XML languages, which seemed in the summer of 2000, both to Leigh Dodds and much of the rest of the XML development community, like a viable approach, "has failed resoundingly in the marketplace."
To put it more plainly: RDF needs a new XML serialization as the existing one is overly complex, and it should be possible to do better. Bray's RPV proposal has at least one immediate virtue: simplicity. It contains only two elements, R and PV -- for resources and property-value pairs, respectively. Which means simple triple in RPV can be as straightforward as
<R r="http://xml.com/"> <PV p="http://foo.com/#siteType" v="http://foo.com/#xml" /> </R>
The resource identified by the R element has the property identified by the URI in PV's p attribute, which has the value identified by the URI in its v attribute. Since there can be any number of PVs within an R, one can easily add other properties to the resource by adding other PV elements. As the object of a property can also be a literal, RPV says that when the v attribute is missing from a PV, the value of the property being predicated of the resource is the content of the PV element:
<R r="http://monkeyfist.com/"> <PV p="http://foo.com/#Title">Our Monkey, Your Fist</PV> </R>
An attributeless R means that the element itself is (or represents) the resource being described:
<R> <PV p="http://foo.com/#Type" v="http://foo.com/#Resource" /> </R>
A resource element with an id attribute, the value of which must be unique within the XML document can be referred to at other points in the document:
<R r="http://monkeyfist.com/" id="r1"> <PV p="http://foo.com/#Publisher">Monkeyfist Collective</PV> </R> <R r="#r1"> <PV p="http://foo.com/#Subject">politics</PV> </R>
That's about all there is to RPV (save for namespaces, which I've omitted above, and some bits about relative URIs and reification). RDF-RPV is clear and simple, easy to write and read; more importantly, it makes the triples plainly visible. The murkiness of the triples is one complaint people often make about RDF-XML.
Also in XML-Deviant |
|
Whether the RDF Working Group will consider alternative syntaxes or whether something like RPV could possible be adopted remains open questions. The value of Bray's RPV proposal is its demonstration that an XML serialization of the RDF model does not have to be complex or hard for humans to read.
One of the parts of RDF which people seem to like is the clarity of tuples of subjects (resources), predicates (properties), and objects (values). The 3-tuple isn't ideal for every situation and, yes, some people aren't interested in thinking of things in terms of graphs of triples. For those who do, however, having an XML serialization of RDF which makes the triples obvious and plain seems to be an unambiguously good thing.