Something Useful This Way Comes

June 9, 2004

Already, Never, or Somewhere in Between

The Semantic Web is a complex human undertaking. Which means, at the very least, that we should expect it to require a significant investment of time and effort and funding. Both the European Union, and the US Pentagon System, as well as many member companies of the W3C, have invested heavily in technologies directly related to the Semantic Web, including XML, RDF, OWL, and various rule and agent systems and languages.

Further, since the Semantic Web is a decentralized, distributed, Web-scale knowledge representation system built on the Web -- a decentralized, distributed hypermedia system -- it's arguable that we should also tally all of the investment that went into making the Web itself, and thus the Internet before it, as well as into the constituent systems and technologies of the knowledge representation part of AI.

In other words, development of the Semantic Web requires a lot of work, but there's been a lot work done. This raises an obvious question: when will all that work pay off?

There are only three ways to answer that question -- already, never, or somewhere in between. In other words, one might say that the Semantic Web is already here; we already have a Web in which machines can retrieve, exchange, and manipulate knowledge in order to satisfy human needs and desires, in order to aid human projects and plans. There's an awful lot of RDF on the Web, in the form of RSS 1.0 feeds, FOAF files, photo annotation formats, geographic and calendaring ontologies, and so on. We might call this the weak form of the Semantic Web; yes, it's weaker than the more robust forms, but it has the advantage of actually existing, today.

There are, of course, more robust forms of the Semantic Web, ones in which OWL -- the Web Ontology Language -- and rules play a central role. This more robust form is not discontinuous with the weaker form; but it's different in that, say, we humans get more and more novel forms of aid and assistance from our machines. For example, rather than being able merely to publish my calendar in a form other people's machines can understand, in a more robust Semantic Web I will be able to publish my calendar, as well as some rules about me, my lifestyle, and my calendaring constraints, and software programs -- which are usually called "agents" in this context -- will take over much of the burden of maintaining my calendar, including making new appointments and events, from me. That will be possible, in large part, because many of the people and institutions with whom and with which I want to interact are using similar agents and technologies.

This more robust form, which we might call the Scientific American or Berners-Lee/Hendler/Lassila form, is -- as I said last August in an XML.com article ("The Semantic Web is Closer Than You Think") -- pretty much a done deal. Okay -- that's an exaggeration, but please note that it's an exaggeration in only one sense. By which I mean that the technology is a done deal. That is, we know how to implement such a system; and, in fact, at last year's WWW conference developers from University of Maryland's MIND Lab demonstrated a system that implements the Scientific American scenario.

The problem, as with any technology, is as much social as it is technical. The rest of the world -- and here I mean big corporations, legislatures, and all the other mechanisms that create new markets and warp old ones -- is playing catch-up to the technology. That's a good thing. I suspect we could see real world systems like the one described in the SciAm article in the next five to ten years. Such systems await a ubiquitous, broadband WiFi network infrastructure -- as well as, realistically, the right kind of legislation and policy changes -- as much as they await any specific web service, Semantic Web, or knowledge representation results.

So much for "already" and "sometime between already and never". What about "never"? Recent debates in the XML developer community suggest that there is contingent of developers and software professionals that believes that the Semantic Web is all hype. Their answer to the question, "when will all that investment in the Semantic Web finally pay off?", seems to be a resounding "never".

An Ongoing Debate

In what remains of this article I will review some of this debate, trying to figure out what, if anything is really at stake in it. And then I will say a few, brief, and informal words about a new W3C standardization effort that will, if it succeeds, help make the Semantic Web more likely to be realized in the real world.

As reported on XML.com two weeks ago by Paul Ford, the debate flared up recently when Elliotte Rusty Harold -- one of the XML figures I admire most, frankly -- suggested in his WWW2004 reportage that the Semantic Web was nothing but a big ol' hype balloon.

The reaction to Harold's claims is interesting in a few ways. First, it demonstrates that the W3C's influence is limited. Second, it suggests that there continues to be a lot of confusion about some of the advantages of, say, RDF over XML. Finally, it suggests that those of us who think the Semantic Web is a valuable project have failed miserably in communicating that to others.

Mike Champion offered an optimistic note, suggesting that Semantic Web technology may first flourish behind the enterprise firewall, in a way reminiscent of the earliest days of Netscape's corporate success:

The other previously missing ingredient is that real organizations have at least something approximating an implicit ontology in their database schema, standard operating procedures, official vocabularies, etc. It is at least arguable that the technologies that have emerged from the Semantic Web efforts allow all this diverse stuff to be pulled together in a useful way -- ontology editors, inferencing engines, semantic metadata repositories, etc. I'm seeing real success stories in my day job, and a coherent story is starting to be told by a number of vendors, analysts, etc.

Champion here makes a similar point to the one I argued in an article last fall ("Commercializing the Semantic Web"), namely, that there exist today several startups and fledgling ventures that are selling Semantic Web technologies to corporate clients, including Network Inference, Tucana Technologies, and others.

In response to Champion's post, Harold seemed to modify his mostly-negative appraisal of the Semantic Web. He conceded that

Part of what bothers me about the semantic web is syntax. It's too ugly to be practical. And syntax does matter. XML succeeded where SGML failed not because XML can do anything SGML can't (except maybe internationalization) but because the XML syntax story is cleaner and more approachable. The RDF syntax is just too ugly to be plausible.

The basic idea of RDF that seems useful is naming things with standard URIs. However, I simply don't see how the RDF syntax improves on XML+namespaces for that, and XML+namespaces is so much nicer a syntax than RDF.

I agree that syntax matters. A lot. But there is no "RDF syntax", unless he means by that the RDF data model. There is an admittedly rather ugly canonical serialization of RDF in XML; but there are also, at least by my latest count, five or six tractable alternatives to RDF-XML. (See my "The Courtship of Atom" for details and links to these alternatives.)

I disagree, however, that the "basic idea of RDF that seems useful is naming things with standard URIs" -- the basic idea of RDF is the formal data model, which offers the possibility of semantic interoperability which we simply do not have with XML. That the data model also offers a way to do inferencing is like icing on the cake. Some people like icing, some people like cake, and others, like me, like both, at the same time. In other words, lots of people do useful work with RDF and never use the inferencing it allows, while others are attracted to RDF precisely because of the inferencing. Where Semantic Web evangelists -- among which number I sometimes count myself -- have failed miserably is in turning that diversity into a widely perceived strength.

Joshua Allen took a similar line: "The value of RDF is the data model; not the serialization syntax". Allen, a Microsoft employee, also claimed that Microsoft project WinFS is similar to RDF: "OSAF Chandler is based on 'triples', as is Longhorn's WinFS. Both are essentially 'personal semantic web stores'. Triples+URIs is how you bootstrap the 'personal semantic web store' and make it universal". Honestly, I don't know whether to laugh, because with WinFS Microsoft seems to be buying into the Semantic Web idea, or cry, because with WinFS Microsoft seems to be embracing-and-extending the Semantic Web idea. Oh well -- outside of the realm of unenforced US antitrust legislation, Microsoft is like gravity. Eventually, you just learn to work around it.

This conversation is ongoing as I write this column, and it spans several threads and a few hundred messages. It also covers a very wide range of ground, a good deal of which has more to do with XML than the Semantic Web per se. If you care about this stuff, you might want to review the conversation in detail.

Query, Inference, and RDF

One of the themes running through the current debate is whether RDF is more expressive than XML and namespaces. I think that it is because of the formal RDF model and because of the inferencing that model provides. For every XML vocabulary I encounter, I have to figure out, on my own, what implicit data model is at work. And they may well be very different. I do this by reading a schema or other documentation. Sometimes I have to ask the people who are producing it. I might even have to guess. For example, consider the simple containment relation in XML:

<foo><bar/></foo>

What does it mean that there is "bar" contained in a "foo"? Is this "bar" a kind of "foo"? Does it mean that "bar" has a "foo"? Is "bar" subordinate to or dependent on "foo" in some way, or vice versa? I can find out answers to all these questions and more, of course, by consulting a schema or documentation or asking a developer. As its evangelists have said repeatedly, XML is useful because it gives us syntactic rather than semantic interoperability. Yes, that's true.

Because there is a formal model behind RDF, however, when given a piece of RDF, I just need to figure out what the predicates mean, and I need to figure out what the URIs identify. It's not as if RDF is perfectly self-describing; nobody of any competence claims that. But what I don't have to figure out is the relations between the different parts of the graph. I don't have to figure out which of the XML elements and attributes are the subjects, predicates, an objects. I get that for free. Those relations are described for me, formally and backed by real logical and mathematical formalisms, in the RDF data model. It's like the old science fiction dictum: you may not need RDF often, but when you do, you'll need it bad.

One of the things RDF and Semantic Web developers have been doing with data that complies with the RDF data model is querying it. Until very recently different communities and clusters of developers have created RDF query languages on their own, which has led to some inevitable problems, including a serious deficit of interoperability. The W3C's Semantic Web Activity has, accordingly, chartered a new Working Group, called Data Access, the members of which are working on standardizing a query language and data access protocol for RDF.

This Working Group has recently released the first working draft of its first document, RDF Data Access Use Cases and Requirements (UCR). I know this because I'm a member of this WG and the editor of this document. I point this out because, well, I want you to think I'm a really smart guy -- &wink; -- but also because I want to make it clear that in this column I am speaking for myself only and not for the Working Group.

The UCR document describes, as you might guess, use cases for an RDF query language and data access protocol. It also describes a set of mandatory requirements and optional design objectives, most of which are motivated by the use cases. In order to give you some idea as to what a standard RDF query language might look like, here are the requirements and design objectives which the WG has already accepted formally:

The query language must include the capability to restrict matches on a queried graph by providing a graph pattern, which consists of one or more RDF triple patterns, to be satisfied in a query.
It must be possible for queries to return zero or more bindings of variables. Each set of bindings is one way that the query can be satisfied by the queried graph.
The query language must make it possible -- whether through function calls, namespaces, or in some other way -- to calculate and test values extensibly.
The query language must be suitable for use in accessing local RDF data -- that is, from the same machine or same system process.
The query language must include support for a subset of XSD datatypes and operations on those datatypes.
The access protocol design shall address bandwidth utilization issues; that is, it shall allow for at least one result format that does not make excessive use of network bandwidth for a given collection of results.

There are some other candidate requirements or design objectives that the WG's members are presently debating:

Also in XML-Deviant

The More Things Change

It must be possible for query results to be returned as a subgraph of the original queried graph.
It must be possible to express a query that does not fail when some specified part of the query fails to match. Any such triples matched by this optional part, or variable bindings caused by this optional part, can be returned in the results, if requested.
It must be possible to specify an upper bound on the number of query results returned.
It must be possible to handle large result sets of any size by iterating over the result set and fetching it in chunks.
It should be possible for query results to include source or provenance information.
It should be possible to query for the non-existence of one or more triples or triple patterns.
It should be possible to specify two or more RDF graphs against which a query shall be executed; that is, the result of an aggregate query is the merge of the results of executing the query on each of two or more graphs.

The Data Access WG is eager to hear from the XML.com audience and invites feedback to its comments mailing list. Note: the latest unreleased version of the UCR draft is publicly available, so if you want to see the latest evolution of the WG's present work, including things the WG is considering but hasn't yet formalized, that's a good place to start.