Identity Crisis
September 11, 2002
Recapitulation
Members of the W3C's Technical Architecture Group (TAG) are preparing the "Architectural Principles of the World Wide Web" (APW), a document intended to serve as a definitive statement of what the TAG has discovered and defined about what makes the Web work. As I described last week, the APW contains four substantive sections: an introduction, a section on identifiers and resources, a section on formats, and a section on protocols. The structure of the document reflects the structure of the Web's architecture, which the APW says consists of identifiers, formats, and protocols.
In last week's column I discussed the APW's introduction and some general issues of terminology, especially the confusion, as I see it, of principle with practice. In this week's column, I examine APW Section 2, Identifiers and Resources.
The Identity of Resources
In the APW's view, the Web is a "universe of resources". So far, so good. But what is a resource? The APW adopts the definition of resource from RFC 2396, a definition which has always made me uneasy, though probably because I'm still more inclined to think of these things like a philosopher than like a programmer or software system architect.
RFC 2396 says that a resource "can be anything that has identity", without any further explanation of, or even as much as a pointer to, what meaning of identity is being invoked. RFC 2396's definition is indistinguishable from saying that a resource can be anything at all. Every individual thing is what it is and not something else. Every thing has whatever identity results from it being identical to itself. RFC 2396 fails spectacularly as a definition of "resource". It's wrong to commend a definition which doesn't provide any traction (a property, or condition, or state) by which one might distinguish between its definiendum, the thing being defined, in this case, resource, and anything else at all.
If the intent is to say that anything can be a resource, under suitable conditions, that's a fairly coherent idea, and it can be expressed without recourse to that thing "having identity". In the absence of further detail about what having identity means (in the sense of RFC 2369 and the APW), I am still waiting for an example of a thing which lacks it. Some parts of the universe are not things and (probably) don't have identity (in the sense of RFC 2369 and the APW); for example, fists, wrinkles, and knots, which are merely modifications of hands, carpets, and strings. But all individual things have at least the identity which comes from being self-identical.
At least two things follow from this criticism. First, it doesn't seem to make much practical difference that RFC 2396 and the APW rely on a weird definition of "resource". Second, in the problem domain of Web specifications and standards, the conceptual boundaries between "resource", "identity", "identifier", and "representation" are gerrymandered, constantly shifting, and provisional at best. The APW offers the following list of resource examples: "documents, files, menu items, machines, and services, as well as people, organizations, and concepts". All of these disparate things are resources because they "have identity", though for that to mean anything more than that each thing is identical to itself, there must be some criterion of identity, some principle of individuation, for the kind of thing in question. I think we have good individuation principles for files, documents, menu items, machines and services. Okay, I'm pretty sure we have them for "file", "menu item", and "machine". I'm less sure about "service" and "document".
But do we really have them for persons, organizations and concepts? And, perhaps more importantly, what are they? The questions raised by personal identity theory are among the thorniest kinds of question humans know how to ask, and the same or roughly the same is true for institutions and organizations. Consider the following URIs:
mailto:kendall@monkeyfist.com
http://clark.dallas.tx.us/kendall
http://monkeyfist.com/KendallClark
Some of the murkiness about resources, identity, and identifiers is responsible in part for the perpetual conversations over which, if any, of the aforementioned URIs identify a natural person, a natural person resource, a natural person's "home page", or some computer resource over which some natural person has (some measure of) control, and so on. I'm increasingly unhappy with the way in which Web specifications address these fundamental issues. It would be better for them to go unaddressed rather than addressed superficially. At the very least, talk about resources having identity should either be dropped or clarified, because as it stands, it's merely another source of confusion.
The Importance of Linking and Being Linked
Use absolute URI references: All important resources SHOULD be identified by an absolute URI reference (APW 2.1).
"The value of the Web increases with the number of resources addressable by absolute URI reference," the APW claims, "In turn, resources are more valuable when they are addressable in the Web." I couldn't agree more; this is one of the foundational ideas of the REST theory of Web architecture, which suggests that in designing a Web application, every significant resource should have its own identifier, i.e., a URI, including transitional and intermediate resources. Having a URI is a necessary and sufficient condition of a thing being a member of the set of Web resources. No URI, no Web. Extending this point a bit further, one can say that the more links to other resources are contained in (the retrievable representation of the state of) a resource, the more valuable that resource is. Linking and being linked to is what the Web is all about.
Doing Things With URIs
Absolute URI references are unambiguous: Each absolute URI reference unambiguously identifies one resource (APW 2.2.2).
Support persistence: Those who create and manage resources and their identifiers SHOULD design the identifiers in such a way as to ensure their persistence (APW 2.3).
The APW rightly points out that the two chief things one can do with an "absolute URI reference" are to compare it for equality with another absolute URI reference -- an operation which is addressing-scheme contingent -- and to interact with the resource or, more pointedly, interact with a retrieved representation of the state of the resource, the result of the ubiquitous GET.
It should be clear now that, at the level of conceptual elegance, the murkiness of the identity of resources causes some conceptual problems with the unambiguity of URIs. URIs may well identify one resource each, but which one? Or, rather, if this is the case, why do developers tend to confuse or conflate resources? A URI like
http://clark.dallas.tx.us/kendall
cannot, if we take the APW seriously, identify the resource we might call "Kendall Clark's home page" and the resource we might call "the natural person Kendall Clark". And yet there are perpetual conversations in the development community about, say, which resource one's home page identifies, about overloading the URI of one's home page to identify both oneself and one's home page, and so on.
There are kinds of ambiguity, and it would be helpful if the APW would specify which type or types it intends here, even if the types are analogical. What sorts of ambiguity might exist between an identifier and a resource? For example, one might say that the relation of URIs to resources is vulnerable to (a kind of) act-object ambiguity, which names the confusion between the result of some action and the action itself. The word "observation" may name both the result of an act of observing, and it may name the act itself -- "My field observations were surprising today" could mean that the result of today's fieldwork cast doubt on a pet theory, or that while performing my fieldwork, I was surprised to discover that I was sitting on a mound of ants.
As I understand the APW and the REST architecture, URIs do not identify the result of retrieving a (representation of the state of a) resource, but, rather, they identify the resource itself. The same issue seems to come up in APW 2.5 Fragment Identifiers, which says that the fragment identifier
is interpreted only after the retrieval of a representation. Section 4.1 of [RFC2396] states that "the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result," that is, the representation.
It seems that something like the act-object ambiguity is part of the canonical understanding of the Web. A URI identifies a resource, but the optional fragment identifier of an "absolute URI reference" identifies some part of the representation of the resource which the URI identifies, a part which is contingent upon the type of representation it is.
According to the APW (more accurately, to my reading of an early draft of the APW), a URI identifies one and only one resource, unambiguously, and the (optional) fragment identifier part of an absolute URI reference identifies some part of the representation of that resource. In effect, URIs have two namespaces: one which points to entities within the shared information space of the Web, another which points inside the representational space of the state of a Web resource. If URIs identify one and only resource, is there a parallel expectation that fragment identifiers identity one and only part (for lack of a more general word) of a representation? Should we extend the URI persistence practice to fragment identifiers, too? Do "cool" fragment identifiers change?
I take the cash value of the URI-Resource unambiguity principle above to mean that the representations of the state of a resource retrieved by, say, successive HTTP GETs of the same URI are representations of the state of the same resource, no matter how different that state may be. Which implies several related good practices about the persistence of URIs, i.e., that the URIs of valuable or important resources shouldn't change willy-nilly (ideally, shouldn't change, ever), and, ideally, that the value of an identifier is in some sense consumed by its first use. In other words, if I put a resource identified by the following URI
http://monkeyfist.com/WeeklyReview
into the Web, the value of that identifier has been consumed or used up by having been associated with the resource ("The Weekly Review") it identifies. It has value as an identifier as long as it continues to identify just that resource, which it will continue to do as long as it returns a representation of the state of that resource in response to a GET request. The value of that identifier approaches nil, however, if I change the resource which that URI identifies to, say, "Harper's Weekly Review". In short, there are at least two types of URI persistence: first, that the identifier of a resource persist through time; second, that URIs are always the identifier of the same resource. URIs are cheap; there are, after all, a lot of them.
Also in XML-Deviant |
|
Conclusion
Most of what I've said about the APW should be taken with a grain of salt: my questions and concerns are largely about the conceptual elegance or tidiness of Web specifications and less about the practical operation of the Web itself. As we push out toward something we can honestly call the Semantic Web, however, some of these conceptual issues will become more pressing and more practical. Or, at the very least, they will become more tricky and more fun to write and speculate idly about.