Identity Crisis

November 7, 2001

This weeks XML-Deviant article looks at the problem of unique identifiers in XML documents, a loop-hole in XPointer, and proposals on XML-DEV to resolve the issue.

Understanding the ID

The high level of activity on XML-DEV over the last few weeks has been sustained with a recent flurry of emails being exchanged on the topic of unique identifiers within XML documents. The issue surfaced following a seemingly straightforward problem statement from Fabio Dianda: how does you identify ID attributes without a DTD or when using a non-validating parser? The problem seems straightforward because the answer is that you can't. The only way to discover if an attribute is an ID is by using a schema, or if your application is hard-coded with its name. Game over.

However for several people, including Tim Bray and James Clark, this isn't an acceptable situation. In a posting to a related thread on an IETF mailing list, Bray described it as

...maybe the #1 gaping architectural hole as regards XML & the Web. The problem is that at the moment, given some arbitrary XML, there is *no* good way to determine what's an ID without recourse to some external resource like a DTD or schema, and that, to use a technical term, sucks.

In a followup post to XML-DEV Bray elaborated on the primary use case for ID attribute discovery.

It's really important for the workings of the web that an address such as
http://example.com/foo#Chapter12
have well-defined semantics. If foo turns out to be XML, this is hopelessly underdefined.

Fragment identifiers, like #Chapter12, are defined in RFC 2396, which explains that the precise meaning of fragment identifier is dependent on the media type of the data being retrieved by the URI. For a Web page this has been defined such that a fragment identifier refers to an HTML named anchor, i.e. <a name="Chapter12"/>. For an XML document the meaning of a fragment identifier is described in the XPointer Candidate Recommendation, using what is known as the Bare Name syntax. Under these processing rules, our fragment identifier would be interpreted as the following XPath expression: id('#Chapter12').

Following the trail of specifications a step further we discover that the XPath id function selects elements by their unique ID attribute, as declared in a DTD. Although, slightly confusingly, it makes no assertions about whether the document is actually valid, so in practice that identifier might not actually be unique: XPath treats any subsequent elements with the same identifier as not having an identifier at all.

To summarize, then, one can only link to an element within an arbitrary XML document using a bare XPointer if the element has an ID attribute and if the document has a DTD. Thus, XPointer doesn't play well with simple well-formed documents, arguably one of XML's greatest gifts.

Of course one could use an XPointer that explicitly identifies the required element, e.g. using a child sequence. But this is a more fragile solution particularly in the face of document edits. There are also benefits to having a uniform syntax for fragment identifiers; for, as David Carlisle observed, if

#foo has a meaning for text/html it would be nice if you could arrange that the same uri-reference located a more or less equivalent thing if something of type application/xml was returned. If the fragment ID syntax for application/xml is different...then you can't have a single uri reference that works regardless of the mime type returned.

Judging by the lengthy discussion that ensued, there is agreement that this issue needs to be resolved. But the exact extent of the problem, and the best way to fix it, are still under discussion.

Gaining an Identity

While the primary use case presented so far as been the need to link to portions of XML documents on the Web, the scope of the problem actually seems wider. CSS allows styling to be applied to individual elements, using id selectors. Acknowledging that this feature is problematic without a DTD (or guarantee that a user agent may read the DTD), the CSS authors ended up recommending a work-around to avoid the problem altogether. If a generic means of attaching an identifier to an element without using a DTD were available, then this uncertainty could be resolved. The same applies to other specifications as Michael Champion noted.

It's not just one spec, id-ness is exposed to users of XPath, XPointer, XSLT, XLink, and DOM. More importantly, it's a widely used feature in real applications that use these specs, especially when an XML app is working with a database. I use "getElementById()" whenever I can (e.g., I control the XHTML and can put in the necessary attributes).

I agree with Tim Bray: this is a "gaping architectural hole" because these other specs don't require DTDs or schemas in the general case, and have to mumble to describe what is supposed to happen if there isn't a DTD/Schema to define id-ness. None care (at their core) about the other features that DTD/Schema brings to the table, they just need a way to define id-ness.

Most of the proposals suggested so far have targeted a general solution, which I briefly summarize here.

Use the Internal Subset

This proposal sidesteps the issue by requiring that ID attributes be declared using the DTD internal subset. This hasn't been generally popular, mainly because of the overheads it imposes on document authors, particularly for languages such as MathML than have many attributes with identifiers.

IDISID

One of the most recent proposals has been from Don Park. In this scenario, ny attribute called "id" should be an ID. This idea is simple, and there is some indication that it was a common practice in SGML circles. But it wouldn't handle vocabularies that had identifiers with alternative names.

Use a Processing Instruction

This proposal suggests using a Processing Instruction (PI) that lists the attributes that should be interpreted as identifiers. While this seems to have low impact -- it has little effect on validity -- many oppose PIs in principle, despite persistent claims, by Rob Lugt, Marcus Carr and others, that this is the most sensible option.

Example usage:




<?xml-typeinfo idnames="x"?>

<foo >

  <bar x="abc"/>

  <baz x="hij" />

</foo>

xml:id

Tim Bray proposed that a reserved attribute, in the XML namespace, be used to explicitly associate identifiers with elements. One of the benefits of using the reserved XML namespace is that it need not be declared, making it simple to add identifiers without additional changes to the instance. The obvious downside is that many DTDs will have to be updated to make these attributes valid for use on particular elements.

Example usage:




<foo>

<bar xml:id="label1">

<baz xml:id="label2">

</foo>

xmlid:xx

Offered as a variant of Bray's proposal, this option uses an explicit "identifier namespace". Any attribute associated with this namespace would be taken as an identifier. This adds the possibility of multiple identifiers per element. There hasn't been much backing for this proposal as the benefits of the additional flexibility aren't widely acknowledged.

Example usage:




<foo xmlns:xmlid="http://w3.org/xmlid">

  <bar xmlid:x="abc" />

  <baz xmlid:z="hij" />

</foo>

xml:idatt

This proposal from James Clark has received the most vocal support so far.

An alternative would be to have an attribute that declares the name of the attribute that is an ID attribute, say xml:idatt. To make this usable, xml:idatt would be inherited. In the typical case where all elements use the same attribute name for an ID, this means that a user has only to add something like xml:idatt="id" or xml:idatt="rdf:ID" to their root element and everything works. You would also need to allow xml:idatt="" to disable inheritance.

Example usage:




<foo xml:idatts="x">

  <bar x="abc"/>

  <baz x="hij" />

</foo>

Minimal Victories

Not everyone has been convinced that a generic solution is required, and some of them have been looking instead for a minimal victory. David Carlisle proposed revising how XPointer interprets a fragment identifier, so that as well as using the id function, it checks for attributes called "id". This results in a rather scary XPath expression:

id('Chapter12')|/*[not(id('foo'))]/descendant-or-self::*[@*[local-name()=3D= 'id' and .3D 'Chapter12']][1]

Elliotte Harold was also of the opinion that the architectural hole could be plugged by a change limited to XPointer alone.

...I issue a new request for a standard xml:target attribute. This would provide a unique name for XPointers to link to. It would have no necessary type. It would have no affect on validity. The documents in which it appears may or may not have DTDs, may or may not be valid, and may or may not declare this attribute with any particular type. Whether such a document was valid would be determined exactly according to the rules of XML 1.0.

A good deal of discussion is likely to be needed before consensus forms. There are some issues related to all proposals presented so far (the Deviant is attempting to maintain a list of these separately), and there are several options for how things might proceed.

Moving Forward

If consensus settles on an XPointer-only fix, then there is likely to be pressure applied by the community to have the specification returned to Working Draft status to resolve this issue. It's disappointing that a specification can get this close to becoming a Recommendation with such a big loophole. Something has failed somewhere. In fact it may turn out that this problem has already been considered and rejected. Without disclosure of Working Group discussion it's difficult to surmise anything.

If the consensus is that a more general solution is required, then the appropriate way forward seems to be for a Note to be submitted to the W3C documenting the proposal with the aim of it becoming a separate Recommendation. Tim Bray has suggested that this would be simpler than re-opening the XML specification.

Michael Champion observed that such a specification might actually be created as a community activity, perhaps with it entering a standardization process after implementation experience.

Something like RDDL is *exactly* what I would like to see here. It's just a nice suggestion for a way forward that tools developers can implement rather than re-inventing the id-ness wheel or leaving it to the application writers. Concrete experience with it in the field may motivate the W3C (or ISO, or whoever comes along and applies the lessons of history to XML 1.0) to put it in a "real" standard someday.

While this discussion may seem slightly esoteric -- it certainly doesn't significantly derail the web service or XML-EDI approaches -- it demonstrates that gaps in the web architecture are still appearing, gaps which clearly point to a need for a group like the TAG, the elections for which end on 28 November.

Those with a keen eye will have also realized that separating out the responsibility for the declaration of identifiers could further attenuate the role of DTDs, taking us another step closer to a more layered architecture.