Document Associations

January 30, 2002

This week the XML-Deviant attempts to disentangle the threads of a number of tightly woven discussions that have taken place on XML-DEV recently. The general theme of these discussions is how one associates processing with an XML document.

On the surface this may seem like a simple problem, but there are a number of issues that expose some weak points in the XML architecture. Actually, in circumstances where you are exchanging XML between tightly coupled systems, there are very few issues, beyond the usual systems integration gotchas. The difficulties begin to arise in circumstances that make the most of the advantages that markup provides. In these loosely coupled environments data may be subjected to processing and constraints unforeseen by the original information provider. Data tends to take on a life of its own, separate from the producing and consuming systems.

Processing in this context may involve dispatching to a component capable of processing the data or converting it into a known vocabulary so that it can be manipulated further. This may require some degree of resource discovery to find the correct components, schemas, and transformations required to carry out the task in hand. This is true for both generic processing architectures as well as specific applications; but in the latter case the resources may be pre-packaged and immediately available rather than dynamically acquired.

James Clark describes validation as a very specific example of associating processing with a document. The schema used by a document author may not be the same as that used by the consumer of the document, assuming they use one at all. The author and consumer may require different constraints and may use different, or a combination of, schema languages to apply them. Clark argues that a general mechanism for describing processing is required. This might be achieved by in-document indicators or by an external association defined by the processor.

Anchor points available within a document which allow an association of resources and processing include the MIME type, namespaces, or the document type. The interplay between the first two of these mechanisms was the subject of last week's column. This week's column will focus on the latter two issues and will illustrate some of the complexities highlighted in the recent debate.

RDDL and Namespaces

A heated exchange raged across XML-DEV recently concerning RDDL, which provides a means to associate a directory of resources with an XML namespace. It took a lot of flames to boil down the issues to their core, at which point it became clear that most of the disagreement was about the utility of associating resources with a namespace rather than a document. In other words, while RDDL was defined as a means to answer the question, "what is at the end of a namespace?", many wanted a mechanism to associate resources at the document level. Ronald Bourret summed up this difference in granularity with his TV metaphor. Bourret also accurately diagnosed the original source of confusion:

I think one of the things that was confusing me was that when people defined the purpose of RDDL, they said that it was to provide information / resources about the elements in a particular namespace. But when they gave examples, it was always with respect to an instance document that might have elements from multiple namespaces.

This confusion lead some to conclude that RDDL was somehow broken and that an alternate mechanism needed. However RDDL is generic enough to be applied to both tasks, ultimately leading Jonathan Borden to suggest a means to associate a resource directory with an XML document using either a namespaced attribute or a Processing Instruction. A quick tally of opinion suggests that the latter was preferable to many, Tim Bray being the notable exception. Bray also warned against attempting to define too much too early:

I'm probably -1 on the whole thing, because I don't think we have enough experience yet to know what information is going to be useful in picking apart and using namespace-compound documents. TimBL is arguing very cogently that the namespace of the root element is the largest single factor in determining what the doc is all about and how it's going to be processed...

There's probably a good idea lurking in here somewhere, but I don't think we're really ready to write the rules down yet.

So while there may be a general consensus that such a mechanism is both desirable and necessary, there's still no agreement on the best approach. For example, Michael Brennan had previously suggested a mechanism that could generalize things further by using extended XLinks. Rick Jelliffe believed that an approach based on packaging XML applications was a richer solution.

We need to move beyond document types to distributable, extensible!, identifiable (and, sure, web-locatable), system-integrator-friendly "XML Applications".

Typing and Architectural Forms

Moving beyond resource associations, the discussion also touched on the general issue of document typing. Specifically the relationships between document types, namespaces, and schemas. Rick Jelliffe explained that there isn't a 1:1 relationship between namespaces and schemas:

...a name in a namespace does not always have a 1:1 association with a particular schema definition. Similarly, the elements in a whole namespace may be used in different ways by different schemas which use elements from the namespace.

But often there will be one general or typical schema for a namespace. Yet variants can be expected over time due to maintenance, etc.

So a namespace may be a set, but that does not mean an element in a particular namespace will always have the same content model etc.

Jonathan Borden also demonstrated, using a "schema algebra", that a document can have many types, and also that it's wrong to equate namespaces and document types. Borden said that the main issue is that a replacement for DOCTYPE is required which is agnostic to the particular schema language used.

Also in XML-Deviant

The More Things Change

All of this boils down to many-to-many associations between namespaces, schemas, and document types. A particular instance may itself take on different types according to how it's interpreted by the user. This seems to be the central message and is a way to understand the theoretical arguments. Semantics are entirely local and are defined by the particular processing context into which the data is fed. The tightly coupled XML exchange mentioned previously becomes a special case. In this circumstance the producer and consumer agree precisely on how a document should be interpreted. It's important for a producer to be able to assert that data is suitable for processing in a certain way, but the consumer is free to disregard this. This echoes Clark's argument that a general mechanism for associating processing is required and lends weight to Gavin Nicol's assertion that this should be defined separately from the instance. If it's defined separately, then the consumer of a document can override it.

Steven Newcomb argued that Architectural Forms is a natural fit in this kind of environment, allowing a document to assert that it conforms to a variety of constraints.

Someday we'll wake up and realize that, from an information management-and-interchange perspective, it's very, very useful for an element to declare that it's an instance of multiple element types, and to be able to invoke full syntactic validation of such instances against all their classes, in syntactic space, including both context and content. Anything less is suboptimal as a basis for flexible, mix-and-match information interchange via XML, among people who want to cooperate with each other, but who have endlessly specialized local requirements. Architectural forms, anyone?

Whether Architectural Forms will be successfully dug out from the HyTime infrastructure remains to be seen. John Cowan certainly seems interested in exploring the possibilities. Unfortunately there are no easy answers at the end of this discussion. For the most part it appears to be scene setting for a large amount of work still to be undertaken. This is a recurrent New Year theme on XML-DEV, according to Len Bullard. Hazarding some predictions it seems likely that the pipeline meme that's been circulating recently will continue to do so, and that the ISO DSDL work will provide some key solutions in this area.