Second Coming

May 31, 2000

This week's XML-Deviant reports on a forthcoming revision of the XML 1.0 specification, and the progress of the W3C Schema Working Group.

XML 1.0 Second Edition

Alongside contending with the Namespace debate (on which we reported last week), the XML Core Working Group is also considering a set of revisions to the XML specification. John Cowan posted a description of the activity to XML-DEV:

There is a list of XML errata at http://www.w3.org/XML/xml-19980210-errata, which is slowly growing as the XML Core WG finds wording problems, infelicities, pointless incompatibilities with SGML, and so on. This material is officially part of the XML Recommendation, but it's hard to use.

XML 1.0, Second Edition, will incorporate the published errata into the text of the XML Recommendation. It is not XML 1.1. It is not considered a substantial change in XML.

When questioned about the naming of the revised specification, Cowan stressed that existing XML processors will not be affected:

A change to the XML version number would make existing 1.0 XML parsers unable to cope. Since we are not making incompatible changes, a new version would be inappropriate.

The Core Working Group have also asked for input from the developer community (specifically parser authors) on two open issues relating to illegal character handling and the use of fragment identifiers:

Currently the XML Recommendation is silent about the handling of documents that contain "impossible" bytes. For example, the byte 0xFF cannot appear in any UTF-8 encoded document. We are considering making such violations of the encoding a fatal error...

... System identifiers may or may not contain fragment identifiers ... We are considering changing this language to say that "it is an error" to use a fragment identifier.

Feedback from developers on the proposed changes has so far been positive, particularly the support for strengthened character handling rules. Kevin Burton pointed out that this would help the RSS community increase interoperability (RSS is a lightweight syndication format):

The RSS space has XML from many remote sources (currently about 1700 URLs). I have noticed that a significant percentage of these contain illegal characters which just basically break things. I am having to filter out this info with a text processor and then pass it to my XML parser... A good portion of the RSS producers don't know that their XML is flawed even though it might be considered well-formed by most parsers.

Rick Jelliffe proposed a more radical alteration to the character handling rules, introducing the concept of "character validity" (a new level of correctness alongside well-formedness and DTD/schema-validity). John Cowan ruled this out however, on the grounds that it went beyond the goals of XML 2nd Edition.

This revision does raise one interesting question: if XML 1.1 were released, how would backwards compatibility be maintained? Not all documents carry an XML declaration, as it is not required by the specification. Yet it is the declaration that defines the version number:

... this construct is provided as a means to allow the possibility of automatic version recognition, should it become necessary.

It is therefore wise to add the XML declaration to all your documents -- including messages exchanged between internal systems. The lack of standard DOM functionality to serialize the tree as XML has led many developers to produce their own serialization functions, or to rely on serialization provided as a custom feature on some DOM implementations. It would be worth revisiting homegrown code to ensure that the declaration is being added. A small thing, but one that may bring a little extra future-proofing.

Schema Progress

The XML Schema Working Drafts are in Last Call status (which will end on the 12th of May), after which they may advance to the Candidate Recommendation stage. Rick Jelliffe pointed out that Last Call is when most serious review occurs:

... I think everyone recognizes that it is easier to respond to a 'final draft' than to a work-in-progress.

Comments gathered so far in response to the Last Call Working Draft have been varied. Mathew Fuchs remarked that simplicity remains the most important issue:

It's already pretty clear that the number 1 complaint we will hear is "why is it so damn complicated". The unstated part is "when it doesn't need to be".

For most developers, the real testing of W3C XML Schemas will begin when schema validation tools become available. This week Henry Thompson announced that a pre-release version of the XML Schema Validator (XSV) is now available:

A pre-release version of XSV with full Unicode support and (consequently) XML-formatted output is now available for friendly testing at: http://cgi.w3.org/cgi-bin/xmlschema-check-new

The Candidate Recommendation stage highlights the need for trial implementations. It is excellent to see an open source validator in the pipeline. Given the input of Henry Thompson, co-editor of the XML Schema Structures specification, XSV is sure to act as an important reference implementation. With development well under way, we should hopefully see support grow for XML Schemas sooner rather than later.