Not My Type: Sizing Up W3C XML Schema Primitives
July 31, 2002
Continuing our occasional series of opinion pieces from members of the XML community, Amy Lewis takes a hard look at W3C XML Schema datatypes.
Since the application of XML to data representation first gained public visibility, there has been a movement to enhance its type system beyond that originally provided by DTD. Several attempts were made (SOX, XML Data and XML Data Reduced, Datatypes for DTDs, and others) before the W3C handed the problem to the XML Schema Working Group.
What is the goal of data type definitions for XML? For one thing, it establishes "strong typing" in XML in a fashion that corresponds with strong typing in programming languages. Various commercial interests have been vocal supporters of strong typing in XML because they see typed generic data representation as their best hope for interoperability and increased automation. With typing in schemas extended into the textual content of simple types, and not just the structural content of complex types, businesses can enforce contracts for data exchange. In other words, strong typing enables electronic commerce.
To phrase it a little differently, the data types defined in DTDs were considered inadequate to support the requirements of electronic commerce or, more generally, of commercially reliable electronic information exchange.
The publication of W3C XML Schema (or WXS), in which one half of the specification was devoted to the definition of a type library (part two), seemed to resolve the problem. Certainly, with forty-four built-in data types, nineteen of them primitive, it seemed at first glance to cover the field. The increasing visibility of WXS and the efforts to layer additional specifications on top of it -- XML Query, the PSVI, data types in XPath 2.0, typing in web services -- have begun to raise serious questions about WXS part two, even among proponents of strong types, including the author of this article.
There are two fundamental problems with WXS datatyping. The first is its design: it's not a type system -- there is no system -- and not even a type collection. Rather, it's a collection of collections of types with no coherent or consistent set of interrelations. The second problem is a single sentence in the specification: "Primitive datatypes can only be added by revisions to this specification". This sentence exists because of the design problem; lacking a concept for what a primitive data type is, the only way to define new types is by appeal to authority. The data type library is wholly inextensible, internally inconsistent, bloated in and incomplete for most application domains.
Not a type system
The data type library defined in WXS part two is not a type system. It's not possible to examine the built-in types and determine the guiding principles which dictated which types were to be defined and which were to be defined as primitives.
Consider a contrasting example. The type system used by C and related languages is clearly based on bit patterns and register sizes. The bit pattern 10011001 fits into registers of a certain size, but has different meaning based on its type: character, unsigned or signed byte. The type assigned to a bit pattern determines certain behaviors. If the above pattern is X, and Y is the bit pattern 00010001, then X > Y if both are unsigned bytes, and X < Y if both are signed bytes. The same bit patterns may represent character (or strings of characters), integers of various sizes, and floating point numbers (again with various constraints), but the fundamental limitation is the number of bits that can be stuffed into a register. By interpreting the identical bits in different fashions, the languages achieve different effects.
One mandate for WXS was that it should reproduce the limited type system of the DTD plus the namespace extensions. It stands to reason that, given the definition of QName and NCName in the namespaces specification and Name in the original XML 1.0 specification, these types would be found in some rational relationship to one another. In the WXS definition, NCName is a subtype of Name, which is a subtype of token, which is a subtype of normalizedString, which is a subtype of string, which is a primitive type. However, QName is also a primitive type, implying that it is not a string, not a normalizedString, not a token, and not a name, even though it is composed lexically of NCName + : + NCName.
WXS also represents numbers of various sorts. Given the requirement to support decimal, integer, float, and double, which should be considered primitive types, and which derived? What criterion should be used for derivation? Your answer should allow for the further derivation of various bounded-range integers, but needn't worry about number systems solely of interest to fusty ivory-tower academics. Data typing isn't particularly useful in science, of course.
Nine times too many
Why is anyURI a primitive type? Why are there nine separate and unrelated primitive types all concerned with measurement of time? Even though early drafts of WXS included three time instant measuring types (dateTime, date, and time, which are not, despite lexical and conceptual overlap, related to one another by derivation in WXS), in the last stages of specification drafting one or more interested constituencies raised such a fuss that five more time instant measuring types were added. Despite lexical and conceptual overlap, all five were made primitive types, unrelated to one another by derivation. Clearly, the committee was too exhausted to fight about it any more, so gHorribleKludge (gYearMonth, gYear, gMonthDay, gMonth, gDay...the "g" stands for "Gregorian," not "good") made it into the specification.
At least three constituencies are easily identifiable with type subcollections in WXS: the original XML/DTD collection (rooted at string, and one of two derivation trees, plus unrelated primitives); the strongly-typed programming language collection (rooted at decimal, and the other derivation tree, plus unrelated primitives); and the database collection (mostly available in the strongly typed tree, plus the time instant primitives, and assorted others). Why are the chosen primitives primitive? Why aren't base64Binary and hexBinary related? Why aren't float and double related to each other or to the rest of the numbers? Certainly if derivation in the integer tree can proceed based on register size (which it does), then one ought to be able to derive float from double. Isn't anyURI a token? No? normalizedString? No? Not even a string?
No? Really, all these date and time thingies don't have any relation to one another at all? No. There's no method to this madness. There is no way to guess whether a particular built-in type will be declared primitive or derived from another type. Nor is there any apparent value to derivation of built-in types, since validity according to the least-derived type does not guarantee validity according to most-derived type.
Given that there are so many types defined by WXS part two, everyone ought to be happy. Right? Well, everyone except scientists or anyone else who might want things like complex numbers, rational numbers, even imaginary numbers, or particular precision. But we've already agreed that academics don't need data types. Real applications are all handled. I won't have any trouble representing an ISBN or a credit card with its embedded check digits. Lisp programmers can use rational numbers. The boolean type can express true and false in any language. Certainly I can specify that a node has type XPath. Yes?
No.
Well, perhaps this is because there is a strong, conscious attempt to keep the number of primitive types to a minimum. That's why there are only eight time instant datatypes, and... Let's not continue down that path. It leads nowhere; there was no attempt to keep the number of types to a minimum. Therefore, a scientific computing application must handle the possibility of the declaration of a NOTATION type, or NMTOKENS, or language. It does not matter that the problem domain does not need these data types. They are defined, so the application had better be prepared to cope with them.
Semi-structured types
A further problem lies in the inclusion of semi-structured types. Almost all of the time instant types have this portmanteau characteristic; a type with a name of the pattern floorwaxDessertTopping should alert the reader to an imminent experience, live and from New York. Even lists (the simplest of non-simple data types) are potentially problematic. The actual locus of validation is on each list component, not the list as a whole. If XML preserved the markup minimization feature of SGML, lists would be utterly superfluous. As it stands, the locus of validation is each component of the list, not the list-as-a-whole. Instead of tags supplying context for simple content validation, position does so. And it does the same for other semi-structured types, most notably dates.
If a structural schema definition language happened to include support for co-occurrence constraints, it's quite likely that no one would need to demand portmanteau gDayYear-style types. "Thirty days hath September ..." is a children's rhyme, and a mnemonic, but it is also an algorithm. The second month may only have twenty-eight days, unless the year is evenly divisible by four, except when the year is evenly divisible by one hundred and is not evenly divisible by four hundred. A language supporting co-occurrence constraints could say "gBye" to semi-structured types.
Nothing can be done to fix WXS, until the single sentence -- "Primitive datatypes can only be added by revisions to this specification" -- is fixed. On the other hand, it says nothing about removing types, so perhaps we can clean it up after all.
Getting it right
"If you can't say something nice, don't say anything at all". Good advice from my mother to me, and all of the foregoing has been not only completely destructive criticism, but has been, in places, offensively phrased as well, and I personally know some of the current and former members of the XML Schema Working Group, so I may well suffer for it. Taking that advice, I will say something nice. Admittedly, I'm going to say it about Relax NG, but you can't have everything.
The Relax NG specification does not resolve data typing problems. However, it took the separation of focus in the XML Schema specification and made it more robust and more flexible. In Relax NG, any data type library may be used. Definition of a data type library is not supplied, except by reference to XML Schema part two, and by definition of a minimal type library (string and token).
However, an effort outside the OASIS technical committee has established data type library interfaces suitable for use with Relax NG validators. These interface definitions (available from SourceForge, for several target languages) are extremely valuable in refining the concept of a data type in XML.
The goal, stated previously, was loosely to enable computer-to-computer interactions with strong typing in XML. What does it mean to define a data type in XML? Clearly, from the point of view taken here, in WXS part two, and in the interface specification for type validation in RNG, we are discussing the typing of "simple" types. That is, types of nodes that contain textual content, rather than or in addition to element content. Attributes may have types, and the ephemeral "text node" children of elements may have types, and these "simple types" are what we are concerned with.
What is a data type in XML? Four answers: a string; smething that can be expressed as a string, following certain constraints; something that can be validated by a specified algorithm; or something that corresponds to a simple concept in my problem domain.
In XML, everything is a string. Since XML contains text, everything is, by definition, expressible as a string. If we take this as fundamental, then every type in XML is simply a string with certain patterning constraints. There are some problems with this concept, but it's useful to keep in mind. If every data type in XML must be derived from string, then there is no need for anySimpleType, either. Any simple type is a string. Relax NG enhances this notion to add token, relying upon the potentially special treatment that an XML parser can give to whitespace. But a token is just a kind of string, one that doesn't contain whitespace.
Almost directly opposed to that interpretation, however, is the last interpretation expressed above. From the perspective of the programmers working on the system or of the salesmen generating the information stored in it, the total amount of a sale entered into a receipts-payable XML representation is not a string, it's a number. It's a big number, and it means money and success. And the date is not a string, it's an anchor in time and payroll had better cough up commission within thirty days or else. Using the "problem domain" definition, we can outline a rough set of commonly used basic (another word for "primitive") types: string, number, control code (like credit card or zip code or SKU; a semi-structured type), time instant (and duration?), truth, encoded data (uuencode, base64, binhex; opaque types without meaning except for a decoding algorithm), and reference (which preserves something that should be preserved from DTD: the notion that one can make references within a document to other places in the document or possibly outside it, roughly equivalent to the programming language notion of a pointer). That's more than the simplest possible definition (string), but significantly fewer than WXS's nineteen: seven types.
The second definition above leads us into an interesting question: what role does derivation play in a type system? Is it significant that WXS defines a Name as a token, which is a normalizedString, which is a string? Certainly derivation patterns can help to conceptualize a bag of types, but is it useful otherwise? It is to the folks creating contracts based on schema, probably. If the data type library says that a float is not a decimal number, then it isn't, and it isn't valid to treat it so. If a number is a number and a decimal number is just the full set of fixed-point and floating-point numbers, then a float is a decimal, and the derivation described in the library says it can be treated so. More to the point, if the library specifies that a decimal is not a string, then it isn't. The question is largely one of whether the derivation pattern should reflect the conceptual universe of the domain or the conceptual universe of something else -- people who work with databases, for instance, are likely to want to have a set of derivations that reflect the hierarchy and available constraints defined for SQL (or some dialect of SQL), not the patterns commonly adopted by languages that depend upon register size.
More on XML Schemas |
A Smoother Change to Version 2.0 |
The prize behind door number three is really terrific, though. It fits very nicely with WXS part one and with Relax NG. A data type is an expression of an algorithm for validating a "text" or "simple-type" node. Better still, this definition accords with the requirements of the users, with the problem domain definition. It is not difficult to describe an algorithm for validating credit cards (this is validation, not verification, mind), or IP addresses, or numbers, or packed-format semi-structured dates. Moreover, by specifying that a data type definition is associated with a validation algorithm, we fulfill what appears to be the ultimate goal of introducing data types to XML: we make it possible to specify a contract, and to determine whether a particular instance conforms to the contract (with the reservation that xsi:type is a contract-breaking scab).
Recommendations
Now we've achieved something. A data type, in XML, is associated with an algorithm to validate the content of a simple-type node (a simple-type element, a component of a mixed-type element in Relax NG, or an attribute). If that's the case, then we can make some recommendations for coping with data types in XML.
First, specifically for the XML Schema Working Group, dump "the sentence". Replace it with a mechanism for specifying type libraries in a schema definition.
Second, in the Schema WG or somewhere else, create a language for describing data type libraries. The Relax NG effort is admirable but is tied to the languages that are implemented. WXS does not provide enough richness to describe data type libraries; its "facets" are a limited universe of available algorithms. The algorithms available ought to be arbitrarily large; ideally the data type definition language would be Turing complete within the algorithm description portion.
Third, either in the WG or elsewhere, establish a registry for data type libraries. An ideal registry would require the data type library definition (in the language created above), any additional (but necessarily non-normative) documentation, at least one working implementation of the library, and a test suite to determine compliance with the library for additional implementations.