Elements Revisited

November 28, 2001

This month, we'll start by revisiting a DTD question that came up in September.

Q: How do I enforce a range of occurrences of one element inside another?

Example: a fruit_basket element must contain between 9 and 11 banana elements. I think the following works:

<!ELEMENT fruit_basket ( (banana, banana, banana, banana, banana, banana, banana, banana, banana) | (banana, banana, banana, banana, banana, banana, banana, banana, banana, banana) | (banana, banana, banana, banana, banana, banana, banana, banana, banana, banana, banana)>

Is there a better way to do this?

A: Replying to this question a couple of months ago, I began with a flippant, "Nope. Frustrating, isn't it?" The balance of the answer focused on the clumsiness of DTDs at handling "range of specific numbers of occurrences" kinds of problems. "Constructing a content model," I wrote, "is even worse for, say, a hypothetical month element type in even a simple calendar application: some months may legally contain 31 days, some 30, and one either 28 or 29, depending on the year."

Within days of that question-and-answer column, I received feedback from two readers (including the always reliable Chris Maden) who provided a concise solution to the nine-to-eleven bananas problem. It looks like this:

<!ELEMENT fruit_basket (banana, banana, banana, banana, banana, banana, banana, banana, banana, (banana, banana?)?)>

See how this works? The list of nine non-optional banana elements is followed by an optional grouped pair of banana elements, which may occur no more than once; within that grouped pair, the second banana is itself optional (and likewise occurring up to once). This is a slick demonstration of how to use optionality and grouping together to define a range of allowable occurrences in a content model, and I should have at least mentioned that this approach is workable for this case. In the case of my other example -- the variable days-in-a-month example -- the solution would look like the following:

<!ELEMENT month (day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, day, (day, day?, day?)?)>

(That is, 28 required day elements, followed by an optional three-day group of days in which only the first of the three is required.)

It's arguable whether this kind of solution is practically useful as the number of occurrences increases beyond a few dozen. But it's undeniably useful for small content models such as these, and I apologize to readers (and the original questioner) for having effectively stolen the specific question in order to make a different, more general point.

Q: Why just those name-start characters?

Why can't element names start with a digit? I know the standard says they can't but why not? Seems that the only forbidden characters should be '>', '/' and white space (which would cause problems parsing out attributes).

A: What an interesting, deceptively simple question. (There are other obvious forbidden characters as well: < and &.) I've consulted eight reference books, from elementary to advanced, as well as Tim Bray's Annotated XML Specification and the XML FAQ, and I have yet to find an answer. So let me go out on a limb here and speculate.

The glib answer is that the XML standard says so because the SGML standards said so; and XML, being more or less a subset of SGML, rarely adds new features, only subtracts them. (There is a relevant exception: SGML element names may start with only a letter, while XML adds the underscore as a valid name start character.) Of course this merely evades the larger question: why does SGML forbid not only markup-meaningful characters, but also anything else not a letter?

I think the problem with the question lies in falsely equating human simplicity with machine simplicity. Humans just love "anything goes!" rules. But the software which drives machines isn't so lucky. In this case, the software in question is an SGML/XML parser. And the more restrictive the rules placed on the language to be parsed, the easier it is to construct a robust, efficient, bug-free parser. (That's just one of several ways in which XHTML, the XMLized version of HTML, is notably superior to its predecessor: browser vendors have less margin for error -- or for "creativity," if you happen to be a browser vendor -- when the rules are more restrictive.) As an absurd example, a parser which expected all element names to be the same ("element," say) would be simpler to build than an XML parser. A given element name's "correctness" would be a simple binary proposition.

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

XML and SGML are meant to balance the tensions between the natural human tendency to want to do anything and the machine capacity to do best when doing no more than one thing. The standard achieves this balance, for the most part, by assigning characters to specific classes, which may be legitimately used for specific purposes.

The best example in the XML Recommendation of this classification scheme is in Section 2.3, "Common Syntactic Constructs." Scroll down in that section to productions [4] through [8] (headed "Names and Tokens"), which define such terms as NameChar, Name, and Nmtoken (at least, as those terms are used in this document). Each definition builds upon the definitions of other terms. For instance, to understand the definition of Name you must understand the definition of NameChar, and so on. Here's what production [5] says constitutes a legitimate XML name (element, attribute, and so on):

[5] Name ::= (Letter | '_' | ':') (NameChar)*

That is, an XML name is composed of (a) a letter or an underscore or a colon, followed by (b) any number of the characters defined by production [4] -- the NameChar characters. (Astute readers of production [5] will note an apparent flaw here, by the way: it seems to permit an element to be named simply "_" or ":". It's hard to imagine what a document author or XML vocabulary developer would intend by such an element.) When you run down the definition of Letter, you're led to production [84] and thence to [85] (something called a BaseChar) and [86] (Ideographic). And these definitions in turn provide you with dozens of Unicode ranges representing hundreds of "letters," everything from Western to Chinese and Japanese characters.

I guess everyone will have his preference. Maybe it'd be nice to name an element something like "30DaySpan" or some such. But for my own taste (and, I'd guess, the taste of most others), being able to start an XML name with any one of "only" hundreds of characters will seem a narrow kind of restriction indeed.