Elements Revisited
November 28, 2001
This month, we'll start by revisiting a DTD question that came up in September.
Q: How do I enforce a range of occurrences of one element inside another?
Example: a fruit_basket
element must contain between 9 and 11
banana
elements. I think the following works:
<!ELEMENT fruit_basket (
(banana, banana, banana, banana,
banana, banana, banana, banana, banana) |
(banana, banana, banana,
banana, banana, banana, banana, banana, banana, banana) |
(banana,
banana, banana, banana, banana, banana, banana, banana, banana, banana, banana)>
Is there a better way to do this?
A: Replying to this question a couple of months ago, I began with a flippant, "Nope.
Frustrating, isn't it?" The balance of the answer focused on the clumsiness of DTDs
at
handling "range of specific numbers of occurrences" kinds of problems. "Constructing
a
content model," I wrote, "is even worse for, say, a hypothetical month
element
type in even a simple calendar application: some months may legally contain 31 days,
some
30, and one either 28 or 29, depending on the year."
Within days of that question-and-answer column, I received feedback from two readers (including the always reliable Chris Maden) who provided a concise solution to the nine-to-eleven bananas problem. It looks like this:
<!ELEMENT fruit_basket
(banana, banana, banana, banana,
banana, banana, banana, banana, banana,
(banana,
banana?)?)>
See how this works? The list of nine non-optional banana
elements is followed
by an optional grouped pair of banana
elements, which may occur no more
than once; within that grouped pair, the second banana
is itself optional (and
likewise occurring up to once). This is a slick demonstration of how to use optionality
and
grouping together to define a range of allowable occurrences in a content model, and
I
should have at least mentioned that this approach is workable for this case. In the
case of
my other example -- the variable days-in-a-month example -- the solution would look
like the
following:
<!ELEMENT month
(day, day, day, day, day, day, day,
day, day, day, day, day, day, day,
day, day, day, day, day,
day, day,
day, day, day, day, day, day, day,
(day,
day?, day?)?)>
(That is, 28 required day
elements, followed by an optional three-day group of
day
s in which only the first of the three is required.)
It's arguable whether this kind of solution is practically useful as the number of occurrences increases beyond a few dozen. But it's undeniably useful for small content models such as these, and I apologize to readers (and the original questioner) for having effectively stolen the specific question in order to make a different, more general point.
Q: Why just those name-start characters?
Why can't element names start with a digit? I know the standard says they can't but why not? Seems that the only forbidden characters should be '>', '/' and white space (which would cause problems parsing out attributes).
A: What an interesting, deceptively simple question. (There are other obvious forbidden
characters as well: <
and &
.) I've consulted eight
reference books, from elementary to advanced, as well as Tim Bray's Annotated XML Specification and the XML FAQ, and I have yet to find an answer. So let me go
out on a limb here and speculate.
The glib answer is that the XML standard says so because the SGML standards said so; and XML, being more or less a subset of SGML, rarely adds new features, only subtracts them. (There is a relevant exception: SGML element names may start with only a letter, while XML adds the underscore as a valid name start character.) Of course this merely evades the larger question: why does SGML forbid not only markup-meaningful characters, but also anything else not a letter?
I think the problem with the question lies in falsely equating human simplicity with machine simplicity. Humans just love "anything goes!" rules. But the software which drives machines isn't so lucky. In this case, the software in question is an SGML/XML parser. And the more restrictive the rules placed on the language to be parsed, the easier it is to construct a robust, efficient, bug-free parser. (That's just one of several ways in which XHTML, the XMLized version of HTML, is notably superior to its predecessor: browser vendors have less margin for error -- or for "creativity," if you happen to be a browser vendor -- when the rules are more restrictive.) As an absurd example, a parser which expected all element names to be the same ("element," say) would be simpler to build than an XML parser. A given element name's "correctness" would be a simple binary proposition.
Also in XML Q&A |
|
XML and SGML are meant to balance the tensions between the natural human tendency to want to do anything and the machine capacity to do best when doing no more than one thing. The standard achieves this balance, for the most part, by assigning characters to specific classes, which may be legitimately used for specific purposes.
The best example in the XML Recommendation of this classification scheme is in Section 2.3, "Common Syntactic
Constructs." Scroll down in that section to productions [4] through [8] (headed "Names
and
Tokens"), which define such terms as NameChar
, Name
, and
Nmtoken
(at least, as those terms are used in this document). Each definition
builds upon the definitions of other terms. For instance, to understand the definition
of
Name
you must understand the definition of NameChar
, and so
on. Here's what production [5] says constitutes a legitimate XML name (element, attribute,
and so on):
[5] Name ::= (Letter | '_' | ':') (NameChar)*
That is, an XML name is composed of (a) a letter or an underscore or a colon,
followed by (b) any number of the characters defined by production [4] -- the
NameChar
characters. (Astute readers of production [5] will note an apparent
flaw here, by the way: it seems to permit an element to be named simply "_" or ":".
It's
hard to imagine what a document author or XML vocabulary developer would intend by
such an
element.) When you run down the definition of Letter
, you're led to
production [84] and thence to [85] (something called a BaseChar
) and [86] (Ideographic
). And these
definitions in turn provide you with dozens of Unicode ranges representing hundreds
of
"letters," everything from Western to Chinese and Japanese characters.
I guess everyone will have his preference. Maybe it'd be nice to name an element something like "30DaySpan" or some such. But for my own taste (and, I'd guess, the taste of most others), being able to start an XML name with any one of "only" hundreds of characters will seem a narrow kind of restriction indeed.