The Naming of Parts
July 25, 2001
Q: What are the rules for a valid XML element name?
I'm thinking, for example, of a rule like "an element name must begin with a letter (alphabet) and can be followed by alphanumeric characters." Are any special characters (like -, _, #, @, etc.) allowed in the name? Where can I find the specification that defines these rules?
A: You're actually pretty close to the real rules.
To begin to understand these official rules, you'll want to check the W3C Web site for the XML Recommendation itself. (The current version is the "second edition" of XML 1.0, basically unchanged from the version originally published in 1998, except for clearing up some ambiguities.)
A clear, useful description of what's allowed in an XML element (or other) name can be found in section 2.3, "Common Syntactic Constructs."
A Name is a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.
This definition (like most of the XML Recommendation's definitions) is formally expressed in Extended Backus-Naur Form (EBNF) notation just a couple of lines further down. The EBNF of Name is
[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar |
Extender
[5] Name ::= (Letter | '_' | ':') (NameChar)*
Reading EBNF
Here's a short lesson in deciphering Name's EBNF.
First, note the numbers enclosed in square brackets -- [4]
and
[5]
. These numbers are called productions. It's not uncommon to find
references, on XML-related mailing lists and newsgroups, to such things as "production
12"
and "production 5." What these terms are referring to, then, are EBNF definitions
in the
spec. (If someone mentions "production 12," and you don't know what it means, just
open the
spec in your browser and do a text search on the string "[12]".)
Second, you need to understand that all these EBNF blocks -- these productions --
do is
define some term or other. Production 4 defines the term "NameChar"; whenever that
term is used elsewhere in the spec, production 4 provides the, well, the definitive
definition of that term. The double-colon-equals character, ::=
, can be read
as "is defined as" or "comprises" and so on.
Finally, what's on the right of the ::=
is similar in syntax to a content
model in a DTD and uses many of the same regular-expression notations. According to
this
syntax, you might encounter a special character such as the vertical bar -- also called
the
"pipe" -- character, |. This character represents logical "or". So production 4 might
be
rewritten in English thus:
The term "NameChar" refers to a letter OR a digit OR a period (".") OR a hyphen ("-") OR an underscore ("_") OR a colon (":") OR a CombiningChar OR an Extender.
Unicode character classes
Note that to the right of the ::=
is a mixture of punctuation and terms
defined elsewhere, and that the terms on the right are presented as hyperlinks to
their
definitions. Thus, production 4 actually looks like
[4] NameChar ::= Letter |
Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
The four hyperlinks lead to productions 84, 88, 87, and 89, respectively. These four
productions are among those in Appendix B ("Character Classes"), and their definitions
--
what's on the right each of their
::=
symbols -- boil down to simple lists of Unicode value ranges, represented
in hexadecimal form.
Of course "simple" is a relative term. You might imagine, for example, that the term
"Digit" in production 4 equates to Unicode values #x0030 through #x0039 -- the hex
representations of the characters 0
through 9
. That's only a small
fraction of the "digits" actually available for use as an XML name character though,
as you
can see from production 88:
[88] Digit ::= [#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9] |
[#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F] |
[#x0AE6-#x0AEF] |
[#x0B66-#x0B6F] |
[#x0BE7-#x0BEF] | [#x0C66-#x0C6F] | [#x0CE6-#x0CEF] |
[#x0D66-#x0D6F] | [#x0E50-#x0E59] | [#x0ED0-#x0ED9] |
[#x0F20-#x0F29]
All the other ranges represent legitimate "digits." They're just not digits as you may be accustomed to the term in Western "Arabic" numbering systems. (Also remember that these are not actual numeric values. An XML document may contain text representations of numeric values but not the numeric values themselves. In terms familiar to programmers, the character "9" is not the same as the number 9.)
If you're curious about all these hexadecimal Unicode values and the actual characters they represent, the Unicode code charts are the authoritative source, available either as PDF or as GIFs. Note the URLs addressed by the hyperlinks on these pages. The page of Arabic characters, for example, is designated as "U0600". The U stands for Unicode, and the four-digit value which follows it indicates the range of hexadecimal values covered by that PDF or Web page.
The bottom line
Back to your question. What characters may an XML name (element, etc.) contain and in what order? This is where you need to refer to production 5 listed above. To repeat (with hyperlinks):
[5] Name ::= (Letter | '_'
| ':') (NameChar)*
Note that here the EBNF expression actually consists of a couple of "sub-expressions," grouped with parentheses. This production might be rewritten in English thus:
The term "Name" refers to (a letter OR an underscore OR a colon) FOLLOWED BY (any number of the characters defined by the term "NameChar").
The asterisk in the EBNF means "0 or more".
Also in XML Q&A |
|
Thus, putting together productions 4 and 5, legitimate element names include the following:
axiom
_axiom_26
:axiom_veintiséis
ora:open.source
All of these names begin with a letter (as defined elsewhere as certain Unicode values), an underscore, or a colon, followed by any combination of letters, digits, underscores, colons, and periods.
The following are not legitimate XML element names:
#axiom
@axiom
26th_of_month
axiom#26
The first three begin with something other than a letter, underscore, or colon; the last starts out all right, but falls apart because the # is not a legitimate name character.
After you've put together some possible combinations of element names based on the above, I think you'll agree that the rules are really quite simple. What makes them seem complex is that they must be stated precisely and unambiguously, and that they must allow for "name characters" not just in Western language systems but in virtually any language representable as Unicode.