Databases and Element Names
January 29, 2003
XML-illegal characters in database-legal field names?
Q: I have a database that contains "unconventional" field names (i.e.,
Book_&_Page
, Grantee's_ID_#
). These field names do not meet
the requirements for element names, so I am forced to run them through a "sanitizing"
function before naming the element nodes. The function replaces or removes the characters
offensive to XML. So the Book_&_Page
field might become the
Book__Page
element, and Grantee's_ID_
might become
Grantees_ID_no
because I've chosen to sanitize simply by removing the
offending characters.
Unfortunately, this sanitizing process introduces the possibility that I could end
up with
elements with the same name, although in the database they are named differently.
For
example, if there were original fields named Book_&_Page
and
Book_:_Page
, sanitizing by the rule of removing the character would result in
Book__Page
for both of these fields. (Changing the field names in the
database is NOT an option.)
The initial dilemma is deciding what rules to apply to what characters in the sanitizing function. Coming up with a generic set of rules that could apply to multiple datasets over multiple systems seems risky. I am hoping that someone has come up with a way to "wrap" the element name so that these "illegal" characters can be included in the name. Maybe something along the lines of a CDATA section for the content of an element?
In lieu of this so-called "wrapper," there are workarounds -- such as the sanitizing function or using aliases for the field names. Do you have any ideas for how to deal with this?
A: As you've evidently been learning, moving data by hand between XML and relational databases can be frustrating, intellectually stimulating, maybe even rewarding. Mostly, though, it can leave you simply hungry for help.
For almost anything involving getting XML data into or out of databases (relational or otherwise), the commonly accepted starting point is Ron Bourret's XML and Databases site. (You might also want to take a look at his piece on Mapping DTDs to Databases, published in mid-2001 here on XML.com, for specific advice about relating the structure of your database to the structure of the XML documents it needs to emit or consume. Despite the feature's emphasis on -- and presumed familiarity with -- DTDs, it's a wonderfully concise yet complete summary of the issues you'll need to consider.)
As for your specific question, you already know, or at least sense, that there's no getting around the XML Recommendation's basic requirements for element names: the ampersands and pound signs have got to go.
Unfortunately, wrapping a document's content -- including its markup -- in one big CDATA section doesn't solve your problem. Consider a document like the following:
<Book_Info><![CDATA[ <Grantee's_ID_#>12345</Grantee's_ID_#> <Book_&_Page>1,29</Book_&_Page>]]> </Book_Info>
This may fool a casual human reader into "seeing" a Book_Info
root element
with Grantee's_ID_#
and Book_&_Page
children -- into "seeing,"
that is, a hierarchical tree of element nodes. (You might consider such a document
an
example of the trompe l'oeil school of database-to-XML mapping.)
XML is "supposed" to be human-readable, you might argue, but it's also supposed to
be
readable by software -- XML parsers -- which can construct a hierarchy of nodes from
the
document. In the case of the above document, the tree will have a single node, the
Book_Info
root element, containing a string of text (PCDATA) which just
happens to contain not only apostrophes, pound signs, and ampersands, but also unescaped
greater- and less-than symbols. The parser won't "see" the structure implied by those
markup-significance characters, because wrapping them in a CDATA section explicitly
instructs the parser to ignore their markup significance.
You might be able to sidestep your difficulty by forgetting about a literal mapping of database field or column names to element names. That is, you could push the database field or column names into attribute values, assigning corresponding element names either arbitrarily or according to some more or less intelligible scheme. For instance, the above document might be recast (via a method like your "sanitizing" script) as:
<table tablename="Book_Info"> <!-- Above assumes table names may present same problem as field names. --> <field1 fieldname="Grantee's_ID_#">12345</field1> <field2 fieldname="Book_&_Page">1,29</field2> </table>
It would then be pretty straightforward to transform this document via XSLT into either a comma-separated values text file or even an SQL statement such as:
INSERT INTO Book_Info (Grantee's_ID_#, Book_&_Page) VALUES ("12345", "1,29")
A portion of a stylesheet to accomplish the latter might look like this (comments in boldface):
<!-- Tell the XSLT processor not to be concerned with outputting well-formed XML. --> <xsl:output method="text"/> <!-- ...and suppress the built-in rules for processing text nodes in the source tree - we want text in the output only when we explicitly place it there. --> <xsl:template match="text()"/> <xsl:template match="table"> <!-- This template rule really works only if there's a single root element in the source tree, which maps onto a single table in the resulting database. Other structures would almost certainly require you to build your SQL differently. --> INSERT INTO <xsl:value-of select="@tablename"/> ( <xsl:for-each select="*"> <!-- Note test below for the name of the source tree element name at this point. It's not necessary, strictly speaking; this is just to ensure that we're not getting any other data mixed in with the data destined for the database. --> <xsl:if test="starts-with(name(), 'field')"> <xsl:value-of select="@fieldname"/> <!-- If this is any field except the last, append a comma to the field name. --> <xsl:if test="position() != last()">, </xsl:if> </xsl:if> </xsl:for-each>) <!-- Building the VALUES clause structured identically to building the field name list, above. --> VALUES ( <xsl:for-each select="*"> <xsl:if test="starts-with(name(), 'field')"> "<xsl:value-of select="."/>"<xsl:if test="position() != last()">, </xsl:if> </xsl:if> </xsl:for-each>) </xsl:template>
(Note: I've added indenting and line breaks for legibility of the XSLT code; if you use the code exactly as above, the resulting SQL will be syntactically valid but look a little strange.)
Numeric datatypes in XML Schema?
Q: I've got a question about the numerical datatypes long, unsignedLong, int, unsignedInt, short, unsignedShort, byte and unsignedByte. Which of these datatypes are subsets of the datatypes float and double?
I know they are all subsets of the datatype decimal. But what's with float and double? Is every long value, unsignedLong value, and so on a valid float value or a valid double value? I think it's obvious that byte and short (signed and unsigned) should be valid float and double values. But I don't know if this is true with int and long (signed and unsigned).
A: For the definitive answer to this question, I turned to Eric van der Vlist, regular XML.com contributor and author of O'Reilly's XML Schema. He summarized it this way:
Against all that we have learned in our math classes, WXS [W3C XML Schema] considers that:
- long, unsignedLong, int, unsignedInt, short, unsignedShort, byte, unsignedByte and friends are derived from the "decimal" primitive datatype.
- float and double are two other primitive datatypes.
This is because the first category allows comparisons without any risk of error: there is an exact matching between the lexical space (i.e., what you see in the XML document) and the value space (i.e., what the schema processor understands depending on the datatype). But the second category is approximate. (When I write 1.23E4, I have very little chance that it can be expressed as powers of two, which is the value space for float and double.)
Also in XML Q&A |
|
Another way of looking at this is that all decimal values can be exactly represented, yes, as floats and doubles. But the reverse is not true; there are quite a few (actually, an infinite number of) floats and doubles which cannot be exactly represented as decimals. The former's approximate quality clashes with the latter's exactitude.
Van der Vlist also added a bonus clarification:
You have the same problem with QName, which is not derived from string (as opposed to NCName, which is derived from string), because its value space is different.
In short, although WXS is the product of logical, rational minds, don't assume -- especially when considering datatypes, primitive and otherwise -- that all its principles will necessarily follow the logic of everyday common sense.