From Excel to XML
January 9, 2002
Q: How do I convert Microsoft Excel data to XML?
A: The easiest way is to use either a commercial product (such as Infoteria's iMaker for Excel -- $150 for a single developer license) or any of a number of publicly-available Visual Basic/Visual Basic for Applications (VB/VBA) modules (such as Scott Woods's contribution). If you're after not just a straight answer, but a deeper understanding, read on.
Converting a spreadsheet's data to XML is a specific form of the general question, "How do I convert tabular data to XML?" Depending on the spreadsheet in question -- and on the character of the desired XML output -- answering it can be extremely simple or mind-bendingly complex. Let's look at a simple example first. Consider the following table (i.e., "spreadsheet"):
A | B | C | D | |
1 |
1000.23 |
1206.79 |
1135.00 |
1394.61 |
2 | 113.19 |
12.36 |
159.12 |
34.78 |
3 |
234.56 |
678.90 |
259.00 |
654.01 |
This table seems to be rich in content but devoid of anything like meaning. There's no way for a human reader, let alone a dumb machine, to make sense of these numbers. Any XML generated from such a table is likely to be similarly stupid, along these lines:
<table>
<row>
<column>1000.23</column>
<column>1206.79</column>
<column>1135.00</column>
<column>1394.61</column>
</row>
<row>
<column>113.19</column>
<column>12.36</column>
<column>159.12</column>
<column>34.78</column>
</row>
<row>
<column>234.56</column>
<column>678.90</column>
<column>259.00</column>
<column>654.01</column>
</row>
</table>
A | B | C | D | E | |
1 |
1Q2002 | 2Q2002 | 3Q2002 | 4Q2002 | |
2 |
West |
1000.23 |
1206.79 |
1135.00 |
1394.61 |
3 |
East | 113.19 |
12.36 |
159.12 |
34.78 |
4 | Central |
234.56 |
678.90 |
259.00 |
654.01 |
We still don't know what the numbers represent, exactly -- sales figures? average birth rates? average test scores? But what each row or column's data represents is becoming clearer. The rows evidently represent geographical subdivisions of a larger whole; the columns, a year's quarterly data. Given this understanding, we might imagine XML output such as
<table>
<division name="West">
<quarter
name="1Q2002">1000.23</quarter>
<quarter name="2Q2002">1206.79</quarter>
<quarter
name="3Q2002">1135.00</quarter>
<quarter name="4Q2002">1394.61</quarter>
</division>
[etc.]
</table>
Finally, of course, the table's (or spreadsheet's) author may have helped us further by providing some kind of summary label, which resolves for certain what is represented by the numbers:
A | B | C | D | E | |
1 | Average Daily Ticket Sales by Region | ||||
2 | 1Q2002 | 2Q2002 | 3Q2002 | 4Q2002 | |
3 | West |
1000.23 |
1206.79 |
1135.00 |
1394.61 |
4 | East | 113.19 |
12.36 |
159.12 |
34.78 |
5 | Central |
234.56 |
678.90 |
259.00 |
654.01 |
Now our XML output can take a more helpful form, shorn of formatting and display notions like tables, rows, and columns:
<avg_ticket_sales>
<region name="West">
<quarter name="1Q2002">1000.23</quarter>
<quarter name="2Q2002">1206.79</quarter>
<quarter name="3Q2002">1135.00</quarter>
<quarter name="4Q2002">1394.61</quarter>
</region>
[etc.]
</avg_ticket_sales>
I think most readers will intuitively grasp how this conversion might work, given the
presence of a human to control the process. Not so obvious, perhaps, is how a
machine might make the leap from (say) a root element named table
to
one named avg_ticket_sales
.
Most automated conversion programs provide some mapping mechanism which instructs the software which cells are to be translated to element or attribute names and which to element content or attribute values. For example, you might have to supply a mapping-configuration file whose contents look something like
root == element(avg_ticket_sales)
row(1) == ignore
row(2) ==
element("quarter")attribute("name")
col(A) ==
element("region")attribute("name")
(This isn't real; I'm just demonstrating how, in general, the mapping might work.)
If the target XML vocabulary comes with a DTD or Schema, the task becomes even easier (well, as such things go). These formal descriptions of a conforming document's allowable structure are relatively easy to automate in a drag-and-drop user interface, like a hierarchical tree of nodes. Such a tool would enable the user to define the table-to-DTD mapping simply by, as it were, "drawing a line" from some portion of a table or spreadsheet to its XML counterpart.
If you're interested in learning more about mapping table-type structures to XML structures, I encourage you to refer to any of Ron Bourret's relevant work. (XML.com published Bourret's thorough, multi-part treatment in May 2001: "Mapping Databases to DTDs".) Bourret's work on this and other XML-related subjects is worth knowing, as you can see from his web site.
XML naming constraints revisited
In last month's column, I discussed why element and attribute names cannot begin with a digit or other (mostly non-alphabetic) character. I talked some in that column about XML's having inherited the constraint from SGML, but it was largely speculative on my part (since I was barely in the neighborhood when XML was being born, let alone SGML).
Also in XML Q&A |
|
After that column appeared, Rick Jelliffe sent me a note listing some real reasons
for the
limitation -- that is, more real than "because SGML says so, too". One of these reasons
should be so obvious to an application developer that I'm embarrassed not to have
thought of
it. Once you open the door to an element or attribute named, say, 30DaySpan
,
then you must also allow an element or attribute named simply 30
. And then
performing mathematical or Boolean operations would become an excruciating guessing
game for
a human, let alone an XML parser: Does the expression 30 * 5
mean the literal
number 30 times 5, or the value of the 30
element (whatever that value might be) times 5 (or, for that matter, the value of the
30
element times the value of the 5
element, or the literal
value 30 times the value of the 5
element)? Does 200 > 30
evaluate to Boolean true
in all cases, or does it depend on the value of the
30
element (or the 200
element, etc.)?
Jelliffe offered another reason: allowing unpronounceable names (like, say,
%^*@
, whether followed by something more pronounceable or not) would play
havoc with speech synthesizers used by blind readers of the XML document in question.
Finally, Jelliffe said, there's a slippery-slope objection to allowing leading digits in particular. Once you allow digits, then someone will want dollar signs, and once you allow dollar signs then someone will come up with a reason for "needing" leading whitespace as part of an element or attribute name.
Besides, as Jelliffe concluded, in order to accomplish most of what people who want non-alphabetic leading characters really want, all you need to do is remove the requirement that ID-type attribute values must be legitimate XML names. Both XML Schema and Jelliffe's Schematron provide this flexibility.
Many thanks to Rick Jelliffe for filling in the gaps.