From Excel to XML

January 9, 2002

Q: How do I convert Microsoft Excel data to XML?

A: The easiest way is to use either a commercial product (such as Infoteria's iMaker for Excel -- $150 for a single developer license) or any of a number of publicly-available Visual Basic/Visual Basic for Applications (VB/VBA) modules (such as Scott Woods's contribution). If you're after not just a straight answer, but a deeper understanding, read on.

Converting a spreadsheet's data to XML is a specific form of the general question, "How do I convert tabular data to XML?" Depending on the spreadsheet in question -- and on the character of the desired XML output -- answering it can be extremely simple or mind-bendingly complex. Let's look at a simple example first. Consider the following table (i.e., "spreadsheet"):

	A	B	C	D
1	1000.23	1206.79	1135.00	1394.61
2	113.19	12.36	159.12	34.78
3	234.56	678.90	259.00	654.01

This table seems to be rich in content but devoid of anything like meaning. There's no way for a human reader, let alone a dumb machine, to make sense of these numbers. Any XML generated from such a table is likely to be similarly stupid, along these lines:

	A	B	C	D	E
`1`		1Q2002	2Q2002	3Q2002	4Q2002
`2`	West	1000.23	1206.79	1135.00	1394.61
`3`	East	113.19	12.36	159.12	34.78
4	Central	234.56	678.90	259.00	654.01

We still don't know what the numbers represent, exactly -- sales figures? average birth rates? average test scores? But what each row or column's data represents is becoming clearer. The rows evidently represent geographical subdivisions of a larger whole; the columns, a year's quarterly data. Given this understanding, we might imagine XML output such as

<table> <division name="West"> <quarter name="1Q2002">1000.23</quarter> <quarter name="2Q2002">1206.79</quarter> <quarter name="3Q2002">1135.00</quarter> <quarter name="4Q2002">1394.61</quarter> </division> [etc.] </table>

Finally, of course, the table's (or spreadsheet's) author may have helped us further by providing some kind of summary label, which resolves for certain what is represented by the numbers:

	A	B	C	D	E
1	Average Daily Ticket Sales by Region
2		1Q2002	2Q2002	3Q2002	4Q2002
3	West	1000.23	1206.79	1135.00	1394.61
4	East	113.19	12.36	159.12	34.78
5	Central	234.56	678.90	259.00	654.01

Now our XML output can take a more helpful form, shorn of formatting and display notions like tables, rows, and columns:

<avg_ticket_sales> <region name="West"> <quarter name="1Q2002">1000.23</quarter> <quarter name="2Q2002">1206.79</quarter> <quarter name="3Q2002">1135.00</quarter> <quarter name="4Q2002">1394.61</quarter> </region> [etc.] </avg_ticket_sales>

I think most readers will intuitively grasp how this conversion might work, given the presence of a human to control the process. Not so obvious, perhaps, is how a machine might make the leap from (say) a root element named table to one named avg_ticket_sales.

Most automated conversion programs provide some mapping mechanism which instructs the software which cells are to be translated to element or attribute names and which to element content or attribute values. For example, you might have to supply a mapping-configuration file whose contents look something like

root == element(avg_ticket_sales) row(1) == ignore row(2) == element("quarter")attribute("name") col(A) == element("region")attribute("name")

(This isn't real; I'm just demonstrating how, in general, the mapping might work.)

If the target XML vocabulary comes with a DTD or Schema, the task becomes even easier (well, as such things go). These formal descriptions of a conforming document's allowable structure are relatively easy to automate in a drag-and-drop user interface, like a hierarchical tree of nodes. Such a tool would enable the user to define the table-to-DTD mapping simply by, as it were, "drawing a line" from some portion of a table or spreadsheet to its XML counterpart.

If you're interested in learning more about mapping table-type structures to XML structures, I encourage you to refer to any of Ron Bourret's relevant work. (XML.com published Bourret's thorough, multi-part treatment in May 2001: "Mapping Databases to DTDs".) Bourret's work on this and other XML-related subjects is worth knowing, as you can see from his web site.

XML naming constraints revisited

In last month's column, I discussed why element and attribute names cannot begin with a digit or other (mostly non-alphabetic) character. I talked some in that column about XML's having inherited the constraint from SGML, but it was largely speculative on my part (since I was barely in the neighborhood when XML was being born, let alone SGML).

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

After that column appeared, Rick Jelliffe sent me a note listing some real reasons for the limitation -- that is, more real than "because SGML says so, too". One of these reasons should be so obvious to an application developer that I'm embarrassed not to have thought of it. Once you open the door to an element or attribute named, say, 30DaySpan, then you must also allow an element or attribute named simply 30. And then performing mathematical or Boolean operations would become an excruciating guessing game for a human, let alone an XML parser: Does the expression 30 * 5 mean the literal number 30 times 5, or the value of the 30 element (whatever that value might be) times 5 (or, for that matter, the value of the 30 element times the value of the 5 element, or the literal value 30 times the value of the 5 element)? Does 200 > 30 evaluate to Boolean true in all cases, or does it depend on the value of the 30 element (or the 200 element, etc.)?

Jelliffe offered another reason: allowing unpronounceable names (like, say, %^*@, whether followed by something more pronounceable or not) would play havoc with speech synthesizers used by blind readers of the XML document in question.

Finally, Jelliffe said, there's a slippery-slope objection to allowing leading digits in particular. Once you allow digits, then someone will want dollar signs, and once you allow dollar signs then someone will come up with a reason for "needing" leading whitespace as part of an element or attribute name.

Besides, as Jelliffe concluded, in order to accomplish most of what people who want non-alphabetic leading characters really want, all you need to do is remove the requirement that ID-type attribute values must be legitimate XML names. Both XML Schema and Jelliffe's Schematron provide this flexibility.

Many thanks to Rick Jelliffe for filling in the gaps.