Word to XML and Back Again
December 8, 2004
A recent article on the O'Reilly Network showed how to edit XML using Word 2003, as long as your target XML format was not too far-removed from the built-in structural limitations of a word processor, and last year there was a survey of solutions on XML.com. But since Word 2000d, it has been possible to "export as XML" if you are up for a little bit of post-processing.
In fact, Rick Jelliffe blogged about this year's Open Publish conference: "If I were to pick a theme or meme, it was that the decision on whether and how to support Word was by far the most critical decision for most large XML deployments." The "whether" is a big question, but here's something about how to support Word in an XML project.
In this article, I will show you how to take the frighteningly messy result of Word's
"Save
as Web Page" and turn it into well-formed XML, using a few lines of Python and a touch
of
XSLT. Grab the sample Python
application, and if you have libxml2
installed, you can type:
python wordconverter.py mydoc.htm > mydoc.xml
python wordconverter.py mydoc.xml > mynewdoc.htm
(Ignore the complaints from the libxml2
parser.)
Even if you do use Word 2003 (and many of us don't), you may find that this is a more
useful approach than WordprocessingML&--the Word 2003 XML format--particularly if
you
are producing web pages. One major advantage of the hack I will show here is that
it gives
you pre-rendered, web-ready versions of your images, equations, graphs, and so on,
nicely
linked in img
elements. You just have to remove the non-HTML parts.
I have been using this technique for more than four years, both with a commercial-but-free-to-use processor that I helped specify, and with a .NET version that I worked on for a former employer. These techniques are well tested on thousands of documents.
To be really useful, you will need to create templates for your authors so that you have predictable outputs to turn into XML. Unfortunately, good template design, and the benefits of basing your custom document types on HTML, are topics beyond the scope of a single article. There is more about template design, particularly for HTML output, at my site, in my Word Processor Interoperability project.
If you "Save as Web Page" in Word 2000 or beyond and open the result in a text editor, you will see something that is nearly XML, but with some craftily designed hacks to make the document accessible from a web browser (well, Internet Explorer, anyway) while containing enough embedded code to reconstitute a Word document in almost all of its glory. This widely reviled format is classic Microsoft. When first introduced, it used to crash or confuse competing browsers.
The goal for this little project is twofold: first, to figure out how to get from Microsoft's format to well-formed XML (we will not be validating this format with a schema or DTD), as from there it is straightforward to use XSLT or the language of your choice to transform the document, probably for rendering. For rendering, the trick is to simply discard most of the proprietary, undocumented Word features and transform the basic HTML paragraphs and tables into something useful, maybe even valid XHTML.
The second part of the goal is to be able to reverse the process, and turn the XML back into a Word document. You could use this to make minor changes to an existing document, such as changing metadata or incorporating comments from a web site. Or you could create entirely new documents, based on a shell, and use Word to render or edit them.
Now to work. The HTML head
element starts off with some pretty standard
stuff. All we need to do here is quote the attribute values, and close the unclosed
elements
meta
and link
. For this, I have settled on
libxml2
's HTML document parser, as discussed recently on XML.com
over the more obvious alternative of Python's own standard sgmllib
. The main
problem with sgmllib
is that it turns all characters in element names into
lowercase. So to round-trip the document back into Microsoft's format, we would need
to use
a big lookup table, or use a hacked version of the library.
<head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> <meta name=ProgId content=Word.Document> <meta name=Generator content="Microsoft Word 11"> <meta name=Originator content="Microsoft Word 11"> <link rel=File-List href="word2xml_files/filelist.xml"> <link rel=Edit-Time-Data href="word2xml_files/editdata.mso">
Parsing an HTML document in libxml2
is one-line simple (assuming that
doc
contains your document as a string), and it deals with both attributes
and empty elements with aplomb:
import libxml2 htmldoc = libxml2.htmlParseDoc(doc, None)
This gives you an XML document, htmldoc, that you can process like any other XML
document. But it's not that simple. There are some limitations to libxml2
's
considerable powers, starting with the fact that that it does not seem to understand
the
XML-style namespace declarations that Word puts in at the top of the document, even
though
they are fine in XML. You may also run into issues with encodings, particularly when
using
fonts such as Wingdings and the like. I have not attempted to deal with this issue
in
detail, but there is a commented hack in the sample file that should get you started.
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns="http://www.w3.org/TR/REC-html40">
So when libxml2
encounters bits of a document like this in the
office
namespace:
<o:p> </o:p>
it complains, and turns them into this:
<p/>
We can't have elements moving between namespaces, from the office
namespace to
the default HTML namespace, but there is a simple solution, which involves replacing
the
:
character with a _
character--and then putting it back later.
The latter bit we'll do using XLST.
#Hide namespaces from libxml2's HTML parser qualifiedname = '<(/?)(\w):(\w)' hackedname = r'<\1\2_\3' doc = re.sub(qualifiedname, hackedname, doc)
The namespace issue is a bit ugly, but nothing compared to the horror of what I like to call the Mutant Markup Declaration (MMD), which is the dirty trick used to hide proprietary Word data in an HTML file. There are two variants of the MMD.
This kind of MMD that starts with <!--[if some-condition ]>
and
ends with <![endif]-->
is a species of comment, to hide things from
"normal" software:
<!--[if gte mso 9]><xml> <o:DocumentProperties> … <o:Author>Peter Sefton</o:Author> … </o:DocumentProperties> </xml><![endif]-->
Ironically, inside of the comment is pure, well-formed XML, thoughtfully wrapped
in
<xml>
tags to emphasize the point. This is just a couple of regular
expressions away from being XML. But how to do it? The most obvious way would be to
turn the
MMDs into processing instructions (PIs), as that is really their function. Unfortunately,
though, libxml2
ignores PIs when parsing HTML, so I settled on the
ugly-but-safe approach of using empty elements, and made-up ones at that.
Two substitutions will fix the MMDs:
startComment = r"<\!--\[(.*?)\]\>"; startCommentReplace = r"<mmd='\1' comment='start' /><div language='mso-conditional'>"; doc = re.sub(startComment, startCommentReplace, doc) endComment = r"<!\[(.*?)\]-->"; endCommentReplace = r"</div><mmd value='\1' comment='end' />"; doc = re.sub(endComment, endCommentReplace, doc)
Here's an example that illustrates a few more challenges. If you have a style that
you use
for lists in Word, called L1*
(for list, first level, with a bullet), it might
look something like this:
- Bullet point
- Bullet point
- Bullet point
In Word's format, each paragraph looks like this (don't look at this if you're squeamish; it's not pretty):
<p class=L10> <![if !supportLists]> <span lang=EN-AU style='font-family: Symbol; mso-fareast-font-family:Symbol;mso-bidi-font-family:Symbol'><span style='mso-list:Ignore'>•<span style= 'font:7.0pt "Times New Roman"'>...some spaces... </span></span></span> <![endif]> <span lang=EN-AU>Bullet point</span></p>
There is a Mutant Markup Declaration in here marking the beginning and end of some
rendering information that uses non-breaking spaces for rendering the list. This works
(sort
of) in conjunction with a CSS stylesheet embedded in the document's head
. The
MMDs are easily dealt with:
startMMD = r'<\!\[(.*?)\]\>' startMMDReplace = "<mso-declaration value='\1' />" doc = re.sub(startMMD, startMMDReplace, doc) endMMD = r'<\!\[endif\]>' endMMDReplace = "<m so-declaration value='endif' />" doc = re.sub(endMMD, endMMDReplace, doc)
Now we can put it all together:
def parsehtmlfile(self, htmfilename): self.htmfilename = htmfilename self.doc = open(htmfilename).read() #Remove mutant markup using regular expressions self.doc = EscapeMMD(self.doc) #Create a libxml2 XML document self.htmldoc = libxml2.htmlParseDoc(self.doc, None)
There is one more complication to deal with. The list items in the original Word document
had the style L1*
, but the paragraph here is marked as
class="L10"
. We need to look in the CSS stylesheet, in the head
,
to resolve this indirection. Here you will find a CSS rule that contains the property
we are
looking for: mso-style-name
. The trick here is to extract the stylesheet and
build a lookup table of class names, so you can say getstylename('L10')
and get
the answer L1*
.
p.L10, li.L10, div.L10 {mso-style-name:L1*; mso-style-parent:B1; margin-top:6.0pt; ... mso-ansi-language:EN-AU;}
So we grab the all the contents of all of the styleNodes
:
styleNodes = self.htmlDoc.xpathEval("//*[local-name() = 'style']") styles = '' for styleNode in styleNodes: styles += styleNode.serialize()
And call something to extract the style names and store them in a dictionary:
self.extractstyles(styles)
Then it's a matter of using the magic of XPath to visit every node in the document
that
has a class attribute, and if there is one, add another made-up attribute:
mso-style-name
.
classNodes = self.htmldoc.xpathEval('//*[@class]') for classy in classNodes: className = classy.prop('class') msoStyle = self.getStyleName(className) if msoStyle: classy.newProp('mso-style-name', msoStyle)
Now we have a libxml2
document object ready to serialize. The weird Microsoft
markup has been escaped into mmd
elements, and namespaces have been escaped.
The final step is to use a little bit of XSLT to serialize the document. The only
interesting part of this is the part that puts the namespaces back, by matching elements
that have an underscore in their names and doing some string manipulation to reinstate
the
namespaces.
<xsl:template match="*[contains(local-name(),'_')]"> <xsl:variable name="new-name" select="concat(substring-before(local-name(), '_'), ':', substring-after(local-name(), '_'))"/> <xsl:element name="{$new-name}"> <xsl:apply-templates select="@*|node()" /> </xsl:element> </xsl:template>
Finally, as promised, the return trip. There are only a few lines of Python, because I did it in XLST. This makes it portable across programming languages.
class xmltoword: xmldoc = '' styledoc = libxml2.parseDoc(wordxml2html) style = libxslt.parseStylesheetDoc(styledoc)
def __init__(self): pass def parsexmlfile(self, fileName): self.xmldoc = libxml2.parseFile(fileName) def output(self): return self.style.applyStylesheet(self.xmldoc, None).serialize()
This kind of hackish transformation is but one way to approach the issue of getting Word documents into XML. One limitation is that while the round trip gives you a document that Word will accept, there are changes to whitespace and character encodings that mean you cannot automate testing across a large set of documents. To build a higher-fidelity version would require a custom parser. There are also ways of extracting XML from .doc files, and macro-based approaches, even using the OpenOffice.org word processor, which can read Word files (give or take a bit) and natively saves documents in XML. But as far as I know, my approach gives you the best crack at round-tripping documents, rather than siphoning off XML, provided you leave the proprietary stuff intact.
Before using this technique or the sample code on too many documents, try it out with a representative sample of real material from your users. And do be careful: my doctor friends tell me that they are still seeing a lot of injuries from the sharp edges on the inside of Microsoft Office files.