
From Word to XML
by John E. SimpsonDecember 30, 2003
But, especially given its roots in SGML and HTML, XML functions equally well as an open, structured-document medium. And that's where this month's question comes from.
Note: I don't pretend that my answer here is definitive or encyclopedic. It covers only one solution among a host of alternatives. If the response to past columns of this sort is any indication, within a week or two you'll be able to find numerous reader-supplied comments at the end of the article, giving you pointers to other options.
Q: How can I convert a Microsoft Word document into XML?
A: Recent versions of Word claim "save as XML" features of one kind or another. Maybe that "claim" is too harsh; they do create well-formed XML documents, after all. But it's XML of a spectacularly hideous form, even for simple documents -- nearly as gnarly and impenetrable to the human eye as XSL-FO.
(For a good idea of what to expect, see A. Russell Jones's recent article on devx.com, "Export Customized XML from Microsoft Word with VB.NET." Don't worry if you don't know or care anything about VB.NET; just check out that article's Figure 1 -- which shows how the document appears in Word -- and its Listing 1 as well. The latter is the output of the document coming from Word 2003's "save as XML" feature.)
Whether you like or don't like Word, or use it in your everyday working life, you may be called upon to convert a Word document to XML at some point. And if you don't even have Word in the first place, the quality of the word processor's "save as XML" output is moot anyway. What do you do then?
A good place to start searching when you're pretty sure software for processing XML must exist, but you don't know where to find it, is xmlsoftware.com. In this case, use the site menu to locate the "Conversion Tools" page.
As you can see, most XML-to/from-Word packages don't process "true" Word documents in the classic .doc form. Instead, they rely on Word's long-standing support for Rich Text Format (RTF). (RTF documents are "structured", after a fashion. But the language is intended primarily to support the display of textual matter -- not unlike Adobe's PDF. If you'd like to learn more about RTF, check the Microsoft site. Another good source is the interglacial.com site, put together by Sean M. Burke, author of The RTF Pocket Guide, published in 2003 by O'Reilly and Associates.)
upCast: Word to RTF to XML
At least one of the XML conversion tools on the xmlsoftware.com site does support native Word .doc conversion: upCast, from infinity-loop GmbH. In this column I'll take a look at how upCast (currently at version 4) does its work.
First, let's get the questions of platforms and licenses out of the way. upCast is Java-based and thus available cross-platform, with installers for Windows, Unix, and Macs. The licensing comes in a variety of flavors, including (among others) a commercial product, a free evaluation, and a "private" (single user, non-commercial) version.
After installing upCast and browsing through its documentation (and the infinity-loop site), you find that its .doc file support is limited in one sense: the .doc file(s) in question must have been created using Word 97 (or later), on on a PC running Windows 95, 98, NT, or 2000. For other, earlier versions of Word and/or Windows, the document first must be saved as RTF; the RTF file then is fed into the upCast conversion process. This limitation shouldn't be a problem for most Windows users, but it is something to bear in mind.
The .doc support relies on one other requirement: it uses an add-in, provided with upCast, called WordLink; this add-in saves the binary .doc as a temporary RTF file, using a copy of Word which is installed on the user's machine. So WordLink isn't available for Mac- and Unix-based upCast users. Hence, upCast users on these platforms are limited to processing RTF files only.
Running upCast is fairly simple. The main dialog box consists of two sections:
- The upper section ("Import Settings") is for specifying input
parameters, chief of which is the name of the source file to be
converted:

Figure 1: upCast import settings - The lower section ("Export Settings") lets you identify the
name and properties of the output:

Figure 2: upCast export settings
In the second screen shot, I've pulled down the selection list to show what you can do with upCast. By default, the program outputs an XML document using upCast's own built-in DTD. Here's a fragment of a resulting document in this vocabulary:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE document PUBLIC "-//infinity-loop//DTD upCast
4.0//EN"
"http://www.infinity-loop.de/DTD/upcast/4.0/upcast.dtd">
<?xml-stylesheet type="text/css"
href="helloworld.css"?>
<document
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:html="http://www.w3.org/HTML/1998/html4"
xml:lang="en"
style="widows: 0; orphans: 0; word-break-inside: normal;
\-ilx-block-border-mode: merge;">
<documentinfo>
<property name="title" value="Hello" type="text"
/>
<property name="author" value="John Simpson" type="text"
/>
<property name="numberOfPages" value="1" type="integer"
/>
</documentinfo>
<part style="page: pageStyle1;">
<par class="Normal">Hello world!</par>
</part>
</document>
This has a number of interesting features (highlighted in bold, above).
First, note the xml-stylesheet PI. In order to
capture not only the contents of the document (which appear
later, as text strings within par elements), but
also its look-and-feel, upCast extracts style information from
the RTF document being processed and writes it to a Cascading
Style Sheet. A small fragment of this style sheet looks like
this:
*[class=Normal] {
display: block;
/* Paragraph Properties: */
text-align: left;
margin-left: 0.0mm;
/* Character Properties: */
vertical-align: baseline;
font-family: "Times New Roman", serif;
color: #000000;
font-size: 12.0pt;
}
With this style sheet and the PI, a viewer (such as a browser capable of displaying XML via CSS) can render the document's contents in something like the way they appear in the source document. This rendering isn't 100% exact, of course -- CSS doesn't do everything a word processor does, in exactly the same way, and browsers are notoriously inconsistent in the extent to which they support CSS.
The second thing to notice about the output document is the
two namespace declarations. One declares that the
html: namespace prefix is associated with the HTML
4.0 namespace.
The other (more interesting) one identifies an
xlink: namespace prefix. How does upCast use XLink?
In several ways, including these:
- Each hyperlink (including e-mail addresses) in the original
Word document is converted to a
linkelement with numerous XLink-specific attributes, such as:
<par class="Normal"[other attributes]>e-mail:
<link xlink:type="simple"
xlink:show="replace"
xlink:actuate="onRequest"
xlink:href="mailto:simpson@polaris.net">
...
</link>
</par> - Each Word "bookmark" is translated into a
referenceelement, which (likelink) takes a variety of XLink attribute. Thexlink:hrefattribute uses a fragment identifier to locate a specific portion of the document:
<reference xlink:type="simple" xlink:show="other"
xlink:actuate="onLoad"
xlink:href="#theThirdItem" ...>3</reference>
(Note also, by the way, the use of alternative values for thexlink:showandxlink:actuateattributes.) - Each image embedded in the Word document is referenced with
an empty XLinking
imageelement.
<image xlink:type="simple" xlink:href="myImage01.jpg"
xlink:show="embed"
xlink:actuate="onLoad"/>
As I said, actually being able to use such XLinking markup presumes the availability of XLink-smart software. The Mozilla browser can handle simple XLinks in XML documents; for example, the email hyperlink in the first of the above three bullets displays correctly as:
Figure 3: Mozilla view of upCast link element
Again, though, you needn't use upCast simply to generate documents in upCast's own XML dialect. As you can see from the second screen shot above, other output options include XHTML 1.0 (Strict) and DocBook 4.2. (DocBook support is only beta-level, although I found no problems with it. And one thing it allows you to do is to migrate a document from Word to PDF, using software which generates PDF output, from DocBook input, without using Adobe Acrobat itself.) As with the output to the native upCast vocabulary, selecting the XHTML and DocBook output formats both cause corresponding CSS style sheets to be generated.
I did encounter some surprises in the resulting XHTML display, but only for Word features with no precise or consistently-renderable CSS counterparts. On the whole, though, the display was remarkably close to the original. For instance, here's a portion of a screen capture from a Word document, as displayed in Word:

Figure 4: Original document opened in Word
And here's the corresponding output of the upCast-generated XHTML document, viewed in Mozilla:

Figure 5: upCast-output version of above document, viewed in Mozilla
|
Also in XML Q&A | |
Not perfect, but very good. A particularly neat touch is the translation of the Word document's bookmarks into true hypertext equivalents, using fragment identifiers which scroll the browser directly to the correct portion of the document.
I haven't covered in this column the use of upCast's other output filter options Like the upCast XML, XHTML, and DocBook outputs, these other options seem to work smoothly and with few surprises. (My favorite of these is the "XSLT Processor" feature, which first generates an XML document and then transforms it to some other form, by way of a user-supplied style sheet and the Apache Xalan XSLT processor.) Nor have I covered the use of infinity-loop's parallel XML-to-Word product, unsurprisingly called downCast. If you're interested in straightforward translation back and forth between Word and various XML formats, though, I encourage you to investigate these other tools on your own. And of course, by all means take a look at the other software on xmlsoftware.com's "Conversion Tools" page.
Share your experience in our forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- Word 2000 Technical Articles
2006-01-11 18:32:52 yongmanlian [Reply]
Export a Word Document to XML,Kevin McDowell.
- Word 2000 Technical Articles
2007-01-30 06:56:19 wordtoxml [Reply]
- Word 2000 Technical Articles
- Manifest CMS - Word->XML
2005-11-06 16:45:23 jasone [Reply]
Manifest CMS - http://www.hardlight.com.au
converts word documents to XML and has done since 1999. It uses the MSWord API to do the conversion, not rtf. You can customise the XML output and its designed for reprocessing with XSLT. Capable of Batch conversions including subfolders.
- Offisor: *.doc to *.xml in Java
2004-02-07 03:51:29 Pasi Nummisalo [Reply]
Davisor Offisor provides pure Java implementation for going directly from Word doc to XML (no "save as RTF" needed). Offisor can be used from command line or through API. There are also several XSL-T examples for XSL-FO, Docbook and XHTML. Learn more from Davisor Offisor pages.
- Antiword
2004-01-22 12:16:59 ROb Schmersel [Reply]
Antiword is another (open-source) tool which converts word documents to a number of other formats , XML (DocBook) just one of them.
- rtf2xml is an open source solutions
2004-01-12 23:21:49 Paul Tremblay [Reply]
My python script rtf2xl is the only open source utility that converts RTF to XML. Give it a try at
http://rtf2xml.sourceforge.net/
I know that wvware is also an open-source project, but wvware converts Word documents to formats such as Latex, but not to XML itself.
rtf2xml converts an RTF document into an XML document with a good amount of structure. It forms lists and can convert headings into sections.
With an xslt stylesheet, you can you can turn the the document that results from an rtf2xml conversion into simplified docbook, TEI, LyX, or XHTML. I have written an xslt stylesheet that works in conjunction with the rtf2xml script to output a simplified docbook document.
rtf2xml has no graphical interfact. You can use it to batch convert many documents at once.
- WorX Studio by HyperVision
2004-01-12 16:30:25 Chango Valtchev [Reply]
Pleased to introduce our new product, WorX Studio… Its purpose is, exactly, automated structuring of Microsoft Word documents (as well as any textual content that can be imported into Word). Conversion can target any custom XML Schema (XSD) that models the logical structure of the document. Hence, meaningful/"semantic" markup can be derived, not just the formatting/typographical kind. Word 2003's native XML markup is supported directly. Older versions are supported just as well based on our other Word add-in, WorX for Word, which augments Word 2000/2002 to become a full-fledged XML authoring tool. [Yeah, we did this three years ahead of Microsoft…] WorX Studio is expressly designed for and fully integrated into the workspace of Microsoft Word. It provides a GUI environment for the development and execution of document-type-specific conversion definitions. Nearly all formatting features supported by Word can be used to define XML element recognition patterns. In addition, literal text-, wildcard-, and regular expression patterns are supported, as well as arbitrarily complex logical (boolean) combinations of all primitive pattern types. Another novel feature of WorX Studio is the conversion model it utilizes. All the "intelligence" encoded in the supplied XML Schema is extracted and used to guide the document conversion process. (No ad-hoc style-to-element mapping like what is seen in some simplistic conversion approaches.) Identifying and defining appropriate recognition patterns only for what is called baseline elements in the document (usually leaf-level or near-leaf-level elements) enables the conversion engine to create the markup for these elements as well as the markup for all higher-level elements, automatically, "for free", by abiding all nesting and repetition rules from the schema. Thus, deep/granular markup can be easily obtained. Another advantage offered by the schema-guided approach is that the individual baseline element patterns can be relatively simple and loose. Patterns are tested only in the context where valid matches are expected/likely to occur, thus avoiding many spurious matches and speeding up the whole conversion process. The result of conversion is the given Word document, with all its original text and formatting intact, but with the XML element tags embedded in it. (This is something that a command-line or "streaming"-approach conversion tool cannot offer.) Pure, standard XML compliant with the custom schema can be exported at any time. Completed conversion definitions can be run even by non-XML users within Microsoft Word. A batch processor is also provided, as well as an API to the conversion engine, which can be used to add automated conversion/structuring capabilities to custom authoring solutions based on Microsoft Word (especially in the paradigm of "smart documents" introduced with Word 2003). For more general information and a detailed feature list, please visit www.hvltd.com. (Fact Sheet: http://www.hvltd.com/misc/WorXStudioFactsheet.pdf.)
- Exegenix: Anything-to-XML conversion tool
2004-01-08 07:52:34 Ryan Germann [Reply]
...anything you can print to a PostScript or PDF file, that is... I am biased, as an Exegenix employee, but am also proud of our technology (the Exegenix Conversion System, or ECS) and the way it analyses the page geometry, and goes a farther than most "to XML" conversion utilities. Output of ECS is not 'just' an XML version of captured formatting information (even WordML is 'just' that); the output is richly structured, employing a DTD that we call a "superset of DocBook"... that is, if an object on the page looks like a "Section Title", we tag the object as a <title>, inside a <section>. If you're going to be post-processing your documents via XSLT, you don't have to rely on exact formatting codes... if it looks like a title, it will be tagged as a <title>, without pre-configuring it to recognise particular formatting as a title... no matter which particular typeface or point size is used in that particular document. Don't worry though, if you're inclined to write scripts that act on formatting information, all that formatting information IS part of ECS XML output, so it's all there for you to use... I could go on :-) but suggest you check our website at www.exegenix.com if this type of structured output is of interest to you.
- Exegenix: Anything-to-XML conversion tool
2007-08-24 15:25:48 Anilvarma [Reply]
I worked on this tool for a quite a while for trnasforming print document to XML format. This tool is really amazing!
Anil
- Exegenix: Anything-to-XML conversion tool
- creating XML from Word
2004-01-06 06:58:55 Katriel s [Reply]
At Live Linx we developed technology in use already for a couple of projects that creates XML from Word. It creates a "base-line" XML similar to DocBook and based on the styles and other info in the Word document, but then produces XML based on your own DTD. It allows mapping information in the Word document (style info, order, pattern-matching of text, etc.) to elements and attributes in your DTD. The technology understands the DTD (ordering, element containers, etc.) and creates an XML file that is
certain to be valid to the DTD. I think it is actually pretty cool and better than anything I've seen yet.
So far we are using it to produce SGML for the AMM DTD (Aircraft Maintenance Manual DTD), for legal DTDs (for legislation, policies, and legal decisions), for variations on DocBook and for some other applications. Details available from Live Linx (http://www.livelinx.com)
- creating XML from Word
2007-01-30 06:55:30 wordtoxml [Reply]
- creating XML from Word
2004-08-29 05:49:29 O'Reillyprogramming [Reply]
to we can convert xml files to ms word(.doc)
please help me in .xml and .xslt or .xsl files
thanks
- creating XML from Word
2005-11-03 02:38:38 fgff [Reply]
- creating XML from Word
2005-11-03 02:39:58 fgff [Reply]
How do i convert word into xml using javascript
- creating XML from Word
- creating XML from Word
- creating XML from Word
- wvware?
2004-01-05 07:02:59 Brian Ewins [Reply]
I don't know if its related to OpenOffice these days as well, but the importer used by Abiword was called wvware, and does the word->xml job quite well across a wide variety of platforms (its in C, but doesn't depend on win32 apis). I'm slightly biased in their favour as I contributed code some years back, however it does seem to have a couple of advantages compared to some of the tools discussed:
- its command line, not point and click. That makes it easy to build it into a processing pipeline where human intervention isn't possible.
- the xml export format preserves (as I recall) all the useful info from the OLE envelope and Word doc, letting you control post-processing into html, mif, tex, etc (e.g. fitting documents into templates, splitting across pages, using a corporate stylesheet instead of an autogenerated one preserving the formatting, etc). You can't get this by converting to RTF first, for example.
Its worth mentioning that the only reason I was involved was that the large corporate I worked for at the time internally surveyed all the tools available at the time (Y2K?) and rated wvware as the best available. You can find it here:
http://sourceforge.net/projects/wvware
- Another method...
2004-01-03 20:35:40 J David Eisenberg [Reply]
Open the Word document in OpenOffice.org. Save in OpenOffice.org form, and unzip the resulting file. Voila - XML.
- Another method...
2007-01-30 21:19:35 wordtoxml [Reply]
hghgfhfghhhgfhgfh
GHgfhgfhgfhgfhhgfh
- Another method...
2006-11-15 10:38:07 Paeon [Reply]
I'm not a programmer. I work for a publishing house, but I can see the beauty of creating a form, such that the field names correspond to paragraph styles/xml tags and fields would be content only. Then you could give a writer a template and they wouldn't be able to screw up the styling. I'd be able to output well-formed xml. I'd use the xml to speed up layout in InDesign, then get someone to write xslt's to output web and talking book versions.
Can I get there without being a "hacker?" My programming knowledge is limited to writing scripts. I'd like to look at open office. Do you need to be a programmer to use it cross platform?
- Another method...
- well-formed XML output from Word?
2004-01-02 10:36:21 Stephen Mattin [Reply]
I would take issue with the statement that
MSWord outputs well-formed XML. The
"Save as HTML" in Word 2000 creates garbage
that is not even well-formed.
- Word Schema
2003-12-31 14:12:29 Robert Leif [Reply]
The great advantage of the Microsoft approach is the use of schema. Unfortunately, when one tries to extend the Word schema, wordml, with XMLSPY, some of the supporting schema appear to not be available. Has anyone been able to create a schema that extends Word using Wordml specific data-types and validate their new schema?
- W2XML is better
2003-12-31 07:25:10 Kurt Martin [Reply]
At least we think so. We tried UpCast, xDoc, YAWC and other Word/XML conversion software. Bt far the best (in terms of price vs. features) is W2XML by Docsoft(http://www.docsoft.com/w2xmlv2.htm). If you know XSL pretty well, then you can do just about anything you need with this software. UpCast is okay, and is fairly inexpensive, but we found W2XML to be a better overall fit for our needs.
- W2XML is better
2004-12-16 06:18:33 davemurphy [Reply]
Deends on your needs, but we were quoted 15k by docsoft v 1500 (or even 500) for upCast. upCast did what we wanted, whihc was a relatively straightforward batch conversion of word docs to xml.
- W2XML is better
