Intuition and Binary XML
April 18, 2001
XML-DEV has been revisiting a well-loved debate this week, namely, binary encoded alternatives to XML, last encountered in January.
The Need For Evidence
Lurk long enough on any mailing list, and you'll always find a few ideas that refuse to go away. XML-DEV has more than a few of its own; the concept of a binary XML is one which ranks up there with "Namespaces: Good or Bad?" and "Why the W3C/ISO/OASIS/IETF (delete as appropriate) Process is Just Plain Wrong." All of them are good subjects if you're feeling lonely and fancy reading some email.
It's easy to be dismissive of these debates, but they're often a sign that there's some fundamental problem or common misunderstanding that needs to be addressed. Alaric Snell characterized the reaction of many developers when faced with XML for the first time.
To many programmers, XML *looks* inefficient and awkward. That was my first thought when presented with the idea of using it for data interchange; luckily I was enamored enough of the good work being done on various interesting schemas that suggested this data format (although technically lacking in many respects) may actually achieve "ubiquitous" status.
This reaction is likely to be a kind of gut feeling. After all, XML is a plain text format containing lots of whitespace, so it must be inefficient, right? Unfortunately gut reactions rarely lead to good results when it comes to optimization. As Tim Bray noted, empirical evidence and real-world test profiles are what's needed.
...an argument that unpacking a binary format (particularly on a machine whose binaries are different and you have to bit-swizzle) is significantly faster than XML parsing a la expat or MSXML, needs to supported by actual empirical data rather than by assertion. And suppose, as a thought experiment, that this were true; if you were to speed up the XML parsing/generating part of an XML-using application, how much would that speed up the whole application? You'd need to know what proportion of its time it spends parsing/generating XML. In some apps, this proportion is going to be very small.
Bray recounted painful attempts to optimize without accurate profiling information, while in the full flush of the enthusiasm one encounters when presented with a optimization problem (something that Sean McGrath later termed the " rush of code to the hand").
In my experience, assertions about what will make software run faster, when not backed up by empirical profiling data, are not worth wasting time on. I have seen untold amounts of time wasted by overeager junior programmers who just knew, "without needing empirical evidence", that putting a hash-table in, or some such, would make their app go faster, when some profiling work would have shown that their performance was dominated by I/O buffer management.
Several members of XML-DEV were forthcoming with anecdotal evidence and experience with different XML encodings. Oleg Paraschenko reported that his Pyx parser project (Pyx is a line-oriented subset of XML) was actually slower than a full parser. Henry Thompson has more recently learned the hard way that binary is not necessarily faster.
I just wasted a weekend getting my schema validator to dump the internal form of the 'compiled' schema-for-schemas, on the _assumption_ that reloading that would be faster than parsing/compiling the schema-document-for-schemas every time I needed it. Wrong. Takes more than twice as long to reload the binary image than to parse/compile the XML.
There are _lots_ of people out there working hard to make parsing/writing XML blindingly fast. With respect, you're unlikely to beat them.
Yet because there are few empirical results, the debate cannot be put to rest and the hand-waving continues. Even those big projects that have adopted a binary encoded XML format have not produced a convincing case. One commonly cited example is wbXML, used in WAP devices that are deemed to have little processing power and limited bandwidth, yet even this case is arguable as Sean McGrath has pointed out.
I do a lot of work with WAP and experience with it has turned me off binary XML encodings fairly comprehensively. I don't think WAP demonstrates the advantage of a binary encoding. I think it demonstrates quite the opposite.
My tests repeatedly show that the difference between response times of the *same* system serving compact HTML (iMode) to an iMode client browser versus WML to a WML browser is negligible.
For my money, iMode got it right. A stripped down HTML with plain text -- pure as the driven snow -- flowing from client to server.
This most recent discussion also highlighted another example which has the potential to become extremely widely used, MPEG-7. The MPEG-7 effort is " daring to describe" multimedia data using XML and provides a binary alternative for encoding this XML data. But, as Claude Seyrat notes, even here a degree of choice is being allowed.
When designing MPEG-7, the following policies have been adopted:
Since the beginning, MPEG-7 has been XML driven. The MPEG-7 community is very reluctant to follow another development path. However in MPEG-7 everybody recognizes the need for a binary format.
- to stay as close as possible to the XML spirit by the adoption of a textual version designed with XML Schema,
- to define a binary format that uses XML Schema definition to generate an efficient encoding scheme,
- to allow one to decide whether he wants to use binary or textual format.
Binary encodings may be suitable for applications where the format and data are known in advance and suitable optimizations can be made. However, deriving a generally useful binary encoding is much harder as Ramin Firoozye pointed out.
Binarizing of the form in WML does actually make the content smaller -- but that's because they've already pre-defined the element tokens, well-known attributes, and common substrings. Binarizing streaming XML of an unknown variety actually slows down the application because of the overhead for building an on-the-fly dictionary (and in worst-case scenarios -- requiring multiple passes over the source). Binarizing through object-streaming actually makes the file size larger due to overhead for storing internal tree information.
Len Bullard succinctly summed up the challenge that proponents of alternative binary XML formats should meet (with hard evidence) for the debate to move forward.
The question is not is a binary useful for any given XML application language, but is a standard XML binary useful for all of them. WML has one because it needs one and it is good for WML. Generalizing that leads to false conclusions because the form and fit is not the same for the function.
Why Binary Isn't Enough
Other members of XML-DEV sidestepped the binary versus text processing speed issue entirely, honing in on other aspects of XML that are significant advantages in their own right and would be lost with a binary format.
David Brownell highlighted XML's openness.
Binary formats are bad because they tend towards being proprietary, and that's the last thing that should happen to the world's next "intellectual commons".
Auditability was a significant advantage in Clark Evan's book.
XML is going to succeed where other file formats have failed because it is auditable -- I, a mere human, can pull up the code and read it with my own eyes and without an intermediate reader which could be at fault.
...Binary XML is dead on arrival. Getting away from binary formats is the _entire_ reason for XML. Being able to audit your inputs and outputs.
The issue of XML as an easily readable format may be too easily dismissed; after all, who wants to sift through tangles of angle-brackets? Yet the point is not that XML should be readable to the everyday user, but it is readable to a developer and therefore can be deciphered, reverse-engineered, tested, and audited much more easily than a binary alternative.
Eric Bohlman believed that the discussion was pitched at too low a level; saving CPU cycles is not the issue. In environments where data is being exchanged between multiple organizations, other factors become important. Not least among them are maintenance and documentation, as well as the social implications of agreeing on a format in the first place.
And let's not forget the *social* aspects (the ultimate non-geeky stuff) of data interchange. When several unrelated organizations, or even departments within an organization, need to exchange data, there's an enormous advantage to using a data format that was created by a third party rather than by one of the players, namely that there's no rivalry over *which* player gets to create the format. Again, if one party could simply impose a format by fiat, everything would be cool, but in real life, if you don't get full "buy in" from all the players, you're going to see a lot of friction (usually in the form of "creative incompetence" where everybody's implementations differ in slight but important details) that will dissipate a lot of energy as heat. Yes, this falls into the realm of what hardcore geeks would call "touchy-feely" stuff, but the fact is that psychological/verbal/non- quantitative/stereotypically-female/"touchy-feely" considerations play important roles in any real-life human endeavor involving more than one person, and the fact that one might be more confortable with bits and chips than with human interactions doesn't change that reality.
Characteristically philosophical, Walter Perry cut to the heart of the issue: once on the Internet you no longer know how, or by whom (or even what), your data will be processed, so you cannot make any assumptions about how it will be used. Perry has long argued that facilitating this kind of usage is the key advantage of XML.
The savings to be realized through the use of a binary format are premised upon parsing the XML text only once and thereafter passing around or storing the binary encoded output. Such a mechanism demands that every user of that data expect, or accept, the identical output of that parse -- effectively, a canonical rendering. It is only such unanimity which would permit every user to accept the product of a parse performed by any of them. In the rapidly growing internetworked universe, it is precisely that unanimity which we cannot reasonably expect...I argue that the reasonable understanding of XML acknowledges that every use of an XML document begins with a fresh parse of that document in the context of that use. That parse is not required to instantiate XML as XML -- the document itself is already that instance -- but to instantiate the particular objects which that specific use of the XML document expects and requires...You may choose to drive that instantiation off of something other than XML syntax, but it is not then XML processing, and what you lose, most significantly, in doing that is the ability for the same text to be understood and usefully processed at the same time as something very different, but simultaneously the valid basis for a transaction between, utterly dissimilar users.
This is an important point, as it grounds much of the effort behind XML. XML is about freeing data so that it can reach its full potential by packaging it up in an appropriate way; it's fundamentally not about standardizing complicated software architectures. This is not to discount any benefits that may come from looking at innovative ways of processing XML data. As Rick Jelliffe observed, innovation can be applied without recourse to a binary format.
It is completely possible to make inefficient binary formats...or ones with performance penalties. It is completely possible to provide indexes in XML documents...It is possible to provide multipart documents with an XML document and a binary index for searching. It is possible to provide non-XML text formats that have nice performance characteristics...my STAX short-tagging compression which can give well over 50% reduction in file size (in suitable cases) for just a paragraph of extra lines of non-processor-taxing code inside an XML parser. And there are more efficient parsers possible (especially for trusted data) if they assume WF documents...
...And there is also the other cat in the bag: sparse, lazy DOMs (i.e. DOMs constructed lazily as required from a fragment server) may require far less processing than retrieving full documents whether those documents are sent as XML or non-XML.
...the use-case is not merely readability, however excellent that constantly shows itself to be. A lot of the supposed benefits of a binary format may be nothing to do with the binary-nature itself, and just as doable in vanilla XML or in a text format.
In short, the consensus is that a binary XML will at best equal the advantages of XML as it is today. Greater rewards will be found from pursuing the application, and not the re-engineering, of XML.