Good Things Come In Small Packages
March 22, 2000
Table of Contents |
•Compression Techniques |
Last week in XML-Deviant we explored the design of SVG and discovered that concern over file size was behind several contentious design decisions. This week we focus on a discussion that sought a more generic solution to XML verbosity.
Compression Techniques
Like any textual markup language, XML is verbose. There is a lot of "redundant" data in an XML document, including white space, and element and attribute names. XML documents are therefore a prime candidate for compression. Simon St. Laurent asked if anyone was working on a standard compression format:
I'm starting to get concerned about the volume of complaints I'm getting from readers and folks in Web development forums who are starting to argue that XML's verbosity is a problem, especially for things like transmitting vector graphics information. There are a lot of wasted bits in XML documents -- and of course in HTML and other text documents as well.
Judging from the response, others have had similar thoughts and encountered similar complaints. Two potentially useful products were mentioned, XMill and XMLZip. Both of these are general purpose XML compression tools. Mark Baker suggested that, because HTTP supports compression through its Accept-Encoding and Content-Encoding headers, there's no need to wait for a standard:
I could see a generic XML-specific compression mechanism being developed; one that understands what "<" and ">" mean. But you don't have to wait for that to compress your XML today.
A binary encoding for XML is another means by which file sizes, and ultimately bandwidth, could be be reduced. Ingo Macherius pointed out the work of the WAP Forum:
The WAP community has developed an architecture for binary XML encoding, which includes efficient compression.
Obviously, an efficient format is essential within the limited bandwidth available to mobile devices (although this restriction will no doubt be alleviated at some point in the future).
The topic of a binary encoding for XML has cropped up before on XML-DEV. Discussion last year explored the requirements and issues behind the idea. The interested reader may wish to look at two threads in particular: "Is there anyone working on a binary version of XML?", and "Binary-encoding of XML for communication."
XML's Future Could Depend on Efficiency
In response to these suggestions, St. Laurent clarified his aim as being the integration of compression seamlessly with current transmission and processing mechanisms, rather than any specific technology:
While these various tools for compressing XML are interesting, and use a wide variety of promising strategies, none of them are currently set up to be built into a compress-before-transmission/decompress-on-receipt framework that's invisible to the user.
The WAP approach is probably the closest to what I'm thinking about, but the WAP forum has control over the entire transmission cycle. Building support for this binary encoding into WAP devices is easy.
Making compression/decompression work across existing Internet frameworks is a lot harder...
In a thought-provoking response, John C. Schneider outlined a body of work carried out by MITRE, a not-for-profit US government organization. The work, based around a format called Message Text Format (MTF), parallels many of the W3C efforts to date. Projects included the development of validating and non-validating parsers, schemas, validation tools, and an object model. Schneider indicated that one product was a compression mechanism that could be tweaked for XML.
One of the concepts we devised fits the description you give below and, with sufficient tweaking, could form the basis of an efficient XML encoding scheme. The algorithm does not rely on character redundancy and, as such, works equally well for small information objects that tend to get larger using algorithms like zip. In addition, its design permits it to be read/written directly from an appropriately modified DOM implementation instead of incurring the cost of a separate compression/decompression step.
Schneider saw a more efficient binary encoding of XML as being "inevitable," and hoped that it would become ubiquitous. Ideally parsers would be capable of reading both text and binary encodings. The exact encoding would be transparent to the user. Citing previous experience, Schneider stressed the importance of an efficient encoding:
For XML's long term viability, I believe it is strategically important to design a more efficient encoding. I'd hate to see XML unseated by a more efficient format a few years down the road, reducing the importance of the great XML work that's been done and introducing new interoperability barriers. While this scenario might seem far fetched today, it occurred within my customer's community several years ago (even though their original format was about 10 times smaller than XML).
Commenting on the WAP initiative, Schneider believed that the activity may not result in a generally applicable standard:
...their current path appears less likely to result in a general purpose XML encoding for all XML users than if the work was done in an environment like the W3C or IETF... If my projections about the eventual development of a general purpose, efficient XML encoding are true, this change in focus may be strategically important to the long term viability of WAP.
Wrapping Up - What's Inside?
The incorporation of compression into a general "XML infrastructure" is related to a much wider problem: packaging of XML documents. For a given document, there is a range of information useful to an XML processor that is not directly related to the data it contains. Identifying a compression mechanism is one example; style sheets and schemas are others. Ideally this additional information would be available from a separate packaging mechanism.
Don Park believed that packaging should be the primary focus:
I doubt we will be able to agree on a standard compression format. Rather, I would like to work on [making the] XML packaging standard proceed faster to encompass arbitrary encoding of XML documents and fragments. XML's relationship with MIME should also be strengthened.
With a generic packaging framework, it should be possible to support multiple compression standards. In this sort of environment, alternate standards could compete and flourish. Thus we avoid the need to dictate a specific solution at this early stage. Simon St. Laurent observed that packaging is an area neglected by the W3C:
I definitely agree on the need for XML packaging. I've been disappointed with the slow progress (is there any?) on packaging at the W3C, and look forward to seeing more activity.
Looking back over the XML-DEV archives shows that again packaging is a recurring topic. Last year saw several threads relating to the issue: "Packaging and Hub Documents," and "Packaging and Related-Resource Discovery." It would appear that no real movement has been made on this front, although Simon St. Laurent's XML Processing Description Language (see also "Profiling and Packaging XML") has been a step in the right direction.
Currently involved with activities to improve MIME support for XML content, St. Laurent commented that he believed more fundamental changes may be required:
I don't think anyone expected that XML might require a rethinking of the infrastructures we use to carry it, but I'm headed more and more that direction. It may still be too early in the game, though -- after all, XML is still a tiny portion of the overall traffic on the Internet.
However tiny XML-based traffic is today, if the current rate of adoption continues, XML transmission will be ubiquitous before long.
Now may be the best time to consider some wider architectural problems: perhaps it's time to take a break from producing the unceasing flow of new standards. Considering how these standards fit together will reinforce our efforts toward the holy grail of Interoperability. Experiences from organizations like MITRE, as well as feedback from developers "on the factory floor," will be vital.