XML Conformance Update
May 10, 2000
Overview
This article is an update to my earlier articles from 1999 that tested the XML 1.0 conformance of parsers. Since then, there has been notable development of the parsers, and updates to the XML 1.0 specification itself. With respect to the dozen Java parsers reviewed last year, there have been a number of interesting changes, both in their conformance and in their open-source status. Microsoft's MSXML parser has also received a significant update, with the release of "technology previews."
Conformance at OASIS and at W3C
Unfortunately the OASIS XML Conformance committee, responsible for the original XML 1.0 test suite, has been rather quiet since it released its first draft last summer. The tests have not been updated in any way, although several issues have arisen since that first draft was released. In fact, there was a recent (April 4) e-mail to the OASIS xmlconf list saying that the conformance committee needed to be "re-initialized," so it's not clear when the necessary updates to its categorizations may be provided.
On another front, the W3C has at last begun to address some of the specification problems that were found by parser writers and by conformance and interoperability testing. In January W3C updated the XML 1.0 specification by publishing over forty errata that had queued up over the last year, since the original seventeen errata published in early 1999. In subsequent months additional errata have been published. I'm surely not the only one who wants to see just one revised specification, with change bars, instead of such a huge list of errata! The W3C has promised to deliver a "corrected version of XML 1.0," according to their public activity page. It is important for that to be delivered soon.
There appears to be a pause in the OASIS work on XML infrastructure conformance. The ball is temporarily in the W3C's court to address bugs in the XML 1.0 specification that have been reported by its implementors and users. I'm not sure there's enough dialogue in that process: my personal preference would be for the W3C to adopt the IETF approach, and insist on interoperability and conformance testing processes with positive results (and multiple full implementations) before making recommendations.
Updating the Test Database
The next logical step is to ensure that the OASIS test suite captures all those changes to the XML specification resulting from the W3C XML 1.0 errata. Since OASIS hadn't published any such work, and since a year's worth of W3C changes was published in January, I prepared a private update and circulated it in early February (including to the OASIS chair). This updated version of the OASIS 1.0 test suite is the one used in this article. I'm not aware that any of the newer errata (mostly editorial) affect these tests.
My February update (download it here, 4,028,333 bytes) addressed most of the issues I knew about in the original OASIS suite (relative URIs for test documents broken, and a handful of miscategorized test cases). It also addressed many issues coming from the errata W3C published in January -- at least the ones that didn't involve new test cases. See the README in the update for the open issues known to me at that time; only perhaps one or two percent of the tests seem likely to be problematic.
I'll repeat my plea from the earlier articles: If you have issues with these test cases, address them to OASIS and/or W3C. Ideally, do it so that the issue and resolution are matters of public record, without closed-door restrictions on who knows about such issues. (Note that discussions on XML-DEV don't always seem to be enough to resolve such issues in the W3C context.) Open standards need fully transparent processes, and responsive hosting organizations.
XML Parser Testing
This section presents the results of testing with the revised test suits. Some of the parsers examined last year have been omitted from these tests: the field has been slightly narrowed by these removals, with the least conformant processors from last year not getting a second look. Omissions include:
- Lark, roughly as conformant as Oracle's non-validating parser, is no longer being maintained.
- Datachannel's parser has been desupported.
- Microsoft's XML Parser for Java (in their Java virtual machine) has not been bugfixed either.
- Although the Silfide SXP parser did get an update, it was a minor one (version 0.88 to 0.89) that didn't particularly improve its test results.
The software used to do the testing was the latest version of the harness used in 1999. For SAX/SAX2 Java parsers, only a few bug fixes were needed to the original driver, and a conversion to use the new SAX2beta API conversion was made. For Microsoft's MSXML3 parser, the ECMAScript test harness from the original article was used.
Before we proceed to the test results, I'll sneak a word in about some additional testing I've learned about. Richard Tobin has done some testing of the RXP processor, making some results available. I'd like to see more parsers test and document their conformance, and applaud Richard's work with RXP!
Summaries for each parser tested are presented below. Click on the test scores to find the complete output from the test program. I've also included some notes on the changes undergone by each parser since the previous testing.
Ælfred 2
This new version of Ælfred is not from its original vendors. Microstar, the originators of Ælfred, have been acquired, and neglected to maintain the parser. Ælfred 2 is now part of the SAX2 XML Utilities, created by myself. The package also includes a layer for performing validation on top of SAX2. DOM Level 2 support is unbundled.
Why have I resurrected Ælfred? Last year I needed an open source Java XML parser that I could redistribute and realistically bugfix. I ended up taking a parser that seemed orphaned: Ælfred. It became the first publicly available parser to fully support all the new functionality in SAX2alpha and SAX2beta. It's had conformance improvements (as you see here) and yet I think it has preserved the original "small, simple, fast" approach, including trading off some conformance in favor of simplicity. (That tradeoff didn't need to be large.)
The most technically interesting feature of this parser comes from the fact that, since SAX2 exposes additional elements of the "XML Infoset" through new callbacks, most validity constraints can be tested by a cleanly layered module. The "validating Ælfred" here uses a validation layer with the non-validating parser. That layer works just as well with other SAX2 parsers. It can also be used with other software that produces SAX2 events, even doing validation of content as it's dynamically generated!
Curiously, I noticed that many of the (few) validity constraints that can't be checked using such a SAX2 layer were also ones that XML parsers often got wrong, or consciously chose not to address. These relate to lexical constraints, both for parameter entities and for some of the standalone document rules. This is perhaps another argument to remove those constraints, or have an XML errata turn them into well-formedness constraints to bless the way XP and MSXMLL handle them.
Mode | Raw Results | Pass Rate | Notes |
---|---|---|---|
Non-Validating | 1062 /1072 | 99 percent | This was 865/1065 in the original version. See the parser documentation for the known nonconformance cases. |
Validating | 1039 /1072 | 96 percent | This is a validation filter on top of the SAX2 event stream from the nonvalidating parser, and some of these errors come from that lower layer. |
Microsoft MSXML3 (March 2000 Technology Preview)
Microsoft has released a new MSXML3 SDK as a "Technology Preview." The focus is on support for XSLT and XPath, and a handful of parser bugfixes seem to have been included as well. DOM Level 2 support was not mentioned.
This is the only parser addressed in this article that isn't a standard SAX or SAX2 Java parser.
Microsoft has posted an interesting bug list, showing some open XML (and XSLT, XPath, ...) bugs. I applaud this move, which not all companies would do. Since consumers need ways to keep the companies honest, I hope that this sort of transparency becomes much more common. (If markets are conversations, as the Cluetrain folk tell us, the need for bugfixes is an inevitable topic.)
However, reading the bug list, you wouldn't think that any parser conformance bugs had ever been reported! There were no bugs listed (open or closed) for the legal documents this parser rejects (the "legal unicode chars not allowed" is marked as fixed, but it demonstrably wasn't ... even given the weak test coverage of such characters found in the OASIS suite) or illegal documents it accepts ("permits whitespace before XML declaration," "accepts illegal unicode chars," or various validation bugs). It's clear that this bug list has a few problems!
The direct end-user consequence of such bugs is that "Microsoft XML" is demonstrably different from official XML. I've had the unpleasant experience of looking at the XHTML parts of a site where the webmaster had validated using MSXML, and then found that every MSXML-validated page showed a fatal error when used with a conformant XML parser. The interoperability lessons are clear: bugs like those hurt only people using non-Microsoft, standards-conformant, software.
Mode | Raw Results | Pass Rate | Notes |
---|---|---|---|
Non-Validating | 941/1072 | 87 percent | All but three of the newly passed cases came from changes in the test suite. MSXML2 had previously passed 931/1067 cases. |
Default | 913/1072 | 85 percent | MSXML2 had previously passed 895/1067 cases. Several of the "new" test cases for invalid xml:lang values were correctly rejected. |
Oracle XML Parser for Java v2.0.2.7
Although Oracle's distribution includes an XSLT processor, it does not claim conformance with the final recommendation. Neither SAX2 nor DOM Level 2 support is provided, and this is not an open source package.
The test results show that this release hasn't addressed conformance very much at all.
Mode | Raw Results | Pass Rate | Notes |
---|---|---|---|
Non-Validating | 928/1072 | 86 percent | This was previously 904/1065; some of those improvements clearly didn't come from passing recategorized tests. |
Validating | 875/1072 | 82 percent | This was previously 871/1065; no real change. |
Sun: Java API for XML Parsing (formerly TR2)
Sun has created an API known as JAXP, which supports vendor-independent bootstrapping of SAX1 or DOM Level 1 implementations. It shipped that with what looks like Sun's second Technology Release (TR2) XML parser. That bundled parser is what is evaluated here. JAXP doesn't address SAX2, or DOM Level 2.
Not much has changed in the conformance status of Sun's parser. Many of the differences from the previous release relate to the new validity constraint for xml:lang attributes. The XML 1.0 specification had provided many rules for what those values must be, but hadn't specified what sort of error would result from violating them. This parser always reports a non-fatal error, which is correct when validating, but not otherwise.
Sun's underlying parser has now been contributed to the Apache XML project, so it's now open source.
Mode | Raw Results | Pass Rate | Notes |
---|---|---|---|
Non-Validating | 1066 /1072 | 99 percent | Some of the XML errata made some behaviors of this parser become nonconformant; previously 1065/1065. |
Validating | 1071 /1072 | 99 percent | Most of the new cases from the XML errata were handled already. The error case came from the errata. Previously this scored 1065/1065. |
Xerces/Java 1.0.3 (formerly IBM XML4J)
One of the most visible changes in the Java XML toolkit world in the last few months has been the formation of the Apache XML project. The project incorporates software from many sources: IBM contributed a bugfixed version of their Java XML parser. The developers contributing to the Apache XML project (including some from IBM) appear to be supporting APIs like SAX2 and DOM Level 2 with a reasonable degree of alacrity, neither the first nor the last to "market" with such features.
The xml:lang erratum caused most of this parser's failures. It treats "invalid" values as fatal errors, which is too severe.
This parser is part of a fairly large package of XML tools, not all of which are in Java, that are being integrated together. If you're after such a package, rather than smaller modules, this may be what you're looking for. But if you want a small modular parser, or a minimalist aesthetic, this isn't it.
When looking at Xerces-J, I see the genesis of this parser in IBM's work. I'm particularly pleased to have seen IBM fix the conformance problems this parser had at first. I'd like to see other vendors make the same commitment to removing interoperability problems.
Mode | Raw Results | Pass Rate | Notes |
---|---|---|---|
Non-Validating | 1066 /1072 | 99 percent | Bigtime improvements. Previously 902/1065, with two seriously visible bugs. |
Validating | 1065 /1072 | 99 percent | Bigtime improvements. Previously 902/1065, with two seriously visible bugs. |
XP 0.5 (retest only)
Although XP has not been updated, it is being re-evaluated here since it did so well against the original conformance tests, and test recategorization (mostly caused by errata to the XML specification) might have changed its ranking. XP has no SAX2 or DOM (or DOM Level 2) support.
The XML 1.0 errata didn't affect the score for this parser much. The seven "new" passes were previously listed as "optional errors," but were reclassified as validity errors by one of the new XML validity constraints.
Mode | Raw Results | Pass Rate | Notes |
---|---|---|---|
Non-Validating | 1057/1072 | 98 percent | This was 1050/1065 before, so nothing much has changed for this parser. Still a good, fast option. |
Summary
The following tables summarize the test scores of each of the parsers, in order of raw score and then alphabetically. Validating and non-validating parsers are separated, to simplify comparisons.
Validating ParsersParser | Pass Rate | Notes |
---|---|---|
Sun JAXP 1.0 | 99 percent | That's one test failure, from the errata. |
Xerces/Java 1.0.3 | 99 percent | Improved, and joined the Apache XML project. |
Layered Validator (distributed with Ælfred 2) |
96 percent | Layers over any SAX2 parser with declaration and lexical handler callbacks. |
MSXML3 Technology Preview | 85 percent | No real changes from MSXML2; "default" mode won't report validity errors associated with a missing DTD. |
Oracle V2.0.2.7 | 82 percent | No real changes. |
With validating parsers, there are two clear groupings. All the open source validators are now at the top of the scale, and the others start about eleven percent below that in test score. This represents a change from the first rankings, when no open source validating Java parser existed. Now there are three good ones to choose from, which can compete to keep each other honest!
Parser | Pass Rate | Notes |
---|---|---|
Sun JAXP 1.0 | 99 percent | Errata made this fall in rating. |
Xerces/Java 1.0.3 | 99 percent | Improved, and joined the Apache XML project. |
Ælfred 2 | 99 percent | Conformant despite trading off conformance in favor of small size and simplicity. |
XP 0.5 | 98 percent | No change. |
MSXML3 Technology Preview | 87 percent | Minor fixes to MSXML2. |
Oracle V2.0.2.7 | 86 percent | Minor fixes. |
As with validating parsers, there are again two groupings, but there's less variation in the conformance level of the open source parsers. The eleven point gap between the groupings is again striking.
Conclusion
Some vendors have shown a strong commitment to fully conforming to the core web standards, by fixing their known XML conformance violations. Others clearly haven't, and there's a curious gap of over ten percent between test results for those two groups. It is important to stop problems from getting big, and fix such infrastructure conformance bugs sooner, not later.