Microsoft XML Parser Conformance
November 17, 1999
Contents |
•Part 2:
Non-validating mode |
Last September, David Brownell conducted a review of XML parsers for XML.com, testing them for conformance to the XML 1.0 specification. In this follow-up article, he tests Microsoft's MSXML.DLL parser, as found in Internet Explorer 5. Unlike previously tested parsers, the Microsoft parser does not provide a SAX interface, used in the testing procedure. As a result of collaboration with Microsoft, the author constructed a Javascript DOM-based test harness. The results of the tests gave the Microsoft parser a "pretty good" rating—in the top 25% for conformance. They did however reveal a serious flaw with DTD handling and validation, for which Brownell presents a workaround.
In my earlier conformance review for Java XML Processors, I evaluated a dozen XML processors written in Java and using the SAX API. Feedback I got from that article was generally positive, and several readers suggested I provide a corresponding evaluation of another widely available XML processor: the Microsoft XML Parser (MSXML.DLL), which is the one bundled with the Internet Explorer 5 web browser. This article provides such an evaluation.
Some readers were also confused about Microsoft's Java XML processor, called "MSXML" in that earlier review. Briefly, Microsoft has had several implementations of XML processor technology. While today one tends to only hear about the latest version of such technologies, they have all been called "MSXML," or "MS XML," in common usage, by numerous people, including some Microsoft staff. Since the Java processor hasn't been updated in well over a year, some confusion seems inevitable. The Java processor was formally called the Microsoft XML Parser for Java. I hope that helps to clarify the distinctions between the various packages; the details of the two reviews should also help.
The version of the Microsoft XML (MSXML) processor reviewed here is the one that has been bundled with Microsoft's Internet Explorer 5.0 web browser. It can be accessed as "MSXML.DLL," and can be redistributed with other software, as part of Win32 applications. Since it provides a COM API, it can be used from JavaScript, C/C++, Visual Basic, and other COM-aware programming languages. It can even be used from Java, but for most Java developers, that support is not particularly useful since it requires using Microsoft's JVM, and does not support the standard SAX or W3C DOM APIs (org.w3c.dom.*).
Another Test Harness for JavaScript, DOM, and MSXML.DLL
I encourage you to read my earlier article for more background on testing XML conformance. Briefly, there are several kinds of tests, which are supported by test cases—not yet in final form—collected and organized by a joint OASIS/NIST working group. These tests need to be run through a test harness using some particular API to access the XML processor under test. The earlier review used SAX as that API, but that would not work for the MSXML.DLL processor, so a new harness was needed. The harness produces some sort of testing report. This article includes the raw test reports, which are in an HTML format that should be easy to use.
I was pleased to receive queries from Chris Lovett, a Program Manager in the XML Group at Microsoft, about those test cases. After some email back and forth, I had a basic JScript test harness my mailbox, which was good, since I usually stick to Java, and it's always a lot easier to improve something that already works! That version has been substantially enhanced, and you can see the reports it now generates in the review below, or run the tests yourself and see what turns up on your own system.
As before, that test harness is provided here as an Open Source tool for general use. In this case, I've put it under the GNU Public License. I hope the various DOM portability issues will get resolved so that the same code can be used with the XML processors in Mozilla (in some beta version soon) and in Internet Explorer.
Also as before, I'd like to emphasize that these reports are in no way official. They don't represent anyone's opinion but my own.
You may recall comments in the earlier review about problems using DOM as a standard XML processor API. Those still hold true. This harness had to use Microsoft-proprietary APIs to acquire a DOM Document object, to populate it with the contents of an XML file, and to detect and report parsing errors. I still remain hopeful that those issues, shared by all bindings of DOM, can be fixed in some upcoming version of the DOM API so that applications using DOM can use any vendor's implementation, in the same way that SAX currently provides an OS-independent API.
Conformance of Today's MSXML.DLL
In order to ensure that these results can be accurately compared against those in the earlier review, I did two things:
- As noted above, the testing report format is the same; it uses almost the same template, though some updates were needed. Since that template was (X)HTML, there was basically no problem here.
- The same patched version of the OASIS/NIST XML test database was used. This was done even though issues have turned up with some of the individual tests. (Eight in total, these cases did not particularly affect the testing results.)
Note that the source code distributed with the earlier review describes how the July version of that test database needed to be patched.
This table provides a quick reference to the results of the testing:
Processor Name and Version |
Passed Tests | Rating (Out of 5) | Summary |
MSXML.DLL (non-validating)
5.00.2314.1000 |
931 |
Overall this processor is above average, though some of its problems have a broad impact. In addition to a variety of problems which should be readily fixed, it (wrongly) tests validity constraints in many cases. |
|
MSXML.DLL (default mode)
5.00.2314.1000 |
895 |
Since it accepts documents as "valid" that don't even have a DTD, all applications need to apply a workaround. |
More detailed analysis of each processor mode can be found in the following sections, with links to the complete testing reports.
MSXML.DLL (non-validating)
Processor Name: | MSXML.DLL (non-validating) |
Version: | 5.00.2314.1000 |
Type: | Non-Validating |
DOM Bundled: | Yes |
Size: | 490 KB |
Download From: | http://www.microsoft.com/xml/ |
This is the processor which is bundled with the Internet Explorer 5 Web browser. As a COM component, it may be used from JavaScript, C/C++, Visual Basic, and other programming languages. The processor is only accessible through an extended DOM API; JavaScript programmers have access to most of the W3C DOM Level 1 functionality.
Rating: | |
Full Test Results: | msxml-nv.html |
Raw Results: | Passed 931 (of 1067) |
Adjusted Results: | Passed 931 |
Most of the time I found the diagnostics to be quite comprehensible; this is valuable to anyone trying to use them. I probably looked at about half the negative test results, and while I found a misleading diagnostic, I didn't notice any indications of significant problems there. I'll be optimistic and assume that the other half of those diagnostics check out as well, so that the raw score is accurate.
Problems Encountered Processing Legal Documents
There are cases where this processor is rejecting documents which it should clearly be accepting. The processor:
- Doesn't accept some XML 1.0 names using non-ASCII characters.
- Treats many validity errors as if they were well-formedness errors, by reporting
them
as fatal errors. Since the processor was told not to report such errors, it should
ignore
them rather than report them.
- The Notation Declared, No Duplicate Types, Unique Element Type, One ID Per Element, ID Attribute Default, Notation Attributes (at least one subclause), and Attribute Default Legal Validity Constraints (VCs) are treated like Well Formedness Constraints (WFCs).
- The Entity Declared VC is similarly treated. See my earlier review; this area of the specification is problematic, and as an implementer I have a hard time blaming anything except the spec for this problem. It would be better to have only the WFC, both for users and for implementers.
- The Proper Declaration/PE Nesting VC is another entry in this category. Again, see my earlier review, for much the same conclusion: it'd be better if this were a WFC, or if there were no constraint here at all.
- Treats conditional sections as if they were individual markup declarations for the purposes of testing parameter entity nesting. This is clearly contrary to the specification. Even if it were appropriate to report violations of validity constraints when validation was not requested (see previous issue).
- Doesn't know how to map character references to surrogate pairs, when that's needed.
- Expands PEs incorrectly inside internal entity declaration literals. (In that case they should not be padded with spaces.)
- Rejects documents conforming to the XML 1.0 specification that use colons in ways the XML namespaces specification does not permit. This is not optional; there is no "XML 1.0 mode."
- Rejects redefinition of built-in entity < using the exact declaration given as an example in the XML spec.
When the MSXML.DLL processor accepts documents, it isn't always reporting the correct information to applications. Such problems can in some cases be quite significant:
- Attribute values were not normalized according to the XML specification.
- Whitespace was not handled correctly, even when the processor was configured to preserve whitespace. By default this DOM acts as if it were an application applying xml:space='default' handling, rather than as if it were an XML processor.
- PUBLIC identifiers are not normalized according to the XML specification.
- With multiple declarations of an attribute, only the first one is supposed to matter; but the others showed up in the output.
- External entities with just a single character cause some tests to fail with normalization errors.
Although they were not reported by this test suite, and do not show up in the statistics above, I will mention two other known problems with this processor, since they prevented this processor from working with XML documents I happen to have found "in the wild," on the Web.
- The processor adds a constraint that is found neither in the XML 1.0 specification nor in the XML Namespaces specification: Namespace declarations placed in a DTD are required to be declared as #FIXED.
- Most recently I happened across a document which used a reference to a Unicode character that was inappropriately rejected: U+FFFD. (U+FFFC was also rejected when I tried that one, suggesting that it wasn't just a ">" vs ">=" coding error.) In this case, it was easy enough to fix since it was defined in a DTD that I could change. However this will not always be the case.
In summary, most of the problems of the non-validating mode parser are revealed in these positive tests, and involve either reporting the wrong data (usually whitespace issues) or certain inappropriately performing validity checks. However, that evaluation is "by volume, not weight," and some of the other issues may need some attention in your system designs.
Problems Encountered Processing Malformed Documents
There weren't many obvious failures here:
- Accepts various illegal characters, such as control characters in the 0x00 to 0x1F range and escapes. They are accepted both as literals and as character references, though in some cases literals may be rejected (as they should always be).
- PUBLIC ids with some illegal characters are accepted.
- Whitespace before an XML declaration is permitted.
- Permits illegal text declarations, missing the mandatory encoding="..."
- Unpaired Unicode surrogate characters are accepted, both as literals and as character references.
Accepting illegal characters is likely to cause the most interoperability problems of those failures.
Problematic Test Cases
There are cases where the MSXML.DLL processor raises issues that the OASIS/NIST tests should address, in some cases by changing the tests:
- MSXML.DLL is unique among all XML processors I've seen in that it demands that general entities, which are never used, be well formed. One way to look at this is that it is reporting potential well-formedness errors, not actual errors. On the other hand, the XML specification does not distinguish between entities that are used and those that are not, so it is easily argued that the tests that expect these not to be reported are themselves in error. I confess to feeling this is a case where the XML specification needs clarification, particularly since I've seen no other processor that takes this interpretation.
- Uses the model of names, and name tokens, found in the XML Namespaces specification, rather than the XML 1.0 model. Conformance for the namespace specification is not defined in a way that a processor can be tested for conformance, but such tests are desirable.
It is interesting that the first issue above, regarding the constraint on unused general entities to be well-formed, may be coupled to the use of DOM as the processor API in this case. DOM permits, but thankfully does not require, much information to be exposed. Many DOM implementations use that flexibility to avoid exposing the contents of entities, among other facilities. Only DOM implementations, or similar APIs, that expose such contents appear to get any benefit from having such a well-formedness constraint.
DOM Conformance
Some of the DOM operations used to turn the MSXML.DLL processor's DOM output into something that could be examined for correctness had an unanticipated side effect. They identified problems in the DOM implementation that had been hooked up to the underlying processor. These need to be worked around, otherwise exceptions, reflecting internal errors of some kind, are thrown by some DOM operations:
- The DocumentType node has children, which it should not. These children must be explicitly ignored for many operations. This may be the reason that Document.getElementsByTagName returned some elements more than once when they came from external entities.
- In some cases the Element.normalize method throws an exception. This seems coupled to external entities with just one character, marking a line end.
- Text declarations (<?xml encoding='...'?>) at the beginning of external entities are exposed as if they were processing instructions. (Regardless of partial syntactic similarities, the XML spec is quite explicit that processing instructions do not use the name 'xml.') These need to be explicitly removed or ignored in certain cases.
In addition, I noticed that in this DOM, the SYSTEM identifiers found in Entity nodes are not resolved. Several other DOM implementations provide such IDs in fully resolved form, making less work for applications that need to use such URIs. The DOM specification should probably make both available because neither approach can address all problems.
The online MSDN documentation for this DOM was incorrect when I looked at it, though I understand that will be fixed. The reason is worrisome: when looking at this documentation with Netscape Communicator, I was served pages which didn't list a number of important standard methods for the NamedNodeList objects (such as the item method). I'm told that if it's read using Internet Explorer 5, and with use of ActiveX controls enabled, the content is correct. Since I disable use of ActiveX controls because of their security problems, accurate system documentation was unavailable to me.
As noted earlier, DOM still needs some work before it can truly be an implementation-independent API. This includes having ways to hook a DOM up to an XML processor (parsing document text into a DOM tree), and setting options for validation, whitespace handling, and use of various types of nodes in resulting tree.
MSXML.DLL (default mode)
Processor Name: | MSXML.DLL (default mode) |
Version: | 5.00.2314.1000 |
Type: | Validating |
DOM Bundled: | Yes |
Size: | 490 KB |
Download From: | http://www.microsoft.com/xml/ |
This is the validating mode of the parser which is bundled with the Internet Explorer 5 web browser. See the coverage of the non-validating mode for basic information.
Rating: | |
Full Test Results: | msxml-val.html |
Raw Results: | Passed 895 (of 1067) |
Adjusted Results: | Passed 895 |
Unlike the situation with some other "dual mode" parsers, the MSXML.DLL processor does not do a complete personality switch, so this description builds heavily on the coverage of the non-validating mode, focusing only on what changes when validation is enabled.
Problems Encountered Processing Legal Documents
This worked basically like the non-validating mode, with the only new problem being that the parser complained when given certain entity expansions: it didn't use the elements found in those entities when checking whether the content model for the parent element was satisfied.
The parser called into question one additional pair of test cases. Specifically, it rejected a CDATA usage which has recently been deemed illegal. Presumably, after this erratum to the XML specification is published, these tests cases will be recategorized.
Parser output was like that for the invalid documents; notably, it doesn't report whitespace or normalize attributes correctly.
Problems Encountered Processing Malformed Documents
One basic issue to note here is that because of its API, this parser is structurally prevented from continuing after reporting a validity error. The API only allows reporting fatal errors. This may not affect conformance (the "at user option" requirement in the XML specification does not seem to require that the option should affect only one error at a time), but it does constrain the use of this API for detecting and correcting multiple validity errors.
The following validity errors were not detected:
- Many documents without a <!DOCTYPE ... > declaration were accepted as valid. Because only such a doctype declaration can provide the declarations against which a document is validated, this is a substantial flaw although one that applications can work around. (See below.)
- This parser does not attempt to report validity errors relating to the standalone='yes' declaration. This is not the most popular feature in the XML specification, even though it is not hard to report these errors.
To make XML validation work correctly, your code to load an XML document should always look something like this (intended to work correctly even if you're not validating):
document.load (uri); if (document.validateOnParse && doc.parseError.errorCode == 0 && doc.doctype == null) { // it's a set of unreported validity errors } else if (doc.parseError.errorCode != 0) { // error reported in parseError object }
I would expect validation to work exactly as defined in the XML 1.0 specification. Validation using any of the various schema systems now available (or being developed) is a separate issue, and merits separate APIs.
Summary of Findings
Contents |
•Part 1:
Microsoft XML Parser Conformance |
The non-validating mode of the MSXML.DLL processor, with whitespace handling set appropriately, is relatively conformant, although not without its problems. Certain familiar errors are also seen in this processor:
- Various legal XML 1.0 names are not accepted, seemingly due to the non-ASCII characters in them.
- Surrogate pairs are not handled correctly, although the problems seem to be fewer than with some other processors.
- PUBLIC identifier characters are not fully correct.
- Nesting of parameter entities is again treated as if it were a well-formedness constraint.
- Whitespace isn't always reported or normalized correctly.
Both processor modes are in the top quartile of the ones tested in the earlier review, but are not the top rated ones. That gets this processor a "pretty good" rating in my book. Although I'm bothered by the validating mode needing an application level workaround, if you apply it, you'll find that nearly another fifty test cases will behave.
As more XML processors approach meaningful levels of conformance, it will be increasingly important to understand exactly which conformance errors show up in a given parser. The raw "passed tests" statistic, used to assign stars in this evaluation and the previous one, will always miss some important information. That's why I've tried, in both this review and the earlier one, to give a lot of analysis for the failure modes of the processors that have the best "passed tests" statistic. Since developers have many choices for their XML processors, it's important that those choices be well informed ones.