Pull Parsing in C# and Java
May 22, 2002
In my first article in
this series, I wrote about porting a SAX application called RSSReader to the new Microsoft
.NET Framework XmlReader
. After publication, I received a message from Chris
Lovett of Microsoft suggesting I revisit the subject. As he said, while the code I
presented
works, my approach was not optimal for the .NET framework; I was still thinking in
terms of
SAX event driven state machinery. A much easier way to approach this problem is to
take
advantage of the fact that XmlReader
does not make you think this way; and,
thus, to write a recursive descent RSS transformer as outlined below.
Based on Chris' suggestions, I've also made some other changes, including changing
the
output mechanism to use the XmlTextWriter
, which will take care of generating
well formed XHTML on the output side.
And following all that, in a reversal of our usual process, I'll port this code back to Java.
Here then, without further ado, is the new RSSReader, optimized for C#. I've given the entire listing here, follows by an explanation.
using System;
using System.IO;
using System.Xml;
using System.Net;
public class RSSReader {
public static void Main(string [] args) {
// create an instance of RSSReader
RSSReader rssreader = new RSSReader();
try {
string url = args[0];
XmlTextWriter writer = new XmlTextWriter(Console.Out);
writer.Formatting = Formatting.Indented;
HttpWebRequest wr = (HttpWebRequest)WebRequest.Create(url);
WebResponse resp = wr.GetResponse();
Stream stream = resp.GetResponseStream();
XmlTextReader reader = new XmlTextReader(stream);
reader.XmlResolver = null; // ignore the DTD
reader.WhitespaceHandling = WhitespaceHandling.None;
rssreader.RSSToHtml(reader, writer);
} catch (XmlException e) {
Console.WriteLine(e.Message);
}
}
public void RSSToHtml(XmlReader reader, XmlWriter writer) {
reader.MoveToContent();
if (reader.Name == "rss") {
writer.WriteStartElement("html");
while (reader.Read() &&
reader.NodeType != XmlNodeType.EndElement) {
switch (reader.LocalName) {
case "channel":
ChannelToHtml(reader, writer);
break;
case "item":
ItemToHtml(reader, writer);
break;
default: // ignore image and textinput.
break;
}
}
writer.WriteEndElement();
} else {
// not an RSS document!
}
}
void ChannelToHtml(XmlReader reader, XmlWriter writer) {
writer.WriteStartElement("head");
// scan header elements and pick out the title.
reader.Read();
while (reader.Name != "item" &&
reader.NodeType != XmlNodeType.EndElement) {
if (reader.Name == "title") {
writer.WriteNode(reader, true); // copy node to output.
} else {
reader.Skip();
}
}
writer.WriteEndElement();
writer.WriteStartElement("body");
// transform the items.
while (reader.NodeType != XmlNodeType.EndElement) {
if (reader.Name == "item") {
ItemToHtml(reader, writer);
}
if (!reader.Read())
break;
}
writer.WriteEndElement();
}
void ItemToHtml(XmlReader reader, XmlWriter writer) {
writer.WriteStartElement("p");
string title = null, link = null, description = null;
while (reader.Read() &&
reader.NodeType != XmlNodeType.EndElement) {
switch (reader.Name) {
case "title":
title = reader.ReadString();
break;
case "link":
link = reader.ReadString();
break;
case "description":
description = reader.ReadString();
break;
}
}
writer.WriteStartElement("a");
writer.WriteAttributeString("href", link);
writer.WriteString(title);
writer.WriteEndElement();
writer.WriteStartElement("br");
writer.WriteEndElement();
writer.WriteString(description);
writer.WriteEndElement(); // end the "p" element
}
}
Explaining the Code
The Main entry point to the new RSSReader uses the System.Net
classes directly
to setup a WebRequest
. You also see the XmlTextWriter
being
constructed, turning on indenting so we get a nice readable output. Then the
XmlReader
and XmlWriter
become arguments to a recursive descent
RSS parser; the top level method is called RSSToHtml()
.
The top level RSSToHtml()
method first checks that we really have an RSS file,
by checking the root element name. MoveToContent()
is a convenient way of
skipping the XML prolog and going right to the top level element in the document.
If the XML
document used namespaces, then we'd also want to match on the NamespaceUri
property; however, this particular XML document doesn't use namespaces. If we find
an
<rss>
element, then we read the contents, calling
ChannelToHtml()
when we find a <channel>
element and
calling ItemToHtml()
when we find an <item>
element. Any
other element is skipped. This is all wrapped in the XmlWriter
call to write
the root level <html>
output element.
The ChannelToHtml()
method does two things: it writes out the HTML head
element containing a <title>
element, then it writes out the HTML body.
Notice here we can simply use the XmlWriter.WriteNode()
method which copies the
<title>
element from the input reader to the output, since an HTML
<title>
is exactly the same as an RSS one. The HTML head element
terminates when we reach the first child <item>
element or the
</channel>
EndElement
token. In the HTML body we look for <item>
elements
and call ItemToHtml()
.
The ItemToHtml()
method writes out an HTML <p>
tag, then reads
the <title>
, <link>
and <description>
elements out of the input. These input tags could arrive in any order, which is why
we have
to read them all before we can write the output. Once we have them we can write the
<a>
tag, with <href
attribute equal to the
<link>
element, and content equal to the <title>
, followed
by an empty <br>
element and the description.
All in all, it seems like a much simpler way to deal with converting RSS to HTML. .NET's built-in XML parser is pretty neat.
Java Pull Parsers
But pull parsers are not unique to the .NET world. The Java Community Process is currently working on a standard called StAX, the Streaming API for XML. This nascent API is, in turn, based upon several vendors' pull parser implementations, notably Apache's Xerces XNI, BEA's XML Stream API, XML Pull Parser 2, PullDOM (for Python), and, yes, Microsoft's XmlReader.
So how would we implement this same program in yet another pull parser, the Common API for XML Pull Parsing, or XPP? Let's take a look.
package com.xml;
import java.io.*;
import java.net.*;
import java.util.*;
import com.alexandriasc.xml.XMLWriter;
import org.xmlpull.v1.*;
public class RSSReader {
public static void main(String [] args) {
// create an instance of RSSReader
RSSReader rssreader = new RSSReader();
XMLWriter writer = null;
try {
String url = args[0];
writer = new XMLWriter(new OutputStreamWriter(System.out),false);
XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
XmlPullParser parser = factory.newPullParser();
InputStreamReader stream = new InputStreamReader(
new URL(url).openStream());
parser.setInput(stream);
parser.setFeature(XmlPullParser.FEATURE_PROCESS_DOCDECL,false);
rssreader.RSSToHtml(parser, writer);
} catch (Exception e) {
e.printStackTrace(System.err);
} finally {
try {
writer.flush();
} catch (IOException io) {
io.printStackTrace(System.err);
}
}
}
public void RSSToHtml(XmlPullParser parser, XMLWriter writer)
throws IOException, XmlPullParserException {
// equivalent to XmlReader.MoveToContent()
while (parser.next() != XmlPullParser.START_TAG
&& !parser.getName().equals("rss")) {
}
if (parser.getName().equals("rss")) {
writer.beginElement("html");
do {
parser.next();
if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("channel")) {
ChannelToHtml(parser, writer);
} else if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("item")) {
ItemToHtml(parser, writer);
}
} while (parser.getEventType() != XmlPullParser.END_DOCUMENT);
writer.endElement();
} else {
// not an RSS document!
}
}
void ChannelToHtml(XmlPullParser parser, XMLWriter writer)
throws IOException, XmlPullParserException {
writer.beginElement("head");
// scan header elements and pick out the title.
while (!(parser.next() == XmlPullParser.END_TAG
&& parser.getName().equals("channel"))) {
if (parser.getEventType() == XmlPullParser.START_TAG) {
do {
if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("title")) {
while (parser.next() != XmlPullParser.END_TAG) {
if (parser.getEventType() == XmlPullParser.TEXT) {
writer.writeElement("title",null,parser.getText());
break;
}
}
break;
}
} while (parser.next() != XmlPullParser.END_TAG);
break;
}
}
writer.endElement();
writer.beginElement("body");
// transform the items.
do {
if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("item")) {
ItemToHtml(parser, writer);
}
parser.next();
} while (parser.getEventType() != XmlPullParser.END_DOCUMENT);
writer.endElement();
}
void ItemToHtml(XmlPullParser parser, XMLWriter writer)
throws IOException, XmlPullParserException {
writer.beginElement("p");
String title = null, link = null, description = null;
while (parser.next() != XmlPullParser.END_DOCUMENT
&& parser.getEventType() != XmlPullParser.END_TAG) {
if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("title")) {
if (parser.next() == XmlPullParser.TEXT)
title = parser.readText();
} else if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("link")) {
if (parser.next() == XmlPullParser.TEXT)
link = parser.readText();
} else if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("description")) {
if (parser.next() == XmlPullParser.TEXT)
description = parser.readText();
}
}
HashMap attributes = new HashMap(1);
attributes.put("href", link);
writer.beginElement("a",attributes);
writer.write(title);
writer.endElement();
writer.writeEmptyElement("br");
writer.write(description);
writer.endElement(); // end the "p" element
}
}
Most of our port was the reverse of our previous ports; for example,
changing Console.Out
to System.out
, making method names start with
lowercase letters, adding explicit throws
clauses. The real meat of this port
is in two areas.
The Parser
First, we're using XmlPullParser
as a rough equivalent of
XmlTextReader
. One difference is that while we are able to instantiate an
XmlTextReader
directly in C# (remember, Microsoft is a one-stop shop), we
have to use the Java XmlPullParserFactory
to get a concrete implementation of
the XmlPullParser
interface. This should be a familiar exercise for anyone
who's used JAXP or, for that matter, JDBC.
Once we have the parser, most of the method name equivalencies are obvious. Remember
that
in C# the ==
operator works just fine for string
s, but in Java you
must use the .equals()
method; otherwise you'll be comparing object
references rather than their values, not at all what we want to do. Also, you can't
use a String
as the expression in a switch...case
statement in
Java, so we've turned those into an if...else
structure.
Another difference between the .NET XmlReader
and the Java
XmlPullParser
has to do with the way in which events are pulled out of the
XMLdocument. In the former, the ReadString()
method will return all the text
for the current element; while in the latter, next()
must explicitly be called
to position the parser at the text node before calling getText()
or
readText()
to read the text.
This may be a minor difference, but it tends to make our port a little more difficult.
To
better handle this requirement, I've changed several while
loops into
do...while
loops. This, unfortunately, makes it less than a simple port; the
logic has changed, but not considerably.
The Writer
Second, there is no XmlTextWriter
in Java, so we're using Alexandria Software
Consulting's XmlHelper
package, which contains a class called XMLWriter
. Besides the naming of
methods, XMLWriter
operates almost identically to .NET's
XmlWriter
, except for two details.
First, XMLWriter
has the notion of a collection of attributes, whereas
XmlWriter
requires you to write each attribute individually. In Java, we call
beginElement()
, passing the name and the Map
of attributes,
whereas in C#, we called WriteStartElement()
followed by
WriteAttributeString()
.
Second, XMLWriter
has a writeEmptyElement()
method, where
XmlWriter
requires you to call WriteStartElement()
followed by
WriteEndElement()
. However, .NET automatically collapses an empty element
into a short end element (in this case, <br />
). .NET's way gives you the
flexibility of determining whether the element is empty at runtime. If, however, you
need to
force an end tag, you can call WriteFullEndElement()
instead of
WriteEndElement()
.
Conclusion
A pull parser makes it much easier to process XML, especially when you are processing XML with a well-defined grammar like RSS. This code is much easier to understand and maintain since there's no complex state machine to build or maintain. In fact, this code is completely stateless; the pull parser keeps track of all the state for us. So in that sense a pull parser is a higher level way of processing XML than SAX.
Although my original code quite intentionally didn't do any error handling, error handling in a push model state machine adds even more complexity to an already complex model. The new RSSReader has clear placeholders for error handling code in the cases when the input doesn't comply with the expected RSS DTD.
Performance can be an important consideration in an XML parser. Notice the call to
Skip()
(in the C# version) when we find elements we're not interested in. In
this case the XML parser can skip over entire subtrees of XML without having to call
us back
on every element, even ones we know we're not interested in. In this case we skip
over the
<image>
elements and all their children. Second, in C# we could optimize
out all the element name string comparisons and make the atomized pointer comparisons
if we
used the XmlReader
's NameTable
to pre-atomize those strings.
Finally, using an XML writer makes our output generation more robust. For example, it will correctly convert special characters -- <, &, etc. -- into their respective entity references. Because it maintains its own state internally, it never forgets which element to close after a convoluted series of while loops. And it will always produce XML output in the consistent and readable format of your choice.
And now for the inevitable comparison between .NET's XmlReader/XmlWriter
and
the equivalent functionality in Java. As usual, I'll say that in .NET, Microsoft has
provided it all for you and, thus, it is undeniably simpler to learn and use. The C#
version of our RSSReader is about 20% shorter than the Java version, which is great
unless
you work in one of those shops which still measures productivity in KLOCs. And the
readability of the code itself is much greater in C#, although that probably can be
chalked
up at least in part to my own lack of skill in that conversion from while
to
do...while
.
But the real bottom line remains that doing it the .NET way means that Microsoft provides all the standards-compliant tools that 90% of developers are likely to need, while the Java way still means putting together a solution from various pieces that you can scrounge from various sources. Some of those pieces come from the Java Community Process and thus represent peer-reviewed, formally approved APIs, but some come from a quick search of the Web, and in the end only you are qualified to judge their worthiness.