Pull Parsing in C# and Java

May 22, 2002

In my first article in this series, I wrote about porting a SAX application called RSSReader to the new Microsoft .NET Framework XmlReader. After publication, I received a message from Chris Lovett of Microsoft suggesting I revisit the subject. As he said, while the code I presented works, my approach was not optimal for the .NET framework; I was still thinking in terms of SAX event driven state machinery. A much easier way to approach this problem is to take advantage of the fact that XmlReader does not make you think this way; and, thus, to write a recursive descent RSS transformer as outlined below.

Based on Chris' suggestions, I've also made some other changes, including changing the output mechanism to use the XmlTextWriter, which will take care of generating well formed XHTML on the output side.

And following all that, in a reversal of our usual process, I'll port this code back to Java.

Here then, without further ado, is the new RSSReader, optimized for C#. I've given the entire listing here, follows by an explanation.


                        
                        
using System;

using System.IO;

using System.Xml;

using System.Net;



public class RSSReader {

  public static void Main(string [] args) {

    // create an instance of RSSReader

    RSSReader rssreader = new RSSReader();



    try {

      string url = args[0];

      XmlTextWriter writer = new XmlTextWriter(Console.Out);

      writer.Formatting = Formatting.Indented;

      HttpWebRequest wr = (HttpWebRequest)WebRequest.Create(url);

      WebResponse resp = wr.GetResponse();

      Stream stream = resp.GetResponseStream();

      XmlTextReader reader = new XmlTextReader(stream);

      reader.XmlResolver = null; // ignore the DTD

      reader.WhitespaceHandling = WhitespaceHandling.None;

      rssreader.RSSToHtml(reader, writer);

    } catch (XmlException e) {

      Console.WriteLine(e.Message);

    }

  }



  public void RSSToHtml(XmlReader reader, XmlWriter writer) {

    reader.MoveToContent();

    if (reader.Name == "rss") {

      writer.WriteStartElement("html");

      while (reader.Read() &&

        reader.NodeType != XmlNodeType.EndElement) {

        switch (reader.LocalName) {

        case "channel":

          ChannelToHtml(reader, writer);

          break;

        case "item":

          ItemToHtml(reader, writer);

          break;

        default: // ignore image and textinput.

          break;

        }

      }

      writer.WriteEndElement();

    } else {

      // not an RSS document!

    }

  }



  void ChannelToHtml(XmlReader reader, XmlWriter writer) {

    writer.WriteStartElement("head");

    // scan header elements and pick out the title.

    reader.Read();

    while (reader.Name != "item" &&

      reader.NodeType != XmlNodeType.EndElement) {

      if (reader.Name == "title") {

        writer.WriteNode(reader, true); // copy node to output.

      } else {

        reader.Skip();

      }

    }

    writer.WriteEndElement();



    writer.WriteStartElement("body");

    // transform the items.

    while (reader.NodeType != XmlNodeType.EndElement) {

      if (reader.Name == "item") {

        ItemToHtml(reader, writer);

      }

      if (!reader.Read())

        break;

    }

    writer.WriteEndElement();

  }



  void ItemToHtml(XmlReader reader, XmlWriter writer) {

    writer.WriteStartElement("p");



    string title = null, link = null, description = null;

    while (reader.Read() &&

      reader.NodeType != XmlNodeType.EndElement) {

      switch (reader.Name) {

      case "title":

        title = reader.ReadString();

        break;

      case "link":

        link = reader.ReadString();

        break;

      case "description":

        description = reader.ReadString();

        break;

      }

    }

    writer.WriteStartElement("a");

    writer.WriteAttributeString("href", link);

    writer.WriteString(title);

    writer.WriteEndElement();



    writer.WriteStartElement("br");

    writer.WriteEndElement();



    writer.WriteString(description);



    writer.WriteEndElement(); // end the "p" element

  }

}

Explaining the Code

The Main entry point to the new RSSReader uses the System.Net classes directly to setup a WebRequest. You also see the XmlTextWriter being constructed, turning on indenting so we get a nice readable output. Then the XmlReader and XmlWriter become arguments to a recursive descent RSS parser; the top level method is called RSSToHtml().

The top level RSSToHtml() method first checks that we really have an RSS file, by checking the root element name. MoveToContent() is a convenient way of skipping the XML prolog and going right to the top level element in the document. If the XML document used namespaces, then we'd also want to match on the NamespaceUri property; however, this particular XML document doesn't use namespaces. If we find an <rss> element, then we read the contents, calling ChannelToHtml() when we find a <channel> element and calling ItemToHtml() when we find an <item> element. Any other element is skipped. This is all wrapped in the XmlWriter call to write the root level <html> output element.

The ChannelToHtml() method does two things: it writes out the HTML head element containing a <title> element, then it writes out the HTML body. Notice here we can simply use the XmlWriter.WriteNode() method which copies the <title> element from the input reader to the output, since an HTML <title> is exactly the same as an RSS one. The HTML head element terminates when we reach the first child <item> element or the </channel> EndElement token. In the HTML body we look for <item> elements and call ItemToHtml().

The ItemToHtml() method writes out an HTML <p> tag, then reads the <title>, <link> and <description> elements out of the input. These input tags could arrive in any order, which is why we have to read them all before we can write the output. Once we have them we can write the <a> tag, with <href attribute equal to the <link> element, and content equal to the <title>, followed by an empty <br> element and the description.

All in all, it seems like a much simpler way to deal with converting RSS to HTML. .NET's built-in XML parser is pretty neat.

Java Pull Parsers

But pull parsers are not unique to the .NET world. The Java Community Process is currently working on a standard called StAX, the Streaming API for XML. This nascent API is, in turn, based upon several vendors' pull parser implementations, notably Apache's Xerces XNI, BEA's XML Stream API, XML Pull Parser 2, PullDOM (for Python), and, yes, Microsoft's XmlReader.

So how would we implement this same program in yet another pull parser, the Common API for XML Pull Parsing, or XPP? Let's take a look.


                        
                        
package com.xml;



import java.io.*;

import java.net.*;

import java.util.*;



import com.alexandriasc.xml.XMLWriter;

import org.xmlpull.v1.*;



public class RSSReader {



  public static void main(String [] args) {

    // create an instance of RSSReader

    RSSReader rssreader = new RSSReader();



    XMLWriter writer = null;

    try {

      String url = args[0];

      writer = new XMLWriter(new OutputStreamWriter(System.out),false);

      XmlPullParserFactory factory = XmlPullParserFactory.newInstance();

      XmlPullParser parser = factory.newPullParser();

      InputStreamReader stream = new InputStreamReader(

        new URL(url).openStream());

      parser.setInput(stream);

      parser.setFeature(XmlPullParser.FEATURE_PROCESS_DOCDECL,false);

      rssreader.RSSToHtml(parser, writer);

    } catch (Exception e) {

      e.printStackTrace(System.err);

    } finally {

      try {

        writer.flush();

      } catch (IOException io) {

        io.printStackTrace(System.err);

      }

    }

  }



  public void RSSToHtml(XmlPullParser parser, XMLWriter writer)

  throws IOException, XmlPullParserException {

    // equivalent to XmlReader.MoveToContent()

    while (parser.next() != XmlPullParser.START_TAG

      && !parser.getName().equals("rss")) {

    }

    if (parser.getName().equals("rss")) {

      writer.beginElement("html");

      do {

        parser.next();

        if (parser.getEventType() == XmlPullParser.START_TAG

          && parser.getName().equals("channel")) {

          ChannelToHtml(parser, writer);

        } else if (parser.getEventType() == XmlPullParser.START_TAG

          && parser.getName().equals("item")) {

          ItemToHtml(parser, writer);

        }

      } while (parser.getEventType() != XmlPullParser.END_DOCUMENT);

      writer.endElement();

    } else {

      // not an RSS document!

    }

  }



  void ChannelToHtml(XmlPullParser parser, XMLWriter writer)

  throws IOException, XmlPullParserException {

    writer.beginElement("head");

    // scan header elements and pick out the title.

    while (!(parser.next() == XmlPullParser.END_TAG

      && parser.getName().equals("channel"))) {

      if (parser.getEventType() == XmlPullParser.START_TAG) {

        do {

          if (parser.getEventType() == XmlPullParser.START_TAG

            && parser.getName().equals("title")) {

            while (parser.next() != XmlPullParser.END_TAG) {

              if (parser.getEventType() == XmlPullParser.TEXT) {

                writer.writeElement("title",null,parser.getText());

                break;

              }

            }

            break;

          }

        } while (parser.next() != XmlPullParser.END_TAG);

        break;

      }

    }

    writer.endElement();



    writer.beginElement("body");

    // transform the items.

    do {

      if (parser.getEventType() == XmlPullParser.START_TAG 

        && parser.getName().equals("item")) {

        ItemToHtml(parser, writer);

      }

      parser.next();

    } while (parser.getEventType() != XmlPullParser.END_DOCUMENT);

    writer.endElement();

  }



  void ItemToHtml(XmlPullParser parser, XMLWriter writer)

  throws IOException, XmlPullParserException {

    writer.beginElement("p");



    String title = null, link = null, description = null;

    while (parser.next() != XmlPullParser.END_DOCUMENT

      && parser.getEventType() != XmlPullParser.END_TAG) {

      if (parser.getEventType() == XmlPullParser.START_TAG

        && parser.getName().equals("title")) {

        if (parser.next() == XmlPullParser.TEXT)

          title = parser.readText();

      } else if (parser.getEventType() == XmlPullParser.START_TAG

        && parser.getName().equals("link")) {

        if (parser.next() == XmlPullParser.TEXT)

          link = parser.readText();

      } else if (parser.getEventType() == XmlPullParser.START_TAG

        && parser.getName().equals("description")) {

        if (parser.next() == XmlPullParser.TEXT)

          description = parser.readText();

      }

    }

    HashMap attributes = new HashMap(1);

    attributes.put("href", link);

    writer.beginElement("a",attributes);

    writer.write(title);

    writer.endElement();



    writer.writeEmptyElement("br");



    writer.write(description);



    writer.endElement(); // end the "p" element

  }

}

Most of our port was the reverse of our previous ports; for example, changing Console.Out to System.out, making method names start with lowercase letters, adding explicit throws clauses. The real meat of this port is in two areas.

The Parser

First, we're using XmlPullParser as a rough equivalent of XmlTextReader. One difference is that while we are able to instantiate an XmlTextReader directly in C# (remember, Microsoft is a one-stop shop), we have to use the Java XmlPullParserFactory to get a concrete implementation of the XmlPullParser interface. This should be a familiar exercise for anyone who's used JAXP or, for that matter, JDBC.

Once we have the parser, most of the method name equivalencies are obvious. Remember that in C# the == operator works just fine for strings, but in Java you must use the .equals() method; otherwise you'll be comparing object references rather than their values, not at all what we want to do. Also, you can't use a String as the expression in a switch...case statement in Java, so we've turned those into an if...else structure.

Another difference between the .NET XmlReader and the Java XmlPullParser has to do with the way in which events are pulled out of the XMLdocument. In the former, the ReadString() method will return all the text for the current element; while in the latter, next() must explicitly be called to position the parser at the text node before calling getText() or readText() to read the text.

This may be a minor difference, but it tends to make our port a little more difficult. To better handle this requirement, I've changed several while loops into do...while loops. This, unfortunately, makes it less than a simple port; the logic has changed, but not considerably.

The Writer

Second, there is no XmlTextWriter in Java, so we're using Alexandria Software Consulting's XmlHelper package, which contains a class called XMLWriter. Besides the naming of methods, XMLWriter operates almost identically to .NET's XmlWriter, except for two details.

First, XMLWriter has the notion of a collection of attributes, whereas XmlWriter requires you to write each attribute individually. In Java, we call beginElement(), passing the name and the Map of attributes, whereas in C#, we called WriteStartElement() followed by WriteAttributeString().

Second, XMLWriter has a writeEmptyElement() method, where XmlWriter requires you to call WriteStartElement() followed by WriteEndElement(). However, .NET automatically collapses an empty element into a short end element (in this case, <br />). .NET's way gives you the flexibility of determining whether the element is empty at runtime. If, however, you need to force an end tag, you can call WriteFullEndElement() instead of WriteEndElement().

Conclusion

A pull parser makes it much easier to process XML, especially when you are processing XML with a well-defined grammar like RSS. This code is much easier to understand and maintain since there's no complex state machine to build or maintain. In fact, this code is completely stateless; the pull parser keeps track of all the state for us. So in that sense a pull parser is a higher level way of processing XML than SAX.

Although my original code quite intentionally didn't do any error handling, error handling in a push model state machine adds even more complexity to an already complex model. The new RSSReader has clear placeholders for error handling code in the cases when the input doesn't comply with the expected RSS DTD.

Performance can be an important consideration in an XML parser. Notice the call to Skip() (in the C# version) when we find elements we're not interested in. In this case the XML parser can skip over entire subtrees of XML without having to call us back on every element, even ones we know we're not interested in. In this case we skip over the <image> elements and all their children. Second, in C# we could optimize out all the element name string comparisons and make the atomized pointer comparisons if we used the XmlReader's NameTable to pre-atomize those strings.

Finally, using an XML writer makes our output generation more robust. For example, it will correctly convert special characters -- <, &, etc. -- into their respective entity references. Because it maintains its own state internally, it never forgets which element to close after a convoluted series of while loops. And it will always produce XML output in the consistent and readable format of your choice.

And now for the inevitable comparison between .NET's XmlReader/XmlWriter and the equivalent functionality in Java. As usual, I'll say that in .NET, Microsoft has provided it all for you and, thus, it is undeniably simpler to learn and use. The C# version of our RSSReader is about 20% shorter than the Java version, which is great unless you work in one of those shops which still measures productivity in KLOCs. And the readability of the code itself is much greater in C#, although that probably can be chalked up at least in part to my own lack of skill in that conversion from while to do...while.

But the real bottom line remains that doing it the .NET way means that Microsoft provides all the standards-compliant tools that 90% of developers are likely to need, while the Java way still means putting together a solution from various pieces that you can scrounge from various sources. Some of those pieces come from the Java Community Process and thus represent peer-reviewed, formally approved APIs, but some come from a quick search of the Web, and in the end only you are qualified to judge their worthiness.