Learning C# XML
March 6, 2002
In his opening keynote at the IDEAlliance XML 2001 Conference in Orlando, Florida, in December, James Clark said: "Just because it comes from Microsoft, it's not necessarily bad". With that in mind, I decided to explore what C# has to offer to the Java-XML community.
I've been watching the continuing Microsoft story with a vague combination of intrigue and apprehension. You almost certainly know by now that, due to an awkward combination of hubris and court orders, Microsoft has stopped shipping any Java implementation with Windows, choosing instead to hitch its wagon to a star of its own making, C#.
As a consumer, I'm not sure whether I like Microsoft's business practices. As a software developer, however, I'm interested in learning new languages and technologies. I've read enough to see that C# is enough like Java to make an interesting porting project. Even if I never write another line of C# code, there is certainly a lot to be learned from how Microsoft has integrated XML into its .NET platform.
In this series I'll be porting a few small XML applications, which I've hypothetically written in Java, to C# in order to see if I can improve my Java programming.
The Exercises
The first Java application to port to C#, which I call RSSReader, does something
that most XML programmers have done at some point: read in an RSS stream using SAX
and
convert it to HTML. For our purposes, I'll expect to be reading an RSS 1.0 stream
using JAXP
and outputing out an HTML stream using java.io
classes. We'll see that this
example ports nicely to the C# XmlReader
class.
Future examples will convert JDOM to the C# XmlDocument
and
XmlNode
classes, as well as experimenting with ports from an XML databinding
framework to ADO.NET. There's a lot to ADO.NET, and I'll discuss some of that as well.
Here's our first Java program, RSSReader, which was adapted from Sun's JAXP Tutorial. I've stripped out some of the error handling and such for the sake of simplicity.
package com.xml; import java.io.*; import java.util.Stack; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; public class RSSReader extends DefaultHandler { static private Writer out; static String lineEnd = System.getProperty("line.separator"); Stack stack = new Stack(); StringBuffer value = null; String title = null; String link = null; String desc = null; public static void main(String args []) { // create an instance of RSSReader DefaultHandler handler = new RSSReader(); try { // Set up output stream out = new OutputStreamWriter(System.out, "UTF8"); // get a SAX parser from the factory SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser(); // parse the document from the parameter saxParser.parse(args[0], handler); } catch (Exception t) { System.err.println(t.getClass().getName()); t.printStackTrace(System.err); } } public void startDocument() throws SAXException { emit("<html>" + lineEnd); } public void endDocument() throws SAXException { emit("</html>" + lineEnd); } public void startElement(String namespaceURI,String sName, String qName,Attributes attrs) throws SAXException { String eName = sName; // element name if ("".equals(eName)) eName = qName; // namespaceAware = false stack.push(eName); value = new StringBuffer(); } public void endElement(String namespaceURI,String sName,String qName) throws SAXException { String eName = (String)stack.pop(); if (eName.equals("title") && stack.peek().equals("channel")) { emit(" <head>" + lineEnd); emit(" <title>" + value + "</title>" + lineEnd); emit(" </head>" + lineEnd); emit(" <body>" + lineEnd); } else if (eName.equals("title") && stack.peek().equals("item")) { title = null == value ? "" : value.toString(); } else if (eName.equals("link") && stack.peek().equals("item")) { link = null == value ? "" : value.toString(); } else if (eName.equals("description") && stack.peek().equals("item")) { desc = null == value ? "" : value.toString(); } else if (eName.equals("item")) { emit(" <p><a href=\"" + link + "\">" + title + "</a><br>" + lineEnd); emit(" " + desc + "</p>" + lineEnd); } else if (eName.equals("channel")) { emit(" </body>" + lineEnd); } value = null; } public void characters(char buf [], int offset, int len) throws SAXException { String s = new String(buf, offset, len); value.append(s); } private static void emit(String s) throws SAXException { try { out.write(s); out.flush(); } catch (IOException e) { throw new SAXException("I/O error", e); } } }
Compiling and running this program, we get the following results when we try to read XMLhack.com's RSS 1.0 feed (some lines have been wrapped for legibility):
C:\> java com.xml.RSSReader http://xmlhack.com/rss.php <html> <head> <title>xmlhack</title> </head> <body> <p><a href="http://www.xmlhack.com/read.php?item=1511">Activity around the Dublin Core</a><br> The Dublin Core Metadata Initiative (DCMI) has seen a recent spate of activity, Recent publications include The Namespace Policy for the Dublin Core Metadata Initiative,
Expressing Simple Dublin Core in RDF/XML,
and Expressing Qualified Dublin Core in RDF/XML.</p> ...
An Intense and Bracing Introduction to C#
The most important thing to remember about C# is that although it's definitely not Java, it is a lot like Java. So much so that you can probably port a lot of your application logic with a simple search-and-replace operation. (I'll highlight those areas when we go over the relevant code examples.) There are a few syntactic differences between the languages; I'll mention a couple of the most interesting ones, but rather than go over them in detail, you should read about the language on your own. I recommend any or all of the following articles: Conversational C# for Java Programmers by Raffi Krikorian; C#: A language alternative or just J--? by Mark Johnson; and A Comparison of Microsoft's C# Programming Language to Sun Microsystems' Java Programming Language by Dare Obasanjo. There are also several books on C#, including Programming C# by Jesse Liberty and the forthcoming C# In A Nutshell by Peter Drayton & Ben Albahari. And of course, there is Microsoft's own invaluable .NET Framework Class Library.
Let's begin our introduction to C# by diving right into the code.
The first change really is a simple global search-and-replace. The C# equivalent of
the
System.out
class is Console
, and the println()
method's equivalent is WriteLine()
.
What we call packages in Java are called assemblies in C# -- an
oversimplification, but it will do for purposes -- and they are brought into scope
not with
the import
statement but with the using
statement. Also, rather
than declaring what package your class belongs to with a package
statement, the
assembly is declared through nested braces.
You don't have to be a rocket scientist to know that there is no
javax.xml.parsers
assembly in C#. Microsoft has provided an assembly called
System.Xml
which contains all the XML classes you're likely to need.
The last of our simple changes is one that may take some getting used to. Every Java
developer knows that method names begin with lowercase letters, right? Well, Microsoft
has
taken a different route: building on its MFC naming tradition, all method names in
C# begin
with capital letters, including Main()
.
Now let's see what those changes have done to our sample code so far.
using System; public class com { public class xml { public class RSSReader { static private Writer out; static String lineEnd = System.getProperty("line.separator"); Stack stack = new Stack(); StringBuffer value = null; String title = null; String link = null; String desc = null; public static void Main(String args[]) { // don't worry about the rest of this code yet // ... } } } }
Compiling Your Code
Since we've made our trivial changes, we might as well let the compiler tell us what else we have to do. Yes, I know there's going to be a lot of error messages, but we have to start somewhere.
Of course you know the command line to compile and run the Java version of RSSReader (I'm running this on Windows, since that's where my C# project lives):
javac -g -classpath %CLASSPATH% com\xml\RSSReader.java
Here is the equivalent C# compilation command line:
csc /debug /r:System.Xml.dll /t:exe RSSReader.cs
Just as javac
is the Java Compiler, csc
is the
Cee-Sharp Compiler (get it?). I've listed the parameters in the same
respective order in each compile line, so that you can see that the C# equivalent
of
-g
is /debug
and the Java -classpath
parameter
(which was not strictly necessary in this example, of course) is something like the
/r
switch. There's a final parameter on the C# compile line,
/t:exe
. In C#, you can run your code in the .NET runtime or you can compile
an executable directly. In this case I'm compiling an executable, saving the wonderful
world
of .NET for future articles.
If you run the compile line, you'll get a list of errors like the following.
Microsoft (R) Visual C# .NET Compiler version 7.00.9372.1 for Microsoft (R) .NET Framework version 1.0.3328 Copyright (C) Microsoft Corporation 2001. All rights reserved. RSSReader1.cs(7,29): error CS1519: Invalid token 'out' in class, struct, or interface member declaration RSSReader1.cs(16,42): error CS1552: Array type specifier, [], must appear before parameter name ...
Which should give you the impression that, although the languages are very similar, they're not identical. Maybe it's best not to try to compile it just yet. We're going to need to cover some minor syntax points first.
C# Minutiae
We need to make three very simple changes. First, out
is a reserved word in
C#, and, in fact, we should just delete that line and change all instances of
out.write()
to Console.Write()
.
Second, in Java the brackets of an array declaration may come either after the type
or
after the instance name. In our case, the Java code is written as String
args[]
. In C#, the brackets must come after the type, thusly: String []
args
. Another simple fix in two places.
Finally, several of our Java methods have throws
clauses. In C#, every
exception is a runtime exception; there is no throws
concept. We'll just delete
all the throws
clauses.
Compile again, and you'll see these errors:
Microsoft (R) Visual C# .NET Compiler version 7.00.9372.1 for Microsoft (R) .NET Framework version 1.0.3328 Copyright (C) Microsoft Corporation 2001. All rights reserved. RSSReader1.cs(9,7): error CS0246: The type or namespace name 'Stack' could not be found (are you missing a using directive or an assembly reference?) RSSReader1.cs(10,7): error CS0246: The type or namespace name 'StringBuffer' could not be found (are you missing a using directive or an assembly reference?) RSSReader1.cs(48,22): error CS0246: The type or namespace name 'Attributes' could not be found (are you missing a using directive or an assembly reference?)
Which raises two more issues. Some of the Java classes that you've come to know and
love
don't exist in the C# world, though similar assemblies are available. And some do
exist, but
you've got to bring those assemblies into scope with the using
statement. So
our solution to these errors is twofold.
First, add the following lines at the top of your C# file.
using System; using System.Collections; using System.IO; using System.Text;
Second, in addition to C# method names starting with a capital letter, there's another
perversity: string
starts with a lower case letter because it's actually
a C# primitive. There are also some minor differences in the names of utility classes.
For
example, instead of StringBuffer
, C# has a StringBuilder
class.
Try to suppress your gag reflex, and do some global search-and-replaces.
These changes will fix more compilation errors. And now we can learn about
XmlReader
.
XmlReader
System.Xml
contains many very useful classes. When you're talking about
stream-based XML parsing, however, XmlReader
is what you want..
XmlReader
is most analogous to SAX, although it does not require implementing
an interface as most SAX implementations do (as JAXP's does). Instead, you simply
instantiate a concrete XmlReader
of your choice -- there are several to choose
from: XmlTextReader
and XmlStreamReader
are the most useful --
then call its Read()
method and pick nodes off as they are returned to you.
In algorithmic terms, we could say that SAX uses callbacks, while XmlReader
uses an event loop. An event loop is an infinite loop, during which certain events
are
received and dispatched to you. Internally, SAX may well have the same sort of event
loop,
but the callback methods hide that particular detail from you. A callback method is
simply
invoked when SAX's parser comes across an event.
While that means that we don't need all those callback methods to satisfy an interface,
we
can still use them to dig through the nodes that we've read. In fact, we're just
retrofitting a SAX-like API on top of XmlReader
. Here's the new version of the
code; be sure and compare these to the Java version above.
public static void Main(string [] args) { // create an instance of RSSReader RSSReader reader = new RSSReader(); // parse the document from the parameter reader.Parse(args[0]); } // we have to write this method ourselves, since it's // not provided by the API public void Parse(string url) { try { XmlTextReader reader = new XmlTextReader(url); while (reader.Read()) { switch (reader.NodeType) { case XmlNodeType.Document: StartDocument(); break; case XmlNodeType.Element: string namespaceURI = reader.NamespaceURI; string name = reader.Name; Hashtable attributes = new Hashtable(); if (reader.HasAttributes) { for (int i = 0; i < reader.AttributeCount; i++) { reader.MoveToAttribute(i); attributes.Add(reader.Name,reader.Value); } } StartElement(namespaceURI,name,name,attributes); break; case XmlNodeType.EndElement: EndElement(reader.NamespaceURI, reader.Name, reader.Name); break; case XmlNodeType.Text: Characters(reader.Value,0,reader.Value.Length); break; // There are many other types of nodes, but // we are not interested in them } } } catch (XmlException e) { Console.WriteLine(e.Message); } } ... public void StartElement(string namespaceURI,string sName, string qName,Hashtable attrs) { string eName = sName; // element name if ("".Equals(eName)) eName = qName; // namespaceAware = false stack.Push(eName); value = new StringBuilder(); }
You'll notice that XmlReader
has several interesting members, including
NodeType
, NamespaceURI
, Name
, and
Value
. NodeType
can have several values depending on the node
read from the XML document; we're only interested in Element
,
EndElement
, and Text
, though there are several other node types
that you can use: CDATA
, ProcessingInstruction
,
Comment
, XmlDeclaration
, Document
,
DocumentType
, and EntityReference
.
As we receive one of the NodeType
s we're interested in, we read the
Name
or Value
and hand it off to our callback methods.
One other point to note is that SAX has an Attributes
class, for which the
closest equivalent in C# is XmlAttributeCollection
. However, creating one of
those is a more complex task than I really want to deal with right now, so instead,
I've
changed StartElement
's last parameter to a Hashtable
. I've done
some fancy processing to populate it, but we're not interested in attributes for this
example so I won't go into any more detail about it.
Now that we've got our event loop and our "callback" methods in place, we can port the RSS-to-HTML conversion logic from Java to C#. This isn't exactly rocket science either, as the syntax of C# is basically the same as Java. Don't forget, though: all method names start with capital letters.
Here is the complete source listing for RSSReader.cs
:
using System; using System.Collections; using System.IO; using System.Text; using System.Xml; public class com { public class xml { public class RSSReader { static String LineEnd = "\n"; Stack stack = new Stack(); StringBuilder value = null; string title = null; string link = null; string desc = null; public static void Main(string [] args) { // create an instance of RSSReader RSSReader reader = new RSSReader(); // parse the document from the parameter reader.Parse(args[0]); } // we have to write this method ourselves, since it's // not provided by the API public void Parse(string url) { try { XmlTextReader reader = new XmlTextReader(url); while (reader.Read()) { switch (reader.NodeType) { case XmlNodeType.Document: StartDocument(); break; case XmlNodeType.Element: string namespaceURI = reader.NamespaceURI; string name = reader.Name; Hashtable attributes = new Hashtable(); if (reader.HasAttributes) { for (int i = 0; i < reader.AttributeCount; i++) { reader.MoveToAttribute(i); attributes.Add(reader.Name,reader.Value); } } StartElement(namespaceURI,name,name,attributes); break; case XmlNodeType.EndElement: EndElement(reader.NamespaceURI, reader.Name, reader.Name); break; case XmlNodeType.Text: Characters(reader.Value,0,reader.Value.Length); break; // There are many other types of nodes, but // we are not interested in them } } } catch (XmlException e) { Console.WriteLine(e.Message); } } public void StartDocument() { Emit("<html>" + LineEnd); } public void EndDocument() { Emit("</html>" + LineEnd); } public void StartElement(string namespaceURI,string sName, string qName,Hashtable attrs) { string eName = sName; // element name if ("".Equals(eName)) eName = qName; // namespaceAware = false stack.Push(eName); value = new StringBuilder(); } public void EndElement(string namespaceURI,string sName, string qName) { string eName = (string)stack.Pop(); if (eName.Equals("title") && stack.Peek().Equals("channel")) { Emit(" <head>" + LineEnd); Emit(" <title>" + value + "</title>" + LineEnd); Emit(" </head>" + LineEnd); Emit(" <body>" + LineEnd); } else if (eName.Equals("title") && stack.Peek().Equals("item")) { title = null == value ? "" : value.ToString(); } else if (eName.Equals("link") && stack.Peek().Equals("item")) { link = null == value ? "" : value.ToString(); } else if (eName.Equals("description") && stack.Peek().Equals("item")) { desc = null == value ? "" : value.ToString(); } else if (eName.Equals("item")) { Emit(" <p><a href=\"" + link + "\">" + title + "</a><br>" + LineEnd); Emit(" " + desc + "</p>" + LineEnd); } else if (eName.Equals("channel")) { Emit(" </body>" + LineEnd); } value = null; } public void Characters(string buf, int offset, int len) { value.Append(buf); } private static void Emit(string s) { Console.Write(s); } } } }
Try it, and you'll see that it compiles without a problem. Run it like so (output wrapped for legibility):
C:\> RSSReader http://xmlhack.com/rss.php <head> <title>xmlhack</title> </head> <body> <p><a href="http://www.xmlhack.com/read.php?item=1511">Activity around the Dublin Core</a><br> The Dublin Core Metadata Initiative (DCMI) has seen a recent spate of activity, Recent publications include The Namespace Policy for the Dublin Core Metadata Initiative, Expressing Simple Dublin Core in RDF/XML, and Expressing Qualified Dublin Core in RDF/XML.</p> ...
This should look very familiar, as it's exactly the same output that our Java program produced. You might have seen a completely different result, however, like this:
C:\> RSSReader http://xmlhack.com/rss.php Unhandled Exception: System.Security.SecurityException: Request for the permission of type System.Net.WebPermission, System, Version=1.0.3300.0, Culture=neutral, PublicKeyToken=b77a5c561934e089 failed. at System.Security.CodeAccessSecurityEngine.CheckHelper( PermissionSet grantedSet, PermissionSet deniedSet, CodeAccess Permission demand, PermissionToken permToken) at System.Security.CodeAccessSecurityEngine.Check( PermissionToken permToken, CodeAccessPermission demand, StackCrawlMark& stackMark, Int32 checkFrames, Int32 unrestrictedOverride) at System.Security.CodeAccessSecurityEngine.Check( CodeAccessPermission cap, StackCrawlMark& stackMark) at System.Security.CodeAccessPermission.Demand() at System.Net.HttpRequestCreator.Create(Uri Uri) at System.Net.WebRequest.Create(Uri requestUri, Boolean useUriBase) at System.Net.WebRequest.Create(Uri requestUri) at System.Xml.XmlDownloadManager.GetNonFileStream(Uri uri, ICredentials credentials) at System.Xml.XmlDownloadManager.GetStream(Uri uri, ICredentials credentials) at System.Xml.XmlUrlResolver.GetEntity(Uri absoluteUri, String role, Type ofObjectToReturn) at System.Xml.XmlTextReader.CreateScanner() at System.Xml.XmlTextReader.Init() at System.Xml.XmlTextReader.Read() at RSSReader.Parse(String url) in U:\thing\RSSReader.cs:line 31 at RSSReader.Main(String[] args) in U:\thing\RSSReader.cs:line 23 The state of the failed permission was: <IPermission class="System.Net.WebPermission, System, Version=1.0.3300.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" version="1"> <ConnectAccess> <URI uri="http://xmlhack\.com/rss\.php"/> </ConnectAccess> </IPermission>
This happens because C# will not load an assembly on a network drive. Just move your
RSSReader.exe
executable onto a local driveand try again.
Conclusions
What can we learn from XmlReader
? First of all, unlike Java's XML libraries,
all of System.Xml
is provided by Microsoft. This means that, among other
things, there is a consistent interface and a consistent set of tools for all your
XML
needs. No need to shop around for parsers and SAX implementations.
That can also be considered a drawback. Since Java has multiple implementations, you're free to use the one that fits best. And with the advent of JAXP, you can drop in the different implementations without changing your code at all. Doing that in C# is, well, impossible; you're stuck with the one, true Microsoft way.
As far as the XmlReader
event model, it doesn't seem that SAX has much to
learn at all. You'll remember that we actually had to write some additional code when
we
ported the program to XmlReader
because SAX provided the event loop for us;
with SAX, all we have to do is write the callbacks.
On the other hand, System.Xml
does provide some nifty classes like XmlAttribute
(which we didn't
really discuss here) and XmlNodeType
, which give you very convenient, standard
ways to look at attributes and nodes, instead of having to deal with strings and Hashtables
and such. And all these classes are used throughout C#'s XML facilities.
If you don't want to write either an event loop or callbacks, the read-only, forward-only,
stream-based model might not be for you; you might prefer a whole-document model (like,
say,
DOM). In that case, XmlReader
will not appeal to you any more than SAX does.
There is another set of tools in C#, starting with XmlDocument
, which we'll
discuss in the next article, which gives you all the power of a document stored in
memory,
plus the added convenience of building on what you've already learned.