Top Ten SAX2 Tips

December 5, 2001

If you write XML processing code in Java, or indeed most popular programming languages, you will be familiar with SAX, the Simple API for XML. SAX is the best API available for streaming XML processing, for processing documents as they're being read. SAX is the most flexible API in terms of data structures: since it doesn't force you to use a particular in-memory representation, you can choose the one you like best. SAX has great support for the XML Infoset (particularly in SAX2, the second version of SAX) and better DTD support than other widely available APIs. SAX2 is now part of JDK 1.4 and will soon be available to even more Java developers.

In this article, I'll highlight some points that can make your SAX programming in Java more portable, robust, and expressive. Some of these points are just advice, some address common programming problems, and some address SAX techniques that offer unique power to developers. A new book from O'Reilly, SAX2, addresses these topics and more. It details all the SAX APIs and explains each feature in more detail than this short article provides.

1. Keep it Simple

Despite being called the Simple API for XML, things are often more complicated than they first appear. SAX has grown to accommodate a lot of the flexibility needed by the tools and applications that process XML, but when you start out with SAX you should first focus on its underlying simplicity.

Related Reading

SAX2
By David Brownell
January 2002 (est.)
240 pages (est.), $29.95 (est.)

Think of SAX2 (including its standardized extensions) as basically including: one parser API, two handler interfaces for content, two handler interfaces for DTD declarations (the best support of any current Java parser API), and a bunch of other classes and interfaces. Many applications can ignore most of that and start with just a few classes and interfaces:

XMLReader is the basic parser interface, and you get a parser object using XMLReaderFactory.
DefaultHandler has no-op implementations of the most popular handler methods, which you can just override.
Attributes wraps up the attributes of the elements reported to you.

You can write useful tools with just those APIs, overriding only three methods in the DefaultHandler class: startElement() when the parser reports the beginning of an element and its attributes, characters() to handle character data inside such elements, and endElement() to report the end of the element.

You shouldn't use any other functionality until your application requires it. Some of the tips which follow describe common reasons to use more features. Good error handling is right at the top of the list of such reasons, and if you ever process documents with DTDs, smarter handling of external entity resolution won't be far behind.

2. Buffer `characters()` calls

Just because a bunch of text looks to you like it's one long set of characters doesn't mean that's how a SAX parser will report it. You need to explicitly group characters that your application thinks belong together. For example, consider this XML fragment:


<asana>Vrichikasana &mdash; Scorpion</asana>

Certainly you'll see callbacks for the element boundaries, and for the various characters. But how many callbacks will you see for those characters? It'd be legal (but annoying) for the parser to report one character per callback. More typically, parsers would report the characters before and after the mdash entity reference using one callback each, and also report the entity reference (plus its contents, whatever they are). Some parsers that don't report the entity reference would make only one characters() callback for the whole thing, assuming the entity is the standard ISO entity for Unicode character U+2014. There are even more legal ways to report that simple block of text. Your event handler needs to work with all of them.

The solution is to buffer up all the characters you receive in the characters() callback. You could just append to a String, but that's not particularly efficient. It's easier to use a StringBuffer, since one StringBuffer.append() signature is an exact match for the parameters in this callback, and it's easy to turn those into a String later:


class MyHandler implements ContentHandler {

	private StringBufferchars = new StringBuffer ();



	public void characters (char buf [], int offset, int length)

	throws SAXException

		{ chars.append (buf, offset, length); }



	private String getCharacters ()

	{

		String retval = chars.toString ();

		chars.setLength (0);

		return retval;

	}



	... lots more in this class!

}

And now the interesting question is: when to collect that set of buffered characters to do something interesting with it? The answer depends on what your application is doing, but it'll usually be in endElement() or startElement(). Sometimes you'll collect the characters when there's a processingInstruction(), or, more rarely, when a comment() is reported. As a rule, avoid treating CDATA sections or entity expansions as if characters inside them were somehow special. Such boundaries are primarily for authoring convenience, and they shouldn't matter except to editor applications.

One scenario that's easy to handle is what's sometimes called "data elements" -- which contain text only and no other elements. (Their DTD content model might be (#PCDATA).) When you know that's what you're working with, collect the element's data in endElement(). That transparently ignores things like comments and PIs that might have been inside the element, as well as any entity or CDATA section boundaries found there. It's harder to give general rules for other kinds of content model, which is in part why many people like to specify the data style of element rather than allowing "mixed content" or using unrestricted content models like ANY. When a startElement() call needs to indicate the end of some text, your code can get complicated.

Remember that if you're using DTDs, you'll likely get some calls to ignorableWhitespace() to report characters in "element content" models. I usually like to just discard all such characters, since they're known to be semantically meaningless. But sometimes that's not an option, and the solution is instead to call characters() with the ignorable whitespace characters. The parameters are the same; you don't even need to reorder them.


public void ignorableWhitespace (char buf [], int offset, int length)

throws SAXException

	{ characters (buf, offset, length); }

If you used only element content models and text-only content models, it'd be easy to get all the useful text from a valid XML document. It would be the content of "data elements" that you'd get when endElement() is called or in attribute values from startElement(). The rest would be ignorable whitespace, which you'd ignore.

3. Use `XMLReaderFactory` for Bootstrapping

Don't hardwire your code to use a particular SAX2 parser or to rely on features of a particular parser. Good SAX-based systems build almost everything as layers over the parser rather than using nonstandard features. In fact, the best way to bootstrap a SAX2 parser hides what parser you're using: it's a simple call to a helper class:


XMLReaderparser = XMLReaderFactory.createXMLReader ();

That gives you the "system default" parser. Which parser is that? You can control that. The most reliable way is to specify the parser name on the command line, using the org.xml.sax.driver system property and the name of your parser to establish a particular JVM-wide default. You can do it like


java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader MyMainClass arg ...

Some current SAX2 distributions (SAX2 r2pre3 at this writing but not JDK 1.4) include easier ways to control the SAX2 default. One way is through a system resource that's accessed through your class loader: the META-INF/services/org.xml.sax.driver resource. That's sensitive to your class loader configuration; in some cases that may be a feature. Such recent distributions also expect redistributions (from parser suppliers) to include a compiled-in "last gasp" default, which handles the case where none of the other configuration mechanisms have been set up.

The following table gives the names of some widely used SAX2 parsers. You should avoid hardwiring such names into your source code; instead use the parser configuration mechanisms to keep your code free of parser dependencies. All of these are optionally validating, except the one labeled non-validating, and most do quite well on most XML conformance tests.

Parser	Class Name
Ælfred2	`gnu.xml.aelfred2.SAXDriver` (non-validating), or else `gnu.xml.aelfred2.XmlReader`
Crimson	`org.apache.crimson.parser.XmlReaderImpl`
Oracle	`oracle.xml.parser.v2.SAXParser`
Xerces Java	`org.apache.xerces.parsers.SAXParser`

If you're still using a SAX1 parser, and setting the org.xml.sax.parser system property to point to that parser, the XMLReaderFactory will fall back to that class if it can't find a native SAX2 parser implementation. You should probably upgrade to a more current implementation, but meanwhile you can continue to use your old one. It will be automagically wrapped in a ParserAdapter by the SAX2 factory.

4. Check for empty Namespace URI Strings

Namespaces have caused a lot of grief for XML developers. At first the use of namespace URIs as purely abstract identifiers caused the confusion, since they looked like URLs that would be used to fetch something (but nobody knew what). But it didn't stop there. Even today reasonable people (along with the applications and tools that they build) have very different perspectives on what it means to be in a particular namespace. It seems to be a rare month in which significant misunderstandings don't crop up in some area of namespace handling.

There's only one basic thing that programmers can do with any namespace URI: compare it to another one as a string. But not every name in an XML document has a namespace URI, and names in namespaces need to be handled differently from names that aren't in a namespace. (You can rely on either the qName or the localName to have a value, but not both. Either name will be an empty string in some cases.) You might be tempted to write code that assumes every XML element or attribute name is in a namespace, but that just doesn't match real world data. One day you'll get a document that's not quite as clean as you expect, and your code will break.

Which means that when you're writing SAX2 code to look at element or attribute names, you have to figure out whether there's even a namespace name. When there isn't, the namespace URI is always passed as an empty string. Once you know which kind of name you're working with, you can figure out how to handle the element or attribute in question. Inline code to do name-based dispatching should look something like the following (for elements); notice that it doesn't even know there's such a thing as a namespace prefix:


public void startElement (

	String uri, String localName,

	String qName, Attributes atts

) throws SAXException

{

	// Handle elements not in any namespace

	if ("".equals (uri)) {

		// these only have "qName"

		if ("dolce".equals (qName)) {

			// ... handle "dolce"

		} else if ("vita".equals (qName)) {

			// ... handle "vita"



		... and all other supported "no namespace" elements

		} else

			error ("unrecognized element name: " + qName);



	// Then handle each supported namespace separately

	} else if ("http://www.example.com/namespaces/ns1".equals (uri)) {

		// these have a "localName" with no prefix

		if ("free".equals (localName)) {

			// ... handle "free"

		} else if ("open".equals (localName)) {

		// ... handle "open"



		... and all other supported NS1 elements

		} else

			error ("unrecognized NS1 element name: " + localName);



		... and similarly for all other supported element namespaces

		} else

			error ("unrecognized element namespace: " + uri);

}

Attributes might not need that kind of handling. Applications often "know about" particular attributes, access them by name, and just ignore any unrecognized attributes. If you're accessing attribute values in that way, just make sure you use the right naming convention, either Attributes.getValue(uri,local) or Attributes.getValue(qName), and you should have no problems.

Otherwise you'll be scanning all of an element's attributes. You'll need to check whether each attribute is in a namespace, just like you checked whether its element was in a namespace. If it's not in a namespace, you probably know a bit more about the attribute than in the case of an element that's not in a namespace. It's either going to be associated with that element's type or, if you've enabled reporting of namespace prefixes, it'll be a namespace declaration. (That's required by the Namespaces in XML specification, but DOM and the XML Infoset have chosen instead to put such declarations into a namespace.) Your code might look something like this:


Attributeatts = ...;

intlength = atts.getLength ();



for (int i = 0; i < length; i++) {

	String	uri = atts.getURI (i);



	if ("".equals (uri)) {

	String	qName = atts.getQName (i);



		// ... then dispatch based on qName

		// including error based on unrecognized name

		// "xmlns" and "xmlns:*" declarations would appear here



	} else if ("http://www.example.com/namespaces/ns1".equals (uri)) {

		String	localName = atts.getLocalName (i);



		// ... then dispatch based on "localName"

		// including error based on unrecognized name



		... and similarly for all other supported attribute namespaces

		} else

			error ("unrecognized attribute namespace: " + uri);

}

If your code uses idioms like those shown above, it'll be handling namespaces correctly. Otherwise, you're likely to run into a document or parser that confuses your code. Don't try to ignore namespaces completely. If your code wants a simpler "pre-namespaces" view of the world, at least make sure the namespace URI is always empty and report errors for all elements and attributes where that's not true.

5. Provide an ErrorHandler, especially when you're validating

If you've never set up a parser to validate, like


parser.setFeature (

	"http://xml.org/sax/features/validation",

	true);

and then been surprised when you didn't get any error reports, congratulations! You're in the minority. Most developers have forgotten (and usually more than once) that by default, validity errors are ignored by SAX parsers. You need to call XMLReader.setErrorHandler() with some useful error handler to make validity errors have any effect at all. That handler needs to do something interesting with the validity errors reported to it using error() calls.

It's worth having a good utility class that you reuse and reconfigure it to handle this particular situation. It'll be handy even when you're not validating. Such a class might look like


class MyErrorHandler implements ErrorHandler

{

	private void print (String label, SAXParseException e)

	{

		System.err.println ("** " + label + ": " + e.getMessage ());

		System.err.println ("   URI  = " + e.getSystemId ());

		System.err.println ("   line = " + e.getLineNumber ());

	}



	booleanreportErrors = true;

	booleanabortOnError = false;



	// for recoverable errors, like validity problems

	public void error (SAXParseException e)

	throws SAXException

	{

		if (reportErrors)

			print ("error", e);

	if (abortOnErrors)

		throw e;

	}



		... plus similar for fatalError(), warning()

		... and maybe more interesting configuration support

}

A SAX ErrorHandler should know two policies for each of its three fault classes: whether to report such faults, and whether such faults should terminate parsing. Various mechanisms can be used to report the fault, such as logging, adding text to a Swing message window, or just printing. At this time, SAX doesn't support portable mechanisms to identify particular failure modes, so that you can't really consider "why did it fail?" in the handler.

6. Share your `ErrorHandler` between the XMLReader and your own Handlers

When your application uses the same ErrorHandler in its own handlers and for the parser, it creates an integrated stream of fault information. That's useful in its own right, but the best part is that all the errors (and warnings) can then be handled according to the same policy and mechanism. You can easily change how faults are handled by switching or reconfiguring that ErrorHandler object. In most cases, the SAX fault classifications are fine, since having more than fatalError(), error(), and warning() will rarely be helpful. Here's how you might set this up for a simple handler:


public class MyHandler implements ContentHandler

{

	// doesn't matter if this stays as null, since

	// SAXParseException constructors don't care

	private Locatorlocator;



	public void setDocumentLocator (Locator l)

		{ locator = l; }



	// application and SAX errors should use the same handler,

	private ErrorHandlereh = new DefaultHandler ();



	public void setErrorHandler (ErrorHandler e)

		{ eh = (e == null) ? new DefaultHandler () : e; }



	// simpler is usually better ...

	public final void error (String message) throws SAXException

		{ eh.error (new SAXParseException (message, locator)); }

	public final void warning (String message) throws SAXException

		{ eh.warning (new SAXParseException (message, locator)); }

	public final void fatalError (String message) throws SAXException

	{	

		SAXParseException e = new SAXParseException (message, locator);

		eh.fatalError (e);

		// in case eh tries to continue:  we can't, and won't

		throw e;

	}



	// the real application code would use error(String) and friends

	// to report errors, something like this:



	public void endElement (String uri, String localName, String qName)

	throws SAXException

		{

			... branch to figure out which element's processing to do ...



			if (processData (getCharacters ()) != true) {

			error ("bad '" + localName + "' content");

			// recover from it (clean up state) and

			return;

		}

		... now repackage and save all the object's state

	}





	... lots more code

}

Then you should initialize both the XMLReader and your content handlers (including any that process DTD content) to use the same ErrorHandler. The SAX ErrorHandler interface is flexible enough to use as a general error handling policy interface in much of your XML code. In fact, you may have noticed that the javax.xml.parsers.DocumentBuilder class uses one to simplify error reporting when building a DOM Document.

If you want, your application can subclass SAXParseException to provide some application-specific exception information, which might be understood by that error handler. It might use information about what happened to make more enlightened decisions about how to handle the problem.

7. Track Context with a Stack

Once developers get past the initial milestone of learning how SAX parser callbacks map to the input text, the next step is to figure out how to turn such a stream of callbacks into application data. Certainly SAX is low overhead, and no other API is likely to get less in the way. At the same time, SAX is not exactly going out of its way to package things neatly. It's the very fact that SAX doesn't pick data structures for you that makes it so powerful. That can take getting used to, particularly if you're used to thinking in terms of structures that someone else designed.

A good place to start is to make a ContentHandler implementation that keeps important information in a stack. For example, you could define a class that records an element name (with its namespace, if any) and uses the AttributesImpl class to snapshot its associated attributes. If you create those entries in startElement() and stack them, any callback could use that information before endElement() popped the stack. Certain attributes, including xml:base, xml:lang, and xml:space, are in a sense "inherited", and you might need to walk up that stack to find such a value while processing other event callbacks.

Such stack entries are also convenient places to collect application-specific information about an element's children. For example, you might be unmarshaling a series of data elements, converting them from strings into more specialized data types as you parse. You'd store those converted values in members of that special stack entry, reporting application level errors when they're detected. Periodically you could transform such entries (or subtrees of entries) into custom data structures that might no longer reflect the way XML text happened to encode that data.

Of course if you track every data item that comes in through SAX, you're starting down a well trodden path. There are plenty of APIs that do that, optimized for one model or another but likely not for your particular application. Still, it can be good fun and useful to build up SAX infrastructure for your application that way.

8. Use an `InputSource` to wrap in-memory data

New SAX programmers often end up with some data in memory, perhaps in a string or other data buffer, that needs to be parsed as XML. (Maybe it came from a database or was built by some other program component.) It's easy to use SAX to parse these, since the java.io package provides classes that let you create character streams from character data. You can use CharArrayReader to read from arrays of characters, or StringReader as shown here when the data starts as a string:


Readerreader;

InputSourcein;

XMLReaderparser;



reader = new StringReader ("<bank name='Gringott&apos;s' box='713'/>");

in = new InputSource (reader);

parser = XMLReaderFactory.createXMLReader ();



parser.parse (in);

You can do similar things with byte arrays, using the ByteArrayInputStream class to create a byte stream, but in that case you've got to be careful about character encoding issues. It's best if those bytes are UTF-8 encoded XML data.

Such input sources can be used as direct parser inputs (as shown here) or, if you're using DTDs and entities defined in them, through an EntityResolver.

9. Manage External Entity Lookups with an `EntityResolver`

XML uses external entities to support document modularity; they are available if you're using DTDs. When a document references an entity, parsers normally fetch it and parse the result. That's exactly what you need in most cases, but it causes problems when the server hosting that URL goes offline for a while (or maybe it was your client that wanted to be disconnected?), and when the network is unreliable. Your whole application could become unavailable, just because it's trying to get a resource that can't be gotten.

How can you avoid entity access problems? SAX2 gives you two basic controls over entity processing.

First, two SAX2 feature flags control whether external entities are ever fetched. One affects parameter entities (like %module;) which are used inside the DTD. The other affects general entities (like &data;) in the body of the document. Most SAX parsers don't let you turn of this fetching, but if you're using one which does, this may be a fine solution. (The current Ælfred2 release supports this, but I don't know another SAX2 parser that does.) So you may not be able to use this facility.

Second, you can use an EntityResolver to control how entities are resolved. Whenever a SAX parser needs to access an external entity, it will ask the resolveEntity() method on your resolver how to handle that entity. That method sees the entity's fully resolved URI and, if it had one, its public ID. (A new SAX extension is in the works to provide more information, but it's not widely supported yet.) Some interesting things for that method to do include:

Map public IDs to local file names. That's what public IDs were designed for, and hashtables were designed for such mappings. Strongly encouraged! You can do the same thing for system IDs. (There are also "catalog" systems to help manage such mappings. You may want to use a resolver that knows how to use one.)
Fetch or compute the data, maybe using a database. If you're using a private URI scheme that your JVM doesn't understand, maybe blob:database-name:database-key, you'll probably want to store those in the public IDs and do the URI resolution yourself.
Construct an empty input source and return that. This is safe to do for general entities, after the first startElement(), and a bit dangerous for parameter entities, but you may be better off trying to skip some remote entities than trying to access them. (The issue with handling parameter entities this way is that the parser won't know it didn't see their declarations, and so it won't behave correctly.)

A simple entity resolver might look like this for an application that's really paranoid about preventing access to all entities it doesn't control. If you were using it, you'd probably preload the hashtable with entries for all of your application's entities. And you'd probably apply intelligence about what requests are really unsafe or your customers would get unhappy. For example, maybe string prefix matches would be used to grant access to certain files inside the firewall (or its DMZ), and only the ones outside that security boundary would be airbrushed out of the picture.


class MyResolver implements EntityResolver

{

	private Hashtablepublics, systems;



	MyResolver (Hashtable pub, Hashtable sys)

		{ publics = pub; systems = sys; }



	public InputSource resolveEntity (String publicId, String systemId)

	throws IOException, SAXException

	{

		InputSourceretval = null;



		if (publicId != null) {

		String	value = (String) publics.get (publicId);



		if (value != null) {

			// use new system ID and original public ID

			retval = new InputSource (retval);

			retval.setPublicId (publicId);

		}

	}

	if (retval == null) {

		String	value = (String) systems.get (systemId);



		if (value != null) {

			// use new system ID and original public ID

			retval = new InputSource (retval);

			retval.setPublicId (publicId);

		}

	}

	if (retval == null) {

		// we're sooo paranoid here!!

		System.err.println ("RESOLVER: punt " + systemId + " "

			+ (publicId == null ? "" : publicId));

		retval = new InputSource (new StringReader (""));

		retval.setSystemId (systemId);

		retval.setPublicId (publicId);

	}

	// if we returned null, the systemId would would

	// be dereferenced using standard URL handling.

	return retval;

  }

}

A good rule of thumb is always to use a resolver for any application that reuses a known set of DTDs. Do it, if for no other reason than to avoid accessing the network when you don't need to. Only mission critical servers would likely want to be as paranoid as shown above.

10. Use a Pipelined Processing Model

SAX is made for streaming processing, and the best way to stream your processing is to connect a series of processing components into an event pipeline. One component produces events, the next consumes them and produces new (or maybe filtered) events for yet another component to consume. Often, both your CPU and I/O subsystems can be working on different parts of the pipeline at the same time, minimizing elapsed time.

SAX parsers produce events, but they're not the only way to produce a stream of SAX events. One common practice is to have programs call the SAX event methods directly, perhaps while walking over a data structure as part of converting it to XML. SAX2 defines a way to make a SAX parser that walks a DOM tree, rather than XML text, emitting a stream of SAX events. And toolsets like DOM4J and JDOM haven't neglected such data-to-SAX converters, either. Think of that SAX event stream as an efficient in-memory version of the generic transfer syntax which XML provides between different processes.

Your "ultimate consumer" in a SAX event pipeline could write XML text out (use one of the various XMLWriter classes) or turn the events into a application-optimized data structure. It's easy to build a DOM (or DOM4J, or JDOM) model from a modified SAX event stream, too. And since you have control over what happens, you don't have to build the entire generic tree structure before you begin processing it; if you do it that way, you can garbage collect each chunk of data as soon as you're done processing it, rather than waiting for the whole document to materialize in memory.

If you're using XSLT in Java, you may well be familiar with the javax.xml.transform.sax (TRAX) package. XSLT engines such as SAXON or Xalan support it. You may not know that it's easy to feed SAX events as inputs to an XSLT engine as a SAX pipeline stage, using a TransformerHandler,or to collect XSLT engine output as SAX events using a SAXResult. SAX events in, transformation according to XSLT, and then SAX events out again: those TRAX APIs are essentially wrappers around SAX pipeline stages! It can be very worthwhile to unwrap them and use XSLT for some heavier weight transformations in your SAX pipelines.

I could go on about pipelines, but I'll just mention that SAX2 includes a XMLFilterImpl class, handy for writing some kinds of intermediate pipeline stages, and stop. Pipelines are covered in more detail in that new book that I mentioned. The main thing to remember is that event pipelines are the natural model for components in SAX. You should plan to use them if you're doing anything very substantial.

If you've read this far, you deserve a special bonus tip. SAX has its own site, http://www.saxproject.org. Visit it site for the the latest information updated documentation about SAX.

David Brownell, author of SAX2, is a software engineer. He recently worked for three years at JavaSoft, where he provided Sun's XML and DOM software, SSL and public key technologies, the original version of the JavaServer Pages technology, and worked on the Java Servlet API for Web servers.

O'Reilly & Associates will soon release (January 2002) SAX2.

For more information, or to order the book, click here.

Top Ten SAX2 Tips

1. Keep it Simple

2. Buffer characters() calls

3. Use XMLReaderFactory for Bootstrapping