Top Ten SAX2 Tips
December 5, 2001
If you write XML processing code in Java, or indeed most popular programming languages, you will be familiar with SAX, the Simple API for XML. SAX is the best API available for streaming XML processing, for processing documents as they're being read. SAX is the most flexible API in terms of data structures: since it doesn't force you to use a particular in-memory representation, you can choose the one you like best. SAX has great support for the XML Infoset (particularly in SAX2, the second version of SAX) and better DTD support than other widely available APIs. SAX2 is now part of JDK 1.4 and will soon be available to even more Java developers.
In this article, I'll highlight some points that can make your SAX programming in Java more portable, robust, and expressive. Some of these points are just advice, some address common programming problems, and some address SAX techniques that offer unique power to developers. A new book from O'Reilly, SAX2, addresses these topics and more. It details all the SAX APIs and explains each feature in more detail than this short article provides.
1. Keep it Simple
Despite being called the Simple API for XML, things are often more complicated than they first appear. SAX has grown to accommodate a lot of the flexibility needed by the tools and applications that process XML, but when you start out with SAX you should first focus on its underlying simplicity.
Related Reading
|
Think of SAX2 (including its standardized extensions) as basically including: one parser API, two handler interfaces for content, two handler interfaces for DTD declarations (the best support of any current Java parser API), and a bunch of other classes and interfaces. Many applications can ignore most of that and start with just a few classes and interfaces:
-
XMLReader
is the basic parser interface, and you get a parser object usingXMLReaderFactory
. -
DefaultHandler
has no-op implementations of the most popular handler methods, which you can just override. -
Attributes
wraps up the attributes of the elements reported to you.
You can write useful tools with just those APIs, overriding only three methods in
the
DefaultHandler class: startElement()
when the parser reports the beginning of
an element and its attributes, characters()
to handle character data inside
such elements, and endElement()
to report the end of the element.
You shouldn't use any other functionality until your application requires it. Some of the tips which follow describe common reasons to use more features. Good error handling is right at the top of the list of such reasons, and if you ever process documents with DTDs, smarter handling of external entity resolution won't be far behind.
2. Buffer characters()
calls
Just because a bunch of text looks to you like it's one long set of characters doesn't mean that's how a SAX parser will report it. You need to explicitly group characters that your application thinks belong together. For example, consider this XML fragment:
<asana>Vrichikasana — Scorpion</asana>
Certainly you'll see callbacks for the element boundaries, and for the various characters.
But how many callbacks will you see for those characters? It'd be legal (but annoying)
for
the parser to report one character per callback. More typically, parsers would report
the
characters before and after the mdash
entity reference using one callback each,
and also report the entity reference (plus its contents, whatever they are). Some
parsers
that don't report the entity reference would make only one characters()
callback for the whole thing, assuming the entity is the standard ISO entity for Unicode
character U+2014. There are even more legal ways to report that simple block of text.
Your
event handler needs to work with all of them.
The solution is to buffer up all the characters you receive in the
characters()
callback. You could just append to a String, but that's not
particularly efficient. It's easier to use a StringBuffer, since one
StringBuffer.append()
signature is an exact match for the parameters in this
callback, and it's easy to turn those into a String later:
class MyHandler implements ContentHandler { private StringBufferchars = new StringBuffer (); public void characters (char buf [], int offset, int length) throws SAXException { chars.append (buf, offset, length); } private String getCharacters () { String retval = chars.toString (); chars.setLength (0); return retval; } ... lots more in this class! }
And now the interesting question is: when to collect that set of buffered characters
to do
something interesting with it? The answer depends on what your application is doing,
but
it'll usually be in endElement()
or startElement()
. Sometimes
you'll collect the characters when there's a processingInstruction()
, or, more
rarely, when a comment()
is reported. As a rule, avoid treating CDATA sections
or entity expansions as if characters inside them were somehow special. Such boundaries
are
primarily for authoring convenience, and they shouldn't matter except to editor
applications.
One scenario that's easy to handle is what's sometimes called "data elements" -- which
contain text only and no other elements. (Their DTD content model might be
(#PCDATA)
.) When you know that's what you're working with, collect the
element's data in endElement()
. That transparently ignores things like comments
and PIs that might have been inside the element, as well as any entity or CDATA section
boundaries found there. It's harder to give general rules for other kinds of content
model,
which is in part why many people like to specify the data style of element rather
than
allowing "mixed content" or using unrestricted content models like ANY. When a
startElement()
call needs to indicate the end of some text, your code can get
complicated.
Remember that if you're using DTDs, you'll likely get some calls to
ignorableWhitespace()
to report characters in "element content" models. I
usually like to just discard all such characters, since they're known to be semantically
meaningless. But sometimes that's not an option, and the solution is instead to call
characters()
with the ignorable whitespace characters. The parameters are the
same; you don't even need to reorder them.
public void ignorableWhitespace (char buf [], int offset, int length) throws SAXException { characters (buf, offset, length); }
If you used only element content models and text-only content models, it'd be easy
to get
all the useful text from a valid XML document. It would be the content of "data elements"
that you'd get when endElement()
is called or in attribute values from
startElement()
. The rest would be ignorable whitespace, which you'd ignore.
3. Use XMLReaderFactory
for Bootstrapping
Don't hardwire your code to use a particular SAX2 parser or to rely on features of a particular parser. Good SAX-based systems build almost everything as layers over the parser rather than using nonstandard features. In fact, the best way to bootstrap a SAX2 parser hides what parser you're using: it's a simple call to a helper class:
XMLReaderparser = XMLReaderFactory.createXMLReader ();
That gives you the "system default" parser. Which parser is that? You can control
that. The
most reliable way is to specify the parser name on the command line, using the
org.xml.sax.driver
system property and the name of your parser to establish a
particular JVM-wide default. You can do it like
java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader MyMainClass arg ...
Some current SAX2 distributions (SAX2 r2pre3 at this writing but not JDK 1.4) include
easier ways to control the SAX2 default. One way is through a system resource that's
accessed through your class loader: the META-INF/services/org.xml.sax.driver
resource. That's sensitive to your class loader configuration; in some cases that
may be a
feature. Such recent distributions also expect redistributions (from parser suppliers)
to
include a compiled-in "last gasp" default, which handles the case where none of the
other
configuration mechanisms have been set up.
The following table gives the names of some widely used SAX2 parsers. You should avoid hardwiring such names into your source code; instead use the parser configuration mechanisms to keep your code free of parser dependencies. All of these are optionally validating, except the one labeled non-validating, and most do quite well on most XML conformance tests.
Parser | Class Name |
---|---|
Ælfred2 | gnu.xml.aelfred2.SAXDriver (non-validating), or else
gnu.xml.aelfred2.XmlReader |
Crimson |
org.apache.crimson.parser.XmlReaderImpl
|
Oracle |
oracle.xml.parser.v2.SAXParser
|
Xerces Java |
org.apache.xerces.parsers.SAXParser
|
If you're still using a SAX1 parser, and setting the org.xml.sax.parser
system
property to point to that parser, the XMLReaderFactory
will fall back to that
class if it can't find a native SAX2 parser implementation. You should probably upgrade
to a
more current implementation, but meanwhile you can continue to use your old one. It
will be
automagically wrapped in a ParserAdapter
by the SAX2 factory.
4. Check for empty Namespace URI Strings
Namespaces have caused a lot of grief for XML developers. At first the use of namespace URIs as purely abstract identifiers caused the confusion, since they looked like URLs that would be used to fetch something (but nobody knew what). But it didn't stop there. Even today reasonable people (along with the applications and tools that they build) have very different perspectives on what it means to be in a particular namespace. It seems to be a rare month in which significant misunderstandings don't crop up in some area of namespace handling.
There's only one basic thing that programmers can do with any namespace URI: compare
it to
another one as a string. But not every name in an XML document has a namespace URI,
and
names in namespaces need to be handled differently from names that aren't in a namespace.
(You can rely on either the qName
or the localName
to have a value, but not both. Either name will be an empty string in some cases.)
You might
be tempted to write code that assumes every XML element or attribute name is in a
namespace,
but that just doesn't match real world data. One day you'll get a document that's
not quite
as clean as you expect, and your code will break.
Which means that when you're writing SAX2 code to look at element or attribute names, you have to figure out whether there's even a namespace name. When there isn't, the namespace URI is always passed as an empty string. Once you know which kind of name you're working with, you can figure out how to handle the element or attribute in question. Inline code to do name-based dispatching should look something like the following (for elements); notice that it doesn't even know there's such a thing as a namespace prefix:
public void startElement ( String uri, String localName, String qName, Attributes atts ) throws SAXException { // Handle elements not in any namespace if ("".equals (uri)) { // these only have "qName" if ("dolce".equals (qName)) { // ... handle "dolce" } else if ("vita".equals (qName)) { // ... handle "vita" ... and all other supported "no namespace" elements } else error ("unrecognized element name: " + qName); // Then handle each supported namespace separately } else if ("http://www.example.com/namespaces/ns1".equals (uri)) { // these have a "localName" with no prefix if ("free".equals (localName)) { // ... handle "free" } else if ("open".equals (localName)) { // ... handle "open" ... and all other supported NS1 elements } else error ("unrecognized NS1 element name: " + localName); ... and similarly for all other supported element namespaces } else error ("unrecognized element namespace: " + uri); }
Attributes might not need that kind of handling. Applications often "know about" particular attributes, access them by name, and just ignore any unrecognized attributes. If you're accessing attribute values in that way, just make sure you use the right naming convention, either Attributes.getValue(uri,local) or Attributes.getValue(qName), and you should have no problems.
Otherwise you'll be scanning all of an element's attributes. You'll need to check whether each attribute is in a namespace, just like you checked whether its element was in a namespace. If it's not in a namespace, you probably know a bit more about the attribute than in the case of an element that's not in a namespace. It's either going to be associated with that element's type or, if you've enabled reporting of namespace prefixes, it'll be a namespace declaration. (That's required by the Namespaces in XML specification, but DOM and the XML Infoset have chosen instead to put such declarations into a namespace.) Your code might look something like this:
Attributeatts = ...; intlength = atts.getLength (); for (int i = 0; i < length; i++) { String uri = atts.getURI (i); if ("".equals (uri)) { String qName = atts.getQName (i); // ... then dispatch based on qName // including error based on unrecognized name // "xmlns" and "xmlns:*" declarations would appear here } else if ("http://www.example.com/namespaces/ns1".equals (uri)) { String localName = atts.getLocalName (i); // ... then dispatch based on "localName" // including error based on unrecognized name ... and similarly for all other supported attribute namespaces } else error ("unrecognized attribute namespace: " + uri); }
If your code uses idioms like those shown above, it'll be handling namespaces correctly. Otherwise, you're likely to run into a document or parser that confuses your code. Don't try to ignore namespaces completely. If your code wants a simpler "pre-namespaces" view of the world, at least make sure the namespace URI is always empty and report errors for all elements and attributes where that's not true.
5. Provide an ErrorHandler, especially when you're validating
If you've never set up a parser to validate, like
parser.setFeature ( "http://xml.org/sax/features/validation", true);
and then been surprised when you didn't get any error reports, congratulations! You're
in
the minority. Most developers have forgotten (and usually more than once) that by
default,
validity errors are ignored by SAX parsers. You need to call
XMLReader.setErrorHandler()
with some useful error handler to make validity
errors have any effect at all. That handler needs to do something interesting with
the
validity errors reported to it using error()
calls.
It's worth having a good utility class that you reuse and reconfigure it to handle this particular situation. It'll be handy even when you're not validating. Such a class might look like
class MyErrorHandler implements ErrorHandler { private void print (String label, SAXParseException e) { System.err.println ("** " + label + ": " + e.getMessage ()); System.err.println (" URI = " + e.getSystemId ()); System.err.println (" line = " + e.getLineNumber ()); } booleanreportErrors = true; booleanabortOnError = false; // for recoverable errors, like validity problems public void error (SAXParseException e) throws SAXException { if (reportErrors) print ("error", e); if (abortOnErrors) throw e; } ... plus similar for fatalError(), warning() ... and maybe more interesting configuration support }
A SAX ErrorHandler should know two policies for each of its three fault classes: whether to report such faults, and whether such faults should terminate parsing. Various mechanisms can be used to report the fault, such as logging, adding text to a Swing message window, or just printing. At this time, SAX doesn't support portable mechanisms to identify particular failure modes, so that you can't really consider "why did it fail?" in the handler.
6. Share your ErrorHandler
between the XMLReader and your
own Handlers
When your application uses the same ErrorHandler in its own handlers and
for the parser, it creates an integrated stream of fault information. That's useful
in its
own right, but the best part is that all the errors (and warnings) can then be handled
according to the same policy and mechanism. You can easily change how faults are handled
by
switching or reconfiguring that ErrorHandler object. In most cases, the SAX fault
classifications are fine, since having more than fatalError()
,
error()
, and warning()
will rarely be helpful. Here's how you
might set this up for a simple handler:
public class MyHandler implements ContentHandler { // doesn't matter if this stays as null, since // SAXParseException constructors don't care private Locatorlocator; public void setDocumentLocator (Locator l) { locator = l; } // application and SAX errors should use the same handler, private ErrorHandlereh = new DefaultHandler (); public void setErrorHandler (ErrorHandler e) { eh = (e == null) ? new DefaultHandler () : e; } // simpler is usually better ... public final void error (String message) throws SAXException { eh.error (new SAXParseException (message, locator)); } public final void warning (String message) throws SAXException { eh.warning (new SAXParseException (message, locator)); } public final void fatalError (String message) throws SAXException { SAXParseException e = new SAXParseException (message, locator); eh.fatalError (e); // in case eh tries to continue: we can't, and won't throw e; } // the real application code would use error(String) and friends // to report errors, something like this: public void endElement (String uri, String localName, String qName) throws SAXException { ... branch to figure out which element's processing to do ... if (processData (getCharacters ()) != true) { error ("bad '" + localName + "' content"); // recover from it (clean up state) and return; } ... now repackage and save all the object's state } ... lots more code }
Then you should initialize both the XMLReader
and your content handlers
(including any that process DTD content) to use the same ErrorHandler. The SAX ErrorHandler
interface is flexible enough to use as a general error handling policy interface in
much of
your XML code. In fact, you may have noticed that the
javax.xml.parsers.DocumentBuilder
class uses one to simplify error reporting
when building a DOM Document.
If you want, your application can subclass SAXParseException
to provide some
application-specific exception information, which might be understood by that error
handler.
It might use information about what happened to make more enlightened decisions about
how to
handle the problem.
7. Track Context with a Stack
Once developers get past the initial milestone of learning how SAX parser callbacks map to the input text, the next step is to figure out how to turn such a stream of callbacks into application data. Certainly SAX is low overhead, and no other API is likely to get less in the way. At the same time, SAX is not exactly going out of its way to package things neatly. It's the very fact that SAX doesn't pick data structures for you that makes it so powerful. That can take getting used to, particularly if you're used to thinking in terms of structures that someone else designed.
A good place to start is to make a ContentHandler
implementation that keeps
important information in a stack. For example, you could define a class that records
an
element name (with its namespace, if any) and uses the AttributesImpl
class to
snapshot its associated attributes. If you create those entries in
startElement()
and stack them, any callback could use that information before
endElement()
popped the stack. Certain attributes, including
xml:base
, xml:lang
, and xml:space
, are in a sense
"inherited", and you might need to walk up that stack to find such a value while processing
other event callbacks.
Such stack entries are also convenient places to collect application-specific information about an element's children. For example, you might be unmarshaling a series of data elements, converting them from strings into more specialized data types as you parse. You'd store those converted values in members of that special stack entry, reporting application level errors when they're detected. Periodically you could transform such entries (or subtrees of entries) into custom data structures that might no longer reflect the way XML text happened to encode that data.
Of course if you track every data item that comes in through SAX, you're starting down a well trodden path. There are plenty of APIs that do that, optimized for one model or another but likely not for your particular application. Still, it can be good fun and useful to build up SAX infrastructure for your application that way.
8. Use an InputSource
to wrap in-memory data
New SAX programmers often end up with some data in memory, perhaps in a
string or other data buffer, that needs to be parsed as XML. (Maybe it came from a
database
or was built by some other program component.) It's easy to use SAX to parse these,
since
the java.io
package provides classes that let you create character streams from
character data. You can use CharArrayReader
to read from arrays of characters,
or StringReader
as shown here when the data starts as a string:
Readerreader; InputSourcein; XMLReaderparser; reader = new StringReader ("<bank name='Gringott's' box='713'/>"); in = new InputSource (reader); parser = XMLReaderFactory.createXMLReader (); parser.parse (in);
You can do similar things with byte arrays, using the ByteArrayInputStream
class to create a byte stream, but in that case you've got to be careful about character
encoding issues. It's best if those bytes are UTF-8 encoded XML data.
Such input sources can be used as direct parser inputs (as shown here) or, if you're using DTDs and entities defined in them, through an EntityResolver.
9. Manage External Entity Lookups with an EntityResolver
XML uses external entities to support document modularity; they are available if you're using DTDs. When a document references an entity, parsers normally fetch it and parse the result. That's exactly what you need in most cases, but it causes problems when the server hosting that URL goes offline for a while (or maybe it was your client that wanted to be disconnected?), and when the network is unreliable. Your whole application could become unavailable, just because it's trying to get a resource that can't be gotten.
How can you avoid entity access problems? SAX2 gives you two basic controls over entity processing.
First, two SAX2 feature flags control whether external entities are ever fetched.
One
affects parameter entities (like %module;) which are used inside the DTD. The other
affects general entities (like &data;
) in the body of the document. Most
SAX parsers don't let you turn of this fetching, but if you're using one which does,
this
may be a fine solution. (The current Ælfred2 release supports this, but I don't know
another SAX2 parser that does.) So you may not be able to use this facility.
Second, you can use an EntityResolver
to control how entities are resolved.
Whenever a SAX parser needs to access an external entity, it will ask the
resolveEntity()
method on your resolver how to handle that entity. That
method sees the entity's fully resolved URI and, if it had one, its public ID. (A
new SAX
extension is in the works to provide more information, but it's not widely supported
yet.)
Some interesting things for that method to do include:
-
Map public IDs to local file names. That's what public IDs were designed for, and hashtables were designed for such mappings. Strongly encouraged! You can do the same thing for system IDs. (There are also "catalog" systems to help manage such mappings. You may want to use a resolver that knows how to use one.)
-
Fetch or compute the data, maybe using a database. If you're using a private URI scheme that your JVM doesn't understand, maybe
blob:database-name:database-key
, you'll probably want to store those in the public IDs and do the URI resolution yourself. -
Construct an empty input source and return that. This is safe to do for general entities, after the first
startElement()
, and a bit dangerous for parameter entities, but you may be better off trying to skip some remote entities than trying to access them. (The issue with handling parameter entities this way is that the parser won't know it didn't see their declarations, and so it won't behave correctly.)
A simple entity resolver might look like this for an application that's really paranoid about preventing access to all entities it doesn't control. If you were using it, you'd probably preload the hashtable with entries for all of your application's entities. And you'd probably apply intelligence about what requests are really unsafe or your customers would get unhappy. For example, maybe string prefix matches would be used to grant access to certain files inside the firewall (or its DMZ), and only the ones outside that security boundary would be airbrushed out of the picture.
class MyResolver implements EntityResolver { private Hashtablepublics, systems; MyResolver (Hashtable pub, Hashtable sys) { publics = pub; systems = sys; } public InputSource resolveEntity (String publicId, String systemId) throws IOException, SAXException { InputSourceretval = null; if (publicId != null) { String value = (String) publics.get (publicId); if (value != null) { // use new system ID and original public ID retval = new InputSource (retval); retval.setPublicId (publicId); } } if (retval == null) { String value = (String) systems.get (systemId); if (value != null) { // use new system ID and original public ID retval = new InputSource (retval); retval.setPublicId (publicId); } } if (retval == null) { // we're sooo paranoid here!! System.err.println ("RESOLVER: punt " + systemId + " " + (publicId == null ? "" : publicId)); retval = new InputSource (new StringReader ("")); retval.setSystemId (systemId); retval.setPublicId (publicId); } // if we returned null, the systemId would would // be dereferenced using standard URL handling. return retval; } }
A good rule of thumb is always to use a resolver for any application that reuses a known set of DTDs. Do it, if for no other reason than to avoid accessing the network when you don't need to. Only mission critical servers would likely want to be as paranoid as shown above.
10. Use a Pipelined Processing Model
SAX is made for streaming processing, and the best way to stream your processing is to connect a series of processing components into an event pipeline. One component produces events, the next consumes them and produces new (or maybe filtered) events for yet another component to consume. Often, both your CPU and I/O subsystems can be working on different parts of the pipeline at the same time, minimizing elapsed time.
SAX parsers produce events, but they're not the only way to produce a stream of SAX events. One common practice is to have programs call the SAX event methods directly, perhaps while walking over a data structure as part of converting it to XML. SAX2 defines a way to make a SAX parser that walks a DOM tree, rather than XML text, emitting a stream of SAX events. And toolsets like DOM4J and JDOM haven't neglected such data-to-SAX converters, either. Think of that SAX event stream as an efficient in-memory version of the generic transfer syntax which XML provides between different processes.
Your "ultimate consumer" in a SAX event pipeline could write XML text out (use one
of the
various XMLWriter
classes) or turn the events into a application-optimized data
structure. It's easy to build a DOM (or DOM4J, or JDOM) model from a modified SAX
event
stream, too. And since you have control over what happens, you don't have to build
the
entire generic tree structure before you begin processing it; if you do it that way,
you can
garbage collect each chunk of data as soon as you're done processing it, rather than
waiting
for the whole document to materialize in memory.
If you're using XSLT in Java, you may well be familiar with the
javax.xml.transform.sax
(TRAX) package. XSLT engines such as SAXON or Xalan
support it. You may not know that it's easy to feed SAX events as inputs to an XSLT
engine
as a SAX pipeline stage, using a TransformerHandler
,or to collect XSLT engine
output as SAX events using a SAXResult
. SAX events in, transformation according
to XSLT, and then SAX events out again: those TRAX APIs are essentially wrappers around
SAX
pipeline stages! It can be very worthwhile to unwrap them and use XSLT for some heavier
weight transformations in your SAX pipelines.
I could go on about pipelines, but I'll just mention that SAX2 includes a
XMLFilterImpl
class, handy for writing some kinds of intermediate pipeline
stages, and stop. Pipelines are covered in more detail in that new book that I mentioned.
The main thing to remember is that event pipelines are the natural model for components
in
SAX. You should plan to use them if you're doing anything very substantial.
If you've read this far, you deserve a special bonus tip. SAX has its own site, http://www.saxproject.org. Visit it site for the the latest information updated documentation about SAX.
David Brownell, author of SAX2, is a software engineer. He recently worked for three years at JavaSoft, where he provided Sun's XML and DOM software, SSL and public key technologies, the original version of the JavaServer Pages technology, and worked on the Java Servlet API for Web servers.O'Reilly & Associates will soon release (January 2002) SAX2.
-
For more information, or to order the book, click here.