An Introduction to XML Processing with Lark

October 2, 1997

Tim Bray

An Introduction to XML Processing with Lark

Tim Bray

Abstract

1. Why Lark?

1.1 Motivations

Lark's creation was driven by the following motivations:

Personal gratification

Writing language processors is fun, particularly when you have the chance to fix the language.

Desire to learn Java

It's about time, and Java seems like the appropriate language for the job.

Test compliance with design goals

In particular, XML Design Principle #4 [1], which states that "It shall be easy to write programs which process XML documents."

Expore the API design space

There is a chance, while XML is young, to make some real progress in the design of processor APIs. The design of Lark makes very few assumptions about the user interface; thus Lark should be useful as an experimental testbed in this area.

$$$

Perhaps Lark will turn out to be useful. I have not the slightest desire to start another software company (been there, done that, got the T-shirts), but it would be nice to figure out a way to get paid for the time I've put in writing it.

1.2 Conclusions

Yes, writing Lark was fun. In particular, none of the innocent-looking things in XML turned out, in practice, to be too horribly difficult.

As for Java: <OldFart>well, maybe there's something to this "Object-Oriented" fad.</OldFart> But it ain't fast.

On the design-goal-compliance front, the good news is that if you wanted a program to process XML that you knew was well-formed, you could probably bash it out in Perl (don't know about Java) in a day or so. On the other hand, if you want to build a general purpose tool that does all of XML and provides helpful error messages and a useful API, the nominal week is not nearly enough. The development of Lark has consumed several weeks. I started sketching out Lark at SGML '96 sometime around the 20th of November, and it took into the first week of January. Mind you, in the interim I returned from Boston, bought a house, traveled to Australia to get married and to Saskatchewan for Christmas, and did some other work. As 1997 has worn on, I have put in a day here and there patching missing pieces and tracking changes in the XML spec.

I do think that Lark will be useful for exploring API designs. Of course, none of this will happen unless there are people out there who want to use an XML processor for something or other. Among other things, Lark currently has no user interface at all; while I don't mind editing the Driver.java file and recompiling to run tests, presumably a UI would be a good thing to have.

As for the financial aspects, I'm kind of gloomy. I think most XML processors are going to be purpose-built for the needs of particular applications, and will thus hide inside them. Which is good; XML's simplicity makes this approach cost-effective. Failing that, processors will be full-dress validating parsers with incremental parsing for authoring support. So I'm not sure that there's all that much need for a standalone processor; but I'd love to be wrong.

Just in case, for the moment I'm going to be giving away all the .class files, and some of the Java source code, but not the source code for the two classes with the hard bits. In any case, they're sure to be buggy at this stage, and I wouldn't want to be letting them out of my hands without a bit more polishing. If you can see a way to get a little revenue out of this project, give me a call.

2. Lark Feature Set Overview

2.1 Compactness

Since an XML processor is often going to run on the client and presumably would need to be delivered over the network, it must be compact. At the moment, the total byte count is just just over 40K, which is not too bad. Example 1 shows a snapshot of the .class files.

There is some more scope for compression, when some useful facilities appear in the Java class libraries that ought to be there; e.g., a usable symbol table and better Unicode support.

2.2 Performance

At the moment, Lark, running under the Win95 J++ "Jview" application viewer on an underconfigured P100 notebook, runs at about 20K/second. I regard this as totally unacceptable. There are all sorts of glaring inefficiencies in the current implementation, and as soon as I track down a Java profiler I think I can promise much better performance. An XML processor is, after all, not doing very much.

2.3 Completeness

Lark is a processor only; it does not attempt to validate. It does read the internal subset of the DTD; it processes attribute list declarations (to find default values) and entity declarations. Lark does process parameter entities in the internal subset, but only recognizes them at places that a complete markup declaration could start. Lark's internationalization is incomplete; it reads UCS-2, UTF-16, and ASCII, but not UTF-8 (making use of the Byte Order Marks and Encoding Declarations in the appropriate fashion). Aside from that, Lark is relatively full-featured; it implements (I think) everything in the XML spec, and reports violations of well-formedness.

Lark's error-handling is draconian. After encountering the first well-formedness error, no further internal data structures are built or returned to the application. However, Lark does continue processing the document looking for more syntax errors (and in fact performing some fairly aggressive heuristics on the tag stack in order to figure out what's going on), and calling doSyntaxError application callback (see below) to report further such errors.

Example 1

-rwxrwxrwa 810 Jun 26 15:07 Attribute.class

-rwxrwxrwa 1756 Jun 26 15:07 Element.class

-rwxrwxrwa 1022 Jun 26 15:07 Entity.class

-rwxrwxrwa 1698 Jun 26 15:07 Handler.class

-rwxrwxrwa 32774 Jun 26 15:07 Lark.class

-rwxrwxrwa 1686 Jun 26 15:07 Namer.class

-rwxrwxrwa 1934 Jun 26 15:07 Segment.class

-rwxrwxrwa 855 Jun 26 15:07 Text.class

-rwxrwxrwa 997 Jun 26 15:07 XmlInputStream.class

2.4 API

Lark presents as a set of Java classes. Those named Element, Attribute, and Entity are obvious in their function. One lesson of this activity is that it may be possible for such classes to be shared independent of the parser architecture; it would be very handy if all XML-processing Java apps used the same Element class, at least as a basis for subclassing.

Those named Lark and Namer have all the "tricky bits" of parsing and text management logic; they are black boxes for the moment.

The class XmlInputStream is a place to put logic to get access to streams of Unicode characters in various encodings. It is to be hoped that it can soon be replaced by equivalent functions in the Java base libraries.

The Text and Segment classes do Lark's character data management; details are below.

From an application's point of view, the Lark and Handler classes are central. Handler has methods such as doPI, doSTag, doEntityReference, doEtag; the application passes a Handler instance to Lark's readXML method, and Lark calls the routines as it recognizes significant objects in the XML document. The base class provides do-nothing methods; the intent is that an application would subclass Handler to provide the appropriate processing semantics.

Along with presenting this event stream to the application, Lark can optionally build a parse tree, and if so doing, can optionally save copies of the XML document's character data, all in parallel with providing the event stream. Lark provides methods to toggle these behaviors. These methods may be used in the Handler callbacks while Lark is running, to build the parse tree or save the text for only a subset of the document.

In building the parse tree, Lark is guaranteed to update only elements which are still open; i.e., for which the end-tag has not been seen. All other sections of the tree, and the entire tree once Lark has hit end-of-file, may be manipulated freely by the application.

An instance of Lark may be initialized with an optional list of element GIs which are to be considered as those of empty elements, whether or not the XML /> syntax is used. A typical set might begin: "HR", "BR", "IMG", ....

Another initializer allows a set of entities to be predefined.

Lark also provides the application access to the "entity tree"; there is a method that toggles whether Lark attempts to retrieve and parse external entities.

2.5 Error Handling

Lark makes a serious effort to be robust, by providing useful error messages and continuing to do so after the first error. I tested against a selection of HTML files, most notably WD-xml.html, which was produced by my own Perl scripts, and was horrified to find lots of unquoted attribute values and a hideous number of nesting errors, mostly in the HTML tables used to pretty-print the BNF productions. The error handling is good enough that I now use Lark as my primary tool to debug broken XML files.

2.6 Text Segment Management

Probably due to my background in the indexing and search business, Lark pays more attention than is perhaps strictly necessary to the location of objects within the XML file. Lark informs the application of the containing entity and begin/end offsets of each element. A chunk of character data in an element, which may look contiguous to an application, may in fact map to several different byte ranges in different entities, due to the effect of entity references. Also, the use of encodings such as UTF8 may cause the number of bytes of underlying document to differ from the number of bytes that makes up the characters in a chunk of text. Lark represents Text objects as a vector of Segment objects, each of which gives information about the source offset and length, and the number of characters in the segment. If text-saving has not been turned on, the segments contain no character data, but still contain the offset information.

2.7 Entity Handling

Lark does full XML entity handling, and Java provides facilities which make this trivially easy. The application can turn the inclusion of external text entities on and off. Lark makes no use of PUBLIC identifiers, aside from passing them to the Handler in the callback upon recognizing the declaration.

2.8 Concurrency

Lark is thread-safe. Multiple Larks can run in parallel threads, with other threads doing useful processing of under-construction document trees.

3. Lark: Constructors, Methods,
and the Handler

3.1 The Handler Class

The Handler class is just a package of methods that act as callback routines for Lark. No arguments are used in constructing a handler.

Handler has a method named element, which the internal logic uses to construct a new Element object every time this is required (rather than calling new Element() directly). The idea is that this can be subclassed to return, rather than an Element object, a custom-built instance of a subclass of Element.

The first argument of all the other Handler methods is an Entity object, which contains information about the location of the object just encountered and the current state of the entity stack. All of the Handler methods return booleans. If any handler method returnstrue, Lark's readXML method returns control to whatever called it. In the base Handler class, all the methods return false by default, except for the doSyntaxError method, which generates a message via System.err.println(). Example 2 summarizes the methods; the names and types of the arguments should make their meaning self-explanatory.

Example 2

 
        // Detected  Processing Instruction
        public boolean doPI(Entity e, String PI)
        
        // Detected a text segment
        public boolean doText(Entity ent, Element el, char[] text, int length) 
        
        // Doctype declaration; version for SYSTEM-only
        public boolean doDoctype(Entity e, String rootID, String externalSubsetID) 
        
        // version for no SYSTEM or PUBLIC keyword
        public boolean doDoctype(Entity e, String rootID)
        
        // Internal (text) entity declaration
        public boolean doInternalEntity(Entity e, String name, char[] value) 
        
        // External text entity declaration; external ID is a URL 
        public boolean doSystemTextEntity(Entity e, String name, String extID) 
        
        // NDATA entity declaration
        public boolean doSystemBinaryEntity(Entity e, String name, String extID, 
        String notation) 
        
        // End-tag detected; the element should be complete
        public boolean doETag(Entity e, Element element)
        
        // Start tag detected; the element will in general be 
        //  incomplete, unless element.empty() is true
        public boolean doSTag(Entity e, Element element)
        
        // Entity reference detected; e is the containing entity, 
        //  not the one that was detected
        public boolean doEntityReference(Entity e, String name)
        
        // Error condition detected; the entity e will have the 
        //  current locational information filled in
        public boolean doSyntaxError(Entity e, String message)

Notes:

For doPI, the PI argument does not include the opening and closing delimiters, <? and ?>.
For doText, the length argument is provided because the characters may not occupy the whole text buffer, which is reusable; thus if the application wishes to save the text, it must copy them out and store them.
If an element contains entity references embedded character data, Lark generates multiple doText calls, one for each text segment thus created. See the discussion of the Text and Segment classes for a fuller explanation of the motivation for and usage of text segments.
The doInternalEntity, doSystemTextEntity, and doSystemBinaryEntity methods signal the declaration, not the invocation, of these entities.
If tree building is turned off, the Element object passed by doSTag and doETag is reused; the same one passed in each time. In this scenario, the only function of the Element object is as a container for data fields. When tree building is turned on, each element gets its own object, and they are linked together in the tree.

As noted above, the default Handler class methods all simply return false, except for doSyntaxError, which is as follows:

        public boolean doSyntaxError(Entity e, String message)
        {
         System.err.println("Syntax error at line " + 
         e.lineCount() + ":" + e.lineOffset() + ": " + message); 
         return false; 
        }

3.2 The Lark Class

The Lark class actually does the parsing. It has the following constructors:

 
        public Lark() throws Exception
        public Lark(String[] emptyElementTypes) throws
        Exception
        public Lark(String[] entityNames, String[] entityValues) throws 
        Exception 
        public Lark(String[] emptyElementTypes,
        String[] entityNames, 
        String[] entityValues) throws Exception

emptyElementTypes is a list of GIs which Lark considers, when encountered, to be empty, whether or not they use the trailing

/> XML empty-element syntax. Lark ignores case in GIs, and always reports GIs to the handler in uppercase form.
entityNames and entityValues are arrays which respectively give the names and values of entities that Lark is to consider to have

3.3 Controlling Lark's Behavior

Lark has three methods for toggling behaviors:

        public void buildTree(boolean build)
        public void saveText(boolean save)
        public void processExternalEntities(boolean process)

buildTree instructs Lark whether or not to build a parse tree of elements as they are encountered. The default behavior is to build no tree. The tree is built downward from the element current at the time of invocation. When that element ends, tree-building stops.
saveText instructs Lark whether or not to save, in the parse tree, all the character data as it is encountered. This behavior cannot be turned on unless the tree-building has been.
processExternalEntities instructs Lark whether or not to read and process the contents of external text entities. The default behavior is not to read and process them.

Each of these methods can be used from within the Handler callbacks; in particular, Lark calls the Handler doSTag and doEntityReference methods before deciding whether to build the tree, save the text, or read the entity, so a Handler can control this behavior upon notification of the element or entity.

3.4 Initiating Parsing

Lark's readXML methods are used to initiate parsing. There are three as shown in Example 3.

In each case, the caller must provide a Handler (or more likely an instance of a class subclassed from Handler). Lark can read from an XmlInputStream (subclassed from BufferedInputStream--see below) or from a byte array. Perhaps Lark should also be able to read from a character array.

As noted above, Lark returns from readXML whenever one of the Handler methods returns true; in which case it maintains the parsing state and readXML can be reinvoked. It also returns when it encounters end of input on the document entity.

The Element object returned by readXML is not meaningful if tree-building has not been turned on. If tree-building has been turned on, it is topmost node in the tree built by Lark--this may not be the node for the root element if tree-building was not turned on until later in the tree.

Example 3

 
        // no input stream provided - assume current state is valid 
        public Element readXML(Handler handler)
        throws java.io.IOException
        
        // XmlInputStream provided - assume it's new
        public Element readXML(Handler handler, XmlInputStream input)
        throws java.io.IOException
        
        // byte-buffer provided - make an inputstream and put it on stack 
        public Element readXML(Handler handler, byte[] buffer)
        throws java.io.IOException 
        
        Example 4
        class Attribute
        {
        String mName;
        Text   mValue;
        
        // Constructor
        Attribute(String name, String value)
        
        public String name() { return mName; }
        public void setName(String name) { mName = name; } 
        public String value() { return mValue; }
        public void setValue(String value)  { mValue = value; } 
        }

4. The XmlInputStream Class

XmlInputStream is a subclass of BufferedInputStream. It adds one additional method:

        public int getXmlChar() throws IOException

This is intended to retrieve enough bytes from the BufferedInputStream to constitute one Unicode character, and return it. Obviously, the logic is highly dependent on the input encoding.

XmlInputStream has all the same constructors as BufferedInputStream, plus one extra:

        XmlInputStream(char[] charBuf)

This allows getXmlChar() to work out of a character array. Lark uses this facility to support internal text entities.

It is remarkable that Java's built-in BufferedInputStream has no getCharacter() method; with any luck, this oversight will soon be corrected, allowing XmlInputStream to be retired. Java does have facilities for reading characters from a DataInputStream, but their design makes them unusable by XML (or by any other application I can think of). Or perhaps I just didn't understand the documentation.

5. Elements, Attributes, and Entities

The types in Example 4 are self-explanatory, contain essentially no nontrivial logic, and are designed for general use in XML applications. Attribute contains a name and a value, and trivial methods for getting and setting them.

Element is more complex. Readers are urged to look at the source code, which is heavily commented. A few fields and methods of special interest are excerpted in Example 5.

Example 5

public class Element
        {
        Attribute[] mAttrs;   // Attributes
        String mType;
        
        // Children are a Vector because they might be Elements 
        //  and they might be Text objects
        Vector mChildren = new Vector(); 
        
        // Parent element.  null does not mean this is the root, 
        //  it just means this is where Lark started constructing 
        //  the tree.
        Element mParent;
        
        public String type();
        
        public Attribute[] allAttributes()
        public void setAllAttributes(Attribute[] attributes)
        
        // get & set attribute by name
        public Attribute attribute(String name)
        public void setAttribute(String name, String value)
        
        public Vector children()
        public Element parent()
        }

Note that Lark only creates one String object for each unique element type. This means that the String objects returned by the type() method may be simply compared for equality to establish whether two elements are the same type.

The Entity class's fields are shown in Example 6: it has only trivial get and set methods for these fields.

Note that Entities differ from Elements in one important respect: Lark always constructs the Entity tree.

6. Text Handling and Segmentation

The Text object exists mostly to serve as an abstraction layer in front of objects which present as Strings, but may be stored in a segmented fashion. This supports lazy evaluation, the String being (expensively) constructed only when it is requested. Example 7 shows the complete source code.

The Segment class is much more complex, and tries to provide a means of storing information about character data, and optionally the character data itself, efficiently while keeping track of source offset information. Its implementation, particularly the constructors, is fairly complex, but the methods that an application might use are straightforward:

Example 6

  
        // locational info
        int mOffset;
        int mLineCount;
        int mLineOffset; // how far into this line
        
        // not true for internal entities
        boolean mTextIsReal;
        
        XmlInputStream mInput; // read characters for this entity 
        String mDescription;   // metadata
        Entity mParent;        // entity tree; null means doc entity

        public Entity entity()
        public int sourceOffset() 
        public int sourceLength() 
        public int dataOffset() 
        public int dataLength() 
        public String string()

7. To-Do List

Fix up XmlInputStream so it can read UTF-8 as well as ASCII.
Find a profiler and make Lark faster.
Make an applet-ized version of Lark.
Add an Entity member to the Element class.
If any signs of interest in Lark develop, formalize the test suite and config management.

Example 7

class Text
        {
        // This is a Text object, which can appear in the Children() 
        //  of an element.  The text itself is actually stored in a  
        //  Vector of Segments; this object really only exists to 
        //  provide a nice singular object to sit in the parse 
        //  tree, and to avoid recalculating the String value 
        //  of a multi-segment object.
        Vector mSegments = new Vector();
        String mString;
        
        public void addSegment(Segment segment)
        {
        mSegments.addElement(segment);
        mString = null; // invalidate string
        }
        public Vector segments() { return mSegments; }
        
        
        // construct string only on request
        public String string()
        {
        int i;
        Segment seg;
        if (mString == null)
        {
        mString = new String();
        for (i = 0; i < mSegments.size(); i++) 
        {
        // there must be a better way 
        seg = (Segment) mSegments.elementAt(i);
        mString = mString.concat(seg.string());
        }
        }
        return mString;
        }
        }

8. How to Get Lark

All of the classes, and all the source code except Lark, Namer, XmlInputStream, are at [2].

About the Author

Tim Bray

321-3495 Cambie Street

Vancouver, B.C., Canada V5Z 4R3

tbray@textuality.com

Tim Bray is a Canadian. He entered the software profession in 1981; after on-the-job training from Digital and GTE, he became manager of the New Oxford English Dictionary Project at the University of Waterloo in 1986. He co-founded Open Text Corporation in 1989, and started an independent consulting practice under the name Textuality in 1996. He is a Seybold Fellow, editor of the Gilbane Report, and co-editor of the World Wide Web Consortium's "Extensible Markup Language Specification."