An Introduction to XML Processing with Lark
October 2, 1997
An Introduction to XML Processing with Lark
Tim Bray
Abstract
1. Why Lark?
1.1 Motivations
Lark's creation was driven by the following motivations:
Personal gratification
- Writing language processors is fun, particularly when you have the chance to fix the language.
Desire to learn Java
- It's about time, and Java seems like the appropriate language for the job.
Test compliance with design goals
- In particular, XML Design Principle #4 [1], which states that "It shall be easy to write programs which process XML documents."
Expore the API design space
- There is a chance, while XML is young, to make some real progress in the design of processor APIs. The design of Lark makes very few assumptions about the user interface; thus Lark should be useful as an experimental testbed in this area.
$$$
- Perhaps Lark will turn out to be useful. I have not the slightest desire to start another software company (been there, done that, got the T-shirts), but it would be nice to figure out a way to get paid for the time I've put in writing it.
1.2 Conclusions
Yes, writing Lark was fun. In particular, none of the innocent-looking things in XML turned out, in practice, to be too horribly difficult.
As for Java: <OldFart>well, maybe there's something to this "Object-Oriented" fad.</OldFart> But it ain't fast.
On the design-goal-compliance front, the good news is that if you wanted a program to process XML that you knew was well-formed, you could probably bash it out in Perl (don't know about Java) in a day or so. On the other hand, if you want to build a general purpose tool that does all of XML and provides helpful error messages and a useful API, the nominal week is not nearly enough. The development of Lark has consumed several weeks. I started sketching out Lark at SGML '96 sometime around the 20th of November, and it took into the first week of January. Mind you, in the interim I returned from Boston, bought a house, traveled to Australia to get married and to Saskatchewan for Christmas, and did some other work. As 1997 has worn on, I have put in a day here and there patching missing pieces and tracking changes in the XML spec.
I do think that Lark will be useful for exploring API designs. Of course, none of this will happen unless there are people out there who want to use an XML processor for something or other. Among other things, Lark currently has no user interface at all; while I don't mind editing the Driver.java file and recompiling to run tests, presumably a UI would be a good thing to have.
As for the financial aspects, I'm kind of gloomy. I think most XML processors are going to be purpose-built for the needs of particular applications, and will thus hide inside them. Which is good; XML's simplicity makes this approach cost-effective. Failing that, processors will be full-dress validating parsers with incremental parsing for authoring support. So I'm not sure that there's all that much need for a standalone processor; but I'd love to be wrong.
Just in case, for the moment I'm going to be giving away all the .class files, and some of the Java source code, but not the source code for the two classes with the hard bits. In any case, they're sure to be buggy at this stage, and I wouldn't want to be letting them out of my hands without a bit more polishing. If you can see a way to get a little revenue out of this project, give me a call.
2. Lark Feature Set Overview
2.1 Compactness
Since an XML processor is often going to run on the client and presumably would need to be delivered over the network, it must be compact. At the moment, the total byte count is just just over 40K, which is not too bad. Example 1 shows a snapshot of the .class files.
There is some more scope for compression, when some useful facilities appear in the Java class libraries that ought to be there; e.g., a usable symbol table and better Unicode support.
2.2 Performance
At the moment, Lark, running under the Win95 J++ "Jview" application viewer on an underconfigured P100 notebook, runs at about 20K/second. I regard this as totally unacceptable. There are all sorts of glaring inefficiencies in the current implementation, and as soon as I track down a Java profiler I think I can promise much better performance. An XML processor is, after all, not doing very much.
2.3 Completeness
Lark is a processor only; it does not attempt to validate. It does read the internal subset of the DTD; it processes attribute list declarations (to find default values) and entity declarations. Lark does process parameter entities in the internal subset, but only recognizes them at places that a complete markup declaration could start. Lark's internationalization is incomplete; it reads UCS-2, UTF-16, and ASCII, but not UTF-8 (making use of the Byte Order Marks and Encoding Declarations in the appropriate fashion). Aside from that, Lark is relatively full-featured; it implements (I think) everything in the XML spec, and reports violations of well-formedness.
Lark's error-handling is draconian. After encountering the first well-formedness error, no further internal data structures are built or returned to the application. However, Lark does continue processing the document looking for more syntax errors (and in fact performing some fairly aggressive heuristics on the tag stack in order to figure out what's going on), and calling doSyntaxError application callback (see below) to report further such errors.
Example 1
-rwxrwxrwa 810 Jun 26 15:07 Attribute.class
|
-rwxrwxrwa 1756 Jun 26 15:07 Element.class
|
-rwxrwxrwa 1022 Jun 26 15:07 Entity.class
|
-rwxrwxrwa 1698 Jun 26 15:07 Handler.class
|
-rwxrwxrwa 32774 Jun 26 15:07 Lark.class
|
-rwxrwxrwa 1686 Jun 26 15:07 Namer.class
|
-rwxrwxrwa 1934 Jun 26 15:07 Segment.class
|
-rwxrwxrwa 855 Jun 26 15:07 Text.class
|
-rwxrwxrwa 997 Jun 26 15:07 XmlInputStream.class
|
2.4 API
Lark presents as a set of Java classes. Those named Element, Attribute, and Entity are obvious in their function. One lesson of this activity is that it may be possible for such classes to be shared independent of the parser architecture; it would be very handy if all XML-processing Java apps used the same Element class, at least as a basis for subclassing.
Those named Lark and Namer have all the "tricky bits" of parsing and text management logic; they are black boxes for the moment.
The class XmlInputStream is a place to put logic to get access to streams of Unicode characters in various encodings. It is to be hoped that it can soon be replaced by equivalent functions in the Java base libraries.
The Text and Segment classes do Lark's character data management; details are below.
From an application's point of view, the Lark and Handler classes are central. Handler has methods such as doPI, doSTag, doEntityReference, doEtag; the application passes a Handler instance to Lark's readXML method, and Lark calls the routines as it recognizes significant objects in the XML document. The base class provides do-nothing methods; the intent is that an application would subclass Handler to provide the appropriate processing semantics.
Along with presenting this event stream to the application, Lark can optionally build a parse tree, and if so doing, can optionally save copies of the XML document's character data, all in parallel with providing the event stream. Lark provides methods to toggle these behaviors. These methods may be used in the Handler callbacks while Lark is running, to build the parse tree or save the text for only a subset of the document.
In building the parse tree, Lark is guaranteed to update only elements which are still open; i.e., for which the end-tag has not been seen. All other sections of the tree, and the entire tree once Lark has hit end-of-file, may be manipulated freely by the application.
An instance of Lark may be initialized with an optional list of element GIs which are to be considered as those of empty elements, whether or not the XML /> syntax is used. A typical set might begin: "HR", "BR", "IMG", ....
Another initializer allows a set of entities to be predefined.
Lark also provides the application access to the "entity tree"; there is a method that toggles whether Lark attempts to retrieve and parse external entities.
2.5 Error Handling
Lark makes a serious effort to be robust, by providing useful error messages and continuing to do so after the first error. I tested against a selection of HTML files, most notably WD-xml.html, which was produced by my own Perl scripts, and was horrified to find lots of unquoted attribute values and a hideous number of nesting errors, mostly in the HTML tables used to pretty-print the BNF productions. The error handling is good enough that I now use Lark as my primary tool to debug broken XML files.
2.6 Text Segment Management
Probably due to my background in the indexing and search business, Lark pays more attention than is perhaps strictly necessary to the location of objects within the XML file. Lark informs the application of the containing entity and begin/end offsets of each element. A chunk of character data in an element, which may look contiguous to an application, may in fact map to several different byte ranges in different entities, due to the effect of entity references. Also, the use of encodings such as UTF8 may cause the number of bytes of underlying document to differ from the number of bytes that makes up the characters in a chunk of text. Lark represents Text objects as a vector of Segment objects, each of which gives information about the source offset and length, and the number of characters in the segment. If text-saving has not been turned on, the segments contain no character data, but still contain the offset information.
2.7 Entity Handling
Lark does full XML entity handling, and Java provides facilities which make this trivially easy. The application can turn the inclusion of external text entities on and off. Lark makes no use of PUBLIC identifiers, aside from passing them to the Handler in the callback upon recognizing the declaration.
2.8 Concurrency
Lark is thread-safe. Multiple Larks can run in parallel threads, with other threads doing useful processing of under-construction document trees.
3. Lark: Constructors, Methods,
and the Handler
3.1 The Handler Class
The Handler class is just a package of methods that act as callback routines for Lark. No arguments are used in constructing a handler.
Handler has a method named element, which the internal logic uses to construct a new Element object every time this is required (rather than calling new Element() directly). The idea is that this can be subclassed to return, rather than an Element object, a custom-built instance of a subclass of Element.
The first argument of all the other Handler methods is an Entity object, which contains information about the location of the object just encountered and the current state of the entity stack. All of the Handler methods return booleans. If any handler method returnstrue, Lark's readXML method returns control to whatever called it. In the base Handler class, all the methods return false by default, except for the doSyntaxError method, which generates a message via System.err.println(). Example 2 summarizes the methods; the names and types of the arguments should make their meaning self-explanatory.
Example 2
// Detected Processing Instruction public boolean doPI(Entity e, String PI) // Detected a text segment public boolean doText(Entity ent, Element el, char[] text, int length) // Doctype declaration; version for SYSTEM-only public boolean doDoctype(Entity e, String rootID, String externalSubsetID) // version for no SYSTEM or PUBLIC keyword public boolean doDoctype(Entity e, String rootID) // Internal (text) entity declaration public boolean doInternalEntity(Entity e, String name, char[] value) // External text entity declaration; external ID is a URL public boolean doSystemTextEntity(Entity e, String name, String extID) // NDATA entity declaration public boolean doSystemBinaryEntity(Entity e, String name, String extID, String notation) // End-tag detected; the element should be complete public boolean doETag(Entity e, Element element) // Start tag detected; the element will in general be // incomplete, unless element.empty() is true public boolean doSTag(Entity e, Element element) // Entity reference detected; e is the containing entity, // not the one that was detected public boolean doEntityReference(Entity e, String name) // Error condition detected; the entity e will have the // current locational information filled in public boolean doSyntaxError(Entity e, String message)
Notes:
- For doPI, the PI argument does not include the opening and closing delimiters, <? and ?>.
- For doText, the length argument is provided because the characters may not occupy the whole text buffer, which is reusable; thus if the application wishes to save the text, it must copy them out and store them.
- If an element contains entity references embedded character data, Lark generates multiple doText calls, one for each text segment thus created. See the discussion of the Text and Segment classes for a fuller explanation of the motivation for and usage of text segments.
- The doInternalEntity, doSystemTextEntity, and doSystemBinaryEntity methods signal the declaration, not the invocation, of these entities.
- If tree building is turned off, the Element object passed by doSTag and doETag is reused; the same one passed in each time. In this scenario, the only function of the Element object is as a container for data fields. When tree building is turned on, each element gets its own object, and they are linked together in the tree.
As noted above, the default Handler class methods all simply return false, except for doSyntaxError, which is as follows:
public boolean doSyntaxError(Entity e, String message) { System.err.println("Syntax error at line " + e.lineCount() + ":" + e.lineOffset() + ": " + message); return false; }
3.2 The Lark Class
The Lark class actually does the parsing. It has the following constructors:
public Lark() throws Exception public Lark(String[] emptyElementTypes) throws Exception public Lark(String[] entityNames, String[] entityValues) throws Exception public Lark(String[] emptyElementTypes, String[] entityNames, String[] entityValues) throws Exception
-
emptyElementTypes is a list of GIs which Lark considers, when encountered, to be empty, whether or not they use the trailing
/> XML empty-element syntax. Lark ignores case in GIs, and always reports GIs to the handler in uppercase form.
- entityNames and entityValues are arrays which respectively give the names and values of entities that Lark is to consider to have
3.3 Controlling Lark's Behavior
Lark has three methods for toggling behaviors:
public void buildTree(boolean build) public void saveText(boolean save) public void processExternalEntities(boolean process)
- buildTree instructs Lark whether or not to build a parse tree of elements as they are encountered. The default behavior is to build no tree. The tree is built downward from the element current at the time of invocation. When that element ends, tree-building stops.
- saveText instructs Lark whether or not to save, in the parse tree, all the character data as it is encountered. This behavior cannot be turned on unless the tree-building has been.
- processExternalEntities instructs Lark whether or not to read and process the contents of external text entities. The default behavior is not to read and process them.
Each of these methods can be used from within the Handler callbacks; in particular, Lark calls the Handler doSTag and doEntityReference methods before deciding whether to build the tree, save the text, or read the entity, so a Handler can control this behavior upon notification of the element or entity.
3.4 Initiating Parsing
Lark's readXML methods are used to initiate parsing. There are three as shown in Example 3.
In each case, the caller must provide a Handler (or more likely an instance of a class subclassed from Handler). Lark can read from an XmlInputStream (subclassed from BufferedInputStream--see below) or from a byte array. Perhaps Lark should also be able to read from a character array.
As noted above, Lark returns from readXML whenever one of the Handler methods returns true; in which case it maintains the parsing state and readXML can be reinvoked. It also returns when it encounters end of input on the document entity.
The Element object returned by readXML is not meaningful if tree-building has not been turned on. If tree-building has been turned on, it is topmost node in the tree built by Lark--this may not be the node for the root element if tree-building was not turned on until later in the tree.
Example 3
// no input stream provided - assume current state is valid public Element readXML(Handler handler) throws java.io.IOException // XmlInputStream provided - assume it's new public Element readXML(Handler handler, XmlInputStream input) throws java.io.IOException // byte-buffer provided - make an inputstream and put it on stack public Element readXML(Handler handler, byte[] buffer) throws java.io.IOException Example 4 class Attribute { String mName; Text mValue; // Constructor Attribute(String name, String value) public String name() { return mName; } public void setName(String name) { mName = name; } public String value() { return mValue; } public void setValue(String value) { mValue = value; } }
4. The XmlInputStream Class
XmlInputStream is a subclass of BufferedInputStream. It adds one additional method:
public int getXmlChar() throws IOException
This is intended to retrieve enough bytes from the BufferedInputStream to constitute one Unicode character, and return it. Obviously, the logic is highly dependent on the input encoding.
XmlInputStream has all the same constructors as BufferedInputStream, plus one extra:
XmlInputStream(char[] charBuf)
This allows getXmlChar() to work out of a character array. Lark uses this facility to support internal text entities.
It is remarkable that Java's built-in BufferedInputStream has no getCharacter() method; with any luck, this oversight will soon be corrected, allowing XmlInputStream to be retired. Java does have facilities for reading characters from a DataInputStream, but their design makes them unusable by XML (or by any other application I can think of). Or perhaps I just didn't understand the documentation.
5. Elements, Attributes, and Entities
The types in Example 4 are self-explanatory, contain essentially no nontrivial logic, and are designed for general use in XML applications. Attribute contains a name and a value, and trivial methods for getting and setting them.
Element is more complex. Readers are urged to look at the source code, which is heavily commented. A few fields and methods of special interest are excerpted in Example 5.
Example 5
public class Element { Attribute[] mAttrs; // Attributes String mType; // Children are a Vector because they might be Elements // and they might be Text objects Vector mChildren = new Vector(); // Parent element. null does not mean this is the root, // it just means this is where Lark started constructing // the tree. Element mParent; public String type(); public Attribute[] allAttributes() public void setAllAttributes(Attribute[] attributes) // get & set attribute by name public Attribute attribute(String name) public void setAttribute(String name, String value) public Vector children() public Element parent() }
Note that Lark only creates one String object for each unique element type. This means that the String objects returned by the type() method may be simply compared for equality to establish whether two elements are the same type.
The Entity class's fields are shown in Example 6: it has only trivial get and set methods for these fields.
Note that Entities differ from Elements in one important respect: Lark always constructs the Entity tree.
6. Text Handling and Segmentation
The Text object exists mostly to serve as an abstraction layer in front of objects which present as Strings, but may be stored in a segmented fashion. This supports lazy evaluation, the String being (expensively) constructed only when it is requested. Example 7 shows the complete source code.
The Segment class is much more complex, and tries to provide a means of storing information about character data, and optionally the character data itself, efficiently while keeping track of source offset information. Its implementation, particularly the constructors, is fairly complex, but the methods that an application might use are straightforward:
Example 6
// locational info int mOffset; int mLineCount; int mLineOffset; // how far into this line // not true for internal entities boolean mTextIsReal; XmlInputStream mInput; // read characters for this entity String mDescription; // metadata Entity mParent; // entity tree; null means doc entity
public Entity entity() public int sourceOffset() public int sourceLength() public int dataOffset() public int dataLength() public String string()
7. To-Do List
- Fix up XmlInputStream so it can read UTF-8 as well as ASCII.
- Find a profiler and make Lark faster.
- Make an applet-ized version of Lark.
- Add an Entity member to the Element class.
- If any signs of interest in Lark develop, formalize the test suite and config management.
class Text { // This is a Text object, which can appear in the Children() // of an element. The text itself is actually stored in a // Vector of Segments; this object really only exists to // provide a nice singular object to sit in the parse // tree, and to avoid recalculating the String value // of a multi-segment object. Vector mSegments = new Vector(); String mString; public void addSegment(Segment segment) { mSegments.addElement(segment); mString = null; // invalidate string } public Vector segments() { return mSegments; } // construct string only on request public String string() { int i; Segment seg; if (mString == null) { mString = new String(); for (i = 0; i < mSegments.size(); i++) { // there must be a better way seg = (Segment) mSegments.elementAt(i); mString = mString.concat(seg.string()); } } return mString; } }
8. How to Get Lark
All of the classes, and all the source code except Lark, Namer, XmlInputStream, are at [2].
About the Author
- Tim Bray
- 321-3495 Cambie Street
- Vancouver, B.C., Canada V5Z 4R3
- tbray@textuality.com
Tim Bray is a Canadian. He entered the software profession in 1981; after on-the-job training from Digital and GTE, he became manager of the New Oxford English Dictionary Project at the University of Waterloo in 1986. He co-founded Open Text Corporation in 1989, and started an independent consulting practice under the name Textuality in 1996. He is a Seybold Fellow, editor of the Gilbane Report, and co-editor of the World Wide Web Consortium's "Extensible Markup Language Specification."