RelaxNGCC -- Bridging the Gap Between Schemas and Programs
May 8, 2002
There are several schema languages available for use in XML applications, including W3C XML Schema, Schematron, RELAX Core, and RELAX NG.
The primary reason schema languages exist is validation, a process to determine whether an XML instance meets the constraints imposed by a schema. Although many people only use schemas to validate XML, the capabilities of a schema can exceed validation. In this article, we examine the effectiveness of schemas and how RelaxNGCC extends their ability.
Are you satisfied with your schemas?
Many XML-based applications consist in substantive part of reading, validating, and recognizing the input of XML documents. These applications must determine whether the input meets all constraints imposed by the application and, further, must return convenient messages if the input does not meet the constraints.
No matter which schema language is used, programmers cannot always describe every constraint perfectly. A perfect schema means that the application processes the input without any errors if the validation succeeds. For example, let's consider following XML, which describes brands in the world stock markets.
<?xml version="1.0" encoding="utf-8"?> <stocks> <stock> <market>Tokyo</market> <ticker>6758</ticker> <name>Sony</name> </stock> ... <stock> <market>NASDAQ</market> <ticker>MSFT</ticker> <name>Microsoft</name> </stock> </stocks>
A grammar written in RELAX NG that describes the schema of this example follows.
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="stocks"> <oneOrMore> <element name="stock"> <element name="market"><text/></element> <element name="ticker"><text/></element> <element name="name"><text/></element> </element> </oneOrMore> </element> </start> </grammar>
Although the schema doesn't say explicitly, the format of ticker codes depends on
the
market. For instance, tickers in NASDAQ consist of some capital letters, whereas they
consist of four digits in Tokyo Stock Exchange. We can describe sufficiently the constraints
of the format, but the validation gets harder if we want to examine whether the ticker
in
the document actually a valid one. This requires enumerating all valid pairs of
market
and ticker
. Certainly the number of pairs is finite, but
it is, practically speaking, too big to specify all of them explicitly. To make matters
worse, the pairs must be maintained continuously for IPOs or bankruptcies.
Complex constraints like this appear quite often in real-world applications. In this situation, it is impossible to validate perfectly by only schema languages.
The power of programming language
To implement validation of complex logic like this, we would naturally choose a programming language such as Java rather than driving a schema language to its limit. In other words, we would write a program that accesses XML via DOM or SAX for validation. By using programming languages, we can employ a range of techniques and tools to deal with XML -- calling external libraries, referring to databases, and so on.
However, the validation against the schema and the program using DOM/SAX are independent of each other. It is troublesome to modify the program when the schema changes. I developed RelaxNGCC in response to this problem.
Introduction to RelaxNGCC
To parse a text stream based on a given grammar, there are tools, generally called "compiler compilers", such as yacc, bison, or JavaCC. These tools translate a context-free grammar with embedded code fragments into source code, which in turn parses the text along the grammar.
What happens if we apply this strategy to XML? The next table describes the relationship of the tools and RelaxNGCC.
Tool | Data format | Schema language | Programming language |
---|---|---|---|
yacc | plain text | context-free grammar | C |
JavaCC | plain text | context-free grammar | Java |
RelaxNGCC | XML | RELAX NG | Java |
RelaxNGCC is a tool for generating Java source code, which in turn performs the embedded action in the given RELAX NG grammar. The "CC" in RelaxNGCC means "compiler compiler".
Let's return to the stock example above. Let's say the program wants to verify the
pair of
market
and ticker
by referring to a database via JDBC.
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:cc="http://www.xml.gr.jp/xmlns/relaxngcc"> <start cc:class="TickerChecker"> <cc:java-import> import java.sql.*; </cc:java-import> <cc:java-body> Connection _connection; </cc:java-body> <element name="stocks"> <oneOrMore> <element name="stock"> <element name="market"><text cc:alias="market"/></element> <element name="ticker"><text cc:alias="ticker"/></element> <cc:java> Statement st = _connection.createStatement(); ResultSet rs = st.executeQuery("SELECT * FROM stocklist WHERE "+ "market='"+market+"' AND ticker='"+ticker+"';"); if(!rs.next()) throw new StockNotFoundException(); st.close(); </cc:java> <element name="name"><text/></element> </element> </oneOrMore> </element> </start> </grammar>
The markup specific to RelaxNGCC has the namespace URI of
http://www.xml.gr.jp/xmlns/relaxngcc
. In this article, we will use prefix
cc
for this namespace URI. Here is a short explanation of the markup.
markup | explanation |
---|---|
cc:class
|
The name of the output Java class is TickerChecker . |
cc:java-import
|
The content of cc:java-import is placed at the beginning of a class
definition. In most cases, the programmer writes import statements in
this element. |
cc:java-body
|
The content of cc:java-body is placed inside of a class definition. In
most cases, the programmer writes necessary methods or fields in this element. |
cc:alias
|
The value of text or data element is accessed via variable
of this name from code within cc:java elements. |
cc:java
|
The content of cc:java elements is executed during parsing of XML. |
RelaxNGCC translates this RELAX NG grammar with embedded code fragments into Java
source
code which implements the ContentHandler
interface of SAX. A part of the output
code will look like the following.
public void leaveElement(String uri,String localname,String qname) throws SAXException { ... (omitted some lines) ... else if(_ngcc_current_state==6) { if(localname.equals("ticker") && uri.equals(DEFAULT_NSURI)) { Statement st = _connection.createStatement(); ResultSet rs = s.executeQuery("SELECT * FROM stocklist WHERE "+ "market='"+market+"' AND ticker='"+ticker+"';"); if(!rs.next()) throw new StockNotFoundException(); st.close(); _ngcc_current_state=5; } else this.throwUnexpectedElementException(qname); ... }
The leaveElement
method corresponds to the endElement
method of
ContentHandler
. In this example, the body of the cc:java
element
is executed when a ticker
element ends. Validation with RelaxNGCC is effective
for reporting specialized error messages or for checking the constraints over two
or more
elements. Flexible behavior like this is hard to implement purely in a schema, even
if
external RELAX NG datatype libraries are supplied.
Data binding
Another way to describe RelaxNGCC is as a kind of data binding tool. JAXB and Relaxer
are well-known in this category. A program translates the tree structure of XML into
Java
objects, but sometimes the policy of the translation is not convenient for programmers.
For
example, it seems natural to bind a sequence of the same XML elements to a simple
collection
such as java.util.Vector
. However, in some cases a different type of collection
would be more suitable, such as a hash table, a binary tree, or some other ad hoc
data
structure. Additionally the programmer may be interested only in a part of the input.
In
this situation, translating all of the input is just a waste of resources.
Of course it is possible to support all those varying binding policies by adding optional features to the tool. But the potential requirements of all the possible programming tasks are just too diverse to cover in a single tool. Furthermore, it cannot decide whether a certain data structure is appropriate only from the grammar.
By comparison with complicating the tool, writing code directly in grammar is smart:
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:cc="http://www.xml.gr.jp/xmlns/relaxngcc"> <start cc:class="TickerCollector"> <cc:java-import> import java.util.*; </cc:java-import> <cc:java-body> HashMap _Stocks = new HashMap(); </cc:java-body> <element name="stocks"> <oneOrMore> <element name="stock"> <element name="market"><text cc:alias="market"/></element> <element name="ticker"><text cc:alias="ticker"/></element> <element name="name"><text cc:alias="name"/></element> <cc:java> if(market.equals("NASDAQ")) _Stocks.put(ticker, new Stock(market, ticker, name)); </cc:java> </element> </oneOrMore> </element> </start> </grammar>
This example shows how to gather stocks traded at NASDAQ and put them into a map so that later a program can search stocks from tickers. Additionally the programmer can obtain the map through the parse of the input, though a traversal of object tree is needed after parsing in Relaxer or JAXB.
Weakness of RelaxNGCC
However, if a programmer wants to serialize Java objects into XML, RelaxNGCC will be of no use. JAXB and Relaxer can produce XML from a tree of objects. RelaxNGCC focuses on unmarshaling (from XML into objects) only; most compiler compilers can not restore the input text from the parsed result.
Additionally, RelaxNGCC can not handle all RELAX NG grammars. RelaxNGCC recognizes internally the given RELAX NG grammar as an automaton driven by SAX events regarded as alphabets. As a result of that, RelaxNGCC reports an error when the automaton is non-deterministic. Ambiguous grammars, for instance, are always recognized as a non-deterministic automaton. Here is an example of ambiguous grammar.
<choice> <element name="foo"><text cc:alias="foo1"/></element> <element name="foo"><text cc:alias="foo2"/></element> </choice>
This restriction is not too serious since ambiguous grammars are rarely written.
Summary
yacc is a parser generator based on context-free grammar and text streams; in an analogous way, RelaxNGCC is another parser generator based on RELAX NG and XML. It bridges the gap between XML and the Java object tree by embedding code fragments inside the grammar. With RELAX NG and RelaxNGCC, the power and the flexibility of programming languages enable fine-grained validation and customized conversion of XML.
Finally, RelaxNGCC is free software distributed under GPL and the generated code by RelaxNGCC is absolutely free.
Related links
- RelaxNGCC documents and downloads
- RELAX NG official site
- RELAX NG tutorial
- Relaxer (data binding tool based on Relax Core)
- JAXB (data binding tool based on DTD)