RelaxNGCC -- Bridging the Gap Between Schemas and Programs

May 8, 2002

There are several schema languages available for use in XML applications, including W3C XML Schema, Schematron, RELAX Core, and RELAX NG.

The primary reason schema languages exist is validation, a process to determine whether an XML instance meets the constraints imposed by a schema. Although many people only use schemas to validate XML, the capabilities of a schema can exceed validation. In this article, we examine the effectiveness of schemas and how RelaxNGCC extends their ability.

Are you satisfied with your schemas?

Many XML-based applications consist in substantive part of reading, validating, and recognizing the input of XML documents. These applications must determine whether the input meets all constraints imposed by the application and, further, must return convenient messages if the input does not meet the constraints.

No matter which schema language is used, programmers cannot always describe every constraint perfectly. A perfect schema means that the application processes the input without any errors if the validation succeeds. For example, let's consider following XML, which describes brands in the world stock markets.


<?xml version="1.0" encoding="utf-8"?>

<stocks>

  <stock>

    <market>Tokyo</market>

    <ticker>6758</ticker>

    <name>Sony</name>

  </stock>

  ...

  <stock>

    <market>NASDAQ</market>

    <ticker>MSFT</ticker>

    <name>Microsoft</name>

  </stock>

</stocks>

A grammar written in RELAX NG that describes the schema of this example follows.


<?xml version="1.0" encoding="utf-8"?>

<grammar xmlns="http://relaxng.org/ns/structure/1.0">

  <start>

    <element name="stocks">

      <oneOrMore>

        <element name="stock">

          <element name="market"><text/></element>

          <element name="ticker"><text/></element>

          <element name="name"><text/></element>

        </element>

      </oneOrMore>

    </element>

  </start>

</grammar>

Although the schema doesn't say explicitly, the format of ticker codes depends on the market. For instance, tickers in NASDAQ consist of some capital letters, whereas they consist of four digits in Tokyo Stock Exchange. We can describe sufficiently the constraints of the format, but the validation gets harder if we want to examine whether the ticker in the document actually a valid one. This requires enumerating all valid pairs of market and ticker. Certainly the number of pairs is finite, but it is, practically speaking, too big to specify all of them explicitly. To make matters worse, the pairs must be maintained continuously for IPOs or bankruptcies.

Complex constraints like this appear quite often in real-world applications. In this situation, it is impossible to validate perfectly by only schema languages.

The power of programming language

To implement validation of complex logic like this, we would naturally choose a programming language such as Java rather than driving a schema language to its limit. In other words, we would write a program that accesses XML via DOM or SAX for validation. By using programming languages, we can employ a range of techniques and tools to deal with XML -- calling external libraries, referring to databases, and so on.

However, the validation against the schema and the program using DOM/SAX are independent of each other. It is troublesome to modify the program when the schema changes. I developed RelaxNGCC in response to this problem.

Introduction to RelaxNGCC

To parse a text stream based on a given grammar, there are tools, generally called "compiler compilers", such as yacc, bison, or JavaCC. These tools translate a context-free grammar with embedded code fragments into source code, which in turn parses the text along the grammar.

What happens if we apply this strategy to XML? The next table describes the relationship of the tools and RelaxNGCC.

Tool	Data format	Schema language	Programming language
yacc	plain text	context-free grammar	C
JavaCC	plain text	context-free grammar	Java
RelaxNGCC	XML	RELAX NG	Java

RelaxNGCC is a tool for generating Java source code, which in turn performs the embedded action in the given RELAX NG grammar. The "CC" in RelaxNGCC means "compiler compiler".

Let's return to the stock example above. Let's say the program wants to verify the pair of market and ticker by referring to a database via JDBC.


<?xml version="1.0" encoding="utf-8"?>

<grammar xmlns="http://relaxng.org/ns/structure/1.0" 

         xmlns:cc="http://www.xml.gr.jp/xmlns/relaxngcc">

  <start cc:class="TickerChecker">

    <cc:java-import>

    import java.sql.*;

    </cc:java-import>

    <cc:java-body>

    Connection _connection;

    </cc:java-body>

    <element name="stocks">

      <oneOrMore>

        <element name="stock">

          <element name="market"><text

	    cc:alias="market"/></element>

          <element name="ticker"><text 

	    cc:alias="ticker"/></element>

          <cc:java>

          Statement st = _connection.createStatement();

          ResultSet rs = st.executeQuery("SELECT * FROM stocklist WHERE "+

	    "market='"+market+"' AND ticker='"+ticker+"';");

          if(!rs.next())

            throw new StockNotFoundException();

          st.close();

          </cc:java>

          <element name="name"><text/></element>

        </element>

      </oneOrMore>

    </element>

  </start>

</grammar>

The markup specific to RelaxNGCC has the namespace URI of http://www.xml.gr.jp/xmlns/relaxngcc. In this article, we will use prefix cc for this namespace URI. Here is a short explanation of the markup.

markup	explanation
`cc:class`	The name of the output Java class is `TickerChecker`.
`cc:java-import`	The content of `cc:java-import` is placed at the beginning of a class definition. In most cases, the programmer writes `import` statements in this element.
`cc:java-body`	The content of `cc:java-body` is placed inside of a class definition. In most cases, the programmer writes necessary methods or fields in this element.
`cc:alias`	The value of `text` or `data` element is accessed via variable of this name from code within `cc:java` elements.
`cc:java`	The content of `cc:java` elements is executed during parsing of XML.

RelaxNGCC translates this RELAX NG grammar with embedded code fragments into Java source code which implements the ContentHandler interface of SAX. A part of the output code will look like the following.


public void leaveElement(String uri,String

  localname,String qname) throws SAXException

{

  ...

  (omitted some lines)

  ...

else if(_ngcc_current_state==6) {

if(localname.equals("ticker") && 

  uri.equals(DEFAULT_NSURI)) {

  Statement st = _connection.createStatement();

  ResultSet rs = s.executeQuery("SELECT * FROM stocklist WHERE "+

    "market='"+market+"' AND ticker='"+ticker+"';");

  if(!rs.next())

    throw new StockNotFoundException();

  st.close();

_ngcc_current_state=5;

}

else this.throwUnexpectedElementException(qname);

...

}

The leaveElement method corresponds to the endElement method of ContentHandler. In this example, the body of the cc:java element is executed when a ticker element ends. Validation with RelaxNGCC is effective for reporting specialized error messages or for checking the constraints over two or more elements. Flexible behavior like this is hard to implement purely in a schema, even if external RELAX NG datatype libraries are supplied.

Data binding

Another way to describe RelaxNGCC is as a kind of data binding tool. JAXB and Relaxer are well-known in this category. A program translates the tree structure of XML into Java objects, but sometimes the policy of the translation is not convenient for programmers. For example, it seems natural to bind a sequence of the same XML elements to a simple collection such as java.util.Vector. However, in some cases a different type of collection would be more suitable, such as a hash table, a binary tree, or some other ad hoc data structure. Additionally the programmer may be interested only in a part of the input. In this situation, translating all of the input is just a waste of resources.

Of course it is possible to support all those varying binding policies by adding optional features to the tool. But the potential requirements of all the possible programming tasks are just too diverse to cover in a single tool. Furthermore, it cannot decide whether a certain data structure is appropriate only from the grammar.

By comparison with complicating the tool, writing code directly in grammar is smart:


<?xml version="1.0" encoding="utf-8"?>

<grammar xmlns="http://relaxng.org/ns/structure/1.0" 

  xmlns:cc="http://www.xml.gr.jp/xmlns/relaxngcc">

  <start cc:class="TickerCollector">

    <cc:java-import>

    import java.util.*;

    </cc:java-import>

    <cc:java-body>

    HashMap _Stocks = new HashMap();

    </cc:java-body>

    <element name="stocks">

      <oneOrMore>

        <element name="stock">

          <element name="market"><text 

	    cc:alias="market"/></element>

          <element name="ticker"><text 

	    cc:alias="ticker"/></element>

          <element name="name"><text 

	    cc:alias="name"/></element>

          <cc:java>

          if(market.equals("NASDAQ")) 

	    _Stocks.put(ticker, new Stock(market, ticker, name));

          </cc:java>

        </element>

      </oneOrMore>

    </element>

  </start>

</grammar>

This example shows how to gather stocks traded at NASDAQ and put them into a map so that later a program can search stocks from tickers. Additionally the programmer can obtain the map through the parse of the input, though a traversal of object tree is needed after parsing in Relaxer or JAXB.

Weakness of RelaxNGCC

However, if a programmer wants to serialize Java objects into XML, RelaxNGCC will be of no use. JAXB and Relaxer can produce XML from a tree of objects. RelaxNGCC focuses on unmarshaling (from XML into objects) only; most compiler compilers can not restore the input text from the parsed result.

Additionally, RelaxNGCC can not handle all RELAX NG grammars. RelaxNGCC recognizes internally the given RELAX NG grammar as an automaton driven by SAX events regarded as alphabets. As a result of that, RelaxNGCC reports an error when the automaton is non-deterministic. Ambiguous grammars, for instance, are always recognized as a non-deterministic automaton. Here is an example of ambiguous grammar.


<choice>

  <element name="foo"><text cc:alias="foo1"/></element>

  <element name="foo"><text cc:alias="foo2"/></element>

</choice>

This restriction is not too serious since ambiguous grammars are rarely written.

Summary

yacc is a parser generator based on context-free grammar and text streams; in an analogous way, RelaxNGCC is another parser generator based on RELAX NG and XML. It bridges the gap between XML and the Java object tree by embedding code fragments inside the grammar. With RELAX NG and RelaxNGCC, the power and the flexibility of programming languages enable fine-grained validation and customized conversion of XML.

Finally, RelaxNGCC is free software distributed under GPL and the generated code by RelaxNGCC is absolutely free.