RELAX NG's Compact Syntax

June 19, 2002

Working with XML Schema is like driving a limousine. It's true that it has some nice appointments (datatypes come to mind), but the wheelbase is a bit on the long side, making it difficult to turn corners easily, and I am inclined to let somebody else do the driving for me. Using RELAX NG, on the other hand, is like driving a sports car. It holds corners amazingly well, and I am much less interested in handing over the keys to anyone. You may prefer to drive a limo over a sports car. But I'll take the sports car any day.

You are probably familiar with XML Schema and RELAX NG. Both are schema languages for XML. The former was released by the W3C in May 2001, while the latter was released in December 2001 by OASIS. RELAX NG, which was developed by a small technical committee lead by James Clark, merges Murata Makoto's RELAX and Clark's TREX. It is a simple, yet elegant evolution of the DTD, which is also easy to learn. It is modular in design. The main core of RELAX NG is focused on validation alone and doesn't modify the infoset in the process of validation; in other words, no PSVI. RELAX NG is also part of an ISO draft standard, ISO/IEC DIS 19757-2.

RELAX NG schemas were originally written in XML, but there's also a compact, non-XML syntax. While this article doesn't contain an exhaustive review of all the features of RELAX NG, it will give you a good idea of how to use the main parts of the compact syntax. If you don't know much about RELAX NG, I suggest that you read Eric van der Vlist's RELAX NG Compared before finishing this article.

I think you'll find the compact syntax quite readable and easy to learn. In some respects, a RELAX NG schema in compact form looks like a context-free grammar, which provides a familiar view of the language, is readily comprehensible, and amenable to parsing. Also, don't be surprised if the compact syntax bears a resemblance to the syntax of XDuce and XQuery's computed element and attribute constructors.

A Simple Example

The first exhibit is a well-formed XML document which represents an ISO 8601 date:


<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE date SYSTEM "date.dtd">



<date type="ISO8601">

 <year>2002</year>

 <month>06</month>

 <day>01</day>

</date>

This document is an instance of the following DTD:


<!ELEMENT date (year, month, day)>

<!ATTLIST date type CDATA #IMPLIED>

<!ELEMENT year (#PCDATA)>

<!ELEMENT month (#PCDATA)>

<!ELEMENT day (#PCDATA)>

It is also valid with regard to the following RELAX NG schema in XML syntax:


<?xml version="1.0" encoding="UTF-8"?>

<element name="date"

xmlns="http://relaxng.org/ns/structure/1.0">

 <optional>

  <attribute name="type"/>

 </optional>

 <element name="year"><text/></element>

 <element name="month"><text/></element>

 <element name="day"><text/></element>

</element>

And here is a version of the schema in RELAX NG's compact syntax:


element date { attribute type { text }?,

element year { text },

element month { text }, 

element day { text } }

No pointy brackets! The RELAX NG schema in XML syntax has the same meaning as the compact schema, but the compact schema is less than half the length. For those who develop schema without the aid of a dedicated editor (not much exists yet for RELAX NG anyway), that's considerably less time tapping the keyboard. Still and all, the compact syntax has more advantages than just concision.

A Little Schema Analysis

In these simple examples, the equivalencies are apparent, but I'll mention a few things about them anyway. Take, for example, the declaration or definition of elements. In the DTD, an element type declaration takes the form:


<!ELEMENT year (#PCDATA)>

In RELAX NG's XML syntax, the same element definition looks like


<element name="year"><text/></element>

This definition is nicely trimmed down in the compact syntax:


element year { text }

Compact attributes, likewise, have gone on a diet. The attribute list declaration for the implied (optional) attribute type looks like this in the DTD:


<!ATTLIST date type CDATA #IMPLIED>

RELAX NG makes an attribute definition like so:


<optional>

 <attribute name="type"/>

</optional>

Which is equivalent to


<optional>

 <attribute name="type"><text/></attribute>

</optional>

The type of the value (was CDATA, now <text/>) is assumed to be text when absent in the XML syntax. The placement of the attribute definition in RELAX NG follows the structure of the XML document instance, which is one reason why the syntax of RELAX NG is rather intuitive. Unlike the RELAX NG XML syntax, the compact definition of the attribute


attribute type { text }?

must use the text token. The ? repetition operator (zero or one <optional>) descends from regular expression notation by way of DTDs, as do the operators * (<zeroOrMore>) and + (<oneOrMore>). The comma (,) operator, sometimes called the sequence operator, when alone, means use exactly one. (Only ? makes sense as an operator for attributes because XML only allows one specification of any given attribute in a start tag.)

Processing Compact Syntax Schema

To translate a compact-syntax schema to XML syntax, use James Clark's Trang. Assuming that you have downloaded both Jing and Trang, have a Java runtime environment in place, and have placed the JAR files in the path and classpath (see instructions on this for Unix or Windows if you are shaky on such terms), you can convert compact syntax to XML with the following command:


java -jar trang.jar date.rnc date.rng

Trang requires two arguments: an input and output file. File formats are inferred by Trang according to the file suffixes (rng for XML syntax, rnc for compact). You can override this behavior with the -i option (rnc or rng) for input files and the -o option for output files (rng or dtd). As you can see, in addition to XML format, you can also specify DTD output from Trang. (Incidentally, you can also use DTDinst to translate a DTD to a schema in RELAX NG XML syntax.) Further, you can indicate Trang's output encoding with the -e option (for example, -e ISO-8859-1); if the -e option is not present, the output file is written in UTF-8 by default. Trang, by the way, automatically inserts the namespace declaration for RELAX NG itself.

With Jing, you can conveniently validate an instance against a compact schema directly, without transforming it to XML syntax, by using the -c option:


java -jar jing.jar -c date.rnc date.xml

Extending the Example

There are a few additional features worth highlighting. The following compact schema contains a definition for a default namespace for elements in an instance, adds a comment (the line prepended with #), and changes the content model of the child elements of date:


default namespace = "http://www.example.com/date"

# RELAX NG schema for a date

element date { attribute type { text },

( element year { text } &

element month { text } &

element day { text } ) }

Notice that the ? operator was dropped from the attribute definition, so the type attribute in now required. The ampersand (&) indicates that adjoining elements are interleaved, that is, these elements can appear in any order in an instance. Parentheses enclose these element definitions. When the schema is translated, the compact comment will appear as a normal XML comment:


<-- RELAX NG schema for a date -->

The default namespace for the instance will become an ns attribute in the document element of the schema:


<element name="date" ns="http://www.example.com/date"

xmlns="http://relaxng.org/ns/structure/1.0">

In RELAX NG, any element which contains a pattern may also serve as a document element (for example, <grammar>, <element>, and <choice>), so you're not limited to a single element.

XML Schema Datatypes

So far the elements and attributes in the examples have only used text content. RELAX NG, as a matter of fact, has only two built-in datatypes: token and string (which is like text). You can also use XML Schema datatypes.

In the following compact schema, you can see the datatypes token which defines the value of a datatypeLibrary attribute (http://www.w3.org/2001/XMLSchema-datatypes) and a prefix for the QNames of the datatypes (xsd):


# Using XML Schema datatypes

namespace dt = "http://www.example.com/date"

datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"



## An ISO 8601, US, or European date format

element dt:date { attribute type {"ISO8601" | "US" | "Euro"},

( element dt:day { xsd:string { pattern = "\d{2}" } } &

element dt:month { attribute days { "28" | "29" | "30" | "31" }?, 

xsd:string { pattern = "\d{2}" } } &

element dt:year { xsd:string { pattern = "\d{4}" } } )

}

You can specify the datatype library and prefix explicitly as shown, or you can omit this line and allow Trang to insert it automatically during translation. Either way, the datatypeLibrary attribute will appear in the document element:


<element name="dt:date"

xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"

xmlns:dt="http://www.example.com/date"

xmlns="http://relaxng.org/ns/structure/1.0"

datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

The element <dt:month> (like <dt:day> and <dt:year>) are all of type xsd:string. The pattern parameter specifies a regular expression, further constraining the allowed content for the element to two digits. The optional attribute days provides a choice of literals as values. For example, the value of days in an instance may be one of 28, 29, 30, or 31. Other values are invalid.


<element name="dt:month">

 <optional>

  <attribute name="days">

   <choice>

    <value>28</value>

    <value>29</value>

    <value>30</value>

    <value>31</value>

   </choice>

  </attribute>

 </optional>

 <data type="string">

  <param name="pattern">\d{2}</param>

 </data>

</element>

This example also shows how to add an annotation element (<a:documentation>) with the double hash (##). This annotation comes from the RELAX NG compatibility spec and, once translated, looks like


<a:documentation>An ISO 8601, US, or European date

 format</a:documentation>

You can also specify annotations with square brackets in the compact form, like


namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0"



[ a:documentation [ "An ISO 8601, US, or European date format" ] ]

When you use this form, you must declare a namespace and prefix for the annotation element. You can insert elements and attributes as annotations from any namespace you like -- say, XHTML -- as long the namespace is declared and the bracketed syntax is used. Default attribute values may also represented as


namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0"



element today {



 [ a:defaultValue = "2002" ]

 attribute year { text },



 [ a:defaultValue = "06" ]

 attribute month { text },



 [ a:defaultValue = "01" ]

 attribute day { text },



 empty }

Notice the empty token at the end of the content model. As you can probably guess, this keyword signifies that the element today is an empty element. This syntax is analogous to the DTD syntax:


<!ELEMENT today EMPTY>

Going Context Free

I mentioned earlier in the article that the compact syntax can look like a context-free grammar. The following example uses a start symbol and other symbols that serve as terminals and non-terminals. For example, the symbol year, on the left side of the equals sign, may be considered a non-terminal, and the element definition on right side, a terminal:


# RELAX NG schema for a date



start = date



date = element date { attribute type { text },

 (year & month & day), limits*}



year = element year { text }

month = element month { text }

day = element day { text }



include "limits.rnc"

The following instance is valid with regard to the foregoing:


<?xml version="1.0" encoding="UTF-8"?>

<date type="US">

 <month>June</month>

 <day>1</day>

 <year>2002</year>

 <limits days="30"/>

</date>

When translated, this example creates a different RELAX NG schema than the ones shown previously, producing a <grammar> and <start> element and several <define> elements, as seen in this incomplete fragment:


<grammar xmlns="http://relaxng.org/ns/structure/1.0">

  <!-- RELAX NG schema for a date -->

  <start>

    <ref name="date"/>

  </start>

  <define name="date">

    <element name="date">

      <attribute name="type"/>

      <interleave>

        <ref name="year"/>

        <ref name="month"/>

        <ref name="day"/>

      </interleave>

      <zeroOrMore>

        <ref name="limits"/>

      </zeroOrMore>

    </element>

  </define>

A <grammar> element is a container for definitions. The <start> element indicates the document element for an instance, just as a document type declaration does. The <define> elements contain patterns which can be referenced by name (with a <ref> element) and therefore easily reused.

Back in the compact schema, a symbol for the limits pattern (modified with *) was added to the end of the content model for date, but where is it defined? It's defined in the included schema limits.rnc (see the last line of the last compact example), which looks like


# Limits for year, months, and days

limits =

 element limits {

  attribute years { text }?,

  attribute months { text }?,

  attribute days { text }?

 }

When processed, the included compact schema is translated into RELAX NG XML syntax as well. The resulting filename, limits.rng, is inferred from limits.rnc. The included pattern contains a definition for the limits element, which may contain up to three optional attributes. The absence of element children indicates that its content is empty.

Conclusion

It would several more articles to cover all aspects of RELAX NG in fair detail. This article has only touched lightly on its compact syntax and some of the more commonly used structures of the language. I have neglected some interesting things: for example, lists, name classes, merging grammars, and combining definitions. If you've gotten behind the wheel and tested these examples for yourself, you likely have a good feel for just how easy RELAX NG's compact syntax is to learn and use.