RELAX NG's Compact Syntax
June 19, 2002
Working with XML Schema is like driving a limousine. It's true that it has some nice appointments (datatypes come to mind), but the wheelbase is a bit on the long side, making it difficult to turn corners easily, and I am inclined to let somebody else do the driving for me. Using RELAX NG, on the other hand, is like driving a sports car. It holds corners amazingly well, and I am much less interested in handing over the keys to anyone. You may prefer to drive a limo over a sports car. But I'll take the sports car any day.
You are probably familiar with XML Schema and RELAX NG. Both are schema languages for XML. The former was released by the W3C in May 2001, while the latter was released in December 2001 by OASIS. RELAX NG, which was developed by a small technical committee lead by James Clark, merges Murata Makoto's RELAX and Clark's TREX. It is a simple, yet elegant evolution of the DTD, which is also easy to learn. It is modular in design. The main core of RELAX NG is focused on validation alone and doesn't modify the infoset in the process of validation; in other words, no PSVI. RELAX NG is also part of an ISO draft standard, ISO/IEC DIS 19757-2.
RELAX NG schemas were originally written in XML, but there's also a compact, non-XML syntax. While this article doesn't contain an exhaustive review of all the features of RELAX NG, it will give you a good idea of how to use the main parts of the compact syntax. If you don't know much about RELAX NG, I suggest that you read Eric van der Vlist's RELAX NG Compared before finishing this article.
I think you'll find the compact syntax quite readable and easy to learn. In some respects, a RELAX NG schema in compact form looks like a context-free grammar, which provides a familiar view of the language, is readily comprehensible, and amenable to parsing. Also, don't be surprised if the compact syntax bears a resemblance to the syntax of XDuce and XQuery's computed element and attribute constructors.
A Simple Example
The first exhibit is a well-formed XML document which represents an ISO 8601 date:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE date SYSTEM "date.dtd"> <date type="ISO8601"> <year>2002</year> <month>06</month> <day>01</day> </date>
This document is an instance of the following DTD:
<!ELEMENT date (year, month, day)> <!ATTLIST date type CDATA #IMPLIED> <!ELEMENT year (#PCDATA)> <!ELEMENT month (#PCDATA)> <!ELEMENT day (#PCDATA)>
It is also valid with regard to the following RELAX NG schema in XML syntax:
<?xml version="1.0" encoding="UTF-8"?> <element name="date" xmlns="http://relaxng.org/ns/structure/1.0"> <optional> <attribute name="type"/> </optional> <element name="year"><text/></element> <element name="month"><text/></element> <element name="day"><text/></element> </element>
And here is a version of the schema in RELAX NG's compact syntax:
element date { attribute type { text }?, element year { text }, element month { text }, element day { text } }
No pointy brackets! The RELAX NG schema in XML syntax has the same meaning as the compact schema, but the compact schema is less than half the length. For those who develop schema without the aid of a dedicated editor (not much exists yet for RELAX NG anyway), that's considerably less time tapping the keyboard. Still and all, the compact syntax has more advantages than just concision.
A Little Schema Analysis
In these simple examples, the equivalencies are apparent, but I'll mention a few things about them anyway. Take, for example, the declaration or definition of elements. In the DTD, an element type declaration takes the form:
<!ELEMENT year (#PCDATA)>
In RELAX NG's XML syntax, the same element definition looks like
<element name="year"><text/></element>
This definition is nicely trimmed down in the compact syntax:
element year { text }
Compact attributes, likewise, have gone on a diet. The attribute list declaration
for the
implied (optional) attribute type
looks like this in the DTD:
<!ATTLIST date type CDATA #IMPLIED>
RELAX NG makes an attribute definition like so:
<optional> <attribute name="type"/> </optional>
Which is equivalent to
<optional> <attribute name="type"><text/></attribute> </optional>
The type of the value (was CDATA
, now <text/>
) is assumed
to be text when absent in the XML syntax. The placement of the attribute definition
in RELAX
NG follows the structure of the XML document instance, which is one reason why the
syntax of
RELAX NG is rather intuitive. Unlike the RELAX NG XML syntax, the compact definition
of the
attribute
attribute type { text }?
must use the text
token. The ?
repetition operator (zero or one
<optional>
) descends from regular expression notation by way of DTDs,
as do the operators *
(<zeroOrMore>
) and +
(<oneOrMore>
). The comma (,
) operator, sometimes called
the sequence operator, when alone, means use exactly one. (Only ?
makes
sense as an operator for attributes because XML only allows one specification of any
given
attribute in a start tag.)
Processing Compact Syntax Schema
To translate a compact-syntax schema to XML syntax, use James Clark's Trang. Assuming that you
have downloaded both Jing
and Trang, have a Java runtime environment in place, and
have placed the JAR files in the path
and classpath
(see
instructions on this for Unix or Windows if
you are shaky on such terms), you can convert compact syntax to XML with the following
command:
java -jar trang.jar date.rnc date.rng
Trang requires two arguments: an input and output file. File formats are inferred
by Trang
according to the file suffixes (rng
for XML syntax, rnc
for
compact). You can override this behavior with the -i
option (rnc
or rng
) for input files and the -o
option for output files
(rng
or dtd
). As you can see, in addition to XML format, you can
also specify DTD output from Trang. (Incidentally, you can also use DTDinst to translate a DTD to a schema
in RELAX NG XML syntax.) Further, you can indicate Trang's output encoding with the
-e
option (for example, -e ISO-8859-1
); if the -e
option is not present, the output file is written in UTF-8 by default. Trang, by the
way,
automatically inserts the namespace declaration for RELAX NG itself.
With Jing, you can conveniently validate an instance against a compact schema directly,
without transforming it to XML syntax, by using the -c
option:
java -jar jing.jar -c date.rnc date.xml
Extending the Example
There are a few additional features worth highlighting. The following compact schema
contains a definition for a default namespace for elements in an instance, adds a
comment
(the line prepended with #
), and changes the content model of the child
elements of date
:
default namespace = "http://www.example.com/date" # RELAX NG schema for a date element date { attribute type { text }, ( element year { text } & element month { text } & element day { text } ) }
Notice that the ?
operator was dropped from the attribute definition, so the
type
attribute in now required. The ampersand (&
) indicates
that adjoining elements are
interleaved, that is, these elements can appear in any order in an instance.
Parentheses enclose these element definitions. When the schema is translated, the
compact
comment will appear as a normal XML comment:
<-- RELAX NG schema for a date -->
The default namespace for the instance will become an ns
attribute in the
document element of the schema:
<element name="date" ns="http://www.example.com/date" xmlns="http://relaxng.org/ns/structure/1.0">
In RELAX NG, any element which contains a pattern may also serve as a document element
(for
example, <grammar>
, <element>
, and
<choice>
), so you're not limited to a single element.
XML Schema Datatypes
So far the elements and attributes in the examples have only used text
content. RELAX NG, as a matter of fact, has only two built-in datatypes: token
and string
(which is like text
). You can also use XML Schema datatypes.
In the following compact schema, you can see the datatypes
token which defines
the value of a datatypeLibrary
attribute
(http://www.w3.org/2001/XMLSchema-datatypes
) and a prefix for the QNames of
the datatypes (xsd
):
# Using XML Schema datatypes namespace dt = "http://www.example.com/date" datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes" ## An ISO 8601, US, or European date format element dt:date { attribute type {"ISO8601" | "US" | "Euro"}, ( element dt:day { xsd:string { pattern = "\d{2}" } } & element dt:month { attribute days { "28" | "29" | "30" | "31" }?, xsd:string { pattern = "\d{2}" } } & element dt:year { xsd:string { pattern = "\d{4}" } } ) }
You can specify the datatype library and prefix explicitly as shown, or you can omit
this
line and allow Trang to insert it automatically during translation. Either way, the
datatypeLibrary
attribute will appear in the document element:
<element name="dt:date" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" xmlns:dt="http://www.example.com/date" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
The element <dt:month>
(like <dt:day>
and
<dt:year>
) are all of type xsd:string
. The
pattern
parameter specifies a regular expression, further constraining the
allowed content for the element to two digits. The optional attribute days
provides a
choice of literals as values. For example, the value of days
in an
instance may be one of 28
, 29
, 30
, or
31
. Other values are invalid.
<element name="dt:month"> <optional> <attribute name="days"> <choice> <value>28</value> <value>29</value> <value>30</value> <value>31</value> </choice> </attribute> </optional> <data type="string"> <param name="pattern">\d{2}</param> </data> </element>
This example also shows how to add an annotation element
(<a:documentation>
) with the double hash (##
). This
annotation comes from the RELAX NG compatibility
spec and, once translated, looks like
<a:documentation>An ISO 8601, US, or European date format</a:documentation>
You can also specify annotations with square brackets in the compact form, like
namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0" [ a:documentation [ "An ISO 8601, US, or European date format" ] ]
When you use this form, you must declare a namespace and prefix for the annotation element. You can insert elements and attributes as annotations from any namespace you like -- say, XHTML -- as long the namespace is declared and the bracketed syntax is used. Default attribute values may also represented as
namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0" element today { [ a:defaultValue = "2002" ] attribute year { text }, [ a:defaultValue = "06" ] attribute month { text }, [ a:defaultValue = "01" ] attribute day { text }, empty }
Notice the empty
token at the end of the content model. As you can probably
guess, this keyword signifies that the element today
is an empty element. This
syntax is analogous to the DTD syntax:
<!ELEMENT today EMPTY>
Going Context Free
I mentioned earlier in the article that the compact syntax can look like a context-free
grammar. The following example uses a start symbol and other symbols that serve as
terminals
and non-terminals. For example, the symbol year
, on the left side of the equals
sign, may be considered a non-terminal, and the element definition on right side,
a
terminal:
# RELAX NG schema for a date start = date date = element date { attribute type { text }, (year & month & day), limits*} year = element year { text } month = element month { text } day = element day { text } include "limits.rnc"
The following instance is valid with regard to the foregoing:
<?xml version="1.0" encoding="UTF-8"?> <date type="US"> <month>June</month> <day>1</day> <year>2002</year> <limits days="30"/> </date>
When translated, this example creates a different RELAX NG schema than the ones shown
previously, producing a <grammar>
and <start>
element
and several <define>
elements, as seen in this incomplete fragment:
<grammar xmlns="http://relaxng.org/ns/structure/1.0"> <!-- RELAX NG schema for a date --> <start> <ref name="date"/> </start> <define name="date"> <element name="date"> <attribute name="type"/> <interleave> <ref name="year"/> <ref name="month"/> <ref name="day"/> </interleave> <zeroOrMore> <ref name="limits"/> </zeroOrMore> </element> </define>
A <grammar>
element is a container for definitions. The
<start>
element indicates the document element for an instance, just as
a document type declaration does. The <define>
elements contain patterns
which can be referenced by name (with a <ref>
element) and therefore
easily reused.
Back in the compact schema, a symbol for the limits
pattern (modified with
*
) was added to the end of the content model for date
, but where
is it defined? It's defined in the included schema limits.rnc
(see the last
line of the last compact example), which looks like
# Limits for year, months, and days limits = element limits { attribute years { text }?, attribute months { text }?, attribute days { text }? }
When processed, the included compact schema is translated into RELAX NG XML syntax
as well.
The resulting filename, limits.rng
, is inferred from limits.rnc
.
The included pattern contains a definition for the limits
element, which may
contain up to three optional attributes. The absence of element children indicates
that its
content is empty.
Conclusion
It would several more articles to cover all aspects of RELAX NG in fair detail. This article has only touched lightly on its compact syntax and some of the more commonly used structures of the language. I have neglected some interesting things: for example, lists, name classes, merging grammars, and combining definitions. If you've gotten behind the wheel and tested these examples for yourself, you likely have a good feel for just how easy RELAX NG's compact syntax is to learn and use.
Related Links
"The Design of RELAX NG," a paper by James Clark
RELAX NG 1.0 DTD compatibility specification
RELAX NG compact syntax specification
Jing, James Clark's RELAX NG processor (Java)
Trang, James Clark's RELAX NG compact syntax processor (Java)
Multi-schema Validator (MSV), Sun's schema validator (by Kawaguchi Kohsuke)