Validation by Instance
August 28, 2002
Most people these days develop XML documents and schema with a visual editor of some sort, perhaps Altova's XML Spy, Tibco's TurboXML, xmlHack from SysOnyx, or Oxygen. Some even use several editors on a single project, depending on the strengths of the software.
Others prefer to work closer to the bone. I usually develop my schema and instances by hand, using the vi editor, along with other Unix utilities (actually, I use Cygwin on a Windows 2000 box). I don't want to make more work for myself, but I prefer to use free, open source tools that allow me to make low-level changes that suit my needs. If you prefer to work this way, you should enjoy this piece.
In this article, I will explore how you can translate an XML document into a Document Type Definition (DTD), a RELAX NG schema, and then into an W3C XML Schema (WXS) schema, in that order. I'll do this with the aid of several open source tools, and I'll also cover a way to validate the original XML instance against the various schemas.
The tools are all Java-based. To get things to work, you will need to have version 1.2 of Java or later installed on your system, have your path and classpath variables set correctly, and be ready to download and install several free tools. I used Java 2 v1.4 when testing the examples in the article. When I use the word install in relation to a JAR file, I mean that it is somewhere on your file system and is within reach of the classpath.
All the schema, instance, and batch files mentioned in this article are stored in a ZIP archive that is available for download.
Translating an XML Document into a DTD
Consider a simple XML document that describes the date of an event in several formats:
<?xml version="1.0" encoding="UTF-8"?> <event> <description>Final sale of property.</description> <date type="ISO"> <year>2002</year> <month>08</month> <day>28</day> </date> <date type="Euro"> <day>28</day> <month>August</month> <year>2002</year> </date> <date type="US"> <month>August</month> <day>28</day> <year>2002</year> </date> </event>
To translate the XML document into a DTD, I'll use Michael Kay's DTDGenerator. Originally, DTDGenerator was part of the Saxon XSLT processor, but now it is separate. At just 17kb, it's a pretty small download. DTDGenerator does a fair amount of work for you, but it doesn't produce parameter entities, notation declarations, or entity declarations. It's also not namespace-aware, but DTDs aren't inherently aware of namespaces or qualified names anyway.
With dtdgen.jar
in your classpath, enter the following command line:
java -cp dtdgen.jar DTDGenerator event.xml > event.dtd
This command produces the following output, redirecting it to the file
event.dtd
:
<!ELEMENT date ( day | month | year )* > <!ATTLIST date type ( Euro | ISO | US ) #REQUIRED > <!ELEMENT day ( #PCDATA ) > <!ELEMENT description ( #PCDATA ) > <!ELEMENT event ( description, date+ ) > <!ELEMENT month ( #PCDATA ) > <!ELEMENT year ( #PCDATA ) >
Of course, this isn't the only possible DTD for the data model in event.xml
.
It is only one possibility. DTDGenerator makes educated guesses about the content
models it
sees in an instance. It may not be what you intend, but at least you are several rungs
up
the ladder.
There are few things to note. First, the element type declarations are listed in
alphabetical order, not in the order of appearance in the instance. The content model
for
the date
element allows a choice of day
, month
, or
year
, according to the variations in the instance.
There is only one description
element, so the content model in the DTD
reflects that. Likewise, because the event
element contains more than one
date
element, the content model allows one or more (date+
).
The type
attribute has enumerated values only because I tweaked some fields in
the source code (DTDGenerator.java
) and recompiled.
MIN_ENUMERATION_INSTANCES
represents the minimum number of times an attribute
must appear for it to be an enumeration type. Also, an attribute is considered an
enumeration only if the number of instances divided by the number of distinct values
is
greater than or equal to MIN_ENUMERATION_RATIO
. Normally, the value of
MIN_ENUMERATION_INSTANCES
is 10 (I switched it to 0), and the value of
MIN_ENUMERATION_RATIO
is 3 (now 1). These changes let me control what is
considered an enumeration to suit the document. This is why I like working with open
source
code: It allows me to make changes to the code to meet specific needs.
Now that we have a DTD I'll use another tool to convert it to a RELAX NG schema. It's called DTDinst.
Translating the DTD to RELAX NG
James Clark's DTDinst is a Java tool that translates a DTD either into its own XML
vocabulary or into a schema in RELAX NG's XML syntax. After downloading and installing
dtdinst.jar
, you can issue the following command to translate a DTD into
RELAX NG:
java -jar dtdinst.jar -i -r rng event.dtd
This command uses the -jar
option because the JAR manifest contains the
line:
Main-Class: com/thaiopensource/xml/dtd/app/Driver
In other words, the manifest tells the Java interpreter where to find the class the
contains the main()
method, so you don't have to inform the Java interpreter of
that fact directly.
The first argument, the -i
option, tells DTDinst to write the RELAX NG
attribute
elements inline, as children of containing element
definitions, rather than as children of define
elements. Next, the
-r
option specifies the directory where the RELAX NG schema should be
written. If the directory you name does not exist, it will be created for you. The
output
file will have the same file name as the DTD, but it will have an rng
suffix.
The last argument, event.dtd
, is of course the DTD that I generated
earlier.
The resulting RELAX NG schema event.rng
(in the rng
directory)
looks like this:
<?xml version="1.0" encoding="UTF-8"?> <!-- Generated by DTDinst version 2002-07-24. --> <grammar datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" xmlns="http://relaxng.org/ns/structure/1.0"> <define name="date"> <element name="date"> <attribute name="type"> <choice> <value>Euro</value> <value>ISO</value> <value>US</value> </choice> </attribute> <zeroOrMore> <choice> <ref name="day"/> <ref name="month"/> <ref name="year"/> </choice> </zeroOrMore> </element> </define> <define name="day"> <element name="day"> <text/> </element> </define> <define name="description"> <element name="description"> <text/> </element> </define> <define name="event"> <element name="event"> <ref name="description"/> <oneOrMore> <ref name="date"/> </oneOrMore> </element> </define> <define name="month"> <element name="month"> <text/> </element> </define> <define name="year"> <element name="year"> <text/> </element> </define> <start> <choice> <ref name="event"/> </choice> </start> </grammar>
As you can tell, a RELAX NG schema is easy to grasp. For example, in the date
definition, you can easily see that the date
element's required attribute
type
may have one of three possible values, Euro
,
ISO
, or US
. Also, the text
element is a dead ringer
for #PCDATA
. Need I go on?
DTDinst generates a grammar
element which is a container for
define
elements. A grammar
element must also contain a
start
element which indicates the document element for the instance. I think
the choice
element surrounding the reference to the event
definition is unnecessary, so I will delete it in my own version (see
new-event.rng
).
The schema could be rewritten without define
elements and references to those
definitions (the ref
elements), but the schema is sufficient as it stands. Now
I'll take the translation process a step further by adding WXS to our list.
Translating RELAX NG to XML Schema
Trang is a another tool
written by James Clark. It can take as input a schema written in RELAX NG XML and
compact syntax; it can
produce RELAX NG XML, RELAX NG compact syntax, DTD, and WXS as output. After downloading Trang (which
includes a JAR file for Jing,
a RELAX NG validator), unzipping and installing it, you can convert the RELAX NG schema
back
to a DTD new-event.dtd
with this command:
java -jar trang.jar rng/event.rng new-event.dtd
The DTD output of Trang is nearly identical to the one produced by DTDGenerator. If
the
file suffixes used with Trang don't match the implied content of the file, you can
also
specify the input file with the -i
option and output file with the
-o
option. You can name either rng
or rnc
as input,
and one of rng
rnc
, dtd
, or xsd
as output. For example, using
-i
and -o
you can issue the preceding command as
java -jar trang.jar -i rng -o dtd rng/event.rng new-event.dtd
You can also produce XML Schema output with the command:
java -jar trang.jar rng/event.rng event.xsd
Trang's WXS output is still in the alpha stage, so there may be some changes in the
future.
The WXS output from event.rng
follows:
<xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" version="1.0"> <xs:element name="date"> <xs:complexType> <xs:choice minOccurs="0" maxOccurs="unbounded"> <xs:element ref="day"/> <xs:element ref="month"/> <xs:element ref="year"/> </xs:choice> <xs:attribute name="type" use="required"> <xs:simpleType> <xs:restriction base="xs:token"> <xs:enumeration value="Euro"/> <xs:enumeration value="ISO"/> <xs:enumeration value="US"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> </xs:element> <xs:element name="day"> <xs:complexType mixed="true"/> </xs:element> <xs:element name="description"> <xs:complexType mixed="true"/> </xs:element> <xs:element name="event"> <xs:complexType> <xs:sequence> <xs:element ref="description"/> <xs:element maxOccurs="unbounded" ref="date"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="month"> <xs:complexType mixed="true"/> </xs:element> <xs:element name="year"> <xs:complexType mixed="true"/> </xs:element> </xs:schema>
The date
element might have been defined as a regular complex type rather than
as an anonymous type, but this works nonetheless. Also, this construct occurs four
times in
the schema:
<xs:element name="day"> <xs:complexType mixed="true"/> </xs:element>
This says that the content of the day
element (and the content of
description
, month
, and year
as well) implicitly
allows text node children only. This is a little unclear at first glance. In my version,
I
changed the element content by hand as follows in all four instances (see it in
new-event.xsd
):
<xs:element name="day" type="xs:string"/>
Now that I have derived schemas from an XML document in DTD, RELAX NG, and W3C XML Schema, I'll attempt to validate the original instance against all three.
Validating the Instance
There are a number of validators to choose from, but I'll use Sun's Multi-Schema Validator because it can validate against schemas in all three formats: DTD, RELAX NG, and W3C XML Schema. Assuming that you have downloaded MSV and that all the JARs are installed (there are four), here is the command for performing the validation for the DTD:
java -cp xerces.jar;xsdlib.jar;relaxngDatatype.jar;isorelax.jar -jar msv.jar event.dtd event.xml
If you are on the Windows platform, you can use a batch file I created to simplify
the
command (see msv.bat
in the file archive).
To validate against the other schema, replace event.dtd
with the name of some
other schema file. You can test all the schemas in the file archive if you like. They
are
all valid, though MSV issues a warning about elements that have the content:
<xs:complexType mixed="true"/>
Conclusion
If you work on the Windows platform, I have also written a set of batch files that
will
perform all the translations (from instance, to DTD, to RELAX NG, and finally to W3C
XML
Schema) and then validate against them in one simple step. You will find this batch
file,
validate.bat
, in the archive. This batch file also calls or accesses four
other batch files in the same directory (dtd.bat
, rng.bat
,
xsd.bat
, and msv.bat
).
Using the tools I've described here, you can perform the conversions and validate against the resulting schemas in a matter of seconds. You may still prefer to use a visual editor, but I believe that learning and using these tools can save you time and money.