Validating XML with Schematron
November 22, 2000
Schematron is an XML schema language, and it can be used to validate XML. In this article I show how to do the latter and assume the reader is at least familiar with XML 1.0, DTDs, XSLT, and XPath.
The Need for Schemas
XML schemas are necessary for communicating the structure of an XML document type to a machine. For example, consider two XML fragments.
<vehicle name='Harley Davidson' type='motorcycle'> <wheel name='Front Tire'/> <wheel name='Rear Tire'/> <HeadLight name='Front Lamp' /> <kickstand/> </vehicle> <vehicle name='Mitsubishi 3000 GT' type='motorcycle'> <wheel name='Front Right Tire'> <wheel name='Front Left Tire'> <wheel name='Rear Right Tire'> <wheel name='Rear Left Tire'> <HeadLight name='Front Right lamp'> <HeadLight name='Front Left lamp'> <SunRoof/> </vehicle>
A person can easily interpret and understand both XML instances from the words used to describe their components. A person can verify if the documents adhere to a set conventions about how vehicle elements should be used. For example, a person can tell that the this XML instance is invalid:
<vehicle name='Harley Davidson' type='motorcycle'> <wheel name='Front Tire'/> <SunRoof/> </vehicle>
We know that a motorcycle typically has two wheels and doesn't have a sunroof. A piece of program logic, however, needs an XML schema against which it can validate XML instances.
XML validation is a crucial part of predictable and efficient processing of XML instances. Knowing the structure of an XML document saves the developer from writing unnecessary conditional application logic. Once a document is identified as belonging to a class of documents, many assumptions about its structure can be made.
Document Type Definitions (DTDs)
DTDs were the first standard mechanism for XML validation, and for all practical purposes still are. They define the roles and structure of XML elements. DTDs are written in a syntax other than XMLs' and rely upon post-processing for validation. For simple XML schemas, DTDs are sufficient. However, DTDs are a step behind the direction XML technologies are evolving: they don't support namespaces, and they use a non-XML syntax.
The most serious problem with DTDs is that they do not support namespaces, a critical flaw since namespaces are a very powerful aspect of XML. The inability to validate DTD-declared XML documents with namespaces prevents XML application developers from taking advantage of namespaces in their business logic.
Most XML technologies (RDF, XSLT, and XLink) and schema languages (RELAX, XML Schema, SOX) are represented as XML. This uniformity helps make these technologies easy to learn, and it means developers are able to leverage existing XML tools. This places DTDs at a disadvantage because developers must learn an additional syntax in order to define their XML schemas--but DTDs also have more severe restrictions.
DTDs are somewhat limited in their range of expression; therefore, they cannot be used to validate some XML document structures. Consider the following XML:
<TennisMatch tournament='US Open'> <Competition type='Doubles' gender='Female'> <Player name='Venus Williams'/> <Player name='Serena Williams'/> .... <Player name='Martina Hingis'/> <Player name='Lindsey Davinport'/> </Competition> </TennisMatch>
A DTD couldn't declare that a Competition element can only have an even number of Player elements. Consider the following XML:
<shortStory author='AUTHOR1'> <character name='CHARACTER1'/> <character name='CHARACTER2'> </shortStory> <anthology author='AUTHOR1'> <shortStory> <character name='CHARACTER1'/> <character name='CHARACTER2'> </shortStory> </anthology>
If one constraint on such a document is that a shortStory element may only contain an author attribute if it isn't the child of anthology element, it wouldn't be possible to represent that constraint in a DTD.
These DTD handicaps aren't going unnoticed, and the W3C is presently developing an XML Schema language (currently a W3C Candidate Recommendation) that is more expressive and powerful than DTDs. The XML Schema language is an XML application and will likely become the standard way XML schemas are formally declared. However, we should take note of the REgular LAnguage description for XML (RELAX), an alternative XML schema language, developed by Murata Makoto, which has been submitted to the International Organization for Standardization (ISO) as a technical report. RELAX has been covered in previous XML.com articles. Until (and after) XML Schema is adopted as the standard for schema definitions, there are alternatives such as RELAX and Schematron. I've found Schematron to be the most promising of these.
Introducing Schematron
Schematron, created by Rick Jelliffe, defines a set of rules and checks that are applied to an XML instance. Schematron takes a unique approach to schemas in that it focuses on validating document instances instead of declaring a schema (as the other schema languages do).
Schematron relies almost entirely on XPath query patterns for defining these rules and checks. With just a subset of XPath, powerful XSLT stylesheets can be created to process very complex XML instances.
Before digging into Schematron, I'll demonstrate how XSLT can easily be used to validate XML instances. Let's go back our previous example.
<shortStory author='AUTHOR1'> <character name='CHARACTER1'/> <character name='CHARACTER2'> </shortStory> <anthology author='AUTHOR1'> <shortStory> <character name='CHARACTER1'/> <character name='CHARACTER2'> </shortStory> </anthology>
A template can be created that returns "Invalid XML" if a shortStory element has an author attribute when it's contained in an anthology element.
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match='shortStory'> <xsl:if test='../anthology and @author'> Invalid XML </xsl:if> </xsl:template> </xsl:stylesheet>
You can imagine other combinations of templates that validate more complex XML structures. This is essentially how Schematron works. It takes a Schematron schema definition (in XML) that describes the constraints. A Schematron XSLT stylesheet converts this to another stylesheet -- transforming an instance document with this resultant stylesheet then performs the validation of that instance.
Structure of a Schematron Document
A Schematron XML document consist of a schema element in the Schematron namespace: http://www.ascc.net/xml/schematron. The schema element contains one or more pattern elements. Pattern elements allow the user to group schema constraints logically. Some examples of logical groupings are: Text Only Elements, Valid Root Element, Check for ID Attribute.
Pattern elements have a name attribute. They may also have a see attribute that refers to a URL for user documentation of the schema.
Rules
Rule elements define a collection of constraints on a particular context in a document instance (for example, on an element or collection of elements). This is very similar to XSLT templates, which are fired with respect to a node or group of nodes returned by an XPath expression. If we go back to the XSLT stylesheet we defined earlier:
<xsl:template match='shortStory'>
The match attribute causes the XSLT processor to evaluate the XPath expression
shortStory
and then instantiate the template relative to the
shortStory element. The contents of a rule element operate within the context of
the elements matched by its context attribute.
Rule elements may contain assert and report elements. Both elements are conditionally instantiated depending on the XPath evaluation of their test attribute. The only difference is that assert elements are instantiated if the XPath expressions evaluates to false, while the report elements are instantiated if it evaluates to true. (The general intent is that assert is used to detect errors, while report can be used to report affirmative qualities of an instance.)
The assert/report mechanism is similar to the XSLT xsl:if element in our example stylesheet above, which also has a test attribute that determines if the contents of the xsl:if element are instantiated in the resulting XML tree.
Note that a node can only be the context of a single rule (the first matching rule the processor comes across) within a pattern. However, a node can be matched multiple times within different patterns. Thus pattern groupings are important. Every match of a context node can be considered a discrete constraint.
These elements allow authors of Schematron schemas to provide functional (and humanly readable) feedback about invalid XML instances. The user-defined feedback makes Schematron's unique approach to schema declaration more powerful than other schema languages.
Finally, assert and report elements have a name element to use for substituting the name of an element into the output stream. The name element has an optional path attribute which returns the node whose tag name will be inserted in place of the name element. If the path attribute isn't specified the name of the current context node is used instead. This element is often used by assert and report elements to identify the tag name of an offending element within the validation message.
Powered by XPath
The power of Schematron lies with its use of XPath expressions. They allow XML instances to be queried by powerful patterns, providing validation of constraints beyond the capabilities of DTDs to declare. Let's consider selected portions of the "Structural Validation" pattern inside the RSS Schematron (which can be downloaded).
<?xml version="1.0"?> <schema xmlns="http://www.ascc.net/xml/schematron"> <pattern name="Structural Validation"> <rule context="rss"> <assert test="@version"> An RSS version identifier should be supplied </assert>
Here the rule context is the rss element. The assert element tests for the
existence of a version attribute with the @version
XPath expression. If
the matched rss element doesn't have a version attribute, the contents of the
assert element are instantiated: that is, the text message is created in the output
of the stylesheet to alert the user that a version identifier is required.
<report test="@version != 0.91"> This Schematron validator is for RSS 0.91 only </report>
This is an example of a report element whose content is instantiated only if the test expression evaluates to true. In this case, the Schematron is checking for a version number other than 0.91.
<assert test="count(channel) = 1"> An RSS element can only contain a single channel element </assert>
Here we have a more complex constraint. It tests whether the context node
(/rss
in this case) has only a single channel element. The test
expression uses the XPath count
function, one of the many powerful XPath
functions available to a Schematron.
<rule context="title|description|link"> <assert test="parent::channel or parent::image or parent::item or parent::textinput"> A <name/> element can only be contained with a channel, image, item or textinput element. </assert> <report test="child::*"> A <name/> element cannot contain sub-elements, remove any additional markup </report> </rule>
This rule element's context node in the example above is either a title,
description, or link element. The assert element checks that the context
node's parent is either a channel, an image, an item, or a
textinput. It uses the parent
axis specifier for the check.
The report element ensures that neither the title, description, nor the
link element contains a child element. It uses the child
axis
specifier.
<rule context="image"> ... <assert test="count(width) = count(height)"> Width and Height elements should be balanced </assert> </rule>
This is another powerful example of the count
function being used for
constraint. And it's another situation where a DTD could not express this constraint
for
validation.
<rule context="width"> <assert test="preceding::height or following::height"> A width should be accompanied by a height </assert> </rule>
Finally, it also shows just what Schematron can validate. The assert element uses
the preceding
and following
XPath axis specifiers to test whether,
if a width element occurs, there is an accompanying height element. Once
again Schematron leverages XPath's powerful functions for its schema constraints.
Putting a Schematron Schema into Action
After a Schematron schema is defined, a Schematron XSLT stylesheet is used to transform the schema to a validating stylesheet. This stylesheet can then be applied to XML instances for validation purposes. There are several such Schematron stylesheets, each of which provides special functionality. You can find these stylesheets on the Schematron web site.
There is Schematron-basics which generates a stylesheet that simply returns the text output of the Schematron (the text of assert and report elements). As the name suggests, this is the most basic of the Schematron stylesheets.
The schematron-message stylesheet generates validating stylesheets that can be
used with an XSLT processor that knows how to handle xml:message
elements and
send them to the standard output. This stylesheet is mainly used in conjunction with
interactive editors such as Emacs and XED to validate an XML instance as it is being
edited.
There are also schematron-report and schematron-pretty stylesheets. These generate validating stylesheets that produce HTML formatted messages. The schematron-report stylesheet produces output in a two-frame frameset. The first frame contains hyper-linked error messages organized by pattern. The bottom frame displays the offending XML source fragments corresponding to the selected error message. This stylesheet provides a helpful way to interactively review validation errors in an XML instance, and it's particularly useful when the XML instance source is large enough to be a burden to browse separately.
Resources |
•Schematron:
An Interview with Rick Jelliffe |
Finally there is schematron-xml which generates validation messages in XML. The elements have a location attribute containing XPath expressions that evaluate to the offending element. This Schematron stylesheet allows users to plug-in Schematron validation to their existing XML application logic.
There are several widely used XML schemas written in Schematron in addition to the RSS Schematron example, for example, the schema in Dan Connolly's Web Content Accessibility Checking Service. It's a service that checks web pages against the Web Content Accessibility Guidelines using the WAI example Schematron, downloadable from Rick Jelliffe's Schematron web page.