Web Content Validation with XML::Schematron
January 23, 2002
Introduction
A fair part of the Web's initial popularity was based in the relative simplicity of HTML authoring. Love it or hate it, HTML offered a standard, ubiquitous markup language that one could expect would be viewable as more or less intended by anyone requesting the document. This ubiquity made web-based applications possible. By having a common, albeit limited language from which to build user interfaces, client-server applications could often abandon the use of platform- and application-specific client-side executables in favor of accessing data and logic on the server through the CGI or Web server extension.
The importance of HTML's ubiquity in web applications is especially noticeable in the class of applications I'll call "in-browser content editors". The details vary widely, but the basic interface and functionality is the same: there is a section of the page that contains a largish <textarea> for entering HTML markup and a preview area where that markup is displayed. When the form is submitted the preview section is updated. What makes this type of application so popular is that it is drop-dead easy to implement. The same markup entered in the textarea is printed as-is in the preview section; and since you are using an HTML browser to view the HTML content you're authoring, if the contents of the preview section look right then you can be reasonably sure that the document that contains that markup will look right. That is, the validation of the markup content is handled implicitly by virtue of using an application that is specifically designed to render that markup in a predictable way.
Choosing XML to markup web content knocks that implicit validation into a cocked hat. With the exception of XHTML, XML languages are completely foreign to HTML browsers. You may get a nice colorized tree representing an entire XML document in some, but that is a far cry from the "if it renders correctly here, it will render correctly most anywhere" that goes along with checking HTML markup in an HTML browser.
How then do you ensure that the XML content being authored is correct? There are DTDs, which can be used with validating parsers, but DTDs require that the entire content model be explicitly described, which can be tricky for mixed content (e.g., elements that can contain both character data and other elements). There are W3C Schemas, but there, too, the entire model must be described, and the technology itself seems a bit biased toward the stricter "data transfer" uses of XML rather the looser models that characterize human communication. DTDs and W3C Schemas have their place, but the learning curve involved in getting it right in order provide a useful level of content validation make their use for most applications impractical.
Enter the Schematron. Created by Rick Jelliffe, Schematron is a simple XML application language designed to make validating the structures of XML documents as straightforward and painless as possible. It uses the XPath syntax to define a series of rules that should or should not be true about a given document's structure. Those rules, and the context in which they are evaluated, can be as coarse or as finely-grained as the task at hand requires. Content models may be open or closed; you can declare a document structurally valid based on a single all-important rule; or you can create rules for each and every element and attribute that may appear in the document -- the choice is yours.
This month we will be looking at the Perl implementation of the Schematron: my
XML::Schematron
.
Writing Schematron Schemas
Before we dig into XML::Schematron
let's take a quick look at a Schematron
schema. The basic rules for writing schemas are very simple:
- The schema will contain single top-level
<schema>
element. - The
<schema>
element will contain one or more<pattern>
elements. - Each
<pattern>
element will contain one or more<rule>
elements. - Each
<rule>
element will contain acontext
attribute consisting of an XPath expression that provides the context for evaluation, and a mix of one or more<assert>
or<report> elements.
- Each
<assert>
element will contain atest
attribute consisting of an XPath expression, and text content containing a descriptive message that will be delivered to the user if the expression contained by thetest
attribute evaluates to false. - Each
<report>
element will contain atest
attribute consisting of an XPath expression, and text content containing a descriptive message that will be delivered to the user if the expression contained by thetest
attribute evaluates to true.
Let's look at a sample schema to see how these rule take shape.
<?xml version="1.0"?> <schema xmlns="http://www.ascc.net/xml/schematron"> <pattern name="Example HTML Schematron Schema"> <rule context="/"> <assert test="html"> The root element of an HTML page must be named 'html'. </assert> </rule>
After declaring the top-level <schema> element, we create a single <pattern>
element. Patterns allow for the logical grouping of tests but our needs are modest
in this
case so we'll have only one. Next, we have a <rule> element with the required
context
attribute. This attribute takes an XPath expression that provides the
context in which the enclosed <assert> and <report> tests will be evaluated. In
this case, the context is "/", the abstract root of the document. Within that rule
we have a
single <assert> element with the required test
attribute. Here, too, the
attribute takes an XPath expression. The expression in an assert element says, in
essence,
"here is some test expression that should be true within the context provided by
the enclosing rule, but if it evaluates to false, I'll print my warning message".
In this example we are checking the document being validated for the presence of an
<html> element in the context of the abstract root. If that is not the case, if the
top-level element were called something else, the text contained by the assert element
would
delivered to the user as an indication that the rule failed.
<rule context="html"> <report test="count(*) != count(head | body)"> The html element may only contain head and body elements. </report> <assert test="count(body) = 1"> The html element must contain a single body element. </assert> </rule>
After checking in the previous rule that the top-level element is named 'html', we
define
a rule with that element as the context so that we may examine its contents. Like
the <assert> element. the <report> element requires a test
attribute that takes an XPath expression. The difference is that test in an assert
element
contains an expression that should evaluate to true in the given context for the structure
to be valid; a report element's test expression creates a validity rule that should
evaluate
to false in the given context for it to pass. Here, we want to ensure that the
<html> element contains only <head> and <body> elements so we create a
report test that contains the XPath expression count(*) != count(head | body)
;
or, in English, "the number of all child elements, regardless of name, is not equal
to the
number of child elements named 'head' and 'body'". Remember, this is a report
element, so the test expression should evaluate to false for the structure to be valid.
Next, we create an <assert> with the test expression count(body) = 1
.
This ensures that the <html> element contains a <body> element; but only one,
since having multiple body sections in a document is likely to drive browsers crazy.
Note that the combination of these two tests creates a open content model. That is, both <head> and <body> elements are allowed, but only the <body> element is required to pass our definition of structural validity.
</pattern> </schema>
Finally, we close the <pattern> and <schema> elements to complete the schema.
This basic schema only hints at Schematron's power. Any valid XPath expression that can be evaluated as true or false can be used to test a document's structure. For example,
<rule context="a"> <assert test="@href or @name"> An a element must contain either an 'href' or 'name' attribute. </assert> </rule>
creates a rule that ensures all <a> elements contain either a name
or
href
attribute. And
<rule context="mytag"> <assert test="@boolean='true' or @boolean='false'"> The mytag element's boolean attribute must be set to either true or false. </assert> </rule>
verifies that the <mytag> element's boolean
attribute contains either
true or false.
Now that you have a basic working overview of Schematron, let's get down to business.
Using XML::Schematron
Basic usage of XML::Schematron
is very simple and best shown by example. The
following script takes a path to a Schematron schema and an XML document and prints
any
validation errors to STDOUT
:
#!/usr/bin/perl -w use strict; use XML::Schematron::LibXSLT; my $schema_file = $ARGV[0]; my $xml_file = $ARGV[1]; die "Usage: perl schematron.pl schemafile XMLfile.\n" unless defined $schema_file and defined $xml_file; my $tron = XML::Schematron::LibXSLT->new(); $tron->schema($schema_file); my $ret = $tron->verify($xml_file); print $ret . "\n";
After collecting the filenames from the command line, this script creates a new instance
of XML::Schematron::LibXSLT
, then sets the schema to use for validation using
that object's schema
method, validates the XML file using the
verify
method, and prints any results to standard output. If the script runs
silently, then the document in question is structurally valid by the definition provided
by
the schema.
Also in Perl and XML |
OSCON 2002 Perl and XML Review PDF Presentations Using AxPoint |
Careful readers will have noticed that we imported the Perl Schematron library with
use XML::Schematron::LibXSLT;
rather than use XML::Schematron;
.
The reason is that the Schematron module actually ships with several backends that
can be
chosen based on the type of processor that you want to use. Schematron's secret is
that it's
most often implemented as an XSLT stylesheet, in which the Schematron stylesheet is
applied
to the schema and the result of that transformation is applied as a stylesheet to
the
document being validated. The same is true with most flavors of XML::Schematron
except that the stylesheet is created dynamically and all the details hidden from
view.
Currently, the Sablotron and LibXSLT processors are supported, but if you do not have
or
want an XSLT processor installed, you may use XML::Schematron::XPath
, a pure
Perl implementation built upon Matt Sergeant's XML::XPath
.
Example -- A Browser-based XML-friendly Content Editor
For our second and final example we will create a browser-based XML content editor
that
uses XML::Schematron
to validate the content being authored. To keep things
nice and tidy we will use Christian Glahn's astonishingly cool
CGI::XMLApplication
which we learned about last month. First,
though, we need to decide on the XML language that we want to use to capture our content.
To
keep things simple, we will choose a very minimal subset of DocBook-XML which will,
nevertheless, provide more semantic richness than plain HTML. Here's the simplified
Schematron schema:
<?xml version="1.0"?> <schema xmlns="http://www.ascc.net/xml/schematron"> <pattern name="Basic Web Site Content Validator"> <rule context="/"> <assert test="article"> The root element of a content page must be named 'article'. </assert> </rule> <rule context="article"> <assert test="count(*) = count(title|section|copyright|abstract)"> Unexpected element(s) found in element 'article'. an article element should contain only title, section, copyright, or abstract elements. </assert> <assert test="title"> A document element must contain a title element. </assert> <assert test="section"> A document element must contain a section element. </assert> <assert test="copyright"> A document element must contain a copyright element. </assert> </rule> <rule context="title"> <assert test="string-length() > 0 and string-length() < 51"> The title element must contain between 1 and 50 characters. </assert> </rule> <rule context="copyright"> <assert test="count(*) = count(name | date)"> Unexpected element(s) found: the copyright element may only contain name and date elements. </assert> <assert test="name"> A copyright element must contain a name element. </assert> <assert test="date"> A copyright element must contain a date element. </assert> </rule> </pattern> </schema>
First, we create the CGI::XMLApplication
interface that will validate that
content and warn the user about any validation errors that may have been encountered.
To
avoid information overload we will focus on the parts that are directly relevant to
validating the submitted content and warning the user about any errors encountered.
However,
the complete working application is available in this month's sample code
for you to peruse, install, or extend as desired.
First, we will create the application's verify_content
method.
sub verify_content { my ( $self, $context ) = @_; my $content = $context->{CONTENT}; warn "content $content \n"; my $tron = XML::Schematron::LibXSLT->new( ); $tron->schema( $context->{SCHEMA} ); my @messages = (); eval { @messages = $tron->verify( $content ); }; if ( $@ ) { my $error = "Error processing XML document: $@"; push @{$context->{ERRORS}}, $error; } else { push @{$context->{ERRORS}}, @messages; } }
In the verify_content
method we create an instance of
XML::Schematron::LibXSLT
and set the schema to the value contained by the
$context->{SCHEMA}
field. Then we verify the XML content contained in
$context->{CONTENT}
. Note that the call to the
XML::Schematron::LibXSLT
object's verify
method is wrapped in an
eval block. This ensures that any well-formedness errors encountered can also be captured
cleanly and sent to the user without causing a server error. If no parsing errors
are
encountered we push any structural validity errors that may have resulted from applying
our
schema to the document on to the $context->{ERRORS}
array reference for later
use.
Now we create the requestDOM
that CGI::XMLApplication
uses to
build the content sent to the browser:
sub requestDOM { my ($self, $context) = @_; my $dom = XML::LibXML::Document->new(); my $root = $dom->createElement( 'document' ); $dom->setDocumentElement( $root ); # add errors if any if ( scalar( @{$context->{ERRORS}} ) > 0 ) { my $errors = $dom->createElement( 'errors' ); foreach my $message ( @{$context->{ERRORS}} ) { $errors->appendTextChild( 'error', $message ); } $root->appendChild( $errors ); } return $dom; }
Resources |
• Download the sample
code. |
Here we have created a new DOM tree using XML::LibXML::Document
's
new
method and added a top-level element named 'document'. If any errors were
pushed on to $context->{ERRORS}
during validation we create a child of the
<document> element called 'errors' and loop over the errors encountered, adding an
<error> to that for each error found, and, finally, we return the new DOM tree. The
XSLT stylesheet that renders the returned DOM will check for the presence of the
<errors> element and print a list of validation errors to the user.
Conclusions
Validating XML content does not have to be a painful process. With
XML::Schematron
and a good working knowledge of the XPath syntax you can add
a powerful layer of structural validation to your Perl XML processing in a fraction
of the
time required by other solutions. Schematron may not completely replace DTDs or W3C
Schemas
for stricter XML systems, but the value that it provides for the minimal time investment
makes it a big winner in my book.