Self-Enhancing Stylesheets
July 2, 2003
Developing new stylesheets can be a chore. It would be nice if you could tell your
stylesheet to trace which tags from the source document are not yet processed by
xsl:template
elements. And why not make your stylesheet write an
xsl:template match
skeleton for each unhandled tag? Unfortunately, doing this
was too hard with XSLT 1.0. But XSLT 2.0 will change this, and with help of Saxon
7.5 (or
greater) you can try it out now.
XSLT gives you two ways of processing XML documents. The first is to directly access
parts
of the document by XPath expressions. This is what the XSLT 2.0 Working Draft calls
pull-processing (§ 2.4). The other way is to walk through the document in document
order. Letting the document structure drive the processing sequence is called push
processing, and this is what the xsl:template match
and
xsl:apply-templates
mechanisms are for. Usually both kinds of processing are
mixed in a stylesheet. When one writes a new stylesheet to process an unknown document,
coding typically begins with adding xsl:template match
rules for the tags.
The Simple Approach
The step-by-step way of writing your templates is not a problem unless you have to work on large or deeply structured documents, containing many different tags. This was the problem I ran into when I was engaged in transforming the OpenOffice 1.0 file format. I wasn't in the mood for reading the extensive DTD to only pass some element contents to HTML. So I began to implement templates for some easy and self-explanatory tags:
<!-- process headers to h1 .. h6 by text level attribute--> <xsl:template match="text:h"> <xsl:element name="{concat('h',@text:level)}"> <xsl:value-of select="."/> </xsl:element> </xsl:template> <!-- generic para processing --> <xsl:template match="text:p"> <p> <xsl:apply-templates/> </p> </xsl:template>
When I asked myself which tags might have passed through my templates unrecognized, I recalled the XSLT default templates and added the following:
<xsl:template match="*"> <xsl:comment> <xsl:value-of select="concat('not processed: ',name())"/> </xsl:comment> <xsl:apply-templates/> </xsl:template>
Because XSLT's behavior in generically processing a tag that has no better fitting template definition, this was extremely simple. It gave me a trace of all tags not processed by my more specific match attributes. If all you want is to have a log of unhandled tags in your output document, you're done with this solution.
Improving the Solution
The idea of letting the stylesheet write the names of the unhandled tags into a separate
document is the next obvious step. We will make it write out not just comments, but
<xsl:template match...>
fragments that match the bypassed tags, and
inform us about all of their attributes. And we want to have this code as a stylesheet
module that can easily be plugged into any stylesheet we are currently working on.
It is very hard to achieve all this with XSLT 1.0. At a minimum you will have to use processor specific extensions. For that reason, the following solution requires a XSLT 2.0 Processor. The most advanced experimental implementation is Michael Kay's Saxon 7 processor. The version used with these examples was 7.5.1.
In the following we will solve the problems that derive from our requirements step by step. You can find the complete code samples in the self_enhancing_samples.zip download. The basic XML document is named glossary.xml. The main stylesheet which is in construction is new_sheet.xslt. To keep things simple, it creates a small HTML file and contains only one template to handle a tag from glossary.xml. It includes the nursery_sheet.xslt, where the tracing work is done.
Figure 1 shows the data flow of the described processing.
Figure 1. data flow between
affected documents and stylesheets.
The main stylesheet (new_sheet.xslt) looks like this:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:include href="nursery_sheet.xslt"/> <xsl:output method="html" version="1.0"/> <xsl:template match="/"> <xsl:call-template name="tag-trace"/> <html> <head><title>trace tags</title></head> <body> <xsl:apply-templates/> </body> </html> </xsl:template> <xsl:template match="entry"> <h3><xsl:value-of select="term"/></h3> </xsl:template> </xsl:stylesheet>
That's all that's needed inside new_sheet.xslt. The nursery_sheet.xslt is
included, and somewhere later the <xsl:call-template name="tag-trace"/>
starts tracing. It's worth noting that we do not care about namespaces yet. If we
work on
documents containing namespace declarations, they should be defined in this sheet.
This is
usually done inside the <xsl:stylesheet>
tag.
Each time we run this transformation the unhandled tags are processed by the
tag-trace
template contained in nursery_sheet.xslt, which we will
look at now.
Writing Multiple Output Documents
The idea of XSLT 1.0 was to transform a single input tree into a single output tree.
There
was no mechanism to write to multiple output files. Most XSLT processors implemented
their
own specific solutions for this. The new XSLT 2.0 xsl:result-document
element
cleans up the jungle of processor specific tags and allows you to serialize an arbitrary
number of output trees to separate documents. If you are curious about the details
you may
consider looking at XSLT 2.0 and XQuery 1.0
Serialization and §20 of XSLT 2.0 . Here we only take a
glance at this topic.
The output is controlled by the xsl:output
element, which as in XSLT 1.0
remains optional. But if you intend to use multiple output formats there must be multiple
xsl:output
elements in your stylesheet. An output definition comes into
effect when referenced from an xsl:result-document
block. Let's have a look at
how it works. First we define a named output format as a top level element.
<xsl:output method="xml" name="nursery" standalone="yes" indent="yes"/>
At some other place in the stylesheet where serialization begins we refer to the definition
using the format
attribute inside the xsl:result-document
element.
<xsl:result-document href="not_processed.xml" format="nursery">
Obviously, the href
attribute tells the name of the document where the result
of the serialization should be written to. But there is one thing to remember. We
are not
able to do file processing in XSLT 2.0 like we can do in most programming languages.
What we
are doing is tree serializing. This means that we can't use the simple
<xsl:template match=“*“>,
to wrap the content creation with
an <xsl:result-document>
element. Such a template would be triggered
during document processing while we are busy constructing the primary result tree,
which is
serialized to the target (HTML) document. So we are forced to disconnect the tracing
mechanism from the recursive descendant processing of the main input document.
Analyzing the Stylesheet
What we need to do is to read the current state of our main stylesheet and compare it with the tags found in the input document. So we have to handle two input documents. The main input document, which is one of the input parameters, and the stylesheet we are working on.
To analyze which tags have been handled already, we read all match
attributes
of xsl:template
elements and hold them as a list of tag names in a variable.
This can be achieved with the document()
function.
<xsl:variable name="handled-tags"> <xsl:for-each select="document($analyze)//xsl:template/@match"> <xsl:value-of select="."/> <xsl:if test="position() != last()">, </xsl:if> </xsl:for-each> </xsl:variable>
If we want to keep the tags handled by xsl:value-of
as well, we can easily add
the following inside the variable definition.
<xsl:for-each select="document($analyze)//xsl:value-of/@select"> <xsl:value-of select="."/> <xsl:if test="position() != last()">, </xsl:if> </xsl:for-each>
The result is a comma-separated list of tagnames handled by xsl:template
or
xsl:value-of
statements.
If we had decided to note the tag-trace template into the new_sheet.xslt, we could
have used document('')
to get the root node of the current stylesheet. But we
want to separate it from current work, which is why we have to pass the name of our
working
stylesheet to the document()
function with the $analyze
parameter.
Outputting the XSLT Namespace
Before we look at the core functionality of our tag-trace
template, we must
think about namespace operations. Because we want to generate XSLT, we need to distinguish
the XSLT elements that should be interpreted by the processor from those that are
only
written to the output tree. This is exactly what namespaces are made for. We will
use the
namespace prefix genxsl
to output XSLT reserved names, using the following
namespace declaration.
xmlns:genxsl="http://www.xml-web.de/genxsl"
This enables us to write <genxsl:template match=".">
to our trace
document. To have the genxsl
prefix replaced with the xsl
prefix
during serialization we use the namespace-alias
directive:
<xsl:namespace-alias stylesheet-prefix="genxsl" result-prefix="xsl"/>
It is processor dependent what really happens due to this declaration (XSLT 2.0
§11.1.4), but Saxon produces <xsl:template match=“.“>
from <genxsl:template match=“.“>
and that's what we want.
Now let's use some new XSLT 2.0 features to collect and compare the tags occurring
in our
XML input document with the names we keep in the variable $handled-tags
.
The tag-trace Template
The main task inside the tag-trace
template is to collect all tag names that
can be found in the XML input document. To get a list of unique names we use the new
xsl:for-each-group
element. We choose all element nodes passing the
'//*'
expression to the select
attribute. Then we tell the
group-by
attribute to divide the nodes into a collection of sequences of
items with identical names.
<xsl:result-document href="not_processed.xml" format="nursery"> <genxsl:stylesheet version="2.0"> <!-- get the tag names of the input file --> <xsl:for-each-group select="//*" group-by="name()"> <xsl:sort select="name()" case-order="lower-first"/> <!-- keep name unique --> <xsl:variable name="cname" select="name(current-group()[1])"/> <xsl:if test="not(contains($handled-tags, $cname))"> <!-- write template for name found --> <genxsl:template match="{$cname}"> <!-- attribute code added later --> <genxsl:apply-templates/> </genxsl:template> </xsl:if> </xsl:for-each-group> </genxsl:stylesheet> </xsl:result-document>
Inside the for-each-group
element the current-group()
function
accesses the sequence currently processed. To achieve uniqueness we explicitly take
the
first item of the sequence and keep its name in the variable $cname
. Now it is
easy to test whether the current name is contained in the list of handled tags.
<xsl:if test="not(contains($handled-tags, $cname))">
If the condition is true, a template fragment matching the current name is created
by
<genxsl:template match="{$cname}">
.
Something like this is generated for each tag name:
<xsl:template match="para"> <xsl:apply-templates/> </xsl:template>
Now let's add some information about possible attributes and generate an xsl:value-of
select
statement for each one inside the template definition. We can use the same
logic as with the tag names.
<xsl:for-each-group select="//*[name() = $cname]/@*" group-by="name()"> <xsl:sort select="name()"/> <genxsl:value-of select="{concat('@',name(current-group()[1]))}"/> </xsl:for-each-group>
To get all possible attributes of a specific element we select all attributes from
all
elements with the same name by the XPath expression "//*[name() = $cname]/@*"
.
As we did with the element names the attribute names are grouped, sorted by name,
and an
xsl:value-of select
statement is created.
Note that the select
attribute inside the <genxsl:value-of
select...>
is not in the XSLT namespace and is not evaluated by the XSLT
processor. That's the reason why we have to tell the processor to evaluate the
concat()
function here. This is done by the attribute value template (AVT)
inside the curly brackets. If we do not use an AVT here, the complete string
'concat('@',name(current-group()[1]))'
will appear on the output tree,
because the processor takes it as a literal result element. The nested processing
looks like
this (line breaks are for formatting reasons):
<xsl:result-document href="not_processed.xml" format="nursery"> <genxsl:stylesheet version="2.0"> <!-- get the tag names of the input file --> <xsl:for-each-group select="//*" group-by="name()"> <xsl:sort select="name()" case-order="lower-first"/> <!-- keep name unique --> <xsl:variable name="cname" select="name(current-group()[1])"/> <xsl:if test="not(contains($handled-tags, $cname))"> <!-- write template for name found --> <genxsl:template match="{$cname}"> <xsl:for-each-group select="//*[name() = $cname]/@*" group-by="name()"> <xsl:sort select="name()"/> <genxsl:value-of select="{concat('@',name(current-group()[1]))}"/> </xsl:for-each-group> <genxsl:apply-templates/> </genxsl:template> </xsl:if> </xsl:for-each-group> </genxsl:stylesheet> </xsl:result-document>
You will find the code inside the nursery_sheet.xslt a bit longer than discussed here. It contains some additional features: ensuring a processor version, writing out comments to increase readability of the target file, as well as a user defined function already described in my previous article.
Usage Notes
These stylesheets can be invoked from a command line window. You can call saxon7.jar directly:
java -jar saxon7.jar -o test.html glossary.xml new_sheet.xslt
If you don't like coding the name of the calling template into
nursery_sheet.xslt
you can pass the name to the analyze parameter.
java -jar saxon7.jar -o test.html glossary.xml new_sheet.xslt
"analyze=new_sheet.xslt"
It is easy to edit the nursery_sheet.xslt in a way that it can be called once for an arbitrary XML input file to generate template fragments for all tags inside. You can find an older version of this approach on my website generate-xslt.zip.
Conclusion
This stylesheet is neither a complete solution for developing stylesheets, nor does
it
replace the need for good development environments. You have to decide on your own
if a
specific template fragment is useful in your document processing or whether a simple
<xsl:value-of select='tagname'/>
would be enough. But this solution
keeps us informed about the tags of an unknown document, especially when there is
no DTD
available. And it shows some of the promising features of XSLT 2.0.