XSLT 2 and Delimited Lists
May 7, 2003
As part of his work as the editor of the XSLT 2.0 specification, Michael Kay has been prototyping the new features of XSLT 2.0 and XPath 2.0 in a separate development branch of his well-known Saxon XSLT processor. As I write this, his most recent prototype release is 7.4. (His recommended stable implementation of XSLT 1.0 is at release 6.5.2; see the project homepage for details on the progress of these two branches.) 7.4 lets us play with many of XSLT 2.0's new features.
The XSLT 2.0 specification is still a Working Draft, so you don't want to build production code around it, but it's still fun to try out some of the new features offered by the next generation of XSLT and XPath. In the next few columns, I'll look at some of these features. Most functions have been separated into their own specification, separate from the XPath 2.0 spec, because they're shared with XQuery: XQuery 1.0 and XPath 2.0 Functions and Operators.
One class of "pervasive
changes" from XSLT 1.0 to 2.0 is "support for sequences as a replacement for the
node-sets of XPath 1.0." Three functions that take advantage of this let you manipulate
tokenized strings: tokenize()
, item-at()
, and
index-of()
. In theory, start-tags and end-tags are the only delimiters anyone
ever needs in XML, but in practice, plenty of data out there uses other delimiters,
if only
for size reasons. Compare the following SVG polygon
element
<polygon points="100,100 140,220 40,145 160,145 60,220"/>
with one that delimits everything with tags:
<poly> <point><x>100</x><y>100</y></point> <point><x>140</x><y>220</y></point> <point><x>40</x><y>145</y></point> <point><x>160</x><y>145</y></point> <point><x>60</x><y>220</y></point> </poly>
The nearly four-fold increase in size makes a big difference for pictures of any
complexity. XSLT developers have longed for some equivalent of Perl and Python's split
functions, which take a string and an indication of the delimiter to look for and
then
returns an array of the substrings it found between the delimiters. While some XSLT
processors offered an equivalent as an extension function, the tokenize()
function's place on the W3C-specified list of required XSLT 2.0 functions lets us
count on
wide, consistent implementation of this function.
Let's look at a demonstration of the tokenize()
and two other new functions
that work very nicely with it. The following stylesheet works with any input, because
it
executes all of its instructions upon seeing the root of the source document and ignores
the
document's contents. (All sample stylesheets, input, and output are available in this zip file).
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="/"> <xsl:variable name="sampleString">XML,XSLT,XPath,SVG,XPointer</xsl:variable> <xsl:variable name="tokenizedSample" select="tokenize($sampleString,',')"/> <xsl:for-each select="$tokenizedSample"> <xsl:value-of select="."/> <xsl:text>! </xsl:text> </xsl:for-each> Second item in tokenizedSample: {<xsl:value-of select="item-at($tokenizedSample,2)"/>} Tenth item in tokenizedSample: {<xsl:value-of select="item-at($tokenizedSample,10)"/>} Position of SVG in tokenizedSample: {<xsl:value-of select="index-of($tokenizedSample,'SVG')"/>} Position of XSL-FO in tokenizedSample: {<xsl:value-of select="index-of($tokenizedSample,'XSL-FO')"/>} End of test. </xsl:template> </xsl:stylesheet>
The single template rule stores a comma-delimited list of W3C XML standard names in
a
variable called sampleString
and then passes that as a parameter to the
tokenize()
function used to create the tokenizedSample
variable,
which stores a sequence of strings. The second parameter passed to the function, which
tells
it where to split the string in the first parameter, is a one-character string consisting
of
a comma. You don't have to pass a single character as the tokenize()
function's
second parameter; you can even use a regular expression such as "\s+" for "one or
more
spaces," with an optional third parameter to the function giving you greater control
over
the regular expression's behavior.
The stylesheet's xsl:for-each
loop iterates through the string sequence,
outputting an exclamation point and a single space after each. The comma does not
show up in
any of the strings, because the tokenize()
function that split the string at
the commas throws these delimiters out.
The next two instructions in the stylesheet try to pull out specific strings from
the
sequence based on their position there. As the stylesheet's output below illustrates,
the
first call to item-at()
is successful, returning the string "XSLT". The lack of text
between the second pair of curly braces in the output show that the second call to
item-at()
returns an empty string, because the tokenizedSample
sequence has no tenth item.
XML! XSLT! XPath! SVG! XPointer! Second item in tokenizedSample: {XSLT} Tenth item in tokenizedSample: {} Position of SVG in tokenizedSample: {4} Position of XSL-FO in tokenizedSample: {} End of test.
The last two instructions in the stylesheet call the
index-of()
function, which returns a number showing the position of the
second parameter in the first one. In the first call of this function, it returns
a 4 for
"SVG" as the fourth string in the input sequence, and it returns an empty string in
the
second call because it didn't find "XSL-FO" in the sequence.
The index-of()
and item-at()
weren't defined by the XSLT 2.0 spec
to only be used with sequences of strings. You can also use them with sequences of
nodes,
making all kinds of element searching and manipulation tasks easier. For example,
with the
following input,
<colors> <color>red</color> <color>green</color> <color>blue</color> <color>yellow</color> </colors>
this template rule
<xsl:template match="colors"> {<xsl:value-of select="item-at((color),3)"/>} {<xsl:value-of select="index-of((color),'green')"/>} </xsl:template>
produces the following output, because "(color)" represents the sequence of
color
elements within the colors
context node:
{blue} {2}
The XPath 2.0 spec has more about the new sequences.
Tokenizing an SVG Attribute
Let's look at a tokenizing example that attacks a more realistic problem, the SVG polygon element shown above.
<xsl:template match="polygon"> <poly> <xsl:for-each select="tokenize(@points,'\s+')"> <point> <x><xsl:value-of select="substring-before(.,',')"/></x> <y><xsl:value-of select="substring-after(.,',')"/></y> </point> </xsl:for-each> </poly> </xsl:template>
The beginning of this column shows the input and output. The input is an SVG
polygon
element that has a space between each pair of numbers that represent
a point of the polygon and a comma between the x and y coordinates of each pair. Without
storing the tokenized sequence in a separate variable as the previous example did,
this
example's tokenize()
function splits them up and its xsl:for-each
loop iterates through the sequence of returned strings, outputting the contents of
each
inside of a point
element. The tokenize()
function would have
worked on the polygon input if the second parameter passed to it had been a simple,
one-character string of a single space, but the regular expression "\s+" is even better,
because specifying that the delimiter is one or more space characters in a row lets
the
function handle any combination of carriage returns, tabs, and spacebar spaces between
each
number pair.
Also in Transforming XML |
|
Within the point
element, the template could have used the
tokenize()
function to split apart the x and y values, but it's less code to
just use the XPath 1.0 substring-before()
and substring-after()
functions. The tokenizing function is great when you don't know how long a list is,
but when
there are always two items on either side of a single delimiter, it only takes two
function
calls to pull them out.
Tokenizing Past and Future XML Data
The combination of tokenize()
, item-at()
, and
index-of()
let you take advantage of something that's always been around in
XML 1.0, but that you couldn't do much with before: attributes of type NMTOKENS. You could always declare an
attribute to be of this type and then store multiple values in it separated by spaces,
but
splitting up these lists required either the Perl split function, its equivalent in
another
language, or lots of code to split it up when using a language that didn't offer such
a
function, like XSLT 1.0. Now a single function can split it for us, another can check
the
list for a particular value, and another can pull out a particular item from the list
based
on its order in the list. I know I'll be using these functions often.