Regular Expression Matching in XSLT 2
June 4, 2003
Most developers who have worked with Perl, awk, or other utilities with a strong heritage in Unix text processing have learned to love regular expressions because of the flexibility they give you to describe which text you want to manipulate. While nearly any programming language (or for that matter, any word processor or spreadsheet) lets you search for a specific string of text and replace it with another string, regular expressions let you search for something less specific by supplying patterns that describe what you want. For example, to replace all dollars and cents expressions with the string "$xx.xx", you can assemble a regular expression to show that you want to search for any string consisting of a dollar sign followed by one or more numeric digits, a period, and then two more digits. Perl, awk, and their relatives can search for text fitting such a pattern and replace it with "$xx.xx" or any other string. More advanced use of regular expressions lets you incorporate pieces of the found text in the replacement text.
Because XSLT is for manipulating XML documents, and XML documents are text, XSLT developers with any experience in Unix-based utilities often wish that XSLT would let them use regular expressions. XSLT 2.0 grants this wish. XPath 2.0 gives XSLT 2.0 three new functions that can take advantage of regular expression matching, and XSLT 2.0 has three new instructions and an accompanying function for using regular expressions to manipulate strings. While neither of these 2.0 specifications is an approved W3C Recommendation yet, Mike Kay's Saxon XSLT processor has an experimental development branch that lets us play with proposed new XSLT 2.0 features. As I write this, the latest experimental release is 7.5, and the latest release of the branch optimized for stable XSLT 1.0 use is 6.5.2.
Specifying Regular Expression Patterns
The regular expression syntax used to indicate patterns like "one or more numeric digits" or "between three and six white space characters" is cryptic enough that much of the marketing of the Omnimark program, which was popular for transforming SGML and later XML documents, was built around the greater readability of its more English-like syntax for accomplishing many of the same things. Still, like almost any syntax, the overall structure of UNIX-based regular expressions is fairly straightforward to understand once you get used to the most popular parts. The following table gives you an overview of the most important parts, and the on-line appendix to an out-of-print O'Reilly book on CGI programming provides a more complete reference. (There's even an entire O'Reilly book on regular expressions.) Anyone who's worked with DTDs will recognize the use of the question mark, asterisk, and plus sign; the newline character will be familiar to nearly all programmers.
\n | A newline character. |
. | Any single character except \n. |
[a-f] | The lower-case letters a, f, or anything in that range. |
\d | Any numeric digit. The same as [0-9]. |
\s | Any single whitespace character—a tab, carriage return linefeed, or spacebar space. |
* | After any of the symbols shown above, this means "zero or more characters fitting this description." |
+ | Like the asterisk, but meaning "one or more characters fitting this description." |
? | Like the asterisk, but meaning "zero or one character fitting this description." |
{4} | Like the asterisk, but meaning "four characters fitting this description." Because curly braces are used in XSLT stylesheets to show which parts of an attribute value template are expressions to be evaluated, be careful when using these in a regular expression specified in an attribute value: escape the curly braces by repeating them (in this case, {{4}}) to tell an XSLT 2.0 processor not to treat the curly braces as attribute value template expression delimiters. |
\+ | A literal plus sign. The backslash character escapes the character after it, telling the processor not to treat it as a special regular expression character. |
XPath 2.0's Regular Expression Functions
XPath 2.0 offers three new functions that use regular expressions:
- tokenize(), which I described last month.
- matches(), which returns a boolean true or false depending on whether the text in its first argument matches the regular expression in its second argument.
- replace(), which searches the string in its first argument for the pattern in its second argument, replacing any found occurrence with the string in the third argument and returning the result.
All three functions support an optional extra argument that lets you specify whether you want the matching to be case-insensitive and whether you want it to operate in multiline mode. The Regular Expression Syntax section of the XQuery 1.0 and XPath 2.0 Functions and Operators specification describes these flags in more detail; it also provides g background on the roots of the XPath regular expression support in the W3C Schema specification.
The following template rule copies a p element, making two changes to it. It adds a matchesPattern attribute whose value will be either true or false, depending on whether the element's contents (passed to the function by the period in the first argument to the matches function) matches the regular expression in the second argument. After this new attribute, the contents of the element will be the contents of the input p element with all occurrences of the pattern described by the regular expression replaced by the string "$xx.xx". (All examples in this article are available in this zip file.)
<xsl:template match="p"> <xsl:copy> <xsl:attribute name="matchesPattern"> <xsl:value-of select='matches(.,".*\$\d+\.\d{2}.*")'/> </xsl:attribute> <xsl:value-of select='replace(., "\$\d+\.\d{2}","\$xx.xx")'/> </xsl:copy> </xsl:template>
Let's look more closely at the regular expressions passed to both functions. The second, in plain English, says "a dollar sign followed immediately by numeric digits, a period, and two more numeric digits." The regular expression passed to the matches() function says the same thing, but allows for the possibility of text before or after the matched expression, so that the attribute value will be "true" if the matching expression is in the p element at all, regardless of whether there is any text before or after it. To break down this longer regular expression into pieces:
.* | Zero or more of any characters other than a newline. A plus sign instead of an asterisk would have meant "look for one or more characters before we get to the dollar sign," so a paragraph starting with the string "$19.99" wouldn't have matched. |
\$ | A literal dollar sign. A dollar sign has a special meaning in regular expressions, identifying the end of an expression, but we don't want that here, so the backslash is included to show that we really mean the dollar sign character. |
\d+ | One or more numeric digits. |
\. | A single period. In regular expressions, a period can be used to represent any character (as we saw at the very beginning of this regular expression) but we're looking for an actual period here, so it has a backslash like the one before the dollar sign. |
\d{2} | Exactly two digits -- no more or less. |
.* | Zero or more of any characters after those final two digits. |
The regular expression passed to the replace function omits the .* at the beginning and the end, because this function isn't asking "is this pattern in there among the other text?" but instead telling the XSLT processor which text to replace with "$xx.xx". We don't want it to replace all the text in the p element, which would happen if we did put the .* at the beginning and end of the regular expression.
Here's the sample input that I used to test the template rule above:
<sample> <p>The milk costs $1.99.</p> <p>The newspaper is $1.</p> <p>Peanut butter is $2.49, and the candy bar is $0.65.</p> </sample>
The second p element has a dollar figure with no figure for cents after the period, so it doesn't fit the pattern. Here's what the template rule above does to this input:
<?xml version="1.0" encoding="UTF-8"?><sample> <p matchesPattern="true">The milk costs $xx.xx.</p> <p matchesPattern="false">The newspaper is $1.</p> <p matchesPattern="true">Peanut butter is $xx.xx, and the candy bar is $xx.xx.</p> </sample>
Nothing in the second p element matches the pattern, so the inserted matchesPattern value is "false," and no text in that element gets replaced. In the first p element, the pattern was matched once, and in the third, twice, as you can see by the "$xx.xx" strings in each.
XSLT 2.0's Regular Expression Instructions
In addition to letting you describe patterns and then finding out if they exist in text, regular expression support in a language like Perl lets you find out exactly what text matched that pattern, and you can then use that text in your program logic. If you put parentheses around any parts of your regular expression, the matching process will remember what text matched the part of the expression in the parentheses; then, in subsequent code, $1 will hold the text from inside the first parentheses, $2 from the second, and so forth.
When using XSLT 2.0's xsl:analyze-string element, you can put parentheses around the parts of the expression that you may grab and then use the regex-group() function to retrieve a particular matched piece. Because you can add new markup around these pieces before adding them to the result tree, you can actually have richer, more granular markup in your output than you found in your input.
Let's look at an example. In the following document, the addrLine2 elements usually have the city, two-letter state code, and the zip code for a United States address:
<addresses> <address> <name>Richard Mutt</name> <addrLine1>30 Main St.</addrLine1> <addrLine2>New Haven, CT 06460</addrLine2> </address> <address> <name>Nanker Phelge</name> <addrLine1>1432 Broad St.</addrLine1> <addrLine2>Phoenix Arizona</addrLine2> </address> <address> <name>Billy Shears</name> <addrLine1>1 Grand View Crest</addrLine1> <addrLine2> Lansing, MI 22934-2234 </addrLine2> </address> </addresses>
The data isn't consistently clean. The second address element has the state name spelled out and no comma or zip code, and the third one has extra white space -- even a carriage return, which would present extra problems to a Perl program processing the data, but not to XSLT 2.0.
The following template rule splits up the addrLine2 elements, when it can, into separate city, state, and zip element.
<xsl:template match="addrLine2"> <xsl:variable name="elValue" select="."/> <xsl:analyze-string select="$elValue" regex="\s*(.*)\s*,\s*([A-Z]{{2}})\s+(\d{{5}}(\-\d{{4}})?)\s*"> <xsl:matching-substring> <city><xsl:value-of select="regex-group(1)"/></city> <state><xsl:value-of select="regex-group(2)"/></state> <zip><xsl:value-of select="regex-group(3)"/></zip> </xsl:matching-substring> <xsl:non-matching-substring> <addrLine2> <xsl:value-of select="$elValue"/> </addrLine2> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template>
As with many regular expressions, "\s*" ("zero or more spaces") appears several times to account for all the optional white space in the input. The "\s+" accounts for the white space after the state code; there must be at least one white space character and there may be more.
There are three pieces of information that we want to pull out of addrLine2:
-
The city name, which we'll assume is all the characters from the beginning (not counting any leading spaces) up to the first comma (not counting any spaces just before the comma).
-
The state code, which we'll assume is two capital letters right after the comma (and after any space that may be right after the comma). In Perl, we'd say "two upper-case letters in a row" as "[A-Z]{2}", but in XSLT we must double up the curly braces to tell the XSLT processor that we're not delimiting an attribute value template expression to evaluate.
-
The zip code. We'll assume at least one space after the state code, and then we want five numeric digits optionally followed by a hyphen and four more digits. The curly braces used here must also be doubled up.
Also in Transforming XML |
|
If we're saving three pieces of information from the matched text, why are there four pairs of parentheses? Because in addition to identifying substrings to save, extra parentheses can serve the same purpose in regular expressions that they do in DTD content models: they can identify sequences that have the asterisk, plus sign, or question mark applied to them. In this case, "(\-\d{{4}})?" means that the entire hyphen-plus-four-digits sequence is optional.
The xsl:analyze-string element has two optional child elements: xsl:matching-substring processes any strings matched in the xsl:analyze-string element's regex attribute, and xsl:non-matching-string processes strings that don't match. The regex-group() function names which matched string you want to use inside of the xsl:matching-substring element; pass it a 1 to get the first, a 2 to get the second, and so forth. The example above uses it to plug the three matched values inside new city, state, and zip elements created for the output.
Let's take a look at the output:
<?xml version="1.0" encoding="UTF-8"?><addresses> <address> <name>Richard Mutt</name> <addrLine1>30 Main St.</addrLine1> <city>New Haven</city><state>CT</state><zip>06460</zip> </address> <address> <name>Nanker Phelge</name> <addrLine1>1432 Broad St.</addrLine1> <addrLine2>Phoenix Arizona</addrLine2> </address> <address> <name>Billy Shears</name> <addrLine1>1 Grand View Crest</addrLine1> <city>Lansing</city><state>MI</state><zip>22934-2234</zip> </address> </addresses>
For the first and third address elements, the regular expression found all of the relevant parts of the address and output them as their own elements. For the second, it didn't find them, so the xsl:non-matching-substring element just copied the original to the output.
The parsing power that regular expressions add to XSLT lets you output XML with more value than your input XML, because XML that identifies data at a finer-grained level is XML that you can do more with. This example just showed a simple addition of tags around the found data; combining these elements and functions with other XSLT and XPath elements and functions will make for some really impressive stylesheets. I can't wait to play with it more and to see what other XSLT developers do with it.