Getting Started with XQuery
March 2, 2005
Although the W3C's XQuery language for querying XML data sources is still in Working Draft status, the recent XML 2004 conference showed that there's already plenty of interest and many implementations. While the Saxon implementation may not scale up as much as the disk-based versions that use persistent indexes and other traditional database features, you can download the free version of Saxon, install it, and use XQuery so quickly that it's a great way to start playing with the language in order to learn about what this new standard can offer you.
Running a Query
Let's start with a toy example that demonstrates how to tell Saxon which query
to run against which XML, and then we'll move on to examples that show useful queries
run
against real XML data. For our first two queries, we'll use the following document,
which is
named data1.xml
:
<doc> <p>this is a sample file</p> <p>this p has <emph>inline</emph> markup</p> </doc>
When run from a command line, the following tells Saxon to run the query shown and to send the result to standard output. As with XSLT, the curly braces enclose an expression to be evaluated and replaced by the result of the evaluation. Unlike XSLT, curly braces can be nested in XQuery queries as they get more complex. In this particular case, the curly braces have more to do with the Saxon implementation than XQuery syntax, because they indicate to Saxon that the enclosed string is an actual query and not some other command line option.
java net.sf.saxon.Query {doc('data1.xml')//p[emph]}
(On a Linux machine, I also had to put quotation marks around the expression
with curly braces.) This query asks for the p
elements in the
data1.xml
file that have an emph
child element. Saxon's XQuery
processor responds with the following:
<?xml version="1.0" encoding="UTF-8"?> <p>this p has <emph>inline</emph> markup</p>
(In the remaining examples, I'll omit the XML declaration from the output.) A
query doesn't have to be much more complex than this one before it's too long to fit
on a
command line, so Saxon can accept a query stored in a text file. To demonstrate, I
put the
query above into its own file, called query1.xqy
, without the curly braces from
above that told Saxon the role of that string on the command line:
(: Here is an XQuery comment. :) doc('data1.xml')//p[emph]
(I also added a comment to show how XQuery uses parentheses and colons to
delimit comments for the query processor to ignore. As a long-time hater of smileys, I can't say I like the XML Query Working Group's choice
of comment delimiters much.) With those two lines stored in query1.xqy
, the
following command has the same result as the previous one:
java net.sf.saxon.Query query1.xqy
While the query above is more concise than the equivalent XSLT stylesheet, the XSLT version of the query would be very simple, and many have debated whether either language makes the other unnecessary. As with many programming language comparisons, the answer is that while both languages may be able to perform the same functions, each makes certain tasks quicker and easier for the developer than the other. Let's look at some of XQuery's strengths.
Looking for Some Sugar
To really test the usefulness of XQuery, I wanted to use real-world data, so I downloaded a collection of recipes from Squirrel's RecipeML archive that conform to the RecipeML DTD. (Because a cookbook is such an obvious candidate for multiple back-of-the-book indexes, I've often wondered why no Topic Map advocates have created a Topic Map from a collection of RecipeML recipes. The availability of XQuery implementations should make it easier.) Like much of the XML available on the internet, we can't assume that these are all clean, well-formed documents, so several recipe files required a little clean-up before I could start running queries against the collection.
Issuing a query against multiple documents at once is an example of a task that,
while not impossible in XSLT, is much easier in XQuery when we use the collection
function. (Like all functions mentioned in this article,
you can use collection
in XSLT 2.0 as well as in XQuery, because it's one of
the XQuery 1.0
and XPath 2.0 Functions and Operators. Its use with XQuery generally allows more
concise requests than it does with XSLT.) In Saxon, the argument for this function
is a URI
identifying a file that lists the collection's XML documents in this format:
<collection> <doc href="_Band__Sloppy_Joes.xml"/> <doc href="_Cheese__Fricadelle.xml"/> <!-- more doc elements... --> <doc href="Walton_Mountain_Coffee_Cake.xml"/> <doc href="Walty's_Dressing.xml"/> <doc href="Wan_Tan_(Wonton).xml"/> </collection>
I named this document docs.xml
and put it in a
recipeml
subdirectory with the 290 or so recipe documents that I extracted
from the Squirrel Archive zip files that I downloaded. The first query against this
collection lists the title
value of all recipes that have the string "sugar" in
any item
child of the ing
("ingredient") element (carriage return
added to queries for readability):
collection('recipeml/docs.xml')/recipeml/recipe/ head/title[//ingredients/ing/item[contains(.,'sugar')]]
The output looks like this:
<title>"Band" Sloppy Joes</title> <title>"Best" Apple Nut Pudding</title> <!-- more title elements... --> <title>Waltons Mountain Coffee Cake</title> <title>Walton Mountain Coffee Cake</title>
Because XPath 2.0 allows function calls as location steps, this query is simply one big XPath expression. Part of the appeal of XQuery to people with more of a traditional database background and less of an XML geek background is that XQuery also offers a more SQL-like syntax, so that you get the same result from your XQuery processor with this query:
for $ingredient in collection('recipeml/docs.xml')// ingredients/ing/item[contains(.,'sugar')] return $ingredient/../../../head/title
The for
clause iterates across a collection of nodes, and the
return
clause creates the result of the iteration by identifying which
node(s) in the collection to return in the expression.
These two queries each asked for a list of title
elements and got
the same result. The output, like the query itself (but unlike an XSLT stylesheet)
is not a
well-formed XML document. You can make the result well-formed easily enough; the following
variation on the last query wraps the result in a sweets
element and
demonstrates some XQuery features that make queries more flexible.
<sweets> { let $target := 'sugar' for $ingredient in collection('recipeml/docs.xml')// ingredients/ing/item[contains(.,$target)] return $ingredient/../../../head/title } </sweets>
As I mentioned above, curly braces in XQuery show an expression to be evaluated
and replaced by the result. In the case above, the data returned by the multi-line
expression between the braces will appear between the sweets
start- and end-tag
in the result. One part of this expression is another for
expression, which
tells the XQuery engine to iterate across the specified set of nodes and then return
the
title
element in each node's recipe. The condition specifying the nodes to
iterate through is a little more flexible than its equivalent in previous examples;
instead
of looking for item
elements with the hardcoded string "sugar" as a substring,
it looks for the value of the $target
variable as a substring. The
$target
variable is set to the value "sugar" by the let
expression preceding the for
clause, so the for
expression has the
same effect that it has in the preceding example, but it's easier to customize to
make it
search for something else.
The for
and let
keywords give us the first two letters
in FLWOR, an umbrella term (pronounced "flower") used in XQuery for expressions that use
the keywords for
, let
, where
, order by
,
and return
. In the words of the W3C Working Draft XQuery 1.0: An
XML Query Language, "a FLWOR expression ... supports iteration and binding of
variables to intermediate results. This kind of expression is often useful for computing
joins between two or more documents and for restructuring data." To someone approaching
XQuery from the relational database world, these keywords will be more familiar than
the
axes, node tests, and predicates of XPath expressions, which is why the first "for
$ingredient
" example above will feel more natural to a typical database
administrator than the example that retrieves the title
elements with a single
XPath expression.
Let's look at a query that uses the where
keyword and builds a web
page, complete with links to the documents with the target text.
Feeding Multitudes
Which recipes will feed more than 20 people? The following one-line query takes an XPath-oriented approach to listing the recipe titles that meet this condition.
collection('recipeml/docs.xml')/recipeml/recipe/head/ title[../yield > 20]
A more FLWORy approach allows more flexibility. While the query above says "get
the title
element for each recipe whose yield
is greater than 20,"
the following says "go through all the documents in the collection, and for any with
a
yield
of more than 20, get the title
."
for $doc in collection('recipeml/docs.xml')/recipeml where $doc/recipe/head/yield > 20 return $doc/recipe/head/title
It may not seem like much of a difference, but once we get past that
where
clause, the $doc
variable gives us a handle to each
document meeting the where
condition, letting us pull all we want out of it;
the title, and if we want, even more. The following query wraps the preceding one
with a
simple HTML document and uses the document-uri
function to add a link to each
document meeting the where
condition.
(: Create an HTML page linking to recipes that serve more than 20 people. :) <html><head><title>Food for a Crowd</title></head> <body> <h1>Food for a Crowd</h1> { for $doc in collection('recipeml/docs.xml') where $doc/recipeml/recipe/head/yield > 20 return <p><a href="{document-uri($doc)}"> {$doc/recipeml/recipe/head/title/text()} </a></p> } </body></html>
In the future, we can look forward to more server-side XQuery support that lets sites dynamically generate HTML pages using XQuery queries. With XQuery's ability to query combinations of XML and relational databases, it could end up playing a huge role in many dynamically generated web sites.
Extreme Recipes
A let
clause can call functions to compute values that you can then
use in a where
clause or an XPath predicate. The following query checks for the
maximum yield
value and then pulls out any recipes with that yield
figure:
(: Which recipe(s) serves the most people? :) let $maxYield := max(collection('recipeml/docs.xml')/recipeml/recipe/head/ yield) return collection('recipeml/docs.xml')/recipeml/recipe[head/ yield = $maxYield]
In part two of this article, we'll see how XQuery's ability to sort and aggregate data lets us create a list of ingredient headings from the recipe collection, with each heading followed by a list of links to recipes that contain that ingredient. We'll also see how user-defined functions in queries can expand the possibilities for how you select and use the data in your XML documents with XQuery.