Getting Started with XQuery, Part 2
March 23, 2005
In part one of this article, we saw how the free version of the Saxon program lets you jump right in with XQuery, the increasingly popular (although not quite finished) W3C standard for querying XML data sources. We learned how to pass Saxon XQuery queries to execute on the command line or from a disk file, and we looked at queries that went through a collection of RecipeML documents to find which recipes used a certain ingredient, which served more than twenty people, and which served the most people. This week, we'll learn more about how a query can manipulate the XML that it pulls out of a collection, and how user-defined functions can provide even greater flexibility in the sorting and arrangement of that data.
What's in the Fridge
Sometimes, the decision of what to make for dinner is based on what ingredients you already have. A list of all the ingredients used in a cookbook, with each followed by a list of the recipes that use it, can be very handy when planning a meal. The query below demonstrates one approach to creating this list, with URLs of the recipes included in the result. I could have written the query to create an HTML file as output, like I did with the "Food for a Crowd" query in part one of this article, but instead this query creates a new XML file that I would most likely convert to HTML (and perhaps other formats) using XSLT.
(: A unique, sorted list of all unique ingredients in the recipe collection, with URLS to link to the recipes. :) <ingredients> { for $ingr in distinct-values(collection('recipeml/docs.xml')/recipeml/recipe/ingredients/ing/item) order by $ingr return <item name="{$ingr}"> { for $doc in collection('recipeml/docs.xml') where $doc/recipeml/recipe/ingredients/ing/item = $ingr return <title url="{document-uri($doc)}"> {$doc/recipeml/recipe/head/title/text()} </title> } </item> } </ingredients>
The distinct-values
function lets the for
clause
iterate across a collection of the specified nodes with no duplicates, and the order
by
clause sorts the upcoming output alphabetically by the values of the
item
elements that the for
clause finds. The return
keyword following the order by
line doesn't simply return the found nodes, but
takes each one and goes further with it: a nested for
clause searches for
recipes with that ingredient and outputs a title
element with the recipe name
as content and the recipe's URI in a url
attribute.
Note that the contents of each title
element in the output is the
text()
node child of the $doc/recipeml/recipe/head/title
pulled
from the recipe. If I had put the whole $doc/recipeml/recipe/head/title
element
in there, I would have gotten its start- and end-tags as well. I didn't want those
because
the query includes a title
start-tag for the output that includes the new
url
attribute.
An excerpt from the output is shown below. For an HTML version, a simple XSLT
stylesheet could convert the item/@name
values to h2
headers and
each title
element to an HTML p
element containing the recipe name
as a link to the recipe.
<ingredients> <!-- some item elements removed --> <item name="(12-oz) tomato paste"> <title url="file:/C:/dat/recipeml/_Best_Ever__Pizza_Sauce.xml"> "Best Ever" Pizza Sauce</title> </item> <item name="Baking Powder"> <title url="file:/c:/dat/recipeml/_Blondie__Brownies.xml"> "Blondie" Brownies</title> <title url="file:/c:/dat/recipeml/Walnut_Pound_Cake.xml"> Walnut Pound Cake</title> </item> <item name="Baking Soda"> <title url="file:/c:/dat/recipeml/_Faux__Sourdough.xml"> "Faux" Sourdough</title> </item> <item name="Baking potatoes"> <title url="file:/c:/dat/recipeml/_Indian_Chili_.xml"> "Indian Chili"</title> </item> <item name="Baking powder"> <title url="file:/c:/dat/recipeml/_Best__Apple_Nut_Pudding.xml"> "Best" Apple Nut Pudding</title> <title url="file:/c:/dat/recipeml/_Gold_Room__Scones.xml"> "Gold Room" Scones</title> <title url="file:/c:/dat/recipeml/_Outrageous_Chocolate_Chipper.xml"> "Outrageous" Chocolate-Oatmeal Chipper (Cooki</title> </item> <item name="Baking soda"> <title url="file:/c:/dat/recipeml/_First__Ginger_Cookies.xml"> "First" Ginger Molasses Cookies</title> <title url="file:/c:/dat/recipeml/_Foot_in_the_Cake.xml"> "Foot in the Fire" Chocolate Cake</title> </item> <item name="Tomato paste"> <title url="file:/C:/dat/recipeml/Crawfish_Etouff'ee.xml"> "Frank's Place" Crawfish Etouff'ee </title> <title url="file:/C:/dat/recipeml/Hamburger____Ground_Meat_Balti.xml"> "Hamburger" / Ground Meat Balti </title> <title url="file:/C:/dat/recipeml/Indian_Chili_.xml"> "Indian Chili"</title> </item> <!-- some item elements removed --> </ingredients>
The bolded text gives clues to potential improvements. The query treats "Baking
Powder" and "Baking powder" as two different ingredients, but I'd rather see recipes
that
require baking powder appearing in the same list. Another problem is that for many
ingredients in the RecipeML files, the item
element was used to store too much
information, like in the following:
<ing><item> (12-oz) tomato paste </item></ing>
It's too much because RecipeML offers a structure to break down this information, making the data richer. This ingredient could have been broken down like this:
<ing> <amt> <qty>12</qty> <unit>oz</unit> </amt> <item>tomato paste</item> </ing>
As the query output above shows, this decision meant that "(12-oz) tomato paste" and "Tomato paste", even if their case was normalized, would still be treated as two different ingredients.
Anyone familiar with data created by volunteer efforts will recognize these issues and others that can cause inconsistencies, such as the occasional use of more than one space when one would do. Fortunately, XQuery and XSLT 2.0 let you create customized processing by writing and calling user-defined functions.
Normalized Ingredients
Our final XQuery example query declares two new functions to normalize ingredient names so that it can use cleaner data for the subheads that list cookbook ingredients. Because new functions in a query must be in a declared namespace, the query begins by declaring a namespace and assigning a namespace prefix to it, just like user-defined functions in XSLT 2.0 stylesheets.
(: A unique, sorted list of all unique ingredients in the recipe collection, with URLs to link to them. Ingredient names get normalized by functions declared in the query prolog. :) declare namespace sn = "http://www.snee.com/ns/misc/" ; declare function sn:normIngName($ingName) as xs:string { (: Normalize ingredient name. :) (: remove parenthesized expression that may begin string, e.g. in "(10 ozs) Rotel diced tomatoes":) let $normedName := replace($ingName,"^\(.*?\)\s*","") (: convert to all lower-case :) let $normedName := lower-case($normedName) (: replace multiple spaces with a single one :) let $normedName := normalize-space($normedName) return $normedName }; declare function sn:normIngList($ingList) as item()* { (: Normalize a list of ingredient names. :) for $ingName in $ingList return sn:normIngName($ingName) }; <ingredients> { let $normIngNames := sn:normIngList(collection('recipeml/ docs.xml')//ing/item) for $ingr in distinct-values($normIngNames) order by $ingr return <item name="{$ingr}"> { for $doc in collection('recipeml/docs.xml'), $i in $doc/recipeml/recipe/ingredients/ing/item where sn:normIngName($i) = $ingr return <title url="{document-uri($doc)}"> {$doc/recipeml/recipe/head/title/text()} </title> } </item> } </ingredients>
Together, this query's namespace and function declarations comprise a prolog for the query. As you can tell from the previous queries, a prolog is not required, but it gives your query new options by customizing the query's environment before it's executed. Prologs can also include variable declarations, imports of other schema modules, and imports of W3C XSD schemas to assign data types or to validate input and output data. (A schema can even validate referenced names in a query. For example, if you misspell the name of an element as a name not defined by the schema, static analysis of the query identifies the error before the XQuery processor even looks at the source data.) Note how the prolog statements are separated by semicolons.
The sn:normIngName
function uses the replace
function, which accepts regular expressions as arguments, to clean up the ingredient
values.
(See my "Transforming XML" column Regular Expression Matching in XSLT
2 for an introduction to using this function with XSLT 2.0.) In this query, it removes
leading parenthetical expressions and converts the whole string to lower-case, and
then the
XPath 1.0 normalize-space function cleans up whitespace in the string. The second declared
function, sn:normIngList
, uses the first one to normalize the ingredient list
passed to it so that the query's main for
clause can iterate over a normalized
list.
I set and referenced the $normIngNames
variable just to make the
main for
clause more readable. The rest of the query is pretty similar to the
earlier one that created a list of recipes with ingredient subheads, with one important
difference: instead of going through the recipes looking for ingredients that match
the
current subhead, the query goes through the ingredients themselves so that it can
normalize
them with another call to sn:normIngName
before comparing them to the subhead
ingredient. This way, it knows that a recipe with "(12-oz) tomato paste" as an ingredient
does belong under the "tomato paste" subhead.
The for
clause iterates over individual ingredient
item
elements by taking advantage of the XQuery for
clause's
ability to associate multiple variables with expressions; in this case, "$doc" with
"collection('recipeml/docs.xml')" and "$i" with "$doc/recipeml/recipe/ingredients/ing/item".
(Note the comma separating the two associations at the end of the first bolded line
in the
for
clause.) This way, it can check each ing/item
within each
document.
The result of this query resembles the output of the previous query, but has fewer headings, because the normalization has resulted in the combination of the "Baking Powder" and "Baking powder" lists, the "(12-oz) tomato paste" and "Tomato paste" lists, and others whose names were cleaned up by the normalization functions.
What Next
Official W3C XQuery specs include XQuery 1.0: An XML Query Language, XQuery 1.0 and XPath 2.0 Formal Semantics, the XQuery 1.0 and XPath 2.0 Data Model, XSLT 2.0 and XQuery 1.0 Serialization, and XQuery 1.0 and XPath 2.0 Functions and Operators. Except for the last one, they're all pretty high-level, and the Functions and Operators document, while easy enough to follow, won't teach you much about how to use the different structures that you can incorporate into a query. I've learned the most from the W3C's XML Query Use Cases document; instead of being organized around XQuery features, it's organized around the specific tasks that people want to accomplish with an XML query language. Each use case begins with a plain English description, followed by the query itself and the result.
While writing my sample queries, I also found the use case queries to be useful models for formatting style as I added carriage returns and whitespace to XQuery's strange mixture of XML markup, SQL-like keywords, curly braces, and semicolons. Many of my queries could have been expressed other ways, perhaps more efficiently, but they do work as shown, and I was surprised how quickly the syntax I put together for my first queries worked.
Another great place to learn about the workings of XQuery queries is the sample queries page for eXist, the open source native XML database. eXist is also a good tool to try when you want to scale your XQuery use beyond the memory-based storage used by Saxon. At XML 2004, vendors such as MarkLogic, DataDirect, Oracle and IBM were all very happy to talk about XQuery products with varying degrees of price and commercial readiness.
Don't expect too much in the way of commercial readiness just yet; as I write this, all the W3C specifications that affect XQuery are still Working Drafts. For playing around with XQuery to see what role it might play in your tool set, the free version of Saxon will painlessly give you a quick start at a great price.