XQuery, XSLT, and OmniMark: Mixed Content Processing
December 6, 2006
Alexander Boldakov, Maxim Grinev, and Kirill Lisovsky
Document-oriented XML usually has highly irregular structure in which elements might be mixed in unknown way. Processing such XML requires advanced data-driven facilities: push-style processing enriched with transformation rules and side-effect-free updates. In this article we emphasize such facilities in three XML-native languages: XQuery, XSLT, and OmniMark; and analyze applicability of these languages and their combinations to document-oriented XML processing. As data in many practical applications often comes as a result of a database query, we also examine various approaches to combine XQuery with XSLT or OmniMark for document-oriented XML processing over a database system.
What is notable about processing document-oriented XML data is that a particular XML element can appear virtually everywhere in the content (i.e. at any level of the hierarchy of the XML document tree and intermixed with any elements). Processing such elements, one usually wants to preserve their relative positions among other elements in the XML document tree. In other words, some elements are to be replaced while others are to be reserved. The replacement for an element may consist of nothing, another element, or a sequence of elements. Below we provide a number of particular examples of such replacements.
XQuery Versus XSLT and OmniMark
The primary approach to processing document-oriented XML data is data-driven transformation (where the order of the output is dictated by the order of the input) as opposed to code-driven transformation (where the order of the output is dictated by XSLT stylesheets, OmniMark rules, or XQuery queries).
Using data-driven transformation, it is very easy to preserve the relative position of elements being processed. In XSLT and OmniMark, data-driven transformations can be naturally expressed in push style using transformation rules.
Let us consider an example. Suppose we need to process a document-oriented XML document (doc.xml) as follows: replace all elements named "a" with an element named "b," which contains the content of "a" wrapped in the "*" symbol. This is how it looks in XSLT.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="a"> <b>*<xsl:value-of select="text()"/>*</b> </xsl:template> <xsl:template match="*"> <xsl:element name="{node-name(.)}"> <xsl:apply-templates/> </xsl:element> </xsl:template> </xsl:stylesheet>
The same can be expressed in OmniMark as follows.
element a output "<b>*" || "%c" || "*</b>" element #implied output "<%q>%c</%q>" process do xml-parse scan file "doc.xml" output "%c" done
As XQuery has no support for push style--it is a pure pull-style language--the only way to express such transformation in XQuery is to use the polymorphic recursive function. The function traverses the source document and reconstructs it, replacing only the required elements. The following recursive function implements the same transformation as in the previous XSLT example.
declare function local:traverse-replace($n as node()) as node() { typeswitch($n) case $a as element(a) return <b>*{$a/text()}*</b> case $e as element() return element { fn:local-name($e) } { for $c in $e/(* | text()) return local:traverse-replace($c) } case $d as document-node() return document { for $c in $d/* return local:traverse-replace($c) } default return $n };
The transformation can be applied to a whole document by invoking the
local:traverse-replace
function on the root node of the document, as
follows:
local:traverse-replace(doc("doc.xml"))
Another way to accomplish such transformation in XQuery has been recently introduced
by the
W3C in "XQuery Update Facilities." The
facilities extend XQuery with the transform
operator, which allows performing
data-driven XML transformations in a way that is very different from all previous
approaches.
- In all previous examples, we had to express the reconstruction of the whole document
including even those elements that remain unchanged. Using
transform
, you can avoid the reconstruction of the elements that remain unchanged. - Another difference lies in execution models. Both the XQuery recursive function and
the
push-style approach (XSLT and OmniMark) inherently imply an execution model based
on
sequential scan. This means that the executor scans all of the document to
process it.
transform
can be implemented using a random access execution model that avoids sequentially scanning all the data and employs instead alternative ways to access the required data (mainly via indices). The possibility to implementtransform
via the random access execution model makestransform
suitable for efficient support in database systems.
The main idea of the XQuery transform
operator is to employ traditional
in-place updates for data transformations. The semantics of in-place updates are modified
to
avoid side effects. That is why we refer to transform
as a side-effect-free
update. Semantically, instead of modifying the document, updates are evaluated on
a new copy
of it. Operationally, transform
can be implemented without actual data copying
(e.g., using a shadow mechanism as proposed in [Rekouts2006]).
The above example can be expressed via XQuery transform
as follows.
transform copy $new:=doc("doc.xml") modify for $a in $new//a do replace $a with <b>*{$a/text()}*</b> return $new
It is worth noting that currently we do not know of any implementations of the XQuery
transform
. Its efficient support is still an open research issue.
Comparing the approaches discussed above, we conclude that the push-style approach powered by transformation rules provided by XSLT and OmniMark is a better choice for processing document-oriented XML data. The XQuery recursive function approach is usually harder to code and maintain than the push-style approach. As concerns XQuery transform, it remains to be seen how effective the transform approach is and its usability is especially questionable in case of complex transformations.
If XSLT and OmniMark seem to be well-suited for document-oriented XML data transformation, why not to use them and forget about XQuery? Growing volumes of XML data (especially in an enterprise environment) often require a database for XML data management. For example, replacing XML elements that represent references (or placeholders) might require querying a database that contains references (or placeholders) mapping to substitutes. This means that XQuery is still required as a database query language, and we need to find the right way to combine transformation languages--XSLT and OmniMark--with a query language--XQuery. In the following sections we analyze two approaches to the combination.
Combining XQuery with XSLT or OmniMark
Before we analyze different ways of combining XQuery/XSLT and XQuery/OmniMark, it is worth a note that OmniMark and XQuery can be integrated directly, as OmniMark supports API-to-XQuery database systems (i.e., OmniMark plays the role of a host language for XQuery). XSLT/XQuery integration requires a third language to glue them together. If we are using command-line XSLT and XQuery processors, then a scripting language like Bash can play the role of the glue language. Or if we use XSLT and XQuery libraries, a general-purpose programming language like Java can be used to glue them.
Tightly Coupled Solution
Let us consider the following example. Suppose that there is an XML document that includes placeholders, which refer to fragments of a book stored in an XML database. We need to publish the document, replacing the placeholders with the corresponding fragments rendering them. Below is the document.
Example: document.xml <page> ... <fragmentref>378</fragmentref> ... <fragmentref>835</fragmentref> ... </page>
The simple solution is to query the database each time we come across a reference to a fragment during the XSLT or OmniMark transformation. Below is an example in OmniMark and XQuery. We use OmniMark API to Sedna XML database in this example.
Example: process.xom mport "omdb.xmd" prefixed by db. global db.database moviedb define string source function div-render (value string source s) as ; rendering code is here element fragmentref local db.field result variable db.query moviedb statement "doc('book.xml')//div[@id='%c']" into result do when db.record-exists result output div-render(db.reader of result) done element #implied output "<%q>%c</%q>" process set moviedb to db.open-sedna "localhost" dbname "moviedb" user "SYSTEM" password "MANAGER" do xml-parse scan file "document.xml" output "%c" done db.close moviedb
This solution is tightly coupled due to the following properties:
- It requires an API to access a database from a transformation language. In the example above, we used the OmniMark API to the Sedna XML database system, which supports XQuery.
- There is a query sent to the database for each reference. It might result in poor performance if there is a large number of references in the document, because each query is a call from one execution environment (OmniMark) to another (the database system) that leads to essential overhead.
While the size of the query result, which is a book fragment, is not known in advance, streaming processing of the fragment in OmniMark allows for processing it regardless of its size. As XSLT engines do not support streaming, the size of the query result that can be processed by XSLT engines is restricted by the size of available memory.
Another problem with the XSLT implementation of this solution is that XSLT engines usually do not support APIs to XML database systems. This means that an XSLT-based implementation has to call the database via an extension function implemented in a programming language, with an XQuery API that overcomplicates the implementation.
To conclude, we would like to emphasize that while this solution can suffer from pure performance because of many query calls, it does not impose any limitations on the size of the query result and allows using streaming transformation. The solution described in the next section has different properties.
Loosely Coupled Solution
Let us consider a popular example of document-oriented XML processing known as dynamic linkage. The idea is that 1) the content is marked up with semantically meaningful XML elements that represent media-neutral links, and 2) the elements are then replaced with the media-specific links at the time of content delivery (rendering). Dynamic linking is especially useful in the context of single source publishing, when the author focuses on content creation and does not have to worry about how content is delivered.
Consider a project to create a collection of movie reviews with associated information
and
to create output to various media. Movie reviews are full of references to other movies,
actors, directors, places, times, and themes. All these references are good places
to create
links to other resources, such as biographies, maps, or histories. Instead of using
direct
HTML links, which are media-specific, the author marks up references with XML tags.
These
tags are named so that they describe the type of the reference (e.g. movie
,
actor
, director
). These tags have an attribute,
name
, which allows for the retrieval of information required to construct the
media-specific links. When we publish reviews on the Web, we might link them to Wikipedia
using HTML links. When we publish reviews on CD, we put a link to local resources.
Here is an example of a movie review with a reference to a director.
Example: reviews.xml <reviews> <review> <title>Titanic</title> <genre>romance</genre> <text> ... <p><director name="James Cameron">James Cameron's</director> 194-minute, $200 million film of the tragic voyage is in the tradition of the great Hollywood epics.</p> ... </text> </review> ... </reviews>
Below is the corresponding fragment of the links mapping (people.xml). The
document people.xml contains person
elements, which have
id
attributes and contain biography
elements with biography
references for various media. The url
element contains the URL to the
director's biography intended for publishing on the web page. The file
element
has a path to the biography stored locally on the CD-ROM. The text
element
provides a brief biography for publishing on the print media.
Example: people.xml <people> <person id="James Cameron"> <biography> <url>http://en.wikipedia.org/wiki/James_Cameron</url> <file>/biography/james_cameron.html</file> <text> James Francis Cameron (born August 16, 1954) is a Canadian-born American film director noted for his action/science fiction films, which are often extremely successful financially... </text> </biography> ... </person> ... </people>
This application can be implemented using the tightly coupled approach, but we will try to improve the performance by minimizing the number of database queries. This may be achieved by decomposition of the application into two separate tasks: database querying and reference processing. This approach allows for minimizing the inter-environment communication to just one data transmission, and as a pleasant side effect, it does not require an API from the transformation language to the database. This is why we refer to this solution as loosely coupled.
In general, the loosely coupled solution can be implemented as follows: the query should return each document augmented with all the information that is needed to process (render) it. In our particular example, it means that an XQuery query should return each review augmented with the corresponding subset of the mapping that is required to render the links within the review. Below is an example of how an augmented review can be represented.
Example: review-with-mapping.xml <catalog> <reviews> <review> <title>Titanic</title> <map:mapping xmlns:map="www.linkmapping.com"> <map:record> <map:name>James Cameron</map:name> <map:link>http://en.wikipedia.org/wiki/James_Cameron</map:link> </map:record> </map:mapping> <text> ... <p><director name="James Cameron">James Cameron's</director> 194-minute, $200 million film of the tragic voyage is in the tradition of the great Hollywood epics.</p> ... </text> </review> </reviews> </catalog>
This fragment contains a review extended by the mapping from director names mentioned
in
the review to the corresponding links. To combine the review with the mapping, XML
elements
in the www.linkmapping.com
namespace are used. You can see that this fragment
contains all the required information to render the review with no need to query the
database.
In XQuery we use element constructors to compound the review text and the corresponding mapping. The XQuery query is as follows:
declare namespace map = "www.linkmapping.com"; <catalog> {for $r in doc("reviews.xml")/reviews/review return <reviews> <review> <title>{$r/title/text()}</title> <map:mapping xmlns:map="www.linkmapping.com"> {for $dir-name in distinct-values($r//director/@name) let $dir:=doc("people.xml")//person[@id=$dir-name] return <map:record> <map:name>{$dir-name}</map:name> <map:link>{$dir/biography/url/text()}</map:link> </map:record> } </map:mapping> <text> {$r/text/node()} </text> </review>} </reviews> </catalog>
The fragment can then be processed in XSLT as follows:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:map="www.linkmapping.com"> <xsl:template match="review"> <review> <xsl:apply-templates> <xsl:with-param name="mapping" select="./map:mapping"/> </xsl:apply-templates> </review> </xsl:template> <xsl:template match="director"> <xsl:param name="mapping"/> <xsl:variable name="dname" select="@name"/> <a href="{$mapping/map:record[map:name=$dname]/map:link/text()}"> <xsl:value-of select="."/> </a> </xsl:template> <xsl:template match="map:mapping"/> <xsl:template match="*"> <xsl:param name="mapping"/> <xsl:element name="{node-name(.)}"> <xsl:apply-templates> <xsl:with-param name="mapping" select="$mapping"/> </xsl:apply-templates> </xsl:element> </xsl:template> </xsl:stylesheet>
The same transformation can be expressed in OmniMark as follows:
global string locname global string locref global string locmapping variable group "reference-processing" element #base mapping output "%c" element #base record output "%c" do when locmapping hasnt key locname set new locmapping{locname} to locref done element #base name set locname to "%c" element #base link set locref to "%c" group #implied process do xml-parse scan file "review-with-mapping.xml" output "%c" done xmlns-change when xmlns-name = "www.linkmapping.com" using group "reference-processing" output "%c" element director output "<a href='" || locmapping{"%v(name)"} || "'>%c</a>" element #implied output "<%q>%c</%q>"
The properties of loosely coupled approach may be described as follows:
- As all the relevant data are fetched in advanced and there is no need to access the database during the transformation process, a loosely coupled solution may be implemented without a database API for the transformation language used. While for a tightly coupled solution such an API is required, a loosely coupled solution can employ a standalone tool to fetch the data and apply a separate, database-agnostic tool to transform the pre-fetched data.
- The loosely coupled approach allows for minimizing database queries. But it may introduce restrictions on the size of the processed data. In the example considered, the mapping for a given review is restricted by buffer size because we have to keep the mapping in memory while we process the review. As a single review cannot contain a lot of references and the size of a single link cannot be too large, a loosely coupled solution will work for this application.
- Combining transformation and query languages in a loosely coupled fashion improves modular design. The de-coupled tasks may be implemented as separate reusable modules and utilized in different content processing pipelines.
Comparison of Loosely Coupled and Tightly Coupled Solutions
According to the above, the general rule for choosing between loosely coupled and tightly coupled solutions is as follows. When the size of the query result is unpredictable, the only solution that is guaranteed to work properly is tightly coupled, as it allows for processing the query result as a stream. When the size of the query result is known to be quite small, both tightly coupled and loosely coupled solutions can be used, but the loosely coupled one should work faster.
Let us demonstrate the latter statement by experiments. We will compare tightly coupled
and
loosely coupled solutions for the movie review example introduced in the previous
section.
Both solutions are implemented using OmniMark version 8.0 and Sedna version
1.0. The experiments were conducted on Windows XP on a computer with the following
configuration: Pentium M 1.8GHz with a hard disk of 4200 RPM. Sedna buffers were set
to
100MB. There are 3000 movie reviews stored in the database. Each review is about 4KB
in size
and includes six director references, on average. The mapping (people.xml) is
2.22GB in size and includes 50,6000 people. It is also stored in the database.
person
elements are indexed by the id
attribute. The table below
contains average total execution time for five runs.
Solution | Cold Buffers | Hot Buffers |
---|---|---|
Tightly coupled | 7 min 40 sec | 7 min 10 sec |
Loosely coupled | 21 sec | 12 sec |
This table demonstrates that the loosely coupled solution is an order of magnitude faster, as it allows minimizing the number of queries. The fact that loosely coupled solutions usually work faster is also discussed in the literature for database practicians; for instance, see Section 5.4.2, "Minimize the Number of Round-Trips Between the Application and the Database Server," in Database Tuning: Principles, Experiments and Troubleshooting Techniques by Dennis E. Shasha and Philippe Bonnet, published by Morgan Kaufmann in 2002.
Conclusion
Processing document-oriented XML in modern content management applications is a challenging
task, as it often requires both content transformation and database querying.
Domain-specific XML transformation languages (e.g., XSLT and OmniMark) are very good
at
document-oriented XML processing but require a query language (e.g., XQuery) to access
a
database. In XQuery document-oriented XML processing can be implemented via
transform
mechanism, but this mechanism is suitable only for simple
transformation tasks performed on the database side. To build elegant and efficient
document-oriented XML processing applications, we have to combine transformation and
query
languages. We have described two possible approaches, which we call tightly coupled
and
loosely coupled, to combine the languages and discussed the pros and cons of these
approaches.
The last thing worth mentioning is that XQuery-featuring systems has an advantage over SQL-based ones for loosely coupled solutions. Chunks, which include all the data required to process themselves, have quite a complex (hierarchical) structure. SQL does not provide adequate construction facilities to build such structures, as it is designed to deal with simpler (flat) structures. Thanks to XML node constructors, such structures can be easily built in XQuery.
The authors would like to thank Maria Grineva and Patrick Baker for valuable discussions and comments.