XQuery, XSLT, and OmniMark: Mixed Content Processing

December 6, 2006

Alexander Boldakov, Maxim Grinev, and Kirill Lisovsky

Document-oriented XML usually has highly irregular structure in which elements might be mixed in unknown way. Processing such XML requires advanced data-driven facilities: push-style processing enriched with transformation rules and side-effect-free updates. In this article we emphasize such facilities in three XML-native languages: XQuery, XSLT, and OmniMark; and analyze applicability of these languages and their combinations to document-oriented XML processing. As data in many practical applications often comes as a result of a database query, we also examine various approaches to combine XQuery with XSLT or OmniMark for document-oriented XML processing over a database system.

What is notable about processing document-oriented XML data is that a particular XML element can appear virtually everywhere in the content (i.e. at any level of the hierarchy of the XML document tree and intermixed with any elements). Processing such elements, one usually wants to preserve their relative positions among other elements in the XML document tree. In other words, some elements are to be replaced while others are to be reserved. The replacement for an element may consist of nothing, another element, or a sequence of elements. Below we provide a number of particular examples of such replacements.

XQuery Versus XSLT and OmniMark

The primary approach to processing document-oriented XML data is data-driven transformation (where the order of the output is dictated by the order of the input) as opposed to code-driven transformation (where the order of the output is dictated by XSLT stylesheets, OmniMark rules, or XQuery queries).

Using data-driven transformation, it is very easy to preserve the relative position of elements being processed. In XSLT and OmniMark, data-driven transformations can be naturally expressed in push style using transformation rules.

Let us consider an example. Suppose we need to process a document-oriented XML document (doc.xml) as follows: replace all elements named "a" with an element named "b," which contains the content of "a" wrapped in the "*" symbol. This is how it looks in XSLT.

<xsl:stylesheet 

   xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 

   version="1.0">



  <xsl:template match="a">

    <b>*<xsl:value-of select="text()"/>*</b>

  </xsl:template>



  <xsl:template match="*">

    <xsl:element name="{node-name(.)}">

       <xsl:apply-templates/>

    </xsl:element>

  </xsl:template>



</xsl:stylesheet>

The same can be expressed in OmniMark as follows.

element a

    output "<b>*" || "%c" || "*</b>"



element #implied

    output "<%q>%c</%q>"



process

    do xml-parse

    scan file "doc.xml"

        output "%c"

    done

As XQuery has no support for push style--it is a pure pull-style language--the only way to express such transformation in XQuery is to use the polymorphic recursive function. The function traverses the source document and reconstructs it, replacing only the required elements. The following recursive function implements the same transformation as in the previous XSLT example.

declare function local:traverse-replace($n as node())

as node()

  {

     typeswitch($n)

       case $a as element(a)

       return

           <b>*{$a/text()}*</b>

       case $e as element()

       return element

         { fn:local-name($e) }

         { for $c in $e/(* | text())

           return local:traverse-replace($c) }

       case $d as document-node()

       return document

         { for $c in $d/* return local:traverse-replace($c) }

       default return $n

  };

The transformation can be applied to a whole document by invoking the local:traverse-replace function on the root node of the document, as follows:

local:traverse-replace(doc("doc.xml"))

Another way to accomplish such transformation in XQuery has been recently introduced by the W3C in "XQuery Update Facilities." The facilities extend XQuery with the transform operator, which allows performing data-driven XML transformations in a way that is very different from all previous approaches.

In all previous examples, we had to express the reconstruction of the whole document including even those elements that remain unchanged. Using transform, you can avoid the reconstruction of the elements that remain unchanged.
Another difference lies in execution models. Both the XQuery recursive function and the push-style approach (XSLT and OmniMark) inherently imply an execution model based on sequential scan. This means that the executor scans all of the document to process it. transform can be implemented using a random access execution model that avoids sequentially scanning all the data and employs instead alternative ways to access the required data (mainly via indices). The possibility to implement transform via the random access execution model makes transform suitable for efficient support in database systems.

The main idea of the XQuery transform operator is to employ traditional in-place updates for data transformations. The semantics of in-place updates are modified to avoid side effects. That is why we refer to transform as a side-effect-free update. Semantically, instead of modifying the document, updates are evaluated on a new copy of it. Operationally, transform can be implemented without actual data copying (e.g., using a shadow mechanism as proposed in [Rekouts2006]).

The above example can be expressed via XQuery transform as follows.

transform

 copy $new:=doc("doc.xml")

 modify

   for $a in $new//a

   do replace $a with <b>*{$a/text()}*</b>

 return $new

It is worth noting that currently we do not know of any implementations of the XQuery transform. Its efficient support is still an open research issue.

Comparing the approaches discussed above, we conclude that the push-style approach powered by transformation rules provided by XSLT and OmniMark is a better choice for processing document-oriented XML data. The XQuery recursive function approach is usually harder to code and maintain than the push-style approach. As concerns XQuery transform, it remains to be seen how effective the transform approach is and its usability is especially questionable in case of complex transformations.

If XSLT and OmniMark seem to be well-suited for document-oriented XML data transformation, why not to use them and forget about XQuery? Growing volumes of XML data (especially in an enterprise environment) often require a database for XML data management. For example, replacing XML elements that represent references (or placeholders) might require querying a database that contains references (or placeholders) mapping to substitutes. This means that XQuery is still required as a database query language, and we need to find the right way to combine transformation languages--XSLT and OmniMark--with a query language--XQuery. In the following sections we analyze two approaches to the combination.

Combining XQuery with XSLT or OmniMark

Before we analyze different ways of combining XQuery/XSLT and XQuery/OmniMark, it is worth a note that OmniMark and XQuery can be integrated directly, as OmniMark supports API-to-XQuery database systems (i.e., OmniMark plays the role of a host language for XQuery). XSLT/XQuery integration requires a third language to glue them together. If we are using command-line XSLT and XQuery processors, then a scripting language like Bash can play the role of the glue language. Or if we use XSLT and XQuery libraries, a general-purpose programming language like Java can be used to glue them.

Tightly Coupled Solution

Let us consider the following example. Suppose that there is an XML document that includes placeholders, which refer to fragments of a book stored in an XML database. We need to publish the document, replacing the placeholders with the corresponding fragments rendering them. Below is the document.

Example: document.xml

<page>

...

<fragmentref>378</fragmentref>

...

<fragmentref>835</fragmentref>

...

</page>

The simple solution is to query the database each time we come across a reference to a fragment during the XSLT or OmniMark transformation. Below is an example in OmniMark and XQuery. We use OmniMark API to Sedna XML database in this example.

Example: process.xom

mport "omdb.xmd" prefixed by db.



global db.database moviedb



define string source function div-render (value string source s)

       as

       ; rendering code is here



element fragmentref

    local db.field result variable

    

    db.query moviedb 

             statement "doc('book.xml')//div[@id='%c']" 

             into result

    do when db.record-exists result

        output div-render(db.reader of result)

    done



element #implied

    output "<%q>%c</%q>"



process

    set moviedb to db.open-sedna "localhost" dbname "moviedb" 

                   user "SYSTEM" password "MANAGER"



    do xml-parse

    scan file "document.xml"

        output "%c"

    done



    db.close moviedb

This solution is tightly coupled due to the following properties:

It requires an API to access a database from a transformation language. In the example above, we used the OmniMark API to the Sedna XML database system, which supports XQuery.
There is a query sent to the database for each reference. It might result in poor performance if there is a large number of references in the document, because each query is a call from one execution environment (OmniMark) to another (the database system) that leads to essential overhead.

While the size of the query result, which is a book fragment, is not known in advance, streaming processing of the fragment in OmniMark allows for processing it regardless of its size. As XSLT engines do not support streaming, the size of the query result that can be processed by XSLT engines is restricted by the size of available memory.

Another problem with the XSLT implementation of this solution is that XSLT engines usually do not support APIs to XML database systems. This means that an XSLT-based implementation has to call the database via an extension function implemented in a programming language, with an XQuery API that overcomplicates the implementation.

To conclude, we would like to emphasize that while this solution can suffer from pure performance because of many query calls, it does not impose any limitations on the size of the query result and allows using streaming transformation. The solution described in the next section has different properties.

Loosely Coupled Solution

Let us consider a popular example of document-oriented XML processing known as dynamic linkage. The idea is that 1) the content is marked up with semantically meaningful XML elements that represent media-neutral links, and 2) the elements are then replaced with the media-specific links at the time of content delivery (rendering). Dynamic linking is especially useful in the context of single source publishing, when the author focuses on content creation and does not have to worry about how content is delivered.

Consider a project to create a collection of movie reviews with associated information and to create output to various media. Movie reviews are full of references to other movies, actors, directors, places, times, and themes. All these references are good places to create links to other resources, such as biographies, maps, or histories. Instead of using direct HTML links, which are media-specific, the author marks up references with XML tags. These tags are named so that they describe the type of the reference (e.g. movie, actor, director). These tags have an attribute, name, which allows for the retrieval of information required to construct the media-specific links. When we publish reviews on the Web, we might link them to Wikipedia using HTML links. When we publish reviews on CD, we put a link to local resources.

Here is an example of a movie review with a reference to a director.

Example: reviews.xml

<reviews>

<review>

<title>Titanic</title>

<genre>romance</genre>

<text>

...

<p><director name="James Cameron">James Cameron's</director> 

194-minute, $200 million film 

of the tragic voyage is in the tradition of the great 

Hollywood epics.</p>

...

</text>

</review>

...

</reviews>

Below is the corresponding fragment of the links mapping (people.xml). The document people.xml contains person elements, which have id attributes and contain biography elements with biography references for various media. The url element contains the URL to the director's biography intended for publishing on the web page. The file element has a path to the biography stored locally on the CD-ROM. The text element provides a brief biography for publishing on the print media.

Example: people.xml

<people>

<person id="James Cameron">

  <biography>

  <url>http://en.wikipedia.org/wiki/James_Cameron</url>

  <file>/biography/james_cameron.html</file>

  <text>

   James Francis Cameron (born August 16, 1954) is

   a Canadian-born American film director noted for

   his action/science fiction films, which are often

   extremely successful financially...

  </text>

  </biography>

  ...

</person>

...

</people>

This application can be implemented using the tightly coupled approach, but we will try to improve the performance by minimizing the number of database queries. This may be achieved by decomposition of the application into two separate tasks: database querying and reference processing. This approach allows for minimizing the inter-environment communication to just one data transmission, and as a pleasant side effect, it does not require an API from the transformation language to the database. This is why we refer to this solution as loosely coupled.

In general, the loosely coupled solution can be implemented as follows: the query should return each document augmented with all the information that is needed to process (render) it. In our particular example, it means that an XQuery query should return each review augmented with the corresponding subset of the mapping that is required to render the links within the review. Below is an example of how an augmented review can be represented.

Example: review-with-mapping.xml

<catalog>

<reviews>

 <review>

  <title>Titanic</title>

  <map:mapping xmlns:map="www.linkmapping.com">

   <map:record>

    <map:name>James Cameron</map:name>

    <map:link>http://en.wikipedia.org/wiki/James_Cameron</map:link>

   </map:record>

  </map:mapping>

  <text>

   ...

  <p><director name="James Cameron">James Cameron's</director>

  194-minute, $200 million film of the tragic voyage is in

  the tradition of the great Hollywood epics.</p>



  ...

  </text>

 </review>

</reviews>

</catalog>

This fragment contains a review extended by the mapping from director names mentioned in the review to the corresponding links. To combine the review with the mapping, XML elements in the www.linkmapping.com namespace are used. You can see that this fragment contains all the required information to render the review with no need to query the database.

In XQuery we use element constructors to compound the review text and the corresponding mapping. The XQuery query is as follows:

declare namespace map = "www.linkmapping.com";

<catalog>

  {for $r in doc("reviews.xml")/reviews/review

   return

   <reviews>

    <review>

     <title>{$r/title/text()}</title>

     <map:mapping xmlns:map="www.linkmapping.com">

     {for $dir-name in distinct-values($r//director/@name)

      let $dir:=doc("people.xml")//person[@id=$dir-name]

      return

      <map:record>

         <map:name>{$dir-name}</map:name>

         <map:link>{$dir/biography/url/text()}</map:link>

      </map:record>

      }

      </map:mapping>

      <text>

        {$r/text/node()}

      </text>

   </review>}

  </reviews>

</catalog>

The fragment can then be processed in XSLT as follows:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

                version="2.0"

                xmlns:map="www.linkmapping.com">



<xsl:template match="review">

 <review>

  <xsl:apply-templates>

     <xsl:with-param name="mapping" select="./map:mapping"/>

  </xsl:apply-templates>

 </review>

</xsl:template>



<xsl:template match="director">

  <xsl:param name="mapping"/>

  <xsl:variable name="dname" select="@name"/>

  <a href="{$mapping/map:record[map:name=$dname]/map:link/text()}">

     <xsl:value-of select="."/>

  </a>

</xsl:template>



<xsl:template match="map:mapping"/>



<xsl:template match="*">

 <xsl:param name="mapping"/>

 <xsl:element name="{node-name(.)}">

   <xsl:apply-templates>

     <xsl:with-param name="mapping" select="$mapping"/>

   </xsl:apply-templates>

 </xsl:element>

</xsl:template>



</xsl:stylesheet>

The same transformation can be expressed in OmniMark as follows:

global string locname

global string locref

global string locmapping variable



group "reference-processing"

element #base mapping

    output "%c"



element #base record

    output "%c"

    do when locmapping hasnt key locname

        set new locmapping{locname} to locref

    done



element #base name

     set locname to "%c"



element #base link

     set locref to "%c"



group #implied

process

        do xml-parse

            scan file "review-with-mapping.xml"

            output "%c"

        done



xmlns-change when xmlns-name = "www.linkmapping.com"

    using group "reference-processing" output "%c"



element director

    output "<a href='" || locmapping{"%v(name)"} || "'>%c</a>"



element #implied

    output "<%q>%c</%q>"

The properties of loosely coupled approach may be described as follows:

As all the relevant data are fetched in advanced and there is no need to access the database during the transformation process, a loosely coupled solution may be implemented without a database API for the transformation language used. While for a tightly coupled solution such an API is required, a loosely coupled solution can employ a standalone tool to fetch the data and apply a separate, database-agnostic tool to transform the pre-fetched data.
The loosely coupled approach allows for minimizing database queries. But it may introduce restrictions on the size of the processed data. In the example considered, the mapping for a given review is restricted by buffer size because we have to keep the mapping in memory while we process the review. As a single review cannot contain a lot of references and the size of a single link cannot be too large, a loosely coupled solution will work for this application.
Combining transformation and query languages in a loosely coupled fashion improves modular design. The de-coupled tasks may be implemented as separate reusable modules and utilized in different content processing pipelines.

Comparison of Loosely Coupled and Tightly Coupled Solutions

According to the above, the general rule for choosing between loosely coupled and tightly coupled solutions is as follows. When the size of the query result is unpredictable, the only solution that is guaranteed to work properly is tightly coupled, as it allows for processing the query result as a stream. When the size of the query result is known to be quite small, both tightly coupled and loosely coupled solutions can be used, but the loosely coupled one should work faster.

Let us demonstrate the latter statement by experiments. We will compare tightly coupled and loosely coupled solutions for the movie review example introduced in the previous section. Both solutions are implemented using OmniMark version 8.0 and Sedna version 1.0. The experiments were conducted on Windows XP on a computer with the following configuration: Pentium M 1.8GHz with a hard disk of 4200 RPM. Sedna buffers were set to 100MB. There are 3000 movie reviews stored in the database. Each review is about 4KB in size and includes six director references, on average. The mapping (people.xml) is 2.22GB in size and includes 50,6000 people. It is also stored in the database. person elements are indexed by the id attribute. The table below contains average total execution time for five runs.

Solution	Cold Buffers	Hot Buffers
Tightly coupled	7 min 40 sec	7 min 10 sec
Loosely coupled	21 sec	12 sec

This table demonstrates that the loosely coupled solution is an order of magnitude faster, as it allows minimizing the number of queries. The fact that loosely coupled solutions usually work faster is also discussed in the literature for database practicians; for instance, see Section 5.4.2, "Minimize the Number of Round-Trips Between the Application and the Database Server," in Database Tuning: Principles, Experiments and Troubleshooting Techniques by Dennis E. Shasha and Philippe Bonnet, published by Morgan Kaufmann in 2002.

Conclusion

Processing document-oriented XML in modern content management applications is a challenging task, as it often requires both content transformation and database querying. Domain-specific XML transformation languages (e.g., XSLT and OmniMark) are very good at document-oriented XML processing but require a query language (e.g., XQuery) to access a database. In XQuery document-oriented XML processing can be implemented via transform mechanism, but this mechanism is suitable only for simple transformation tasks performed on the database side. To build elegant and efficient document-oriented XML processing applications, we have to combine transformation and query languages. We have described two possible approaches, which we call tightly coupled and loosely coupled, to combine the languages and discussed the pros and cons of these approaches.

The last thing worth mentioning is that XQuery-featuring systems has an advantage over SQL-based ones for loosely coupled solutions. Chunks, which include all the data required to process themselves, have quite a complex (hierarchical) structure. SQL does not provide adequate construction facilities to build such structures, as it is designed to deal with simpler (flat) structures. Thanks to XML node constructors, such structures can be easily built in XQuery.

The authors would like to thank Maria Grineva and Patrick Baker for valuable discussions and comments.