Transforming XML With SAX Filters

October 10, 2001

Introduction

Last month we began our exploration of more advanced SAX topics with a look at how SAX events can be generated from non-XML data. This month, we conclude the series by introducing SAX filters and their use in XML data transformation.

What Is A SAX Filter?

A SAX filter is simply a class that is passed as the event handler to another class that generates SAX events, then forwards all or some of those events on the next handler (or filter) in the processing chain. A filter may prune the document tree by not forwarding events for elements with a given name (or that meet some other condition), while in other cases, a filter might generate its own new events to add parent or child elements to certain elements the existing document stream. Also, element attributes can be added or removed or the character data altered in some way. Really any class that is able to receive SAX events, then call event methods on another SAX handler in a way that alters the document stream can be seen as a SAX filter.

In practice, SAX filters are like conceptual cousins of many of the standard UNIX tools. By themselves, these tools often perform only a single, simple task, but when piped together they are capable of astonishing feats. In the same way, the real power of SAX filters is derived from the fact that simpler, easy-to-maintain filters may be chained together to produce complex XML data transformations.

Transforming Data Within Existing Events

For our first example we will create a simple SAX filter that transforms the character data passed from XML::Parser::PerlSAX then hands it on to Michael Koehne's XML::Handler::YAWriter to produce the final XML document.


use strict;

use XML::Parser::PerlSAX;

use XML::Handler::YAWriter;

use IO::File;



my $file = $ARGV[0] || die "Please pass a file name to process\n";

With the necessary modules included, we get to the section that reveals just exactly how SAX filters work. Notice that we create a new instance of XML::YAWriter, then pass that object as the Handler for our custom filter, the instance of which is passed as the Handler to XML::Parser::PerlSAX. When the script is executed, the parser will call its SAX events on the methods in our FilterPorcus class, which, in turn will call the event methods on the writer class to print the result to STDOUT.

Note that when defining event chains, the classes are created in reverse order, with the first handler being the last class that is actually called. This may seem a bit confusing at first but with a little practice, you will get the hang of it.


my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" ));

my $filter = FilterPorcus->new(Handler => $writer);

my $parser = XML::Parser::PerlSAX->new(Handler => $filter);



my %parser_args = (Source => {SystemId => $file});

$parser->parse(%parser_args);



# end main

Next we create our custom filter class as an inline Perl package. Pay special attention to the fact that our class inherits from Matt Sergeant's XML::Filter::Base class. This allows us to implement only those handler methods that are relevant to our filter since XML::Filter::Base automatically forwards, by default, all SAX to the next handler class in the chain. If our class were not a subclass of Filter::Base we would have to explicitly forward each and every event that the previous class could potentially generate.


# silly text transformer

package FilterPorcus;

use strict;

use base qw(XML::Filter::Base);



sub new {

  my $class = shift;

  my %options = @_;

  return bless \%options, $class;

}

Our filter is only interested in transforming the text nodes of the input document, so we will only implement the characters method. After passing the character data to the local porcus subroutine for transformation, we forward the result to the next handler by calling the characters event on that handler.


sub characters {

  my ($self, $chars) = @_;

  my $out = $self->porcus($chars->{Data});

  $self->{Handler}->characters({Data => $out});

}

Finally we get to the porcus method that returns the string passed to it transformed into the desired format using a little regular expression voodoo.


sub porcus {

  my ($self, $chars) = @_;

  $chars =~ tr/A-Z/a-z/;

  $chars =~ s/\b([aeiou])/w$1/g;

  my $cons = q{[bcfghjklmnpqrstvwxz]};

  $chars =~ s/\b(qu|$cons($cons$cons?)?|[a-z])([a-z]*)/$3$1ay/g;

  return $chars;

}

Feeding this script a snippet of Larry Wall's latest Perl 6 Apocalypse produces the following result:


<html>

<body>

<p>

  otay emay, oneway ofway ethay ostmay

  agonizingway aspectsway ofway anguage

  lay esignday isway omingcay upway

  ithway away usefulway ystemsay ofway

  operatorsway.  otay otherway

  anguagelay esignersday, isthay aymay

  eemsay ikelay away illysay ingthay

  otay agonizeway overway.  afterway

  allway, ouyay ancay iewvay allway

  operatorsway asway eremay yntacticsay

  ugarsay -- operatorsway

  areway ustjay unnyfay ookinglay

  unctionfay allscay.

</p>

</body>

</html>

Okay, the result is admittedly pretty silly -- there may even be those who would argue that converting Uncle Larry's prose to pig latin is a bit redundant -- but the script does illustrate the basics of creating a simple SAX filter:

It accepts SAX events from a SAX filter or other event generator.
It alters the document stream (in this case, by transforming all text data to pig latin).
It forwards SAX events to the next handler or filter in the chain.

If we also wanted to transform the element and attribute names and values in addition to the text data we would only need to add the following start_element and end_element handlers.


sub start_element {

  my ($self, $element) = @_;

  my %attrs = %{$element->{Attributes}};



  while ( my ($name, $value) = (each (%attrs))) {

    my $orig_name = $name;

    $name = $self->porcus($name);

    $value = $self->porcus($value);

    $attrs{$name} = $value;

    delete $attrs{$orig_name};

  }



  $element->{Attributes} = \%attrs;

  my $elname = $self->porcus($element->{Name});

  $element->{Name} = $elname;

  $self->{Handler}->start_element($element);

}



sub end_element {

  my ($self, $element) = @_;

  my $elname = $self->porcus($element->{Name});

  $element->{Name} = $elname;

  $self->{Handler}->end_element($element);

}

Again, the principles are the same: accept events, alter the data, then forward that altered data by calling events on the filter's designated handler.

Enough silliness, let's look at a more practical example.

Transforming Document Structure

For our final example, we will demonstrate how a SAX filter can be used to alter the structure of an XML document by creating a filter that partially implements the current version of the W3C's XInclude working draft.

XInclude suggests a compact, DTD- and Schema-agnostic way to include external XML documents (or document fragments) into the current document being processed. For example,


<?xml version="1.0">

<article

  xmlns="http://localhost/myns"

  xmlns:xi="http://www.w3.org/2001/XInclude">

  <para>

    All brontosauruses are thin at one end,

    much much thicker in the middle, and

    then thin again at the far end.

  </para>

  <xi:include href="disclaimer.xml"/>

</article>

would signal an XInclude-aware processor to include the contents of the file disclaimer.xml into the current document between the end tag of para element and the end tag of the top-level article element.

And speaking of disclaimers, it should be pointed out that our implementation here by no means covers the requirements of the full XInclude draft; it will only allow inclusion of complete documents from the local file system. XInclude itself is far more flexible and robust. Our goal here is merely to demonstrate the principles of writing SAX filters.


use strict;

use XML::Parser::PerlSAX;

use XML::Filter::SAX2toSAX1;

use XML::Filter::SAX1toSAX2;

use XML::Handler::YAWriter;

use IO::File;



my $file = $ARGV[0] || die "Please pass a filename to process. . .\n";

After the required imports we are ready to build our SAX filter-handler chain. The chain is more complex in this case since XML::Parser::PerlSAX generates SAX1 events and XML::Handler::YAWriter expects SAX1 events, but our XInclude filter requires the more sophisticated namespace processing provided by SAX2. We work around this by adding the filters XML::Filter::SAX1toSAX2 and XML::Filter::SAX2toSAX1 to the chain immediately before and after our custom filter. This allows for proper namespace processing while ensuring that the other parts of the handler chain are able to generate and receive the data for the given events in the format that each expects.


my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" ));

$writer->{Pretty}->{NoProlog} = 1;

my $sax1_filter = XML::Filter::SAX2toSAX1->new(Handler => $writer);

my $handler = FilterXInclude->new(Handler => $sax1_filter);

my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $handler);

my $parser = XML::Parser::PerlSAX->new(Handler => $sax2_filter);



my %parser_args = (Source => {SystemId => $file});

$parser->parse(%parser_args);



# end main

We now begin our XInclude filter module. Note that, again, we inherit from XML::Filter::Base to make life a little easier. Also notice that we add a BaseURI property to the filter object. This gives us a place to store the path that provides the context in which to resolve any relative URIs offered by the include elements. We set the default for this property to the current directory that the script is being executed in.


# minimal XInclude Implementation

package FilterXInclude;

use strict;

use base qw(XML::Filter::Base);

use XML::Parser::PerlSAX;

use XML::Filter::SAX2toSAX1;

use XML::Filter::SAX1toSAX2;



sub new {

    my $class = shift;

    my %options = @_;

    $options{BaseURI} ||= './';

    return bless \%options, $class;

}



sub start_element {

  my ($self, $element) = @_;

  my %attrs = %{$element->{Attributes}};

As we begin the start_element handler, we first check for an xml:base attribute in the current element. The xml:base attribute is the recommended way to set the base URI for applications that are expected to cope with relative URIs. In this case if an xml:base attribute is found, we set the value of the filter object's BaseURI property to its value.

It is worth noting here that the structure of SAX2 attributes differs significantly from that of SAX1. In Perl implementations of SAX1, attributes are a simple HASH reference of name/value pairs. This causes problems with more modern documents that employ XML namespaces since they allow for cases where two attributes may have the same name, but are bound to different namespace URIs. Simple key => value pairs are not enough to capture the "X, in namespace Y, equals Z" relationships provided by namespaced attributes.

After much discussion on the perl-xml mailing list, it was decided that in SAX2 implementations attributes should remain a HASH, but should employ a notation first advanced by James Clark where the insufficient name => value structure is replaced by {namepace_uri}localname = \%attribute_properties. So, in the following block, when we say $attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value} this should be understood to mean "give me the 'Value' property of the attribute that is bound to the 'http://www.w3.org/XML/1998/namespace' namespace whose local name is 'base'".


  if (defined $attrs{'{http://www.w3.org/XML/1998/namespace}base'}) {

    $self->{BaseURI} =

        $attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value};

    $self->{BaseURI} =~ s|^file://||;

  }

Next, we check to see if the current element is in the XIinclude namespace and has the local name of 'include' and, if so, we send the value that element's href attribute off to our include_proc method to include the document at that URI into the current document stream.

Also notice that we do not forward the events for the include elements since we do not want those elements to actually appear in the result document. This, coupled with the results included from the include_proc method, has the effect of replacing the include elements with the documents that they point to.


  if ($element->{NamespaceURI} eq 'http://www.w3.org/2001/XInclude'

      and $element->{LocalName} eq 'include') {

      $self->include_proc($attrs{'{}href'}->{Value});

  }

  else {

    $self->{Handler}->start_element($element);

  }

}

It is not enough to exclude the include elements from being forwarded in the start_element handler; we must also do the same in the end_element handler as well. Otherwise, the resulting document would still contain the end tags for the include elements, causing the resulting XML document to be ill-formed.


sub end_element {

  my ($self, $element) = @_;

  unless ($element->{NamespaceURI} eq

          'http://www.w3.org/2001/XInclude'

      and $element->{LocalName} eq 'include') {

      $self->{Handler}->end_element($element);

  }

}

I should also point out that if you want to prune elements that may contain character data from a document, you must also implement a characters handler that conditionally blocks the forwarding of text events. Otherwise the text contained by the excluded elements will become part of the text of the nearest parent element, which is not likely to produce the desired result. We need not worry in this case since all of the include elements are empty.

Finally we get to the include_proc method which is responsible for parsing and including the requested documents. Here we simply create a new instance of XML::Filter::SAX1toSAX2, passing the current instance of our filter as the handler, then pass that as the handler for a new instance of XML::Parser::PerlSAX, and tell the parser to parse the document passed to the subroutine in the context of the BaseURI property.

The result of this is that the events fired from these included documents are inserted into the current document stream at the precise location previously taken by the include elements.


sub include_proc {

  my ($self, $file) = @_;

  $file = $self->{BaseURI} . $file;

  my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $self);

  my $parser = XML::Parser::PerlSAX->new({Handler => $sax2_filter,

                                          Source => {SystemId => $file}

                                        });

  $parser->parse;

}

Passing the following XML document to this script. . .

Resources

• Download the sample code.

• Writing SAX Drivers for Non-XML Data

• Perl XML Quickstart: The Standard XML Interfaces

• High-Performance XML Parsing With SAX

• David Megginson's SAX Pages


<?xml version="1.0"?>

<html xmlns="http://www.w3.org/1999/xhtml"

      xmlns:xi="http://www.w3.org/2001/XInclude"

      xml:base="file://files/">

  <head>

    <title>

      Templating With XInclude and SAX2

    </title>

  </head>

  <body>

   <xi:include href="header.xml"/>

   <hr width="80%"/>

   <xi:include href="content.xml"/>

   <hr width="80%"/>

   <xi:include href="footer.xml"/>

  </body>

</html>

might result in a document like

Also in Perl and XML

OSCON 2002 Perl and XML Review

XSH, An XML Editing Shell

PDF Presentations Using AxPoint

Multi-Interface Web Services Made Easy

Perl and XML on the Command Line


<html

  xml:base="file://files/"

  xmlns="http://www.w3.org/1999/xhtml"

  xmlns:xi="http://www.w3.org/2001/XInclude">

  <head>

    <title>

      Templating With XInclude and SAX2

    </title>

  </head>

  <body>

<div class="header">

 <h1>Common Header</h1>

</div>

<hr width="80%"></hr>

<div class="content">

 <p>

   Now is the winter of our

   discontent made glorious

   summer by the son of York.

 </p>

</div>

<hr width="80%"></hr>

<div class="footer">

 <p>Common Footer</p>

</div>

  </body>

</html>

Conclusions

SAX is an important XML technology that, like Perl, keeps simple things simple and makes hard thing possible. Knowing how to generate SAX events from non-XML data and using SAX filters to transform existing document streams are key to a mature understanding of the power that SAX offers. We have only scratched the surface of what SAX filters and generators can do, but I hope that we have at least covered the basics well enough to pique your curiosity and provoke experimentation.