Transforming XML With SAX Filters
October 10, 2001
Introduction
Last month we began our exploration of more advanced SAX topics with a look at how SAX events can be generated from non-XML data. This month, we conclude the series by introducing SAX filters and their use in XML data transformation.
What Is A SAX Filter?
A SAX filter is simply a class that is passed as the event handler to another class that generates SAX events, then forwards all or some of those events on the next handler (or filter) in the processing chain. A filter may prune the document tree by not forwarding events for elements with a given name (or that meet some other condition), while in other cases, a filter might generate its own new events to add parent or child elements to certain elements the existing document stream. Also, element attributes can be added or removed or the character data altered in some way. Really any class that is able to receive SAX events, then call event methods on another SAX handler in a way that alters the document stream can be seen as a SAX filter.
In practice, SAX filters are like conceptual cousins of many of the standard UNIX tools. By themselves, these tools often perform only a single, simple task, but when piped together they are capable of astonishing feats. In the same way, the real power of SAX filters is derived from the fact that simpler, easy-to-maintain filters may be chained together to produce complex XML data transformations.
Transforming Data Within Existing Events
For our first example we will create a simple SAX filter that transforms the character
data passed from XML::Parser::PerlSAX
then hands it on to Michael Koehne's
XML::Handler::YAWriter
to produce the final XML document.
use strict; use XML::Parser::PerlSAX; use XML::Handler::YAWriter; use IO::File; my $file = $ARGV[0] || die "Please pass a file name to process\n";
With the necessary modules included, we get to the section that reveals just exactly
how
SAX filters work. Notice that we create a new instance of XML::YAWriter
, then
pass that object as the Handler
for our custom filter, the instance of which is
passed as the Handler
to XML::Parser::PerlSAX
. When the script is
executed, the parser will call its SAX events on the methods in our
FilterPorcus
class, which, in turn will call the event methods on the writer
class to print the result to STDOUT
.
Note that when defining event chains, the classes are created in reverse order, with the first handler being the last class that is actually called. This may seem a bit confusing at first but with a little practice, you will get the hang of it.
my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" )); my $filter = FilterPorcus->new(Handler => $writer); my $parser = XML::Parser::PerlSAX->new(Handler => $filter); my %parser_args = (Source => {SystemId => $file}); $parser->parse(%parser_args); # end main
Next we create our custom filter class as an inline Perl package. Pay special attention
to
the fact that our class inherits from Matt Sergeant's XML::Filter::Base
class.
This allows us to implement only those handler methods that are relevant to our filter
since
XML::Filter::Base
automatically forwards, by default, all SAX to the next
handler class in the chain. If our class were not a subclass of Filter::Base
we
would have to explicitly forward each and every event that the previous class could
potentially generate.
# silly text transformer package FilterPorcus; use strict; use base qw(XML::Filter::Base); sub new { my $class = shift; my %options = @_; return bless \%options, $class; }
Our filter is only interested in transforming the text nodes of the input document,
so we
will only implement the characters
method. After passing the character data to
the local porcus
subroutine for transformation, we forward the result to the
next handler by calling the characters
event on that handler.
sub characters { my ($self, $chars) = @_; my $out = $self->porcus($chars->{Data}); $self->{Handler}->characters({Data => $out}); }
Finally we get to the porcus
method that returns the string passed to it
transformed into the desired format using a little regular expression voodoo.
sub porcus { my ($self, $chars) = @_; $chars =~ tr/A-Z/a-z/; $chars =~ s/\b([aeiou])/w$1/g; my $cons = q{[bcfghjklmnpqrstvwxz]}; $chars =~ s/\b(qu|$cons($cons$cons?)?|[a-z])([a-z]*)/$3$1ay/g; return $chars; }
Feeding this script a snippet of Larry Wall's latest Perl 6 Apocalypse produces the following result:
<html> <body> <p> otay emay, oneway ofway ethay ostmay agonizingway aspectsway ofway anguage lay esignday isway omingcay upway ithway away usefulway ystemsay ofway operatorsway. otay otherway anguagelay esignersday, isthay aymay eemsay ikelay away illysay ingthay otay agonizeway overway. afterway allway, ouyay ancay iewvay allway operatorsway asway eremay yntacticsay ugarsay -- operatorsway areway ustjay unnyfay ookinglay unctionfay allscay. </p> </body> </html>
Okay, the result is admittedly pretty silly -- there may even be those who would argue that converting Uncle Larry's prose to pig latin is a bit redundant -- but the script does illustrate the basics of creating a simple SAX filter:
- It accepts SAX events from a SAX filter or other event generator.
- It alters the document stream (in this case, by transforming all text data to pig latin).
- It forwards SAX events to the next handler or filter in the chain.
If we also wanted to transform the element and attribute names and values in addition
to
the text data we would only need to add the following start_element
and
end_element
handlers.
sub start_element { my ($self, $element) = @_; my %attrs = %{$element->{Attributes}}; while ( my ($name, $value) = (each (%attrs))) { my $orig_name = $name; $name = $self->porcus($name); $value = $self->porcus($value); $attrs{$name} = $value; delete $attrs{$orig_name}; } $element->{Attributes} = \%attrs; my $elname = $self->porcus($element->{Name}); $element->{Name} = $elname; $self->{Handler}->start_element($element); } sub end_element { my ($self, $element) = @_; my $elname = $self->porcus($element->{Name}); $element->{Name} = $elname; $self->{Handler}->end_element($element); }
Again, the principles are the same: accept events, alter the data, then forward that altered data by calling events on the filter's designated handler.
Enough silliness, let's look at a more practical example.
Transforming Document Structure
For our final example, we will demonstrate how a SAX filter can be used to alter the structure of an XML document by creating a filter that partially implements the current version of the W3C's XInclude working draft.
XInclude suggests a compact, DTD- and Schema-agnostic way to include external XML documents (or document fragments) into the current document being processed. For example,
<?xml version="1.0"> <article xmlns="http://localhost/myns" xmlns:xi="http://www.w3.org/2001/XInclude"> <para> All brontosauruses are thin at one end, much much thicker in the middle, and then thin again at the far end. </para> <xi:include href="disclaimer.xml"/> </article>
would signal an XInclude-aware processor to include the contents of the file
disclaimer.xml
into the current document between the end tag of para element
and the end tag of the top-level article element.
And speaking of disclaimers, it should be pointed out that our implementation here by no means covers the requirements of the full XInclude draft; it will only allow inclusion of complete documents from the local file system. XInclude itself is far more flexible and robust. Our goal here is merely to demonstrate the principles of writing SAX filters.
use strict; use XML::Parser::PerlSAX; use XML::Filter::SAX2toSAX1; use XML::Filter::SAX1toSAX2; use XML::Handler::YAWriter; use IO::File; my $file = $ARGV[0] || die "Please pass a filename to process. . .\n";
After the required imports we are ready to build our SAX filter-handler chain. The
chain
is more complex in this case since XML::Parser::PerlSAX
generates SAX1 events
and XML::Handler::YAWriter
expects SAX1 events, but our XInclude filter
requires the more sophisticated namespace processing provided by SAX2. We work around
this
by adding the filters XML::Filter::SAX1toSAX2
and
XML::Filter::SAX2toSAX1
to the chain immediately before and after our custom
filter. This allows for proper namespace processing while ensuring that the other
parts of
the handler chain are able to generate and receive the data for the given events in
the
format that each expects.
my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" )); $writer->{Pretty}->{NoProlog} = 1; my $sax1_filter = XML::Filter::SAX2toSAX1->new(Handler => $writer); my $handler = FilterXInclude->new(Handler => $sax1_filter); my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $handler); my $parser = XML::Parser::PerlSAX->new(Handler => $sax2_filter); my %parser_args = (Source => {SystemId => $file}); $parser->parse(%parser_args); # end main
We now begin our XInclude filter module. Note that, again, we inherit from
XML::Filter::Base
to make life a little easier. Also notice that we add a
BaseURI
property to the filter object. This gives us a place to store the
path that provides the context in which to resolve any relative URIs offered by the
include
elements. We set the default for this property to the current directory that the script
is
being executed in.
# minimal XInclude Implementation package FilterXInclude; use strict; use base qw(XML::Filter::Base); use XML::Parser::PerlSAX; use XML::Filter::SAX2toSAX1; use XML::Filter::SAX1toSAX2; sub new { my $class = shift; my %options = @_; $options{BaseURI} ||= './'; return bless \%options, $class; } sub start_element { my ($self, $element) = @_; my %attrs = %{$element->{Attributes}};
As we begin the start_element
handler, we first check for an
xml:base
attribute in the current element. The xml:base
attribute is the recommended way to set the base URI for applications that are expected
to
cope with relative URIs. In this case if an xml:base
attribute is found, we set
the value of the filter object's BaseURI
property to its value.
It is worth noting here that the structure of SAX2 attributes differs significantly from that of SAX1. In Perl implementations of SAX1, attributes are a simple HASH reference of name/value pairs. This causes problems with more modern documents that employ XML namespaces since they allow for cases where two attributes may have the same name, but are bound to different namespace URIs. Simple key => value pairs are not enough to capture the "X, in namespace Y, equals Z" relationships provided by namespaced attributes.
After much discussion on the perl-xml mailing list, it was decided that in SAX2
implementations attributes should remain a HASH, but should employ a notation first
advanced
by James Clark where the insufficient name => value
structure is replaced by
{namepace_uri}localname = \%attribute_properties
. So, in the following block,
when we say $attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value}
this
should be understood to mean "give me the 'Value' property of the attribute that is
bound to
the 'http://www.w3.org/XML/1998/namespace' namespace whose local name is 'base'".
if (defined $attrs{'{http://www.w3.org/XML/1998/namespace}base'}) { $self->{BaseURI} = $attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value}; $self->{BaseURI} =~ s|^file://||; }
Next, we check to see if the current element is in the XIinclude namespace and has
the
local name of 'include' and, if so, we send the value that element's href
attribute off to our include_proc
method to include the document at that URI
into the current document stream.
Also notice that we do not forward the events for the include
elements since we do not want those elements to actually appear in the result document.
This, coupled with the results included from the include_proc
method, has the
effect of replacing the include
elements with the documents that they
point to.
if ($element->{NamespaceURI} eq 'http://www.w3.org/2001/XInclude' and $element->{LocalName} eq 'include') { $self->include_proc($attrs{'{}href'}->{Value}); } else { $self->{Handler}->start_element($element); } }
It is not enough to exclude the include
elements from being forwarded in the
start_element
handler; we must also do the same in the
end_element
handler as well. Otherwise, the resulting document would still
contain the end tags for the include
elements, causing the resulting XML
document to be ill-formed.
sub end_element { my ($self, $element) = @_; unless ($element->{NamespaceURI} eq 'http://www.w3.org/2001/XInclude' and $element->{LocalName} eq 'include') { $self->{Handler}->end_element($element); } }
I should also point out that if you want to prune elements that may contain character
data
from a document, you must also implement a characters
handler that
conditionally blocks the forwarding of text events. Otherwise the text contained by
the
excluded elements will become part of the text of the nearest parent element, which
is not
likely to produce the desired result. We need not worry in this case since all of
the
include
elements are empty.
Finally we get to the include_proc
method which is responsible for parsing
and including the requested documents. Here we simply create a new instance of
XML::Filter::SAX1toSAX2
, passing the current instance of our filter as the
handler, then pass that as the handler for a new instance of
XML::Parser::PerlSAX
, and tell the parser to parse the document passed to the
subroutine in the context of the BaseURI
property.
The result of this is that the events fired from these included documents are inserted
into
the current document stream at the precise location previously taken by the
include
elements.
sub include_proc { my ($self, $file) = @_; $file = $self->{BaseURI} . $file; my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $self); my $parser = XML::Parser::PerlSAX->new({Handler => $sax2_filter, Source => {SystemId => $file} }); $parser->parse; }
Passing the following XML document to this script. . .
Resources |
• Download the
sample code. |
<?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" xml:base="file://files/"> <head> <title> Templating With XInclude and SAX2 </title> </head> <body> <xi:include href="header.xml"/> <hr width="80%"/> <xi:include href="content.xml"/> <hr width="80%"/> <xi:include href="footer.xml"/> </body> </html>
might result in a document like
Also in Perl and XML |
OSCON 2002 Perl and XML Review PDF Presentations Using AxPoint |
<html xml:base="file://files/" xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude"> <head> <title> Templating With XInclude and SAX2 </title> </head> <body> <div class="header"> <h1>Common Header</h1> </div> <hr width="80%"></hr> <div class="content"> <p> Now is the winter of our discontent made glorious summer by the son of York. </p> </div> <hr width="80%"></hr> <div class="footer"> <p>Common Footer</p> </div> </body> </html>
Conclusions
SAX is an important XML technology that, like Perl, keeps simple things simple and makes hard thing possible. Knowing how to generate SAX events from non-XML data and using SAX filters to transform existing document streams are key to a mature understanding of the power that SAX offers. We have only scratched the surface of what SAX filters and generators can do, but I hope that we have at least covered the basics well enough to pique your curiosity and provoke experimentation.