Introducing XML::SAX::Machines, Part One

February 13, 2002

Introduction

In recent columns we have seen that SAX provides a modular way to generate and filter XML content. For those just learning how SAX works, though, the task of hooking up to the correct parser-generator-driver and building chains of filters can be tricky. More experienced SAX users may have a clearer picture of how to proceed, but they often find that initializing complex filter chains is tedious and lends itself to lots of duplicated code.

Consider the following simple filter chain script:


use XML::SAX::ParserFactory;

use XML::SAX::Writer;

use My::SAXFilter::One;

use My::SAXFilter::Two;

use My::SAXFilter::Three;



my $writer  = XML::SAX::Writer->new();

my $filter3 = My::SAXFilter::Three->new( Handler => $writer );

my $filter2 = My::SAXFilter::Two->new( Handler => $filter3 );

my $filter1 = My::SAXFilter::One->new( Handler => $filter2 );

my $parser = XML::SAX::ParserFactory->parser( Handler => $filter1 );



$parser->parse_uri( $xml_file );

Not too bad for this tiny example, perhaps, but imagine how it might look in a complex system with 10 or 15 filters all doing their part. Also, new SAX users often stumble over the fact that the handler chain must be built in reverse order ($filter3 has to be initialized before $filter2 so it can be passed in as the handler class, for example). Yet another potential weakness in this script is that the filters in the chain are hard-coded from the start. While it is possible to make some aspects more flexible, adding the ability to have a dynamic list of filters only adds to the complexity of the script.

Barrie Slaymaker's outstanding new XML::SAX::Machines addresses both the complexity and the tedium of creating SAX systems. Compare the following snippet to the one above.


use XML::SAX::Machines qw( :all );



my $machine = Pipeline(

    "My::SAXFilter::One",

    "My::SAXFilter::Two",

    "My::SAXFilter::Three",

    \*STDOUT

  );



$machine->parse_uri(  $xml_file );

Less verbose, more intuitive (note that the chain is declared in processing order) and, perhaps most importantly, making the filter chain dynamic is as simple as creating a list of strings containing module names:


my $machine = Pipeline(

    @filter_list,

    \*STDOUT

  );

Where @filter_list is built dynamically elsewhere in the application.

The story does not end there, however. XML::SAX::Machines and its associated Machine classes provide a small host of options for building easy-to-maintain SAX-based XML processing systems. Over the next two months we will be looking at this inventive distribution, beginning with this month's introduction.

Machine Types

XML::SAX::Machines is high-level wrapper class that allows its various Machine classes (which may also be used as standalone libraries) to be easily chained together to create complex SAX filtering systems. XML::SAX::Machines currently installs and knows about several Machines by default.

Pipeline

Implemented by XML::SAX::Pipeline, a Pipeline provides a way to set up a linear series of filters (or other Machines) that works like the traditional hand-rolled SAX filter chain that we looked at in the introduction. That is, the events fired go directly to the next filter or handler on the chain with no intervention.


my $machine = Pipeline(

    "My::SAXFilter::One",

    "My::SAXFilter::Two",

    "My::SAXFilter::Three",

    \*STDOUT

  );

In this example, the three filter classes are fired in linear order with the results of My::SAXFilter::One being sent to My::SAXFilter::Two and so on.

Manifold

Manifold Machines provide a way to create multi-pass filters. The events are cached at the beginning of the Manifold's run and duplicate copies of that event stream are sent through the filters one by one and recompiled into a single document upon completion. It is implemented by XML::SAX::Manifold.


my $machine = Pipeline(

	Manifold(

    	"My::SAXFilter::A",

    	"My::SAXFilter::B",

    	"My::SAXFilter::C",

      ),

    \*STDOUT

  );

Here, events fired during parsing are buffered and sent directly to each of the three filters (in order) and the output of each of the filters is merged into a single stream before being handed off to the Writer class.

Tap

Implemented by XML::SAX::Tap, a Tap offers a way to insert a class that examines one or more SAX events, but in no way alters the data passed to the next filter or handler. This can be extremely useful for cases where you want to examine the result of a given filter or other Machine part for debugging purposes. The handler that you use for your Tap need not forward the events as a typical filter would since the same events will also be sent to the next handler in the chain as if the Tap did not exist. Note:


my $machine = Pipeline(

    "My::SAXFilter::One",

    "My::SAXFilter::Two",

    Tap(

		"My::SAXDumper"

	   ),

    "My::SAXFilter::Three",

    \*STDOUT

  );

In this case, we have taken the Pipeline from above and added a Tap to send events fired by My::SAXFilter::Two to our SAXDumper for debugging.

ByRecord

ByRecord carves up record-oriented XML documents and sends each record through each filter in the ByRecord machine as a separate event stream delimited by start_document and end_document events. All other events (data outside of the records) are forwarded appropriately to the downstream filter or handler. It is implemented by XML::SAX::ByRecord


my $machine = Pipeline(

    ByRecord(

		"My::RecordFilter::One",

		"My::RecordFilter::Two",

	   ),

    "My::SAXFilter::One",

    "My::SAXFilter::Two",

    "My::SAXFilter::Three",

    \*STDOUT

  );

In this case, we have taken the Pipeline from above and added a ByRecord Machine to process the record-oriented parts of the document before beginning the rest of the Pipeline chain.

Now that we have an idea of the various Machines that are currently available, let's get straight to this month's code example.

Example -- Adding Custom Tag Libraries to XHTML

One of the more interesting ideas to emerge in the Web development world in recent years is the notion of custom tag libraries (or taglibs, for short). In a taglib implementation one or more custom tags are defined and the server application evaluates and expands or replaces those tags with the result of running some chunk of code on the server. This allows document authors to add reusable bits of server-side functionality to their pages without the hair loss associated with embedding code in the documents.

For this month's example we will write a mod_perl handler that allows us to create our own custom taglibs. We will do this by creating SAX filters that transform the various tags in our library into the desired results. ANd we'll use SAX::Machines within our Apache handler to manage the filter chain.

First, we need to define our taglib. To keep the example simple we start off with only two tags: an <include> tag that provides a way to insert the contents of an external document defined by the uri attribute at the location of the tag, and a <fortune> tag that inserts a random quote.

To avoid possible collision with the elements allowed in the documents that will contain the tags from our taglib, we will quarantine them in their own XML namespace and bind that namespace to the prefix "widget".

Here is an example of a simple XHTML document containing our custom tags:


<?xml version="1.0"?>

<html xmlns:widget="http://localhost/saxpages/widget">

  <head>

    <title>My Cool Taglib-Enabled Page</title>

  </head>

  <body>

    <widget:include uri="/path/to/widgets/common_header.xml"/>

    <p>

     Today quote is:

    </p>

      <pre><widget:fortune/></pre>

    <p>

    Thanks for stopping by.

    </p>

    <widget:include uri="/path/to/widgets/common_footer.xml"/>

  </body>

</html>

Now let's create our SAX filters to expand our custom tags. We'll write the filter that include an external XML document, first.


package Widget::Include;

use strict;



use vars qw(@ISA $WidgetURI);

@ISA = qw(XML::SAX::Base);

$WidgetURI = 'http://localhost/saxpages/widget';

After a bit of initialization we get straight to the SAX event handlers. In the start_element handler we examine the current element's NamespaceURI and LocalName properties to see if we have an "include" element in our widgets namespace. If it finds one, it further checks for an uri attribute, and, if it finds one, it passes that file name on to a new parser instance using the current filter as the handler.


sub start_element {

    my ( $self, $el ) = @_;



    if ( $el->{NamespaceURI} eq $WidgetURI &&

         $el->{LocalName} eq 'include' ) {



         if ( defined $el->{Attributes}->{'{}uri'} ) {

             my $uri = $el->{Attributes}->{'{}uri'}->{Value};

             my $parser = XML::SAX::ParserFactory->parser( Handler => $self );

             $p->parse_uri( $uri );

         }

    }

If we did not get an element with the right name in the right namespace we forward the event to the next filter in the chain.


    else {

        $self->SUPER::start_element( $el );

    }

}

We do a similar test in the end_element event handler; forwarding the events that we are not interested in.


sub end_element {

    my ( $self, $el ) = @_;



    $self->SUPER::end_element( $el ) unless

        $el->{NamespaceURI} eq $WidgetURI and

        $el->{LocalName} eq 'include';



}

Also in Perl and XML

OSCON 2002 Perl and XML Review

XSH, An XML Editing Shell

PDF Presentations Using AxPoint

Multi-Interface Web Services Made Easy

Perl and XML on the Command Line

That's it. Since this filter inherits from XML::SAX::Base we need only implement the event handlers that are required for the task at hand. All other events will be safely forwarded to the next filter/handler.

The filter that implements the <widget:fortune> tag is very similar. We check to see if the current element is named "fortune" and is bound to the correct namespace. If so, we replace the element with the text returned from a system call to the fortune program. If not, the events are forwarded to the next filter.


package Widget::Fortune;

use strict;



use vars qw(@ISA $WidgetURI);

@ISA = qw(XML::SAX::Base);

$WidgetURI = 'http://localhost/saxpages/widget';



sub start_element {

    my ( $self, $el ) = @_;



    if ( $el->{NamespaceURI} eq $WidgetURI &&

         $el->{LocalName} eq 'fortune' ) {

         my $fortune = `/usr/games/fortune`;

         $self->SUPER::characters( { Data => $fortune } );

    }

    else {

        $self->SUPER::start_element( $el );

    }

}



sub end_element {

    my ( $self, $el ) = @_;



    $self->SUPER::end_element( $el ) unless

        $el->{NamespaceURI} eq $WidgetURI and

        $el->{LocalName} eq 'fortune';



}

With the filters out of the way we turn to the Apache handler that will make our filters work as expected for the files on our server. The basic Apache handler module that makes our taglibs work is astonishingly small considering what it provides. We simply create a new instance of XML::SAX::Pipeline then, inside the required handler subroutine, we create a Pipeline machine, passing in the names of the widget filter classes we just created. Then we send the required HTTP headers and call parse_uri on the file being requested by the client.


package SAXWeb::MachinePages;

use strict;

use XML::SAX::Machines qw( :all );



sub handler {

  my $r = shift;



  my $machine = Pipeline(

    "Widget::Include" =>

    "Widget::Fortune" =>

    \*STDOUT

  );



    $r->content_type('text/html');

    $r->send_http_header;

    $machine->parse_uri(  $r->filename );

}

Finally, we need to upload the XML documents to the server and add a small bit to one of our Apache configuration file so our handler is called appropriately. I used


<Directory /www/sites/myhostdocroot >

  <FilesMatch "\.(xml|xhtml)">

    SetHandler perl-script

    PerlHandler SAXWeb::MachinePages

  </FilesMatch>

</Directory>

After restarting Apache, a request to the XML document we created earlier will look something like the following:


<html xmlns:widget='http://localhost/saxpages/widget'>

  <head>

    <title>My Cool Page</title>

  </head>

  <body>

  <div class='header'>

<h2>MySite.tld</h2>

<hr />

</div>

  <p>

  Today quote is:

  </p>

  <pre>The faster we go, the rounder we get.

		-- The Grateful Dead

</pre>

  <p>

  Thanks for stopping by.

  </p>

  <div class='footer'>

<hr />

<p>Copyright 2000 MySite.tld, Ltd. All rights reserved.</p>

</div>

  </body>

</html>

No Webby awards here, to be sure, but the basic foundation is sound and implementing new tags for our tag library is a matter of creating new SAX filter classes and adding them the Pipeline in the Apache handler.

Conclusions

We've only touched the surface of what XML::SAX::Machines can do. Tune in next month when we will delve deeper into the API and show off some of its advanced features.