Writing SAX Drivers for Non-XML Data
September 19, 2001
In a previous column, we covered the basics of the Simple API for XML (SAX) and the modules that implement that interface in Perl. Over the course of the next two months we will move beyond these basic topics to look at two slightly more advanced ones: creating drivers that generate SAX events from non-XML sources and writing custom SAX filters. If you are not familiar with the way SAX works, please read High-Performance XML Parsing With SAX before proceeding.
What Is A SAX Driver, And Why Would You Want One?
SAX is an event-driven API in which the contents of an XML document are accessed through callback subroutines that fire based on various XML parsing events (the beginning of an element, the end of an element,character data, etc.) For the purpose of this article, a SAX driver (sometimes called a SAX generator) can be understood to mean any Perl class that can generate these SAX events.
In the most common case, a SAX driver acts as a proxy between an XML parser and the one or more handler classes written by the developer. The handler methods detailed in the SAX API are called as the parser makes its way through the document, thereby providing access to the contents of that XML document. In fact, this is precisely what SAX was designed for: to provide a simple means to access information stored in XML. As we will see, however, it is often handy to be able to generate these events from data sources other than XML documents.
A Simple Example: Dumping A Perl Hash As An XML Document
Before we look at our first example, it's important to note that a SAX driver without
a
handler that receives the generated events and does something with the data passed
is useless. While the basics of writing SAX handlers are quite easy to grasp, the
handlers
themselves can sometimes be quite complex. Our focus here is on generating events,
not
handling them; so, for simplicity's sake, we will use Ken MacLeod's
XML::Handler::XMLWriter
(which takes a SAX event stream and prints it to
STDOUT
as a XML document) as the default handler throughout this article.
To show how to write a SAX driver we will create a simple inlined class that translates a typical Perl hash into a well-formed XML document where the keys are the element names and the values are the character data contained by those elements.
use strict; use XML::Handler::XMLWriter;
The main
portion of the script consists of nothing more than initialization
of the new handler and driver objects and the call to the driver's parse
method. We use the parse
method to kick off the SAX event stream, passing the
hash we wish to dump to XML as the sole argument. In this case we will use Perl's
venerable
built-in system environment dictionary %ENV
.
my $writer = XML::Handler::XMLWriter->new(); my $driver = SAXDriverHash2XML->new(Handler => $writer); $driver->parse(%ENV);
Next we create our driver class beginning with a typical constructor method.
package SAXDriverHash2XML; # generate SAX1 events from a simple Perl HASH. use strict; # standard constructor sub new { my ($proto, %args) = @_; my $class = ref($proto) || $proto; my $self = \%args; bless ($self, $class); return $self; }
Finally we get to the substantial part of our driver, the parse
method.
# generate the events sub parse { my $self = shift; my %passed_hash = @_;
After slurping the sole argument into the local %passed_hash
, we begin firing
off the necessary SAX events to create our XML document. Recall that we passed a blessed
instance of XML::Handler::XMLWriter
as the default handler for our driver.
Generating the SAX events is as simple as calling the appropriate handler methods
on that
object and passing the data through as arguments in the format that the handler expects.
This is the essence of writing a custom SAX driver.
We begin the SAX event stream by calling the required start_document
handler.
$self->{Handler}->start_document();
Now a Perl hash is a list of key-value pairs; but for our XML document to be well-formed, it must have a single top-level element. To meet the well-formedness requirement, we will add a top-level wrapper element named "root".
Pay special attention to the arguments we pass to the start_element
handler.
Perl SAX1 implementations expect a hash reference of named properties where the
Name
property is a string containing the element's name, and the
Attributes
property is another hash reference that contains the XML
attributes attached to that element (empty in this case).
$self->{Handler}->start_element({Name => 'root', Attributes => {}});
Next we loop over the elements of %passed_hash
using the each
function. As we loop over each entry, we fire a start_element
,
characters
, and end_element
handler event for each record. Note
that the argument to the characters
is a hash reference with a single property
(Data
that contains the character data that will become the text content of
the surrounding element.
while (my ($name, $value) = each(%passed_hash)) {
$name = lc($name);
# we like lower-case tag names
$self->{Handler}->start_element({Name =>
$name,Attributes => {}});
$self->{Handler}->characters({Data =>
$value});
$self->{Handler}->end_element({Name => $name});
}
Finally we call the end_element
event on the "root" wrapper element, followed
by the end_document
handler which signals the handler class that the "parse" is
complete.
$self->{Handler}->end_element({Name => 'root'}); $self->{Handler}->end_document(); }
Running this script on my machine yields an XML document that is similar to the following:
<?xml version="1.0"?>
<root>
<bash_env>/home/kip/.bashrc</bash_env>
<ostype>linux</ostype>
<histsize>1000</histsize>
<hostname>duchamp.hampton.ws</hostname>
<user>kip</user>
<hosttype>i386</hosttype>
<home>/home/kip</home>
<term>linux</term>
<logname>kip</logname>
<path>/usr/local/bin:/bin:/usr/bin:/usr/x11r6/bin:
/home/kip/bin</path>
<shell>/bin/bash</shell>
<mail>/var/spool/mail/kip</mail>
<lang>en_us</lang>
</root>
Note that I said that the output of this script is similar to the snippet above.
Actually, the resulting document puts all the elements and data on a single line.
Remember,
XML elements can have mixed content (containing both child elements and text) so
all character data is important. The spaces and newline characters added to this
example to make it more readable here are, in truth, text data contained by the "root"
element and would have to be explicitly added via calls to the characters
handler to produce an exact match.
Simplifying Event Generation
As we have seen, generating SAX events is as simple as calling the appropriate method
on
the handler object from within our driver class. However, writing
$obj->{Handler}->method_name($appropriate_hashref)
for each
event can be cumbersome and error-prone. Not only is it a lot of typing, it requires
an
intimate knowledge of the properties that each event expects and that we get those
properties right each and every time. If we do not mind a little extra overhead, we
can make
life a little easier by creating wrapper methods within our driver class which allow
us to
write our parse()
method in a more simple, Perlish way, while ensuring that the
handler receives the data passed by the event in the format that it expects.
For our second and final example we will write a simple driver that produces an XML
document from a genetic sequence record stored in the FASTA file format. We will use
the
Bio::SeqIO
module from the bioperl project
to read the sequence record, calling convenience methods in our driver class to generate
the
required SAX events to translate that record into an XML format. Again, we will use
XML::Handler::XMLWriter
as the default handler for our driver.
The main
portion of our script is more or less identical to that of the
previous example; we create new instances of the handler and driver classes and call
the
parse
method on the driver object. This time, though, we pass the location of
the FASTA file that we want to translate to XML as the sole argument to
parse
.
use strict; use XML::Handler::XMLWriter; my $sequence_file = 'files/seq1.fasta'; my $writer = XML::Handler::XMLWriter->new(); my $driver = SAXDriverFastaFile->new(Handler => $writer); $driver->parse($sequence_file);
Now we begin our driver class.
package SAXDriverFastaFile; # generate SAX1 events from a fasta sequence record. use strict; use Bio::SeqIO; use vars qw($AUTOLOAD); sub new { my ($proto, %args) = @_; my $class = ref($proto) || $proto; my $self = \%args; bless ($self, $class); return $self; }
We have decided that we would rather pass simple structures to the event generators
that
are used most often (rather than the hash references that the SAX handler expects),
so we
will implement the start_element
, end_element
, and
characters
methods inside our driver class to accept these simpler arguments
and forward the data to the handler in the expected format.
sub start_element { my $self = shift; my $element_name = shift; my %attributes = @_; $self->{Handler}->start_element({Name => $element_name, Attributes => \%attributes}); } sub end_element { my ($self, $element_name) = @_; $self->{Handler}->end_element({Name => $element_name}); $self->newline; } sub characters { my ($self, $data) = @_; $self->{Handler}->characters({Data => $data}); }
With these methods in place we can now write:
$obj->start_element('element_name', (attr1 => 'some value', attr2 => 'some
other value'))
rather than the more verbose:
$obj->{Handler}->start_element({Name => 'element_name', Attributes => {attr1 => 'some value' attr2 => 'some other value'} })
In analyzing the task at hand, we notice that we often want to produce simple XML
data
elements in the format <name>value</name>
. To make generating these
events easier, we will add the following data_element
method to our driver
class which will allow us to produce these elements by calling
$obj->data_element('name', 'value')
.
sub data_element { my ($self, $element_name, $data) = @_; $self->{Handler}->start_element({Name => $element_name, Attributes => {}}); $self->{Handler}->characters({Data => $data}); $self->{Handler}->end_element({Name => $element_name}); $self->newline; }
Did you notice the call to the mysterious newline
method? The handler for this
driver does nothing more than present the SAX events as an XML document, and we have
decided
that the resulting document should have at least some sort of minimal formatting to
make it
easier to look at in a text editor. In this case, having each element on a separate
line
will suffice. Inserting newlines into the document is likely to be very common, so,
rather
than calling $obj->characters("\n")
for every line break we have created the
following newline
method that does that for us.
sub newline { my $self = shift; $self->{Handler}->characters({Data => "\n"}); }
With the convenience methods out of the way, we have only to write the code that translates
the FASTA record into XML. To keep things nice and tidy, we will break things up a
bit. The
parse
method initializes the Bio::SeqIO
object that processes
the file passed from the main
section of the script, starts the SAX event
stream with the call to start_document
, and opens the required top-level
element ( <fasta_sequence>
). After looping over the gene sequences
contained in the file, and passing them off to the seq2sax1
method to handle
the details, parse
then closes the root element and ends the event stream with
a call to end_document
. Note the calls to our newline
method along
the way to ensure that the document produced is in the proper format.
sub parse {
my $self = shift;
my $seq_file =
shift;
my $seq_in = Bio::SeqIO->new(-file => $seq_file, -format
=> 'fasta');
$self->start_document();
$self->start_element('fasta_sequence');
$self->newline;
while (my $seq = $seq_in->next_seq())
{
$self->seq2sax1($seq->{primary_seq});
}
$self->end_element('fasta_sequence');
$self->newline;
$self->end_document();
}
The seq2sax1
method is very similar to the parse
method from the
earlier example. Each sequence is represented as a hash reference of key-value pairs
and we
need only loop over the elements of that hash, calling our various convenience methods
as we
go. Note that each sequence is wrapped in a <primary_seq>
element to
ensure that the resulting XML data reflects the information captured by
Bio::SeqIO
.
sub seq2sax1 {
my ($self, $seq) = @_;
my
%attrs;
$attrs{display_id} = $seq->{display_id};
$attrs{primary_id} = $seq->{primary_id};
$self->start_element('primary_seq', %attrs);
$self->newline;
while ( my ($name, $value) = each
(%{$seq})) {
next if $name =~ /_id$/; # display_id and
primary_id are already attributes
$self->data_element($name, $value);
}
$self->end_element('primary_seq');
}
Careful readers will have noticed that we call the start_document
and
end_document
methods on our driver object but the driver class does not
implement these methods. This would normally cause Perl to die with an error about
its
inability to locate these object methods. We have kept the event generator interface
localized to the driver class using the Perl's built-in AUTOLOAD
subroutine to
forward these methods to the handler for us.
# expensive, but handy
sub AUTOLOAD {
my $self = shift;
my $called_sub = $AUTOLOAD;
$called_sub =~ s/.+:://; # snip
pkg name...
if (my $method = $self->{Handler}->can($called_sub))
{
$method->($self->{Handler}, @_);
}
else {
warn "Method
'$called_sub' not implemented by handler $self->{Handler}\n";
}
}
Below is an abbreviated snippet of the document produced by running this script on the sample FASTA record file that ships with the bioperl distribution. The complete file can seen in this month's source code.
<?xml version="1.0"?>
<fasta_sequence>
<primary_seq
display_id="gi|2981175" primary_id="gi|2981175">
<moltype>protein</moltype>
<desc>deltex</desc>
<seq>MSRPGHGGLMPVNGLGFPPQNVARVVVWECLNEHSRWR...</seq>
<_rootI_verbose>0</_rootI_verbose>
</primary_seq>
<primary_seq display_id="gi|927067" primary_id="gi|927067">
<moltype>protein</moltype>
<desc>longation factor 1-alpha
1</desc>
<seq>MQSERGITIDISLWKFETSKYYVTIIDAPGHRDFIQNM...</seq>
<_rootI_verbose>0</_rootI_verbose>
</primary_seq>
...
</fasta_sequence>
The result is a bit scary to look at, perhaps, but it is accurate. Most importantly, we did not have to reinvent the wheel to translate our data to an XML format.
Avoiding Common Traps
We have seen how easy it can be to produce SAX event streams (and, hence, XML documents) from non-XML data, but there are a few common gotchas to be aware of before you begin writing your own custom SAX drivers.
In the standard SAX model, where an XML parser is the application that fires the events, the parser is responsible for making sure that the incoming data meets the requirements of well-formed XML; that is, that the document contains a single root element, that each start tag has a corresponding end tag, that the tags are nested properly, and so on. By removing the XML parser from the equation and calling the methods of a SAX handler directly, there is no such safety net. It is entirely possible to write SAX drivers whose resulting documents would not meet XML's well-formedness requirements and driver authors should take care to ensure that the documents event streams being produced actually meet those requirement.
Similarly, driver developers need to make sure that any characters that XML treats
as
special are replaced by their corresponding entities or wrapped in CDATA sections
before
passing the data on to a characters
handler. Specifically, the characters
&, <, >, and ' should be replaced by &, <, >, and '
respectively, (or declared as CDATA). \
The easiest way to ensure that the event streams produced by a given driver are legitimate
XML is to hook a writer handler to the driver (as in the examples above), save the
results
to a file, and attempt to parse the resulting document using your favorite XML parser.
For
example, if you have the Gnome Project's libxml2
installed, you can check the
resulting output by typing xmllint --noout myfile.xml
, or, using
XML::Parser
, perl -MXML::Parser -e 'XML::Parser->new(ErrorHandler =>2
)-> parsefile(q('myfile.xml'))'
at the command line. In both cases, if the
parser does not complain, then you know that your driver is producing well-formed
XML.
Conclusions
I'm certain that there are XML purists out there for whom this technique -- using a non-XML class to produce SAX event streams -- will seem like heresy. Indeed you do need to be a bit more careful when letting your own custom module stand in for an XML parser (for the reasons stated above), but, in my opinion, the benefits far outweigh the costs. Writing custom SAX drivers provides a predictable, memory-efficient, easy to take advantage of Perl's advanced built-in data handling capabilities and vast collection of non-XML parsers and other data interfaces to create XML document streams. Could bypassing an XML parser and calling a SAX handler's methods directly be considered a hack? Perhaps. If so, it is a darn good and useful one.
If you are intrigued by the notion of custom SAX drivers, or think that you may have
a
place for them in your work, I strongly encourage you to have a look at the code for
Ilya
Sterin's XML::SAXDriver::Excel
and XML::SAXDriver::CSV
as well as
Matt Sergeant's XML::Generator::DBI
, and Petr Cimprich's
XML::Directory::SAXGenerator
for ideas.
Be sure to tune in next month for part two of our advanced SAX series where we will learn how to write our own custom SAX filters.