XML::LibXML - An XML::Parser Alternative
November 14, 2001
Introduction
The vast majority of Perl's XML modules are built on top of XML::Parser
,
Larry Wall and Clark Cooper's Perl interface to James Clark's expat
parser. The
expat
-XML::Parser
combination is not the only
full-featured XML parser available in the Perl World. This month we'll look at
XML::LibXML
, Matt Sergeant and Christian Glahn's Perl interface to Daniel
Velliard's libxml2
.
Why Would You Want Yet Another XML Parser?
Expat and XML::Parser
have proven themselves to be quite capable, but they
are not without limitations. Expat was among the first XML parsers available and,
as a
result, its interfaces reflect the expectations of users at the time it was written.
Expat
and XML::Parser
do not implement the Document Object Model, SAX, or XPath
language interfaces (things that most modern XML users take for granted) because either
the
given interface did not exist or was still being heavily evaluated and not considered
"standard" at the time it was written.
The somewhat unfortunate result of this is that most of the available Perl XML modules
are
built upon one of XML::Parser
's non- or not-quite-standard interfaces with the
presumption that the input will be some sort of textual representation of an XML document
(file, filehandle, string, socket stream) that must be parsed before proceeding. While
this
works for many simple cases, most advanced XML applications need to do more than one
thing
with a given document and that means that for each stage in the process, the document
must
be serialized to a string and then re-parsed by the next module.
By contrast libxml2
was written after the DOM, XPath, and SAX interfaces
became common, and so it implements all three. In-memory trees can be built by parsing
documents stored in files, strings, and so on, or generated from a series of SAX events.
Those trees can then be operated on using the W3C DOM and XPath interfaces or used
to
generate SAX events that are handed off to external event handlers. This added flexibility,
which reflects current XML processing expectations, makes XML::LibXML
a strong
contender for XML::Parser
's throne.
Using XML::LibXML
This month's column may be seen as a addendum to the Perl/XML Quickstart Guide published earlier this
year, when XML::LibXML
was in its infancy, and we'll use the same tests from
the Quickstart to put XML::LibXML
though its paces. For a detailed overview of
the test cases see the first installment in the
Quickstart; but, to summarize, the two tests illustrate how to extract and print data
from an XML document, and how to build and print, programmatically, an XML document
from
data stored in a Perl HASH using the facilities offered by a given XML module.
Reading
For accessing the data stored in XML documents, XML::LibXML
provides a
standard W3C DOM interface. Documents are treated as a tree of nodes and the data
those
nodes contain are accessed by calling methods on the node objects themselves.
use strict; use XML::LibXML; my $file = 'files/camelids.xml'; my $parser = XML::LibXML->new(); my $tree = $parser->parse_file($file); my $root = $tree->getDocumentElement; my @species = $root->getElementsByTagName('species'); foreach my $camelid (@species) { my $latin_name = $camelid->getAttribute('name'); my @name_node = $camelid->getElementsByTagName('common-name'); my $common_name = $name_node[0]->getFirstChild->getData; my @c_node = $camelid->getElementsByTagName('conservation'); my $status = $c_node[0]->getAttribute('status'); print "$common_name ($latin_name) $status \n"; }
One of the more exciting features of XML::LibXML
is that, in addition to the
DOM interface, it allows you to select nodes using the XPath language. The following
illustrates how to achieve the same effect as the previous example using XPath to
select the
desired nodes:
use strict; use XML::LibXML; my $file = 'files/camelids.xml'; my $parser = XML::LibXML->new(); my $tree = $parser->parse_file($file); my $root = $tree->getDocumentElement; foreach my $camelid ($root->findnodes('species')) { my $latin_name = $camelid->findvalue('@name'); my $common_name = $camelid->findvalue('common-name'); my $status = $camelid->findvalue('conservation/@status'); print "$common_name ($latin_name) $status \n"; }
What makes this exciting is that you can you can mix and match methods from the DOM and XPath interfaces to best suit the needs of your application, while operating on the same tree of nodes.
Writing
To create an XML document programmatically with XML::LibXML
you simply use
the provided DOM interface:
use strict; use XML::LibXML; my $doc = XML::LibXML::Document->new(); my $root = $doc->createElement('html'); $doc->setDocumentElement($root); my $body = $doc->createElement('body'); $root->appendChild($body); foreach my $item (keys (%camelid_links)) { my $link = $doc->createElement('a'); $link->setAttribute('href', $camelid_links{$item}->{url}); my $text = XML::LibXML::Text->new($camelid_links{$item}->{description}); $link->appendChild($text); $body->appendChild($link); } print $doc->toString;
An important difference between XML::LibXML
and XML::DOM
is that
libxml2
's object model conforms to the W3C DOM Level 2 interface, which is
better able to cope with documents containing XML Namespaces. So, where
XML::DOM
is limited to:
@nodeset = getElementsByTagName($element_name);
and
$node = $doc->createElement($element_name);
XML::LibXML
also provides:
@nodeset = getElementsByTagNameNS($namespace_uri, $element_name);
and
$node = $doc->createElementNS($namespace_uri, $element_name);
The Joy of SAX
Also in Perl and XML |
OSCON 2002 Perl and XML Review PDF Presentations Using AxPoint |
We've seen the DOM and XPath goodness that XML::LibXML
provides, but the
story does not end there. The libxml2
library also offers a SAX interface that
can be used to create DOM trees from SAX events or generate SAX events from DOM trees.
The following creates a DOM tree programmatically from a SAX driver built on
XML::SAX::Base
. In this example, the initial SAX events are generated from a
custom driver implemented in the CamelDriver
class that calls the handler
events in the XML::LibXML::SAX::Builder
class to build the DOM tree.
use XML::LibXML; use XML::LibXML::SAX::Builder; my $builder = XML::LibXML::SAX::Builder->new(); my $driver = CamelDriver->new(Handler => $builder); my $doc = $driver->parse(%camelid_links); # doc is an XML::LibXML::Document object print $doc->toString; package CamelDriver; use base qw(XML::SAX::Base); sub parse { my $self = shift; my %links = @_; $self->SUPER::start_document; $self->SUPER::start_element({Name => 'html'}); $self->SUPER::start_element({Name => 'body'}); foreach my $item (keys (%camelid_links)) { $self->SUPER::start_element({Name => 'a', Attributes => { 'href' => $links{$item}->{url} } }); $self->SUPER::characters({Data => $links{$item}->{description}}); $self->SUPER::end_element({Name => 'a'}); } $self->SUPER::end_element({Name => 'body'}); $self->SUPER::end_element({Name => 'html'}); $self->SUPER::end_document; } 1;
You can also generate SAX events from an existing DOM tree using
XML::LibXML::SAX::Generator
. In the following snippet, the DOM tree created
by parsing the file camelids.xml
is handed to
XML::LibXML::SAX::Generator
's generate()
method which in turn
calls the event handlers in XML::Handler::XMLWriter
to print the document to
STDOUT
.
use strict; use XML::LibXML; use XML::LibXML::SAX::Generator; use XML::Handler::XMLWriter; my $file = 'files/camelids.xml'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($file); my $handler = XML::Handler::XMLWriter->new(); my $driver = XML::LibXML::SAX::Generator->new(Handler => $handler); # generate SAX events that are captured # by a SAX Handler or Filter. $driver->generate($doc);
Resources |
Perl XML Quickstart: The Standard XML Interfaces |
This ability to accept and emit SAX events is especially useful in light of the recent
discussion in this column of generating SAX events
from non-XML data and writing SAX filter chains. You could, for
example, use a SAX driver written in Perl to emit events based on data returned from
a
database query that creates a DOM object, which is then transformed in C-space for
display
using XSLT and the mind-numbingly fast libxslt
library (which expects
libxml2
DOM objects), and then emit SAX events from that transformed DOM tree
for further processing using custom SAX filters to provide the finishing touches --
all
without once having had to serialize the document to a string for re-parsing. Wow.
Conclusions
As we have seen, XML::LibXML
offers a fast, updated approach to XML
processing that may be superior to the first-generation XML::Parser
for many
cases. Do not misunderstand, XML::Parser
and its dependents are still quite
useful, well-supported, and are not likely to go away any time soon. But it is not
the only
game in town, and given the added flexibility that XML::LibXML
provides, I
would strongly encourage you to give XML::LibXML
a closer look before beginning
your next Perl/XML project.