Perl XML Quickstart: The Standard XML Interfaces
May 16, 2001
Introduction
This is the second part in a series of articles meant to quickly introduce some of the more popular Perl XML modules. This month we look at the Perl implementations of the standard XML APIs: The Document Object Model, The XPath language, and the Simple API for XML.
As stated in part one, this series is not concerned with comparing the relative merits of the various XML modules. My only goal is to provide enough sample code to help you decide for yourself which module or approach is most appropriate for your situation by showing you how to achieve the same result with each module given two simple tasks. Those tasks are 1) extracting data from an XML document and 2) producing an XML document from a Perl hash. Please see last month's column for a complete description of the sample requirements.
Samples of the Perl Implementations of the Standard XML Interfaces
The Document Object Model (XML::DOM)
The Document Object Model, or DOM for short, provides a language neutral interface to XML data by representing the document's contents as a hierarchical structure of objects whose properties describe the relationships between one object and another. The Perl implementation of the DOM is called, unsurprisingly, XML::DOM.
Reading
use XML::DOM; use XML::DOM; my $file = 'files/camelids.xml'; my $parser = XML::DOM::Parser->new(); my $doc = $parser->parsefile($file); foreach my $species ($doc->getElementsByTagName('species')){ print $species->getElementsByTagName('common-name')->item(0) ->getFirstChild->getNodeValue; print ' (' . $species->getAttribute('name') . ') '; print $species->getElementsByTagName('conservation')->item(0) ->getAttribute('status'); print "\n"; }
Writing
use XML::DOM; require "files/camelid_links.pl"; my %camelid_links = get_camelid_data(); my $doc = XML::DOM::Document->new; my $xml_pi = $doc->createXMLDecl ('1.0'); my $root = $doc->createElement('html'); my $body = $doc->createElement('body'); $root->appendChild($body); foreach my $item ( keys (%camelid_links) ) { my $link = $doc->createElement('a'); $link->setAttribute('href', $camelid_links{$item}->{url}); my $text = $doc->createTextNode($camelid_links{$item}->description}); $link->appendChild($text); $body->appendChild($link); } print $xml_pi->toString; print $root->toString;
XPath (XML::XPath)
Originally developed to provide a node matching syntax for the eXtensible Stylesheet
Language (XSLT) and, later, for XPointer projects, the XPath language provides an
interface
to an XML document's contents using a compact set of expressions and functions that,
like
the DOM, treats the data as a tree of nodes. XPath differs significantly from the
DOM in
that it allows developers fine-grained access to a document's contents based on both
the
structural relationships between nodes (paths) and the properties of those nodes
(expression evaluation). For example, in XPath syntax you can say, "give me all the
div
elements that have a background attribute with the value of blue" by writing
//div[@background="blue"]
.
Reading
use XML::XPath; my $file = 'files/camelids.xml'; my $xp = XML::XPath->new(filename => $file); foreach my $species ($xp->find('//species')->get_nodelist){ print $species->find('common-name')->string_value; print ' (' . $species->find('@name') . ') '; print $species->find('conservation/@status'); print "\n"; }
Writing
use XML::XPath; require "files/camelid_links.pl"; my %camelid_links = get_camelid_data(); my $xp = XML::XPath->new(); my $xml_pi = XML::XPath::Node::PI->new('xml', 'version="1.0"'); my $root = XML::XPath::Node::Element->new('html'); my $body = XML::XPath::Node::Element->new('body'); $root->appendChild($body); foreach my $item ( keys (%camelid_links) ) { my $link = XML::XPath::Node::Element->new('a'); my $href = XML::XPath::Node::Attribute->new('href', $camelid_links{$item}->{url}); $link->appendAttribute($href); my $text = XML::XPath::Node::Text->new( $camelid_links{$item}->{description}); $link->appendChild($text); $body->appendChild($link); } print $xml_pi->toString; print $root->toString
SAX 1 (XML::Parser::PerlSAX)
The SAX, or Simple API for XML, interface provides access to XML data using an event model in which the contents of an XML document are made available through callback subroutines, which it calls handlers. In contrast to the DOM and XPath APIs, the SAX interface does not build an internal representation of the entire XML document. Instead, data is passed to the handlers in response to the various events (the beginning of an element, the end of an element, etc.) that occur as the document is parsed. This makes SAX extremely fast and memory efficient, but it leaves the task defining node relationships entirely up to the developer.
Reading
use XML::Parser::PerlSAX; my $file = "files/camelids.xml"; my $handler = CamelHandler->new(); my $parser = XML::Parser::PerlSAX->new(Handler => $handler); $parser->parse(Source => { SystemId => $file}); package CamelHandler; use strict; sub new { my $type = shift; return bless {}, $type; } my $current_element = ''; my $latin_name = ''; my $common_name = ''; sub start_element { my ($self, $element) = @_; my %attrs = %{$element->{Attributes}}; $current_element = $element->{Name}; if ($current_element eq 'species') { $latin_name = $element->{Attributes}->{'name'}; } elsif ($current_element eq 'conservation') { print $common_name .' (' . $latin_name .') ' . $element->{Attributes}->{'status'} . "\n"; } } sub end_element { my ($self, $element) = @_; if ($element->{LocalName} eq 'species') { $common_name = undef; $latin_name = undef; } } sub characters { my ($self, $characters) = @_; my $text = $characters->{Data}; $text =~ s/^\s*//; $text =~ s/\s*$//; return '' unless $text; if ($current_element eq 'common-name') { $common_name = $text; } } 1;
Writing
Unlike DOM and XPath, SAX offers no in-memory representation of an XML document and, consequently, has no API facilities for directly creating such a representation. However, there is theoretically no limit to the logic that can embedded in the various event handlers, so creating one or more XML documents based on the SAX events generated by another is quite common.
SAX 2 (Orchard::SAXDriver::Expat)
Also in Perl and XML |
OSCON 2002 Perl and XML Review PDF Presentations Using AxPoint |
The most important difference between the SAX 1 and SAX2 APIs is SAX 2's support for
XML
namespaces. A complete SAX 2 implementation is available as part of Ken MacLeod's
Orchard
project. Since a sample for Orchard::SAXDriver::Expat
would look largely the
same as the previous, SAX 1 example, I omit it here. However, if you are curious,
you can
browse orchard_saxdriver_read.pl
in this month's sample code.
Familiarity with the standard XML APIs, their strengths and weaknesses relative to a given task, is key to a mature understanding of XML technology. Much has been written about the interfaces covered here, and I strongly encourage you to follow the links in this month's "Resources" section for more information.
Up to this point each module we've looked at shares the common goal of providing a generic interface to the contents any well-formed XML document. Next month we will depart from this pattern a bit by exploring some of the modules that, while perhaps less generically useful, seek to simplify the execution of some specific XML-related task.