Using XML::Twig
March 21, 2001
If your problem is finding a fast, memory-efficient way to handle large XML documents, but the needs of your application make using the SAX interface overly complex, the solution is to use XML::Twig.
Why XML::Twig?
If you've been working with XML for a while it's often tempting frame solutions to
new
problems in the context of the tools you've used successfully in the past. In other
words,
if you are most familiar with the DOM interface, you're likely to approach new challenges
from a more-or-less DOMish perspective. While there's plenty to be said for doing
what you
know will work, experience shows that there is no one right way to process XML. With
this in
mind, Michel Rodriguez's XML::Twig
embodies Perl's penchant for borrowing the
best features of the tools that have come before. XML::Twig combines the efficiency
and
small footprint of SAX processing with the power of XPath's node selection syntax,
and it
adds a few clever tricks of its own.
Understanding Twigs
To use XML::Twig successfully you've got to realize that XML document trees are typically comprised of smaller tree-like structures, which are called twigs. Consider the following simplified representation of an XHTML document tree:
html / \ head body / \ \ script title div / \ h1 p
We see that head and body elements are branches (or twigs) connected to the root html
element, and, in turn, those elements contain smaller tree-like structures (the script,
title and div elements), and so on. XML::Twig
lets us operate on all or part of
the document tree by accessing the individual twigs themselves. We can operate on
twigs by
using a subset of the XPath syntax to select only those structures that are relevant
to the
task at hand. This ability to pick and choose some of the twigs of the larger tree,
while
passing over the rest, gives XML::Twig
its power, speed, and flexibility.
TwigRoots
Passed to XML::Twig
's object constructor, the TwigRoots
argument
accepts a single hash reference, the keys of which are XPath-like expressions that
define
the elements in the input document one wants to include in the output tree. If one
or more
TwigRoots
are defined, only those elements defined as Roots will be included
in the result tree.
Let's say, for example, that we need to create a table of contents for the XML version
of
one of the books available through the Gutenberg Project. These electronic books are
often
quite large, but our table of contents need only include the title of the book and
the
titles of the various chapters. Fortunately, this is just the sort of task that the
TwigRoots
was designed to handle.
First let's look at a simplified excerpt from Homer's Iliad:
<gutbook> ... <book> <frontmatter> <titlepage> <title>THE ILIAD</title> <author>HOMER</author> </titlepage> </frontmatter> <bookbody> <chapter> <title>BOOK I</title> <para> Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans. ... </para> ... </chapter> ... </bookbody> </book> </gutbook>
To capture all of the <title> elements contained in the document we need only define a single TwigRoot, passing it the expression 'title' as the key.
use XML::Twig; my $file = $ARGV[0]; my $twig= new XML::Twig(TwigRoots => {title => 1}); $twig->parsefile($file); $twig->print;
After processing, the output looks like this:
<gutbook> <title>THE ILIAD</title> <title>BOOK I</title> <title>BOOK II</title> <title>BOOK II</title> ... </gutbook>
This is not a very descriptive table of contents, but it illustrates how
TwigRoots
allows us to capture only the elements we need in the output tree.
Remember that the expressions that define the TwigRoots
are XPath-like, so,
for example, if we wanted to build our table of contents from only those
<title>elements with a <chapter> element as a parent, we would change the key in
our TwigRoots
hash to
TwigRoots => {'chapter/title' => 1}
TwigHandlers
In the same way that TwigRoots
allows us to prune the output tree to include
only those structures that we care about, TwigHandlers
allow us to operate on
specific subtrees within the document, while leaving the rest of the tree untouched.
We
achieve this by binding callbacks (subroutine handlers) to the expressions that define
the
twigs themselves.
Returning to our table of contents script let's set two callbacks for the two different types of <title> elements that add a descriptive attribute to each type of element:
my $twig_handlers = {'titlepage/title' => \&book_title, 'chapter/title' => \&chapter_title} my $twig= new XML::Twig(TwigRoots => {title => 1}, TwigHandlers => $twig_handlers); $twig->parsefile($file); $twig->print; sub book_title{ my ($twig, $title) = @_; $title->set_att('type', 'book'); } sub chapter_title { my ($twig, $title) = @_; $title->set_att('type', 'chapter'); }
With this addition, out output will now look something like
<gutbook> <title type="book">THE ILIAD</title> <title type="chapter">BOOK I</title> <title type="chapter">BOOK II</title> ... </gutbook>
The entire contents of the twigs are processed before passing them along to the callbacks, so any child elements they may contain (branches within the twig) are also available. So, if we had chosen to define a handler for the <chapter> elements, rather than those matching the path "chapter/title", we could access the chapter's title with
sub chapter_handler { my ($twig_obj, $chapter_element) = @_; my $title_element = $chapter_element->first_child('title'); ... }
Other Handlers and Methods
In addition to TwigHandlers
, XML::Twig
allows you to to set
callbacks for handling DTD events, SAX-style (start_element, character, end_element)
events,
and a host of others. Each element within a twig has a wide range of possible methods
available to help make the task of processing as easy and flexible as possible.
Unfortunately, space does not permit me to cover these in detail. I encourage you
to run
perldoc XML::Twig
for the complete list of possible handlers and element
methods.
Putting It Together
For our final example, let's use what we've learned so far to build a simple command
line
tool that will allow us to perform keyword searches on the contents of an e-book.
This
script presumes that you have already processed the book using the gut2xhtml.pl
script, available with this month's sample code, that translates the Gutenberg XML
files to
simple XHTML and adds named anchors for each chapter and paragraph.
use XML::Twig; use HTML::Entities; my ($match_word, $file) = @ARGV; my ($current_chapter, $last_chapter, $global_match); my $twig= new XML::Twig(TwigHandlers => { 'p' => \¶graph, 'h2' => \&chapter_title}, TwigRoots => {body => 1}); $twig->parsefile($file); # build the twig $twig->print; warn "Sorry, no matches found for '$match_word'\n" unless $global_match;
So far our search script is similar to the previous examples. We have initialized
a few
variables and created a new XML::Twig
object, setting the <body> element
as the sole TwigRoot
. We have also set TwigHandlers
for all
<p> and <h2> elements in the document. Let's move on to the TwigHandler
callbacks.
sub paragraph { my ($twig, $para) = @_; my $para_text = $para->text; $para_text =~ s/\n/ /g; if ($para_text =~ /\b(.{0,30}\b$match_word.{0,30}\b)/is) { my $snippet = $1; $snippet = decode_entities($snippet); $global_match++;
Here we've copied the paragraph's text into the $para_text
variable, then
checked to see if $para_text
contains the word or phrase that the user passed
from the command line. If we have a match, we extract a small snippet of the paragraph
text
(30 characters to the left and right of the match, if they exist) and increment our
global
match counter.
my $anchor = $para->first_child; my $para_ref = $anchor->att('name'); my $link = XML::Twig::Elt->new('a'); $link->set_att('href', $file . '#' . $para_ref); $link->set_text($para_ref); $para->set_text(" - ...$snippet..."); $link->paste('first_child', $para);
Now we've retrieved the value of the paragraph's named anchor attribute and created a new HTML hyperlink element (<a>); we've added an 'href' attribute that points to the paragraph's location in the original XHTML document and set the text of the link to the same value. This link lets users jump directly to the matching paragraph in the original document if they want to view the match in a broader context.
if ((!$last_chapter) || ($last_chapter ne $current_chapter)) { my $header = XML::Twig::Elt->new('h2'); $header->set_text($current_chapter); $header->paste('first_child', $para); } $last_chapter = $current_chapter;
Here we've simply checked whether or not our current match is within the previous chapter; if not, we add a new <h2> heading to keep the result visually organized.
} else { $para->delete; } }
The last part of the paragraph handler deletes the twig from the result tree if the paragraph didn't contain a match for the specified keyword. This ensures that only those paragraphs containing a match will make it into the final output.
sub chapter_heading { my ($twig, $chapter_heading) = @_; $current_chapter = $chapter_heading->text; $chapter_heading->delete; }
And now we've created a handler for the original document's chapter headings. Here
we need
only set the global $current_chapter
variable for use in the paragraph handler
and delete the element from the output.
Saving this script as xhtml_search.pl
, let's use it to search our XHTML
version of The Iliad for all references to the sons of the Trojan king, Priam.
$ perl xhtml_search.pl 'son of priam' /home/books/illiad.html <html> <body> <p> <h2>BOOK IV</h2> <a href="/home/books/illiad.html#4.36">4.36</a> - ... of the gleaming corslet, son of Priam, hurled a spear at Ajax from ...</p> <p> <h2>BOOK V</h2> <a href="/home/books/illiad.html#5.52">5.52</a> - ... besought him, saying, "Son of Priam, let me not be here to fall ... </p> ... </body> </html>
Conclusions
XML::Twig
is an excellent example of thinking Perlishly about XML. Developers
familiar with the DOM, SAX, or XPath interfaces may struggle a bit with some of
XML::Twig
's naming conventions, but the power it provides, combined with the
ways in which it simplifies tasks that would be troublesome using one of the standard
APIs,
makes Twig
a strong addition to any Perl-XML developer's bag of tricks. If
you're intrigued by this short tutorial, I suggest a visit to Michel Rodriguez's xmltwig.com for more information.