Using The Perl XML::Parser Module
September 12, 1998
XML::Parser is a Perl module which acts as an interface to expat, James Clark's XML parser. A prototype was originally created by Larry Wall, and Clark Cooper has continued the development of this useful tool. In this article Clark presents two Perl programs which demonstrate some of XML::Parser's capabilities.
Most Perl applications in need of an XML parser will likely fall into one of two types. The first type of application will process specific applications of XML, for example RDF or MathML. For these, a subclass of XML::parser will need to be written in order to provide a tool conceptually closer to the job at hand. The second type of application will operate on any conforming XML document in order to find or filter out pieces of the document, or to discover things about its structure. This article will discuss two examples of the second type of application, utilities that do useful things with generic XML documents.
Overview of XML::Parser
First, let's go over the current XML::Parser interface. Like James Clark's expat library, upon which it's built, XML::Parser is an event-based parser. Prior to parsing the document, an application registers various event handlers with the parser. Then, as the document is parsed, the handlers are called when the relevant parts are recognized.
Most utilities need only register 3 handlers: start, end, and character handlers. The start handler is called when an XML start tag is recognized; the end handler is called on recognition of an end tag; and the character handler is called for non-markup content inside an element. The first example below uses a default handler. I'll explain it during the discussion of the example.
xmlcomments
The xmlcomments utility prints out all the comments in a given document with the line numbers on which the comment started. At conclusion, it prints out the total number of comments found.
The main part of the program, after checking for the existence of the file given as the first argument, creates the parser object with the ErrorContext option set to 2. This requests that errors in the document be reported with 2 lines of context on either side of an occurrence of an error. Two handlers are registered, the character handler, and the default handler. Then the file is parsed. All the action is in the default_handler function.
#!/usr/local/bin/perl -w use XML::Parser; my $file = shift; die "Can't find file \"$file\"" unless -f $file; my $count = 0; my $parser = new XML::Parser(ErrorContext => 2); $parser->setHandlers(Char => \&char_handler, Default => \&default_handler); $parser->parsefile($file); print "Found $count comments.\n"; ################ ## End of main ################
A registered default handler is called when the parser recognizes a portion of the document for which no handler has been registered (excepting start and end tags). You can't currently register handlers for things like comments and markup declarations. But a registered default handler will be called when these things are recognized. The default handler is also called (other than start and end tags) when there is no other handler registered for the particular event.
sub char_handler { # This is just here to reduce the noise seen by # the default handler } # End of char_handler
We're going to find comments by looking for things that are sent to the default handler beginning with "<!--". This isn't reliable if we're also seeing character data. After all, somebody could have a cdata section that begins like that. So to make sure that character data doesn't get sent to the default handler, we register an empty character handler.
sub default_handler { my ($p, $data) = @_; if ($data =~ /^<!--/) { my $line = $p->current_line; $data =~ s/\n/\n\t/g; print "$line:\t$data\n"; $count++; } } # End of default_handler
In the default handler, when we get data that looks like the beginning of a comment, we get the current line number, replace newlines with a newline followed by a tab. We then print the comment along with the line number and increment the global comment count.
My first cut at writing this example was more complicated, since I didn't know whether or not comments were always delivered with a single call to the handler. After I ran some experiments and looked at the expat code, I found out they were. If expat ever broke up a comment into multiple calls to the handler, we would have had to check whether or not the comment ended in the current call; then we'd have to set a flag indicating that we're inside an open comment; and whether we were looking for the beginning or the end of a comment would depend on the flag.
xmlstats
The second example program, xmlstats, prints out statistics about the structure of an XML document. For each type of element seen in the document, it prints out:
- the number times the element occurred
- the number of times it had a particular element as a parent
- the number of times it had a particular element as a child
- the number of times it had a particular attribute
- the amount of character data that had at least some non-whitespace characters
- whether the element was always empty
The order of the listing is top down, so no element should be listed until at least one of its parents has been listed.
The initial part of the program deals with establishing a lightweight object to hold element information. There will be one of these Elinfo objects created for each element type.
#!/usr/local/bin/perl -w package Elinfo; sub new { my $class = shift; my $self = [0, undef, 0, 0, 1, {}, {}, {}]; bless $self, $class; } package main; # These should be above. But I can't seem to import # them reliably without # Elinfo being in a separate file. sub COUNT () {0;} sub MINLEV () {1;} sub SEEN () {2;} sub CHARS () {3;} sub EMPTY () {4;} sub PTAB () {5;} sub KTAB () {6;} sub ATAB () {7;}
After declaring and setting some variables we'll need later, the main part of the program starts out very similar to our last example. We create a parser object and set some handlers. This time, though, the start handler does most of the heavy lifting and the character handler actually does a little bit of work.
use English; use XML::Parser; my %elements; my $seen = 0; my $root; my $file = shift; my $subform = ' @<<<<<<<<<<<<<<< @>>>>'; die "Can't find file \"$file\"" unless -f $file; my $parser = new XML::Parser(ErrorContext => 2); $parser->setHandlers(Start => \&start_handler, Char => \&char_handler); $parser->parsefile($file);
However, after the parse, there's some work to do stepping through the objects that were created. Let's take a look at the handlers first so that we can see how the objects are generated.
sub start_handler { my $p = shift; my $el = shift; my $elinf = $elements{$el}; if (not defined($elinf)) { $elements{$el} = $elinf = new Elinfo; $elinf->[SEEN] = $seen++; } $elinf->[COUNT]++; my $partab = $elinf->[PTAB]; my $parent = $p->current_element; if (defined($parent)) { $partab->{$parent}++; my $pinf = $elements{$parent}; # Increment our slot in parent's child table $pinf->[KTAB]->{$el}++; $pinf->[EMPTY] = 0; } else { $root = $el; } # Deal with attributes my $atab = $elinf->[ATAB]; while (@_) { my $att = shift; $atab->{$att}++; shift; # Throw away value } } # End start_handler
The first 2 parameters passed to a start handler are the parser object and the name of the element just started. After shifting these two things off the list, we lookup what information we've got for that element in the %elements hashtable. If there's no object there, we create one, and set the SEEN field for it. The $seen variable gets incremented for each new element type we see. We'll use the SEEN field later for sorting the output.
The current_element method returns the name of the enclosing element. If there is no enclosing element, then the undefined value is returned. This can only happen for the root element. Otherwise the name we get is the name of a parent element for this element.
There are three anonymous hashes created when we instantiate an Elinfo object, one each for parents, children, and attributes. When we find a parent, we can increment its slot in our parent table and increment our slot in the parent's child table. Also if this element is contained in some other element, then that element can't be empty.
Finally, we deal with the remaining parameters, which are the attributes to this element passed along as name and value pairs. We shift off the name into the $att variable, which we use to update our attribute table, then we throw away the attribute value. This is done until the parameter list is empty.
sub char_handler { my ($p, $data) = @_; my $inf = $elements{$p->current_element}; $inf->[EMPTY] = 0; if ($data =~ /\S/) { $inf->[CHARS] += length($data); } } # End char_handler
The character handler has two tasks: set the empty boolean to false if we see any content at all for that element, and increment the byte count if the given data we're seeing has any non-whitespace characters. This way of counting content bytes is somewhat bogus. A better way would have been to keep a count of all bytes and a flag that indicates whether or not any non-whitespace has been seen. Then an end handler could summarize for that instance of the element.
Now that we've seen how the element information is generated during the parse, let's go back to the rest of the main program to see how this information is processed.
set_minlev($root, 0); my $el; foreach $el (sort bystruct keys %elements) { my $ref = $elements{$el}; print "\n================\n$el: ", $ref->[COUNT], "\n"; print "Had ", $ref->[CHARS], " bytes of character data\n" if $ref->[CHARS]; print "Always empty\n" if $ref->[EMPTY]; showtab('Parents', $ref->[PTAB], 0); showtab('Children', $ref->[KTAB], 1); showtab('Attributes', $ref->[ATAB], 0); }
The only hard thing here is to establish the order in which we print out the elements. To do this we need to set the MINLEV for each element. This number is the minimum level from the root element that an element occurred. So the root element should have a minlev of zero. The set_minlev subroutine is a recursive subroutine that sets MINLEV for all the elements.
sub set_minlev { my ($el, $lev) = @_; my $elinfo = $elements{$el}; if (! defined($elinfo->[MINLEV]) or $elinfo->[MINLEV] > $lev) { my $newlev = $lev + 1; $elinfo->[MINLEV] = $lev; foreach (keys %{$elinfo->[KTAB]}) { set_minlev($_, $newlev); } } } # End set_minlev
Note that element type structure is not constrained to be a tree. To make sure that we don't get caught in a cycle, we only update MINLEV if it hasn't been previously set or if we've got a lesser level. Whenever an element's MINLEV is updated, all its children's MINLEVs need to be updated too.
Once we've set the MINLEV field for all the element types, we can get an ordered list of elements. The bystruct comparison function is used to sort the elements. We want to sort first by MINLEV and then, for element types that have an equal MINLEV, by order seen.
sub bystruct { my $refa = $elements{$a}; my $refb = $elements{$b}; $refa->[MINLEV] <=> $refb->[MINLEV] or $refa->[SEEN] <=> $refb->[SEEN]; } # End bystruct
Once we've got the element type in proper order, we process them one by one, printing out a divider, the element name and count. Then if there was any character data, the number of bytes is printed. If every occurrence of the element type was empty, then that is printed out. Finally, we print out parent, child, and attribute tables using the showtab subroutine.
sub showtab { my ($title, $table, $dosum) = @_; my @list = sort keys %{$table}; if (@list) { print "\n $title:\n"; my $item; my $sum = 0; foreach $item (@list) { my $cnt = $table->{$item}; $sum += $cnt; formline($subform, $item, $cnt); print $ACCUMULATOR, "\n"; $ACCUMULATOR = ''; } if ($dosum and @list > 1) { print " =====\n"; formline($subform, '', $sum); print $ACCUMULATOR, "\n"; $ACCUMULATOR = ''; } } } # End showtab
This subroutine takes 3 parameters, the title to print out, the table to print, and a boolean value that indicates whether we should print out the total of the table count. It makes use of Perl's built-in formline function which formats strings into the $ACCUMULATOR string.
To see how all this works, when xmlstats is applied to this document:
<foo> <bar> This is a test </bar> <bar> <alpha><bar>Surprise</bar> <junk>Here we go</junk> </alpha> <ref id="me" xxx="there"/> </bar></foo>
we get this output:
================ foo: 1 Children: bar 2 ================ bar: 3 Had 22 bytes of character data Parents: alpha 1 foo 2 Children: alpha 1 ref 1 ===== 2 ================ alpha: 1 Parents: bar 1 Children: bar 1 junk 1 ===== 2 ================ ref: 1 Always empty Parents: bar 1 Attributes: id 1 xxx 1 ================ junk: 1 Had 10 bytes of character data Parents: alpha 1
Summary
These two examples demonstrate how easy it is to put together useful XML processing utilities when using Perl and XML::Parser. XML::Parser is still under development and the API changes as ideas occur to me or as I get feedback from the perl-xml mailing list. When it stabilizes, I intend to post the module to CPAN.