Perl XML Quickstart: Convenience Modules
June 13, 2001
Introduction
This is the third and final part of a series of articles meant to give quick introductions to some of the more popular Perl XML modules. In the last two months we have looked at the modules that implement the standard XML APIs and those that provide more Perlish XML interfaces. This month we will be looking at some of the modules that seek to simplify a specific XML-related task.
Keeping It ( Real | Simple )
Getting started with XML processing in Perl can be a daunting task. A quick CPAN search reveals 77 XML-related distributions containing more than 200 modules. How do you know which one to choose? Selecting a module based on its ability to cover common use cases (usually a useful guide) seems a bit absurd in light of the fact that XML is being used successfully for everything from storing simple configuration data to enabling communications between complex AI systems. If you find yourself wondering where to begin, don't despair. Despite the apparent complexities, you do not need expert knowledge of Perl or XML to put their combined power to work for you.
Unless XML is a significant part of your daily life, chances are good that the more generic XML API modules will seem like overkill. Perhaps they are. If your needs are modest, a module probably exists that will reduce your task to a few method calls. These single purpose, convenience modules are a key entry point to the Perl/XML world, and I have chosen a few of the more popular ones for this month's code samples. In the interest of clarity, we will limit the scope of the examples to the common tasks of creating XML document for other data sources, converting HTML to XHTML, and comparing the contents of two XML documents.
Creating XML From Other Data Sources
While many of the XML API modules provide a way to create XML documents programmatically based on data from any source, several modules exist that simplify the task of creating XML documents from data stored in other common formats. We'll illustrate how to create XML documents based on data extracted from CSV (Comma Separated Value) files, Excel spreadsheets, and relational databases.
Comma Separated Value - XML::CSV
Arguably the easiest structured data format to use and understand, CSV continues to
be very
popular. Illya Stern's XML::CSV
offers a simple way to create XML documents
from CSV files.
use XML::CSV; my $file = 'addresses.csv'; my @columns = ('first-name', 'last-name', 'email'); my $csv = XML::CSV->new({column_headings => \@columns}); $csv->parse_doc($file); $csv->declare_xml({version => '1.0', standalone => 'yes'}); $csv->print_xml('address.xml', {file_tag => 'address-book', parent_tag => 'entry'} );
Running this script produces the follow XML document:
<?xml version="1.0"> <address-book> <entry> <first-name>Lister</first-name> <last-name>David</last-name> <email>curryboy@dwarf.spc</email> </entry> <entry> <first-name>Rimmer</first-name> <last-name>Arnold</last-name> <email>smeghead@dwarf.spc</email> </entry> ... </address-book>
Excel Spreadsheets - XML::Excel
Identical to XML::CSV
in terms of interface, XML::Excel
provides
the same functionality for those extracting data from Excel spreadsheets.
use XML::Excel; my $file = 'addresses.xls'; my @columns = ('first-name', 'last-name', 'email'); my $xls = XML::Excel->new({column_headings => \@columns}); $xsl->parse_doc($file); $xsl->declare_xml({version => '1.0', standalone => 'yes'}); $xsl->print_xml('address.xml', {file_tag => 'address-book', parent_tag => 'entry'} );
The output from this script is identical to the output of the XML::CSV
example above.
Relational Databases - DBIx::XML_RDB
The following script shows how to translate the contents of a simple MySQL table
into an
XML document using Matt Sergeant's DBIx::XML_RDB
module:
use DBIx::XML_RDB; my $driver = 'mysql'; my $hostname = 'localhost'; my $database = 'shipdata'; my $user = 'root'; my $pass = 's3cr37'; my $dsn = "DBI:$driver:database=$database;host=$hostname"; my $dbx = DBIx::XML_RDB->new($dsn, 'mysql', $user, $pass); $dbx->DoSql(qq{ select * from address_book}); open(XML, ">addresses.xml") || die "Could not open file for writing: $! \n"; print XML $dbx->GetData; close XML
Running this script against our mythical database yields the following:
<?xml version="1.0"?> <DBI driver="DBI:mysql:database=shipdata;host=localhost"> <RESULTSET statement=" select * from addressbook"> <ROW> <first-name>Cat</first-name> <last-name>The</last-name> <email>kewlguy@dwarf.spc</email> </ROW> <ROW> <first-name>Holly</first-name> <last-name></last-name> <email>root@localhost</email> </ROW> ... </RESULTSET> </DBI>
XML/RDBMS integration is a very broad topic. Please see my earlier column, Using XML and Relational Databases with Perl, for a more complete discussion.
HTML Conversion
Also in Perl and XML |
OSCON 2002 Perl and XML Review PDF Presentations Using AxPoint |
The first challenge that Web developers often face when considering the move to XML is how to convert their existing HTML documents to well-formed XHTML. As we saw in a previous column, publishing XML on the Web offers many benefits beyond the ability to separate content from design. The key point, though, is that the documents must be well-formed and the cost associated with cleaning them up by hand may not always be offset by the benefits. Here, too, Perl offers several possible solutions, but we will focus only on two of the simpler ones.
XML::PYX
Based on concepts inherited from SGML, the PYX notation is a simple, line-oriented way of accessing the contents of markup documents. The XML::PYX distribution ships with command-line utilities that translate XML documents to and from PYX notation, and one that generates PYX from (possibly malformed) HTML.
Converting bad HTML into a stricter form that is palatable tpo XML parsers is as simple as typing the following at the command prompt:
$ pyxhtml dirty.html | pyxw > clean.html
Or consider the following snippet that processes a list of files and creates clean
XHTML
versions in the same directories with a .xhtml
file extension:
foreach my $in_file (@files) { my $out_file; ($out_file = $in_file) =~ s/\.html$/\.xhtml/; my $html = `pyxhtml $in_file | pyxw`; open (OUT, ">$out_file") || die "could not write to $out_file: $!\n"; print OUT $html; close OUT; }
PYX's line-oriented interface also allows standard UNIX tools like grep
to
work more cleanly on XML documents. See Sean McGrath's introduction to Pyxie for
more detail about PYX.
XML::Driver::HTML
Michael Koehne's XML::Driver::HTML
implements a SAX interface to malformed
HTML. It too can be used to create XHTML documents from existing HTML by using
XML::Handler::YAWriter
as a SAX handler.
use IO::File; use XML::Driver::HTML; use XML::Handler::YAWriter; my $handler = XML::Handler::YAWriter->new( 'Output' => IO::File->new( ">-" ), 'Pretty' => {'NoWhiteSpace' => 1, 'NoComments' => 1, 'AddHiddenNewline' => 1, 'AddHiddenAttrTab' => 1} ); my $parser = XML::Driver::HTML->new( 'Handler' => $handler, 'Source' => { 'ByteStream' => new IO::File ( "<-" ) } ); $parser->parse();
Comparing XML Documents
XML::SemanticDiff
The final module we will look at is my own XML::SemanticDiff
. In a nutshell,
XML::SemanticDiff
provides an easy way to report the differences in content
and structure between two XML documents. While the module's handler-style architecture
allows it to be integrated into more complex XML applications, basic usage is quite
simple:
use XML::SemanticDiff; my $diff = XML::SemanticDiff->new(); my @differences = $diff->compare('old.xml', 'new.xml'); foreach my $warning (@differences) { print $warning->{message} . "\n"; }
Summing Up
Do not be led astray. Adding a working knowledge of Perl's XML processing facilities to your bag of tricks does not mean spending hours poring over complex specs or committing yourself to a future of mindless buzzword compliance. If you need cutting-edge XML processing capabilities for your applications there are modules that provide everything you will probably need. If the task at hand is simple, choose one of the many convenience modules that hides the moving parts and then just relax. As with most things in the Perl world, simple things are easy, hard things are possible, and there's always more than one (right) way to do it.
We have covered a lot of ground in this series, and it is my sincere hope that the examples I have chosen will help to point the way forward for new Perl-XML users. If, however, you are still wondering which tools are best suited to your needs, please do not hesitate to join the Perl-XML mailing list and ask for help.