Perl XML Quickstart: Convenience Modules

June 13, 2001

Introduction

This is the third and final part of a series of articles meant to give quick introductions to some of the more popular Perl XML modules. In the last two months we have looked at the modules that implement the standard XML APIs and those that provide more Perlish XML interfaces. This month we will be looking at some of the modules that seek to simplify a specific XML-related task.

Keeping It ( Real | Simple )

Getting started with XML processing in Perl can be a daunting task. A quick CPAN search reveals 77 XML-related distributions containing more than 200 modules. How do you know which one to choose? Selecting a module based on its ability to cover common use cases (usually a useful guide) seems a bit absurd in light of the fact that XML is being used successfully for everything from storing simple configuration data to enabling communications between complex AI systems. If you find yourself wondering where to begin, don't despair. Despite the apparent complexities, you do not need expert knowledge of Perl or XML to put their combined power to work for you.

Unless XML is a significant part of your daily life, chances are good that the more generic XML API modules will seem like overkill. Perhaps they are. If your needs are modest, a module probably exists that will reduce your task to a few method calls. These single purpose, convenience modules are a key entry point to the Perl/XML world, and I have chosen a few of the more popular ones for this month's code samples. In the interest of clarity, we will limit the scope of the examples to the common tasks of creating XML document for other data sources, converting HTML to XHTML, and comparing the contents of two XML documents.

Creating XML From Other Data Sources

While many of the XML API modules provide a way to create XML documents programmatically based on data from any source, several modules exist that simplify the task of creating XML documents from data stored in other common formats. We'll illustrate how to create XML documents based on data extracted from CSV (Comma Separated Value) files, Excel spreadsheets, and relational databases.

Comma Separated Value - XML::CSV

Arguably the easiest structured data format to use and understand, CSV continues to be very popular. Illya Stern's XML::CSV offers a simple way to create XML documents from CSV files.


use XML::CSV;



my $file = 'addresses.csv';

my @columns = ('first-name', 'last-name', 'email');



my $csv = XML::CSV->new({column_headings => \@columns});



$csv->parse_doc($file);

$csv->declare_xml({version => '1.0',

                   standalone => 'yes'});



$csv->print_xml('address.xml',

                {file_tag    => 'address-book',

                 parent_tag  => 'entry'}

               );

Running this script produces the follow XML document:


<?xml version="1.0">

<address-book>

  <entry>

    <first-name>Lister</first-name>

    <last-name>David</last-name>

    <email>curryboy@dwarf.spc</email>

  </entry>

  <entry>

    <first-name>Rimmer</first-name>

    <last-name>Arnold</last-name>

    <email>smeghead@dwarf.spc</email>

  </entry>

  ...

</address-book>

Excel Spreadsheets - XML::Excel

Identical to XML::CSV in terms of interface, XML::Excel provides the same functionality for those extracting data from Excel spreadsheets.


use XML::Excel;



my $file = 'addresses.xls';

my @columns = ('first-name', 'last-name', 'email');



my $xls = XML::Excel->new({column_headings => \@columns});



$xsl->parse_doc($file);

$xsl->declare_xml({version => '1.0',

                   standalone => 'yes'});



$xsl->print_xml('address.xml',

                {file_tag    => 'address-book',

                 parent_tag  => 'entry'}

               );

The output from this script is identical to the output of the XML::CSV example above.

Relational Databases - DBIx::XML_RDB

The following script shows how to translate the contents of a simple MySQL table into an XML document using Matt Sergeant's DBIx::XML_RDB module:


use DBIx::XML_RDB;

my $driver = 'mysql';

my $hostname = 'localhost';

my $database = 'shipdata';

my $user = 'root';

my $pass = 's3cr37';

my $dsn = "DBI:$driver:database=$database;host=$hostname";



my $dbx = DBIx::XML_RDB->new($dsn, 'mysql', $user, $pass);

$dbx->DoSql(qq{ select * from address_book});



open(XML, ">addresses.xml")

     || die "Could not open file for writing: $! \n";



print XML $dbx->GetData;

close XML

Running this script against our mythical database yields the following:


<?xml version="1.0"?>

<DBI driver="DBI:mysql:database=shipdata;host=localhost">

  <RESULTSET statement=" select * from addressbook">

    <ROW>

      <first-name>Cat</first-name>

      <last-name>The</last-name>

      <email>kewlguy@dwarf.spc</email>

    </ROW>

    <ROW>

      <first-name>Holly</first-name>

      <last-name></last-name>

      <email>root@localhost</email>

    </ROW>

    ...

  </RESULTSET>

</DBI>

XML/RDBMS integration is a very broad topic. Please see my earlier column, Using XML and Relational Databases with Perl, for a more complete discussion.

HTML Conversion

Also in Perl and XML

OSCON 2002 Perl and XML Review

XSH, An XML Editing Shell

PDF Presentations Using AxPoint

Multi-Interface Web Services Made Easy

Perl and XML on the Command Line

The first challenge that Web developers often face when considering the move to XML is how to convert their existing HTML documents to well-formed XHTML. As we saw in a previous column, publishing XML on the Web offers many benefits beyond the ability to separate content from design. The key point, though, is that the documents must be well-formed and the cost associated with cleaning them up by hand may not always be offset by the benefits. Here, too, Perl offers several possible solutions, but we will focus only on two of the simpler ones.

XML::PYX

Based on concepts inherited from SGML, the PYX notation is a simple, line-oriented way of accessing the contents of markup documents. The XML::PYX distribution ships with command-line utilities that translate XML documents to and from PYX notation, and one that generates PYX from (possibly malformed) HTML.

Converting bad HTML into a stricter form that is palatable tpo XML parsers is as simple as typing the following at the command prompt:


$ pyxhtml dirty.html | pyxw > clean.html

Or consider the following snippet that processes a list of files and creates clean XHTML versions in the same directories with a .xhtml file extension:


foreach my $in_file (@files) {

    my $out_file;

    ($out_file = $in_file) =~ s/\.html$/\.xhtml/;

    my $html = `pyxhtml $in_file | pyxw`;

    open (OUT, ">$out_file") || die "could not write to $out_file: $!\n";

    print OUT $html;

    close OUT;

}

PYX's line-oriented interface also allows standard UNIX tools like grep to work more cleanly on XML documents. See Sean McGrath's introduction to Pyxie for more detail about PYX.

XML::Driver::HTML

Michael Koehne's XML::Driver::HTML implements a SAX interface to malformed HTML. It too can be used to create XHTML documents from existing HTML by using XML::Handler::YAWriter as a SAX handler.


use IO::File;

use XML::Driver::HTML;

use XML::Handler::YAWriter;



my $handler = XML::Handler::YAWriter->new(

    'Output' => IO::File->new( ">-" ),

    'Pretty' => {'NoWhiteSpace'     => 1,

                 'NoComments'       => 1,

                 'AddHiddenNewline' => 1,

                 'AddHiddenAttrTab' => 1}

    );





my $parser = XML::Driver::HTML->new(

    'Handler' => $handler,

    'Source' => { 'ByteStream' => new IO::File ( "<-" ) }

    );



$parser->parse();

Comparing XML Documents

XML::SemanticDiff

The final module we will look at is my own XML::SemanticDiff. In a nutshell, XML::SemanticDiff provides an easy way to report the differences in content and structure between two XML documents. While the module's handler-style architecture allows it to be integrated into more complex XML applications, basic usage is quite simple:


use XML::SemanticDiff;



my $diff = XML::SemanticDiff->new();



my @differences = $diff->compare('old.xml', 'new.xml');



foreach my $warning (@differences) {

    print $warning->{message} . "\n";

}

Summing Up

Do not be led astray. Adding a working knowledge of Perl's XML processing facilities to your bag of tricks does not mean spending hours poring over complex specs or committing yourself to a future of mindless buzzword compliance. If you need cutting-edge XML processing capabilities for your applications there are modules that provide everything you will probably need. If the task at hand is simple, choose one of the many convenience modules that hides the moving parts and then just relax. As with most things in the Perl world, simple things are easy, hard things are possible, and there's always more than one (right) way to do it.

We have covered a lot of ground in this series, and it is my sincere hope that the examples I have chosen will help to point the way forward for new Perl-XML users. If, however, you are still wondering which tools are best suited to your needs, please do not hesitate to join the Perl-XML mailing list and ask for help.