Book HomeJava and XSLTSearch this book

Chapter 13. XML and Perl

Contents:

XML Parsing and Validation
XML::Parser Methods
Expat Handlers
XML::Parser Styles
Expat Encodings
XML::Parser::ContentModel Methods

The Extensible Markup Language (XML) is a metalanguage for providing complete, configurable information for documents and other types of data. XML is based loosely on SGML, but has divorced itself of much of the complexity that made SGML unsuitable for everyday use. Without opening a can of worms about the differences between SGML and XML, suffice it to say that SGML is the predecessor to XML, and XML is a subset of SGML, with extensions.

With XML, you aren't bound by a fixed format, but can mark up a document to make it easily adaptable to whatever final format you later decide to apply it to. In fact, this book is written in XML, to be produced later not only in a print format but in an online format as well.

XML is often associated with web content, but it is much more flexible than that. Lately, XML's application to web services such as SOAP and XML-RPC has given it a chance to flex its muscles and show what it's capable of. XML gives you the structure to hold any content you'd like, whether it be the pages of this book in their rawest form, a list of your favorite recipes, or the ledger from your checkbook. XML is structured so that you can represent any kind of data in XML. XML's openness means that you can implement an XML-based application on any platform.

This chapter is focused on parsing, checking and delivering XML content. Chapter 14, "SOAP" covers SOAP programming in XML.

13.1. XML Parsing and Validation

The two most common tasks that you'll perform on your XML content are likely to be parsing and validating. If you've ever combed around on CPAN for XML-related modules, you probably already know that there's no shortage of resources when it comes to Perl and XML. On the contrary, the sheer volume of modules available for Perl and XML is rather daunting, so you might be looking for a place to start.

This chapter covers two Perl/XML modules in particular: XML::Simple and XML::Parser. These modules were selected because they allow you to parse and manipulate most XML. While these modules themselves don't validate XML, we'll resort to a little bit of trickery to show you how you can do exactly that.

XML::Simple provides an easy API that allows you to read and write XML. It is built on top of XML::Parser, which will be covered shortly. As its name suggests, XML::Simple implements only two methods: XMLin( ) and XMLout( ). But don't let its apparent lack of methods fool you; XML::Simple lets you do great things simply, such as parsing XML-written configuration files.

Let's say that your company keeps a log of its Sun Microsystems servers and their respective operation-system versions, IP addresses, and current patch levels. While you could keep this information in your home directory as a delimited file (which you can parse and analyze when you need to, or import into a database, such as PostGresSQL or MySQL), why not just write it in XML? You'll find that by writing this information in an XML document, you'll be able to operate on this information just as flexibly as you can with one of the aforementioned strategies. In addition, your information would reside in a structure in which the relationships between items and their meanings are clear.

Let's say that your flat text log file looks like this:

# sunhosts - patches and levels
atlas|solaris|2.8|192.168.0.2|Generic_108528-10|5.6.1
carrie|solaris|2.8|192.168.1.10|Generic_108527-12|5.6.0
not4sun|solaris|2.8|192.168.0.25|Generic_108482-06|5.005_03

While you could parse this configuration file easily, and generate some kind of report from it, your configuration file doesn't really tell you anything about the data that it represents. You can represent the same data in XML like so:

<config servertype="sunhosts" reporttype="patches and levels">
  <server name="atlas" osname="solaris" osversion="2.8">
    <address>192.168.0.2</address>
    <patchlevel>Generic_108528-10</patchlevel>
    <perlversion>5.6.1</perlversion>
  </server>

  <server name="carrie" osname="solaris" osversion="2.8">
    <address>192.168.1.10</address>
    <patchlevel>Generic_108527-12</patchlevel>
    <perlversion>5.6.0</perlversion>
  </server>

  <server name="not4sun" osname="solaris" osversion="2.8">
    <address>192.168.0.25</address>
    <patchlevel>Generic_108482-06</patchlevel>
    <perlversion>5.005_03</perlversion>
  </server>
</config>

In the above XML, all of the entries will be keyed on server so that for each entry in your XML that's called server, you'll be able to view its information for address, patchlevel, and perlversion. The following XML::Simple code does just that:

#!/usr/local/bin/perl -w

use XML::Simple;

my $config = XMLin('./myconfig.xml');

# Simply show us the IP address of each server in our table
foreach my $server (keys %{$config->{server}}) {
    print "$server -> $config->{server}{$server}{address}\n";
}

Now that you've parsed your configuration file, let's say you want to write XML back to your configration. To do this, you should use XMLout( ), which is covered in the XML::Simple documentation.

While XML::Simple doesn't provide a thorough method for validating XML, it does insist that a document is compliant XML. This means that if you present XML::Simple with a document that's not syntatically correct, XML::Simple will stop parsing the XML document and store the error message in $@. For example:

<config servertype="sunhosts" reporttype="patches and levels">
  <server name="atlas" osname="solaris" osversion="2.8">
    <address>192.168.0.2</address>
    <patchlevel>Generic_108528-10</patchlevel>
    <perlversion>5.6.1</perlversion>
</config>

The following code would find the error in your XML and exit upon finding it:

#!/usr/local/bin/perl -w

use XML::Simple;

my $config = eval { XMLin('./mybadconfig.xml') };
print("I found an error in your XML: $@") if $@;

foreach my $server (keys %{$config->{server}}) {
    print "$server -> $config->{server}{$server}\n";
}

This gives you:

I found an error in your XML: 
mismatched tag at line 6, column 2, byte 241 at
/usr/local/lib/perl5/site_perl/5.6.1/sun4-solaris/XML/Parser.pm line 185

Two components are very useful for parsing XML with Perl: Expat and XML::Parser. Although there are several XML parsing options for Perl, such as GNOME's libxml and offerings from the PerlSAX2 project, we stick to Expat and XML::Parser in this chapter.

Expat is a nonvalidating (that is, it does not check XML for correctness) XML parser that was written by James Clark. XML::Parser, a Perl wrapper around Expat, was originally written by Larry Wall and later developed by Clark Cooper. If you're using the ActivePerl, you should be able to install XML::Parser with ppm. Otherwise, you'll have to build Expat first, then link XML::Parser against it.

Each call to one of the XML::Parser parsing methods creates a new instance of XML::Parser::Expat, which is then used to parse the document. Expat options may be provided when the XML::Parser object is created. These options are then passed on to the Expat object on each parse call. They can also be given as extra arguments to the parse methods, in which case they override options given at XML::Parser creation time.

The XML parser takes your XML document and turns it into a data structure that you can operate on. XML::Parser gives you low-level control (and precision) over the data structure that it created, from which you can build parsers on top of it. XML::Simple is an example of this.

The behavior of the parser is controlled either by Style and/or Handlers options or by the setHandlers method. These all provide mechanisms that XML::Parser can use to set the handlers needed by XML::Parser::Expat. If neither Style nor Handlers are specified, then parsing just checks if the document is well-formed.

When underlying handlers get called, they receive as their first parameter the Expat object, not the Parser object.

You can show the relationships between entries in a configuration file with XML::Parser as well. For example:

#!/usr/local/bin/perl -w

use XML::Parser;
use Data::Dumper;

# Simply dump all of the entries and their relationships
my $p1 = XML::Parser->new(Style => 'Tree');
my $tree = $p1->parsefile('myconfig.xml');

print Dumper($tree);


Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.