Many people, understandably, think of XML as the invention of an evil genius bent on destroying humanity. The embedded markup, with its angle brackets and slashes, is not exactly a treat for the eyes. Add to that the business about nested elements, node types, and DTDs, and you might cower in the corner and whimper for nice, tab-delineated files and a split function.
Here's a little secret: writing programs to process XML is not hard. A whole spectrum of tools that handle the mundane details of parsing and building data structures for you is available, with convenient APIs that get you started in a few minutes. If you really need the complexity of a full-featured XML application, you can certainly get it, but you don't have to. XML scales nicely from simple to bafflingly complex, and if you deal with XML on the simple end of the continuum, you can pick simple tools to help you.
To prove our point, we'll look at a very basic module called XML::Simple, created by Grant McLean. With minimal effort up front, you can accomplish a surprising amount of useful work when processing XML.
A typical program reads in an XML document, makes some changes, and writes it back out to a file. XML::Simple was created to automate this process as much as possible. One subroutine call reads in an XML document and stores it in memory for you, using nested hashes to represent elements and data. After you make whatever changes you need to make, call another subroutine to print it out to a file.
Let's try it out. As with any module, you have to introduce XML::Simple to your program with a use pragma like this:
use XML::Simple;
When you do this, XML::Simple exports two subroutines into your namespace:
This subroutine reads an XML document from a file or string and builds a data structure to contain the data and element structure. It returns a reference to a hash containing the structure.
Given a reference to a hash containing an encoded document, this subroutine generates XML markup and returns it as a string of text.
If you like, you can build the document from scratch by simply creating the data structures from hashes, arrays, and strings. You'd have to do that if you wanted to create a file for the first time. Just be careful to avoid using circular references, or the module will not function properly.
For example, let's say your boss is going to send email to a group of people using the world-renowned mailing list management application, WarbleSoft SpamChucker. Among its features is the ability to import and export XML files representing mailing lists. The only problem is that the boss has trouble reading customers' names as they are displayed on the screen and would prefer that they all be in capital letters. Your assignment is to write a program that can edit the XML datafiles to convert just the names into all caps.
Accepting the challenge, you first examine the XML files to determine the style of markup. Example 1-1 shows such a document.
<?xml version="1.0"?> <spam-document version="3.5" timestamp="2002-05-13 15:33:45"> <!-- Autogenerated by WarbleSoft Spam Version 3.5 --> <customer> <first-name>Joe</first-name> <surname>Wrigley</surname> <address> <street>17 Beable Ave.</street> <city>Meatball</city> <state>MI</state> <zip>82649</zip> </address> <email>joewrigley@jmac.org</email> <age>42</age> </customer> <customer> <first-name>Henrietta</first-name> <surname>Pussycat</surname> <address> <street>R.F.D. 2</street> <city>Flangerville</city> <state>NY</state> <zip>83642</zip> </address> <email>meow@263A.org</email> <age>37</age> </customer> </spam-document>
Having read the perldoc page describing XML::Simple, you might feel confident enough to craft a little script, shown in Example 1-2.
# This program capitalizes all the customer names in an XML document # made by WarbleSoft SpamChucker. # Turn on strict and warnings, for it is always wise to do so (usually) use strict; use warnings; # Import the XML::Simple module use XML::Simple; # Turn the file into a hash reference, using XML::Simple's "XMLin" # subroutine. # We'll also turn on the 'forcearray' option, so that all elements # contain arrayrefs. my $cust_xml = XMLin('./customers.xml', forcearray=>1); # Loop over each customer sub-hash, which are all stored as in an # anonymous list under the 'customer' key for my $customer (@{$cust_xml->{customer}}) { # Capitalize the contents of the 'first-name' and 'surname' elements # by running Perl's built-in uc( ) function on them foreach (qw(first-name surname)) { $customer->{$_}->[0] = uc($customer->{$_}->[0]); } } # print out the hash as an XML document again, with a trailing newline # for good measure print XMLout($cust_xml); print "\n";
Running the program (a little trepidatious, perhaps, since the data belongs to your boss), you get this output:
<opt version="3.5" timestamp="2002-05-13 15:33:45"> <customer> <address> <state>MI</state> <zip>82649</zip> <city>Meatball</city> <street>17 Beable Ave.</street> </address> <first-name>JOE</first-name> <email>i-like-cheese@jmac.org</email> <surname>WRIGLEY</surname> <age>42</age> </customer> <customer> <address> <state>NY</state> <zip>83642</zip> <city>Flangerville</city> <street>R.F.D. 2</street> </address> <first-name>HENRIETTA</first-name> <email>meowmeow@augh.org</email> <surname>PUSSYCAT</surname> <age>37</age> </customer> </opt>
Congratulations! You've written an XML-processing program, and it worked perfectly. Well, almost perfectly. The output is a little different from what you expected. For one thing, the elements are in a different order, since hashes don't preserve the order of items they contain. Also, the spacing between elements may be off. Could this be a problem?
This scenario brings up an important point: there is a trade-off between simplicity and completeness. As the developer, you have to decide what's essential in your markup and what isn't. Sometimes the order of elements is vital, and then you might not be able to use a module like XML::Simple. Or, perhaps you want to be able to access processing instructions and keep them in the file. Again, this is something XML::Simple can't give you. Thus, it's vital that you understand what a module can or can't do before you commit to using it. Fortunately, you've checked with your boss and tested the SpamChucker program on the modified data, and everyone was happy. The new document is close enough to the original to fulfill the application's requirements.[1] Consider yourself initiated into processing XML with Perl!
[1]Some might say that, disregarding the changes we made on purpose, the two documents are semantically equivalent, but this is not strictly true. The order of elements changed, which is significant in XML. We can say for sure that the documents are close enough to satisfy all the requirements of the software for which they were intended and of the end user.
This is only the beginning of your journey. Most of the book still lies ahead of you, chock full of tips and techniques to wrestle with any kind of XML. Not every XML problem is as simple as the one we just showed you. Nevertheless, we hope we've made the point that there's nothing innately complex or scary about banging XML with your Perl hammer.
Copyright © 2002 O'Reilly & Associates. All rights reserved.