A Template-based approach to XML Parsing in C++


Introduction

XML is a markup based data description language designed to allow a developer 
to create structured documents using descriptive custom tags.  
The intent of XML is to separate the description of the data 
from its intended use and allow the transfer of the data between different
applications in a non-platform or architecture-specific way.

Another useful application of XML is to describe a process 
in a logical and meaningful way that can be carried out by 
the application at run-time.


Parsing XML

In order for an XML file to be parsed successfully the developer must 
first create a file that can be processed by a parser.  A 
parser is a set of shared objects, or a library that reads and 
processes an XML file.  The parser may be one of two types: 
validating or non-validating.  

A validating parser scans the XML file and determines if the document 
is well-formed, as specified, by either an XML schema or the document 
type defintion (DTD).  

A non-validating parser simply reads the file and ignores the format and 
layout as specified by either the XML schema or the DTD.

The most widely used parsers come in two flavors: event driven and tree 
based.  The event driven parser is called SAX, which stands for Simple API 
for XML processing.  Whereas a tree based parser creates a DOM 
(Document Object Model) tree in memory at the time the XML file is read
and parsed.
 
The two differ in their approach to XML parsing and each has advantages 
and disadvantages.  The DOM implementation is difficult to navigate and 
does not allow for a clean mapping between XML elements and Domain 
specific objects.  SAX provides the events to allow the developer to 
create their domain-specific objects at the time the XML file is read and 
parsed.  

This article provides a framework design using the SAX API for
XML parsing.


XML Parsers for C++

The two most commonly used parsers for C++ are Xerces of the 
Apache Project and XML4C created by IBM's AlphaWorks project.
XML4C is based on the open-source Xerces project.  

Both parsers provide essentially the same layout of source and libraries 
and can therefore be used interchangeably.  They also support both 
DOM and SAX based XML parsing.

This document describes an implementation using the SAX parser 
with the Xerces parser.


Installing the Xerces Library

The Xerces source or binaries related to XML parsing can be 
downloaded from the Xerces website [1].


Parsing XML Files using SAX

In order to begin parsing an XML file using the SAX API the layout 
of the SAX C++ object interactions must be understood. 

SAX is designed with the following basic interfaces:


SAXParser						
=========
setDoValidation
setDoNamespace
setDoSchema
setValidationFullSchemaChecking
setDocumentHandler
setErrorHandler
parse


HandlerBase
===========
warning
error
fatalError
startElement
characters
ignorableWhitespace
endElement


Close examination of the methods in the HandlerBase object above will 
reveal two different categories of methods: error handling and 
document processing.  The error handling methods include warning, error 
and fatalError.  Whereas, the parsing methods consist of startElement, 
characters, ignorableWhitespace and endElement.  These behaviors can
be separated into individual objects and is shown later in the article.

The SAXParser class takes care of setting basic properties and the 
desired behavior that is to be enforced at run-time.

The following sample code illustrates the basic steps that need to be 
followed in order to parse an XML file using the SAX parser in C++:


// Create a new instance of the SAX parser
SAXParser parser;

// Initialize the behavior you desire
parser.setDoValidation(true);
parser.setDoNamespaces(true);
parser.setDoSchema(true);
parser.setValidationSchemaFullChecking(true);

// Add handlers for document and error processing
parser.setDocumentHandler(&docHandler);
parser.setErrorHandler(&errorHandler);

// Parse file
parser.parse("MyXMLFile.xml");


At the time the parsing occurs the classes you’ve instantiated, docHandler 
and errorHandler, are forwarded the events that get triggered from 
the parsing.  

Note, these classes are derived from the Xerces base class HandlerBase 
and have overridden the appropriate methods for handling the events 
based on their categorized function.


XML Framework Implentation using the SAX API

Now that we've been exposed to parsing XML using SAX, let's explore 
how our XML framework has been implemented to take advantage of the 
facilities provided within the API.


Policy Classes

A policy class, made popular by Andrei Alexandrescu's book "Modern C++ 
Design" [2], is described as follows:

"A policy defines a class interface or a class template interface.  
The interface consists of one or all of the following: inner type 
definitions, member functions and member variables.

Policies have much in common with traits but differ in that they 
put less emphasis on type and more emphasis on behavior." 

The usefulness of policy classes, in this XML framework, are realized 
when created using a template based C++ design.  A policy allows 
you to parameterize and configure functionality at a fine granularity.  
In this design, policies are created to accomodate the following behavior:

- Document Handling
- Error Handling
- Domain Mapping
- Parsing

Configuring these elements as policies allows the creation of more 
concise code that is easier to maintain by any developer experienced 
in C++ and the use of templates.

The principal class of the XML Parsing framework is that of the 
XMLSAXParser.  It's a custom designed class template that implements 
the XMLParserInterface and includes a SAXParser object as a member 
variable.  The template parameters include policy classes for both the 
document and error handlers.  All parsing is eventually delegated to 
the SAXParser member variable after the various handlers and other 
properties have been set.

Implementing custom handlers, as policy classes, is a relatively trivial 
task using the framework.  The advantage of this type of design is 
that the same framework can be used with different parsing API's and 
different domain-mapping objects by altering one or more of the 
policies - an exercise that is not implemented in this article.

In order to create custom handlers derive newly created custom 
classes from HandlerBase and override the virtual methods of interest.  
The following two types of custom handlers have been created in 
the XMLFactory framework.


XMLSAXHandler
=============
startElement
character
ignorableWhitespace
endElement

XMLSAXErrorHandler
==================
warning
error
fatalError


XMLSAXHandler handles document event processing and XMLSAXErrorHandler
handles the various error callbacks.


Mapping XML Tags to Domain Objects

The next aspect of our XML parsing framework is to convert XML 
tags into Domain related objects that can be used within the 
application.  This is accomplished by using templates and 
a loose definition of policy classes.  

The XMLDomainMap template accepts a single template parameter 
called an XMLNode.  The interface for the domain-mapping 
object is described below.


XMLDomainMap
============
create
add
updateAttribute


The XMLNode acts as both a leaf and a root in a tree structure 
that aggregates its children as a subtree.  The XMLNode's interface 
is described below.


XMLNode
=======
operator==
operator!=
operator=
addChild
hasChildren
numChildren
value
name
getChildCount
getChild
getParent 


The key here is the design of the public interface of the object.  
There are several operator overloads, specifically operator 
equals (operator==), operator not equals (operator!=) and the 
assignment operator (operator=).  

The benefit to this is that the object can now be used with many 
of the standard template library containers and algorithms.  
This allows for the use of advanced features with the C++ language.


Linking our Classes together - An XML Façade

Thus far the focus has been on individual classes and describing 
the templates that have been created for our XML processing framework.  
The next step is to link the disparate interfaces together and 
make them appear to function as a single cohesive unit.  To accomplish 
this step the Facade Design Pattern will be used [3].

The facade design provides a simple and elegant way to delegate 
parsing functionality from an outside client to the internal 
policy class that will be used for performing the parsing.

The XMLProcessor is the facade that has been created.  It is defined 
with the following interface:


 XMLProcessor
 ============
 parse
 getParseEngine


Once all the source has been written an XML file and a test client 
will be needed to run our sample.


Parsing an Actual XML File

The following simple XML file has been created to illustrate 
the use of the framework:


<?xml version="1.0" encoding="iso-8859-1"?>

<customer>
	<name>John Doe</name>
	<account-number>555123</account-number>
</customer>


The above XML file shows a very basic layout of a customer
with a name and an account number.  This example will be used
to demonstrate the simplicity of using the framework.  For now, 
enter file into a text editor and save it as MyXMLFile.xml.


The Public Interface - Writing the Client Application

The framework's functionality will be used as a mechanism to 
provide a succinct interface to the client application.

The primary methods that a client of the framework would make use of
can be described with an actual, albeit small, sample of C++
source code:

// ---------------------------------------
//  Sample source for parsing an XML doc
// ---------------------------------------
#include "XMLProcessor.hpp"
#include "XMLDomainMap.hpp"
#include "XMLSAXParser.hpp"
#include "XMLNode.hpp"
#include "XMLCommand.h"
#include "XMLSAXHandler.hpp"
#include "XMLSAXErrorHandler.hpp"

#include <iostream>
using namespace std;
using namespace XML;

// Let's get the ugly stuff out of the way first
typedef XMLSAXHandler<XMLDomainMap<XMLNode> > DOCHANDLER;
typedef XMLSAXErrorHandler ERRORHANDLER;
typedef XMLSAXParser<DOCHANDLER, ERRORHANDLER> PARSER;
typedef XMLProcessor<PARSER> XMLEngine;

// Create a basic test client
int main(void)
{
	// Define a string object with our file name
	std::string xmlFile = "MyXMLFile.xml";

	// Create an instance of our XMLFactory
	XMLEngine parser(xmlFile);

	// Turn off validation
	parser.doValidation(false);
	
	// Parse our XML file
	parser.parse();

	// Get the tree root
	XMLNode root = parser.getXMLRoot();

	// Print the name of our object
	cout << "Root = " << root.name() << endl;

	return 0;
}
//  end sample source


Now that an instance of an XMLNode object representing the root of
the tree has been parsed, the child elements of the root XMLNode can
be accessed.


Compiling the Test Client

The last step is to compile the client.  Simply perform the 
compile at the command line by entering the following GNU g++ command:

$>g++ -o testClient -I. -I/path/to/xerces/include \
		-I/path/to/xerces/include/xerces \
		testClient.cpp -L/path/to/xerces/lib -lxerces-c

This will compile the client application.  The next step is to
run a test.  Note: set/export your LD_LIBRARY_PATH 
environment variable to point to the location of your Xerces 
installation's lib directory.  

Since the shared libraries from this directory the application 
loader needs a way to import the required symbols at run-time 
in order for everything to function correctly.

When testClient has run from the command line the following output
is expected:


$>testClient
Adding child name
Adding child account-number
Root = customer


You now have a fully functional XML parsing framework using C++
templates that will allow you to incorporate XML into your 
new or existing applications. Enjoy!



References

[1] Open your favorite browser and navigate to 
http://xml.apache.org/xerces-c.  We will install the 
latest Xerces implemenation, which at the time of this 
writing is version 2.1.0.  Download the following file:

http://xml.apache.org/dist/xerces-c/stable/xerces-c-src2_1_0.tar.gz 

for the source code and follow the instructions for compiling and 
installing it provided, on the website, for your particular 
to your distribtution.


[2] Modern C++ Design, Andrei Alexandrescu Addison Wesley © 
2002, page 7-8.


[3] The Façade Design Pattern, Design Patterns (Gamma et al.) page 185.

The intent of the Façade Design Pattern is defined as follows:

"Provide a unified interface to a set of interfaces in a subsystem.  Façade 
defines a higher-level interface that makes the subsystem easier to use."