A Template-based approach to XML Parsing in C++
Introduction
XML is a markup based data description language designed to allow a developer
to create structured documents using descriptive custom tags.
The intent of XML is to separate the description of the data
from its intended use and allow the transfer of the data between different
applications in a non-platform or architecture-specific way.
Another useful application of XML is to describe a process
in a logical and meaningful way that can be carried out by
the application at run-time.
Parsing XML
In order for an XML file to be parsed successfully the developer must
first create a file that can be processed by a parser. A
parser is a set of shared objects, or a library that reads and
processes an XML file. The parser may be one of two types:
validating or non-validating.
A validating parser scans the XML file and determines if the document
is well-formed, as specified, by either an XML schema or the document
type defintion (DTD).
A non-validating parser simply reads the file and ignores the format and
layout as specified by either the XML schema or the DTD.
The most widely used parsers come in two flavors: event driven and tree
based. The event driven parser is called SAX, which stands for Simple API
for XML processing. Whereas a tree based parser creates a DOM
(Document Object Model) tree in memory at the time the XML file is read
and parsed.
The two differ in their approach to XML parsing and each has advantages
and disadvantages. The DOM implementation is difficult to navigate and
does not allow for a clean mapping between XML elements and Domain
specific objects. SAX provides the events to allow the developer to
create their domain-specific objects at the time the XML file is read and
parsed.
This article provides a framework design using the SAX API for
XML parsing.
XML Parsers for C++
The two most commonly used parsers for C++ are Xerces of the
Apache Project and XML4C created by IBM's AlphaWorks project.
XML4C is based on the open-source Xerces project.
Both parsers provide essentially the same layout of source and libraries
and can therefore be used interchangeably. They also support both
DOM and SAX based XML parsing.
This document describes an implementation using the SAX parser
with the Xerces parser.
Installing the Xerces Library
The Xerces source or binaries related to XML parsing can be
downloaded from the Xerces website [1].
Parsing XML Files using SAX
In order to begin parsing an XML file using the SAX API the layout
of the SAX C++ object interactions must be understood.
SAX is designed with the following basic interfaces:
SAXParser
=========
setDoValidation
setDoNamespace
setDoSchema
setValidationFullSchemaChecking
setDocumentHandler
setErrorHandler
parse
HandlerBase
===========
warning
error
fatalError
startElement
characters
ignorableWhitespace
endElement
Close examination of the methods in the HandlerBase object above will
reveal two different categories of methods: error handling and
document processing. The error handling methods include warning, error
and fatalError. Whereas, the parsing methods consist of startElement,
characters, ignorableWhitespace and endElement. These behaviors can
be separated into individual objects and is shown later in the article.
The SAXParser class takes care of setting basic properties and the
desired behavior that is to be enforced at run-time.
The following sample code illustrates the basic steps that need to be
followed in order to parse an XML file using the SAX parser in C++:
// Create a new instance of the SAX parser
SAXParser parser;
// Initialize the behavior you desire
parser.setDoValidation(true);
parser.setDoNamespaces(true);
parser.setDoSchema(true);
parser.setValidationSchemaFullChecking(true);
// Add handlers for document and error processing
parser.setDocumentHandler(&docHandler);
parser.setErrorHandler(&errorHandler);
// Parse file
parser.parse("MyXMLFile.xml");
At the time the parsing occurs the classes you’ve instantiated, docHandler
and errorHandler, are forwarded the events that get triggered from
the parsing.
Note, these classes are derived from the Xerces base class HandlerBase
and have overridden the appropriate methods for handling the events
based on their categorized function.
XML Framework Implentation using the SAX API
Now that we've been exposed to parsing XML using SAX, let's explore
how our XML framework has been implemented to take advantage of the
facilities provided within the API.
Policy Classes
A policy class, made popular by Andrei Alexandrescu's book "Modern C++
Design" [2], is described as follows:
"A policy defines a class interface or a class template interface.
The interface consists of one or all of the following: inner type
definitions, member functions and member variables.
Policies have much in common with traits but differ in that they
put less emphasis on type and more emphasis on behavior."
The usefulness of policy classes, in this XML framework, are realized
when created using a template based C++ design. A policy allows
you to parameterize and configure functionality at a fine granularity.
In this design, policies are created to accomodate the following behavior:
- Document Handling
- Error Handling
- Domain Mapping
- Parsing
Configuring these elements as policies allows the creation of more
concise code that is easier to maintain by any developer experienced
in C++ and the use of templates.
The principal class of the XML Parsing framework is that of the
XMLSAXParser. It's a custom designed class template that implements
the XMLParserInterface and includes a SAXParser object as a member
variable. The template parameters include policy classes for both the
document and error handlers. All parsing is eventually delegated to
the SAXParser member variable after the various handlers and other
properties have been set.
Implementing custom handlers, as policy classes, is a relatively trivial
task using the framework. The advantage of this type of design is
that the same framework can be used with different parsing API's and
different domain-mapping objects by altering one or more of the
policies - an exercise that is not implemented in this article.
In order to create custom handlers derive newly created custom
classes from HandlerBase and override the virtual methods of interest.
The following two types of custom handlers have been created in
the XMLFactory framework.
XMLSAXHandler
=============
startElement
character
ignorableWhitespace
endElement
XMLSAXErrorHandler
==================
warning
error
fatalError
XMLSAXHandler handles document event processing and XMLSAXErrorHandler
handles the various error callbacks.
Mapping XML Tags to Domain Objects
The next aspect of our XML parsing framework is to convert XML
tags into Domain related objects that can be used within the
application. This is accomplished by using templates and
a loose definition of policy classes.
The XMLDomainMap template accepts a single template parameter
called an XMLNode. The interface for the domain-mapping
object is described below.
XMLDomainMap
============
create
add
updateAttribute
The XMLNode acts as both a leaf and a root in a tree structure
that aggregates its children as a subtree. The XMLNode's interface
is described below.
XMLNode
=======
operator==
operator!=
operator=
addChild
hasChildren
numChildren
value
name
getChildCount
getChild
getParent
The key here is the design of the public interface of the object.
There are several operator overloads, specifically operator
equals (operator==), operator not equals (operator!=) and the
assignment operator (operator=).
The benefit to this is that the object can now be used with many
of the standard template library containers and algorithms.
This allows for the use of advanced features with the C++ language.
Linking our Classes together - An XML Façade
Thus far the focus has been on individual classes and describing
the templates that have been created for our XML processing framework.
The next step is to link the disparate interfaces together and
make them appear to function as a single cohesive unit. To accomplish
this step the Facade Design Pattern will be used [3].
The facade design provides a simple and elegant way to delegate
parsing functionality from an outside client to the internal
policy class that will be used for performing the parsing.
The XMLProcessor is the facade that has been created. It is defined
with the following interface:
XMLProcessor
============
parse
getParseEngine
Once all the source has been written an XML file and a test client
will be needed to run our sample.
Parsing an Actual XML File
The following simple XML file has been created to illustrate
the use of the framework:
John Doe
555123
The above XML file shows a very basic layout of a customer
with a name and an account number. This example will be used
to demonstrate the simplicity of using the framework. For now,
enter file into a text editor and save it as MyXMLFile.xml.
The Public Interface - Writing the Client Application
The framework's functionality will be used as a mechanism to
provide a succinct interface to the client application.
The primary methods that a client of the framework would make use of
can be described with an actual, albeit small, sample of C++
source code:
// ---------------------------------------
// Sample source for parsing an XML doc
// ---------------------------------------
#include "XMLProcessor.hpp"
#include "XMLDomainMap.hpp"
#include "XMLSAXParser.hpp"
#include "XMLNode.hpp"
#include "XMLCommand.h"
#include "XMLSAXHandler.hpp"
#include "XMLSAXErrorHandler.hpp"
#include
using namespace std;
using namespace XML;
// Let's get the ugly stuff out of the way first
typedef XMLSAXHandler > DOCHANDLER;
typedef XMLSAXErrorHandler ERRORHANDLER;
typedef XMLSAXParser PARSER;
typedef XMLProcessor XMLEngine;
// Create a basic test client
int main(void)
{
// Define a string object with our file name
std::string xmlFile = "MyXMLFile.xml";
// Create an instance of our XMLFactory
XMLEngine parser(xmlFile);
// Turn off validation
parser.doValidation(false);
// Parse our XML file
parser.parse();
// Get the tree root
XMLNode root = parser.getXMLRoot();
// Print the name of our object
cout << "Root = " << root.name() << endl;
return 0;
}
// end sample source
Now that an instance of an XMLNode object representing the root of
the tree has been parsed, the child elements of the root XMLNode can
be accessed.
Compiling the Test Client
The last step is to compile the client. Simply perform the
compile at the command line by entering the following GNU g++ command:
$>g++ -o testClient -I. -I/path/to/xerces/include \
-I/path/to/xerces/include/xerces \
testClient.cpp -L/path/to/xerces/lib -lxerces-c
This will compile the client application. The next step is to
run a test. Note: set/export your LD_LIBRARY_PATH
environment variable to point to the location of your Xerces
installation's lib directory.
Since the shared libraries from this directory the application
loader needs a way to import the required symbols at run-time
in order for everything to function correctly.
When testClient has run from the command line the following output
is expected:
$>testClient
Adding child name
Adding child account-number
Root = customer
You now have a fully functional XML parsing framework using C++
templates that will allow you to incorporate XML into your
new or existing applications. Enjoy!
References
[1] Open your favorite browser and navigate to
http://xml.apache.org/xerces-c. We will install the
latest Xerces implemenation, which at the time of this
writing is version 2.1.0. Download the following file:
http://xml.apache.org/dist/xerces-c/stable/xerces-c-src2_1_0.tar.gz
for the source code and follow the instructions for compiling and
installing it provided, on the website, for your particular
to your distribtution.
[2] Modern C++ Design, Andrei Alexandrescu Addison Wesley ©
2002, page 7-8.
[3] The Façade Design Pattern, Design Patterns (Gamma et al.) page 185.
The intent of the Façade Design Pattern is defined as follows:
"Provide a unified interface to a set of interfaces in a subsystem. Façade
defines a higher-level interface that makes the subsystem easier to use."