A Template-based approach to XML Parsing in C++ Introduction XML is a markup based data description language designed to allow a developer to create structured documents using descriptive custom tags. The intent of XML is to separate the description of the data from its intended use and allow the transfer of the data between different applications in a non-platform or architecture-specific way. Another useful application of XML is to describe a process in a logical and meaningful way that can be carried out by the application at run-time. Parsing XML In order for an XML file to be parsed successfully the developer must first create a file that can be processed by a parser. A parser is a set of shared objects, or a library that reads and processes an XML file. The parser may be one of two types: validating or non-validating. A validating parser scans the XML file and determines if the document is well-formed, as specified, by either an XML schema or the document type defintion (DTD). A non-validating parser simply reads the file and ignores the format and layout as specified by either the XML schema or the DTD. The most widely used parsers come in two flavors: event driven and tree based. The event driven parser is called SAX, which stands for Simple API for XML processing. Whereas a tree based parser creates a DOM (Document Object Model) tree in memory at the time the XML file is read and parsed. The two differ in their approach to XML parsing and each has advantages and disadvantages. The DOM implementation is difficult to navigate and does not allow for a clean mapping between XML elements and Domain specific objects. SAX provides the events to allow the developer to create their domain-specific objects at the time the XML file is read and parsed. This article provides a framework design using the SAX API for XML parsing. XML Parsers for C++ The two most commonly used parsers for C++ are Xerces of the Apache Project and XML4C created by IBM's AlphaWorks project. XML4C is based on the open-source Xerces project. Both parsers provide essentially the same layout of source and libraries and can therefore be used interchangeably. They also support both DOM and SAX based XML parsing. This document describes an implementation using the SAX parser with the Xerces parser. Installing the Xerces Library The Xerces source or binaries related to XML parsing can be downloaded from the Xerces website [1]. Parsing XML Files using SAX In order to begin parsing an XML file using the SAX API the layout of the SAX C++ object interactions must be understood. SAX is designed with the following basic interfaces: SAXParser ========= setDoValidation setDoNamespace setDoSchema setValidationFullSchemaChecking setDocumentHandler setErrorHandler parse HandlerBase =========== warning error fatalError startElement characters ignorableWhitespace endElement Close examination of the methods in the HandlerBase object above will reveal two different categories of methods: error handling and document processing. The error handling methods include warning, error and fatalError. Whereas, the parsing methods consist of startElement, characters, ignorableWhitespace and endElement. These behaviors can be separated into individual objects and is shown later in the article. The SAXParser class takes care of setting basic properties and the desired behavior that is to be enforced at run-time. The following sample code illustrates the basic steps that need to be followed in order to parse an XML file using the SAX parser in C++: // Create a new instance of the SAX parser SAXParser parser; // Initialize the behavior you desire parser.setDoValidation(true); parser.setDoNamespaces(true); parser.setDoSchema(true); parser.setValidationSchemaFullChecking(true); // Add handlers for document and error processing parser.setDocumentHandler(&docHandler); parser.setErrorHandler(&errorHandler); // Parse file parser.parse("MyXMLFile.xml"); At the time the parsing occurs the classes you’ve instantiated, docHandler and errorHandler, are forwarded the events that get triggered from the parsing. Note, these classes are derived from the Xerces base class HandlerBase and have overridden the appropriate methods for handling the events based on their categorized function. XML Framework Implentation using the SAX API Now that we've been exposed to parsing XML using SAX, let's explore how our XML framework has been implemented to take advantage of the facilities provided within the API. Policy Classes A policy class, made popular by Andrei Alexandrescu's book "Modern C++ Design" [2], is described as follows: "A policy defines a class interface or a class template interface. The interface consists of one or all of the following: inner type definitions, member functions and member variables. Policies have much in common with traits but differ in that they put less emphasis on type and more emphasis on behavior." The usefulness of policy classes, in this XML framework, are realized when created using a template based C++ design. A policy allows you to parameterize and configure functionality at a fine granularity. In this design, policies are created to accomodate the following behavior: - Document Handling - Error Handling - Domain Mapping - Parsing Configuring these elements as policies allows the creation of more concise code that is easier to maintain by any developer experienced in C++ and the use of templates. The principal class of the XML Parsing framework is that of the XMLSAXParser. It's a custom designed class template that implements the XMLParserInterface and includes a SAXParser object as a member variable. The template parameters include policy classes for both the document and error handlers. All parsing is eventually delegated to the SAXParser member variable after the various handlers and other properties have been set. Implementing custom handlers, as policy classes, is a relatively trivial task using the framework. The advantage of this type of design is that the same framework can be used with different parsing API's and different domain-mapping objects by altering one or more of the policies - an exercise that is not implemented in this article. In order to create custom handlers derive newly created custom classes from HandlerBase and override the virtual methods of interest. The following two types of custom handlers have been created in the XMLFactory framework. XMLSAXHandler ============= startElement character ignorableWhitespace endElement XMLSAXErrorHandler ================== warning error fatalError XMLSAXHandler handles document event processing and XMLSAXErrorHandler handles the various error callbacks. Mapping XML Tags to Domain Objects The next aspect of our XML parsing framework is to convert XML tags into Domain related objects that can be used within the application. This is accomplished by using templates and a loose definition of policy classes. The XMLDomainMap template accepts a single template parameter called an XMLNode. The interface for the domain-mapping object is described below. XMLDomainMap ============ create add updateAttribute The XMLNode acts as both a leaf and a root in a tree structure that aggregates its children as a subtree. The XMLNode's interface is described below. XMLNode ======= operator== operator!= operator= addChild hasChildren numChildren value name getChildCount getChild getParent The key here is the design of the public interface of the object. There are several operator overloads, specifically operator equals (operator==), operator not equals (operator!=) and the assignment operator (operator=). The benefit to this is that the object can now be used with many of the standard template library containers and algorithms. This allows for the use of advanced features with the C++ language. Linking our Classes together - An XML Façade Thus far the focus has been on individual classes and describing the templates that have been created for our XML processing framework. The next step is to link the disparate interfaces together and make them appear to function as a single cohesive unit. To accomplish this step the Facade Design Pattern will be used [3]. The facade design provides a simple and elegant way to delegate parsing functionality from an outside client to the internal policy class that will be used for performing the parsing. The XMLProcessor is the facade that has been created. It is defined with the following interface: XMLProcessor ============ parse getParseEngine Once all the source has been written an XML file and a test client will be needed to run our sample. Parsing an Actual XML File The following simple XML file has been created to illustrate the use of the framework: John Doe 555123 The above XML file shows a very basic layout of a customer with a name and an account number. This example will be used to demonstrate the simplicity of using the framework. For now, enter file into a text editor and save it as MyXMLFile.xml. The Public Interface - Writing the Client Application The framework's functionality will be used as a mechanism to provide a succinct interface to the client application. The primary methods that a client of the framework would make use of can be described with an actual, albeit small, sample of C++ source code: // --------------------------------------- // Sample source for parsing an XML doc // --------------------------------------- #include "XMLProcessor.hpp" #include "XMLDomainMap.hpp" #include "XMLSAXParser.hpp" #include "XMLNode.hpp" #include "XMLCommand.h" #include "XMLSAXHandler.hpp" #include "XMLSAXErrorHandler.hpp" #include using namespace std; using namespace XML; // Let's get the ugly stuff out of the way first typedef XMLSAXHandler > DOCHANDLER; typedef XMLSAXErrorHandler ERRORHANDLER; typedef XMLSAXParser PARSER; typedef XMLProcessor XMLEngine; // Create a basic test client int main(void) { // Define a string object with our file name std::string xmlFile = "MyXMLFile.xml"; // Create an instance of our XMLFactory XMLEngine parser(xmlFile); // Turn off validation parser.doValidation(false); // Parse our XML file parser.parse(); // Get the tree root XMLNode root = parser.getXMLRoot(); // Print the name of our object cout << "Root = " << root.name() << endl; return 0; } // end sample source Now that an instance of an XMLNode object representing the root of the tree has been parsed, the child elements of the root XMLNode can be accessed. Compiling the Test Client The last step is to compile the client. Simply perform the compile at the command line by entering the following GNU g++ command: $>g++ -o testClient -I. -I/path/to/xerces/include \ -I/path/to/xerces/include/xerces \ testClient.cpp -L/path/to/xerces/lib -lxerces-c This will compile the client application. The next step is to run a test. Note: set/export your LD_LIBRARY_PATH environment variable to point to the location of your Xerces installation's lib directory. Since the shared libraries from this directory the application loader needs a way to import the required symbols at run-time in order for everything to function correctly. When testClient has run from the command line the following output is expected: $>testClient Adding child name Adding child account-number Root = customer You now have a fully functional XML parsing framework using C++ templates that will allow you to incorporate XML into your new or existing applications. Enjoy! References [1] Open your favorite browser and navigate to http://xml.apache.org/xerces-c. We will install the latest Xerces implemenation, which at the time of this writing is version 2.1.0. Download the following file: http://xml.apache.org/dist/xerces-c/stable/xerces-c-src2_1_0.tar.gz for the source code and follow the instructions for compiling and installing it provided, on the website, for your particular to your distribtution. [2] Modern C++ Design, Andrei Alexandrescu Addison Wesley © 2002, page 7-8. [3] The Façade Design Pattern, Design Patterns (Gamma et al.) page 185. The intent of the Façade Design Pattern is defined as follows: "Provide a unified interface to a set of interfaces in a subsystem. Façade defines a higher-level interface that makes the subsystem easier to use."