XML in C++ with CodeSynthesis XSD

Synthesis


Processing XML in your C++ programs without the drone of DOM.

By Ben Martin

AlienCat, Fotolia

Software developers often use XML to pass data into and out of programs. Several APIs provide a means for processing XML data. Popular options, such as the Simple API for XML (SAX) and the Document Object Model (DOM), offer standard systems for accessing XML data from within a program. Another interesting option for working with XML data within C++ is CodeSynthesis XSD, an open source data binding compiler (Figure 1). Given an XML schema, CodeSynthesis generates C++ classes that allow you to access the data stored in the XML document "... using types and functions that automatically correspond to your application domain" [1].

Figure 1: CodeSynthesis XSD is maintained and distributed by the South African company CodeSynthesis Tools CC.

Figure 2: CodeSynthesis XSD, Xalan-C++, and XQilla all either rely on or can use the Xerces-C++ XML library, which is available through the Apache website.

CodeSynthesis XSD simplifies the business of XML parsing and eliminates the need for clumsy DOM instances.

CodeSynthesis XSD

CodeSynthesis XSD takes the hassle out of handling XML: No more parsing, no more DOM interaction [2]; instead, you can just work with C++ objects in a normal way. CodeSynthesis XSD can also serialize these C++ objects back into XML again, removing the hassles of dealing with reading and writing XML and letting you concentrate on the application functionality instead of the code details.

CodeSynthesis XSD has two modes of operation: You can use it with an in-memory tree (DOM like) or in parser mode (SAX like). This article will use the tree mode. Although the examples on the CodeSynthesis XSD website all use xsd as the main CodeSynthesis XSD command, on Fedora 11, I found that I had to use xsdcxx to get the CodeSynthesis XSD command. As a running example, I'll use the customers.xml file shown in Listing 1. I'll make minor changes along the way as new features are explored. Any similarity the data in customers.xml might have to the real or animated world is purely coincidental.

Listing 1: customers.xml
01 <?xml version="1.0"?>
02 <customers>
03
04   <customer id="1">
05     <first-name>Bart</first-name>
06     <sir-name>Simpson</sir-name>
07     <gender>male</gender>
08   </customer>
09
10   <customer id="2">
11     <first-name>Charles</first-name>
12     <middle-name>Montgomery</middle-name>
13     <sir-name>Burns</sir-name>
14     <gender>male</gender>
15     <dob>1903-11-20T06:30:13</dob>
16   </customer>
17
18 </customers>

As the name implies, CodeSynthesis XSD is most interested in the .xsd files that provide the XML schema. To create a C++ binding to parse an XML file with CodeSynthesis XSD, you need to have an XML schema file. The customers.xml file complies with the customers.xsd schema file shown in Listing 2. Reading the schema from bottom to top, you see the customers element, which has a type customers_t. The customers_t element contains a list of customer_t elements. Each customer_t has a few names, a gender, and a date of birth (dob). Note that, since the first, middle, and surname (sir-name) entries are just strings; they are defined in a single line in the schema. The gender is restricted to a few choices, so it has to have a simpleType of its own in the schema file. The date of birth and middle names are optional, so they have a minOccurs="0" in their schema.

Listing 2: customers.xsd Schema
01 <?xml version="1.0"?>
02 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
03
04   <xs:simpleType name="gender_t">
05     <xs:restriction base="xs:string">
06       <xs:enumeration value="male"/>
07       <xs:enumeration value="female"/>
08     </xs:restriction>
09   </xs:simpleType>
10
11   <xs:complexType name="customer_t">
12    <xs:sequence>
13     <xs:element name="first-name" type="xs:string"/>
14     <xs:element name="middle-name" type="xs:string" minOccurs="0"/>
15     <xs:element name="sir-name" type="xs:string"/>
16     <xs:element name="gender" type="gender_t"/>
17     <xs:element name="dob" type="xs:dateTime" minOccurs="0"/>
18    </xs:sequence>
19    <xs:attribute name="id" type="xs:unsignedInt" use="required"/>
20   </xs:complexType>
21
22   <xs:complexType name="customers_t">
23     <xs:sequence>
24       <xs:element name="customer" type="customer_t" maxOccurs="unbounded"/>
25     </xs:sequence>
26   </xs:complexType>
27
28   <xs:element name="customers" type="customers_t"/>
29
30 </xs:schema>

As you can see, even if you are unfamiliar with XSD schema files, there is nothing particularly difficult about creating a schema for an XML document. You just describe the required elements, their children, and their attributes. Each thing you describe in the schema has a type associated with it; for example, dob is a datetime, so you can record the exact time of birth from a certificate if the dob is available.

Now that I have the XML schema file, the C++ client code shown in Listing 3 becomes straightforward. I have intentionally left a little hiccup in the code: The XML file does not reference the XSD schema file that it follows. Many XML systems have been built without regard to schema files, and as such, they do not reference any schema file. To handle such XML files with CodeSynthesis XSD, you have to tell the system what XSD file the XML file follows.

Listing 3: C++ Client to Parse customers.xml
01 #include <iostream>
02 #include "customers.hxx"
03
04 using namespace std;
05
06 int
07 main (int argc, char* argv[])
08 {
09   try
10   {
11     xml_schema::properties props;
12     props.no_namespace_schema_location ("customers.xsd");
13     auto_ptr<customers_t> all_customers( customers( argv[1], 0, props) );
14
15     customers_t::customer_sequence& l = all_customers->customer();
16     for( customers_t::customer_sequence::iterator ci = l.begin(); ci != l.end(); ++ci )
17     {
18       cout << "first name:" << ci->first_name()
19         << " sirname:" << ci->sir_name();
20       if( ci->middle_name().present() )
21       {
22         cout << " middle:" << ci->middle_name();
23       }
24       if( ci->dob().present() )
25       {
26        ::xml_schema::date_time dt = ci->dob().get();
27        cout << " dob:" << dt.year() << "/" << dt.month() << "/" << dt.day();
28       }
29       cout << endl;
30     }
31   }
32   catch (const xml_schema::exception& e)
33   {
34     cerr << e << endl;
35     return 1;
36   }
37   catch (const std::exception& e)
38   {
39     cerr << e.what() << endl;
40     return 1;
41   }
42 }

The first two lines set up the props variable with this information, and the third line parses the customers.xml file using the customers.xsd schema. Notice the all_customers C++ object lets you get at each customer from an STL-like interface. You can iterate from begin() to end() and get at the names using member functions instead of by performing tedious and error-prone DOM interaction. Optional fields like middle-name and dob are handled in the code with the present() method to see whether they are supplied for the current element.

The code in Listing 3 relies on C++ classes from customers.hxx, which are generated by CodeSynthesis XSD using customers.xsd. The Makefile shown in the following snippet will build an executable from this C++ code and customers.xsd file. CodeSynthesis XSD has a build-time-only requirement - no additional shared library is necessary.

main: main.cxx customers.cxx
       g++ main.cxx customers.cxx -lxerces-c -o main

customers.cxx: customers.xsd Makefile
       xsdcxx cxx-tree customers.xsd

The generated "main" binary can then be executed giving these results.

$ ./main customers.xml
first name:Bart sirname:Simpson
first name:Charles sirname:Burns middle:Montgomery dob:1903/11/20

From Xerces-C++ DOM Node to C++ Object

Listing 4: New customers.xsd
01 $ cat customers.xsd
02 ...
03       <xs:element name="gender" type="gender_t"/>
04       <xs:element name="dob" type="xs:dateTime" minOccurs="0"/>
05     </xs:sequence>
06     <xs:attribute name="id" type="xs:ID" use="required"/>
07   </xs:complexType>
08
09 $ cat customers.xml
10 <?xml version="1.0"?>
11 customers xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
12       xsi:noNamespaceSchemaLocation="customers.xsd" >
13 ...
14   <customer id="c2">
15     <first-name>Charles</first-name>
16     <middle-name>Montgomery</middle-name>
17 ...

You might have an XML file that includes ID attributes for elements, or you might want to find elements using the XML Path Language (XPath) [3]. Once you have found a DOMNode with XPath, you can convert it into a C++ object with CodeSynthesis XSD. If I promote the id attribute to be a proper XML ID attribute [4], I can then directly find the customer using DOMDocument::getElementById() [5].

The XML schema file customers.xsd has to change to indicate that the id attribute is of type xs:ID, instead of just an integer. The customers.xml file has to change to have each ID start with a letter. Notice that the customers.xml file now also includes a link to its XML schema file.

The C++ code is shown in Listing 5. The XML file is loaded slightly differently. To begin with, the XML run time is initialized and keep_dom is used to tell CodeSynthesis XSD that it should keep the DOM around after parsing the XML. During the first loop through the customer names, a pointer is taken to Mr. Burns. After showing Mr. Burns to the user, the DOM is obtained with the use of the _node() method on the C++ object that CodeSynthesis XSD created for the XML.

Listing 5: Finding Monty
01 #include <iostream>
02 #include "customers.hxx"
03
04 using namespace std;
05 using namespace xercesc;
06
07 int
08 main (int argc, char* argv[])
09 {
10     try
11     {
12         // Parse XML
13         XMLPlatformUtils::Initialize ();
14         xml_schema::properties props;
15         props.no_namespace_schema_location ("customers.xsd");
16         auto_ptr<customers_t> all_customers(
17             customers(
18                 argv[1],
19                 xml_schema::flags::keep_dom | xml_schema::flags::dont_initialize,
20                 props) );
21
22         // Grab monty
23         customer_t* monty = 0;
24         customers_t::customer_sequence& l = all_customers->customer();
25         for( customers_t::customer_sequence::iterator ci = l.begin(); ci != l.end(); ++ci )
26         {
27             if( ci->sir_name() == "Burns" && ci->first_name() == "Charles" )
28             {
29                 monty = &(*ci);
30                 break;
31             }
32         }
33
34         // Did we get him?
35         if( !monty )
36         {
37             cerr << "Can't find poor Monty! exiting..." << endl;
38             exit(1);
39         }
40         cout << "first name:" << monty->first_name()
41              << " sirname:"   << monty->sir_name()
42              << endl;
43
44         // Find him "by id"
45         const xercesc::DOMNode* root = all_customers->_node ();
46         if( DOMDocument* dom = root->getOwnerDocument() )
47         {
48             if( DOMElement* e = dom->getElementById( XMLString::transcode("c2") ) )
49             {
50                 xml_schema::type& t (
51                     *reinterpret_cast<xml_schema::type*> (
52                         e->getUserData (xml_schema::dom::tree_node_key)));
53                 customer_t& montyByID = (dynamic_cast<customer_t&> (t));
54
55                 cout << "From ID lookup...." << endl;
56                 cout << "first name:" << montyByID.first_name()
57                      << " sirname:"   << montyByID.sir_name()
58                      << endl;
59
60                 // Change via one reference, view via the other
61                 montyByID.first_name() = "fred";
62                 cout << "monty pointer, first name:" << monty->first_name() << endl;
63             }
64         }
65     }
66     catch (const xml_schema::exception& e)
67     {
68         cerr << e << endl;
69         return 1;
70     }
71 }

With the use of this DOMNode, the DOMDocument is obtained, and getElementById() is used to pick Mr. Burns out by his XML ID. Of course, once I have the DOMElement* to Mr. Burns, I could use the DOM API to get more information. However, because I am already using CodeSynthesis XSD, it would be more convenient to get the CodeSynthesis XSD-generated C++ object for Monty from the DOMElement, which is what the reinterpret_cast and dynamic_cast pair do.

In order to show that the two references to Mr. Burns are actually the same object, I change his first name with the use of the reinterpret_cast reference, then I print it with the use of the reference that was originally obtained to the C++ object.

Instead of just picking elements by their ID, you might want to find one with the use of XPath. The example shown in Listing 6 uses Xalan-C++ [6] to evaluate the XPath expression. To use Xalan-C++ to evaluate an XPath, you have to wrap the Xerces-C++ objects [7] into Xalan-C++ objects.

Listing 6: Xalan-C++ and XPath to C++ Objects
01 #include <iostream>
02 #include "customers.hxx"
03
04 #include <xalanc/Include/PlatformDefinitions.hpp>
05 #include <xalanc/PlatformSupport/XSLException.hpp>
06 #include <xalanc/DOMSupport/XalanDocumentPrefixResolver.hpp>
07 #include <xalanc/XPath/XObject.hpp>
08 #include <xalanc/XPath/XPathEvaluator.hpp>
09 #include <xalanc/XalanSourceTree/XalanSourceTreeDOMSupport.hpp>
10 #include <xalanc/XercesParserLiaison/XercesParserLiaison.hpp>
11 #include <xalanc/XercesParserLiaison/XercesDOMSupport.hpp>
12 #include <xalanc/XalanTransformer/XercesDOMWrapperParsedSource.hpp>
13 #include <xalanc/XercesParserLiaison/XercesDocumentWrapper.hpp>
14
15 using namespace std;
16 using namespace xercesc;
17 using namespace XALAN_CPP_NAMESPACE;
18
19 int
20 main (int argc, char* argv[])
21 {
22     try
23     {
24         // Parse XML
25         XMLPlatformUtils::Initialize ();
26         XPathEvaluator::initialize();
27         xml_schema::properties props;
28         props.no_namespace_schema_location ("customers.xsd");
29         auto_ptr<customers_t> all_customers(
30             customers(
31                 argv[1],
32                 xml_schema::flags::keep_dom | xml_schema::flags::dont_initialize,
33                 props) );
34
35         // Create a XalanDocument based on doc.
36         DOMDocument* xercesDoc = all_customers->_node ()->getOwnerDocument();
37         XercesDOMSupport theDOMSupport;
38         XercesDocumentWrapper theWrapper( XalanMemMgrs::getDefaultXercesMemMgr(), xercesDoc );
39         XalanNode* xalanContextNode = theWrapper.getDocumentElement();
40         XalanDocumentPrefixResolver thePrefixResolver(
41             theWrapper.getDocumentElement()->getOwnerDocument() );
42         XPathEvaluator theEvaluator;
43
44         cerr << "Evaluating XPath" << endl;
45         // Evaluate XPath
46         XalanNode* const        resultXalanNode =
47             theEvaluator.selectSingleNode(
48                 theDOMSupport,
49                 xalanContextNode,
50                 XalanDOMString("/customers/customer[1]").c_str(),
51                 thePrefixResolver );
52
53         //
54         // Go back to CodeSynthesis XSD C++ objects
55         //
56         if( DOMNode* xercesNode = (DOMNode*)theWrapper.mapNode( resultXalanNode ) )
57         {
58             cerr << "have xerces-c result node:" << xercesNode << endl;
59             xml_schema::type& t (
60                 *reinterpret_cast<xml_schema::type*> (
61                     xercesNode->getUserData (xml_schema::dom::tree_node_key)));
62             customer_t& montyByID = (dynamic_cast<customer_t&> (t));
63
64             cout << "From ID lookup...." << endl;
65             cout << "first name:" << montyByID.first_name()
66                  << " sirname:"   << montyByID.sir_name()
67                  << endl;
68         }
69     }
70     catch (const xml_schema::exception& e)
71     {
72         cerr << e << endl;
73         return 1;
74     }
75 }

The boilerplate code in the block leading up to selectSingleNode() is used to achieve the Xalan-C++ wrappers. Once the XPath is evaluated, the XercesDocumentWrapper object is used to convert the XalanNode back into an Xerces-C++ DOMNode. Once I have the DOMNode again, the conversion back into a CodeSynthesis XSD C++ object is the same as for the DOMDocument::getElementById() example.

C++ Objects Back to XML Again

Of course, at some stage, you might also want to stream the C++ objects back out to XML again. The only change needed to support this is to pass --generate-serialization as an argument when running xsdcxx. The C++ code shown in Listing 7 operates on the same customers.xsd and customers.xml files used in the previous examples.

Listing 7: C++ Objects Back to XML Again
01 #include <iostream>
02 #include "customers.hxx"
03
04 using namespace std;
05
06 int
07 main (int argc, char* argv[])
08 {
09     try
10     {
11         auto_ptr<customers_t> all_customers( customers(argv[1]) );
12
13         customers_t::customer_sequence& l = all_customers->customer();
14         for( customers_t::customer_sequence::iterator ci = l.begin(); ci != l.end(); ++ci )
15         {
16             if( ci->first_name() == "Bart" )
17             {
18                 int year = 1980;
19                 unsigned short month = 2;
20                 unsigned short day = 22;
21                 unsigned short hours = 1;
22                 unsigned short minutes = 2;
23                 double seconds = 3;
24
25                 ci->dob() = xml_schema::date_time( year, month, day, hours, minutes, seconds );
26             }
27         }
28
29 //      customers (std::cout, *all_customers );
30
31         xml_schema::namespace_infomap map;
32         map[""].schema = "customers.xsd";
33         customers (std::cout, *all_customers, map);
34     }
35     catch (const xml_schema::exception& e)
36     {
37         cerr << e << endl;
38         return 1;
39     }
40 }

Because the customers.xml file now links to its xsd file, loading it becomes a single-line operation. When I find Bart, I create a date_time and assign it to his date of birth. The commented-out line will dump all the customers as a valid XML file to stdout.

The slightly longer version will include a link to the customers.xsd file in the output. Because the program expects that the input XML file contains a link to its schema file, it makes sense to have the output XML also include a link to its schema. In this way, the output XML of the program is also valid input XML for the program.

Conclusion

This article outlined some of the benefits of using CodeSynthesis XSD to integrate XML with C++ rather than relying on a programming interface such as DOM. Of course, the details for how you use CodeSynthesis XSD, DOM, XQilla, and other XML-related tools will depend on your own development methods and the peculiarities of your project.

XQuery: From XQilla to C++ Objects

So far, this article has focused on how to use CodeSynthesis XSD to simplify the process of consuming and generating XML from C++. However, you don't have to give full control to CodeSynthesis XSD; instead, you can find elements from a DOM by your own means and ask CodeSynthesis XSD to give you the C++ object that wraps the XML element you found.

The XQilla project (Figure 3) [8] [9] provides support for both the XPath 2.0 XML path language [3] and the XQuery language [10], which is used to query XML data. XQilla offers a huge amount of power for defining XML elements and generating XML. When you use CodeSynthesis XSD with XQilla, you can bring together XML information and access it as C++ objects.

For those unfamiliar with XQuery, it is a FLOWR (For, Let, Order by, Where, Return) query language [11] similar to the SQL query language. With XQuery, you can embed queries into XML and also generate new XML from intermediate query results.

The following example is an XQuery that will return only a single customer from the customers.xml file:

$ cat customers.xq
<customers>
{
  for $c in //customer
  where $c/middle-name = "Montgomery"
  return $c
}
</customers>

Notice that the customer's XML element just appears in the XQuery itself. The query to be executed is encapsulated in { }, and the result of the return statement will completely replace everything inside the braces and the braces themselves.

The for statement operates on the nodes selected by the XPath; in this case inspecting each customer node. The where clause could have been placed into the XPath expression to make //customer[@middle-name="Montgomery"]. The return statement just returns the entire customer XML element.

The command shown in Listing 8 will execute the preceding customers.xq XQuery file. Note that the xqilla program does not pretty-print the result, so I pipe it into xmllint to have it formatted for easier human consumption. The -i argument to xqilla tells it to bind the customers.xml file as the default context. You'll notice in the preceding XQuery that no file is mentioned; the query just assumes it can pick off customer(s) somewhere in "the document." This separation is very powerful, in that you can use the same XQuery to process many files. In this case, I want customers.xml, so I bind that as "the document" that the query uses. The result of executing the query on customers.xml is an XML file that validates against the customers.xsd schema file but only contains the single customer: Mr. Burns.

The trick to using CodeSynthesis XSD with XQilla is to generate the DOM with XQilla and pass it to CodeSynthesis XSD to create the C++ object model from a DOM that you continue to own. Given that you have the DOM and C++ objects, you can find particular DOM elements with XQilla and then convert them C++ objects. See the Linux Magazine website for a listing showing how to convert XQuery results to C++ objects (http://www.linux-magazine.com/Resources/Article-Code).

Figure 3: The XQilla library is available for download under the Apache v2 license.
Listing 8: Executing customers.xq
01 $ xqilla -i customers.xml customers.xq | xmllint --format -
02 <?xml version="1.0"?>
03 <customers>
04   <customer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="c2">
05     <first-name>Charles</first-name>
06     <middle-name>Montgomery</middle-name>
07     <sir-name>Burns</sir-name>
08     <gender>male</gender>
09     <dob>1903-11-20T06:30:13</dob>
10   </customer>
11 </customers>
INFO
[1] CodeSynthesis XSD: http://codesynthesis.com/projects/xsd/
[2] Document object model: http://en.wikipedia.org/wiki/Document_Object_Model
[3] XPath: http://www.w3.org/TR/xpath/
[4] XML ID attibute: http://www.w3.org/TR/xmlschema-2/#ID
[5] DOMDocument::getElementById; http://xerces.apache.org/xerces-c/apiDocs-2/classDOMDocument.html#fb3e89ba1247d689c4570f40003ea5db
[6] Xalan-C++: http://xml.apache.org/xalan-c/
[7] Xerces-C++: http://xerces.apache.org/xerces-c/
[8] XQilla: http://xqilla.sourceforge.net/
[9] XQilla extension functions: http://xqilla.sourceforge.net/ExtensionFunctions
[10] XQuery: http://www.w3.org/TR/xquery/
[11] FLOWR query languages: For, Let, Order by, Where, Return: http://en.wikipedia.org/wiki/FLWOR