LJ Archive CD

XMLC

Reuven M. Lerner

Issue #88, August 2001

Reuven introduces XMLC, part of the Enhydra application server.

Over the last few months, we have looked at a variety of methods for creating web applications using server-side Java. We started with simple servlets and then moved onto JavaServer Pages (JSPs). In order to remove Java code from our JSPs, we began to use JavaBeans, objects whose methods are automatically available to our pages.

But you can only go so far with JavaBeans, which is where custom actions come in. These actions, which look like XML tags and attributes in our JSPs, are tied to the methods of a Java class. In other words, placing a tag in our JSP can effectively invoke one or more methods. Combining custom tags with beans allows us to remove quite a bit of the Java code from our JSPs.

But in the end, what have we accomplished? As we saw last month, intelligent use of custom actions means creating our own mini-language, with its own loops, conditionals and variables. Writing our own tags saves graphic designers from having to use Java and allows us a greater separation between form and content. But it does not go nearly far enough in solving problems.

One clever solution is part of the Enhydra application server, about which I will be writing over the next few months. XMLC, or the XML compiler, turns XML files (including HTML and XHTML files) into Java objects. By invoking methods on these objects, we can modify the HTML that is eventually produced.

XML and XHTML

XML, as you have probably heard by now, is the extensible markup language. What began as a simple and small standard several years ago has ballooned into a veritable alphabet soup of standards and proposed standards.

But the core of XML has remained the same, allowing people to create their own markup languages using a uniform syntax. XML is not meant to be used directly; rather, it is meant to let you create your own markup languages. Because those markup languages are based on XML, they have a well-understood syntax that can be verified by any XML parser. Moreover, if you define a data type definition (DTD) for your markup language, a verifying parser can ensure that the elements and attributes are within accepted norms.

HTML and XML are both standards of the World Wide Web Consortium (W3C), have a similar syntax and are often discussed in the same breath. But in fact, HTML is just one markup language, while XML allows you to create your own languages. More significantly, HTML has a much looser syntax than XML, thanks in no small part to historical factors. The following is thus legal HTML: <img src="foo.png">.

But because every tag must be explicitly closed in XML-derived languages, this would be illegal in an XML document. Instead, we would have to say: <img src="foo.png"/>.

In order to bridge the gap between HTML and XML, the W3C has issued a recommendation known as XHTML, the XML implementation of HTML. While there are indeed various benefits to the use of XHTML, the biggest one is that XML tools will now work on our HTML documents.

Of course, this means that our XHTML documents will look a bit more formal than the HTML documents we might be used to writing. While HTML allows us to be sloppy, using <P> to separate paragraphs, XHTML is much stricter, forcing us to begin paragraphs with <P> and end them with </P>. Attributes must also appear in double quotes, which many people fail to do when working with straight HTML.

While XHTML might be a pain for humans, it actually reduces the load on programs by making the syntax more regular, and thus easier to read and write. But the biggest benefit is the fact that XHTML documents can now be treated as XML documents.

The DOM

XML documents are trees, which should ring a bell for those of you who studied computer science in college. Trees are remarkably easy to work with in theory, but the practice can be a bit tricky sometimes, depending on the way in which the interface is implemented.

There are two popular and cross-platform APIs for working with XML: SAX (the Simple API for XML) is designed to work with incoming streams of XML data, allowing it to be small and efficient. The DOM (document object model), by contrast, gives us access to the entire document tree at once. This allows us to traverse and modify nodes, including adding new nodes and removing old ones. However, it also means that the entire document must be loaded into memory before we can begin to work with documents using the DOM. This makes it more powerful than SAX, but also slower and more resource-intensive.

XMLC works by converting an XML file, normally written in HTML or XHTML, into a Java class that creates and manipulates a DOM tree. You can use standard DOM methods to add, modify and remove nodes on the tree, thus changing the document that will eventually be output.

But the truly clever idea in XMLC is the use of HTML “id” attributes. When the XMLC complier sees an id attribute, it creates methods that allow us to retrieve and modify the text contained within that attribute. The site designers thus work with HTML, identifying areas of dynamic text by giving them unique identifiers. When the designers have finished with their mockup of the original HTML page, they compile it (using XMLC) into a Java class. Developers then create servlets that instantiate that class, use methods to replace the mockup text with dynamically generated content and send the document to the user's browser.

The basic idea is that the designers do not work on hybrids of text and HTML, but rather on mockups of the final output. So long as the id attributes do not change, the HTML file and servlet can evolve in parallel, with neither designers nor developers waiting for their counterparts.

Installing Enhydra

As I mentioned above, XMLC is one element of the Enhydra application server. The 3.x version of Enhydra is considered to be production-ready and includes a copy of XMLC that most users will find more than adequate. Because I am particularly interested in Enhydra for working with Enterprise JavaBeans (EJB), I have been working with the beta version of 4.x, otherwise known as Enhydra Enterprise. By the time you read this, the final release of Enhydra Enterprise should be available, giving web developers an open-source, production-quality J2EE-compliant application server.

To work with XMLC, I downloaded the Enhydra Enterprise beta, a 15.7MB file named enhydra4.0.tar.gz. Open this file, and you will find a wealth of libraries, applications and documentation for the Enhydra application server. We will ignore much of this for now, concentrating on XMLC for the time being.

Almost all of Enhydra is written as Java classes invoked from shell scripts. In order for the shell scripts to find the Java classes, they must be configured for your particular installation. You can do this by entering the Enhydra directory (enhydra4.0 on my system) and running the configure script:

./configure /usr/java/jdk1.3

configure normally takes a single argument—the root directory of your JDK 1.3 installation. While earlier versions of Enhydra (and particularly earlier versions of Enhydra Enterprise) wouldn't work with JDK 1.3, current versions will only work with 1.3. Since JDK 1.3 has a number of other benefits, and a Linux version is supported by Sun, it is probably a good idea to install it.

If you have installed Enhydra somewhere other than /usr/local/enhydra, you should probably set the ENHYDRA environment variable to your installation directory.

Full use of XMLC depends on placing three different .jar files in your CLASSPATH. Since we will be concentrating on XMLC for the rest of this article, we should probably add them now, using bash syntax:

export CLASSPATH=$ENHYDRA/lib/xmlc.jar:\
$ENHYDRA/lib/enhydra.jar:\
$ENHYDRA/lib/xmlc-support.jar

If you're like me, you will want to have a number of items in your CLASSPATH in addition to Enhydra-related items. Here is how I set my CLASSPATH, for instance:

export CLASSPATH=$ENHYDRA/lib/xmlc.jar:\
$ENHYDRA/lib/enhydra.jar:\
$ENHYDRA/lib/xmlc-support.jar:\
$TOMCAT_HOME/classes:\
$TOMCAT_HOME/lib/servlet.jar:\
/usr/share/pgsql/jdbc7.1-1.2.jar:\
.
Notice how I placed the Enhydra .jar files before the others on my system in order to avoid potential problems with conflicts. Since Enhydra has the newest versions of some classes, such as those having to do with the DOM, they should take precedence.

Note that not all three Enhydra-provided .jar files are necessary for each stage of working with XMLC. However, I found it convenient to include all of them at all stages in order to avoid unpleasant surprises later on.

A Simple HTML File

Now that we have installed everything we need to work with XMLC, let's try it with a simple HTML file:

<html>
    <head><title>This is a title</title></head>
    <body>
        <h1>This is a headline.</h1>
        <p id="firstpara">This is a paragraph.</p>
        <img src="foo.gif"/>
        <p>This is a second paragraph.</p>
    </body>
</html>

While XMLC works just fine with straight HTML files, XHTML is a better idea because it stops us from generating files that the DOM cannot represent. For example, XML forbids overlapping tags:

<i><p>Wow</i>, he thought.</p>
The above is tolerable HTML but is illegal XML and XHTML. So while your web browser can somehow handle this HTML and make sense of it, XMLC will generate a warning indicating that it is discarding what it considers to be a useless closing tag. XMLC will often warn you when your HTML is not well formed, helping you to identify potential problems. While you might not have to consider your document's structure when you are writing simple HTML documents, the manipulations that you can perform with XMLC require that you have a clear understanding of how your document will be rendered.

The first paragraph in the previous sample statement is identified with the id attribute “firstpara”. We will soon see how we can manipulate that text from within a Java program, using the id as a lever into the document.

To turn our document into a Java class, we invoke the xmlc program. Assuming that our above HTML file was called foo.html, we can say:

$ENHYDRA/bin/xmlc -parseinfo -verbose -keep foo.html

This turns foo.html into a Java source file called foo.java, which is in turn compiled into foo.class. The -keep argument retains foo.java, rather than deleting it once it has been compiled into foo.class. And while they are unnecessary, I like to use -parseinfo and -verbose when working with xmlc, if only to get some visual feedback on the compilation process.

The Java source code created by XMLC is fairly long and boring, if well-commented. For those of us who want to modify foo.html, the most important parts of foo.java are the getElementFirstpara() and setTextFirstpara() methods. The former returns the text associated with the id “firstpara”, while the latter allows us to swap that text with an arbitrary string.

Listing 1 contains the source code to a small command-line Java class (PrintFoo.java) that prints the contents of the Java-ized version of foo.html. Before printing it, it uses setTextFirstpara() to modify the output:

myfoo.setTextFirstpara("This has been changed");

Once we have made that change, we can display the document:

System.out.print(myfoo.toDocument());
We could traverse the DOM tree ourselves, looking for nodes with a certain id and then modify it manually. However, XMLC's convenience methods make it extremely easy and straightforward to modify such text.

Listing 1. PrintFoo.java

If you have just run PrintFoo, you will notice that the output HTML is displayed without any of the original white space. The resulting document is harder for humans to read but is rendered identically by browsers. That said, I have always tried to keep my HTML documents formatted correctly for easier debugging, and it would be nice for XMLC to include a -preserve-whitespace option.

From what we have seen so far, it would seem that XMLC makes it easy to modify entire paragraphs but difficult to change a single word. However, XMLC takes advantage of the HTML “span” tag, which takes an id attribute and allows us to identify individual words, characters and images that we might want to modify. For example:

<P id="para">This is a paragraph,
    <span id="phrase">and this is a phrase</span>.
</P>

When we compile this HTML using XMLC, we will be able to modify the contents of the entire paragraph using the SetTextPara() method and the individual phrase using the SetTextPhrase() method.

Servlets

Now that we have seen how to work with XMLC from the command line, let's look at a servlet that accomplishes the same task. For starters, our simple PrintFooServlet servlet will receive an HTTP request and will return a copy of the document.

Listing 2 contains a copy of the servlet that displays a foo.html. Like its command-line counterpart, it creates an instance of our “foo” class, modifies some of its text and then writes a textual representation of the XML tree to an output stream. In this particular case, however, the output stream is connected to the user's browser. The user thus sees the modified template without knowing that two Java classes (and an original HTML document) were involved.

Listing 2. PrintFooServlet.java

For our servlet to work, I needed to put a copy of foo.class in a directory located under the Jakarta-Tomcat servlet engine's CLASSPATH environment variable. I chose to put it in $TOMCAT/classes, at the top level. If this were a production class, I would undoubtedly want to put it in a more intelligent place, taking advantage of Java's hierarchical namespace. However, I executed xmlc without specifying a package, meaning that foo.class must be put in the top-level namespace. In order to place foo.class in the il.co.lerner namespace, I would have had to use the -class option:

$ENHYDRA/bin/xmlc -class il.co.lerner.foo\
-parseinfo -verbose -keep foo.html

With foo.class in $TOMCAT/classes, I was able to compile PrintFooServlet.java successfully. Now the only remaining challenge was to execute this servlet and display my modified HTML page. Once again, I needed to modify the CLASSPATH, but this time the CLASSPATH in need of change was that of the Tomcat servlet engine, which executes servlets on our behalf. I modified $TOMCAT/bin/tomcat.sh such that just before it exports its CLASSPATH, we add the three Enhydra-supplied .jar files and restarted Tomcat. Moments after pointing my browser at the servlet, I was delighted to see a modified version of my original HTML file on my screen.

Databases and XMLC

It is easy to see how we could populate a page with information taken from a relational database. For example, here is a small PostgreSQL table that we can use to store a different saying for each calendar day:

CREATE TABLE DailySayings (
    date   TIMESTAMP NOT NULL,
    saying TEXT NOT NULL,
    UNIQUE(date)
)

Now let's insert a number of sayings into our system:

INSERT INTO DailySayings(date, saying)
VALUES (CURRENT_DATE,
     'A bird in the hand is worth two in the bush.');
INSERT INTO DailySayings(date, saying)
VALUES (CURRENT_DATE+1,
     'A penny saved is a penny earned.');
INSERT INTO DailySayings(date, saying)
VALUES (CURRENT_DATE+2,
     'The rain in Spain falls mainly in the plain.');
To retrieve today's saying, we merely need the following query:
SELECT saying
FROM DailySayings
WHERE date = CURRENT_DATE
In order to write a servlet that displays today's saying, we will need two classes: a template that we will create with XMLC (saying.html, which will be compiled into saying.class) and another that will load and manipulate the template (DailySaying.java). We will agree in advance of writing our XMLC document and our manipulation class that the id “saying” will link the two together.

Our XMLC document is fairly straightforward:

<html>
   <head><title>Today's saying</title></head>
   <body>
      <h1>Today's saying</h1>
      <p>And now, as you requested, today's saying:
        <span id="saying">Saying Goes Here</span>.</p>
   </body>
</html>

I compiled this HTML document into the Java class il.co.lerner.saying, keeping around the .java file just for fun:

$ENHYDRA/bin/xmlc -class il.co.lerner.saying\
-parseinfo -verbose -keep saying.html
I then copied the resulting saying.class file into $TOMCAT_HOME/classes/il/co/lerner, where I keep my servlet-related classes.

Once I installed my document, I had to write a manipulation class. This class executes the SQL query that we saw above, retrieving the results and sticking them into our compiled XMLC document. Listing 3 [see Listing 3 at ftp://ftp.linuxjournal.com/pub/lj/listings/issue88/] contains the source code for our servlet, which I compiled and put into an active servlet context on my Tomcat server. After restarting Tomcat and Apache, I was able to retrieve today's saying via my web browser, with the SQL results instantiated into the HTML document.

Is XMLC a Good Solution?

When I first began to look into XMLC, I had my serious doubts about its viability. After years of working with hybrid templates, it just seemed too weird to turn an HTML file into a Java class, only to manipulate that class using the DOM. And indeed, it takes significantly greater resources to fire up a DOM parser than to simply display a file.

As I have begun to work with XMLC, however, I am increasingly aware of its advantages over such templates. In essence, XMLC forces designers and developers to create a contract, or API specification, between their documents and programs. Once this API is in place, it cannot easily be changed, which is not necessarily a bad thing. Most importantly, the stability of the API between designers and developers allows them to work in parallel, barely interfering with each other's work.

Because a Java manipulation class can modify the HTML of a compiled document in any way it chooses, we can easily imagine a situation in which we bring in three classes at a time: a header file, the main body of the document and a footer file. Our class could then use DOM methods to attach the header to the beginning of the document and the footer to the end. In such a way, we could add global formatting to our site without having to copy boilerplate text to the top of each file.

There are, of course, a number of irksome details when working with XMLC. One is that it quickly gets boring and frustrating to write one servlet per HTML file. True, we could write a single servlet that takes the name of a file in its query string, acting almost as a document template for a variety of classes created by XMLC. Perhaps I have not yet explored Enhydra enough to have discovered the answer to this question, and perhaps Enhydra developers quickly get used to creating two Java classes for each page they wish to display. Regardless, this can quickly create an overwhelming number of classes, even on a small- to medium-sized site.

The biggest problem that I see with XMLC is the lack of a high-level API to manipulate HTML (and XML, for that matter). One of the FAQs for XMLC is “How do I add a row to an HTML table?” Such a task, which is trivial to accomplish with standard HTML, quickly becomes a burden with XMLC. You must first find the bottom of the table to which you want to add rows and then add individual nodes (and attributes) to that node. It has a very non-HTML feel to it and forces the developer to think of nodes when he or she would prefer to think in terms of HTML. Given that Enhydra includes an API to create SQL queries using Java methods, I would imagine that a similar API for HTML manipulation wouldn't be too difficult.

Conclusion

XMLC is an intriguing technology that sits at the heart of the Enhydra application server. XMLC forces developers (and designers) to consider how they will interact before they begin working and then allows them to work independently. While this mode of operation might throw experienced template users off balance, it quickly becomes second nature and feels more natural than I ever expected.

Indeed, the fact that Zope's ZPT uses a similar method for separating form from content probably points to a trend within the web development community. We can expect to see more XMLC-like systems in the near future. If we're lucky, perhaps there will even be some standardization of these templates, so that designers can move across systems without having to learn the subtle differences between them.

While XMLC is important, Enhydra has many other features that make it worth investigating. Next month we will continue to look into Enhydra, looking at ways in which it speeds up the writing of server-side database applications.

Resources

Reuven M. Lerner owns a small consulting firm specializing in web and internet technologies. He lives with his wife Shira and daughter Atara Margalit in Modi'in, Israel's “city of the future”. You can reach him at reuven@lerner.co.il, or on the ATF home page, http://www.lerner.co.il/atf/.

LJ Archive CD