Book HomeSAX2Search this book

2.2. Beginning SAX

This chapter explores SAX through some progressively more functional examples, which build on each other to present the key concepts that are discussed later in more detail. Essential producer and consumer interfaces are presented together to show how they interact, and you'll see how to customize classic SAX configurations. We'll focus first on the producer side, saving most details about consumer-side APIs for a bit later.

2.2.1. How Do the Parts Fit Together?

In the simplest possible example, you (in your role as director) will get an XML parser, which will later produce parsing events. Then you will get a consumer and connect it to the producer for processing the most important events. Finally, you'll ask that parser to produce events, pushing them through to the consumer.

To start, focus on what the different parts are, and how they relate to each other. Example 2-1 is a simple SAX program, which you can compile and run if you like.

Example 2-1. SAX2 application skeleton

import java.io.IOException;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class Skeleton {

    // argv[0] must be the absolute URL of an XML document
    public static void main (String argv [])
    {
        XMLReader       producer;
        DefaultHandler  consumer;

        // Get an instance of the default XML parser class
        try {
            producer = XMLReaderFactory.createXMLReader ();
        } catch (SAXException e) {
            System.err.println (
                  "Can't get parser, check configuration: "
                + e.getMessage ());
            return;
        }

	// Set up the consumer
	
        
        try {

	    // Get a consumer for all the parser events
	    consumer = new DefaultHandler ();

	    // Connect the most important standard handler
	    producer.setContentHandler (consumer);

	    // Arrange error handling
	    producer.setErrorHandler (consumer);
	} catch (Exception e) {
	    // Consumer setup can uncover errors,
	    // though this simple one shouldn't
	    System.err.println (
	          "Can't set up consumers:"
                + e.getMessage ());
            return;
	}

        // Do the parse!
        try {
            producer.parse (argv [0]);
        } catch (IOException e) {
            System.err.println ("I/O error: ");
	    e.printStackTrace ();
        } catch (SAXException e) {
            System.err.println ("Parsing error: ");
	    e.printStackTrace ();
        }
    }
}

This is a complete SAX application, though it's sort of boring since it throws away all the data the parser delivers. The only reason this program would print anything at all is if you didn't pass it an argument that was the URL for a well-formed XML file. Other than that, it's fairly typical of how you'll be using SAX2, at least in terms of the basic structure. You can make real programs from this skeleton if you substitute smarter components for the simple ones shown here.

We introduced a few SAX classes and interfaces, so we can add some details to our earlier producer/consumer picture to get Figure 2-2. This producer is an XMLReader, and we're listening to one consumer interface and the ErrorHandler. The whole thing is driven by an application which is pulling the whole document through the reader.

Figure 2-2

Figure 2-2. Basic SAX roles and components

XMLReader producer;

The most common type of SAX2 event producer is an XML parser. Like most parsers, XML parsers implement the XMLReader interface. Whether or not they parse actual XML (instead of HTML or something else), they are required to produce events as if they did.

Don't confuse this class with the java.io.Reader from which you can pull a stream of character data. SAX parsers produce streams of SAX events, which they push to event consumers. Those are rather different models for how to deliver data.

producer = XMLReaderFactory.createXMLReader ();

This is the best all-around SAX2 bootstrap API when you need an XML parser. The only time it should produce any kind of exception is when your environment is misconfigured. For example, you might need to set the org.xml.sax.driver system property to the class name for your parser (see Section 3.2.1, "The XMLReaderFactory Class" in Chapter 3, "Producing SAX2 Events").

You can (and should!) keep reusing this XMLReader, but you should only have one thread touch a parser at a time. That is, parsing is not re-entrant. Parsers are perfectly safe to use with multiple threads, except that two threads can't use the same parser at the same time. (That's a good rule of thumb for most objects in multithreaded code, in all programming languages; it should feel natural to apply that rule to SAX parsers.)

consumer = new DefaultHandler ();

The DefaultHandler class is particularly handy when you're just starting to use SAX. It implements most of the event consumer interfaces, providing stubbed out (no-op) implementations for each method that's not part of an extension handler. That means it's easy to subclass this method if you need a place to start: just override each stub method to provide real code when you need it. We'll use DefaultHandler to avoid presenting extra callback methods.

producer.setContentHandler (consumer);

In this chapter, we're only showing the most commonly used consumer interfaces. ContentHandler is used to report elements, attributes, and characters; that's enough to get almost all serious XML work done.

producer.setErrorHandler (consumer);

ErrorHandler lets applications control handling of various kinds of errors, and we'll need it in later examples. We'll usually look at error handling as a specialized kind of task, different from other consumer roles. Even though "handler" is part of its name, it's a different kind of object.

producer.parse (argv [0]);

This call tells a parser to read the XML text found at a particular fully qualified URL. There's another call you'll use when you don't have a URL for that text, but most of the time this is the call you ought to use. If you're tempted to pass filenames or relative URIs, just say no! Filenames need to be converted to URLs first (see Section 3.1.3, "Filenames Versus URIs" in Chapter 3, "Producing SAX2 Events"), and relative URIs must be converted to absolute ones.

Parsing can report exceptions. This is important, and not just because it's the only way that a chunk of code like this (using just an XMLReader) could seem to "do" anything. Normally, those exceptions will be thrown only for fatal errors, such as well-formedness errors in an XML document, or for document I/O problems.

The application thread is "pulling" the XML text through the XMLReader-style producer: the parse() call won't return until the whole document is parsed, or until parsing is aborted by throwing an exception. Until it returns, the thread that called the XMLReader is either blocking on I/O, parsing data that it just read, or "pushing" data into one of the consumer interfaces. That is, from the perspective of event consumers SAX2 is a "push" API: handlers do nothing until they're asked.

2.2.2. What Are the SAX2 Event Handlers?

SAX2 events are grouped into several interfaces, which we explore later in more detail. All except two are implemented by DefaultHandler. Each interface encapsulates a set of events; to see those events, applications give parsers objects that implement the handler interfaces they're interested in.

org.xml.sax.ContentHandler

Essentially every significant use of SAX2 involves this handler. The element and character data callbacks (discussed later in this chapter) are defined in this interface, as are callbacks for most other SAX2 events for general-purpose data. Many SAX2 applications will focus primarily on this interface. If you only need the core XML data model (elements, attributes, and text), this could be the only handler you use.

org.xml.sax.ext.DeclHandler

This handler reports DTD declarations that aren't exposed through DTDHandler (or in one case LexicalHandler) callbacks: declarations for elements, attributes, and parsed entities.

Because it is an extension handler, it won't necessarily be recognized by all SAX2 parsers, and DefaultHandler doesn't provide no-op implementations for its callbacks.

org.xml.sax.DTDHandler

This handler reports DTD declarations that the XML 1.0 specification requires all processors to expose: declarations for notations and for unparsed entities. Most applications won't use this interface unless they're connected to SGML-based infrastructure that depends on such tools. This is probably the most exotic SAX handler interface; web-oriented XML applications will use MIME types instead of notations and URIs instead of unparsed entities.

org.xml.sax.ErrorHandler

The events reported by this class are errors and warnings. These behaviors are part of XML, but not part of the data model so they don't show up in the Infoset. Grouping these events in one interface lets application code centralize treatment of XML or application data errors. After ContentHandler, it's probably the most important SAX2 handler. It's also usefully managed apart from other handlers, so in this book it's usually not lumped with "real" handlers. (This interface is discussed later in this chapter.)

org.xml.sax.ext.LexicalHandler

This interface mostly exposes information that is intended to be semantically meaningless, such as comments and CDATA section boundaries, as well as entity and DTD boundaries.

Because it is an extension handler, it won't necessarily be recognized by all SAX2 parsers, and DefaultHandler doesn't provide no-op implementations for its callbacks.

With the exception of ErrorHandler, you'll normally want to work with all of these interfaces as a single group: four interfaces, two for content in the document body and two for DTD content. That way, you will work with all the XML data from a document (its Infoset) as part of a cohesive whole. There are SAX2 helper classes (like DefaultHandler and XMLFilterImpl) that group most of these interfaces into classes, but they ignore the two extension handlers (Decl and Lexical handlers in the org.xml.sax.ext package). SAX2 application layers often handle such grouping; for example, you can subclass those helper classes in a different package, adding extension interface support.

The logic behind keeping these interfaces separate, rather than merging all of their methods into one huge interface, is that it's more appropriate for simple applications. You must explicitly ask for bells and whistles; they aren't thrust upon you by default. You can easily prune out certain data by ignoring the interfaces that report it. Most code only uses ContentHandler and ErrorHandler implementations, so the methods in other interfaces are easy to ignore. Plus, from the application perspective, parser recognition of the extension handlers isn't guaranteed. There's a slight awkwardness associated with needing to bind each type of handler separately, but that's a small trade-off for the benefit of having a modular API extension model already in place.

SAX2 defines another important interface beyond these handlers and the XMLReader: parsers use EntityResolver to retrieve external entity text they must parse. That interface is also stubbed out by DefaultHandler. If you want the parser to use local copies of DTDs rather than DTDs accessed from a server that might not be available, you'll want to become familiar with EntityResolver. However, it isn't really a consumer API since it doesn't deal directly with parsed XML data (the Infoset); it deals with accessing raw unparsed text, the same stuff that's given to XMLReader.parse() methods. This book presents it as a producer-side helper for parsers, in Section 3.4, "The EntityResolver Interface" in Chapter 3, "Producing SAX2 Events".

2.2.3. XMLWriter: an Event Consumer

The next part of SAX we show in this overview is really not a part of SAX, except that it uses SAX to do something you'll likely need to do fairly often. (Pretty much everyone does!) As you've seen, SAX2 includes an XMLReader interface, used to turn XML text into a stream of SAX events. But it does not include the corresponding XMLWriter to reverse the process: turning such events back into text and supporting XML for program outputs as well as inputs. SAX isn't only for reading XML. The same APIs are used to write XML too.

It's almost a tradition to show how to write most of such a class as an example when explaining SAX. We avoid that in this book because getting all the XML details right is tricky, and because this class is a clear example of something that should be treated as a reusable SAX library component. There are lots of ways the data needs to be escaped, and sometimes you need to use output encodings (like ASCII) that have problems representing some XML characters.

There's a better solution: use one of several such classes, which are widely available. This book uses the gnu.xml.util.XMLWriter class (bundled with gnujaxp.jar andÆlfred) when it needs XML generation functionality, because it doesn't force applications to discard as much of the XML data. It supports all of the SAX2 handlers, including the extension handlers LexicalHandler and DeclHandler, so it can round-trip almost all XML data. To use such classes, at least in their simple low-fidelity modes, you can modify the skeleton program shown earlier to something like this:

import java.io.FileOutputStream;
import gnu.xml.util.XMLWriter;

public class ... {

    ...
        setContentHandler (
	    new XMLWriter (new FileOutputStream ("out.xml"))
	    );
    ...
}

In addition to the GNU class used in this book, other versions are available. One is provided with DOM4J org.dom4j.io.XMLWriter, which supports Content and Lexical handlers and evolved from the com.megginson.sax.XMLWriter class, which supports only ContentHandler. Curiously, neither Crimson nor Xerces include such SAX-to-text functionality at this time.

2.2.3.1. Event pipelines

Of course, just parsing and echoing data is not very useful. Such classes are best used to output XML data that you've massaged a bit. We'll look at two ways to do this later. One way is to use an XML pipeline, where consumers produce data for other consumers, as illustrated in Figure 2-3. For example, one stage could filter the event stream from a parser to remove various uninteresting elements, or otherwise transform the data, and then feed the result to an XMLWriter. You can combine several such stages into a "pipeline" and debug them using an XMLWriter to watch data as it flows through particular stages. Remember that XMLReader isn't the only kind of SAX event producer: programs can write events and feed the result to an XMLWriter. Also, the consumer doesn't need to be an XMLWriter; it could construct any kind of useful data structure. In fact we'll look later at doing this with DOM.

Figure 2-3

Figure 2-3. Simple SAX2 event pipeline

This kind of processing pipeline is a fundamental model for more advanced uses of SAX and for structuring components that are SAX-aware. We look at pipelines again in Section 4.5, "XML Pipelines " in Chapter 4, "Consuming SAX2 Events". For now, keep in mind that sometimes event consumers will be producing events for later processing components.

2.2.3.2. Concerns when writing XML text

There are several important issues to consider when writing XML output, which should be mentioned in the documentation for the XMLWriter you use. You may even be able to use your XMLWriter to canonicalize output, so you can safely compare processor output or create digital signatures. The GNU class shown earlier handles most of these directly, but that's not true for all such classes.

Such an XMLWriter is part of almost every developer's SAX toolkit, even though it isn't part of SAX itself. As you work with SAX, you'll probably start to collect and develop your own library of such reusable event consumer code.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.