Event Consumer Issues (SAX2)

B.2. Event Consumer Issues

You really shouldn't care, but since the String datatype can't handle more than two gigabytes of data, and strings are used to pass certain document data to applications, there's a chance that some documents could cause trouble by overflowing that limit. If you encounter such a document, consult a pathologist. There really isn't much you can do about this.

B.2.1. Structural Issues

The [children] properties are arbitrarily sized, ordered sequences of information items, which are presented in document order by SAX2 event callbacks. Most other information items are not ordered, such as [notations], [unparsed entities], and [attributes] properties. Only [children] properties would need to be stored in order-preserving data structures.

While most information items are provided through a single callback, some of the more complex ones involve matched, and (except in one case) cleanly nested, pairs of calls to start() and end() the item. Such items include the Document itself, its Document Type Declaration, Elements, and Namespace Information. To track those items, applications implement some kind of context stack tracking.

The [parent] properties of some information items are implicitly encoded through such SAX2 nested event reports. Except for items that can be direct children of the Document or Document Type Information Items, applications often push stack entries when startElement() is called and pop them when endElement() is called.

The children of Document and Document Type Information Items have curious restrictions: they don't always match the actual text structure. For example, information items for notations and unparsed entities are found in the Document Information Item, but they're textually part of the Document Type; and comments are stripped out of DTDs. You can use more natural structures in your applications if the descriptive Infoset structure seems awkward.

Other complex information items are implicitly decoded from DTD declarations. To track such items, applications must save declarations during DTD processing, to ensure that they can be correlated with information in the body of a document. Examples of such items include [notation] properties for Unparsed Entities and processing instructions, most properties for Unexpanded Entity References, and [references] properties of attributes.

B.2.2. Base URIs, xml:base, and Locator Data

Some information items have a [base URI] property that is computed according to xml:base rules. Except for two cases, these rules amount to using Locator.getSystemId() to find the absolute base URI; the producer needs to provide this information. SAX2 effectively augments every information item with this information, as well as line and column location within such entities. (However, applications can cause this information to be lost if they provide InputSource objects without including those base URIs as the system IDs.)

The two exceptional cases are for Elements and for processing instructions within the document element. In these instances, the computation is complex because xml:base attributes can play a role; it is demonstrated in Example 5-1. Consumers must be able to invoke Locator.getSystemId() to get the entity's URI in LexicalHandler.startEntity() when the entity is shown to be external using DeclHandler.externalEntityDecl(). And they must also maintain a stack of URIs, augmenting it with xml:base values.

Application code should use Locator information to generate meaningful diagnostics. However, conforming applications will use the URI computed with xml:base when absolutizing relative URIs found in attribute values, character data, processing instructions, or (primarily for HTML legacy data models) comments. Except for the startDTD() call, all system identifiers reported through SAX are delivered as absolute URIs. An upcoming extension feature flag will probably let that behavior be changed, so you can choose whether the parser or the application absolutizes the URIs. Meanwhile, you should be aware that some SAX parsers have bugs in how they report such identifiers.