Next up to bat is dealing with constraining XML. If there's nothing you get out of this chapter other than the rationale behind constraining XML, then I'm a happy author. Because XML is extensible and can represent data in hundreds and even thousands of ways, constraints on a document provide meaning to those various formats. Without document constraints, it is impossible (in most cases) to tell what the data in a document means. In this section, I'm going to cover the two current standard means of constraining XML: DTDs (included in the XML 1.0 specification) and XML Schema (recently a standard put out by the W3C). Choose the one best suited for you.
An XML document is not very usable without an accompanying DTD (or schema). Just as XML can effectively describe data, the DTD makes this data usable for many different programs in a variety of ways by defining the structure of the data. In this section, I show you the most common constructs used within a DTD. I use the XML representation of a portion of the table of contents for this book as an example again, and go through the process of constructing a DTD for the XML table of contents document.
The DTD defines how data is formatted. It must define each allowed element in an XML document, the allowed attributes and possibly the acceptable attribute values for each element, the nesting and occurrences of each element, and any external entities. DTDs can specify many other things about an XML document, but these basics are what we will focus on. You will learn the constructs that a DTD offers by applying them to and constraining the XML file from Example 2-1. The complete DTD is shown in Example 2-3, which I'll refer to in this section.
<!ELEMENT book (title, contents, ora:copyright)> <!ATTLIST book xmlns CDATA #REQUIRED xmlns:ora CDATA #REQUIRED > <!ELEMENT title (#PCDATA)> <!ATTLIST title ora:series (C | Java | Linux | Oracle | Perl | Web | Windows) #REQUIRED > <!ELEMENT contents (chapter+)> <!ELEMENT chapter (topic+)> <!ATTLIST chapter title CDATA #REQUIRED number CDATA #REQUIRED > <!ELEMENT topic EMPTY> <!ATTLIST topic name CDATA #REQUIRED > <!-- Copyright Information --> <!ELEMENT ora:copyright (copyright)> <!ELEMENT copyright (year, content)> <!ATTLIST copyright xmlns CDATA #REQUIRED > <!ELEMENT year EMPTY> <!ATTLIST year value CDATA #REQUIRED > <!ELEMENT content (#PCDATA)> <!ENTITY OReillyCopyright SYSTEM "http://www.newInstance.com/javaxml2/copyright.xml" >
The bulk of the DTD is composed of ELEMENT definitions (covered in this section) and ATTRIBUTE definitions (covered in the next). An element definition begins with the ELEMENT keyword, following the standard <! opening of a DTD tag, and then the name of the element. Following that name is the content model of the element. The content model is generally within parentheses, and specifies what content can be included within the element. Take the book element as an example:
<!ELEMENT book (title, contents, ora:copyright)>
This says that for any book element, there may be a title element, a contents element, and an ora:copyright element within it. The definitions for these elements are defined later with their content models, and so on. You should be aware that in this standard case, the order specified in the content model is the order that the elements must appear within the document. Additionally, each element must appear, once and only once, when no modifiers are used (which I'll cover momentarily). In this case, each book element must have a title element, a contents element, and then an ora:copyright element, without exception. If these rules are broken, the document is not considered valid (although it still could be well-formed).
Of course, in many cases you need to specify multiple occurrences of an element, or optional occurrences. You can do this using the recurrence modifiers listed in Table 2-1.
Operator |
Description |
---|---|
[Default] |
Must appear once and only once (1) |
May appear once or not at all (0..1) |
|
Must appear at least once, up to an infinite number of times (1..N) |
|
May appear any number of times, including not at all (0..N) |
As an example, take a look at the contents element definition:
<!ELEMENT contents (chapter+)>
Here, the contents element must have at least one chapter element within it, but there can be an unlimited number of those chapters.
If an element has character data within it, the #PCDATA keyword is used as its content model:
<!ELEMENT title (#PCDATA)>
If an element should always be an empty element, the EMPTY keyword is used:
<!ELEMENT topic EMPTY>
Once you've handled the element definition, you'll want to define attributes. These are defined through the ATTLIST keyword. The first value is the name of the element, and then you have various attributes defined. Those definitions involve giving the name of the attribute, the type of attribute, and then whether the attribute is required or implied (which means it is not required, essentially). Most attributes with textual values will simply be of the type CDATA, as shown here:
<!ATTLIST chapter title CDATA #REQUIRED number CDATA #REQUIRED >
You can also specify a set of values that an attribute must take on for the document to be considered valid:
<!ATTLIST title ora:series (C | Java | Linux | Oracle | Perl | Web | Windows) #REQUIRED >
You can specify entity reference resolution in a DTD using the ENTITY keyword. This works a lot like the DOCTYPE reference I talked about earlier, where a public ID and/or system ID may be specified. In the example DTD, I've specified a system ID, a URL, for the OReillyCopyright entity reference to resolve to:
<!ENTITY OReillyCopyright SYSTEM "http://www.newInstance.com/javaxml2/copyright.xml" >
This results in the copyright.xml file at the specified URL being loaded as the value of the O'Reilly copyright entity reference in the sample document. You'll see this in action in the next few chapters.
Now this is hardly an extensive reference on DTDs, but it should give you enough basic knowledge to get going. As I've suggested, have some additional resources specifically on XML available (like XML in a Nutshell) as you go through this book in case you run across something you're unsure about. By assuming that you have that or the online specifications from http://www.w3.org around, I can delve into Java topics more quickly.
XML Schema is a newly finalized candidate recommendation from the W3C. It seeks to improve upon DTDs by adding more typing and quite a few more constructs than DTDs, as well as following an XML format. I'm going to spend relatively little time here talking about schemas, because they are a "behind-the-scenes" detail for Java and XML. In the chapters where you'll be working with schemas (Chapter 14, "Content Syndication", for instance), I'll address specific points you need to be aware of. However, the specification for XML Schema is so enormous that it would take up an entire book of explanation on its own (see the book XML Schema on this CD-ROM). Example 2-4 shows the XML Schema constraining Example 2-1.
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://www.oreilly.com/javaxml2" xmlns:ora="http://www.oreilly.com" targetNamespace="http://www.oreilly.com/javaxml2" elementFormDefault="qualified" > <xs:import namespace="http://www.oreilly.com" schemaLocation="contents-ora.xsd" /> <xs:element name="book"> <xs:complexType> <xs:sequence> <xs:element ref="title" /> <xs:element ref="contents" /> <xs:element ref="ora:copyright" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="title"> <xs:complexType> <xs:simpleContent> <xs:restriction base="xs:string"> <xs:attribute ref="ora:series" use="required" /> </xs:restriction> </xs:simpleContent> </xs:complexType> </xs:element> <xs:element name="contents"> <xs:complexType> <xs:sequence> <xs:element name="chapter" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="topic" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="name" type="xs:string" use="required" /> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="title" type="xs:string" use="required"/> <xs:attribute name="number" type="xs:byte" use="required"/> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
In addition, you'll need the schema in Example 2-5, for reasons you will soon understand.
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns="http://www.oreilly.com" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.oreilly.com" attributeFormDefault="qualified" elementFormDefault="qualified" > <xs:attribute name="series" type="xs:string"/> <xs:element name="copyright" type="xs:string" /> </xs:schema>
Before diving into the specifics of these schemas, notice that various namespace declarations are made. First, the XML Schema namespace itself is attached to the xs prefix, allowing separation of XML Schema constructs from the elements and attributes being constrained. Next, the default namespace is attached to the namespace of the elements being defined; in Example 2-4 this is the Java and XML namespace, and in Example 2-5 it's the O'Reilly namespace. I've also assigned the targetNamespace attribute this same value. This attribute specifies to the schema the namespace of the elements and attributes being constrained. This is easy to forget, and can wreak a lot of havoc, so be careful to include it. At this point, namespaces are defined for the elements being constrained (the default namespace) and the constructs being used (the XML Schema namespace).
Last, I've specified the value of attributeFormDefault and elementFormDefault as "qualified." This indicates that I'll use fully qualified names for the elements and attributes, rather than just local names. I won't go into detail about this, but I highly recommend you use qualified names at all times. Trying to deal with multiple namespaces and unqualified names at the same time is a mess I wouldn't want to wander into.
Elements are defined with the element construct. You'll generally need to define your own data types by nesting a complexType tag within the element element, which defines the name of the element (through the name attribute). Take a look at this fragment of Example 2-4:
<xs:element name="book"> <xs:complexType> <xs:sequence> <xs:element ref="title" /> <xs:element ref="contents" /> <xs:element ref="ora:copyright" /> </xs:sequence> </xs:complexType> </xs:element>
Here, I've specified that the book element has complex content. Within it there should be three elements: title, contents, and ora:copyright. By using the sequence construct, I've ensured that they appear in the specified order; and with no modifiers, an element must appear once and only once. For each of these other elements, I've used the ref keyword to reference another element definition. This points to the definitions for each of these elements in another part of the schema, and keeps things organized and easy to follow.
Later in the file, the title element is defined:
<xs:element name="title"> <xs:complexType> <xs:simpleContent> <xs:restriction base="xs:string"> <xs:attribute ref="ora:series" use="required" /> </xs:restriction> </xs:simpleContent> </xs:complexType> </xs:element>
This element is really just a simple XML Schema string type; however, I've added an attribute to it, so I must define a complexType. Since I'm extending an existing type, I use the simpleContent and restriction keywords (as nested elements) to define this type. simpleContent informs the schema that this is a basic type, and restriction, with the base of "xs:string", lets the schema know I want to allow just what the XML Schema string type allows, plus the additional attribute defined here (with the attribute keyword). For the attribute itself, I reference the type defined elsewhere, and specify that it must appear for this element (through use="required"). I realize that this paragraph is a mouthful, and not completely obvious; however, take your time and you'll get it all.
One other thing you'll notice is the use of minOccurs and maxOccurs attributes on the element element; these attributes allow an element to appear a specified number of times other than the default, which is once and only once. For example, specifying minOccurs="0" and maxOccurs="1" allows an element to appear once, or not at all. To allow an element to appear an unlimited number of times, you can use the value of "unbounded" for the maxOccurs attribute, as in Example 2-4.
You'll notice that I defined two schemas, though, which may have you puzzled. For each namespace in a document, one schema must be defined. Additionally, you can't use the same external schema for both namespaces, and simply point both at that external schema. As a result, using the ora prefix and namespace requires an additional schema, which I called contents-ora.xsd. You'll also need to use the schemaLocation attribute I talked about earlier to reference this schema; however, don't add another attribute. Instead, you can append another namespace and schema-location pair to the end of the value of the attribute, as shown here:
<book xmlns="http://www.oreilly.com/javaxml2" xmlns:ora="http://www.oreilly.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.oreilly.com/javaxml2 XSD/contents.xsd http://www.oreilly.com XSD/contents-ora.xsd" >
This essentially says for the namespace http://www.oreilly.com/javaxml2, look up definitions in the schema called contents.xsd in the XSD/ directory. For the http://www.oreilly.com namespace, use the contents-ora.xsd schema in the same directory. You'll then need to define the two schemas I showed you in Example 2-5 and Example 2-5. Finally, import the O'Reilly schema into the Java and XML one, since elements in the Java and XML schema refer to attributes in the O'Reilly one:
<xs:import namespace="http://www.oreilly.com" schemaLocation="contents-ora.xsd" />
This import is fairly self-explanatory, so I won't dwell on it. You should realize that dealing with multiple namespaces is about the most complex thing you can do in schemas, and can easily trip you up. (It tripped me up, until Eric van der Vlist saved the day.) I also recommend a good XML Schema-capable editor. While I'm generally slow to recommend commercial products, in this case XMLSpy 4.0 (http://www.xmlspy.com) turned out to be wonderfully helpful.
I've barely scratched the surface of either DTDs or XML Schema, and there are even other constraint models not covered at all! For example, Relax (and Relax NG, which includes what used to be TREX) is gaining a lot of steam, as it's considered a lot easier and more lightweight than XML Schema. You can check out the activity online at http://www.oasis-open.org/committees/relax-ng/. No matter what technology you choose, though, you should be able to find something that helps you constrain your XML documents. With these constraints in place, validation and interoperability become a snap. Consider yourself educated on XML constraints, and get ready to move on to the next topic in this whirlwind tour: XML transformations.
Copyright © 2002 O'Reilly & Associates. All rights reserved.