The Text Encoding Initiative (TEI, http://www.tei-c.org/) is an SGML application designed for the markup of classic literature, such as Virgil's Aeneid or the collected works of Thomas Jefferson. It's a prime example of a narrative-oriented DTD. Since TEI is designed for scholarly analysis of text rather than more casual reading or publishing, it includes elements not only for common document structures (chapter, scene, stanza, etc.) but also for typographical elements, grammatical structure, the position of illustrations on the page, and so forth. These aren't important to most readers, but they are important to TEI's intended audience of humanities scholars. For many academic purposes, one manuscript of the Aeneid is not necessarily the same as the next. Transcription errors and emendations made by various monks in the Middle Ages can be crucial.
TEI is an SGML application. It uses several features of SGML not found in XML, including the & connector and tag minimization. However, XML is clearly the wave of the future. Therefore, like most evolving SGML applications, TEI is moving toward XML. A light version of the TEI DTD is available for authors who prefer to work in pure XML. It's not exactly the same as the SGML version, but it's very close for many practical uses.
Example 6-1 shows a fairly simple TEI Lite document that uses the XML version of the TEI DTD. The content comes from the book you're reading now. Although a complete TEI-encoded copy of this manuscript would be much longer, this simple example demonstrates the basic features of most TEI documents that represent books. (As well as prose, TEI can also be used for plays, poems, missals, and essentially any written form of literature.)
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE TEI.2 SYSTEM "xteilite.dtd"> <TEI.2> <teiHeader> <fileDesc> <titleStmt> <title>XML in a Nutshell</title> <author>Harold, Elliotte Rusty</author> <author>Means, W. Scott</author> </titleStmt> <publicationStmt><p></p></publicationStmt> <sourceDesc><p>Early manuscript draft</p></sourceDesc> </fileDesc> </teiHeader> <text id="HarXMLi"> <front> <div type='toc'> <head>Table Of Contents</head> <list> <item>Introducing XML</item> <item>XML as a Document Format</item> <item>XML on the Web</item> </list> </div> </front> <body> <div1 type="chapter"> <head>Introducing XML</head> <p></p> </div1> <div1 type="chapter"> <head>XML as a Document Format</head> <p> XML is first and foremost a document format. It was always intended for web pages, books, scholarly articles, poems, short stories, reference manuals, tutorials, texts, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications like syndication, order processing, object serialization, database exchange and backup, electronic data interchange, and so forth is mostly a happy accident. </p> <div2 type="section"> <head>SGML's Legacy</head> <p></p> </div2> <div2 type="section"> <head>TEI</head> <p></p> </div2> <div2 type="section"> <head>DocBook</head> <p> DocBook (<hi>http://www.docbook.org/</hi>) is an SGML application designed for new documents, not old ones. It's especially common in computer documentation. Several O'Reilly books have been written in DocBook including <bibl><author>Norm Walsh</author>'s <title>DocBook: The Definitive Guide</title></bibl>. Much of the <abbr expan='Linux Documentation Project'>LDP</abbr> (<hi>http://www.linuxdoc.org/</hi>) corpus is written in DocBook. </p> </div2> </div1> <div1 type="chapter"> <head>XML on the Web</head> <p></p> </div1> </body> <back> <div1 type="index"> <list> <head>INDEX</head> <item>SGML, 8, 9, 91, 92, 94</item> <item>DocBook, 97-101</item> <item>TEI, 94-97, 101</item> <item>Text Encoding Initiative, See TEI</item> </list> </div1> </back> </text> </TEI.2>
The root element of this and all TEI documents is TEI.2. This root element is always divided into two parts, a header represented by a teiHeader element and the main content of the document represented by a text element. The header contains information about the source document (for instance, exactly which medieval manuscript the text was copied from), the encoding of the document, some keywords describing the document, and so forth.
The text element is itself divided into three parts:
The divisions may be further subdivided; div1 s can contain div2s, div2 s can contain div3s, div3 s can contain div4 s, and so on up to div7. However, for any given work, there is a smallest division. This division contains paragraphs represented by p elements for prose or stanzas represented by lg elements for poetry. Stanzas are further broken up into individual lines represented by l elements.
Both lines and paragraphs contain mixed content; that is, they contain plain text. However, parts of this text may be marked up further by elements indicating that particular words or characters are peoples' names (name), corrections (corr), illegible (unclear), misspellings (sic), and so on.
This structure fairly closely reflects the structure of the actual documents that are being encoded in TEI. This is true of most narrative-oriented XML applications that need to handle fairly generic documents. TEI is a very representative example of typical XML document structure.
Copyright © 2002 O'Reilly & Associates. All rights reserved.