Bare Bones TEI: A Very Very Small Subset of the TEI Encoding Scheme

Bare Bones TEI A Very Very Small Subset of the TEI Encoding Scheme C. M. Sperberg-McQueen Document No. TEI U6 30 Aug 1994, rev. June 1995

An HTML version of this document may be retrieved from http://www.uic.edu/orgs/tei/intros/teiu6.split.html; it has been split over several files to make retrieval faster. A single-file version for easier printing is at http://www.uic.edu/orgs/tei/intros/teiu6.html.

Preface

Mark Olsen ARTFL Database University of Chicago 1050 E. 59th St. Chicago, Illinois 60637

Dear Mark,

A few months ago, when the TEI published its Guidelines, and you saw the 1300 pages, and hefted the seven pounds, of the Guidelines for Electronic Text Encoding and Interchange, you wrote me, as you may remember, words to the effect: Your bricks landed on my desk today. Is there a Cliff Notes version? A bare-bones TEI, without any of the optional stuff, just the absolute minimal TEI encoding scheme?

This is my attempt to provide you what you asked for — but only half of my effort to provide you with what I think you really need. The half I don't think you were asking for, though other people have, is a sort of Pocket-sized Guide to the TEI: a version of the TEI encoding scheme which is small enough to be understood without too much trouble, but large enough to do reasonably serious work with, and powerful enough to suffice for most people's work encoding electronic text, most of the time. Lou and I have discussed this at some length, and a pocket-sized TEI (aka TEI Lite) is now documented in a little paper called An Introduction to TEI Tagging (document number TEI U5).

You didn't ask for a TEI Lite, though: you asked for something even smaller and more austere: you asked me to isolate the absolute minimum set of TEI tags, without which it's difficult to imagine making any useful electronic text nowadays at all. That is what I have done in this document.

Note, however, that what you have in your hands is emphatically not an attempt at a realistic markup scheme for real use in encoding new texts. It is a definition of a toy markup language: the absolute minimum is not necessarily a useful minimum. In particular, although this tag set may conceivably suffice for the translation of ARTFL texts and other pre-existing data into TEI form, still I think that when you set about creating new electronic texts, you would be crazy to limit yourself to the textual features listed here, and I hope that, despite your well publicized antipathy to any rational scheme of text markup, and despite the ample measure of craziness which your friends all know and treasure in you, you won't do such a silly thing.

The tag set defined here is simple enough that you should be able to get familiar with it in half an hour, become proficient in it in an afternoon or so, and outgrow it completely in a day or a week or two. And it is a clean subset of the full TEI encoding scheme, so that when you do outgrow this bare-bones tag set, and start (as I hope) looking at TEI Lite and the full TEI markup language, you will already have a firm grasp of the basics of TEI encoding, and can easily integrate the additional tags into the mental framework you built while assimilating this bare-bones TEI scheme. In order to encourage you, and other readers who share your predilection for craziness, to move eventually to the full TEI markup scheme, I mention periodically in this outline the tags in the TEI header and the TEI core tag set which are not included here, so that you will know what you're missing.

I have to confess that Lou is skeptical about the definition of this bare-bones TEI subset. Like me, he thinks that it won't be useful for serious encoding of real data, but he disagrees with my belief that it may nevertheless be useful to those encountering the TEI for the first time. I hope it will be useful, by (a) reducing the clutter so you can see the basic outlines of the TEI scheme more clearly — the tags included here are the ones everyone is going to need to use — and (b) demonstrating, by a reductio ad absurdum, how reducing a tag set to this size (it's about the same size as the first version of HTML) forces one to omit too much material which can be useful in the encoding of virtually any text, and which is absolutely essential for dealing rationally with some texts. Lou thinks I am dreaming; time will tell.

So: here is the bare-bones TEI subset you asked for — may you read it in good health, and may it prove useful in showing you how to translate your existing data into TEI form, and extending your existing software to handle TEI data. (N.B. maledictions will rain on your head if you implement support for this subset but not for the full TEI DTD. And what's more, you'll deserve every malediction in the book.) Use it to encode some simple exercises in SGML and TEI tagging. A few exercises should suffice to persuade you that you'll need a larger scheme (e.g. the full TEI scheme) for serious encoding of texts you hope anyone will work with. Use TEI Lite instead of standard TEI if you must, but don't limit yourself to the skeletal tag set (perhaps I should say, cadaverous tag set) sketched here. Even you aren't that crazy.

Best regards, Michael

Introduction

This document describes a bare-bones tag set taken from the Guidelines for Electronic Text Encoding and Interchange published in 1994 by the Text Encoding Initiative (TEI). The tags described have been chosen to serve as a simple introduction to the full markup scheme described in the Guidelines; they may suffice in some cases for the creation of simple electronic texts, but serious work will require a larger selection of the TEI tags. The reader is encouraged to use this document as a first introduction to TEI tagging, and to progress, after reading this document and using its tag set for a while, to a study of other TEI documentation, either the document called TEI Lite: An Introduction to TEI Tagging (document number TEI U5), or the full text of the Guidelines themselves (Guidelines for Electronic Text Encoding and Interchange, document number TEI P3).

This document introduces the tags informally, with examples. As an incentive to learn the full TEI tag set, it mentions, from time to time, tags which are in the full tag set but have been omitted here to keep the bare-bones tag set simple. Such references may be ignored on first reading. Fuller discussion of all tags, and their formal descriptions in terms of the Standard Generalized Markup Language (SGML) may be found in the Guidelines.

Bare-Bones SGML

SGML, the Standard Generalized Markup Language, is a formal language for representing text in electronic form. The TEI tag set is defined in terms of SGML, and all TEI-conformant documents must also conform to SGML.

In SGML-based encoding schemes, a document is represented by a combination of content (roughly speaking, the characters of the text, what you see on a printed page when the text is printed out) and markup (roughly speaking, information about the structure of the text, or features important for proper processing of the text, such as its division into chapters and sections, or the fact that a given phrase is a technical term and must be italicized). Non-SGML software, such as proprietary word processors, uses a similar division into content and markup. In sophisticated software, markup is usually invisible to the user unless you use a reveal-codes function or the like, to make it visible. SGML differs from proprietary markup systems in several ways: SGML is non-proprietary, and fully documented independent of any SGML software, so you can move your documents from one program to another at any time without losing any information. In contrast, conversion among proprietary systems is notoriously difficult and error prone. All systems use the markup to decide how to process the text, but in SGML systems the markup is typically defined in abstract terms, rather than directly in processing terms. An SGML system is more likely to mark a phrase as a technical term than simply as italic: a separate style sheet is normally used to determine whether technical terms are italicized, bolded on first reference, or displayed in blue, and to specify further processing such as adding an entry for the term to a glossary of technical terms. Proprietary systems nowadays often have style sheet mechanisms, too, but these seldom match SGML software in convenience or power. SGML-based markup schemes, called document type definitions (DTDs), are typically accompanied by system-independent documentation of the markup. The document you are reading provides precisely this type of system-independent documentation — i.e., since you can use these tags with any SGML software, this paper cannot describe what will happen on the screen or what keys you must press, with the particular program you choose to use. Such matters will vary from program to program, and you should consult the software documentation for help. Unlike proprietary systems, SGML systems invariably provide facilities for exporting documents in standard (SGML) form, so they can easily be used by other software. Some enthusiasts phrase this as a sort of slogan: With SGML, you own your documents; without SGML, your documents are owned by people in Orem, Utah, or in Redmond, Washington. Which do you prefer? There are other differences, but these will do for now.

SGML markup takes three forms: declarations, entity references and tags. I cannot tell a lie: actually, there are four forms of markup. The fourth, processing instructions, won't concern us here.

Declarations are used to define the tags and entity references which are legal in a document type. Since the tags and entities we are concerned with here have all already been defined by the TEI, there is no need to discuss declarations further in this document. You will need to learn about them if you want to customize the TEI tag sets, but that won't be covered here. The only form of declaration you need to know about, to follow the examples below, is the comment, which is preceded by &mdo;-- and followed by -->:

]]>

Entities are named portions of documents, which may be stored separately; entity references show where each entity goes. Among other things, entity references are used to embed special characters in the text when, as often happens, the characters in question are not available on the keyboard. Some entities for special characters are defined in international standards. For example, the entity eacute names the character e with an acute accent (é). When the standard entity sets are in use, the following two examples are identical in meaning:

(In case this has been corrupted in transmission, or is being rendered on a device without accented characters, that second one is the same as the first, except that the reference to the entity eacute in characters 3-10 of the first example has been replaced with a real e with an acute accent in the system's native character set in character 3 of the second example.)

Entities are also used to handle graphics and other material in non-SGML notations, and to divide a document up into sections stored in separate files for purposes of simpler maintenance, but we won't discuss such uses here.

Tags mark the beginning and ending of parts of the document; the parts themselves are called elements. Normally, tags are marked in the document by angle brackets; end-tags have a slash after the opening angle bracket. In the following example, the sentence is marked as a quotation by the start-tag and end-tag which surround it; quote is an element type defined by the TEI.

L'étag, c'est moi.]]>

Elements always have a basic type (in the example above, it is quote); they may also have other attributes, which are indicated by special notations inside the start-tag for the element. For example, the TEI defines the attribute lang as applying to every type of element; its value indicates the language of the element's content, using standard two- or three-letter abbreviations (e.g. fra for French).

L'état, c'est moi.]]>

Some attributes may be restricted to certain types of values. Attributes of type id, for example, must provide a unique name or identifier for the element on which they appear; this identifier can be referred to by other attributes, of type idref (id reference). The TEI defines a global attribute named id, of type id, for use in cross-references and other kinds of hypertext links. (TEI attributes are called global when they apply to every type of element.)

Finally, it should be noted that SGML allows some tags to be omitted from documents, in cases when they are logically redundant and their location can be inferred from that of other tags; in the examples given here, we will not exploit this facility, but always give all tags explicitly. Tag omission is generally of interest only to those working without an SGML editor.

In sum: in SGML, everything is delimited. Elements are delimited by start- and end-tags. Start- and end-tags are delimited by angle brackets. Attribute values are delimited by single or double quotes. Entity references are delimited by ampersand and semicolon.

That's all there is to it. If you understand the rules just described, you should have no trouble understanding all the SGML examples in this document.

Basic Text Encoding

A TEI-conformant electronic text consists of the text itself (transcribed from some source, or created in electronic form), preceded by a TEI header, which identifies the electronic text and can also document the encoding practices used in creating it. The entire thing is enclosed within a tei.2 element, and preceded by an SGML declaration identifying the document type to be used in validating the document.

The SGML declaration won't be described here. Further below, I'll discuss the TEI header, and the specialized tags for front matter and back matter of the main text. In work with electronic text, however, the vast majority of one's time is spent within the body of the text itself, and so I begin with a description of tags for basic text encoding: paragraphs and other paragraph-like things, character- or phrase-level elements which occur within paragraphs, and so on.

Paragraphs

Mark paragraphs with the tag p. Paragraphs do not nest, and neither may p elements. For example:

I call specific attention to the authority given by the 21st Amendment to the Constitution to prohibit transportation or importation of intoxicating liquors into any State in violation of the laws of such State.

I ask the wholehearted cooperation of all our citizens to the end that this return of individual freedom shall not be accompanied by the repugnant conditions that obtained prior to the adoption of the 18th Amendment and those that have existed since its adoption. Failure to do this honestly and courageously will be a living reproach to us all.

I ask especially that no State shall by law or otherwise authorize the return of the saloon either in its old form or in some modern guise.

]]>

This example, like most of the others not otherwise identified, is from Franklin D. Roosevelt's proclamation upon the repeal of Prohibition, in The Public Papers and Addresses of Franklin D. Roosevelt, vol. II (New York: Random House, 1938), pp. 510-514.

Highlighted Phrases

Phrases which are highlighted in the source (or should be highlighted in the output), whether by italics, boldface, small caps, or other special treatment, should be tagged with the hi element. The rend attribute may optionally say how the phrase was highlighted. In the example below, the word whereas and the phrase therefore, I, Franklin D. Roosevelt are printed in small caps in the source:

Whereas the Congress of the United States ...

Whereas Section 217(a) of the Act of Congress entitled "An Act ..." ...

Whereas it appears ...

Now, therefore, I, Franklin D. Roosevelt, President of the United States of America ... do hereby proclaim that the Eighteenth Amendment to the Constitution of the United States was repealed on the fifth day of December, 1933.

]]>

The rend attribute may be omitted if the rendering is of no interest, or if all highlighted phrases are rendered the same way. Its values may be chosen arbitrarily by the encoder --- the values used may then be used in turn to direct processing software to display or process the element correctly.

It is normally preferable to mark phrases with element types indicating why they are highlighted, rather than simply indicating that they are highlighted. The full TEI encoding scheme defines elements which allow typographic highlighting to be identified as marking linguistic emphasis (emph), words in foreign languages (foreign), words in non-standard or specialized languages (distinct), technical terms (term), glosses on terms (gloss), and words mentioned rather than used (mentioned). The generic hi element is normally used only when it is economically or intellectually infeasible to supply one of the more informative alternatives.

Quotations

Mark quotations from other works, or dialog spoken by characters in a narrative, as q (quotation) elements:

Whereas Section 217(a) of the Act of Congress entitled "An Act ..." approved June 16, 1933, provides as follows: Section 217(a) The President shall proclaim the ...

]]>

Block quotations and inline quotations are distinguished only by the value of their rend attribute; for the former, use the value block or display, for the latter, use inline.

The full TEI scheme also provides a quote element which is restricted to real quotations from external sources, and unlike q may not be used for direct discourse and fictive quotations. Also provided there but missing here are cit, for quotations with attached bibliographic references to their sources, and soCalled, for material printed with scare quotes to indicate that the author disclaims full responsibility for it.

Cross References

References to other documents, or to other locations in the current document, should be tagged with the ref tag:

Section 217(a) of the Act of Congress ... approved June 16, 1933, provides as follows: ...]]>

The full scheme defines an empty element called ptr for use when the actual phrase referring to the other document or section can be generated automatically by software, as is usually done in document production systems.

For cross references within the same SGML document, the target attribute may be used to indicate which section is being referred to; its value is the id value assigned to some element in the document. For example, the following cross reference:

Press Conference of October 11, 1933, Item 137, this volume.)]]>

assumes the existence of some element elsewhere in the volume with the identifier given:

Press Conference, 11 October 1933

]]>

This example is from the note in the Public Papers which follows the proclamation of the repeal of Prohibition.

The div and head used in the example just given elements are described below.

Page Breaks

If the page breaks of the source are of interest, as they generally are for material transcribed from existing printed editions, record them using the pb element. This element is empty: that is, it has neither content nor an end-tag. It does not mark a passage or portion of the text, just a location within the text. The attribute n, defined for all TEI elements, should be used to indicate the page number; if page numbers from more than one edition are transcribed, the attribute ed should be used to distinguish the two paginations:

]]>

In addition to page breaks, column and line breaks may be of interest; the full TEI scheme defines cb and lb elements for these, as well as a generic milestone element for boundaries and breaks of unforeseen type. Specialized tags in the TEI header can describe how these milestone elements are used in standard reference schemes for the work.

Verse

Individual verse lines should be tagged with l (that's an L), stanzas or other verse structures above the level of the line should be tagged lg (line group); the latter's type attribute may optionally be used to identify the formal structure in question, for retrieval or other purposes:

Awake! for Morning in the Bowl of Night Has flung the Stone that puts the Stars to Flight: And Lo! the Hunter of the East has caught The Sultan's Turret in a Noose of Light. ]]>

Example is from Rubáiyát of Omar Khayyám, tr. Edward Fitzgerald (New York: Collier; London: Collier-Macmillan, 1962), first quatrain of the first edition.

When the indentation of the lines is significant, it can be recorded using the global rend attribute, with some suitable value:

And Lo! the Hunter of the East has caught The Sultan's Turret in a Noose of Light.]]>

Of course, if the verse is quoted from another text, the l elements should be enclosed in a q element.

Drama

Drama should be encoded with the elements sp (speech) and stage (stage direction). Stage directions can occur either within speeches or between them. As may be seen in the example below, the speaker may be indicated with the who attribute on the sp element:

Speak, hands, for me! They stab Caesar. Et tu, Brute? -- then fall, Caesar! Dies.]]>

Example is from a modern student reprint of Julius Caesar, III.i: William Shakespeare, The Tragedy of Julius Caesar (New York: Airmont, 1965).

When the precise form of the speaker atribution in the source is important, the speaker may be identified by a separate speaker element at the beginning of the sp element.

Cas. Speak, hands, for me! They stab Caesar. Caes. Et tu, Brute? -- then fall, Caesar! Dies.]]>

These tags may also be used for material not written as drama, but presented using dramatic conventions (e.g. transcriptions of speeches, or of press conferences):

[Applause.] and that Governments of the people, by the people, and for the people, shall not perish from the earth. [Long-continued applause.] ]]>

Newspaper version of Abraham Lincoln, Address Delivered at the Dedication of the Cemetery at Gettysburg, in The Collected Works of Abraham Lincoln, ed.Roy P. Basler, vol. VII (New Brunswick: Rutgers University Press, 1953), pp. 20-21. Since in this text such stage-directions are always printed in brackets, the encoder might choose to omit the square brackets from the transcription, noting in the header that stage elements are always bracketed.

As with verse, if the drama is quoted from another text, it should be enclosed in a q element.

Bibliographic References

Bibliographic references should normally be enclosed in bibl elements; within such elements, or outside them, title may be used to mark titles of articles, books, journals, etc. Its level attribute takes the values A, M, J, S, or U to show whether the title is an analytic (article) title, a monogrphic (book) title, the title of a journal, that of a series, or that of unpublished material such as a thesis. For example a reference to: Inaugural Address, March 4, 1933, in The Public Papers and Addresses of Franklin D. Roosevelt, vol. II (New York: Random House, 1938), pp. 510-514 would be encoded thus:

Inaugural Address, March 4, 1933, in The Public Papers and Addresses of Franklin D. Roosevelt , vol. II (New York: Random House, 1938), pp. 11-16. ]]>

Omitted from this bare-bones tag set are tags for other bibliographic elements, such as author, editor, publisher, and so on. Also omitted are the elements biblStruct and biblFull, which require consistently structured bibliographic entries and are useful when all the items in a bibliography must be structured correctly (e.g., for machine processing).

Omissions

If material has been omitted from an electronic text (e.g. because it is illegible or not of interest to the expected users, the omission should normally be indicated using a gap element at the point of omission. The attributes desc, reason, and extent may optionally be used to describe what was omitted, to explain why, and to give an approximate size for it. For example:

Suppose I see two individuals approaching whose rank I wish to ascertain. They are, we will suppose, a Merchant and a Physician, or in other words, an Equilateral Triangle and a Pentagon: how am I to distinguish them?

It will be obvious ...

]]>

Example is from Edwin A. Abbott, Flatland: A Romance of Many Dimensions (1884; rpt. New York: Dover, 1992), p. 19, extract from chapter 6, Recognition by Sight. The bare-bones tag set omits the elements defined by the standard TEI tag set for marking other kinds of editorial interventions or authorial alterations to a text, such as cancellations, insertions, corrections or failure to correct errors, normalized spelling, illegible writing or inaudible speech, and the expansion of abbreviations.

Notes

Notes in the text, whether footnotes, endnotes, or inline block notes, should be tagged with the note element. The location may be given, if desired, in the place attribute. Authorial notes may be distinguished from editorial notes by means of the resp attribute, which indicates who is responsible for the note. For example:

IN WITNESS WHEREOF, I have hereunto set my hand and caused the seal of the United States to be affixed.

The 72d Congress, which convened following the 1932 election, passed the Twenty-first Amendment to the Constitution to repeal the Eighteenth Amendment.

...

]]>

Footnotes and endnotes should normally be transcribed at their point of attachment. Their number may optionally be given in the n attribute:

Philadelphia Inquirer has our poor attempts and Chicago Tribune has our poor power. to add or detract.]]>

Lists

Lists should be tagged using the list and item elements; a heading or title for the list should be tagged as a head. Lists may be distinguished as ordered (numbered), unordered (bulleted), etc., by means of the type attribute. For example:

the close of the first fiscal year ending June 30 of any year after the year 1933, in which ..., or the repeal of the eighteenth amendment to the Constitution, whichever is the earlier.]]>

The full TEI scheme also defines a label element for use as an alternative to using the n attribute to give item numbers or labels.

What Is Missing?

Notes in the preceding sections have mentioned some of the elements defined in the full TEI scheme's core tag set but omitted from this bare-bones version. In addition to those already mentioned, tags omitted here include those for proper nouns and other references to people and places, addresses, numbers, units of measure and measured quantities, dates, and times of day.

The full scheme also defines optional tag sets for hypertext linking, analysis or interpretation (including both literary and linguistic analysis) of the text, manuscript transcription, text-critical apparatus, tables, figures, and other specialized interests.

Overall Structure of a Text

Front, Body, and Back Matter

Overall, texts are divided into front matter, the body, and back matter, tagged respectively front, body, and back. Front and back matter are distinct only by virtue of their location: they can contain exactly the same kinds of material. The overall structure of a typical book, for example, would be something like this:

]]>

Text Divisions

Within the body, or within the front and back matter, text may be subdivided into text divisions (parts, chapters, sections; act, scene; canto, stanza; etc.). For such divisions, the single element div should be used; subsections are tagged with nested div elements. The type attribute may be used to indicate that the division has a particular name or type; later divisions will take the same type value unless a different value is specified. Within a text division, paragraphs or paragraph-level elements (e.g. note, list) may occur.

The eighteenth article of amendment to the Constitution of the United States is hereby repealed.

The transportation or importation into any State, Territory, or possession of the United States for delivery or use therein of intoxicating liquors, in violation of the laws thereof, is hereby prohibited.

This article shall be inoperative unless it shall have been ratified as an amendment to the Constitution by conventions in the several States, as provided in the Constitution, within seven years from the date of the submission hereof to the States by the Congress.

]]>

In cases where text divisions have no headings, or have only headings consisting of their type value and a number, no heading need be given, as shown above. If desired, however, the heading may be given explicitly:

Section 1.

The eighteenth article of amendment to the Constitution of the United States is hereby repealed.

Section 2.

The transportation ...

Section 3.

This article shall be inoperative unless ...

]]>

The headings in the preceding example are fixed text (the word Section followed by the value of the n attribute), which any moderately intelligent SGML software could generate mechanically. In general, document management is more convenient, and results are more consistent, if such material is not transcribed as part of the text, but is generated by software when the text is displayed or printed. Inconsistency in the source, of course, may be of interest, and if so it should be captured explicitly.

The full TEI encoding scheme includes specialized elements for anthologies (texts containing other texts), epigraphs, datelines, bylines, salutations, signatures, and groups of headings, datelines, etc. at the beginning or ending of a text division.

Title Pages

The TEI encoding scheme defines specialized tags for transcribing title pages, in order to ensure that processing software can easily locate and identify the author, title, and date of the document as given on its title page. The title page itself, and its major component parts, are illustrated in this example:

The Public Papers and Addresses of Franklin D. Roosevelt With a special introduction and explanatory notes by President Roosevelt Volume Two The Year of Crisis 1933 Random House New York 1938 ]]>

The titlePart element is used both for the different parts of the document title (as shown) and also for miscellaneous parts of the title page which are neither document title, nor document author, nor imprint information.

In addition to the tags shown here, the full TEI scheme defines a docEdition element for tagging information like second revised and expanded edition.

The TEI Header

The TEI header allows later users of the etexts you create to find out what the text is, who created the etext (i.e. you), and what source edition(s) you transcribed the etext from. In its full expansion, it also allows a full accounting of your transcription practice (did you correct typos silently? did you expand abbreviations? normalize spelling? etc.) and can also include a detailed characterization of the text itself (demographics of its author and audience, subject matter, genre, etc.) and a full change log, which is important for document management in large projects.

For bare-bones work, however, it's simplest to copy the following TEI header by rote, and replace the text in square brackets with appropriate information about the text being encoded. If the etext is not transcribed from a pre-existing source, but instead is being created in electronic form, the bibl tags within the sourceDesc element should be changed to p.

[Put the title of the electronic text here.]

[Indicate who is publishing this electronic text (i.e. you).]

[Indicate the source from which this etext is transcribed.] ]]>

For example, the TEI header of the document you are reading looks like this:

Bare Bones TEI: A Very Very Small Subset of the TEI Encoding Scheme

Published electronically by the Text Encoding Initiative, Chicago and Oxford, in 1994.

This text was created in electronic form.

]]>

What You're Missing Not described here are facilities in the TEI header for fuller bibliographic description of the electronic text (edition labels, series information, full indications of who is responsible for it, identification of publisher, distributors, restrictions on availability, price, etc. fuller bibliographic description of the source, including identification of non-textual sources such as tape recordings documentation of how the text was encoded: nature of the project, sampling policy for corpora, editorial practices, SGML tags used, recognition criteria for tags used, typical rendition of tags used (e.g. unless otherwise indicated, terms in glossary lists are printed in italic and offset into the left margin), documentation of specialized notations used for feature structure annotation, metrical analysis, or text-critical apparatus. characterization of the text in non-bibliographic terms: when and how it was created, languages used, classification in some subject-matter or text-type taxonomy, demographic description of author (or of speakers in spoken materials) change logs recording modifications to the electronic text These facilities are all present in the full header; they may not all be defined in the TEI Lite tag set.

Putting It All Together

A TEI-encoded electronic text is always encoded as a tei.2 element, which in turn contains a teiHeader element followed by a text element. The overall structure is thus:

]]>

The start-tag of the tei.2 element is preceded by an explicit reference to the external file containing the document type definition to be applied to the text by the SGML parser. The stripped-down DTD described here may be invoked with the following document-type declaration:

]]>

In some systems, the association of a document with a given document type is handled internally, and no such explicit declaration is visible until the document is exported from the system. In such systems, the user will be asked to select a rules or logic file when the document is first created or imported into the editor.

A Complete Example

The following is a small but complete document encoded using the tag set declared here:

Bare-bones Sample of Bare-bones Tagging

An unpublished document.

This document created in electronic form.

The world's shortest TEI document.

]]>

A More Interesting Example

A slightly more realistic example of bare-bones tagging is provided by the following abridged transcription of Franklin D. Roosevelt's proclamation that Prohibition (i.e. the prohibition of alcohol, imposed in the U.S. by the adoption of the 18th Amendment to the Constitution) had been repealed. In the following example, the overall structure is what would be used if the entire Public Papers of Roosevelt, or a selection of several of them, were being transcribed.

]]>

The header identifies the electronic text and gives the source from which it was made.

Proclamation of the 21st Amendment: an Electronic Version

Published by the TEI as a specimen of tagged text.

The Public Papers and Addresses of Franklin D. Roosevelt , vol. II (New York: Random House, 1938). ]]>

The text element contains the actual transcription.

The Public Papers and Addresses of Franklin D. Roosevelt With a special introduction and explanatory notes by President Roosevelt Volume Two The Year of Crisis 1933 Random House New York 1938

]]>

The body of the electronic text is a series of documents, each in a div element.

Inaugural Address. March 4, 1933

The President Calls the Congress into Extraordinary Session. Proclamation No. 2038. March 5, 1933

]]>

The repeal of the 18th Amendment is item no. 175 in this volume.

The President Proclaims the Repeal of the Eighteenth Amendment. Proclamation No. 2065. December 5, 1933

Whereas the Congress of the United States in 2d Session of the 72d Congress, begun at Washington on the fifth day of December in the year one thousand nine hundred and thirty-two, adopted a resolution in the words and figures following, to wit —

]]>

At this point the Congressional resolution is quoted in its entirety. It has its own title and paragraphing, and embeds in its turn the full text of yet another document, which became the 21st Amendment. Since FDR is quoting the resolution, we tag it as a q. Within the q is a text element. The q is rendered as a block quote with quotation marks at the beginning and end, and opening quotation marks at the beginning of each paragraph.

Joint Resolution Proposing an amendment to the Constitution of the United States.

Resolved by the Senate and House of Representatives of the United States of America in Congress assembled (two-thirds of each House concurring therein), That the following article is hereby proposed as an amendment to the Constitution of the United States, which shall be valid to all intents and purposes as part of the Constitution when ratified by conventions in three-fourths of the several States: ]]>

The beginning of the embedded text of the amendment here:

Article

The eighteenth article of amendment to the Constitution of the United States is hereby repealed.

]]>

The end of the embedded text of amendment here.

]]>

And here, the end of the quoted Congressional resolution.

Whereas Section 217(a) of the Act of Congress entitled An Act to encourage national industrial recovery, to foster competition, and to provide for the construction of certain useful public works, and for other purposes approved June 16, 1933, provides as follows: ]]>

Here we have a quotation within a paragraph, which itself contains a paragraph with an embedded list.

Section 217(a) The President shall proclaim the date of the close of the first fiscal year ending June 30 of any year after the year 1933, during which the total receipts of the United States (excluding public-debt receipts)exceed its total expenditures (excluding public-debt expenditures other than those chargeable against such receipts), or the repeal of the eighteenth amendment to the Constitution, whichever is the earlier.

Whereas it appears from a certificate issued December 5, 1933, by the Acting Secretary of State that official notices have been received by the Department of State that on the fifth day of December, 1933, Conventions in thirty-six States of the United States, constituting three-fourths of the whole number of the States had ratified the said repeal amendment:

Now, therefore, I, Franklin D. Roosevelt, President of the United States of America pursuant to the provisions of Section 217(a) of the said Act of June 16, 1933, do hereby proclaim that the Eighteenth Amendment to the Constitution of the United States was repealed on the fifth day of December, 1933.

Furthermore, I enjoin upon all citizens of the United States and upon others resident within the jurisdiction thereof, to co-operate with the Government in its endeavor to restore greater respect for law and order, by confining such purchases of alcoholic beverages as they may make solely to those dealers or agencies which have been duly licensed by State or Federal license.

I ask especially that no State shall by law or otherwise authorize the return of the saloon either in its old form or in some modern guise.

In witness whereof, I have hereunto set my hand and caused the seal of the United States to be affixed.

The 72d Congress, which convened following the 1932 election, passed the Twenty-first Amendment to the Constitution to repeal the Eighteenth Amendment.

]]>

Here is the end of the repeal proclamation. From here, the transcription continues in the same way, to the end of the volume.

]]>

What About Software?

Documents created using the tag set described here can be created: using an SGML-aware editor such as SoftQuad's Author/Editor, or ArborText's Adept, or DataLogic's EditStation, or the public-domain editor emacs, for which a specialized mode for SGML editing has been constructed (psgml.el). (If you use emacs, you will find psgml.el at the usual sources of emacs material. The other programs are available from their vendors.) using a standard editor or word processor, typing in the angle brackets, etc. as you see them in the examples here; working this way requires that you save the file in ASCII or non-document mode; otherwise, the proprietary markup of your word processor will get in the way. using a standard word processor with an add-on tool to translate documents from that word processor's native format into SGML Of these, the first and third are most convenient for most users, and the first and second are most likely to produce valid SGML. The main difficulty with the third method is that mechanical translation from a word processor into SGML is usually possible only for very restricted SGML tag sets, and is only reliable if the documents have been created with an unusually disciplined use of the word processor's style-sheet facility. Any user interested enough in SGML to exercise the necessary discipline would probably do better with a full-fledged SGML editor.

Once created, SGML documents can be processed with a variety of commercial and public-domain tools. No complete listing is possible here; at the time this is written, the most convenient summary of SGML software is the Whirlwind Guide to SGML Tools maintained by Steve Pepper of Oslo, and available on the internet by ftp at ftp.uio.no (if you don't know about ftp, or this whole paragraph appears to be technobabble, consult your local computer center, or one of the numerous recent guides to the Internet for users who lack local computer center support). The most popular public-domain tool is the parser sgmls, written by James Clark on the basis of materials written by Charles Goldfarb. Using sgmls to process SGML documents commonly involves writing programs to read its standard output format, but it can also be used by non-programmers to check the validity of their SGML documents. (If you want to do this, check the TEI file servers for DOS batch files, Unix shell scripts, or the equivalent for your system, which simplify the task of setting up sgmls and running it as a validator. If you run into difficulties, issue a call for help on TEI-L.) An increasing number of SGML tools also use sgmls as a pre-processor, so acquiring a copy of sgmls makes sense even for those who have no intention of writing programs on their own.

Summary of the Bare Bones TEI Subset

Elements in the Bare Bones Tag Set

The tags included in the Bare Bones TEI Subset are: TEI header tags (not explained; use by rote) teiHeader fileDesc titleStmt title publicationStmt sourceDesc Paragraphs and Other Chunk-Sized Elements p note list, item, head Verse and Drama l lg sp stage Miscellaneous hi q ref pb gap Bibliographic References bibl title Overall Text Structure tei.2 text front body back div, head Title Pages titlePage docTitle, titlePart docAuthor docDate docImprint

Formal Declarations

The bare-bones TEI subset is a clean subset of the TEI encoding scheme as published: bare-bones texts conform to the published TEI DTD. The subset is defined exclusively by suppressing elements which are normally available within TEI documents. This suppression is accomplished by a DTD fragment available from the TEI file servers under the name bb.ent (for bare-bones entities).