An Introduction to the Text Encoding Initiative <docnum>TEI &sysfnam <date>13 May 1991 <abstract> <p> Rather than summarising the likely content of the TEI Guidelines, the present document attempts to provide input for an informed discussion of some of the theoretical and practical issues which initial reaction to the Guidelines have shown to be of major concern. It begins with a brief re-statement of the goals of the project, placed in the current research context, and then gives a brief description of the nature and scope of SGML, with attention to its potential for the handling of the full diversity of textual materials likely to be of interest to social historians.</abstract> </frontm><body> <h1>What is the Text Encoding Initiative? <p> The Text Encoding Initiative is an international research project, the aim of which is to develop and to disseminate guidelines for the encoding and interchange of machine-readable texts. It is sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The project is funded by the U.S. National Endowment for the Humanities, DG XIII of the Commission of the European Communities, and the Andrew W. Mellon Foundation. Equally important has been the donation of time and expertise by the many members of the research community who have served on the TEI's Working Committees and Working Groups. <p> During the first funding cycle of the TEI (June 1989-June 1990), work was carried out in four large committees, which proposed a variety of recommendations in distinct areas of text encoding practices. These found expression in the first draft of the TEI Guidelines, a large (300 page) report which was widely distributed in Europe, North America and elsewhere in November 1990. <note><cit>Guidelines for the encoding and interchange of machine-readable texts</cit> edited by Lou Burnard and C.M.Sperberg-McQueen (Chicago and Oxford, ALLC-ACH-ACL Text Encoding Initiative, 1990). Summary articles have appeared in <cit>Humanistiske Data</cit> 3-90; <cit>ACH Newsletter</cit> 12 (3- 4); <cit>EPSIG News</cit> 3(3); <cit>SGML Users Group Newsletter</cit> 18; <cit>ACLS Newsletter</cit>2/4 and elsewhere. A fuller summary is to appear as Burnard <q>The TEI: a progress report</q> in <cit>Proceedings of the 11th ICAME Conference, Berlin, 1990</cit>, ed G.Leitner. Encoding principles of the TEI are discussed in <q>Texts in the electronic age: textual study and text encoding, with examples from medieval text</q> C.M. Sperberg-McQueen, in <cit>Literary and Linguistic Computing</cit> (6.1, 1991)</note> During the second cycle (June 1990-June 1992) the initial recommendations are being reviewed and extended by about a dozen different specialist working groups and put to the test by a comparable number of affiliated projects. A series of TEI Workshops has been organised, and much introductory and tutorial material drafted. The final deliverables of the project will include a substantial reference manual and a number of tutorial guides. <p> The task of co-ordinating the working groups and committees, and of combining their drafts for publication is carried out by two editors, one European and one American. The project as a whole is managed by a steering committee, with two representatives from each of the three sponsoring organisations. An Advisory Board, with representatives from 15 major learned and professional societies, endorsed the initial work plan at its first meeting in February 1989, and will also (all being well) endorse the final project deliverables in June of 1992. <h1>The need for interchange <p> The goal of the TEI is to develop and disseminate a set of Guidelines for the interchange of machine-readable texts among researchers, so as to allow easier and more efficient sharing of resources for textual computing and natural language processing. Such interchange is already perceived as essential in several areas of the research community, in particular within the expanding field of natural language processing, on both economic and methodological grounds. Economically, it is widely accepted that the heavy cost of creating such resources as language corpora and electronic lexica can only be justified if the resources can be re-used by many projects. <note>A recent Eurotra-funded study (Eurotra-7) reported on this and other aspects of `re-usability of lexical resources' in considerable detail: its findings are equally relevant to other disciplines.</note> Methodologically, the repeatability of research results which forms an essential aspect of any empirical research can best be guaranteed by the continued availability of data sets for secondary analysis. Standardisation is easily achieved where there is a broad consensus about the kinds of data to be processed and the particular software packages to be used (as has been, for example, the case for many years in social science research). It is less simple where essentially identical kinds of data resources (such as textual corpora) contain matter of interest to distinct research communities characterised by an immense variety of theoretical positions and methods. <p> The TEI arose from a perceived need within one, comparatively small, research community: that concerned with the encoding and manipulation of purely textual data for purposes of descriptive or corpus linguistics, stylistic analysis, textual editing and other forms of what is broadly called `Literary and Linguistic Computing' (LLC). There has recently been an interesting convergence between the needs and abilities of that community with those of the somewhat larger body of researchers concerned with the computational analysis of natural language (NLP) whether for natural language understanding, generation or translation systems. Straddling the two communities are those concerned with the creation of better, objectively derived, models of language in use, whose methods have transformed current practices in lexicography and language teaching. What links all of these researchers is the need to process large amounts of textual data in a wide variety of different styles. What the TEI offers them, and others, is a model for the standardisation of textual data resources for interchange. <p> It is helpful, when considering standardisation of electronic resources, to distinguish the objects of standardisation (the `what') from the particular representation recommended for it (the `how'). Like other standardisation efforts, the TEI Guidelines include both recommendations about which textual features should be distinguished when encoding texts from scratch if the resulting text is to be of maximal usefulness to the research community, and recommendations of specific practices in the encoding of new texts. The `how' chosen by the TEI is based on the international standard SGML, and I therefore discuss this in some detail in section <hdref refid='how'> below. The `what' is rather more difficult to summarise in a short document of this nature, but some general remarks and a few specific examples are provided in section <hdref refid='what'> below. Distinguishing these two aspects of standardisation is particularly important for electronic resources, because of the ease with which their representations may be changed. <p> What is sometimes forgotten is that ease of conversion is crucially dependent on the prior existence of an agreed set of distinctions. The TEI attempts to make such an agreed set of distinctions, by proposing an abstract data model consisting of those features for which a consensus can be reached as to their importance in a wide range of automatic analyses. To identify these features, particular software systems may use entirely different representation schemata: different representations will be appropriate for different hardware environments, for different software packages, for archival storage and in particular for the exchange of data across particular networks. <h1 id=how>SGML, or TEI as she is spoke <p> For clarity, the TEI has adopted a single descriptive schema for its abstract data model, expressed using the international standard ISO 8879 Standard Generalised Markup Language (SGML); this does not however imply that other descriptive schemata are inappropriate in particular environments, only that SGML has been found to be a practical way of representing the structural and other aspects of the underlying abstract model. Transduction between the SGML representation and other formats is likely to remain a necessary part of the TEI interchange process for some time, although the rapid acceptance of this international standard within both the commercial and academic text processing communities encourages the belief that this will not long be the case. <h2>Styles of Markup <p> <term>Markup</term> is a convenient word used to describe all the information contained in a computer text file other than the text itself, by means of which computer programs are able to manipulate the text in useful ways; the term is borrowed from the history of printing, where <term>markup</term> referred to the notations made in the margins of a text to guide the compositor in the layout of the text. Markup intended to specify the proper layout or presentation of a text is still the most common type of markup in computer files. <p> If computers were used only to print texts out on paper, there would be little point in using SGML. Because, in fact, computers are used for far more than this, markup can be used to guide processing of any type, not just printing. In an important sense, the markup represents a particular interpretation of the text, making explicit something which is only implicit in the data characters stored. In general, markup in the TEI scheme is not intended as a way of controlling any one piece of software. Although convenient, such markup becomes obtrusive as soon as some other program attempts to work on the text. It also makes it difficult to change systematically the way pieces of text of a certain type are processed. It is easier to work flexibly with text, and easier to use many different kinds of software with the same machine-readable text, if (a) the markup in a text is clearly distinguishable from the text itself, and (b) the markup provides not <emph>instructions</emph> for how to process a bit of text but a <emph>description</emph> of the text itself. The former approach (<term>procedural markup</term>) makes difficult or impossible reuse of the text for a different kind of processing; the latter (<term>descriptive markup</term>) means that the processing carried out can be determined by a particular piece of software independently. A common method is to use a lookup table which associates the generic markup tags of the text with specific processing instructions; by analogy with similar shorthands used in publishing, such tables are often called <term>style sheets</term>. <h2>SGML Markup <p> The Standard Generalized Markup Language (SGML) is a language for defining <term>markup languages</term>, i.e. sets of markup tags with rules defining when they are applicable and how they can interrelate. SGML does not itself define a markup language. It merely allows its users to define one. SGML itself provides a simple but powerful formalism for the description of the textual features identified by a variety of markup languages. . <p> The TEI encoding scheme uses SGML to associate the set of textual features to be distinguished in a document with a corresponding set of markup tags, and to define how these tags can legally occur within it. The intended meanings of the features themselves are defined by descriptive documents such as the published Guidelines; the syntax and format of the markup tags are defined by a particular SGML construct called a `document type definition' or DTD. <p> There are three characteristics of SGML which distinguish it from other markup languages: it is designed for <term>descriptive</term> rather than <term>procedural</term> markup; it allows one to define distinct <term>document types</term> with distinct rules for their structures and the markup they can contain; and it is independent of any one system for representing characters. Procedural and descriptive markup have already been discussed. The notion of document types allows an SGML processor to verify that the markup in a text actually follows rules laid down (by the user) for that type; equally important, it allows software developers to exploit the knowledge about text structures which is embodied in the rules for different document types, and to create more intelligent software as a result. SGML's independence of specific character sets is important for its role in the interchange of documents among scholars using different types of machine. <h3>Elements and their attributes <p> SGML-based markup languages, including that of the TEI, regard text not as an undifferentiated sequence of words, much less of bytes, but as a consistently arranged hierarchy of many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages. Unlike other markup languages which share this view of text as a complex hierarchical structure, SGML and the TEI allow more than one single hierarchical structure to be discerned and marked up in a single text. <p> The technical term used in the SGML standard for a textual unit, viewed as a structural component, is <term>element</term>. Different types of elements are given different names, but SGML provides no formal way of expressing the meaning of a particular type of element, other than its relationship to other element types and its content. That is, all one can say about an element called (for instance) <q>blort</q> is that instances of it may (or may not) occur within elements of type <q>farble</q>, and that it may (or may not) be decomposed into elements of type <q>blortette.</q> It should be stressed that the SGML standard is entirely unconcerned with the semantics of textual elements: these are application dependent. It is up to the creators of SGML conformant tag sets (such as the TEI Guidelines) to choose intelligible names for the elements they identify and to document their proper use in text markup. <p> Within a marked up text (a <term>document instance</term>), each element must be explicitly marked or tagged in some way. The standard provides for a variety of different ways of doing this, the most commonly used being to insert a tag at the beginning of the element (a <term>start-tag</term>) and another at its end (an <term>end-tag</term>). The start- and end-tag pair are used to bracket off the element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, an embedded speech element in a text might be tagged as follows: <xmp> <![ CDATA [ ... Rosalind's remarks <speech>This is the silliest stuff that ere I heard of!</speech> clearly indicate ... ]]> </xmp> <!> As this example shows, a start-tag takes the form <tag>name</tag>, where <q><</q> is a string indicating the start of the start-tag, <q>name</q> is the generic identifier of the element which is being delimited, and <q>></q> is the string indicating the end of a tag. An end-tag takes the form <tag>/name</tag>, where <q><</q> is a string marking the start of an end-tag, <q>name</q> is the generic identifier of the element being closed and, as before, <q>></q> is the string indicating the end of a tag. <p> The SGML formalism also provides for <term>attributes</term> to be associated with occurrences of tagged elements. The attributes which may be associated with a given type of element form a part of its definition. Individual element occurrences may have different attribute values. For example, it might be convenient to define a tag <tag>name</tag> to identify all proper names in a text, with attributes <att>TYPE</att> and <att>NORMAL</att>. In a particular textual element identified as a name by the <tag>name</tag> one could then specify additionally both the type of name (`personal', `family', `given' etc.) and a normalised form of it. Attribute values may be defaulted, taken from a controlled list or specified freely, the only constraint being that they cannot contain markup. <p> Attribute names and values are supplied within the open-tag for the element to which they apply, for example <xmp> <![ CDATA [ <name type='personal' normal='SmithJ'>Jack Smyth</name> ]]> </xmp> The most common use for attributes in the TEI and other SGML schemes is not however to categorise element occurrences in this way, but to identify them. In the TEI scheme, for example, every element is defined to have an ID attribute, which supplies a unique identifier for that particular textual element in the text. <h3>Entity References <p> The only other feature of the SGML formalism which needs definition here is the <term>entity reference</term>. SGML entities provide a simple and flexible method of encoding and naming arbitrary strings of characters, that is, parts of a text which have no structural significance. An SGML entity has a name and a definition. When an entity is referred to in an SGML document, its name appears in the document; in the output, the SGML processor replaces the name of the entity with its definition. Entity references are thus a convenient way both of including large quantities of text in a document (for example <q>boilerplate text</q> used in several places in one or more documents) and of handling characters needed in a document but not present on the keyboard, such as special symbols or accented letters. <p> An entity reference conventionally takes the form of a mnemonic name separated from the rest of the text by an ampersand at its start and a semicolon following it. For example, one might represent the astrological symbol for Sagittarius by an entity reference of the form &sagit;. Standard entity names are proposed in ISO 8879 for most of the common symbols found in modern printed materials (accented letters, mathematical symbols etc); as this example indicates, the same mechanism can be easily extended for more esoteric applications. <!> <h1 id=what>Structure and interpretation: the TEI topoi <p> I suggested above that the primary function of markup was to make explicit an interpretation of a text. Any standardisation effort such as the TEI must therefore at some time grasp the nettle of deciding which interpretations are to be favoured over others. To put it another way, the TEI must at least attempt to address the question as to which aspects or features of a text should be made explicit by its markup. <p> For some scholars, this is a simple issue. There are some features of a text which are <q>obvious</q> and <q>objective</q> -- examples usually include major structural subdivisions such as chapters or verse lines or entries in a charter. There are others which are equally obviously <q>pure interpretation</q> -- such as whether or not a passage in a prose text belongs to some stylistic category, or is in a foreign language, or is a personal name. As this last list perhaps indicates, for the present writer this is a far from clear cut distinction. In almost every kind of material, and especially in the kinds of materials studied by historians, there is a continuum of categorisations, from things about which almost everyone will agree almost all of the time, down to things which almost no-one will identify in the same way ever. <p> The TEI therefore adopts a liberal policy. It proposes for consideration a set of categories which wide consultation has demonstrated to be of use to a broad consensus of researchers. It proposes ways in which instances of those categories may be marked up (as discussed in the last section). Researchers in agreement as to the use of the categories so defined can thus interchange texts, or (if you wish) interpreted texts. They can do so moreover in a format which allows the disentangling of the interpretation from the text stream, or its enrichment in a controlled way. No claim is made as to the feasibility or desirability of making such interpretations in a given case -- all that the TEI can or does offer is a way of making explicit what has been done. <p> The remainder of this paper discusses some concrete instances of the kinds of textual feature which typify the current TEI proposals. <h2>The structure of a TEI text <h3>The TEI header All TEI-conformant texts contain (a) a <term>TEI header</term> and (b) the transcription of the text proper. The TEI header provides information analogous to that provided by the title page of a printed text. It contains a description of the machine- readable text, a description of the way it has been encoded, and a revision history; these are delimited by the <tag>file.description</tag>, the <tag>encoding.declarations</tag>, and the <tag>revision.history</tag> tags, respectively. The first of these identifies the electronic text as an object in its own right, independent of its source or sources (which must however be documented within it). The second supplies details of the particular encoding practices or variations which characterise the text, for example any special codebooks or other values used within the body of the text and descriptions of the referencing scheme or editorial principles applied. The header is, perhaps surprisingly, the <emph>only</emph> part of a TEI text which is mandatory. <h3>Marking Divisions within a Text <p> The TEI recommendations categorise document elements as either <term>structural</term> or <term>floating</term>. Structural elements are constrained as to where they may appear in a document; for example a <tag>head</tag> or heading may not appear in the middle of a <tag>list</tag>. Floating elements, as the name suggests, are less constrained and may appear almost anywhere in a text: examples include <tag>note</tag> or <tag>date</tag>. Intermediate between the two categories are so- called <term>crystals</term>: these are floating features the contents of which have an inherent structure, for example <tag>list</tag> or <tag>citn</tag> elements. <h4>Structural features The current recommendations define a general purpose hierarchic structure, which has been found to be suitable for a very large (perhaps surprisingly large) variety of textual sources. In this, a text is divided into an optional <hi>front</hi>, a <hi>body</hi> and an optional <hi>back</hi>. The body of a text may be a series of paragraphs (marked with <tag>p</tag> ... <tag>/p</tag>), or it may be divided into chapters, sections, subsections, etc. In the latter case, the <tag>body</tag> is divided into a series of elements known generically as <term>div</term>s. The largest subdivision of a given text is tagged div1, the next smallest div2 and so on. Written prose texts may also be further subdivided into <hi>p</hi>s (paragraphs). For verse texts, metrical lines are tagged with the <tag>l</tag> tag. <h4>Floating features <p>As mentioned above, the current Guidelines propose names and definitions for a wide variety of floating features. Examples include <tag>head</tag> for titles and captions (not properly floating, since they are generally tied to a particular structural element); <tag>q</tag> for quoted matter and direct speech; <tag>list</tag> for lists and <tag>item</tag> for the items within them; <tag>note</tag> for footnotes etc.; <tag>corr</tag> for editorial corrections of the original source made by the encoder; and, optionally, a variety of lexically `awkward' items such as <tag>abbr</tag>eviations, <tag>acronym</tag>s, <tag>number</tag>s, <tag>name</tag>s, <tag>date</tag>s, <tag>citn</tag> for bibliographic or other citations, <tag>address</tag> for street addresses and <tag>foreign</tag> for non-English words or phrases. <h2>Reference scheme <p>The advantage of using a single hierarchic scheme as outline above, is that a referencing scheme based on it can be automatically generated. For example, a given p will acquire a number indicating its sequence within the enclosing div, itself identified by its number within any enclosing div above it, and ultimately within the enclosing text. For example, the value <q>T98.1.9/12</q> might identify the 12th p in chapter 9 of book 1 of the text with number T98. <p> To complement this kind of internal referencing system, the Guidelines provide two distinct methods of marking other reference schemes, such as page and line numbers. The hierarchy of volume, page, and line can be neatly expressed with a concurrent markup stream separate from the main markup hierarchy (see P1 section 5.6); for data entry purposes, however, the simpler scheme we describe here may be more convenient. After data entry, this markup can be transformed mechanically into that required for a concurrent markup hierarchy, if that is supported by the software in use. <p> Page breaks, column breaks, and line breaks may be marked with empty <term>milestone</term> elements: that is, tags such as <tag>line.break</tag> or <tag>page.break</tag> which mark a single point in the text, not a span of text, and therefore have no corresponding end-tags. Such tags may have an <att>n</att> attribute to supply the number of the page, column, or line beginning at the tag explicitly, or may give only the number of the first if subsequent ones can be calculated automatically. This mechanism also allows for the pagination etc. of more than one edition to be specified by using an <att>ed</att> attribute. <h2>Descriptive vs presentational markup <p>A matter of considerable controversy (and associated misunderstanding) has been the question of whether or not aspects of a text directly related to its physical appearance can or should be marked up. For some researchers, and in many applications, typographic features such as lineation or font are of little or no importance. For others, they are the very subject of interest. Because SGML focuses attention on <emph>describing</emph> a text, rather than attempting to simulate its appearance, the TEI recommendations have proposed that where it is possible to identify a structural (or floating) feature by its function, then that is what should be primarily tagged. This does not however mean that they provide no support for cases where the exact purpose of some distinctly-rendered part of a text cannot be determined. It is recognised that in many cases it may be neither desirable nor possible to interpret changes of rendering in this way. <p> A global attribute <att>RENDITION</att> may be specified for every tag in the TEI scheme, the value of which is a user- specified string descriptive of the way that the current element is rendered in the source being transcribed. <note>This implies of course that the markup describes a single source.</note> In most cases, a change in rendering and a change of element coincide: this mechanism therefore reduces the amount of tagging from what would be required if a separate set of tags were used for rendering. Further reduction in tagging is provided by the fact that the default value for a RENDITION attribute is that of the immediately surrounding element (if any). <p> In cases where a renditional change is not associated with any discernible element, a special tag <tag>highlighted</tag> may be used, the sole function of which is to carry the <att>RENDITION</att> attribute. <p> No recommendations about the form of value to be supplied for rendition attributes have yet been made: these are the subject of current work in two working groups. Similar considerations apply to the use of quotation marks and quoted passages within a text. <h2>Scope and coverage of P1 <p> As an example of the scope and range of facilities which SGML can support, I close with a brief summary of the full contents of the current draft and a more detailed description of a few of the more specialised kinds of textual features for which tags are already proposed in the draft Guidelines. <p> It should be stressed that the first draft of the Guidelines, despite its weighty appearance (nearly 300 pages of closely printed A4), is very much a discussion paper and far from being complete or definitive. Some characteristics of the TEI approach are however already discernible which are unlikely to change. One is a focus on the encoding of the content of text, rather than its appearance -- as discussed above, this is also a characteristic of SGML. Another is the rigorous application of Occam's razor: the TEI approach to the immense variety of text types in the real world is to attempt to define a comparatively small number of features which all texts share, and to allow for these to be used in combination with user-definable sets of more specialised features. <note>This has been termed the <q>pizza model</q>, by contrast with either the <q>table d'hôte</q> or the <q>à la carte</q> models. A choice of a small number of bases is offered, each of which may be combined with a large number of toppings. </note> <p> The current draft has eight main sections, which are briefly summarized below. <p> Chapter 1 outlines the purpose and scope of the TEI scheme. As outlined above, its main goals are both to facilitate data interchange and to provide guidance for those creating new texts. The desiderata of simplicity, clarity, formal rigour, sufficient power for research purposes, conformance to international standards, and independence of software, hardware or application alike are stressed. <p> Chapter 2 provides a gentle introduction to the basic concepts of SGML and also contains some more technical information about the ways in which the TEI scheme uses the standard. <p> Chapter 3 addresses the problems of character encoding and translation in a world dominated by the rival claims of ASCII and EBCDIC. If the goal is to provide machine-independent support for all writing systems of all languages, these problems are far from trivial. The specific recommendations made are that only a subset of the ISO-646 character set (sometimes known as ASCII) can currently be relied on for data interchange, and that this should be extended either by using the entity reference mechanism provided by SGML or by using transliteration schemes. It proposes a powerful but economical way of documenting such transliteration schemes by a formal Writing System Declaration <p> Chapter 4 contains recommendations for in-file documentation of electronic texts adequate to the bibliographic needs of researchers, data archivists and librarians. It recommends that a special header be added to each file to perform a function analogous to that of the title page of a non-electronic text, and proposes sets of tags for information about the file itself, the source from which it was derived and how it was encoded. <p> Chapter 5, the largest chapter, attempts to define a set of general-purpose structural and floating tags for continuous prose texts. Its basic ideas of text as an ordered hierarchy of objects, within which floating features and crystals may appear was discussed above. This chapter of the Guidelines also proposes tags for features such as lists, notes, names, abbreviations, numbers, foreign or emphasised phrases, cross references, and hypertextual links. Sections deal with the kinds of textual element commonly found in front and back matter of printed texts, title pages etc. Other sections discuss ways of encoding textual variation and critical apparatus and of recording the rendering of arbitrary textual fragments within this overall framework. There is also some discussion of different ways of maintaining multiple referencing schemes within the same text. <p> Chapter 6 outlines a number of theory-independent mechanisms for representing all kinds of linguistic analyses of running text. It is probably the most daunting chapter for the non-specialist reader, though much of its contents are of very wide relevance. It argues that most, if not all, linguistic analyses can be represented as bundles of named, value-bearing, `feature structures', which may be nested and grouped into sets or lists. It proposes ways of supporting multiple and independently aligned analyses, chiefly by means of the ID/IDREF pointer mechanism native to SGML. It also contains some tagsets for such commonly occurring formalisms as tree structures and parts of speech. <p> Chapter 7 considers in more detail particular aspects of some specific types of text. The text-types discussed in this draft are: language corpora and collections; verse, drama, and narrative; dictionaries; and office documents. In each case, an overview of the problems specific to these types of discourse is given, with some preliminary proposals for tags appropriate to them. This chapter is one that will be considerably revised and extended over the coming months, as its initial proposals are firmed up and as its scope is extended to other types of text. <p> Chapter 8 outlines a method by which the current Guidelines may be modified and extended, largely by introducing indirection into the Document Type Definitions (the formal SGML specifications for the TEI encoding scheme). Extension and modification of the TEI proposals is an important design goal, since this is both expected and intended, and the final form of the Guidelines will facilitate it. <p> Preliminary versions of a number of technical appendixes are provided in the current draft. These include annotated examples, illustrating the application of the TEI encoding scheme to a wide range of texts, formal SGML document type declarations (DTDs) for all the tags and groups of tags defined in the TEI scheme, and code pages for some commonly used character sets. Later drafts will extend and improve these initial versions considerably, and will also contain an alphabetical reference section with a summary of each tag, its attributes, its usage, and an example of its use, as well as full Writing System Declarations for a range of commonly used alphabets. <p> Space precludes an exhaustive discussion of the various tags and associated features suggested by the current TEI draft proposals. Further proposals from the specialist working groups currently discussing extensions in a wide range of subject areas will be included in the final TEI report in a years time. However, it is hoped that enough detail has been provided to give some indication of the general ideas underlying the scheme. <h1>What is a TEI Text? <p> What does it mean to say that a text is <q>TEI conformant</q>? A full answer to this question involves an understanding of the various contexts or environments in which electronic texts may be used. At one extreme, a text may be prepared using a particular version of a particular software package on a particular machine, for use with that software package only. Its users and preparers may never have any intention of sharing the text with others, nor of using any texts prepared elsewhere. At the other extreme, a text may be prepared on many different systems as part of a co- operative data capture exercise, for use by several different people, all with differing objectives and different software systems. Most projects fall between these two extremes, often with different priorities at different times. How does the TEI project help either of them? <p> As suggested above, encoding a text is fundamentally a process of deciding which textual features should be distinguished by markup of some kind, and of deciding on a suitable markup for them. The TEI Guidelines may be thought of as a codification of the distinctions which have been found helpful by most people most of the time when faced with this task. For the most part, these are optional features of a text: clearly, no-one could be expected to make all the distinctions or to capture all the textual features listed in P1 in every text prepared, for no matter how simple a purpose. Equally clearly though, every distinction made by P1 is made because for someone that distinction is important. <!> <h2>The notion of conformance <p> Returning to the question of conformance: if the Guidelines do not require that every distinction they specify be made in encoding a text, what in fact do they require? They say, in effect, <emph>if</emph> you wish to distinguish this feature in your text, then <emph>this</emph> is the tag you should use to identify it, and (possibly) this is the way that this textual feature should be related to other textual features in the text. If for example, you wish to distinguish proper names that are embedded in your text, the Guidelines advise you to use the tag <tag>propname</tag> for the purpose: they do <emph>not</emph> propose that all proper names in a text should be marked however. <p> A TEI-conformant text must, as a minimum, be parsable by an SGML processor using one or other of the published TEI document type definitions (DTDs). Strict TEI-conformance additionally involves adherence to various formal rules about the way in which SGML is used in a text (see sections 1.1.2 and 2.2 of the Guidelines for a discussion), of which probably the most significant is that end-tags must be supplied for every element. For interchange purposes, TEI conformance at present implies the use of a very restricted character set. <p> It should be stressed that the purpose of restricting TEI conformance in this way is to ensure that texts can be interchanged between different machines and operating systems without loss of information. Such restrictions make no sense, and are therefore not required, when texts are to be exchanged between the same kind of machine, or when they are not exchanged at all. <h2>Conformance in different environments <p> Strict conformance may be desired or required when you are sending files to someone about whose system you know no details, when you are depositing a text in a text archive, or when you are working with software which accepts only fully-conformant TEI texts. <p> In many cases, a less strict adherence to the rules of the TEI scheme may be appropriate. If you have SGML software, for example, then it is unnecessary to limit yourself, in the work you do on your own machine, to the subset of SGML features allowed in strictly TEI-conformant documents, since it is easy to use SGML software to produce a TEI-conformant version of any SGML document which uses the TEI document type declarations. You may use some other software, on the other hand, which accepts most TEI-conforming documents, but places some further restriction on the SGML features which can be accepted. In this case prudence will dictate that you restrict yourself to the SGML features your software can handle. <p> If you do not have SGML software, you may wish to use some markup scheme designed around the software you use most: Word Perfect or Nota Bene users might develop a set of Word Perfect styles or Nota Bene styles corresponding to the TEI tags they use most often. As long as the mark-up scheme you use makes at least the same set of distinctions as those recommended by the Guidelines, then it will be simple to translate from your local scheme to the TEI scheme, and back. <p> The construction of a sensible local scheme depends entirely on the hardware and software you are using. What makes sense for a Macintosh user who shuttles constantly between Word and Hypercard, will not necessarily be the best approach for a PC user who seldom leaves Nota Bene, and neither will necessarily be apt for someone using a VAX. <h2>Character Sets and Conformance <p> Character set incompatibilities pose serious problems for the exchange of machine-readable texts within disparate research communities; many common methods of exchanging texts fail for texts which contain characters other than the twenty-six basic letters of the Latin alphabet, the ten Arabic numerals, and some common punctuation marks. Accented characters, braces and brackets, and many other characters may not arrive at all, or may arrive as undecipherable nonsense. The TEI Guidelines define a <q>safe</q> set of characters for interchange using today's systems, and recommend the use of entity references for all other characters. Because the shortcomings of current systems will not (we hope!) be with us forever, however, adherence to these restrictions is not a necessary part of TEI-conformance, though it may be highly desirable in certain situations. <p> In your own work on your own machine, however, there is <emph>no reason</emph> not to use all the characters available in your machine's character set. When you wish to exchange texts with users of other systems, you can transform any such characters into SGML entity references, by using a simple global search and replace function for example. <p> Just as special purpose programs may be needed to convert from the form in which it is convenient to enter text into a TEI- conformant one, so it is likely that special-purpose programs will be developed to convert a TEI-conformant text into one that can be reliably transported across networks, possibly involving some data compression as well as translation of <q>awkward</q> characters, together with similar programs to do the opposite. Such programs have yet to be written however. </gdoc>

.sr docfile = &sysfnam. ;.sr docversion = 'Draft';.im teigmlp1 Lou Burnard An Introduction to the Text Encoding Initiative <docnum>TEI &sysfnam <date>13 May 1991 <abstract> <p> Rather than summarising the likely content of the TEI Guidelines, the present document attempts to provide input for an informed discussion of some of the theoretical and practical issues which initial reaction to the Guidelines have shown to be of major concern. It begins with a brief re-statement of the goals of the project, placed in the current research context, and then gives a brief description of the nature and scope of SGML, with attention to its potential for the handling of the full diversity of textual materials likely to be of interest to social historians.</abstract> </frontm><body> <h1>What is the Text Encoding Initiative? <p> The Text Encoding Initiative is an international research project, the aim of which is to develop and to disseminate guidelines for the encoding and interchange of machine-readable texts. It is sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The project is funded by the U.S. National Endowment for the Humanities, DG XIII of the Commission of the European Communities, and the Andrew W. Mellon Foundation. Equally important has been the donation of time and expertise by the many members of the research community who have served on the TEI's Working Committees and Working Groups. <p> During the first funding cycle of the TEI (June 1989-June 1990), work was carried out in four large committees, which proposed a variety of recommendations in distinct areas of text encoding practices. These found expression in the first draft of the TEI Guidelines, a large (300 page) report which was widely distributed in Europe, North America and elsewhere in November 1990. <note><cit>Guidelines for the encoding and interchange of machine-readable texts</cit> edited by Lou Burnard and C.M.Sperberg-McQueen (Chicago and Oxford, ALLC-ACH-ACL Text Encoding Initiative, 1990). Summary articles have appeared in <cit>Humanistiske Data</cit> 3-90; <cit>ACH Newsletter</cit> 12 (3- 4); <cit>EPSIG News</cit> 3(3); <cit>SGML Users Group Newsletter</cit> 18; <cit>ACLS Newsletter</cit>2/4 and elsewhere. A fuller summary is to appear as Burnard <q>The TEI: a progress report</q> in <cit>Proceedings of the 11th ICAME Conference, Berlin, 1990</cit>, ed G.Leitner. Encoding principles of the TEI are discussed in <q>Texts in the electronic age: textual study and text encoding, with examples from medieval text</q> C.M. Sperberg-McQueen, in <cit>Literary and Linguistic Computing</cit> (6.1, 1991)</note> During the second cycle (June 1990-June 1992) the initial recommendations are being reviewed and extended by about a dozen different specialist working groups and put to the test by a comparable number of affiliated projects. A series of TEI Workshops has been organised, and much introductory and tutorial material drafted. The final deliverables of the project will include a substantial reference manual and a number of tutorial guides. <p> The task of co-ordinating the working groups and committees, and of combining their drafts for publication is carried out by two editors, one European and one American. The project as a whole is managed by a steering committee, with two representatives from each of the three sponsoring organisations. An Advisory Board, with representatives from 15 major learned and professional societies, endorsed the initial work plan at its first meeting in February 1989, and will also (all being well) endorse the final project deliverables in June of 1992. <h1>The need for interchange <p> The goal of the TEI is to develop and disseminate a set of Guidelines for the interchange of machine-readable texts among researchers, so as to allow easier and more efficient sharing of resources for textual computing and natural language processing. Such interchange is already perceived as essential in several areas of the research community, in particular within the expanding field of natural language processing, on both economic and methodological grounds. Economically, it is widely accepted that the heavy cost of creating such resources as language corpora and electronic lexica can only be justified if the resources can be re-used by many projects. <note>A recent Eurotra-funded study (Eurotra-7) reported on this and other aspects of `re-usability of lexical resources' in considerable detail: its findings are equally relevant to other disciplines.</note> Methodologically, the repeatability of research results which forms an essential aspect of any empirical research can best be guaranteed by the continued availability of data sets for secondary analysis. Standardisation is easily achieved where there is a broad consensus about the kinds of data to be processed and the particular software packages to be used (as has been, for example, the case for many years in social science research). It is less simple where essentially identical kinds of data resources (such as textual corpora) contain matter of interest to distinct research communities characterised by an immense variety of theoretical positions and methods. <p> The TEI arose from a perceived need within one, comparatively small, research community: that concerned with the encoding and manipulation of purely textual data for purposes of descriptive or corpus linguistics, stylistic analysis, textual editing and other forms of what is broadly called `Literary and Linguistic Computing' (LLC). There has recently been an interesting convergence between the needs and abilities of that community with those of the somewhat larger body of researchers concerned with the computational analysis of natural language (NLP) whether for natural language understanding, generation or translation systems. Straddling the two communities are those concerned with the creation of better, objectively derived, models of language in use, whose methods have transformed current practices in lexicography and language teaching. What links all of these researchers is the need to process large amounts of textual data in a wide variety of different styles. What the TEI offers them, and others, is a model for the standardisation of textual data resources for interchange. <p> It is helpful, when considering standardisation of electronic resources, to distinguish the objects of standardisation (the `what') from the particular representation recommended for it (the `how'). Like other standardisation efforts, the TEI Guidelines include both recommendations about which textual features should be distinguished when encoding texts from scratch if the resulting text is to be of maximal usefulness to the research community, and recommendations of specific practices in the encoding of new texts. The `how' chosen by the TEI is based on the international standard SGML, and I therefore discuss this in some detail in section <hdref refid='how'> below. The `what' is rather more difficult to summarise in a short document of this nature, but some general remarks and a few specific examples are provided in section <hdref refid='what'> below. Distinguishing these two aspects of standardisation is particularly important for electronic resources, because of the ease with which their representations may be changed. <p> What is sometimes forgotten is that ease of conversion is crucially dependent on the prior existence of an agreed set of distinctions. The TEI attempts to make such an agreed set of distinctions, by proposing an abstract data model consisting of those features for which a consensus can be reached as to their importance in a wide range of automatic analyses. To identify these features, particular software systems may use entirely different representation schemata: different representations will be appropriate for different hardware environments, for different software packages, for archival storage and in particular for the exchange of data across particular networks. <h1 id=how>SGML, or TEI as she is spoke <p> For clarity, the TEI has adopted a single descriptive schema for its abstract data model, expressed using the international standard ISO 8879 Standard Generalised Markup Language (SGML); this does not however imply that other descriptive schemata are inappropriate in particular environments, only that SGML has been found to be a practical way of representing the structural and other aspects of the underlying abstract model. Transduction between the SGML representation and other formats is likely to remain a necessary part of the TEI interchange process for some time, although the rapid acceptance of this international standard within both the commercial and academic text processing communities encourages the belief that this will not long be the case. <h2>Styles of Markup <p> <term>Markup</term> is a convenient word used to describe all the information contained in a computer text file other than the text itself, by means of which computer programs are able to manipulate the text in useful ways; the term is borrowed from the history of printing, where <term>markup</term> referred to the notations made in the margins of a text to guide the compositor in the layout of the text. Markup intended to specify the proper layout or presentation of a text is still the most common type of markup in computer files. <p> If computers were used only to print texts out on paper, there would be little point in using SGML. Because, in fact, computers are used for far more than this, markup can be used to guide processing of any type, not just printing. In an important sense, the markup represents a particular interpretation of the text, making explicit something which is only implicit in the data characters stored. In general, markup in the TEI scheme is not intended as a way of controlling any one piece of software. Although convenient, such markup becomes obtrusive as soon as some other program attempts to work on the text. It also makes it difficult to change systematically the way pieces of text of a certain type are processed. It is easier to work flexibly with text, and easier to use many different kinds of software with the same machine-readable text, if (a) the markup in a text is clearly distinguishable from the text itself, and (b) the markup provides not <emph>instructions</emph> for how to process a bit of text but a <emph>description</emph> of the text itself. The former approach (<term>procedural markup</term>) makes difficult or impossible reuse of the text for a different kind of processing; the latter (<term>descriptive markup</term>) means that the processing carried out can be determined by a particular piece of software independently. A common method is to use a lookup table which associates the generic markup tags of the text with specific processing instructions; by analogy with similar shorthands used in publishing, such tables are often called <term>style sheets</term>. <h2>SGML Markup <p> The Standard Generalized Markup Language (SGML) is a language for defining <term>markup languages</term>, i.e. sets of markup tags with rules defining when they are applicable and how they can interrelate. SGML does not itself define a markup language. It merely allows its users to define one. SGML itself provides a simple but powerful formalism for the description of the textual features identified by a variety of markup languages. . <p> The TEI encoding scheme uses SGML to associate the set of textual features to be distinguished in a document with a corresponding set of markup tags, and to define how these tags can legally occur within it. The intended meanings of the features themselves are defined by descriptive documents such as the published Guidelines; the syntax and format of the markup tags are defined by a particular SGML construct called a `document type definition' or DTD. <p> There are three characteristics of SGML which distinguish it from other markup languages: it is designed for <term>descriptive</term> rather than <term>procedural</term> markup; it allows one to define distinct <term>document types</term> with distinct rules for their structures and the markup they can contain; and it is independent of any one system for representing characters. Procedural and descriptive markup have already been discussed. The notion of document types allows an SGML processor to verify that the markup in a text actually follows rules laid down (by the user) for that type; equally important, it allows software developers to exploit the knowledge about text structures which is embodied in the rules for different document types, and to create more intelligent software as a result. SGML's independence of specific character sets is important for its role in the interchange of documents among scholars using different types of machine. <h3>Elements and their attributes <p> SGML-based markup languages, including that of the TEI, regard text not as an undifferentiated sequence of words, much less of bytes, but as a consistently arranged hierarchy of many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages. Unlike other markup languages which share this view of text as a complex hierarchical structure, SGML and the TEI allow more than one single hierarchical structure to be discerned and marked up in a single text. <p> The technical term used in the SGML standard for a textual unit, viewed as a structural component, is <term>element</term>. Different types of elements are given different names, but SGML provides no formal way of expressing the meaning of a particular type of element, other than its relationship to other element types and its content. That is, all one can say about an element called (for instance) <q>blort</q> is that instances of it may (or may not) occur within elements of type <q>farble</q>, and that it may (or may not) be decomposed into elements of type <q>blortette.</q> It should be stressed that the SGML standard is entirely unconcerned with the semantics of textual elements: these are application dependent. It is up to the creators of SGML conformant tag sets (such as the TEI Guidelines) to choose intelligible names for the elements they identify and to document their proper use in text markup. <p> Within a marked up text (a <term>document instance</term>), each element must be explicitly marked or tagged in some way. The standard provides for a variety of different ways of doing this, the most commonly used being to insert a tag at the beginning of the element (a <term>start-tag</term>) and another at its end (an <term>end-tag</term>). The start- and end-tag pair are used to bracket off the element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, an embedded speech element in a text might be tagged as follows: <xmp> <![ CDATA [ ... Rosalind's remarks <speech>This is the silliest stuff that ere I heard of!</speech> clearly indicate ... ]]> </xmp> <!> As this example shows, a start-tag takes the form <tag>name</tag>, where <q><</q> is a string indicating the start of the start-tag, <q>name</q> is the generic identifier of the element which is being delimited, and <q>></q> is the string indicating the end of a tag. An end-tag takes the form <tag>/name</tag>, where <q><</q> is a string marking the start of an end-tag, <q>name</q> is the generic identifier of the element being closed and, as before, <q>></q> is the string indicating the end of a tag. <p> The SGML formalism also provides for <term>attributes</term> to be associated with occurrences of tagged elements. The attributes which may be associated with a given type of element form a part of its definition. Individual element occurrences may have different attribute values. For example, it might be convenient to define a tag <tag>name</tag> to identify all proper names in a text, with attributes <att>TYPE</att> and <att>NORMAL</att>. In a particular textual element identified as a name by the <tag>name</tag> one could then specify additionally both the type of name (`personal', `family', `given' etc.) and a normalised form of it. Attribute values may be defaulted, taken from a controlled list or specified freely, the only constraint being that they cannot contain markup. <p> Attribute names and values are supplied within the open-tag for the element to which they apply, for example <xmp> <![ CDATA [ <name type='personal' normal='SmithJ'>Jack Smyth</name> ]]> </xmp> The most common use for attributes in the TEI and other SGML schemes is not however to categorise element occurrences in this way, but to identify them. In the TEI scheme, for example, every element is defined to have an ID attribute, which supplies a unique identifier for that particular textual element in the text. <h3>Entity References <p> The only other feature of the SGML formalism which needs definition here is the <term>entity reference</term>. SGML entities provide a simple and flexible method of encoding and naming arbitrary strings of characters, that is, parts of a text which have no structural significance. An SGML entity has a name and a definition. When an entity is referred to in an SGML document, its name appears in the document; in the output, the SGML processor replaces the name of the entity with its definition. Entity references are thus a convenient way both of including large quantities of text in a document (for example <q>boilerplate text</q> used in several places in one or more documents) and of handling characters needed in a document but not present on the keyboard, such as special symbols or accented letters. <p> An entity reference conventionally takes the form of a mnemonic name separated from the rest of the text by an ampersand at its start and a semicolon following it. For example, one might represent the astrological symbol for Sagittarius by an entity reference of the form &sagit;. Standard entity names are proposed in ISO 8879 for most of the common symbols found in modern printed materials (accented letters, mathematical symbols etc); as this example indicates, the same mechanism can be easily extended for more esoteric applications. <!> <h1 id=what>Structure and interpretation: the TEI topoi <p> I suggested above that the primary function of markup was to make explicit an interpretation of a text. Any standardisation effort such as the TEI must therefore at some time grasp the nettle of deciding which interpretations are to be favoured over others. To put it another way, the TEI must at least attempt to address the question as to which aspects or features of a text should be made explicit by its markup. <p> For some scholars, this is a simple issue. There are some features of a text which are <q>obvious</q> and <q>objective</q> -- examples usually include major structural subdivisions such as chapters or verse lines or entries in a charter. There are others which are equally obviously <q>pure interpretation</q> -- such as whether or not a passage in a prose text belongs to some stylistic category, or is in a foreign language, or is a personal name. As this last list perhaps indicates, for the present writer this is a far from clear cut distinction. In almost every kind of material, and especially in the kinds of materials studied by historians, there is a continuum of categorisations, from things about which almost everyone will agree almost all of the time, down to things which almost no-one will identify in the same way ever. <p> The TEI therefore adopts a liberal policy. It proposes for consideration a set of categories which wide consultation has demonstrated to be of use to a broad consensus of researchers. It proposes ways in which instances of those categories may be marked up (as discussed in the last section). Researchers in agreement as to the use of the categories so defined can thus interchange texts, or (if you wish) interpreted texts. They can do so moreover in a format which allows the disentangling of the interpretation from the text stream, or its enrichment in a controlled way. No claim is made as to the feasibility or desirability of making such interpretations in a given case -- all that the TEI can or does offer is a way of making explicit what has been done. <p> The remainder of this paper discusses some concrete instances of the kinds of textual feature which typify the current TEI proposals. <h2>The structure of a TEI text <h3>The TEI header All TEI-conformant texts contain (a) a <term>TEI header</term> and (b) the transcription of the text proper. The TEI header provides information analogous to that provided by the title page of a printed text. It contains a description of the machine- readable text, a description of the way it has been encoded, and a revision history; these are delimited by the <tag>file.description</tag>, the <tag>encoding.declarations</tag>, and the <tag>revision.history</tag> tags, respectively. The first of these identifies the electronic text as an object in its own right, independent of its source or sources (which must however be documented within it). The second supplies details of the particular encoding practices or variations which characterise the text, for example any special codebooks or other values used within the body of the text and descriptions of the referencing scheme or editorial principles applied. The header is, perhaps surprisingly, the <emph>only</emph> part of a TEI text which is mandatory. <h3>Marking Divisions within a Text <p> The TEI recommendations categorise document elements as either <term>structural</term> or <term>floating</term>. Structural elements are constrained as to where they may appear in a document; for example a <tag>head</tag> or heading may not appear in the middle of a <tag>list</tag>. Floating elements, as the name suggests, are less constrained and may appear almost anywhere in a text: examples include <tag>note</tag> or <tag>date</tag>. Intermediate between the two categories are so- called <term>crystals</term>: these are floating features the contents of which have an inherent structure, for example <tag>list</tag> or <tag>citn</tag> elements. <h4>Structural features The current recommendations define a general purpose hierarchic structure, which has been found to be suitable for a very large (perhaps surprisingly large) variety of textual sources. In this, a text is divided into an optional <hi>front</hi>, a <hi>body</hi> and an optional <hi>back</hi>. The body of a text may be a series of paragraphs (marked with <tag>p</tag> ... <tag>/p</tag>), or it may be divided into chapters, sections, subsections, etc. In the latter case, the <tag>body</tag> is divided into a series of elements known generically as <term>div</term>s. The largest subdivision of a given text is tagged div1, the next smallest div2 and so on. Written prose texts may also be further subdivided into <hi>p</hi>s (paragraphs). For verse texts, metrical lines are tagged with the <tag>l</tag> tag. <h4>Floating features <p>As mentioned above, the current Guidelines propose names and definitions for a wide variety of floating features. Examples include <tag>head</tag> for titles and captions (not properly floating, since they are generally tied to a particular structural element); <tag>q</tag> for quoted matter and direct speech; <tag>list</tag> for lists and <tag>item</tag> for the items within them; <tag>note</tag> for footnotes etc.; <tag>corr</tag> for editorial corrections of the original source made by the encoder; and, optionally, a variety of lexically `awkward' items such as <tag>abbr</tag>eviations, <tag>acronym</tag>s, <tag>number</tag>s, <tag>name</tag>s, <tag>date</tag>s, <tag>citn</tag> for bibliographic or other citations, <tag>address</tag> for street addresses and <tag>foreign</tag> for non-English words or phrases. <h2>Reference scheme <p>The advantage of using a single hierarchic scheme as outline above, is that a referencing scheme based on it can be automatically generated. For example, a given p will acquire a number indicating its sequence within the enclosing div, itself identified by its number within any enclosing div above it, and ultimately within the enclosing text. For example, the value <q>T98.1.9/12</q> might identify the 12th p in chapter 9 of book 1 of the text with number T98. <p> To complement this kind of internal referencing system, the Guidelines provide two distinct methods of marking other reference schemes, such as page and line numbers. The hierarchy of volume, page, and line can be neatly expressed with a concurrent markup stream separate from the main markup hierarchy (see P1 section 5.6); for data entry purposes, however, the simpler scheme we describe here may be more convenient. After data entry, this markup can be transformed mechanically into that required for a concurrent markup hierarchy, if that is supported by the software in use. <p> Page breaks, column breaks, and line breaks may be marked with empty <term>milestone</term> elements: that is, tags such as <tag>line.break</tag> or <tag>page.break</tag> which mark a single point in the text, not a span of text, and therefore have no corresponding end-tags. Such tags may have an <att>n</att> attribute to supply the number of the page, column, or line beginning at the tag explicitly, or may give only the number of the first if subsequent ones can be calculated automatically. This mechanism also allows for the pagination etc. of more than one edition to be specified by using an <att>ed</att> attribute. <h2>Descriptive vs presentational markup <p>A matter of considerable controversy (and associated misunderstanding) has been the question of whether or not aspects of a text directly related to its physical appearance can or should be marked up. For some researchers, and in many applications, typographic features such as lineation or font are of little or no importance. For others, they are the very subject of interest. Because SGML focuses attention on <emph>describing</emph> a text, rather than attempting to simulate its appearance, the TEI recommendations have proposed that where it is possible to identify a structural (or floating) feature by its function, then that is what should be primarily tagged. This does not however mean that they provide no support for cases where the exact purpose of some distinctly-rendered part of a text cannot be determined. It is recognised that in many cases it may be neither desirable nor possible to interpret changes of rendering in this way. <p> A global attribute <att>RENDITION</att> may be specified for every tag in the TEI scheme, the value of which is a user- specified string descriptive of the way that the current element is rendered in the source being transcribed. <note>This implies of course that the markup describes a single source.</note> In most cases, a change in rendering and a change of element coincide: this mechanism therefore reduces the amount of tagging from what would be required if a separate set of tags were used for rendering. Further reduction in tagging is provided by the fact that the default value for a RENDITION attribute is that of the immediately surrounding element (if any). <p> In cases where a renditional change is not associated with any discernible element, a special tag <tag>highlighted</tag> may be used, the sole function of which is to carry the <att>RENDITION</att> attribute. <p> No recommendations about the form of value to be supplied for rendition attributes have yet been made: these are the subject of current work in two working groups. Similar considerations apply to the use of quotation marks and quoted passages within a text. <h2>Scope and coverage of P1 <p> As an example of the scope and range of facilities which SGML can support, I close with a brief summary of the full contents of the current draft and a more detailed description of a few of the more specialised kinds of textual features for which tags are already proposed in the draft Guidelines. <p> It should be stressed that the first draft of the Guidelines, despite its weighty appearance (nearly 300 pages of closely printed A4), is very much a discussion paper and far from being complete or definitive. Some characteristics of the TEI approach are however already discernible which are unlikely to change. One is a focus on the encoding of the content of text, rather than its appearance -- as discussed above, this is also a characteristic of SGML. Another is the rigorous application of Occam's razor: the TEI approach to the immense variety of text types in the real world is to attempt to define a comparatively small number of features which all texts share, and to allow for these to be used in combination with user-definable sets of more specialised features. <note>This has been termed the <q>pizza model</q>, by contrast with either the <q>table d'hôte</q> or the <q>à la carte</q> models. A choice of a small number of bases is offered, each of which may be combined with a large number of toppings. </note> <p> The current draft has eight main sections, which are briefly summarized below. <p> Chapter 1 outlines the purpose and scope of the TEI scheme. As outlined above, its main goals are both to facilitate data interchange and to provide guidance for those creating new texts. The desiderata of simplicity, clarity, formal rigour, sufficient power for research purposes, conformance to international standards, and independence of software, hardware or application alike are stressed. <p> Chapter 2 provides a gentle introduction to the basic concepts of SGML and also contains some more technical information about the ways in which the TEI scheme uses the standard. <p> Chapter 3 addresses the problems of character encoding and translation in a world dominated by the rival claims of ASCII and EBCDIC. If the goal is to provide machine-independent support for all writing systems of all languages, these problems are far from trivial. The specific recommendations made are that only a subset of the ISO-646 character set (sometimes known as ASCII) can currently be relied on for data interchange, and that this should be extended either by using the entity reference mechanism provided by SGML or by using transliteration schemes. It proposes a powerful but economical way of documenting such transliteration schemes by a formal Writing System Declaration <p> Chapter 4 contains recommendations for in-file documentation of electronic texts adequate to the bibliographic needs of researchers, data archivists and librarians. It recommends that a special header be added to each file to perform a function analogous to that of the title page of a non-electronic text, and proposes sets of tags for information about the file itself, the source from which it was derived and how it was encoded. <p> Chapter 5, the largest chapter, attempts to define a set of general-purpose structural and floating tags for continuous prose texts. Its basic ideas of text as an ordered hierarchy of objects, within which floating features and crystals may appear was discussed above. This chapter of the Guidelines also proposes tags for features such as lists, notes, names, abbreviations, numbers, foreign or emphasised phrases, cross references, and hypertextual links. Sections deal with the kinds of textual element commonly found in front and back matter of printed texts, title pages etc. Other sections discuss ways of encoding textual variation and critical apparatus and of recording the rendering of arbitrary textual fragments within this overall framework. There is also some discussion of different ways of maintaining multiple referencing schemes within the same text. <p> Chapter 6 outlines a number of theory-independent mechanisms for representing all kinds of linguistic analyses of running text. It is probably the most daunting chapter for the non-specialist reader, though much of its contents are of very wide relevance. It argues that most, if not all, linguistic analyses can be represented as bundles of named, value-bearing, `feature structures', which may be nested and grouped into sets or lists. It proposes ways of supporting multiple and independently aligned analyses, chiefly by means of the ID/IDREF pointer mechanism native to SGML. It also contains some tagsets for such commonly occurring formalisms as tree structures and parts of speech. <p> Chapter 7 considers in more detail particular aspects of some specific types of text. The text-types discussed in this draft are: language corpora and collections; verse, drama, and narrative; dictionaries; and office documents. In each case, an overview of the problems specific to these types of discourse is given, with some preliminary proposals for tags appropriate to them. This chapter is one that will be considerably revised and extended over the coming months, as its initial proposals are firmed up and as its scope is extended to other types of text. <p> Chapter 8 outlines a method by which the current Guidelines may be modified and extended, largely by introducing indirection into the Document Type Definitions (the formal SGML specifications for the TEI encoding scheme). Extension and modification of the TEI proposals is an important design goal, since this is both expected and intended, and the final form of the Guidelines will facilitate it. <p> Preliminary versions of a number of technical appendixes are provided in the current draft. These include annotated examples, illustrating the application of the TEI encoding scheme to a wide range of texts, formal SGML document type declarations (DTDs) for all the tags and groups of tags defined in the TEI scheme, and code pages for some commonly used character sets. Later drafts will extend and improve these initial versions considerably, and will also contain an alphabetical reference section with a summary of each tag, its attributes, its usage, and an example of its use, as well as full Writing System Declarations for a range of commonly used alphabets. <p> Space precludes an exhaustive discussion of the various tags and associated features suggested by the current TEI draft proposals. Further proposals from the specialist working groups currently discussing extensions in a wide range of subject areas will be included in the final TEI report in a years time. However, it is hoped that enough detail has been provided to give some indication of the general ideas underlying the scheme. <h1>What is a TEI Text? <p> What does it mean to say that a text is <q>TEI conformant</q>? A full answer to this question involves an understanding of the various contexts or environments in which electronic texts may be used. At one extreme, a text may be prepared using a particular version of a particular software package on a particular machine, for use with that software package only. Its users and preparers may never have any intention of sharing the text with others, nor of using any texts prepared elsewhere. At the other extreme, a text may be prepared on many different systems as part of a co- operative data capture exercise, for use by several different people, all with differing objectives and different software systems. Most projects fall between these two extremes, often with different priorities at different times. How does the TEI project help either of them? <p> As suggested above, encoding a text is fundamentally a process of deciding which textual features should be distinguished by markup of some kind, and of deciding on a suitable markup for them. The TEI Guidelines may be thought of as a codification of the distinctions which have been found helpful by most people most of the time when faced with this task. For the most part, these are optional features of a text: clearly, no-one could be expected to make all the distinctions or to capture all the textual features listed in P1 in every text prepared, for no matter how simple a purpose. Equally clearly though, every distinction made by P1 is made because for someone that distinction is important. <!> <h2>The notion of conformance <p> Returning to the question of conformance: if the Guidelines do not require that every distinction they specify be made in encoding a text, what in fact do they require? They say, in effect, <emph>if</emph> you wish to distinguish this feature in your text, then <emph>this</emph> is the tag you should use to identify it, and (possibly) this is the way that this textual feature should be related to other textual features in the text. If for example, you wish to distinguish proper names that are embedded in your text, the Guidelines advise you to use the tag <tag>propname</tag> for the purpose: they do <emph>not</emph> propose that all proper names in a text should be marked however. <p> A TEI-conformant text must, as a minimum, be parsable by an SGML processor using one or other of the published TEI document type definitions (DTDs). Strict TEI-conformance additionally involves adherence to various formal rules about the way in which SGML is used in a text (see sections 1.1.2 and 2.2 of the Guidelines for a discussion), of which probably the most significant is that end-tags must be supplied for every element. For interchange purposes, TEI conformance at present implies the use of a very restricted character set. <p> It should be stressed that the purpose of restricting TEI conformance in this way is to ensure that texts can be interchanged between different machines and operating systems without loss of information. Such restrictions make no sense, and are therefore not required, when texts are to be exchanged between the same kind of machine, or when they are not exchanged at all. <h2>Conformance in different environments <p> Strict conformance may be desired or required when you are sending files to someone about whose system you know no details, when you are depositing a text in a text archive, or when you are working with software which accepts only fully-conformant TEI texts. <p> In many cases, a less strict adherence to the rules of the TEI scheme may be appropriate. If you have SGML software, for example, then it is unnecessary to limit yourself, in the work you do on your own machine, to the subset of SGML features allowed in strictly TEI-conformant documents, since it is easy to use SGML software to produce a TEI-conformant version of any SGML document which uses the TEI document type declarations. You may use some other software, on the other hand, which accepts most TEI-conforming documents, but places some further restriction on the SGML features which can be accepted. In this case prudence will dictate that you restrict yourself to the SGML features your software can handle. <p> If you do not have SGML software, you may wish to use some markup scheme designed around the software you use most: Word Perfect or Nota Bene users might develop a set of Word Perfect styles or Nota Bene styles corresponding to the TEI tags they use most often. As long as the mark-up scheme you use makes at least the same set of distinctions as those recommended by the Guidelines, then it will be simple to translate from your local scheme to the TEI scheme, and back. <p> The construction of a sensible local scheme depends entirely on the hardware and software you are using. What makes sense for a Macintosh user who shuttles constantly between Word and Hypercard, will not necessarily be the best approach for a PC user who seldom leaves Nota Bene, and neither will necessarily be apt for someone using a VAX. <h2>Character Sets and Conformance <p> Character set incompatibilities pose serious problems for the exchange of machine-readable texts within disparate research communities; many common methods of exchanging texts fail for texts which contain characters other than the twenty-six basic letters of the Latin alphabet, the ten Arabic numerals, and some common punctuation marks. Accented characters, braces and brackets, and many other characters may not arrive at all, or may arrive as undecipherable nonsense. The TEI Guidelines define a <q>safe</q> set of characters for interchange using today's systems, and recommend the use of entity references for all other characters. Because the shortcomings of current systems will not (we hope!) be with us forever, however, adherence to these restrictions is not a necessary part of TEI-conformance, though it may be highly desirable in certain situations. <p> In your own work on your own machine, however, there is <emph>no reason</emph> not to use all the characters available in your machine's character set. When you wish to exchange texts with users of other systems, you can transform any such characters into SGML entity references, by using a simple global search and replace function for example. <p> Just as special purpose programs may be needed to convert from the form in which it is convenient to enter text into a TEI- conformant one, so it is likely that special-purpose programs will be developed to convert a TEI-conformant text into one that can be reliably transported across networks, possibly involving some data compression as well as translation of <q>awkward</q> characters, together with similar programs to do the opposite. Such programs have yet to be written however. </gdoc>