TEI Conformance, TEI Recommended Practice, TEI Interchange, and Related Issues C. M. Sperberg-McQueen University of Illinois at Chicago Document Number: TEI MLW43 May 30, 1991 (13:56:26) Draft May 30, 1991 (13:56:26) ABSTRACT This paper defines the notion of "TEI conformance" and related con- cepts and describes some areas to which they do and do not apply. 1 INTRODUCTION The notion of "TEI conformance," like the other concepts introduced in this paper, is intended as an aid in describing the format and con- tents of a particular document or set of documents. These concepts are expected to be useful in: * agreements for the interchange of documents among researchers * agreements for the deposit of texts in archives and their distribu- tion from archives * describing the documents to be produced by or for a given project * defining the classes of documents accepted or rejected by a given piece of software This paper describes the areas in which these terms are defined and specifies their meaning. It also proposes other terms for related con- cepts and points out some dangers in the careless use or application of these terms. 2 TEI CONFORMANCE The terms described here should be considered technical terms for users and implementors of the TEI Guidelines and should be used only in the senses given and with the usages described. TEI-Conformant Document A document is TEI conformant if it is either in TEI local processing format or in TEI interchange format. A full description of the document should specify which format it is in. The term TEI conformance does not apply to software: programs can be usefully described as accepting or validating TEI-conformant documents or some subset of TEI-conformant documents, but the TEI defines no required processing model against which software could be measured. Programs are thus not themselves conformant or non-conformant and should not be so described. TEI Local Processing Format A document is in TEI local processing format if 1. It is a conforming SGML document with a legal SGML declaration. 2. It uses the document type declarations provided by the TEI, either without modifications or with all modifications effected by inclu- sion in the DTD subset as described in section "Modifications to TEI Document Type Declarations" below. 3. All modifications to meaning or use of defined tags, and all new tags, are documented in TEI Tag Set Declarations which accompany the document.(1) 4. It includes, in the TEI header, all the elements required by the TEI declarations for the TEI header. 5. It contains no markup other than SGML or declared notations for graphics, tables, figures, etc. That is, unless a declared nota- tion is in use, the semantics of any content character in the doc- ument are exhausted by its identity as a graphic character. A TEI-local-processing-format document may be described as requiring DTD extensions if it modifies the TEI-supplied DTDs or SGML prolog in any of the ways described below under "Modifications to TEI Document Type Declarations". The following terms are synonymous: document in TEI local processing format, TEI local-processing document, and TEI local-processing- conformant document. TEI Interchange Format A document is in TEI interchange format if it conforms to the TEI local-processing format and if further: 1. Its SGML declaration is either the predefined SGML declaration for TEI interchange documents or an SGML declaration which differs from it only in ways allowed by section "Modifications to TEI SGML Declaration" below. 2. It makes no use of any of the following SGML constructs: a. short references b. the RANK feature c. omission of generic identifiers in start- and end-tags(2) d. inclusion a SUBDOC subordinate document by means of an enti- ty reference embedded directly within content data (SUBDOCs must be included by giving the entity reference as the value of an attribute) e. Use of keywords other than INCLUDE, IGNORE, and CDATA in a marked section f. Definition of the same entity with different values in dif- ferent document types. (An entity must have the same value in all document types.) A TEI-interchange-format document may be described as requiring DTD extensions if its DTD is modified in any of the ways described in sec- tion "Modifications to TEI Document Type Declarations" below. The following terms are synonymous: document in TEI interchange for- mat, TEI interchange document, and TEI interchange-conformant document. 3 TEI PACKED INTERCHANGE FORMAT A document is in TEI packed interchange format with a given transmis- sion character set and a given transmission entity set if all of the following are true. (The process of making them true by appropriate pre-transmission processing on a TEI-interchange-format document is packing.) 1. All separate entities in the document are packed into a single entity (file) in a manner conforming to ISO 9069 (SDIF) or to some other TEI-authorized form. 2. All characters occurring in SGML names (generic identifiers and attribute names) occur within the transmission character set. (If this is not true, the SGML names must be modified so as to make it true.) 3. All characters in the document content and attribute values either occur within the transmission character set or are represented by an appropriate entity reference using an entity name included in the transmission entity set. 4. The transmission character and entity sets are named in the header of the packed file and in any accompanying paper documentation. With prior agreement between parties to an exchange, interchange doc- uments may use character code set switching as defined in ISO 2022, its national analogues, or successor standards. A full description of a document in TEI packed interchange format must specify the transmission character set and the transmission entity set used in the document. 4 TEI RECOMMENDED PRACTICE A document follows TEI recommended practice if: 1. it is a TEI-conformant document 2. wherever the guidelines say to prefer one tag to another, the pre- ferred tag is used 3. all textual features which the guidelines recommend be captured are in fact encoded 4. no textual features which the guidelines recommend not be cap- tured(3) are in fact encoded 5 TEI ABSTRACT MODEL A document follows the TEI abstract model if it tags the features specified in the TEI documentation and documentation, and their struc- tural interrelations agree with those specified in the TEI DTDs. 6 EXTENSIONS AND MODIFICATIONS TO TEI SPECIFICATIONS Modifications to TEI SGML Declaration The SGML declaration for TEI interchange documents may differ from that provided in TEI documentation in these ways: 1. the CHARSET clause must be used to define the transmission charac- ter set (possibly in connection with the SHUNCHAR specification in the SYNTAX clause) 2. The CAPACITY clause may be used to raise (but not lower) capaci- ties 3. The SYNTAX clause may be used to define the SGML syntax used in the document. Notably: a. The SHUNCHAR specification within the SYNTAX clause may be used to restrict the transmission character set b. The BASESET and DESCSET specifications within the SYNTAX clause must be used to describe the transmission character set c. The DELIM and NAMES specifications may be used to modify the SGML syntax 4. In the FEATURES clause, CONCUR may be set to NO if concurrent markup is not used in the document The following portions of the SGML declaration may not be modified in TEI interchange documents: 1. The CAPACITY and QUANTITY values may be increased but not decreased 2. The SCOPE clause may not be changed 3. No new FEATURES may be turned on The SGML declaration for TEI-local-format documents may be modified without restriction. Some recommendations for usage are made in docu- ment TEI P1, but these recommendations are not normative. Modifications to TEI Document Type Declarations A TEI-conformant document (whether for local processing or for inter- change) may make any change to the TEI-supplied document type declara- tions which is allowed by SGML and the controlling SGML declaration. All such changes should be effected within the SGML DTD subset.(4) In all TEI-conformant documents, the document type name must be tei1 or as specified in the version of the TEI guidelines in use, and the system data should refer to a file containing the unmodified TEI main DTD frag- ment: ]> The following must remain true of the DTD after modification: 1. The overall document must contain a single element and a single element, in that order; in the case of a corpus or collection the overall collection may have a followed by a series of documents. 2. The element must include elements for title statement: the title of the machine-readable work and the title statement names of those responsible for it publication statement: place and date of publication or distribu- publication statement tion of the machine-readable document source description: bibliographic description of the copy text or source description source of the electronic text, including at least title, author, and edition DTD Extensions A TEI-conformant document may be said to require DTD extensions if it: 1. defines new elements 2. modifies content model, declared content, or omissibility of any element 3. adds or modifies any attribute definitions 4. renames any elements, attributes, or attribute values 5. defines any new document types 6. declares any non-SGML notations Without requiring DTD extension, therefore, any TEI document may 1. define entities and parameter entities 2. include processing instructions and comments in its DTD subset TEI local-processing documents can, without requiring DTD extension, also: 1. include link type declarations etc. in its SGML prolog 2. define short reference mapping and use in its DTD TEI interchange documents may not include link type or short reference declarations because the SGML declaration for interchange does not allow them. It is expected that the notion of DTD extension will be particularly useful in describing the classes of documents accepted or validated by software. 7 TEI PROCESSING MODEL In a project using the TEI Guidelines, the work might include pro- cessing of the following kinds. This section is included for illustra- tive purposes only; it has no normative status and does not restrict the processing of TEI or other documents. For definitions of terms, see the glossary below. Document Capture and Reclamation First, data might be captured by keyboarding into a locally defined data capture format, or by scanning into a locally defined scanner-file format. From these initial forms, transducers might convert the files into a standard local storage format. Local Storage Format and Application Software The local storage format might be the input format of some applica- tion program used frequently by the project. In this case, transducers might be necessary to prepare data for processing by other applications. Alternatively, the local storage format might be independent of the for- mats used by application programs; transducers would be needed to pre- pare data for any processing. Such an independent format is useful if the local storage format needs to contain more information than any sin- gle application can conveniently handle. The local storage format might be SGML-conformant without being TEI- conformant, e.g. because it uses local DTDs instead of the standard TEI DTDs. Or it may use the TEI local processing format. If SGML software is available, it may be used to validate the TEI local-processing for- mat, to transduce documents into the input formats needed by applica- tions, and when appropriate to transform documents into the TEI inter- change format for exchange with other sites. Finally, the local storage format might use the TEI interchange for- mat. It is not expected that this will be a very common practice, since it is expected that most sites interested in TEI conformance will even- tually acquire SGML-conformant software which allows for a more compact local storage than does the interchange format. In the absence of SGML software, however, some projects may find the TEI interchange format (or perhaps a restrictive variant of it) useful, because such a format can be relatively easy to parse with ad hoc software. Whether the local storage format is strictly TEI conformant or not, it may follow TEI-recommended practice in its selection of textual fea- tures to be marked up, in its tag names, in its documentation practices, etc. Enrichment and Other Processing Over the course of the project, analysis and processing may result in interim results which may be incorporated into the locally stored copy of the text so that the interim results can be used in later processing. This process of enrichment can be carried out either by manual editing of the documents using conventional text editors, or by application pro- grams which perform (part of) the analysis. Data Export When a document is to be exchanged with another site using the TEI Interchange Format, it must first be transduced from the local storage format to TEI interchange form. If local documents are already TEI- conformant, this requires either no processing at all, or a relatively simple normalization which can be handled readily by the normalization facilities of most SGML parsers. If the local storage form is non-SGML conformant, some transducer must be used to transform it into the TEI interchange format. The TEI-interchange-format document must then be packed for shipping into the TEI packed interchange format, using a packing program. This program will gather the constituent parts (files) of a document into a single file, and ensure that the file contains no characters whose safe passage to the recipient of the data is endangered by the transmission path. If the ultimate recipient of the document is unknown, the set of safe characters is very small. The specific transmission character set however is independent of TEI conformance: any convenient set may be used where both parties agree. The packer will ensure that the trans- mission character set is properly identified. Data Import When a document is received from another site using the TEI packed interchange format, it must first be unpacked into a TEI interchange- format document in the local character set. It may then be necessary to naturalize it by translating it into the local storage format; if the local format is TEI- or SGML-conformant, no processing is needed (although some SGML processors may offer a facility for suppressing omissible markup). TEI Conformance in the Processing Model The notions of TEI interchange format and TEI packed interchange for- mat are central to the exchange of documents using the TEI guidelines, whether the local storage format is TEI-conformant or not. The TEI interchange format and the TEI local-processing format may each be used as a local storage format, though the local storage format might well differ from either of these without materially affecting the use of TEI formats for interchange. The TEI interchange format being less flexible than the local-processing format, it is expected that sites using SGML- conformant software may use the latter, while sites without such soft- ware may prefer the former. The notion of TEI recommended practice, it is hoped, will be relevant to decisions about what textual features should be recorded during data capture and will thus affect data-capture formats and the transducers which render captured files into the local storage format. The TEI abstract structure may be useful in developing local non-SGML markup schemes for data capture or for processing with ad hoc applica- tion programs. It is strongly recommended that the TEI recommendations, as well as the TEI abstract structure, be used for such development as well. 8 ASPECTS OF CONFORMANCE AND DOCUMENT DESCRIPTION Character Sets Neither the character sets used for local processing nor those used for transmission of interchange documents are restricted by the defini- tion of TEI conformance. For local processing, users will typically use the system character set of their local system or some modification thereof. For exchange with known partners, users should choose any con- venient character set; typically the most convenient is the set of all characters which: 1. are transmitted successfully over the existing transmission link 2. occur in both sender's and receiver's local coded character sets For blind exchange with unknown partners, of course, a conservative choice of transmission set is needed to ensure that characters arrive correctly. How conservative the choice need be depends on the medium. At this time (1991), the ISO 646 subset defined in TEI P1 version 1 is the only safe set of characters for the regional and international net- works most widely used. Silent translation regularly destroys charac- ters in these nets. Over large portions of these networks, however, the full complement of ASCII characters may be used successfully, so the ISO 646 subset is recommended only for fully blind interchange. In transmission by disk or tape, however, no silent translation is likely to occur, and so larger sets may be successfully used in blind interchange. The primary danger is a failure of software in the receiv- ing machine to process the characters correctly; at this time (1991), ASCII or 94-character U.S. EBCDIC appear to represent the largest safe choices; other national character sets may of course be used if good internal documentation is also provided. Note that the transmission character set does not associate specific binary encodings with the characters in the set. In the technical sens- es, it is a character set, not a coded character set. This means that a document may undergo various automatic translations from one coded char- acter set to another (notably, in the case of transmission over interna- tional networks, from ASCII to EBCDIC or vice versa) without leaving the transmission character set. SGML Declaration The utility of various SGML constructs is discussed in section 2.2 of document TEI P1 version 1. The restrictions on SGML declarations and SGML usage in TEI interchange documents discussed above under "Modifica- tions to TEI SGML Declaration" are derived from that discussion. No restrictions are made on SGML usage in the local processing format because such usage is best determined locally and has no impact on interchange. SGML Document Type Declaration The document type declarations provided by the TEI are intended to cover as wide a variety of document types and processing needs as proved feasible. It is impossible, however, for any finite list of text ele- ments to cover every need of textual research and processing. As a result, extension of the TEI DTDs is defined as having no effect on strict TEI conformance, as long as certain restrictions are observed; these have the effect of ensuring that later users of a file can easily see what changes have been made to the DTDs and what the new tags are intended to mean. The requirement that all new or modified tags be documented, however, is formally verifiable only to a limited extent. It is possible for a program to verify that for every tag introduced in a DTD modification, a corresponding record exists in a Tag Set Declaration. It is impossible, however, to verify using formal means that the entry in the tag set dec- laration makes sense. Purely formal conformance measures, therefore, must be supplemented with human inspection of the documentation. The concept of DTD extension is introduced to allow the concise description of software which is designed to handle documents encoded using the published DTDs but which is not prepared to deal with tags not included there.(5) All sections of the TEI DTDs are subject to modification by the user, except that a documentary header must be provided and distinguished from the text itself, and that documentary header must include tagged ele- ments identifying the document encoded and those responsible for the encoding. This ensures that all TEI-conformant documents will have at least this bare minimum of accompanying documentation. Tag Usage and Feature Marking The basic design principles of the TEI require the notion of TEI con- formance to be applicable to existing electronic documents if they are translated into a proper format, without requiring the insertion of information not captured in the initial preparation of the text.(6) At the same time, the TEI is charged with formulating advice to those engaged in the creation of new electronic texts and is required to dis- tinguish what is actively recommended for general use from what is mere- ly optional, provided for use by those engaged in a particular sort of work. The notion of TEI recommended practice is introduced to allow the concise description of documents in which not only the requirements, but also the recommendations of the Guidelines are followed. It is hoped that while projects to convert existing electronic data may content themselves with achieving TEI conformance, projects to produce new elec- tronic texts will produce documents following TEI recommended practice. To distinguish those projects which follow the TEI's recommendation to use SGML markup from those which capture the same underlying textual features but do so using non-SGML markup, the notion of the TEI abstract model is introduced; it is this which a non-SGML-based encoding can have in common with the TEI. Non-SGML Markup In exchanging texts for use by others, the goal of an interchange format is to ensure that the information encoded in an electronic ver- sion of a text can be correctly understood and processed by the recipi- ent as well as by the originator of the text. To assure the achievement of this goal, the definition offered here of TEI conformance restricts markup in TEI conformant documents to SGML markup and to properly declared non-SGML notations. The latter are explicitly recommended by version 1 of the guidelines for the encoding of tables, figures, etc. and so cannot reasonably be excluded. (Since they do place a burden on the recipient for proper processing, the use of any such non-SGML nota- tion is defined to fall within the class of DTD extensions.) Because of the escape clause for graphics, etc., it is in principle possible to create a TEI conformant document by embedding a document using any arbitrary markup into a driver file containing a TEI header and a declaration for the appropriate markup as a non-SGML notation. Though it falls within the letter, such a practice falls outside the spirit of TEI-conformant document interchange. ------------------------- (1) The components of a tag set declaration are at present defined only in document TEI ED W5; they should receive a fuller definition and documentation as soon as possible. The ultimate structure of the tag set declaration should agree with the structures used in the TEI's own reference manual for the TEI scheme. (2) This is one of several abbreviations allowed by the SHORTTAG fea- ture; the others (omission of attribute names under certain circum- stances and omission of non-required attribute values) are allowed. (3) At this writing there are none. -Ed. (4) Informally, this means TEI-conformant documents must embed the stan- dard TEI DTDs and then override any declarations which are to be modified, rather than embedding a distinct local DTD. (5) Some will regard such simplifications as useful ways of making it easier to develop software which accepts TEI-conformant documents; others will deplore the failure of such software to accept all TEI- conformant documents including those which extend the TEI DTDs. In providing the notion of DTD extension for describing what documents are and are not accepted by such software, the TEI acts in the belief that such software will in fact be developed; it neither endorses nor deplores its construction or use. (6) See document TEI PC P1 "The Preparation of Text Encoding Guide- lines." Draft May 30, 1991 (13:56:26) TEI MLW43 TEI Conformance page 2 ------------------------------------------------------------------------ GLOSSARY accept: to process correctly as input. An application accepts a class of documents if it processes correctly any document within that class. Cf. validate. application, processing program: any program used to create or manipu- late documents character set: a collection (or repertoire) of characters, independent of their encoding in a particular coded character set coded character set: a collection (or repertoire) of characters, each assigned to a specific representation in an electronic form. One character set may be represented by any number of coded character sets. data capture: the process of converting a document from paper form to some local storage format data reclamation: the process of converting a document from an elec- tronic form to some local storage format enrichment: the addition of further information to a document (e.g. in the form of analytic markup); in practice, this may involve transduc- tion from a local storage format into the input form of a processing program, processing, and then transduction from the application output format into the local storage format. export, import: the process of preparing documents for transmission or of rendering them into the local storage format after transmission; may involve canonization and packing, or unpacking and naturalization impoverishment: the elimination of information from a document (e.g. by omitting some markup) input format: a form in which a given program accepts input interchange format: a form in which documents are transmitted to or received from other sites via network, magnetic media, etc. local character set: The coded character set used in the local storage format within a site or project. local storage format: a form in which documents are stored within the system. It may be identical to some input or output format of some processing program used; it may not. A given site may have one or several local storage formats. naturalization: the process of converting a document from an inter- change form into a local storage format (in the case of the TEI local processing format, for example, by the suppression of omissible mark- up) normalization: the process of converting a document from a local stor- age format which is not TEI-interchange-conformant into a canonical (normalized) TEI-interchange-conformant document output format: a form in which a given program produces output packer, unpacker: transducers which transform documents from a local storage format (e.g. a TEI-conformant form) into an interchange format (e.g. the TEI packed interchange format) or vice versa TEI conformant: (of documents only) in the TEI local processing format or the TEI interchange format (as described above) TEI Interchange Format: an interchange format defined by the TEI (see above, section "TEI Packed Interchange Format" on page 1). transducer, translator: a program which transforms a document from one format into another transmission character set: the character set used during passage of a document from a sender to a receiver. The document may be translated from one coded character set to another along the way, so it is impor- tant that the transmission character set contain only characters which will be correctly translated at such times. As a matter of conven- ience, the transmission character set will typically be the largest set of characters which are known to travel safely over a given trans- mission route. validate: to process correctly as input, and to verify adherence to some specification. An application validates a class of documents if it processes correctly any document within that class, while not pro- cessing any document which is not a member of the class. Cf. accept. Draft May 30, 1991 (13:56:26)