SGML Declarations for the TEI Guidelines

The SGML declaration used for documents encoded according to these guidelines is reproduced as appendix . This section discusses some of the issues involved in choosing which SGML features to use for local processing and interchange, and the relationships among SGML, these guidelines, and other encoding schemes.

SGML and TEI Conformance

These guidelines specify an SGML-conformant coding scheme, but they also contain rules and recommendations which cannot be enforced by an SGML parser, or in some instances by any program. No program, for example, can verify that a phrase marked as a quotation is in fact a quotation, though programs might verify that a string of characters does represent a date expressed in some known format. We distinguish TEI-conforming programs, which can correctly process any valid TEI conforming document, from TEI-validating programs, which can determine that a document conforms with the TEI guidelines.

Features identified as part of the TEI encoding scheme (or family of schemes) can fall into one of these classes: Representable and validatable in SGML There is an SGML feature intended for use in the representation of the feature. The most obvious example is that a single structural hierarchy can be easily encoded. An SGML parser can validate that the feature is encoded properly. In these cases the available feature should always be used. Representable but not validatable in SGML There is no SGML feature directly available for representing the feature. However, there is a way -- perhaps an obvious way---to encode the feature using SGML, but this involves using SGML according to some convention that will not be known to a parser and that thus can not be checked by a parser. For example, it may be possible to use attribute values to encode information in a way that can be interpreted by an application but cannot be validated by a parse, as in using a string to contain a list of values from some predetermined enumeration. In these cases, the conventions being used are carefully documented for the writers of applications. Not representable in SGML These features are rare. In fact, they are essentially non-existent in that some SGML encoding could be found for almost anything. The encoding, even if not natural, would have the virtue of being consistent with other such encodings being used in the project. SGML is a large and complex formal system. There are few features (perhaps none) which cannot be encoded in SGML, although an SGML parser may not be able to validate all of them.

Note therefore that SGML parsers using TEI DTDs will be conforming but not necessarily validating, as validation may require extra processing. Conversely, programs might be TEI-validating without being strictly SGML-validating or even conformant. While all TEI documents are valid SGML documents, not all valid SGML documents using the TEI DTDs are necessarily TEI conformant. Some variations permitted by SGML are not necessarily allowed by these Guidelines.

Use of SGML Features

The SGML standard includes a wide variety of optional and basic features, some of which are the subject of controversy. This section describes the TEI's usage of these features, summarizing the reasoning which led to the SGML declaration for interchange, which is reproduced in appendix . As noted above (section ), SGML features exploited in local processing are a local decision, based on requirements quite different from those for interchange, which concern us here.

Use of the following features is restricted in TEI-conformant documents intended for interchange (although some are, as noted, useful for data capture or local processing): SHORTREF This is a search and replace mechanism for undelimited strings. It is useful for data capture but is hard to reconcile with the use of multiple concurrent markup streams. It is not allowed in interchange documents. DATATAG This allows specified strings of data characters to be interpreted as markup. Like SHORTREF, it may be useful for data capture. It is not allowed in interchange documents. OMITTAG This allows start and end tags to be omitted in certain circumstances (where unambiguous in context and allowed by the DTD's tag-omissibility rules). It is convenient for data capture and for some local processing, so meaningful tag omissibility rules are specified in TEI DTDs. However, the formal determination that a tag omission is unambiguous often proves unexpectedly complex, and careless use of the feature can make documents hard to read, even ambiguous. Use of OMITTAG is not allowed in interchange documents. RANK This allows certain abbreviations in the naming of multi-leveled tags. It is forbidden for data interchange and not recommended for local processing. SHORTTAG This allows:

  1. omission of attribute names when the attribute value comes from a closed list of possible values, and
  2. omission of generic identifiers in start and end tags in some circumstances. The rules for determining which generic identifier is supplied vary with which features are turned on in the SGML declaration.
The use of SHORTTAG is allowed in the first case only. SUBDOC This allows individual parts of an SGML document to have their own document type declarations. Specialized structures can thus be kept separate from the general environment, which helps control DTD complexity. When a SUBDOC is encountered in a document, the current DTD is replaced by that of the SUBDOC, and restored when the end of the SUBDOC is reached. No information can be exchanged between the two contexts.

SUBDOCs are invoked by entity reference, either simply embedded in the text, or supplied as the value of an attribute (e.g. on an include tag.) Only the latter form is allowed in interchange of TEI documents. Marked Sections In general, the use of marked-section keywords other than INCLUDE, IGNORE, and CDATA should be avoided. DTD-specific entity definitions When multiple concurrent document type declarations are used, SGML allows the same entity name to have different definitions in each document type. This is forbidden in TEI-conforming documents.

Other features are not restricted in TEI-conformant documents. The following paragraphs describe the uses to which they appear most suited in the context of the TEI.

Attributes are one of the most controversial features of SGML. The controversy is basically this: anything that can be encoded in an attribute can be encoded as text surrounded by tags. Attributes are thus not strictly necessary, though they complicate the formalism. Concurrent markup with multiple document types, however, is seriously complicated when attributes are not used, since extra-textual information (the attribute values) is introduced into all concurrent views of the document at the same time. Attributes are therefore used in the TEI DTDs without restriction.

Inclusion and exclusion exceptions allow specified elements to be included at any point within a given element, or excluded entirely within that element (even though the content models specify otherwise. (See section .) They can greatly simplify content models, but should be used with care.

CONCUR allows multiple views (hierarchies) to be marked within a document. It is imperfect but if we excluded it we would have to design another mechanism to accomplish the same thing, since the ability to specify multiple views is a necessity in many texts.

The concrete reference syntax is used in the TEI DTDs and all examples in these Guidelines. It is recommended but not required that all TEI-conformant texts use it.

The quantities and capacities specified in the SGML declaration in appendix diverge from the default values given in ISO 8879. Notably the name length limit has been set at 128 (instead of 8) and the literal length has been set to 1024 (instead of 240). The default nesting level has been retained, but will be increased if any examples are found which require deeper nesting.

The default case sensitivity has been retained: entity names are case sensitive but other SGML identifiers are not. Existing entity names from public entity sets have been used in preference to defining new names for the same graphic characters.

The SGML declaration in appendix specifies a subset of ISO 646 as the default character set; this subset is also a subset of ASCII (ANSI X3.4). For full discussion of this point see chapter . A corresponding subset of EBCDIC characters may also be used for TEI documents. When fuller character sets are used for local processing or private interchange, SGML identifiers may use any legal character in the character set; they are not restricted to the ISO 646 subset.

If the local character set is an extension of ISO 646, problems may arise in interchange of documents. Methods of overcoming these problems are under investigation: see further .