. This section
discusses some of the issues involved in choosing which SGML features to
use for local processing and interchange, and the relationships among
SGML, these guidelines, and other encoding schemes.
SGML and TEI Conformance
These guidelines specify an SGML-conformant coding scheme, but they also
contain rules and recommendations which cannot be enforced by an SGML
parser, or in some instances by any program.
No program, for example, can verify that a phrase marked as
a quotation is in fact a quotation, though programs might verify
that a string of characters does represent a date expressed in
some known format.
We distinguish TEI-conforming programs, which can correctly
process any valid TEI conforming document, from
TEI-validating programs, which can determine that a
document conforms with the TEI guidelines.
Features identified as part of the TEI encoding scheme (or family of
schemes) can fall into one of these classes:
Representable and validatable in SGML
There is an SGML feature intended for use in the
representation of the feature. The most obvious example is that
a single structural hierarchy can be easily encoded. An SGML
parser can validate that the feature is encoded properly. In
these cases the available feature should always be used.
Representable but not validatable in SGML
There is no SGML feature directly available for representing
the feature. However, there is a way -- perhaps an obvious
way---to encode the feature using SGML, but this involves using
SGML according to some convention that will not be known to a
parser and that thus can not be checked by a parser. For
example, it may be possible to use attribute values to encode
information in a way that can be interpreted by an application
but cannot be validated by a parse, as in using a string to
contain a list of values from some predetermined enumeration. In
these cases, the conventions being used are carefully documented
for the writers of applications.
Not representable in SGML
These features are rare. In fact, they are essentially
non-existent in that some SGML encoding could be found for
almost anything. The encoding, even if not natural, would have
the virtue of being consistent with other such encodings being
used in the project.
SGML is a large and complex formal system.
There are few features (perhaps none) which cannot be encoded in SGML,
although an SGML parser may not be able to validate all of them.
Note therefore that SGML parsers using TEI DTDs will be conforming but
not necessarily validating, as validation may require extra processing.
Conversely, programs might be TEI-validating without being strictly
SGML-validating or even conformant. While all TEI documents are valid
SGML documents, not all valid SGML documents using the TEI DTDs are
necessarily TEI conformant. Some variations permitted by SGML are not
necessarily allowed by these Guidelines.
Use of SGML Features
The SGML standard includes a wide variety of optional and basic
features, some of which are the subject of controversy. This section
describes the TEI's usage of these features, summarizing the reasoning
which led to the SGML declaration for interchange, which is reproduced
in appendix . As noted above (section
), SGML features exploited in local processing are a
local decision, based on requirements quite different from those for
interchange, which concern us here.
Use of the following features is restricted in TEI-conformant documents
intended for interchange (although some are, as noted, useful for data
capture or local processing):
SHORTREF This is a search and replace mechanism for
undelimited strings. It is useful for data capture but is hard
to reconcile with the use of multiple concurrent markup streams.
It is not allowed in interchange documents.
DATATAG This allows specified strings of data characters
to be interpreted as markup. Like SHORTREF, it may be useful
for data capture. It is not allowed in interchange documents.
OMITTAG This allows start and end tags to be omitted in
certain circumstances (where unambiguous in context and allowed
by the DTD's tag-omissibility rules). It is convenient for data
capture and for some local processing, so meaningful tag
omissibility rules are specified in TEI DTDs. However, the
formal determination that a tag omission is unambiguous often
proves unexpectedly complex, and careless use of the feature can
make documents hard to read, even ambiguous. Use of OMITTAG is
not allowed in interchange documents.
RANK This allows certain abbreviations in the naming of
multi-leveled tags. It is forbidden for data interchange and
not recommended for local processing.
SHORTTAG This allows:
- omission of attribute names when the attribute value comes from
a closed list of possible values, and
- omission of generic identifiers in start and end tags in
some circumstances. The rules for determining which generic
identifier is supplied vary with which features are turned on
in the SGML declaration.
The use of SHORTTAG is allowed in the first case only.
SUBDOC This allows individual parts of an SGML document
to have their own document type declarations. Specialized
structures can thus be kept separate from the general
environment, which helps control DTD complexity. When a SUBDOC
is encountered in a document, the current DTD is replaced by
that of the SUBDOC, and restored when the end of the SUBDOC is
reached. No information can be exchanged between the two
contexts.
SUBDOCs are invoked by entity reference, either simply embedded
in the text, or supplied as the value of an attribute (e.g. on
an include tag.) Only the latter form is allowed in
interchange of TEI documents.
Marked Sections In general, the use of marked-section
keywords other than INCLUDE, IGNORE, and CDATA should be
avoided.
DTD-specific entity definitions When multiple concurrent
document type declarations are used, SGML allows the same entity
name to have different definitions in each document type. This
is forbidden in TEI-conforming documents.
Other features are not restricted in TEI-conformant documents. The
following paragraphs describe the uses to which they appear most suited
in the context of the TEI.
Attributes are one of the most controversial features of
SGML. The controversy is basically this: anything that can be encoded
in an attribute can be encoded as text surrounded by tags. Attributes
are thus not strictly necessary, though they complicate the formalism.
Concurrent markup with multiple document types, however, is seriously
complicated when attributes are not used, since extra-textual
information (the attribute values) is introduced into all concurrent
views of the document at the same time. Attributes are therefore used
in the TEI DTDs without restriction.
Inclusion and exclusion exceptions allow specified elements
to be included at any point within a given element, or excluded entirely
within that element (even though the content models specify otherwise.
(See section .) They can greatly simplify content
models, but should be used with care.
CONCUR allows multiple views (hierarchies) to be marked
within a document. It is imperfect but if we excluded it we would have
to design another mechanism to accomplish the same thing, since the
ability to specify multiple views is a necessity in many texts.
The concrete reference syntax is used in the TEI DTDs and
all examples in these Guidelines. It is recommended but not required
that all TEI-conformant texts use it.
The quantities and capacities specified in the
SGML declaration in appendix diverge from the default
values given in ISO 8879.
Notably the name length limit has been set at 128 (instead
of 8) and the literal length has been set to 1024 (instead
of 240). The default nesting level has been retained, but will be
increased if any examples are found which require deeper nesting.
The default case sensitivity has been retained: entity names are case
sensitive but other SGML identifiers are not. Existing entity names
from public entity sets have been used in preference to defining new
names for the same graphic characters.
The SGML declaration in appendix specifies a subset of
ISO 646 as the default character set; this subset is also a subset of
ASCII (ANSI X3.4). For full discussion of this point see chapter
. A corresponding subset of EBCDIC characters may also
be used for TEI documents. When fuller character sets are used for
local processing or private interchange, SGML identifiers may use any
legal character in the character set; they are not restricted to the ISO
646 subset.
If the local character set is an extension of ISO 646,
problems may arise in interchange of documents. Methods of overcoming
these problems are under investigation: see further .