Texts occur in many
or Wittgenstein.
The language of a text element generally corresponds to a particular
Because computer systems and their display capabilities vary, the particular character repertoire(s) used to represent each language used in a document must be explicitly specified, in a manner which allows any receiver of data correctly to interpret the author's intent.
This chapter specifies the method by which a TEI-conforming text document (a) denotes changes of language or character repertoire, and (b) documents the meanings of the character codes it includes.
At this time, the specification is intended mainly for languages with writing systems which are phonetically-based (as opposed to syllabically or lexically), and which are relatively easy to render on current computer systems (as opposed to those whose orthographic units are multiply accented, modified on the basis of context, require calligraphic quality, and so on). The specification is, however, intended to be a general mechanism capable of principled extension.
In some few cases more than one language
for
the purposes of encoding. For convenience, the term language
is
used in this section to refer not only to natural languages
Several basic terms need to be defined. We distinguish:
alphabet
or syllabary;
they may also include
other units such as diacritical marks, punctuation, and so on.
A document may include portions in any number of character sets, preferably with one coded character set corresponding to each. Although multiple repertoires may be used for a single language in a single document, this is to be avoided where practical.
In the majority of cases there will be no need for users to define new coded character sets; rather, registered and publicly known coded character sets can and should be used, simplifying the task of information transfer. Section 4.2 below provides information on the declaration and use of coded character sets, transliterations, and entity sets within the TEI encoding scheme.
Coded character sets are defined and standardized at several levels. Formal standards are promulgated by national and international standards bodies. Less formal standards are often developed by industry standards groups. Finally, other "standards" arise by virtue of common usage, often as a consequence of design decisions by individual vendors.
Any formal coded character standard promulgated by a recognized
national or international standards body may be
incorporated by reference in the DOCE, obviating full specification of
the meanings of character codes. Specifying a standard is sufficient,
however, only if the file
ISO has also promulgated ISO 2022,
Any organization wishing to register a new coded character set which conforms to ISO 2022 may propose it. ISO does not apply a fine filter to registration; rather, registration of multiple character sets for the same or analogous purposes is permitted. Once a character set has been registered, however, there will be a standard way to refer to it using the control-character sequences desribed in ISO 2022, and a descriptio of the set will be publicly available.
There are many additional standards already established, which should be consulted before creating any new ones. Others of interest to some users of these guidelines include ANSI Z39.12-1972,
It is worth noting that the names of standards are often heard
in reference to character sets which only partially resemble
the standards.
For example, the term ASCII
formally refers to one national
standard, but is commonly used to designate any file which is encoded
entirely with printable
characters, and without binary numeric or
program-specific data. If a file either (a) uses numeric character
codes not assigned within the ASCII standard, or (b) uses particular
assigned codes with a meaning or intent not specified in the standard,
then it is non-conforming.
Industry standards bodies also have provided useful standards for character encoding. One such body is the European Computer Manufacturer's Association, or ECMA.
De facto standards have been established by common use
of particular kinds of computer hardware.
Many computer vendors have established
Among the more popular vendor-specific character repertoires are the following, all of which differ, and fail to conform to the requirements of ISO 2022, and so cannot be formally registered under those provisions:
If only a subset of a vendor-specific character repertoire is used, and that subset conforms to a registered standard, then the file conforms to the registered standard, and the standard may be specified without further elaboration.
For example, both the IBM PC and the Apple Macintosh intrinsic character repertoires fail to conform to ISO 646, to ASCII, or even to the ISO 2022 rules for extended character repertoires (for example, they define numeric codes 129-159 as graphical characters rather than reserving them as control codes).
The TEI does not currently provide an organization for registration of coded character repertoires. If a coded character set is defined in these guidelines, then it is considered registered for purposes of the guidelines, and its formal name may be used without further elaboration. The reference section of these guidelines includes descriptions of several character repertoires in common use.
Registration and use of new character repertoire standards should be undertaken with great discretion, because they complicate the task of data interchange. TEI reserves the right to refuse registration of any proposed character repertoire; it is anticipated that this right will be exercised primarily when a proposed standard merely permutes an existing standard, or otherwise makes non-functional changes. The guidelines of the following section should be considered before any character repertoire is proposed for registration.
In designing either extensions to an existing standard or an entirely
new standard, the prime criteria should be
Data loss may easily occur when translation is performed during movement of data from one computer system to another. Users cannot safely assume that their data will not be translated; indeed, repeated translation may occur within a single transfer, as is standard practice on multi-vendor networks. Because of this, character repertoires are safe for interchange only when they avoid those characters which are not thoroughly standardized.
The most thoroughly standardized characters
are those of ISO 646 IRV. Even a few characters within ISO 646 IRV are
not entirely safe for data translation; in particular, the translation
between EBCDIC and ISO 646 poses problems (for details see
ISO 646 IRV characters which nevertheless are dangerous
in the
sense of often failing to pass through network or other transmissions
unscathed, are listed here. In short, they are the national-use
characters, plus exclamation point (2/01). Control characters are also
commonly lost, except perhaps carriage return (0/13) and line feed
(0/11). The consequent problems of information apply to all types of
data, including text, programs, electronic mail, and others.
Of particular note in translating ISO 646 to EBCDIC are the lack of the
code
extension control characters
specified in ISO 646 section 4.1.3 and
ISO 2022 sections 6.1.6-6.1.7 are also unsafe for transfer.
Of particular note in translating EBCDIC to ISO 646 are three characters
defined only in the former code:
The remaining safe
characters are as shown
below. These characters are likely to survive transmission through
networks, even those which include machines using national variants
of ISO 646 and those using national variants of EBCDIC.
The characters reserved by SGML for its own use are, for the most part, confined to the "safe" set. The significant exception is the exclamation point, which must appear in the opening delimiter of certain SGML constructions. For safe transmission to and from EBCDIC systems, this character must be re-coded in some way.
One solution proposed elsewhere for ISO 646/EBCDIC transfer has been to transfer files in binary, never translating them; all SGML declarations would then be modified to re-define all the meanings of numeric values. That solution is specifically and strongly discouraged by this standard. First, binary file transfer is more complicated and fragile than text transfer, given current software. Second, untranslated files cannot be viewed or edited with standard software on the receiving systems--even many current SGML processors could not support such files. Third, maintaining multiple SGML declarations (one per version of EBCDIC, plus one for ISO 646 and one for each other code) is difficult and error-prone. And fourth, ongoing efforts to resolve differences among variants of EBCDIC and among derivations from ISO 646 would have to be duplicated.
With the concern of preventing data loss, the following
It is important that uniform utilities become available for translating files not conforming to the ISO646 subset into conforming versions and back, in order to facilitate file transfer. At this time, many such utilities are available, but they themselves have not been standardized; therefore the safest practice is not to attempt transmitting other characters.
upper half). Such files will likely suffer extreme data loss on multi-vendor networks, and will appear drastically differently on the default displays of differing vendors. Information loss will occur in transfer to EBCDIC-based or strictly 7-bit hardware, making this level of conformity (or any less-conforming level) highly undesirable, even though it is commonly used, such as with the major personal computer vendors' character repertoires. Note that this level still does not permit use of the control characters (00-31) for graphical character data.
two-octetcodes used for lexically-based orthographies such as Japanese, and for seemingly
universalcharacter repertoires, which combine many character repertoires into one.
At present, interchanged performed according to these
guidelines may use
There are two major candidate methods for marking character set changes:
Both methods are legal under these guidelines, but for general use the latter is strongly recommended. Method 1 has these advantages:
The second method (using SGML attributes) is preferred and strongly recommended for the following reasons:
can easily be defined to use a Greek character repertoire if appropriate, whereas with escape sequences this is quite complex, and may prevent use of low-end SGML parsers.- Attributes marking character repertoire are easy for humans to read and interpret (even with
no specialized software); escape and shift-in/shift-out characters, on the other hand, haveno commonly accepted graphical representation, and the codes specified by ISO 2022 are convenient only for programs, unlike mnemonic attribute values.- The characters required are only those already required for transmitting any SGML file, whereas many present-day computer networks do not transmit escape sequences without data loss.
- An SGML parser knows how to parse tags and attributes, and how to inform a conforming application of their presence; but such a parser has no intrinsic knowledge of ISO 2022 conventions, and so cannot fully interpret language changes.
%%font35%%
or \Greek
for a character
set change, share most problems of escape sequences, merely using a
different trigger
character(s). They are to be avoided on the
same grounds, plus the ground that they lack the force of a standard
such as ISO 2022.
A document may include portions in any number of character
repertoires, each of which is a set of
The DOCE must be encoded so as to be transferable across networks without loss. Therefore it must fulfill these requirements
A DOCE DTD provides information about one character repertoire, and all the characters used in it. The information about the character repertoire as a whole includes
The definition of each character includes at least the following
Some of the information which the DOCE provides can be given to an
SGML processor via the CHARSET
portion of the SGML declaration
(see ISO 88791986(E), section 13.1). If any characters in the
repertoire are encoded using numeric values not defined as graphical
data characters by the SGML declaration in effect, then an SGML
declaration must be provided which does declare them.
The DOCE, however, must also be provided, because it specifies information which is not encoded in an SGML character set declaration (although, of course, some or all of the additional information may happen to be specified in SGML comments). A DOCE could be programmatically converted to an SGML character set declaration, but not vice-versa.
These guidelines recommend use of an attribute called lang
,
which may be specified on lang
attribute may, however, be specified for an element in
any of these ways
langattribute on the start-tag for the document element instance. The value is specified as the formal reference name of the character repertoire in which the element's content is encoded, as defined via the
formalattribute in the DOCE.
langattribute for the element type in the document's DTD.
langattribute, either explicitly or by default. In this case the character repertoire is to be interpreted as being that of the containing document element.
Option (c) may not be used for the outermost (or root
) document
element. The methods just listed are in priority order an explicit
value for the lang
attribute overrides a default value, which in
turn overrides an inherited value.
For those cases where character repertoire must be changed, but where
the text for which it must be changed does not constitute a document
element in its own right, the lang
attribute on that tag.
See also section 6.3.6.
For example,
an isolated word in a particular language might be tagged as
However, in many such cases a tag such as
For special characters which occur infrequently (as opposed
to special characters used frequently throughout a document), it is
often more convenient and perspicuous to encode the characters via
SGML entities.
For more information, see section 3.1.7.
An entity is encoded as an ampersand, an entity name, and a semicolon.
For example, in the default SGML entity set (documented in section D.4
of ISO 8879(E))
Their primary disadvantage is verbosity; this problem is relatively
minor, and decreasing in significance with time. This is true because
(a) more advanced computer systems handle encoding and decoding
automatically (so users need never type the entire sequences), and (b)
computer storage decreases continually in cost, making the overhead for
storage decreasingly significant. Also,
Standard entity sets are available for encoding many accented
or non-Latinate characters, for many printer's symbols, mathematical
symbols, and other purposes. These are best used when
When special characters are to be encoded as entities, public
entity sets should be used wherever possible. Section D.4 of ISO
8879(E) defines a wide variety of letters, accented letters, and
special symbols, and should be consulted before any new special character
entities are defined. When the needed symbols are defined in D.4,
the name used there should be used in preference to others.
Numeric character entity references, of the form
A selection of commonly used entities is included in the reference
section of these guidelines.
The following sections provide information about many of the character
repertoires commonly used to encode data in several languages. The
documentation of a repertoire here does not constitute recommendation or
approval; rather, such documentation is provided for the convenience of
users. However, some of the repertoires shown are marked
Eastern Europe may be incomplete in June 1990.
Recommend
Document
Recommend ISO 8859/5 (though incomplete).
Document
Recommend Beta-code? Problems Accents follow lower-case letters,
but precede upper-case; officially, marks upper case via asterisk prefix,
though few still do this; uses
Document SuperGreek, SMK GreekKeys?
Recommend Michigan-Clarmont?
Recommend
Document
Phonetic alphabets tend to be very large; some require novel
compositions of characters, while some require very large sets of
symbols. Also, phonetic representations are not entirely standardized,
and are most commonly used for short texts. For these and other
reasons, it is recommended that phonetic alphabets generally be encoded
via SGML entity references, where possible using the entities defined in
section D.4 of ISO 88790-1986(E). Necessary additional entities should
be created using the same naming conventions, and clearly documented.
Separate encoding of diacritics should be used in preference to
composite encoding (see below), except where points of lingustic theory
or argumentation dictate otherwise.
*Recommend
Document
ISO 8879(E) provides entity sets which allow for at least two different
encodings for accented letters first as
For most languages, pedagogical texts, grammars, and other
descriptions present diacritics as orthographic units in their own
right, which are composed with others. For example, Greek textbooks
treat diacritics as named units which accompany letters, rather than
asserting that Greek has an alphabet of several hundred (composite)
characters. Thus it is generally more natural to encode diacritics
separately from the characters they modify or otherwise adjoin.
This allows much smaller character repertoires about 60 characters
for Greek, as opposed to about 200 if composite characters are used
(plus digits, punctuation, any reserved codes, ...). It also facilitate
consistency in font design, and accommodates software and hardware
systems limited to 7 or 8 bit codes (i.e., most current systems).
The main disadvantage of separate encoding for diacritics is that some
computer systems do not support over-striking; such systems are becoming
rarer, however, and even they can display accents following letters,
which is at least legible, even if unaesthetic.
Both composite and sequential representations for diacritics are
comparably suitable for sorting and searching In one, whole characters
may need to be ignored; in the other, characters must be re-mapped to
other values. Neither is convenient, but some such process is required
for adequate natural language text processing. The requirement of
specific interpretive processing, including sorting, are specifically
beyond the scope of this standard.
It is worth noting again that this standard does not constrain internal
representation, keyboarding decisions, and the like; rather, it is an
interchange standard, designed to facilitate transfer and re-use of data
by as wide a range of users as possible. Thus, users can type, store,
or process composite characters on systems where this is necessary or
preferable, yet provide the more perspicuous separate characters for
interchange. This is to be encouraged, as is the development of
software to make representational transformations entirely convenient.
Note This DTD assumes TEI standard settings, rather than SGML
defaults. For example, it uses tag names longer than the SGML default
NAMELEN setting.
The
The
The
The
The language name must be specified formally within the
The language name may also be specified as the content of the
Notwithstanding the choice of language, the character encoding
of the entire DOCE must conformance.levelto ISO 646. Therefore, transliteration
may be necessary, and should follow generally accepted methods for
the language in question.
The
The
The
The
The
The
The
The code is specified as a string rather than as a number, in order to
achieve a slightly greater independence from hardware-specific encoding
choices. Also, a string is permitted in preference to only a single
character, to accommodate those languages which conventionally represent
some single graphemic units by multiple characters.
The
The
This specification should not be taken to preclude the use
of multiple diacritics; it is, however, recommended that all diacritics
be either left-adjoining or right-adjoining, rather than some each
way. Also, it is recommended, though not required, that diacritics
follow their graphemes (LD), rather than precede them (DL).
The
The
The SGML character repertoire support
Entity references
Á
represents an upper-case
a
with an acute accent. Such entities have the significant
advantages that (a) they can be read on virtually
nn;
(see
section B.7.2 of ISO 8879(E)) are to be avoided, because they needlessly
decrease human readability and increase system-dependence.
Recommended or documented character repertoires
Recommended,
and these are recommended by TEI for use.
Latin Scripts for European Languages
Cyrillic Script
Greek Script
1
as an accent (?); doesn't have
much punctuation available; does it provide the older characters?
*
to indicate capitals. It is now common practice to forego the asterisks
(not defined here) and use mixed case.
xgi
as entity,
not to be confused with chi, which looks like
an English 'x'.>
Xgr
as entity,
not to be confused with chi, which looks like
an English 'X'.>
a/A
could associate the diacritic either way, but it is rare
enough that it has not proven to be a problem).
Hebrew
Phonetic Alphabets
Non-alphabetic symbols
Accents and diacritics
á
), and second as a´
or ´a
).
Reference Section: Coded Character Repertoires
Declaration of Character Encoding Document Type Definition
Formal SGML DTD
Related semantic specifications
writing.scheme
Formal
Japanese_katakana_jis
and Greek_4thCenturyTablets_ZSU.
Conform
nat.language
Lgcode
Date
Standard
Type
ISO 646.
Exceptions
Grapheme
Code
Entityname
entityname
attribute specifies the name of an entity drawn
from section D.4 of ISO 8879(E), which is the entity corresponding to
the grapheme being define. The attribute is omitted if no such entity
exists. Note that this specification is not a reference to the entity,
but only a specification of its name for human readability; therefore it
is not to be interpreted by the SGML parser, and should not be enclosed
between ampersand and semicolon, as an entity reference would be.
Diac
diacritic
attribute specifies whether the grapheme
being defined is an overstriking diacritical mark. This standard does
not provide means to specify the desired relative or absolute placement
of diacritical marks for text formatting; diacritic
merely provides
applications with the ability to determine whether an accent or other
diacritic applies to the preceding or to the following letter. The
attribute's permissible values are:
Letter, diacritic
) indicates that the diacritical
character &x9F;
would be encoded in the character
repertoire by the following sequence:
Diacritic, letter
) indicates that the diacritical
character Letter
) indicates that this character is a
full-fledged unit, diacritic
attribute need not be specified for most characters.
Join
) indicates that the character being defined
is not, indeed, a character with a graphical representation at
all, but serves as a syntactic connector between diacritics and
the graphemes to which they apply. If _
were defined as
a Join
character, the preceding example could appear in
the file in either of two ways:
JD
or DJ
types (see below)
Join, diacritic
): indicates that the character
is a diacritic, and follows the grapheme to which it applies,
but separated by a Join
character (see above). This
specification is optional, in that declaration of the join
character itself as type J
is sufficient to allow correct
interpretation. However, type JD
may be encoded for
diacritics in order to facilitate identification of the
diacritical characters themselves by processing programs.
Diacritic, join
): indicates that the character
is a diacritic, and precedes the grapheme to which it applies,
but separated by a Join
character (see above). This
specification is optional, in that declaration of the join
character itself as type J
is sufficient to allow correct
interpretation. However, type DJ
may be encoded for
diacritics in order to facilitate identification of the
diacritical characters themselves by processing programs.
Dname
Notes
P
Formal Names for some character sets
Formal name Description
x
being the specification number.