4 Characters and Character Sets

4.1 Principles and Definitions

Texts occur in many languages, and many texts include several languages. At one extreme are texts such as multilingual dictionaries, which systematically include elements in differing languages; at the other, basically monolingual texts with occasional words or names in other languages, such as mutatis mutandis or Wittgenstein.

The language of a text element generally corresponds to a particular character repertoire, commonly known as its writing system. The writing system dictates the appearance and the behavior of the characters it defines. For example, word and other boundaries may be determined in a language-dependent way; identical-appearing characters may sort differently in different languages; and so on. It is therefore essential that the language of document elements be explicitly indicated by markup.

Because computer systems and their display capabilities vary, the particular character repertoire(s) used to represent each language used in a document must be explicitly specified, in a manner which allows any receiver of data correctly to interpret the author's intent.

This chapter specifies the method by which a TEI-conforming text document (a) denotes changes of language or character repertoire, and (b) documents the meanings of the character codes it includes.

At this time, the specification is intended mainly for languages with writing systems which are phonetically-based (as opposed to syllabically or lexically), and which are relatively easy to render on current computer systems (as opposed to those whose orthographic units are multiply accented, modified on the basis of context, require calligraphic quality, and so on). The specification is, however, intended to be a general mechanism capable of principled extension.

In some few cases more than one writing system may be used for a single language, as with Japanese, Gothic, and diachronic variations of many other languages. In such cases, each writing system may be treated separately, as if it were a separate language for the purposes of encoding. For convenience, the term language is used in this section to refer not only to natural languages per se, but to particular writing systems for those languages. Since in many cases only one character repertoire need be used per language, these guidelines will consider it understood that in other cases files must identify not merely the language of their elements, but the particular writing system and its repertoire.

Character, Character Repertoire, (Coded) Character Set

Several basic terms need to be defined. We distinguish: These definitions are adapted from those in ANSI X3.4-1986 Coded Character Sets --- 7-bit American National Standard Code for Information Interchange (7-bit ASCII), pp. 7-8. We distinguish: character a unit in a writing system This definition allows divergent interpretations of the same writing system. Most would agree that `a' and `A' are characters; depending on one's choices, some or all of the following might be as well: ä Ä ¨ and so on. Note also that most discussions of computer character sets distinguish graphic characters from control characters. We are here concerned only with graphic characters. character set or repertoire a set of graphic characters (that is, an unordered collection of conceptual writing units. coded character set or code page A set of unambiguous rules establishing a character set and a one-to-one relationship between the characters of the set and their assigned bit combinations. Character Code Table A table which specifies the meanings to be associated with particular numeric values for computer representation of characters. Commonly, a character code table is presented as a table, with numbers along the sides, and graphical images or other meaning indicators printed in each cell. Diacritic A mark such as an accent, not usually considered a letter, which appears over, under, or around letters. Document Element Any node of the structural hierarchy which constitutes a document; for our purposes, document elements are essentially those objects demarcated by SGML tags. Grapheme A meaningful graphemic unit of a language. In general, these include the conventional members of a language's alphabet or syllabary; they may also include other units such as diacritical marks, punctuation, and so on.

A document may include portions in any number of character sets, preferably with one coded character set corresponding to each. Although multiple repertoires may be used for a single language in a single document, this is to be avoided where practical.

In the majority of cases there will be no need for users to define new coded character sets; rather, registered and publicly known coded character sets can and should be used, simplifying the task of information transfer. Section 4.2 below provides information on the declaration and use of coded character sets, transliterations, and entity sets within the TEI encoding scheme.

Declaration of coded character sets used All of the coded character sets which are used in a TEI-conforming document must be declared. This allows processing programs to identify the information required for proper rendering of characters. In most cases declaration will be done by referring to registered standards by name. However, a method is described below for defining a non-standard coded character set for use when necessary. In the case that only a few changes are in effect from a standard, that standard can be specified, followed by declaration of all exceptions and additions.

International and national standards

Coded character sets are defined and standardized at several levels. Formal standards are promulgated by national and international standards bodies. Less formal standards are often developed by industry standards groups. Finally, other "standards" arise by virtue of common usage, often as a consequence of design decisions by individual vendors.

Any formal coded character standard promulgated by a recognized national or international standards body may be incorporated by reference in the DOCE, obviating full specification of the meanings of character codes. Specifying a standard is sufficient, however, only if the file fully conforms to it. The formal standards bodies which these guidelines currently recognize include:

    ISO International Organization for Standardisation ANSIAmerican National Standards Institute BSI British Standards Institute DIN French...
Perhaps the best-known coded character standard in this class is ISO 646, from which ASCII and many other national standards are derived, but to which such national standards are not equivalent. ASCII, the US analogue of ISO 646, is defined in ANSI standard X3.4-1986

ISO has also promulgated ISO 2022, Code Extension Techniques which provides a means for defining and registering new character sets, which must follow certain overall constraints. ANSI has adopted a comparable standard, ANSI X3.41-1974.

Any organization wishing to register a new coded character set which conforms to ISO 2022 may propose it. ISO does not apply a fine filter to registration; rather, registration of multiple character sets for the same or analogous purposes is permitted. Once a character set has been registered, however, there will be a standard way to refer to it using the control-character sequences desribed in ISO 2022, and a descriptio of the set will be publicly available.

There are many additional standards already established, which should be consulted before creating any new ones. Others of interest to some users of these guidelines include ANSI Z39.12-1972, System for the Romanization of Arabic, ISO/IEC DP 10646, Information processing--Multiple octet coded character set, and ANSI Z39.47-1985, Extended Latin Alphabet Coded Character Set for Bilbiographic Use.

It is worth noting that the names of standards are often heard in reference to character sets which only partially resemble the standards. For example, the term ASCII formally refers to one national standard, but is commonly used to designate any file which is encoded entirely with printable characters, and without binary numeric or program-specific data. If a file either (a) uses numeric character codes not assigned within the ASCII standard, or (b) uses particular assigned codes with a meaning or intent not specified in the standard, then it is non-conforming.

Industry standards bodies also have provided useful standards for character encoding. One such body is the European Computer Manufacturer's Association, or ECMA.

Vendor-specific standards

De facto standards have been established by common use of particular kinds of computer hardware. Many computer vendors have established de facto character repertoires by building them into hardware. Vendor-specific character repertoires in general do not conform to national or international standards, and so require explicit declarations at least for those characters which are non-standard.

Among the more popular vendor-specific character repertoires are the following, all of which differ, and fail to conform to the requirements of ISO 2022, and so cannot be formally registered under those provisions:

    IBM PC Also known as IBM code page 437. IBM PS/2Also known as IBM code page 850. Apple MacintoshVaries with font, but is fairly consistent for the natural-language fonts supplied with the machine. EBCDIC This is a large family of coded character sets, which varies from country to country and from one hardware environment to another. One common form corresponds to the TN 10 print train.

If only a subset of a vendor-specific character repertoire is used, and that subset conforms to a registered standard, then the file conforms to the registered standard, and the standard may be specified without further elaboration.

For example, both the IBM PC and the Apple Macintosh intrinsic character repertoires fail to conform to ISO 646, to ASCII, or even to the ISO 2022 rules for extended character repertoires (for example, they define numeric codes 129-159 as graphical characters rather than reserving them as control codes).

Character repertoire registration

The TEI does not currently provide an organization for registration of coded character repertoires. If a coded character set is defined in these guidelines, then it is considered registered for purposes of the guidelines, and its formal name may be used without further elaboration. The reference section of these guidelines includes descriptions of several character repertoires in common use.

Registration and use of new character repertoire standards should be undertaken with great discretion, because they complicate the task of data interchange. TEI reserves the right to refuse registration of any proposed character repertoire; it is anticipated that this right will be exercised primarily when a proposed standard merely permutes an existing standard, or otherwise makes non-functional changes. The guidelines of the following section should be considered before any character repertoire is proposed for registration.

Practical Constraints

In designing either extensions to an existing standard or an entirely new standard, the prime criteria should be prevention of data loss in transfer of the character data across systems and readability of the data across systems. For local use, that is, use which does not involve transfer to different configurations of hardware or software, any agreed-upon representation will of course suffice. However, even seemingly slight differences between systems can render data unusable after transfer. Therefore, for any kind of data transfer or interchange much more rigid constraints must be applied to the choice of character encoding(s).

Data loss may easily occur when translation is performed during movement of data from one computer system to another. Users cannot safely assume that their data will not be translated; indeed, repeated translation may occur within a single transfer, as is standard practice on multi-vendor networks. Because of this, character repertoires are safe for interchange only when they avoid those characters which are not thoroughly standardized.

The most thoroughly standardized characters are those of ISO 646 IRV. Even a few characters within ISO 646 IRV are not entirely safe for data translation; in particular, the translation between EBCDIC and ISO 646 poses problems (for details see ASCII and EBCDIC Character Set and Code Issues in Systems Application Architecture, June 1989, available from SHARE Inc., Chicago, IL). The subset of ISO 646 characters which can be expected to transfer widely without data loss, will here be called the ISO 646 subset.

ISO 646 IRV characters which nevertheless are dangerous in the sense of often failing to pass through network or other transmissions unscathed, are listed here. In short, they are the national-use characters, plus exclamation point (2/01). Control characters are also commonly lost, except perhaps carriage return (0/13) and line feed (0/11). The consequent problems of information apply to all types of data, including text, programs, electronic mail, and others.

Of particular note in translating ISO 646 to EBCDIC are the lack of the circumflex (ISO 646 code 5/14) and the left and right square brackets (codes 5/11 and 5/13) in EBCDIC; and problems in the interpretation of tab (code 0/9). The code extension control characters specified in ISO 646 section 4.1.3 and ISO 2022 sections 6.1.6-6.1.7 are also unsafe for transfer.

Of particular note in translating EBCDIC to ISO 646 are three characters defined only in the former code: not-sign (EBCDIC code X'5F'); the cent-sign (X'4A'); and the broken vertical bar (X'6A'). Also, brackets and braces have different codes in different variants of EBCDIC. Table 4.1 Dangerous characters Character Character description Code Entity Name ! exclamation point 2/01 excl # number sign 2/03 num &xDB; (neutral) currency sign 2/04 curren [ left square bracket 5/11 lsqb &backslash; backslash (reverse solidus) 5/12 bsol ] right square bracket 5/13 rsqb | circumflex (inverted hacek) 5/14 circ grave accent (D.4.3.2) 6/00 grave &leftbrace; left curly brace 7/11 lcub vertical bar 7/12 verbar &rightbrace;right curly brace 7/13 rcub - overline (D.4.3.2) 7/14 macron

The remaining safe characters are as shown below. These characters are likely to survive transmission through networks, even those which include machines using national variants of ISO 646 and those using national variants of EBCDIC. Table 4.1 ISI 646 subset

a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0 " % & ' ( ) * + , - . / ; < = > ? _ (space)

The characters reserved by SGML for its own use are, for the most part, confined to the "safe" set. The significant exception is the exclamation point, which must appear in the opening delimiter of certain SGML constructions. For safe transmission to and from EBCDIC systems, this character must be re-coded in some way.

One solution proposed elsewhere for ISO 646/EBCDIC transfer has been to transfer files in binary, never translating them; all SGML declarations would then be modified to re-define all the meanings of numeric values. That solution is specifically and strongly discouraged by this standard. First, binary file transfer is more complicated and fragile than text transfer, given current software. Second, untranslated files cannot be viewed or edited with standard software on the receiving systems--even many current SGML processors could not support such files. Third, maintaining multiple SGML declarations (one per version of EBCDIC, plus one for ISO 646 and one for each other code) is difficult and error-prone. And fourth, ongoing efforts to resolve differences among variants of EBCDIC and among derivations from ISO 646 would have to be duplicated.

Character repertoire conformance levels

With the concern of preventing data loss, the following character conformance levels are defined. The lower the character conformance level number, the more portable the file can be expected to be. Given the current state of computers and networks, only a file of conformance level 1 can be expected to transfer freely without data loss or change.

  1. The repertoire includes only the ISO 646 subset defined above. No control characters except carriage return (0/13) and line feed (0/11) may appear; these singly or in combination indicate record separation, and are to be interpreted under the conditions specified in ISO 8879(E), sections 7.6.1 and B.3.3. Note that even such a file may fail to transfer on at least two grounds (given the state of affairs current as of this writing)
    1. Some systems cannot correctly convert line-ends between the three common conventions of (i) line-feed only, (ii) carriage-return only, and (iii) carriage-return plus line-feed.
    2. Some systems discard data beyond some maximum line length such as 80 or 255 characters (these limits may or may not include carriage-return and/or line-feed, and some systems may measure line length incorrectly due to problem (a).

    It is important that uniform utilities become available for translating files not conforming to the ISO646 subset into conforming versions and back, in order to facilitate file transfer. At this time, many such utilities are available, but they themselves have not been standardized; therefore the safest practice is not to attempt transmitting other characters.

  2. Identical to level 1, except that the file may include the national-use characters defined in ISO 646 and/or the exclamation point character. The meanings of these characters must be declared, either explicitly or via reference to a national or other standard; this is true even if they are used as assigned in ISO 646 IRV. A file of conformance level 2 is likely to transfer without information loss except across national boundaries. It may or may not transfer without information loss to or through systems based on character repertoires fundamentally different from ISO 646, such as EBCDIC.
  3. The repertoire uses characters with numeric codes between (decimal) 128 and 255 (the upper half). Such files will likely suffer extreme data loss on multi-vendor networks, and will appear drastically differently on the default displays of differing vendors. Information loss will occur in transfer to EBCDIC-based or strictly 7-bit hardware, making this level of conformity (or any less-conforming level) highly undesirable, even though it is commonly used, such as with the major personal computer vendors' character repertoires. Note that this level still does not permit use of the control characters (00-31) for graphical character data.
  4. The repertoire may use any or all numeric values from 0 to 255 to represent graphical characters.
  5. The repertoire may use any or all numeric values from 00 to 65535. This level applies, for example, to two-octet codes used for lexically-based orthographies such as Japanese, and for seemingly universal character repertoires, which combine many character repertoires into one.
  6. 6 Numeric values are unbounded. One example of a character repertoire standard which goes beyond two octets is ISO 10646, which defines 32-bit characters.

At present, interchanged performed according to these guidelines may use only conformance level 1, unless using fully conforming ISO 2022 data streams.

Marking character repertoire and Language Shifts d24>Alternative possible methods

There are two major candidate methods for marking character set changes:

  1. Escape sequences defined by ISO 2022 announce the impending use of an alternate character sets, and shift-in/shift-out control characters are used to mark actual changes.
  2. An SGML attribute declares, for any element on which it appears, that the element is encoded using a particular (mnemonically named) character repertoire.

Both methods are legal under these guidelines, but for general use the latter is strongly recommended. Method 1 has these advantages:

  1. The SGML standard describes how to prevent accidental recognition of SGML delimiters when using an alternate character repertoire (see section E.3 of ISO 88791986(E)). However, this problem arises only because of a basic dissonance between escape/control codes and descriptive markup; it compels opaque syntax, such as hiding character repertoire assignments in SGML comments (see section E.3.1.1(c)), making the SGML parser unable to determine the character repertoire in effect at any time.
  2. Any software which supports ISO 2022 but not SGML can correctly display the text characters of a document. However, little such software appears to exist at this time.
  3. This method allows character repertoire changes to occur without constraint by or relation to the SGML elements. However, this advantage is more apparent than real since a character repertoire change is in fact required before and after every element-start and element-end tag (see E.3.1.1.a). Also, few changes of language are not correlated to document elements.

The second method (using SGML attributes) is preferred and strongly recommended for the following reasons:

  1. SGML is characterized by hierarchical structures of text elements, whereas escape sequences denote only the beginnings of elements, and hence are not hierarchical. Among other advantages, hierarchical structure limits the search required to determine the character set in effect at any chosen point in a document.
  2. Attributes can be assigned default values for particular document element types. Thus, can easily be defined to use a Greek character repertoire if appropriate, whereas with escape sequences this is quite complex, and may prevent use of low-end SGML parsers.
  3. Attributes marking character repertoire are easy for humans to read and interpret (even with no specialized software); escape and shift-in/shift-out characters, on the other hand, have no commonly accepted graphical representation, and the codes specified by ISO 2022 are convenient only for programs, unlike mnemonic attribute values.
  4. The characters required are only those already required for transmitting any SGML file, whereas many present-day computer networks do not transmit escape sequences without data loss.
  5. An SGML parser knows how to parse tags and attributes, and how to inform a conforming application of their presence; but such a parser has no intrinsic knowledge of ISO 2022 conventions, and so cannot fully interpret language changes.

Ad hoc sequences, such as %%font35%% or \Greek for a character set change, share most problems of escape sequences, merely using a different trigger character(s). They are to be avoided on the same grounds, plus the ground that they lack the force of a standard such as ISO 2022.

Recommendations

Definition of Document Character Set

A document may include portions in any number of character repertoires, each of which is a set of characters, which represent the meaningful graphemic units of a language. Although multiple repertoires may be used for a single language in a single document, this is to be avoided where practical. Each character repertoire has a formal name, which is used within an SGML document to indicate that portions of the document are encoded in the character repertoire. Each character repertoire is described in a file called the declaration of character encoding, or DOCE, for the character repertoire.

The DOCE must be encoded so as to be transferable across networks without loss. Therefore it must fulfill these requirements

  1. It must be a minimal SGML document, per section 15.1.2 of ISO 8879(E), as amended.
  2. It must be encoded using only the ISO 646 subset defined above.
  3. It must conform to the SGML Document Type Definition included below in section , referred to as the DOCE DTD.

A DOCE DTD provides information about one character repertoire, and all the characters used in it. The information about the character repertoire as a whole includes

  1. A formal name for the character repertoire, by which document elements specify that they are encoded in it.
  2. A specification of the meaning of each character used in the character repertoire, in one of the following forms
    1. Reference to a international, national, or TEI-registered character repertoire standard. Use of readily accessible formal published standard(s) for character repertoires is strongly encouraged.
    2. A formal declaration of each graphemic unit used in the character repertoire.
    3. Reference to a standard as in (1), with declaration of all exceptions as in (2).

The definition of each character includes at least the following

  1. The unique code used to represent the character.
  2. Certain special properties, such as being a diacritical mark.
  3. A natural-language description of the character.
  4. If possible, reference to a standard entity which could represent the character (e.g., from ISO 8879(E)).

Some of the information which the DOCE provides can be given to an SGML processor via the CHARSET portion of the SGML declaration (see ISO 88791986(E), section 13.1). If any characters in the repertoire are encoded using numeric values not defined as graphical data characters by the SGML declaration in effect, then an SGML declaration must be provided which does declare them.

The DOCE, however, must also be provided, because it specifies information which is not encoded in an SGML character set declaration (although, of course, some or all of the additional information may happen to be specified in SGML comments). A DOCE could be programmatically converted to an SGML character set declaration, but not vice-versa.

Use of the attribute method (method 2)

These guidelines recommend use of an attribute called lang, which may be specified on any tag in any TEI tag set, and whose value is the name of a character repertoire. For completeness, a character repertoire must be specified for every document element. The value of the lang attribute may, however, be specified for an element in any of these ways

  1. By including an explicit lang attribute on the start-tag for the document element instance. The value is specified as the formal reference name of the character repertoire in which the element's content is encoded, as defined via the formal attribute in the DOCE.
  2. By specifying the formal reference name as the default value for the lang attribute for the element type in the document's DTD.
  3. By not specifying any value for the lang attribute, either explicitly or by default. In this case the character repertoire is to be interpreted as being that of the containing document element.

Option (c) may not be used for the outermost (or root) document element. The methods just listed are in priority order an explicit value for the lang attribute overrides a default value, which in turn overrides an inherited value.

For those cases where character repertoire must be changed, but where the text for which it must be changed does not constitute a document element in its own right, the foreign tag must be used, and the encoding specified by the lang attribute on that tag. See also section 6.3.6. For example, an isolated word in a particular language might be tagged as

<tag>p</tag>The word is<tag>foreign lang=Greek</tag> logos<tag>/foreign</tag> is short.<tag>/p</tag>

However, in many such cases a tag such as (see section 6.3.4) may be more perspicuous. In the rare case that a language change crosses the boundaries of elements, it must be encoded via multiple foreign tags, which individually do not cross boundaries. A language change does not cross the boundaries of elements when a single higher-level element dominates the entire scope of the change, but only in cases such as

<tag>p</tag>The word is <tag>foreign lang=Greek</tag>logos<tag>/foreign</tag>. <tag>/p</tag> <tag>p</tag> <tag>foreign lang=Greek</tag>Logos<tag>/foreign</tag> is a short word.<tag>/p</tag>

SGML character repertoire support

Entity references

For special characters which occur infrequently (as opposed to special characters used frequently throughout a document), it is often more convenient and perspicuous to encode the characters via SGML entities. For more information, see section 3.1.7.

An entity is encoded as an ampersand, an entity name, and a semicolon. For example, in the default SGML entity set (documented in section D.4 of ISO 8879(E)) Á represents an upper-case a with an acute accent. Such entities have the significant advantages that (a) they can be read on virtually any display device, and (b) they are unlikely to suffer data loss during transfer between systems.

Their primary disadvantage is verbosity; this problem is relatively minor, and decreasing in significance with time. This is true because (a) more advanced computer systems handle encoding and decoding automatically (so users need never type the entire sequences), and (b) computer storage decreases continually in cost, making the overhead for storage decreasingly significant. Also, data interchange has different priorities from text editing; the two need not use the same representations.

Standard entity sets are available for encoding many accented or non-Latinate characters, for many printer's symbols, mathematical symbols, and other purposes. These are best used when

  1. relatively little text is to be encoded using the special characters;
  2. the special text is fragmentary, not constituting primary document elements; or
  3. the special text is not representative of a natural language (e.g., isolated special symbols).

When special characters are to be encoded as entities, public entity sets should be used wherever possible. Section D.4 of ISO 8879(E) defines a wide variety of letters, accented letters, and special symbols, and should be consulted before any new special character entities are defined. When the needed symbols are defined in D.4, the name used there should be used in preference to others.

Numeric character entity references, of the form &#nn; (see section B.7.2 of ISO 8879(E)) are to be avoided, because they needlessly decrease human readability and increase system-dependence.

A selection of commonly used entities is included in the reference section of these guidelines.

Recommended or documented character repertoires

The following sections provide information about many of the character repertoires commonly used to encode data in several languages. The documentation of a repertoire here does not constitute recommendation or approval; rather, such documentation is provided for the convenience of users. However, some of the repertoires shown are marked Recommended, and these are recommended by TEI for use.

Latin Scripts for European Languages

Eastern Europe may be incomplete in June 1990.

Recommend

Document The character encoding subcommittee at the Oxford meeting of TEI-REP recommended including descriptions of the IBM PC, PS/2, and Macintosh 8-bit sets. It was also suggested that PostScript specifications be consulted, as to whether they have a recommended assignment of chars 128-255 which should also be documented here.

Cyrillic Script

Recommend ISO 8859/5 (though incomplete).

Document The character encoding subcommittee at the Oxford meeting of TEI-REP recommended consulting David Birnbaum on this.

Greek Script

Recommend Beta-code? Problems Accents follow lower-case letters, but precede upper-case; officially, marks upper case via asterisk prefix, though few still do this; uses 1 as an accent (?); doesn't have much punctuation available; does it provide the older characters?

Document SuperGreek, SMK GreekKeys?

<! CDATA <doce formal=<q>betacode</q> conform=1> <naturallg lgcode=EN>English</naturallg> <date>March 15, 1990</date> <standard type=NONE> The beta-code encoding for Greek is widely used by Biblical and Classical scholars for ancient texts. Its original form used only upper-case letters for transliteration, with prefixed asterisks <q>*</q> to indicate capitals. It is now common practice to forego the asterisks (not defined here) and use mixed case. </standard> <exceptions> <grapheme code='a' entityname="agr"> <dname>lower-case alpha</></> <grapheme code='b' entityname="bgr"> <dname>lower-case beta</></> <grapheme code='g' entityname="ggr"> <dname>lower-case gamma</></> <grapheme code='d' entityname="dgr"> <dname>lower-case delta</></> <grapheme code='e' entityname="egr"> <dname>lower-case epsilon</></> <grapheme code='z' entityname="zgr"> <dname>lower-case zeta</></> <grapheme code='h' entityname="eegr"> <dname>lower-case eta</></> <grapheme code='q' entityname="thgr"> <dname>lower-case theta</></> <grapheme code='i' entityname="igr"> <dname>lower-case iota</></> <grapheme code='k' entityname="kgr"> <dname>lower-case kappa</></> <grapheme code='l' entityname="lgr"> <dname>lower-case lambda</></> <grapheme code='m' entityname="mgr"> <dname>lower-case mu</></> <grapheme code='n' entityname="ngr"> <dname>lower-case nu</></> <grapheme code='c' entityname="xgr"> <dname>lower-case ksi</></> <descr>Note that SGML uses <q>xgi</q> as entity, not to be confused with chi, which looks like an English 'x'.</> <grapheme code='o' entityname="ogr"> <dname>lower-case omicron</></> <grapheme code='p' entityname="pgr"> <dname>lower-case pi</></> <grapheme code='r' entityname="rgr"> <dname>lower-case rho</></> <grapheme code='s' entityname="sgr"> <dname>lower-case sigma</></> <grapheme code='t' entityname="tgr"> <dname>lower-case tau</></> <grapheme code='u' entityname="ugr"> <dname>lower-case upsilon</></> <grapheme code='f' entityname="phgr"> <dname>lower-case phi</></> <grapheme code='x' entityname="khgr"> <dname>lower-case xi (shape of Eng 'x')</></> <grapheme code='y' entityname="psgr"> <dname>lower-case psi</></> <grapheme code='w' entityname="ohgr"> <dname>lower-case omega</></> <grapheme code='A' entityname="Agr"> <dname>upper-case alpha</></> <grapheme code='B' entityname="Bgr"> <dname>upper-case beta</></> <grapheme code='G' entityname="Ggr"> <dname>upper-case gamma</></> <grapheme code='D' entityname="Dgr"> <dname>upper-case delta</></> <grapheme code='E' entityname="Egr"> <dname>upper-case epsilon</></> <grapheme code='Z' entityname="Zgr"> <dname>upper-case zeta</></> <grapheme code='H' entityname="EEgr"> <dname>upper-case eta</></> <grapheme code='Q' entityname="THgr"> <dname>upper-case theta</></> <grapheme code='I' entityname="Igr"> <dname>upper-case iota</></> <grapheme code='K' entityname="Kgr"> <dname>upper-case kappa</></> <grapheme code='L' entityname="Lgr"> <dname>upper-case lambda</></> <grapheme code='M' entityname="Mgr"> <dname>upper-case mu</></> <grapheme code='N' entityname="Ngr"> <dname>upper-case nu</></> <grapheme code='C' entityname="Xgr"> <dname>upper-case ksi</></> <descr>Note that SGML uses <q>Xgr</q> as entity, not to be confused with chi, which looks like an English 'X'.</> <grapheme code='O' entityname="Ogr"> <dname>upper-case omicron</></> <grapheme code='P' entityname="Pgr"> <dname>upper-case pi</></> <grapheme code='R' entityname="Rgr"> <dname>upper-case rho</></> <grapheme code='S' entityname="Sgr"> <dname>upper-case sigma</></> <grapheme code='T' entityname="Tgr"> <dname>upper-case tau</></> <grapheme code='U' entityname="Ugr"> <dname>upper-case upsilon</></> <grapheme code='F' entityname="PHgr"> <dname>upper-case phi</></> <grapheme code='X' entityname="KHgr"> <dname>upper-case xi (shape of Eng 'x')</></> <grapheme code='Y' entityname="PSgr"> <dname>upper-case psi</></> <grapheme code='W' entityname="OHgr"> <dname>upper-case omega</></> <grapheme code='j' entityname="sfgr"> <dname>lower-case word-final sigma</></> <grapheme code='' entityname="slgr"> <dname>lower-case lunate sigma</></> <grapheme code='j' entityname="diggr"> <dname>lower-case digamma</></> <grapheme code='j' entityname="kopgr"> <dname>lower-case koppa</></> <grapheme code='/' entityname="acute" diac=LD> <dname>acute accent</></> <grapheme code='&backslash;' entityname="grave" diac=LD> <dname>grave accent</></> <grapheme code='=' entityname="circ" diac=LD> <dname>circumflex accent</></> <grapheme code='' entityname="die" diac=LD> <dname>dieresis</></> <grapheme code='1' entityname="" diac=LD> <dname>iota-subscript</></> <grapheme code=')' entityname="" diac=LD> <dname>smooth breathing</></> <grapheme code='(' entityname="" diac=LD> <dname>rough breathing</></> </exceptions> <notes>This repertoire does not provide for digits or punctuation. In cases of multiple diacritics, they may occur in any order, but always come after their letter, except for capitals, in which case they come before (this introduces a potential ambiguity, where, for example, <q>a/A</q> could associate the diacritic either way, but it is rare enough that it has not proven to be a problem). </notes> </doce> >

Hebrew

Recommend Michigan-Clarmont?

Phonetic Alphabets

Recommend

Document

Phonetic alphabets tend to be very large; some require novel compositions of characters, while some require very large sets of symbols. Also, phonetic representations are not entirely standardized, and are most commonly used for short texts. For these and other reasons, it is recommended that phonetic alphabets generally be encoded via SGML entity references, where possible using the entities defined in section D.4 of ISO 88790-1986(E). Necessary additional entities should be created using the same naming conventions, and clearly documented. Separate encoding of diacritics should be used in preference to composite encoding (see below), except where points of lingustic theory or argumentation dictate otherwise.

Non-alphabetic symbols

*Recommend

Document

Accents and diacritics

ISO 8879(E) provides entity sets which allow for at least two different encodings for accented letters first as composite characters (e.g., á), and second as separate sequences of letter and diacritic (e.g., either or ´a).

For most languages, pedagogical texts, grammars, and other descriptions present diacritics as orthographic units in their own right, which are composed with others. For example, Greek textbooks treat diacritics as named units which accompany letters, rather than asserting that Greek has an alphabet of several hundred (composite) characters. Thus it is generally more natural to encode diacritics separately from the characters they modify or otherwise adjoin.

This allows much smaller character repertoires about 60 characters for Greek, as opposed to about 200 if composite characters are used (plus digits, punctuation, any reserved codes, ...). It also facilitate consistency in font design, and accommodates software and hardware systems limited to 7 or 8 bit codes (i.e., most current systems). The main disadvantage of separate encoding for diacritics is that some computer systems do not support over-striking; such systems are becoming rarer, however, and even they can display accents following letters, which is at least legible, even if unaesthetic.

Both composite and sequential representations for diacritics are comparably suitable for sorting and searching In one, whole characters may need to be ignored; in the other, characters must be re-mapped to other values. Neither is convenient, but some such process is required for adequate natural language text processing. The requirement of specific interpretive processing, including sorting, are specifically beyond the scope of this standard.

It is worth noting again that this standard does not constrain internal representation, keyboarding decisions, and the like; rather, it is an interchange standard, designed to facilitate transfer and re-use of data by as wide a range of users as possible. Thus, users can type, store, or process composite characters on systems where this is necessary or preferable, yet provide the more perspicuous separate characters for interchange. This is to be encouraged, as is the development of software to make representational transformations entirely convenient.

Reference Section: Coded Character Repertoires

Declaration of Character Encoding Document Type Definition

Note This DTD assumes TEI standard settings, rather than SGML defaults. For example, it uses tag names longer than the SGML default NAMELEN setting.

Formal SGML DTD <mdecl>ELEMENT writing.scheme - - (nat.language, date.of.specification, standard, exceptions?, notes?) </mdecl> <mdecl>ATTLIST writing.scheme reference.name NAME #REQUIRED conformance.level NUMBER #REQUIRED </mdecl> <mdecl>ELEMENT nat.language - - (#PCDATA) </mdecl> <mdecl>ATTLIST nat.language language.code NAME #REQUIRED </mdecl> <mdecl>ELEMENT date.of.specification - - (#PCDATA) </mdecl> <mdecl>ELEMENT standard - - (#PCDATA) </mdecl> <mdecl>ATTLIST standard type (ISONationalTEI) #REQUIRED </mdecl> <mdecl>ELEMENT exceptions - - (grapheme*) </mdecl> <mdecl>ELEMENT grapheme - - (dname,descr?) </mdecl> <mdecl>ATTLIST grapheme code CDATA #REQUIRED entityname NAME #IMPLIED diacritic (DLLDLJJDDJ) L </mdecl> <mdecl>ELEMENT dname - - (#PCDATA) </mdecl> <mdecl>ELEMENT descr - - (#PCDATA) </mdecl> <mdecl>ELEMENT notes - - ((#PCDATAp)+) </mdecl> <mdecl>ELEMENT p - - ((#PCDATA)+) </mdecl>

Related semantic specifications

writing.scheme

The writing.scheme element is the root element for a TEI declaration of character encoding.

Formal

The reference.name attribute specifies the formal reference name which elements using the character repertoire must encode as the value of their lg attribute, whenever an element is encoded in a different character repertoire from the containing document element. The formal name should express (a) the natural or other language which the repertoire is intended for encoding; (b) the particular writing system of that language, if more than one is prevalent; and (c) the particular encoding of the writing system, if more than one has been declared. Two perhaps overly verbose examples are Japanese_katakana_jis and Greek_4thCenturyTablets_ZSU.

Conform

The conform attribute specifies the conformance level of the character repertoire, which depends upon the range of numeric character codes included.

nat.language

The nat.language tag contains the name of the natural (human) language in which the contents of the dname and descr elements are written.

The language name must be specified formally within the language.code attribute (see below), using a two-letter code drawn from the set specified by ISO 639, as amended.

The language name may also be specified as the content of the nat.language tag. In this case, it must be specified in at least one of English, French, German, Greek, Italian, or Spanish (transliterated as necessary); the language name may also be provided in additional languages at the author's discretion.

Notwithstanding the choice of language, the character encoding of the entire DOCE must conformance.levelto ISO 646. Therefore, transliteration may be necessary, and should follow generally accepted methods for the language in question.

Lgcode

The language.code attribute specifies the formal two-letter code for the language in which the DOCE's natural language portions are written. Creating DOCEs in languages not assigned formal codes is to be avoided, since these are generally languages understood by very few of the potential readers of a DOCE file.

Date

The date.of.specification tag specifies the date on which the DOCE was last changed, and is intended for human interpretation, as a means to ascertaining whether a DOCE is up-to-date.

Standard

The standard tag specifies whether the character repertoire being defined by the DOCE is a public standard (perhaps with defined exceptions), or is fully defined in the DOCE. Its text content may be used to provide an explanation of the origins and range of use of the encoding, if it is a de facto but unregistered standard.

Type

The qtype/q attribute of the std tag specifies what kind of declaration follows. Its value have the following meanings:

Exceptions

The exceptions tag encloses all definitions for orthographic units which are not incorporated without change from a specified standard. If no standard is used as a basis, then all characters in the repertoire must be defined within the exceptions element.

Grapheme

The grapheme tag encloses contents which fully define a single meaningful graphemic unit of the character repertoire.

Code

The code attribute specifies, as a string, the character value or sequence of character values which is being defined.

The code is specified as a string rather than as a number, in order to achieve a slightly greater independence from hardware-specific encoding choices. Also, a string is permitted in preference to only a single character, to accommodate those languages which conventionally represent some single graphemic units by multiple characters.

Entityname

The entityname attribute specifies the name of an entity drawn from section D.4 of ISO 8879(E), which is the entity corresponding to the grapheme being define. The attribute is omitted if no such entity exists. Note that this specification is not a reference to the entity, but only a specification of its name for human readability; therefore it is not to be interpreted by the SGML parser, and should not be enclosed between ampersand and semicolon, as an entity reference would be.

Diac

The diacritic attribute specifies whether the grapheme being defined is an overstriking diacritical mark. This standard does not provide means to specify the desired relative or absolute placement of diacritical marks for text formatting; diacritic merely provides applications with the ability to determine whether an accent or other diacritic applies to the preceding or to the following letter. The attribute's permissible values are:

This specification should not be taken to preclude the use of multiple diacritics; it is, however, recommended that all diacritics be either left-adjoining or right-adjoining, rather than some each way. Also, it is recommended, though not required, that diacritics follow their graphemes (LD), rather than precede them (DL).

Dname

The name tag encloses a brief descriptive name for the unit, preferably including the name by which speakers of the language in question refer to it. Terse or unclear descriptions are to be avoided.

Notes

The notes tag may be used at the author's discretion to encode any explanatory information deemed necessary or helpful.

P

The p merely separates paragraphs within the notes element. It may in certain cases seem useful to distinguish arbitrarily complex internal structure to notes, but in the interest of simplicity such features are foregone.

Formal Names for some character sets Formal name Description IRV ISO 646 International Reference Version Latin_IBMPC IBM PC (not more recent models) intrinsic Latin_IBMPS2 IBM PS/2 intrinsic 8859/x Registered repertoires specified by ISO 8859, with x being the specification number. Latin_MAC Apple Macintosh, default fonts Greek_Beta Greek beta-code (but mixed-case allowed)