MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 2. Version 0.1. Last modified 8 December 1995.
Nancy Ide and Jean Véronis
Copyright (c) Centre National de la Recherche Scientifique, 1995.
This document is only a draft and should be cited as such. Creators of
WWW documents pointing to it are warned that its content and location may change
without notice. This document is provided as is without any express or implied
warranties. While every effort has been taken to ensure the accuracy of the
information contained, the authors assume no responsibility for errors or omissions,
or for damages resulting from the use of the information contained herein.
Permission is granted to make and distribute verbatim copies of this document for
non-commercial purposes provided the copyright notice and this permission notice are
preserved on all copies.
The CES uses the "reference concrete syntax'' of SGML, which specifies that tags are delimited by the characters "<" and ">" and contain the name of the element (its gi for generic identifier). In end tags, the gi is preceded by "/". The gi may consist of upper and lower case letters and the digits 0-9.
UCS encodes each character in four bytes, thus providing a single character set to encode all the worlds' languages.
However:
For corpora intended for use in language engineering applications, much interchange will be accomplished via CD-ROM or ftp. Ftp allows binary interchange and can be used to safely transmit any 8-bit character set. Moreover, data interchange is becoming increasingly reliable, due to major international efforts towards standardization such as the Internet effort. For example, TCP/IP and many network applications (e.g., ftp, WWW, etc.) are "8-bit clean". In addition, recent standards have been proposed to guarantee delivery by automatically packing and unpacking data as required:
Even when such these standards are not yet implemented, files can be safely transferred by using universally available encoding programs such as 'uuencode'.Therefore, we recommend that all data is distributed using the recommendations below for character sets. In the case of blind interchange, data should be encoded using 'uuencode'.
Our recommendation has the merit of being reasonably compatible with UCS, thus facilitating future migration to that standard.
The CES recommendations have been adopted by the EAGLES Tool subgroup for its Guidelines for Linguistic Software Development--see especially Part 1-1: Characters.
The following is a rough list of the languages accomodated in the ISO 8859 series. See also the graphic representation of the code tables.
A list of characters used by a large number of languages is provided in "Characters and character sets for various languages " (Alvestrand, 1995).
See also "ISO 8859-1 National Character Set FAQ" (Gschwind, 1995).
Shortcomings of the ISO 8859 series
The ISO 8859 series lacks the ligatures Dutch ij, French oe and ,,German`` quotation marks, as well as several other characters.
There are also Bulgarian and Ukranian characters missing from ISO 8859-5.
[THIS SECTION IS UNDER DEVELOPMENT]
The recommendations above do not provide for Asian languages, including Chinese, Japanese, and Korean. Independent standards have been developed for these languages. The CES specifications for these cases are under development.
If it is necessary to encode a text in a language not covered by the ISO 8859-X series, it is required to use
Note that the TEI provides several pre-defined Writing System Declarations, including:
We recommend the use of ISO entities. Standard public entity names can be declared by a reference to a standard public entity, e.g.,
<!ENTITY % ISOLat2 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN">
%ISOLat1;
<!ENTITY % ISOGrk1 PUBLIC "ISO 8879-1986//ENTITIES Greek Letters//EN">
%ISOGrk1;
<!ENTITY % ISOGrk2 PUBLIC "ISO 8879-1986//ENTITIES Monotoniko Greek//EN">
%ISOGrk2;
<!ENTITY % ISOCyr1 PUBLIC "ISO 8879-1986//ENTITIES Russian Cyrillic//EN">
%ISOCyr1;
<!ENTITY % ISOCyr2 PUBLIC "ISO 8879-1986//ENTITIES Non-Russian
Cyrillic//EN">
%ISOCyr2;
Many of the characters that commonly need to be represented are included in the ISO entity sets ISOpub and ISOnum. These sets include, for example, the special characters "&" and "<" which are part of the SGML markup syntax and cannot be included in an SGML document. They also contain entities such as "—" (for the dash the width of an "m"), "£" (for British sterling), etc. The ISOpub and ISOnum entity sets are declared as follows:
<!ENTITY % ISONUM PUBLIC "ISO 8879-1986//ENTITIES
Publishing//EN">
%ISONUM;
Note that these entity sets are declared in all the CES DTDs.
If no standard entity name exists or a standard entity is to be renamed, normal SGML syntax can be used to declare an appropriate entity, as follows:
Declaration of entities and entity sets not already included in the DTD for the document are added at the top of the encoded document, as in this example:
<!doctype cesDoc PUBLIC "-//CES//DTD//cesDoc//EN" [
<!ENTITY igcy "i`" --=small i grave, Cyrillic -->
<!ENTITY Igcy "I`" --=capital I grave, Cyrillic -->
<!ENTITY % ISOcyr1 PUBLIC
"ISO 8879-1986//ENTITIES Russian Cyrillic//EN" >
%ISOcyr1;
<!ENTITY % ISOcyr2 PUBLIC
"ISO 8879-1986//ENTITIES Non Russian Cyrillic//EN" >
%ISOcyr2;
]>
<cesdoc>
Notes:
These implicit methods are useful when there is a systematic mapping between tags and character sets (e.g., a list of words in one character set, with their translations in another).
The CES provides global lang and wsd attributes, as well as appropriate mechanisms to document correspondences between languages or tags with particular character sets in the CES header.
Note that the language tagging mechanism will still be valid with UCS. "Unicode characters do not specify the language of the text they represent; that is, they are completely language neutral. If the language of a character or character string must be known to accomplish a particular type of process (e.g. language sensitive collation), then a higher-level protocol must be used to specify the language." [from Unicode's "Basic Principles"].
[THIS SECTION IS UNDER DEVELOPMENT]
The TEI provides a pre-defined Writing System Declaration (WSD) for transcribing the International Phonetic Alphabet. This is distributed by the TEI both as an SGML entity set and as a TEI Writing System Declaration documenting the entity set:
-//TEI P3: 1994//ENTITIES International Phonetic Alphabet//EN
The CES recommends using the SGML entities and providing the TEI WSD (with reference to it in the <wsdUsage> element in the header) when the IPA system is used in a document.