MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 0. Version 0.2. Last modified 20 December 1995.
Nancy Ide and Jean Véronis
Copyright (c) Centre National de la Recherche Scientifique, 1995.
This document is only a draft and should be cited as such. Creators of
WWW documents pointing to it are warned that its content and location may change
without notice. This document is provided as is without any express or implied
warranties. While every effort has been taken to ensure the accuracy of the
information contained, the authors assume no responsibility for errors or omissions,
or for damages resulting from the use of the information contained herein.
Permission is granted to make and distribute verbatim copies of this document for
non-commercial purposes provided the copyright notice and this permission notice are
preserved on all copies.
The MULTEXT project and the EAGLES subgroup on Text Representation have joined efforts to develop a Corpus Encoding Standard (CES) optimally suited for use in language engineering, which can serve as a widely accepted set of encoding standards for European corpus work. The overall goal is the identification of a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding conventions for more extensive encoding and for linguistic annotation.
The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language). It is based on and in broad agreement with the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative. The TEI Guidelines were expressly designed to be applicable across a broad range of applications and disciplines and therefore treat not only a vast array of textual phenomena, but are also designed with an eye toward the maximum of generality and flexibility. The CES, on the other hand, treats a specific domain and set of applications, and can therefore be more restrictive and prescriptive in its specifications. In addition, because the TEI is not complete, there are some areas of importance for corpus encoding that the TEI Guidelines do not cover. Therefore, the first major task in developing the CES involved evaluating, adapting, selecting from, and in some cases extending the TEI Guidelines to meet the specific needs of corpus-based work in language engineering.
The CES has also been developed taking into account several practical realities surrounding the encoding of corpora intended for use in language engineering research and applications. In particular, at the present time and for the foreseeable future, corpora for language engineering will be adapted from legacy data, that is, pre-existing electronic data encoded in some arbitrary format (typically, word processor, typesetter, etc. formats intended for printing). The vast quantities of data involved and the difficulty (and cost) of the translation into usable formats imply that the CES must be designed in such a way that this translation does not require prohibitively large amounts of manual intervention to achieve minimum conformance to the standard. However, the markup that would be most desirable for the linguist is not achievable by automatic means. Therefore, a major feature of the CES is the provision for a series of increasingly refined encodings of text, beyond the minimum requirements.
Due to the need for massive amounts of data, many corpora intended for use in language engineering applications are currently being created. Electrtonic texts are obtained by
Corpora are used in language engineering to gather real language evidence, both qualitative and quantitative. Qualitative evidence consists of examples which can be used for the construction of computational lexicons, grammars, and multi-lingual lexicons and term banks, for lexicography, etc. Quantitative information consists of statistics which indicate frequent or characteristic uses of language. These statistics can also be used to guide preference-based parsers, assist in lexicography, determine translation equivalents, etc. In addition, statistics can be used to drive morphological taggers, POS taggers, alignment programs, sense taggers, etc. Common operations on corpora for the purposes of language engineering include extraction of sub-corpora; sophisticated search and retrieval, including collocation extraction, concordance generation, generation of lists of linguistic elements, etc.; and the determination of statistics such as frequency information, averages, mutual information scores, etc.
We do not address corpora intended for other applications, such as stylistic studies, socio-linguistics, historical studies, information retrieval, etc., although these uses are not excluded a priori (in fact, many of the features required for these applications may be the same as those needed for language engineering). Treating a restricted domain enables development of a standard tighter than that of the TEI, by providing specific encoding solutions rather than general or multiple ones, and, most importantly, by providing standards for elements particularly important in that domain (e.g., linguistic annotation).
The CES also covers encoding conventions for linguistic annotation of text and speech, including morphosyntactic tagging, parallel text alignment, prody, phonetic transcription, etc.
The CES is intended to cover those areas of corpus encoding on which there exists consensus among the language engineering community, or on which consensus can be easily achieved. Areas where no consensus can be reached (for example, sense tagging) are not treated at this time.
The CES provides a Document Type Definition (DTD) that can be used for various levels of primary data encoding. The first is the minimum encoding level required to make the corpus (re)usable across all possible language engineering applications. Succeeding levels provide for increasing enhancement in the amount of encoded information and increasing precision in the identification of text elements. Automatic methods to achieve markup at each level are for the most part increasingly complex, and therefore more costly; the sequence is designed to accomodate a series of increasingly information rich instantiations of the text at a minimum of cost.
Linguistic annotation is maintained in SGML documents separate from the primary data, to which it is associated by hyper-links. The CES includes a series of DTDs for documents containing the different kinds of annotation information.
We recognize that changes in the specifications present problems for those who have previously implemented the standard. To alleviate this problem, we have adopted the following development strategy:
All current CES documents and DTDs will continue to be available at
<URL: http://www.cs.vassar.edu/CES/>
Anyone actively implementing the standard should consult this site regularly.
In developing the CES we have look at the work of other TEI-based corpus applications, including in particular the British National Corpus Project and the English-Norwegian Parallel Corpus Project. The various modifications of the TEI which have been developed by these groups and independently in the CES are often very similar, and it is at times difficult to know where an idea or strategy originated. We would therefore like to offer here a general acknowledgement of the work of these other projects and their influence on the CES.
We welcome and encourage user input concerning the CES.