CNRS

MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 0. Version 0.2. Last modified 20 December 1995.






CES Part 0. Introduction




Nancy Ide and Jean Véronis


Copyright (c) Centre National de la Recherche Scientifique, 1995.

This document is only a draft and should be cited as such. Creators of WWW documents pointing to it are warned that its content and location may change without notice. This document is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained, the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

Permission is granted to make and distribute verbatim copies of this document for non-commercial purposes provided the copyright notice and this permission notice are preserved on all copies.


Contents


CNRS

NAVIGATOR

| Next | CES 1 Table of contents |

0.1. Background

The language engineering community has recently revived its interest in the use of empirical methods, thus creating a demand for large-scale corpora. Numerous data-gathering efforts exist on both sides of the Atlantic to provide wide-spread access to both mono- and bi-lingual resources of sufficient size and coverage for data-oriented work, including the U.S. Linguistic Data Consortium, the European Corpus Initiative (ECI), ICAME, the British National Corpus (BNC), and recently, the European Language Resources Association (ELRA). The rapid multiplication of such efforts has made it critical for the language engineering community to create a set of standards for encoding corpora.

The MULTEXT project and the EAGLES subgroup on Text Representation have joined efforts to develop a Corpus Encoding Standard (CES) optimally suited for use in language engineering, which can serve as a widely accepted set of encoding standards for European corpus work. The overall goal is the identification of a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding conventions for more extensive encoding and for linguistic annotation.

The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language). It is based on and in broad agreement with the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative. The TEI Guidelines were expressly designed to be applicable across a broad range of applications and disciplines and therefore treat not only a vast array of textual phenomena, but are also designed with an eye toward the maximum of generality and flexibility. The CES, on the other hand, treats a specific domain and set of applications, and can therefore be more restrictive and prescriptive in its specifications. In addition, because the TEI is not complete, there are some areas of importance for corpus encoding that the TEI Guidelines do not cover. Therefore, the first major task in developing the CES involved evaluating, adapting, selecting from, and in some cases extending the TEI Guidelines to meet the specific needs of corpus-based work in language engineering.

The CES has also been developed taking into account several practical realities surrounding the encoding of corpora intended for use in language engineering research and applications. In particular, at the present time and for the foreseeable future, corpora for language engineering will be adapted from legacy data, that is, pre-existing electronic data encoded in some arbitrary format (typically, word processor, typesetter, etc. formats intended for printing). The vast quantities of data involved and the difficulty (and cost) of the translation into usable formats imply that the CES must be designed in such a way that this translation does not require prohibitively large amounts of manual intervention to achieve minimum conformance to the standard. However, the markup that would be most desirable for the linguist is not achievable by automatic means. Therefore, a major feature of the CES is the provision for a series of increasingly refined encodings of text, beyond the minimum requirements.


0.2. Scope of the CES


0.2.1. Text types

The term corpus typically designates a collection of linguistic data, including written, spoken, or both, in one or multiple languages. In some cases, the term corpus (as opposed to terms such as collection, archive,etc.) is further restricted to apply to collections constructed according to various linguistic criteria such as representativeness and balance across a given domain, set of languages, etc. Here, we use the term corpusto refer to any collection of linguistic data, whether or not it is selected or structured according to some design criteria. According to this definition, a corpus can potentially contain any text type, including not only prose, newspapers, as well as poetry, drama, etc., but also word lists, dictionaries, etc. The CES also covers transcribed spoken data.

Due to the need for massive amounts of data, many corpora intended for use in language engineering applications are currently being created. Electrtonic texts are obtained by

The third is at present the most usual source of material for inclusion in linguistic corpora. As a result, a wide range of text types must be accomodated by the CES, including law records, technical manuals, transcriptions of debates, etc., as well as newspapers (which are an important source of material for corpora), many of which have irregular formats that require special consideration for encoding.


0.2.2. Languages

The CES applies to monolingual corpora including texts from a variety of western and eastern European languages, as well as multi-lingual corpora and parallel corpora comprising texts in any of these languages.


0.2.3. Applications

The CES is intended to be used for encoding corpora used as a resource in language engineering, including all areas of natural language processing, machine translation, lexicography, etc.

Corpora are used in language engineering to gather real language evidence, both qualitative and quantitative. Qualitative evidence consists of examples which can be used for the construction of computational lexicons, grammars, and multi-lingual lexicons and term banks, for lexicography, etc. Quantitative information consists of statistics which indicate frequent or characteristic uses of language. These statistics can also be used to guide preference-based parsers, assist in lexicography, determine translation equivalents, etc. In addition, statistics can be used to drive morphological taggers, POS taggers, alignment programs, sense taggers, etc. Common operations on corpora for the purposes of language engineering include extraction of sub-corpora; sophisticated search and retrieval, including collocation extraction, concordance generation, generation of lists of linguistic elements, etc.; and the determination of statistics such as frequency information, averages, mutual information scores, etc.

We do not address corpora intended for other applications, such as stylistic studies, socio-linguistics, historical studies, information retrieval, etc., although these uses are not excluded a priori (in fact, many of the features required for these applications may be the same as those needed for language engineering). Treating a restricted domain enables development of a standard tighter than that of the TEI, by providing specific encoding solutions rather than general or multiple ones, and, most importantly, by providing standards for elements particularly important in that domain (e.g., linguistic annotation).


0.2.4. Encoded facts

The CES distinguishes primary data, which is "unannotated" data in electronic form, most often originally created for non-linguistic purposes such as publishing, broadcasting, etc.; and linguistic annotation, which comprises information information generated and added to the primary data as a result of some linguistic analysis. The CES covers the encoding of objects in the primary data that are seen to be relevant to corpus-based work in language engineering research and applications, such as

The CES also covers encoding conventions for linguistic annotation of text and speech, including morphosyntactic tagging, parallel text alignment, prody, phonetic transcription, etc.

The CES is intended to cover those areas of corpus encoding on which there exists consensus among the language engineering community, or on which consensus can be easily achieved. Areas where no consensus can be reached (for example, sense tagging) are not treated at this time.


0.3. Overview of the CES

In its present form, the CES provides the following :

The CES provides a Document Type Definition (DTD) that can be used for various levels of primary data encoding. The first is the minimum encoding level required to make the corpus (re)usable across all possible language engineering applications. Succeeding levels provide for increasing enhancement in the amount of encoded information and increasing precision in the identification of text elements. Automatic methods to achieve markup at each level are for the most part increasingly complex, and therefore more costly; the sequence is designed to accomodate a series of increasingly information rich instantiations of the text at a minimum of cost.

Linguistic annotation is maintained in SGML documents separate from the primary data, to which it is associated by hyper-links. The CES includes a series of DTDs for documents containing the different kinds of annotation information.


0.4. Status of the current document

The current version of the CES is a first draft of the standard. It has not been widely implemented, and the intention is to continue to develop the CES on the basis of input and feedback from users after it is put to greater use. Therefore, this document will continue to evolve and should not be regarded as "final".

We recognize that changes in the specifications present problems for those who have previously implemented the standard. To alleviate this problem, we have adopted the following development strategy:

The current version of the CES has the following immediate and major limitations:

These areas are under development.

All current CES documents and DTDs will continue to be available at

<URL: http://www.cs.vassar.edu/CES/>

Anyone actively implementing the standard should consult this site regularly.

In developing the CES we have look at the work of other TEI-based corpus applications, including in particular the British National Corpus Project and the English-Norwegian Parallel Corpus Project. The various modifications of the TEI which have been developed by these groups and independently in the CES are often very similar, and it is at times difficult to know where an idea or strategy originated. We would therefore like to offer here a general acknowledgement of the work of these other projects and their influence on the CES.

We welcome and encourage user input concerning the CES.


0.5. Key to tag descriptions

Throughout this document, the tables describing the tags should be interpreted in the following way:


CNRS

NAVIGATOR

| Top | CES Contents | Next | MULTEXT | EAGLES TR subgroup | LPL |