CNRS

MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Title page. Version 0.1. Last modified 11 December 1995.






Corpus Encoding Standard

CES Version 0




Nancy Ide and Jean Véronis


CNRS

Laboratoire Parole et Langage
Centre National de la Recherche Scientifique
29, Avenue Robert Schuman
13621 Aix-en-Provence Cedex 1, France

e-mail: ide@univ-aix.fr ,veronis@univ-aix.fr






Copyright (c) Centre National de la Recherche Scientifique, 1995.

This document is only a draft and should be cited as such. Creators of WWW documents pointing to it are warned that its content and location may change without notice. This document is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained, the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

Permission is granted to make and distribute verbatim copies of this document for non-commercial purposes provided the copyright notice and this permission notice are preserved on all copies.


Abstract

This document is the first version of the MULTEXT/EAGLES Corpus Encoding Standard (CES). The CES has been designed to be optimally suited for use in language engineering research and applications, in order to serve as a widely accepted set of encoding standards for European corpus work. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), strongly influenced by and in broad agreement with the specifications of the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative. The CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.

The CES is being developed in a bottom up fashion, starting with minimal specifications and expanding based upon feedback resulting from its use, and the input of the research community in general. We invite and encourage all comments and discussion of any aspect of the CES.


Contents


Acknowledgements

This document results from joint effort of the European projects MULTEXT (LRE), MULTEXT-EAST (Copernicus) and EAGLES. CNRS has supported the integration effort.



Contributors
Greg Priest-Dorman

CNRS

NAVIGATOR

| Top | Next | MULTEXT | EAGLES Text Representation subgroup | LPL