CNRS

MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 4. Version 0.1. Last modified 4 December 1995.






CES Part 4. Encoding Primary Data




Nancy Ide and Jean Véronis


Copyright (c) Centre National de la Recherche Scientifique, 1995.

This document is only a draft and should be cited as such. Creators of WWW documents pointing to it are warned that its content and location may change without notice. This document is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained, the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

Permission is granted to make and distribute verbatim copies of this document for non-commercial purposes provided the copyright notice and this permission notice are preserved on all copies.


Contents


CNRS

NAVIGATOR

| Next | Prev | CES 1 Table of contents |

4.0. Overview

For the foreseeable future, the greatest portion of texts that will be encoded exist already in electronic form. Such texts are referred to as legacy data. The vast majority of these documents were originally intended to be printed and therefore already contain markup in the form of typesetter codes, word processing formats, etc., primarily related to visual presentation.

The goal of encoding for corpus linguistics is to describe text structure that is linguistically relevant and mark objects relevant to analysis. Thus, for the purposes of corpus work in language engineering applications, a text (prior to linguistic annotation) is a set of linguistic objects, comprising at least

The text seen as a printed or displayed object, including fonts, layout, etc., and the text seen as a collection of linguistic objects represent two different views of the text. Some of the components of one of these views correspond to components of the other, while others do not. Therefore, the process of preparing a corpus originally existing as legacy data involves

This process is potentially very costly, depending on how well presentational categories map directly into distinct linguistic categories, and how much additional markup for elements not marked in the original, or which are not easily distinguishable based on typography, is desired.

Because of the potential cost, data preparation is often accomplished by taking the data through by a series of transformations, each of which raises the information level to some extent. The final state models the richest possible information state. The transformation process cannot be completely deterministic, since raising the information level often involves deciding which among several possible candidates a given tag maps to, as well as adding structural information that is not present or fully explicit in the previous state. Therefore, the transformation process is not fully automatic or entirely cost-free. However, it is possible to minimize transformation costs from one information state to the next higher one.

The CES defines a DTD that can be used in such a process for encoding primary data. It has been designed to enable representing the text at any of various stages of information transformation (i.e., translating existing markup into relevant, increasingly information-rich categories). The representation of the text in the first (minimum required) representation can often be accomplished by automatic means and may be nearly cost-free. Users of the cesDoc DTD can encode their texts to conform to intermediate stages, aiming toward a rich represetnation of relevant linguistic informaton, depending on cost considerations, application needs, etc.


4.1. Levels of encoding for primary data

For the encoding of primary data the CES identifies three levels of encoding:

The following sections provide precise criteria for conformance to each level.


4.2. Level 1 conformance

4.2.1. Requirements

                       <cesDoc>
                         <body>
                            <div1> [optional]
                               <p>
                               <p>
                               <p>
                                ...

4.2.2. Recommendations

4.2.3. Requirements for documents adapted from legacy data

4.2.4. Recommendations for documents adapted from legacy data


4.3. Level 2 conformance

Level 2 conformance requires the following:

4.4. Level 3 conformance

Conformance to this level demands
CNRS

NAVIGATOR

| Top | Prev | Next | CES Contents | MULTEXT | EAGLES TR subgroup | LPL |