MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 4. Version 0.1. Last modified 4 December 1995.
Nancy Ide and Jean Véronis
Copyright (c) Centre National de la Recherche Scientifique, 1995.
This document is only a draft and should be cited as such. Creators of
WWW documents pointing to it are warned that its content and location may change
without notice. This document is provided as is without any express or implied
warranties. While every effort has been taken to ensure the accuracy of the
information contained, the authors assume no responsibility for errors or omissions,
or for damages resulting from the use of the information contained herein.
Permission is granted to make and distribute verbatim copies of this document for
non-commercial purposes provided the copyright notice and this permission notice are
preserved on all copies.
Contents
NAVIGATOR
| Next
| Prev
| CES 1 Table of contents
|
For the foreseeable future, the greatest portion of texts that will be encoded
exist already in electronic form. Such texts are referred to as legacy data.
The vast majority of these documents were originally intended to be printed and
therefore already contain markup in the form of typesetter codes, word
processing formats, etc., primarily related to visual presentation.
The goal of encoding for corpus linguistics is to describe text structure that
is linguistically relevant and mark objects relevant to analysis. Thus, for
the purposes of corpus work in language engineering applications, a text (prior to
linguistic annotation) is a
set of linguistic objects, comprising at least
- large units of discourse, such as paragraphs, chapters, etc. together with
titles, footnotes, etc.;
- basic linguistic objects common to linguistic analyses, such as sentences,
clauses, phrases, words, morphemes, and phonemes, as well as names, dates,
abbreviations, etc.;
The text seen as a printed or displayed object,
including fonts, layout, etc., and the text seen as a collection of linguistic
objects represent two different views of the text. Some of the
components of one of these views correspond to components of the other, while
others do not. Therefore, the process of preparing a corpus originally existing
as legacy data involves
- the translation, where relevant, of presentation markup into markup
descriptive of linguistic categories (e.g., the translation of items in bold to
titles, etc.);
- the elimination of presentational markup which does not signify an object
of linguistic relevance;
- possibly, the addition of tags for elements not marked in any way in the
legacy document (e.g., proper names).
This process
is potentially very
costly, depending on how well presentational categories map directly into
distinct linguistic categories, and how much additional markup for elements not
marked in the original, or which are not easily distinguishable based on
typography, is
desired.
Because of the potential cost, data preparation is often accomplished by taking the data through by a series of transformations, each
of which raises the information level to some extent. The final state models
the richest possible information state.
The transformation process cannot be completely deterministic,
since raising the information level often involves deciding which among several possible
candidates a given tag maps to, as well as adding structural information that
is not present or fully explicit in the previous state. Therefore, the transformation
process is not fully automatic or entirely cost-free. However, it is possible to
minimize transformation costs from one information state to the next higher
one.
The CES defines a DTD that can be used in such a
process for encoding primary data. It has been designed to enable representing the text at any of various
stages of information transformation (i.e., translating existing markup into
relevant, increasingly information-rich categories).
The representation of the text in the first (minimum required) representation can often be accomplished by automatic means and may be nearly cost-free. Users of the cesDoc DTD can encode their texts to conform to intermediate stages, aiming toward a rich represetnation of relevant linguistic informaton,
depending on cost considerations, application needs, etc.
For the encoding of primary data the CES identifies three levels of encoding:
- Level 1
- This is the minimum encoding level required for CES
conformance, requiring markup for gross document structure (major text
divisions), down to the level of the paragraph, conformant to the CES1 DTD.
- Level 2
- This level requires that
paragraph level elements are correctly marked, and (where possible) the
function of rendition information at the sub-paragraph level is determined and
elements marked accordingly.
- Level 3
- This is the most restrictive and refined level of markup for
primary data. It places additional constraints on the encoding of s-units and quoted dialogue, and demands more sub-paragraph level tagging.
The
following sections provide precise criteria for conformance to each level.
- Two documents must be provided, each with an approriate reference to the location of the other:
- a document containing the primary data;
- a document containing a CES header describing the primary data.
- The header must provide a full description of all encoding formats utilized in the document.
- The document must not contain foreign markup.
- There should be no information loss for sub-paragraph elements. Sub-paragraph elements identified in the original by special typography not
directly representable in the SGML encoded version (e.g., distinction by font
such as italics, vs. distinction by capital letters or quote marks, which is
directly representable in the encoded version) should be marked, typically
using a <hi> tag.
- The document validates against the cesDoc DTD, using an SGML parser such as
sgmls.
- CES-conformant encoding to the paragraph level must be included. However,
note that:
- For Level 1 CES conformance, paragraph-level markup need not be refined.
For example, via automatic means all carriage returns may be changed to
<p> (paragraph) tags; additional work is needed to identify and
mark those situations where the carriage return signals a list, a long quote,
etc. This level of refinement is not required. Documents differentiating only
<p> tags are still complaint to the cesDoc DTD, which (minimally) requires
the following structure :
<cesDoc>
<body>
<div1> [optional]
<p>
<p>
<p>
...
- Markup of sub-paragraph elements is conformant to CES specifications.
- When the document differs from an original either encoded using another encoding scheme, or containing no markup (apart from carriage returns to signal paragraphs, etc.), the CES-encoded text must be
accompanied by a copy of the original data or information specifying where the
original can be permanently and readily obtained (in the
<sourceDesc>
element), for the following reasons:
- it ensures that the encoded text can always be checked
against the original.
- since the rendering of visual presentation classes
into more descriptive markup categories is necessarily an interpretive process,
having the original on hand enables the user to examine the original categories
and, potentially, modify or improve them as necessary.
- because encoded
texts may be gradually enriched by a number of users over time, it becomes
increasingly essential to retain a trace of the "archaeology" of the document
as well as to ensure that the original is permanently preserved.
- All information in the original essential for the recognition of content
is retained in the encoded version. This refers particularly to rendition
information such as italics, etc. that may signal a linguistically relevant
element.
- Information whose sole function is to allow re-creation of
the original printed source should be discarded.
- The original character sequence comprising the document should be
retained, by employing the following principles:
- None of the original sequence of characters (with the possible
exception of rendition text) should be deleted or altered.
- The original data should not be given in attributes, but should
always appear as tag content. Note that data such as list numbers, footnote symbols, etc., can be considered rendition text and placed in attributes on the appropriate tag.
- Apart from the original data, no other data should appear as tag
content.
- The original order of the data should not be changed.
- Line breaks in the original which do not signal logical divisions
(paragraphs, etc.) should be encoded as blanks or, when they break a logically
contiguous unit, ignored.
- The translation process should be documented in the text and/or corpus
header, as appropriate, in the <encodingDesc> element.
- Alignment between the original data and the SGML encoded text should be
provided.
Level 2 conformance requires the following:
- The requirements for a Level 1 document are satisfied.
- All paragraph level elements (lists, quotes, etc.) are correctly
identified
- Where possible, <hi> tags are resolved to more precise tags
(foreign, term, etc.)
- If a sub-paragraph element is marked, every occurrence of that element has
been identified and marked in the text.
- SGML entities replace all special characters (e.g., —, £,
etc.).
- Quotation marks are removed and either replaced by appropriate standard SGML
entities, or represented in a rend attribute on a <q> or
<quote> tag.
- The document validates against the cesDoc DTD, using an SGML parser such as
sgmls.
Conformance to this level demands
- Requirements for a Level 2 document are satisfied.
- The following sub-paragraph elements have been identified and marked in
the text:
- abbreviations
- dates
- numbers
- measures
- names
- times
- titles
- foreign words and phrases
-
Where s-units and dialogue are tagged, the
<p> - <s> - <q> hierarchy described in section 4.5 must be followed.
- The document validates against the cesDoc DTD, using an SGML parser such as
sgmls.
NAVIGATOR
| Top
| Prev
| Next
| CES Contents
| MULTEXT
| EAGLES TR subgroup
| LPL
|