CNRS

MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 5. Version 0.1. Last modified 20 December 1995.






CES Part 5. Encoding Linguistic Annotation




Nancy Ide and Jean Véronis


Copyright (c) Centre National de la Recherche Scientifique, 1995.

This document is only a draft and should be cited as such. Creators of WWW documents pointing to it are warned that its content and location may change without notice. This document is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained, the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

Permission is granted to make and distribute verbatim copies of this document for non-commercial purposes provided the copyright notice and this permission notice are preserved on all copies.


Contents


CNRS

NAVIGATOR

| Prev | CES 1 Table of contents |

construction[THIS SECTION IS UNDER DEVELOPMENT]

The contents of this section represent an initial draft serving as a proposal. It is subject to discussion and may undergo significant change.


5.0. Overview

The classical view of a document prepared for use in corpus-based research is one in which annotation is added incrementally to the original as it is generated. The CES adopts a strategy whereby annotation information is not merged with the original, but rather retained in separate SGML documents (with different DTDs) and linked to the original or other annotation documents. Linkage between documents will be accomplished using the HyTime-based TEI addressing mechanisms for element linkage.

The separate markup strategy is in essence a finely linked hypertext format where the links signify a semantic role rather than navigational options. That is, the links signify the locations where markup contained in a given annotation document would appear in the document to which it is linked. As such the annotation information comprises remote markup which is virtually added to the document to which it is linked. In principle, the two documents could be merged to form a single document containing all the markup in each. This approach has several advantages for corpus-based research:

The hyper-document comprising each text in the corpus and its annotations will consist of several documents. The base or "hub" document is the unannotated document containing only primary data markup. The hub document is "read only" and is not modified in the annotation process. Each annotation document is a proper SGML document with a DTD, containing annotation information linked to its appropriate location in the hub document or another annotation document.

All annotation documents are linked to the SGML original (containing the primary data) or other annotation documents using one-way links. The exception is output of the aligner for parallel texts, which will consist of an SGML document containing only two-way links associating locations in two documents in different languages. The two linked documents are two documents containing the relevant structural information, such as sentence or word boundaries. The overall architecture is described by the figure below.

Following this model, the CES provides DTDs for the different types of annotation information, described below.


5.1. Locators

Both HyTime and the TEI provide several different means to specify the location of the endpoints of a link. One of the most common is to point to a unique identifier, specified in the ID attribute on the target tag. However, since some annotation documents are pointing into the hub (base) document, which is intended to be read-only, it may be necessary to modify the original by adding ID attributes to each tag, as a prior step. Another method is to use locations in the SGML tree to point to specific elements (e.g., the third child of the second child of the first child of the root). Using this method solves the read-only problem, eliminates the indexing needed for IDs, and provides a semantically meaningful location that might be useful for processing.

For many purposes it is necessary to point to locations that are not the entire content of an SGML element--for example, to a specific character that is the start of a sentence. Both HyTime and TEI allow for mixing pointing mechanisms in the same location indicator. Therefore, we recommend that information in annotation documents be linked to the hub document via one-way links, specified in two steps:

Beyond this, it is necessary to choose between TEI and HyTime formats. TEI locators have the advantage that they are more compact than HyTime location ladders. Additionally, the TEI notation is easily made compatible with HyTime by the use of the Hytime notloc form, in conjunction with an appropriate notation declaration. Therefore, it is recommended that in general, TEI locators are used. As an example of the recommended addressing method, consider the following text:

       <p>
       L'usine, qui devrait être implantée à Eloyes (Vosges) représente 
       un investissement d'environ 3,7 milliards de yens. Elle fabriquera 
       des pièces détachées pour la filiale de Minolta en RFA.
       <p>

The following two tags point to the first two words inside the <p> element :

       <tok from="CHILD (2) (1) (1) (1) (2) (1) STRLOC (1)" 
            to="CHILD (2) (1) (1) (1) (2) (1) STRLOC (2)">
       <tok from="CHILD (2) (1) (1) (1) (2) (1) STRLOC (3)" 
            to="CHILD (2) (1) (1) (1) (2) (1) STRLOC (7)">	

The TEI notation is given in two parts:

While this notation is far more compact than the equivalent HyTime notation, it is still more bulky than desired. Therefore, we have developed a more compact notation for this kind of addressing. It is very common to need to refer to an element by indicating its position in the ESIS tree relative to the root, and to then access a particular character within that element. A compact notation such as ESIS (2.1.1.1.2.1\1), equivalent to CHILD (2) (1) (1) (1) (2) (1) STRLOC (1), accomplishes this. The resulting notation is:

       <tok from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\2">
       <tok from="2.1.1.1.2.1\3" to="2.1.1.1.2.1\7">	

Locators for alignment can utilize the same notation to point to locations in the documents they link. Additionally, itr is necessary to indicate the documents associated with the addresses given in the from and to attributes, respectively, since these are different. For this purpose, two attributes, fromdoc and todoc, are specified to provide the names of these documents. If these values are given default (in SGML jargon, "#CURRENT") values, the resulting notation for the locators in the alignment file is similar to the above, since the fromdoc and todoc attributes need not be explicitly given in the tag. Note also that in most cases, a character offset is not specified for aligned data, which is typically between the entire content of SGML elements (sentences, paragraphs, tokens) in the two aligned documents:

       <link from="2.1.1.1.2.1" to="2.1.1.1.2.1">


5.1.1. Linking strategies and document verification

There are two ways we might attach tags to a document from another document. One way would be to attach individual tags at particular parts, in much the same way as we can create links freely to a document. This implies that the added tags cannot necessarily be parsed as a legitimate SGML document, because there are no constraints on the applied tags. In particular, the constraints that the tags cover the entire text of the document and that their containment relations form a tree are not present in this model. It is also very difficult to linearize. Fortunately, the external markup that comprises an annotation is an alternative description of the whole document. Two key constraints can be enforced to enable parsing:

The first constraint enables parsing of the several portions of a compound document in parallel, since the order of references will be the same. This should allow for a limited memory sequential processing structure, while aribtrary re-orderings are likely to require the use of some kind of location table to store the structures of the texts. Some re-ordering could be allowed, if necessary, as long as it is possible to bound the region within which such re-ordering will occur to a finite window (and thus bound the size of symbol table needed to process the markup). The second constraint can be enforced by making remote tags nest according to the rules of SGML, and providing a DTD for how those tags should be nested. Thus, there will be a DTD for each type of annotation information, and that DTD will validate that the annotation information is properly conformant to the DTD, when considered in isolation.


5.2. Encoding conventions for segmentation and grammatical annotation

The cesAna DTD is used for segmentation and grammatical annotation, including:

Allowing more than one possible set of grammatical annotation enables representing data which for which lexical lookup or some other morphosyntactic analysis has been performed, but which has not been disambiguated. When disambiguation has been accomplished, an optional element can be included containing the disambiguated form. The structure of the DTD constituents is based on the overall principle that one or more "chunks" of a text may be included in the annotation document. These chunks may correspond to parts of the document extracted at different times for annotation, or simply to some subset of the text that has been extracted for analysis. For example, it is likely that within any text, only the paragraph content will undergo morphosyntactic analysis, and titles, footnotes, captions, long quotations, etc. will be omitted or analysed separately.


5.2.1. Global attributes

Four global attributes are defined in the cesAna DTD: Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".

The global attributes are defined at the top of the cesAna DTD and represented by an entity, A.ANA. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.


5.2.2. Top-level Constituents

The top level structure of the cesAna DTD is as follows:


5.2.3. Chunks


5.2.4. Chunk consituents


5.2.5. Token consituents


5.2.6. Lex and Disamb consituents


5.2.7. Example

The following example shows the full use of all the options provided in the cesAna DTD. This set of annotation data could be the final result after tokenization, segmentation, lexical lookup or morphosyntactic analysis, and part of speech disambiguation. All the original options for morphosyntactic class are retained here, and the disambiguated tag is provided in the <disamb> element.

Note that because the attribute doc is specified as #CURRENT, once a value has been specified for this attribute on one instance of a given element, all subsequent occurrences of that element will use this value as the default unless it is re-specified.

     <!doctype cesAna PUBLIC "-//CES//DTD//cesAna//EN" []>
     <cesAna doc=MyText1 header.loc="MyText.hdr">
       <chunkList>
         <chunk doc="MyText1" from='1.2.1\1'>
           <s>
             <tok  doc="MyText1" class='tok' from='1.2.1\1'>
               <orth>Les</orth>
               <disamb>
                   <ctag>DMP</ctag>
               </disamb>         
               <lex>
                   <base>le</base>
                   <msd>Da-fp--d</msd>
                   <ctag>DFP</ctag>
               </lex>
               <lex>
                   <base>le</base>
                   <msd>Da-mp--d</msd>
                   <ctag>DMP</ctag>
               </lex>
               <lex>
                   <base>le</base>
                   <msd>Pp3fpj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
               <lex>
                   <base>le</base>
                   <msd>Pp3mpj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
             </tok>
             <tok class='tok' from='1.2.1\5'>
               <orth>critères</orth>
               <disamb>
                   <ctag>NCMP</ctag>
               </disamb>         
               <lex>
                   <base>critère</base>
                   <msd>Ncmp-</msd>
                   <ctag>NCMP</ctag>
               </lex>
             </tok>
             <tok  class='tok' from='1.2.1\14'>
               <orth>se</orth>
               <disamb>
                   <ctag>PPJ</ctag>
               </disamb>         
               <lex>
                   <base>se</base>
                   <msd>Pp3msj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
               <lex>
                   <base>se</base>
                   <msd>Pp3fpj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
               <lex>
                   <base>se</base>
                   <msd>Pp3fsj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
               <lex>
                   <base>se</base>
                   <msd>Pp3mpj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
             </tok>
             <tok  class='tok' from='1.2.1\17'>
               <orth>basent</orth>
               <disamb>
                   <ctag>VM3P</ctag>
               </disamb>         
               <lex>
                   <base>baser</base>
                   <msd>Vmip3p--</msd>
                   <ctag>VM3P</ctag>
               </lex>
               <lex>
                   <base>baser</base>
                   <msd>Vmsp3p--</msd>
                   <ctag>VM3P</ctag>
               </lex>
             </tok>
             <tok  class='tok' from='1.2.1\24'>
               <orth>sur</orth>
               <disamb>
                   <ctag>SP</ctag>
               </disamb>         
               <lex>
                   <base>sur</base>
                   <msd>Afpms-</msd>
                   <ctag>AMS</ctag>
               </lex>
               <lex>
                   <base>sur</base>
                   <msd>Sp</msd>
                   <ctag>SP</ctag>
               </lex>
             </tok>
             ...
           </s>
         </chunk>
       </chunkL>
    </cesAna>
Alternatively, if a more concise set of information is desired, the following could be provided for the first token in the example above:
             <tok  class='tok' from='1.2.1\1'>
               <orth>Les</orth>
               <lex>
                   <base>le</base>
                   <ctag>DMP</ctag>
               </lex>
             </tok>


The cesAna DTD



5.3. Encoding conventions for parallel text alignment

The annotation document containing alignment information consists entirely of links between two documents which have been aligned. Alignment may be between primary data documents or between annotation documents containing segmentation information for the aligned units (sentences, tokens etc.). The cesAlign DTD provides the document structure which, like that of the cesAna DTD, is based on the notion of "chunks" that correspond to parts of the document.


5.3.1. Global attributes

Four global attributes are defined in the cesAlign DTD: Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".

The global attributes are defined at the top of the cesAna DTD and represented by an entity, A.ALIGN. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.


5.3.2. Top-level Constituents

The top level structure of the cesAlign DTD is as follows:

As in the cesAna DTD, <chunkList> contains one or more occurrences of the element <chunk>, defined as follows:


5.3.3.Links

Links may associate data of two kinds: In the first instance, the markup required for the linkage can be simplified, since only one location (i.e., the location of the SGML element in the ESIS tree) need by specified for each of the targets of the link. For example:

<link fromLoc="2.1.1.1.2.1" toLoc="2.1.1.1.3.2">

In the second instance, a heavier mechanism is needed, since each target must provide a starting and ending location in each of the aligned documents. Therefore it is necessary to specify something like the following:

     <xptr id=En1 doc=EN104 from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\2">
     <xptr id=Fr1 doc=FR413 from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\2">
     <link targets="En1 Fr1">

Note that using this mechanism, three or more files can be aligned if desired, since any number of IDs can be specified in the value field of the targets attribute on the <link> element.

The following elements appear in the cesAlign DTD:



Note that because the attributes doc, fromDoc, and toDoc are defined as #CURRENT, once a value has been specified for this attribute on one instance of a given element, all subsequent occurrences of that element will use this value as the default unless it is re-specified. Therefore, verbosity can be reduced by placing all the <xptr> elements pointing to given document sequentially in the alignment document. Similarly, <link> elements appearing together need only specify fromDoc and toDoc on the first appearance of <link>.


The cesAlign DTD



CNRS

NAVIGATOR

| Top | Prev | CES Contents | MULTEXT | EAGLES Text Representation subgroup | LPL