MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 5. Version 0.1. Last modified 20 December 1995.
Nancy Ide and Jean Véronis
Copyright (c) Centre National de la Recherche Scientifique, 1995.
This document is only a draft and should be cited as such. Creators of
WWW documents pointing to it are warned that its content and location may change
without notice. This document is provided as is without any express or implied
warranties. While every effort has been taken to ensure the accuracy of the
information contained, the authors assume no responsibility for errors or omissions,
or for damages resulting from the use of the information contained herein.
Permission is granted to make and distribute verbatim copies of this document for
non-commercial purposes provided the copyright notice and this permission notice are
preserved on all copies.
Contents
NAVIGATOR
| Prev
| CES 1 Table of contents
|
[THIS SECTION IS UNDER DEVELOPMENT]
The contents of this section represent an initial draft serving as a proposal. It is subject to discussion and may undergo significant change.
The classical view of a document prepared for use in corpus-based research is
one in which annotation is added incrementally to the original as it is
generated. The CES adopts a strategy whereby
annotation information is not merged with the original, but rather retained in
separate SGML documents (with different DTDs) and linked to the original or other
annotation
documents. Linkage between documents will be accomplished using the HyTime-based TEI addressing mechanisms for
element linkage.
The separate markup strategy is in essence a finely linked hypertext
format where the links signify a semantic role rather than navigational
options. That is, the links signify the locations where markup contained in a
given annotation document would appear in the document to which it is
linked. As such the annotation information comprises remote markup which
is virtually added to the document to which it is linked. In principle, the two
documents could be merged to form a single document containing all the markup
in each.
This approach has several advantages for corpus-based research:
- it avoids the creation of potentially unwieldy documents--envision, in a
worst case, a single document containing segmentation and part of speech
markup, plus markup for alignment with translations in the eleven EU languages,
plus alignment with the speech recording, plus variant part of speech taggings
from several taggers, etc.!
- the original or hub document remains stable and is not modified by any
process which may add annotation.
- it avoids problems with markup containing overlapping hierarchies, which
are not allowable in SGML.
- different versions of the same kind of annotation (e.g., different POS
annotation) can be associated with the text.
- it is very much in line with what is evolving in the SGML world, and it is
likely that SGML/HyTime software which handles complex linkages will be
available in the near future.
- annotation can be accomplished by associating the SGML original or other
annotation documents with other, pre-existing documents--e.g., instead of
generating a document containing POS markup and linking it to the original,
links could be made directly with lexicon entries.
- it gives easy access to the original SGML document (or, as mentioned
above, any among several versions annotated for certain features) for use by
other applications.
The hyper-document comprising each text in the corpus
and its annotations will consist of several documents. The base or "hub"
document is the unannotated document containing only primary data markup.
The hub document is "read only" and is not modified in the annotation process.
Each annotation document is a proper SGML document with a DTD, containing
annotation information linked to its appropriate location in the hub document or another annotation document.
All annotation documents are linked to the SGML original (containing the primary data) or other
annotation documents using one-way links. The exception is output of the
aligner for parallel texts, which will consist of an SGML document containing
only two-way links associating locations in two documents in different
languages. The two linked documents are two documents containing
the relevant structural information, such as sentence or word boundaries. The overall architecture
is described by the figure below.

Following this model, the CES provides DTDs for the different types of annotation information, described below.
Both HyTime and the TEI provide several different means to specify the location of the endpoints of a link. One of the most common is to point to a unique identifier, specified in the ID attribute on the target tag. However, since some annotation documents are pointing into the hub (base) document, which is intended to be read-only, it may be necessary to modify the original by adding ID attributes to each tag, as a prior step. Another method is to use locations in the SGML tree to point to specific elements (e.g., the third child of the second child of the first child of the root). Using this method solves the read-only problem, eliminates the indexing needed for IDs, and provides a semantically meaningful location that might be useful for processing.
For many purposes it is necessary to point to locations that are not the entire content of an SGML element--for example, to a specific character that is the start of a sentence. Both HyTime and TEI allow for mixing pointing mechanisms in the same location indicator. Therefore, we recommend that information in annotation documents be linked to the hub document via one-way links, specified in two steps:
- tree addressing to point to the nearest enclosing tag for the location in question;
- reference to the position of a specific character from the location given in the first step.
Beyond this, it is necessary to choose between TEI and HyTime formats. TEI locators have the advantage that they are more compact than HyTime location ladders. Additionally, the TEI notation is easily made compatible with HyTime by the use of the Hytime notloc form, in conjunction with an appropriate notation declaration. Therefore, it is recommended that in general, TEI locators are used.
As an example of the recommended addressing method, consider the following text:
<p>
L'usine, qui devrait être implantée à Eloyes (Vosges) représente
un investissement d'environ 3,7 milliards de yens. Elle fabriquera
des pièces détachées pour la filiale de Minolta en RFA.
<p>
The following two tags point to the first two words inside the <p> element :
<tok from="CHILD (2) (1) (1) (1) (2) (1) STRLOC (1)"
to="CHILD (2) (1) (1) (1) (2) (1) STRLOC (2)">
<tok from="CHILD (2) (1) (1) (1) (2) (1) STRLOC (3)"
to="CHILD (2) (1) (1) (1) (2) (1) STRLOC (7)">
The TEI notation is given in two parts:
- CHILD (2) (1) (1) (1) (2) (1) starts at the root of the ESIS (Element Structure Information Set) tree (corresponding to the output of the SGML parser) by default, and descends taking at each node, the designated child (i.e., the second child of the root, the first child of this node, etc.). See TEI P3, chapter 14, "Linking, Segmentation, and Alignment" for a complete description and explanation of TEI locators.
- STRLOC (n) gives a character offset within the element referenced above.
While this notation is far more compact than the equivalent HyTime notation, it is still more bulky than desired. Therefore, we have developed a more compact notation for this kind of addressing. It is very common to need to refer to an element by indicating its position in the ESIS tree relative to the root, and to then access a particular character within that element. A compact notation such as ESIS (2.1.1.1.2.1\1), equivalent to CHILD (2) (1) (1) (1) (2) (1) STRLOC (1), accomplishes this. The resulting notation is:
<tok from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\2">
<tok from="2.1.1.1.2.1\3" to="2.1.1.1.2.1\7">
Locators for alignment can utilize the same notation to point to locations in the documents they link. Additionally, itr is necessary to indicate the documents associated with the addresses given in the from and to attributes, respectively, since these are different. For this purpose, two attributes, fromdoc and todoc, are specified to provide the names of these documents. If these values are given default (in SGML jargon, "#CURRENT") values, the resulting notation for the locators in the alignment file is similar to the above, since the fromdoc and todoc attributes need not be explicitly given in the tag. Note also that in most cases, a character offset is not specified for aligned data, which is typically between the entire content of SGML elements (sentences, paragraphs, tokens) in the two aligned documents:
<link from="2.1.1.1.2.1" to="2.1.1.1.2.1">
5.1.1. Linking strategies and document verification
There are two ways we might attach tags to a document from another document. One way would be to attach individual tags at particular parts, in much the same way as we can create links freely to a document. This implies that the added tags cannot necessarily be parsed as a legitimate SGML document, because there are no constraints on the applied tags. In particular, the constraints that the tags cover the entire text of the document and that their containment relations form a tree are not present in this model. It is also very difficult to linearize.
Fortunately, the external markup that comprises an annotation is an alternative description of the whole document. Two key constraints can be enforced to enable parsing:
- order invariance (source text will not be reordered)
- well-formed SGML result in the annotation document (i.e., hierarchy conforming to the element structure is defined by a DTD).
The first constraint enables parsing of the several portions of a compound document in parallel, since the order of references will be the same. This should allow for a limited memory sequential processing structure, while aribtrary re-orderings are likely to require the use of some kind of location table to store the structures of the texts. Some re-ordering could be allowed, if necessary, as long as it is possible to bound the region within which such re-ordering will occur to a finite window (and thus bound the size of symbol table needed to process the markup).
The second constraint can be enforced by making remote tags nest according to the rules of SGML, and providing a DTD for how those tags should be nested. Thus, there will be a DTD for each type of annotation information, and that DTD will validate that the annotation information is properly conformant to the DTD, when considered in isolation.
The cesAna DTD is used for segmentation and grammatical annotation, including:
- sentence boundary markup
- tokens, each of which consists of the following:
- the orthographic form of the token as it appears in the corpus
- grammatical annotation, comprising one or more sets of the following:
- the base form (lemma)
- a morpho-syntactic specification, in the EAGLES annotation style
- a corpus tag
Allowing more than one possible set of grammatical annotation enables representing data which for which lexical lookup or some other morphosyntactic analysis has been performed, but which has not been disambiguated. When disambiguation has been accomplished, an optional element can be included containing the disambiguated form.
The structure of the DTD constituents is based on the overall principle that one or more "chunks" of a text may be included in the annotation document. These chunks may correspond to parts of the document extracted at different times for annotation, or simply to some subset of the text that has been extracted for analysis. For example, it is likely that within any text, only the paragraph content will undergo morphosyntactic analysis, and titles, footnotes, captions, long quotations, etc. will be omitted or analysed separately.
Four global attributes are defined in the cesAna DTD:
- id
- a unique identifier for the element bearing the ID value.
- n
- a number or other label for the element, not necessarily unique within the document.
- lang
- indicates that the tag's content is in the specified language. The value of the lang attribute which should be the same as that appearing on a <language> element in the header document which describes that character set, composed of one of the following:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
- wsd
- indicates that the tag's content is encoded in the specified character set. The value of the attribute is the character set name (ISO-8859-1, etc.) which should be the same as that appearing on a <writingSystem> element in the header document which describes that character set.
Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".
The global attributes are defined at the top of the cesAna DTD and represented by an entity, A.ANA. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.
The top level structure of the cesAna DTD is as follows:
- <cesAna>
- a single annotation document,
containing a <docHead> element, followed by a <chunkList> element. In addition to n and id, this element has the following attributes:
- type
- indicates the type of annotation contained in the document:
- TOK contains tokenized text
- SEG contains segmentation for orthographic sentences
- LEX contains morphosyntactic information for tokens
- DISAMB contains disambiguated morphosyntactic information
(Note that when the document contains more than one type of annotation, a series of values in quotation marks can be given for the attribute, e.g., "type = "TOK SEG".)
- doc
- provides the location (URL, path/filename, etc.) of the file to which this annotation document is linked.
- version
- provides the version of the cesAna DTD to which this document is compliant.
- header.loc
- provides the location (URL, path/filename, etc.) of the file containing the full CES Header for this annotation document.
-
- <docHead>
- contains a short (one or two line) description of the text. The full header is contained in the file pointed to in the header.loc attribute on <cesAna>.
- <chunkList>
- contains one or more "chunks" of annotation.
- <chunk>
- contains either a series of sentences or a series of tokens.
Attributes include:
- type
- indicates the type of data with which the chunk is associated, e.g., paragraph data, titles, etc.
- doc
- provides the location (URL, path/filename, etc.) of the file to which this chunk is linked.
- from
- provides, using the notation outlined above in section 5.1., Locators, the starting location of the chunk in the original document.
- to
- provides, using the notation outlined above in section 5.1., Locators, the ending location of the chunk in the original document. This is optional if it can be computed from the data.
- <s>
- contains a series of tokens; nested sentences may also appear.
Attributes:
- doc
- provides the location (URL, path/filename, etc.) of the file to which this sentence is linked.
- from
- provides, using the notation outlined above in section 5.1., Locators, the starting location of the sentence in the original document.
- to
- provides, using the notation outlined above in section 5.1., Locators, the ending location of the sentence in the original document. This is optional if it can be computed from the data.
- <tok>
- contains a token, consisting of its orthographic form in the original document, followed optionally by disambiguated corpus tag and/or one or more alternative sets of morphosyntactic information associated with the token.
Attributes:
- class
- gives the class of the token (e.g., name, date, abbr, etc.)
- doc
- provides the location (URL, path/filename, etc.) of the file to which this token is linked.
- from
- provides, using the notation outlined above in section 5.1., Locators, the starting location of the token in the original document.
- to
- provides, using the notation outlined above in section 5.1., Locators, the ending location of the token in the original document. This is optional if it can be computed from the data.
- <orth>
- contains the orthographic form of the token as it appears in the original, and as it may appear in a lexicon, possibly modified by processing (e.g., a compound may appear as "in_spite_of").
- <disamb>
- contains a disambiguated corpus tag associated with the token.
- <lex>
- contains one or more alternative sets of morphosyntactic information associated with the token.
- <base>
- the base or lemmatized form for the morphosyntactic information given in the associated <msd> element.
- <msd>
- the morphosyntactic description, specified in EAGLES-complaint format.
- <ctag>
- contains the corpus tag associated with the morphosyntactic information.
- certainty
- provides the level of certainty associated with this corpus tag assignment for the token in question, usually expressed as a percentage.
The following example shows the full use of all the options provided in the cesAna DTD. This set of annotation data could be the final result after tokenization, segmentation, lexical lookup or morphosyntactic analysis, and part of speech disambiguation. All the original options for morphosyntactic class are retained here, and the disambiguated tag is provided in the <disamb> element.
Note that because the attribute doc is specified as #CURRENT, once a value has been specified for this attribute on one instance of a given element, all subsequent occurrences of that element will use this value as the default unless it is re-specified.
<!doctype cesAna PUBLIC "-//CES//DTD//cesAna//EN" []>
<cesAna doc=MyText1 header.loc="MyText.hdr">
<chunkList>
<chunk doc="MyText1" from='1.2.1\1'>
<s>
<tok doc="MyText1" class='tok' from='1.2.1\1'>
<orth>Les</orth>
<disamb>
<ctag>DMP</ctag>
</disamb>
<lex>
<base>le</base>
<msd>Da-fp--d</msd>
<ctag>DFP</ctag>
</lex>
<lex>
<base>le</base>
<msd>Da-mp--d</msd>
<ctag>DMP</ctag>
</lex>
<lex>
<base>le</base>
<msd>Pp3fpj-</msd>
<ctag>PPJ</ctag>
</lex>
<lex>
<base>le</base>
<msd>Pp3mpj-</msd>
<ctag>PPJ</ctag>
</lex>
</tok>
<tok class='tok' from='1.2.1\5'>
<orth>critères</orth>
<disamb>
<ctag>NCMP</ctag>
</disamb>
<lex>
<base>critère</base>
<msd>Ncmp-</msd>
<ctag>NCMP</ctag>
</lex>
</tok>
<tok class='tok' from='1.2.1\14'>
<orth>se</orth>
<disamb>
<ctag>PPJ</ctag>
</disamb>
<lex>
<base>se</base>
<msd>Pp3msj-</msd>
<ctag>PPJ</ctag>
</lex>
<lex>
<base>se</base>
<msd>Pp3fpj-</msd>
<ctag>PPJ</ctag>
</lex>
<lex>
<base>se</base>
<msd>Pp3fsj-</msd>
<ctag>PPJ</ctag>
</lex>
<lex>
<base>se</base>
<msd>Pp3mpj-</msd>
<ctag>PPJ</ctag>
</lex>
</tok>
<tok class='tok' from='1.2.1\17'>
<orth>basent</orth>
<disamb>
<ctag>VM3P</ctag>
</disamb>
<lex>
<base>baser</base>
<msd>Vmip3p--</msd>
<ctag>VM3P</ctag>
</lex>
<lex>
<base>baser</base>
<msd>Vmsp3p--</msd>
<ctag>VM3P</ctag>
</lex>
</tok>
<tok class='tok' from='1.2.1\24'>
<orth>sur</orth>
<disamb>
<ctag>SP</ctag>
</disamb>
<lex>
<base>sur</base>
<msd>Afpms-</msd>
<ctag>AMS</ctag>
</lex>
<lex>
<base>sur</base>
<msd>Sp</msd>
<ctag>SP</ctag>
</lex>
</tok>
...
</s>
</chunk>
</chunkL>
</cesAna>
Alternatively, if a more concise set of information is desired, the following could be provided for the first token in the example above:
<tok class='tok' from='1.2.1\1'>
<orth>Les</orth>
<lex>
<base>le</base>
<ctag>DMP</ctag>
</lex>
</tok>
The cesAna DTD
The annotation document containing alignment information consists entirely of links between two documents which have been aligned. Alignment may be between primary data documents or between annotation documents containing segmentation information for the aligned units (sentences, tokens etc.).
The cesAlign DTD provides the document structure which,
like that of the cesAna DTD, is based on the notion of "chunks" that correspond to parts of the document.
Four global attributes are defined in the cesAlign DTD:
- id
- a unique identifier for the element bearing the ID value.
- n
- a number or other label for the element, not necessarily unique within the document.
- lang
- indicates that the tag's content is in the specified language. The value of the lang attribute which should be the same as that appearing on a <language> element in the header document which describes that character set, composed of one of the following:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
- wsd
- indicates that the tag's content is encoded in the specified character set. The value of the attribute is the character set name (ISO-8859-1, etc.) which should be the same as that appearing on a <writingSystem> element in the header document which describes that character set.
Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".
The global attributes are defined at the top of the cesAna DTD and represented by an entity, A.ALIGN. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.
The top level structure of the cesAlign DTD is as follows:
- <cesAlign>
- a single annotation document,
containing a <docHead> element, followed by a <chunkList> element. In addition to n and id, this element has the following attributes:
- type
- indicates the type of alignment:
- PAR alignment by paragraphs
- SENT alignment by orthographic sentences
- TOK alignment by tokens
- fromDoc
- provides the location (URL, path/filename, etc.) of the first file containing one set of aligned data.
- toDoc
- provides the location (URL, path/filename, etc.) of the second file containing the other set of aligned data.
- version
- provides the version of the cesAna DTD to which this document is compliant.
- header.loc
- provides the location (URL, path/filename, etc.) of the file containing the full CES Header for this annotation document.
-
- <docHead>
- contains a short (one or two line) description of the text. The full header is contained in the file pointed to in the header.loc attribute on <cesAlign>.
As in the cesAna DTD, <chunkList>
contains one or more occurrences of the element
<chunk>, defined as follows:
- <chunk>
- contains a series of links.
Attributes include:
- type
- indicates the type of data with which the chunk is associated, e.g., paragraph data, titles, etc.
- fromDoc
- provides the location (URL, path/filename, etc.) of the first file containing one set of aligned data.
- toDoc
- provides the location (URL, path/filename, etc.) of the second file containing the other set of aligned data.
- fromLoc
- provides, using the notation outlined above in section 5.1., Locators, the location in the document described in fromDoc that is being linked.
- toLoc
- provides, using the notation outlined above in section 5.1., Locators, the location in the document described in toDoc that is being linked.
Links may associate data of two kinds:
- data which comprises the entire content of an SGML element, such as an <s> or <tok> element;
- data which is not the entire content of an SGML element and therefore must be referenced by the method outlined in section 5.1., Locators, using a combination of ESIS tree location and character offset.
In the first instance, the markup required for the linkage can be simplified, since only one location (i.e., the location of the SGML element in the ESIS tree) need by specified for each of the targets of the link. For example:
<link fromLoc="2.1.1.1.2.1" toLoc="2.1.1.1.3.2">
In the second instance, a heavier mechanism is needed, since each target must provide a starting and ending location in each of the aligned documents. Therefore it is necessary to specify something like the following:
<xptr id=En1 doc=EN104 from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\2">
<xptr id=Fr1 doc=FR413 from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\2">
<link targets="En1 Fr1">
Note that using this mechanism, three or more files can be aligned if desired, since any number of IDs can be specified in the value field of the targets attribute on the <link> element.
The following elements appear in the cesAlign DTD:
- <link>
- a link specifying the SGML elements in two documents that have been aligned.
Attributes include:
- targType
- indicates the type of data being linked, e.g., paragraph sentence, etc.
- fromDoc
- provides the location (URL, path/filename, etc.) of the first (primary) file containing one set of aligned data.
- toDoc
- provides the location (URL, path/filename, etc.) of the second (secondary) file containing the other set of aligned data.
- fromLoc
- provides, using the notation outlined above in section 5.1., Locators, the location in the document described in fromDoc that is being linked. This is used when the target in fromDoc is the entire contents of an SGML element.
- toLoc
- provides, using the notation outlined above in section 5.1., Locators, the location in the document described in toDoc that is being linked. This is used when the target in toDoc is the entire contents of an SGML element.
- targets
- provides the IDs of two <xptr> elements that point to the locations of the aligned data in each of the aligned files.
- certainty
- gives a value indicating the degree of certainty for establishing this link, usually in the form of a percentage.
- <xptr>
- a pointer to a location in an external file
Attributes include the global attributes, but in this case, the value of id may be the target specified on a <link> tag. The following additional atributes are defined:
- targType
- indicates the type of data being linked, e.g., paragraph sentence, etc.
- doc
- provides the location (URL, path/filename, etc.) of the file containing the data being pointed to.
- from
- provides, using the notation outlined above in section 5.1., Locators, the starting location of the token in the original document.
- to
- provides, using the notation outlined above in section 5.1., Locators, the ending location of the token in the original document.
Note that because the attributes doc, fromDoc, and toDoc are defined as #CURRENT, once a value has been specified for this attribute on one instance of a given element, all subsequent occurrences of that element will use this value as the default unless it is re-specified. Therefore, verbosity can be reduced by placing all the
<xptr> elements pointing to given document sequentially in the alignment document.
Similarly, <link> elements appearing together need only specify fromDoc and toDoc on the first appearance of <link>.
The cesAlign DTD
NAVIGATOR
| Top
| Prev
| CES Contents
| MULTEXT
| EAGLES Text Representation subgroup
| LPL