CNRS

MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 4.5. Version 0.1. Last modified 6 December 1995.






CES Part 4.5. The cesDoc DTD for primary data




Nancy Ide and Jean Véronis


Copyright (c) Centre National de la Recherche Scientifique, 1995.

This document is only a draft and should be cited as such. Creators of WWW documents pointing to it are warned that its content and location may change without notice. This document is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained, the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

Permission is granted to make and distribute verbatim copies of this document for non-commercial purposes provided the copyright notice and this permission notice are preserved on all copies.


Contents


CNRS

NAVIGATOR

| Prev | Next | CES 1 Table of contents |

4.5. The cesDoc DTD: description

This section defines the cesDoc DTD, which is used for Level 1, Level 2, and and Level 3 CES-conformant encodings. The cesDoc DTD defines the required structure for marking Level 1 conformant documents down to the paragraph level. It also defines additional elements at the sub-paragraph level which may appear, but are not required, in a Level 1 encoding, and which are used in Level 2 and Level 3 encodings.

The cesDoc DTD specifies rules which determine where the included elements may legally appear in a document conforming to this DTD. The rules are expressed formally in the DTD for the document, which is given at the end of the section. This section also provides informal semantics for the use of the defined elements.


4.5.1. Global attributes

Five global attributes are defined in the cesDoc DTD: Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".

The global attributes are defined at the top of the cesDoc DTD and represented by an entity, A.GLOBAL. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.


4.5.2. Element classes represented by entities in the cesDoc DTD

For modularity and readability, the cesDoc DTD follows the TEI model of creating element classes for groups of elements which commonly appear together in content models. These element classes differ from the TEI's in two major ways:

Element classes are defined in the cesDoc DTD by declaring an entity that represents a group of elements. This entity can then appear in the content model of some element and indicates that all of the members of that class may appear at a common location.

The cesDoc DTD defines the following element classes (class names consistent with similar TEI classes):


4.5.3. Content models represented by entities in the cesDoc DTD

It is similarly useful to define entities that represent content models which are frequently used in defining elements, since common content models are readily obvious, and modification is simple. The content models defined by entities in the cesDoc DTD are:


4.5.4. Top-level structure

The top level structure of the cesDoc DTD is as follows:


4.5.5. Text body

The <body> element may contain:

4.5.6. Text divisions

Written texts exhibit a variety of different structural forms. Some have very little organization at levels higher than the paragraphs, while others have a complex hierarchy of parts, sections, chapters etc. Novels are divided into chapters, newspapers into sections, reference works into articles, etc.

The following elements are used to represent textual divisions of all kinds. They appear inside the <body> element:

The smallest recognized subdivision of a text is tagged <div4>. Structural subdivisions smaller than this but above paragraph level are not distinguished.

If a text has any structural subdivision, then at least those at the highest level (<div1>) are identified. Lower levels of subdivision (i.e. <div2>, <div3> or <div4>) may be indicated, but are not required.

The <divN> elements have the following attributes in common:

The n global attribute can be used to carry an identifying name or number used within the text for a given division, for example, a chapter number, as in the following example:

<div1 type=CHAPTER n=5>

The type attribute is used to characterize the division. A set of precise values will be provided by EAGLES/PAROLE.

A sequence of paragraph level elements of arbitrary length may precede the first structural subdivision at any level.


4.5.7. Contents of text divisions

Below the level of text divisions, there are two general groups of elements which may appear:

The content of <divN> tags is defined to consist of one or more division head elements (optional) followed by a sequence of paragraph-level elements.

Division head elements include:

The <keywords> element that appears within the opener can contain terms and lists of terms that may appear at the beginning of a text as identifying material.

The <dateline> element can contain untagged prose intermixed with markup for dates, times, names, addresses, abbreviations, and numbers.


4.5.8. Paragraph-level elements

A number of divisons of text occur at what is called the paragraph-level, since the most common such division at this level is <p> (paragraph). There are in addition several other elements which may appear directly within structural divisions (that is, not nested within some other element).

The paragraph-level elements are discussed in more detail in the following sub-sections.

NB: only the <p> element is required below the division level for minimal Level 1 CES conformance.



4.5.8.1. Captions

We distinguish between <head> elements, which can appear only at the start of a text division and are logically associated with it (for example, chapter titles, newspaper headlines etc.) and <caption> elements, which are logically independent of the position they may have within a textual division (e.g.,, captions attached to pictures or figures, "pull-quotes'' embedded within the text, "by-lines'' identifying authorship and provenance of a newspaper or periodical article.

The type attribute may be used to indicate the function of the caption:

A caption can be placed at a point other than where it appears, so as not to interrupt the normal flow of a text, by using it with the <ptr> tag. See the section on Pointing and reference.


4.5.8.2. Quotations

A quotation is a (usually long) extract from some other work than the text itself which is embedded within it. It is typically set off from the paragraphs that surround it typographically, by spacing similar to that for paragraphs (e.g., white space before and after). It may contain paragraphs, s-units, dialogue (marked with <q>) or any other phrase-level element.

In the CES, the use of the <quote> tag is sharply distinguished from that of the <q> tag, which is used to mark quoted material such as dialogue that can be considered to be inside a paragraph.



4.5.8.3. Spoken paragraphs

The <sp> element is used to mark parts of a written text which are intended to be spoken, for example the speeches in a dramatic text, or which comprise the transcription of a speech, interview, debates, etc. typically intended for publication (i.e., which have been transcribed to be read as text). Such parts are generally readily identifiable by the use of conventions such as speaker prefixes (the label supplying the name of the speaker) and stage directions. The <sp> element takes the following attribute:

The <sp> element contains:

The <sp> element is not intnded to identify speaker turns identified in a spoken text, i.e. one which has been transcribed from audio tape. The <sp> element is used only for speaker turns identified as such in a written text.

The <speaker> element is used to tag a label or prefix identifying the speaker or speakers, and is followed by a sequence of paragraphs.

The <stage> element, when it appears, will normally be relocated to the end of a paragraph in which it occurs. The <ptr> element can be used to indicate its original position; see the section on Pointing and reference.



4.5.8.4. Poems

Poems or fragments of verse or song may appear between paragraphs. Where they are distinguished from the surrounding text, they are marked using the <poem> element, which contains an optional series of <head> elements followed by one or more <lg> or <l> (for line) elements, which is used to mark metrical lines, rather than typographic lines:

Note that the <lg> element may be recursively nested, in order to provide for sub-groupings of lines. In this case, the n attribute should be used to indicate the nesting level (e.g., n=1 for outer level, n=1.1 for nested sub-level, etc.; see the section on 4.5.9. Reference systems.



4.5.8.5. Lists

A list consists of an optional <head> element, followed by one or more <item> elements, each of which may optionally be prefixed by a <label> element:

The <label> element is used to hold the identifier or tag sometimes attached to a list item, for example "(a)'', or a word or phrase used for a similar purpose. However, note that for the purposes of corpus-based work, it is usually preferable to regard list labels as rendition information and to encode them in the n attribute, rather than as part of the document content.

The <item> element may appear only inside lists. It contains the same elements as a paragraph, and may therefore contain one or more nested lists.



4.5.8.6. Figures

Figures are marked with the following tag, which enables a reference to a stored image in another file:

The <figure> element contains an optional <head> element for the figure title or heading, followed by an optional sequence of paragraphs for commentary or caption, an optional <figdesc> element, and an optional <body> element for including the graphic itself, where desired. The <figure> element can be empty, serving only to mark the presence of a figure in the text.



4.5.8.7. Annotations (<note> and <bibl>)

Annotations and bibliographic citations or references are marked using the following elements:

Original notes may contain paragraphs, s-units, dialogue, and any other phrase-level element. The global n attritbute can be used to indicate the value of a numbered note.

Like captions, notes are often moved from their original location in the original data and placed at another point so as not to interrupt the normal flow of a text, by using the <ptr> tag as follows (see the section on Pointing and reference):

      Here is a text, with a "1" at the end for a 
      footnote. [1].
      <<Then, this note appears at 
      this point in the original.>>
      But we would like to keep the text together.
This can be encoded as

      <p>Here is a text.
      <ptr target=N1 n=1 rend=bracketed>
      But we would like to keep the text together.</p>
      <note id=N1 place=foot>Then, this note appears at 
      this point in the original.</note>

Bibliographic citations or references within running texts are marked using the <bibl> element, which can contain any phrase-level element plus the <author> element.


4.5.8.8. Tables

The <table> element is used to include tables in the text. It takes the attributes:


4.5.9. Sub-paragraph (phrase-level) elements

The cesDoc DTD also includes tags for marking sub-paragraph-level elements. marking sub-paragraph elements is not required for Level 1 documents, but some are required for Level 2 and Level 3 documents.

Certain phrase-level elements are commonly tagged in the early stages of the markup process, since they are signalled by the typography in legacy data or in printed versions serving as the copy. It is therefore desirable to provide some guidance for the inclusion of sub-paragraph markup in Level 1 documents.

The phrase-level elements that are provided for in the cesDoc DTD are selected on the basis of their relevance for corpus-based work. There are four main categories of phrase-level elements:

The cesDoc DTD imposes a relatively strict structure on sub-paragraph elements, intended to disallow options and impose a structure which is most suited to the needs of corpus-handling tools. Adherence to this structure for Level 1 documents is recommended, but not required.


4.5.9.1. Linguistic elements

There have been two main defining forces behind the choice of elements:

The phrase-level elements identifying linguistically relevant elements are:

<abbr> and <num> may contain only PCDATA. The remaining elements may contain PCDATA as well as the <abbr> and <num> elements. Abbreviations and numbers are frequently identified and tagged automatically, and therefore their placement must be relatively free.


4.5.9.2. Rendition information

In general it is not desirable to mark typographic features of a given printing of a text in texts designated for use in corpus-based research. However, there are circumstances under which it is desirable to retain this information. In particular, certain items of linguistic interest may be marked by typography in the original; e.g., linguistic emphasis and foreign words are often rendered in italics. In addition, some applications (e.g., machine translation which attempts to reproduce the format of the original) demand retaining the rendition information.

In the process of up-translation from legacy data, a first step is often to translate relevant typographic information into SGML, with no attempt to interpret the significance of the rendering (e.g., that the italics signify a foreign word). Interpretation is often too costly because it is ambiguous (e.g., italics signify not only foreign words, but also emphasis, titles, etc.). In such cases the <hi> element can be used.

Note: Several values from the list may be specified where appropriate, separated by spaces, e.g., "ro it".

When the <hi> tag is used, no claim about the reason is made. This may be the case in a Level 1 encoding, since determining the reasons for highlighting (e.g., presence of a foreign word, vs. emphasis, vs. a title, etc.) demands human intervention and is therefore too costly in the early stages of up-translation. Note that typographically highlighted phrases and the kind of highlighting used may be recorded in one of two ways:

The first method specifies an attribute on some element which contains all of and only the highlighted phrase. In this case, the function of the highlighting is clear (for example, to mark a heading), and the boundaries of the highlighted phrase therefore coincide with the boundaries of some other element. The rend attribute is given on the tag for that element, for example

<head rend=bo>The world beyond</head>

The second method inserts a new tag indicating that what it contains is highlighted. It is used

The rend attribute must be supplied on the <hi> element. The rend attribute is optional on all other elements. Both the start and end tag for any SGML element must be contained within the start and end tag of any of its ancestors in the tree for that document. Since by definition <hi> elements can appear only within <p> elements, this means that where, for example, an italicized passage contains more than one paragraph or starts within a paragraph and spans one or more others, the <hi> element must be closed at the end of the enclosing element, and then re-opened within the next. For example, an italicized passage which crosses a <p> boundary must be tagged as follows:

That is, the <hi> element is closed before the end of the first paragraph and re-opened at the start of the next. Note that the following encoding is not acceptable:

This second encoding mixes different styles of marking the same feature for a given span of text, which will cause problems for retrieval.


4.5.9.3. Editorial corrections

The following tags are used to mark editorial changes:


4.5.9.4. S-units and quoted dialogue

The segmentation of texts into s-units, or orthographic sentences, is usually accomplished by special tools. The results of such segmentation are, in the CES model, considered as a type of annotation and stored in a separate file, which has advantages for ease of processing. However, in some cases it is desirable to mark s-units and/or quoted dialogue in the primary data. We therefore provide mechanisms for marking these elements.

In some cases only quoted dialogue is marked in the primary data, because the identification of quoted dialogue can be accomplished automatically (by detecting quotation marks etc.).

When s-units are tagged, no split should be made between a colon or semi-colon followed by a word beginning with a capital initial (unless there is an end-of-paragraph marker).

When both <s> and <q> are marked, the problem of overlapping hierarchies can arise. For this reason it has been necessary to allow for mutual recursive nesting of <s> and <q> tags in the cesDoc DTD, a practice which is otherwise avoided. This allows all the following encodings:

However, the CES recommends that the <p> - <s> - <q> hierarchy be retained if possible--that is, the hierarchy of <s> elements is treated as primary, and the hierarchy of <q> elements is treated as secondary. In a case such as the one above, this can be accomplished by breaking the quotes and using the next and prev attributes together with the global id attribute to associate the fregments, as follows:

In the following case, this method solves the problem of overlapping hierarchies:

NOTE: The strategy that retains the <p> - <s> - <q> hierarchy is required for Level 3 conformance.


4.5.9.5. Pointing and reference

References in the text which refer to another part of it can be tagged with

In some cases it is desirable to move an element to another location in the encoded text. This is common for footnotes which occur in-line in the electronic text, but which appear as footnotes, endnotes, etc. in a printed version. It is also common for cpations, figures, bibliographic citations, and stage directions.

Examples:

This can be encoded as

The note in the following example originally appeared at the location of the <ptr> tag:

4.5.10. Reference systems

For purposes of alignment or other reference to elements within a text, a reference system can be built up using the id attribute on appropriate elements.

We recommend the following strategy:

          <body id=ORW1>
            <div1 type=part id=ORW1.1>
              <div2 type=chapter id=ORW1.1.1>
                 <div3 type=section id=ORW1.1.1.1>
                 </div3>
              </div2>
            </div1>
          </body>

          <div2 type=chapter id=ORW1.1.1>
               <p id=ORW1.1.1.1.p1></p>
               <p id=ORW1.1.1.1.p2></p>
          </div2>

          <div2 type=chapter id=ORW1.1.1>
               <p id=ORW1.1.1.1.p1>
                 <s id=ORW1.1.1.1.p1.s1></s>
               </p>
          </div2>


4.5.11. Encoding names

When a string of characters is tagged as a name, many corpus-handling tools treat the string as a single token (e.g. some morpho-syntactic taggers) and do not perform additional analysis.


Titles and roles

For English, we can state the following rules: Where these rules can be used for encoding other languages they should be followed.


Possessives and inflected forms

In English the possessive is formed by the addition of "'s" which is tokenized separately, and should not be encoded as a part of the name:

<name>Winston</name>'s

Inflected forms of names (e.g., adjectival forms such as "Estonian") should not be encoded. In languages where the possessive is formed by internal inflection, the possessive form should not be encoded.


Forms of names with punctuation

Punctuation is normally considered to be a separate token, and should be encoded outside the <name> tag. See the discussion in the next section.

Examples:


Forms not to be tagged as names


4.5.12. Handling punctuation

Punctuation should be left as in the original text, except in the cases noted below.

Note that punctuation and special characters are treated by many corpus-handling tools as separate tokens. For example, a text such as

                  <q>Ignorance is strength.</q>

may be tokenized as

                      TOKEN   Ignorance 
                      TOKEN   is 
                      TOKEN   strength
                      TOKEN   .


Full stops and ellipses

The full stop should be kept as both a part of an abbreviation and as an end-of-sentence indicator. The disambiguation of the two uses is accomplished by the marking of abbreviations and/or s-units, when such markup is provided.

Ellipses should be regularized so that the three periods are contiguous, with no spaces in between.

Full stops appearing as a part of abbreviations should not be separated from the rest of the abbreviation string when the abbreviation is marked with the <abbr> tag, even though the full stop may serve a double function (i.e., also signal end-of-sentence).

Example:

I'm back in the U.S.

should be tagged as

I'm back in the <abbr>U.S.</abbr>

even though the period is both part of the abbreviation and a signal of end-of-sentence.


Hyphens and dashes

Line-end (soft) hyphens should be removed where they are not part of the regular spelling of the word. In cases of doubt, guidance should be sought elsewhere in the same text or in dictionaries. If doubt still remains, a hyphen should be retained rather than removed.

Dashes are marked by an entity reference (&mdash;). No distinction should be made between different types of dashes.


Apostrophes

Apostrophes should be left as they are in the original text. Note that the apostrophe can be ambiguous with the single quotation mark (e.g., in English the possessive "Joneses'"). This may be disambiguated by the marking of quotations.


Punctuation and tokens identified by the encoder

There is a small class of tags which mark the presence of tokens that have been isolated and classified by the encoder. Among the elements included in the cesDoc DTD, the following may be used to identify individual tokens:

                      <abbr>
                      <date>
                      <num>
                      <measure>
                      <name>
                      <term>
                      <time> 

For many tools, when such an element is identified in the input stream, it is not desirable to further tokenize the string inside the tag; rather, the string inside the tag can be regarded as a single token (possibly with the type indicated by the tag name). For example, an element with the tag <name> can be assumed by lexical lookup routines and morpho-syntactic taggers to be a single token with the grammatical category PROPER NOUN (Np). For example,

<name type=person>Big Brother</name>

can be tokenized as

TOKEN(name) Big Brother

Similarly, the string

<date>April 4th, 1984</date>

can be tokenized as

TOKEN(date) April 4th, 1984

Therefore, punctuation that is not a part of an identified token should not appear within the tag (except abbreviations--see below). For example, the text

The Ministry of Love, which maintained law and order.

should be encoded as

Other examples:


Punctuation and quotations

When the <q> or <quote> tag is used, any quotation marks or other typographical device for indicating quoted dialogue should be removed from the text. The rend attribute can be used to indicate the means by which the quotation was originally marked in the text (this is not required). In these cases, the value of the rend attribute should be one of the following, which are consistent with entity names in ISOpub and ISOnum:


Note that it is required to eliminate quotation marks etc. marking a quotation for Level 2 and 3 conformant encodings, since the rendition conventions for dialogue are language-specific and therefore not a part of the "content" proper.

In principle, encode punctuation as inside or outside the <q> tag according to the position of the quotation marks in the original, as in these examples:

In cases where the <q> tag is used for text that is not enclosed in quotation marks in the original, leave punctuation that is not a part of the actual cited text outside the <q> tags:

Note, however, that the tokenization of the text should not be affected by the position of the punctuation relative to the closing tag; the same set of tokens is ultimately generated in either case.


Punctuation in <s> tags

Sentence terminating punctuation should always appear within an enclosing set of <s> and </s> tags: