CNRS

MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 4.5. Version 0.1. Last modified 6 December 1995.






CES Part 4.5. The CES DTD for primary data




Nancy Ide and Jean Véronis


Copyright (c) Centre National de la Recherche Scientifique, 1995.

This document is only a draft and should be cited as such. Creators of WWW documents pointing to it are warned that its content and location may change without notice. This document is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained, the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

Permission is granted to make and distribute verbatim copies of this document for non-commercial purposes provided the copyright notice and this permission notice are preserved on all copies.


Contents


CNRS

NAVIGATOR

| Prev | CES 1 Table of contents |

4.5. The CES DTD

This section defines the CES DTD, which is used for Level 1, Level 2, and and Level 3 CES-conformant encodings. The CES DTD defines the required structure for marking Level 1 conformant documents down to the paragraph level. It also defines additional elements at the sub-paragraph level which may appear, but are not required, in a Level 1 encoding, and which are used in Level 2 and Level 3 encodings.

The CES DTD specifies rules which determine where the included elements may legally appear in a document conforming to this DTD. The rules are expressed formally in the DTD for the document, which is given at the end of the section. This section also provides informal semantics for the use of the defined elements.

The top level structure of the CES DTD is as follows:


4.5.1. Global attributes

Four global attributes are defined in the CES DTD: Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".

The global attributes are defined at the top of the CES DTD and represented by an entity, A.GLOBAL. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.


4.5.2. Element classes represented by entities in the CES DTD

For modularity and readability, the CES DTD follows the TEI model of creating element classes for groups of elements which commonly appear together in content models. The CES1 element classes differ from the TEI's in two major ways:

Element classes are defined in the CES1 DTD by declaring an entity that represents a group of elements. This entity can then appear in the content model of some element and indicates that all of the members of that class may appear at a common location.

The CES DTD defines the following element classes (class names consistent with similar TEI classes):


4.5.3. Content models represented by entities in the CES DTD

It is similarly useful to define entities that represent content models which are frequently used in defining elements, since common content models are readily obvious, and modification is simple. The content models defined by entities in the CES DTD are:


4.5.4. Text body

The <body> element may contain:

4.5.5. Text divisions

Written texts exhibit a variety of different structural forms. Some have very little organization at levels higher than the paragraphs, while others have a complex hierarchy of parts, sections, chapters etc. Novels are divided into chapters, newspapers into sections, reference works into articles, etc.

The following elements are used to represent textual divisions of all kinds. They appear inside the <body> element:

The smallest recognized subdivision of a text is tagged <div4>. Structural subdivisions smaller than this but above paragraph level are not distinguished.

If a text has any structural subdivision, then at least those at the highest level (<div1>) are identified. Lower levels of subdivision (i.e. <div2>, <div3> or <div4>) may be indicated, but are not required.

The <divN> elements have the following attributes in common:

The n global attribute can be used to carry an identifying name or number used within the text for a given division, for example, a chapter number, as in the following example:

<div1 type=CHAPTER n=5>

The type attribute is used to characterize the division. A set of precise values will be provided by EAGLES/PAROLE.

A sequence of paragraph level elements of arbitrary length may precede the first structural subdivision at any level.


4.5.6. Contents of text divisions

Below the level of text divisions, there are two general groups of elements which may appear:

The content of <divN> tags is defined to consist of one or more division head elements (optional) followed by a sequence of paragraph-level elements.

Division head elements include:

The <keywords> element that appears within the opener can contain terms and lists of terms that may appear at the beginning of a text as identifying material.

The <dateline> element can contain untagged prose intermixed with markup for dates, times, names, addresses, abbreviations, and numbers.


4.5.7. Paragraph-level elements

A number of divisons of text occur at what is called the paragraph-level, since the most common such division at this level is <p> (paragraph). There are in addition several other elements which may appear directly within structural divisions (that is, not nested within some other element).

The paragraph-level elements are discussed in more detail in the following sub-sections.

NB: only the <p> element is required below the division level for minimal Level 1 CES conformance.



4.5.7.1. Captions

We distinguish between <head> elements, which can appear only at the start of a text division and are logically associated with it (for example, chapter titles, newspaper headlines etc.) and <caption> elements, which are logically independent of the position they may have within a textual division (e.g.,, captions attached to pictures or figures, "pull-quotes'' embedded within the text, "by-lines'' identifying authorship and provenance of a newspaper or periodical article.

The type attribute may be used to indicate the function of the caption:

A caption can be placed at a point other than where it appears, so as not to interrupt the normal flow of a text, by using it with the <ptr> tag. See the section on Pointing and reference.


4.5.7.2. Quotations

A quotation is a (usually long) extract from some other work than the text itself which is embedded within it. It is typically set off from the paragraphs that surround it typographically, by spacing similar to that for paragraphs (e.g., white space before and after). It may contain paragraphs, s-units, dialogue (marked with <q>) or any other phrase-level element.

In the CES, the use of the <quote> tag is sharply distinguished from that of the <q> tag, which is used to mark quoted material such as dialogue that can be considered to be inside a paragraph.



4.5.7.3. Spoken paragraphs

The <sp> element is used to mark parts of a written text which are intended to be spoken, for example the speeches in a dramatic text, or which comprise the transcription of a speech, interview, debates, etc. typically intended for publication (i.e., which have been transcribed to be read as text). Such parts are generally readily identifiable by the use of conventions such as speaker prefixes (the label supplying the name of the speaker) and stage directions. The <sp> element takes the following attribute:

The <sp> element contains:

The <sp> element is not intnded to identify speaker turns identified in a spoken text, i.e. one which has been transcribed from audio tape. The <sp> element is used only for speaker turns identified as such in a written text.

The <speaker> element is used to tag a label or prefix identifying the speaker or speakers, and is followed by a sequence of paragraphs.

The <stage> element, when it appears, will normally be relocated to the end of a paragraph in which it occurs. The <ptr> element can be used to indicate its original position; see the section on Pointing and reference.



4.5.7.4. Poems

Poems or fragments of verse or song may appear between paragraphs. Where they are distinguished from the surrounding text, they are marked using the <poem> element, which contains an optional series of <head> elements followed by one or more <lg> or <l> (for line) elements, which is used to mark metrical lines, rather than typographic lines:

<lg>
groups verse lines (marked by <l>), most often into stanzas. Use the type attribute to identify the reason for the grouping.
<l>
a line of verse.

part
indicates whether the verse line is metrically complete.

U*  metricality is not known or inapplicable
Y  the line is metrically complete
N  the line is metrically incomplete

Note that the <lg> element may be recursively nested, in order to provide for sub-groupings of lines. In this case, the n attribute should be used to indicate the nesting level (e.g., n=1 for outer level, n=1.1 for nested sub-level, etc.; see the section on 4.5.9. Reference systems.



4.5.7.5. Lists

A list consists of an optional <head> element, followed by one or more <item> elements, each of which may optionally be prefixed by a <label>element:

The <label> element is used to hold the identifier or tag sometimes attached to a list item, for example "(a)'', or a word or phrase used for a similar purpose. However, note that for the purposes of corpus-based work, it is usually preferable to regard list labels as rendition information and to encode them in the n attribute, rather than as part of the document content.

The <item> element may appear only inside lists. It contains the same elements as a paragraph, and may therefore contain one or more nested lists.



4.5.7.6. Figures

Figures are marked with the following tag, which enables a reference to a stored image in another file:

The <figure> element contains an optional <head> element for the figure title or heading, followed by an optional sequence of paragraphs for commentary or caption, an optional <figdesc> element, and an optional <body> element for including the graphic itself, where desired. The <figure> element can be empty, serving only to mark the presence of a figure in the text.



4.5.7.7. Annotations (<note> and <bibl>)

Annotations and bibliographic citations or references are marked using the following elements:

Original notes may contain paragraphs, s-units, dialogue, and any other phrase-level element. The global n attritbute can be used to indicate the value of a numbered note.

Like captions, notes are often moved from their original location in the original data and placed at another point so as not to interrupt the normal flow of a text, by using the <ptr> tag as follows (see the section on a href="#ptr">Pointing and reference):

      Here is a text, with a "1" at the end for a 
      footnote. [1].
      <<Then, this note appears at 
      this point in the original.>>
      But we would like to keep the text together.
This can be encoded as

      <p>Here is a text.
      <ptr target=N1 n=1 rend=bracketed>
      But we would like to keep the text together.</p>
      <note id=N1 place=foot>Then, this note appears at 
      this point in the original.</note>

Bibliographic citations or references within running texts are marked using the <bibl> element, which can contain any phrase-level element plus the <author> element.


4.5.7.8. Tables

The <table> element is used to include tables in the text. It takes the attributes:


4.5.8. Sub-paragraph (phrase-level) elements

The CES DTD also includes tags for marking sub-paragraph-level elements. marking sub-paragraph elements is not required for Level 1 documents, but some are required for Level 2 and Level 3 documents.

Certain phrase-level elements are commonly tagged in the early stages of the markup process, since they are signalled by the typography in legacy data or in printed versions serving as the copy. It is therefore desirable to provide some guidance for the inclusion of sub-paragraph markup in Level 1 documents.

The phrase-level elements that are provided for in the CES DTD are selected on the basis of their relevance for corpus-based work. There are four main categories of phrase-level elements:

The CES DTD imposes a relatively strict structure on sub-paragraph elements, intended to disallow options and impose a structure which is most suited to the needs of corpus-handling tools. Adherence to this structure for Level 1 documents is recommended, but not required.


4.5.8.1. Linguistic elements

There have been two main defining forces behind the choice of elements:

The phrase-level elements identifying linguistically relevant elements are:

<abbr> and <num> may contain only PCDATA. The remaining elements may contain PCDATA as well as the <abbr> and <num> elements. Abbreviations and numbers are frequently identified and tagged automatically, and therefore their placement must be relatively free.


4.5.8.2. Rendition information

In general it is not desirable to mark typographic features of a given printing of a text in texts designated for use in corpus-based research. However, there are circumstances under which it is desirable to retain this information. In particular, certain items of linguistic interest may be marked by typography in the original; e.g., linguistic emphasis and foreign words are often rendered in italics. In addition, some applications (e.g., machine translation which attempts to reproduce the format of the original) demand retaining the rendition information.

In the process of up-translation from legacy data, a first step is often to translate relevant typographic information into SGML, with no attempt to interpret the significance of the rendering (e.g., that the italics signify a foreign word). Interpretation is often too costly because it is ambiguous (e.g., italics signify not only foreign words, but also emphasis, titles, etc.). In such cases the <hi> element can be used.

Note: Several values from the list may be specified where appropriate, separated by spaces, e.g., "ro it".

When the <hi> tag is used, no claim is made about the reason is made. This may be the case in a Level 1 encoding, since determining the reasons for highlighting (e.g., presence of a foreign word, vs. emphasis, vs. a title, etc.) demands human intervention and is therefore too costly in the early stages of up-translation. Note that typographically highlighted phrases and the kind of highlighting used may be recorded in one of two ways:

<ref>
a reference to another location in the current document, in terms of one or more identifiable elements, possibly modified by additional text or comment.

In some cases it is desirable to move an element to another location in the encoded text. This is common for footnotes which occur in-line in the electronic text, but which appear as footnotes, endnotes, etc. in a printed version. It is also common for cpations, figures, bibliographic citations, and stage directions.

<ptr>
a pointer to another location in the current document in terms of one or more identifiable elements.
Examples:


Quotations

When the <q> or <quote>tag is used, any quotation marks or other typographical device for indicating quoted dialogue should be removed from the text. The rendattribute can be used to indicate the means by which the quotation was originally marked in the text (this is not required). In these cases, the value of the rend attribute should be one of the following, which are consistent with entity names in ISOpub and ISOnum:


Note that it is required to eliminate quotation marks etc. marking a quotation for Level 2 and 3 conformant encodings, since the rendition conventions for dialogue are language-specific and therefore not a part of the "content" proper.

In principle, encode punctuation as inside or outside the <q> tag according to the position of the quotation marks in the original, as in these examples:

In cases where the <q> tag is used for text that is not enclosed in quotation marks in the original, leave punctuation that is not a part of the actual cited text outside the <q> tags:

Note, however, that the tokenization of the text should not be affected by the position of the punctuation relative to the closing tag; the same set of tokens is ultimately generated in either case.


Punctuation in <s> tags

Sentence terminating punctuation should always appear within an enclosing set of <s> and </s> tags: