MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 4.5. Version 0.1. Last modified 6 December 1995.
CES Part 4.5. The cesDoc DTD for primary data
Nancy Ide and Jean Véronis
Copyright (c) Centre National de la Recherche Scientifique, 1995.
This document is only a draft and should be cited as such. Creators of
WWW documents pointing to it are warned that its content and location may change
without notice. This document is provided as is without any express or implied
warranties. While every effort has been taken to ensure the accuracy of the
information contained, the authors assume no responsibility for errors or omissions,
or for damages resulting from the use of the information contained herein.
Permission is granted to make and distribute verbatim copies of this document for
non-commercial purposes provided the copyright notice and this permission notice are
preserved on all copies.
Contents
NAVIGATOR
| Prev
| Next
| CES 1 Table of contents
|
This section defines the cesDoc DTD, which is used for Level 1, Level 2, and and Level 3 CES-conformant encodings. The cesDoc DTD defines the required structure for marking Level 1 conformant
documents down to the paragraph level. It also defines additional elements at the sub-paragraph level which may appear, but are not required, in a Level 1 encoding, and which are used in Level 2 and Level 3 encodings.
The cesDoc DTD specifies rules which determine where the included elements may
legally appear in a document conforming to this DTD. The rules are expressed formally in the DTD for
the document, which is given at the end of the section. This section also
provides informal semantics for the use of the defined elements.
Five global attributes are defined in the cesDoc DTD:
- id
- a unique identifier for the element bearing the ID value.
- n
- a number or other label for the element, not necessarily unique within the corpus.
- lang
- indicates that the tag's content is in the specified language. The value of the lang attribute which should be the same as that appearing on a <language> element in the header document which describes that character set, composed of one of the following:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
- wsd
- indicates that the tag's content is encoded in the specified character set. The value of the attribute is the character set name (ISO-8859-1, etc.) which should be the same as that appearing on a <writingSystem> element in the header document which describes that character set.
- rend
- provides information about rendition in an original printed version. The value of the rend attribute may take one of the following attributes, although other values are also valid:
- BO bold face
- BX boxed
- IT italic font
- RO roman font
- UL underlined
- CA capital letters
Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".
The global attributes are defined at the top of the cesDoc DTD and represented by an entity, A.GLOBAL. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.
For modularity and readability, the cesDoc DTD follows the TEI model of creating
element classes for groups of elements which commonly appear together in
content models. These element classes differ from the TEI's in two major
ways:
- elements are grouped into classes only on the basis of common appearance
in content models, and not on the basis of shared attributes;
- the CES element classes are far more simplified, comprising a shallow
hierarchy with no common elements.
Element classes are defined in the cesDoc
DTD by declaring an entity that represents a group of elements. This entity can
then appear in the content model of some element and indicates that all of the
members of that class may appear at a common location.
The cesDoc DTD defines the following element classes (class names consistent with
similar TEI classes):
- M.INTER
- paragraph-level elements, i.e., elements which can appear inside <divN> elements at the paragraph
level, or between paragraphs
- M.PHRASE
- phrase-level elements, i.e., elements which can appear inter-mixed with PCDATA at the sub-paragraph
level
It is similarly useful to define entities that represent content models which
are frequently used in defining elements, since common content models are
readily obvious, and modification is simple. The content models defined by
entities in the cesDoc DTD are:
- BASE.SEQ
- the base content model for sub-paragraph level elements, including PCDATA, possibly inter-mixed with <abbr> and <num> elements.
- PHRASE.SEQ
- phrase-level elements, i.e., paragraph content, consisting of PCDATA
inter-mixed with the elements in class M.PHRASE
- PAR.SEQ
- elements that can appear at the paragraph level--i.e., in between
paragraphs, at the same level as <p>. This includes the elements in class M.INTER plus <p> and <sp>.
The top level structure of the cesDoc DTD is as follows:
- <cesDoc>
- a single document, either forming part of or derived from a corpus,
containing a <docHead> element, followed by either a <body> element or a <group> element. In addition to n and id, this element has the following attributes:
- type
- indicates the type of document (text, spoken data, etc.); the default is text.
- version
- provides the version of the cesDoc DTD to which this text is compliant.
- header.loc
- provides, using an entity reference, the location of the file containing the full CES Header for this text.
-
- <docHead>
- contains a short (one or two line) description of the text. The full header is contained in the file pointed to in the header.loc attribute on <cesDoc>.
- <body>
- contains an individual text.
- complete
- specifies whether or not this text is complete or a
sample.
-
- Y in principal, all of the original has been transcribed
-
- N a sample of the original has been taken
- <group>
- groups together a sequence of distinct texts that are regarded as a unit,
such as a sequence of prose essays, poems, etc.
The <body> element may contain:
- a <group> element, grouping together a sequence of distinct texts that are regarded as a unit,
such as a sequence of prose essays, poems, etc.
- An optional sequence of paragraph level elements of arbitrary length, followed by one or more elements sub-dividing the text. The body may have no structural
divisions within it at all.
Written texts exhibit a variety of different structural forms. Some have very
little organization at levels higher than the paragraphs, while others have a
complex hierarchy of parts, sections, chapters etc. Novels are divided into
chapters, newspapers into sections, reference works into articles, etc.
The following elements are used to represent textual divisions of all kinds.
They appear inside the <body> element:
- <div1>
- major subdivision of a written text, e.g. chapter.
- <div2>
- further subdivision of a written text, entirely contained within a
<div1> , e.g. section.
- <div3>
- further subdivision of a written text, entirely contained within a
<div2>, e.g. subsection.
- <div4>
- smallest possible subdivision of a written text, entirely contained within
a <div3>, e.g. sub-subsection.
The smallest recognized
subdivision of a text is tagged <div4>. Structural subdivisions
smaller than this but above paragraph level are not distinguished.
If a text has any structural subdivision, then at least those at the highest
level (<div1>) are identified. Lower levels of subdivision (i.e.
<div2>, <div3> or <div4>) may be
indicated, but are not required.
The <divN> elements have the following attributes in common:
- type
- categorises the division in some respect, e.g. as a chapter,
section etc.
- complete
- specifies whether or not this division is complete or a
sample.
- Y* the full text of the original has been transcribed
- N a sample of the original text has been taken
The n global attribute can be used to carry an identifying name or
number used within the text for a given division, for example, a chapter
number, as in the following example:
<div1 type=CHAPTER n=5>
The type attribute is used to characterize the division. A set
of precise values will be provided by EAGLES/PAROLE.
A sequence of paragraph level elements of arbitrary length may precede the
first structural subdivision at any level.
Below the level of text divisions, there are two general groups of elements
which may appear:
- Division head elements
- information such as section titles, bylines, etc. that often appears at the
beginning of text sections.
- Paragraph-level elements
- further division of the text, into paragraphs, etc.
The content of <divN> tags is defined to consist of one
or more division head elements (optional) followed by a sequence of
paragraph-level elements.
Division head elements include:
- <opener>
- groups together any opening material that is not a heading at the
start of a division, including in particular <dateline> and
<keywords>.
- <head>
- contains any heading, for example, the title of a section. This element
can also appear inside the <list> and <poem> elements
to mark the title of a list or poem. It can contain any
phrase-level element.
- <byline>
- contains the primary statement of responsibility given for a work on
its title page or at the head of the work, most often applicable to newspapers.
Can contain any phrase-level element plus the tag <docAuthor> for
the author's name.
The <keywords> element that appears within the opener can contain
terms and lists of terms that may appear at the beginning of a text as
identifying material.
The <dateline> element can contain untagged prose intermixed with markup for dates, times, names, addresses, abbreviations, and numbers.
A number of divisons of text occur at what is called the paragraph-level, since
the most common such division at this level is <p> (paragraph).
There are in addition several other elements which may appear directly within
structural divisions (that is, not nested within some other element).
- <p>
- a paragraph in a written text.
- <sp>
- contains material marked as "written to be spoken'' or "written as
spoken", usually by the presence of a speaker prefix, for example in a play
script or printed interview.
- <caption>
- (1) a heading, title etc. attached to a picture or diagram (2) a "pull
quote" or other text about or extracted from a text and superimposed upon it
to draw attention to it.
- <quote>
- a quotation from some author other than that of the surrounding text,
usually either embedded or displayed.
- <poem>
- a poem, or an extract from one, embedded or quoted within a text.
- <list>
- a collection of distinct items flagged as such by special layout in
written texts, often functioning as a single syntactic unit.
- <figure>
- indicates the location of a graphic, illustration, or figure.
- <bibl>
- a loosely-structured bibliographic citation appearing within a corpus
text.
- <note>
- any form of note, usually a footnote. This tag is used only for notes
that are a part of the original data only, not notes which may be added by the
encoder, etc.
- <table>
- contains text displayed in tabular form, in rows and columns.
The paragraph-level elements are discussed in more detail in the following
sub-sections.
NB: only the <p> element is required below the division
level for minimal Level 1 CES conformance.
We distinguish between <head> elements, which can appear only at
the start of a text division and are logically associated with it (for example,
chapter titles, newspaper headlines etc.) and <caption> elements,
which are logically independent of the position they may have within a textual
division (e.g.,, captions attached to pictures or figures, "pull-quotes''
embedded within the text, "by-lines'' identifying authorship and provenance of
a newspaper or periodical article.
The type attribute may be used to indicate the function of the caption:
- type
- categorizes the caption.
- BYLINE caption containing authorship of an article
- DISPLAY extra-textual caption (displayed box, etc.)
- ATTACHED caption describing a figure,
photograph, etc.
- UNSPEC* not specified or unknown
A caption can be placed at a point other than where it appears, so as not to
interrupt the normal flow of a text, by using it with the <ptr>
tag. See the section on Pointing and reference.
A quotation is a (usually long) extract from some other work than the text
itself which is embedded within it. It is typically set off from the paragraphs
that surround it typographically, by spacing similar to that for paragraphs (e.g., white space before and after). It
may contain paragraphs, s-units, dialogue (marked with <q>) or any
other phrase-level element.
In the CES, the use of the <quote> tag is sharply distinguished
from that of the <q> tag, which is used to mark quoted material
such as dialogue that can be considered to be inside a paragraph.
The <sp> element is used to mark parts of a written text which are
intended to be spoken, for example the speeches in a dramatic text, or which
comprise the transcription of a speech, interview, debates, etc. typically
intended for publication (i.e., which have been transcribed to be read as
text). Such parts are generally readily identifiable by the use of conventions
such as speaker prefixes (the label supplying the name of the speaker) and
stage directions. The <sp> element takes the following attribute:
The <sp> element contains:
- <speaker>
- contains the speech prefix used in the original source to identify the
speaker of a passage written to be spoken.
- <stage>
- contains any kind of stage direction within a dramatic text.
- type
- indicates the kind of stage direction.
The
<sp> element is not intnded to identify speaker turns identified
in a spoken text, i.e. one which has been transcribed from audio tape. The
<sp> element is used only for speaker turns identified as such in
a written text.
The <speaker> element is used to tag a label or prefix identifying
the speaker or speakers, and is followed by a sequence of paragraphs.
The <stage> element, when it appears, will normally be relocated
to the end of a paragraph in which it occurs. The <ptr> element can be used
to indicate its original position; see the section on Pointing and reference.
Poems or fragments of verse or song may appear between paragraphs. Where they
are distinguished from the surrounding text, they are marked using the
<poem> element, which contains an optional series of <head>
elements followed by one or more <lg> or <l> (for
line) elements, which is used to mark metrical lines, rather than typographic
lines:
- <lg>
- groups verse lines (marked by <l>), most often into stanzas.
Use the type attribute to identify the reason for the grouping.
- <l>
- a line of verse.
- part
- indicates whether the verse line is metrically complete.
- U* metricality is not known or inapplicable
- Y the line is metrically complete
- N the line is metrically incomplete
Note that the <lg> element may be recursively nested, in order to provide for sub-groupings of lines. In this case, the n attribute should be used to indicate the nesting level (e.g., n=1 for outer level, n=1.1 for nested sub-level, etc.; see the section on 4.5.9. Reference systems.
A list consists of an optional <head> element, followed by one or
more <item> elements, each of which may optionally be prefixed by
a <label> element:
- <item>
- an item within a list.
- <label>
- an enumerator or other label attached to a list item. Lists may or
may not be marked. Where marked, they may appear within or between
paragraphs.
The <label> element is used to hold the identifier
or tag sometimes attached to a list item, for example "(a)'', or a word or
phrase used for a similar purpose.
However, note that for the purposes of corpus-based work, it is usually preferable to regard list labels as rendition information and to encode them in the n attribute, rather than as part of the document content.
The <item> element may appear only inside lists. It contains the
same elements as a paragraph, and may therefore contain one or more nested
lists.
Figures are marked with the following tag, which enables a reference to a
stored image in another file:
- <figure>
- indicates the location of a graphic, illustration, or figure.
- entity
- names the external entity within which the graphic image of
the figure is stored.
- value
- the entity name of some external entity declared either as a
SUBDOC entity or as an entity using a non-SGML notation.
The
<figure> element contains an optional <head> element
for the figure title or heading, followed by an optional sequence of paragraphs
for commentary or caption, an optional <figdesc> element,
and an optional <body> element for including the graphic
itself, where desired. The <figure> element can be empty, serving
only to mark the presence of a figure in the text.
- <figDesc>
- contains a brief prose description of the appearance or content of a
graphic figure, for use when documenting an image without displaying
it.
Annotations and bibliographic citations or references are marked using the
following elements:
- <note>
- any form of note, usually a footnote. This tag marks only notes that
are a part of the original text, not notes that may be added by the encoder,
etc.
- place
- for a written text, specifies the location of an original
note in the source text.
- FOOT note at foot of page.
- END note at end of current division or text.
- SIDE note in left or right margin.
- UNSPEC* placement unknown or unspecified.
- <bibl>
- a loosely-structured bibliographic citation appearing within a corpus
text.
Original notes may contain paragraphs, s-units, dialogue, and any
other phrase-level element. The global n attritbute can be used to indicate the value of a numbered note.
Like captions, notes are often moved from their original location in the original data and placed at another point so as not to
interrupt the normal flow of a text, by using the <ptr>
tag as follows
(see the section on Pointing and reference):
Here is a text, with a "1" at the end for a
footnote. [1].
<<Then, this note appears at
this point in the original.>>
But we would like to keep the text together.
This can be encoded as
<p>Here is a text.
<ptr target=N1 n=1 rend=bracketed>
But we would like to keep the text together.</p>
<note id=N1 place=foot>Then, this note appears at
this point in the original.</note>
Bibliographic citations or references within running texts are marked using the
<bibl> element, which can contain any phrase-level element plus
the <author> element.
The <table> element is used to include tables in the text. It takes the attributes:
- rows
- indicates the number of rows in the table.
- cols
- indicates the number of columns in the table.
The cesDoc DTD also includes tags for marking sub-paragraph-level elements. marking sub-paragraph elements is not required for Level 1 documents, but some are required for Level 2 and Level 3 documents.
Certain phrase-level elements are commonly tagged in the early stages of the
markup process, since they are signalled by the typography in legacy data or in
printed versions serving as the copy. It is therefore desirable to provide some
guidance for the inclusion of sub-paragraph markup in Level 1 documents.
The phrase-level elements that are provided for in the cesDoc DTD are selected on
the basis of their relevance for corpus-based work. There are four main
categories of phrase-level elements:
- elements of linguistic interest;
- elements indicating editorial changes to the original text;
- the <hi> element for marking typographically distinct words or
phrases, especially when the purpose of the highlighting is not yet determined;
- elements for identifying s-units (typically orthographic sentences) and
quoted dialogue;
- elements for pointing and reference.
The cesDoc DTD imposes a relatively strict structure on sub-paragraph elements,
intended to disallow options and impose a structure which is most suited to the
needs of corpus-handling tools. Adherence to this structure for Level 1
documents is recommended, but not required.
There have been two main defining forces behind the choice of elements:
- the needs of corpus-annotation tools, such as morpho-syntactic taggers,
whose performance can often be improved by pre-identification of elements such
as names, addresses, title, dates, measures, foreign words and phrases, etc.
- the need to identify objects which have intrinsic linguistic interest, or
are often useful for the purposes of translation, text alignment, etc., such as
abbreviations, names, terms, linguistically distinct words and phrases,
etc.
The phrase-level elements identifying linguistically relevant elements are:
- <abbr>
- contains an abbreviation of any sort; expansion may be given in the
expan attribute. Consult Handling Punctuation for guidelines for encoding abbreviations.
- <date>
- contains a date in any format, with ISO 8601 normalized form given in
the ISO8601 attribute.
- <measure>
- contains a number, word, phrase indicating a quantity. The type
attribute differentiates currency, weight, count, length, area, volume, etc.
For currencies, the ISO 4217 codes for currency representation can be given in
the value attribute.
- <name>
- contains a proper noun or noun phrase. Attributes can indicate its
type.
See Encoding Names.
- <num>
- contains a number, written in any form, with normalized value in the
value attribute.
- <term>
- contains a single-word, multi-word or symbolic designation which is
regarded as a technical term.
- <time>
- contains a phrase defining a time of day in any format, with ISO 8601
normalized form given in the ISO8601 attribute.
- <distinct>
- identifies a word or phrase regarded as linguistically distinct (e.g.,
archaic, techncial, dialect, etc.).
- <foreign>
- identifies a word or phrase as belonging to some language other than
that of the surrounding text. The lang attribute indicates the language.
- <mentioned>
- marks words or phrases mentioned, not used.
- <title>
- contains the title of a work, whether article, book, journal, or
series, including any alternative titles or subtitles.
<abbr>
and <num> may contain only PCDATA. The remaining elements may
contain PCDATA as well as the <abbr> and <num> elements. Abbreviations and numbers are frequently
identified and tagged automatically, and therefore their placement must be
relatively free.
In general it is not desirable to mark typographic features of a given printing of a text in texts designated for use in corpus-based research. However, there are circumstances under which it is desirable to retain this information. In particular, certain items of linguistic interest may be marked by typography in the original; e.g., linguistic emphasis and foreign words are often rendered in italics. In addition, some applications (e.g., machine translation which attempts to reproduce the format of the original) demand retaining the rendition information.
In the process of up-translation from legacy data, a first step is often to translate relevant typographic information into SGML, with no attempt to interpret the significance of the rendering (e.g., that the italics signify a foreign word). Interpretation is often too costly because it is ambiguous (e.g., italics signify not only foreign words, but also emphasis, titles, etc.). In such cases the
<hi>
element can be used.
- <hi>
- marks a word or phrase as graphically distinct from the surrounding
text, for reasons concerning which no claim is made. The rend attribute
should provide the original rendition information when its function has not yet
been determined.
- rend
- describes the rendition or presentation of the highlighted item.
- BO bold face
- BX boxed
- IT italic font
- RO roman font
- UL underlined
- CA capital letters
Note: Several values from the list may be specified where appropriate,
separated by spaces, e.g., "ro it".
When the <hi> tag is used, no claim about the reason is made. This may be the case in a Level 1 encoding, since determining the reasons for highlighting (e.g., presence of a foreign word, vs. emphasis, vs. a title, etc.) demands human intervention and is therefore too costly in the early stages of up-translation. Note that typographically highlighted phrases and the kind of highlighting used may be recorded in one of two ways:
- using the global rend attribute
- using the <hi> element with a rend attribute
The first method specifies an attribute on some element which contains all of and only the highlighted phrase. In this case, the function of the highlighting is clear (for example, to mark a heading), and the boundaries of the highlighted phrase therefore coincide with the boundaries of some other element. The rend attribute is given on the tag for that element, for example
<head rend=bo>The world beyond</head>
The second method inserts a new tag indicating that what it contains is highlighted. It is used
- when the function of the highlighting is not clear;
- where there is no tag identifying the feature concerned;
- where the highlighted phrase is not co-terminous with some other element.
The rend attribute must be supplied on the <hi> element. The rend attribute is optional on all other elements.
Both the start and end tag for any SGML element must be contained within the start and end tag of any of its ancestors in the tree for that document. Since by definition <hi> elements can appear only within <p> elements, this means that where, for example, an italicized passage contains more than one paragraph or starts within a paragraph and spans one or more others, the <hi> element must be closed at the end of the enclosing element, and then re-opened within the next. For example, an italicized passage which crosses a <p> boundary must be tagged as follows:
<p>This is the start of a paragraph which <hi rend=it>switches to
italics here and then goes on for several paragraphs</hi></p>
<p><hi rend=it>This second paragraph is all in italics and so has
no "hi" tag</hi></p>
<p><hi rend=it>This is the last bit of italics</hi> and the rest is
in roman.</p>
That is, the <hi> element is closed before the end of the first paragraph and re-opened at the start of the next. Note that the following encoding is not acceptable:
<p>This is the start of a paragraph which <hi rend=it>switches to
italics here and then goes on for several paragraphs</hi></p>
<p rend=it>This second paragraph is all in italics and so has
no "hi" tag</p>
<p><hi rend=it>This is the last bit of italics</hi> and the rest is
in roman.</p>
This second encoding mixes different styles of marking the same feature for a given span of text, which will cause problems for retrieval.
The following tags are used to mark editorial changes:
- <corr>
- contains the correct form of a passage apparently erroneous in the copy
text.
- sic
- gives the original form
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been
made.
- <gap>
- indicates a point where material has been omitted in a transcription,
whether for editorial sampling practice, or because the material is
illegible.
- desc
- describes the omitted text
- reason
- gives the reason for the omission (sampling, illegible, etc.)
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been
made.
- <reg>
- contains text which has been regularized or normalized in some sense.
- orig
- gives the original form
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been
made.
The segmentation of texts into s-units, or orthographic sentences, is
usually accomplished by special tools. The results of such segmentation are, in
the CES model, considered as a type of annotation and stored in a separate
file, which has advantages for ease of processing. However, in some cases it is
desirable to mark s-units and/or quoted dialogue in the primary data. We
therefore provide mechanisms for marking these elements.
In some cases only quoted dialogue is marked in the primary data, because the
identification of quoted dialogue can be accomplished automatically (by
detecting quotation marks etc.).
- <s>
- identifies an s-unit within a document, typically an orthographic
sentence.
- <q>
- contains a quoted dialogue appearing inside a paragraph.
When s-units are tagged, no
split should be made between a colon or semi-colon followed by a word beginning
with a capital initial (unless there is an end-of-paragraph marker).
When both <s> and <q> are marked, the problem of
overlapping hierarchies can arise.
For this reason it has been necessary to allow for mutual recursive nesting of
<s> and <q> tags in the cesDoc DTD, a practice which is otherwise avoided. This allows all the following encodings:
<s><q>Indeed yes,</q>she replied.</s>
<q rend="PRE lsquo POST rsquo"><s>I know precisely what you are
feeling.</s><s>I know all about your contempt, your hatred, your
disgust.</s><s>But don't worry, I am on your
side!</s></q><s>And then the flash of intelligence was
gone...
However, the CES recommends that the <p> - <s> - <q>
hierarchy be retained if possible--that is, the hierarchy of <s> elements is treated as primary, and the hierarchy of <q> elements is treated as secondary. In a case such as the one above, this can be accomplished by breaking the quotes and using the next and prev attributes together
with the global id attribute to associate the fregments, as follows:
<s><q id=q1 type=part next=q2>I know precisely what you are
feeling.</q></s> <s><q id=q2 type=part prev=q1
next=q3>I know all about your contempt, your hatred, your
disgust.</q></s><s><q id=q3 type=part prev=q2>But don't
worry, I am on your side!</q></s> <s>And then the flash of
intelligence was gone...
In the following case, this method solves the problem of overlapping
hierarchies:
<s>According to the visiting leader, the economy of the country is
<q id=q1 type=part next=q2>better than ever.</q></s> <q
id=q2 type=part prev=q1><s>It is in fact in very good
shape.</s>"</q></p>
NOTE: The strategy that retains the <p> - <s> - <q>
hierarchy is required for Level 3 conformance.
References in the text which refer to another part of it can be tagged
with
- <ref>
a reference to another location in the current document, in terms of
one or more identifiable elements, possibly modified by additional text or
comment.
In some cases it is desirable to move an element to another
location in the encoded text. This is common for footnotes which occur in-line
in the electronic text, but which appear as footnotes, endnotes, etc. in a
printed version. It is also common for cpations, figures, bibliographic citations, and stage directions.
- <ptr>
a pointer to another location in the current document in terms of one
or more identifiable elements.
Examples:
This can be encoded as
The note in the following example originally appeared at the location of the <ptr> tag:
The <name type=org>Ministry of Truth</name>, —
<name type=org lang=ns>Minitrue</name>, in
<name>Newspeak</name><ptr target=N1 rend=asterisk>
— was startlingly different from any other object in sight...</p>
<note place=foot id=N1><name>Newspeak</name> was the
official language of <name type=place>Oceania</name>. For an
account of its structure and etymology see Appendix.</note>
For purposes of alignment or other reference to elements within a text, a
reference system can be built up using the id attribute on appropriate
elements.
We recommend the following strategy:
- supply a unique identifying label in the id attribute of the
<body> tag
- for each nested division, give each unit an identifier which is built up
by successively adding to the identifier of the text; for example
<body id=ORW1>
<div1 type=part id=ORW1.1>
<div2 type=chapter id=ORW1.1.1>
<div3 type=section id=ORW1.1.1.1>
</div3>
</div2>
</div1>
</body>
- for each paragraph, add another layer to the immediately superordinate
identifier, as follows:
<div2 type=chapter id=ORW1.1.1>
<p id=ORW1.1.1.1.p1></p>
<p id=ORW1.1.1.1.p2></p>
</div2>
- for each s-unit, add another layer to the superordinate identifer on the
enclosing <p> element:
<div2 type=chapter id=ORW1.1.1>
<p id=ORW1.1.1.1.p1>
<s id=ORW1.1.1.1.p1.s1></s>
</p>
</div2>
When a string of characters is tagged as a name, many corpus-handling
tools treat the string as a single token (e.g. some morpho-syntactic
taggers) and do not perform additional analysis.
For English, we can state the following rules:
- Titles such as "Mr." and role names such as "Secretary" are not considered
part of a person name:
Mme. <name>Edith Cresson</name>
(or : <abbr>Mme.</abbr> <name>Edith
Cresson</name>)
President <name>Boris Yeltsin</name>
- Appositives such as "Jr." are considered part of a person name:
<name>Sammy Davis, Jr.</name>
Where these rules can be used for encoding other languages they should be
followed.
In English the possessive is formed by the addition of "'s" which is
tokenized separately, and should not be encoded as a part of the name:
<name>Winston</name>'s
Inflected forms of names (e.g., adjectival forms such as "Estonian") should
not be encoded. In languages where the possessive is formed by internal
inflection, the possessive form should not be encoded.
Punctuation is normally considered to be a separate token, and should be encoded outside the <name> tag. See the discussion in the next section.
Examples:
Jaguar is made is <name type=place>Britain</name>.
<name type=place>France</name>-based
<name type=place>U.S.</name>-<name
type=place>Japan</name> trade negotations
- Laws, diseases, prizes, etc. named after people or saints, etc. should not
be tagged with <name type=person>.
- Street addresses, street names, adjectival forms of place names should not
be tagged as <name type=place>.
Punctuation should be left as in the original text, except in the cases noted below.
Note that punctuation and special characters are treated by many corpus-handling
tools as separate tokens. For example, a text such as
<q>Ignorance is strength.</q>
may be tokenized as
TOKEN Ignorance
TOKEN is
TOKEN strength
TOKEN .
Full stops and ellipses
The full stop should be kept as both a part of an abbreviation and as an end-of-sentence indicator. The disambiguation of the two uses is accomplished by the marking of abbreviations and/or s-units, when such markup is provided.
Ellipses should be regularized so that the three periods are contiguous, with no spaces in between.
Full stops appearing as a part of abbreviations should not be separated from
the rest of the abbreviation string when the abbreviation is marked with the <abbr> tag, even though the full stop may serve a double
function (i.e., also signal end-of-sentence).
Example:
I'm back in the U.S.
should be tagged as
I'm back in the <abbr>U.S.</abbr>
even though the period is both part of the abbreviation and a signal of
end-of-sentence.
Hyphens and dashes
Line-end (soft) hyphens should be removed where they are not part of the
regular spelling of the word. In cases of doubt, guidance should be
sought elsewhere in the same text or in dictionaries. If doubt still
remains, a hyphen should be retained rather than removed.
Dashes are marked by an entity reference (—). No
distinction should be made between different types of dashes.
Apostrophes
Apostrophes should be left as they are in the original text. Note that the apostrophe can be ambiguous with the single quotation mark (e.g., in English the possessive "Joneses'"). This may be disambiguated by the marking of quotations.
Punctuation and tokens identified by the encoder
There is a small class of tags which mark the presence of tokens that have
been isolated and classified by the encoder. Among the elements included in the
cesDoc DTD, the following may be used to identify individual tokens:
<abbr>
<date>
<num>
<measure>
<name>
<term>
<time>
For many tools, when such an element is identified in the input stream, it
is not desirable to further tokenize the string inside the tag; rather, the
string inside the tag can be regarded as a single token (possibly with the type
indicated by the tag name). For example, an element with the tag <name>
can be assumed by lexical lookup routines and morpho-syntactic taggers to be a
single token with the grammatical category PROPER NOUN (Np). For example,
<name type=person>Big Brother</name>
can be tokenized as
TOKEN(name) Big Brother
Similarly, the string
<date>April 4th, 1984</date>
can be tokenized as
TOKEN(date) April 4th, 1984
Therefore, punctuation that is not a part of an identified token should not appear
within the tag (except abbreviations--see below). For example, the text
The
Ministry of Love, which maintained law and order.
should be encoded as
The <name type=org>Ministry of Love</name>, which maintained law and order.
Other examples:
<name type=org>Jaguar</name> company in <name type=place>Britain</name>.
...he had been born in <date>1944</date> or
<date>1945</date>; but it...
...the three slogans of the <name
type=org>Party</name>:...
When the
<q> or <quote> tag is used, any quotation marks or other typographical device
for indicating quoted dialogue should be removed from the text. The rend attribute can be used to indicate the means by which the quotation was
originally marked in the text (this is not required). In these cases, the value of the rend
attribute should be one of the following, which are consistent with entity
names in ISOpub and ISOnum:
laquo angle quotation mark, left
raquo angle quotation mark, right
lsquo single quotation mark, left
rsquo single quotation mark, right
ldquo double quotation mark, left
rdquo double quotation mark, right
lsquor rising single quote, left (low)
ldquor rising dbl quote, left (low)
rdquor rising dbl quote, right (high)
rsquor rising single quote, right (high)
mdash dash the width of lowercase m
Note that it is required to eliminate quotation marks etc. marking a quotation for Level 2 and 3 conformant encodings, since the rendition conventions for dialogue are language-specific and therefore not a part of the "content" proper.
In principle, encode punctuation as inside or outside the <q>
tag according to the position of the quotation marks in the original, as in
these examples:
- ('dealing on the free market', it was called)
(<q rend="PRE lsquo POST rsquo">dealing on the free
market</q>, it was called)
-
- The dark-haired girl behind Winston had begun crying out `Swine!
Swine! Swine!'
The dark-haired girl behind <name
type=person>Winston</name> had begun crying out <q rend="PRE lsquo
POST rsquo">Swine! Swine! Swine!</q>
- 'I am with you,' O'Brien seemed to be saying to him.
<q rend="PRE lsquo POST rsquo">I am with you,</q><name
type=person>O'Brien</name>seemed to be saying to him.
In cases where the <q> tag is used for text that is not
enclosed in quotation marks in the original, leave punctuation that is not a
part of the actual cited text outside the <q> tags:
- BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
<q rend=ca type=slogan><name type=person>Big
Brother</name> is watching you</q>, the caption beneath it ran.
- Never mind, it doesn't matter, he thought. ["Never mind, it doesn't
matter" in italics]
<q rend=it>Never mind, it doesn't matter</q>, he
thought.
- Eureka! he shouted. ["Eureka!" in italics]
<q rend=it>Eureka!</q> he
shouted.
Note, however, that the tokenization of the text should not be affected by
the position of the punctuation relative to the closing tag; the same set of
tokens is ultimately generated in either case.
Sentence terminating punctuation should always appear within an enclosing
set of <s> and </s> tags:
- <s><q rend=it>Eureka!</q> he
shouted.</s>
- <s>The dark-haired girl behind <name
type=person>Winston</name> had begun crying out <q rend="PRE lsquo POST rsquo">Swine! Swine!
Swine!</q></s>
Because tokenizers typically treat text within tags such as
<hi> and <foreign>, punctuation can appear either
inside or outside the closing tag without effect. Therefore, given this
text:
She ordered a croque monsieur. ["croque monsieur" in italics]
either of the two following encodings is acceptable:
She ordered a <foreign rend=it>croque
monsieur</hi>.
She ordered a <foreign rend=it>croque
monsieur.</hi>
The cesDoc DTD
NAVIGATOR
| Top
| Prev
| Next
| CES Contents
| MULTEXT
| EAGLES TR subgroup
| LPL