MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Annex 4. Version 0.2. Last Modified 16 December 1995
| Back to main document
|
Copyright (c) Centre National de la Recherche Scientifique, 1995.
This document is only a draft and should be cited as such. Creators of WWW documents pointing to it are warned that its content and location may change without notice. This document is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained, the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
Permission is granted to make and distribute verbatim copies of this document for non-commercial purposes provided the copyright notice and this permission notice are preserved on all copies.
Contents
Elements defined in the cesHeader DTD
- <address>
- contains a postal address of the
distributor. Should include telephone, fax, and email.
- <analytic>
- contains bibliographic elements describing an item (e.g. an article or
poem) published within a monograph, journal, or periodical and not as an
independent publication.
- <annotation>
- gives information about an annotation file associated with the text.
- <annotations>
- This element groups information about annotation documents associated with the text.
- <author>
- in a bibliographic reference, contains the name of an author
(personal or corporate) of a work; names should be given in a canonical form,
with surnames preceding forenames.
- <availability>
- supplies information about the availability of a text, for example,
any restrictions on its use or distribution, its copyright status, etc.
- <biblFull>
- contains a bibliographic citation for a text which has been previously encoded in electronic form. This element contains the same elements as the
<fileDesc> element, and is intended to include the header of the electronic text from which the current document is derived.
- <biblNote>
- a descriptive note supplying additional information of any kind
relating to a bibliographic item described within a corpus or text header.
- <biblScope>
- defines the scope of a bibliographic reference, for example as a list
of page numbers, or a named subdivision of a larger work.
- <biblStruct>
- contains a structured bibliographic citation, in which only
bibliographic sub-elements appear and in a specified order.
- <byteCount>
- contains the count of bytes in the text.
- <catDesc>
- describes a category within the text typology, in the form of a brief
prose description.
- <catRef>
- specifies one or more defined categories within some taxonomy or text
typology.
- <category>
- gives a standard category name defined by EAGLES. List of
values to be provided.
- <cesHeader>
- contains the descriptive and declarative information making up an
"electronic title page" prefixed to every text, or to the corpus as a
whole
- <change>
- summarizes a particular change or correction made to a particular
version of an electronic text which is shared between several
researchers.
- <classDecl>
- contains a series of <category> elements, defining the
classification codes used for texts within the corpus.
- <conformance>
- provides the CES level of conformance for the text or corpus.
- <correction>
- specifies a set of correction practices applied in creating one or more
components of the corpus.
- <creation>
- contains information about the creation of a text.
- <date>
- the date expressed as a calendar date in any format, for inclusion in <publication>, <imprint> or <change>.
- <distributor>
- gives the name of the person or institution who distributes the text or corpus.
- <edition>
- provides bibliographic details for an edition of some text.
- <editionStmt>
- contains any additional information relating to a particular version
of a text.
- <editorialDecl>
- provides details of editorial principles and practices applied
during the encoding of a text.
- <encodingDesc>
- documents the relationship between an electronic text and the source
or sources from which it was derived.
- <extent>
- provides the size of the electronic text as stored on
some carrier medium.
- <fileDesc>
- contains a full bibliographic description of the corpus itself or of a
text within it.
- <hyphenation>
- summarizes the way in which end-of-line hyphenation in a source text
has been treated in an encoded version of it.
- <imprint>
- groups information relating to the publication or distribution of a
bibliographic item.
- <item>
- specifies the nature of the change(s). One or more occurrences of
this element may appear within each <change>.
- <keyTerm>
- contains a technical term or phrase, particularly in a
list of descriptive keywords.
- <keywords>
- contains a list of keywords or phrases identifying the topic or
nature of a text, each of which is tagged as a term.
- <langUsage>
- groups information describing the languages, sublanguages, registers,
dialects etc. represented within a text.
- <language>
- characterizes a language, sublanguage, register, dialect,
etc., used within a single text.
- <monogr>
- contains bibliographic elements describing an item (e.g. a book or
journal) published as an independent item (i.e. as a separate physical
object).
- <normalization>
- specifies a set of normalization practices applied in creating one
or more components of the corpus.
- <profileDesc>
- provides further information about various aspects of a text,
specifically the language used, the situation and date of its
production, the participants and their setting, and a descriptive
classification for it.
- <projectDesc>
- describes in detail the purpose for which an electronic file
was encoded.
- <pubPlace>
- place of publication for a book, article, etc.
- <publicationStmt
- groups information concerning the publication or distribution of the
corpus and its constituent texts.
- <publisher>
- the publisher of the corpus or text expressed as the proper name of a person, place or institution.
- <quotation>
- specifies editorial practice adopted with respect to quotation marks
in the original.
- <refsDecl>
- specifies how canonical references are constructed for this text.
- <respName>
- proper name of a person, place or institution.
- <respStmt>
- supplies information about any person or institution responsible for
the intellectual content of a text, edition, or electronic transcription.
- <respType>
- contains a phrase describing the nature of a person's or institution's
intellectual responsibility.
- <revisionDesc>
- is used to record details of any significant change to the
corpus.
- <samplingDecl>
- contains a prose description of the rationale and methods used in
sampling texts in the creation of the corpus.
- <segmentation>
- describes the principles according to which the text has been
segmented, for example into sentences, tone-units, graphemic strata, etc.
- <sourceDesc>
- supplies a bibliographic description of the copy text(s) from which
an electronic text was derived or generated.
- <tagUsage>
- supplies information about the usage of a specific element within the
corpus or text with which this header is associated.
- <tagsDecl>
- provides detailed information about the tagging applied to an SGML
document.
- <textClass>
- groups information which describes the nature or topic of a text in
terms of a standard classification scheme, thesaurus, etc.
- <title>
- the title a work, including alternative titles or
subtitles.
- <titleStmt>
- groups information concerning the title of the corpus or the individual text and its
constituent texts.
- <transduction>
- describes the principles according to which the text has been
transduced, either in transcribing it from audio tape to written form, or in
converting from an electronic original.
- <translation>
- gives information about a translation of the text.
- <translations>
- groups information about existing translations of the text.
- <translator>
- gives the name of the translator.
- <wordCount>
- contains the count of words in the text.
- <writingSystem>
- characterizes a character set used within a single text.
- <wsdUsage>
- groups information describing the character set(s) used within a text.
Elements defined in the cesDoc DTD
- <abbr>
- contains an abbreviation of any sort; expansion may be given in the
expan attribute.
- <author>
- contains the name of the author(s), personal or corporate, of a work; the
primary statement of responsibility for any bibliographic
item.
- <bibl>
- a loosely-structured bibliographic citation appearing within a corpus
text.
- <body>
- contains the body of a single unitary text.
- <byline>
- contains the primary statement of responsibility given for a work on
its title page or at the head of the work, most often applicable to newspapers.
Can contain any phrase-level element plus the tag <docAuthor> for
the author's name.
- <caption>
- (1) a heading, title etc. attached to a picture or diagram (2) a "pull
quote" or other text about or extracted from a text and superimposed upon it
to draw attention to it.
- <cesDoc>
- a single document, either forming part of or derived from a corpus,
containing a <docHead> element, followed by either a <body> element or a <group> element.
- <cell>
- contains one cell in a row of a table.
- <corr>
- contains the correct form of a passage apparently erroneous in the copy
text.
- <docAuthor>
- contains the name of the author of the doument, as given on the title page or at the head of the work, may only occur in a <byline>.
- <date>
- contains a date in any format, with ISO 8601 normalized form given in
the ISO8601 attribute.
- <dateline>
- contains a brief description of the place, date, time, etc., of
production of a letter, newspaper story, or other work, prefixedto it as a kind
of heading.
- <distinct>
- identifies a word or phrase regarded as linguistically distinct (e.g.,
archaic, techncial, dialect, etc.).
- <div1>
- major subdivision of a written text, e.g. chapter.
- <div2>
- further subdivision of a written text, entirely contained within a
<div1> , e.g. section.
- <div3>
- further subdivision of a written text, entirely contained within a
<div2>, e.g. subsection.
- <div4>
- smallest possible subdivision of a written text, entirely contained within
a <div3>, e.g. sub-subsection.
- <docHead>
- contains a short (one or two line) description of the text. The full header is contained in the file pointed to in the header.loc attribute on <cesDoc>
- <figDesc>
- contains a brief prose description of the appearance or content of a
graphic figure, for use when documenting an image without displaying
it.
- <figure>
- indicates the location of a graphic, illustration, or figure.
- <foreign>
- identifies a word or phrase as belonging to some language other than that of the surrounding text. The lang attribute indicates the language.
- <gap>
- indicates a point where material has been omitted in a transcription,
whether for editorial sampling practice, or because the material is illegible.
- <group>
- groups together a sequence of distinct texts that are regarded as a unit,
such as a sequence of prose essays, poems, etc.
- <head>
- contains any heading, for example, the title of a section, heading of a
list, etc. Can contain any phrase-level element.
- <hi>
- marks a word or phrase as graphically distinct from the surrounding
text, for reasons concerning which no claim is made. The rend attribute
should provide the original rendition information when its function has not yet
been determined.
- <item>
- an item within a <list>.
- <keywords>
- contains a list of keywords or phrases identifying the topic or nature
of a text; if the keywords come from a controlled vocabulary, it can
be identified by the scheme attribute.
- <l>
- a line of verse.
- <label>
- an enumerator or other label attached to a list <item>. Lists may or
may not be marked. Where marked, they may appear within or between
paragraphs.
- <lg>
- groups verse lines, most often into stanzas. Use the type attribute
to identify the reason for the grouping.
- <list>
- a collection of distinct items flagged as such by special layout in
written texts, often functioning as a single syntactic unit.
- <measure>
- contains a number, word, phrase indicating a quantity. The type
attribute differentiates currency, weight, count, length, area, volume, etc.
For currencies, the ISO 4217 codes for currency representation can be given in
thevalue attribute.
- <mentioned>
- marks words or phrases mentioned, not used.
- <name>
- contains a proper noun or noun phrase. Attributes can indicate its type.
- <note>
- any form of note, usually a footnote. This tag marks only notes that
are a part of the original text, not notes that may be added by the encoder,
etc.
- <num>
- contains a number, written in any form, with normalized value in the
value attribute.
- <opener>
- groups together any opening material that is not a heading at the start
of a division.
- <p>
- a paragraph in a written text.
- <poem>
- a poem, or an extract from one, embedded or quoted within a text.
- <ptr>
- a pointer to another location in the current document in terms of one
or more identifiable elements.
- <q>
- contains a quotation, usually in dialogue, appearing in a paragraph.
- <quote>
- a quotation from some author other than that of the surrounding text,
usually either embedded or displayed.
- <ref>
- a reference to another location in the current document, in terms of
one or more identifiable elements, possibly modified by additional text or
comment.
- <reg>
- contains text which has been regularized or normalized in some sense.
- <row>
- contains one row of a table, consisting of
a <cell> or additional <table>s.
- <s>
- identifies an s-unit within a document, typically an orthographic
sentence.
- <sp>
- contains material marked as "written to be spoken'' or "written as
spoken", usually by the presence of a speaker prefix, for example in a play
script or printed interview.
- <speaker>
- contains the speech prefix used in the original source to identify the
speaker of a passage written to be spoken.
- <stage>
- contains any kind of stage direction within a dramatic text.
- <table>
- contains text displayed in tabular form, in rows and columns.
- <term>
- contains a single-word, multi-word or symbolic designation which is
regarded as a technical term.
- <time>
- contains a phrase defining a time of day in any format, with ISO 8601
normalized form given in the ISO8601 attribute.
- <title>
- contains the title of a work, whether article, book, journal, or
series, including any alternative titles or subtitles.
Elements defined in the cesAna DTD
- <base>
- the base or lemmatized form for the morphosyntactic information given in the associated <msd> element.
- <cesAna>
- a single annotation document,
containing a <docHead> element, followed by a <chunkList> element.
- <chunk>
- contains either a series of sentences or a series of tokens.
- <chunkList>
- contains one or more "chunks" of annotation
- <ctag>
- contains the corpus tag associated with the morphosyntactic information.
- <disamb>
- contains a disambiguated corpus tag associated with the token.
- <docHeader>
- contains a short (one or two line) description of the text. The full header is contained in the file pointed to in the header.loc attribute on <cesAna>.
- <lex>
- contains one or more alternative sets of morphosyntactic information associated with the token.
- <msd>
- the morphosyntactic description, specified in EAGLES-complaint format.
- <orth>
- contains the orthographic form of the token as it appears in the original, and as it may appear in a lexicon, possibly modified by processing (e.g., a compound may appear as "in_spite_of").
- <s>
- contains a series of tokens; nested sentences may also appear.
- <tok>
- contains a token, consisting of its orthographic form in the original document, followed optionally by disambiguated corpus tag and/or one or more alternative sets of morphosyntactic information associated with the token.
Elements defined in the cesAlign DTD
- <cesAlign>
- a single annotation document,
containing a <docHead> element, followed by a <chunkList> element.
- <chunk>
- contains a series of links.
- <chunkList>
- contains one or more occurrences of the element
<chunk>.
- <docHeader>
- contains a short (one or two line) description of the text. The full header is contained in the file pointed to in the header.loc attribute on <cesAlign>.
- <link>
- a link specifying the SGML elements in two documents that have been aligned.
- <xptr>
- a pointer to a location in an external file.
NAVIGATOR
| Top
| Main document
| MULTEXT
| EAGLES Text Representation subgroup
| LPL