MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 3. Version 0.1. Last modified 7 December 1995.
Nancy Ide and Jean Véronis
Copyright (c) Centre National de la Recherche Scientifique, 1995.
This document is only a draft and should be cited as such. Creators of
WWW documents pointing to it are warned that its content and location may change
without notice. This document is provided as is without any express or implied
warranties. While every effort has been taken to ensure the accuracy of the
information contained, the authors assume no responsibility for errors or omissions,
or for damages resulting from the use of the information contained herein.
Permission is granted to make and distribute verbatim copies of this document for
non-commercial purposes provided the copyright notice and this permission notice are
preserved on all copies.
Contents
NAVIGATOR
| Next
| Prev
| CES 1 Table of contents
|
The header provides information about the electronic text that has been encoded, including not only its title, author, etc. but also information about its encoding. The TEI header has provided the first means to document electronic texts, which has been widely adopted and adapted for use in text and corpus encoding.
The TEI provides an in-line header that is included in the same SGML document as the encoded text. Usually, the header appears in the same file as the text, although this is not obligatory. The TEI also provides an independent header, a header without its attached text, which is intended mainly for cataloguing electronic texts.
The CES adopts the following strategy for headers:
- headers are stored independently of the text, in a central directory.
- each header
describes the locations (URL, path/filename, etc.) where the text and its
annotations are stored, possibly at a remote site.
- each text contains an indication of the location of its header.
- each text is prefaced by a short formulaic description (e.g., title and author) of the text, instead of the full header.
- a corpus header (see below) is stored in the same central directory as the headers for the texts in the corpus.
This strategy has the following advantages:
-
parts or all of a corpus may be stored in different directories or in remote sites, while information about the component texts is retained in a single repository.
-
the header can have a DTD which is different from the DTD for the text, which
- enables a modularity that SGML does not provide, so that it is possible to define the content of elements common to the header and text (e.g., title, author, etc.) in a way which is appropriate to each context, and so that changes to the same element in one context do not affect the other.
- in those cases where it is appropriate, enables using the TEI header with a CES conformant text.
- can faciliate processing by corpus-handling tools, for which the header is often irrelevant, since the text can be easily handled separately.
- conversely, it enables using the CES header with an SGML-encoded text which is not CES conformant; this is advantageous in the early stages of corpus preparation, where the text may temporarily be in a freer SGML format such as
Rainbow,
TEI
Lite, FORMEX, etc.
-
With an appropriate interface (e.g., the IMS Stuttgart corpus tools) the user does not necessarily need to know where a corpus or text
is stored to access it.
The CES has developed a header which is for the most part a subset of the TEI header (see TEI P3, chapter 5, "The Header",
and chapter 23, "Language Corpora"). There are the following exceptions:
- elements have been added for more precision in the specifications;
- attributes have been added to existing elements;
- attribute values have been constrained to allow only a given set of values;
- element content models are simplified, to contain either a sequence of tags in sub-categories, or plain text (PCDATA).
The CES header needs attention to determine exactly which elements and information are appropriate for corpora. We hope to develop a more constrained model with a precise template, to facilitate and regularize the creation of corpus and text headers.
Four global attributes are defined, which may appear on any element in the header:
- id
- a unique identifier for the element bearing the ID value.
- n
- a number or other label for the element, not necessarily unique within the corpus.
- lang
- indicates that the tag's content is in the specified language. The value of the lang attribute is composed of one of the following:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
- wsd
- indicates that the tag's content is encoded in the specified character set (except on <writingSystem>). The value of the attribute is the character set name (ISO-8859-1, etc.).
Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".
The global attributes are defined at the top of the CES Header DTD and represented by an entity, A.GLOBAL. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.
Each text in the corpus (i.e. each <cesDoc> element) has its own
header, referred to as a text header. The whole corpus also has
a header, referred to as the corpus header, which contains
information applicable to the whole corpus (possibly with some local
overriding). Both corpus and text headers are represented by
<cesHeader> elements. The type attribute is used to
distinguish the two.
The root of the CES header element tree as defined by the CES header DTD is the <cesHeader> element, defined as follows:
- <cesHeader>
- contains the descriptive and declarative information making up an "electronic title page'' prefixed to every text, or to the corpus as a whole.
- type
- specifies the kind of document to which the header is attached.
- CORPUS the header is attached to the corpus.
- TEXT* the header is attached to a single text.
- creator
- specifies the agency responsible for creating the header.
- text.loc
- provides, in an entity reference, the location (URL, path/filename, etc.) that contains the body of the associated document. This attribute is required.
- version
- specifies the version of the CES header DTD used to encode this header.
- status
- specifies the revision status of the header.
- NEW* this is the first version of the header
- UPDATE header has been updated.
- date.created
- specifies the date on which the header content was created.
- date.updated
- specifies the date on which the header content was last updated.
The
<cesHeader>
element contains the following four elements:
- <fileDesc>
contains a full bibliographic description of the corpus itself or of a
text within it.
- <encodingDesc>
documents the relationship between an electronic text and the source
or sources from which it was derived.
- <profileDesc>
provides further information about various aspects of a text,
specifically the language used, the situation and date of its production, the
participants and their setting, and a descriptive classification for it.
- <revisionDesc>
summarizes the revision history for a file.
These elements
are tagged as follows:
<cesHeader>
<fileDesc></fileDesc>
<encodingDesc></encodingDesc>
<profileDesc></profileDesc>
<revisionDesc></revisionDesc>
</cesHeader>
The file description is the first of the four main constituents of the header
and is represented by the <fileDesc> element and the only one that is required.
The file description documents the
electronic file itself, i.e. (in the case of a corpus header) the whole corpus,
or (in the case of a text header) the individual text to which the header applies.
It contains the following elements:
- <titleStmt>
groups information concerning the title of the corpus or the individual text and its
constituent texts.
- <editionStmt>
contains any additional information relating to a particular version
of a text.
- <extent>
provides the size of the electronic text as stored on
some carrier medium.
- <publicationStmt>
groups information concerning the publication or distribution of the
corpus and its constituent texts.
- <sourceDesc>
supplies a bibliographic description of the copy text(s) from which
an electronic text was derived or generated. Further detail is given in the
following subsections. Note that these relate only to the electronic file (the
corpus text itself) --- bibliographic and other details of the written or
spoken text from which it derives are given in the source description .
Note that the <titleStmt> describes the machine-readable file,
while the source text is specified in the <sourceDesc>. The title
in the <titleStmt> should indicate that this is a machine-readable
version and should not be identical to the title of the source text.
<titleStmt>, <publicationStmt>, and <sourceDesc> are required.
The minimal header has the following structure:
<cesHeader text.loc="/corpus/english/eng01a.tr">
<fileDesc>
<titleStmt>
<title></title>
</titleStmt>
<publicationStmt>
<distributor></distributor>
<address></address>
<availability></availability>
<date></date>
<sourceDesc>
<biblStruct>
<monogr>
<title></title>
<author></author>
<imprint>
<pubPlace></pubPlace>
<publisher></publisher>
<date></date>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
</cesHeader>
Note that if the lang or wsd attributes are used on elements in the main text, it is required to include a <profileDesc> element containing
<langUsage> (for use of lang) and/or <wsdUsage> (for use of wsd).
This element consists of a <title> element followed by zero or
more <respStmt> elements. These sub-elements are used throughout
the header, wherever the title of a work or a statement of responsibility is
required.
- <title>
the title a work, including alternative titles or
subtitles.
- <respStmt>
supplies information about any person or institution responsible for
the intellectual content of a text, edition, or electronic transcription.
<respStmt> in turn contains the following elements:
-
- <respType>
contains a phrase describing the nature of a person's or institution's
intellectual responsibility.
- <respName>
the publisher of the corpus or text expressed as the proper name of a person, place or institution.
In the corpus header, the version
attribute on the <editionStmt> element is used to indicate both a version number and a revision number, in
the form "version.revision'', where "version'' changes if texts are added to or
removed from the corpus, and "revision'' changes if amendments are made within
texts or the corpus header.
In individual text headers, the version
attribute carries only a revision number.
This tag can be empty. For example:
<editionStmt version='1'>
This element corresponds to the TEI <editionStmt>, except that its
content is an unstructured note.
This element corresponds to the TEI <extent> element in that it
describes the number of words in the whole corpus or in an individual text. It
differs in that it contains specific tags for specifying the size of the text
or corpus in terms of words and bytes.
- <extent>
describes the approximate size of the electronic text as stored on
some carrier medium, specified in words (corpus header) and additionally in Kb
(corpus texts).
The <extent> tag contains:
- <wordCount>
- contains the count of words in the text.
- <byteCount>
- contains the count of bytes in the text.
- units
- gives the unit in which the bytecount is measured.
- BYTES bytes
- KB* kilobytes
- MB megabytes
- GB gigabytes
For the purposes of the word count value, a "word" is considered to be an orthographic word--i.e., a string of characters surrounded by blanks. Punctuation not surrounded by white space is not considered as a word. This sort of count can be achieved fairly simply by automatic means. If any other definition is used it should be documented in the <wordCount> tag following the word value; e.g.,
<wordCount>45987 words; punctuation marks counted separately.</wordCount>
The <bytecount> tag gives the size of the text including its tags, in its representation as a text file encoded in an 8-bit ISO
character set, which is useful for calculating media requirements or file
download times.
This corresponds to the TEI <publicationStmt> but has a narrower
focus, since it relates only to the public availability of the electronic
text.
It contains the following sub-elements:
- <distributor>
gives the name of the person or institution who distributes the text or corpus.
- <address>
contains a postal address of the
distributor. Should include telephone, fax, and email.
- <availability>
supplies information about the availability of a text, for example,
any restrictions on its use or distribution, its copyright status, etc.
Attributes include (to be elaborated):
- status
- supplies a code identifying the current availability of the
text. Values (to be filled out):
- RESTRICT the text is not freely available.
- UNKNOWN* the status of the text is unknown.
- FREE the text is freely available.
- region
- specifies the territories within which rights in the
electronic text apply. Suggested values include:
- NOT-US all parts of the world outside the USA
- NOT-NAP all parts of the world other than the USA, Canada, and the
Philippines
- NOT-NA all parts of the world other than USA and Canada
- WORLD* the text is freely available.
- EU European Union only
- NOT-USP all parts of the world other than the USA and the
Philippines
- <date>
the publication date expressed as a calendar date in any format.
- value
- specifies standard value for this date in ISO 8601 (Representation of dates and times)
format
This element corresponds to the TEI <sourceDesc>, except that its
content is constrained to include only the following possible sub-elements:
- <biblStruct>
contains a structured bibliographic citation, in which only
bibliographic sub-elements appear and in a specified order.
- <biblFull>
contains a bibliographic citation for a text which has been previously encoded in electronic form. This element contains the same elements as the
<fileDesc> element, and is intended to include the header of the electronic text from which the current document is derived.
The headers of
individual texts will each contain at least one of the above elements to
specify their source. When a particular text contains items derived from more
than one bibliographic source or recording, all relevant sources for which
information is available are listed in the text header, and individual
<div>, <div1> or <div2> elements
associated with the correct citation or recording by means of the decls
attribute.
If an electronic text has been derived from a previous electronic version of the text, then the source description will contain a <biblFull> element. If this version had itself been derived from another electronic version, then this <biblFull> element could contain yet another <biblFull> element, and os on for as many recursive levels as required. If electronic text described in any <biblFull> element is derived from a
print source, it contains a <biblStruct> element describing that source.
The
<biblStruct> element
The <biblStruct> element has the following component sub-elements:
- <analytic>
contains bibliographic elements describing an item (e.g. an article or
poem) published within a monograph, journal, or periodical and not as an
independent publication.
- <monogr>
contains bibliographic elements describing an item (e.g. a book or
journal) published as an independent item (i.e. as a separate physical
object).
At least one <monogr> element must be present in a
<biblStruct> element. It may contain the following elements:
- <title>
the title of a work.
- <author>
in a bibliographic reference, contains the name of an author
(personal or corporate) of a work; names should be given in a canonical form,
with surnames preceding forenames.
- <respStmt>
supplies information about any person or institution responsible for
the intellectual content of a text, edition, or electronic transcription.
- <edition>
provides bibliographic details for an edition of some text.
- <imprint>
groups information relating to the publication or distribution of a
bibliographic item.
- <biblScope>
defines the scope of a bibliographic reference, for example as a list
of page numbers, or a named subdivision of a larger work.
- type
- identifies the type of information conveyed by the element.
- PP the element contains a page number or page range.
- VOL the element contains a volume number.
- ISSUE the element contains an issue number, or volume and issue numbers.
- <biblNote>
a descriptive note supplying additional information of any kind
relating to a bibliographic item described within a corpus or text header.
Published texts must contain at least one <imprint> element,
which can contain the following elements:
- <publisher>
proper name of a person, place or institution.
- type
- categorises the name. Legal values are:
- PERSON name of a person
- PLACE name of a place
- ORG name of an organization article in a periodical
- <date>
a calendar date in any format.
- value
- specifies standard value for this date in ISO 8601 format
- <pubPlace>
place of publication for a book, article, etc.
The
<analytic> element is used when multiple monographic records are
grouped together into single items. When the item described by a bibliographic
citation forms a part of some other bibliographic item (as, for example, a
newspaper article within a newspaper, or a journal article within a
collection), a monographic description should be given for the newspaper or
collection, prefixed by an analytic description for the individual component,
enclosed within an <analytic> element. This contains a mixture of
the elements <author> <respStmt> and
<title> in any order and repeated as necessary.
The second major component of the header, the encoding description, contains
information about the relationship between an encoded text and its original
source and describes the editorial and other principles employed throughout the
corpus.
The <encodingDesc> element has the following six components:
- <projectDesc>
describes in detail the purpose for which an electronic file
was encoded.
- <samplingDecl>
contains a prose description of the rationale and methods used in
sampling texts in the creation of the corpus.
- <editorialDecl>
provides details of editorial principles and practices applied
during the encoding of a text.
- <tagsDecl>
provides detailed information about the tagging applied to an SGML
document.
- <refsDecl>
specifies how canonical references are constructed for this text.
- <classDecl>
contains a series of <category> elements, defining the
classification codes used for texts within the corpus.
This element provides information about the project for and by which the text or corpus was created, together with any other relevant information concerning the
process by which it was assembled or collected. The content of this element is an unstructured note. Example:
<projectDesc>
The MULTEXT project is assembling a corpus consisting of
mono-lingual texts in seven Eastern and Western European
languages, together with parallel translations in each of
these languages. The original texts were acquired in various
forms and marked up for conformance with the MULTEXT/EAGLES
Corpus Encoding Standard, to test and validate that scheme.
MULTEXT has also developed a suite of annotation tools which
have been tested on the texts in the corpus.
</projectDesc>
A minimal encoding description can contain only the <projectDesc> element. In this case, a prose description of the encoding methods can be provided. If documentation of encoding principles exists in another location (a manual, etc. in printed form, at a given URL, in an ftp site, etc.) this information should be provided.
If no <conformance> element is provided in an <editorialDecl> element within the encoding description, the CES conformance level must be provided here.
This is also an unstructued note, which contains information about the methods for text sampling in the corpus. This element is relevant only in the corpus header.
This element provides details about the systematic inclusion or exclusion of portions of texts, the rationale, and the means by which this is noted in the encoding, if any. For example (adapted from English-Norwegian Parallel Corpus Project
manual):
<samplingDecl>
The texts of the core corpus are mostly extracts from books.
The extracts are between 10,000 and 15,000 words long (30 - 40
pages), and are taken from the beginning of the texts. The front
matter, prefaces, forewords, list of contents, etc., are not
included in the extracts. In some cases, introductions have been
left out as well, e.g. introductions by scholars to works of
fiction.
Omission of passages in the text may be marked by an
<omit> tag.
</samplingDecl>
The <editorialDecl> element contains the following elements, each
specifying a particular kind of editorial practice used for some portion of the
corpus.
Where the same principles apply across the whole corpus (e.g., for the
<segmentation> element), they can be documented only once within the
corpus header.
Where different parts of the corpus apply different practices (as for example
with the <quotation> or <hyphenation> elements), all
possible practices can be defined in the corpus header, and particular parts of the corpus can specify the editorial practices applicable to
them by using the
decls
attribute. When this method is used, if a
practice is not explicitly associated with a part of the corpus in this way, it
is assumed not to apply to it.
- <conformance>
provides the CES level of conformance for the text or corpus.
- level
- gives the level of CES conformance (legal values are 1, 2, or 3).
- <transduction>
describes the principles according to which the text has been
transduced, either in transcribing it from audio tape to written form, or in
converting from an electronic original.
- <correction>
specifies a set of correction practices applied in creating one or more
components of the corpus.
- <quotation>
specifies editorial practice adopted with respect to quotation marks
in the original.
- marks
- indicates whether or not quotation marks are retained as tag
content in the text.
- NONE no quotation marks have been retained
- SOME some quotation marks have been retained
- ALL* all quotation marks have been retained
- form
- specifies how quotation marks are indicated within the
text.
- STD use of quotation marks has been standardized; open and close quote
marks are distinct.
- NONSTD open and close quote marks are represented indiscriminately by the
- UNKNOWN* use of quotation marks is unknown.
- <hyphenation>
summarizes the way in which end-of-line hyphenation in a source text
has been treated in an encoded version of it.
- <segmentation>
describes the principles according to which the text has been
segmented, for example into sentences, tone-units, graphemic strata, etc.
- <normalization>
- specifies a set of normalization practices applied in creating one or more
components of the corpus.
- method
- indicates whether normalization made without notation or made
by including editorial tags.
- TAGS normalization indicated with tags
- SILENT* normalization made silently
This element is used differently in corpus and in text headers. In the corpus
header, it is used to list all the element names actually used within the
corpus, together with a brief description of its function. In text headers, the
same element is used to specify the number of SGML elements actually tagged
within each text. In both cases it consists of a number of
<tagUsage> elements, defined as follows:
- <tagUsage>
- supplies information about the usage of a specific element within the
corpus or text with which this header is associated.
- gi
- the name (generic identifier) of the element indicated by the
tag.
- occurs
- specifies the number of occurrences of this element within
the text.
In the corpus header, each <tagUsage> element
contains a brief description of the element specified by its gi
attribute; the occurs attribute is not supplied. In text
headers, the <tagUsage> elements may be empty, but the
occurs attribute is always supplied.
A typical written text has a tag declaration like the following:
<tagsDecl>
<tagUsage gi=name occurs=256>
<tagUsage gi=div1 occurs=7>
<tagUsage gi=head occurs=7>
<tagUsage gi=p occurs=705>
<tagUsage gi=reg occurs=2>
<tagUsage gi=sic occurs=1>
<tagUsage gi=body occurs=1>
</tagsDecl>
Note that the global attributes lang and wsd can be used on a <tagUsage> element to indicate that for every appearance of the described element in the text, the content defaults to the specified language and character set. Therefore the declaration
<tagUsage gi=term occurs=5 wsd="ISO 8859-5">
indicates that the content of all <term> elements is in the ISO 8859-5 character set.
A PERL script to automatically generate <tagUsage> elements
with appropriate values for tags in any SGML text is available at
<URL: http://www.cs.vassar.edu/~priestdo/research/scripts/tagusage.txt>
This element is useful for encoding corpora since it provides information about references which are often used in the alignment of parallel texts. In particular, it is common to use ID values on tags marking paragraphs and sentences as references in links associating two parallel texts. See for example, the
English-Norwegian Parallel Corpus Project
and
The Lingua Parallel Concordancing Project.
<samplingDecl>
A reference system is built up using the identifiers of the
following text units: text, division, paragraph, s-unit.
Each nested division has an identifier which is built up by
successively adding to the identifier of the text. Each
paragraph has an identifier which adds yet another layer to the
immediately superordinate identifier. S-units are numbered
within the nearest division, as shown above. After alignment,
each s-unit in the core corpus has a "corresp"
attribute containing a reference to the corresponding unit(s) in
the parallel text.
Example:
<body id=NN1>
<div1 type=part id=NN1.1>
<div2 type=chapter id=NN1.1.1>
<div3 type=section id=NN1.1.1.1>
<p id=NN1.1.1.1.p1>
<s id=NN1.1.1.1.s1 corresp=NN1T.1.1.1.s1></s>
<s id=NN1.1.1.1.s2 corresp=NN1T.1.1.1.s2></s>
</p>
</div3>
</div2>
</div1>
</body>
</samplingDecl>
The following scheme outlines means to define a set of text categories for
classifying texts in the corpus. A standardized set of text categories is under
development by the EAGLES Corpus Working Group on Text Typology, which will, in
most cases, eliminate the need to explicitly provide a descriptive taxonomy in
the corpus header.
The standard text categories and means to use them to classify texts in the
corpus will be specified in the final CES recommendations. The following can be
used to extend that taxonomy where necessary.
The <classDecl> element contains the descriptive taxonomy used to
classify texts within the corpus. It occurs once, in the corpus header, and
consists of a set of <category> elements, each representing a
particular textual classification feature and a value for that feature.
- <category>
contains an individual descriptive category or feature-value pair.
The global id attribute is required for the <category>
element, since it is used to associate a <catRef> within a text
header with the descriptive category appropriate to it. The category element
contains a set of <catDesc> elements:
- <catDesc>
describes a category within the text typology, in the form of a brief
prose description.
The <catDesc> element is used to contain
the value for a feature within a <category>, unless that category
is further subdivided, in which case a nested <category> element
may be used.
Within the <textClass> element of the header for each text, a
<catRef> element is provided, the target attribute of which
lists the identifiers of all <category> elements applicable to
that text.
When a standard set of text categories is developed, it is anticipated that an
attribute on <textClass> will provide the category. Unless the
standard categories are extended, no pointer to <category>
elements in the corpus header will be required.
The third component of the header is the profile description. The
<profileDesc> element has the following components:
- <creation>
contains information about the creation of a text.
- <langUsage>
groups information describing the languages, sublanguages, registers,
dialects etc. represented within a text.
- <wsdUsage>
groups information describing the character set(s) used within a text.
- <textClass>
groups information which describes the nature or topic of a text in
terms of a standard classification scheme, thesaurus, etc.
- <translations>
groups information about existing translations of the text.
- <annotations>
groups information about existing annotation files associated with the text.
These
components appear in individual text headers, since they describe features of
particular texts.
This element is used to record the date of first publication of electronic texts, and any details concerning the origination of the text, whether or not covered elsewhere.
This element contains one or more
<language> elements, each identifying a language used on the text:
- <language>
characterizes a language, sublanguage, register, dialect,
etc., used within a single text.
- iso639
- gives the standard language code from ISO 639 in one of the following forms:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
- type
- indicates the type of language, e.g., sublanguage, dialect,
etc.
Example:
<langUsage>
<language id="fr" iso639="fr">French</language>
<language id="en" iso639="en">English</language>
<language id="la" iso639="la">Latin</language>
</langUsage>
The value of the id attribute on any <language> element should be given as a value for the global lang attribute when it is used on a tag in the text or header to refer to this language.
For example,
She ate <foreign lang=fr>croissants</foreign>
When more than one character set is used in a text, the wsd attribute should be used on each <language> tag to associate the language with a particular character set.
This element contains one or more
<writingSystem> elements, each identifying a character set used on the text:
- <writingSystem>
characterizes a character set used within a single text.
Example:
<wsdUsage>
<writingSystem id="ISO 8859-1">ISO character set for western
European languages</writingSystem>
<writingSystem id="ISO 8859-5">ISO character set for
Cyrillic</writingSystem>
</wsdUsage>
The value of the id attribute on any <writingSystem> element should be given as a value for the global wsd attribute when it is used on a tag in the text or header to refer to this character set.
For example,
This is a patch of Cyrillic:
<foreign lang=bu wsd="ISO 8859-5">
Големия
брат
те наблюдава
</foreign>
When a Writing System Declaration describing a transcription scheme is provided as an auxiliary document, the value of the wsd attribute on the <writingSystem> element must be an entity that points to this document. Usually, the entity expands to be the name of the file in which the Writing System Declaration is stored. Note that for this reason, the type of the wsd attribute on the <writingSystem> element is ENTITY (indicating that its value must be en SGML entity). For all other elements in the header or text, the type of the global wsd attribute is CDATA.
This element contains references to
the text classification scheme and descriptive keywords which together describe
the text concerned. The following elements are used for these purposes:
- <catRef>
specifies one or more defined categories within some taxonomy or text
typology.
- target
- identifies the text category or categories by reference to a
definition in the corpus header.
- category
- gives a standard category name defined by EAGLES. List of
values to be provided.
- <keywords>
contains a list of keywords or phrases identifying the topic or
nature of a text, each of which is tagged as a term. To be provided by EAGLES/PAROLE.
The <keywords> element contains one or more technical terms:
- <keyTerm> contains a technical term or phrase, particularly in a
list of descriptive keywords.
This element groups information about translations of the text which exist, usually within the same corpus. The following elements are used for these purposes:
- <translation>
gives information about a translation of the text. The global lang and wsd attributes are required on this tag. In addition to the four global attributes, this tag has the following optional attribute:
- trans.loc
- provides, in an entity reference, information (path/file name, URL, etc.) about the location of the the translation.
- <translator>
gives the name of the translator.
This element groups information about annotation documents associated with the text. The following elements are used for these purposes:
- <annotation>
gives information about an annotation file associated with the text. Attributes:
- type
- indicates the type of annotation.
- ann.loc
- provides, in an entity reference, information (path/file name, URL, etc.) about the location of the annotation file.
- SEGMENT annotation file contains segmentation into sentences and words.
- GRAM annotation file contains morpho-syntactic category information for the words in the text.
- ALIGN annotation file contains alignment links to a parallel translation.
- trans.loc
- for annotation files containing alignment information, provides, in an entity reference, information (path/file name, URL, etc.) about the location of the file containing the aligned text.
The revision description is the fourth element in the header. It is used to
record details of any significant change to the corpus. The
<revisionDesc> element has the following component:
- <change>
summarizes a particular change or correction made to a particular
version of an electronic text which is shared between several
researchers.
Multiple <change> elements are provided for; one should appear per change.
Unlike its counterpart in the TEI scheme, the
<change> element must here contain
- <date>
- gives the date of the change.
- value
- specifies standard value for this date in ISO 8601 format
- <respName>
- specifies the person responsible for the change.
- <item>
- specifies the nature of the change(s). One or more occurrences of this element may appear within each <change> element.
When any significant change is made to any component of the corpus, the
following steps should be taken:
- a <change> element is added to the
<revisionDesc> of the text affected
- the update attribute of the text header is changed to the date of
the change
- the value of the status attribute of the text header is set to
UPDATE
- the revision number specified on the version attribute of the
<editionStmt> of the corpus header is incremented
The decls attribute is specified for the element <body> and
the larger division elements (<div1> or
<div2>).
It is used for two purposes:
- to supply a specific title for parts of composite works;
- to specify encoding or other declarations applicable to all or part of a
text where a number of possibilities have been provided for in the
header.
Its value is a list of identifiers, each of which has been supplied
elsewhere in a text or corpus header as the identifier for one of the following
elements: <biblStruct>, <editorialDecl> and its
constituents (<correction>, <hyphenation>,
<quotation>, <segmentation> and
<transduction>), and <textClass>.
For these elements, the corpus header will normally contain several mutually
incompatible options, for example, several editorial declarations. Individual
texts, or portions of texts, specify explicitly which of the available options
applies to them by using the decls attribute. In cases where the set
of declarable elements applies only within portions of a single text, they will
be specified in the text header rather than the corpus header.
Declarable elements, once specified, are inherited by all sub-components. That
is, if the decls attribute of a <body> element specifies a
particular value for some declarable element, that value is understood to apply
to all components of the text unless over-ridden. If the decls attribute
of a <div1> within that text specifies a different value, the new
value applies to the contents of that <div1> only; the value
specified by the <body> applies to all subsequent
<div1> elements in the same text, unless they also specify a
different decls value.
For non-declarable elements, the header of an individual text will specify only
those respects (if any) in which it differs from the defaults stated in the
corpus header.
This is a simplification of the decls mechanism described in
the TEI Guidelines.
<!doctype cesHeader PUBLIC "-//CES//DTD//cesHeader//EN" []>
<cesHeader text.loc="/usr/multext/corpus/english/ORW23">
<fileDesc>
<titleStmt>
<title>Machine-readable version of 1984, ch. 1</title>
<respStmt>
<respType>typed in and marked with CES tags </respType>
<respName>A. Student</respName>
</respStmt>
</titleStmt>
<extent>
<bytecount>47992</bytecount>
<wordcount>6571</wordcount>
</extent>
<publicationStmt>
<distributor>Laboratoire Parole et Langage, CNRS</distributor>
<address>29, avenue Robert Schuman
Aix-en-Provence, France
tel: +33 42 95 36 33
fax : +33 42 59 50 96
email: phonetic@univ-aix.fr</address>
<availability status=restricted>
internal use only--cannot be distributed</availability>
<date>6571</date>
<sourceDesc>
<biblStruct>
<monogr>
<title>Nineteen Eighty-four</title>
<author>George Orwell</author>
<imprint>
<pubPlace>New York</pubPlace>
<publisher>New American Library</publisher>
<date>1949; reprinted 1961</date>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
<encodingdesc>
<projectdesc>
This English version of the first chapter of Orwell's 1984 is
encoded for use in the MULTEXT-EAST project. The English is
to serve as the base for the parallel corpus, and will be aligned
to versions of the text in Romanian, Bulgarian, Estonian,
Slovenian, Czech, and Hungarian.
</projectdesc>
<editorialdecl>
<conformance level=1>CES Level 1</conformance>
<correction status=medium method=silent></correction>
<quotation marks=none form=std>Rendition attribute values on Q
and QUOTE tags are adapted from ISOpub and ISOnum standard
entity set names
</quotation>
<segmentation>Marked up to the level of paragraph plus
marking of particular sub-paragraph elements: NAME, DATE,
FOREIGN.
</segmentation>
</editorialdecl>
<tagsdecl>
<tagusage gi=body occurs=1></tagusage>
<tagusage gi=date occurs=5></tagusage>
<tagusage gi=div1 occurs=1></tagusage>
<tagusage gi=div2 occurs=1></tagusage>
<tagusage gi=foreign occurs=4></tagusage>
<tagusage gi=hi occurs=4></tagusage>
<tagusage gi=name occurs=149></tagusage>
<tagusage gi=note occurs=1></tagusage>
<tagusage gi=num occurs=2></tagusage>
<tagusage gi=p occurs=41></tagusage>
<tagusage gi=ptr occurs=1></tagusage>
<tagusage gi=q occurs=22></tagusage>
<tagusage gi=quote occurs=3></tagusage>
</tagsdecl>
</encodingdesc>
<profiledesc>
<langusage>
<language id="fr" iso639="fr">French</language>
<language id="en" iso639="en">English</language>
<language id="la" iso639="la">Latin</language>
<language id="ns">Newspeak</language>
</langusage>
</profiledesc>
</cesHeader>
The CES Header DTD
NAVIGATOR
| Top
| Prev
| Next
| CES Contents
| MULTEXT
| EAGLES TR subgroup
| LPL
|