Special Issues for Language Corpora and Other Collections
A collection is a group of texts encoded according to the
same encoding scheme, though perhaps unrelated in other respects. A
corpus is a body of texts put together in a principled way,
typically in order to construct a sample of a given language or
sublanguage. The term corpus is here applied to
collections of running text and does not apply to word lists,
concordances, collections of citations, and the like.
Many corpora have been compiled for the purposes of linguistic analysis.
Though mixed forms do occur, we may distinguish between:
sample text corpora
full text corpora
monitor text corpora
While the first two are closed and limited in size, the third is
open-ended and expanding (rather like a language archive). Typical
examples of the three types are: the Brown Corpus,See Francis and
Kucera (1979), the Leuven Drama Corpus,See Geens et al.
(1975) and (3) the Birmingham Collection of English Texts See
Renouf (1987).
While many sample corpora used to contain text extracts of a given
length (disregarding natural textual divisions), there is now a tendency
among compilers of sample corpora to pay less attention to exact
comparability of sample size and put more emphasis on textual coherence
and continuity.
The coding of individual members of a corpus does not differ in
principle from that of other texts. The corpus as a whole does,
however, present special problems, owing to the need to organize and
encode the texts in a uniform and consistent manner while staying as
close as possible to the original text. This involves setting up (1) a
text identification and classification scheme, (2) a reference system,
and (3) a scheme for the encoding of textual features.
Text Documentation at the Corpus or Collection Level
Like any other encoded text, a corpus or collection must have a TEI
document header providing bibliographic information about the corpus
itself as well as bibliographic identification of the texts included.
See chapter for details. To distinguish the header
for the corpus or collection from the headers for the individual pieces,
the former should carry the attribute value type=corpus or
type=collection instead of the default type=text. Some
aspects of particular interest for corpora and collections are discussed
here:
declarations of encoding practice for corpora and collections,
documenting text classifications,
special note types,
treatment of documentation for multiple sub-corpora,
and
treatment of multiple sources.
The encoding.declarations section of the TEI header for the
corpus serves as a preface allowing compilers to specify sampling
principles, editorial practices, etc. This may be given in prose, with
sections arranged as suggested by the tags in the list below.
Alternatively, there may be a reference to a printed manual. It is
recommended but not required for TEI conformance that all the tags
specified in the list below be present with appropriate information.
Other relevant information may be encoded using an
other.information tag.
Aim of collection, using the tag aim
Composition history (tagged composition.history). This
element may be repeated if the corpus contains multiple subcorpora
with different histories.
Sampling principles (sampling). This tag may be repeated
if there are multiple subcorpora for which the sampling principles
are to be specified separately. Each sampling should
then have an ID attribute specified, so that individual
samples can refer to the particular declaration which applies to
them.
Editorial principles (editorial.principles)
Inclusion/exclusion of text (textincl)
Correction (correction)
Normalization (normalization)
Analysis and interpretation (analysis
The editorial.principles group may be repeated if there are
multiple subcorpora which were created according to different
principles. Each editorial.principles element should in
that case bear an ID attribute so that it can be
referred to from individual components of the corpus.
Reference system (reference.system).
This element may be repeated if the corpus uses more than one type
of reference system.
List of contents (contents)
Other material information (other.information)
Corpora are often constructed around a defined set of text categories,
such as Science Fiction or Learned and Scientific Writing
(to take two text categories from the Brown Corpus). The classification
of texts is notoriously difficult and may vary depending upon the
purpose of the study and the theoretical stance of the analyst. No
specific set of categories is recommended here; those for whom such a
recommendation would be useful should contact the Text Encoding
Initiative with more information about their requirements.
In addition to text categories, corpus texts may also be described with
terms from a subject classification scheme. No specific set of subject
terms is recommended here; those in need of such terms are directed
instead to the large literature on subject classification and the wide
variety of general and specific sets of subject terms for use in library
cataloguing and periodical bibliography.
Where applicable, text categories and subject classifications can be
described in the encoding.principles area of the TEI header
for the corpus, using the tags text.categories and
subject.classification. The text.categories
element contains a list of category.definition elements, each
of which contains a description or definition of one category, in prose,
with an ID attribute to which reference can later be made
from text.category cat=xxx tags in TEI headers for individual
pieces in the corpus. The subject classification declaration contains a
bibliographic reference or a description in prose of the subject
classification terms used, if any, in categorizing the components of the
corpus. Keywords used may be listed, if desired, but this is not
required.
Information about the language(s) and text type(s) represented in the
corpus should be given in note elements in the TEI header's
notes area, as should information about availability of the
corpus and copyright information. For example:
Language: English (British)Text types: newpapers, light fiction, technical reports.
Availability: available to research centers for cost of
media.Copyright: all material is public domain.
]]>
A corpus may comprise several subcorpora, each constructed on different
sampling principles and prepared with different editorial practices. In
such cases, the descriptions of sampling and editorial principles should
be repeated and each description given a specific identifier by means of
the ID attribute. Each sample can then be associated with
a particular set of principles by referring back to these IDs from the
TEI header for the sample. (See below.)
The bibliographic identification of sources can be given either all at
once, in the source section of the main TEI header (see
section ) or individually with each text in the
corpus. If full bibliographic details of each piece in the corpus are
given in the main TEI header, then a cross reference can be made from
each piece to the appropriate source description.
A TEI header for a corpus with more than one sampling procedure and more
than one set of editorial practices might look like this, if all sources
are described in the main TEI header:
[file description here, with title of corpus, etc.]
[Title of corpus][Statement of responsibility for corpus]
[Other bibliographic information about the corpus]
[Bibliographic description of source 1.]
[Bibliographic description of source 2.]
[Bibliographic description of source 3.]
[etc.]
[Notes on the corpus]
[description of aim, etc.]
Samples of the first type comprise exactly 2000 words of
text, beginning at a randomly chosen point.
Samples of the second type comprise at least 5000 words of
text, beginning at the beginning of the document and
continuing through the first paragraph ending after the
5000th word.
All text and diagrams are included.
Formulas are represented in eqn and chemical
diagrams in CGM graphics.No corrections are made. All typos in the
original are retained.No normalization is performed.
No analysis has been performed in this version
of the corpus.Only English-language running text is
included.
Formulas are replaced by the tag "formula"
and diagrams by the tag "diagram".
Obvious typographic errors are corrected
silently.British spellings have been changed to
American spelling where marked by the "norm" tag.
No analysis has been performed in this version
of the corpus.
[version control information ...]
]]>
Text Documentation for Individual Items
Individual items may be assigned a reference name or number, which
should be specified on the n attribute of the item's
TEI.doc tag.
The TEI header for an individual item may include a full description of
the encoding and the source of the text, but typically the encoding
practices relevant to the collection as a whole will have been specified
in the main TEI header and the headers for individual items will need
only to refer back to the main header using special tags specified
below. The encoding.principles area of the item's TEI header
may be used to describe and discuss idiosyncratic features of the text.
References to declarations in the main TEI header are possible with the
following tags:
sample.type type=xxx to refer to a sampling
declaration
editorial.type type=xxx to refer to a declaration of
editorial.principlessource.ref target=xxx to refer to a source.description.
This may be followed by a partial bibliographic description giving
details relevant to this specific item. If for example the general
source description relates to a specific newspaper, individual
items taken from that newspaper might have source references to the
general description, followed by specific information about the date
and pages from which the specific item was taken.
text.category cat=xxx to refer to the definition of a specific
category to which the item is assigned.
If subject classifications of the items are assigned, they may be
specified with a list of subject.term tags in the TEI header
for the item.
A text in a corpus might look something like this, if backward
references are used as described:
[Text of corpus sample here.]
]]>
Reference Schemes in Corpora and Collections
On reference schemes in general, see . The reference
system for a corpus can be built up from the hierarchical structure of
the corpus: name of corpus + name of text category + number of text
sample + location in the sample. S-units dividing the text into
orthographic sentences (on which see section )
may provide suitable points of reference within the sample.
Many corpora impose no unified reference system on their materials,
using instead different styles of reference for different classes of
material. In these cases, the reference.system declaration
in the main TEI header should be repeated and individual items should
refer back to the relevant declaration using the tag ref.type
type=xxx (where xxx is the ID associated with
the applicable reference.system declaration.
Encoding Practice in Corpora and Collections
Corpora require more editorial intervention than individual
machine-readable texts. There are two levels of text documentation:
one for the corpus as a whole and one for the text sample. (To some
extent, this is true of any computer-edition of a printed text. There
must be documentation on the machine-readable edition/version and on the
original printed edition.) There are special problems where text
samples are composite, as in the newspaper categories of the Brown
Corpus. Either the individual sample must be treated as a
mini-collection, with nested headers, or else the item's TEI header must
contain bibliographic information for all the texts in the sample.
Headings, name of the author, etc. should in either case be left as part
of the text and tagged as front matter. This may or may not be feasible
in working with existing corpora.
To achieve the overall aim of uniformity and consistency of coding, the
corpus editor cannot provide tailor-made encoding for each individual
sample. All editorial decisions must be documented: criteria for text
selection, omission of text, emendation, normalisation of spelling, etc.
The general policy on these matters should be stated in the
encoding.declarations section referred to above, and
individual instances should be commented on or marked up in the
appropriate place of the text (cf. section ), if this
is judged to be important for the user of the text.
For examples of the treatment of corpora, see the manuals for the
Brown and LOB corpora: Francis and Kucera (1979) and
Johansson et al. (1978). An extract from the LOB Corpus
(re-coded using SGML) is given in the reference section of this
report.
Basic Structure of Corpora and Collections
The internal structure of corpus texts, samples, and texts in
collections need not differ from that of other texts. In simple cases,
the corpus or collection as a whole may be tagged with the
TEI.Corpus or TEI.Collection tag, which must
contain a TEI.header element with information about the
corpus, followed by a series of TEI.doc elements, each
structured as described elsewhere in these Guidelines. Each
TEI.doc element must contain, as usual, a
TEI.header and a text. The TEI.header
section of these TEI.doc elements may be used, as shown
above, to document the source of the individual text and to associate it
with particular declarations made in the TEI header for the corpus or
collection.
In more complex cases, special modifications to the document type
declaration may be necessary to allow component texts of different types
to be represented in the same corpus without losing the ability to
validate individual document structures. This topic requires further
work; researchers interested in this topic should contact the Text
Encoding Initiative with information about their requirements.
Newspapers, periodicals and anthologies are similar in structure to
corpora, except that in these cases the computer-editor takes over
collections organized by the editor(s) of the printed version. There is
a general TEI header section followed by a series of individual
articles. Each article consists of a documentation section (the TEI
header) and text.