Special Issues for Language Corpora and Other Collections

A collection is a group of texts encoded according to the same encoding scheme, though perhaps unrelated in other respects. A corpus is a body of texts put together in a principled way, typically in order to construct a sample of a given language or sublanguage. The term corpus is here applied to collections of running text and does not apply to word lists, concordances, collections of citations, and the like.

Many corpora have been compiled for the purposes of linguistic analysis. Though mixed forms do occur, we may distinguish between:

  1. sample text corpora
  2. full text corpora
  3. monitor text corpora
While the first two are closed and limited in size, the third is open-ended and expanding (rather like a language archive). Typical examples of the three types are: the Brown Corpus,See Francis and Kucera (1979), the Leuven Drama Corpus,See Geens et al. (1975) and (3) the Birmingham Collection of English Texts See Renouf (1987).

While many sample corpora used to contain text extracts of a given length (disregarding natural textual divisions), there is now a tendency among compilers of sample corpora to pay less attention to exact comparability of sample size and put more emphasis on textual coherence and continuity.

The coding of individual members of a corpus does not differ in principle from that of other texts. The corpus as a whole does, however, present special problems, owing to the need to organize and encode the texts in a uniform and consistent manner while staying as close as possible to the original text. This involves setting up (1) a text identification and classification scheme, (2) a reference system, and (3) a scheme for the encoding of textual features.

Text Documentation at the Corpus or Collection Level

Like any other encoded text, a corpus or collection must have a TEI document header providing bibliographic information about the corpus itself as well as bibliographic identification of the texts included. See chapter for details. To distinguish the header for the corpus or collection from the headers for the individual pieces, the former should carry the attribute value type=corpus or type=collection instead of the default type=text. Some aspects of particular interest for corpora and collections are discussed here: declarations of encoding practice for corpora and collections, documenting text classifications, special note types, treatment of documentation for multiple sub-corpora, and treatment of multiple sources.

The encoding.declarations section of the TEI header for the corpus serves as a preface allowing compilers to specify sampling principles, editorial practices, etc. This may be given in prose, with sections arranged as suggested by the tags in the list below. Alternatively, there may be a reference to a printed manual. It is recommended but not required for TEI conformance that all the tags specified in the list below be present with appropriate information. Other relevant information may be encoded using an other.information tag.

Corpora are often constructed around a defined set of text categories, such as Science Fiction or Learned and Scientific Writing (to take two text categories from the Brown Corpus). The classification of texts is notoriously difficult and may vary depending upon the purpose of the study and the theoretical stance of the analyst. No specific set of categories is recommended here; those for whom such a recommendation would be useful should contact the Text Encoding Initiative with more information about their requirements.

In addition to text categories, corpus texts may also be described with terms from a subject classification scheme. No specific set of subject terms is recommended here; those in need of such terms are directed instead to the large literature on subject classification and the wide variety of general and specific sets of subject terms for use in library cataloguing and periodical bibliography.

Where applicable, text categories and subject classifications can be described in the encoding.principles area of the TEI header for the corpus, using the tags text.categories and subject.classification. The text.categories element contains a list of category.definition elements, each of which contains a description or definition of one category, in prose, with an ID attribute to which reference can later be made from text.category cat=xxx tags in TEI headers for individual pieces in the corpus. The subject classification declaration contains a bibliographic reference or a description in prose of the subject classification terms used, if any, in categorizing the components of the corpus. Keywords used may be listed, if desired, but this is not required.

Information about the language(s) and text type(s) represented in the corpus should be given in note elements in the TEI header's notes area, as should information about availability of the corpus and copyright information. For example:

<![ CDATA [ <notes> <note>Language: English (British)</note> <note>Text types: newpapers, light fiction, technical reports. </note> <note>Availability: available to research centers for cost of media.</note> <note>Copyright: all material is public domain.</note> </notes> ]]>

A corpus may comprise several subcorpora, each constructed on different sampling principles and prepared with different editorial practices. In such cases, the descriptions of sampling and editorial principles should be repeated and each description given a specific identifier by means of the ID attribute. Each sample can then be associated with a particular set of principles by referring back to these IDs from the TEI header for the sample. (See below.)

The bibliographic identification of sources can be given either all at once, in the source section of the main TEI header (see section ) or individually with each text in the corpus. If full bibliographic details of each piece in the corpus are given in the main TEI header, then a cross reference can be made from each piece to the appropriate source description.

A TEI header for a corpus with more than one sampling procedure and more than one set of editorial practices might look like this, if all sources are described in the main TEI header:

<![ CDATA [ <TEI.header type=corpus> <file.description> [file description here, with title of corpus, etc.] <title>[Title of corpus]</title> <author>[Statement of responsibility for corpus]</author> [Other bibliographic information about the corpus] <source id=s001> [Bibliographic description of source 1.] </source> <source id=s002> [Bibliographic description of source 2.] </source> <source id=s003> [Bibliographic description of source 3.] </source> [etc.] <notes>[Notes on the corpus]</notes> </file.description> <encoding.declarations> [description of aim, etc.] <sampling.principles id=rand2000> Samples of the first type comprise exactly 2000 words of text, beginning at a randomly chosen point. <sampling.principles type=first5000> Samples of the second type comprise at least 5000 words of text, beginning at the beginning of the document and continuing through the first paragraph ending after the 5000th word. <editorial.principles id=edmin> <textincl>All text and diagrams are included. Formulas are represented in eqn and chemical diagrams in CGM graphics.</textincl> <correction>No corrections are made. All typos in the original are retained.</correction> <normalization>No normalization is performed. </normalization> <analysis>No analysis has been performed in this version of the corpus.</analysis> </editorial.principles> <editorial.principles id=edmax> <textincl>Only English-language running text is included. Formulas are replaced by the tag "formula" and diagrams by the tag "diagram". <correction>Obvious typographic errors are corrected silently.</correction> <normalization>British spellings have been changed to American spelling where marked by the "norm" tag. </normalization> <analysis>No analysis has been performed in this version of the corpus.</analysis> </editorial.principles> </encoding.declarations> [version control information ...] </TEI.header> ]]>

Text Documentation for Individual Items

Individual items may be assigned a reference name or number, which should be specified on the n attribute of the item's TEI.doc tag.

The TEI header for an individual item may include a full description of the encoding and the source of the text, but typically the encoding practices relevant to the collection as a whole will have been specified in the main TEI header and the headers for individual items will need only to refer back to the main header using special tags specified below. The encoding.principles area of the item's TEI header may be used to describe and discuss idiosyncratic features of the text.

References to declarations in the main TEI header are possible with the following tags: sample.type type=xxx to refer to a sampling declaration editorial.type type=xxx to refer to a declaration of editorial.principles source.ref target=xxx to refer to a source.description. This may be followed by a partial bibliographic description giving details relevant to this specific item. If for example the general source description relates to a specific newspaper, individual items taken from that newspaper might have source references to the general description, followed by specific information about the date and pages from which the specific item was taken. text.category cat=xxx to refer to the definition of a specific category to which the item is assigned.

If subject classifications of the items are assigned, they may be specified with a list of subject.term tags in the TEI header for the item.

A text in a corpus might look something like this, if backward references are used as described:

<![ CDATA [ <TEI.doc id=t001> <TEI.header> <source.ref id=s001> <sample.type type=samp1000> <editorial.type type=edmax> <source.ref id=s001> </TEI.header> <frontmatter> </frontmatter> <body> [Text of corpus sample here.] </body> </TEI.doc> ]]>

Reference Schemes in Corpora and Collections

On reference schemes in general, see . The reference system for a corpus can be built up from the hierarchical structure of the corpus: name of corpus + name of text category + number of text sample + location in the sample. S-units dividing the text into orthographic sentences (on which see section ) may provide suitable points of reference within the sample.

Many corpora impose no unified reference system on their materials, using instead different styles of reference for different classes of material. In these cases, the reference.system declaration in the main TEI header should be repeated and individual items should refer back to the relevant declaration using the tag ref.type type=xxx (where xxx is the ID associated with the applicable reference.system declaration.

Encoding Practice in Corpora and Collections

Corpora require more editorial intervention than individual machine-readable texts. There are two levels of text documentation: one for the corpus as a whole and one for the text sample. (To some extent, this is true of any computer-edition of a printed text. There must be documentation on the machine-readable edition/version and on the original printed edition.) There are special problems where text samples are composite, as in the newspaper categories of the Brown Corpus. Either the individual sample must be treated as a mini-collection, with nested headers, or else the item's TEI header must contain bibliographic information for all the texts in the sample. Headings, name of the author, etc. should in either case be left as part of the text and tagged as front matter. This may or may not be feasible in working with existing corpora.

To achieve the overall aim of uniformity and consistency of coding, the corpus editor cannot provide tailor-made encoding for each individual sample. All editorial decisions must be documented: criteria for text selection, omission of text, emendation, normalisation of spelling, etc. The general policy on these matters should be stated in the encoding.declarations section referred to above, and individual instances should be commented on or marked up in the appropriate place of the text (cf. section ), if this is judged to be important for the user of the text. For examples of the treatment of corpora, see the manuals for the Brown and LOB corpora: Francis and Kucera (1979) and Johansson et al. (1978). An extract from the LOB Corpus (re-coded using SGML) is given in the reference section of this report.

Basic Structure of Corpora and Collections

The internal structure of corpus texts, samples, and texts in collections need not differ from that of other texts. In simple cases, the corpus or collection as a whole may be tagged with the TEI.Corpus or TEI.Collection tag, which must contain a TEI.header element with information about the corpus, followed by a series of TEI.doc elements, each structured as described elsewhere in these Guidelines. Each TEI.doc element must contain, as usual, a TEI.header and a text. The TEI.header section of these TEI.doc elements may be used, as shown above, to document the source of the individual text and to associate it with particular declarations made in the TEI header for the corpus or collection.

In more complex cases, special modifications to the document type declaration may be necessary to allow component texts of different types to be represented in the same corpus without losing the ability to validate individual document structures. This topic requires further work; researchers interested in this topic should contact the Text Encoding Initiative with information about their requirements.

Newspapers, periodicals and anthologies are similar in structure to corpora, except that in these cases the computer-editor takes over collections organized by the editor(s) of the printed version. There is a general TEI header section followed by a series of individual articles. Each article consists of a documentation section (the TEI header) and text.