CNRS

MULTEXT/EAGLES - Corpus Encoding Standard
Document MUL/EAG-CES 1. Part 3. Version 0.1. Last modified 7 December 1995.






CES Part 3 - The Header




Nancy Ide and Jean Véronis


Copyright (c) Centre National de la Recherche Scientifique, 1995.

This document is only a draft and should be cited as such. Creators of WWW documents pointing to it are warned that its content and location may change without notice. This document is provided as is without any express or implied warranties. While every effort has been taken to ensure the accuracy of the information contained, the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

Permission is granted to make and distribute verbatim copies of this document for non-commercial purposes provided the copyright notice and this permission notice are preserved on all copies.


Contents


CNRS

NAVIGATOR

| Next | Prev | CES 1 Table of contents |

3. The Header

The header provides information about the electronic text that has been encoded, including not only its title, author, etc. but also information about its encoding. The TEI header has provided the first means to document electronic texts, which has been widely adopted and adapted for use in text and corpus encoding.

The TEI provides an in-line header that is included in the same SGML document as the encoded text. Usually, the header appears in the same file as the text, although this is not obligatory. The TEI also provides an independent header, a header without its attached text, which is intended mainly for cataloguing electronic texts.

The CES adopts the following strategy for headers:

This strategy has the following advantages:

The CES has developed a header which is for the most part a subset of the TEI header (see TEI P3, chapter 5, "The Header", and chapter 23, "Language Corpora"). There are the following exceptions:

The CES header needs attention to determine exactly which elements and information are appropriate for corpora. We hope to develop a more constrained model with a precise template, to facilitate and regularize the creation of corpus and text headers.


3.1. Global attributes

Four global attributes are defined, which may appear on any element in the header: Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".

The global attributes are defined at the top of the CES Header DTD and represented by an entity, A.GLOBAL. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.


3.2. Document structure

Each text in the corpus (i.e. each <cesDoc> element) has its own header, referred to as a text header. The whole corpus also has a header, referred to as the corpus header, which contains information applicable to the whole corpus (possibly with some local overriding). Both corpus and text headers are represented by <cesHeader> elements. The type attribute is used to distinguish the two.

The root of the CES header element tree as defined by the CES header DTD is the <cesHeader> element, defined as follows:

<cesHeader>
contains the descriptive and declarative information making up an "electronic title page'' prefixed to every text, or to the corpus as a whole.

type
specifies the kind of document to which the header is attached.

CORPUS  the header is attached to the corpus.
TEXT*  the header is attached to a single text.

creator
specifies the agency responsible for creating the header.

text.loc
provides, in an entity reference, the location (URL, path/filename, etc.) that contains the body of the associated document. This attribute is required.

version
specifies the version of the CES header DTD used to encode this header.

status
specifies the revision status of the header.
NEW*    this is the first version of the header
UPDATE   header has been updated.

date.created
specifies the date on which the header content was created.

date.updated
specifies the date on which the header content was last updated.



The <cesHeader> element contains the following four elements:

These elements are tagged as follows:

          <cesHeader>
               <fileDesc></fileDesc>
               <encodingDesc></encodingDesc>
               <profileDesc></profileDesc>
               <revisionDesc></revisionDesc>
          </cesHeader>



3.3. The File description

The file description is the first of the four main constituents of the header and is represented by the <fileDesc> element and the only one that is required. The file description documents the electronic file itself, i.e. (in the case of a corpus header) the whole corpus, or (in the case of a text header) the individual text to which the header applies.

It contains the following elements:

Note that the <titleStmt> describes the machine-readable file, while the source text is specified in the <sourceDesc>. The title in the <titleStmt> should indicate that this is a machine-readable version and should not be identical to the title of the source text.

<titleStmt>, <publicationStmt>, and <sourceDesc> are required.

The minimal header has the following structure:


        <cesHeader text.loc="/corpus/english/eng01a.tr">
            <fileDesc>
                 <titleStmt>
                     <title></title>
                 </titleStmt>
                 <publicationStmt>
                     <distributor></distributor>
                     <address></address>
                     <availability></availability>
                     <date></date>
                 <sourceDesc>
                     <biblStruct>
                          <monogr>
                               <title></title>
                               <author></author>
                               <imprint>
                                    <pubPlace></pubPlace>
                                    <publisher></publisher>
                                    <date></date>
                               </imprint>
                          </monogr>
                     </biblStruct>
                 </sourceDesc>
            </fileDesc>
        </cesHeader>

Note that if the lang or wsd attributes are used on elements in the main text, it is required to include a <profileDesc> element containing <langUsage> (for use of lang) and/or <wsdUsage> (for use of wsd).


3.3.1. Title statement

This element consists of a <title> element followed by zero or more <respStmt> elements. These sub-elements are used throughout the header, wherever the title of a work or a statement of responsibility is required.

<respStmt> in turn contains the following elements:


3.3.2. Edition statement

In the corpus header, the version attribute on the <editionStmt> element is used to indicate both a version number and a revision number, in the form "version.revision'', where "version'' changes if texts are added to or removed from the corpus, and "revision'' changes if amendments are made within texts or the corpus header.

In individual text headers, the version attribute carries only a revision number.

This tag can be empty. For example:

<editionStmt version='1'>

This element corresponds to the TEI <editionStmt>, except that its content is an unstructured note.


3.3.3. Extent statement

This element corresponds to the TEI <extent> element in that it describes the number of words in the whole corpus or in an individual text. It differs in that it contains specific tags for specifying the size of the text or corpus in terms of words and bytes.

The <extent> tag contains:

For the purposes of the word count value, a "word" is considered to be an orthographic word--i.e., a string of characters surrounded by blanks. Punctuation not surrounded by white space is not considered as a word. This sort of count can be achieved fairly simply by automatic means. If any other definition is used it should be documented in the <wordCount> tag following the word value; e.g.,

<wordCount>45987 words; punctuation marks counted separately.</wordCount>

The <bytecount> tag gives the size of the text including its tags, in its representation as a text file encoded in an 8-bit ISO character set, which is useful for calculating media requirements or file download times.


3.3.4. Publication statement

This corresponds to the TEI <publicationStmt> but has a narrower focus, since it relates only to the public availability of the electronic text.

It contains the following sub-elements:


3.3.5. Source description

This element corresponds to the TEI <sourceDesc>, except that its content is constrained to include only the following possible sub-elements:

The headers of individual texts will each contain at least one of the above elements to specify their source. When a particular text contains items derived from more than one bibliographic source or recording, all relevant sources for which information is available are listed in the text header, and individual <div>, <div1> or <div2> elements associated with the correct citation or recording by means of the decls attribute.

If an electronic text has been derived from a previous electronic version of the text, then the source description will contain a <biblFull> element. If this version had itself been derived from another electronic version, then this <biblFull> element could contain yet another <biblFull> element, and os on for as many recursive levels as required. If electronic text described in any <biblFull> element is derived from a print source, it contains a <biblStruct> element describing that source.


The <biblStruct> element

The <biblStruct> element has the following component sub-elements:

At least one <monogr> element must be present in a <biblStruct> element. It may contain the following elements:

Published texts must contain at least one <imprint> element, which can contain the following elements:

The <analytic> element is used when multiple monographic records are grouped together into single items. When the item described by a bibliographic citation forms a part of some other bibliographic item (as, for example, a newspaper article within a newspaper, or a journal article within a collection), a monographic description should be given for the newspaper or collection, prefixed by an analytic description for the individual component, enclosed within an <analytic> element. This contains a mixture of the elements <author> <respStmt> and <title> in any order and repeated as necessary.


3.4. The Encoding description

The second major component of the header, the encoding description, contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus.

The <encodingDesc> element has the following six components:


3.4.1. The <projectDesc> element

This element provides information about the project for and by which the text or corpus was created, together with any other relevant information concerning the process by which it was assembled or collected. The content of this element is an unstructured note. Example:

      <projectDesc>
           The MULTEXT project is assembling a corpus consisting of
           mono-lingual texts in seven Eastern and Western European
           languages, together with parallel translations in each of
           these languages. The original texts were acquired in various
           forms and marked up for conformance with the MULTEXT/EAGLES
           Corpus Encoding Standard, to test and validate that scheme.
           
           MULTEXT has also developed a suite of annotation tools which
           have been tested on the texts in the corpus. 
      </projectDesc>

A minimal encoding description can contain only the <projectDesc> element. In this case, a prose description of the encoding methods can be provided. If documentation of encoding principles exists in another location (a manual, etc. in printed form, at a given URL, in an ftp site, etc.) this information should be provided.

If no <conformance> element is provided in an <editorialDecl> element within the encoding description, the CES conformance level must be provided here.


3.4.2. The <samplingDecl> element

This is also an unstructued note, which contains information about the methods for text sampling in the corpus. This element is relevant only in the corpus header. This element provides details about the systematic inclusion or exclusion of portions of texts, the rationale, and the means by which this is noted in the encoding, if any. For example (adapted from English-Norwegian Parallel Corpus Project manual):

      <samplingDecl>
           The texts of the core corpus are mostly extracts from books. 
           The extracts are between 10,000 and 15,000 words long (30 - 40  
           pages), and are taken from the beginning of the texts. The front  
           matter, prefaces, forewords, list of contents, etc., are not  
           included in the extracts. In some cases, introductions have been  
           left out as well, e.g. introductions by scholars to works of  
           fiction.
           
           Omission of passages in the text may be marked by an 
           <omit> tag. 
      </samplingDecl>

3.4.3. The <editorialDecl> element

The <editorialDecl> element contains the following elements, each specifying a particular kind of editorial practice used for some portion of the corpus.

Where the same principles apply across the whole corpus (e.g., for the <segmentation> element), they can be documented only once within the corpus header.

Where different parts of the corpus apply different practices (as for example with the <quotation> or <hyphenation> elements), all possible practices can be defined in the corpus header, and particular parts of the corpus can specify the editorial practices applicable to them by using the decls attribute. When this method is used, if a practice is not explicitly associated with a part of the corpus in this way, it is assumed not to apply to it.


3.4.4. The <tagsDecl> element

This element is used differently in corpus and in text headers. In the corpus header, it is used to list all the element names actually used within the corpus, together with a brief description of its function. In text headers, the same element is used to specify the number of SGML elements actually tagged within each text. In both cases it consists of a number of <tagUsage> elements, defined as follows:

In the corpus header, each <tagUsage> element contains a brief description of the element specified by its gi attribute; the occurs attribute is not supplied. In text headers, the <tagUsage> elements may be empty, but the occurs attribute is always supplied.

A typical written text has a tag declaration like the following:


            <tagsDecl>
               <tagUsage gi=name occurs=256>
               <tagUsage gi=div1 occurs=7>
               <tagUsage gi=head occurs=7>
               <tagUsage gi=p occurs=705>
               <tagUsage gi=reg occurs=2>
               <tagUsage gi=sic occurs=1>
               <tagUsage gi=body occurs=1>
            </tagsDecl>

Note that the global attributes lang and wsd can be used on a <tagUsage> element to indicate that for every appearance of the described element in the text, the content defaults to the specified language and character set. Therefore the declaration

<tagUsage gi=term occurs=5 wsd="ISO 8859-5">

indicates that the content of all <term> elements is in the ISO 8859-5 character set.

A PERL script to automatically generate <tagUsage> elements with appropriate values for tags in any SGML text is available at

<URL: http://www.cs.vassar.edu/~priestdo/research/scripts/tagusage.txt>


3.4.5.<refsDecl>

This element is useful for encoding corpora since it provides information about references which are often used in the alignment of parallel texts. In particular, it is common to use ID values on tags marking paragraphs and sentences as references in links associating two parallel texts. See for example, the English-Norwegian Parallel Corpus Project and The Lingua Parallel Concordancing Project.

     <samplingDecl>
          A reference system is built up using the identifiers of the 
          following text units: text, division, paragraph, s-unit.
          Each nested division has an identifier which is built up by 
          successively adding to the identifier of the text. Each  
          paragraph has an identifier which adds yet another layer to the
          immediately superordinate identifier. S-units are numbered  
          within the nearest division, as shown above. After alignment,  
          each s-unit in the core corpus has a "corresp"  
          attribute containing a reference to the corresponding unit(s) in  
          the parallel text. 

          Example:

   
              <body id=NN1>
                <div1 type=part id=NN1.1>
                  <div2 type=chapter id=NN1.1.1>
                    <div3 type=section id=NN1.1.1.1>
                      <p id=NN1.1.1.1.p1>
                        <s id=NN1.1.1.1.s1 corresp=NN1T.1.1.1.s1></s>
                        <s id=NN1.1.1.1.s2 corresp=NN1T.1.1.1.s2></s>
                      </p>
                    </div3>
                  </div2>
                </div1>
              </body>
  
      </samplingDecl>

3.4.6. The <classDecl> element

The following scheme outlines means to define a set of text categories for classifying texts in the corpus. A standardized set of text categories is under development by the EAGLES Corpus Working Group on Text Typology, which will, in most cases, eliminate the need to explicitly provide a descriptive taxonomy in the corpus header.

The standard text categories and means to use them to classify texts in the corpus will be specified in the final CES recommendations. The following can be used to extend that taxonomy where necessary.

The <classDecl> element contains the descriptive taxonomy used to classify texts within the corpus. It occurs once, in the corpus header, and consists of a set of <category> elements, each representing a particular textual classification feature and a value for that feature.

The global id attribute is required for the <category> element, since it is used to associate a <catRef> within a text header with the descriptive category appropriate to it. The category element contains a set of <catDesc> elements:

The <catDesc> element is used to contain the value for a feature within a <category>, unless that category is further subdivided, in which case a nested <category> element may be used.

Within the <textClass> element of the header for each text, a <catRef> element is provided, the target attribute of which lists the identifiers of all <category> elements applicable to that text.

When a standard set of text categories is developed, it is anticipated that an attribute on <textClass> will provide the category. Unless the standard categories are extended, no pointer to <category> elements in the corpus header will be required.


3.5. The Profile description

The third component of the header is the profile description. The <profileDesc> element has the following components:

These components appear in individual text headers, since they describe features of particular texts.


3.5.1. The <creation> element

This element is used to record the date of first publication of electronic texts, and any details concerning the origination of the text, whether or not covered elsewhere.


3.5.2. The <langUsage> element

This element contains one or more <language> elements, each identifying a language used on the text:

Example:

      <langUsage>
          <language id="fr" iso639="fr">French</language>
          <language id="en" iso639="en">English</language>
          <language id="la" iso639="la">Latin</language>
      </langUsage>

The value of the id attribute on any <language> element should be given as a value for the global lang attribute when it is used on a tag in the text or header to refer to this language. For example,

              She ate <foreign lang=fr>croissants</foreign>

When more than one character set is used in a text, the wsd attribute should be used on each <language> tag to associate the language with a particular character set.


3.5.3. The <wsdUsage> element

This element contains one or more <writingSystem> elements, each identifying a character set used on the text:

Example:

      <wsdUsage>
          <writingSystem id="ISO 8859-1">ISO character set for western 
                   European languages</writingSystem>
          <writingSystem id="ISO 8859-5">ISO character set for 
                   Cyrillic</writingSystem>
      </wsdUsage>

The value of the id attribute on any <writingSystem> element should be given as a value for the global wsd attribute when it is used on a tag in the text or header to refer to this character set. For example,

       This is a patch of Cyrillic: 
       <foreign lang=bu wsd="ISO 8859-5">
         Големия 
         брат 
         те наблюдава
       </foreign>

When a Writing System Declaration describing a transcription scheme is provided as an auxiliary document, the value of the wsd attribute on the <writingSystem> element must be an entity that points to this document. Usually, the entity expands to be the name of the file in which the Writing System Declaration is stored. Note that for this reason, the type of the wsd attribute on the <writingSystem> element is ENTITY (indicating that its value must be en SGML entity). For all other elements in the header or text, the type of the global wsd attribute is CDATA.


3.5.4. The <textClass> element

This element contains references to the text classification scheme and descriptive keywords which together describe the text concerned. The following elements are used for these purposes:

The <keywords> element contains one or more technical terms:

3.5.5. The <translations> element

This element groups information about translations of the text which exist, usually within the same corpus. The following elements are used for these purposes:


3.5.6. The <annotations> element

This element groups information about annotation documents associated with the text. The following elements are used for these purposes:


3.6. The revision description

The revision description is the fourth element in the header. It is used to record details of any significant change to the corpus. The <revisionDesc> element has the following component:

Multiple <change> elements are provided for; one should appear per change.

Unlike its counterpart in the TEI scheme, the <change> element must here contain

When any significant change is made to any component of the corpus, the following steps should be taken:


3.7. Use of decls attribute

The decls attribute is specified for the element <body> and the larger division elements (<div1> or <div2>).

It is used for two purposes:

Its value is a list of identifiers, each of which has been supplied elsewhere in a text or corpus header as the identifier for one of the following elements: <biblStruct>, <editorialDecl> and its constituents (<correction>, <hyphenation>, <quotation>, <segmentation> and <transduction>), and <textClass>.

For these elements, the corpus header will normally contain several mutually incompatible options, for example, several editorial declarations. Individual texts, or portions of texts, specify explicitly which of the available options applies to them by using the decls attribute. In cases where the set of declarable elements applies only within portions of a single text, they will be specified in the text header rather than the corpus header.

Declarable elements, once specified, are inherited by all sub-components. That is, if the decls attribute of a <body> element specifies a particular value for some declarable element, that value is understood to apply to all components of the text unless over-ridden. If the decls attribute of a <div1> within that text specifies a different value, the new value applies to the contents of that <div1> only; the value specified by the <body> applies to all subsequent <div1> elements in the same text, unless they also specify a different decls value.

For non-declarable elements, the header of an individual text will specify only those respects (if any) in which it differs from the defaults stated in the corpus header.

This is a simplification of the decls mechanism described in the TEI Guidelines.


3.8. Example


  <!doctype cesHeader PUBLIC "-//CES//DTD//cesHeader//EN" []>
  <cesHeader text.loc="/usr/multext/corpus/english/ORW23">
      <fileDesc>
           <titleStmt>
               <title>Machine-readable version of 1984, ch. 1</title>
               <respStmt>
                    <respType>typed in and marked with CES tags </respType>
                    <respName>A. Student</respName>
               </respStmt>
           </titleStmt>
           <extent>
               <bytecount>47992</bytecount>
               <wordcount>6571</wordcount>
           </extent>
           <publicationStmt>
               <distributor>Laboratoire Parole et Langage, CNRS</distributor>
               <address>29, avenue Robert Schuman
                        Aix-en-Provence, France
                        tel: +33 42 95 36 33
                        fax : +33 42 59 50 96
                        email: phonetic@univ-aix.fr</address>
               <availability status=restricted>
                   internal use only--cannot be distributed</availability>
               <date>6571</date>
           <sourceDesc>
               <biblStruct>
                    <monogr>
                         <title>Nineteen Eighty-four</title>
                         <author>George Orwell</author>
                         <imprint>
                              <pubPlace>New York</pubPlace>
                              <publisher>New American Library</publisher>
                              <date>1949; reprinted 1961</date>
                         </imprint>
                    </monogr>
               </biblStruct>
           </sourceDesc>
      </fileDesc>
      <encodingdesc>
           <projectdesc>
             This English version of the first chapter of Orwell's 1984 is 
             encoded for use in the MULTEXT-EAST project. The English is  
             to serve as the base for the parallel corpus, and will be aligned 
             to versions of the text in Romanian, Bulgarian, Estonian,  
             Slovenian, Czech, and Hungarian.
           </projectdesc>
           <editorialdecl>
               <conformance level=1>CES Level 1</conformance>
               <correction status=medium method=silent></correction>
               <quotation marks=none form=std>Rendition attribute values on Q 
                     and QUOTE tags are adapted from ISOpub and ISOnum standard 
                     entity set names
               </quotation>
               <segmentation>Marked up to the level of paragraph plus 
                     marking of particular sub-paragraph elements: NAME, DATE, 
                     FOREIGN.
               </segmentation>
           </editorialdecl>
           <tagsdecl>
               <tagusage gi=body occurs=1></tagusage>
               <tagusage gi=date occurs=5></tagusage>
               <tagusage gi=div1 occurs=1></tagusage>
               <tagusage gi=div2 occurs=1></tagusage>
               <tagusage gi=foreign occurs=4></tagusage>
               <tagusage gi=hi occurs=4></tagusage>
               <tagusage gi=name occurs=149></tagusage>
               <tagusage gi=note occurs=1></tagusage>
               <tagusage gi=num occurs=2></tagusage>
               <tagusage gi=p occurs=41></tagusage>
               <tagusage gi=ptr occurs=1></tagusage>
               <tagusage gi=q occurs=22></tagusage>
               <tagusage gi=quote occurs=3></tagusage>
           </tagsdecl>
      </encodingdesc>    
      <profiledesc>
           <langusage>
               <language id="fr" iso639="fr">French</language>
               <language id="en" iso639="en">English</language>
               <language id="la" iso639="la">Latin</language>
               <language id="ns">Newspeak</language>
           </langusage>
      </profiledesc>
  </cesHeader>



The CES Header DTD



CNRS

NAVIGATOR

| Top | Prev | Next | CES Contents | MULTEXT | EAGLES TR subgroup | LPL |