Basic Non-Structural Features

These are lower level features which occur freely in texts; they are typically bound to no particular place either in the text as a whole or in some larger structure. Most have no consistent internal structure. Like the structural features dealt with above, however, they are often signalled by typographical conventions such as font shifts, quotation marks, or layout. In general, we recommend tagging the underlying feature, not its realization. Realization features are regularly lost in converting a written text to machine-readable form using our scheme, unless the rendition attribute is consistently used to record the surface realization of each underlying feature. For details on this attribute, see below, especially section .

Paragraphs and Their Contents

At the bottom of the hierarchy dealt with in Section 6.2.2 we find paragraphs. These are tagged p.

Paragraphs have no firm internal structure; they contain prose encoded as a mix of characters, entity references, phrases marked as described in the rest of this chapter, and embedded elements like lists, figures, or tables, which have internal structure, though they are not bound to any particular position in the structure of the document. In the sections which follow, various types of phrase marking are discussed, including

followed by discussions of various types of simple embedded structures (for which the term crystal is used here): Other embedded structures which can occur within paragraphs are treated separately in this chapter: If a consistent internal subdivision of paragraphs is desired, the s (segment) tag should be used. For discussion, see section .

Highlighting and Related Features

Highlighted words or phrases are those made visibly different from the rest of the text, typically by shifts in type font, handwriting style, or ink color. They are marked in some way in order to draw the reader's attention to them.

It is recommended that highlighted text be tagged with the underlying feature signaled by the highlighting. The following tags may be used to mark features often realized with highlighting: emph marks emphatic (stressed) words foreign marks words in a foreign language (see further section ) cited.word marks words mentioned, not used term marks words highlighted as technical terms title marks titles of books and journals (see further section ) Quotations and glosses (see section ) are usually marked with quotation marks in printed text, but may occasionally be marked with font shifts. The tags q and gloss should be used regardless of how they are rendered; if the rendition is to be recorded, use the rendition attribute. See also section . On the rendition attribute, see further section .

If the underlying feature is unclear, or if presentational markup is used, use the tag highlighted to mark highlighted text. It has one optional attribute, rendition, which specifies how the highlighting is realized, e.g. italic, underline, double underline, etc. For a fuller discussion of this attribute see section .

As an example of the tags defined here, consider the following sentence: On the one hand the Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquit&eacu., the romances of Chr&eacu.tien de Troyes, and the German adaptations of these works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram von Eschenbach. Using descriptive tagging, the second sentence might look like this:

<![CDATA[ On the one hand the <title>Nibelungenlied</title> is associated with the new rise of romance of twelfth-century France, the <foreign>romans d'antiquit&eacu.</foreign>, the romances of Chr&eacu.tien de Troyes, ... ]]> Using presentational markup, it might look like this: <![CDATA[ On the one hand the <highlighted rendition=italic>Nibelungenlied</highlighted> is associated with the new rise of romance of twelfth-century France, the <highlighted rendition=italic>romans d'antiquit&eacu.</highlighted>, the romances of Chr&eacu.tien de Troyes, and the German adaptations of these works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram von Eschenbach. ]]>

Quotations and Related Features

Quotation marks are conventionally used to denote several different features within a text. It is recommended that the underlying feature be tagged, when possible, rather than the simple fact that quotation marks appear in the text.

The most common and important use of quotation marks is to mark quotations. A quotation is a piece of text attributed by the author or narrator to another. The tag q should be used for a quotation, no matter how it appears in the text. If it is desired to record whether the quotation was printed in-line or set off as a display or block quotation, the rendition attribute should be used. See below.

Quotations embedded within quotations are treated in the same way as ordinary quotations. Interruptions of the quotation by a narrator may be tagged with the tag in.quot.

Quotations may be accompanied by a reference to the source or speaker. For a description of how bibliographic references are handled, see Section 6.6. If the source is not given in the text it may be added as the value of the s (source or speaker) attribute.

Examples:

<![CDATA[ Few dictionary makers are likely to forget Dr. Johnson's description of the lexicographer as <q>a harmless drudge.</q> <p> <q>Who-e debel you?<in.quot>--he at last said--</in.quot>you no speak-e, damme, I kill-e.</q> And so saying, the lighted tomahawk began flourishing about me in the dark. <q s=Wilson>Spaulding, he came down into the office just this day eight weeks with this very paper in his hand, and he says:--<q s=Spaulding>I wish to the Lord, Mr. Wilson, that I was a red-headed man.</q></q> ]]> In the second example, the phrase he at last said interrupts the direct quotation; in the third, the speaker (Wilson) quotes another speaker (Spaulding).

The creator of the electronic text must decide whether the quotation marks are replaced by the tags or whether the tags are added and the quotation marks kept. If the quotation marks are removed from the text, the rendition attribute may be used to record the way in which they were rendered in the copy text. This attribute is optional; when it is used for quotations, the following special values may be used to describe quotation-mark styles common in European and American typesetting: 66U double inverted comma 6U single inverted comma 99U double apostrophe 9U single apostrophe 99L double comma 9L single comma << double guillemet open to the right < single guillemet open to the right >> double guillemet open to the left > single guillemet open to the left These may be combined to show how the quotation was opened and closed. For example:

<![ CDATA [ Few dictionary makers are allowed to forget Dr Johnson's description of the lexicographer as <q rendition='6u 9u'>a harmless drudge.</q> ]]> Other possible values for rendition of quoted material include: display, mdash, and unmarked.

Other features often signaled by quotation marks may be marked with the following tags: cited.word marks words mentioned, not used so.called marks words used in a special or ironic sense (e.g. She hated `good' books.) title.piece marks titles of poems, articles in journals, chapters in books, and other items published as part of larger wholes (see further section ) Like q, these can carry the rendition attribute.

Where the underlying feature is not marked, the tag q.mark should be used to record the presence of quotation marks in the text. Like the other tags just discussed, it may have a rendition attribute.

Foreign words or expressions

Words or phrases which are not in the main language of the text should be tagged as such, at least where the fact is signaled in the text. Where possible, the language shift should be indicated by attaching the attribute lang to an existing element. See section for discussion. Where there is no applicable element, the tag foreign may be inserted, again using the lang attribute to indicate the language of the foreign words. Optionally, the usage attribute may be specified on this tag to indicate how widely used the word or expression is. See section for discussion of this attribute.

Example:

<![CDATA[ John eats a <foreign lang=FR usage=common>croissant</foreign> every morning. ]]>

Do not use foreign to tag words in foreign languages which are mentioned (not used) in the text: use cited.word with the lang attribute.

Problems of languages, character sets, etc. are dealt with in chapter .

Terms, Cited Words, and Glosses

Technical terms are often italicized or bolded upon first mention in printed texts; an explanation or gloss is sometimes given in quotation marks. Linguistic analyses conventionally cite words in languages under discussion in italics, providing a gloss immediately following marked with single quotation marks. Other texts in which individual words or phrases are mentioned may mark them either with italics or with quotation marks, and will gloss them less regularly.

The three tags term, cited.word, and gloss are provided for marking these phenomena in texts. Use term if the word is used in the sentence, cited.word if it is merely mentioned.

Glosses may be separated in the text from the term or cited word they gloss; to specify unambiguously what term is being glossed, the attribute termid may be specified with the gloss tag in free text: its value should be the ID value specified in the term or cited.word tag used to mark the word or phrase being glossed.

Examples:

<![ CDATA [ A computational device that infers structure from grammatical strings of words in known as a <term>parser</term>, and much of the history of <abbrev type=acronym full='natural language processing'>NLP</abbrev> over the last 20 years has been occupied with the design of parsers. There is thus a striking accentual difference between a verbal form like <cited.word lang=Greek id=cw234>eluthemen</cited.word> <gloss termid=cw234>we were released,</gloss> accented on the second syllable of the word, and its participial derivative <cited.word id=cw235 lang=Greek>lutheis</cited.word> <gloss termid=cw235>released,</gloss> accented on the last. Although Chomsky's decision that all NL sentences are finite objects was never justified by arguments from the attested properties of NLs, it did have a certain <so.called>social</so.called> justification. It was commonly assumed in works on logic until fairly recently that the notion <cited.word>language</cited.word> is necessarily restricted to finite strings. ]]>

Names

Proper names may be tagged. It is desirable to distinguish between different types of proper names, e.g. names of people and names of places. The tag propname marks a proper name. It has these attributes: type takes values such as person, place, institution, product, acronym referent may be used to supply an identification of the person or thing, using some canonical identifier scheme normalized may be used to supply a normalized form of the name if desired for onomastic or other study Many proper names consist of more than one word, e.g John Smith, New York and should be tagged as sequences. Examples:

<![CDATA[ <propname TYPE = person>John Smith</propname> lives in <propname TYPE = place>New York</propname>. ]]> This method is adequate for simple applications such as producing registers in a book. More work is needed on this subject to allow for complex methods of handling names; this will be a topic of work during the further development of the project.

Abbreviations

Abbreviations may or may not include full stops (periods). Groups of abbreviated words may be tagged as a sequence with the tag abbrev. This tag has two optional attributes: full gives the expanded form of the abbreviation type classifies the abbreviation using terms such as title, initials, acronym, degree Abbreviations such as Dr J C in Dr J C Smith may be treated as one or two (Dr and J C) abbreviations. Example:

<![CDATA[ <propname type=person> <abbrev TYPE = title>Dr.</abbrev> <abbrev TYPE = initials>M. </abbrev> Deegan</propname> is the Research Officer of the <abbrev full='Computers in Teaching Initiative' type=acronym>CTI</abbrev> Centre for Literature and Linguistic Studies. ]]>

Lists

A list, denoted by the tag list, is a sequence of text items. The items may be ordered (e.g. numbered or lettered) or unordered (e.g. bulleted). The attribute type is used to specify the type of list; the following special values are suggested for common cases: ordered list items are numbered or lettered bulleted list items are marked with a bullet or other printer's dingbat or ornament simple list items begin at the left margin but are not numbered or bulleted Other values may be used if needed.

Individual list items are tagged with item. The first item may optionally be preceded by a head, which gives a heading for the list. For ordered lists, the enumerator (either a number or a letter) should be omitted (if the numbering is unremarkable and may be reconstructed by any processing program) or specified with the attribute N. Alternatively, the enumerator may if desired be tagged with enum. The following two examples are synonymous:

<![ CDATA [ <list type=ordered> <item n=1>First item in list.</item> <item n=2>Second item in list.</item> <item n=3>Third item in list.</item> </list> <list type=ordered> <enum>1.</enum> <item>First item in list.</item> <enum>2.</enum> <item>Second item in list.</item> <enum>3.</enum> <item>Third item in list.</item> </list> ]]> The two styles may not be mixed in the same list.

In some lists, the individual items have internal structure. In glossary lists, marked by the tag list.gl, each item comprises a term and a gloss, marked with gl.term and gl.gloss. These correspond to the tags term and gloss, which can occur anywhere in prose text. Special heading tags are required for glossary-list headings: a general heading for the list may be given with head, and headings for the term and gloss columns may be given with term.head and gloss.head. Polyglot wordlists may make use of the global lang attribute to specify on the gl.term tag what language the term is from. The standard two-letter language-name abbreviations defined by ISO nnnn should be used, where they provide a code for the language in question. Identifiers for other languages require further investigation.

For example: The glosses are from A Literary Middle English Reader, ed. Albert Stanburrough Cook (Boston: Ginn, 1915), p. 406. The example shows a legal SGML form, but not a legal interchange form: for interchange, the gl.term and gl.gloss elements should be explicitly ended with /gl.term and /gl.gloss tags.

<![ CDATA [ <list.gl> <head>Vocabulary</head> <term.head>Middle English</term.head> <gloss.head>New English</gloss.head> <gl.term>nu <gl.gloss>now <gl.term>lhude <gl.gloss>loudly <gl.term>bloweth <gl.gloss>blooms <gl.term>med <gl.gloss>meadow <gl.term>wude <gl.gloss>wood <gl.term>awe <gl.gloss>ewe <gl.term>lhouth <gl.gloss>lows <gl.term>sterteth <gl.gloss>bounds, frisks <gl.term>verteth <gl.gloss><foreign lang=Latin>pedit</foreign> <gl.term>murie <gl.gloss>merrily <gl.term>swik <gl.gloss>cease <gl.term>naver <gl.gloss>never </list.gl> ]]>

Notes

Notes, footnotes, endnotes, marginalia, etc. are inserted in the text at the point to which they refer and are marked by the tag note. Footnotes and endnotes are tagged in the same way as other notes. These almost always have an identifier or mark in the text, showing exactly where the note applies. Marginalia, by contrast, may not be anchored to an exact location. They may be in a different hand or typeface and may have been added later. Here we recommend a simple system where marginal notes are added before the relevant paragraph. It should be indicated by attributes if they are in the same format/hand as the original and where on the page they occur. Further work is needed on these questions, and fuller recommendations on them are deferred until a later stage of the project.

The tag note has the following attributes: type describes the type of note, with values such as: annotation, gloss, explanation, preliminary, temporary source identifies the author of the note, if different from the author of the text. The values

  • ed[itor]
  • comp[iler]
  • transcriber
  • author (the default)
are suggested for common cases. Other values may be used as needed. On editorial notes, see further section . place specifies where the note appears in the copy text. The values
  • foot
  • end
  • inline
  • display (the default)
  • left (for notes in left margin)
  • right (for notes in right margin)
are suggested for common cases. Other values may be used as needed, e.g. app1, app2 to distinguish between annotations in separate apparatus. anchored indicates whether the copy text shows the exact place of reference for the note (anchored=yes) or not (anchored=no)

If the symbol used in the copy text is to be recorded in the markup, the global N attribute may be used.

Examples:

<![CDATA[ Collections are ensembles of distinct entities or objects of any sort. <note place=foot n=1> We explain below why we use the uncommon term <cited.word rendition='6u 9u'>collection</cited.word> instead of the expected <cited.word rendition='6u 9u'>set</cited.word>. Our usage corresponds to the <cited.word>aggregate</cited.word> of many mathematical writings and to the sense of <cited.word>class</cited.word> found in older logical writings. </note> The elements ... ]]>

As regards editorial notes and author notes, see section .

Index Entries

Machine-readable versions of existing texts rarely reproduce any index published with the copy text, but it is convenient to be able to generate a new index from a machine-readable text, whether the text is being written for the first time with the tags here defined or was transcribed from some other source. The index.term tag is provided for this purpose; it may be useful for marking points of particular interest for whatever reason, and not merely for generating printed indices for a printed version of the text.

The tag index.term associates up to four levels of index terms with a specific point in the text. The index terms are supplied in attributes named level1, level2, level3, and level4. An index attribute associates the entry with a particular index, so multiple indices are possible.

All index terms must be supplied as attribute values; no part of the text itself is taken as a term. This may require words or phrases to be repeated, as illustrated below:

<![CDATA[ The students understand procedures for Arabic lemmatisation <index.term level1='Arabic lemmatization'> and are beginning to build parsers. ]]>

Numbers and Dates

Like names or abbreviations, numbers and dates can occur virtually anywhere in a text. They are special in that they can be written with either letters or digits (twenty-one and 21) and their presentation is language-dependent (e.g. English 5th becomes Greek 5.; English 123,456.78 equals French 111.745,15).

Handling of numbers and dates can be problematic in natural-language processing or machine-translation applications, where fully automatic recognition is normally required. For these applications, some sort of standardization is extremely helpful, since it allows the feature in the text to be delimited and provides an appropriate encoding of its value. The recommendations given here are intended to provide solutions suitable for the basic needs of natural-language processing and machine translation projects. The requirements of other applications concerned with dates (e.g. historical research) are more complex; more work in this area will be performed during the further development of these Guidelines.

All numbers may be marked with the tag num. This tag has two attributes: type indicates the type of numeric value. Possible values include:

  • cardinal (e.g. 21 or twenty-one)
  • ordinal (e.g. 5th)
  • fraction (e.g. 1/2)
  • percentage (e.g. ten percent) These values should be used for numbers in these forms. Other values may be use for this attribute as necessary. value supplies the value of the number in a standard form. The form used for such values is application-dependent and must be declared in the encoding.declarations area of the TEI header if values are supplied. The tag standard.numeric.values should be used to describe the form of standard values or give a bibliographic reference to such a description. See chapter for a discussion of the TEI header and the encoding declarations area. Examples: <![ CDATA [ <num type=cardinal value='21'>twenty-one</num> <num type=cardinal value='1,5'>1.5</num> <num type=cardinal value='1,5'>1,5</num> <num type=percentage value='10'>ten percent</num> <num type=percentage value='10'>10%</num> <num type=ordinal value='5'>5th</num> <num type=fraction value='0,5'>one half</num> <num type=fraction value='0,5'>1/2</num> ]]>

    Simple dates may be marked with the tag date, which has the following attributes: type indicates the type of date provided. Possible values include: Gregorian, Julian, Roman, Mosaic, Revolutionary, Islamic, and so on. value supplies the value of the date in a standard form. For simple dates, the form used should be that of ISO/R 2014, which prescribes the form yyyy-mm-dd. Such standard dates should usually be given in the Gregorian calendar. If another form or calendar is used for standard-form dates, the standard.date.values tag should be used to describe the standard form and calendar. See chapter for a discussion of the TEI header and the encoding declarations area. On partial dates and date ranges, see below. certainty indicates the degree of certainty (optional) using values such as: approx., ca., after, before, ... If necessary, the mechanisms described in chapter may be used to add tags for subdividing Gregorian or Julian dates into year, month, and day.

    Partial dates (e.g. 1990, September 1990) can be expressed in the value attribute by omitting the corresponding field in the VALUE attribute.

    Examples:

    <![ CDATA [ <date value=1980-02-21'>21 Feb 1980</date> Given on the <date value=1977-06-12'>Twelfth Day of June in the Year of Our Lord One Thousand Nine Hundred and Seventy-seven of the Republic the Two Hundredth and first and of the University the Eighty-Sixth.</date> <date value='1990'>1990</date> <date value='1990-09'>September 1990</date> ]]>

    These mechanisms are useful primarily for fully specified dates known with certainty. Fully adequate methods for representing date ranges and partially specified dates other than as noted above will require further work during the future development of these Guidelines. As a first step towards the representation of date ranges, the tag date.range may be used. This marks expressions which specify ranges of dates, and takes the following attributes: from beginning of the date range in standard form to end of the date range in standard form from.cert certainty expression for start of range to.cert certainty expression for end of range For example:

    <![ CDATA [ The Eddic poems are preserved in a unique manuscript (Codex Regius 2365, 4o) from <date.range from=1250 to=1300 from.cert=approx to.cert=approx> the second half of the thirteenth century</date.range>, and <title>Hervarar saga</title> dates from <date value=1300 certainty=approx>around 1300</date>. ]]>

    Other Crystals

    Dates, numbers, bibliographic citations, and names are all crystals: small objects with internal structure containing particular semantically constrained sorts of data. Other crystals which might under some circumstances require marking in text include:

    • addresses (city, country-subdivision, post code, street address, postbox, telephone, etc.)
    • names (subdivided into titles, family and personal names, name suffixes, etc.)
    • organization names (organizaton name, department, division, address, etc.)
    • meetings and conferences (sponsoring organization, organizer, conference or meeting name, number, date, location, etc.)

    More work is needed to develop useful markup for these and similar crystals. No specific recommendations are made at this time; interested encoders are referred to the Formex markup scheme for useful treatments of organizations, addresses and meetings, and to the various codes for descriptive cataloguing in libraries for detailed analyses of personal and corporate names, on the basis of which tags may be added to the TEI scheme using the mechanisms described in chapter .