Basic Non-Structural Features
These are lower level features which occur freely in texts; they are
typically bound to no particular place either in the text as a whole or
in some larger structure. Most have no consistent internal structure.
Like the structural features dealt with above, however, they are often
signalled by typographical conventions such as font shifts, quotation
marks, or layout. In general, we recommend tagging the underlying
feature, not its realization. Realization features are regularly lost
in converting a written text to machine-readable form using our scheme,
unless the rendition attribute is consistently used to
record the surface realization of each underlying feature. For details
on this attribute, see below, especially section .
Paragraphs and Their Contents
At the bottom of the hierarchy dealt with in Section 6.2.2 we find
paragraphs. These are tagged p.
Paragraphs have no firm internal structure; they contain prose encoded
as a mix of characters, entity references, phrases marked as described
in the rest of this chapter, and embedded elements like lists, figures,
or tables, which have internal structure, though they are not bound to
any particular position in the structure of the document. In the
sections which follow, various types of phrase marking are discussed,
including
- features commonly marked by font shifts
(section ),
- features commonly marked by quotation marks
(section ),
- other marking for words and phrases
(sections , ,
and ),
followed by discussions of various types of simple embedded structures
(for which the term crystal is used here):
- lists (section )
- notes (section )
- index entries (section )
- numbers and dates (section )
- other crystals (section )
Other embedded structures which can occur within paragraphs are treated
separately in this chapter:
- formulas, tables, and figures )
- bibliographic citations (section )
- traditional reference systems (section )
- cross references (section )
- apparatus for recording textual variants (section )
If a consistent internal subdivision of paragraphs is desired, the
s (segment
) tag should be used. For discussion, see
section .
Highlighting and Related Features
Highlighted words or phrases are those made visibly different from the
rest of the text, typically by shifts in type font, handwriting style,
or ink color. They are marked in some way in order to draw the reader's
attention to them.
It is recommended that highlighted text be tagged with the underlying
feature signaled by the highlighting. The following tags may be used to
mark features often realized with highlighting:
emph
marks emphatic (stressed) words
foreign
marks words in a foreign language
(see further section )
cited.word
marks words mentioned, not used
term
marks words highlighted as technical terms
title
marks titles of books and journals
(see further section )
Quotations and glosses (see section ) are usually
marked with quotation marks in printed text, but may occasionally be
marked with font shifts. The tags q and gloss
should be used regardless of how they are rendered; if the rendition is
to be recorded, use the rendition attribute. See also
section . On the rendition attribute,
see further section .
If the underlying feature is unclear, or if presentational markup is
used, use the tag highlighted to mark highlighted text. It
has one optional attribute, rendition, which specifies how
the highlighting is realized, e.g. italic,
underline,
double underline,
etc. For a fuller discussion of this attribute
see section .
As an example of the tags defined here, consider the following
sentence:
On the one hand the Nibelungenlied is associated with the new
rise of romance of twelfth-century France, the romans
d'antiquit&eacu., the romances of Chr&eacu.tien de Troyes, and
the German adaptations of these works by Heinrich van Veldeke, Hartmann
von Aue, and Wolfram von Eschenbach.
Using descriptive tagging, the second sentence might look like this:
Nibelungenlied is associated
with the new rise of romance of twelfth-century France, the
romans d'antiquit&eacu., the romances of
Chr&eacu.tien de Troyes, ...
]]>
Using presentational markup, it might look like this:
Nibelungenlied
is associated with the new rise of romance of twelfth-century
France, the romans
d'antiquit&eacu., the romances of
Chr&eacu.tien de Troyes, and the German adaptations of these
works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram
von Eschenbach.
]]>
Quotations and Related Features
Quotation marks are conventionally used to denote several different
features within a text. It is recommended that the underlying
feature be tagged, when possible, rather than the simple fact that
quotation marks appear in the text.
The most common and important use of quotation marks is to mark
quotations. A quotation is a piece of text attributed by the author or
narrator to another. The tag q should be used for a
quotation, no matter how it appears in the text. If it is desired to
record whether the quotation was printed in-line or set off
as a display or block quotation, the
rendition attribute should be used. See below.
Quotations embedded within quotations are treated in the same way as
ordinary quotations. Interruptions of the quotation by a narrator may
be tagged with the tag in.quot.
Quotations may be accompanied by a reference to the source or
speaker. For a description of how bibliographic references are handled,
see Section 6.6. If the source is not given in the text it may be added
as the value of the s (source or speaker) attribute.
Examples:
a harmless drudge.
Who-e debel you?--he at last said--you
no speak-e, damme, I kill-e.
And so saying, the lighted
tomahawk began flourishing about me in the dark.
Spaulding, he came down into the office just this
day eight weeks with this very paper in his hand, and he
says:--I wish to the Lord, Mr. Wilson, that
I was a red-headed man.
]]>
In the second example, the phrase he at last said
interrupts the
direct quotation; in the third, the speaker (Wilson) quotes another
speaker (Spaulding).
The creator of the electronic text must decide whether the quotation
marks are replaced by the tags or whether the tags are added and the
quotation marks kept. If the quotation marks are removed from the text,
the rendition attribute may be used to record the way in
which they were rendered in the copy text. This attribute is optional;
when it is used for quotations, the following special values may be used
to describe quotation-mark styles common in European and American
typesetting:
66U double inverted comma
6U single inverted comma
99U double apostrophe
9U single apostrophe
99L double comma
9L single comma
<< double guillemet open to the right
< single guillemet open to the right
>> double guillemet open to the left
> single guillemet open to the left
These may be combined to show how the quotation was opened and closed.
For example:
a
harmless drudge.
]]>
Other possible values for rendition of quoted material
include: display,
mdash,
and unmarked.
Other features often signaled by quotation marks may be marked with
the following tags:
cited.word
marks words mentioned, not used
so.called
marks words used in a special or ironic sense (e.g. She hated
`good' books.
)
title.piece
marks titles of poems, articles in journals, chapters in books, and
other items published as part of larger wholes
(see further section )
Like q, these can carry the rendition attribute.
Where the underlying feature is not marked, the tag q.mark
should be used to record the presence of quotation marks in the text.
Like the other tags just discussed, it may have a rendition
attribute.
Foreign words or expressions
Words or phrases which are not in the main language of the text should
be tagged as such, at least where the fact is signaled in the text.
Where possible, the language shift should be indicated by attaching the
attribute lang to an existing element. See section
for discussion. Where there is no applicable element, the
tag foreign may be inserted, again using the
lang attribute to indicate the language of the foreign
words. Optionally, the usage attribute may be specified on
this tag to indicate how widely used the word or expression is. See
section for discussion of this attribute.
Example:
croissant
every morning.
]]>
Do not use foreign to tag words in foreign languages which
are mentioned (not used) in the text: use
cited.word with the lang attribute.
Problems of languages, character sets, etc. are dealt with in chapter
.
Terms, Cited Words, and Glosses
Technical terms are often italicized or bolded upon first mention in
printed texts; an explanation or gloss is sometimes given in quotation
marks. Linguistic analyses conventionally cite words in languages under
discussion in italics, providing a gloss immediately following marked
with single quotation marks. Other texts in which individual words or
phrases are mentioned may mark them either with italics or with
quotation marks, and will gloss them less regularly.
The three tags term, cited.word, and
gloss are provided for marking these phenomena in texts. Use
term if the word is used in the sentence,
cited.word if it is merely mentioned.
Glosses may be separated in the text from the term or cited word they
gloss; to specify unambiguously what term is being glossed, the
attribute termid may be specified with the gloss
tag in free text: its value should be the ID value
specified in the term or cited.word tag used to
mark the word or phrase being glossed.
Examples:
parser, and much
of the history of NLP over the last 20 years has been occupied
with the design of parsers.
There is thus a striking accentual difference between a
verbal form like eluthemen we
were released, accented on the second
syllable of the word, and its participial derivative
lutheis
released, accented on the last.
Although Chomsky's decision that all NL sentences are finite
objects was never justified by arguments from the attested
properties of NLs, it did have a certain
social justification. It was
commonly assumed in works on logic until fairly recently
that the notion language is
necessarily restricted to finite strings.
]]>
Names
Proper names may be tagged. It is desirable to distinguish between
different types of proper names, e.g. names of people and names of
places. The tag propname marks a proper name. It has these
attributes:
type takes values such as person, place, institution, product,
acronym
referent may be used to supply an identification of the person
or thing, using some canonical identifier scheme
normalized may be used to supply a normalized form of the
name if desired for onomastic or other study
Many proper names consist of more than one word, e.g John Smith, New
York and should be tagged as sequences.
Examples:
John Smith lives in
New York.
]]>
This method is adequate for simple applications such as producing
registers in a book. More work is needed on this subject to allow for
complex methods of handling names; this will be a topic of work during
the further development of the project.
Abbreviations
Abbreviations may or may not include full stops (periods). Groups of
abbreviated words may be tagged as a sequence with the tag
abbrev. This tag has two optional attributes:
full gives the expanded form of the abbreviation
type classifies the abbreviation using terms such as title,
initials, acronym, degree
Abbreviations such as Dr J C
in Dr J C Smith
may be
treated as one or two (Dr
and J C
) abbreviations.
Example:
Dr.
M. Deegan
is the Research Officer of the
CTI
Centre for Literature and Linguistic Studies.
]]>
Lists
A list, denoted by the tag list, is a sequence of text items.
The items may be ordered (e.g. numbered or lettered) or unordered (e.g.
bulleted). The attribute type is used to specify the type
of list; the following special values are suggested for common cases:
ordered list items are numbered or lettered
bulleted list items are marked with a bullet
or other
printer's dingbat or ornament
simple list items begin at the left margin but are not
numbered or bulleted
Other values may be used if needed.
Individual list items are tagged with item. The first
item may optionally be preceded by a head, which
gives a heading for the list. For ordered lists, the enumerator (either
a number or a letter) should be omitted (if the numbering is
unremarkable and may be reconstructed by any processing program) or
specified with the attribute N. Alternatively, the
enumerator may if desired be tagged with enum. The following
two examples are synonymous:
- First item in list.
- Second item in list.
- Third item in list.
1. - First item in list.
2. - Second item in list.
3. - Third item in list.
]]>
The two styles may not be mixed in the same list.
In some lists, the individual items have internal structure. In
glossary lists, marked by the tag list.gl, each
item comprises a term and a gloss, marked with
gl.term and gl.gloss. These correspond to the
tags term and gloss, which can occur anywhere in
prose text. Special heading tags are required for glossary-list
headings: a general heading for the list may be given with
head, and headings for the term and gloss columns may be
given with term.head and gloss.head. Polyglot
wordlists may make use of the global lang attribute to
specify on the gl.term tag what language the term is from.
The standard two-letter language-name abbreviations defined by ISO nnnn
should be used, where they provide a code for the language in question.
Identifiers for other languages require further investigation.
For example:
The glosses are from A Literary Middle English Reader,
ed. Albert Stanburrough Cook (Boston: Ginn, 1915), p. 406. The example
shows a legal SGML form, but not a legal interchange form: for
interchange, the gl.term and gl.gloss elements
should be explicitly ended with /gl.term and
/gl.gloss tags.
Vocabulary
Middle English
New English
nu now
lhude loudly
bloweth blooms
med meadow
wude wood
awe ewe
lhouth lows
sterteth bounds, frisks
verteth pedit
murie merrily
swik cease
naver never
]]>
Notes
Notes, footnotes, endnotes, marginalia, etc. are inserted in the text at
the point to which they refer and are marked by the tag note.
Footnotes and endnotes are tagged in the same way as other notes. These
almost always have an identifier or mark in the text, showing exactly
where the note applies. Marginalia, by contrast, may not be anchored to
an exact location. They may be in a different hand or typeface and
may have been added later. Here we recommend a simple system where
marginal notes are added before the relevant paragraph. It should be
indicated by attributes if they are in the same format/hand as the
original and where on the page they occur. Further work is needed on
these questions, and fuller recommendations on them are deferred until a
later stage of the project.
The tag note has the following attributes:
type describes the type of note, with values such as:
annotation, gloss, explanation, preliminary, temporary
source identifies the author of the note, if different
from the author of the text. The values
- ed[itor]
- comp[iler]
- transcriber
- author (the default)
are suggested for common cases. Other values may
be used as needed. On editorial notes, see further
section .
place specifies where the note appears in the copy text.
The values
- foot
- end
- inline
- display (the default)
- left (for notes in left margin)
- right (for notes in right margin)
are suggested for common cases. Other values may
be used as needed, e.g. app1,
app2
to
distinguish between annotations in separate apparatus.
anchored indicates whether the copy text shows the exact
place of reference for the note (anchored=yes
)
or not (anchored=no
)
If the symbol used in the copy text is to be recorded in the markup, the
global N attribute may be used.
Examples:
We explain below why we use the uncommon term
collection
instead of the expected
set.
Our usage corresponds to the aggregate
of many mathematical writings and to the sense of
class found in older logical
writings.
The elements ...
]]>
As regards editorial notes and author notes,
see section
.
Index Entries
Machine-readable versions of existing texts rarely reproduce any index
published with the copy text, but it is convenient to be able to
generate a new index from a machine-readable text, whether the text is
being written for the first time with the tags here defined or was
transcribed from some other source. The index.term tag is
provided for this purpose; it may be useful for marking points of
particular interest for whatever reason, and not merely for generating
printed indices for a printed version of the text.
The tag index.term associates up to four levels of index
terms with a specific point in the text. The index terms are supplied
in attributes named level1, level2,
level3, and level4. An index
attribute associates the entry with a particular index, so multiple
indices are possible.
All index terms must be supplied as attribute values; no part of the
text itself is taken as a term. This may require words or phrases to be
repeated, as illustrated below:
and are beginning to build parsers.
]]>
Numbers and Dates
Like names or abbreviations, numbers and dates can occur virtually
anywhere in a text. They are special in that they can be written with
either letters or digits (twenty-one
and 21
) and
their presentation is language-dependent (e.g. English 5th
becomes Greek 5.
; English 123,456.78
equals French
111.745,15
).
Handling of numbers and dates can be problematic in natural-language
processing or machine-translation applications, where fully automatic
recognition is normally required. For these applications, some sort of
standardization is extremely helpful, since it allows the feature in the
text to be delimited and provides an appropriate encoding of its value.
The recommendations given here are intended to provide solutions
suitable for the basic needs of natural-language processing and machine
translation projects. The requirements of other applications concerned
with dates (e.g. historical research) are more complex; more
work in this area will be performed during the further development of
these Guidelines.
All numbers may be marked with the tag num. This tag has two
attributes:
type indicates the type of numeric value. Possible values
include:
cardinal (e.g. 21
or twenty-one
)
ordinal (e.g. 5th
)
fraction (e.g. 1/2
)
percentage (e.g. ten percent
)
These values should be used for numbers in these forms.
Other values may be use for this attribute as
necessary.
value supplies the value of the number in a standard form. The
form used for such values is application-dependent and
must be declared in the encoding.declarations
area of the TEI header if values are supplied. The
tag standard.numeric.values should be used to
describe the form of standard values or give a
bibliographic reference to such a description.
See
chapter for a discussion of the TEI
header and the encoding declarations area.
Examples:
twenty-one
1.5
1,5
ten percent
10%
5th
one half
1/2
]]>
Simple dates may be marked with the tag date, which has the
following attributes:
type indicates the type of date provided. Possible values
include: Gregorian, Julian, Roman, Mosaic, Revolutionary,
Islamic, and so on.
value supplies the value of the date in a standard form. For
simple dates, the form used should be that of ISO/R
2014, which prescribes the form yyyy-mm-dd.
Such standard dates should usually be given in the
Gregorian calendar. If another form or calendar is used
for standard-form dates,
the standard.date.values tag should be
used to describe the standard form and calendar.
See
chapter for a discussion of the TEI
header and the encoding declarations area.
On partial dates and date ranges, see below.
certainty indicates the degree of certainty (optional) using
values such as: approx., ca., after, before, ...
If necessary, the mechanisms described in chapter may
be used to add tags for subdividing Gregorian or Julian dates into year,
month, and day.
Partial dates (e.g. 1990,
September 1990
) can be
expressed in the value attribute by omitting the
corresponding field in the VALUE attribute.
Examples:
21 Feb 1980
Given on the Twelfth Day of June
in the Year of Our Lord One Thousand Nine Hundred and
Seventy-seven of the Republic the Two Hundredth and first
and of the University the Eighty-Sixth.
1990
September 1990
]]>
These mechanisms are useful primarily for fully specified dates known
with certainty. Fully adequate methods for representing date ranges and
partially specified dates other than as noted above will require further
work during the future development of these Guidelines. As a first step
towards the representation of date ranges, the tag date.range
may be used. This marks expressions which specify ranges of dates, and
takes the following attributes:
from beginning of the date range in standard form
to end of the date range in standard form
from.cert certainty expression for start of range
to.cert certainty expression for end of range
For example:
the second half of the thirteenth century,
and Hervarar saga dates from
around 1300.
]]>
Other Crystals
Dates, numbers, bibliographic citations, and names are all
crystals: small objects with internal structure containing
particular semantically constrained sorts of data. Other crystals which
might under some circumstances require marking in text include:
- addresses (city, country-subdivision, post code, street address,
postbox, telephone, etc.)
- names (subdivided into titles, family and personal names, name
suffixes, etc.)
- organization names (organizaton name, department, division, address,
etc.)
- meetings and conferences (sponsoring organization, organizer,
conference or meeting name, number, date, location, etc.)
More work is needed to develop useful markup for these and similar
crystals. No specific recommendations are made at this time; interested
encoders are referred to the Formex markup scheme for useful treatments
of organizations, addresses and meetings, and to the various codes for
descriptive cataloguing in libraries for detailed analyses of personal
and corporate names, on the basis of which tags may be added to the TEI
scheme using the mechanisms described in chapter .