SGML Markup

<!-- TEI P1, sec. 3.1                                           -->
<!-- Title:  A Gentle Introduction to SGML                      -->
<!-- Drafted:  LB, March-April 1990                             -->
<!-- ********************************************************** -->
<!-- Revision History (add lines at top)                        -->
<!-- Date      Who    What                                      -->
<!--  2 May 90 CMSMcQ move fn out of xmp line 219               -->
<!--  1 May 90 CMSMcQ into draft 0                              -->
<!-- ********************************************************** -->
<!-- Date:     Sun, 29 Apr 90  19:22 GMT                        -->
<!-- From:     Lou Burnard <LOU@VAX.OXFORD.AC.UK>               -->
<!-- To:       U35395@UICVM                                     -->
<!-- Subject:  draft for 3.1                                    -->
<!--                                                            -->
<!-- <teidoc00 status=draft>                                    -->
<h1 id=z30>SGML Markup
<h2 id=z31>A gentle introduction to SGML
<p>
SGML <fn>Although widely rumoured to be short for the
surnames of its progenitors, the official expansion of this
abbreviation is <q>Standard Generalised Markup
Language</q></fn>is an international standard for the
description of marked-up electronic text. To be more precise,
SGML is a metalanguage, that is, a means of formally
describing a language, in this case, a markup language, as
described in Chapter <hdref refid=z10>. The present chapter, while
falling far short of the rigour of the international standard
itself<fn><cit>International Organisation for Standardisation:
ISO 8879 Information processing - Text and office systems -
Standard Generalized Markup Language (SGML), 1986</cit></fn>,
attempts to give an informal introduction to those parts of it
of which a proper understanding is necessary to make best use
of these Guidelines. A short SGML reading list is also
provided, in section <hdref refid=z319> below.
<h3 id=z311>What's special about SGML?
<p>
There are three characteristics of SGML which distinguish it
from other markup languages: its use of descriptive rather
than procedural markup; its <q>document type</q> concept; and
its independence of any one representation system. These three
aspects are discussed briefly below, and then in more depth in
sections <hdref refid=z313> and <hdref refid=z317>.
<h4 id=z3111>Descriptive markup
<p>
A descriptive markup system uses markup codes which assert
simply of the parts of a document which they identify <q>the
following item is a blort</q>, <q>this is the end of the most
recently begun flapdoodle</q> etc. By contrast, a procedural
markup system defines what processing is to be carried out at
particular points in a document - <q>call procedure blort with
parameters 1, b and x here</q> <q>terminate the flapdoodle
procedure here</q> etc. In SGML, the instructions needed to
process a document for some particular purpose (for example,
to format it) are sharply distinguished from the descriptive
markup which occurs within the document. Usually, they are
collected outside the document in separate procedures or
programs.
<p>
This means that the same document can be processed by many
different pieces of software, each of which can apply
different processing instructions to those parts of it which
are considered relevant. For example, a content analysis
program might wish to disregard entirely the footnotes
embedded in an annotated text, while a formatting program
might wish to extract and collect them altogether for printing
at a specific point during the processing of the document.
Different sorts of processing instructions can be associated
with the same parts of the file. For example, one program
might wish to extract names of persons and places from a
document to create an index or database, while another,
operating on the same text, might wish to print names of
persons and places in a distinctive typeface.
<h4 id=z3112>Types of document
<p>
Secondly, SGML introduces the notion of a document type, and
hence a <term>document type definition</term> (DTD). Documents are
regarded as having types, just as other objects processed by
computers do. The type of a document is formally defined by its
constituent parts and their structure. The definition of a
report, for example, might be that it consisted of a title and
possibly an author, followed by an abstract and a sequence of
one or more paragraphs. Anything lacking a title, according to
this formal definition, would not formally be a report, and
neither would a sequence of paragraphs followed by an abstract
-  whatever other report-like characteristics these might have
for the human reader.
<p>
If documents are of known types, a special purpose program
(called a <term>parser</term>) can be used to process a
document claiming to be of a particular type and check that
all the elements required for that document type are indeed
present and correctly ordered.  More significantly, different
documents of the same type can be processed in a uniform way.
Programs can be written which take advantage of the knowledge
encapsulated in the document structure information, and which
can thus behave in a more intelligent fashion.
<h4 id=z3113>Data independence
<p>
A basic design goal of SGML was to ensure that documents
encoded according to its provisions should be transportable
from one hardware/software environment to another without loss
of information. The two features discussed so far all address
this requirement at an abstract level: the third feature
addresses it at the level of the strings of bytes (characters)
of which documents are composed. SGML provides a general
purpose mechanism for <q>string substitution</q>, that is, a
simple machine-independent way of stating that a particular
string of characters in the document should be replaced by
some other string when the document is processed. One obvious
application for this mechanism is to ensure consistency of
nomenclature; another, more significant, is to counter the
notorious inability of different computer systems to
understand each other's character sets, or of any one system
to provide all the graphic characters needed for a particular
application, by providing descriptive mappings for non-
portable characters.
 
<h3 id=z312>Textual structures
<p>
A text is not an undifferentiated sequence of words, much less
of bytes. For different purposes, it may be divided into many
different units, of different types or sizes. A prose text
such as this one might be divided into sentences, paragraphs,
chapters and sections. A verse text might be divided into
lines, stanzas and cantos. Once printed, sequences of prose
and verse might be divided into pages, gatherings or volumes.
<p>
Such units as these are most often used to identify specific
locations or reference points within a text (the third
sentence of the second paragraph in chapter ten; canto 10,
line 1234; page 412 etc.) but they may also be used to
subdivide a text into meaningful fragments for analytic
purposes (is the average sentence length of section 2
different from that of section 5? how many paragraphs separate
each occurrence of the word <q>blort</q>? how many pages?).
Other structural units are more clearly analytic, in that they
characterise a section of a text. A dramatic text might regard
each speech by a different character as units of one kind, and
stage directions or pieces of action as units of another kind.
The purpose of such an analysis is less likely to be to locate
parts of the text (<q>the 93rd speech by Horatio in Act 2</q>)
than to facilitate comparisons between the words used by one
character and those of another, or those used by the same
character at different points of the play.
<p>
In a prose text one might similarly wish to regard as units of
different types passages in direct or indirect speech,
passages employing different stylistic registers (narrative,
polemic, commentary, argument etc.), passages of different
authorship and so forth. And for certain types of analysis
(most notably textual criticism) the physical appearance of
one particular printed or manuscript source may be of
importance: paradoxically, one may wish to use descriptive
markup to describe presentational features such as typeface,
linebreaks, use of white space and so forth.
<p>
These textual structures overlap with each other in complex
and unpredictable ways. Particularly when dealing with texts
as instantiated by paper technology, the reader needs to be
aware of both the physical organisation of the book and the
logical structure of the work it contains. Many great works
(Sterne's <q>Tristram Shandy</q> for example) cannot be fully
appreciated without an awareness of the interplay between
narrative units (such as chapters or paragraphs) and page
divisions. For many types of research, it is the interplay
between different levels of analysis which is crucial: the
extent to which syntactic structure and narrative structure
mesh, or fail to mesh, for example, or the extent to which
phonological structures reflect morphology.
 
<h3 id=z313>SGML structures
<p>
SGML provides a simple and consistent mechanism for the markup
or identification of all such textual units, and also a method
of expressing rules which define how combinations of such
units can meaningfully occur in any text. The technical term
used in the SGML Standard for a textual unit, viewed as a
structural component, is <term>element</term>. Different types
of elements are given different names, but SGML provides no
way of expressing the meaning of a particular type of element,
other than its relationship to other element types. That is,
all one can say about any <q>blort</q> element is that
instances of it may (or may not) occur within elements of type
<q>farble</q>, and that it may (or may not) be decomposed into
elements of type <q>blortette</q>. It should be stressed that,
so far as the SGML standard is concerned, the semantics of an
element are entirely in the eye of the beholder. It is up to
the creators of SGML conformant definitions (confusingly known
in the Standard as <term> applications </term>) to choose
names indicative of the intended function of the elements they
identify, hence the technical term for the name of an element
type which is <term>generic identifier</term>, or GI.
<p>
Within a marked up text (or, to use the jargon, a <term>document
instance</term>), each element must be explicitly
marked or tagged in some way. The standard provides for a
variety of different ways of doing this, the most commonly
used being to insert a tag at the begining of the element (an
<term>open-tag</term>) and another at its end (a
<term>close-tag</term>). The start and end tag pair are used to
bracket off the element occurrences within the running text,
in rather the same way as different types of parentheses or
quotation marks are used in conventional punctuation. For
example, an embedded speech element in a text might be tagged
as follows:
<xmp><![ CDATA [
 
	... Rosalind's remarks <speech>This is the silliest stuff
     that ere I heard of!</speech> clearly indicate ...
 
]]>
</xmp>
As this example shows, an open-tag takes the form
<emph>&lt;name&gt;</emph>, where &lt; is a string indicating the
start of the open tag, <emph>name</emph> is the generic identifier
of the element which is being delimited, and &gt; is the
string indicating the end of a tag. A close-tag takes the form
<emph>&lt;&sol;name&gt;</emph>, where &lt;&sol; is a string marking
the start of a close-tag, <emph>name</emph> is the generic
identifier of the element being closed and as before &gt; is
the string indicating the end of a tag.<fn>The actual
characters used for &lt;, &lt;&sol; and &gt; may be re-
defined, but it is conventional to use the characters used in
this description.</fn>
<p>
Elements within a text will usually be nested, that is,
elements of one type will usually be <term>embedded</term>,
(contained entirely) within elements of a different type. This
is one reason why the end-tag needs to specify which element
it is closing.
<p>
To illustrate this, we will consider a very simple structural
model. Let us assume that we wish to identify within an
anthology only poems, their titles, and the stanzas and lines
of which they are composed. In SGML terms, our document type
is the anthology, and it consists of a series of poems. Each
poem has embedded within it one element, a title, and several
occurrences of another, a stanza, each stanza having embedded
within it a number of line elements. Fully marked up, a text
conforming to this model might appear as follows:
<xmp><![ CDATA [
 
	<anthology>
		<poem><title>A counterpoint</title>
		<stanza>
			<line>Let me be my own fool</line>
			<line>of my own making, the sum of it</line>
		</stanza>
		<stanza>
			<line>is equivocal.</line>
			<line>One says of the drunken farmer:</line>
		</stanza>
		<stanza>
			<line>leave him lay off it. And this is</line>
			<line>the explanation.</line>
		</stanza>
	</poem>
 
			<!-- more poems go here    -->
 
	</anthology>
 
]]>
</xmp>
<fn>Taken from <cit>Robert Creeley, <q>For Love: poems 1950-
1960</q></cit>. Copyright 1962 and used without permission tsk tsk
</fn>
<p>
This example makes no assumptions about the rules governing,
for example, whether or not a title can appear in places other
than preceding the first stanza, or whether lines can appear
which are not included in a stanza: that is why its markup
appears so verbose. In some cases, the begining and end of
every element must be explicitly marked, because there are no
identifiable rules about which elements can appear where. In
practice, however rules of the following type can usually be
hypothesized which greatly reduce the need for so much
tagging. For example, considering our greatly over-simplified
model of a poem, we could state rules of the following kind:
<ol>
<li id=i0>An anthology contains a number of poems and nothing
else
<li id=i1> A poem always has a single title element which
precedes the first stanza.
<li id=i2> Every stanza consists of one or more lines and every
line is contained by a stanza.
<li id=i3> Nothing can follow a stanza except another stanza or
the end of a poem.
<li id=i4> Nothing can follow a line except another line or the
start of a new stanza
</ol>
<p>
From rules <liref refid=i3> and <liref refid=i4> it follows that
we do not need to mark the ends of stanzas or lines
explicitly. From rule <liref refid=i1> it follows that we do
not need to mark the end of the title - it is implied by the
start of the first stanza. Similarly, from rule <liref refid=i0>,
it follows that we need not mark the end of the poem
- it is implied by the start of the next poem, or by the end
of the anthology. From rule <liref refid=i2> it follows that we
do not need to mark the start of the first line in each stanza
- it is implied by the start of a stanza. Applying these
simplifications, we could mark up the same poem as follows:
<xmp><![ CDATA [
	<anthology>
	<poem><title>A counterpoint
		<stanza>Let me be my own fool
			<line>of my own making, the sum of it
		<stanza>is equivocal.
			<line>One says of the drunken farmer:
		<stanza>leave him lay off it. And this is
			<line>the explanation.
	</anthology>
 
]]>
</xmp>
<p>
The ability to use rules stating which elements can be nested
within others to simplify markup is a very important
characteristic of SGML. Before considering these rules
further, you may like to consider how text marked up in the
form above could be processed by a computer for very many
different purposes. A simple indexing program could extract
only the relevant text elements in order to make a list of
titles, or of words used in the poem text; a simple formatting
program could insert blank lines between stanzas, perhaps
indenting the first line of each, or inserting a stanza
number. Different parts of each poem could be typeset in
different ways. A more ambitious analytic program could
determine how many stanzas or lines begin with lower-case
letters and thus (perhaps) in mid-sentence.<fn><p>Note that
this simple example has not addressed the problem of marking
elements such as sentences explicitly; the implications of
this are discussed below in section <hdref refid=z3152></p>
</fn>  Scholars wishing to see the implications of changing
the stanza or line divisions chosen by the editor of this poem
can do so simply by altering the position of the tags. And of
course, the text as presented above can be transported from
one computer to another and processed by any program (or
person) capable of making sense of the tags embedded within it
with no need for the sort of transformations and translations
needed to move word processor files around.
 
<h3 id=z314>Defining the rules
<p>
In specifying rules such as those described above, the
document designer may be as lax or as restrictive as the
occasion warrants. A balance must be struck between  the
convenience of simple rules and the complexity of real texts.
This is particularly the case when the rules being defined
relate to texts which already exist: the designer may have
only the haziest of notions as to an ancient text's original
purpose or meaning and hence find it very difficult to specify
consistent rules about its structure. On the other hand, where
a new text is being prepared to an exact specification, for
example for entry into a textual database of some kind, the
more precisely stated the rules, the better they can be
enforced. Even in the case where an existing text is being
marked up, it may be beneficial to define a restrictive set of
rules relating to one particular view or hypothesis about the
text - if only as a means of testing the usefulness of that
view or hypothesis. It is important to remember that every
document type definition is an interpretation of a text. There
is no single DTD which encompasses any kind of absolute truth
about a text, although it may be convenient to privilege some
DTDs above others for particular types of analysis.
<p>
At present, SGML is most widely used in environments where
uniformity of document structure is a major desideratum. In
the production of technical documentation, for example, it is
of major importance that sections and subsections should be
properly nested, that cross references should be properly
resolved and so forth. In such situations, documents are seen
as raw material to match against pre-defined sets of rules. As
discussed above, however, the use of simple rules can also
greatly simplify the task of tagging accurately elements of
less rigidly constrained texts such as those which concern the
TEI. By making these rules explicit, the scholar reduces his
or her own burdens while also being forced to make explicit an
interpretation of the text being encoded.
<p>
The rules to be used by an SGML parser when interpreting an
encoded text take the form of series of declarations which,
together with other definitions, make up the body of the
document type definition (DTD). A DTD may be attached to a
document, or more usually referred to from within it. For our
simple model of a poem, the following declarations would be
appropriate:
<xmp><![ CDATA [
 
	<!ELEMENT ANTHOLOGY		- -	(POEM+)>
	<!ELEMENT POEM  		- - 	(TITLE? STANZA+)>
	<!ELEMENT TITLE		- O	(#PCDATA)	>
	<!ELEMENT STANZA		- O	(LINE+)	>
	<!ELEMENT LINE			O O	(#PCDATA)	>
 
]]>
</xmp>
These four lines are examples of SGML formal declarations.
Each declaration begins with &lt;&excl;ELEMENT indicating that
it declares an element, in the technical sense defined above,
and ends with a &gt;. It consists of three parts: a name, or
group of names; optionally two characters specifying
<q>minimisation rules</q>; and a <q>content model</q>. Each of
these parts is discussed further below. Components of the
declaration are separated by white space, that is one or more
blanks, tabs or newlines.
<p>
The first part of each declaration above gives the generic
identifier of the element which is being declared, for example
POEM, TITLE etc. It is possible to declare several elements in
one statement, as discussed below.
<p>
The second part of the declaration is optional. It specifies
what are called <term>minimisation rules</term> for the
element concerned. These rules determine whether or not start
and end tags must be present in every occurrence of the
element concerned. They take the form of a pair of characters,
separated by white space, the first of which relates to the
start tag, and the second to the end tag. In either case,
either a hyphen or a letter o (for optional) must be given;
the hyphen indicating that the tag must be present, and the
letter o that it may be omitted. Thus, in this example, every
element must have a start tag, while only the POEM element
requires an end tag as well. If no minimisation rules are
given for an element, then neither start nor end tags may be
omitted.
<p>
The third part of each declaration, enclosed in parentheses,
is called the <term>content model</term> of the element,
because it specifies what element occurrences may legitimately
contain. Contents are specified either in terms of other
elements or using special reserved words. There are several
such reserved words, of which by far the most commonly
encountered is #PCDATA, as in this example. This is an
abbreviation for <q>parsed character data</q>, and it means
that the element being defined may contain any valid character
data but may not contain further embedded elements. It thus
forms the bottom line of most SGML element declarations. In
our example, TITLEs and LINEs are so defined.
<p>
The declaration for STANZA in the example above states that a
stanza consists of one or more lines. It uses an
<term>occurrence indicator</term> (the plus sign) to indicate
how many times the element named in its content model may
occur. There are three occurrence indicators, known in the
standard as <term>plus</term>, <term>opt</term> and
<term>rep</term>. Plus, which is usually represented by a plus
sign, means that there may be one or more occurrences of the
element concerned; opt, usually represented by a question
mark, means that there may be at most one and possibly no
occurrence; rep, usually represented by a star, means that the
element concerned may either be absent or appear one or more
times. Thus, if the content model for STANZA were (LINE*),
stanzas with no lines would be possible as well as those with
more than one line. If it were (LINE?), again empty stanzas
would be countenanced, but no stanza could have more than a
single line. Similarly, the declaration for POEM in the
example above thus states that a POEM cannot have more than
one title, but may have none, and that it must have at least
one stanza and may have several.
<p>
The content model (TITLE?,STANZA+) contains more than one
component, and thus needs additionally to specify the order in
which these (TITLE and STANZA) may appear. This ordering is
determined by the <term>group connector</term> (the comma)
used between its components. There are three possible group
connectors, known in the standard as <term>seq</term>,
<term>and</term> and <term>or</term>. Seq, which is usually
represented by a comma, means that the components it connects
must both appear in the order specified by the content model.
And, which is usually represented by an ampersand, indicates
that the components it connects must both appear but in any
order. Or, which is usually represented by a vertical bar,
indicates that only one of the components it connects can
appear. If the comma in this example were replaced by an
ampersand, a title could appear either before the stanzas of a
poem or at the end (but not between stanzas). If it were
replaced by a vertical bar, then a poem would consist of
either a title or just stanzas - but not both!
<p>
In our example so far, the components of each content model
have been either single elements or #PCDATA. It is quite
permissible however to define content models in which the
components are lists of elements, combined by group
connectors. Such lists, known as <term>model groups</term> may
also be modified by occurrence indicators (provided that their
constituent elements are not) and themselves combined by group
connectors. To demonstrate these facilities, let us now expand
our example to include non-stanzaic types of verse.  For the
sake of demonstration, we will categorise poems as one of
stanzaic, couplets or blank.  A blank verse poem consists
simply of lines (we ignore the possibility of verse paragraphs
for the moment <fn><p> It will not have escaped the astute
reader that the fact that verse paragraphs need not start on a
line boundary seriously complicates the issue; see further
section <hdref refid=z3152></fn> ) so no additional elements
need be defined for it. A couplet is defined as a LINE1
followed by a LINE2.
<xmp><![ CDATA [
 
	<!ELEMENT COUPLET o o (LINE1 & LINE2)>
 
]]>
</xmp>
<p>
The elements LINE1 and LINE2 (which are distinguished to
enable studies of rhyme scheme, for example) have exactly the
same content model as the existing LINE element. They can
therefore share the same declaration. In this situation, it is
convenient to supply a <term>name group</term> as the first
component of a single element declaration, rather than give a
series of declarations differing only in the names used. A
name group is a list of GIs connected by the or connector and
enclosed in parentheses, as follows:
<xmp><![ CDATA [
 
	<!ELEMENT (LINE | LINE1 | LINE2) o o (#PCDATA)>
 
]]>
</xmp>
The declaration for the POEM element can now be changed to
include all three possibilities:
<xmp><![ CDATA [
 
	<!ELEMENT POEM - o (TITLE?,
				(STANZA+ | COUPLET+ | LINE+) ) >
 
]]>
</xmp>
That is, a poem consists of an optional title, followed by
several stanzas, or several couplets, or several lines. Note
the difference between this definition and the following:
<xmp><![ CDATA [
 
	<!ELEMENT POEM - o (TITLE?,
				(STANZA | COUPLET | LINE)+ ) >
 
]]>
</xmp>
The second version, by applying the occurrence indicator to
the group rather than to each element within it, would allow
for a single poem to contain a mixture of stanzas, couplets or
blank verse.
 
<h3 id=z315>Complicating the issue
<p>
In the simple cases described so far, it has been assumed that
one can identify both the immediately containing element type
and the immediate constituents of every element defined in a
textual  structure. A poem consists of stanzas, and an
anthology consists of poems; stanzas do not float around
unattached to poems or combined into some other unrelated
element; a poem cannot contain an anthology. All the elements
of a given document type may be arranged into a hierarchic
structure, arranged like a family tree with a single ancestor
at the top and many children (mostly the elements containing
#PCDATA) at the bottom. This gross simplification  turns out
to be surprisingly effective for a large number of purposes.
It is not however adequate for the full complexity of real
textual structures. In particular, it does not cater for the
case of more or less freely floating elements that can appear
at almost any hierarchic level in the structure, and it does
not cater for the case where several different trees may be
identified in the same document. To deal with the first case,
SGML provides the <q>exception</q> mechanism; to deal with the
second, SGML permits the definition of <q>concurrent</q>
document structures.
<h4 id=z3151>Exceptions to the content model
<p>
In most documents, there will be some identifiable elements
that can occur at any level of its structure. Annotations, for
example, might be attached to the whole of a poem, to a
stanza, to a line of a stanza or to a single word within it.
In a textual critical edition, the same might be true of
variant readings. In this simple case, the complexity of
adding an annotation element as an optional component of every
content model is not particularly onerous; in a more
realistically complex model perhaps containing some ten or
twenty levels such an approach is barely workable. It would
not make much sense to include (say) zero or more annotation
elements as a component of every element in even a moderately
complex DTD.
<p>
To cope with this, SGML allows for any content model to be
further modified by means of an <term>exception</term> list.
There are two types of exception: <term>inclusions</term>,
that is, additional elements that can be included at any point
in the model group or any of its constituent elements; and
<term>exclusions</term>, that is, elements that cannot be
included within the current model.
<p>
To extend our declarations further to allow for annotations
and variant readings, which we will assume can appear anywhere
within the text of a poem, we first need to add declarations
for these two elements:
<xmp><![ CDATA [
 
	<!ELEMENT (note | variant) - - (#PCDATA)>
 
]]>
</xmp>
The note and variant elements must have both start and end
tags, since they can appear anywhere. Rather than add them to
the content model for each type of poem, we can add them in
the form of an inclusion list to the poem element, which now
reads:
<xmp><![ CDATA [
 
	<!ELEMENT POEM - o (TITLE?,
				(STANZA+ | COUPLET+ | LINE+) )
				+(NOTE | VARIANT) >
 
]]>
</xmp>
The plus sign at the start of the (NOTE | VARIANT) name list
indicates that this is an inclusion exception. With this
addition, notes or variants can appear at any point in the
content of a poem element - even those (such as TITLE) for
which we have defined a content model of #PCDATA. They can
thus also appear within notes or variants! If we wanted for
some reason to prevent notes or variants appearing within
titles, we could add an exclusion exception to the declaration
for TITLE above:
<xmp><![ CDATA [
 
	<!ELEMENT TITLE		- O	(#PCDATA)
				-(NOTE | VARIANT)>
 
]]>
</xmp>
The minus sign at the start of the (NOTE | VARIANT) name list
indicates that this is an exclusion exception. With this
addition, notes and variants will be prohibited from appearing
within titles, notwithstanding their potential inclusion
implied by the previous addition to the content model for
POEM. In the same way, we could prevent notes and variants
from nesting within notes and variants by modifying the
definition above to read
<xmp><![ CDATA [
 
	<!ELEMENT (note | variant) - - (#PCDATA)
						-(NOTE | VARIANT)>
 
]]>
</xmp>
The meticulous reader will note that this precludes both
variants within notes and notes within variants. Inclusion and
exclusion exceptions should be used with care as their
ramifications may not be immediately apparent.
</h4>
<h4 id=z3152>Concurrent structures
<p>
hic desunt multa
<h3 id=z316>Attributes
<p>
SGML uses the word <q>attribute</q>, as it does some other words,
in a rather specialised way, in this case to describe information
which is in some sense descriptive of specific element occurrences
witout being itself regarded as an element. For example, you might
wish to add a <q>status</q> attribute to occurrences of some
elements in a document to indicate their degree of reliability, or
to add an <q>identifier</q> attribute so that you could refer to
particular element occurrences from elsewhere within a document.
If an element has been defined as having attributes, the attribute
values are supplied in the document instance as <term>attribute-
value pairs</term> inside the open-tag for the element occurrence.
For example
<xmp><![ CDATA[
	<poem id=P1 status="draft"> ... </poem>
]]>
</xmp>
The <tag>poem</tag> element has been defined as having two
attributes <q>id</q> and <q>status</q>. For the instance of a
<tag>poem</tag> in this example, represented here by an ellipsis,
the <q>id</q> attribute has the value <q>P1</q> and the
<q>status</q> attribute has the value <q>draft</q>.
<p>
Like elements themeslve, attributes are declared in the SGML
document type declaration, using rather similar syntax. As well as
specifying its name  and the element to which it is to be attached,
it is possible to specify (within limits) what kind of value is
acceptable for an attribute and a default value.
<p>
Some critics, pointing out that almost any information conveyed by
using an attribute could equally well be conveyed by using an
additional element, see attributes as confusing the simplicity
of SGML syntax for no very good reason. Since the reverse is
not always the case - information represented by additional
elements cannot always be represented by using an attribute -
they may be right. However, there are situations in which
attributes seem to provide a convenient way of expressing
information ancillary to a text, whatever their formal
redundancy. The interested reader is referred to <cit> who?
</cit> for a discussion.
<p>
We will discuss two possible uses for attributes: firstly as a
means of including normalised forms of speech prefixes in a
dramatic text, and secondly as a means of providing cross
reference links within a given text.
<p>
In a dramatic text, it is customary to flag the start of each
speech by a brief indication of who is to speak it. For most
types of analysis, we would wish to distinguish this speech
prefix from the speech itself. We might therefore expect to
find an element declaration like the following:
<![ CDATA [
 
	<!ELEMENT SPEECH - o (PREFIX, TEXT)>
	<!ELEMENT (PREFIX|TEXT) - o (#PCDATA)>
 
]]>
and text tagged as follows:
<![ CDATA [
 
	<SPEECH><PREFIX>Ferd.
	<TEXT>Couer her face: Mine eyes dazell: she di'd yong.
	<SPEECH><PREFIX>Bos.
	<TEXT>I thinke not so: her infelicitie
	Seem'd to have yeeres too many.
	...
]]>
 
<p>
When encoding from an early printed book or manuscript, it is
not at all unusual to find the same character referred to by
different prefixes in different parts of the play, or
ambiguous prefixes. For the purposes of dramatic analysis it
would be very convenient to normalise all the prefixes for a
given character. One way of doing this might be to define an
additional element, say NORM; this is left to the reader as an
exercise. Instead, feeling that such editorial interventions
should in some sense be distinguished from the encoded text,
we will choose to define an attribute NORM for the existing
PREFIX element. This is done by adding an <term>attribute
definition list declaration</term> to the declarations above,
as follows:
<![ CDATA [
 
	<!ATTLIST PREFIX NORM	NMTOKEN	#REQUIRED>
 
]]>
<p>
The declaration has four main parts. The first specifies the
element (or elements) concerned. The next specifies the name
or names of the attributes to be associated with the element.
The third specifies what kind of values the attribute may
take, that is, what kind of information is to be supplied for
it. The last states whether or not the attribute is optional
and if it is, what default values can be assumed for it. In
our simple case, we define an attribute NORM to be associated
with the element PREFIX, the value of which is a <term>name
token</term> that is, (loosely) any string of alphabetic
characters not including a space and which may not be omitted.
<p>
With this declaration in force, the above passage could be
tagged as follows:
<![ CDATA [
 
	<SPEECH><PREFIX NORM=FERDINAND>Ferd.
	<TEXT>Couer her face: Mine eyes dazell: she di'd yong.
	<SPEECH><PREFIX NORM=BOSOLA>Bos.
	<TEXT>I thinke not so: her infelicitie
	Seem'd to have yeeres too many.
	...
]]>
 
Clearly, the same mechanism could be extended for any other
type of normalisation.
<p>
Our second example for the use of attributes is less
controversial, though more complicated. It is sometimes
necessary to refer to an occurrence of one textual element
from within another: an obvious example being phrases such as
<q>see note 6</q> or <q>as discussed in chapter 5</q>. When a
text is being produced the actual numbers associated with the
notes or chapters may not be certain. Moreover, if we have
followed the gospel of descriptive markup, such things as page
or chapter numbers, being entirely matters of presentation
will not in any case be present in the marked up text: they
will be assigned by whatever processor is operating on the
text (and may indeed differ in different applications). SGML
therefore provides a special mechanism by which any element
occurrence may be given a special identifier, a kind of label,
which may be used to refer to it from anywhere  else within
the same text. The cross-reference itself is regarded as an
element occurrence of a specific kind, which must also be
declared in the DTD. In each case, the identifying label
(which may be arbitrary) is supplied as the value of a special
attribute.
 
Suppose, for example, we wish to include a reference within
the notes on one poem that refers to another poem. We will
first need to provide some way of attaching a label to each
poem: this is done by defining an attribute for the POEM
element, as follows
<![ CDATA [
 
	<!ATTLIST POEM PID	ID	#IMPLIED>
 
]]>
Here we define an attribute PID, the value of which must be of
type ID (this keyword implies that it must be unique within
the document and will be used to identify the element
occurrence in which it is used)  but which may be omitted
(because only poems to which we intend to refer need use this
attribute). For any such poem we can now include in the tag
that opens it a unique identifier, for example
<![CDATA[
 
	<POEM ID=Counterpoint>
 
]]>
Next we need to define a new element for the cross reference
itself. This will not have any content - it is only a
pointer - but it has an attribute, the value of which will be
the identifier of the element pointed at. This is achieved by
the following declarations:
<![CDATA[
 
	<!ELEMENT POEMREF - O EMPTY>
	<!ATTLIST POEMREF ID IDREF #REQUIRED>
 
]]>
The POEMREF element needs no close tag because it has no
content. It has a single attribute, which we choose to call ID
to make obvious what its function is. The value of this
attribute must be of type IDREF (the keyword used for cross
reference pointers of this type) and it must be supplied.
 
With these declarations in force, we can now encode a
reference to the poem with id <q>counterpoint</q> as follows
 
<![CDATA[
 
	<POEMREF ID=Counterpoint>
 
]]>
 
When an SGML parser encounters this empty element it will
simply check that a poem exists with the identifier
<q>counterpoint</q>. Other SGML processors could take any
number of other actions: different formatters might for
example insert a phrase such as "See also " followed by a
number, or the poem title or its first lines. A hypertext
style processor might use this element as a signal to activate
a link to the poem being referred.  The purpose of the SGML
markup is simply to indicate that a cross reference exists: it
does not determine what the processor is to do with it.
 
<h3 id=z317>Data Independence
<p>
The aspects of SGML discussed so far are all concerned with
the markup of structural elements within a document. SGML also
provides a simple and flexible method of encoding and naming
arbitrary parts of the actual content of a document in a
portable way. In SGML parlance, an <term>entity</term> is any
named part of a marked up document, irrespective of any
structural considerations. An entity might be a string of
characters or a whole file of text. To include it in a
document, we use a construction known as an <term>entity
reference</term>. For example, the following declaration
<![CDATA[
 
	<!ENTITY tei "Text Encoding Initiative">
 
]]>
defines an entity whose name is <q>tei</q> <fn>By convention
case is significant in entity names, unlike element
names</fn>, and whose value is the string  <q>Text Encoding
Initiative</q>. This is an instance of a <term>general entity
declaration</term>; by contrast, the following is an example
of a <term>system entity declaration</term>
<![CDATA[
 
	<!ENTITY GoodBits SYSTEM "c:\tei\hotstuff.txt">
 
]]>
This defines a system entity whose name is <q>GoodBits</q> and
whose value is a system identifier - in this case, the name of
an operating system file.
<p>
Once an entity has been declared, it may be referenced
anywhere within a document. This is done by supplying its name
prefixed with the <term>ero</term> (entity reference open)
character - normally an ampersand and terminated by the
<term>refc</term> (reference close) character - normally a
semicolon. The reference close character may be omitted if it
is followed by a space or record end. When an SGML parser
encounters such an <term>entity reference</term>, it
immediately substitutes the value declared for the entity
name. Thus, the passage <q>The labours of the &amp;tei have only
just begun</q> will be interpreted by an SGML processor
exactly as if it read <q>The labours of the Text Encoding
Initiative have only just begun</q>. In the case of a system
entity, it is, of course, the contents of the operating system
file which are subsituted, so that the passage <q>The
following text has been suppressed: &amp;goodbits; </q> will be
expanded to include the whole of whatever the system finds in
the file c:\tei\hotstuff.txt.
<p>
This obviously saves typing, and simplifies the task of
maintaining consistency in a set of documents. If the printing
of a complex document is to be done at many sites, the
document body itself might use an entity reference, such as
&amp;site;, wherever the name of the site is required. Different
entity declarations could then be added at different sites to
supply the appropriate string to be substituted for this name,
with no need to change the text of the document itself.
<p>
This <term>string substitution</term> mechanism (to use the
jargon) has many other applications. It can be used to
circumvent the notorious inadequacies of most computer systems
for representing the full range of graphic characters needed
for the display of modern English, (let alone the requirements
of other modern scripts or of ancient languages). So called
special characters not directly accessible from the keyboard
(or if accessible not correctly translated when transmitted)
may be represented by an entity reference. Suppose, for
example, that we wish to encode the use of ligatures in early
printed texts. The ligatured form of <q>ct</q> might be
distinguished from the non-ligatured form by encoding it as
<q>&amp;ctlig;</q> rather than <q>ct</q>. Other special
typographic features such as leafstops or rules could equally
well be represented by mnemonic entity references in the text.
When processing such texts, an entity declaration would be
added giving the desired representation for such textual
elements. If, for example, ligatured letters are of no
interest, we would simply add a declaration such as
<![CDATA[
 
	<!ENTITY ctlig "ct" >
 
]]>
and the distinction present in the source document would be
removed. If, on the other hand, a formatting program capable
of representing ligatured characters is to be used, we might
replace the entity declaration to give whatever sequence of
characters such a program requires as the expansion. If the
characters to be used in the expansion cannot be typed in
directly, they may be given as <term>character
references</term>, that is, as numeric values. A character
reference is are distinguished from other characters in the
replacement string by the fact that it begins with a special
<term>character reference open</term> symbol, usually the
sequence  &amp;&num;, and ends with a refc symbol (i.e.
usually a semicolon). For example, if the formatter to be used
represents the ligatured form of ct by the characters c and t
prefixed by the character with decimal value 102, the entity
declaration would read:
<![CDATA[
 
	<!ENTITY ctlig "&#102;ct" >
 
]]>
<p>
A list of entity declarations is known as an <term>entity
set</term> Standard entity sets are provided for use with most
SGML processors, in which the names used will normally be
taken from the lists of such names published as an annex to
the SGML standard and elsewhere. The replacement values are,
of course, highly system dependent.
<p>
Useful though the entity reference mechanism is for dealing
with occasional departures from the expected character set,
no-one would consider using it to encode extended passages,
such as quotations in Greek or Russian in an English text. In
such situations, different mechanisms are appropriate. These
are discussed below in chapter 4.
<h3 id=z318>Putting it altogether
<p>
An SGML conformant document has a number of parts, not all of
which have been discussed in this chapter, and many of which
the user of these Guidelines may safely ignore. For
completeness, the following summary of how these parts are
inter-related may however be found useful.
<p>
An SGML document consists of an SGML  prologue and a document
instance. The prologue contains an SGML declaration and a
document type definition.
<p>
The SGML declaration specifies basic facts about the dialect
of SGML being used such as the character set, the codes used
for SGML delimiters etc. Its content for TEI-conformant
document types is discussed further in chapter <hdref refid=z32>
below; normally the SGML declaration will be held in the form of
compiled tables by the SGML processor and will thus be
invisible to the user.
<p>
The document type declaration contains a base document type
definition and may also include one or more concurrent
document type definitions. The declaration  may consist of a
reference to some publicly defined document type declaration,
an explicit document type definition, or some combination of
the two. Entity names to be used in the DTD may be declared in
similar ways in the same part of the prologue.
<p>
Combining and extending document type definitions is discussed
further in chapter 9. As with the SGML declaration, most SGML
processors allow the document type declaration to be held
in compiled form and invoked invisibly by the user for one or
more documents.
<p>
The document instance is the content of the document itself,
independent of any declarations but possibly containing
references to other entities to be included within it.
<p>
A variety of software is available to assist in the tasks of
creating, validating and processing SGML documents. At the
heart of most such software is an SGML <term>parser</term>.
Other software functions which SGML processors should provide
include structured editing, formatting and database
management.
<p>
A <term>structured editor</term> is a kind of intelligent
word-processor.It can use information extracted from a processed
DTD to prompt the user with information about which elements are
required at different points in a document as the document is
being created. It can also greatly simplify the task of
preparing a document, for example by inserting tags
automatically.
<p>
A <term>formatter</term> operates on a tagged document instance to
produce a printed form of it. Many typographic distinctions, such
as the use of particular typefaces or sizes, are intimately
related to structural distinctions, and formatters can thus
usefully take advantage of descriptive markup. It is also
possible to define the tagging structure expected by a formatting
program in SGML terms, as a concurrent document structure.
<p>
Text oriented database management systems typically use inverted
file indexes to point into documents, or subdivisions of them. A
search can be made for an occurrence of some word or word pattern
within a document or within a subdivision of one. Meaningful
subdivisions of input documents will of course be closely related
to the subdivisions specified using descriptive markup. It is thus
simple for textual database systems to take advantage of SGML-
tagged documents.
<p>
Hypertext systems improve on oldewr methods of handling texts
bysuppoirting associative links within and across documenbts.
Again, the basic building block needed for such systems is also a
basic building block of SGML markup: the ability to identify and
to link together individual document elements comes free as a part
of the SGML way of doing things.  To load an SGML document into a
hypertext system requires only a processor which can interpret the
SGML tags correctly. (See further section <hdref refid=z6a>)
 
<h3 id=z319>A short SGML Reading List
<note>
As the current version of TEIDOC0 does not support bibliographic
lists other than in back matter, and not much structure in them
even then, this list has been moved to a separate file (Z319) with
its own dtd.
</note>
