SGML
There are three characteristics of SGML which distinguish it
from other markup languages: its use of descriptive rather
than procedural markup; its
A descriptive markup system uses markup codes which assert
simply of the parts of a document which they identify
This means that the same document can be processed by many
different pieces of software, each of which can apply
different processing instructions to those parts of it which
are considered relevant. For example, a content analysis
program might wish to disregard entirely the footnotes
embedded in an annotated text, while a formatting program
might wish to extract and collect them altogether for printing
at a specific point during the processing of the document.
Different sorts of processing instructions can be associated
with the same parts of the file. For example, one program
might wish to extract names of persons and places from a
document to create an index or database, while another,
operating on the same text, might wish to print names of
persons and places in a distinctive typeface.
Secondly, SGML introduces the notion of a document type, and
hence a
If documents are of known types, a special purpose program
(called a
A basic design goal of SGML was to ensure that documents
encoded according to its provisions should be transportable
from one hardware/software environment to another without loss
of information. The two features discussed so far all address
this requirement at an abstract level: the third feature
addresses it at the level of the strings of bytes (characters)
of which documents are composed. SGML provides a general
purpose mechanism for
A text is not an undifferentiated sequence of words, much less
of bytes. For different purposes, it may be divided into many
different units, of different types or sizes. A prose text
such as this one might be divided into sentences, paragraphs,
chapters and sections. A verse text might be divided into
lines, stanzas and cantos. Once printed, sequences of prose
and verse might be divided into pages, gatherings or volumes.
Such units as these are most often used to identify specific
locations or reference points within a text (the third
sentence of the second paragraph in chapter ten; canto 10,
line 1234; page 412 etc.) but they may also be used to
subdivide a text into meaningful fragments for analytic
purposes (is the average sentence length of section 2
different from that of section 5? how many paragraphs separate
each occurrence of the word
In a prose text one might similarly wish to regard as units of
different types passages in direct or indirect speech,
passages employing different stylistic registers (narrative,
polemic, commentary, argument etc.), passages of different
authorship and so forth. And for certain types of analysis
(most notably textual criticism) the physical appearance of
one particular printed or manuscript source may be of
importance: paradoxically, one may wish to use descriptive
markup to describe presentational features such as typeface,
linebreaks, use of white space and so forth.
These textual structures overlap with each other in complex
and unpredictable ways. Particularly when dealing with texts
as instantiated by paper technology, the reader needs to be
aware of both the physical organisation of the book and the
logical structure of the work it contains. Many great works
(Sterne's
SGML provides a simple and consistent mechanism for the markup
or identification of all such textual units, and also a method
of expressing rules which define how combinations of such
units can meaningfully occur in any text. The technical term
used in the SGML Standard for a textual unit, viewed as a
structural component, is
Within a marked up text (or, to use the jargon, a
Elements within a text will usually be nested, that is,
elements of one type will usually be
To illustrate this, we will consider a very simple structural
model. Let us assume that we wish to identify within an
anthology only poems, their titles, and the stanzas and lines
of which they are composed. In SGML terms, our document type
is the anthology, and it consists of a series of poems. Each
poem has embedded within it one element, a title, and several
occurrences of another, a stanza, each stanza having embedded
within it a number of line elements. Fully marked up, a text
conforming to this model might appear as follows:
This example makes no assumptions about the rules governing,
for example, whether or not a title can appear in places other
than preceding the first stanza, or whether lines can appear
which are not included in a stanza: that is why its markup
appears so verbose. In some cases, the begining and end of
every element must be explicitly marked, because there are no
identifiable rules about which elements can appear where. In
practice, however rules of the following type can usually be
hypothesized which greatly reduce the need for so much
tagging. For example, considering our greatly over-simplified
model of a poem, we could state rules of the following kind:
From rules
The ability to use rules stating which elements can be nested
within others to simplify markup is a very important
characteristic of SGML. Before considering these rules
further, you may like to consider how text marked up in the
form above could be processed by a computer for very many
different purposes. A simple indexing program could extract
only the relevant text elements in order to make a list of
titles, or of words used in the poem text; a simple formatting
program could insert blank lines between stanzas, perhaps
indenting the first line of each, or inserting a stanza
number. Different parts of each poem could be typeset in
different ways. A more ambitious analytic program could
determine how many stanzas or lines begin with lower-case
letters and thus (perhaps) in mid-sentence. Note that
this simple example has not addressed the problem of marking
elements such as sentences explicitly; the implications of
this are discussed below in section
In specifying rules such as those described above, the
document designer may be as lax or as restrictive as the
occasion warrants. A balance must be struck between the
convenience of simple rules and the complexity of real texts.
This is particularly the case when the rules being defined
relate to texts which already exist: the designer may have
only the haziest of notions as to an ancient text's original
purpose or meaning and hence find it very difficult to specify
consistent rules about its structure. On the other hand, where
a new text is being prepared to an exact specification, for
example for entry into a textual database of some kind, the
more precisely stated the rules, the better they can be
enforced. Even in the case where an existing text is being
marked up, it may be beneficial to define a restrictive set of
rules relating to one particular view or hypothesis about the
text - if only as a means of testing the usefulness of that
view or hypothesis. It is important to remember that every
document type definition is an interpretation of a text. There
is no single DTD which encompasses any kind of absolute truth
about a text, although it may be convenient to privilege some
DTDs above others for particular types of analysis.
At present, SGML is most widely used in environments where
uniformity of document structure is a major desideratum. In
the production of technical documentation, for example, it is
of major importance that sections and subsections should be
properly nested, that cross references should be properly
resolved and so forth. In such situations, documents are seen
as raw material to match against pre-defined sets of rules. As
discussed above, however, the use of simple rules can also
greatly simplify the task of tagging accurately elements of
less rigidly constrained texts such as those which concern the
TEI. By making these rules explicit, the scholar reduces his
or her own burdens while also being forced to make explicit an
interpretation of the text being encoded.
The rules to be used by an SGML parser when interpreting an
encoded text take the form of series of declarations which,
together with other definitions, make up the body of the
document type definition (DTD). A DTD may be attached to a
document, or more usually referred to from within it. For our
simple model of a poem, the following declarations would be
appropriate:
The first part of each declaration above gives the generic
identifier of the element which is being declared, for example
POEM, TITLE etc. It is possible to declare several elements in
one statement, as discussed below.
The second part of the declaration is optional. It specifies
what are called
The third part of each declaration, enclosed in parentheses,
is called the
The declaration for STANZA in the example above states that a
stanza consists of one or more lines. It uses an
The content model (TITLE?,STANZA+) contains more than one
component, and thus needs additionally to specify the order in
which these (TITLE and STANZA) may appear. This ordering is
determined by the
In our example so far, the components of each content model
have been either single elements or #PCDATA. It is quite
permissible however to define content models in which the
components are lists of elements, combined by group
connectors. Such lists, known as It will not have escaped the astute
reader that the fact that verse paragraphs need not start on a
line boundary seriously complicates the issue; see further
section
The elements LINE1 and LINE2 (which are distinguished to
enable studies of rhyme scheme, for example) have exactly the
same content model as the existing LINE element. They can
therefore share the same declaration. In this situation, it is
convenient to supply a
In the simple cases described so far, it has been assumed that
one can identify both the immediately containing element type
and the immediate constituents of every element defined in a
textual structure. A poem consists of stanzas, and an
anthology consists of poems; stanzas do not float around
unattached to poems or combined into some other unrelated
element; a poem cannot contain an anthology. All the elements
of a given document type may be arranged into a hierarchic
structure, arranged like a family tree with a single ancestor
at the top and many children (mostly the elements containing
#PCDATA) at the bottom. This gross simplification turns out
to be surprisingly effective for a large number of purposes.
It is not however adequate for the full complexity of real
textual structures. In particular, it does not cater for the
case of more or less freely floating elements that can appear
at almost any hierarchic level in the structure, and it does
not cater for the case where several different trees may be
identified in the same document. To deal with the first case,
SGML provides the
In most documents, there will be some identifiable elements
that can occur at any level of its structure. Annotations, for
example, might be attached to the whole of a poem, to a
stanza, to a line of a stanza or to a single word within it.
In a textual critical edition, the same might be true of
variant readings. In this simple case, the complexity of
adding an annotation element as an optional component of every
content model is not particularly onerous; in a more
realistically complex model perhaps containing some ten or
twenty levels such an approach is barely workable. It would
not make much sense to include (say) zero or more annotation
elements as a component of every element in even a moderately
complex DTD.
To cope with this, SGML allows for any content model to be
further modified by means of an
To extend our declarations further to allow for annotations
and variant readings, which we will assume can appear anywhere
within the text of a poem, we first need to add declarations
for these two elements:
hic desunt multa
SGML uses the word
Like elements themeslve, attributes are declared in the SGML
document type declaration, using rather similar syntax. As well as
specifying its name and the element to which it is to be attached,
it is possible to specify (within limits) what kind of value is
acceptable for an attribute and a default value.
Some critics, pointing out that almost any information conveyed by
using an attribute could equally well be conveyed by using an
additional element, see attributes as confusing the simplicity
of SGML syntax for no very good reason. Since the reverse is
not always the case - information represented by additional
elements cannot always be represented by using an attribute -
they may be right. However, there are situations in which
attributes seem to provide a convenient way of expressing
information ancillary to a text, whatever their formal
redundancy. The interested reader is referred to
We will discuss two possible uses for attributes: firstly as a
means of including normalised forms of speech prefixes in a
dramatic text, and secondly as a means of providing cross
reference links within a given text.
In a dramatic text, it is customary to flag the start of each
speech by a brief indication of who is to speak it. For most
types of analysis, we would wish to distinguish this speech
prefix from the speech itself. We might therefore expect to
find an element declaration like the following:
]]>
and text tagged as follows:
When encoding from an early printed book or manuscript, it is
not at all unusual to find the same character referred to by
different prefixes in different parts of the play, or
ambiguous prefixes. For the purposes of dramatic analysis it
would be very convenient to normalise all the prefixes for a
given character. One way of doing this might be to define an
additional element, say NORM; this is left to the reader as an
exercise. Instead, feeling that such editorial interventions
should in some sense be distinguished from the encoded text,
we will choose to define an attribute NORM for the existing
PREFIX element. This is done by adding an
The declaration has four main parts. The first specifies the
element (or elements) concerned. The next specifies the name
or names of the attributes to be associated with the element.
The third specifies what kind of values the attribute may
take, that is, what kind of information is to be supplied for
it. The last states whether or not the attribute is optional
and if it is, what default values can be assumed for it. In
our simple case, we define an attribute NORM to be associated
with the element PREFIX, the value of which is a
With this declaration in force, the above passage could be
tagged as follows:
Our second example for the use of attributes is less
controversial, though more complicated. It is sometimes
necessary to refer to an occurrence of one textual element
from within another: an obvious example being phrases such as
The aspects of SGML discussed so far are all concerned with
the markup of structural elements within a document. SGML also
provides a simple and flexible method of encoding and naming
arbitrary parts of the actual content of a document in a
portable way. In SGML parlance, an
Once an entity has been declared, it may be referenced
anywhere within a document. This is done by supplying its name
prefixed with the
This obviously saves typing, and simplifies the task of
maintaining consistency in a set of documents. If the printing
of a complex document is to be done at many sites, the
document body itself might use an entity reference, such as
&site;, wherever the name of the site is required. Different
entity declarations could then be added at different sites to
supply the appropriate string to be substituted for this name,
with no need to change the text of the document itself.
This
A list of entity declarations is known as an
Useful though the entity reference mechanism is for dealing
with occasional departures from the expected character set,
no-one would consider using it to encode extended passages,
such as quotations in Greek or Russian in an English text. In
such situations, different mechanisms are appropriate. These
are discussed below in chapter 4.
An SGML conformant document has a number of parts, not all of
which have been discussed in this chapter, and many of which
the user of these Guidelines may safely ignore. For
completeness, the following summary of how these parts are
inter-related may however be found useful.
An SGML document consists of an SGML prologue and a document
instance. The prologue contains an SGML declaration and a
document type definition.
The SGML declaration specifies basic facts about the dialect
of SGML being used such as the character set, the codes used
for SGML delimiters etc. Its content for TEI-conformant
document types is discussed further in chapter
The document type declaration contains a base document type
definition and may also include one or more concurrent
document type definitions. The declaration may consist of a
reference to some publicly defined document type declaration,
an explicit document type definition, or some combination of
the two. Entity names to be used in the DTD may be declared in
similar ways in the same part of the prologue.
Combining and extending document type definitions is discussed
further in chapter 9. As with the SGML declaration, most SGML
processors allow the document type declaration to be held
in compiled form and invoked invisibly by the user for one or
more documents.
The document instance is the content of the document itself,
independent of any declarations but possibly containing
references to other entities to be included within it.
A variety of software is available to assist in the tasks of
creating, validating and processing SGML documents. At the
heart of most such software is an SGML
A
A
Text oriented database management systems typically use inverted
file indexes to point into documents, or subdivisions of them. A
search can be made for an occurrence of some word or word pattern
within a document or within a subdivision of one. Meaningful
subdivisions of input documents will of course be closely related
to the subdivisions specified using descriptive markup. It is thus
simple for textual database systems to take advantage of SGML-
tagged documents.
Hypertext systems improve on oldewr methods of handling texts
bysuppoirting associative links within and across documenbts.
Again, the basic building block needed for such systems is also a
basic building block of SGML markup: the ability to identify and
to link together individual document elements comes free as a part
of the SGML way of doing things. To load an SGML document into a
hypertext system requires only a processor which can interpret the
SGML tags correctly. (See further section Standard Generalised Markup
Language
What's special about SGML?
document type
concept; and
its independence of any one representation system. These three
aspects are discussed briefly below, and then in more depth in
sections Descriptive markup
the
following item is a blort
, this is the end of the most
recently begun flapdoodle
etc. By contrast, a procedural
markup system defines what processing is to be carried out at
particular points in a document - call procedure blort with
parameters 1, b and x here
terminate the flapdoodle
procedure here
etc. In SGML, the instructions needed to
process a document for some particular purpose (for example,
to format it) are sharply distinguished from the descriptive
markup which occurs within the document. Usually, they are
collected outside the document in separate procedures or
programs.
Types of document
Data independence
string substitution
, that is, a
simple machine-independent way of stating that a particular
string of characters in the document should be replaced by
some other string when the document is processed. One obvious
application for this mechanism is to ensure consistency of
nomenclature; another, more significant, is to counter the
notorious inability of different computer systems to
understand each other's character sets, or of any one system
to provide all the graphic characters needed for a particular
application, by providing descriptive mappings for non-
portable characters.
Textual structures
blort
? how many pages?).
Other structural units are more clearly analytic, in that they
characterise a section of a text. A dramatic text might regard
each speech by a different character as units of one kind, and
stage directions or pieces of action as units of another kind.
The purpose of such an analysis is less likely to be to locate
parts of the text (the 93rd speech by Horatio in Act 2
)
than to facilitate comparisons between the words used by one
character and those of another, or those used by the same
character at different points of the play.
Tristram Shandy
for example) cannot be fully
appreciated without an awareness of the interplay between
narrative units (such as chapters or paragraphs) and page
divisions. For many types of research, it is the interplay
between different levels of analysis which is crucial: the
extent to which syntactic structure and narrative structure
mesh, or fail to mesh, for example, or the extent to which
phonological structures reflect morphology.
SGML structures
blort
element is that
instances of it may (or may not) occur within elements of type
farble
, and that it may (or may not) be decomposed into
elements of type blortette
. It should be stressed that,
so far as the SGML standard is concerned, the semantics of an
element are entirely in the eye of the beholder. It is up to
the creators of SGML conformant definitions (confusingly known
in the Standard as For Love: poems 1950-
1960
Defining the rules
minimisation rules
; and a content model
. Each of
these parts is discussed further below. Components of the
declaration are separated by white space, that is one or more
blanks, tabs or newlines.
parsed character data
, and it means
that the element being defined may contain any valid character
data but may not contain further embedded elements. It thus
forms the bottom line of most SGML element declarations. In
our example, TITLEs and LINEs are so defined.
Complicating the issue
exception
mechanism; to deal with the
second, SGML permits the definition of concurrent
document structures.
Exceptions to the content model
Concurrent structures
Attributes
attribute
, as it does some other words,
in a rather specialised way, in this case to describe information
which is in some sense descriptive of specific element occurrences
witout being itself regarded as an element. For example, you might
wish to add a status
attribute to occurrences of some
elements in a document to indicate their degree of reliability, or
to add an identifier
attribute so that you could refer to
particular element occurrences from elsewhere within a document.
If an element has been defined as having attributes, the attribute
values are supplied in the document instance as id
and status
. For the instance of a
id
attribute has the value P1
and the
status
attribute has the value draft
.
see note 6
or as discussed in chapter 5
. When a
text is being produced the actual numbers associated with the
notes or chapters may not be certain. Moreover, if we have
followed the gospel of descriptive markup, such things as page
or chapter numbers, being entirely matters of presentation
will not in any case be present in the marked up text: they
will be assigned by whatever processor is operating on the
text (and may indeed differ in different applications). SGML
therefore provides a special mechanism by which any element
occurrence may be given a special identifier, a kind of label,
which may be used to refer to it from anywhere else within
the same text. The cross-reference itself is regarded as an
element occurrence of a specific kind, which must also be
declared in the DTD. In each case, the identifying label
(which may be arbitrary) is supplied as the value of a special
attribute.
Suppose, for example, we wish to include a reference within
the notes on one poem that refers to another poem. We will
first need to provide some way of attaching a label to each
poem: this is done by defining an attribute for the POEM
element, as follows
]]>
Here we define an attribute PID, the value of which must be of
type ID (this keyword implies that it must be unique within
the document and will be used to identify the element
occurrence in which it is used) but which may be omitted
(because only poems to which we intend to refer need use this
attribute). For any such poem we can now include in the tag
that opens it a unique identifier, for example
]]>
Next we need to define a new element for the cross reference
itself. This will not have any content - it is only a
pointer - but it has an attribute, the value of which will be
the identifier of the element pointed at. This is achieved by
the following declarations:
]]>
The POEMREF element needs no close tag because it has no
content. It has a single attribute, which we choose to call ID
to make obvious what its function is. The value of this
attribute must be of type IDREF (the keyword used for cross
reference pointers of this type) and it must be supplied.
With these declarations in force, we can now encode a
reference to the poem with id counterpoint
as follows
]]>
When an SGML parser encounters this empty element it will
simply check that a poem exists with the identifier
counterpoint
. Other SGML processors could take any
number of other actions: different formatters might for
example insert a phrase such as "See also " followed by a
number, or the poem title or its first lines. A hypertext
style processor might use this element as a signal to activate
a link to the poem being referred. The purpose of the SGML
markup is simply to indicate that a cross reference exists: it
does not determine what the processor is to do with it.
Data Independence
tei
Text Encoding
Initiative
. This is an instance of a GoodBits
and
whose value is a system identifier - in this case, the name of
an operating system file.
The labours of the &tei have only
just begun
will be interpreted by an SGML processor
exactly as if it read The labours of the Text Encoding
Initiative have only just begun
. In the case of a system
entity, it is, of course, the contents of the operating system
file which are subsituted, so that the passage The
following text has been suppressed: &goodbits;
will be
expanded to include the whole of whatever the system finds in
the file c:\tei\hotstuff.txt.
ct
might be
distinguished from the non-ligatured form by encoding it as
&ctlig;
rather than ct
. Other special
typographic features such as leafstops or rules could equally
well be represented by mnemonic entity references in the text.
When processing such texts, an entity declaration would be
added giving the desired representation for such textual
elements. If, for example, ligatured letters are of no
interest, we would simply add a declaration such as
]]>
and the distinction present in the source document would be
removed. If, on the other hand, a formatting program capable
of representing ligatured characters is to be used, we might
replace the entity declaration to give whatever sequence of
characters such a program requires as the expansion. If the
characters to be used in the expansion cannot be typed in
directly, they may be given as Putting it altogether
A short SGML Reading List