SGML Markup

A gentle introduction to SGML

SGML Although widely rumoured to be short for the surnames of its progenitors, the official expansion of this abbreviation is Standard Generalised Markup Languageis an international standard for the description of marked-up electronic text. To be more precise, SGML is a metalanguage, that is, a means of formally describing a language, in this case, a markup language, as described in Chapter . The present chapter, while falling far short of the rigour of the international standard itselfInternational Organisation for Standardisation: ISO 8879 Information processing - Text and office systems - Standard Generalized Markup Language (SGML), 1986, attempts to give an informal introduction to those parts of it of which a proper understanding is necessary to make best use of these Guidelines. A short SGML reading list is also provided, in section below.

What's special about SGML?

There are three characteristics of SGML which distinguish it from other markup languages: its use of descriptive rather than procedural markup; its document type concept; and its independence of any one representation system. These three aspects are discussed briefly below, and then in more depth in sections and .

Descriptive markup

A descriptive markup system uses markup codes which assert simply of the parts of a document which they identify the following item is a blort, this is the end of the most recently begun flapdoodle etc. By contrast, a procedural markup system defines what processing is to be carried out at particular points in a document - call procedure blort with parameters 1, b and x here terminate the flapdoodle procedure here etc. In SGML, the instructions needed to process a document for some particular purpose (for example, to format it) are sharply distinguished from the descriptive markup which occurs within the document. Usually, they are collected outside the document in separate procedures or programs.

This means that the same document can be processed by many different pieces of software, each of which can apply different processing instructions to those parts of it which are considered relevant. For example, a content analysis program might wish to disregard entirely the footnotes embedded in an annotated text, while a formatting program might wish to extract and collect them altogether for printing at a specific point during the processing of the document. Different sorts of processing instructions can be associated with the same parts of the file. For example, one program might wish to extract names of persons and places from a document to create an index or database, while another, operating on the same text, might wish to print names of persons and places in a distinctive typeface.

Types of document

Secondly, SGML introduces the notion of a document type, and hence a document type definition (DTD). Documents are regarded as having types, just as other objects processed by computers do. The type of a document is formally defined by its constituent parts and their structure. The definition of a report, for example, might be that it consisted of a title and possibly an author, followed by an abstract and a sequence of one or more paragraphs. Anything lacking a title, according to this formal definition, would not formally be a report, and neither would a sequence of paragraphs followed by an abstract - whatever other report-like characteristics these might have for the human reader.

If documents are of known types, a special purpose program (called a parser) can be used to process a document claiming to be of a particular type and check that all the elements required for that document type are indeed present and correctly ordered. More significantly, different documents of the same type can be processed in a uniform way. Programs can be written which take advantage of the knowledge encapsulated in the document structure information, and which can thus behave in a more intelligent fashion.

Data independence

A basic design goal of SGML was to ensure that documents encoded according to its provisions should be transportable from one hardware/software environment to another without loss of information. The two features discussed so far all address this requirement at an abstract level: the third feature addresses it at the level of the strings of bytes (characters) of which documents are composed. SGML provides a general purpose mechanism for string substitution, that is, a simple machine-independent way of stating that a particular string of characters in the document should be replaced by some other string when the document is processed. One obvious application for this mechanism is to ensure consistency of nomenclature; another, more significant, is to counter the notorious inability of different computer systems to understand each other's character sets, or of any one system to provide all the graphic characters needed for a particular application, by providing descriptive mappings for non- portable characters.

Textual structures

A text is not an undifferentiated sequence of words, much less of bytes. For different purposes, it may be divided into many different units, of different types or sizes. A prose text such as this one might be divided into sentences, paragraphs, chapters and sections. A verse text might be divided into lines, stanzas and cantos. Once printed, sequences of prose and verse might be divided into pages, gatherings or volumes.

Such units as these are most often used to identify specific locations or reference points within a text (the third sentence of the second paragraph in chapter ten; canto 10, line 1234; page 412 etc.) but they may also be used to subdivide a text into meaningful fragments for analytic purposes (is the average sentence length of section 2 different from that of section 5? how many paragraphs separate each occurrence of the word blort? how many pages?). Other structural units are more clearly analytic, in that they characterise a section of a text. A dramatic text might regard each speech by a different character as units of one kind, and stage directions or pieces of action as units of another kind. The purpose of such an analysis is less likely to be to locate parts of the text (the 93rd speech by Horatio in Act 2) than to facilitate comparisons between the words used by one character and those of another, or those used by the same character at different points of the play.

In a prose text one might similarly wish to regard as units of different types passages in direct or indirect speech, passages employing different stylistic registers (narrative, polemic, commentary, argument etc.), passages of different authorship and so forth. And for certain types of analysis (most notably textual criticism) the physical appearance of one particular printed or manuscript source may be of importance: paradoxically, one may wish to use descriptive markup to describe presentational features such as typeface, linebreaks, use of white space and so forth.

These textual structures overlap with each other in complex and unpredictable ways. Particularly when dealing with texts as instantiated by paper technology, the reader needs to be aware of both the physical organisation of the book and the logical structure of the work it contains. Many great works (Sterne's Tristram Shandy for example) cannot be fully appreciated without an awareness of the interplay between narrative units (such as chapters or paragraphs) and page divisions. For many types of research, it is the interplay between different levels of analysis which is crucial: the extent to which syntactic structure and narrative structure mesh, or fail to mesh, for example, or the extent to which phonological structures reflect morphology.

SGML structures

SGML provides a simple and consistent mechanism for the markup or identification of all such textual units, and also a method of expressing rules which define how combinations of such units can meaningfully occur in any text. The technical term used in the SGML Standard for a textual unit, viewed as a structural component, is element. Different types of elements are given different names, but SGML provides no way of expressing the meaning of a particular type of element, other than its relationship to other element types. That is, all one can say about any blort element is that instances of it may (or may not) occur within elements of type farble, and that it may (or may not) be decomposed into elements of type blortette. It should be stressed that, so far as the SGML standard is concerned, the semantics of an element are entirely in the eye of the beholder. It is up to the creators of SGML conformant definitions (confusingly known in the Standard as applications ) to choose names indicative of the intended function of the elements they identify, hence the technical term for the name of an element type which is generic identifier, or GI.

Within a marked up text (or, to use the jargon, a document instance), each element must be explicitly marked or tagged in some way. The standard provides for a variety of different ways of doing this, the most commonly used being to insert a tag at the begining of the element (an open-tag) and another at its end (a close-tag). The start and end tag pair are used to bracket off the element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, an embedded speech element in a text might be tagged as follows:

<![ CDATA [ ... Rosalind's remarks <speech>This is the silliest stuff that ere I heard of!</speech> clearly indicate ... ]]> As this example shows, an open-tag takes the form <name>, where < is a string indicating the start of the open tag, name is the generic identifier of the element which is being delimited, and > is the string indicating the end of a tag. A close-tag takes the form </name>, where </ is a string marking the start of a close-tag, name is the generic identifier of the element being closed and as before > is the string indicating the end of a tag.The actual characters used for <, </ and > may be re- defined, but it is conventional to use the characters used in this description.

Elements within a text will usually be nested, that is, elements of one type will usually be embedded, (contained entirely) within elements of a different type. This is one reason why the end-tag needs to specify which element it is closing.

To illustrate this, we will consider a very simple structural model. Let us assume that we wish to identify within an anthology only poems, their titles, and the stanzas and lines of which they are composed. In SGML terms, our document type is the anthology, and it consists of a series of poems. Each poem has embedded within it one element, a title, and several occurrences of another, a stanza, each stanza having embedded within it a number of line elements. Fully marked up, a text conforming to this model might appear as follows:

<![ CDATA [ <anthology> <poem><title>A counterpoint</title> <stanza> <line>Let me be my own fool</line> <line>of my own making, the sum of it</line> </stanza> <stanza> <line>is equivocal.</line> <line>One says of the drunken farmer:</line> </stanza> <stanza> <line>leave him lay off it. And this is</line> <line>the explanation.</line> </stanza> </poem> <!-- more poems go here --> </anthology> ]]> Taken from Robert Creeley, For Love: poems 1950- 1960. Copyright 1962 and used without permission tsk tsk

This example makes no assumptions about the rules governing, for example, whether or not a title can appear in places other than preceding the first stanza, or whether lines can appear which are not included in a stanza: that is why its markup appears so verbose. In some cases, the begining and end of every element must be explicitly marked, because there are no identifiable rules about which elements can appear where. In practice, however rules of the following type can usually be hypothesized which greatly reduce the need for so much tagging. For example, considering our greatly over-simplified model of a poem, we could state rules of the following kind:

  1. An anthology contains a number of poems and nothing else
  2. A poem always has a single title element which precedes the first stanza.
  3. Every stanza consists of one or more lines and every line is contained by a stanza.
  4. Nothing can follow a stanza except another stanza or the end of a poem.
  5. Nothing can follow a line except another line or the start of a new stanza

From rules and it follows that we do not need to mark the ends of stanzas or lines explicitly. From rule it follows that we do not need to mark the end of the title - it is implied by the start of the first stanza. Similarly, from rule , it follows that we need not mark the end of the poem - it is implied by the start of the next poem, or by the end of the anthology. From rule it follows that we do not need to mark the start of the first line in each stanza - it is implied by the start of a stanza. Applying these simplifications, we could mark up the same poem as follows:

<![ CDATA [ <anthology> <poem><title>A counterpoint <stanza>Let me be my own fool <line>of my own making, the sum of it <stanza>is equivocal. <line>One says of the drunken farmer: <stanza>leave him lay off it. And this is <line>the explanation. </anthology> ]]>

The ability to use rules stating which elements can be nested within others to simplify markup is a very important characteristic of SGML. Before considering these rules further, you may like to consider how text marked up in the form above could be processed by a computer for very many different purposes. A simple indexing program could extract only the relevant text elements in order to make a list of titles, or of words used in the poem text; a simple formatting program could insert blank lines between stanzas, perhaps indenting the first line of each, or inserting a stanza number. Different parts of each poem could be typeset in different ways. A more ambitious analytic program could determine how many stanzas or lines begin with lower-case letters and thus (perhaps) in mid-sentence.

Note that this simple example has not addressed the problem of marking elements such as sentences explicitly; the implications of this are discussed below in section

Scholars wishing to see the implications of changing the stanza or line divisions chosen by the editor of this poem can do so simply by altering the position of the tags. And of course, the text as presented above can be transported from one computer to another and processed by any program (or person) capable of making sense of the tags embedded within it with no need for the sort of transformations and translations needed to move word processor files around.

Defining the rules

In specifying rules such as those described above, the document designer may be as lax or as restrictive as the occasion warrants. A balance must be struck between the convenience of simple rules and the complexity of real texts. This is particularly the case when the rules being defined relate to texts which already exist: the designer may have only the haziest of notions as to an ancient text's original purpose or meaning and hence find it very difficult to specify consistent rules about its structure. On the other hand, where a new text is being prepared to an exact specification, for example for entry into a textual database of some kind, the more precisely stated the rules, the better they can be enforced. Even in the case where an existing text is being marked up, it may be beneficial to define a restrictive set of rules relating to one particular view or hypothesis about the text - if only as a means of testing the usefulness of that view or hypothesis. It is important to remember that every document type definition is an interpretation of a text. There is no single DTD which encompasses any kind of absolute truth about a text, although it may be convenient to privilege some DTDs above others for particular types of analysis.

At present, SGML is most widely used in environments where uniformity of document structure is a major desideratum. In the production of technical documentation, for example, it is of major importance that sections and subsections should be properly nested, that cross references should be properly resolved and so forth. In such situations, documents are seen as raw material to match against pre-defined sets of rules. As discussed above, however, the use of simple rules can also greatly simplify the task of tagging accurately elements of less rigidly constrained texts such as those which concern the TEI. By making these rules explicit, the scholar reduces his or her own burdens while also being forced to make explicit an interpretation of the text being encoded.

The rules to be used by an SGML parser when interpreting an encoded text take the form of series of declarations which, together with other definitions, make up the body of the document type definition (DTD). A DTD may be attached to a document, or more usually referred to from within it. For our simple model of a poem, the following declarations would be appropriate:

<![ CDATA [ <!ELEMENT ANTHOLOGY - - (POEM+)> <!ELEMENT POEM - - (TITLE? STANZA+)> <!ELEMENT TITLE - O (#PCDATA) > <!ELEMENT STANZA - O (LINE+) > <!ELEMENT LINE O O (#PCDATA) > ]]> These four lines are examples of SGML formal declarations. Each declaration begins with <!ELEMENT indicating that it declares an element, in the technical sense defined above, and ends with a >. It consists of three parts: a name, or group of names; optionally two characters specifying minimisation rules; and a content model. Each of these parts is discussed further below. Components of the declaration are separated by white space, that is one or more blanks, tabs or newlines.

The first part of each declaration above gives the generic identifier of the element which is being declared, for example POEM, TITLE etc. It is possible to declare several elements in one statement, as discussed below.

The second part of the declaration is optional. It specifies what are called minimisation rules for the element concerned. These rules determine whether or not start and end tags must be present in every occurrence of the element concerned. They take the form of a pair of characters, separated by white space, the first of which relates to the start tag, and the second to the end tag. In either case, either a hyphen or a letter o (for optional) must be given; the hyphen indicating that the tag must be present, and the letter o that it may be omitted. Thus, in this example, every element must have a start tag, while only the POEM element requires an end tag as well. If no minimisation rules are given for an element, then neither start nor end tags may be omitted.

The third part of each declaration, enclosed in parentheses, is called the content model of the element, because it specifies what element occurrences may legitimately contain. Contents are specified either in terms of other elements or using special reserved words. There are several such reserved words, of which by far the most commonly encountered is #PCDATA, as in this example. This is an abbreviation for parsed character data, and it means that the element being defined may contain any valid character data but may not contain further embedded elements. It thus forms the bottom line of most SGML element declarations. In our example, TITLEs and LINEs are so defined.

The declaration for STANZA in the example above states that a stanza consists of one or more lines. It uses an occurrence indicator (the plus sign) to indicate how many times the element named in its content model may occur. There are three occurrence indicators, known in the standard as plus, opt and rep. Plus, which is usually represented by a plus sign, means that there may be one or more occurrences of the element concerned; opt, usually represented by a question mark, means that there may be at most one and possibly no occurrence; rep, usually represented by a star, means that the element concerned may either be absent or appear one or more times. Thus, if the content model for STANZA were (LINE*), stanzas with no lines would be possible as well as those with more than one line. If it were (LINE?), again empty stanzas would be countenanced, but no stanza could have more than a single line. Similarly, the declaration for POEM in the example above thus states that a POEM cannot have more than one title, but may have none, and that it must have at least one stanza and may have several.

The content model (TITLE?,STANZA+) contains more than one component, and thus needs additionally to specify the order in which these (TITLE and STANZA) may appear. This ordering is determined by the group connector (the comma) used between its components. There are three possible group connectors, known in the standard as seq, and and or. Seq, which is usually represented by a comma, means that the components it connects must both appear in the order specified by the content model. And, which is usually represented by an ampersand, indicates that the components it connects must both appear but in any order. Or, which is usually represented by a vertical bar, indicates that only one of the components it connects can appear. If the comma in this example were replaced by an ampersand, a title could appear either before the stanzas of a poem or at the end (but not between stanzas). If it were replaced by a vertical bar, then a poem would consist of either a title or just stanzas - but not both!

In our example so far, the components of each content model have been either single elements or #PCDATA. It is quite permissible however to define content models in which the components are lists of elements, combined by group connectors. Such lists, known as model groups may also be modified by occurrence indicators (provided that their constituent elements are not) and themselves combined by group connectors. To demonstrate these facilities, let us now expand our example to include non-stanzaic types of verse. For the sake of demonstration, we will categorise poems as one of stanzaic, couplets or blank. A blank verse poem consists simply of lines (we ignore the possibility of verse paragraphs for the moment

It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see further section ) so no additional elements need be defined for it. A couplet is defined as a LINE1 followed by a LINE2.

<![ CDATA [ <!ELEMENT COUPLET o o (LINE1 & LINE2)> ]]>

The elements LINE1 and LINE2 (which are distinguished to enable studies of rhyme scheme, for example) have exactly the same content model as the existing LINE element. They can therefore share the same declaration. In this situation, it is convenient to supply a name group as the first component of a single element declaration, rather than give a series of declarations differing only in the names used. A name group is a list of GIs connected by the or connector and enclosed in parentheses, as follows:

<![ CDATA [ <!ELEMENT (LINE | LINE1 | LINE2) o o (#PCDATA)> ]]> The declaration for the POEM element can now be changed to include all three possibilities: <![ CDATA [ <!ELEMENT POEM - o (TITLE?, (STANZA+ | COUPLET+ | LINE+) ) > ]]> That is, a poem consists of an optional title, followed by several stanzas, or several couplets, or several lines. Note the difference between this definition and the following: <![ CDATA [ <!ELEMENT POEM - o (TITLE?, (STANZA | COUPLET | LINE)+ ) > ]]> The second version, by applying the occurrence indicator to the group rather than to each element within it, would allow for a single poem to contain a mixture of stanzas, couplets or blank verse.

Complicating the issue

In the simple cases described so far, it has been assumed that one can identify both the immediately containing element type and the immediate constituents of every element defined in a textual structure. A poem consists of stanzas, and an anthology consists of poems; stanzas do not float around unattached to poems or combined into some other unrelated element; a poem cannot contain an anthology. All the elements of a given document type may be arranged into a hierarchic structure, arranged like a family tree with a single ancestor at the top and many children (mostly the elements containing #PCDATA) at the bottom. This gross simplification turns out to be surprisingly effective for a large number of purposes. It is not however adequate for the full complexity of real textual structures. In particular, it does not cater for the case of more or less freely floating elements that can appear at almost any hierarchic level in the structure, and it does not cater for the case where several different trees may be identified in the same document. To deal with the first case, SGML provides the exception mechanism; to deal with the second, SGML permits the definition of concurrent document structures.

Exceptions to the content model

In most documents, there will be some identifiable elements that can occur at any level of its structure. Annotations, for example, might be attached to the whole of a poem, to a stanza, to a line of a stanza or to a single word within it. In a textual critical edition, the same might be true of variant readings. In this simple case, the complexity of adding an annotation element as an optional component of every content model is not particularly onerous; in a more realistically complex model perhaps containing some ten or twenty levels such an approach is barely workable. It would not make much sense to include (say) zero or more annotation elements as a component of every element in even a moderately complex DTD.

To cope with this, SGML allows for any content model to be further modified by means of an exception list. There are two types of exception: inclusions, that is, additional elements that can be included at any point in the model group or any of its constituent elements; and exclusions, that is, elements that cannot be included within the current model.

To extend our declarations further to allow for annotations and variant readings, which we will assume can appear anywhere within the text of a poem, we first need to add declarations for these two elements:

<![ CDATA [ <!ELEMENT (note | variant) - - (#PCDATA)> ]]> The note and variant elements must have both start and end tags, since they can appear anywhere. Rather than add them to the content model for each type of poem, we can add them in the form of an inclusion list to the poem element, which now reads: <![ CDATA [ <!ELEMENT POEM - o (TITLE?, (STANZA+ | COUPLET+ | LINE+) ) +(NOTE | VARIANT) > ]]> The plus sign at the start of the (NOTE | VARIANT) name list indicates that this is an inclusion exception. With this addition, notes or variants can appear at any point in the content of a poem element - even those (such as TITLE) for which we have defined a content model of #PCDATA. They can thus also appear within notes or variants! If we wanted for some reason to prevent notes or variants appearing within titles, we could add an exclusion exception to the declaration for TITLE above: <![ CDATA [ <!ELEMENT TITLE - O (#PCDATA) -(NOTE | VARIANT)> ]]> The minus sign at the start of the (NOTE | VARIANT) name list indicates that this is an exclusion exception. With this addition, notes and variants will be prohibited from appearing within titles, notwithstanding their potential inclusion implied by the previous addition to the content model for POEM. In the same way, we could prevent notes and variants from nesting within notes and variants by modifying the definition above to read <![ CDATA [ <!ELEMENT (note | variant) - - (#PCDATA) -(NOTE | VARIANT)> ]]> The meticulous reader will note that this precludes both variants within notes and notes within variants. Inclusion and exclusion exceptions should be used with care as their ramifications may not be immediately apparent.

Concurrent structures

hic desunt multa

Attributes

SGML uses the word attribute, as it does some other words, in a rather specialised way, in this case to describe information which is in some sense descriptive of specific element occurrences witout being itself regarded as an element. For example, you might wish to add a status attribute to occurrences of some elements in a document to indicate their degree of reliability, or to add an identifier attribute so that you could refer to particular element occurrences from elsewhere within a document. If an element has been defined as having attributes, the attribute values are supplied in the document instance as attribute- value pairs inside the open-tag for the element occurrence. For example

<![ CDATA[ <poem id=P1 status="draft"> ... </poem> ]]> The poem element has been defined as having two attributes id and status. For the instance of a poem in this example, represented here by an ellipsis, the id attribute has the value P1 and the status attribute has the value draft.

Like elements themeslve, attributes are declared in the SGML document type declaration, using rather similar syntax. As well as specifying its name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute and a default value.

Some critics, pointing out that almost any information conveyed by using an attribute could equally well be conveyed by using an additional element, see attributes as confusing the simplicity of SGML syntax for no very good reason. Since the reverse is not always the case - information represented by additional elements cannot always be represented by using an attribute - they may be right. However, there are situations in which attributes seem to provide a convenient way of expressing information ancillary to a text, whatever their formal redundancy. The interested reader is referred to who? for a discussion.

We will discuss two possible uses for attributes: firstly as a means of including normalised forms of speech prefixes in a dramatic text, and secondly as a means of providing cross reference links within a given text.

In a dramatic text, it is customary to flag the start of each speech by a brief indication of who is to speak it. For most types of analysis, we would wish to distinguish this speech prefix from the speech itself. We might therefore expect to find an element declaration like the following: ]]> and text tagged as follows: Ferd. Couer her face: Mine eyes dazell: she di'd yong. Bos. I thinke not so: her infelicitie Seem'd to have yeeres too many. ... ]]>

When encoding from an early printed book or manuscript, it is not at all unusual to find the same character referred to by different prefixes in different parts of the play, or ambiguous prefixes. For the purposes of dramatic analysis it would be very convenient to normalise all the prefixes for a given character. One way of doing this might be to define an additional element, say NORM; this is left to the reader as an exercise. Instead, feeling that such editorial interventions should in some sense be distinguished from the encoded text, we will choose to define an attribute NORM for the existing PREFIX element. This is done by adding an attribute definition list declaration to the declarations above, as follows: ]]>

The declaration has four main parts. The first specifies the element (or elements) concerned. The next specifies the name or names of the attributes to be associated with the element. The third specifies what kind of values the attribute may take, that is, what kind of information is to be supplied for it. The last states whether or not the attribute is optional and if it is, what default values can be assumed for it. In our simple case, we define an attribute NORM to be associated with the element PREFIX, the value of which is a name token that is, (loosely) any string of alphabetic characters not including a space and which may not be omitted.

With this declaration in force, the above passage could be tagged as follows: Ferd. Couer her face: Mine eyes dazell: she di'd yong. Bos. I thinke not so: her infelicitie Seem'd to have yeeres too many. ... ]]> Clearly, the same mechanism could be extended for any other type of normalisation.

Our second example for the use of attributes is less controversial, though more complicated. It is sometimes necessary to refer to an occurrence of one textual element from within another: an obvious example being phrases such as see note 6 or as discussed in chapter 5. When a text is being produced the actual numbers associated with the notes or chapters may not be certain. Moreover, if we have followed the gospel of descriptive markup, such things as page or chapter numbers, being entirely matters of presentation will not in any case be present in the marked up text: they will be assigned by whatever processor is operating on the text (and may indeed differ in different applications). SGML therefore provides a special mechanism by which any element occurrence may be given a special identifier, a kind of label, which may be used to refer to it from anywhere else within the same text. The cross-reference itself is regarded as an element occurrence of a specific kind, which must also be declared in the DTD. In each case, the identifying label (which may be arbitrary) is supplied as the value of a special attribute. Suppose, for example, we wish to include a reference within the notes on one poem that refers to another poem. We will first need to provide some way of attaching a label to each poem: this is done by defining an attribute for the POEM element, as follows ]]> Here we define an attribute PID, the value of which must be of type ID (this keyword implies that it must be unique within the document and will be used to identify the element occurrence in which it is used) but which may be omitted (because only poems to which we intend to refer need use this attribute). For any such poem we can now include in the tag that opens it a unique identifier, for example ]]> Next we need to define a new element for the cross reference itself. This will not have any content - it is only a pointer - but it has an attribute, the value of which will be the identifier of the element pointed at. This is achieved by the following declarations: ]]> The POEMREF element needs no close tag because it has no content. It has a single attribute, which we choose to call ID to make obvious what its function is. The value of this attribute must be of type IDREF (the keyword used for cross reference pointers of this type) and it must be supplied. With these declarations in force, we can now encode a reference to the poem with id counterpoint as follows ]]> When an SGML parser encounters this empty element it will simply check that a poem exists with the identifier counterpoint. Other SGML processors could take any number of other actions: different formatters might for example insert a phrase such as "See also " followed by a number, or the poem title or its first lines. A hypertext style processor might use this element as a signal to activate a link to the poem being referred. The purpose of the SGML markup is simply to indicate that a cross reference exists: it does not determine what the processor is to do with it.

Data Independence

The aspects of SGML discussed so far are all concerned with the markup of structural elements within a document. SGML also provides a simple and flexible method of encoding and naming arbitrary parts of the actual content of a document in a portable way. In SGML parlance, an entity is any named part of a marked up document, irrespective of any structural considerations. An entity might be a string of characters or a whole file of text. To include it in a document, we use a construction known as an entity reference. For example, the following declaration ]]> defines an entity whose name is tei By convention case is significant in entity names, unlike element names, and whose value is the string Text Encoding Initiative. This is an instance of a general entity declaration; by contrast, the following is an example of a system entity declaration ]]> This defines a system entity whose name is GoodBits and whose value is a system identifier - in this case, the name of an operating system file.

Once an entity has been declared, it may be referenced anywhere within a document. This is done by supplying its name prefixed with the ero (entity reference open) character - normally an ampersand and terminated by the refc (reference close) character - normally a semicolon. The reference close character may be omitted if it is followed by a space or record end. When an SGML parser encounters such an entity reference, it immediately substitutes the value declared for the entity name. Thus, the passage The labours of the &tei have only just begun will be interpreted by an SGML processor exactly as if it read The labours of the Text Encoding Initiative have only just begun. In the case of a system entity, it is, of course, the contents of the operating system file which are subsituted, so that the passage The following text has been suppressed: &goodbits; will be expanded to include the whole of whatever the system finds in the file c:\tei\hotstuff.txt.

This obviously saves typing, and simplifies the task of maintaining consistency in a set of documents. If the printing of a complex document is to be done at many sites, the document body itself might use an entity reference, such as &site;, wherever the name of the site is required. Different entity declarations could then be added at different sites to supply the appropriate string to be substituted for this name, with no need to change the text of the document itself.

This string substitution mechanism (to use the jargon) has many other applications. It can be used to circumvent the notorious inadequacies of most computer systems for representing the full range of graphic characters needed for the display of modern English, (let alone the requirements of other modern scripts or of ancient languages). So called special characters not directly accessible from the keyboard (or if accessible not correctly translated when transmitted) may be represented by an entity reference. Suppose, for example, that we wish to encode the use of ligatures in early printed texts. The ligatured form of ct might be distinguished from the non-ligatured form by encoding it as &ctlig; rather than ct. Other special typographic features such as leafstops or rules could equally well be represented by mnemonic entity references in the text. When processing such texts, an entity declaration would be added giving the desired representation for such textual elements. If, for example, ligatured letters are of no interest, we would simply add a declaration such as ]]> and the distinction present in the source document would be removed. If, on the other hand, a formatting program capable of representing ligatured characters is to be used, we might replace the entity declaration to give whatever sequence of characters such a program requires as the expansion. If the characters to be used in the expansion cannot be typed in directly, they may be given as character references, that is, as numeric values. A character reference is are distinguished from other characters in the replacement string by the fact that it begins with a special character reference open symbol, usually the sequence &#, and ends with a refc symbol (i.e. usually a semicolon). For example, if the formatter to be used represents the ligatured form of ct by the characters c and t prefixed by the character with decimal value 102, the entity declaration would read: ]]>

A list of entity declarations is known as an entity set Standard entity sets are provided for use with most SGML processors, in which the names used will normally be taken from the lists of such names published as an annex to the SGML standard and elsewhere. The replacement values are, of course, highly system dependent.

Useful though the entity reference mechanism is for dealing with occasional departures from the expected character set, no-one would consider using it to encode extended passages, such as quotations in Greek or Russian in an English text. In such situations, different mechanisms are appropriate. These are discussed below in chapter 4.

Putting it altogether

An SGML conformant document has a number of parts, not all of which have been discussed in this chapter, and many of which the user of these Guidelines may safely ignore. For completeness, the following summary of how these parts are inter-related may however be found useful.

An SGML document consists of an SGML prologue and a document instance. The prologue contains an SGML declaration and a document type definition.

The SGML declaration specifies basic facts about the dialect of SGML being used such as the character set, the codes used for SGML delimiters etc. Its content for TEI-conformant document types is discussed further in chapter below; normally the SGML declaration will be held in the form of compiled tables by the SGML processor and will thus be invisible to the user.

The document type declaration contains a base document type definition and may also include one or more concurrent document type definitions. The declaration may consist of a reference to some publicly defined document type declaration, an explicit document type definition, or some combination of the two. Entity names to be used in the DTD may be declared in similar ways in the same part of the prologue.

Combining and extending document type definitions is discussed further in chapter 9. As with the SGML declaration, most SGML processors allow the document type declaration to be held in compiled form and invoked invisibly by the user for one or more documents.

The document instance is the content of the document itself, independent of any declarations but possibly containing references to other entities to be included within it.

A variety of software is available to assist in the tasks of creating, validating and processing SGML documents. At the heart of most such software is an SGML parser. Other software functions which SGML processors should provide include structured editing, formatting and database management.

A structured editor is a kind of intelligent word-processor.It can use information extracted from a processed DTD to prompt the user with information about which elements are required at different points in a document as the document is being created. It can also greatly simplify the task of preparing a document, for example by inserting tags automatically.

A formatter operates on a tagged document instance to produce a printed form of it. Many typographic distinctions, such as the use of particular typefaces or sizes, are intimately related to structural distinctions, and formatters can thus usefully take advantage of descriptive markup. It is also possible to define the tagging structure expected by a formatting program in SGML terms, as a concurrent document structure.

Text oriented database management systems typically use inverted file indexes to point into documents, or subdivisions of them. A search can be made for an occurrence of some word or word pattern within a document or within a subdivision of one. Meaningful subdivisions of input documents will of course be closely related to the subdivisions specified using descriptive markup. It is thus simple for textual database systems to take advantage of SGML- tagged documents.

Hypertext systems improve on oldewr methods of handling texts bysuppoirting associative links within and across documenbts. Again, the basic building block needed for such systems is also a basic building block of SGML markup: the ability to identify and to link together individual document elements comes free as a part of the SGML way of doing things. To load an SGML document into a hypertext system requires only a processor which can interpret the SGML tags correctly. (See further section )

A short SGML Reading List As the current version of TEIDOC0 does not support bibliographic lists other than in back matter, and not much structure in them even then, this list has been moved to a separate file (Z319) with its own dtd.