Reference Systems

Reference systems are necessary in order to be able to mark a place within a text, and to enable other readers to find it again. Traditional referencing systems may use structural units (chapters, paragraphs, sentences; stanza and verse), typographic units (page and line numbers), or divisions created specifically for reference purposes (chapter and verse in Biblical texts). The ID and IDREF attribute types of SGML (discussed above in section ) can provide new methods of reference, or can be used to implement traditional reference systems. Traditional reference schemes and schemes using the SGML ID attributes may be more useful than those that rely on the SGML tagging of a text without using ID attributes, since the latter may be garbled if the SGML tagging is revised. (See section for a detailed discussion of the problems of various methods of identifying text segments without using SGML IDs.)

When traditional reference schemes represent a hierarchical structuring of the text, it is recommended that they be marked with hierarchically defined tags. When the hierarchy of the reference scheme mirrors that of the SGML document, the N attribute defined for all tags may be used to indicate the traditional identifier (name, number, or combination) of the relevant structural units. N may also be used to record the numbering of sections or list items in the copy text if the copy-text numbering is important for some reason (e.g. the numbers are out of sequence). When the hierarchy of the SGML-encoded document and that of the traditional scheme diverge (e.g. for reference schemes based on page and line numbers) or when there are several conflicting traditional reference schemes, the reference scheme should be tagged using a concurrent document hierarchy. (See the discussion of the CONCUR feature in section for an introduction to the concept of concurrent hierarchies.)

If concurrent markup is not desired (e.g. because the available SGML parser does not support the CONCUR feature), boundaries between segments in a traditional reference scheme may be specified using the milestone tag described below in section . No SGML validation of the reference scheme is possible using the milestone tag, so it will be the responsibility of the encoder or the application software to ensure that milestone tags occur in a sensible order (e.g. with a page reference before the first line reference).

When creating SGML versions of any text, it is recommended that the page boundaries of the source text be marked using a concurrent hierarchy for its pages and lines or using the page.break or milestone tags. (If it falls in mid-word, the page tag may be moved to the end of a word if desired.) It is strongly recommended when the text has no traditional referencing scheme or acknowledged reference edition. Line breaks in prose texts may be, but need not be, tagged.

Concurrent Markup for Pages and Lines

Perhaps the most common form of traditional reference system specifies the page and line, or page, column, and line of a passage as it appears in some standard edition. Such references may be specified using a concurrent markup hierarchy which divides the body of a text into pages and lines or into pages, columns, and lines. Volumes may also need to be identified. The document type name should be a short-hand identifier for the edition cited.

Page and line numbers for an edition by Lachmann, for example, might be specified thus:

<![ CDATA [ <(La)page n=223> <(La)line n=1> [Text from Lachmann, p. 223, line 1] <(La)line n=2> [Text from Lachmann, p. 223, line 2] <(La)line n=3> [Text from Lachmann, p. 223, line 3] <(La)line n=4> [Text from Lachmann, p. 223, line 4] (etc.) <(La)page n=224> <(La)line n=1> [Text from Lachmann, p. 224, line 1] <(La)line n=2> [Text from Lachmann, p. 224, line 2] <(La)line n=3> [Text from Lachmann, p. 224, line 3] <(La)line n=4> [Text from Lachmann, p. 224, line 4] (etc.) ]]> The following SGML declarations define such a concurrent markup hierarchy: <![ CDATA [ <!-- Define "VERSION.NAME" in the document type declaration --> <!-- subset before calling these declarations. Sample: --> <!-- <!DOCTYPE La system 'plrefs.dec' [ --> <!-- <!ENTITY % version.name "La" > --> <!-- ]> --> <!-- --> <!-- N.B. this is not a highly constrained hierarchy. --> <!-- A tighter hierarchy may be defined if desired. --> <!-- --> <!ENTITY % version.name "ref"> <!ELEMENT %version.name - - (#PCDATA | vol | page)* > <!ELEMENT vol - O (#PCDATA | page)* > <!ELEMENT page - O (#PCDATA | line | col)* > <!ELEMENT col - O (#PCDATA | line)* > <!ELEMENT line - O (#PCDATA) > <!ATTLIST page n CDATA #IMPLIED id ID #IMPLIED > ]]>

This concurrent hierarchy is enabled as shown in the comments; the sequence of lines shown (from DOCTYPE ... to ]>) should be embedded in the document file after the normal document type specification. (See examples in the appendix.) If page and line numbers from more than one standard edition are to be marked, then the relevant lines may be repeated, each time using a different value for the document type and entity definition (where the example has La).

Concurrent Markup for Other Hierarchies

Hierarchies similar to that defined above can be provided for a variety of common hierarchical reference schemes. The document type declarations in the appendix include definitions for three such hierarchies:

Any text with idiosyncratic canonical referencing will require its own DTD, so that appropriately named tags can be created for the reference units. Such DTDs may be modeled on those in the appendix.

Using the ID and N Attributes

In some cases, the canonical reference unit and the content units marked by an SGML tagging may coincide. For example, a reference to Ovid's Amores might be Amores 2.10.7---book 2, poem 10, line 7. Book, poem, and line are structural units of the work and will be tagged in any case. (See section for a discussion of structural units in verse collections.) In such cases, it is convenient to record traditional reference numbers of the structural units using the N attribute. The relevant tags for our example would be:

<![ CDATA [ <h0 n=Amores> <h1 n=2 type=book> <h2 n=10 type=poem> <line n=7> ]]>

This method is not without problems, since some editions may define structural units differently. For example, another edition of the Amores considers poem 10 a continuation of poem 9, and therefore would specify the same line as 2.9.31. In such cases, one must specify the competing schemes in concurrent markup hierarchies, or else use the milestone tags described in section .

If a text has no canonical reference scheme of its own, and was entered without preserving the pagination of its source edition, a reference scheme, if needed, may be derived from the structure of the electronic text, specifically from the SGML markup of the text. As with any reference scheme intended for long-term use, it is important to see the reference as an established, unchanging point in the text. Should the text be revised or rearranged, the reference-scheme identifiers associated with any bit of text must stay with that bit of text, even if it means the reference numbers fall out of sequence. (A new reference scheme may always be created beside the old one if out-of-sequence numbers must be avoided.)

The global attributes N and ID may be used to assign reference identifiers to segments of the text. Identifiers specified by either attribute apply to the entire element for which they are given. SGML enforces uniqueness on ID attributes within a single document, and ID values must begin with a letter. No such restrictions are made on the values of N attributes.

A convenient method of mechanically generating unique values for ID or N attributes, based on the SGML structure of the document is to use the type path or untyped path method to identify elements within the text segment of a TEI document. The text segment is recommended rather than the TEI.doc as a whole or the body of the text only. No values need usually be generated for the TEI.header section of the document, if the reference scheme is intended primarily for the text; values should not usually be restricted to the text body, because front and back matter must also be referred to. This is a convenient method, but is in no way required for anyone creating a reference scheme.

If the ID attribute is used to record the reference identifiers generated, each value should record the entire path. If the N attribute is used, each value may record either the entire path or only the subpath from the SGML parent element.

Milestone Tags

When concurrent markup is not used, checkpoints for any traditional reference scheme may be incorporated into a document using empty tags which can appear at any point and which mark the boundaries between sections in the tradition reference scheme. Page and line boundaries, for example, can be marked using the page.break and line.break tags described in section . For other reference schemes, a single tag is here defined, called milestone. Using these tags, the reference scheme of any one edition can be recreated from a text in which all are marked by simply ignoring all tags that do not describe that edition. The milestone elements have no content, and subdivide the text into regions just as milestones divide a road into segments.

A milestone tag indicates the beginning of some segment marked in a traditional reference system. The specific system, the type of segment marked, and the identifier of the segment are specified using the attributes ed (for edition), unit, and N (for name or number). Each of these attributes can take any character string as its value. N is optional, since an application can keep a count from the start of the document if desired; the others are required.

For unit the following values are suggested as appropriate: page for page breaks column for column breaks line for physical line of the page (in page / column / line reference systems) or for verse line (in reference systems for verse) book for any unit termed book, liber, etc. poem for an individual poem canto for a canto or major section of a poem stanza for a stanza within a poem, book, or canto act for an act within a play scene for a scene within a play or act section for a section of any kind absent if it desired to specify that a given piece of text is not present in the edition in question (such specification is wholly optional) Other terms may of course be used as desired (e.g. Stephanus to indicate Stephanus numbers in Plato). The encoding.declarations section of the TEI file header should contain an explanation of the reference system(s) used and bibliographic references to their sources, if appropriate, under the rubric reference.system. (See section for a full discussion of the encoding declarations area.)

The value of the N attribute may but need not include the identifiers used for any larger sections. That is, either of the following styles is legitimate:

<![CDATA[ [front matter text ...] <milestone ed=Riverside unit=act n=1> <milestone ed=Riverside unit=scene n=1> [text of Act 1, Scene 1 ... Traditional reference is "1.1"] <milestone ed=Riverside unit=scene n=2> [text of Act 1, Scene 2 ... Traditional reference is "1.2"] (etc. ...) ]]> or <![CDATA[ [front matter text ...] <milestone ed=Riverside unit=act n=1> <milestone ed=Riverside unit=scene n=1.1> [text of Act 1, Scene 1 ... Traditional reference is "1.1"] <milestone ed=Riverside unit=scene n=1.2> [text of Act 1, Scene 2 ... Traditional reference is "1.2"] (etc. ...) ]]>

When counting lines on a page for reference purposes, headers, footers and headings are usually ignored. When using milestone tags, line numbers may be supplied for every line or only periodically (every fifth, every tenth line). The latter may be simpler; the former is more reliable. Note that SGML short references may be used to simplify the marking of page and line breaks during data capture. Such short references must be resolved to the fully specified SGML form before TEI-conformant interchange.

The style of numbering used in the values of N is unrestricted: for the example above, I.i and I.ii could have been used equally well if preferred. The special value unnumbered should be reserved for marking sections of text which fall outside the normal numbering scheme (e.g. chapter heads, poem numbers, titles, or speaker attributions in a verse drama).

No hierarchical ordering is or can be defined for the various types of milestone tag; it is the encoder's responsibility to ensure that milestone values are valid.

Because the ed attribute is unrestricted, no change need be made to the document type declaration of a file before adding tags to describe a new reference scheme. (The value of ed may be restricted to a defined set of edition symbols by using the techniques described in chapter .)

The SGML declarations for the milestone tag and its attributes are as follows:

<![ CDATA [ <!-- TEI milestone tags for references to page, column, and --> <!-- line of editions. --> <!-- --> <!-- These tags mark the point at which a given page, line, --> <!-- or column of a given edition or ms begins. They can --> <!-- appear anywhere in the text. A PAGE tag should appear --> <!-- before any COL or LINE tags are used, but this cannot --> <!-- be enforced by the SGML parser. For better validation --> <!-- use the concurrent PLREF markup stream. --> <!-- --> <!ELEMENT milestone - O #EMPTY > <!-- Attributes: --> <!-- Use ED attribute to specify which edition or ms the --> <!-- reference is for. --> <!-- Use N attribute to specify the page, line, or column --> <!-- number or symbol. --> <!-- Use UNIT attribute to specify what type of unit is --> <!-- being specified: e.g. page, column, line, book, --> <!-- poem, canto, stanza, act, scene, section, etc. --> <!-- The special value UNIT=ABSENT indicates a section not --> <!-- present in the edition named. --> <!-- --> <!ATTLIST milestone ed CDATA #REQUIRED n CDATA #IMPLIED unit CDATA #REQUIRED > ]]>