Links and cross references

Terminology

This section concerns the encoding of general purpose links between one text and another. However, because the term link is used in the SGML standard with a very specific technical sense which bears no relation to the sense in which the word is commonly used in hypertext technology, we use the neutral term cross-reference throughout this section to specify any form of link which connects non-adjacent portions of a text. Note also that special tags are proposed for some specific types of cross references, notably bibliographic citations, which are discussed in section and footnotes, which are discussed in section .

Cross-references must refer to distant documents and document portions, and so those document portions must have names, to which the cross-references refer. Once a naming method is established, (see hdref refid=z6a3>) supporting linkage is not difficult, although providing optimal data structures and interfaces are matters for much continued research.

In some hypertext systems, texts are subdivided in a very simple way into cards or frames. In such systems, it is a trivial matter to identify a document element, e.g. by card number. More usually, however, linguistic and structural elements of different sizes are nested, from book to character. Cross-references in such cases must be able to point not only to document elements per se, but to arbitrarily chosen, perhaps even discontiguous, document portions.

Information constituting a cross-reference

A cross reference is a way of specifying a location in a document. A cross-reference recorded at its origin location in a document, must somehow specify the location of its destination. If, on the other hand, a cross-reference is recorded somewhere other than its origin or destination, as in a database of endpoint pairs, it must specify both. We do not consider this method of implementing cross-references further here. In either case, the central problem is to find a uniform way of identifying target locations in a text. This is discussed in section below.

Cross-references may be categorised in many different ways and may be distinguished by creator, date, and other properties. Users may wish to view or suppress cross-references based on any of these properties. A processor may wish to act upon different cross-references in different ways, either generically or individually.

We propose the tag xref, as a simple mechanism for encoding cross-references which should provide for as many of these options as possible. The tag marks an empty element which should be thought of as the origin of a cross-reference or, in the hypertextual sense, a start link. Like every other element in the TEI scheme, it may be identified by the value of an `id' or an `n' attribute. If a processor is to be able to return easily to the cross-reference origin, obviously at least one of `id' and `n' must be available to it. Since cross-references may originate from any point in a document, the xref may appear anywhere within a text. It has no content and may take the following attributes: target Specifies the destination of the cross reference. Its value corresponds with the ID of some other element, which may be an anchor if the target is a single point or the starting point of a span, or another element if the target is an element. target.end Specifies the ending point of a span, when the target is a span of text rather than a single point. Its value (and, in this case, that of the target attribute also) must correspond with the ID attribute of an anchor tag. x.target Specifies the destination for an external cross reference. Its possible values are discussed in section below. x.target.end Specifies the end of an external cross-reference destination which is a span of text rather than a single point. Its possible values are discussed in section below. type Specifies the type of cross-reference. Suggested values are `authorial',`editorial',`analogue' etc. author Specifies the creator of the cross reference. date Specifies the date the cross reference was added to the text. sys.id Supplies the system identifier for an external document within which the target is to be sought. This attribute must be present if the target is not the current document. Its value is implementation dependent.

Referring to document locations

The `target' and `target.end' attributes can only be used to point to locations in the current document, because their values are defined as IDREF. If they are intended to point to an arbitrary point or span, the empty element anchor may be used to mark the location of a single point. These attributes cannot however be used to indicate locations or spans in other documents, as SGML provides no support for IDs outside the current document. For this purpose we propose two additional attributes: `xr.target' and `xr.target.end'.

External documents may not already contain suitable ID values or anchor elements, and it may not be possible to insert them, if for example the target text is on a read-only medium. A method of locating parts of a text without changing it is therefore necessary. A number of such methods can be used, and are listed below. They are, for the most part, fragile, in the sense that if the destination documents change (even slowly or occasionally), the cross-references can fail, either by pointing to a location which is no longer there, or by pointing to a location other than that intended.

We list below a number of such methods, in decreasing order of fragility, together with a recommended name for each. The list is not intended to be exhaustive, but the names proposed should be used if the corresponding method is used. bytes The location is specified in terms of the number of bytes from a known point, for example, the start of an enclosing element, or the start of the document. Entity references are expanded and tags are excluded from the count. This method is highly system-specific and very fragile: we recommend it only in extremis. tokens This is similar to the byte offset method, except that the value supplied indicates a number of tokens, i.e. white-space delimited strings, rather than a number of bytes. path This points to an element in a hierarchy by specifying a domain-style address. This takes the form of a series of numbers, separated by dots, one for each level of the hierarchy. The value of the number gives its position in the sequence of nodes at the same level, read in document sequence and irrespective of type. See figure below. tpath The typed path method improves on the path method by including element names in the path. In this case the numbers specify the sequence of the target element in elements of the specified type at the specified point in the hierarchy. See figure below. pattern This uses any uniquely occurring string or pattern, expressed as a regular expression, to identify a particular place in the text. It leads to the first position in the text where the string or pattern is matched. ref This relies on the fact that many texts already contain referencing strings in a particular format which provide explicit reference points. (See further section ). It contains a canonical reference string. id This method uses the ID attribute of some element in the external document to identify a location within it. These methods may of course be combined, in which case the value of the `xr.target' attribute will be a sequence of method/value pairs which together give a full path to the desired point in the external document. For example, one could point to the third token following the second paragraph in the hierarchy headed by the element with a particular ID.

Tree Path Method |------C1 | |-------B1------|------D1 | | A---| |------C2 | |-------B2 In this representation, the element A contains two instances of an element B (B1 and B2). B1 contains a mixture of elements of types C and D (C1, D1 and C2 in that order). The string `1.1.3' points to element C2 by the `path' method. The string `A.1.B.1.C.2' points to it by the `tpath' method.

As a further example of how these methods can be used, consider the following fragment, which is the start of a document with system identifier `foo':

<![ CDATA[ <book><chapter> <ct>Introduction</ct> <sec><st>Hypertext and SGML</st> <para>Hypertext and SGML are two new technologies. <canon id=l1296><para id=p29>They are synergistic. </book> ]]> Assume that a cross-reference is to be made whose destination is the second paragraph of this fragment. The methods just described might specify this as shown below (the extra attributes described above for type, etc. are not included):

Methods which depend on the physical structure of the file are almost guaranteed to fail if the file changes. Also, they fail perniciously: the system cannot in general report an error, but will merely return the wrong destination. For archival data on read-only media, this may be less of a problem, in that updates are less frequent and more controlled. The cost of recreating indexes of cross-references is none the less expensive and tedious. For other data the problem will be a constant one. The simple path method is essentially similar, though somewhat less likely to fail.

Methods depending on the use of either SGML ids or a known canonical reference scheme have clear advantages as regards stability. The other methods may have attractions as regards precision, when used in combination with these, in which circumstances they are also less likely to fail. For example, to target the third word (`synergistic') of the second paragraph, xref sys.id=foo x.target="id p27 token 3" will be satisfactory. so long as no change is made in the number of tokens in the paragraph concerned (rather than no change in the entire document if the token offset were calculated from the start of the document).

Wherever possible however, the ID mechanism should be preferred. The uniqueness of ID attribute values within a document can be enforced by SGML; they use an SGML mechanism designed for the purpose; and both intra- and inter-file cross-references are implemented with the same mechanism. Most importantly, these methods are robust in the face of file changes. IDs very rarely change as a document is edited. Therefore cross-references remain valid across edits. If an element is entirely deleted, then of course its ID is absent and cannot be retrieved. Crucially, however, this is not a pernicious error; the system can in general detect and report it, rather than simply retrieving the wrong destination.

The disadvantage of these methods is that most files do not have canonical references or unique SGML IDs encoded on all elements. To refer to any element lacking such, either an identifier must added, or an alternate method of referring must be used. Given the other advantages, it is desirable to encode element IDs on each element of new SGML files as they are created, and to have software ensure IDs are not re-used after deletion.

Span-to-span references and discontinuous targets

So far we have discussed cross-referencing methods which start from a single point in a text and point either to another single point (an anchor), or to a span (expressed as a pair of anchors or as a document element). If the starting point of a cross- reference is also x.to be a span, then a pair of xref tags might be used, with an additional attribute `type' taking values `start' and `end'. Such pairs would need to be co-indexed (using id/idref) so x.that an application can match them up when cross- references overlap.

A further problem is introduced by the possibility that the target of a reference may not be a continuous segment, and cannot thus be specified simply by a start target and an end target. One method might be to use the alignment map mechanism discussed in section . Another might be to specify cross-references for each segment of the target independently, and then group these cross-reference together into a higher-level xref.vector element.

No recommendation is made concerning these two areas, as much &winita;