Critical Apparatus and Parallel Texts

Encoding Textual Variants and Textual Variation

Texts which have been hand copied or printed in several editions contain variations of many kinds. Scholars will usually wish to record or refer to these variants in encoding such texts. Methods for encoding textual variation may be simple or highly complex: the method of choice should be dictated by the interests and goals of textual scholarship and by the nature of the textual materials. In one case, the goal may be to encode the critical apparatus of a standard edition as a means of producing a slightly improved printed critical edition from the electronic format; in another case, the goal may be to present text publication of a new fragment of a well-known literary text, showing only the most important variants; in another case, the goal may be to create an electronic text-critical database from fresh collation of hundreds of manuscripts solely for the purpose of database queries. Because the goals of textual inquiry change over time, it is desirable that general encoding solutions be used to represent the underlying textual variation, whatever the immediate goal of the investigator.

This section presents several methods for encoding information about textual variation in a text. Much work remains to be done in this area: the relative strengths and weaknesses of these methods must be established in practice with various applications, and other approaches remain to be explored. The specific tags and encoding methods described here should be understood as work in progress, intended to provide a basis for public discussion of these problems, and not in any sense as exhaustive or optimal solutions. No recommendations are made at this time as to the particular method to be preferred and none should be inferred from the order of presentation. Researchers with an interest in this area are encouraged to contact the Text Encoding Initiative with comments and with information on systems now in use for encoding variants, their particular strengths and shortcomings, and the requirements of the researcher.

This section touches exclusively on the problem of recording textual variations. Other problems of critical editions, for example the marking of cruxes and editorial interventions of various kinds are not touched upon; they remain to be addressed in the further development of these Guidelines. It is hoped that a common notation can be found which will be adequate to the varying needs of textual scholars working on widely varying periods, languages, and cultures.

Two Approaches: In-Line Encoding and External Representation

Two general approaches to encoding textual variation may be distinguished: (a) in-line encoding and (b) external representation. In the in-line approach, information about textual variation is encoded within the running text; in the external approach, SGML cross-referencing is used to link the running text with text-critical information held outside of the running text. The choice between these two methods is a matter of individual preference and convenience. The in-line encoding method offers the advantage of allowing the reader to see all the related information at a single locus without special software, at the risk of obscuring the base text with the apparatus. The external encoding method allows convenient separation of text and apparatus, at the cost of slightly more elaborate requirements for linking the two. With proper windowing and/or hypertext software, the differences between the two methods become less significant.

The representation of individual variants need not differ radically between the in-line and external approaches; what does vary is the method used to align individual variant readings with the text used as a base text, and with each other. In the sections which follow, three methods of encoding textual variants are described which differ primarily in the way they align the variants with each other:

  1. parallel segmentation
  2. single end-point attachment
  3. double end-point attachment
The first method assumes in-line encoding of variants; the other two can be implemented with either the in-line or external-storage approach. The different methods are motivated in part by different models of how processors are likely to be designed, or how applications could optimally process the data in batch or real-time operations. The examples provide only rudimentary information in a skeletal format, to help keep the syntax clear. Provisions for other information to be attached to the basic tags are discussed in a separate section ().

If textual variants are encoded using one of these methods, the encoding declarations section of the file header should contain a variant.encoding declaration with a method attribute specifying which encoding method is used. Legal values are parallel.segments, single.attachment, and double.attachment. A style attribute with the possible values inline and external should be used to specify whether encoding is done in-line or externally. Both method and style attributes are required. As described below, the single end-point attachment method and the in-line style of the double end-point attachment methods may also specify whether the variants are attached at the beginning or end of the base-text reading by using the position attribute, whose value can be either beginning or end. The latter is the default. Sample variant-encoding style declarations are thus:

Parallel Segmentation Method

This method segments all versions of the text in parallel: all versions contain the same number of segments in the same sequence. At the boundary of any segment, all versions of the text are synchronized with each other. Because textual variants branch off from each other only at explicitly marked segment boundaries, an application can extract any single version of the text in a single pass over the text simply by scanning the text in sequence, selecting the correct version within each segment. An apparatus can also be generated simply, because all variants for any given segment of the text are stored together. For simple texts, it also permits a human reader to view the textual evidence within the encoded document amidst minimum markup clutter. It may be an optimal method for texts which contain a small number of witnesses and which involve minimal text-critical complexity (witnesses written in the same language; minimal codependency between variant readings).

Consider the following hypothetical set of variant texts, presented here in a Paritur Umschrift format:

TEXT A: The quick brown fox jumped over the lazy dog. TEXT B: The silver wolf jumped over the lazy dog. TEXT C: A quick brown fox jumped over the lazy dog at noon. The parallel-segmentation method could represent this sentence as follows, using the special entity zero.var when a reading in one version corresponds to nothing at all (a empty or zero variant) in another. (The zero-variant entity allows us to distinguish cases where a version is complete and lacks the reading in question from cases where the version is incomplete or imperfect and we know nothing about its reading.) [(AB) The (C) A] [(AC) quick (B) &zero.var;] [(AC) brown (B) silver] [(AC) fox (B) wolf] [(ABC) jumped over the lazy dog] [(C) at noon (AB) &zero.var;] [(ABC) .] Or, displaying each segment on a different line for clarity: [(AB) The (C) A] [(AC) quick (B) &zero.var;] [(AC) brown (B) silver] [(AC) fox (B) wolf] [(ABC) jumped over the lazy dog] [(C) at noon (AB) &zero.var;] [(ABC) .] Using this method, textual variants can be represented with two tags: one to mark the segments (e.g. var) and one to mark the individual variant readings (e.g. rdg). At the bottom level, each text is a series of var units, which may be contained by paragraphs and all the phrase-level tags described elsewhere in this chapter; each var unit is a sequence of rdg units, which may in turn contain phrase-level tags. For simplicity, we assume that each no variation is larger than a paragraph. The document type declaration can of course be adjusted to allow for the opposite assumption.

The simple example given here would be represented thus using the var and rdg tags and the parallel-segmentation encoding method:

<![ CDATA [ <p> <var> <rdg wit=AB>The <rdg wit=C>A</var> <var> <rdg wit=AC>quick <rdg wit=B>&zero.var;</var> <var> <rdg wit=AC>brown <rdg wit=B>silver</var> <var> <rdg wit=AC>fox <rdg wit=B>wolf</var> <var><rdg wit=ABC>jumped over the lazy dog</var> <var> <rdg wit=C>at noon <rdg wit=AB>&zero.var;</var> <var><rdg wit=ABC>.</var> </p> ]]>

A more compact display can be used if

  • it is assumed that all texts are full, so that zero-variants need not be explicitly marked, and
  • the var and rdg tags are omitted around passages attested unanimously in all versions.
Adjacent variants can be merged if they show the same pattern of witnesses, so as to give them a single entry in the apparatus. With these assumptions and changes, the passage looks like this: <![ CDATA[ <p> <var> <rdg wit=AB>The</rdg> <rdg wit=C>A</rdg></var> <var> <rdg wit=AC>quick brown fox</rdg> <rdg wit=B>silver wolf</rdg> </var> jumped over the lazy dog <var><rdg wit=C>at noon</rdg></var>. </p> ]]> Note that in this example, no reading is designated as a preferred reading. If a judgment is to be rendered concerning a text's typological anteriority, or higher antiquity, or greater authenticity (etc.,), this may be done by using the status attribute on the rdg tag, with the value lemma or preferred: <![ CDATA[ <var> <rdg wit=AB status=preferred>The</rdg> <rdg wit=C>A</rdg> </var> ]]>

Because the parallel-segmentation method makes no structural distinction among witnesses and has no notion of a base text to which a separate apparatus could be keyed, it requires the in-line encoding of variants. There is no external-representation method corresponding to in-line encoding with parallel segmentation.

Single End-Point Attachment Method

As the number of witnesses and the complexity of the textual variations increase the parallel-segmentation method places greater demands upon the encoder. Since all versions must have the same segmentation, the addition of new witnesses may require the re-segmentation of all the old witnesses. When many witnesses are involved, perhaps in several languages, and in several genetic strata, it is difficult to segment the text properly or optimally in advance, and equally difficult to change the segmentation with the addition of each new witness. In some cases, the segmentation required to exhibit properly the relation of one pair of variants to each other conflicts with that required for some other pair. Parallel segmentation proves particularly difficult and particularly prone to obscure the relationships among readings when texts contain substantive conflations, long-range transpositions, large quantities of data, or widely different recensions. In these cases, decisions about text segmentation and multiple overlapping variations can be deferred by using an incompletely segmented or divergent-segmentation method. In these methods, readings of the witnesses are registered as variants of some base text and recorded in an apparatus. The base text is marked with the location at which each variant group is attached, and the other versions are each segmented notionally into portions which agree with the base text and portions which disagree, but each version is segmented separately, not in parallel with all other versions.

Using the incomplete segmentation methods, one can encode textual variation using the following tags: app (apparatus entry) this tag groups variants which are opposed to the same portion of the base text. When variants are encoded in-line, it appears immediately after (or before) the corresponding base-text reading. An app element contains exactly one lemma and an arbitrary number of readings. When several witnesses agree substantially in a reading but vary in detail, app elements may appear nested within the readings or lemma of another app element. lem (lemma) this element repeats the contents of the base text which correspond to the other readings. Like the reading tag, it carries a wit attribute for recording the witnesses which agree in the lemma. Its contents may include other variations, marked by app tags. rdg (reading) this element contains exactly one alternate reading opposed to the lemma of the base text. It carries a wit attribute whose value is the sigla of the witnesses with this reading. A reading may include minor variations, which are marked by nested app tags. The name app used here for apparatus entry reflects both a convenience and a reminiscence of the historic convention for the layout of critical texts on the printed page, but the textual object may be conceived in a neutral manner as simply area of text-critical interest. Similarly, the term base text denotes (as here used) only the text in terms of which the apparatus is formulated. The text chosen as base text may indeed represent the editor's or encoder's preferred text, or a historically revered standard text, or it may be an particularly full text chosen merely as a convenience. There is no requirement that the base text correspond throughout to any single witness or edition, though that appears to be the most convenient approach.

The variants are connected to the base text either by being inserted into it at the end of the corresponding reading in the base text (using the in-line approach), or by a pointer from a separate location (using the external-representation approach). In the latter case, the pointer is implemented either with an SGML ID reference or with pointer of the type described in section . For the reasons given there, the use of IDREFs (as provided for the the document type declarations in the appendix) is recommended over the other, less reliable methods.

In the single end-point attachment method the variants are attached to the end of the corresponding reading in the base text; the beginning of the reading must be found by applications software by comparing the base text with the content of a lemma tag. Alternatively, variants can be attached at the beginning of the base-text reading. If this approach is chosen, the variant.encoding declaration in the encoding declarations area must specify the attribute position with the value beginning. The default value is variant.encoding ... position=end.

The following example illustrates the single end-point attachment method using almost the same example as in the preceding section. The A text is chosen arbitrarily as the base-text, and the other readings are recorded with rdg elements recorded within app elements.

<![ CDATA[ TEXT A: The quick brown fox jumped over the lazy dog. TEXT B: A silver wolf jumped over the lazy dog. TEXT C: The sleek brown fox jumped over the lazy dog at noon. <p> The quick brown fox <app> <lem wit=AC> The <app> <lem wit=A>quick <rdg wit=C>sleek </app> b.f.</lem> <rdg wit=B>silver wolf</rdg> </app> jumped over the lazy dog <app><lem wit=AB>j.o.t.l.d.<rdg wit=C>&plus at noon</app> . </p> ]]>

Note that the overall opposition of The quick/sleek brown fox and A silver wolf is recorded in one apparatus entry, with the opposition of quick and sleek nested within it. Such nested app elements provide a convenient method of grouping manuscripts and their readings within single complex variations, but they are of course not required; this particular variation could be analyzed in other ways, and the notation can in each case express the encoder's chosen analysis.

The complete base text occurs outside the app tags. To make explicit which parts of the base text are replaced by the variant, however, the base text (or lemma) is repeated in the lem of each app element. Certain text-critical shorthand conventions may be used in the encoding, placing a corresponding burden on the application to interpret them. These conventions are to be used only if semantically and syntactically non-ambiguous in context and thus fully machine parsable. For example, the use of j.o.t.l.d. as an abbreviation for the lemma given above would have to pass qualifying tests such as: (a) the abbreviation sequence is unique in the region of text to be scanned for the lemma; (b) use of period does not collide with literal period, which would have to be quoted; (c) the abbreviation based upon discrete words (fixed word boundaries) implies that word boundaries are not part of the text-critical issue, obscured by the notation. No shorthand conventions should be used unless their resolution by a machine is completely predictable. No specific conventions are recommended at this time; the definition of a reliable and usable set of such conventions is a topic for further work.

If variants are to be recorded externally to the base text, the points at which the variants are to be attached must be specified by anchor tags, as defined in section . The app tag must then carry an endpoint attribute which names the anchor point at which this apparatus entry is to be attached to the base text: the endpoint of the base-text reading. Assuming that all apparatus entries are gathered together after the body of the main text itself, the single end-point attachment method of recording our simple example would look something like this:

<![ CDATA [ <teidoc> <file.header> [documentation of the encoding and its source ...] <encoding.declarations> <variant.encoding method=single.attachment position=end style=external> [Other encoding declarations ...] </encoding.declarations> </file.header> <body> <p> The quick <anchor id=a2> brown fox <anchor id=a1> jumped over the lazy dog<anchor id=a3>. </p> </body> <!-- Textual apparatus here --> <app endpoint=a1> <lem wit=AC> The quick <app endpoint=a2> <lem wit=A>quick <rdg wit=C>sleek </app> b.f.</lem> <rdg wit=B>silver wolf</rdg> </app> <app endpoint=a3> <lem wit=AB>j.o.t.l.d. <rdg wit=C>&plus at noon </app> </teidoc> ]]> If the apparatus entries are to be attached at the beginning and not the end of the base-text reading, the startpoint attribute should be used instead of the endpoint attribute.

Double End-Point Attachment Method

Because the single end-point attachment method explicitly marks only the end (or beginning) of each segment in the base text to which a set of variants is opposed, an application cannot easily tell, for any given word or passage in the base text, whether any variation is open at that point. To find out, the application must scan and analyze each lem element in each app element from the point in question to the end of the document. (If variations are attached at the beginning of the base-text reading, the application must scan from the beginning of the document to the point in question.) Indeed, variations may cross the boundaries of the highest levels of document hierarchy, where one might not normally expect variation to occur (e.g., chapter-level transpositions).

It is easier to find the full range of textual variation on a given portion of the base text if the beginning and ending of each variation are explicitly marked in the base text (as they are in the parallel segmentation method). A processor can then, with a single pass over the text, mark all the portions of the base text which have opposed variants, without scanning each lem element and matching it against the base text.

The double end-point attachment method of encoding textual variants uses fundamentally the same mechanisms as the single-point attachment method described in the preceding section, but marks both end points, not just a single end-point, of each variation in the text. Each apparatus entry can then point to the starting point of the lemma. As with the single end-point attachment method, the double end-point attachment method can be used with variants attached at the beginning of the base-text reading, if a corresponding variant.encoding declaration is made. In this case, the in-line app tags point at the end-point of the lemma.

Encoded using the double end-point attachment method and in-line variants, the sample text looks like this:

<![ CDATA [ <p> <anchor id=a1> The quick brown fox <app startpoint=a1> <lem wit=AC> The <anchor id=a2> quick <app startpoint=a2> <lem wit=A>quick <rdg wit=C>sleek </app> b.f.</lem> <rdg wit=B>silver wolf</rdg> </app> jumped over the lazy <anchor id=a3> dog <app><lem wit=AB>d.<rdg wit=C>&plus at noon</app> . </p> ]]> Encoded using the double end-point attachment method and external representation of variants, the sample text looks like this: <![ CDATA [ <teidoc> <file.header> [documentation of the encoding and its source ...] <encoding.declarations> <variant.encoding method=single.attachment position=end style=external> [Other encoding declarations ...] </encoding.declarations> </file.header> <body> <p> <anchor id=a4> The <anchor id=a5> quick <anchor id=a2> brown fox <anchor id=a1> jumped over the lazy <anchor id=a6> dog<anchor id=a3>. </p> </body> <!-- Textual apparatus here --> <app startpoint=a4 endpoint=a1> <lem wit=AC> The quick <app startpoint=a5 endpoint=a2> <lem wit=A>quick <rdg wit=C>sleek </app> b.f.</lem> <rdg wit=B>silver wolf</rdg> </app> <app startpoint=a6 endpoint=a3> <lem wit=AB>d. <rdg wit=C>&plus at noon </app> </teidoc> ]]>

Since the lemma can now be read in the base text, the repetition of the lemma in the app entry is strictly speaking redundant. The lem tag, however, is still required, in order to carry the wit attribute. More work is needed to ensure the simplicity and clarity of this notation.

Fuller Encoding of Text-Critically Relevant Information