Segmentation of Prose and Treatment of Ambiguous Punctuation

In many contexts it is convenient to segment the text systematically into pieces smaller than the paragraph. For this purpose, a special segmentation tag S may be used. Segments of any size or type may be marked using this tag, though it should not be used to tag features for which other tags are provided in these Guidelines. The attribute type may be used to specify what sort of segment is marked; the global ID and N attributes may be used to associate standard reference identifiers with the segment for purposes of citation or retrieval.

This section discusses the use of the S to mark orthographic sentences and methods of resolving ambiguities in common punctuation practice.

S-units

Paragraphs contain orthographic sentences, which we may call S-units, to prevent confusion with the linguistic notion sentence, which is commonly taken to apply only to structures containing a subject and a verb. S-units are signalled by initial capitalization and end punctuation (full stop, exclamation mark, question mark). An S-unit can be a full sentence, but also a phrase or even a single word. The main point is its independence of its surroundings, as signalled by typographical conventions. Example:

<![CDATA[ <s>When are you leaving?</s> <s>Tomorrow.</s> ]]> We recommend keeping initial capitalisation and end punctuation, though they are strictly redundant after the S-units have been tagged.

S-units occur not only in paragraphs, but also in notes and lists. They may be embedded in larger S-units, particularly within quotations or direct speech in narrative:

<![CDATA[ <s>She said, <s>"Let's go."</s></s> <s><s>"Let's go,"</s> she said.</s> <s>She said, <s>"Let's go,"</s> and left immediately.</s> <s>The Apostle Paul said concerning some that <s>"By good words and fair speeches they deceived the heart of the simple."</s></s> ]]>

Problems arise where S-units include long embedded elements, as with long quotations and lists (which may even contain headings and paragraphs). For pragmatic reasons, it may be preferable to break the S-unit before long quotations/lists, unless it continues after the list/quotation or is clearly integrated with it in some other manner. The break is usually signalled by a colon:

<![CDATA[ <s>This is due to the following factors:</s> ]]>

Except as illustrated above for embedded quotations or lists, do not mark punctuation as S-unit boundaries unless followed by an initial capital:

<![ CDATA [ <s><q><s>Are you crazy?</s></q> he shouted. <s>Ah, happy, happy boughs! that cannot shed Your leaves, nor ever bid the spring adieu; ...</s> ]]>

S-units are suitable elements to mark from the point of information retrieval and also for reference purposes. The reference system in the example from the LOB Corpus in the appendix is based on S-units. The reference identifier may given as an attribute value; embedded S-units can be numbered sequentially within the superordinate unit. (On reference systems see also sections and .)

There are numerous problems of demarcation. If S-units are tagged in a text, a discussion of marking conventions should be included within the encoding declarations section of the TEI header, marked by the tag segment.demarcation.

Ambiguous Punctuation

Punctuation marks cause problems for text markup because they may not be available in the character set used and because they are often ambiguous. In the former case entity names should be used to render the punctuation mark (see ). In the latter case, ambiguous punctuation may be treated as described below.

Full stop (period) may mark (orthographic) sentence boundaries, abbreviations, decimal points, or serve as a visual aid in rinting numbers. These usages can be distinguished by tagging S-units, abbreviations, and numbers, as described in sections , , and . There are independent reasons for tagging these, whether or not they are marked by full stops.

Question mark and exclamation mark typically mark the end of orthographic sentences, but may also be used as a mid-sentence comment by the author (! to express surprise or some other strong feeling, ? to query a word or expression or mark a sentence as dubious in linguistic discussion). These uses may be distinguished by marking S-units, in which case the mid-sentence uses of these punctuation marks may be left unmarked.

Hyphens at line-end may or may not indicate permanent (hard) hyphens in the word. Where the lineation of the machine-readable text differs from the original, the editor may eliminate soft (line-end) hyphens or replace them by a reference to the entity SHY (soft hyphen). The solution chosen should be reported in the hyphenation tag of the encoding declarations in the TEI header. (See chapter for discussion of the TEI header and encoding declarations.)

Creators of machine-readable texts are recommended to avoid soft hyphens, as one cannot tell whether the hyphens are soft or hard in the case of compounds or prefixed words which might or might not be hyphenated in mid-sentence.

Dashes are best distinguished in form by using the entity names provided in the public entity sets of ISO 8879: mdash, ndash, and hyphen. Dashes are used for a variety of purposes: insertion, interruption, new speaker (in dialogue), list item. In the last two cases it is preferable to mark the underlying feature using the tags q and item, described in section .

Quotation marks are best replaced by tags indicating begin-quote and end-quote, especially as quotations are not always marked by quotation marks (notably long quotations) or may be marked in a variety of ways; see the discussion of quotation and related features in section .

The main problem arises where quotation marks are used for other purposes, e.g. to indicate the title of an article, to gloss the meaning of a word, to indicate that a word is a technical term or is used in a special sense (as in: she hated good books). These uses can be dealt with by different forms of tagging, as shown above (see sections and ).

Apostrophes must be distinguished from single quote marks. This is best done by tagging quotations or other uses of quotation marks (see above). However, apostrophes have a variety of uses. In English they mark contractions, genitive forms, and (occasionally) plural forms. Disambiguation of these uses belongs to the level of linguistic analysis and interpretation.

Where punctuation marks are disambiguated by tagging the underlying feature they signal, it may be debated whether they should be excluded or left (redundantly) as part of the text. The solution we choose may vary depending upon the feature and depending upon the purpose of our project. It is natural to keep the end punctuation in the case of S-units, as there is a significant contrast between the different punctuation marks. With quotation marks, on the other hand, it is probably most often natural to leave them out (unless it is essential to keep a record of exactly how these features were marked in the original text). In either case, the global rendition attribute may be used to record the original form of punctuation.