In many contexts it is convenient to segment the text systematically
into pieces smaller than the paragraph. For this purpose, a special
segmentation tag
This section discusses the use of the
Paragraphs contain orthographic sentences, which we may call S-units, to
prevent confusion with the linguistic notion sentence
, which is
commonly taken to apply only to structures containing a subject and a
verb. S-units are signalled by initial capitalization and end
punctuation (full stop, exclamation mark, question mark). An S-unit can
be a full sentence, but also a phrase or even a single word. The main
point is its independence of its surroundings, as signalled by
typographical conventions. Example:
S-units occur not only in paragraphs, but also in notes and lists. They may be embedded in larger S-units, particularly within quotations or direct speech in narrative:
Problems arise where S-units include long embedded elements, as with long quotations and lists (which may even contain headings and paragraphs). For pragmatic reasons, it may be preferable to break the S-unit before long quotations/lists, unless it continues after the list/quotation or is clearly integrated with it in some other manner. The break is usually signalled by a colon:
Except as illustrated above for embedded quotations or lists, do not mark punctuation as S-unit boundaries unless followed by an initial capital:
he shouted.Are you crazy?
S-units are suitable elements to mark from the point of information
retrieval and also for reference purposes. The reference system in the
example from the LOB Corpus in the appendix is based on S-units. The
reference identifier may given as an attribute value; embedded S-units
can be numbered sequentially within the superordinate unit. (On
reference systems see also sections
There are numerous problems of demarcation. If S-units are tagged in a
text, a discussion of marking conventions should be included within the
encoding declarations section of the TEI header, marked by the tag
Punctuation marks cause problems for text markup because they may not be
available in the character set used and because they are often
ambiguous. In the former case entity names should be used to render the
punctuation mark (see
Creators of machine-readable texts are recommended to avoid soft
hyphens, as one cannot tell whether the hyphens are soft or hard in the
case of compounds or prefixed words which might or might not be
hyphenated in mid-sentence.
The main problem arises where quotation marks are used for other
purposes, e.g. to indicate the title of an article, to gloss the meaning
of a word, to indicate that a word is a technical term or is used in a
special sense (as in: she hated
Where punctuation marks are disambiguated by tagging the underlying
feature they signal, it may be debated whether they should be excluded
or left (redundantly) as part of the text. The solution we choose may
vary depending upon the feature and depending upon the purpose of our
project. It is natural to keep the end punctuation in the case of
S-units, as there is a significant contrast between the different
punctuation marks. With quotation marks, on the other hand, it is
probably most often natural to leave them out (unless it is essential to
keep a record of exactly how these features were marked in the original
text). In either case, the global Ambiguous Punctuation
hard
) hyphens in the word. Where the lineation of the
machine-readable text differs from the original, the editor may
eliminate soft (line-end) hyphens or replace them by a reference to the
entity soft hyphen
). The solution chosen
should be reported in the good
books). These uses can be
dealt with by different forms of tagging, as shown above (see sections