Notes on the Text-Representation Text-Analysis Interface TEI-AI-W-20 March 23, 1990 D. Terence Langendoen
Department of Linguistics University of Arizona Tucson, AZ 85721 langendt@arizvm1 (bitnet)

Introduction

This paper addresses two problems on the border between our committee's responsibilities and text representation's. Some of it is inspired by discussion at the Steering Committee meeting last weekend, particular Stig Johansson's report, and a question I got from Antonio Zampolli about how references to lemmata should be handled. I realized later I didn't like the answer I had given him.

Punctuation Marks and Pointers Thereto

The question came up in Stig's report concerning the suppressing of certain punctuation marks if tags are supplied that provide the information carried by those marks. It was agreed that for preexisting text that is being tagged, it's probably best to keep the original marks even though they are partially or wholly redundant when tags are added. My suggestion is that we add IDREF attributes to the relevant tags that point to the opening and closing punctuation. Let us call these attributes openp and closep respectively. For initial capitalization of sentences, we can let openp point to the capitalized character. For Spanish questions and exclamations, we'll need to allow for a list of IDREFs, one for the inverted punctuation mark and one for the initial capital letter. A given punctuation mark can be pointed to by more than one pointer. This handles not only punctuation within a direct quotation which can simultaneously close a quoted sentence and the containing sentence, but also the possibility that a word boundary simultaneously marks the end of the preceding word and the beginning of the following word.

Suppose the text is the following.

And God said, "Let there be light."

Here is a suggested partial markup, where q is the ordinary SGML inline quotation tag, augmented with an ID attribute. S, w and c are tags for orthographic sentence, word and character respectively. S and q are permitted to have openp and closep tags associated with them. Note that the final period is pointed to by the closep tags associated with both the quotation and the embedded orthographic sentence. To avoid confusion between blanks occurring in the text and blanks used for formatting the markup, I use &er;rbl; to represent the former. In the IBM SGML version this is formatted in, &er;amper; is the entity that formats as the ampersand &er;; &er;rbl; is the entity that formats as a required blank &rbl;.

< s id=s1 openp=c1 closep=c34> < w id=w1 closep=c4> < c id=c1 case=upper> A < /c> < c id=c2> n < /c> < c id=c3> d < /c> < /w> < c id=c4> &amper;rbl; < /c> <!-- I suppress most of the remaining word and character tags.--> G o d &amper;rbl; s a i d , &amper;rbl; < q id=q1 openp=c15 closep=c35> < c id=c15> " < s id=s2 openp=c16 closep=c34> < w id=w4 closep=c19> < c id=c16 case=upper> L < /c> < c id=c17> e < /c> < c id=c18> t < /c> < /w> &amper;rbl; t h e r e &amper;rbl; b e &amper;rbl; l i g h t < c id=c34> . < /s> < c id=c35> " < /q> < /s>

Word Lemmatization

After my report at the Steering Committee meeting on March 18, Antonio asked me about how we propose to handle references to the base form of words, as many machine-readable files already have this information encoded, and many applications potentially depend on it. My response there was to use entities that abbreviate pieces of feature structure markup, as is being suggested for representing standard part of speech information, among other things. I think, upon sober reflection, that there's a better way, using a lemma tag, which either has content (namely the spelling of the lemma) or a pointer to an entry in a lexicon.

Suppose our text is:

Horses went slower.

Here's how the markup could look with a lemma tag associated with each word.

<!--Character markup is suppressed.--> < s id=s1 openp=c1 closep=c19> < w id=w1 closep=c7> < lemma id=l1> horse < /lemma> Horses < /w> &amper;rbl; < w id=w2 openp=c7 closep=c12> < lemma id=l2> go < /lemma> went < /w> &amper;rbl; < w id=w3 openp=c12> < lemma id=l3> slow < /lemma> slower < /w> . < /s>

Now suppose instead of a lemma tag with content associated with each word, we add an IDREF attribute to the w tag, which points to an entry in a lexicon; the latter, I'll assume for purposes of illustration, is attached to the text. To distinguish the text from the lexicon, I'll mark the beginning of the lexicon with an empty tag.

< s id=s1 openp=c1 closep=c19> < w id=w1 lexp=e2 closep=c7> Horses < /w> &amper;rbl; < w id=w2 lexp=e1 openp=c7 closep=c12> went < /w> &amper;rbl; < w id=w3 lexp=e3 openp=c12> slower < /w> . < /s> < lexicon> < entry id=e1> go < /entry> < entry id=e2> horse < /entry> < entry id=e3> slow < /entry>

To Be Continued