Morphology

This section describes the application of the linguistic-analysis tags described earlier to morphemic analysis, mixed-level annotation of the sort often practiced by field linguists, and to lemmatization and word-class assignment.

Delimiting Words In the simple case, the orthographic conventions of a text may be accepted and any blank-delimited string can be accepted as a word for purposes of analysis. In more complex cases, separate orthographic words can be grouped together as a single item for morphological or lexical purposes, or a single token may be analyzed into several words for annotation. The discussion of explicit alignment in section above includes examples of both the splitting and the agglutination of orthographic words for analysis. Need a couple of examples here: Times-Picayune Ledger-Star of New Orleans, and New York-born financier would be good.

Word Structure When the morphological analysis can be carried out in full alignment with the orthographic text, it is probably simplest to use the implicit alignment method described in section above. The use of this method for morphological analysis can be illustrated with a sample sentence from the Eskimo language (northwest Alaska dialect). (The digraph ng and gh are used here to represent the eng and dotted-g, respectively, of the standard Eskimo orthography.) In interlinear format, the annotated sentence looks like this: tx: Akutchilighmik-uvva uqaaqtullangniaqtunga. at: akut -chi-ligh-mik =uvva uqaaqtu -llang-niaq-tunga mr: akutuq -si -liq -mik =uvva uqaaqtuq -llak -niaq-tunga mg: icecream-RSL-GER -s.MOD=now tell story-DUR -INT -1s.I wg: about making Eskimo icecream I am going to tell a story The two-letter codes at the beginning of each line identify the types of information in each line. They are: tx for baseline text, at for allomorphic transcription (that is, with morph cuts indicated in the surface form of the word), mr for morphemic representation (that is, underlying forms), mg for morpheme gloss, and wg for word gloss. In this example, the elements of the baseline text and the word gloss lines work at the word level. But the middle three lines are further subdivided into a morpheme level of analysis. If one ignores this and treats everything as a word-level annotation, the markup will be something like this: <![CDATA[ <unit type=word> <level type=tx> Akutchilighmik-uvva </level> <level type=at> akut-chi-ligh-mik=uvva </level> <level type=mr> akutuq-si-liq-mik=uvva </level> <level type=mg> icecream-RSL-GER-s.MOD=now </level> <level type=wg> about making Eskimo icecream </level></unit> <unit type=word> <level type=tx> uqaaqtullangniaqtunga. </level> <level type=at> uqaaqtu-llang-niaq-tunga </level> <level type=mr> uqaaqtuq-llak-niaq-tunga </level> <level type=mg> tell story-DUR-INT-1s.I </level> <level type=wg> I am going to tell a story </level></unit> ]]> Note that in this representation we have introduced morpheme break characters in the data so that the human reader can reconstruct the alignment relationships between the parts of the related lines, but we have not encoded those relationships explicitly in the markup. To handle the morphemic substructure in the markup, we would use the annotated unit structure recursively to treat analyzed morphemes as units within analyzed words. When we do this, the base form has only two annotations, a morphemic analysis (for which we use the type code ma) and a word gloss. Ignoring the substructure in the morphemic analysis for the moment, the markup at the word-level would look like this: <![CDATA[ <unit type=word> <level type=tx> Akutchilighmik-uvva </level> <level type=ma> "morphemic analysis goes here" </level> <level type=wg> about making Eskimo icecream </level></unit> <unit type=word> <level type=tx> uqaaqtullangniaqtunga. </level> <level type=ma> "morphemic analysis goes here" </level> <level type=wg> I am going to tell a story </level></unit> ]]> The content of the morphemic analysis annotation becomes a sequence of annotated units, each of which represents a morpheme. Using morpheme as the value of the type attribute is one possibility. We choose instead to use root, suffix, and clitic as the type designators. This serves the purpose of encoding the information represented by the hyphens and equals signs in the original example. (The formatting application would insert a hyphen before suffixes and an equals sign before clitics.) The markup for the full example then becomes: <![CDATA[ <unit type=word> <level type=tx> Akutchilighmik-uvva </level> <level type=ma> <unit type=root> <level type=at> akut </level> <level type=mr> akutuq </level> <level type=mg> icecream </level></unit> <unit type=suffix> <level type=at> chi </level> <level type=mr> si </level> <level type=mg> RSL </level></unit> <unit type=suffix> <level type=at> ligh </level> <level type=mr> liq </level> <level type=mg> GER </level></unit> <unit type=suffix> <level type=at> mik </level> <level type=mr> mik </level> <level type=mg> s.MOD </level></unit> <unit type=clitic> <level type=at> uvva </level> <level type=mr> uvva </level> <level type=mg> now </level></unit></level> <level type=wg> about making Eskimo icecream </level></unit> <unit type=word> <level type=tx> uqaaqtullangniaqtunga. </level> <level type=ma> <unit type=root> <level type=at> uqaaqtu </level> <level type=mr> uqaaqtuq </level> <level type=mg> tell story </level></unit> <unit type=suffix> <level type=at> llang </level> <level type=mr> llak </level> <level type=mg> DUR </level></unit> <unit type=suffix> <level type=at> niaq </level> <level type=mr> niaq </level> <level type=mg> INT </level></unit> <unit type=suffix> <level type=at> tunga </level> <level type=mr> tunga </level> <level type=mg> 1s.I </level></unit></level> <level type=wg> I am going to tell a story </level></unit> ]]> The analysis could extend upward to encode these two words as comprising a unit at the sentence level. It would have a base text (which might be a pointer to a sequence of characters in the original text) plus two annotations: a word analysis (consisting of the above sequence of word units) and a sentence translation (namely, "I am going to tell a story about making Eskimo ice cream."). In so doing, the capitalization and punctuation would probably best be separated out as units in the sequence of word analysis or even as annotations on the analyzed words.

Next we give an example of a morphological analysis that cannot be carried out in full alignment with the orthographic text, and which therefore makes use of the explicit alignment mechanism in section . The example is the word kutub books in modern Egyptian Arabic. This word may be analyzed as made up of the triconsonantal root ktb having to do with writing and the vowel desinence u noun plural, which is intercalated between the first and second and between the second and third consonants of the root.

<![ CDATA [ <!-- Analysis of the word 'kutub' in modern Egyptian Arabic. --> <f.struct id=kutub> <feature> <f.name> gloss </f.name> <f.struct> books </f.struct> </feature> <feature> <f.name> category </f.name> <f.struct> noun </f.struct> </feature> <feature> <f.name> number </f.name> <f.struct> plural </f.struct> </feature> <feature> <f.name> base </f.name> <f.ptr target=ktb> </feature> <feature> <f.name> inflection </f.name> <f.ptr target=uu> </feature> <feature> <f.name> transcription </f.name> <f.list> <f.struct id=ch1> k </f.struct> <f.struct id=ch2> u </f.struct> <f.struct id=ch3> t </f.struct> <f.struct id=ch4> u </f.struct> <f.struct id=ch5> b </f.struct> </f.list> </feature> </f.struct> <!-- Analysis of the root 'ktb'. --> <f.struct id=ktb> <feature> <f.name> gloss </f.name> <f.struct> having to do with writing </f.struct> </feature> <feature> <f.name> category </f.name> <f.struct> root </f.struct> </feature> <feature> <f.name> transcription </f.name> <f.list> <f.struct id=co1> k </f.struct> <f.struct id=co2> t </f.struct> <f.struct id=co3> b </f.struct> </f.list> </feature> </f.struct> <!-- Analysis of the desinence 'u'. --> <!-- Note that we analyze the vowel as spreading into the --> <!-- positions that it occupies in the word. --> <f.struct id=uu> <feature> <f.name> gloss </f.name> <f.struct> noun plural </f.struct> </feature> <feature> <f.name> category </f.name> <f.struct> desinence </f.struct> </feature> <feature> <f.name> transcription </f.name> <f.struct id=vo1> u </f.struct> </feature> </f.struct> <!-- Alignment among the components of the word 'kutub'. --> <alignment> <!-- The root is aligned with the consonants of the word. --> <al.map> <al.ptr target=ktb> <al.list> <al.ptr target=ch1> <al.ptr target=ch3> <al.ptr target=ch5> </al.list> </al.map> <!-- The desinence is aligned with the vowels of the word. --> <al.map> <al.ptr target=uu> <al.list> <al.ptr target=ch2> <al.ptr target=ch4> </al.list> </al.map> <!-- The individual consonants and vowels of the word are --> <!-- aligned with the individual consonants and vowels of --> <!-- the root and the desinence. --> <al.map> <al.ptr target=ch1> <al.ptr target=co1> </al.map> <al.map> <al.ptr target=ch2> <al.ptr target=vo1> </al.map> <al.map> <al.ptr target=ch3> <al.ptr target=co2> </al.map> <al.map> <al.ptr target=ch4> <al.ptr target=vo1> </al.map> <al.map> <al.ptr target=ch5> <al.ptr target=co3> </al.map> </alignment> ]]>

Word-Class, Lemmatization and Grammatical Features Corpus linguists have long tagged corpora with part-of-speech, lemma, and other grammatical features of the words of running text. Parts of EDW12 need to be inserted here. -Ed.

Marking Usage, Special Words and Special Cases

Lexical Ambiguity

Discontinuous Lexical Items

Relating Lexical Analyses to Textual Occurrences

An analysis of a particular word such as kutub in can be construed either as an analysis of a particular occurrence of that word in a particular text, or as an analysis of its lexical structure. The latter interpretation would be particularly useful in a situation in which the word with that or a closely related analysis occurs more than once in the text. If the analysis is viewed as that of its lexical structure, then the analysis of the actual textual occurrences of the word can be specified as a function of its lexical analysis. In the simplest case the function is identity, and in that case we can simply set a pointer from the textual occurrences of the word to its lexical representation using an IDREF attribute on whatever tag is used to delimit the textual occurrences in the text. At this point, we have not settled on the name of this attribute nor what tags this attribute can be used with, but expect to have a recommendation shortly. The name lexp was suggested in a working paper of the committee. Alternatively, one could use the alignment mechanism to relate all of the textual occurrences of a word to the representation of its lexical structure. This alternative has the advantage of generalizing more readily to situations in which the structure of the textual occurrence of a word is not identical to its lexical structure. We consider briefly here two such situations.

First, if we wish to associate more information with a textual occurrence of a word than is provided in its lexical representation, then we would create a structural representation for it that incorporates the lexical structure as a proper part, presumably using the f.ptr tag, and align the textual occurrence of that word with its structural representation, along with (if desired) its lexical representation. Second, if we wish to associate less information with a textual occurrence of a word than is provided in its lexical representation, then we would create a structural representation for it that incorporates only those parts of the lexical structure for that word that are appropriate for that occurrence. For example, if the lexical representation contained several interpretations of the word (using f.s.OR), and the textual occurrence of the word only had only one of these interpretations, then the representation of the structure of the textual occurrence of that word would incorporate only that aspect of the lexical structure (using f.s.choice) that is part of the structure of the textual occurrence. Again, we could align the textual occurrence of that word with its structural representation and (if desired) with its full lexical representation.