above.
The use of this method for morphological analysis can be illustrated
with a sample sentence from the Eskimo language (northwest Alaska
dialect). (The digraph ng and gh are used here to represent the
eng and dotted-g, respectively, of the standard Eskimo
orthography.) In interlinear format, the annotated sentence
looks like this:
tx: Akutchilighmik-uvva uqaaqtullangniaqtunga.
at: akut -chi-ligh-mik =uvva uqaaqtu -llang-niaq-tunga
mr: akutuq -si -liq -mik =uvva uqaaqtuq -llak -niaq-tunga
mg: icecream-RSL-GER -s.MOD=now tell story-DUR -INT -1s.I
wg: about making Eskimo icecream I am going to tell a story
The two-letter codes at the beginning of each line identify the types of
information in each line. They are: tx for baseline
text,
at for allomorphic transcription
(that is,
with morph cuts indicated in the surface form of the word),
mr for morphemic representation
(that is, underlying
forms), mg for morpheme gloss,
and wg
for word gloss.
In this example, the elements of the baseline text and the
word gloss lines work at the word level. But the middle three
lines are further subdivided into a morpheme level of analysis.
If one ignores this and treats everything as a word-level
annotation, the markup will be something like this:
Akutchilighmik-uvva
akut-chi-ligh-mik=uvva
akutuq-si-liq-mik=uvva
icecream-RSL-GER-s.MOD=now
about making Eskimo icecream
uqaaqtullangniaqtunga.
uqaaqtu-llang-niaq-tunga
uqaaqtuq-llak-niaq-tunga
tell story-DUR-INT-1s.I
I am going to tell a story
]]>
Note that in this representation we have introduced morpheme
break characters in the data so that the human reader can
reconstruct the alignment relationships between the parts of the
related lines, but we have not encoded those relationships
explicitly in the markup.
To handle the morphemic substructure in the markup, we would
use the annotated unit structure recursively to treat analyzed
morphemes as units within analyzed words. When we do this, the
base form has only two annotations, a morphemic analysis (for
which we use the type code ma) and a word gloss. Ignoring the
substructure in the morphemic analysis for the moment, the markup
at the word-level would look like this:
Akutchilighmik-uvva
"morphemic analysis goes here"
about making Eskimo icecream
uqaaqtullangniaqtunga.
"morphemic analysis goes here"
I am going to tell a story
]]>
The content of the morphemic analysis annotation becomes a sequence of
annotated units, each of which represents a morpheme. Using
morpheme as the value of the type attribute is one
possibility. We choose instead to use root,
suffix, and clitic as the type designators.
This serves the purpose of encoding the information represented by the
hyphens and equals signs in the original example. (The formatting
application would insert a hyphen before suffixes and an equals sign
before clitics.) The markup for the full example then becomes:
Akutchilighmik-uvva
akut
akutuq
icecream
chi
si
RSL
ligh
liq
GER
mik
mik
s.MOD
uvva
uvva
now
about making Eskimo icecream
uqaaqtullangniaqtunga.
uqaaqtu
uqaaqtuq
tell story
llang
llak
DUR
niaq
niaq
INT
tunga
tunga
1s.I
I am going to tell a story
]]>
The analysis could extend upward to encode these two words
as comprising a unit at the sentence level. It would have a base
text (which might be a pointer to a sequence of characters in the
original text) plus two annotations: a word analysis (consisting
of the above sequence of word units) and a sentence translation
(namely, "I am going to tell a story about making Eskimo ice
cream."). In so doing, the capitalization and punctuation would
probably best be separated out as units in the sequence of word
analysis or even as annotations on the analyzed words.
Next we give an example of a morphological analysis that cannot
be carried out in full alignment with the orthographic text, and
which therefore makes use of the explicit alignment mechanism in
section . The example is the word
kutub books
in modern Egyptian Arabic. This
word may be analyzed as made up of the triconsonantal root
ktb having to do with writing
and the vowel
desinence u noun plural
, which is intercalated
between the first and second and between the second and third
consonants of the root.
gloss
books
category
noun
number
plural
base
inflection
transcription
k
u
t
u
b
gloss
having to do with writing
category
root
transcription
k
t
b
gloss
noun plural
category
desinence
transcription
u
]]>
Word-Class, Lemmatization and Grammatical Features
Corpus linguists have long tagged corpora with part-of-speech, lemma,
and other grammatical features of the words of running text.
Parts of EDW12 need to be inserted here. -Ed.
Marking Usage, Special Words and Special Cases
Lexical Ambiguity
Discontinuous Lexical Items
Relating Lexical Analyses to Textual Occurrences
An analysis of a particular word such as kutub in
can be construed either as an analysis of a
particular occurrence of that word in a particular text, or as an
analysis of its lexical structure. The latter interpretation
would be particularly useful in a situation in which the word
with that or a closely related analysis occurs more than once in
the text. If the analysis is viewed as that of its lexical
structure, then the analysis of the actual textual occurrences of
the word can be specified as a function of its lexical analysis.
In the simplest case the function is identity, and in that case
we can simply set a pointer from the textual occurrences of the
word to its lexical representation using an IDREF attribute on
whatever tag is used to delimit the textual occurrences in the
text.
At this point, we have not settled on the name of this attribute
nor what tags this attribute can be used with, but expect to have
a recommendation shortly. The name lexp
was suggested in
a working paper of the committee.
Alternatively, one could use the alignment mechanism to relate
all of the textual occurrences of a word to the representation of
its lexical structure. This alternative has the advantage of
generalizing more readily to situations in which the structure
of the textual occurrence of a word is not identical to its
lexical structure. We consider briefly here two such situations.
First, if we wish to associate more information with a textual
occurrence of a word than is provided in its lexical
representation, then we would create a structural representation
for it that incorporates the lexical structure as a proper part,
presumably using the f.ptr tag, and align the textual occurrence
of that word with its structural representation, along with (if
desired) its lexical representation. Second, if we wish to
associate less information with a textual occurrence of a word
than is provided in its lexical representation, then we would
create a structural representation for it that incorporates only
those parts of the lexical structure for that word that are
appropriate for that occurrence.
For example, if the lexical representation contained several
interpretations of the word (using f.s.OR), and the
textual occurrence of the word only had only one of these
interpretations, then the representation of the structure of the
textual occurrence of that word would incorporate only that
aspect of the lexical structure (using f.s.choice)
that is part of the structure of the textual occurrence.
Again, we could align the textual occurrence of that word with
its structural representation and (if desired) with its full
lexical representation.