This chapter focuses on methods for marking the linguistic analysis and interpretation of texts. In the future it is intended to extend its coverage to other types of analysis and interpretation; the initial focus on linguistics reflects both the centrality of linguistic information for all types of textual study and the relative readiness of linguistic analyses to be formalized in the manner required by an explicit markup.
The primary problem has been to find mechanisms which combine clarity and precision with a very wide hospitality to the varying theoretical presuppositions of practicing linguists, as well as the flexibility required to allow use of these markup methods in the widely varying conditions of existing computational practice. These tags are intended to be usable by linguists of any theoretical persuasion for a number of purposes, including the following:
in-line, that is, interspersed among the words of the texts;
The interspersing of analytic tags with the text is appropriate
in situations in which one is interested in a relatively
restricted range of linguistic information about the text, such
as the categories or parts of speech of its words, and other
easily specifiable grammatical information. The separation of
analytic markup from text is appropriate in situations in which
one is interested in providing detailed linguistic analyses of
text, or in which one has available one or more databases of
linguistic representations which can be linked to textual
material. We see the creation of linguistically analyzed texts
and the creation of linguistic databases that can be used for the
markup of additional texts as interrelated tasks, which are
facilitated by a markup scheme that enables one to factor out the
common properties of recurring textual material from the
properties that distinguish those occurrences. Some apparently
needless verbosity in the tags of this chapter is motivated by
the intention to allow this sort of cross-reference to linguistic
databases (lexicons, partial grammars, etc., also represented in
SGML); elaboration of such links will occupy the relevant
committee during the revision and extension of this draft.
Finally, the ability to relate the representations of the
analyses of complex expressions to those of their parts in an
explicit way will make the analyzed texts of use not only for
purely linguistic purposes or for purposes of information
retrieval, as important as these are, but also for other
purposes, such as the design of experiments in human language
processing, and for measuring the structural complexity of texts.
The special problems posed by lexical and structural ambiguity
are discussed further below in section 8.3.
Wash sinks
gets included, and what, if anything, is done
now about my
In order to accommodate existing practices in linguistic text
markup, and to provide a scheme which can be used by
practitioners of any linguistic theory whatever, we refrain from
associating specific elements of linguistic theory or description
with specific SGML tags, attributes, and attribute values. For
example, you will find no tags for
categoryand
nounin this illustration are content associated respectively with
As can be seen from this illustration, the tags for linguistic analysis
provide a straightforward means of representing linguistic structures as
containing features with values, the latter being either atomic values,
feature structures, or pointers to other feature structures. Thus it
may seem that, contrary to our espousal of theoretical neutrality, we
are favoring those schools of linguistic thought, such as
lexical-functional grammar and generalized phrase-structure grammar,
that typically represent linguistic structures directly in this way, and
disfavoring others, such as government-binding theory and perhaps
categorial grammar, which typically do not. Certainly our proposals
provide a certain advantage to those who think in terms of features and
values; however, the feature-value system of representation is
sufficiently general to represent structures which are formulated in
different ways, such as forests of trees connected by application of
movement transformations, or single trees with associated data
structures (such as chains
), which represent the history of
application of movement transformations. To facilitate the use of our
proposals, we have created several special-purpose tag sets to represent
specialized kinds of feature structures, such as annotated tree
structures and derivation trees for categorial grammar applications.
The tag sets we have proposed have also been developed with an eye
toward the eventual development of software that would render the markup
of linguistic structures graphically, more or less in the manner that
linguists now draw them. Finally, we point out that we have also
developed a tag set for representing interlinear glossing of text, to
provide a direct means of encoding the type of linguistic
representations that are currently in most widespread use by linguists.
We strongly recommend that those who create texts with linguistic
analyses for interchange with others document the linguistic ideas which
guided the analysis with an appropriate declaration in the declarations
area. Moreover, if the analyses are the result of the application of a
program, documentation concerning that program (above all, its
algorithm) should also be provided, along with an indication of the
nature of the intervention (if any) that was used when the program was
run. For these purposes, the
The tags described here have been carefully designed to allow
underspecification of the analysis as well as full specification of the
linguistic structure. In practice, one may wish to underspecify the
analysis either because it is incomplete and will be worked on further
with other tools, or because one does not wish to make any claim either
way about some issue. Since they do not themselves embody a specific
notion of a
The flagging and treatment of linguistic anomalies (errors,
particularly interesting cases, interference phenomena, etc.) are
dealt with in part in an earlier chapter (sec.
The general mechanisms for linguistic analysis are described in
the next section of this chapter (encoding declarations
area should be used. (See
section See attached documentation
are discouraged: if the
documentation is extensive, it should have an accessible treatment;
if it is small it should be incorporated into the machine-readable file.
complete
analysis, the general mechanisms provided
here allow underspecification to take a very simple form. No specific
item of analysis is required, and so any analysis can be left incomplete
by the simple mechanism of leaving out whatever one does not want to
specify.