Analytic and Interpretive Information

Principles and Definitions

This chapter focuses on methods for marking the linguistic analysis and interpretation of texts. In the future it is intended to extend its coverage to other types of analysis and interpretation; the initial focus on linguistics reflects both the centrality of linguistic information for all types of textual study and the relative readiness of linguistic analyses to be formalized in the manner required by an explicit markup.

The primary problem has been to find mechanisms which combine clarity and precision with a very wide hospitality to the varying theoretical presuppositions of practicing linguists, as well as the flexibility required to allow use of these markup methods in the widely varying conditions of existing computational practice. These tags are intended to be usable by linguists of any theoretical persuasion for a number of purposes, including the following:

The interspersing of analytic tags with the text is appropriate in situations in which one is interested in a relatively restricted range of linguistic information about the text, such as the categories or parts of speech of its words, and other easily specifiable grammatical information. The separation of analytic markup from text is appropriate in situations in which one is interested in providing detailed linguistic analyses of text, or in which one has available one or more databases of linguistic representations which can be linked to textual material. We see the creation of linguistically analyzed texts and the creation of linguistic databases that can be used for the markup of additional texts as interrelated tasks, which are facilitated by a markup scheme that enables one to factor out the common properties of recurring textual material from the properties that distinguish those occurrences. Some apparently needless verbosity in the tags of this chapter is motivated by the intention to allow this sort of cross-reference to linguistic databases (lexicons, partial grammars, etc., also represented in SGML); elaboration of such links will occupy the relevant committee during the revision and extension of this draft. Finally, the ability to relate the representations of the analyses of complex expressions to those of their parts in an explicit way will make the analyzed texts of use not only for purely linguistic purposes or for purposes of information retrieval, as important as these are, but also for other purposes, such as the design of experiments in human language processing, and for measuring the structural complexity of texts. The special problems posed by lexical and structural ambiguity are discussed further below in section 8.3. This is promissory, depending on whether my stuff on Wash sinks gets included, and what, if anything, is done now about my f.s.choice recommendations. I snuck in a reference to f.s.choice in one of my extensions of 8.4 composed earlier today. --TL

In order to accommodate existing practices in linguistic text markup, and to provide a scheme which can be used by practitioners of any linguistic theory whatever, we refrain from associating specific elements of linguistic theory or description with specific SGML tags, attributes, and attribute values. For example, you will find no tags for noun and verb here. Instead, a more general mechanism is proposed, which allows the expression of linguistic analysis with a small set of tags that represent just the configurations of linguistic structures. To say that a particular expression is a noun, for example, one stipulates that it corresponds to a feature structure (identified by the tag f.struct) which contains a feature (identified by the tag feature), which has a name (identified by f.name) and a value (identified again by the f.struct tag), for example, as follows:

<f.struct id=sample> ... <feature> <f.name> category </f.name> <f.struct> noun </f.struct> </feature> ... </f.struct> Both category and noun in this illustration are content associated respectively with f.name and f.struct, and nothing in our recommendations requires their use. If certain patterns of tags and content themselves recur in actual practice, then entity definitions can be created for them, to simplify the process of creating the markup, and of interpreting it. Some entity definitions for commonly used parts of speech and other grammatical information are suggested in .

As can be seen from this illustration, the tags for linguistic analysis provide a straightforward means of representing linguistic structures as containing features with values, the latter being either atomic values, feature structures, or pointers to other feature structures. Thus it may seem that, contrary to our espousal of theoretical neutrality, we are favoring those schools of linguistic thought, such as lexical-functional grammar and generalized phrase-structure grammar, that typically represent linguistic structures directly in this way, and disfavoring others, such as government-binding theory and perhaps categorial grammar, which typically do not. Certainly our proposals provide a certain advantage to those who think in terms of features and values; however, the feature-value system of representation is sufficiently general to represent structures which are formulated in different ways, such as forests of trees connected by application of movement transformations, or single trees with associated data structures (such as chains), which represent the history of application of movement transformations. To facilitate the use of our proposals, we have created several special-purpose tag sets to represent specialized kinds of feature structures, such as annotated tree structures and derivation trees for categorial grammar applications. The tag sets we have proposed have also been developed with an eye toward the eventual development of software that would render the markup of linguistic structures graphically, more or less in the manner that linguists now draw them. Finally, we point out that we have also developed a tag set for representing interlinear glossing of text, to provide a direct means of encoding the type of linguistic representations that are currently in most widespread use by linguists.

We strongly recommend that those who create texts with linguistic analyses for interchange with others document the linguistic ideas which guided the analysis with an appropriate declaration in the declarations area. Moreover, if the analyses are the result of the application of a program, documentation concerning that program (above all, its algorithm) should also be provided, along with an indication of the nature of the intervention (if any) that was used when the program was run. For these purposes, the linguistic.analysis tag in the encoding declarations area should be used. (See section for a discussion of the the encoding declarations area.) This element should typically contain either a prose description of the salient points or a bibliographic reference to a published account of the relevant information. Descriptions of the form See attached documentation are discouraged: if the documentation is extensive, it should have an accessible treatment; if it is small it should be incorporated into the machine-readable file.

The tags described here have been carefully designed to allow underspecification of the analysis as well as full specification of the linguistic structure. In practice, one may wish to underspecify the analysis either because it is incomplete and will be worked on further with other tools, or because one does not wish to make any claim either way about some issue. Since they do not themselves embody a specific notion of a complete analysis, the general mechanisms provided here allow underspecification to take a very simple form. No specific item of analysis is required, and so any analysis can be left incomplete by the simple mechanism of leaving out whatever one does not want to specify. Those who desire some more stringent checking of analyses to ensure that they are complete and well-formed in some way are directed to the discussions of SGML in chapters 3 and 9, which make clear what sorts of constraints SGML parsers are in a position to enforce and how to modify the tags here provided to allow tighter constraints to be imposed on an analysis.

The flagging and treatment of linguistic anomalies (errors, particularly interesting cases, interference phenomena, etc.) are dealt with in part in an earlier chapter (sec. ). Further consideration of the problem of encoding linguistic anomalies will be taken up by the relevant committee in the cycle of revising and extending this draft.

The general mechanisms for linguistic analysis are described in the next section of this chapter () and are applied to problems of syntax, morphology, and phonology in the remaining sections of the chapter ( through ).