General framework

Feature structures In encoding linguistic analyses, it is often necessary to represent complex, often recursive relations among the various elements of the analysis. It is possible to represent such relations by means of feature structures, and it is convenient to do so, not only for purposes of analysis in a great variety of domains, but for the purpose of interchanging the results of such analysis as well. In sections to , which deal with phonology, morphology and syntax, feature structures will be presented that are geared specifically towards each of these domains. Analyses not covered by the specific proposals in those sections should be encoded using the declarations presented in this section. For expository reasons, examples of encoding for each subdomain of linguistic analysis will be given in the appropriate subsections.

Feature structures, for our purposes, are bundles or groups of features as linguists commonly use the term. To describe an English noun, for example, one might say that its word-category is noun and its number singular; the feature structure describing this noun would contain the features

category = noun number = singular or in a different analysis using slightly different terminology and a different set of features cat = N -PLUR +SING The tags described in this section provide a means of expressing these and more complex feature structures. These feature structures should be written with brackets.

Feature structures are marked with the tag f.struct. In the extreme case, feature structures can consist of a single atomic feature value (represented by a string of characters); in general, however, they consist of a feature structure name marked with the f.struct.name tag followed by an arbitrary number of features, each marked with the tag feature. The feature structure name is optional (the examples just given, for example, have no feature structure name), but if a feature structure has no name, it must have at least one feature. Where present, the feature structure name is represented by a string of characters surrounded by the f.struct.name and /f.struct.name tags.

Each feature in turn consists of a feature name (optional, and marked if present with the tag f.name), a specification of the value(s) the feature may assume and an optional restriction on those value(s). In the examples given above, the terms category, number, cat, PLUR, and SING are feature names, while noun, singular, N, -, and + are the corresponding feature values. Simple feature values may be any of the following:

  • the special binary values plus and minus, marked with the empty tags plus and minus,
  • a nested feature structure which itself contains only characters
  • a nested feature structure which contains other features
  • a set of features (represented by the f.set tag)
  • a list of features (represented by the f.list tag)
Note that the feature value is not represented as a simple string of characters; the feature values in the simple examples above are represented as nested feature structures which in turn contain the characters.This underscores the parallelism between simple atomic feature values and internally structured feature values, at the cost of adding extra tags. It is foreseen that the tags of this section may be revised in future to provide a more concise notation.

In addition to simple feature values, one can represent complex Boolean combinations of feature values using the tags f.s.AND, f.s.OR, and f.s.NOT. The f.s.AND and f.s.OR tags contain a series of at least two feature values; the f.s.NOT tag contains exactly one feature value.

Finally, the value of a feature may be specified indirectly by pointing to some other feature structure or feature value. Such feature pointers are represented by an f.ptr tag with an ID reference to the ID of the structure or value being pointed at. Since their only function is to bear the ID reference attribute, feature pointers have no content.

The SGML declarations for the feature structures described above are as follows:

<![ CDATA [ <!doctype ling.analysis [ <!-- Entities --> <!ENTITY % f.Boolean "f.s.AND | f.s.OR | f.s.NOT" > <!ENTITY % f.value.simple "plus | minus | word | f.struct | f.set | f.list | f.ptr" > <!ENTITY % f.value "%f.value.simple; | %f.Boolean;" > <!-- Top-level organization of feature-value specifications --> <!ELEMENT f.struct - - (#PCDATA | (f.struct.name, feature*) | feature+ ) > <!ELEMENT f.struct.name - - (#PCDATA) > <!ELEMENT feature - - (f.name?, (%f.value;), f.restriction?) > <!ELEMENT f.name - - (#PCDATA) > <!ELEMENT f.restriction - - (%f.value;) > <!-- Representations of feature values --> <!-- Feature values: primitives --> <!ELEMENT (plus, minus) - O EMPTY > <!ELEMENT word - - (#PCDATA) > <!-- Feature value: pointer to another feature structure or value --> <!ELEMENT f.ptr - O EMPTY > <!ATTLIST f.ptr target IDREF #REQUIRED > <!-- Feature values: structured values --> <!ELEMENT f.set - - (f.struct | f.set | f.list)+ > <!ELEMENT f.list - - (f.struct | f.set | f.list)+ > <!-- Feature values: Boolean combinations --> <!ELEMENT f.s.AND - - ((%f.value;), (%f.value;)+) > <!ELEMENT f.s.OR - - ((%f.value;), (%f.value;)+) > <!ELEMENT f.s.NOT - - (%f.value;) > <!-- All feature structures and values can be pointed at with IDref --> <!ATTLIST (f.struct | %f.value;) ID ID #IMPLIED > ]> ]]>

Using these declarations and tags, the simple examples mentioned at the beginning of this section can be tagged thus:

<![ CDATA [ <f.struct> <feature><f.name> category </f.name> <f.struct> noun </f.struct> </feature> <feature><f.name> number </f.name> <f.struct> singular </f.struct> </feature> </f.struct> ]]> or <![ CDATA [ <f.struct> <feature> <f.name> cat </f.name> <f.struct> N </f.struct> </feature> <feature><f.name> PLUR </f.name> <minus> </feature> <feature><f.name> SING </f.name> <plus> </feature> </f.struct> ]]>

Tree structures Relationships among linguistic elements are often represented graphically in the form of a tree (or sets of trees). While trees can easily be translated into feature structures and hence be represented in terms of the SGML declarations given above, they are a sufficiently common form of representation to warrant a document type declaration of their own.

A tree consists of one or more optional names, a single node (the root of the tree), and an arbitrarily large set of subtrees of that node. This definition allows trees to consist of a single node without any subtrees; this case must be allowed since there are formalisms, e.g. Tree Adjoining Grammar, which require such degenerate trees. Trees, tree names, and nodes are tagged tree, tree.name, and node, respectively. Individual subtrees are tagged as trees, terminals, or else pointers to trees; all subtrees of a tree are grouped together using the subtrees tag. Nodes are defined as feature structures in the sense of section , allowing them to have very rich and articulated internal structure. When the tags of this section are used, the parent-child relationship among nodes should be expressed by nesting the children nodes as subtrees within the parent's tree, and not by nesting them as feature structures within the parent node's feature structure. No distinction should be made by using both notations: the meaning of markup which uses both the sset tag and nested feature structures is not defined. Groups of trees (forests) can be marked using the tag forest, which groups together one or more trees.

The children (subtrees) of a node may be trees, terminal nodes, or pointers to some other element in the analysis. Finally, a single node can be associated with multiple sets of subtrees in order to allow multiple analyses. Each set of subtrees in a tree represents the subtrees of that tree's root in some one analysis. Multiple groups of subtrees can also be used to describe ambiguities such as indefinite attachment of prepositional phrases. If a given subtree (e.g. one representing the prepositional phrase with a telescope in the sentence I saw the man with a telescope) can be attached in either of two places (here: to the node describing saw or to the node describing man), then each potential place of attachment can be given two sets of subtrees, one with and one without the disputed subtree. These subtrees might use tree pointers to avoid specifying the disputed subtree twice in full. Note that at present no mechanism is provided to distinguish the preferred place of attachment. The class attribute on the subtrees tag should be used to identify the analysis which leads to that grouping; similar values for this attribute can be used to allow a single tree to be traced through many nodes. No specific convention for associating various analyses by way of the class attribute is provided here. Such conventions are the responsibility of the application program and should be documented in the encoding declarations area.

The SGML declarations necessary for representing trees (and forests) are given below:

<![ CDATA [ [ <!--Entities and elements from Section 8.2.1.1 not repeated here --> <!--for reasons of space --> <!--Declarations relating specifically to representation of trees --> <!ENTITY % node "f.struct+" > <!ELEMENT forest - - (tree+) > <!ELEMENT tree - - (tree.name*, %node;, subtrees*) > <!ELEMENT tree.name - - (#PCDATA) > <!ELEMENT subtrees - - (tree | terminal | n.ptr)* > <!ELEMENT terminal - - (#PCDATA) > <!-- Feature value: pointer to another element --> <!ELEMENT n.ptr - O EMPTY > <!ATTLIST n.ptr target IDREF #REQUIRED > ]> ]]> It has been feared that these declarations don't provide a solution to the PP attachment problem. I believe they do, except for lacking a way of expressing preference. Linguists, please examine and test. -Ed.

As an example, consider the tree structure which might be assigned to the sentence I saw the man with the telescope:

saw / º \ I man with º \ the telescope \ the This might be expressed in markup this way: For simplicity, we omit all analysis internal to the nodes themselves, letting them be feature structures which contain simply the surface form of the word. <![ CDATA [ <tree> <tree.name>Sample Tree:</tree.name> <tree.name>I saw the man with the telescope</tree.name> <f.struct> saw </f.struct> <subtrees> <tree><f.struct> I </f.struct></tree> <tree> <f.struct> man </f.struct> <subtrees> <tree><f.struct> the </f.struct> </subtrees> </tree> <tree> <f.struct> with </f.struct> <subtrees> <tree><f.struct> telescope </f.struct> <subtrees> <tree><f.struct> the </f.struct> </subtrees> </tree> </subtrees> </tree> </subtrees> </tree> ]]> If the sentence's structural ambiguity is to be expressed, the nodes for man and saw can each be represented with two sets of subtrees, one with and one without the tree for with the telescope: <![ CDATA [ <tree id=t1> <tree.name>Sample Tree:</tree.name> <tree.name>I saw the man with the telescope</tree.name> <f.struct> saw </f.struct> <subtrees class='saw with the telescope'> <tree id=t11><f.struct> I </f.struct></tree> <tree id=t12> <f.struct> man </f.struct> <subtrees> <tree id=t121><f.struct> the </f.struct> </subtrees> </tree> <tree id=t13> <f.struct> with </f.struct> <subtrees> <tree id=t131><f.struct> telescope </f.struct> <subtrees> <tree><f.struct> the </f.struct> </subtrees> </tree> </subtrees> </tree> </subtrees> <subtrees class='man with the telescope'> <t.ptr target=t11> <tree id=t12> <f.struct> man </f.struct> <subtrees> <tree id=t121><f.struct> the </f.struct> <t.ptr target=t13> </subtrees> </tree> </subtrees> </tree> ]]>

Alignment of Multiple Analyses

In addition to representing isolated linguistic analyses, it is often necessary to represent multiple analyses of the same text and relate them to each other---a task referred to in what follows as alignment. The analyses in question may be

  • at distinct levels of representation, as when a phonological transcription or the syntactic representation of a text is to be related to its orthographic transcription
  • at the same level of representation, as in the case of structural ambiguity, where two or more syntactic analyses are to be related to the same input string.
Linguists, anthropologists, literary scholars, and others who deal with large corpora of foreign-language text have traditionally used interlinear annotation as a mechanism both for developing and for presenting the analysis (including literal translation) of running text. To deal with the needs of this kind of analysis, we propose a single, recursive markup element for annotated units, which provides implicit alignment for different levels of analysis. Such implicit alignment of levels can be used if each level of analysis divides the text (or another level of analysis) into identical series of segments, or into segments which nest cleanly.

The alignment must be made explicit whenever the different levels of analysis require:

  • incompatible segmentations of the text (crossing segments, non-nesting segments)
  • re-ordering of the segments of the text
  • discontiguous segments (e.g. the separable prefixes of German verbs) which must be re-combined at some level of analysis
Accordingly, methods for implicit and explicit alignment of multiple analyses are provided in this section. Any markup using implicit alignment can be rewritten using explicit alignment.

Implicit Alignment for Multiple Analyses

Analyses may be aligned implicitly by treating the running text as a simple series of units, each unit containing one or more levels of content. Typically the levels of content are a base form and any number of annotations of that base form; the contents of the unit at a given level will typically be either a simple text string or a series of (nested) units at the next lower level of analysis. Annotation levels can attach either to the base or to another annotation level.

Each level of content may optionally be described with a type attribute which describes what type of analysis it contains. This attribute can be any text string; typical values would include original transcription, retranscription, word-by-word gloss, allomorphic transcription (i.e. a transcription which indicates morpheme boundaries as cuts within the surface form of a word), morphemic representation, and so on.Compare the discussion of parallel texts in chapter 6. The unit structure could be used for parallel texts; the values of type might in that case be sigla for the various versions of the text. Like all other tags, both unit and level may have an ID attribute which assigns a unique identifier to the element. Optionally, any level of annotation may point with a base attribute to the level of content on which it is based.E.g. a morphemic-representation level might point at the allomorphic-representation level, which in turn points at the orthographic level.

This definition of annotated unit is kept intentionally general. Every individual analyst is likely to want to use a different scheme of analysis, involving different kinds of units and involving different sets of annotations (likely to include completely novel annotations) even when the same kinds of units are used. In view of this the proposed markup scheme imposes no constraints on the specific content of the analysis. The type attribute is provided to allow the user to encode information about the semantic structure of the analysis. Application software could use the type values to process the analyzed data in accordance with that semantic structure. For instance, an editor might use the type of a unit to constrain the types and relative order of its annotations. A formatter could use the annotation types to select font parameters; it would use unit types to select interlinear alignment (for the annotations of low-level units) versus synchronization in parallel columns (for the annotations of high-level units).

The SGML declarations required for the elements described here is:

<![CDATA[ <!ELEMENT unit - - (level+) > <!ELEMENT level - O (#PCDATA | unit+) > <!ATTLIST (unit, level) type CDATA #IMPLIED id ID #IMPLIED base IDREF #CURRENT > ]]>

Strictly speaking, of course, it is desirable to allow all the phrase-level tags described elsewhere in these guidelines to appear within units and levels; the formal document type declarations in the appendix allow anything to occur within a level element which can occur within a paragraph.

Consider, for example, the following simple analysis of a sentence: It should be stressed that this example is intended to demonstrate the mechanisms of this section in generally comprehensible terms, and not as a model analysis of this sentence. I saw the man with the telescope. PNVBD ATD NN IN ATD NN . SubVb direct objectprepositional phrase
This can be represented using the tags of this section as shown below. Note that when different levels of description segment the text differently, dummy levels are introduced to group together all the levels with the finer segmentation.

<![ CDATA [ <unit type=sentence> <level type=dummy> <unit type=word> <level type=ort>I <level type=LOB>PN <level type=syn>Sub </unit> <unit type=word> <level type=ort>saw <level type=LOB>VBD <level type=syn>verb </unit> <unit type=phrase> <level type=dummy> <unit type=word> <level type=ort>the <level type=LOB>ATD </unit> <unit type=word> <level type=ort>man <level type=LOB>NN </unit> <level type=syn>direct object </unit> <unit type=phrase> <level type=dummy> <unit type=word> <level type=ort>with <level type=LOB>IN <unit type=word> <level type=ort>the <level type=LOB>ATD <unit type=word> <level type=ort>telescope <level type=LOB>NN <level type=syn>prepositional phrase </unit> ]]>

In general, material tagged in this manner can be translated mechanically into the more complex feature-structure tagging presented earlier (section ). One simple method for such a transformation is:

  1. start and end tags for unit become start and end tags for f.struct.
  2. a type attribute on a unit tag becomes a f.s.name; its value becomes the content of the f.s.name element.
  3. start and end tags for level become start and end tags for feature.
  4. a type attribute on a level tag becomes a f.name; its value becomes the content of the f.name element.
  5. if a level element contains character data, its content is enclosed in start and end tags for a f.struct element.
When processed by such an algorithm, the tagging above becomes: <![ CDATA [ <f.struct><f.s.name>sentence</f.s.name> <feature><f.name>dummy</f.name> <f.struct><f.s.name>word</f.s.name> <feature><f.name>ort</f.name><f.struct>I</f.struct> <feature><f.name>LOB</f.name><f.struct>PN</f.struct> <feature><f.name>syn</f.name><f.struct>Sub</f.struct> </f.struct> <f.struct><f.s.name>word</f.s.name> <feature><f.name>ort</f.name><f.struct>saw</f.struct> <feature><f.name>LOB</f.name><f.struct>VBD</f.struct> <feature><f.name>syn</f.name><f.struct>verb</f.struct> </f.struct> <f.struct><f.s.name>phrase</f.s.name> <feature><f.name>dummy</f.name> <f.struct><f.s.name>word</f.s.name> <feature><f.name>ort</f.name><f.struct>the</f.struct> <feature><f.name>LOB</f.name><f.struct>ATD</f.struct> </f.struct> <f.struct><f.s.name>word</f.s.name> <feature><f.name>ort</f.name><f.struct>man</f.struct> <feature><f.name>LOB</f.name><f.struct>NN</f.struct> </f.struct> <feature><f.name>syn</f.name><f.struct>direct object</f.struct> </f.struct> <f.struct><f.s.name>phrase</f.s.name> <feature><f.name>dummy</f.name> <f.struct><f.s.name>word</f.s.name> <feature><f.name>ort</f.name><f.struct>with</f.struct> <feature><f.name>LOB</f.name><f.struct>IN</f.struct> <f.struct><f.s.name>word</f.s.name> <feature><f.name>ort</f.name><f.struct>the</f.struct> <feature><f.name>LOB</f.name><f.struct>ATD</f.struct> <f.struct><f.s.name>word</f.s.name> <feature><f.name>ort</f.name><f.struct>telescope</f.struct> <feature><f.name>LOB</f.name><f.struct>NN</f.struct> <feature><f.name>syn</f.name><f.struct>prepositional phrase</f.struct> </f.struct> ]]>

Explicit Alignment of Multiple Analyses

Where analyses differ in their decomposition or ordering of the base text, implicit alignment of the analyses is not possible. In such cases, the alignment must be made explicit. The data structures described in this section allow specification of alignment between arbitrary sets of linguistic analyses with a very simple mechanism. An alignment consists of a series of alignment maps, which in turn each consist of at least two pointers to elements from the analyses being aligned. Normally, two different analyses will be involved, but alignments which link different parts of a single analysis to itself are legal. It is the responsibility of the encoder to specify what such an alignment might mean. Each pointer can be:

  • a simple reference index pointing at an element in an analysis,
  • a list of such indices, or
  • a pair of indices which define a range of elements.
The necessary SGML declarations are as follows: <![ CDATA [ <!ELEMENT alignment - - (al.map)+ > <!ELEMENT al.map - - (al.ptr | al.list | al.range), (al.ptr | al.list | al.range)+ > <!ELEMENT al.ptr - O EMPTY > <!ATTLIST al.ptr id IDREF #REQUIRED > <!ELEMENT al.list - - (al.ptr+) > <!ELEMENT al.range - O EMPTY > <!ATTLIST al.range from IDREF #REQUIRED to IDREF #REQUIRED > ]]>

The alignment of three distinct syntactic analyses of the sentence He won't hang it up. is illustrated below. (The analyses are intended to serve as illustrations only and have no theoretical status.) The first representation is simply the input text, including punctuation. The indices under the segments in the representation will be referred to in the alignment map.

A: He won't hang it up . -- ----- ---- -- -- -- A1 A2 A3 A4 A5 A6
Example A: He won't hang it up . A1 A2 A3 A4 A5 A6

Example A: \He \won't \hang \it \up \. \A1 \A2 \A3 \A4 \A5 \A6 The second representation differs from the first in that the contraction won't is split into a sequence of wo and the negative morpheme n't. B: He wo n't hang it up . -- -- --- ---- -- -- -- B1 B2 B3 B4 B5 B6 B7 Finally, in the third representation, the contraction won't is represented as a sequence of the two words will and not, and the particle verb hang up is represented both as a two-word sequence and as a single lexeme. C: He will not hang up it -- ---- --- ------- -- C1 C2 C3 C4 C5 ---- -- C6 C7 The SGML encoding of these analyses---without yet taking into account the alignments among them---is shown below. The tags sent, w, seg and lex are illustrative labels only and have no standing within the standard. The segmentation of the orthographic form here should be redone using existing tags from chapter 6.11, if possible. If that is not possible, then we need some extensions. -Ed. <![ CDATA [ <sent> <A> <w id = A1> He </w> <w id = A2> won't </w> <w id = A3> hang </w> <w id = A4> it </w> <w id = A5> up </w> <w id = A6> . </w> </A> <B> <seg id = B1> He </seg> <seg id = B2> wo </seg> <seg id = B3> n't </seg> <seg id = B4> hang </seg> <seg id = B5> it </seg> <seg id = B6> up </seg> <seg id = B7> . </seg> </B> <C> <lex id = C1> He </lex> <lex id = C2> will </lex> <lex id = C3> not </lex> <lex id = C4> <lex id = C6> hang </lex> <lex id = C7> up </lex> </lex> <lex id = C5> it </lex> </C> </sent> ]]> The alignment among the different levels of analysis is given below. As shown below, alignment need not be fully specified, and it can be declared separately from the encoding of the analyses to which it refers. <![ CDATA [ <alignment> <al.map> <al.ptr id = A1> <al.ptr id = B1> </al.map> <al.map> <al.ptr id = A2> <al.list> <al.ptr id = B2> <al.ptr id = B3> </al.list> <al.list> <al.ptr id = C2> <al.ptr id = C3> </al.list> </al.map> <al.map> <al.list> <al.ptr id = A3> <al.ptr id = A5> </al.list> <al.list> <al.ptr id = B4> <al.ptr id = B6> </al.list> <al.ptr id = C4> </al.map> <al.map> <al.range al.start = A1 al.end = A3> <al.range al.start = B1 al.end = B4> <al.range al.start = C1 al.end = C6> </al.map> </alignment> ]]>