Minutes of the Dictionary Work Group (TEI AI 5) C. M. Sperberg-McQueen Document Number: TEI AI5 M5 October 18, 1991 (14:20:18) Draft October 18, 1991 (14:20:18) The meeting convened at 9:45 Wednesday, 2 October 1991. Those present were: Robert Amsler (RA), chairing Nicoletta Calzolari (NC) Louise Guthrie (LG) Nancy Ide (NI) Frank Tompa (FWT) Carol Van Ess-Dykema (CVED) 1 AGENDA The group agreed to the following preliminary agenda: 1. review of Cycle 2 deadlines 2. notes from last SC meeting esp. on documents to be produced (RA) 3. review of Pisa minutes (NC, NI) 4. specify precise targets for the day 5. as specified in the preceding item 2 REVIEW OF CYCLE 2 SCHEDULE AND DEADLINES MSM, working backwards, summarized the schedule for the remainder of the second development cycle thus: Jun 1992: finished First Edition ("final Guidelines") May 1992: TEI P3 mailed to Advisory Board Mar 1992: input cutoff for TEI P3 Jan 1992: TEI P2 mailed to AB and public 30 Nov 1991 (end): input cutoff for P2 15-18 Nov 1991: joint TR/AI meeting 8 Nov 1991: deadline for docs for TR/AI meeting 3 NOTES FROM SC MEETING ON DOCUMENTS MSM described the current plans for TEI publications: a series of tutorials (specialized for different audiences), reference manual, and case book. The reference manual will contain a prose specification of the encoding scheme, an alphabetic reference list of tags and attri- butes, and formal DTDs. The case book will contain big examples. This WG must produce the relevant section for the prose specification, and plan to work on examples for the case book. RA suggested that in addition to the tutorial for dictionaries, he could imagine a kind of step-by-step guided tour of the dictionary stan- dard, and finally a sort of case book like the Chicago Manual of Style-- it seems to have something for everything you encounter, described in very small paragraphs. 4 REPORTING OF PISA MINUTES NC reported on the meeting held in Pisa in August (the minutes are document AI5 M4). She recalled the decision in Tempe to have internal subgroups within AI 5 for work on monolingual and polyglot dictionaries. NC and SW thought they could work better together rather than individu- ally, and so suggested an informal meeting; SW suggested NI be included since she too was geographically proximate. CVED noted that she would have liked to have been notified of this meeting, in order to have been able to provide input. NC apologized for the oversight. MSM noted that in general, the TEI is experiencing more need for bureaucratic routini- zation of meetings (scheduling, notification, and distribution of docu- ments) than had been anticipated, and suggested that the discussion of the organizational aspects of the Pisa meeting be curtailed. RA said that a crucial outcome of the Pisa meeting appeared to be the explicit realization that multiple communities are interested in dic- tionary encoding. If this had been explicitly recognized six months ago, he suggested, it might have proven desirable to split the work group into separate groups for the different communities. Given the single work group, the current task must be to aim at the common core which everyone can agree on. Different communities may want various kinds of extensions to this core, extensions which may not be compati- ble, and most of which may involve trying to make various kinds of things explicit which are implicit in the print dictionary or in the common core. NI said that the major outcome was to make explicit the hitherto implicit knowledge that two different objects are involved in dictionary encoding. Some possible users are interested in the printed page (as produced or as to be produced); others interested only in the informa- tional content of the dictionary. Those interested in the printed page might wish to retain the specific form (e.g. "delay, -ed, -ing") in which the printed dictionary indicates inflectional information, while those interested in information content might wish to render the same information in a form more convenient for their own processing (e.g. "delay, delayed, delaying"). FWT observed that even when one is inter- ested in having the encoded information, one may find oneself unable to interpret the printed string with enough certainty. RA agreed that many disputes could arise over the implicit meanings of fonts, etc., in all content markup, and urged that the group first try to reach a bare mini- mum level. The scoping of information might not be explicit at that minimum level. Using a chart (not reproduced here), NI resumed the dichotomy between: * printed form (rectangle) * encoded information content (oval) The communities interested in these would include: * philologists, interested in the printed book, but may also want information content * computational linguists, who typically want the encoded information content * publishers, who want the information content in some database so they can in future produce print dictionaries (historically, pub- lishers are beginning now with print dictionaries) RA added the category of * dictionary users, who want to be able to retrieve as in a database but have the results look like the printed book LG distinguished between syntactic content and semantic content of dictionaries. The IBM dictionary research group, for example, reverses ellipses in definitions, etc. -- that's a sort of syntactic informa- tion. Others want to find hypernyms and record them and similar infor- mation. That's semantic information. The group's consensus was that the current need is to work with machine-readable (and "machine- tractable") dictionaries; lexical knowledge bases are beyond the group's current scope. CVED urged that the group simply begin with what everyone has in com- mon and allow extensions, possibly separate, for the different communi- ties. NI distinguished four kinds of document grammars for dictionaries: 1. bottom-level (leaf) tags only (rather like Amsler/Tompa: nothing nests except that everything nests within main entry)--this might be called a flat or tiled DTD.(1) 2. leaf tags with groups and subentries (this falls at the other extreme and is almost as anarchic as the Amsler/Tompa proposal). This style of document grammar resembles that proposed in the Pisa minutes (AI5 M4). Subentries can contain almost anything; if one dictionary puts a usage note at one level and another puts it at a different level, the encoder can put it anywhere. There are groups, but each may have have virtually any structure. Addition- ally, NI believed one might be able to define some groups like FORM and GRAM which are in fact tightly constrained. The DTD for this deeply nesting encoding might look thus: 3. a further constrained DTD, which could serve as a template for a specific dictionary, providing much tighter validation than the two anarchic document grammars. The work group could even provide an example. The group digressed to discuss the inheritance proposals of the Pisa minutes. Such inheritance cannot be specified explicitly in SGML itself, but it is legitimate for the TEI (and thus for AI5) to specify the semantics of nesting this way if we like. RA suggested that different syntactic classes of tags correspond neatly to the different levels of recommendation: the required tags are leaf tags; grouping tags, which represent an attempt to capture more of the information structure, are recommended but not required; finally there will be tags for specialized information which some but not all communities will want to encode. FWT agreed on the distinction between leaf tags, marking elements within which no other elements can nest, and grouping tags, within which primarily or exclusively leaf elements will nest. He disagreed with RA's identification of leaf tags as required and grouping tags as recom- mended, and denied that for any string in a dictionary there would always be a single leaf tag which applies: etymology, for example, can- not be encoded solely with leaf tags. There was no consensus on this point. RA suggested that grouping elements should be provided with more or less canonical definitions. For example "Etymology usually contains a, b, c, d. A given dictionary may have defective etymologies, but if a,b,c occur they should be tagged as etymology." FWT disagreed, arguing that all grouping tags should allow any leaf tag and any other grouping tag within them. MSM asked what RA and FWT meant by required. Did they mean the tag must appear in the document? That the tag must be used when applicable? or that the tag is recommended to be used when applicable for new docu- ments? He observed that implicitly or explicitly the TEI already dis- tinguished several levels of recommendation and requirement for elements and their tags: * required (document is not TEI conformant without this element) * mandatory when applicable (bad faith if not tagged) * recommended (should be used in new encodings unless there is a good reason not to) * optional (use if you like) The work group urged the explicit creation of a category of conditional- ly required / mandatory: required or mandatory when applicable if a given document type is used. This could be merged with mandatory when applicable or made separate. 5 TARGETS OF THE DAY'S WORKS NI suggested that since consensus had been achieved on the overall schema, the goals for the day should be: * devise DTD for anarchic scheme without nesting (flat DTD) * devise DTD for anarchic scheme with arbitrary nesting * devise DTD for constrained groups for use in nesting DTD * think hard about DTD modification to ensure that one can actually make more constrained DTDs RA suggested also: * think hard about specific leaf tags and decide how to determine whether a given thing should be made a group or not. * specify how to decide which group to use, if the elements present could appear in several. * bear in mind the need for tags applicable to lots of strange animals (what if something is both syllabification and spelling? it will be necessary to generalize both from syllabification and from ortho- graphic form.) MSM proposed * agree to distinguish atomic, group, and ENTRY tag * construct DTDs with abstractions * make inventory of tags in each abstraction No explicit decision was made, but the discussion immediately passed over to consideration of the DTD structure as suggested by NI and MSM. 6 DTD STRUCTURE FWT proposed the following basic DTD: If this is accepted, FWT continued, the invention of individual tags would simply the the process of populating the following table: Required Recommended Optional (atoms) (some) (lots) (a few) (groups) (many) (maybe some) NI objected to FWT's DTD fragment, urging that the element needs to have a separate definition. FWT suggested that be treated like any other group and allow all groups to have the properties assigned to by the Pisa minutes. NI argued that eliminating the distinction between subentries and other groups would lose some good properties of the subentry tag and make it unclear how the tagging is to work. FWT suggested keeping the distinction in the examples and the prose, without reifying it in the DTD. 7 TARGETS OF THE DAY'S WORK (RESUMED) The group then agreed on the following goals for the afternoon: * decide whether requires separate definition * clarify relation between print and information content? * populate the tag table * decide to make one DTD for all or several for different communities * specify documents to draft, assign them to individuals * agree on date of next meeting, if any [At this point the group broke for lunch.] 8 PRINT AND INFORMATION CONTENT After lunch, the group discussed the different types of encoding pos- sibly needed by different communities, distinguishing: * fidelity to the information content of the dictionary * fidelity to all details on the printed page * fidelity to the linear sequence of characters and their presentation (fonts etc.) (This differs from the preceding as input to a type- setting program may differ from its typeset output.) Alternative terminology was discussed without final decision: (1) typographic fidelity (which preserves all decisions made by the typeset- ter, even those another typesetter might make differently) vs. copy- editor fidelity (which preserves all decisions made by the copy editor, even those which might be made differently by a different copy editor or by the same editor using a different style sheet) vs. editorial fideli- ty (which preserves all decisions made by the subject editor or author, but not necessarily those of the copy editor or typesetter). (2) layout fidelity vs. linear fidelity vs. abstract-content fidelity. (3) pixel fidelity vs. character linear fidelity vs. symbolic linear fidelity. 9 ONE DTD OR MANY? FWT suggested that any single DTD capable of handling the print forms of a variety of dictionaries would necessarily also handle the encodings of interest to all of these communities. This can be done by accepting NI's model: one DTD for everyone, with constraints to be added if one wishes to limit things. This was accepted by consensus. MSM observed that the DTD fragment proposed by FWT would allow both deeply nesting and non-nesting encodings of dictionaries, and thus there was no need to distinguish a "flat" DTD from a nesting DTD. 10 INHERITANCE MODEL FWT elaborated the proposal for special semantics for the (subentry) element (hereinafter the "SE model") on the basis of this example: ... ... ... ... ... ... ... ... ... Or in tree form: e | +-----------------------------+ | | se id=se1 se id=se2 | | +---------+---------+ +------------+ | | | | | x se se x se id=se21 | id=se12 id=se13 | | | +----------+-----------+ +--+--+ +-+---+ | | | | | | y se id=se211 y z y w | +----+----+ | | x w All non- children of an are interpreted as features applying to that and these features are inherited by all children s unless overridden there by the occurrence of children with the same name (generic identifier). In the left branch of this tree, the X Y Z sub- tree of the first (se1) is interpreted as though it were also writ- ten as a child of the first and second third-generation s (se12 and se13) because neither of the latter have an as a child. In the right branch, the X element within the second-generation (se2) is interpreted as a feature applicable to the third-generation (se21) but not to the third-generation (se211), because se211 has an X child of its own, and se21 has none. Note: Crucial point: please check this for accuracy of substance. -MSM RA agreed that this model provides a true and plausible effort to replicate the way dictionaries work. A break ensued, with an informal discussion of inheritance mechanisms and whether it is possible to define an additive inheritance rule: if a parent (e1) has an additive group (i.e. one which simply adds further specification to the inherited information, rather than overrid- ing the inherited information) and a child (c1) overrides the element, does the child override only e1's additive group or the entire specification back to the root? No consensus was achieved. After the break, FWT proposed that the SE model be adopted, and that be renamed . The only role of this element is to describe inheritance; the inheritance rules must be described very clearly. When is not present, inheritance is unspecified. The group then further discussed possible variations on the simple override-or-inherit semantics described above, including "additive inheritance" or "leaf-inheritance". In the simple SE model, a locally nested subtree of type X in a given blocks the inheritance of any subtree of type X from the 's ancestors. In some cases, how- ever, it would be better for the ancestral X subtree to be inherited and its leaves merged with the leaves on the locally nested X subtree. Leaves within the locally nested X subtree would override leaves within the inherited X subtree, based on identity of name [i.e. generic iden- tifier? -MSM]. This seems plausible, but needs to be examined to ensure that it will work correctly with deeply nested trees. In discussing the advisability of the SE model, FWT argued that this form of inheritance is often useful, usually easy to understand in the dictionary and specify in the encoding, and never prevents one from doing other things, or forces one to encode what one does not wish to encode. The group agreed to adopt the SE inheritance model, renaming as . NI was assigned the production of a paper defining both the leaf-overlaying semantics and an additive semantics for , with the assistance of FWT and NC. Consensus: proposal adopted with rename of Group for , and additive or leaf-overlaying semantics, to be specified by NI with assis- tance from FWT and NC. NI, with FWT and NC draft specification of additive and overlaying semantics Due: asap Document number: AI5 W7 11 TAG TABLE The group then progressed briefly to discuss the need to decide what tags are needed in addition to those already decided on. MSM asked what was being counted as having been decided on; the consensus was that tags in section 7.4 of TEI P1 were those already decided on. Material in other documents (Amsler/Tompa, Fought and Van Ess-Dykema, the Acquilex proposal) should be considered but does not count as already accepted. 12 DOCUMENTS TO DRAFT RA stressed the need for the work group to produce text which can be integrated into TEI P2 to replace the current section 7.4 of TEI P1. MSM noted that P2 was to have less discursive explanation and justifica- tion, and should simply state the tags available and their usage. RA observed that members of the work group might wish to annotate their suggested changes to explain why the changes were wanted, but that this justification should be regarded as annotation, not as part of the draft. For each tag proposed, members should say whether it should be defined as a leaft tag or a grouping tag, and whether it is required, mandatory when applicable, recommended, or purely optional. RA assigned all members of the work group to work on replacement text for the dictionary section of TEI P2 and distribute them to the entire work group. The style should roughly the same as in section 7.4 of TEI P1. RA will collect the proposals and interpolate them in the text. The review of this interpolated text would be a topic at the next meet- ing, if any. MSM asked that if possible, the group also attempt to prepare short drafts of reference entries for individual tags, giving generic identi- fier, full name, usual parents, usual children, atomic or group tag, and a brief trivial example. NC suggested that progress could be made quickly if NI sent a revi- sion of the Pisa minutes to FWT, so he can incorporate his own and RA's comments. RA will try to tag Collins with the tags of TEI P1. 13 MEETING No date was set for another meeting. It was agreed, however, to set a meeting up if travel funds permitted and it seemed a useful way of accomplishing more work. 14 OTHER BUSINESS The group acknowledged receipt of the extensive comments of John Fought on the Pisa minutes. Many of the comments are well taken; the group believes the most crucial points are those about inconsistencies, which (with others' suggestions) have led to the more flexible set of content models proposed in this meeting. Further comments on the tags of TEI P1 will be welcome. ------------------------- (1) Leaf tags correspond to what Amsler/Tompa call "base" tags, but the term base has another specific technical meaning within discussions of the TEI DTDs, and the term leaf tag is preferred for references to elements with a content model of (#PCDATA). -MSM Draft October 18, 1991 (14:20:18)