Minutes of the Dictionary Work Group (TEI AI 5)
                         C. M. Sperberg-McQueen
                      Document Number:  TEI AI5 M5
                      October 18, 1991 (14:20:18)
                   Draft October 18, 1991 (14:20:18)
 
   The meeting convened at 9:45 Wednesday, 2 October 1991.  Those
present were:
 
    Robert Amsler (RA), chairing
    Nicoletta Calzolari (NC)
    Louise Guthrie (LG)
    Nancy Ide (NI)
    Frank Tompa (FWT)
    Carol Van Ess-Dykema (CVED)
 
 
 
                                   1
 
                                 AGENDA
 
 
   The group agreed to the following preliminary agenda:
 
1.    review of Cycle 2 deadlines
2.    notes from last SC meeting esp. on documents to be produced (RA)
3.    review of Pisa minutes (NC, NI)
4.    specify precise targets for the day
5.    as specified in the preceding item
 
 
 
                                   2
 
                REVIEW OF CYCLE 2 SCHEDULE AND DEADLINES
 
 
   MSM, working backwards, summarized the schedule for the remainder of
the second development cycle thus:
 
Jun 1992:  finished First Edition ("final Guidelines")
 
May 1992:  TEI P3 mailed to Advisory Board
 
Mar 1992:  input cutoff for TEI P3
 
Jan 1992:  TEI P2 mailed to AB and public
 
30 Nov 1991 (end):  input cutoff for P2
 
15-18 Nov 1991:  joint TR/AI meeting
 
8 Nov 1991:  deadline for docs for TR/AI meeting
 
 
 
                                   3
 
                   NOTES FROM SC MEETING ON DOCUMENTS
 
 
   MSM described the current plans for TEI publications:  a series of
tutorials (specialized for different audiences), reference manual, and
case book.  The reference manual will contain a prose specification of
the encoding scheme, an alphabetic reference list of tags and attri-
butes, and formal DTDs.  The case book will contain big examples.  This
WG must produce the relevant section for the prose specification, and
plan to work on examples for the case book.
 
   RA suggested that in addition to the tutorial for dictionaries, he
could imagine a kind of step-by-step guided tour of the dictionary stan-
dard, and finally a sort of case book like the Chicago Manual of Style--
it seems to have something for everything you encounter, described in
very small paragraphs.
 
 
 
                                   4
 
                       REPORTING OF PISA MINUTES
 
 
   NC reported on the meeting held in Pisa in August (the minutes are
document AI5 M4). She recalled the decision in Tempe to have internal
subgroups within AI 5 for work on monolingual and polyglot dictionaries.
NC and SW thought they could work better together rather than individu-
ally, and so suggested an informal meeting; SW suggested NI be included
since she too was geographically proximate.  CVED noted that she would
have liked to have been notified of this meeting, in order to have been
able to provide input.  NC apologized for the oversight.  MSM noted that
in general, the TEI is experiencing more need for bureaucratic routini-
zation of meetings (scheduling, notification, and distribution of docu-
ments) than had been anticipated, and suggested that the discussion of
the organizational aspects of the Pisa meeting be curtailed.
 
   RA said that a crucial outcome of the Pisa meeting appeared to be the
explicit realization that multiple communities are interested in dic-
tionary encoding.  If this had been explicitly recognized six months
ago, he suggested, it might have proven desirable to split the work
group into separate groups for the different communities.  Given the
single work group, the current task must be to aim at the common core
which everyone can agree on.  Different communities may want various
kinds of extensions to this core, extensions which may not be compati-
ble, and most of which may involve trying to make various kinds of
things explicit which are implicit in the print dictionary or in the
common core.
 
   NI said that the major outcome was to make explicit the hitherto
implicit knowledge that two different objects are involved in dictionary
encoding.  Some possible users are interested in the printed page (as
produced or as to be produced); others interested only in the informa-
tional content of the dictionary.  Those interested in the printed page
might wish to retain the specific form (e.g.  "delay, -ed, -ing") in
which the printed dictionary indicates inflectional information, while
those interested in information content might wish to render the same
information in a form more convenient for their own processing (e.g.
"delay, delayed, delaying").  FWT observed that even when one is inter-
ested in having the encoded information, one may find oneself unable to
interpret the printed string with enough certainty.  RA agreed that many
disputes could arise over the implicit meanings of fonts, etc., in all
content markup, and urged that the group first try to reach a bare mini-
mum level.  The scoping of information might not be explicit at that
minimum level.
 
   Using a chart (not reproduced here), NI resumed the dichotomy
between:
 
*   printed form (rectangle)
*   encoded information content (oval)
 
The communities interested in these would include:
 
*   philologists, interested in the printed book, but may also want
    information content
*   computational linguists, who typically want the encoded information
    content
*   publishers, who want the information content in some database so
    they can in future produce print dictionaries (historically, pub-
    lishers are beginning now with print dictionaries)
RA added the category of
*   dictionary users, who want to be able to retrieve as in a database
    but have the results look like the printed book
 
   LG distinguished between syntactic content and semantic content of
dictionaries.  The IBM dictionary research group, for example, reverses
ellipses in definitions, etc.  -- that's a sort of syntactic informa-
tion.  Others want to find hypernyms and record them and similar infor-
mation.  That's semantic information.  The group's consensus was that
the current need is to work with machine-readable (and "machine-
tractable") dictionaries; lexical knowledge bases are beyond the group's
current scope.
 
   CVED urged that the group simply begin with what everyone has in com-
mon and allow extensions, possibly separate, for the different communi-
ties.
 
   NI distinguished four kinds of document grammars for dictionaries:
 
1.    bottom-level (leaf) tags only (rather like Amsler/Tompa:  nothing
      nests except that everything nests within main entry)--this might
      be called a flat or tiled DTD.(1)
2.    leaf tags with groups and subentries (this falls at the other
      extreme and is almost as anarchic as the Amsler/Tompa proposal).
      This style of document grammar resembles that proposed in the Pisa
      minutes (AI5 M4). Subentries can contain almost anything; if one
      dictionary puts a usage note at one level and another puts it at a
      different level, the encoder can put it anywhere.  There are
      groups, but each may have have virtually any structure.  Addition-
      ally, NI believed one might be able to define some groups like
      FORM and GRAM which are in fact tightly constrained.  The DTD for
      this deeply nesting encoding might look thus:
 
                   <!ELEMENT me - - (se+)>
                   <!ELEMENT se - - (%anything; | se)* >
 
3.    a further constrained DTD, which could serve as a template for a
      specific dictionary, providing much tighter validation than the
      two anarchic document grammars.  The work group could even provide
      an example.
 
   The group digressed to discuss the inheritance proposals of the Pisa
minutes.  Such inheritance cannot be specified explicitly in SGML
itself, but it is legitimate for the TEI (and thus for AI5) to specify
the semantics of nesting this way if we like.
 
   RA suggested that different syntactic classes of tags correspond
neatly to the different levels of recommendation:  the required tags are
leaf tags; grouping tags, which represent an attempt to capture more of
the information structure, are recommended but not required; finally
there will be tags for specialized information which some but not all
communities will want to encode.
 
   FWT agreed on the distinction between leaf tags, marking elements
within which no other elements can nest, and grouping tags, within which
primarily or exclusively leaf elements will nest.  He disagreed with
RA's identification of leaf tags as required and grouping tags as recom-
mended, and denied that for any string in a dictionary there would
always be a single leaf tag which applies:  etymology, for example, can-
not be encoded solely with leaf tags.  There was no consensus on this
point.
 
   RA suggested that grouping elements should be provided with more or
less canonical definitions.  For example "Etymology usually contains a,
b, c, d.  A given dictionary may have defective etymologies, but if
a,b,c occur they should be tagged as etymology." FWT disagreed, arguing
that all grouping tags should allow any leaf tag and any other grouping
tag within them.
 
   MSM asked what RA and FWT meant by required.  Did they mean the tag
must appear in the document?  That the tag must be used when applicable?
or that the tag is recommended to be used when applicable for new docu-
ments?  He observed that implicitly or explicitly the TEI already dis-
tinguished several levels of recommendation and requirement for elements
and their tags:
 
*   required (document is not TEI conformant without this element)
*   mandatory when applicable (bad faith if not tagged)
*   recommended (should be used in new encodings unless there is a good
    reason not to)
*   optional (use if you like)
 
The work group urged the explicit creation of a category of conditional-
ly required / mandatory:  required or mandatory when applicable if a
given document type is used.  This could be merged with mandatory when
applicable or made separate.
 
 
 
                                   5
 
                       TARGETS OF THE DAY'S WORKS
 
 
   NI suggested that since consensus had been achieved on the overall
schema, the goals for the day should be:
 
*   devise DTD for anarchic scheme without nesting (flat DTD)
*   devise DTD for anarchic scheme with arbitrary nesting
*   devise DTD for constrained groups for use in nesting DTD
*   think hard about DTD modification to ensure that one can actually
    make more constrained DTDs
 
   RA suggested also:
 
*   think hard about specific leaf tags and decide how to determine
    whether a given thing should be made a group or not.
*   specify how to decide which group to use, if the elements present
    could appear in several.
*   bear in mind the need for tags applicable to lots of strange animals
    (what if something is both syllabification and spelling?  it will be
    necessary to generalize both from syllabification and from ortho-
    graphic form.)
 
   MSM proposed
 
*   agree to distinguish atomic, group, and ENTRY tag
*   construct DTDs with abstractions
*   make inventory of tags in each abstraction
 
No explicit decision was made, but the discussion immediately passed
over to consideration of the DTD structure as suggested by NI and MSM.
 
 
 
                                   6
 
                             DTD STRUCTURE
 
 
   FWT proposed the following basic DTD:
 
     <!-- parameter entities contain lists of elements:
          atomics:  all leaf tags
          group:    all grouping tags
     -->
     <!ELEMENT dictionary - - (entry | other)+ >
     <!ELEMENT (%atomics;) - - (#PCDATA) >
     <!ELEMENT (%groups;)  - - (%atomics; | %group; | #PCDATA)* -(entry)
     >
 
   If this is accepted, FWT continued, the invention of individual tags
would simply the the process of populating the following table:
 
                   Required            Recommended         Optional
    (atoms)        (some)              (lots)              (a few)
    (groups)       <entry>             (many)              (maybe some)
 
   NI objected to FWT's DTD fragment, urging that the <subentry> element
needs to have a separate definition.  FWT suggested that <subentry> be
treated like any other group and allow all groups to have the properties
assigned to <subentry> by the Pisa minutes.  NI argued that eliminating
the distinction between subentries and other groups would lose some good
properties of the subentry tag and make it unclear how the tagging is to
work.  FWT suggested keeping the distinction in the examples and the
prose, without reifying it in the DTD.
 
 
 
                                   7
 
                  TARGETS OF THE DAY'S WORK (RESUMED)
 
 
   The group then agreed on the following goals for the afternoon:
 
*   decide whether <se> requires separate definition
*   clarify relation between print and information content?
*   populate the tag table
*   decide to make one DTD for all or several for different communities
*   specify documents to draft, assign them to individuals
*   agree on date of next meeting, if any
 
   [At this point the group broke for lunch.]
 
 
 
                                   8
 
                     PRINT AND INFORMATION CONTENT
 
 
   After lunch, the group discussed the different types of encoding pos-
sibly needed by different communities, distinguishing:
 
*   fidelity to the information content of the dictionary
*   fidelity to all details on the printed page
*   fidelity to the linear sequence of characters and their presentation
    (fonts etc.)  (This differs from the preceding as input to a type-
    setting program may differ from its typeset output.)
 
   Alternative terminology was discussed without final decision:  (1)
typographic fidelity (which preserves all decisions made by the typeset-
ter, even those another typesetter might make differently) vs.  copy-
editor fidelity (which preserves all decisions made by the copy editor,
even those which might be made differently by a different copy editor or
by the same editor using a different style sheet) vs.  editorial fideli-
ty (which preserves all decisions made by the subject editor or author,
but not necessarily those of the copy editor or typesetter).  (2) layout
fidelity vs. linear fidelity vs. abstract-content fidelity.  (3) pixel
fidelity vs. character linear fidelity vs. symbolic linear fidelity.
 
 
 
                                   9
 
                            ONE DTD OR MANY?
 
 
   FWT suggested that any single DTD capable of handling the print forms
of a variety of dictionaries would necessarily also handle the encodings
of interest to all of these communities.  This can be done by accepting
NI's model:  one DTD for everyone, with constraints to be added if one
wishes to limit things.  This was accepted by consensus.
 
   MSM observed that the DTD fragment proposed by FWT would allow both
deeply nesting and non-nesting encodings of dictionaries, and thus there
was no need to distinguish a "flat" DTD from a nesting DTD.
 
 
 
                                   10
 
                           INHERITANCE MODEL
 
 
   FWT elaborated the proposal for special semantics for the <se>
(subentry) element (hereinafter the "SE model") on the basis of this
example:
 
     <e>
             <se id=se1>
                     <x>
                             <y> ... </y>
                             <z> ... </z>
                     </x>
                     <se id=se12>
                             <y> ... </y>
                             <w> ... </w>
                     </se>
                     <se id=se13> ... </se>
             </se>
             <se id=se2>
                     <x> ... </x>
                     <se id=se21>
                             <y> ... </y>
                             <se id=se22>
                                     <x> ... </x>
                                     <w> ... </w>
                             </se>
                     </se>
             </se>
     </e>
 
Or in tree form:
 
                                  e
                                  |
             +-----------------------------+
             |                             |
          se id=se1                     se id=se2
             |                              |
         +---------+---------+          +------------+
         |         |         |          |            |
         x         se        se         x        se id=se21
         |      id=se12   id=se13                    |
         |         |                      +----------+-----------+
      +--+--+    +-+---+                  |                      |
      |     |    |     |                  y                  se id=se211
      y     z    y     w                                         |
                                                            +----+----+
                                                            |         |
                                                            x         w
 
   All non-<se> children of an <se> are interpreted as features applying
to that <se> and these features are inherited by all children <se>s
unless overridden there by the occurrence of children with the same name
(generic identifier).  In the left branch of this tree, the X Y Z sub-
tree of the first <se> (se1) is interpreted as though it were also writ-
ten as a child of the first and second third-generation <se>s (se12 and
se13) because neither of the latter have an <x> as a child.  In the
right branch, the X element within the second-generation <se> (se2) is
interpreted as a feature applicable to the third-generation <se> (se21)
but not to the third-generation <se> (se211), because se211 has an X
child of its own, and se21 has none.
 
   Note:  Crucial point:  please check this for accuracy of substance.
-MSM
 
   RA agreed that this model provides a true and plausible effort to
replicate the way dictionaries work.
 
   A break ensued, with an informal discussion of inheritance mechanisms
and whether it is possible to define an additive inheritance rule:  if a
parent (e1) has an additive <gram> group (i.e. one which simply adds
further specification to the inherited information, rather than overrid-
ing the inherited information) and a child (c1) overrides the <gram>
element, does the child override only e1's additive <gram> group or the
entire <gram> specification back to the root?  No consensus was
achieved.
 
   After the break, FWT proposed that the SE model be adopted, and that
<se> be renamed <group>.  The only  role of this element is to describe
inheritance; the inheritance rules must be described very clearly.  When
<group> is not present, inheritance is unspecified.
 
   The group then further discussed possible variations on the simple
override-or-inherit semantics described above, including "additive
inheritance" or "leaf-inheritance".  In the simple SE model, a locally
nested subtree of type X in a given <group> blocks the inheritance of
any subtree of type X from the <group>'s ancestors.  In some cases, how-
ever, it would be better for the ancestral X subtree to be inherited and
its leaves merged with the leaves on the locally nested X subtree.
Leaves within the locally nested X subtree would override leaves within
the inherited X subtree, based on identity of name [i.e.  generic iden-
tifier? -MSM].  This seems plausible, but needs to be examined to ensure
that it will work correctly with deeply nested trees.
 
   In discussing the advisability of the SE model, FWT argued that this
form of inheritance is often useful, usually easy to understand in the
dictionary and specify in the encoding, and never prevents one from
doing other things, or forces one to encode what one does not wish to
encode.
 
   The group agreed to adopt the SE inheritance model, renaming <se> as
<group>.  NI was assigned the production of a paper defining both the
leaf-overlaying semantics and an additive semantics for <group>, with
the assistance of FWT and NC.
 
Consensus:  <se> proposal adopted with rename of Group for <se>, and
additive or leaf-overlaying semantics, to be specified by NI with assis-
tance from FWT and NC.
 
      NI, with FWT and NC
 
      draft specification of additive and overlaying semantics
 
      Due:  asap
 
      Document number:  AI5 W7
 
 
 
                                   11
 
                               TAG TABLE
 
 
   The group then progressed briefly to discuss the need to decide what
tags are needed in addition to those already decided on.  MSM asked what
was being counted as having been decided on; the consensus was that tags
in section 7.4 of TEI P1 were those already decided on.  Material in
other documents (Amsler/Tompa, Fought and Van Ess-Dykema, the Acquilex
proposal) should be considered but does not count as already accepted.
 
 
 
                                   12
 
                           DOCUMENTS TO DRAFT
 
 
   RA stressed the need for the work group to produce text which can be
integrated into TEI P2 to replace the current section 7.4 of TEI P1.
MSM noted that P2 was to have less discursive explanation and justifica-
tion, and should simply state the tags available and their usage.  RA
observed that members of the work group might wish to annotate their
suggested changes to explain why the changes were wanted, but that this
justification should be regarded as annotation, not as part of the
draft.
 
   For each tag proposed, members should say whether it should be
defined as a leaft tag or a grouping tag, and whether it is required,
mandatory when applicable, recommended, or purely optional.
 
   RA assigned all members of the work group to work on replacement text
for the dictionary section of TEI P2 and distribute them to the entire
work group.  The style should roughly the same as in section 7.4 of TEI
P1.  RA will collect the proposals and interpolate them in the text.
The review of this interpolated text would be a topic at the next meet-
ing, if any.
 
   MSM asked that if possible, the group also attempt to prepare short
drafts of reference entries for individual tags, giving generic identi-
fier, full name, usual parents, usual children, atomic or group tag, and
a brief trivial example.
 
   NC suggested that progress could be made quickly if NI sent a revi-
sion of the Pisa minutes to FWT, so he can incorporate his own and RA's
comments.
 
   RA will try to tag Collins with the tags of TEI P1.
 
 
 
                                   13
 
                                MEETING
 
 
   No date was set for another meeting.  It was agreed, however, to set
a meeting up if travel funds permitted and it seemed a useful way of
accomplishing more work.
 
 
 
                                   14
 
                             OTHER BUSINESS
 
 
   The group acknowledged receipt of the extensive comments of John
Fought on the Pisa minutes.  Many of the comments are well taken; the
group believes the most crucial points are those about inconsistencies,
which (with others' suggestions) have led to the more flexible set of
content models proposed in this meeting.  Further comments on the tags
of TEI P1 will be welcome.
 
-------------------------
 
(1) Leaf tags correspond to what Amsler/Tompa call "base" tags, but the
    term base has another specific technical meaning within discussions
    of the TEI DTDs, and the term leaf tag is preferred for references
    to elements with a content model of (#PCDATA).  -MSM
 
                                       Draft October 18, 1991 (14:20:18)