&ML; TEI-ML-M10 Minutes of the meeting held at the Commission of the European Communities, Luxembourg, October 18-20, 1989 Lou Burnard Present: David Barnard (D), chair; Lou Burnard (LB); Jean-Pierce Gaspart (JPG); Lynne A. Price (LAP); Michael Sperberg-McQueen (MSM); Nino Varile (NV).

Introductory business

Administrative matters DB welcomed the committee. A new address list had been prepared [attached to these minutes]; any further alterations to DB as soon as possible. MSM outlined the limits on and procedure for claiming reimbursement of expenses; it was noted that the situation may change with new funding arrangements. DB asked all committee members to inform him of their likely expenses well in advance of meetings to enable him to administer the budget allocated to the committee effectively.

Minutes of previous meeting [ML M 1] `DRTD' (page 3) should read `DTD'.

LB confirmed that he and MSM were still working on a tagset for internal TEI usage.

The status of decisions made at the previous meeting was reviewed. On non-substantive issues, the committee agreed to accept the Chair's ruling. On substantive issues, it was agreed that a decision accepted by a two-thirds majority of the whole committee should be binding. Members who had not voted on a particular issue would be given a fixed period after notification of the decision in which to inform the chair of any disagreement; silence would be interpreted as consent. This lead to a brief discussion of absenteeism: it was agreed that two unexplained absences from meetings would constitute resignation.

It was agreed that document MLW1 had not yet been adopted by the committee: substantial revisions had been requested at Toronto which had not yet been carried out. Most of these were subsumed into the discussion of the replacement for this working paper, now renumbered as ML W13.

Document numbering A new document numbering scheme was proposed. All papers before the committee would be consecutively numbered, and each would bear a single letter prefix (W for working papers, A for agenda lists, M for minutes). There would be a standing agenda item at each meeting to check the status of all documents currently before the committee and to note any new ones. DB agreed to produce a new document list including all existing and proposed papers [attached]. It was also agreed that documents circulated outside the committee should be credited to the whole committee as author or editor; individual authorship should be noted only within the committee.

Statement of Work The Committee's plan of work, as presented in TEI documents SC/G10, ED/P2 and ML/R2, was reviewed. DB stated that the committee had two main responsibilities: firstly to advise the other committees on optimal usage of SGML, and secondly to survey existing encoding schemes with a view to recasting them. There was some discussion of what was entailed by `recasting': at one extreme it might simply be a lookup table or an SGML short reference map; at the other a much more complex scheme might be needed. Information-loss should be avoidable when going from one exogenous scheme into TEI, but would not be when going from TEI to a comparatively impoverished scheme. LAP asked whether software would be produced to do the conversions. DB said this was not the committee's responsibility: it needed only to produce generic specifications for transforming between a representative set of encoding schemes and the tagsets and DTDs defined by the TEI. Political considerations were involved in determining what would be an appropriate set to consider, as well as technical issues. JPG noted that some candidate encodings were very application specific and that hence some consideration of the function of the markup was necessary, citing macro packages such as LaTex. It was agreed to produce a list of candidate encoding schemes and to revise the Statement of Work. LBto produce categorised list of candidate markup schemesML W1219-Oct-89 DB to revise statement of work ML W327-Oct-89

Guidelines It was noted that a paper setting out general Guidelines on the usage of SGML features was urgently needed by the other committees. The existing draft working paper (TEI ML W1, now ML M13) needed considerable revision and extension. A detailed discussion ensued.

General principles JPG pointed out that different situations might require different recommendations: in particular, features appropriate for capture or processing might not be appropriate in interchange or storage. Although the CALS and AAP standards had been proposed for interchange only, people still used them for data capture, for which they were less suitable. It was agreed to structure discussion by drawing up a matrix of SGML features categorised by their suitability for data capture, interchange, storage and processing. As a general principle, the committee felt that anything which could be expressed using SGML should be. Similarly, documentation should make clear that only those features which are defined using SGML could be relied on for interchange purposes. DB proposed that the document should also address the importance of using software to check the syntax of SGML encoded texts. It was recognised that users of the TEI Guidelines might use many intermediate software environments, but the committee agreed that DTDs developed for the project, and documents claiming to be TEI-conformant, would have to be validated by full SGML parsers, and to caution the other committees against under-estimating the complexity of this process. JPG and LAP both pointed out that producing software capable of checking SGML syntax correctly was a far from trivial task, and LAP agreed to draft a few pages setting out the difficulties of doing so. LAP to draft a memorandum on the difficulties of SGML parseability ? ? DBto produce new draft of Guidelines ML W13 1-Nov-89

Discussion of specific features

SHORTREF and CONCUR JPG proposed that SHORTREF should be recommended for use only in data capture; LAP that it was appropriate at all times, and could lead to significant savings in storage costs as well as convenience in input. JPG noted that SHORTREF and SHORTTAG could apply only to the base tagset, suggesting that their use by applications using CONCURrent dtds might be problematic. MSM said that potentially any element might need to CONCUR. DB agreed that the only alternative to using CONCUR would be to define something else with a similar functionality. LAP proposed that any problems the group identified in using the feature should be passed back to WG8.

Attributes It was agreed that, although formally equivalent, attribute values could be used in preference to element content in order to add information to a view. Even within a single view it might still be difficult to decide what was content and what was process-specific information.JPGto draft recommendations on the appropriate use of attributes W13 ?

Inclusion/Exclusion Exceptions Recommendations on when these might profitably be used were still needed. JPGto draft recommendations on the appropriate use of exceptions in content models W13 ?

SUBDOC This feature might provide an alternative to some uses of the CONCUR feature. It allowed an entity reference to be replaced by a sub document with a distinct environment, entirely replacing replaced that of the base document type, and with no easy way of communicating information (eg the target of IDREF attributes) between the two environments. How this feature relates to CAPACITY is not clear. Guidance on its usage would be useful, particularly for the &ai;. LAP proposed that entity references the text of which were subdocuments should appear only as attribute values rather than as content, which met with general agreement. NVto draft a paragraph on the use of SUBDOCML M13?

APPINFO MSM asked whether this might be an appropriate feature to use as a means of providing aliases for tagnames and other GIs. LAP said parameter entities or short refs would be a better solution to this problem. AppInfo merely provided commentary on the environment described by the DTD in which it appeared, and not its application.

OMITTAG There was continued discussion on appropriate use of this minimization feature, and it was agreed that clear recommendations were not easy to make. Part of the problem was that different recommendations were appropriate for private and public use of documents, a distinction which not all wished to make. LAP circulated some `Notes on Markup Minimization and Attributes' [document ML W9] which would be incorporated in the draft for W13. Further information on the use of SHORTTAG was requested. LAPto expand her notes to include recommendations on use of SHORTTAG M13 1-Nov-89

LINK JPG stated that LINK was the only mechanism provided by SGML for relating different document types. As implemented by the SOBEMAP parser the feature allowed the dtd designer to associate semantic actions with any element, for example to define formatting. LAP was opposed to its use on the grounds that SGML was intended to separate semantics from markup, that it was currently the subject of some concern within WG8 and that as currently defined it was defective in several respects. After some discussion, it was agreed to defer decision on the use of this feature.

Concrete syntax It was agreed that working committees should not deviate from the reference concrete syntax without strong motivation. To do so would require transmission of a default SGML declaration with each document instance. Concern was expressed over the default namelength of the current syntax which was felt to be inadequate.

Quantity and capacity The existing defaults were reviewed briefly: it was agreed that we would need to alter only namelength, for which a value of 128 was proposed as default, and possibly the level to which entities could be nested, for which 16 seemed a little low.

Naming rules It was agreed to encourage consistency in naming rules as far as possible. In particular the existing defaults for case sensitivity (sensitive in entity names only) should be adhered to and existing defined entity names should be used. The character set used for names should as far as possible be the same as that used for the document: it was recognised that this might give rise to problems in documents using 8-bit character sets, and the committee asked the Text Representation Committee to address the problem of representing names in such documents, to consider the SGML syntax status of each character as well as its collating position etc. and to address the best way of defining translations between identical sets of names in different languages. A need for globally transforming names was identified.

In discussion of the FORMAL feature, it was noted that no decision had yet been taken as to whether the TEI scheme would be registered with the relevant standards bodies. MSM to raise with Steering whether formal registration of the TEI Guidelines was intended ?

It was agreed that entity names should be used without a doctype qualifier only if they had the same replacement value in all TEI doctypes. The special case of entity references expanding to IGNORE or INCLUDE when used with marked sections was noted: the editors would need to maintain consistency in this context as there was no way of including a doctype identifier in this case.

Conclusions DB agreed to produce a new set of draft recommendations by November 1st. Fuller justification would be left to a later date. The document would be distributed to JPG and LAP by FAX, by email to other members. Two weeks would be allowed for comments. Committee members were asked to acknowledge receipt of the draft immediately.

SGML Bibliography Production of this was well advanced. Robin Cover had produced a substantial amount of information which was being merged with DB's current file and would be distributed as a working paper in SGML form shortly. DBTo distribute SGML bibliography ML M14 15-Nov-89

Introductory Guide to SGML There was some discussion of the need for a very elementary guide to SGML including illustrative material relevant to the TEI's concerns. The diversity of other committee members' backgrounds and expertise in formal language theory was noted. It was suggested that the chapter on SGML in the FORMEX manual might be a suitable basis. LB to draft "Idiots' Guide" to SGML ML M151-Dec-89 NV to check on copyright status of FORMEX Guide ML W15?

Programmers' Guide to SGML MSM proposed that a document introducing the major concepts of SGML aimed at document designers and programmers would also be useful, analogous to Bryan's book but less biassed to publishing applications. LAP suggested that our working papers would form a good basis for such a text. DB agreed that ML M13 should cover all that was necessary for those wishing to design TEI-conformant DTDs. LAP to report on the availability of relevant work done previously at Hewlett Packard ML W16 24-Oct-89 LAP with MSM and David Durand as back- upto consider writing a technical introduction to the use of the TEI Guidelines ML W16 ?

Other Encoding Schemes LB presented verbally an initial draft of ML W12. The Committee worked through a long list of encoding schemes, categorising each as (a) one on which work towards specifying a transformation would be carried out (b) one on which such work was not judged necessary (c) one on which no further work was currently planned. [The results are presented in the draft of ML W12 appended to these minutes] It was noted that Nancy Ide (NI) had expressed willingness to work in this area and she was requested to form a working group. NV noted that recoding schemes for a number of wordprocessing systems were already required for the Eurotra project. DB proposed that each scheme on which work was required should be considered on an ad hoc basis. It might be possible simply to specify an SGML declaration and DTD for some; others might need simple string substitutions; for yet others general purpose tools for lexical analysis such as YACC and Lex might be required. JPG felt that the issue should not be oversimplified but the consensus was that simple tasks should be addressed by simple tools. DB noted that there were already several volunteers willing to work in this area and others might be forthcoming. He undertook to ask NI and FT to co-ordinate the work. NI with assistance from Frank TompaTo form working group with responsibility for addressing tasks specified in ML W12 ML W12 ?

Stylistic Guide LB suggested that a separate document summarising TEI `house style' for DTD-definers might be useful. This was agreed, though the difficulty of formulating such rules was noted. It should for example summarise naming conventions, recommend appropriate grouping levels for content models and propose a standard ordering for DTD statements. DB suggested that much of this material might be included in W13. MSM proposed that it should be a separate document and suggested David Durand (DD) as a suitable author. DBto contact DD to discuss preparation of Stylistic Guidelines ML W17 ?

Input to WG8 The committee agreed that a formal mechanism for conveying problems identified with the current SGML standard to the relevant ISO working groups was highly desirable. Two particular problems (namelength and doctype qualifier on marked sections) had already been identified and it seemed likely that there would be more. It was agreed that the Steering Committee should be asked to convey detailed proposals for revisions to ISO 8879, the five year review of which was due in 1991 and that the Committee would attempt to draft such proposals. DBto ask Steering to communicate with WG8??

Next Meeting Planned dates for next meeting are 16-19 February 1990, in Chicago. Date and place to be confirmed by 1 Dec 1989. MSMto confirm date and time of next meeting1-Dec-89

Questions raised by other committees Due to other committments, DB had to leave the meeting after the second day. The third day of the meeting was spent discussing in some detail some specific problems raised by members of other working committees. MSM characterised the problem areas on which guidance was needed as follows:
  • Arbitrary segments which did not seem to be really text elements, for example in discourse or stylistic analysis.
  • Discontinuous segments where elements were interrupted by other elements, for example in morphological or conversational analyses.
  • Ambiguity, where more than one structural analysis might apply to a given text segment.
  • Overlap, where two or more overlapping structures were to be tagged across the same text.
  • Synchronous parallel structures
  • Transcription mapping.
  • Cross references both within a document and outside it.
  • Vagueness, where the feature to be encoded could not be exactly localised.

Arbitrary segments Examples quoted were: this is where the transcript is inaudible; this is where he moved to the window. JPG suggested that each such segment should be regarded as a separate automaton. CONCUR could only be used if the number of different segment types was in principle bounded and quite small. Alternatively he proposed empty tags marking either end of the segment, with attributes to indicate their type. For example, Nino's moves around the room might be represented in a document with content model of (#PCDATA | move) by elements tagged [(nino)move] .... [/(nino)move] or by elements tagged [move person=nino start] ... [move person=nino end]. It was noted that the first mechanism could be generated from the second, but that the second was more general. The advantage of the first was that the parser could check that tags were properly matched etc. A third possibility was to create the total intersection of all identifiable segments, subdivided at each segment boundary and grouped at a higher level.

Discontinuous segments These could be handled by co-indexing using the id/idref mechanism of SGML. To avoid the need to treat the first part of a discontinuous segment differently from the rest, it could be set up initially. The example discussed was of discontinuous segments relating to a particular topic. Provided that the set of topics could be predefined, these could be listed in a separate TOPICLIST element, each member of which could be allocated an ID. Each occurrence of a topic in the text itself would then be linked back to the appropriate element by an IDREF.

It was agreed that this method was probably not appropriate for all examples of discontinuity. For micro discontinuities such as those found in morphology, a simpler solution might be to introduce some redundancy into the text. For example, the lemmatised form KTB of the Arabic word al-kaatib might be represented either as an attribute [word root=KTB]al-kaatib [/word] or as an additional element [word] [root]KTB [form]al- kaatib [/word]. The first was probably to be preferred, since lemmatisation was a process applied to the text rather than a part of its content.

Ambiguity Five mechanisms were outlined for dealing with ambiguities such as `I saw the man with the telescope' The two parse trees for this sentence - ((I) (saw((the man) with the telescope))) and ((I) ((saw (the man)) with the telescope)) - could be represented independently, but without repeating their points of overlap, by using CONCUR. Or they could be repeated as alternative marked sections.

The third possibility was to think of each parse tree as a graph, in which the nodes were the boundaries between syntactic units and the arcs the syntactic interpretation placed on them. A sentence could then be represented in SGML as an ordered list of words, each with its own ID. The arcs of each syntactic graph for the sentence could then be represented by empty elements, using idrefs to point back to the words. A simpler representation might be to use a special notation (which could not therefore be checked by the SGML parser) for the parse tree value and specify it as an attribute. The fifth possibility was to treat all parse subtrees as arbitrary segments, as outlined above.

The committee felt that more specific examples would be needed before clear recommendations could be given as to which of these mechanisms should be preferred.

Overlap The example cited -- she took advantage of Joan -- seemed to be an instance of arbitrary sectioning. The A&I Committee was requested to provide precise illustration of the Japanese biclausal analysis referred to in document TEI AI M1.

Cross references The id/idref mechanism of SGML seemed adequate in the absence of more specific problems.

Synchronisation of multiple transcriptions JPG pointed out that so long as order was preserved, paralellism between two synchronous structures could be implied. If order was not preserved, paralellism would need to be explicitly tagged by using idrefs.

Vagueness Various examples of vagueness were discussed inconclusively. In most cases the mechanisms proposed for arbitrary and discontinuous segments seemed adequate. Two different examples were requested, together with some indication of the purpose for which such features might be tagged.

CDATA LAP gave a detailed clarification of the use of the CDATA keyword, as declared content for an element, declared value for an attribute, as an entity or as effective status keyword in a marked section. JPG described the interaction between the use of CDATA and of marked sections. It was agreed that CDATA and RCDATA should be avoided as declared element content, unless in a non-SGML environment or as a marked section status keyword.

Processing Instructions It was agreed that these should be avoided, except for a few specific purposes such as returning the system date to a document instance, in which case the value returned should be declared as SDATA to avoid changing the parse state.