&ML;
TEI-ML-M10
Minutes of the meeting held at the Commission of the European
Communities, Luxembourg,
October 18-20, 1989
Lou Burnard
Present: David Barnard (D), chair; Lou Burnard (LB);
Jean-Pierce Gaspart (JPG); Lynne A. Price (LAP); Michael
Sperberg-McQueen (MSM); Nino Varile (NV).
Introductory business
Administrative matters
DB welcomed the committee. A new address list had been prepared
[attached to these minutes]; any further alterations to DB as
soon as possible. MSM outlined the limits on and procedure for
claiming reimbursement of expenses; it was noted that the
situation may change with new funding arrangements. DB asked
all committee members to inform him of their likely expenses
well in advance of meetings to enable him to administer the
budget allocated to the committee effectively.
Minutes of previous meeting [ML M 1]
`DRTD' (page 3) should read `DTD'.
LB confirmed that he and MSM were still working on a tagset
for internal TEI usage.
The status of decisions made at the previous meeting was
reviewed. On non-substantive issues, the committee agreed to
accept the Chair's ruling. On substantive issues, it was
agreed that a decision accepted by a two-thirds majority of
the whole committee should be binding. Members who had not
voted on a particular issue would be given a fixed period
after notification of the decision in which to inform the
chair of any disagreement; silence would be interpreted as
consent. This lead to a brief discussion of absenteeism: it
was agreed that two unexplained absences from meetings would
constitute resignation.
It was agreed that document MLW1 had not yet been adopted
by the committee: substantial revisions had been requested at
Toronto which had not yet been carried out. Most of these were
subsumed into the discussion of the replacement for this
working paper, now renumbered as ML W13.
Document numbering
A new document numbering scheme was proposed. All papers
before the committee would be consecutively numbered, and each
would bear a single letter prefix (W for working papers, A for
agenda lists, M for minutes). There would be a standing
agenda item at each meeting to check the status of all
documents currently before the committee and to note any new
ones. DB agreed to produce a new document list including all
existing and proposed papers [attached]. It was also agreed
that documents circulated outside the committee should be
credited to the whole committee as author or editor;
individual authorship should be noted only within the
committee.
Statement of Work
The Committee's plan of work, as presented in TEI documents
SC/G10, ED/P2 and ML/R2, was reviewed. DB stated that the
committee had two main responsibilities: firstly to advise the
other committees on optimal usage of SGML, and secondly to
survey existing encoding schemes with a view to recasting
them. There was some discussion of what was entailed by
`recasting': at one extreme it might simply be a lookup table
or an SGML short reference map; at the other a much more
complex scheme might be needed. Information-loss should be
avoidable when going from one exogenous scheme into TEI, but
would not be when going from TEI to a comparatively
impoverished scheme. LAP asked whether software would be
produced to do the conversions. DB said this was not the
committee's responsibility: it needed only to produce generic
specifications for transforming between a representative set
of encoding schemes and the tagsets and DTDs defined by the
TEI. Political considerations were involved in determining
what would be an appropriate set to consider, as well as
technical issues. JPG noted that some candidate encodings were
very application specific and that hence some consideration of
the function of the markup was necessary, citing macro
packages such as LaTex. It was agreed to produce a list of
candidate encoding schemes and to revise the Statement of
Work.
LBto produce categorised list of candidate
markup schemesML W1219-Oct-89
DB to revise statement of work
ML W327-Oct-89
Guidelines
It was noted that a paper setting out general Guidelines on
the usage of SGML features was urgently needed by the other
committees. The existing draft working paper (TEI ML W1, now
ML M13) needed considerable revision and extension. A detailed
discussion ensued.
General principles
JPG pointed out that different situations might require
different recommendations: in particular, features appropriate
for capture or processing might not be appropriate in
interchange or storage. Although the CALS and AAP standards
had been proposed for interchange only, people still used them
for data capture, for which they were less suitable. It was
agreed to structure discussion by drawing up a matrix of SGML
features categorised by their suitability for data capture,
interchange, storage and processing.
As a general principle, the committee felt that anything which
could be expressed using SGML should be. Similarly,
documentation should make clear that only those features which
are defined using SGML could be relied on for interchange
purposes.
DB proposed that the document should also address the
importance of using software to check the syntax of SGML
encoded texts. It was recognised that users of the TEI
Guidelines might use many intermediate software environments,
but the committee agreed that DTDs developed for the project,
and documents claiming to be TEI-conformant, would have to be
validated by full SGML parsers, and to caution the other
committees against under-estimating the complexity of this
process. JPG and LAP both pointed out that producing software
capable of checking SGML syntax correctly was a far from
trivial task, and LAP agreed to draft a few pages setting out
the difficulties of doing so. LAP to draft a
memorandum on the difficulties of SGML parseability ?
? DBto produce new draft
of Guidelines ML W13 1-Nov-89
Discussion of specific features
SHORTREF and CONCUR
JPG proposed that SHORTREF should be recommended for use only
in data capture; LAP that it was appropriate at all times, and
could lead to significant savings in storage costs as well as
convenience in input. JPG noted that SHORTREF and SHORTTAG
could apply only to the base tagset, suggesting that their use
by applications using CONCURrent dtds might be problematic.
MSM said that potentially any element might need to CONCUR. DB
agreed that the only alternative to using CONCUR would be to
define something else with a similar functionality. LAP
proposed that any problems the group identified in using the
feature should be passed back to WG8.
Attributes
It was agreed that, although formally equivalent, attribute
values could be used in preference to element content in order
to add information to a view. Even within a single view it
might still be difficult to decide what was content and what
was process-specific information.JPGto draft
recommendations on the appropriate use of attributes
W13 ?
Inclusion/Exclusion Exceptions
Recommendations on when these might profitably be used were
still needed. JPGto draft recommendations on
the appropriate use of exceptions in content models
W13 ?
SUBDOC
This feature might provide an alternative to some uses of the
CONCUR feature. It allowed an entity reference to be replaced
by a sub document with a distinct environment, entirely
replacing replaced that of the base document type, and with no
easy way of communicating information (eg the target of IDREF
attributes) between the two environments. How this feature
relates to CAPACITY is not clear. Guidance on its usage would
be useful, particularly for the &ai;. LAP proposed that entity
references the text of which were subdocuments should appear
only as attribute values rather than as content, which met
with general agreement. NVto draft a
paragraph on the use of SUBDOCML M13?
APPINFO
MSM asked whether this might be an appropriate feature to use
as a means of providing aliases for tagnames and other GIs.
LAP said parameter entities or short refs would be a better
solution to this problem. AppInfo merely provided commentary
on the environment described by the DTD in which it appeared,
and not its application.
OMITTAG
There was continued discussion on appropriate use of this
minimization feature, and it was agreed that clear
recommendations were not easy to make. Part of the problem was
that different recommendations were appropriate for private
and public use of documents, a distinction which not all
wished to make. LAP circulated some `Notes on Markup
Minimization and Attributes' [document ML W9] which would be
incorporated in the draft for W13. Further information on the
use of SHORTTAG was requested. LAPto expand
her notes to include recommendations on use of SHORTTAG
M13 1-Nov-89
LINK
JPG stated that LINK was the only mechanism provided by SGML
for relating different document types. As implemented by the
SOBEMAP parser the feature allowed the dtd designer to
associate semantic actions with any element, for example to
define formatting. LAP was opposed to its use on the grounds
that SGML was intended to separate semantics from markup, that
it was currently the subject of some concern within WG8 and
that as currently defined it was defective in several
respects. After some discussion, it was agreed to defer
decision on the use of this feature.
Concrete syntax
It was agreed that working committees should not deviate from
the reference concrete syntax without strong motivation. To do
so would require transmission of a default SGML declaration
with each document instance. Concern was expressed over the
default namelength of the current syntax which was felt to be
inadequate.
Quantity and capacity
The existing defaults were reviewed briefly: it was agreed
that we would need to alter only namelength, for which a value
of 128 was proposed as default, and possibly the level to
which entities could be nested, for which 16 seemed a little
low.
Naming rules
It was agreed to encourage consistency in naming rules as far
as possible. In particular the existing defaults for case
sensitivity (sensitive in entity names only) should be adhered
to and existing defined entity names should be used. The
character set used for names should as far as possible be the
same as that used for the document: it was recognised that
this might give rise to problems in documents using 8-bit
character sets, and the committee asked the Text
Representation Committee to address the problem of
representing names in such documents, to consider the SGML
syntax status of each character as well as its collating
position etc. and to address the best way of defining
translations between identical sets of names in different
languages. A need for globally transforming names was
identified.
In discussion of the FORMAL feature, it was noted that no
decision had yet been taken as to whether the TEI scheme would
be registered with the relevant standards bodies.
MSM to raise with Steering whether formal
registration of the TEI Guidelines was intended ?
It was agreed that entity names should be used without a
doctype qualifier only if they had the same replacement value
in all TEI doctypes. The special case of entity references
expanding to IGNORE or INCLUDE when used with marked sections
was noted: the editors would need to maintain consistency in
this context as there was no way of including a doctype
identifier in this case.
Conclusions
DB agreed to produce a new set of draft recommendations by
November 1st. Fuller justification would be left to a later
date. The document would be distributed to JPG and LAP by FAX,
by email to other members. Two weeks would be allowed for
comments. Committee members were asked to acknowledge receipt
of the draft immediately.
SGML Bibliography
Production of this was well advanced. Robin Cover had produced
a substantial amount of information which was being merged
with DB's current file and would be distributed as a working
paper in SGML form shortly. DBTo distribute
SGML bibliography ML M14 15-Nov-89
Introductory Guide to SGML
There was some discussion of the need for a very elementary
guide to SGML including illustrative material relevant to the
TEI's concerns. The diversity of other committee members'
backgrounds and expertise in formal language theory was noted.
It was suggested that the chapter on SGML in the FORMEX manual
might be a suitable basis. LB to draft
"Idiots' Guide" to SGML ML M151-Dec-89
NV to check on copyright status of
FORMEX Guide ML W15?
Programmers' Guide to SGML
MSM proposed that a document introducing the major concepts of
SGML aimed at document designers and programmers would also be
useful, analogous to Bryan's book but less biassed to
publishing applications. LAP suggested that our working papers
would form a good basis for such a text. DB agreed that ML M13
should cover all that was necessary for those wishing to
design TEI-conformant DTDs. LAP to report
on the availability of relevant work done previously at
Hewlett Packard ML W16 24-Oct-89
LAP with MSM and David Durand as back-
upto consider writing a technical introduction to the use
of the TEI Guidelines ML W16 ?
Other Encoding Schemes
LB presented verbally an initial draft of ML W12. The
Committee worked through a long list of encoding schemes,
categorising each as (a) one on which work towards specifying
a transformation would be carried out (b) one on which such
work was not judged necessary (c) one on which no further work
was currently planned. [The results are presented in the draft
of ML W12 appended to these minutes] It was noted that Nancy
Ide (NI) had expressed willingness to work in this area and
she was requested to form a working group. NV noted that
recoding schemes for a number of wordprocessing systems were
already required for the Eurotra project. DB proposed that each
scheme on which work was required should be considered on an
ad hoc basis. It might be possible simply to specify an SGML
declaration and DTD for some; others might need simple string
substitutions; for yet others general purpose tools for
lexical analysis such as YACC and Lex might be required. JPG
felt that the issue should not be oversimplified but the
consensus was that simple tasks should be addressed by simple
tools. DB noted that there were already several volunteers
willing to work in this area and others might be forthcoming.
He undertook to ask NI and FT to co-ordinate the work.
NI with assistance from Frank TompaTo form
working group with responsibility for addressing tasks
specified in ML W12 ML W12 ?
Stylistic Guide
LB suggested that a separate document summarising TEI `house
style' for DTD-definers might be useful. This was agreed,
though the difficulty of formulating such rules was noted. It
should for example summarise naming conventions, recommend
appropriate grouping levels for content models and propose a
standard ordering for DTD statements. DB suggested that much of
this material might be included in W13. MSM proposed that it
should be a separate document and suggested David Durand (DD)
as a suitable author. DBto contact DD to
discuss preparation of Stylistic Guidelines ML W17
?
Input to WG8
The committee agreed that a formal mechanism for conveying
problems identified with the current SGML standard to the
relevant ISO working groups was highly desirable. Two
particular problems (namelength and doctype qualifier on
marked sections) had already been identified and it seemed
likely that there would be more. It was agreed that the
Steering Committee should be asked to convey detailed
proposals for revisions to ISO 8879, the five year review of
which was due in 1991 and that the Committee would attempt to
draft such proposals. DBto ask Steering to
communicate with WG8??
Next Meeting
Planned dates for next meeting are 16-19 February 1990, in
Chicago. Date and place to be confirmed by 1 Dec 1989.
MSMto confirm date and time of next
meeting1-Dec-89
Questions raised by other committees
Due to other committments, DB had to leave the meeting after
the second day. The third day of the meeting was spent
discussing in some detail some specific problems raised by
members of other working committees. MSM characterised the
problem areas on which guidance was needed as follows:
- Arbitrary segments which did not seem to be really
text elements, for example in discourse or stylistic analysis.
- Discontinuous segments where elements were interrupted by
other elements, for example in morphological or conversational
analyses.
- Ambiguity, where more than one structural analysis might
apply to a given text segment.
- Overlap, where two or more overlapping structures were to
be tagged across the same text.
- Synchronous parallel structures
- Transcription mapping.
- Cross references both within a document and outside it.
- Vagueness, where the feature to be encoded could not be
exactly localised.
Arbitrary segments
Examples quoted were: this is where the transcript is
inaudible; this is where he moved to the window. JPG suggested
that each such segment should be regarded as a separate
automaton. CONCUR could only be used if the number of
different segment types was in principle bounded and quite
small. Alternatively he proposed empty tags marking either end
of the segment, with attributes to indicate their type. For
example, Nino's moves around the room might be represented in
a document with content model of (#PCDATA | move) by elements
tagged [(nino)move] .... [/(nino)move] or by elements tagged
[move person=nino start] ... [move person=nino end]. It was
noted that the first mechanism could be generated from the
second, but that the second was more general. The advantage of
the first was that the parser could check that tags were
properly matched etc. A third possibility was to create the
total intersection of all identifiable segments, subdivided at
each segment boundary and grouped at a higher level.
Discontinuous segments
These could be handled by co-indexing using the id/idref
mechanism of SGML. To avoid the need to treat the first part
of a discontinuous segment differently from the rest, it could
be set up initially. The example discussed was of
discontinuous segments relating to a particular topic.
Provided that the set of topics could be predefined, these
could be listed in a separate TOPICLIST element, each member
of which could be allocated an ID. Each occurrence of a topic
in the text itself would then be linked back to the
appropriate element by an IDREF.
It was agreed that this method was probably not appropriate
for all examples of discontinuity. For micro discontinuities
such as those found in morphology, a simpler solution might be
to introduce some redundancy into the text. For example, the
lemmatised form KTB of the Arabic word al-kaatib might be
represented either as an attribute [word root=KTB]al-kaatib
[/word] or as an additional element [word] [root]KTB [form]al-
kaatib [/word]. The first was probably to be preferred, since
lemmatisation was a process applied to the text rather than a
part of its content.
Ambiguity
Five mechanisms were outlined for dealing with ambiguities
such as `I saw the man with the telescope' The two parse
trees for this sentence - ((I) (saw((the man) with the
telescope))) and ((I) ((saw (the man)) with the telescope)) -
could be represented independently, but without repeating
their points of overlap, by using CONCUR. Or they could be
repeated as alternative marked sections.
The third possibility was to think of each parse tree as a
graph, in which the nodes were the boundaries between
syntactic units and the arcs the syntactic interpretation
placed on them. A sentence could then be represented in SGML
as an ordered list of words, each with its own ID. The arcs of
each syntactic graph for the sentence could then be
represented by empty elements, using idrefs to point back to
the words. A simpler representation might be to use a special
notation (which could not therefore be checked by the SGML
parser) for the parse tree value and specify it as an
attribute. The fifth possibility was to treat all parse
subtrees as arbitrary segments, as outlined above.
The committee felt that more specific examples would be
needed before clear recommendations could be given as to which
of these mechanisms should be preferred.
Overlap
The example cited -- she took advantage of Joan -- seemed to
be an instance of arbitrary sectioning. The A&I Committee was
requested to provide precise illustration of the Japanese
biclausal analysis referred to in document TEI AI M1.
Cross references
The id/idref mechanism of SGML seemed adequate in the absence
of more specific problems.
Synchronisation of multiple transcriptions
JPG pointed out that so long as order was preserved,
paralellism between two synchronous structures could be
implied. If order was not preserved, paralellism would need to
be explicitly tagged by using idrefs.
Vagueness
Various examples of vagueness were discussed inconclusively.
In most cases the mechanisms proposed for arbitrary and
discontinuous segments seemed adequate. Two different examples
were requested, together with some indication of the purpose
for which such features might be tagged.
CDATA
LAP gave a detailed clarification of the use of the CDATA
keyword, as declared content for an element, declared value
for an attribute, as an entity or as effective status keyword
in a marked section. JPG described the interaction between the
use of CDATA and of marked sections. It was agreed that CDATA
and RCDATA should be avoided as declared element content,
unless in a non-SGML environment or as a marked section status
keyword.
Processing Instructions
It was agreed that these should be avoided, except for a few
specific purposes such as returning the system date to a
document instance, in which case the value returned should be
declared as SDATA to avoid changing the parse state.