*.sr docfile = &sysfnam. ;.sr docversion = 'Draft';.im teigmlp1 .* Document proper begins. The Text Encoding Initiative: a further report <author>Lou Burnard Oxford University Computing Services European Editor, TEI <date>August 1991 </front> <body> <div1><head>The story so far <p>In July 1990, the first phase of the Text Encoding Initiative<note>The Text Encoding Initiative is an international research project, the aim of which is to develop and to disseminate guidelines for the encoding and interchange of machine-readable texts. It is sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The project is funded by the U.S. National Endowment for the Humanities, DG XIII of the Commission of the European Communities, the Canadian Social Science and Humanities Research Council and the Andrew W. Mellon Foundation. Equally important has been the donation of time and expertise by the many members of the research community who have served on the TEI's Working Committees and Working Groups</note>'s work came to an end with the publication of TEI P1: <cit>Guidelines for the encoding and interchange of machine-readable texts</cit> <note> edited by Lou Burnard and C.M.Sperberg-McQueen (Chicago and Oxford, ALLC-ACH-ACL Text Encoding Initiative, 1990). Summary articles have appeared in <cit>Humanistiske Data</cit> 3-90; <cit>ACH Newsletter</cit> 12 (3-4); <cit>EPSIG News</cit> 3(3); <cit>SGML Users Group Newsletter</cit> 18; <cit>ACLS Newsletter</cit>2/4 and elsewhere. Fuller discussion articles on the TEI include <q>Texts in the electronic age: textual study and text encoding, with examples from medieval text</q> C.M. Sperberg-McQueen, in <cit>Literary and Linguistic Computing</cit> (6.1, 1991) and <q>What is the TEI?</q> Lou Burnard, in D. Greenstein, <cit>Modelling Historical Data</cit>, (Goettingen, St. Katharinen, 1991)</note> In a presentation at the Berlin ICAME meeting (to be published as <q>The TEI: a progress report</q> in <cit>G. Leitner, Proceedings of the 11th ICAME Conference </cit> (Berlin, 1992), I reported on the main components of the <q>Guidelines</q>, drawing attention to their relevance for corpus linguistics. The present report summarizes briefly developments during the following year, outlining the plan of work for the remainder of the project, which is due to end in June 1992. <p> In November 1990, P1 was reprinted with several minor corrections, mostly of a typographic nature. By February of 1991, about 400 copies had been distributed in Europe, nearly half of these outside the UK, despite the fact that the report is published in English. <note>A translation of the report into Spanish was funded by the EC in the autumn of 1991</note>. About 50 copies were distributed to different sites or individuals in Germany, where an informal TEI discussion group was set up at Goettingen; about 40 each in the Netherlands and in France; 14 in Sweden, 11 in Italy. Copies were also requested from a further 29 other countries, including Hong Kong, Bulgaria, Thailand and Kenya. About 500 copies were also distributed in North America during the same period. A second reprint of the Guidelines turned out to be necessary later in 1991: in all, over a thousand copies of what was intended to be an initial draft have now been distributed worldwide. <p> It was clear from the start that this large and somewhat indigestible volume would not suffice to introduce the basic notions of the TEI to an audience not already immersed in the topic. The production of a tutorial guide however pre-supposes the existence of a more or less stable set of topics. Partly in order to determine what that set of topics should be, and partly in response to pressure from the community out of which the TEI had sprung, a series of introductory workshops was organised in early 1991. The first of these was held in conjunction with the annual ALLC/ACH Conference in Tempe (Arizona) in April 1991, for which the first version of a short tutorial guide, titled <q>Living with the Guidelines</q>, was drafted. With minor revisions and expansions this was also used at two subsequent public workshops, held at Oxford and Brown Universities, in July 1991. <p> As noted in the Berlin presentation, the first draft was far from complete and final, and it was recognised that substantial work on reviewing and evaluating the Guidelines would be necessary before the end of the project. To that end, three distinct types of activity were instituted, which form the subject of this report. The first was to solicit detailed comment at an individual or institutional level from all those who had received copies of the draft. The second was to set up an institutional framework within which specific research projects could formally 'affiliate' with the TEI and put its proposals to the test of real life data. The third was to set up a large number of specialist working groups, each charged with a specific area in which it was recognised that the provisions of the first draft were incomplete or nonexistent. On the basis of reports from these three activities, it is planned that the next draft of the Guidelines would be available for a further phase of public comment by early 1992, before publication of the revised Guidelines in June 1992, when the current funding cycle comes to an end. <div1><head>User Response and Comment <p> The obvious source for initial comment and review of the Guidelines has been the public at large, which has reacted with surprising and occasionally embarassing enthusiasm to the original draft proposals. Over a hundred sets of individual comments and criticisms, some quite detailed, have already been received, and it is hoped that this public discussion will continue during the remainder of the project's lifetime. For some reviewers the proposals have been too much focussed on literary or book-oriented texts; for others too much on linguistic or computational matters; for some they have been too technical, for others too superficial. The distribution of these criticisms seems to indicate that a reasonable degree of balance is being achieved. For almost every reviewer there was some topic not covered which should have been (a topic which is addressed below), but also some topic which seemed irrelevant. A small but vocal amount of criticism seemed to be hostile to the very notion of marking up text at all. <p> It has often been remarked that standards cannot be enforced by fiat: they must be accepted voluntarily if they are to achieve any permanent standing. To that end, the TEI has always been anxious to stimulate informed discussion of its proposals in as many and as divers forums as there are listeners willing to hear. Presentations and publications such as the present one are one way of achieving that end. <div1><head>Affiliated Projects <p> When the TEI was first announced, a number of major research projects expressed willingness to co-operate with it in some loosely defined way. During the phase of the project discussed in this report, a slightly more formal method of carrying out that co-operation was devised, and proposed to some fifteen so-called <q>affiliated projects</q>. This involves, amongst other things, the testing of the Guidelines against realistically sized samples of the various textual resources which each project is engaged in creating. To aid in that task, the TEI provides the project with a <q>consultant</q> who is charged to produce a detailed report on the problems encountered and solutions found in making the project `TEI-conformant'. <p> The list of research projects which have affiliated with the TEI is an impressive one, and includes several major corpus building initiatives, specifically the British National Corpus (Universities of Oxford and Lancaster, OUP, Longmans, British Library); the ACL Data Collection Initiative; the Bar Ilan Hebrew Database; the Leiden Armenian Database; the Stockholm-Umaea Swedish Text Corpus; the Network of European Research Corpora; Major text-collecting projects represented include the Chicago- based American Research on the Trésor de la Langue Française; project; the Brown University Women Writers Project; and Harvard's Perseus Project. Institutions formally affiliated include the Georgetown Center for Text Technology and the Institute of Formal/Applied Linguistics (Charles University, Prague). Specific and more focussed research projects affiliated with the TEI include the Vassar/CNRS Electronic Dictionary Project, the Electronic Middleton Edition (Brandeis University), the Electronic Milton Project (Ohio University/University of Wales), and the Nietzsche Project (Stanford University). <p> As a preliminary step, representatives from each project were invited to a special three day workshop, one in Europe and the other in North America, to pool experience and reaction to the Guidelines. These workshops, held in conjunction with the public introductory workshops mentioned above, proved to be a very useful means of identifying both common problems in the use of the Guidelines, and needs for extensions to them. In the short term, the value of the Affiliated Projects to the TEI is their provision of a serious test bed for the original proposals; the longer term, they will demonstrate the extensibility of the scheme to widely diverging areas. So far detailed reports have been forthcoming from only one or two of the Projects, but several have already produced extremely valuable input to the future development of the Guidelines, and it is reasonable to assume that the working relationships set up within the current project will continue beyond the its end. <div1><head>Workgroups <p> Experience gained during the first phase of the TEI project indicated that detailed proposals are more often obtained from comparatively small, narrowly-focussed groups than from comparatively large and disparate <q>brainstorming</q> groups, although the latter procedure had proved of immense value in setting the TEI proposals on a firm footing. It was moreover reasonably clear in which areas the Guidelines could benefit from expansion or further work. For these reasons, the mode of operation in the second phase of the project changed from that of the first. Of the four drafting committees of phase one, only one, that on Metalanguage and Syntax issues, continued to meet regularly. The committee on Text Documentation, originally only funded for one year only, was inactive until the end of 1991, when it met to review how proposals from the rest of the project could best be incorporated in the TEI header. The other two committees became effectively parent committees to a large number of small, more narrowly-focussed work groups. <p> Each workgroup was set up initially with a specific charge, usually to make recommendations for action in a particular area, and requested to report within a short time scale. Membership of workgroups was kept deliberately small, for the most part, while at the same time attempting to achieve a balance both in terms of geography and in terms of theoretical allegiances. As ever, the object of the TEI was not to legislate in areas of fundamental disagreement, but to formalise a consensus where that can be obtained. Most workgroups addressed well-defined areas and were able consequently to produce some quite specific recommendations; others (`planning-level groups') were asked specifically to advise on the areas of a given specialisation within which work was needed. <p> The remainder of this section describes briefly the objectives of each workgroup: most groups were in place by the summer of 1991, though few had made substantial progress at the time of this report. <list type=gloss> <label>TR1: Character sets <item> This workgroup, chaired by Harry Gaylord (University of Groningen), is responsible for testing and extending P1's coverage of character set problems. Its objectives include the creation and testing of TEI-conformant Writing System Declarations for the nine languages of the EC in their modern and ancient forms, and for Slavic, Hebrew and Arabic, and an evaluation of other transliteration schemes for non-Latin alphabet scripts. It will also consider TEI compatibility with both ISO 10646 and UNICODE proposals and make recommendations for the encoding of the International Phonetic Alphabet in a TEI-conformant manner. <label>TR2: Text Criticism <item>This workgroup, chaired by Peter Robinson (University of Oxford), is responsible for reviewing and revising P1's recommendations for the encoding of text-critical apparatus. OIts objectives also include the provision of sets of tags for such forms of critical apparatus as editorial interventions , dubious readings and cruxes, lacunae, historical-critical commentary , glossarial notes, polyglot versions etc. <label>TR3: Hypertext and hypermedia <item>This workgroup, chaired by Steven De Rose (Electronic Book Technology), is responsible for reviewing P1's recommendations with respect to hypertextual links and cross-references. Its objectives include evaluations both of available mechanisms for loading texts marked up according to P1 into existing hypertext systems (e.g. Hypercard, Guide) and of P1 itself as an interlingua for such systems. Its primary activity will however be to liaise with the ANSI X3V8 committee on Hypermedia (Hytime) with a view to longterm compatibility. <label>TR4: Mathematical formulae and tables <item> This workgroup, chaired by Paul Ellison (University of Exeter), is responsible for extending P1's currently rather sparse suggestions for handling mathematical formulas and tables. It will liaise closely with other initiatives (e.g. the American Mathematical Society, the EuroMATH Project, the American Association of Publishers) also working in this area, survey existing schemes and attempt to harmonise these within the P1 framework. It will also propose ways of encoding mathematical formulae descriptively, rather than for presentational purposes and recommend methods of including graphics within documents in existing software systems <label>TR6: Language Corpora <item>This workgroup, chaired by Douglas Biber (University of Northern Arizona), is responsible for extending the current provisions of P1 with respect to the encoding and organization of language corpora. Its specific objectives include detailed recommendations for text classification e.g. according to text type, subject, socio-linguistic stratum, etc. based on a survey of existing text-classification schemes. It will also report on existing schemes for linguistic annotation. <label>TR8: Physical description of mss and incunabula <item>This workgroup, chaired by Jacqueline Hamesse (University of Louvain la Neuve), is charged with making recommendations for a substantial area omitted from P1, namely that of the physical description of manuscript materials. Although this area is of clear importance to a significant proportion of the research community, organizational problems delayed the setting up of this workgroup to the point where its recommendations are unlikely to be incorporated in the June 1992 draft. Similar problems have delayed the setting up of an analogous workgroup to deal with printed materials and with analytic bibliography. It is expected that work in these areas will be continued after the lifetime of the current project. <label>AI1: General linguistics <item> This workgroup, chaired by Terry Langendoen (University of Arizona), is responsible for developing and enhancing the basic linguistic annotation tools proposed in P1. In particular it will develop a compact and efficient notations for the representation of linguistic analyses as bundled sets of feature-structures, for the representation of directed acyclic graphs, alignments, trees and other commonly-used linguistic formalisms. It will also propose ways of unifying linguistic description, specifically for part-of-speech and morphological tagging, by means of the feature-structure formalism. <label>AI2: Spoken texts <item>This workgroup, chaired by Stig Johansson (University of Oslo), is charged with extending P1 to deal with texts which are transcriptions of spoken language, an area previously not addressed at all. It will do this by first surveying existing practices in some detail, and then proposing ways of encoding transcribed speech for interchange purposes in line with the recommendations of P1. <label>AI3: Literary studies <item>This workgroup, chaired by Paul Fortier (University of Manitoba), is a planning level group requested to propose a work plan for the definition of those areas of interest to literary scholars not already catered for in P1, for example narrative structure thematics, metrics, stylistics, literary influence, authorship attribution , reader response theory etc. The group carried out a detailed opinion survey by means of questionnaire, and produced a report in July 1991 recommending that work groups be set up to provide detailed recommendations in three specific areas. These groups did not however come into being until after the period covered by this report and are not therefore discussed here in any detail. For completeness, the groups set up were TR10: verse texts (chaired by David Robey, University of Manchester); TR11: performance texts (chaired by Elli Mylonas, Harvard University); and TR12 (chaired by Tom Corns, University of Wales). <label>AI4: Historical studies <item>This workgroup, chaired by Daniel Greenstein (University of Glasgow), was another planning level group, organised jointly with the Association for History and Computing. Its objectives included the proposal of lists of source types of particular interest to historians and the definition of the specific textual elements included within them. A volume of very detailed discussion papers arising from the meetings of the group was published<note>Greenstein, op. cit</note> in August 1991, shortly before the AHC's annual conference at which the proposals were presented in some detail. More detailed work will continue, again, due to organizational constraints, beyond the present funding period. <label>AI5: Machine-readable dictionaries <item> This workgroup, chaired by Robert Amsler (Mitre Corporation), continues work incomplete at the time of publication of P1 on the provision of a general purpose tagset capable of encoding existing printed monolingual and multilingual dictionaries, without loss of information. It was charged also to consider the problems of encoding early modern dictionaries, etymological, historical and other special purpose dictionaries. <label>AI6: Computational lexica <item>This workgroup, chaired by Robert Ingria (BBN), was charged with reporting on the usage and structure of existing machine- readable lexica (that is, lexical data files used in existing natural language processing systems) with a view to assessing the possibility of defining a TEI-conformant interchange format for them. <label>AI7: Terminological databases <term>This workgroup, chaired by Alan Melby (Brigham Young University), continues work carried out within the context of INFOTERM and other international efforts concerned with the standardisation of terminology. Its objectives include a report on the structure of existing terminological databases and the proposal of a set of TEI-conformant tags or dtds for use in the interchange of terminological data. </list> <div1><head>Towards P2 <p> It is hoped that reports and recommendations from all of the above-listed workgroups will be available by the end of 1991. A technical review conference is planned for the autumn, in which the heads of all existing work groups will meet to finalise their recommendations and to survey the material available from elsewhere for revision of P1. Initial reports from the affiliated projects should be available within the same timescale. By bringing all of this material together, it is hoped that the next version of the Guidelines, P2, when it appears in early 1992, will reflect the consensus of a larger, more broadly-based research community than was possible for the first draft. <p> For more information, and up to date information about the progress of the TEI, please subscribe to the public discussion list TEI-L@UICVM.BITNET or get in touch with either of the TEI editors, Michael Sperberg- McQueen (U35395@UICVM.BITNET) or Lou Burnard (LOU@UK.AC.OXFORD.VAX).