TCR01: Report of the TEI Council 2003 – TEI: Text Encoding Initiative

Report of the TEI Council to the Members Meeting 2003

[This report summarizes activities of TEI-funded workgroups and TEI-chartered Task forces reporting to the TEI Technical Council during the year ending December 2003. There will be an informal panel session at the TEI Members Meeting during which Chairs of workgroups and task forces will be pleased to discuss the activities reported on here with members.]

Character Encoding Workgroup Chaired by Christian Wittern, charged July 2001This workgroup was charged by the TEI Council to revise those areas of the TEI Guidelines that deal with representation of characters, languages and writing systems, which includes the current Chapter 4 (Languages and Character Sets) and Chapter 25 (Writing System Declaration).

Since its inception in the Summer of 2001, the Character Encoding Workgroup has met twice face to face, in October 2001 in Berkeley and in July 2002 in Tübingen. Over this past year, the business has been conducted by email, but another face to face meeting is planned immediately prior to the Members Meeting in Nancy, November 2003. All draft documents of the workgroup are available from the workgroup area on the TEI homepage ().

The workgroup has spent most of its effort this past year in drafting and revising an introduction to character set issues, Unicode and its implication for XML processing. This is available as document CEW01 at the website and was presented to the TEI Council in May. At that time the Council requested some minor changes, which are currently being worked on.

Another area of work has been to develop markup constructs to augment, modify or replace the orthodox Unicode interpretation of characters in specific contexts, as well as to provide for the extension of the character set available to text encoders. This work is reported in working paper CEW06

which was also discussed by the TEI Council. Its underlying model was found to be sound, but considerable revision is required, especially to bring it in line with the recent developments towards P5.

The workgroup also tried to collect some use cases of these mechanisms in different languages and encoding scenarios and some members worked on testing the markup constructs in real life projects. The current draft was also presented and discussed with fruitful results at a Conference on Character Encoding issues at Academia Sinica, Taipei in March 2003.

The workgroup plans to wrap up the open issues and submit a final report to the TEI Council at the next council meeting.

Stand-Off Markup Workgroup Chaired by David Durand, charged May 2002The Workgroup on Stand-Off markup and Linking has been working on several working-group documents, discussing linking strategy, linguistic annotations, the TEI canonical reference system, and a summary of the rationale for the larger decisions taken in this process. Progress has been slow, but is ongoing. Some portions of the work have been presented to the council for feedback, while others are still in preparation. We project that the working papers will be submitted to the TEI council at the next council meeting, for approval. Any feedback on the work so far is welcomed.

Another activity of the group has been the development of some sample software tools that may prove helpful to projects engaging with the changes in pointing mechanisms. This effort has produced two sample implementations: A Perl script that can translate old-format TEI Extended Pointers in TEI documents into the W3C XPointer syntax, recommended by the workgroup, and an implementation of the W3C XPointer language.

TEI SOW2 technical rationale: An explanation of the rationale for the major changes introduced by this workgroup.
TEI SOW4 Notes on Media formats and XPointer: This document discusses the linking of images and image metadata to TEI documents. The biggest change here is the use of the SVG (Structured Vector Graphics) DTD to tag included images.
TEI SOW5 Corpus applications: This document describes Corpus and other linguistic applications of Standoff markup and annotation
TEI SOW6 Standoff Markup: discusses the use of the W3C XInclude standard to represent standoff markup documents. Revisions are under discussion on the group mailing list.
TEI SOW7: Notes on the representation of graph structures in the TEI; generated as advice to any other workgroup that may deal with these phenomena. Accepted by Council/finished.
TEI SOW8: The document is a replacement for the current “canonical reference” scheme in the TEI, and has been accepted by the TEI Council.
SOW2, 4, and 5 have received most recent attention.
SOW3 (revised chapters) is delayed, awaiting the completion of the other documents.

Progress on this workgroup has been slow, and there remain many things yet to do. The work is carried on by email list and conference call. Energetic volunteers are encouraged to contact the chair if they want to participate.

SGML/XML Conversion Workgroup Chaired by Christine Ruotolo, charged May 2002The TEI Task Force on SGML to XML Migration was convened in May 2002 and charged with developing recommendations for migrating existing TEI resources from SGML to XML. The Task Force comprises representatives from projects with significant TEI SGML investment, along with selected technical experts and the TEI editors, and has worked for the past 18 months to diagnose and document the problems, methods, and tools necessary to migrate legacy TEI data to XML.

The members of the Task Force met three times over the past year: in October 2002 in Chicago (immediately following TEI Annual Members Meeting), in February 2003 at the University of Maryland, and in June 2003 at the University of Alicante. Minutes from each of the meetings are available from the Task Force activities page, as is a detailed workplan written after the first meeting.

The primary deliverables of the Task Force are two reports, Strategic Considerations in Migration of TEI Documents from SGML to XML and the Practical Guide to Migration of TEI Documents from SGML to XML. The first report, intended for administrators and project managers, emphasizes the planning and decision-making involved in data migration, while the second report describes the mechanics of conversion in greater detail and is written primarily for the technical staff who will implement the conversion. The specific recommendations in the technical report are augmented by a set of Migration Case Study Reports (MIW06a-j) that discuss individual migration efforts undertaken by members of the Task Force.

As of October 2003, the reports and case studies have been thoroughly reviewed by the Task Force members and edited by a technical writer. The final drafts will be circulated back to the group for a brief review period before being forwarded on to the TEI Council for comment. The reports and case studies will be officially presented to the greater TEI community at the Annual Members Meeting in Nancy (November 2003).

Manuscript Description Task-force Chaired by Matthew Driscoll, charged February 2003The work of the Manuscript Description Task-force, which has as its goal the reconciliation or merging of the various schemes for encoding manuscript descriptions using TEI-conformant XML (principally MASTER and the TEI-MMSS workgroup, but also the scheme devised for the Sofia based Repertorium of Old Bulgarian Literature), proceeds apace. The members of the task force, Messrs. Bauman, Birnbaum, Burnard, Driscoll (chair) and Ms. Proffitt, met in Reykjavík, Iceland, from Sunday 7 to Tuesday 9 September 2003. Minutes of the meeting may be viewed at

In particular we were keen to provide a mechanism for those wishing to encode legacy data in as simple a way as possible, i.e. without requiring extensive re-writing or re-ordering of the data by the encoder (what could be called the TEI-MMSS approach), while at the same time allowing those wishing to produce highly-structured detailed descriptions, possibly but not necessarily on the basis of existing data (the MASTER approach) to do so. An idea put forward at the TEI council meeting in Oxford, viz. that this might best be accomplished by defining separate elements, one for structured and the other for unstructured descriptions, on the analogy of the existing biblStruct and bibl, was rejected, as the difference between two approaches are more a matter of degree than nature, and a transition from one view to the other should be possible without requiring re-encoding. It was also recognised that the recommendations of the two groups (MASTER and TEI-MMSS) had more in common than not, but that some of the ideas underlying them both (and to an extent also the Repertorium scheme) were in need of revision. It was decided therefore to take as little as possible for granted, but rather to review the entire system, element for element. The result of this review is a greatly improved tagset, much more balanced and robust than either of its predecessors. Work will be concentrated now on the production of a reference manual and a tutorial guide.

Metalanguage Workgroup Chaired by Sebastian Rahtz, charged March 2003This report should be read in conjunction with the other reports at .

The full task force has not yet met face to face, operating only by email, but there have been two fruitful full-day meetings in October 2003 between Lou Burnard, Sebastian Rahtz, and Norm Walsh, and Lou Burnard, Sebastian Rahtz, and Laurent Romary. These resulted in documents MEW 02, MEW 03, MEW 04 and MEW 05. Attention is drawn to the analysis in MEW 02 of the relationship between TEI and Docbook element classes, as this may prove a fertile area of future collaboration. The discussion recorded MEW 05 about linking TEI names to corresponding concepts in a proposed ISO data category repository is also a very important development for the future.

The timetable of work under the META umbrella is expected to be as follows:

November 2003: release stable alpha version of P5 schemas and Guidelines (in HTML) for comment by members
November 2003: first draft of new TEI module for tag documentation
December 2003: freeze new ODD format, and hand over P5 sources in this form to TEI editors
January 2004: alpha release of Pizza Chef replacement for ad hoc generation of schemas
February 2004: first release of P5 with new/revised modules for manuscript descriptions, feature structures, characters, and linking.

The remaining sections in this report describe the progress on the task force’s jobs of:

Revision of the ODD format to be entirely independent of the SGML/XML notation for DTDs
Combination of the ODD format and the current TEI DTD for tag documentation (TSD) into a single standard TEI tagset.
Basing the new notation on one of the XML schema languages.
Creation of additional processors to make not only XML DTDs (and possibly SGML as well), but also at least one XML schema format.
Conversion of data types should be converted to use the datatype library of the W3C.
Rewriting the Pizza Chef to allow user choice of DTD or schema output.

The final result will be a new version of the TEI Guidelines.

Revision of ODD formatThe markup used to create the TEI Guidelines, from which both documentation and DTDs/Schemas are derived, has been reviewed several times, and three important sets of changes have been made:

Elements which have content of literal SGML/XML code have been converted either to have neutral TEI markup
Element names have been changed to make them more independent of SGML DTD naming, and to make them more consistent with the rest of the TEI. The main reference documentation for element classes, elements, and patterns (entities) has been simplified
Element content models have been converted to use Relax NG syntax

This process should now be complete. The next stage will be to release sample subsets of the TEI source for comment by other working groups (eg manuscripts, feature structures, and standoff markup), to check that they are useable for future editing.

New module for tag documentationThe revision of the TSD tagset, to turn it into a proper TEI module and to make it conform to the current Guidelines source, has not yet been started. It is planned to complete this during November 2003.

Schema languageThe Guidelines have been converted to express element syntactic constraints using Relax NG. This work is complete, and awaiting more user testing. The Relax NG compact syntax is used for display in the HTML version of the Guidelines.

Processors to generate schemasTools to generate RelaxNG schemas and DTDs from the new Guidelines have been completed, written as XSLT transforms. The results await user testing.

A tool to generated W3C Schemas from the Guidelines has not been written. It is intended to produce them using James Clark’s trang program from the RelaxNG schemas, on demand.

DatatypesAs described in the May report, 23 datatypes (linked to W3C datatypes where relevant) have been defined, and linked to all the simple attribute cases. Work has not yet started on looking at element content models to see where they would benefit from datatyping.

Pizza chef rewriteA prototype processor to replace the pizza chef, called roma was developed to demonstrate at XML Europe 2003; it works by offering a fixed set of choices (including exotica like using MathML as the content for formula), but has not yet been revised to work with the latest revision of P5. Work will start on this in January 2004.

Joint ISO/TEI Activity on Feature Structures Chaired by Kyong Lee: chartered January 2003TEI and ISO TC37/SC4 are jointly sponsoring an activity which will result in adoption of a version of the TEI proposals for encoding linguistic annotations using the feature structure formalism. The group was chartered by the TEI Council in January 2003 () and is chaired by Kyong Lee, with TEI input coming from Laurent Romary. A draft for the standard was circulated in the summer, and there are a number of comments which will be integrated and discussed at a face to face meeting to be held in Nancy immediately before the Members Meeting. A further meeting is planned for next spring.

See further documents at the Activity’s website .

Training activities directed by the TEI Council

TEI Training Session at MM02: Encoding Literary and Cultural Documents in TEIThe TEI annual members’ meeting in 2002, held in Chicago, Illinois, was accompanied by a one-day training workshop entitled “TEI Training Session: Encoding Literary and Cultural Documents in TEI”.

This training session used a case study model to provide advice and discussion on specific topics in text encoding, based on real-world problems supplied by the participants. The session was aimed at those responsible for designing their project’s encoding system. It provided an opportunity to take a focused look at a particular problem or set of problems, in a group of knowledgeable peers guided by TEI experts. Participants were expected to have some basic familiarity with the TEI. The session focused on the encoding of literary and cultural documents, interpreted broadly.

The session was held from 1 to 6 pm on Thursday, October 10. Each participant was asked to bring a problem or encoding challenge from their own project. The session began with a general discussion of the topics raised, followed by focused attention to each particular case in turn. The instructors addressed each participant’s questions in depth and also drew comparisons among the projects represented. The goal of the session was not only to answer the participants’ specific questions, but also to place them in the context of issues such as retrieval, data interchange, and long-term project goals.

There were 16 participants in the workshop, and three instructors (Julia Flanders, Syd Bauman, and Terry Catapano).

The feedback was positive (I am ashamed to say that I can’t find the actual responses, so I can’t give quotes); people liked the discussion format and the opportunity to address specific encoding questions from their own projects, and they found the perspective on other projects helpful as well. We allowed a few beginners to attend at their request, and they found the session moved too quickly and didn’t provide enough background (which was to be expected, but suggests that either clarifying this point or making sure to offer introductory training as well would be desirable when possible).

TEI Training at ACH/ALLC 2003

Amit Kumar and Susan Schreibman of Maryland Institute for Technology in the Humanities (MITH) offered two one-day TEI training workshops in conjunction with the 2003 ACH/ALLC Annual Conference in Athens Georgia on 28-29 May.

On 28 May an Introduction to TEI and XML workshop was taken by eight students. The workshop consisted of lectures and exercises that introduced students to basic XML and TEI concepts, the TEILite DTD, constructing a TEI Header, and how to use CSS with TEI. Comments were positive: “Thanks for an excellent workshop & plain, down-to-earth explanations/comments”; “this was [an] excellent workshop. I really appreciate about everything (except that the length was too short)”.

On 29 May, and Introduction to XSLT workshop was taken by fourteen students. Again, the format was lectures punctuated by exercises. All the exercises transformed documents encoded in TEI. Students were introduced to XSLT and XPath expressions, XSL templates, node trees, and element/attribute matching, as well as XSLT processing models. Students were generally satisfied with the workshop and comments were positive: “Nice format, excellent instructors, very personable, friendly, receptive”; “there were some typos & the like in some of the handouts, but all the same, one could generally find them & work them through. Well-selected & organized materials – an excellent preparation job!”

Report on Rhodes University WorkshopFrom 8th to 12th September 2003, sixteen delegates from all over South Africa took part in the five-day workshop ‘Book & Text Studies: Humanities Computing’ at Rhodes University (Grahamstown, South Africa). The workshop was funded by the Association for Literary and Linguistic Computing (ALLC), certified by the TEI, and sponsored by The Department of Computer Science and the Department of English at Rhodes University (SA), The School of Library, Archive, and Information Studies at University College London (UK), and The Centre for Scholarly Editing and Document Studies of the Royal Academy of Dutch Language and Literature (Belgium). The latter two institutions allowed gratis the services of the lecturers Melissa Terras and Edward Vanhoutte & Ron Van den Branden, respectively.

The five-day workshop provided introductory training in the theory and practice of digitizing text and images for those working in book and text studies. Training in these skills for those working in the Humanities is not readily available in South Africa, but both academics and advanced students are increasingly aware of the need to bring their work into line with that of colleagues elsewhere implementing more advanced technologies. Course participants came from institutions as diverse as the National English Literary Museum, Rhodes University, The University of Zululand, The University of Witwatersrand, The National Archives of South Africa, etc. showing the interest in learning the appropriate standards and technologies relevant to their collections.

The five workshop days each consisted of two morning sessions (8:30-10:00 & 10:30-12:00) and two afternoon sessions (1:30-3:00 & 3:30-5:00) of 90 minutes each. After the first session which introduced the workshop, Humanities Computing as a discipline, the associations, journals, mailing-lists, publications, and institutions involved, Edward Vanhoutte taught five (hands-on) sessions on XML and TEI, followed by four (hands-on) sessions on XSL taught by Ron Van den Branden, and four (hands-on) sessions on digitizing images and textual resources taught by Melissa Terras. The workshop concluded with two sessions on project management taught by Melissa Terras and Edward Vanhoutte. On the evening of the first day, the Terry Sander’s documentary film Into the Future, On the Preservation of Knowledge in the Electronic Age was shown. The course culminated with a course dinner, which cemented friendships made on the course: an additional positive outcome of such a workshop being the contacts made between the various institutions and interested parties taking part, who were unaware of each others existence prior to the workshop.

All participants were provided with printouts of the lectures and a copy of the CD-ROM XML, XSL & Digitisation. Tools and Resources, compiled by the instructors for this workshop and sponsored by the Centre for Scholarly Editing and Document Studies.

Judging from the constant interaction with the participants during the workshop, their comments afterwards, and the many letters of thanks and support the local organiser, Professor John Gouws received, the instructors feel that this was a very successful workshop which filled an urgent need in Africa.

Workshop website

TEI Training activities at Oxford

Sebastian Rahtz and Lou Burnard again taught their four half-days course introducing the fundamentals of the TEI architecture and processing with XSLT at Oxford University in February. Teaching materials from this course and a summary are both available on the TEI web site.

Lou Burnard taught a three day practical workshop on Corpus encoding with the TEI at the Scuola Superiore di Lingue per Interpreti e Traduttore in Forlí, Italy in April. Teaching materials and overview are also available on the TEI web site.

Reports from individual council members about their TEI related asctivities

Report on TEI-related activities 2002-2003: Tomaž ErjavecI have been a TEI Council member since November 2001. In that capacity I attended three telephone conferences and one two day meeting in Oxford.

In the preparation for release of TEI P4 I have read and commented on the P3 chapter on Simple Analytic Mechanisms. I translated into Slovene and circulated the Press Release for TEI P4.

I am a member of the TEI/NEH Workgroup on SGML to XML Migration, where I contributed a sample P3 to P4 migration of the multilingual annotated MULTEXT-East corpus. I am also a member of the ISO/TEI Workgroup on Feature Structures, where I have so far commented on the preliminary draft of ISO TC 37/SC 4 N033 Language Resource Management – Feature Structures. For the Nancy members meeting I have proposed a TEI SIG on Language Technologies.

In connection with TEI training, I have given a course on “Annotation of Language Resources” at the 14th European Summer School in Logic, Language and Information (ESSLLI’02), Trento, August 5-9, 2002.

In my own work, I have in the last two years used TEI to encode various resources. Some of these projects are now completed, some are on-going, and most had their results presented at conferences and in journal papers: the IJS-ELAN Slovene-English linguistically annotated parallel corpus; the ontology-annotated GENIA corpus of MEDLINE abstracts; the MULTEXT multilingual morphosyntactic specifications; the Slovene Dependency Treebank; a Japanese-Slovene learners’ dictionary; and eSlomsek, a text-critical edition of 19th century sermons by St Slomšek.

David J. BirnbaumFor the TEI, I served on the Character Set Work Group, on the Manuscript Description Task Force (which Matthew Driscoll chaired), and on the TEI Council.

Over the past year I published “Computer-Assisted Analysis and Study of the Structure of Mixed Content Miscellanies” in “Scripta & e-Scripta,” vol. 1, 2003, 15-54. I also gave invited lectures or conference presentations on this research in the following venues:

Thirty-Eighth International Congress on Medieval Studies, Kalamazoo, MI, USA, May 2003
2003 Summer Research Laboratory on Russia and Eastern Europe, University of Illinois in Urbana-Champaign, IL, US, June 2003
Graduate School of Library and Information Science Electronic Publishing Research Group, University of Illinois in Urbana-Champaign, IL, US, June 2003
Medieval Slavic Summer Institute, Ohio State University Research Center for Medieval Slavic Studies (Columbus, OH, US, July 2003
Extreme Markup 2003, Montreal, QUE, CA, August 2003
Thirteenth International Congress of Slavists, Ljubljana, Slovenia, August 2003
Seventh International Conference of the Bulgarian Studies Association, Columbus, OH, US, October 2003