[This report summarizes activities of TEI-funded workgroups and TEI-chartered Task forces reporting to the TEI Technical Council during the year ending December 2003. There will be an informal panel session at the TEI Members Meeting during which Chairs of workgroups and task forces will be pleased to discuss the activities reported on here with members.]
Since its inception in the Summer of 2001, the Character Encoding Workgroup has met twice face to face, in October 2001 in Berkeley and in July 2002 in Tübingen. Over this past year, the business has been conducted by email, but another face to face meeting is planned immediately prior to the Members Meeting in Nancy, November 2003. All draft documents of the workgroup are available from the workgroup area on the TEI homepage ().
The workgroup has spent most of its effort this past year in drafting and revising an introduction to character set issues, Unicode and its implication for XML processing. This is available as document CEW01 at the website and was presented to the TEI Council in May. At that time the Council requested some minor changes, which are currently being worked on.
Another area of work has been to develop markup constructs to augment, modify or replace the orthodox Unicode interpretation of characters in specific contexts, as well as to provide for the extension of the character set available to text encoders. This work is reported in working paper CEW06
which was also discussed by the TEI Council. Its underlying model was found to be sound, but considerable revision is required, especially to bring it in line with the recent developments towards P5.
The workgroup also tried to collect some use cases of these mechanisms in different languages and encoding scenarios and some members worked on testing the markup constructs in real life projects. The current draft was also presented and discussed with fruitful results at a Conference on Character Encoding issues at Academia Sinica, Taipei in March 2003.
The workgroup plans to wrap up the open issues and submit a final report to the TEI Council at the next council meeting.
Another activity of the group has been the development of some sample software tools that may prove helpful to projects engaging with the changes in pointing mechanisms. This effort has produced two sample implementations: A Perl script that can translate old-format TEI Extended Pointers in TEI documents into the W3C XPointer syntax, recommended by the workgroup, and an implementation of the W3C XPointer language.
- Working documents updated since last year include:
- TEI SOW2 technical rationale: An explanation of the rationale for the major changes introduced by this workgroup.
- TEI SOW4 Notes on Media formats and XPointer: This document discusses the linking of images and image metadata to TEI documents. The biggest change here is the use of the SVG (Structured Vector Graphics) DTD to tag included images.
- TEI SOW5 Corpus applications: This document describes Corpus and other linguistic applications of Standoff markup and annotation
- TEI SOW6 Standoff Markup: discusses the use of the W3C XInclude standard to represent standoff markup documents. Revisions are under discussion on the group mailing list.
- TEI SOW7: Notes on the representation of graph structures in the TEI; generated as advice to any other workgroup that may deal with these phenomena. Accepted by Council/finished.
- TEI SOW8: The document is a replacement for the current “canonical reference” scheme in the TEI, and has been accepted by the TEI Council.
- SOW2, 4, and 5 have received most recent attention.
- SOW3 (revised chapters) is delayed, awaiting the completion of the other documents.
Progress on this workgroup has been slow, and there remain many things yet to do. The work is carried on by email list and conference call. Energetic volunteers are encouraged to contact the chair if they want to participate.
The members of the Task Force met three times over the past year: in October 2002 in Chicago (immediately following TEI Annual Members Meeting), in February 2003 at the University of Maryland, and in June 2003 at the University of Alicante. Minutes from each of the meetings are available from the Task Force activities page, as is a detailed workplan written after the first meeting.
The primary deliverables of the Task Force are two reports, Strategic Considerations in Migration of TEI Documents from SGML to XML and the Practical Guide to Migration of TEI Documents from SGML to XML. The first report, intended for administrators and project managers, emphasizes the planning and decision-making involved in data migration, while the second report describes the mechanics of conversion in greater detail and is written primarily for the technical staff who will implement the conversion. The specific recommendations in the technical report are augmented by a set of Migration Case Study Reports (MIW06a-j) that discuss individual migration efforts undertaken by members of the Task Force.
As of October 2003, the reports and case studies have been thoroughly reviewed by the Task Force members and edited by a technical writer. The final drafts will be circulated back to the group for a brief review period before being forwarded on to the TEI Council for comment. The reports and case studies will be officially presented to the greater TEI community at the Annual Members Meeting in Nancy (November 2003).
In particular we were keen to provide a mechanism for those wishing to encode legacy data in as simple a way as possible, i.e. without requiring extensive re-writing or re-ordering of the data by the encoder (what could be called the TEI-MMSS approach), while at the same time allowing those wishing to produce highly-structured detailed descriptions, possibly but not necessarily on the basis of existing data (the MASTER approach) to do so. An idea put forward at the TEI council meeting in Oxford, viz. that this might best be accomplished by defining separate elements, one for structured and the other for unstructured descriptions, on the analogy of the existing biblStruct and bibl, was rejected, as the difference between two approaches are more a matter of degree than nature, and a transition from one view to the other should be possible without requiring re-encoding. It was also recognised that the recommendations of the two groups (MASTER and TEI-MMSS) had more in common than not, but that some of the ideas underlying them both (and to an extent also the Repertorium scheme) were in need of revision. It was decided therefore to take as little as possible for granted, but rather to review the entire system, element for element. The result of this review is a greatly improved tagset, much more balanced and robust than either of its predecessors. Work will be concentrated now on the production of a reference manual and a tutorial guide.
The full task force has not yet met face to face, operating only by email, but there have been two fruitful full-day meetings in October 2003 between Lou Burnard, Sebastian Rahtz, and Norm Walsh, and Lou Burnard, Sebastian Rahtz, and Laurent Romary. These resulted in documents MEW 02, MEW 03, MEW 04 and MEW 05. Attention is drawn to the analysis in MEW 02 of the relationship between TEI and Docbook element classes, as this may prove a fertile area of future collaboration. The discussion recorded MEW 05 about linking TEI names to corresponding concepts in a proposed ISO data category repository is also a very important development for the future.
The timetable of work under the META umbrella is expected to be as follows:
- November 2003: release stable alpha version of P5 schemas and Guidelines (in HTML) for comment by members
- November 2003: first draft of new TEI module for tag documentation
- December 2003: freeze new ODD format, and hand over P5 sources in this form to TEI editors
- January 2004: alpha release of Pizza Chef replacement for ad hoc generation of schemas
- February 2004: first release of P5 with new/revised modules for manuscript descriptions, feature structures, characters, and linking.
The remaining sections in this report describe the progress on the task force’s jobs of:
- Revision of the ODD format to be entirely independent of the SGML/XML notation for DTDs
- Combination of the ODD format and the current TEI DTD for tag documentation (TSD) into a single standard TEI tagset.
- Basing the new notation on one of the XML schema languages.
- Creation of additional processors to make not only XML DTDs (and possibly SGML as well), but also at least one XML schema format.
- Conversion of data types should be converted to use the datatype library of the W3C.
- Rewriting the Pizza Chef to allow user choice of DTD or schema output.
The final result will be a new version of the TEI Guidelines.
- Elements which have content of literal SGML/XML code have been converted either to have neutral TEI markup
- Element names have been changed to make them more independent of SGML DTD naming, and to make them more consistent with the rest of the TEI. The main reference documentation for element classes, elements, and patterns (entities) has been simplified
- Element content models have been converted to use Relax NG syntax
This process should now be complete. The next stage will be to release sample subsets of the TEI source for comment by other working groups (eg manuscripts, feature structures, and standoff markup), to check that they are useable for future editing.
A tool to generated W3C Schemas from the Guidelines has not been written. It is intended to produce them using James Clark’s trang program from the RelaxNG schemas, on demand.
See further documents at the Activity’s website .
Training activities directed by the TEI Council
This training session used a case study model to provide advice and discussion on specific topics in text encoding, based on real-world problems supplied by the participants. The session was aimed at those responsible for designing their project’s encoding system. It provided an opportunity to take a focused look at a particular problem or set of problems, in a group of knowledgeable peers guided by TEI experts. Participants were expected to have some basic familiarity with the TEI. The session focused on the encoding of literary and cultural documents, interpreted broadly.
The session was held from 1 to 6 pm on Thursday, October 10. Each participant was asked to bring a problem or encoding challenge from their own project. The session began with a general discussion of the topics raised, followed by focused attention to each particular case in turn. The instructors addressed each participant’s questions in depth and also drew comparisons among the projects represented. The goal of the session was not only to answer the participants’ specific questions, but also to place them in the context of issues such as retrieval, data interchange, and long-term project goals.
There were 16 participants in the workshop, and three instructors (Julia Flanders, Syd Bauman, and Terry Catapano).
The feedback was positive (I am ashamed to say that I can’t find the actual responses, so I can’t give quotes); people liked the discussion format and the opportunity to address specific encoding questions from their own projects, and they found the perspective on other projects helpful as well. We allowed a few beginners to attend at their request, and they found the session moved too quickly and didn’t provide enough background (which was to be expected, but suggests that either clarifying this point or making sure to offer introductory training as well would be desirable when possible).
TEI Training at ACH/ALLC 2003
Amit Kumar and Susan Schreibman of Maryland Institute for Technology in the Humanities (MITH) offered two one-day TEI training workshops in conjunction with the 2003 ACH/ALLC Annual Conference in Athens Georgia on 28-29 May.
On 28 May an Introduction to TEI and XML workshop was taken by eight students. The workshop consisted of lectures and exercises that introduced students to basic XML and TEI concepts, the TEILite DTD, constructing a TEI Header, and how to use CSS with TEI. Comments were positive: “Thanks for an excellent workshop & plain, down-to-earth explanations/comments”; “this was [an] excellent workshop. I really appreciate about everything (except that the length was too short)”.
On 29 May, and Introduction to XSLT workshop was taken by fourteen students. Again, the format was lectures punctuated by exercises. All the exercises transformed documents encoded in TEI. Students were introduced to XSLT and XPath expressions, XSL templates, node trees, and element/attribute matching, as well as XSLT processing models. Students were generally satisfied with the workshop and comments were positive: “Nice format, excellent instructors, very personable, friendly, receptive”; “there were some typos & the like in some of the handouts, but all the same, one could generally find them & work them through. Well-selected & organized materials – an excellent preparation job!”
The five-day workshop provided introductory training in the theory and practice of digitizing text and images for those working in book and text studies. Training in these skills for those working in the Humanities is not readily available in South Africa, but both academics and advanced students are increasingly aware of the need to bring their work into line with that of colleagues elsewhere implementing more advanced technologies. Course participants came from institutions as diverse as the National English Literary Museum, Rhodes University, The University of Zululand, The University of Witwatersrand, The National Archives of South Africa, etc. showing the interest in learning the appropriate standards and technologies relevant to their collections.
The five workshop days each consisted of two morning sessions (8:30-10:00 & 10:30-12:00) and two afternoon sessions (1:30-3:00 & 3:30-5:00) of 90 minutes each. After the first session which introduced the workshop, Humanities Computing as a discipline, the associations, journals, mailing-lists, publications, and institutions involved, Edward Vanhoutte taught five (hands-on) sessions on XML and TEI, followed by four (hands-on) sessions on XSL taught by Ron Van den Branden, and four (hands-on) sessions on digitizing images and textual resources taught by Melissa Terras. The workshop concluded with two sessions on project management taught by Melissa Terras and Edward Vanhoutte. On the evening of the first day, the Terry Sander’s documentary film Into the Future, On the Preservation of Knowledge in the Electronic Age was shown. The course culminated with a course dinner, which cemented friendships made on the course: an additional positive outcome of such a workshop being the contacts made between the various institutions and interested parties taking part, who were unaware of each others existence prior to the workshop.
All participants were provided with printouts of the lectures and a copy of the CD-ROM XML, XSL & Digitisation. Tools and Resources, compiled by the instructors for this workshop and sponsored by the Centre for Scholarly Editing and Document Studies.
Judging from the constant interaction with the participants during the workshop, their comments afterwards, and the many letters of thanks and support the local organiser, Professor John Gouws received, the instructors feel that this was a very successful workshop which filled an urgent need in Africa.
TEI Training activities at Oxford
Sebastian Rahtz and Lou Burnard again taught their four half-days course introducing the fundamentals of the TEI architecture and processing with XSLT at Oxford University in February. Teaching materials from this course and a summary are both available on the TEI web site.
Lou Burnard taught a three day practical workshop on Corpus encoding with the TEI at the Scuola Superiore di Lingue per Interpreti e Traduttore in Forlí, Italy in April. Teaching materials and overview are also available on the TEI web site.
In the preparation for release of TEI P4 I have read and commented on the P3 chapter on Simple Analytic Mechanisms. I translated into Slovene and circulated the Press Release for TEI P4.
I am a member of the TEI/NEH Workgroup on SGML to XML Migration, where I contributed a sample P3 to P4 migration of the multilingual annotated MULTEXT-East corpus. I am also a member of the ISO/TEI Workgroup on Feature Structures, where I have so far commented on the preliminary draft of ISO TC 37/SC 4 N033 Language Resource Management – Feature Structures. For the Nancy members meeting I have proposed a TEI SIG on Language Technologies.
In connection with TEI training, I have given a course on “Annotation of Language Resources” at the 14th European Summer School in Logic, Language and Information (ESSLLI’02), Trento, August 5-9, 2002.
In my own work, I have in the last two years used TEI to encode various resources. Some of these projects are now completed, some are on-going, and most had their results presented at conferences and in journal papers: the IJS-ELAN Slovene-English linguistically annotated parallel corpus; the ontology-annotated GENIA corpus of MEDLINE abstracts; the MULTEXT multilingual morphosyntactic specifications; the Slovene Dependency Treebank; a Japanese-Slovene learners’ dictionary; and eSlomsek, a text-critical edition of 19th century sermons by St Slomšek.
Over the past year I published “Computer-Assisted Analysis and Study of the Structure of Mixed Content Miscellanies” in “Scripta & e-Scripta,” vol. 1, 2003, 15-54. I also gave invited lectures or conference presentations on this research in the following venues:
- Thirty-Eighth International Congress on Medieval Studies, Kalamazoo, MI, USA, May 2003
- 2003 Summer Research Laboratory on Russia and Eastern Europe, University of Illinois in Urbana-Champaign, IL, US, June 2003
- Graduate School of Library and Information Science Electronic Publishing Research Group, University of Illinois in Urbana-Champaign, IL, US, June 2003
- Medieval Slavic Summer Institute, Ohio State University Research Center for Medieval Slavic Studies (Columbus, OH, US, July 2003
- Extreme Markup 2003, Montreal, QUE, CA, August 2003
- Thirteenth International Congress of Slavists, Ljubljana, Slovenia, August 2003
- Seventh International Conference of the Bulgarian Studies Association, Columbus, OH, US, October 2003