TEI Conference and Members' Meeting 2022

September 12 - 16, 2022 | Newcastle, UK

JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organisers at tei2022@ncl.ac.uk.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Session Overview

Session

Import to your local calendar

Workshop 1: From a collection of documents to a published edition : how to use an end-to-end publication pipeline [Full Day]

Time:

Monday, 12/Sept/2022:

9:30am - 6:00pm

Location: ARMB: 3.38

Armstrong Building: Teaching Room 3.38. Capacity: 59

Presentations

ID: 125 / WS 1: 1
Workshop
Keywords: digital edition, historical manuscripts, encoding pipeline, publication workflow

From a collection of documents to a published edition : how to use an end-to-end publication pipeline

F. Chiffoleau^1,2, H. Scheithauer¹

¹Inria, France; ²Le Mans Université, France

In 2021, during the last edition of the TEI Conference “Next Gen TEI”, I took part in a session where I presented a project I had been working on for a year and a half. This project, both relying massively on the Text Encoding Initiative and benefiting its community, focusses on the creation of a pipeline for the publication of digital scholarly editions. Our pipeline, which was still a work in progress at the time of the 2021 Conference, but is now complete, aims at providing open source, free, easy-to-use and interoperable tools; its goal is to support the editorial process from the digitization of a collection of documents to its publication in a machine-readable standard.

In the following, I will succinctly describe the six steps that compose this pipeline, and then move to the way I intend to conduct the workshop based on them.

Firstly, the collection of images that composes the corpus has to be preserved and curated somewhere online to keep them available for the researcher. For this task, we rely on IIIF, to ensure sustainability and interoperability.

The three following steps (segmentation/transcription/post-OCR correction) are conducted with eScriptorium, an open-source automatic transcription application. It offers various options: images uploading, manual and automatic segmentation/transcription, import of models, production of ground truths, model training. Finally, if there are remaining errors in the transcription (in case of automatic transcription), it is possible to either correct them manually in eScriptorium or export the files and correct with the help of specifically designed scripts.

Once the transcription is fully ready, we encode it in TEI XML. For this step, we provide various solutions, depending on the transcription file format (Page XML, XML Alto, Text). We also propose a series of scripts and documentation that help automatize and speed up this process.

The publication itself is made available for online consultation with the help of TEI Publisher, an application created to generate custom publications for corpora encoded in TEI XML. We have developed and launched a dedicated application for digital scholarly editions (DiScholEd) on this basis. It is available online together with a thorough documentation, and is conceived as an open application: new corpora can always be added to it, and we welcome new collaborations.

The goal of our workshop is to demonstrate how an available corpus could be processed for publication on the DiScholEd application. The workshop participants will learn to experiment with a ready-to-use solution that provides an easy and quick online publication of a corpus. They will also get tips and shortcuts to help speed up the creation of a digital edition. Moreover, by the end of the session, this workshop will provide the participants with a visualization of their respective corpus, with side by side transformed text and original image; all of which then showing what can be achieved while working with TEI in the context of an end-to-end publication pipeline.

The program for this workshop is the following: Firstly, it will start with a presentation of the development of the pipeline, its objectives and how it works. Then, the time we have will be divided into several slots corresponding to the work steps of the pipeline. Each slot will start with a quick presentation of what is expected of the participants and what tools they will need to use. Next, they will be allotted some time to process their data according to the requirements of the concerned work step, as they all require a certain amount of time. At the end of the day, a 30mn feedback session will make it possible for each participant as well as for the workshop organizers to assess the benefits of the session and envision further possible collaborations.

Considering the number of steps in this pipeline and the time required for each of these steps, a full day is necessary for this workshop. The number of participants should be 10-15 maximum, in order for the two workshop conveners to be able to provide the necessary technical support in the hands-on parts of the workshop.

In order for the participants to be able to work correctly on the pipeline, they will need a laptop as well as the following tools: a command line interface for the execution of the scripts and an XML editor (Oxygen is the best choice). It is also preferable if, beforehand, they get an account at Huma-Num and eScriptorium.

GitHub repository of the pipeline:

https://github.com/DiScholEd/pipeline-digital-scholarly-editions

Mobile View Print View

Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: TEI 2022