Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
Only Sessions at Location/Venue 
 
Session Overview
Session
Session 5A: Long Papers
Time:
Thursday, 15/Sept/2022:
11:30am - 1:00pm

Session Chair: Dario Kampkaspar, Universitäts- und Landesbibliothek Darmstadt
Location: ARMB: 2.98

Armstrong Building: Lecture Room 2.98. Capacity: 168

Show help for 'Increase or decrease the abstract text size'
Presentations
ID: 159 / Session 5A: 1
Long Paper
Keywords: TEI XML, Handwritten Text Recognition, HTR, Libraries

Evolving Hands: HTR and TEI Workflows for cultural institutions

J. Cummings1, D. Jakacki2, I. Johnson1, C. Pirmann2, A. Healey1, V. Flex1, E. Jeffrey1

1Newcastle University, United Kingdom; 2Bucknell University, USA

This Long Paper will look at the work of the Evolving Hands project which is undertaking three case studies ranging across document forms to demonstrate how TEI-based HTR workflows can be iteratively incorporated into curation. These range from: 19th-20th century handwritten letters and diaries from the UNESCO Gertrude Bell Archive, 18th century German, 20th century French correspondence, and a range of printed materials from the 19th century onward in English and French. A joint case study converts legacy printed material of the Records of Early English Drama (REED) project. By covering a wide variety of periods and document forms the project has a real opportunity here to foster responsible and responsive support for cultural institutions.

See Uploaded Abstract for more information

Cummings-Evolving Hands-159.docx


ID: 109 / Session 5A: 2
Long Paper
Keywords: TEI, text extraction, linguistic annotation, digital edition, mass digitisation

Between automatic and manual encoding: towards a generic TEI model for historical prints and manuscripts

A. Pinche1, K. Christensen2, S. Gabay3

1Ecole nationale des chartes | PSL (France); 2INRIA (France); 3Université de Genève (Switzerland)

Cultural heritage institutions today aim to digitise their collections of prints and manuscripts (Bermès 2020) and are generating more and more digital images (Gray 2009). To enrich these images, many institutions work with standardised formats such as IIIF, preserving as much of the source’s information as possible. To take full advantage of textual documents, an image alone is not enough. Thanks to automatic text recognition technology, it is now possible to extract images’ content on a large scale. The TEI seems to provide the perfect format to capture both an image’s formal and textual data (Janès et al. 2021). However, this poses a problem. To ensure compatibility with a range of use cases, TEI XML files must guarantee IIIF or RDF exports and therefore must be based on strict data structures that can be automated. But a rigid structure contradicts the basic principles of philology, which require maximum flexibility to cope with various situations.

The solution proposed by the Gallic(orpor)a project attempted to deal with such a contradiction, focusing on French historical documents produced between the 15th and the 18th c. It aims to enrich the digital facsimiles distributed by the French National Library in two different ways:

• text extraction, including the segmentation of the image (layout analysis) with SegmOnto (Gabay, Camps, et al. 2021) and the recognition of the text (Handwritten Text Recognition) augmenting already existing models (Pinche and Clérice, 2021);

• linguistic annotation, including lemmatisation, POS tagging (Gabay, Clérice, et al. 2020), named entity recognition and linguistic normalisation (Bawden et al. 2022).

Our TEI document modelling has two strictly coercive automatically generated data blocks:

• the <sourceDoc> with information from the digital facsimile, which computer vision, HTR and segmentation tools produce thanks to machine learning (Scheithauer et al. 2021);

• the <standOff> (Bartz et al. 2021a) with linguistic information produced by natural language processing tools (Gabay, Suarez, et al. 2022) to make it easier to search the corpus (Bartz et al. 2021b).

Two other elements are added that can be customised according to researchers’ specific needs:

• a pre-filled <teiHeader> with basic bibliographic metadata automatically retrieved from (i) the digital facsimile’s IIIF Image API and (ii) the BnF’s Search/Retrieve via URL (SRU) API. The <teiHeader> can be enriched with additional data, as long as it respects a strict minimum encoding;

• a pre-editorialised <body>. It is the only element totally free regarding encoding choices.

By restricting certain elements and allowing others to be customisable, our TEI model can efficiently pivot toward other export formats, including RDF and IIIF. Furthermore, the <sourceDoc> element’s strict and thorough encoding of all of the document’s graphical information allows the TEI document to be converted into PAGE XML and ALTO XML files, which can then be used to train OCR, HTR, and segmentation models. Thus, not only does our TEI model’s strict encoding avoid limiting philological choices, thanks to the <body>, it also allows us to pre-editorialise the <body> via the content of the <sourceDoc> and, in a near future, the <standOff>.

Pinche-Between automatic and manual encoding-109.odt


ID: 128 / Session 5A: 3
Long Paper
Keywords: NER, HTR, Correspondence, Digital Scholarly Edition

Dehmel Digital: Pipelines, text as data, and editorial interventions at the distance

D. Maus1, J. Nantke2, S. Bläß2, M. Flüh2

1State and University Library Hamburg, Germany; 2University of Hamburg

Ida and Richard Dehmel were a famous, internationally well-connected artist couple around 1900. The correspondence of the Dehmels, which has been comprehensively preserved in approx. 35,000 documents, has so far remained largely unexplored in the Dehmel Archive of the State and University Library Hamburg. The main reason for this is the quantity of the material that makes it difficult to explore the material using traditional methods of scholarly editing. However, the corpus is relevant for future research precisely because of its size and variety. It not only contains many letters from important personalities from the arts and culture of the turn of the century, but also documents personal relationships, main topics as well as forms and ways of communication in the cultural life of Germany and Europe before the First World War on a large scale.

The project Dehmel digital seeks out to close this gap by creating a digital scholarly edition of the Dehmels’ correspondence that addresses the quantitative aspects with a combination of state-of-the-art machine learning approaches, namely handwritten text recognition (HTR) and named entity recognition (NER). At the heart of the project is a scalable pipeline that integrates automated and semi-automated text/data processing tasks. In our paper we will introduce and discuss the main steps: 1. Importing the result of HTR from Transkribus and OCR4all, 2. Applying a trained NER model; 3. Disambiguating entities and referencing authority records with OpenRefine; 4. Publishing data and metadata to a Linked Open Data web service. Our main focus will be on the pipeline itself, the “glue” that ties together well-established tools (Transkribus, OCR4All, Stanford Core NLP, OpenRefine), our use of TEI to encode relevant information and the special challenges we observe when using text as data, i.e. combining automated and semi-automated processes with the desire of editorial interventions.

Maus-Dehmel Digital-128.docx