TEI Conference and Members' Meeting 2022

Conference Agenda

Date: Tuesday, 13/Sept/2022

6:15pm - 7:30pm

Opening Keynote: Constance Crompton, "Situated, Partial, Common, Shared: TEI Data as Capta"
Location: ARMB: 2.98
Session Chair: James Cummings, Newcastle University

Starting with: Welcome To Newcastle University, Professor Jennifer Richards, Director of the Newcastle University Humanities Research Institute.

ID: 165 / Opening Keynote: 1
Invited Keynote

Situated, Partial, Common, Shared: TEI Data as Capta

C. Crompton

University of Ottawa, Canada

It has been a decade since Johanna Drucker reminded us that all data are capta in the TEI-encoded pages of Digital Humanities Quarterly. In some ways this may appear to be self-evident in the context of the TEI: for many TEI users, their primary encoded material is text, and the TEI tags are a textual intervention in the sea of primary text – the resulting markup is not data, as in something objectively observed, but rather capta, as in something situated, partial, and contextually freighted (as indeed is all data. All data is capta). That said, Drucker warns her readers against self-evident claims.

Drawing on Drucker's arguments, this keynote explores the tension in several of the TEI's models, and the challenges that arise from our need to have fixed start and end points, bounding boxes, interps, certainty, events, traits (the list goes on!) in order to do our analytical work. Drawing on a number of projects, I argue for the value of our shared markup language and the value it offers us through its data-like behaviour, even as it foregrounds how clearly how much TEI data, and indeed, all data, are capta.

Date: Wednesday, 14/Sept/2022

9:30am - 11:00am

Session 1A: Short-Papers
Location: ARMB: 2.98
Session Chair: Martin Holmes, University of Victoria

ID: 140 / Session 1A: 1
Short Paper
Keywords: text mining, stand-off annotations, models of text, generic services

Standoff-Tools. Generic services for building automatic annotation pipelines around existing tools for plain text analysis

C. Lück

Universität Münster, Germany

TEI XML excels at encoding text. But when it comes to machine-based

analysis of a corpus as data, XML is no good platform. NLP, NER, topic

modelling, text-reuse detection etc. work on plain text; they get very

complicated and slow, if they have to traverse a tree structure. While

extracting plain text from XML is simple, feeding the result back into

XML is tricky. However, having the analysis in XML is desired: Its

result can be related to the internal markup, e.g. for overviews of

names per chapter, ellipsis per verse etc. In my short paper I will

introduce standoff-tools, a suite of generic tools for building

(automatic) annotation pipelines around plain text tools.

standoff-tools implement the extractor *E* and the internalizer

*I*. *E* produces a special flavour of plain text, I term *equidistant

plain text*: The XML tags are replaced by special characters,

e.g. zero-width non-joiner U+200C, i.e. all non-special characters

have the same character offset as in the XML source. This equidistant

plain text can then be fed to an arbitrary tagger *T* designed for

plain text. Its only requirement is to produce positioning

information. *I* inserts tags based on positioning information into

XML. For this purpose, it splits the annotated spans of text, so that

the result is syntactically valid XML without overlapping edges. It

aggregates the splits back together with `@next` and `@from`.

Optionally, a shrinker *S* removes the special characters in the

output of *E* and also produces a map of character positions. This map

of character positions is applied by a corrector *C* to the

positioning information produced by the tagger *T*.

The internalizer can also be used to internalize stand-off markup

produced manually with CATMA, GNU Emacs standoff-mode, etc. into

syntactically correct XML.

ID: 103 / Session 1A: 2
Short Paper
Keywords: TEI, indexes, XQuery

TEI Automatic Enriched List of Names (TAELN): An XQuery-based Open Source Solution for the Automatic Creation of Indexes from TEI and RDF Data

G. Fernandez Riva

Universität Heidelberg, Germany

The annotation of names of persons, place or organizations is a common feature of TEI editions. One way of identifying the annotated individuals is through the use of IDs from authority records like Geonames, Wikidata or the GND.

In this paper I will introduce an open source tool written in XQuery that enables the creation of TEI indexes using a very flexible custom templating language. The TEI Automatic Enriched List of Names (TAELN) uses the ids according to one authority document to create a custom index (model.listLike) with information from one or more RDF endpoints.

TAELN has been developed for the edition of the diaries and travel journals from Albrecht Dürer and his family. People, places and art works are identified with GND-numbers in the TEI edition. The indexes generated with TAELN include some information from GND records, but mostly from duerer.online, a virtual research portal, created with WissKI (https://wiss-ki.eu/), which offers an RDF endpoint.

TAELN relies on an XML-template to indicate how to retrieve information from the different endpoints and how to structure the desired TEI output. The templates use a straight-forward but flexible syntax. Simple use cases are depicted in the following example that retrieves the person name from the GND and the occupation from WissKI (which relies on the so-called »Pathbuilder syntax«).

<persName origin="gnd">preferredNameForThePerson</persName>

<occupation origin="wisski">ecrm:E21_Person -> ecrm:P11i_participated_in -> wvz:WV7_Occupation -> ecrm:P3_has_note</occupation>

</person>

Much more complex outputs can be achieved. TAELN offers editions an out of the box solution to generate TEI indexes by gathering information from different endpoints and it only requires the creation of the corresponding template and the knowledge of how to apply an XQuery transformation. The tool will be published shortly before the date of the TEI conference.

ID: 151 / Session 1A: 3
Short Paper
Keywords: manuscripts, codicology, paleography, XForms

manuForma – A Web Tool for Cataloging Manuscript Data

M. de Molière

University of Munich, Germany

The team of the ERC-funded project "MAJLIS – The Transformation of Jewish Literature in Arabic in the Islamicate World" at the University of Munich needed a software solution for describing manuscripts in TEI that would be easy to learn for non-specialists. After about one year of development, manuForma provides to our manuscript catalogers an accessible platform for entering their data. Users can choose elements and attributes from a list, add them to their catalog file and rearrange them with a mouse click. While manuForma does not spare our catalogers the need to learn the fundamentals of TEI, the restrictions the forms based approach proffers, enhances both TEI conformance and the uniformity of our catalog records. Moreover, our tool eliminates the need to install commercial XML editors on the machine of each and every project member tasked with describing manuscripts. Instead, our tool offers a web interface for the entire editorial process.

At the heart, manuForma uses XForms, which has been modified to allow adding, moving and deleting elements and attributes. A tightly knit schema file controls which elements and attributes can be added and in which situations to ensure conformance to the project's scholarly objectives. As an existDB application, manuForma integrates well with other apps that provide the front end to the manuscript catalog. TEI records can be stored on and retrieved from GitHub, tying the efforts of the entire team together. The web solution is adaptable to other entities by writing a dedicated schema and template file. Moreover, manuForma will be available under an OpenSource licence.

11:30am - 1:00pm

Session 2A: Long Papers
Location: ARMB: 2.98
Session Chair: Elli Bleeker, Huygens Institute for the History of the Netherlands

ID: 131 / Session 2A: 1
Long Paper
Keywords: Herman Melville, genetic criticism, text analysis, R, XPath

Revision, Negation, and Incompleteness in Melville's _Billy Budd_ Manuscript

C. Ohge

School of Advanced Study, University of London, United Kingdom

In 2019, John Bryant, Wyn Kelley, and I released a beta-version of a digital edition of Herman Melville's last work _Billy Budd, Sailor_. This TEI-encoded edition required nearly 10 years of work to complete, mostly owing to the fact that this last, unfinished work by Melville survives in an incredibly complicated manuscript that demonstrates about 8 stages of revision. The digital edition (https://melville.electroniclibrary.org/versions-of-billy-budd) has since been updated, and it presents a fluid-text edition (Bryant 2002) in three versions: a diplomatic transcription of the manuscript, a 'base' (or clean readable) version of the manuscript, and a critical, annotated reading text generated from the base version. Nevertheless, it remained questionable to me how we could effectively use all of the sophisticated descriptive markup of the manuscript transcription for critical purposes. What is missing, in other words, is an effective analysis of the genesis of this work.

In this talk I would like to demonstrate recent work on text analyses on the TEI XML data of the manuscript for a chapter-in-progress of my book-length project entitled _Melville’s Codes: Literature and Computation Across Complex Worlds_ (co-authored with Dennis Mischke, and under contract with Bloomsbury). First I generated and visualised basic statistics of textual phenomena (additions, deletions, and substitutions, e.g.) using XPath expressions combined with the R programming language. I then used the XML2 and TidyText libraries in R to perform more sophisticated analyses of the manuscript in comparison to Melville's oeuvre. Ultimately the analyses show that _Billy Budd_ ought to be read as a testament to incompleteness and negation.

In general, Melville’s use of negations and negative sentiments increased throughout his fictional work. Although this trend drops off in the late poetry, _Billy Budd_ has the highest number of negations in all of Melville’s oeuvre. It also has more acts of deletion than addition in the manuscript. Yet these trends need to be analysed in the context of Melville’s incomplete manuscript, the ‘ragged edges’ of which demonstrate not only a late tendency to increase negative words and ideas, but also, in late revisions, to complicate the main characters of the novel (particularly Captain Vere) who represent justice in the story. Like 'Benito Cereno', the codes of judgment are shown to be inadequate to the task of reckoning with the tragic conditions represented in Melville’s final sea narrative. This inadequacy is illustrated by Vere’s reaction to Billy’s death, which is framed as a computation, an either/or conditional: ‘Captain Vere, either thro stoic self-control or a sort of momentary paralysis induced by emotional shock, stood erectly rigid as a musket in the ship-armorer's rack’ (Chapter 25). This thematic incompleteness is not only a metaphor in the text but a metaphor of the text of this incomplete story.

Christopher Ohge is Senior Lecturer in Digital Approaches to Literature at the School of Advanced Study, University of London. His book _Publishing Scholarly Editions: Archives, Computing, and Experience_ was published in 2021 by Cambridge University Press. He also serves as the Associate Director of the Herman Melville Electronic Library.

ID: 136 / Session 2A: 2
Long Paper
Keywords: digital editions, sentiment analysis, machine learning, literary analysis, corpus annotation

“Un mar de sentimientos”. Sentiment analysis of TEI encoded Spanish periodicals using machine learning

L. Krusic¹, M. Scholger¹, E. Hobisch², Y. Völkl²

¹Institute Centre of Information Modelling (Austrian Centre for Digital Humanities), University of Graz; ²Technical University Graz

Sentiment analysis (SA), one of the most active research areas in NLP for over two decades, focuses on the automatic detection of sentiments, emotions and opinions found in textual data (Liu, 2012). Recently, SA has also gained popularity in the field of Digital Humanities (Schmidt Burghardt & Dennerlein, 2021). This contribution will present the analysis of a TEI encoded digital scholarly edition of Spanish periodicals using a machine learning approach for sentiment analysis as well as the re-implementation of the results into TEI for further retrieval and visualization.

2:30pm - 4:00pm

Session 3A: Long Papers
Location: ARMB: 2.98
Session Chair: Gustavo Fernandez Riva, Universität Heidelberg

ID: 113 / Session 3A: 1
Long Paper
Keywords: Middle Ages, lexicography, glossary, quantitative analysis, Latin

Vocabularium Bruxellense. Towards Quantitative Analysis of Medieval Lexicography

K. Nowak¹, I. Krawczyk¹, R. Alexandre²

¹Institute of Polish Language (Polish Academy of Sciences), Poland; ²Institut de recherche et d'histoire des textes, France

The Vocabularium Bruxellense is a little-known example of medieval Latin lexicography (Weijers 1989). It has survived in a single manuscript dated to the 12th century and currently held at the Royal Library of Belgium in Brussels. In this paper we present the digital edition of the dictionary and the results of a quantitative study of its structure and content based on the TEI-conformant XML annotation.

First, we briefly discuss a number of annotation-related issues. For the most part, they result from the discrepancy between medieval and modern lexicographic practices which are accounted for in the 9th chapter of the TEI Guidelines (TEI Consortium). For example, a single paragraph of a manuscript may contain multiple dictionary entries which are etymologically or semantically related to the headword.

Medieval glossaries are also less consistent in their use of descriptive devices. For instance, the dictionary definitions across the same work may greatly vary as to their form and content. As such, they require fine-grained annotation if the semantics of the TEI elements is not to be strained.

Second, we present the TEI Publisher-based digital edition of the Vocabularium (Reijnders et al. 2022). At the moment, it provides basic browsing and search functionalities, making the dictionary available to the general public for the first time since the Middle Ages.

Thirdly, we demonstrate how the TEI-conformant annotation may enable a thourough quantitative analysis of the text which sheds a light on its place in a long tradition of medieval lexicography. We focus on two major aspects, namely the structure and the sources of the dictionary. As for the first, we present summary statistics of the almost 8,000 entries of the Vocabularium, expressed as a number of entries per letter and per physical page. We show that half of the entries are relatively short: a number among them contain only a one-word gloss and only 25% of entries contain 15 or more tokens.

Based on the the TEI XML annotation of nearly 1200 quotes, we were able to make a number of points concerning the function of quotations in medieval lexicographic works which is hardly limited to attesting specific language use. We observe that quotations are not equally distributed across the dictionary, as they can be found in slightly more than 10% of the entries, whereas nearly 7,000 entries have no quotations at all. The quotes are usually relatively short with only 5% containing 10 or more words. Our analysis shows that the most quoted author is by a wide margin Virgil followed by Horace, Lucan, Juvenal, Ovid, Plautus, and Terence (19). Church Fathers and medieval authors are seldom quoted, we have also discovered only 86 explicit Bible quotations so far.

In conclusions, we argue that systematic quantitative analyses of the existing editions of the medieval glossaries might provide useful insight into the development of this important part of the medieval written production.

ID: 162 / Session 3A: 2
Long Paper
Keywords: standardization, morphology, morphosyntax, ISO, MAF, stand-off annotation

ISO MAF reloaded: new TEI serialization for an old ISO standard

P. Banski¹, L. Romary², A. Witt¹

¹IDS Mannheim, Germany; ²INRIA, France

The ISO Technical Committee TC 37, Language and terminology, Subcommittee SC 4, Language resource management (https://www.iso.org/committee/297592.html, ISO TC37 SC4 henceforth) has been, for nearly 20 years now, the locus of much work focusing on standardization of annotated language resources. Through the subcommittee’s liaison with the TEI-C, many of the standards developed there use customizations of the TEI Guidelines for the purpose of serializing their data models. Such is the case of the feature structure standards (ISO 24610-1:2006, ISO 24610-2:2011), which together form chapter 18 of the Guidelines, as well as the standard on the transcription of the spoken language (ISO 24624:2016, reflected in ch. 8) or the Lexical Markup Framework (LMF) series, where ISO 24613-4:2021 mirrors ch. 9 of the Guidelines.

The Morphosyntactic Annotation Framework (ISO 24611:2012) was initially published with its own serialization format, interwoven with suggestions on how its fragments can be rendered in the TEI. In a recent cyclic revision process, a decision was made to divide the standard in two parts, and to replace the legacy serialization format with a customization of the TEI that makes use of the recent developments in the Guidelines – crucially, the work on the standOff element and the work on the att.linguistic attribute class. The proposed contribution reviews fragments of the revised standard and presents the TEI devices used to encode it. At the time of the conference, ISO/CD 24611-1 “Language resource management — Morphosyntactic annotation framework (MAF) — Part 1: Core model” will have been freshly through the Committee Draft ballot by the national committees mirroring ISO TC37 SC4.

In what follows, we briefly outline the basic properties of the MAF data model and review selected examples of its serialization in the TEI.

ID: 108 / Session 3A: 3
Long Paper
Keywords: lexicography, dictionaries, semantic web

TEI Modelling of the Lexicographic Data in the DARIAH-PL Project

K. Nowak, D. Mika, W. Łukasik

Institute of Polish Language (Polish Academy of Sciences), Poland

The main goal of project “DARIAH-PL Digital Research Infrastructure for the Arts and Humanities” project is building the Dariah.lab infrastructure, which would allow for sharing and integrated access to digital resources and data from various fields of the humanities and arts. Among numerous tasks that the Institute of Polish Language, Polish Academy of Sciences coordinates, we are working towards the integration of our lexicographic data with the LLOD resources (Chiarcos et al. 2012). The essential step of this task is to convert the raw text into TEI-compliant XML format (TEI Consortium).

In this paper we would like to outline the main issues involved in TEI XML modelling of these heterogeneous lexicographic data.

In the first part, we will give a brief overview of the formal and content features of the dictionaries. For the most part, they are paper-born works developed with the research community in mind and as such are rich in information and complex in structure. They cover diachronic development (from medieval Polish and Latin to present day Polish) and its functional variation of Polish (general language vs. dialects, proper names).

On a practical level, this meant that, first, substantial effort had to be put into optimizing the quality of the OCR output. Since, except for grobid-dictionaries (Khemakhem et al. 2018), there are no tools at the moment that would enable easy conversion of lexicographic data, the subsequent phase of structuring of dictionary text had to be applied on a per resource basis.

The TEI XML annotation has three main goals. First, it is a means of preserving the textuality of paper-born dictionaries which makes heavy use of formatting necessary to convey information and employ a complex system of text-based internal cross-references. Second, TEI modelling aims at a better understanding of each resource and its explicit description. The analysis is performed by lexicographers who may, however, come from a lexicographic tradition different from the one embodied in a particular dictionary, and thus need to make their interpretation of the dictionary text explicit. Regardless, in this way we may also detect and correct editorial inconsistencies, which are natural for collective works developed over many years. Third, the annotated text is meant to be the input of the alignment and linking tasks, it is then crucial that functionally equivalent structures were annotated in a systematic and coherent way. As we plan to provide an integrated access to the dictionaries, the TEI XML representation is also where the first phase of data reconciliation takes place. It does not only concern the structural units of a typical dictionary entry, such as <sense/> or <form/>, but also mapping between units of analytical language the dictionaries employ, such as labels, bibliographic reference system etc.

Date: Thursday, 15/Sept/2022

9:30am - 11:00am

Session 4A: Short-Papers
Location: ARMB: 2.98
Session Chair: Peter Stadler, Paderborn University

ID: 126 / Session 4A: 1
Short Paper
Keywords: digital texts, textual studies, born-digital, electronic literature

TEI and the Re-Encoding of Born-Digital and Multi-Format Texts

E. Forget, A. Galey

University of Toronto, Canada

What affordances can TEI encoding offer scholars who work with born-digital, multi-format, and other kinds of texts produced in today’s publishing environments, where the term “digitization” is almost redundant? How can we use TEI and other digitization tools to analyze materials that are already digital? How do we distinguish between a digital text’s multiple editions or formats and its paratexts, and what differences do born-digital texts make to our understanding of markup? Can TEI help with a situation such as the demise of Flash, where the deprecation of a format has left many works of electronic literature newly vulnerable — and, consequently, newly visible as historical artifacts?

These questions take us beyond descriptive metadata and back to digital markup’s origins in electronic typesetting, but also point us toward recent work on electronic literature, digital ephemera, and the textual artifacts of the very recent past (e.g. those described in recent work by Matthew Kirschenbaum, Dennis Tenen, and Richard Hughes Gibson). Drawing from textual studies, publishing studies, book history, disability studies, and game studies, we are experimenting with the re-encoding of born-digital materials, using TEI to encode details of the texts’ form and function as digital media objects. In some cases, we are working from a single digital source, and in others we are working with digital editions of materials that are available in multiple analogue and digital formats. Drawing on our initial encoding and modelling experiments, this paper explores the affordances of using TEI and modelling for born-digital and multi-format textual objects, particularly emerging digital book formats. We reconsider what the term “data” entails when one’s materials are born-digital, and the implications for digital preservation practice and the emerging field of format theory.

ID: 107 / Session 4A: 2
Short Paper
Keywords: online forum, thread structure, social media, computer mediated communication

Capturing the Thread Structure: A Modification of CMC-Core to Account for Characteristics of Online Forums

S. Reimann, L. Rodenhausen, F. Elwert, T. Scheffler

Ruhr-University Bochum, Germany

Representing computer mediated communication (CMC), such as discussions in online forums, according to the guidelines of the Text Encoding Initiative was addressed by the CMC Special Interest Group (SIG). Their latest schema, CMC-core, presents a basic way of representing a wide range of different types of CMC in TEI P5. However, this schema has a general aim and is not specifically tailored for capturing the thread structure of online forums.

In particular, CMC-core is organized centrally by the time stamp of posts (a timeline structure), whereas online forums often split into threads and subthreads, giving less importance to the time of posting. In addition, forums may contain quotes from external sources as well as from other forum posts, which need to be differentiated in an adapted <quote> element. Not only do online forums as a whole differ from other forms of CMC, but there are often also considerable differences between individual online forums. We created a corpus of posts from various religious online forums, including different communities on Reddit, as well as two German forums which specifically focus on the topic of religion, with the purpose of analyzing their structure and textual content. These forums differ in the way threads are structured, how emoticons and emojis are used, and how people are able to react to other posts (for example by voting).

This raises the need for a schema which on the one hand takes the features of online forums as a genre into account, and, on the other hand, is flexible enough to enable the representation of a wide range of different online forums. We present some modifications of the elements in CMC-core in order to guarantee a standardized representation of three substantially different online forums while retaining all their potentially interesting microstructural characteristics.

ID: 111 / Session 4A: 3
Short Paper
Keywords: digital publications, VRE, open access, scholarly communication, web publication

Publishing the grammateus research output with the TEI : how our scholarly texts become data

E. Nury

University of Geneva, Switzerland

The TEI is not exclusively used to encode primary sources: TEI-based scholarly publishing represents a non-negligible portion of TEI-encoded texts (Baillot and Giovacchini 2019). I present here how the encoding of secondary sources such as scholarly texts can benefit researchers, with the example of the grammateus project.

In the grammateus project, we are creating a Virtual Research Environment to present a new way of classifying Greek documentary papyri. This environment comprises a database of papyri, marked up with the standard EpiDoc subset of the TEI. It includes as well the textual research output from the project, such as introductory materials, detailed descriptions of papyri by type, and an explanation on the methodology of the classification. The textual research output was deliberately prepared as an online publication so as to fully take advantage of the interactivity with data offered by a web application, in contrast to a printed book. We are thus experimenting with a new model of scholarly writing and publishing.

In this short paper I will describe how we have used the TEI not only for modeling papyrological data, but also for the encoding of scholarly texts produced in the context of the project, which would have traditionally been material for a monograph or academic articles. I will also demonstrate how this has enabled us later on to enrich our texts with markup for features that have emerged as relevant. We implemented a spiraling encoding process in which methodological documentation and analytical descriptions keep feeding back the editorial encoding of the scholarly texts. Documentation and analytical text therefore become data, within a research process based on a feedback method.

ID: 153 / Session 4A: 4
Short Paper
Keywords: HTR, Transkribus, Citizen Science

Handwritten Text Recognition for heterogeneous collections? The Use Case Gruß & Kuss

S. Büdenbender¹, M. Seltmann², J. Baum¹

¹University of Applied Sciences Darmstadt (h_da), Germany; ²University and State Library Darmstadt, Germany

Gruß & Kuss – Briefe digital. Bürger*innen erhalten Liebesbriefe – a research project funded by BMBF for 36 months – aims to digitize and explore love letters from ordinary persons with the help of dedicated volunteers, also raising the question of how citizens can actively participate in the indexing and encoding of textual sources.

To present, transcriptions are made manually in Transkribus (lite), tackling a corpus consisting of more than 22,000 letters from 52 countries and 345 donators, divided into approximately 750 bundles (i.e., correspondences between usually two writers). The oldest letter dates from 1715, the most recent from 2021, using a very broad concept of letter and including, for instance, notes left on pillows or WhatsApp messages.

The paper investigates the applicability of Handwritten Text Recognition (HTR) to this highly heterogeneous stock in a citizen science context. In an explorative approach, we will investigate at which scope of a bundle, respectively at which number of pages of the same handwriting, HTR becomes worthwhile.

For this purpose, the effort of a manual transcription is first compared to the effort of a model creation in Transkribus (in particular the creation of a training and validation set by double keying), including final corrections. In a second step, we will explore whether a modification of the procedure can be used to process even smaller bundles. Based on given metadata (time of origin, gender, script ...) a first clustering can be created, and existing models can be used as a basis for graphemically similar handwritings, allowing training sets to be kept much smaller while maintaining acceptable error rates. Another possibility is to start off with mixed training sets covering a class of related scripts.

Furthermore, we discuss how manual transcription by citizen scientists can be quantified in relation to the project’s overall resources.

11:30am - 1:00pm

Session 5A: Long Papers
Location: ARMB: 2.98
Session Chair: Dario Kampkaspar, Universitäts- und Landesbibliothek Darmstadt

ID: 159 / Session 5A: 1
Long Paper
Keywords: TEI XML, Handwritten Text Recognition, HTR, Libraries

Evolving Hands: HTR and TEI Workflows for cultural institutions

J. Cummings¹, D. Jakacki², I. Johnson¹, C. Pirmann², A. Healey¹, V. Flex¹, E. Jeffrey¹

¹Newcastle University, United Kingdom; ²Bucknell University, USA

This Long Paper will look at the work of the Evolving Hands project which is undertaking three case studies ranging across document forms to demonstrate how TEI-based HTR workflows can be iteratively incorporated into curation. These range from: 19th-20th century handwritten letters and diaries from the UNESCO Gertrude Bell Archive, 18th century German, 20th century French correspondence, and a range of printed materials from the 19th century onward in English and French. A joint case study converts legacy printed material of the Records of Early English Drama (REED) project. By covering a wide variety of periods and document forms the project has a real opportunity here to foster responsible and responsive support for cultural institutions.

See Uploaded Abstract for more information

ID: 109 / Session 5A: 2
Long Paper
Keywords: TEI, text extraction, linguistic annotation, digital edition, mass digitisation

Between automatic and manual encoding: towards a generic TEI model for historical prints and manuscripts

A. Pinche¹, K. Christensen², S. Gabay³

¹Ecole nationale des chartes | PSL (France); ²INRIA (France); ³Université de Genève (Switzerland)

Cultural heritage institutions today aim to digitise their collections of prints and manuscripts (Bermès 2020) and are generating more and more digital images (Gray 2009). To enrich these images, many institutions work with standardised formats such as IIIF, preserving as much of the source’s information as possible. To take full advantage of textual documents, an image alone is not enough. Thanks to automatic text recognition technology, it is now possible to extract images’ content on a large scale. The TEI seems to provide the perfect format to capture both an image’s formal and textual data (Janès et al. 2021). However, this poses a problem. To ensure compatibility with a range of use cases, TEI XML files must guarantee IIIF or RDF exports and therefore must be based on strict data structures that can be automated. But a rigid structure contradicts the basic principles of philology, which require maximum flexibility to cope with various situations.

The solution proposed by the Gallic(orpor)a project attempted to deal with such a contradiction, focusing on French historical documents produced between the 15th and the 18th c. It aims to enrich the digital facsimiles distributed by the French National Library in two different ways:

• text extraction, including the segmentation of the image (layout analysis) with SegmOnto (Gabay, Camps, et al. 2021) and the recognition of the text (Handwritten Text Recognition) augmenting already existing models (Pinche and Clérice, 2021);

• linguistic annotation, including lemmatisation, POS tagging (Gabay, Clérice, et al. 2020), named entity recognition and linguistic normalisation (Bawden et al. 2022).

Our TEI document modelling has two strictly coercive automatically generated data blocks:

• the <sourceDoc> with information from the digital facsimile, which computer vision, HTR and segmentation tools produce thanks to machine learning (Scheithauer et al. 2021);

• the <standOff> (Bartz et al. 2021a) with linguistic information produced by natural language processing tools (Gabay, Suarez, et al. 2022) to make it easier to search the corpus (Bartz et al. 2021b).

Two other elements are added that can be customised according to researchers’ specific needs:

• a pre-filled <teiHeader> with basic bibliographic metadata automatically retrieved from (i) the digital facsimile’s IIIF Image API and (ii) the BnF’s Search/Retrieve via URL (SRU) API. The <teiHeader> can be enriched with additional data, as long as it respects a strict minimum encoding;

• a pre-editorialised <body>. It is the only element totally free regarding encoding choices.

By restricting certain elements and allowing others to be customisable, our TEI model can efficiently pivot toward other export formats, including RDF and IIIF. Furthermore, the <sourceDoc> element’s strict and thorough encoding of all of the document’s graphical information allows the TEI document to be converted into PAGE XML and ALTO XML files, which can then be used to train OCR, HTR, and segmentation models. Thus, not only does our TEI model’s strict encoding avoid limiting philological choices, thanks to the <body>, it also allows us to pre-editorialise the <body> via the content of the <sourceDoc> and, in a near future, the <standOff>.

ID: 128 / Session 5A: 3
Long Paper
Keywords: NER, HTR, Correspondence, Digital Scholarly Edition

Dehmel Digital: Pipelines, text as data, and editorial interventions at the distance

D. Maus¹, J. Nantke², S. Bläß², M. Flüh²

¹State and University Library Hamburg, Germany; ²University of Hamburg

Ida and Richard Dehmel were a famous, internationally well-connected artist couple around 1900. The correspondence of the Dehmels, which has been comprehensively preserved in approx. 35,000 documents, has so far remained largely unexplored in the Dehmel Archive of the State and University Library Hamburg. The main reason for this is the quantity of the material that makes it difficult to explore the material using traditional methods of scholarly editing. However, the corpus is relevant for future research precisely because of its size and variety. It not only contains many letters from important personalities from the arts and culture of the turn of the century, but also documents personal relationships, main topics as well as forms and ways of communication in the cultural life of Germany and Europe before the First World War on a large scale.

The project Dehmel digital seeks out to close this gap by creating a digital scholarly edition of the Dehmels’ correspondence that addresses the quantitative aspects with a combination of state-of-the-art machine learning approaches, namely handwritten text recognition (HTR) and named entity recognition (NER). At the heart of the project is a scalable pipeline that integrates automated and semi-automated text/data processing tasks. In our paper we will introduce and discuss the main steps: 1. Importing the result of HTR from Transkribus and OCR4all, 2. Applying a trained NER model; 3. Disambiguating entities and referencing authority records with OpenRefine; 4. Publishing data and metadata to a Linked Open Data web service. Our main focus will be on the pipeline itself, the “glue” that ties together well-established tools (Transkribus, OCR4All, Stanford Core NLP, OpenRefine), our use of TEI to encode relevant information and the special challenges we observe when using text as data, i.e. combining automated and semi-automated processes with the desire of editorial interventions.

2:30pm - 4:00pm

Session 6A: An Interview With ... Lou Burnard
Location: ARMB: 2.98
Session Chair: Diane Jakacki, Bucknell University

An interview session: a short statement piece followed by interview questions, then audience questions.

4:30pm - 6:00pm

TEI Annual General Meeting - All Welcome
Location: ARMB: 2.98
Session Chair: Diane Jakacki, Bucknell University

Date: Friday, 16/Sept/2022

9:30am - 11:00am

Session 7A: Short Papers
Location: ARMB: 2.98
Session Chair: Patricia O Connor, University of Oxford

ID: 118 / Session 7A: 1
Short Paper
Keywords: Spanish literature, Digital library, TEI-Publisher, facsimile, sourceDoc

Encoding Complex Structures: The Case of a Gospel Spanish Chapbook

E. Leblanc, P. Jacsont

University of Geneva, France

The project Untangling the cordel seeks to study and revalue a corpus of Spanish chapbooks dating from the 19th century by creating a digital library (Leblanc and Carta 2021). This corpus of chapbooks, also called pliegos de cordel, is highly heterogeneous in its content and editorial formats, giving rise to multiple reflections on its encoding.

In this short paper, we would like to share our feedback and thoughts on the XML-TEI encoding of a Gospel pliego for its integration into TEI-Publisher.

This pliego is an in-4° containing 16 small columns with extracts from the Four Gospels (John's prologue, Annunciation, Nativity, Mark's finale and the passion according to John; i.e. the same extracts as those in the book of hours (Join-Lambert 2016)) duplicated on both sides. This printout had to be cut in half and then folded to obtain two identical sets of excerpts from the Four Gospels. Whoever acquires it appropriates the object for private devotions or protection: it is therefore not an object kept for reading (the text is written in Latin with small letters) but for apotropaic or curative use (Botrel 2021).

To put forward the interest of this pliego as a devotional object and not strictly as a textual object required much reflection concerning its encoding and its publication on our digital library. Indeed, depending on our choice of encoding, the information conveyed differs: should we favour a diplomatic and formal edition or an encoding that follows the reading?

To determine which encoding would be the most suitable, we decided to test two encoding solutions, one with <facsimile> and another with <sourceDoc>. The visualisation of the two encodings possibilities on TEI-Publisher will allow us to develop the advantages and disadvantages of each method.

ID: 124 / Session 7A: 2
Short Paper
Keywords: Digital Scholarly Edition, Dictionary, Linguistics, Manuscript

Annotating a historical manuscript as a linguistic resource

H.-J. Döhla³, H. Klöter², M. Scholger¹, E. Steiner¹

¹University of Graz; ²Humboldt-Universität zu Berlin; ³Universität Tübingen

The Bocabulario de lengua sangleya por las letraz de el A.B.C. is a historical Chinese-Spanish dictionary held by the British Library (Add ms. 25.317), probably written in 1617. It consists of 223 double-sided folios with about 1400 alphabetically arranged Hokkien Chinese lemmas in the Roman alphabet.

The contribution will introduce our considerations on how to extract and annotate linguistic data from the historical manuscript and the design of a digital scholarly edition (DSE) in order to answer research questions in the fields of linguistics, missionary linguistics and migration (Klöter/Döhla 2022).

ID: 163 / Session 7A: 3
Short Paper
Keywords: text mining, topic modeling, digital scholarly editions, data modeling, data integration

How to Represent Topic Models in Digital Scholarly Editions

U. Henny-Krahmer¹, F. Neuber²

¹University of Rostock, Germany; ²Berlin-Brandenburgische Akademie der Wissenschaften, Germany

Topic modeling (Blei et al. 2003, Blei 2012) as a quantitative text analysis method is not part of the classic editing workflow as it stands for a way of working with text that in many respects contrasts with critical editing. However, for the purpose of a thematic classification of documents, topic modeling can be a useful enhancement to an editorial project. It has the potential to replace the cumbersome manual work that is needed to represent and structure large edition corpora thematically, as has been done for instance in the projects Alfred Escher Briefedition (Jung 2022), Jean Paul – Sämtliche Briefe digital (Miller et al. 2018) or the edition humboldt digital (Ette 2016).

We apply topic modeling to two edition corpora of correspondence of the German-language authors Jean Paul (1763-1825) and Uwe Johnson (1934-1984), compiled at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) and the University of Rostock (Miller et al. 2018, Helbig et al. 2017). In our contribution, we discuss how the results of the topic modeling can be usefully integrated into digital editions. We propose to integrate them into the TEI corpora on three levels: (1) the topic model of a corpus, including the topic words and the parameters of its creation, is modeled as a taxonomy in a separate TEI file, (2) the relevance of the topics for individual documents is expressed in the text classification section of the TEI header of each document in the corpus, and (3) the assignment of individual words in a document to topics is expressed by links from word tokens to the corresponding topic in the taxonomy. Following a TEI encoding workflow as outlined above allows for developing digital editions that include topic modeling as an integral part of their user interface.

ID: 119 / Session 7A: 4
Short Paper
Keywords: Odyssey, heroines, prosopography, women

Analyzing the Catalogue of Heroines through Text Encoding

R. Milio

Bucknell University, United States of America

The Catalogue of Heroines (Odyssey 11.225-330) presents a corpus of prominent mythological women as Odysseus recounts the stories of each woman he encounters in the Underworld. I undertook a TEI close reading of the Catalogue in order to center ancient women in a discussion of the Odyssey and determine how the relationships between the heroines contribute to the Catalogue’s overall purpose. In this short paper I demonstrate first my process: developing my own detailed feminist translation of the Catalogue, applying a TEI close reading to both my translation and the original ancient Greek, and creating a customized schema to best suit my purposes. Then, I detail my analysis of my close reading using cross-language encoding and a prosopography I developed through that reading, which reveals complex connections, both explicit and implied, among characters of the Catalogue. Third, I present the result of this analysis: that through this act of close reading I identified a heretofore unconsidered list of objects within the Catalogue and then demonstrated how these four objects of the Catalogue, ζώνη (girdle), βρόχοs (noose), ἕδνα (bride-price), and χρυσὸs (gold), reveal the ancient Greek stigma surrounding women, sexuality, and fidelity. These objects clearly allude to negative perceptions of women in ancient Greek society and through these objects the Catalogue of Heroines reminds its audience of Odysseus’ concerns regarding the faithfulness of his wife Penelope. Ultimately, by applying and adapting a TEI close reading, I identified patterns within the text that spoke to a greater purpose for the Catalogue and the Odyssey overall, that was able to export for further analysis of this prosopographical data. By the time of the conference, I will be able to present data visualizations that provide pathways that can assist other classicists to center women in ancient texts.

11:30am - 1:00pm

Session 8A: Long Papers
Location: ARMB: 2.98
Session Chair: Meaghan Brown, Independent Scholar

ID: 102 / Session 8A: 1
Long Paper
Keywords: medieval studies; medieval literature; xforms; manuscript; codicology

Codex as Corpus : Using TEI to unlock a 14th-century collection of Old French short texts

S. Dows-Miller

University of Oxford, United Kingdom

Medieval manuscript collections of short texts are, in a sense, materially discrete corpora, offering data that can help scholarship understand the circumstances of their composition and early readership.

This paper will discuss the role played by TEI in an ongoing mixed-method study into a fourteenth-century manuscript written in Old French: Bibliothèque nationale de France, fonds français, 24432. The aim of the project has been to display how fruitful the combination of traditional and more data-driven approaches can be in the holistic study of individual manuscripts.

TEI has been critical to the project so far, and has enabled discoveries about the manuscript which have eluded less technologically enabled generations of scholarship. For example, quantitative analysis of scribal abbreviation, made possible through the manuscript’s encoding, has illuminated the contributions of a number of individuals in the production of the codex. Similarly, analysis of the people and places mentioned in the texts allows for greater localisation of the manuscript than was previously considered possible.

As with any project of this nature, the process of encoding BnF fr. 24432 in TEI has not been without difficulty, and so this paper will also discuss the ways in which attempts have been made to streamline the process through automation and UI tools, most notably in the case of this project through the use of XForms.

ID: 149 / Session 8A: 2
Long Paper
Keywords: ODD, ODD chaining, RELAX NG, schema, XSLT Stylesheets

atop: another TEI ODD processor

S. Bauman¹, H. Bermúdez Sabel², M. Holmes³, D. Maus⁴

¹Northeastern University, United States of America; ²University of Neuchâtel, Switzerland; ³University of Victoria, Canada; ⁴State and University Library Hamburg, Germany

TEI is, among other things, a schema. That schema is written in and customized with the TEI schema language system, ODD. ODD is defined by Chapter 22 of the _Guidelines_, and is also used to _define_ TEI P5. It can also be used to define non-TEI markup languages. The TEI supports a set of stylesheets (called, somewhat unimaginatively, “the Stylesheets”) that, among other things, convert ODD definitions of markup languages (including TEI P5) and customizations thereof into schema languages like RELAX NG and XSD that one can use to validate XML documents.

Holmes and Bauman have been fantasizing for years about re-writing those Stylesheets from scratch. Spurred by Maus’ comment of 2021-03-23[1] Holmes presented a paper last year describing the problems with the current Stylesheets and, in essence, arguing that they should be re-written.[2] Within a few months the TEI Technical Council had charged Bauman with creating a Task Force for the purpose of creating, from scratch, an ODD processor that reads in one or more TEI ODD customization files, merges them with a TEI language (likely, but not necessarily, TEI P5 itself), and generates RELAX NG and Schematron schemas. It is worth noting that this is a distinctly narrower scope than the current Stylesheets,[3] which, in theory, convert most any TEI into any of a variety of formats including DocBook, MS Word, OpenOffice Writer, MarkDown, ePub, LaTeX, PDF, and XSL-FO (and half of those formats into TEI); and convert a TEI ODD customization file into RELAX NG, DTD, XML Schema, ISO Schematron, and HTML documentation. A different group is working on the conversion of a customization ODD into customized documentation using TEIPublisher.[4]

The Task Force, which began meeting in April, comprises the authors. We meet weekly, with the intent of making slow, steady progress. Our main goals are that the deliverables be a utility that can be easily run on GNU/Linux, MacOS, or within oXygen, and that they be programs that can be easily maintained by any programmer knowledgeable about TEI ODD, XSLT, and ant. Of course we also want the program to work properly. Thus we are generating test suites and performing unit testing (with XSpec[5]) as we go, rather than creating tests as an afterthought. We have also developed naming and other coding conventions for ourselves and written constraints (mostly in Schematron) to help enforce them. So, e.g., all XSLT variables must start with the letter ‘v’, and all internal parameters must start with the letter ‘p’ or letters “tp” for tunnel parameters.

We are trying to tackle this enormous project in a sensible, piecemeal approach. We have (conceptually) completely separated the task of assembling one or more customization ODDs with a source ODD into a derived ODD from the task of converting the derived ODD into RELAX NG, and from converting the derived ODD into Schematron. In order to make testing-as-we-go easier, we are starting with the derived ODD→RELAX NG process, and expect to demonstrate some working code at the presentation.

2:30pm - 4:00pm

Closing Keynote: Emmanuel Ngue Um, 'Tone as “Noiseless Data”: Insight from Niger-Congo Tone Languages'
Location: ARMB: 2.98
Session Chair: Martina Scholger, University of Graz

With Closing Remarks, Dr James Cummings, Local TEI2022 Conference Organiser

ID: 166 / Closing Keynote: 1
Invited Keynote

Tone as “Noiseless Data”: Insight from Niger-Congo Tone Languages

E. Ngue Um

University of Yaoundé 1 & University of Bertoua (Cameroon), Cameroon

Text processing assumes two layers of textual data: a "noisy" layer and a "noiseless" layer. The “noisy” layer is generally considered unsuitable for analysis and is eliminated at the pre-processing stage. In current Natural Language Processing (NLP) technologies like text generation in machine translation, the representation of tones as diacritical symbols in the orthography of Niger-Congo languages leads to these symbols being pre-processed as “noisy” data. As an illustration, none of the 15 Niger-Congo tone languages modules available on Google Translate delivers in a systematic and consistent manner, text data that contains linguistic information encoded through tone melody.

The Text Encoding Initiative (TEI) is a framework which can be used to circumvent the “noisiness” brought about by diacritical tone symbols in the processing of text data of Niger-Congo languages.

In novel work, I propose a markup scheme for tone that encompasses:

a) The markup of tone units within an <m> (morpheme) element; this aims to capture the functional properties of tone units, just like segmental morphemes.

b) The markup of tonal characters (diacritical symbols) within a <g> (glyph) element and the representation of the pitch by hexadecimal data representing the Unicode character code for that pitch; this aims to capture tone marks as autonomous symbols, in contrast with their combining layout when represented as diacritics.

c) The markup of downstep and upstep within an <accid> (accidental) element mirroring musical accidentals such as “sharp” and “flat”; this aims to capture strictly melodic properties of tone on a separate annotation tier.

The objectives of tone encoding within the TEI framework are threefold:

a) To harness quantitative research on tone in Niger-Congo languages.

b) To leverage “clean” language data of Niger-Congo languages that can be used more efficiently in machine learning tasks for tone generation in textual data.

c) To gain better insights into the orthography of tone in Niger-Congo languages.

In this paper, I will show how this novel perspective to the annotation of tone can be applied productively, using a corpus of language data stemming from 120 Niger-Congo languages.

Session Overview
Location: ARMB: 2.98 Armstrong Building: Lecture Room 2.98. Capacity: 168