TEI Conference and Members' Meeting 2022

September 12 - 16, 2022 | Newcastle, UK

JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organisers at tei2022@ncl.ac.uk.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Session Overview

Date: Monday, 12/Sept/2022

9:00am - 9:30am

Registration - Monday

9:30am - 6:00pm

Workshop 1: From a collection of documents to a published edition : how to use an end-to-end publication pipeline [Full Day]
Location: ARMB: 3.38

ID: 125 / WS 1: 1
Workshop
Keywords: digital edition, historical manuscripts, encoding pipeline, publication workflow

From a collection of documents to a published edition : how to use an end-to-end publication pipeline

F. Chiffoleau^1,2, H. Scheithauer¹

¹Inria, France; ²Le Mans Université, France

In 2021, during the last edition of the TEI Conference “Next Gen TEI”, I took part in a session where I presented a project I had been working on for a year and a half. This project, both relying massively on the Text Encoding Initiative and benefiting its community, focusses on the creation of a pipeline for the publication of digital scholarly editions. Our pipeline, which was still a work in progress at the time of the 2021 Conference, but is now complete, aims at providing open source, free, easy-to-use and interoperable tools; its goal is to support the editorial process from the digitization of a collection of documents to its publication in a machine-readable standard.

In the following, I will succinctly describe the six steps that compose this pipeline, and then move to the way I intend to conduct the workshop based on them.

Firstly, the collection of images that composes the corpus has to be preserved and curated somewhere online to keep them available for the researcher. For this task, we rely on IIIF, to ensure sustainability and interoperability.

The three following steps (segmentation/transcription/post-OCR correction) are conducted with eScriptorium, an open-source automatic transcription application. It offers various options: images uploading, manual and automatic segmentation/transcription, import of models, production of ground truths, model training. Finally, if there are remaining errors in the transcription (in case of automatic transcription), it is possible to either correct them manually in eScriptorium or export the files and correct with the help of specifically designed scripts.

Once the transcription is fully ready, we encode it in TEI XML. For this step, we provide various solutions, depending on the transcription file format (Page XML, XML Alto, Text). We also propose a series of scripts and documentation that help automatize and speed up this process.

The publication itself is made available for online consultation with the help of TEI Publisher, an application created to generate custom publications for corpora encoded in TEI XML. We have developed and launched a dedicated application for digital scholarly editions (DiScholEd) on this basis. It is available online together with a thorough documentation, and is conceived as an open application: new corpora can always be added to it, and we welcome new collaborations.

The goal of our workshop is to demonstrate how an available corpus could be processed for publication on the DiScholEd application. The workshop participants will learn to experiment with a ready-to-use solution that provides an easy and quick online publication of a corpus. They will also get tips and shortcuts to help speed up the creation of a digital edition. Moreover, by the end of the session, this workshop will provide the participants with a visualization of their respective corpus, with side by side transformed text and original image; all of which then showing what can be achieved while working with TEI in the context of an end-to-end publication pipeline.

The program for this workshop is the following: Firstly, it will start with a presentation of the development of the pipeline, its objectives and how it works. Then, the time we have will be divided into several slots corresponding to the work steps of the pipeline. Each slot will start with a quick presentation of what is expected of the participants and what tools they will need to use. Next, they will be allotted some time to process their data according to the requirements of the concerned work step, as they all require a certain amount of time. At the end of the day, a 30mn feedback session will make it possible for each participant as well as for the workshop organizers to assess the benefits of the session and envision further possible collaborations.

Considering the number of steps in this pipeline and the time required for each of these steps, a full day is necessary for this workshop. The number of participants should be 10-15 maximum, in order for the two workshop conveners to be able to provide the necessary technical support in the hands-on parts of the workshop.

In order for the participants to be able to work correctly on the pipeline, they will need a laptop as well as the following tools: a command line interface for the execution of the scripts and an XML editor (Oxygen is the best choice). It is also preferable if, beforehand, they get an account at Huma-Num and eScriptorium.

GitHub repository of the pipeline:

https://github.com/DiScholEd/pipeline-digital-scholarly-editions

Chiffoleau-From a collection of documents to a published edition-125.docx

11:00am - 11:30am

Monday Morning Refreshment Break
Location: ARMB: King's Hall

1:00pm - 2:30pm

Monday Lunch Break
Location: ARMB: King's Hall

2:30pm - 6:00pm

Workshop 2: Creating Digital Editions with FairCopy [Half Day, Afternoon]
Location: ARMB: 1.04

ID: 112 / WS 2: 1
Workshop
Keywords: Digital Humanities Critical Editions Tools IIIF

Creating Digital Editions with FairCopy

N. Laiacona

Performant Software Solutions LLC, United States of America

In this half day workshop, participants will learn how to use FairCopy to transform historical texts into online digital editions. Using crowdsourced transcriptions as a starting point, we will add semantic structure and mark names of people, places, and events. We will then publish our digital editions using Hugo.

The TEI Guidelines have been used by hundreds of scholarly projects and are an essential tool for researching, preserving, and disseminating cultural heritage world-wide. And yet, despite its mission to provide a common vocabulary for describing texts, TEI faces problems of adoption and use in the wider scholarly community. While the basics of TEI XML encoding are simple enough, true fluency in TEI requires institutional support and commitment in the form of training, technical staff, IT infrastructure, and the time and commitment of the individual scholar.

Even within institutions that have these resources, projects often adopt a simpler interface for domain experts to interact with. This interface then translates the scholar’s work into TEI behind the scenes. This is sometimes accomplished technologically, sometimes through a tiered system of labor, or both. These interfaces are more often than not specialized to the needs of the projects which develop them. This current state of affairs leads to a structural problem of access which further limits whose texts can be digitized and preserved.

FairCopy addresses this problem of access by providing a simple editing environment in which anyone can produce valid TEI documents. FairCopy doesn’t hide the complexity of TEI, but rather makes it available for users to explore at their own pace. Users are quickly comfortable with its interface and able to focus on the text, not XML syntax.

FairCopy has support for most of the 500+ elements in TEI and allows users to customize a schema for their particular project. Scholars can seamlessly import and export TEI-XML documents. Additionally, scholars can bring in IIIF images of primary resources and link them to their transcriptions.

In this half day workshop, participants will learn how to use FairCopy to transform historical texts into online digital editions encoded using TEI. Using crowdsourced transcriptions as a starting point, we will add semantic structure and mark names of people, places, and events. We will then publish our digital editions using Hugo.

In the first part of the workshop, we will begin with a demonstration of FairCopy. We will then select texts to work on based on participants interests. Participants are encouraged to bring their own texts. Finally, we will break into small groups.

In the second part, each group will work on encoding a text using FairCopy. Participants will work collaboratively to choose elements and attributes that best suit their selected texts. The presenter will float between groups answering questions.

In the third part, we will export our texts into a pre-made Hugo template that can display both the original IIIF page images and the TEI encoded texts.

Participants in this workshop will need to bring a Mac, Windows, or Linux laptop on which they can install FairCopy for free. No web design or XML skills are required.

Participants in this workshop will learn how to use FairCopy to create a digital edition. They will also learn about using TEI semantics to structure and mark texts. The will also gain familiarity with using IIIF Manifests to interoperate between library collections and digital editions.

Presenter Bio

Nick Laiacona is a partner at Performant Software Solutions LLC. Performant serves clients in the Digital Humanities throughout North America and Europe. Laiacona has developed tools for critical digital editions including: Juxta, Digital Mappa, TextLab, and now FairCopy. Laiacona has helped produce a number of critical editions, including “Secrets of Craft and Nature in Renaissance France” and the “Melville Electronic Library.”

Laiacona-Creating Digital Editions with FairCopy-112.docx

2:30pm - 6:00pm

Workshop 3: A short introduction to Schematron [Half Day, Afternoon]
Location: ARMB: 1.06

ID: 130 / WS 3: 1
Workshop
Keywords: Schematron, Validation, Quality Assurance

A short introduction to Schematron

D. Maus

State and University Library Hamburg, Germany

Schematron is a rule based validation language for structured documents. It was designed by Rick Jelliffe in 1999 (Jelliffe 1999) and standardized as ISO/IEC 19757-3 in 2006 (ISO 2006). The key concepts of Schematron validation are patterns that are the focus of a validation, rules selecting the portions of a document contributing to the pattern, and assertion tests that are run in the context of a rule. Schematron uses XPath both as the language to select the portion of a document and as the language of the assertion tests. This use of XPath gives Schematron the flexibility to validate arbitrary relationships and dependencies of information items in a document.

What also sets Schematron apart from other languages is that it encourages the use of natural language descriptions targeted to human readers. This way validation can be more than just a binary distinction (document valid/invalid) but also support authors of in-progress documents with quick feedback on erroneous or unwanted document structure and content.

The flexibility and (relative) simplicity of Schematron makes it an invaluable tool for XML-based text encoding projects. The range of supported tasks reaches from "hard" validation to enforce constraints on documents to "soft" validation to report potential problems such as Unicode characters from Unicode Private Use Areas to interactive error correction with Schematron extensions like Schematron QuickFix (Kutscherauer and Nadolu 2018).

This half-day workshop will introduce the participants to principle idea of Schematron and practice its application to XML-based text encoding projects. Together we will explore patterns, rules, and assertions as the basic Schematron concepts and touch phases, variables, and abstract patterns as more advanced features of Schematron validation.

From the participants the workshop requires a general understanding of XML document editing and basic knowledge of XPath. The material requirements are a projector and laptops to follow through with the examples given in the workshop. Any operating system with a recent Java Runtime is sufficient

Participants are recommended to bring their own device.

Maus-A short introduction to Schematron-130.docx

4:00pm - 4:30pm

Monday Afternoon Refreshment Break
Location: ARMB: King's Hall

Date: Tuesday, 13/Sept/2022

9:00am - 9:30am

Registration - Tuesday

9:30am - 1:00pm

Workshop 4: Building TEI-powered websites with static site technology. A hands on exploration of the publishing toolkit of the Scholarly Editing Journal [Half Day, Morning]
Location: ARMB: 3.38

ID: 134 / WS 4: 1
Workshop
Keywords: Digital publishing, TEI processing, static sites, programming

Building TEI-powered websites with static site technology. A hands on exploration of the publishing toolkit of the Scholarly Editing Journal

R. Viglianti

University of Maryland, United States of America

This half-day (approximately 3 hours) workshop will introduce TEI publishing with static site generators and front-end technologies, namely React JS and the static site generator Gatsby. It will introduce the attendees to the publishing strategies and tool sets developed for the reboot of the online Scholarly Editing journal (https://scholarlyediting.org/), which publishes, among essay-like content, TEI-based small scale editions. This workshop is aimed at attendees who already have some experience with programming (including XSLT) and the command line; however, all are welcome and will be supported as much as possible throughout the workshop.

The publishing tools presented in this workshop were developed for the reboot of the Scholarly Editing journal, which published its newest issue, volume 39, in April 2022. The previous site, built with Apache Cocoon, was converted into a static site and made accessible as an archive (https://scholarlyediting.org/se.index.issues.html). The new website and journal issues are built using Gatsby, a static site generator that relies on React JS for building user interfaces. The journal’s editors chose to adopt a static site generator because, once built, static sites do not need maintenance and can be easily moved and archived. This requires less infrastructure to publish and keep online the site on the web, which is desirable both for keeping operational costs of the journal low and to ensure its longevity. XML technologies can be and are used to generate static sites; the TEI Guidelines are a notable example. Regardless of how the static site is built, the result has minimal infrastructure requirements. A server is always needed to publish something on the web, but its role is limited to sending files over to the client, essentially just supporting HTTP GET operations. This is cheap and it makes it possible to rely on affordable web hosting, or take advantage of free services, or even use a home server.

During the workshop, participants will create a Gatsby website starting from a provided template that includes the TEI rendering tools gatsby-transformer-ceteicean and gatsby-theme-ceteicean. These tools re-implement principles pioneered by CETEIcean, which relies on the browser’s DOM processing and HTML5 Custom Elements to publish TEI documents as a component pluggable into any HTML structure (Cayless and Viglianti 2018). Example TEI documents to integrate into the website will be provided, but attendees are encouraged to bring their own.

After an introduction on static sites, the motivations for using them, and an open discussion, the workshop will introduce:

How to set up Gatsby and the CETEIcean plugins
How to use built-in behaviors
Customization via CSS (and CSS-in-JS)
Defining custom behaviors as React components
Applying transformation to TEI documents before and after ingestion into Gatsby.

If time allows, we will conclude with open discussion and collaborative experimentation.

Participants must bring their own laptop and be able to install (free) software on it. Internet access will be required. The tutor will require a projector.

References:

Cayless, Hugh, and Raffaele Viglianti. “CETEIcean: TEI in the Browser.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). https://doi.org/10.4242/BalisageVol21.Cayless01.

Biography:

Dr. Raffaele (Raff) Viglianti is a Senior Research Software Developer at the Maryland Institute for Technology in the Humanities, University of Maryland. His research is grounded in digital humanities and textual scholarship, where “text” includes musical notation. He researches new and efficient practices to model and publish textual sources as innovative and sustainable digital scholarly resources. Dr. Viglianti is currently an elected member of the Text Encoding Initiative technical council and the Technical Editor of the Scholarly Editing journal.

Viglianti-Building TEI-powered websites with static site technology A hands-134.docx

9:30am - 1:00pm

Workshop 5: Introduction to XProc [Half Day, Morning]
Location: ARMB: 3.41

ID: 129 / WS 5: 1
Workshop
Keywords: XProc, Automation, Pipeline

Introduction to XProc

D. Maus

State and University Library Hamburg, Germany

XProc is an XML based programming language for processing documents in pipelines. Version 1.0 of the language was published as W3C Recommendation in 2010. The specification of the next version, XProc 3.0, is expected to be published as a community group report in late 2022. While XProc does not seem to have had big adoption in the digital humanities, it is successfully used in various branches of the publishing industry.

This half-day workshop will teach the participants the basic concepts of an XProc processing pipeline (pipelines, steps, ports) and practice their application in a series of excercises. The overall goal of the workshop is to enable to participants to write pipelines that chain common markup manipulation tasks such as loading, transforming, validating that can be used as building blocks for more elaborate steps or as one-off scripts in data maintenance.

Participants are recommended to bring their own device.

Maus-Introduction to XProc-129.docx

9:30am - 1:00pm

Workshop 6: Engaging TEI Editors Through LEAF-Writer [Half Day, Morning]
Location: ARMB: 1.06

ID: 127 / WS 6: 1
Workshop
Keywords: TEI-XML, web-based editor, RDF, named entity recognition

Engaging TEI Editors Through LEAF-Writer

D. Jakacki¹, S. Brown², J. Cummings³, M. Ilovan⁴, L. Frizzera², C. Black⁵, R. Milio¹

¹Bucknell University, United States of America; ²University of Guelph, Canada; ³Newcastle University, United Kingdom; ⁴University of Alberta, Canada; ⁵LAB Cooperative

In this half-day hands-on workshop, participants will learn how to use LEAF-Writer - an open-source, open-access Extensible Markup Language (XML) editor that runs in a web browser and offers scholars and their students a rich textual editing experience without the need to download, install, and configure proprietary software, pay ongoing subscription fees, or learn complex coding languages. This user-friendly editing environment incorporates Text Encoding Initiatives (TEI) and Resource Description Framework (RDF) standards, meaning that texts edited in LEAF-Writer are interoperable with other texts produced by the scholarly editing community and with other materials produced for the Semantic Web.

Participants will learn how to make the most of LEAF-Writer’s extensive capabilities on their own laptops.They will learn how to choose among TEI customizations that best support their work in diplomatic and/or semantic markup, add inline scholarly notes and glosses, and create annotations - tagging named entities and associating them with recognized authorities like VIAF, Wikidata, and Getty - that do double duty as in-text identifiers and potential contributions to the Semantic Web.

Jakacki-Engaging TEI Editors Through LEAF-Writer-127.docx

11:00am - 11:30am

Tuesday Morning Refreshment Break
Location: ARMB: King's Hall

1:00pm - 2:30pm

Tuesday Lunch Break
Location: ARMB: King's Hall

2:30pm - 4:00pm

SIG 1: Manuscripts
Location: ARMB: 3.38

2:30pm - 4:00pm

SIG 2: Ontologies
Location: ARMB: 1.06

2:30pm - 4:00pm

SIG 3: Linguistics
Location: ARMB: 3.41
Session Chair: Piotr Banski, IDS Mannheim

4:00pm - 4:30pm

Tuesday Afternoon Refreshment Break
Location: ARMB: King's Hall

4:30pm - 6:00pm

SIG 4: Correspondence
Location: ARMB: 3.38

4:30pm - 6:00pm

SIG 5: Newspapers and Periodicals
Location: ARMB: 1.06

4:30pm - 6:00pm

SIG 6: [Unbooked]
Location: ARMB: 3.41

6:15pm - 7:30pm

Opening Keynote: Constance Crompton, "Situated, Partial, Common, Shared: TEI Data as Capta"
Location: ARMB: 2.98
Session Chair: James Cummings, Newcastle University

Starting with: Welcome To Newcastle University, Professor Jennifer Richards, Director of the Newcastle University Humanities Research Institute.

ID: 165 / Opening Keynote: 1
Invited Keynote

Situated, Partial, Common, Shared: TEI Data as Capta

C. Crompton

University of Ottawa, Canada

It has been a decade since Johanna Drucker reminded us that all data are capta in the TEI-encoded pages of Digital Humanities Quarterly. In some ways this may appear to be self-evident in the context of the TEI: for many TEI users, their primary encoded material is text, and the TEI tags are a textual intervention in the sea of primary text – the resulting markup is not data, as in something objectively observed, but rather capta, as in something situated, partial, and contextually freighted (as indeed is all data. All data is capta). That said, Drucker warns her readers against self-evident claims.

Drawing on Drucker's arguments, this keynote explores the tension in several of the TEI's models, and the challenges that arise from our need to have fixed start and end points, bounding boxes, interps, certainty, events, traits (the list goes on!) in order to do our analytical work. Drawing on a number of projects, I argue for the value of our shared markup language and the value it offers us through its data-like behaviour, even as it foregrounds how clearly how much TEI data, and indeed, all data, are capta.

7:30pm - 9:00pm

Opening Keynote Reception
Location: ARMB: King's Hall

Date: Wednesday, 14/Sept/2022

9:00am - 9:30am

Registration - Wednesday

9:30am - 11:00am

Session 1A: Short-Papers
Location: ARMB: 2.98
Session Chair: Martin Holmes, University of Victoria

ID: 140 / Session 1A: 1
Short Paper
Keywords: text mining, stand-off annotations, models of text, generic services

Standoff-Tools. Generic services for building automatic annotation pipelines around existing tools for plain text analysis

C. Lück

Universität Münster, Germany

TEI XML excels at encoding text. But when it comes to machine-based

analysis of a corpus as data, XML is no good platform. NLP, NER, topic

modelling, text-reuse detection etc. work on plain text; they get very

complicated and slow, if they have to traverse a tree structure. While

extracting plain text from XML is simple, feeding the result back into

XML is tricky. However, having the analysis in XML is desired: Its

result can be related to the internal markup, e.g. for overviews of

names per chapter, ellipsis per verse etc. In my short paper I will

introduce standoff-tools, a suite of generic tools for building

(automatic) annotation pipelines around plain text tools.

standoff-tools implement the extractor *E* and the internalizer

*I*. *E* produces a special flavour of plain text, I term *equidistant

plain text*: The XML tags are replaced by special characters,

e.g. zero-width non-joiner U+200C, i.e. all non-special characters

have the same character offset as in the XML source. This equidistant

plain text can then be fed to an arbitrary tagger *T* designed for

plain text. Its only requirement is to produce positioning

information. *I* inserts tags based on positioning information into

XML. For this purpose, it splits the annotated spans of text, so that

the result is syntactically valid XML without overlapping edges. It

aggregates the splits back together with `@next` and `@from`.

Optionally, a shrinker *S* removes the special characters in the

output of *E* and also produces a map of character positions. This map

of character positions is applied by a corrector *C* to the

positioning information produced by the tagger *T*.

The internalizer can also be used to internalize stand-off markup

produced manually with CATMA, GNU Emacs standoff-mode, etc. into

syntactically correct XML.

Lück-Standoff-Tools Generic services for building automatic annotation pipelines around existing tools for .odt

ID: 103 / Session 1A: 2
Short Paper
Keywords: TEI, indexes, XQuery

TEI Automatic Enriched List of Names (TAELN): An XQuery-based Open Source Solution for the Automatic Creation of Indexes from TEI and RDF Data

G. Fernandez Riva

Universität Heidelberg, Germany

The annotation of names of persons, place or organizations is a common feature of TEI editions. One way of identifying the annotated individuals is through the use of IDs from authority records like Geonames, Wikidata or the GND.

In this paper I will introduce an open source tool written in XQuery that enables the creation of TEI indexes using a very flexible custom templating language. The TEI Automatic Enriched List of Names (TAELN) uses the ids according to one authority document to create a custom index (model.listLike) with information from one or more RDF endpoints.

TAELN has been developed for the edition of the diaries and travel journals from Albrecht Dürer and his family. People, places and art works are identified with GND-numbers in the TEI edition. The indexes generated with TAELN include some information from GND records, but mostly from duerer.online, a virtual research portal, created with WissKI (https://wiss-ki.eu/), which offers an RDF endpoint.

TAELN relies on an XML-template to indicate how to retrieve information from the different endpoints and how to structure the desired TEI output. The templates use a straight-forward but flexible syntax. Simple use cases are depicted in the following example that retrieves the person name from the GND and the occupation from WissKI (which relies on the so-called »Pathbuilder syntax«).

<persName origin="gnd">preferredNameForThePerson</persName>

<occupation origin="wisski">ecrm:E21_Person -> ecrm:P11i_participated_in -> wvz:WV7_Occupation -> ecrm:P3_has_note</occupation>

</person>

Much more complex outputs can be achieved. TAELN offers editions an out of the box solution to generate TEI indexes by gathering information from different endpoints and it only requires the creation of the corresponding template and the knowledge of how to apply an XQuery transformation. The tool will be published shortly before the date of the TEI conference.

Fernandez Riva-TEI Automatic Enriched List of Names-103.docx

ID: 151 / Session 1A: 3
Short Paper
Keywords: manuscripts, codicology, paleography, XForms

manuForma – A Web Tool for Cataloging Manuscript Data

M. de Molière

University of Munich, Germany

The team of the ERC-funded project "MAJLIS – The Transformation of Jewish Literature in Arabic in the Islamicate World" at the University of Munich needed a software solution for describing manuscripts in TEI that would be easy to learn for non-specialists. After about one year of development, manuForma provides to our manuscript catalogers an accessible platform for entering their data. Users can choose elements and attributes from a list, add them to their catalog file and rearrange them with a mouse click. While manuForma does not spare our catalogers the need to learn the fundamentals of TEI, the restrictions the forms based approach proffers, enhances both TEI conformance and the uniformity of our catalog records. Moreover, our tool eliminates the need to install commercial XML editors on the machine of each and every project member tasked with describing manuscripts. Instead, our tool offers a web interface for the entire editorial process.

At the heart, manuForma uses XForms, which has been modified to allow adding, moving and deleting elements and attributes. A tightly knit schema file controls which elements and attributes can be added and in which situations to ensure conformance to the project's scholarly objectives. As an existDB application, manuForma integrates well with other apps that provide the front end to the manuscript catalog. TEI records can be stored on and retrieved from GitHub, tying the efforts of the entire team together. The web solution is adaptable to other entities by writing a dedicated schema and template file. Moreover, manuForma will be available under an OpenSource licence.

de Molière-manuForma – A Web Tool for Cataloging Manuscript Data-151.docx

9:30am - 11:00am

Session 1B: Long Papers
Location: ARMB: 2.16
Session Chair: Syd Bauman, Northeastern University

ID: 139 / Session 1B: 1
Long Paper
Keywords: intertextuality, bibliography, interface development, customization

Texts All the Way Down: The Intertextual Networks Project

S. Connell, A. Clark

Northeastern University, United States of America

In 2016, the Women Writers Project (WWP) began a new research project on the multivalent ways that early women writers engaged with literate culture, at the center of which were systemic enhancements to a longstanding TEI corpus. The WWP’s flagship publication, Women Writers Online (WWO), collects approximately 450 works from the sixteenth to the nineteenth centuries, a watershed period in which women’s participation in the authorship and consumption of texts expanded dramatically. With generous funding from the National Endowment for the Humanities, we used WWO’s TEI encoding to jumpstart the creation of a standalone bibliography containing and linking to all the works referenced in WWO. This bibliography currently includes 3,431 book-level entries; 942 entries that are parts of larger works, such as individual essays or poems; and 126 simple bibliographic entries (e.g. books of the Bible). The bibliography identifies the genre of each work and the gender of the author, where known. We also expanded WWO’s custom TEI markup in order to say more about “intertextual gestures”—or WWO authors’ engagement with other works—which include not only named titles and quotations but also textual remix, adaptation, and parody. By the end of the grant period, we had identified 11,787 quotations, 5,692 titles, 4,825 biblical references, and 1,968 other bibliographic references, linking the individual instances within the WWO texts to the relevant bibliography entries.

Now, the WWP has published “Women Writers: Intertextual Networks” (https://wwp.northeastern.edu/intertextual-networks), a web interface built on these two sources of rich TEI data: the bibliography and WWO’s newly refined intertextual gestures. In this paper we will discuss the challenge of turning dense, textually-embedded data into an interface. Though the encoded texts themselves can stand alone as complete documents, we built Intertextual Networks with a focus on connective tissue, using faceting and linkages to invite curiosity about how authors and works are in conversation with each other. As the numbers above suggest, this project attempts to enable investigations at scale, but we have also sought to draw out the local, even individual, ways that our writers engaged with other texts and authors. Thus, the interface includes visualizations that show overall patterns of usage (for example, the kinds of intertextual gestures employed by each author), but it also allows the reader to view the complete text of each gesture, reading through quotations, named titles, citations, and so on in full, with filtering and faceting to support exploration of this language.

An important challenge for this project has been to build an interface that can address the multidirectional levels of textual imbrication at stake, allowing researchers to examine patterns among both referenced and referencing texts. This paper will share some key insights for TEI projects seeking to undertake similar markup expansion and interface development initiatives. We will discuss strategies for modeling, enabling discovery, and revealing complex layers of textual data and textuality among not only a primary corpus but also a related collection of texts.

Connell-Texts All the Way Down-139.docx

ID: 116 / Session 1B: 2
Long Paper
Keywords: sex, gender, TEI Guidelines, document data, theory

Revising Sex and Gender in the TEI Guidelines

E. Beshero-Bondar¹, R. Viglianti², H. Bermúdez Sabel³, J. Jenstad⁴

¹Penn State Behrend, United States of America; ²University of Maryland, United States of America; ³University of Neuchâtel, Switzerland; ⁴University of Victoria, Canada

In Spring 2022, the co-authors collaborated in a TEI Technical Council subgroup to introduce a long-awaited <gender> element and attribute. In the process, we wrote new language for the TEI Guidelines on how to approach these concepts. As we submit this abstract, our proposed changes are under review by the Council for introduction in the next release of the TEI Guidelines, slated for October 2022. We wish to discuss this work with the TEI community to validate and address

* the history of the Guidelines' representation of these concepts,

* applications of the new encoding, and

* the extent to which the new specifications preserve backwards compatibility.

We must recognize as digital humanists and textual scholars that coding sex and gender as true "data" from texts significantly risks categorical determinism and normative cultural bias (Sedgwick 1990, 27+). Nevertheless, we believe that the TEI community is well prepared to encounter these risks with diligent study and expertise on the cultures that produce the textual objects being encoded, in that TEI projects are theoretical in their deliberate efforts to model document data (Ramsay and Rockwell 2012). We seek to encourage TEI-driven research on sex and gender by enhancing the Guidelines' expressiveness in these areas. Our revision of the Guidelines therefore provides examples but resists endorsing any single particular standard for specifying values for sex or gender. We recommend that projects encoding sex and/or gender explicitly state the theoretical groundwork for their ontological modeling, such that the encoding articulates a context-appropriate, informed, and thoughtful epistemology.

Gayle Rubin's influential theory of "sex/gender systems" informs some of our new language in the Guidelines “Names and Dates” chapter (Rubin 1975). While updating existing examples for encoding sex and introducing related examples for encoding gender, we mention the “sex/gender systems” concept to suggest that sex and gender may be related, such that a culture's perspective on biological sex gives rise to its notions of gender identity. Unexpectedly, we found ourselves confronting the Guidelines' prioritization of personhood in discussion of sex, likely stemming from the conflation of sex and gender in the current version of the Guidelines. In revising the technical specifications describing sex, we introduced the term "organism" to broaden the application of sex encoding. We leave it to our community to investigate the fluid concepts of gender and sex in their textual manifestations of personhood and biological life.

Encoding of cultural categories, when unquestioned, can entrench biases and do harm, a risk we must face in digital humanities generally. Yet we seek to make the TEI more expressive and adaptable for projects that complicate, question, and theorize sex and gender constructions. We look forward to working with the TEI community, in hopes of continued revisions, examples, and theoretical document data modeling of sex and gender for future projects. In particular, we are eager to learn more from project customizations that “queer” the TEI and theorize about sexed and gendered cultural constructions, and we hope for a lively discussion at the TEI conference and beyond.

Beshero-Bondar-Revising Sex and Gender in the TEI Guidelines-116.docx

ID: 104 / Session 1B: 3
Long Paper
Keywords: TEI, Spanish, Survey, Community, Geopolitics of Knowledge

Where is the Spanish in the TEI?: Insights on a Bilingual Community Survey

G. del Rio Riande¹, S. Allés-Torrent²

¹CONICET, Argentine Republic; ²Unversity of Miami, USA

Who can best define the interests and needs of a community? The members of the community itself.

“Communicating the Text Encoding Initiative to a Multilingual User Community” is a research project financed by the A. Mellon Foundation in which scholars from North and South America are generating linguistic, cultural, didactic and situated educational materials to improve the XML-TEI encoding, editing and publication of Spanish texts.

As part of the project activities, we prepared a bilingual survey (Spanish-English) aimed at inquiring t who uses or has used XML-TEI practices, and where and how they have been applied to Spanish humanistic texts. Bearing in mind that many digital scholarly edition projects of Spanish texts are carried out in Spanish-speaking and Anglophone institutions, we did not focus on a geographical survey, but on the use of XML at a global level. The survey ran between February and April 2022. It is an anonymous survey and consists of 22 questions. It received 104 responses, 77 in Spanish and 28 in English.

Some of the data that we will discuss in this short presentation aims at illustrating the significant differences regarding the organization of projects, collaboration, financing and use of TEI in master's and doctoral research. In broad terms, the survey allowed us to better understand not only the Spanish-speaking community that uses XML-TEI, but also to think of strategies that can contribute with more inclusive practices for scholars from less represented countries and in less favorable contexts inside the global TEI community. Last but not least, we believe the survey will be useful for designing actions that can support a wider range of modes of interaction and collaboration inside the global TEI community.

del Rio Riande-Where is the Spanish in the TEI-104.docx

11:00am - 11:30am

Wednesday Morning Refreshment Break
Location: ARMB: King's Hall

11:30am - 1:00pm

Session 2A: Long Papers
Location: ARMB: 2.98
Session Chair: Elli Bleeker, Huygens Institute for the History of the Netherlands

ID: 131 / Session 2A: 1
Long Paper
Keywords: Herman Melville, genetic criticism, text analysis, R, XPath

Revision, Negation, and Incompleteness in Melville's _Billy Budd_ Manuscript

C. Ohge

School of Advanced Study, University of London, United Kingdom

In 2019, John Bryant, Wyn Kelley, and I released a beta-version of a digital edition of Herman Melville's last work _Billy Budd, Sailor_. This TEI-encoded edition required nearly 10 years of work to complete, mostly owing to the fact that this last, unfinished work by Melville survives in an incredibly complicated manuscript that demonstrates about 8 stages of revision. The digital edition (https://melville.electroniclibrary.org/versions-of-billy-budd) has since been updated, and it presents a fluid-text edition (Bryant 2002) in three versions: a diplomatic transcription of the manuscript, a 'base' (or clean readable) version of the manuscript, and a critical, annotated reading text generated from the base version. Nevertheless, it remained questionable to me how we could effectively use all of the sophisticated descriptive markup of the manuscript transcription for critical purposes. What is missing, in other words, is an effective analysis of the genesis of this work.

In this talk I would like to demonstrate recent work on text analyses on the TEI XML data of the manuscript for a chapter-in-progress of my book-length project entitled _Melville’s Codes: Literature and Computation Across Complex Worlds_ (co-authored with Dennis Mischke, and under contract with Bloomsbury). First I generated and visualised basic statistics of textual phenomena (additions, deletions, and substitutions, e.g.) using XPath expressions combined with the R programming language. I then used the XML2 and TidyText libraries in R to perform more sophisticated analyses of the manuscript in comparison to Melville's oeuvre. Ultimately the analyses show that _Billy Budd_ ought to be read as a testament to incompleteness and negation.

In general, Melville’s use of negations and negative sentiments increased throughout his fictional work. Although this trend drops off in the late poetry, _Billy Budd_ has the highest number of negations in all of Melville’s oeuvre. It also has more acts of deletion than addition in the manuscript. Yet these trends need to be analysed in the context of Melville’s incomplete manuscript, the ‘ragged edges’ of which demonstrate not only a late tendency to increase negative words and ideas, but also, in late revisions, to complicate the main characters of the novel (particularly Captain Vere) who represent justice in the story. Like 'Benito Cereno', the codes of judgment are shown to be inadequate to the task of reckoning with the tragic conditions represented in Melville’s final sea narrative. This inadequacy is illustrated by Vere’s reaction to Billy’s death, which is framed as a computation, an either/or conditional: ‘Captain Vere, either thro stoic self-control or a sort of momentary paralysis induced by emotional shock, stood erectly rigid as a musket in the ship-armorer's rack’ (Chapter 25). This thematic incompleteness is not only a metaphor in the text but a metaphor of the text of this incomplete story.

Christopher Ohge is Senior Lecturer in Digital Approaches to Literature at the School of Advanced Study, University of London. His book _Publishing Scholarly Editions: Archives, Computing, and Experience_ was published in 2021 by Cambridge University Press. He also serves as the Associate Director of the Herman Melville Electronic Library.

Ohge-Revision, Negation, and Incompleteness in Melvilles _Billy Budd_ Manuscript-131.docx

ID: 136 / Session 2A: 2
Long Paper
Keywords: digital editions, sentiment analysis, machine learning, literary analysis, corpus annotation

“Un mar de sentimientos”. Sentiment analysis of TEI encoded Spanish periodicals using machine learning

L. Krusic¹, M. Scholger¹, E. Hobisch², Y. Völkl²

¹Institute Centre of Information Modelling (Austrian Centre for Digital Humanities), University of Graz; ²Technical University Graz

Sentiment analysis (SA), one of the most active research areas in NLP for over two decades, focuses on the automatic detection of sentiments, emotions and opinions found in textual data (Liu, 2012). Recently, SA has also gained popularity in the field of Digital Humanities (Schmidt Burghardt & Dennerlein, 2021). This contribution will present the analysis of a TEI encoded digital scholarly edition of Spanish periodicals using a machine learning approach for sentiment analysis as well as the re-implementation of the results into TEI for further retrieval and visualization.

Krusic-“Un mar de sentimientos” Sentiment analysis of TEI encoded Spanish periodicals using machine lea.docx

11:30am - 1:00pm

Session 2B: Long Papers
Location: ARMB: 2.16
Session Chair: Hugh Cayless, Duke University

ID: 145 / Session 2B: 1
Long Paper
Keywords: collation, information transfer, ecdotics, materiality

TEICollator: a semi-automatic TEI to TEI workflow

M. Gille Levenson

ENS Lyon, France

Automated text comparison has been an area of interest for many years [Nury 2019]: tools such as CollateX allow automated text comparison, and even export to TEI. However, there is no tool today that allows, from transcripts encoded and structured in XML-TEI, to automate the collation of texts and to inject the produced apparatuses into the original files. Working in this way ensures that the contextual and structural information specific to each witness (structure, additions, deletions, line changes, etc) encoded in XML-TEI is not lost. In other words, there is a need of being able to work on textual differences without ignoring the individual, structural and material reality of each text or witness.

Furthermore, the increasing use of Optical Character Recognition (OCR) or Handwritten Text Recognition (HTR) tools [Kiessling 2019], which is interesting both in terms of speed of acquisition and of quality of the preserved information [Camps 2016], have consequences for the ecdotical methods: should we keep collating the text manually, when its acquisition has been done by the computer ?

My work focus on a semi-automatic collation workflow. I will present a complete TEI to TEI processing chain, from single TEI-encoded transcriptions to meaningful collated ones (by the production of typed apparatus, for instance: see [Camps 2018]) that allows to keep the original structural information. This process also identifies the omissions and transpositions, and finally the transformation of the data into documents that present the textual information in the clearest possible way. I will present my work from the perspective of information transfer and pointing out the dialectic between material and textual collation (as carried out by Blekker et al 2018, but using other methods): the latter being the alignment of material features encoded in TEI. Finally, I will outline the limitations and difficulties I face along the processing chain (can the tokenisation of TEI-encoded text be fully automated? What level of textual heterogeneity can manage the worflow ? What quality of lemmatisation is required? what encoding method should be prefered to get the best result posible ?).

I want to show how the TEI standard, the pivot format of this computational method, can be used to describe text as well as to process it. Finally, I will show how the last operation, the transformation from TEI to LaTeX, maybe the most complex task, is fully part of the ecdotic chain, and contributes to produce meaning from the data: in this sense, my work is part of the reflection carried out for several years on Digital Scholarly Editions [Pierazzo 2015; Pierazzo and Driscoll 2016], -- I made a choice to prefer the print/pdf format over a web interface -- thanks to the LaTeX Reledmac package developed and maintained by Maïeul Rouquette [Rouquette 2022].

This paper will be the technical counterpart of a paper presented in La Laguna in July, which will focus on the philological side of the processing chain.

Gille Levenson-TEICollator-145.odt

ID: 148 / Session 2B: 2
Long Paper
Keywords: digital edition, data quality assurance, XSL-FO, software test, PDF

Back to analog: the added value of printing TEI editions

M. Kupreyev

Goethe Universität Frankfurt am Main, Germany

Saale (2017) [1] provides the operational definition of a scholarly digital edition by contrasting its paradigm to that of a print edition. His bottom line is that any “digital edition cannot be given in print without significant loss of content and functionality”. In my talk I will touch upon the challenges of printing TEI XML datasets but also substantiate its positive effects: PDF export, indeed, presents only a part of the encoded information, but it can play essential role in data quality assurance. Creating a printed version of a digital edition can enhance the consistency of encoding and affect the overall production pipeline of the TEI XML data.

At the “School of Salamanca” [2] project the TEI XML of the early modern print editions goes through the restrictive Schema and Schematron check-ups, after which it is exported to HTML and JSON IIIF for web display [3]. Recently, an option of PDF export was added. Considering the complexity and the depth of annotation the solution integrated in Salamanca’s Oxygen workflow was chosen, namely a free Apache FOP processor. Similar results may have been achieved with TEI Publisher or Oxygen PDF Chemistry processor. The PDF export highlighted the issues which pertain to two ontologically different areas:

• Rendering XML elements in a constrained two-dimensional PDF layout.

• Varying XML encoding of semantically identical chunks of information.

The issues of the first type refer, for example, to the representation of marginal notes and their anchors, and to the pagination correlation between XML and IIIF (as representing the original), and PDF (as a print output). The second type embraces different rendering of semantically identical text parts, induced either by errors in the original or by the text editors.

PDF generation was initially intended to be one of the export methods of TEI data. It is now implemented early in the TEI production workflow, as it pinpoints the semantic and structural inconsistencies in the data and allows to correct them before the final XML release. PDF production thus adheres to one of the principles of agile software testing, which states that capturing and eliminating defects in the early stages of RDLC (research data life cycle) is less time-consuming, less resource-intensive and less prone to collateral bugs (Crispin 2008) [4].

[1] Sahle, Patrick. 2017. "What is a Scholarly Digital Edition?" in Digital Scholarly Editing, edited by Matthew James Driscoll and Elena Pierazzo, 19-39. Cambridge: Open Book Publishers.

[2] https://www.salamanca.school/en/index.html , accessed on 20.06.2022.

[3] https://blog.salamanca.school/de/2022/04/27/the-school-of-salamanca-text-workflow-from-the-early-modern-print-to-tei-all/,

https://blog.salamanca.school/de/2020/03/17/deutsch-entwicklung-der-webanwendung-v2-0/ , accessed on 20.06.2022.

[4] Crispin, LIsa. 2008. Agile Testing: A Practical Guide for Testers and Agile Teams. Addison-Wesley.

Kupreyev-Back to analog-148.docx

ID: 106 / Session 2B: 3
Long Paper
Keywords: poetry, rhyme, sound

Encoding sonic devices: what is it good for?

M. Holmes

University of Victoria, Canada

The Digital Victorian Periodical Poetry project[1] has captured metadata and page-images for 15,548 poems from Victorian periodicals, and transcribed and encoded a representative sample of 2,150 poems. Our encoding captures rhyme and other sonic devices such as anaphora, epistrophe, and refrains. This presentation will describe our encoding practices and then discuss what useful outcomes can be gained from this undertaking. Although even TEI P1 specified both a rhyme attribute to capture rhyme-scheme and a rhyme element for "very detailed studies of rhyming" (TEI P1 P172)[2], and all significant TEI tutorials teach the encoding of rhyme (e.g. TEI by Example Module 4), it is difficult to find work which makes explicit use of TEI encoding of rhyme (let alone other sonic devices) in the analysis of English poetry.

Is manual encoding of rhyme still necessary? Chisholm & Robey noted back in 1995 that "much of the analysis which currently requires extensive manual markup will in due course be carried out by electronic means" (100), and much work has been devoted to the automated detection of rhyme (Kavanagh 2008; Kilner & Fitch 2017). However, these tools are not completely successful, and in our own work, there is a consistent subset of cases which generate disagreement and discussion regarding type of rhyme, or even whether a rhyme is intended. We do make use of automated detection of anaphora and epistrophe, but only to generate suggestions for cases that might have been missed after the initial encoding has been done. We therefore believe that manually-curated encoding of sonic devices is a prerequisite for serious literary analysis which depends on that encoding.

[1] DVPP, https://dvpp.uvic.ca/.

[2] See also Chisholm & Robey 1995.

Having invested in careful encoding of sonic devices, what are the potential uses for research? DVPP has begun by making rhyme-scheme discoverable and searchable in our search interface, and this is beginning to generate research questions. We can already test notions such as the claim that irregular rhyme-schemes were more frequently used as the century progressed; a table of the percentage of irregularly-rhymed poems in each decade in our collection (Appendix) shows only the weakest support for this claim.

In addition to tracing trends in poetic practice, and the construction of historical rhyme dictionaries, sonic device encoding might also be used for:

- Dialect detection. For example, our dataset includes a significant subset of poems written in Scots dialect, and others which may or may not be; for problem cases, where other factors such as poet and host publication suggest a dialect poem, but surface features are not persuasive, rhyme patterns may provide more evidence.

- Genre detection. Particular poetic genres, such as sonnets or ballads are characterized by formal structures which include rhyme-scheme.

- Bad poetry. We are particularly interested in the notion of what constitutes bad poetry, and our early work suggests that poetry which subjectively seems to be of poor quality also exhibits features such as monotonous rhyme-schemes and intrusive echoic devices.

- Authorship attribution.

- Diachronic sound-change.

- Historical rhyming dictionaries.

Holmes-Encoding sonic devices-106.odt

1:00pm - 2:30pm

Wednesday Lunch Break
Location: ARMB: King's Hall

2:30pm - 4:00pm

Session 3A: Long Papers
Location: ARMB: 2.98
Session Chair: Gustavo Fernandez Riva, Universität Heidelberg

ID: 113 / Session 3A: 1
Long Paper
Keywords: Middle Ages, lexicography, glossary, quantitative analysis, Latin

Vocabularium Bruxellense. Towards Quantitative Analysis of Medieval Lexicography

K. Nowak¹, I. Krawczyk¹, R. Alexandre²

¹Institute of Polish Language (Polish Academy of Sciences), Poland; ²Institut de recherche et d'histoire des textes, France

The Vocabularium Bruxellense is a little-known example of medieval Latin lexicography (Weijers 1989). It has survived in a single manuscript dated to the 12th century and currently held at the Royal Library of Belgium in Brussels. In this paper we present the digital edition of the dictionary and the results of a quantitative study of its structure and content based on the TEI-conformant XML annotation.

First, we briefly discuss a number of annotation-related issues. For the most part, they result from the discrepancy between medieval and modern lexicographic practices which are accounted for in the 9th chapter of the TEI Guidelines (TEI Consortium). For example, a single paragraph of a manuscript may contain multiple dictionary entries which are etymologically or semantically related to the headword.

Medieval glossaries are also less consistent in their use of descriptive devices. For instance, the dictionary definitions across the same work may greatly vary as to their form and content. As such, they require fine-grained annotation if the semantics of the TEI elements is not to be strained.

Second, we present the TEI Publisher-based digital edition of the Vocabularium (Reijnders et al. 2022). At the moment, it provides basic browsing and search functionalities, making the dictionary available to the general public for the first time since the Middle Ages.

Thirdly, we demonstrate how the TEI-conformant annotation may enable a thourough quantitative analysis of the text which sheds a light on its place in a long tradition of medieval lexicography. We focus on two major aspects, namely the structure and the sources of the dictionary. As for the first, we present summary statistics of the almost 8,000 entries of the Vocabularium, expressed as a number of entries per letter and per physical page. We show that half of the entries are relatively short: a number among them contain only a one-word gloss and only 25% of entries contain 15 or more tokens.

Based on the the TEI XML annotation of nearly 1200 quotes, we were able to make a number of points concerning the function of quotations in medieval lexicographic works which is hardly limited to attesting specific language use. We observe that quotations are not equally distributed across the dictionary, as they can be found in slightly more than 10% of the entries, whereas nearly 7,000 entries have no quotations at all. The quotes are usually relatively short with only 5% containing 10 or more words. Our analysis shows that the most quoted author is by a wide margin Virgil followed by Horace, Lucan, Juvenal, Ovid, Plautus, and Terence (19). Church Fathers and medieval authors are seldom quoted, we have also discovered only 86 explicit Bible quotations so far.

In conclusions, we argue that systematic quantitative analyses of the existing editions of the medieval glossaries might provide useful insight into the development of this important part of the medieval written production.

Nowak-Vocabularium Bruxellense Towards Quantitative Analysis-113.odt

ID: 162 / Session 3A: 2
Long Paper
Keywords: standardization, morphology, morphosyntax, ISO, MAF, stand-off annotation

ISO MAF reloaded: new TEI serialization for an old ISO standard

P. Banski¹, L. Romary², A. Witt¹

¹IDS Mannheim, Germany; ²INRIA, France

The ISO Technical Committee TC 37, Language and terminology, Subcommittee SC 4, Language resource management (https://www.iso.org/committee/297592.html, ISO TC37 SC4 henceforth) has been, for nearly 20 years now, the locus of much work focusing on standardization of annotated language resources. Through the subcommittee’s liaison with the TEI-C, many of the standards developed there use customizations of the TEI Guidelines for the purpose of serializing their data models. Such is the case of the feature structure standards (ISO 24610-1:2006, ISO 24610-2:2011), which together form chapter 18 of the Guidelines, as well as the standard on the transcription of the spoken language (ISO 24624:2016, reflected in ch. 8) or the Lexical Markup Framework (LMF) series, where ISO 24613-4:2021 mirrors ch. 9 of the Guidelines.

The Morphosyntactic Annotation Framework (ISO 24611:2012) was initially published with its own serialization format, interwoven with suggestions on how its fragments can be rendered in the TEI. In a recent cyclic revision process, a decision was made to divide the standard in two parts, and to replace the legacy serialization format with a customization of the TEI that makes use of the recent developments in the Guidelines – crucially, the work on the standOff element and the work on the att.linguistic attribute class. The proposed contribution reviews fragments of the revised standard and presents the TEI devices used to encode it. At the time of the conference, ISO/CD 24611-1 “Language resource management — Morphosyntactic annotation framework (MAF) — Part 1: Core model” will have been freshly through the Committee Draft ballot by the national committees mirroring ISO TC37 SC4.

In what follows, we briefly outline the basic properties of the MAF data model and review selected examples of its serialization in the TEI.

Banski-ISO MAF reloaded-162.odt

ID: 108 / Session 3A: 3
Long Paper
Keywords: lexicography, dictionaries, semantic web

TEI Modelling of the Lexicographic Data in the DARIAH-PL Project

K. Nowak, D. Mika, W. Łukasik

Institute of Polish Language (Polish Academy of Sciences), Poland

The main goal of project “DARIAH-PL Digital Research Infrastructure for the Arts and Humanities” project is building the Dariah.lab infrastructure, which would allow for sharing and integrated access to digital resources and data from various fields of the humanities and arts. Among numerous tasks that the Institute of Polish Language, Polish Academy of Sciences coordinates, we are working towards the integration of our lexicographic data with the LLOD resources (Chiarcos et al. 2012). The essential step of this task is to convert the raw text into TEI-compliant XML format (TEI Consortium).

In this paper we would like to outline the main issues involved in TEI XML modelling of these heterogeneous lexicographic data.

In the first part, we will give a brief overview of the formal and content features of the dictionaries. For the most part, they are paper-born works developed with the research community in mind and as such are rich in information and complex in structure. They cover diachronic development (from medieval Polish and Latin to present day Polish) and its functional variation of Polish (general language vs. dialects, proper names).

On a practical level, this meant that, first, substantial effort had to be put into optimizing the quality of the OCR output. Since, except for grobid-dictionaries (Khemakhem et al. 2018), there are no tools at the moment that would enable easy conversion of lexicographic data, the subsequent phase of structuring of dictionary text had to be applied on a per resource basis.

The TEI XML annotation has three main goals. First, it is a means of preserving the textuality of paper-born dictionaries which makes heavy use of formatting necessary to convey information and employ a complex system of text-based internal cross-references. Second, TEI modelling aims at a better understanding of each resource and its explicit description. The analysis is performed by lexicographers who may, however, come from a lexicographic tradition different from the one embodied in a particular dictionary, and thus need to make their interpretation of the dictionary text explicit. Regardless, in this way we may also detect and correct editorial inconsistencies, which are natural for collective works developed over many years. Third, the annotated text is meant to be the input of the alignment and linking tasks, it is then crucial that functionally equivalent structures were annotated in a systematic and coherent way. As we plan to provide an integrated access to the dictionaries, the TEI XML representation is also where the first phase of data reconciliation takes place. It does not only concern the structural units of a typical dictionary entry, such as <sense/> or <form/>, but also mapping between units of analytical language the dictionaries employ, such as labels, bibliographic reference system etc.

Nowak-TEI Modelling of the Lexicographic Data in the DARIAH-PL Project-108.docx

2:30pm - 4:00pm

Session 3B: Notes from the DEPCHA Field and Beyond: TEI/XML/RDF for Accounting Records
Location: ARMB: 2.16
Session Chair: Syd Bauman, Northeastern University

ID: 154 / Session 3B: 1
Panel
Keywords: accounts, accounting, DEPCHA, bookkeeping ontology

Notes from the DEPCHA Field and Beyond: TEI/XML/RDF for Accounting Records

K. Tomasek¹, O. Bullock¹, L. Hermsen², R. Walker², N. Kokaze³

¹Wheaton College Massachusetts, United States of America; ²Rochester Institute of Technology, United States of America; ³Chiba University, Japan

Notes from the DEPCHA Field and Beyond: TEI/XML/RDF for Accounting Records

Session Proposal—Short Papers

TEI 2022, Newcastle

Abstract:

The short papers in this session focus on questions that arise in the process of editing manuscript account books. Some of these questions result from the “messiness” of accounting practices in contrast to the “rationality” of accounting principles; others arise from efforts to reflect in the markup social and economic relationships beyond those imagined in Chapter 14 of the P5 TEI Guidelines, “Tables, Formulae, Graphics, and Notated Music.” The Bookkeeping Ontology developed by Christopher Pollin for the Digital Edition Publishing Cooperative for Historical Accounts (DEPCHA) in the Graz Asset Management System (GAMS) extends the potential of TEI/XML using RDF.

In “Operating Centre Mills,” Tomasek and Bullock focus on markup for information about the people, materials, and machines used to produce cotton batting at Centre Mills, a textile mill in Norton, Massachusetts, in 1847-48. The general ledger for this enterprise includes store accounts, production records, and tracking of materials used to run the mill. Entries that reflect the costs of mill operation show sources of raw cotton, daily use of materials, and payments for wages and board for a small labor force. Examples in the paper demonstrate flexible use of the <measure> element combined with a draft taxonomy based on Historical Statistics of the United States, a resource for economic history originally published by the U.S. Bureau of the Census. The goal of the edition is to develop additional semantic markup to supplement Pollin’s Bookkeeping Ontology.

“Wages and Hours,” Hermsen and Walker’s paper, emerges from their work on a digital scholarly edition of account books of William Townsend & Sons, Printers, Stationers, and Account Book Manufacturers, Sheffield UK (1830-1910). Volume 3, “Business Guide and Works Manual,” speaks both to book history and to cultural observations about unionization, gender roles, and credit/debit accounting. Parts of this complex manuscript might be considered a nineteenth-century commonplace book; it also contains specific instructions for book binding, including lists of required materials and a recipe for glue.

The financial accounts in this collection are recorded in ambiguous tabular form with in-text page references to nearly indecipherable price keys. For example, Townsend provides a “key” to determine the size of an account book. The formula is figured using imperial standards for the size of a sheet of paper (i.e. Foolscap) and quarto or octavo folds of the sheet and the number of sheets. This formula, along with the type of ruling and binding, provides the necessary numbers for the arithmetic that will determine the price of an account book.

Naoki’s paper, “Stakeholders in the British Ship-Breaking Industry,” develops a set of methods to analyse structured data of historical financial records, taking a disbursement ledger of Thomas W. Ward, the largest British shipbreaker in the twentieth century, as an example. That ledger is held by the Marine Technology Special Collection at Newcastle University, UK. The academic contribution of this research is to critically examine the possibilities and limitations of DEPCHA, the ongoing digital humanities approach for semantic datafication of historical financial records with the TEI and RDF, mainly developed by scholars in the United States and Austria, and to present an original argument in British maritime history, which is to visualise a part of the overall structure of the British shipbreaking industry.

Development of DEPCHA was supported by a joint initiative of the National Historic Publications and Records Commission at the National Archives and Records Administration in the United States and the Andrew W. Mellon Foundation.

Bios:

Kathryn Tomasek is Professor of History at Wheaton College. She has been working on TEI for account books since 2009, and she was PI for the DEPCHA planning award in 2018. She chaired the TEI Board between 2018 and 2021

Olivia Bullock is a senior Creative Writing major at Wheaton College who studies intersectional identities in literature and history.

Lisa Hermsen is Professor and Caroline Werner Gannett Endowed Chair in the College of Liberal Arts at Rochester Institute of Technology.

Rebecca Walker, Digital Humanities Librarian, coordinates large-scale DH projects and supports classroom digital initiatives in the College of Liberal Arts at Rochester Institute of Technology.

Naoki Kokaze is an Assistant Professor at Chiba University, where he leads the design and implementation of DH-related lectures in the government-funded humanities’ graduate education program conducted in collaboration with several Japanese universities. He is a PhD candidate in History at the University of Tokyo, writing his doctoral dissertation focusing on the social, economic, and diplomatic aspects of the disposal of obsolete British Royal Navy’s warships from the mid-nineteenth century through the 1920s.

Tomasek-Notes from the DEPCHA Field and Beyond-154.docx

4:00pm - 4:30pm

Wednesday Afternoon Refreshment Break
Location: ARMB: King's Hall

4:30pm - 6:00pm

Poster Slam, Session, and Reception
Location: ARMB: King's Hall
Session Chair: Syd Bauman, Northeastern University

The Poster Slam and Session will start with a 1 minute - 1 slide presentation by all poster presenters summarising their poster and why you should come see it.

There will be an informal drinks and nibbles reception during the poster session.

ID: 115 / Poster Session: 1
Poster
Keywords: Early modern history, Ottoman, Edition

The QhoD project: A resource on Habsburg-Ottoman diplomatic exchange

S. Kurz, M. Mayer, Y. Yılmaz

Austrian Academy of Sciences, Austria

After having started as a cross-disciplinary (Early modern history, Ottoman studies) in 2020, the Digitale Edition von Quellen zur habsburgisch-osmanischen Diplomatie 1500–1918 (QhoD) project has recently gone public with their TEI based source editions related to the diplomatic exchange between the Ottoman and Habsburg empires.

Unique features of QhoD are:

- QhoD is editing sources from both sides (Habsburg and Ottoman archives), giving complimentary views; Ottoman sources are translated into English

- diversity of source genres (e.g. letters, contracts, travelogues, descriptions and depictions of cultural artefacts in LIDO; protocol register entries, Seyahatnâme, Sefâretnâme, newspapers, etc.)

- openness to outside collaboration (bring your TEI data!)

For Ottoman sources, QhoD is adhering to the İslam Ansiklopedisi Transkripsiyon Alfabesi transcription rules (Arabopersian to Latin transliteration). Transcriptions are aided by using Transkribus HTR mainly for German language sources, with ventures into Ottoman HTR together with other projects. Named entity data is curated in a shared instance of the Austrian Prosopographical Information System (APIS), aligned to GND identifiers.

By the time of writing, <https://qhod.net> features

- by language: 60 German, 42 Ottoman language documents

- by genre: 60 letters, 20 protocol register entries, 16 official records, 5 artefacts, 4 travelogues, 4 reports, 3 instructions.

- by embassy/timeframe: 16 sources related to correspondence between Maximilian II and Selim II (1566–1574); 31 sources on Rudolf Schmid zu Schwarzenhorn’s internuntiature (1649); 61 sources on the mutual grand embassies of Virmont and Ibrahim Pasha (1719–1720)

The poster will describe those sources and the TEI-infused reasoning behind their edition, as well as the technical implementation, which uses the GAMS repository software to archive and disseminate data.

QhoD uses state-of-the-art TEI/XML technology to improve availability of archival material essential for understanding centuries of mutual relations between two large imperial entities.

Kurz-The QhoD project-115.odt

ID: 110 / Poster Session: 2
Poster
Keywords: Semantic Web, Mobility Studies, Travelogues

Building a digital infrastructure for the edition and analysis of historical travelogues

S. Balck

IOS Regensburg, Germany

The core of the project stems from the unpublished records of Franz Xaver Bronner's (1758 - 1850) journey from Aarau, via St. Petersburg, to the university in Kazan (1810) and his way back (1817) via Moscow, Lviv and Vienna. A digital edition of these manuscripts will be created, enhanced by Semantic Web and Linked Data technologies.

The project will use the annotated critical text edition of the work above as a Case Study with the aim of developing a modularly expandable digital research infrastructure. This infrastructure will support digital transcription, annotation and visualisation of travelogues.

In the preliminary stages of the project, the first and more extensive part (the outward journey) of Franz Xaver Bronner's travelogue manuscript has already been transcribed with Transkribus. High-quality digital copies were made for Handwritten Text Recognition and training modules were developed on the basis of the manually transcribed texts. These are to be used for the semi-automatic transcription of other related texts. People, places, travel and other events were annotated with XML markup elements using TEI.

In the next step, visualisations and ontology design patterns for travelogues and itineraries will be developed. This includes a new annotation scheme for linking the TEI annotated text passages to associated database entries. The edition will enable the visualisations of textual information and contextual data.

Balck-Building a digital infrastructure for the edition and analysis-110.docx

ID: 122 / Poster Session: 3
Poster
Keywords: scholarly digital editions, conceptual model, digital philology, textual criticism, text modeling

TEI and Scholarly Digital Editions: how to make philological data easier to retrieve and elaborate

C. Martignano

University of Florence, Italy

In the past few decades the number of TEI-encoded scholarly digital editions (SDEs) has risen significantly, which means that a big amount of philologically edited data is now available in a machine-readable form. One could try to apply computational approaches, in order to further study the linguistic data, the information about the textual transmission, etc. contained in multiple TEI-encoded digital editions. The problem is that retrieving philological data through different TEI-encoded SDEs is not that simple.

Every TEI-encoded edition has its own markup model, designed to respond to the philological and editorial requirements of that particular edition. Hence, it is difficult to share a markup model between various editions.

A possible way to bridge multiple digital editions, despite their different markup solutions, is to map them onto a same model that is able to represent SDEs on a more abstract level.

This kind of mapping would be particularly useful when the markup solutions are more prone to various interpretations or more ambiguous. The TEI guidelines, for example, show how the @type attribute can be used with the <rdg> element to distinguish between different types of variants. However, every edition may have its own set of possible values, beyond “orthographic” and “substantive”, to markup a wider range of phenomena of the textual transmission.

To build a model capable of representing different editions is a challenging task, for “scholarly practice in representing critical editions differs widely across disciplines, time periods, and languages.” However there is common ground that can be used to model scholarly editing: what the editor reads in the source/s, how the editor compares different sources and finally what the editor writes in the edition. Around these three concepts I am building a model that aims at making philological data more visible and easier to further elaborate.

Martignano-TEI and Scholarly Digital Editions-122.docx

ID: 117 / Poster Session: 4
Poster
Keywords: Spanish literature, Digital library, Services, TEI-Publisher

Between Data and Interface, Building a Digital Library for Spanish Chapbooks with TEI-Publisher

E. Leblanc, P. Jacsont

University of Geneva, France

This poster will present the project Untangling the cordel (2020-2023) and its experimentations with TEI-Publisher to develop a digital library (DL) that aims at studying and promoting the Geneva collection of Spanish chapbooks (Leblanc and Carta 2021).

Intended for a wide audience and sold in the streets, chapbooks recount fictitious or real events as well as songs, dramas, or religious writings. Although their contents are varied, they are characterised by their editorial form, i.e. small texts (4 to 8 pages), in in-quarto, arranged in columns and decorated with woodcuts. The interest in chapbooks ranges from literature to art and book history, sociology, linguistics, or musicology. This diversity reflects the hybridity of chapbooks, at the frontier between document, text, image, and orality (Botrel 2001; Gomis and Botrel 2019, 127–30).

An editorial workflow based on XML-TEI to display our corpus online was devised. After transcribing texts with HTR tools, they were 1) converted the transcriptions in XML-TEI via XSLT, 2) stored them in eXist-DB, and 3) published them with TEI-Publisher. Images of the documents are displayed with IIIF. Through this workflow, DL can offer services that stress different aspects of chapbooks.

Working with TEI-Publisher has influenced the way we think about our XML-TEI model. If the choices we have made are mainly driven by data, it appears that part of them have been influenced by the functionalities we wanted to implement, such as the addition of image links or keywords. Thus, our ODD reflects not only the nature of our documents but also the DL services. In this context, the use of TEI-Publisher invites us to reconsider a strict distinction between “data over interface” and “interface over data” (Dillen 2018), as data and interface are here mutually influenced.

Leblanc-Between Data and Interface, Building a Digital Library-117.docx

ID: 157 / Poster Session: 5
Poster
Keywords: software, editors, oxygen, frameworks, annotations

oXbytei and oXbytao. A Stack of Configurable oXygen Frameworks

C. Lück

Universität Münster, Germany

Until recently, adapting author mode frameworks for the oXygen XML

editor was rather limited. A framework was either a base framework

like TEI-C's *TEI P5* framework, or it was based on a base

framework. But since version 23.1+, the mechanism of *.framework files

for configuring frameworks is replaced/supplemented with extension

scripts. This allows us to design arbitrarly tall stacks of

frameworks, not only limited to height level 2. It's now possible to

design base and intermediate frameworks with common functions. Only a

thin layer is required for project specific needs.

oXbytei and oXbytao are such intermediate and higher level frameworks.

oXbytei is based on TEI-C's *TEI P5* framework. Its design idea is to get

as much of its configuration as possible from the TEI document's

header. E.g. depending on the variant encoding declared in the header

it produces a parallel segmentation, double end-point attached or

location referenced apparatus. Since not all information for setting

up the editor is available in the header, oXbytei comes with its own

XML configuration. It ships with Java classes for rather complex

actions. It has a plugin interface for aggregating and selecting

information either from local or remote norm data. It also offers actions

for generating anchor-based annotations, either with TEI `<span>` or

in RDF/OWL with OA.

oXbytao is a level 3 framework based on oXbytei. It offers common

actions that are more biased towards a certain kind of TEI usage,

e.g. for `<corr>` and `<choice>` or for encoding multiple recensions

of the same text within a single TEI document. It defines a template

directory on each oXygen project. CSS styles offer a collapsed and an

expanded view and optional views on the header or for editing through

form controls etc. All styles are fully customizable on a project

basis.

https://github.com/SCDH/oxbytei

https://github.com/SCDH/oxbytao

Lück-oXbytei and oXbytao A Stack of Configurable oXygen Frameworks-157.odt

ID: 141 / Poster Session: 6
Poster
Keywords: automation, validation, continuous integration, continuous deployment, error reports, quality control

Automatic Validation, Packaging and Deployment of TEI Documents. What Continuous Integration can do for us

C. Lück

Universität Münster, Germany

Keeping TEI documents in a Git repository is one way to store

data. However, Git does not only excel in robustness against data

losses, downtime of internet connections and enabling collaboration on

a TEI editions. Git-Servers also leverage automation of re-current

tasks: validation of all our TEI documents, generating human-readable

reports about their validity, assembling them in a data package, and

deploying it on a publication environment. These tasks can be

processed automatically in a continuous integration (CI) pipeline. In

software development, CI has established as a key to quality

assurance. It gets its strength from automation by running tests

*regularly* and *uniformly*. For obvious reasons, CI has been

transferred to quality assurance of research data (in life sciences)

by Cimiano et al. (2021). The poster presentation will be on a data

template for TEI editions, that runs the above listed tasks on a

Gitlab server or on Github and even generates and deploys an

Expath-Package on TEI Publisher:

https://github.com/scdh/edition-data-template-cx

The template extends the data template for TEI Publisher.[^1] It uses

Apache Maven as a pipeline driver because Maven only needs a

configuration file and thus enables us to keep our repository free of

software.[^2] It validates all TEI documents against common RNG and

schematron files. Jing's output is parsed and a human-readable report

is created and deployed on the Git-Server's publication environment

(e.g. gitlab pages). On successful validation, a XAR package is

assembled and deployed on a running TEI publisher instance.

References

Cimiano, Ph. et al. (2021): Studies in Analytic Reproducibility. The

Conquaire Project. U Bielefeld Press. doi: 10.4119/unibi/2942780

[^1]:

[https://github.com/eeditiones/tei-publisher-data-template](https://github.com/eeditiones/tei-publisher-data-template)

[^2]: Using GNU Make would not be portable and XProc lacks incremental

builds. Replacing Maven with Gradle is under development.

Lück-Automatic Validation, Packaging and Deployment of TEI Documents What Continuous Integration can do-141.odt

ID: 137 / Poster Session: 7
Poster
Keywords: braille, bibliography, accessibility, book history, publishing

Adapting TEI for Braille

E. Forget

University of Toronto, Canada

Bibliography as a field has undergone rapid changes to adapt for ever-evolving book formats in the digital age. Methods, tools, and techniques originally meant for manuscripts and printed books have now been adjusted to apply to ebooks (Rowberry 2017; Galey 2012 and 2021), audiobooks (Rubery 2011), and other bookish objects (Pressman 2020). However, there is much less work currently available that considers the bibliographical differences of accessible book formats or, more specifically, braille as a book format. Braille lends itself well to bibliographical analysis due to its tactile nature, but traditional bibliographical methods and tools were not developed with braille in mind and must be adapted to work with braille.

As part of a larger braille bibliography-focussed project, I am adapting TEI to work for analyzing braille books—specifically braille editions of illustrated children’s books. Illustrated books offer additional complexity to textual analysis that is compounded by the forced hierarchy of linear-text tools, and working with braille editions of illustrated books further complicates questions of hierarchy and format descriptions.

This poster will showcase the progress I have made so far in adapting TEI to work with braille, specifically using the multilingual prototype book as an example. The poster will touch on questions of textual hierarchy, line length/breaks, illustration descriptions, braille and format descriptions, and how languages are tagged, and it will include a wish list of TEI needs that I have not successfully adapted yet, as this is a work-in-progress project.

Forget-Adapting TEI for Braille-137.docx

ID: 135 / Poster Session: 8
Poster
Keywords: lexicography, Okinawan, endangered language, multiple writing systems, language revitalization

Okinawan Lexicography in TEI: Challenges for Multiple Writing Systems

S. Miyagawa¹, K. Kato², M. Zlazli³, S. Machida⁴, S. Carlino⁵

¹National Institute for Japanese and Linguistics (NINJAL), Japan; ²Tokyo University of Foreign Studies, Japan; ³SOAS University of London, UK; ⁴University of Hawaiʻi at Hilo, US; ⁵Kyushu University/Hitotsubashi University, Japan

Okinawan is classified as one of the Northern Ryukyuan languages in the Japonic language family. It is primarily spoken in the south and central parts of the Okinawa Island of the Ryukyu Archipelago. It was the official lingua franca of the Ryukyu Kingdom and a literary vehicle; e.g., the Omoro Soshi poetry collection, but currently an endangered language. Okinawan has been recorded in various written forms: A combination of Kanji logograms and Hiragana syllabary with archaic spellings (e.g. Omoro Soshi) or modern spelling variations to approximate actual pronunciation, pure Katakana syllabary (e.g., Bettelheim’s Bible translation), Latin alphabet (mostly by linguists), and pure Hiragana (popular).

The Okinawago Jiten (Okinawan Dictionary; OD), published by the National Institute for Japanese Language and Linguistics (NINJAL) in 1963 and revised in 2001[1], uses the Latin alphabet for each lexical entry. We first added the possible writing forms listed above to the data in CSV format. We then converted the CSV into TEI XML using Python. Figure 1 presents a sample encoding of the TEI file for each entry. Here, we solved the multiple writing forms with <orth> tags with corresponding writing systems in @xml:lang attribute following BCP47[2] (e.g., xml:lang=”ryu-Hira'' for Okinawan words written in Hiragana). We added the International Phonetic Alphabet (IPA) and the accent type to make the pronunciation clearer with the <pron> tags.

Fig. 1 TEI of each lexical entry

Using XSLT, we transformed this TEI file into a static webpage with a user-friendly GUI, as shown in Figure 2. It is anticipated that this digitization of OD and its publication under the open license will benefit key stakeholders, such as Okinawan heritage learners and worldwide Okinawan learners, being the largest Okinawan dictionary available online.

Fig. 2 Webpage rendition of TEI

Miyagawa-Okinawan Lexicography in TEI-135.docx

ID: 155 / Poster Session: 9
Poster
Keywords: 3D scholarly editions, annotation, <souceDoc>, Babylon.js

Text as Object: Encoding the data for 3D annotation in TEI

J. Ogawa¹, K. Nagasaki², I. Ohmukai³, Y. Nakamura³, A. Kitamoto¹

¹Center for Open Data in the Humanities, Japan; ²International Institute for Digital Humanities, Japan; ³University of Tokyo, Graduate School of Humanities and Sociology

This poster will present a way of representing the text on a 3D object and its annotations in TEI. Since the concept of 3D scholarly editions has recently been discussed in the field of Digital Humanities, we try to provide experimentally a practical method contributing to the realization of this concept.

Ogawa-Text as Object-155.docx

ID: 158 / Poster Session: 10
Poster
Keywords: Japanese text, close reading, interface, CETEIcean

Building Interfaces for East Asian/Japanese TEI data

K. Nagasaki¹, S. Nakamura², K. Okada³

¹International Institute for Digital Humanities, Japan; ²Historiographical Insutite, The University of Tokyo; ³Hokkai Gakuen University

Over the past several years, East Asian/Japanese (henceforth, EAJ) TEI data have been created in various fields. Under the situation, one of the issues the authors have been working on is constructing an easy-to-use interface. In this presentation, we will report the activity.

Nagasaki-Building Interfaces for East AsianJapanese TEI data-158.docx

ID: 170 / Poster Session: 11
Poster
Keywords: Natural Language Processing, Explainable AI, Computing, Social Media, Hate Speech

Explainable Supervised Models for Bias Mitigation in Hate Speech Detection: African American English

A. Gabriel, M. Sinclair

Northumbria University

Automated hate speech detection systems have great potential in the realm of social media but have seen their success limited in practice due to their unreliability and inexplicability. Two major obstacles they have yet to overcome is their tendency to underperform when faced with non-standard forms of English and a general lack of transparency in their decision-making process. These issues result in users of low-resource languages (those that have limited data available for training) such as African-American English being flagged for hate speech at a higher rate than users of mainstream English. The cause of the performance disparity in these systems has been traced to multiple issues including social biases held by the human annotators employed to label training data, training data class imbalances caused by insufficient instances of low-resource language text and a lack of sensitivity of machine learning (ML) models to contextual nuances between dialects. All these issues are further compounded by the ‘black-box’ nature of the complex deep learning models used in these systems. This research proposes to consolidate seemingly unrelated recently developed methods in machine learning to resolve the issue of bias and lack of transparency in automated hate speech detection. The research will utilize synthetic text generation to produce a theoretically unlimited amount of low-resource language text training data, machine translation to overcome annotation conflicts caused by contextual nuances between dialects and explainable ML (including integrated gradients and instance-level explanation by simplification). We will attempt to show that when repurposed and integrated into a single system these methods can both significantly reduce bias in hate speech detection tasks whilst also providing interpretable explanations of the system’s decision-making process.

Gabriel-Explainable Supervised Models for Bias Mitigation-170.docx

ID: 105 / Poster Session: 12
Poster
Keywords: manucript studies, palaeography, IIIF, cataloguing

A TEI/IIIF Structure for Adding Palaeographic Examples to Catalogue Entries

S. M. Winslow

University of Graz, Austria

The study of palaeography generally relies on either expert testimony with sparse examples or separate, specialist catalogues imaging and documenting the specific characteristics of each hand. Both practices presumably made much more sense due to the cost, difficulty, and space used by printed catalogues in the past, but with modern practice in cataloguing manuscripts via TEI and disseminating images via IIIF, these difficulties have been largely obviated. Accordingly, it is desirable to have a simple, consistent, and searachable way to embed examples of manuscript hands within the TEI, as a companion to elements from msdescription that describe hand features. This poster will demonstrate a simple and re-useable structure for embedding information about the palaeography of manuscript hands in msdescription and associating it with character examples using IIIF. An example implementation, part of the Hidden Treasures from the Syriac Manuscript Heritage project, will be demonstrated and an ODD containing the new elements and strcuture will be made available.

ID: 123 / Poster Session: 13
Poster
Keywords: Digital edition, projects, cooperation, digital texts, infrastructure

From facsimile to online representation. The Centre for Digital Editions in Darmstadt. An Introduction

K. Fischer, S. Kalmer, D. Kampkaspar, S. Müller, M. Scheffer, M. E.-H. Seltmann, K. Wunsch

University and State Library Darmstadt, Germany

The Centre for Digital Editions in Darmstadt (CEiD) covers all aspects of preparing texts for digital scholarly editions from planning to publication. It not only processes the library's own holdings, but also partners with external institutions.

Workflow

After applying various methods for text recognition (OCR/HTR) the output is used as a starting point for the realisation of the digital edition as an online publication. In addition, a variety of transformation tools is used to convert texts from different formats such as XML, JSON, WORD-DOCX or PDF into a wide range of TEI-based formats (TEI Consortium 2022), thus enabling uniformity across different projects. These texts can be annotated and enriched with metadata. Furthermore, entities can be marked up, which are managed in a central index file. This workflow is not static, but can be adapted according to the needs of the project.

Framework

The XML files are stored in eXist-db (eXist Solutions 2021) and presented in various user-friendly ways with the help of the framework wdbplus (Kampkaspar 2018). By default, the corresponding scan and the transcribed text are presented side by side. Additionally, different forms of presentation are available so that the special needs of individual projects can be taken into account. Further advantages of wdbplus are various APIs, which not only allow the retrieval of individual texts, but also of metadata and further information. Full-text search is realised at project level as well as across projects.

CEiD's portfolio includes several projects in which a multitude of texts are processed. The source material ranges from early modern prints and manuscripts to more recent texts and includes early constitutional texts, religious peace agreements, newspapers and handwritten love letters.

Fischer-From facsimile to online representation The Centre-123.docx

ID: 173 / Poster Session: 14
Poster
Keywords: Software Sustainability, Software Development, DH Communities

From Oxgarage to TEIGarage and MEIGarage

P. Stadler, A. Ferger, D. Röwenstrunk

Paderborn University, Germany

Poster proposal for presenting the history and future development of the OxGarage.

Stadler-From Oxgarage to TEIGarage and MEIGarage-173.odt

ID: 176 / Poster Session: 15
Poster
Keywords: marginalia, Old English, mise-en-page, sourceDoc, facsimile

Towards a digital documentary edition of CCCC41: The TEI and Marginalia-Bearing Manuscripts

P. O Connor

University of Oxford, United Kingdom

The specific aim of this case study is to demonstrate how the TEI Guidelines have transformed the representation of an important corollary of the medieval production process; the annotations, glosses and other textual evidence of an interactive engagement with the text. Cambridge, Corpus Christi College MS 41 (CCCC MS 41) best exemplifies the value of the TEI in this respect as this manuscript is noted for containing a remarkable record of textual engagement from early medieval England. CCCC MS 41 is an early-eleventh century manuscript witness of the vernacular translation of Bede’s Historia ecclesiastica, commonly referred to as the Old English Bede. However, in addition to preserving the earliest historical account of early medieval England, the margins of CCCC MS 41 contain numerous Old English and Latin texts. Of the 490 pages of CCCC MS 41, 108 pages contain marginal texts which span several genres of Old English and Latin literature; and thereby provide the potential for substantial evidence of interaction with the manuscript’s central text.

While the marginalia of CCCC MS 41 continue to excite scholarly attention, the representation of this vast body of textual engagement poses certain challenges to editors of print scholarly editions. This poster emphasises the importance of the transcription process in successfully conveying the mise-en-page of marginalia-bearing manuscripts and explains how adopting the <facsimile> or <sourceDoc> approach encourages further engagement with and a deeper understanding of CCCC MS 41’s marginalia.

O Connor-Towards a digital documentary edition of CCCC41-176.docx

ID: 169 / Poster Session: 16
Poster
Keywords: letters, America, France, transnational, networks

Transatlantic Networks - a Pilot: mapping the correspondence of David Bailie Warden (1772-1845)

J. Orr, S. Howard, J. Cummings

Newcastle University, United Kingdom

The scientific revolution of the nineteenth century is often seen as remediating the early modern republic of letters (Klancher) from the pens of learned individuals to learned Institutions. This project aims to map the transatlantic network of one of the most important hubs in the exchange of literary and scientific correspondence, David Bailie Warden (1772-1845). Warden is known as an Irish political asylum seeker, American diplomat, and respected Parisian scientific writer in his own right, authoring and collaborating in foundational statistical works on America, the burgeoning natural sciences, and anti-slavery. More importantly, his correspondence with at least 3000 individuals and learned institutions reframes our perspective on the scientific revolution, its historical context, and its everyday activities. In addition to traditional close reading methods, this project tests methods from the field of scientific network analysis to enable us to identify other important network nodes, enabling the process of continual discovery. This project seeks to compile not only a ‘who’s who’ of the intellectual community in this period but to identify previously hidden facilitative figures whose importance to the fabric of the republic of letters might not be at first obvious due to a range of marginalising factors including: social class, transnationality, gender, religion, or other liminal identities.

Date: Thursday, 15/Sept/2022

9:00am - 9:30am

Registration - Thursday

9:30am - 11:00am

Session 4A: Short-Papers
Location: ARMB: 2.98
Session Chair: Peter Stadler, Paderborn University

ID: 126 / Session 4A: 1
Short Paper
Keywords: digital texts, textual studies, born-digital, electronic literature

TEI and the Re-Encoding of Born-Digital and Multi-Format Texts

E. Forget, A. Galey

University of Toronto, Canada

What affordances can TEI encoding offer scholars who work with born-digital, multi-format, and other kinds of texts produced in today’s publishing environments, where the term “digitization” is almost redundant? How can we use TEI and other digitization tools to analyze materials that are already digital? How do we distinguish between a digital text’s multiple editions or formats and its paratexts, and what differences do born-digital texts make to our understanding of markup? Can TEI help with a situation such as the demise of Flash, where the deprecation of a format has left many works of electronic literature newly vulnerable — and, consequently, newly visible as historical artifacts?

These questions take us beyond descriptive metadata and back to digital markup’s origins in electronic typesetting, but also point us toward recent work on electronic literature, digital ephemera, and the textual artifacts of the very recent past (e.g. those described in recent work by Matthew Kirschenbaum, Dennis Tenen, and Richard Hughes Gibson). Drawing from textual studies, publishing studies, book history, disability studies, and game studies, we are experimenting with the re-encoding of born-digital materials, using TEI to encode details of the texts’ form and function as digital media objects. In some cases, we are working from a single digital source, and in others we are working with digital editions of materials that are available in multiple analogue and digital formats. Drawing on our initial encoding and modelling experiments, this paper explores the affordances of using TEI and modelling for born-digital and multi-format textual objects, particularly emerging digital book formats. We reconsider what the term “data” entails when one’s materials are born-digital, and the implications for digital preservation practice and the emerging field of format theory.

Forget-TEI and the Re-Encoding of Born-Digital and Multi-Format Texts-126.docx

ID: 107 / Session 4A: 2
Short Paper
Keywords: online forum, thread structure, social media, computer mediated communication

Capturing the Thread Structure: A Modification of CMC-Core to Account for Characteristics of Online Forums

S. Reimann, L. Rodenhausen, F. Elwert, T. Scheffler

Ruhr-University Bochum, Germany

Representing computer mediated communication (CMC), such as discussions in online forums, according to the guidelines of the Text Encoding Initiative was addressed by the CMC Special Interest Group (SIG). Their latest schema, CMC-core, presents a basic way of representing a wide range of different types of CMC in TEI P5. However, this schema has a general aim and is not specifically tailored for capturing the thread structure of online forums.

In particular, CMC-core is organized centrally by the time stamp of posts (a timeline structure), whereas online forums often split into threads and subthreads, giving less importance to the time of posting. In addition, forums may contain quotes from external sources as well as from other forum posts, which need to be differentiated in an adapted <quote> element. Not only do online forums as a whole differ from other forms of CMC, but there are often also considerable differences between individual online forums. We created a corpus of posts from various religious online forums, including different communities on Reddit, as well as two German forums which specifically focus on the topic of religion, with the purpose of analyzing their structure and textual content. These forums differ in the way threads are structured, how emoticons and emojis are used, and how people are able to react to other posts (for example by voting).

This raises the need for a schema which on the one hand takes the features of online forums as a genre into account, and, on the other hand, is flexible enough to enable the representation of a wide range of different online forums. We present some modifications of the elements in CMC-core in order to guarantee a standardized representation of three substantially different online forums while retaining all their potentially interesting microstructural characteristics.

Reimann-Capturing the Thread Structure-107.docx

ID: 111 / Session 4A: 3
Short Paper
Keywords: digital publications, VRE, open access, scholarly communication, web publication

Publishing the grammateus research output with the TEI : how our scholarly texts become data

E. Nury

University of Geneva, Switzerland

The TEI is not exclusively used to encode primary sources: TEI-based scholarly publishing represents a non-negligible portion of TEI-encoded texts (Baillot and Giovacchini 2019). I present here how the encoding of secondary sources such as scholarly texts can benefit researchers, with the example of the grammateus project.

In the grammateus project, we are creating a Virtual Research Environment to present a new way of classifying Greek documentary papyri. This environment comprises a database of papyri, marked up with the standard EpiDoc subset of the TEI. It includes as well the textual research output from the project, such as introductory materials, detailed descriptions of papyri by type, and an explanation on the methodology of the classification. The textual research output was deliberately prepared as an online publication so as to fully take advantage of the interactivity with data offered by a web application, in contrast to a printed book. We are thus experimenting with a new model of scholarly writing and publishing.

In this short paper I will describe how we have used the TEI not only for modeling papyrological data, but also for the encoding of scholarly texts produced in the context of the project, which would have traditionally been material for a monograph or academic articles. I will also demonstrate how this has enabled us later on to enrich our texts with markup for features that have emerged as relevant. We implemented a spiraling encoding process in which methodological documentation and analytical descriptions keep feeding back the editorial encoding of the scholarly texts. Documentation and analytical text therefore become data, within a research process based on a feedback method.

Nury-Publishing the grammateus research output with the TEI-111.docx

ID: 153 / Session 4A: 4
Short Paper
Keywords: HTR, Transkribus, Citizen Science

Handwritten Text Recognition for heterogeneous collections? The Use Case Gruß & Kuss

S. Büdenbender¹, M. Seltmann², J. Baum¹

¹University of Applied Sciences Darmstadt (h_da), Germany; ²University and State Library Darmstadt, Germany

Gruß & Kuss – Briefe digital. Bürger*innen erhalten Liebesbriefe – a research project funded by BMBF for 36 months – aims to digitize and explore love letters from ordinary persons with the help of dedicated volunteers, also raising the question of how citizens can actively participate in the indexing and encoding of textual sources.

To present, transcriptions are made manually in Transkribus (lite), tackling a corpus consisting of more than 22,000 letters from 52 countries and 345 donators, divided into approximately 750 bundles (i.e., correspondences between usually two writers). The oldest letter dates from 1715, the most recent from 2021, using a very broad concept of letter and including, for instance, notes left on pillows or WhatsApp messages.

The paper investigates the applicability of Handwritten Text Recognition (HTR) to this highly heterogeneous stock in a citizen science context. In an explorative approach, we will investigate at which scope of a bundle, respectively at which number of pages of the same handwriting, HTR becomes worthwhile.

For this purpose, the effort of a manual transcription is first compared to the effort of a model creation in Transkribus (in particular the creation of a training and validation set by double keying), including final corrections. In a second step, we will explore whether a modification of the procedure can be used to process even smaller bundles. Based on given metadata (time of origin, gender, script ...) a first clustering can be created, and existing models can be used as a basis for graphemically similar handwritings, allowing training sets to be kept much smaller while maintaining acceptable error rates. Another possibility is to start off with mixed training sets covering a class of related scripts.

Furthermore, we discuss how manual transcription by citizen scientists can be quantified in relation to the project’s overall resources.

Büdenbender-Handwritten Text Recognition for heterogeneous collections The Use Case Gruß & Kuss-153.docx

9:30am - 11:00am

Session 4B: Long Papers
Location: ARMB: 2.16
Session Chair: Elisa Beshero-Bondar, Penn State Behrend

ID: 138 / Session 4B: 1
Long Paper
Keywords: IPIF, Prosopography, Personography, Linked Open Data

From TEI Personography to IPIF data

R. W. J. Hadden¹, G. Vogeler^2,1

¹Austrian Academy of Sciences, Austria; ²University of Graz, Austria

The International Prosopography Interchange Format (IPIF) is an open API and data model for prosopographical data interchange, access, querying and merging, using a regularised format. This paper discusses the challenges for converting TEI personographies into the IPIF format, and more general questions of using the TEI for so-called 'factoid' prospographies.

Hadden-From TEI Personography to IPIF data-138.docx

ID: 147 / Session 4B: 2
Long Paper
Keywords: data modeling, information retrieval, data processing, digital philology, digital editions

TEI as Data: Escaping the Visualization Trap

R. Rosselli Del Turco¹, E. Magnanti², G. Cerretini³

¹Università di Torino, Italy; ²University of Vienna, Austria; ³Università di Pisa, Italy

During the last few years, the TEI Guidelines and schemas have continued growing in terms of capability and expressive power. A well-encoded TEI document constitutes a small treasure trove of textual data that could be queried to quickly derive information of different types. However, access to such data is mainly intended for visualization purposes in many edition browsing tools, e.g. EVT (http://evt.labcd.unipi.it/). Such an approach seems to be hardly compatible with the strategy of setting up databases to query this data, thus leading to a splitting of environments: DSEs to browse edition texts versus databases to perform powerful and sophisticated queries. It would be interesting to expand the capabilities of EVT, and possibly other tools, adding functionalities which would allow them to process TEI documents to answer complex user queries. This requires both an investigation to define the text model in terms of TEI elements and a subsequent implementation of the desired functionality, to be tested on a suitable TEI project that can adequately represent the text model.

The Anglo-Saxon Chronicle stands out as an ideal environment to test such a method. The wealth of information that it records about early medieval England makes it the optimal footing upon which to enhance computational methods for textual criticism, knowledge extraction and data modeling for primary sources. The application of such a method could here prove essential to assist the retrieval of knowledge otherwise difficult to extract from a text that survives in multiple versions. Bridging together, cross-searching and querying information dispersed in all the witnesses of the tradition would allow us to broaden our understanding of the Chronicle in unprecedented ways. Interconnecting the management of a wide spectrum of named entities and realia—which is one of the greatest assets of TEI—with the representation of historical events would make it possible to gain new knowledge about the past. Most importantly, it would lay the groundwork for a Digital Scholarly Edition of the Anglo-Saxon Chronicle, a project never undertaken so far.

Therefore, we decided to implement a new functionality capable of extracting and processing a greater amount of information by cross-referencing various types of TEI/XML-encoded data. We developed a TypeScript library to outline and expose a series of APIs allowing the user to perform complex queries on the TEI document. Besides the cross referencing of people, places and events as hinted above—on the basis of standard TEI elements such as <listPerson>/<person>, <listPlace>/<place>, <listEvent>/<event> etc.—we plan to support ontology-based queries, defining the relationships between different entities by means of RDF-like triples. In a similar way, it will be possible to query textual variants recorded in the critical apparatus by typology and witness distribution. This library will be integrated in EVT to interface directly with its existing data structures, but it is not limited to it. We are currently working on designing a dedicated GUI within EVT to make the query system intuitive and user-friendly.

Rosselli Del Turco-TEI as Data-147.docx

ID: 120 / Session 4B: 3
Long Paper
Keywords: linked data, conversion, reconciliation, software development

LINCS’ Linked Workflow: Creating CIDOC-CRM from TEI

C. Crompton, H. Zafar, A. Defours

University of Ottawa, Canada

TEI data is so often carefully curated without any of the noise and error common to algorithmically created data, that it is a perfect candidate for linked data creation; however, while most small TEI projects boast clean beautifully crafted data, linked data creation is often out of reach both technically and financially for these project teams. This paper reports (following where others have tread ) on the Networked Cultural Scholarship project (LINCS) workflow, mappings, and tools for creating linked data from TEI resources.

The process of creating linked data is far from straightforward since TEI is by nature hierarchical, taking its meaning from the deep nesting of elements. Any one element in TEI may be drawing its meaning from its relationship to a grandparent well up the tree (for example a persName appearing inside a listPerson inside the teiHeader is more likely to be a canonical reference to a person than a persName whose parent is a paragraph). Furthermore, the meaning of TEI elements are not always well-represented in existing ontologies and the time and money required to represented TEI-based information about people, places, time, and cultural production as linked data is out of reach of many small projects.

This paper introduces the LINCS workflow for creating linked data from TEI. We will introduce the named entity recognition and reconciliation service, NSSI (pronounced nessy), and its integration into a TEI-friendly vetting interface, Leaf Writer. Following NSSI reconciliation, Leaf Writer users can download their TEI with the entity uris in idno elements for their own use. If they wish to contribute to LINCS, they may proceed to enter the TEI document they have exported from Leaf Writer into XTriples, a customized version of Mainz’s Digitale Akademie’s XTriples tool of the same name, which converts TEI to CIDOC-CRM for either private use, or for integration into the LINCS repository. We have adopted the XTriples tool because it meets the needs of a very common type of TEI user: the director or team member of a project who is not going to be able to learn the intricacies of CIDOC-CRM, or indeed perhaps not even of linked data principles, but would still like to contribute their data to LINCS. That said, we are keen to get the feedback of the expert users of the TEI community on our workflow, CIDOC-CRM mapping, and tools.

Bodard, Gabriel, Hugh Cayless, Pietro Liuzzo, Chiara Cenati, Alison Cooley, Tom Elliott, Silvia Evangelisti, Achille Felicetti, et al. “Modeling Epigraphy with an Ontology.” Zenodo, March 26, 2021.

Ciotti, Fabio. “A Formal Ontology for the Text Encoding Initiative.” Umanistica Digitale, vol. 2, no. 3, 2018.

Eide, Ø., and C. Ore. “From TEI to a CIDOC-CRM Conforming Model: Towards a Better Integration Between Text Collections and Other Sources of Cultural Historical Documentation.” Digital Humanities, 2007.

Ore, Christian-Emil, and Øyvind Eide. “TEI and Cultural Heritage Ontologies: Exchange of Information?” Literary and Linguistic Computing, vol. 24, no. 2, 2009, pp. 161–72., https://doi.org/10.1093/llc/fqp010.

Crompton-LINCS’ Linked Workflow-120.docx

11:00am - 11:30am

Thursday Morning Refreshment Break
Location: ARMB: King's Hall

11:30am - 1:00pm

Session 5A: Long Papers
Location: ARMB: 2.98
Session Chair: Dario Kampkaspar, Universitäts- und Landesbibliothek Darmstadt

ID: 159 / Session 5A: 1
Long Paper
Keywords: TEI XML, Handwritten Text Recognition, HTR, Libraries

Evolving Hands: HTR and TEI Workflows for cultural institutions

J. Cummings¹, D. Jakacki², I. Johnson¹, C. Pirmann², A. Healey¹, V. Flex¹, E. Jeffrey¹

¹Newcastle University, United Kingdom; ²Bucknell University, USA

This Long Paper will look at the work of the Evolving Hands project which is undertaking three case studies ranging across document forms to demonstrate how TEI-based HTR workflows can be iteratively incorporated into curation. These range from: 19th-20th century handwritten letters and diaries from the UNESCO Gertrude Bell Archive, 18th century German, 20th century French correspondence, and a range of printed materials from the 19th century onward in English and French. A joint case study converts legacy printed material of the Records of Early English Drama (REED) project. By covering a wide variety of periods and document forms the project has a real opportunity here to foster responsible and responsive support for cultural institutions.

See Uploaded Abstract for more information

Cummings-Evolving Hands-159.docx

ID: 109 / Session 5A: 2
Long Paper
Keywords: TEI, text extraction, linguistic annotation, digital edition, mass digitisation

Between automatic and manual encoding: towards a generic TEI model for historical prints and manuscripts

A. Pinche¹, K. Christensen², S. Gabay³

¹Ecole nationale des chartes | PSL (France); ²INRIA (France); ³Université de Genève (Switzerland)

Cultural heritage institutions today aim to digitise their collections of prints and manuscripts (Bermès 2020) and are generating more and more digital images (Gray 2009). To enrich these images, many institutions work with standardised formats such as IIIF, preserving as much of the source’s information as possible. To take full advantage of textual documents, an image alone is not enough. Thanks to automatic text recognition technology, it is now possible to extract images’ content on a large scale. The TEI seems to provide the perfect format to capture both an image’s formal and textual data (Janès et al. 2021). However, this poses a problem. To ensure compatibility with a range of use cases, TEI XML files must guarantee IIIF or RDF exports and therefore must be based on strict data structures that can be automated. But a rigid structure contradicts the basic principles of philology, which require maximum flexibility to cope with various situations.

The solution proposed by the Gallic(orpor)a project attempted to deal with such a contradiction, focusing on French historical documents produced between the 15th and the 18th c. It aims to enrich the digital facsimiles distributed by the French National Library in two different ways:

• text extraction, including the segmentation of the image (layout analysis) with SegmOnto (Gabay, Camps, et al. 2021) and the recognition of the text (Handwritten Text Recognition) augmenting already existing models (Pinche and Clérice, 2021);

• linguistic annotation, including lemmatisation, POS tagging (Gabay, Clérice, et al. 2020), named entity recognition and linguistic normalisation (Bawden et al. 2022).

Our TEI document modelling has two strictly coercive automatically generated data blocks:

• the <sourceDoc> with information from the digital facsimile, which computer vision, HTR and segmentation tools produce thanks to machine learning (Scheithauer et al. 2021);

• the <standOff> (Bartz et al. 2021a) with linguistic information produced by natural language processing tools (Gabay, Suarez, et al. 2022) to make it easier to search the corpus (Bartz et al. 2021b).

Two other elements are added that can be customised according to researchers’ specific needs:

• a pre-filled <teiHeader> with basic bibliographic metadata automatically retrieved from (i) the digital facsimile’s IIIF Image API and (ii) the BnF’s Search/Retrieve via URL (SRU) API. The <teiHeader> can be enriched with additional data, as long as it respects a strict minimum encoding;

• a pre-editorialised <body>. It is the only element totally free regarding encoding choices.

By restricting certain elements and allowing others to be customisable, our TEI model can efficiently pivot toward other export formats, including RDF and IIIF. Furthermore, the <sourceDoc> element’s strict and thorough encoding of all of the document’s graphical information allows the TEI document to be converted into PAGE XML and ALTO XML files, which can then be used to train OCR, HTR, and segmentation models. Thus, not only does our TEI model’s strict encoding avoid limiting philological choices, thanks to the <body>, it also allows us to pre-editorialise the <body> via the content of the <sourceDoc> and, in a near future, the <standOff>.

Pinche-Between automatic and manual encoding-109.odt

ID: 128 / Session 5A: 3
Long Paper
Keywords: NER, HTR, Correspondence, Digital Scholarly Edition

Dehmel Digital: Pipelines, text as data, and editorial interventions at the distance

D. Maus¹, J. Nantke², S. Bläß², M. Flüh²

¹State and University Library Hamburg, Germany; ²University of Hamburg

Ida and Richard Dehmel were a famous, internationally well-connected artist couple around 1900. The correspondence of the Dehmels, which has been comprehensively preserved in approx. 35,000 documents, has so far remained largely unexplored in the Dehmel Archive of the State and University Library Hamburg. The main reason for this is the quantity of the material that makes it difficult to explore the material using traditional methods of scholarly editing. However, the corpus is relevant for future research precisely because of its size and variety. It not only contains many letters from important personalities from the arts and culture of the turn of the century, but also documents personal relationships, main topics as well as forms and ways of communication in the cultural life of Germany and Europe before the First World War on a large scale.

The project Dehmel digital seeks out to close this gap by creating a digital scholarly edition of the Dehmels’ correspondence that addresses the quantitative aspects with a combination of state-of-the-art machine learning approaches, namely handwritten text recognition (HTR) and named entity recognition (NER). At the heart of the project is a scalable pipeline that integrates automated and semi-automated text/data processing tasks. In our paper we will introduce and discuss the main steps: 1. Importing the result of HTR from Transkribus and OCR4all, 2. Applying a trained NER model; 3. Disambiguating entities and referencing authority records with OpenRefine; 4. Publishing data and metadata to a Linked Open Data web service. Our main focus will be on the pipeline itself, the “glue” that ties together well-established tools (Transkribus, OCR4All, Stanford Core NLP, OpenRefine), our use of TEI to encode relevant information and the special challenges we observe when using text as data, i.e. combining automated and semi-automated processes with the desire of editorial interventions.

Maus-Dehmel Digital-128.docx

11:30am - 1:00pm

Session 5B: Panel - Manuscript catalogues as data for research
Location: ARMB: 2.16
Session Chair: Katarzyna Anna Kapitan, University of Oxford

ID: 144 / Session 5B: 1
Panel
Keywords: Manuscripts, Provenance, Research, Clustering, Linked Data

Manuscript catalogues as data for research

H. E. Jones¹, Y. Faghihi¹, M. Holford², T. Schaßan³, T. Burrows², K. A. Kapitan², N. K. Yavuz⁴

¹Cambridge University, United Kingdom; ²University of Oxford; ³Herzog August Bibliothek; ⁴University of Leeds

Manuscript catalogues present problems and opportunities for researchers, not least the status of manuscript descriptions as both information about texts and texts in themselves. In this panel, we will present three recent projects which have used manuscript catalogues as data for research, and which raise general questions in text encoding, in manuscript studies and in data-driven digital humanities. This will be followed by a panel discussion to further investigate issues and questions raised by the papers.

1. Investigating the Origins of Islamicate Manuscripts Using Computational Methods (Yasmin Faghihi and Huw Jones):

This project evaluated computational methods for the generation of new information about the origins of manuscripts from existing catalogue data. The dataset was the Fihrist Union Catalogue of Manuscripts from the Islamicate World. We derived a set of codicological features from the TEI data, clustered together manuscripts sharing features, and used dated/placed manuscripts to generate hypotheses about the provenance of other manuscripts in the clusters. We aimed to establish a set of base criteria for the dating/placing of manuscripts, to investigate methods of enriching existing datasets with inferred data to form the basis of further research, and to engage critically with the research cycle in relation to computational methods in the humanities.

2. Re-thinking the <provenance> element in TEI Manuscript Description to support graph database transformations (Toby Burrows and Matthew Holford):

This paper reports on the transformation of the Bodleian Library’s online medieval manuscripts catalogue, based on the “Manuscript Description” section of the TEI Guidelines, into RDF graphs using the CIDOC-CRM and FRBROO ontologies. This work was carried out in the context of two Linked Open Data projects: Oxford Linked Open Data and Mapping Manuscript Migrations.

One area of particular focus was the provenance data relating to these manuscripts, which proved challenging to transform effectively from TEI to RDF. An important output from the MMM project was a set of recommendations for re-thinking the structure and encoding of the TEI <provenance> element to enable more effective reuse of the data in graph database environments. These recommendations draw on concepts previously outlined by Ore and Eide (2009), but also take into account the parallel work being done in the art museum and gallery community.

3. The use of TEI in the Handschriftenportal (Torsten Schaßan)

The national manuscript portal for Germany in the making, the Handschriftenportal, is built on TEI encoded data. These include representations for manuscripts, descriptions that have been imported, authority data, and OCR-generated catalogues. In the future, it will be possible to enter descriptions directly into the backend database.

The structure of the descriptive data shall be adopted according to the latest developments in manuscript studies, e.g. the risen importance of material aspects, or the alignment of the description of texts and illuminations.

Especially the latter, the data to be entered in the future, poses several issues to the TEI encoding as currently defined in the Guidelines. This comprises the overall structure of the main components of a description, as well as needs on a more detailed level.

Bios

Dr Toby Burrows is a Digital Humanities researcher at the University of Oxford and the University of Western Australia. His research focuses on the history of cultural heritage collections, and especially medieval and Renaissance manuscripts.

Yasmin Faghihi is Head of the Near and Middle Eastern Department at Cambridge University Library. She is the editor of FIHRIST, the online union catalogue for manuscripts from the Islamicate world.

Matthew Holford is Tolkien Curator of Medieval Manuscripts at the Bodleian Library, Oxford. He has a long-standing research interest in the use of TEI for the description and cataloguing of Western medieval manuscripts.

Huw Jones is Head of the Digital Library at Cambridge University Library, and Director of CDH Labs at Cambridge Digital Humanities. His work spans many aspects of collections-driven digital humanities, from creating and making collections available to their use in a research and teaching context.

Torsten Schaßan is member of the Manuscripts and Special Collections department of the Herzog August Bibliothek Wolfenbüttel. He was involved in many manuscript digitisation and cataloguing projects. In the Handschriftenportal project he is responsible for the definition of schemata and all transformations of data for import into the portal.

Chair: Dr Katarzyna Anna Kapitan is manuscript scholar and digital humanist specialising in Old Norse literature and culture. Currently she is Junior Research Fellow at Linacre College, University of Oxford, where she works on a digital book-historical project, “Virtual Library of Torfæus”, funded by the Carlsberg Foundation.

Respondent: Dr N. Kıvılcım Yavuz works at the intersection of medieval studies and digital humanities, with an expertise in medieval historiography and European manuscript culture. She is especially interested in digitisation of manuscripts as cultural heritage items and creation, collection and interpretation of data and metadata in the context of digital repositories.

Jones-Manuscript catalogues as data for research-144.docx

1:00pm - 2:30pm

Thursday Lunch Break
Location: ARMB: King's Hall

2:30pm - 4:00pm

Session 6A: An Interview With ... Lou Burnard
Location: ARMB: 2.98
Session Chair: Diane Jakacki, Bucknell University

An interview session: a short statement piece followed by interview questions, then audience questions.

4:00pm - 4:30pm

Thursday Afternoon Refreshment Break
Location: ARMB: King's Hall

4:30pm - 6:00pm

TEI Annual General Meeting - All Welcome
Location: ARMB: 2.98
Session Chair: Diane Jakacki, Bucknell University

Date: Friday, 16/Sept/2022

9:00am - 9:30am

Registration - Friday

9:30am - 11:00am

Session 7A: Short Papers
Location: ARMB: 2.98
Session Chair: Patricia O Connor, University of Oxford

ID: 118 / Session 7A: 1
Short Paper
Keywords: Spanish literature, Digital library, TEI-Publisher, facsimile, sourceDoc

Encoding Complex Structures: The Case of a Gospel Spanish Chapbook

E. Leblanc, P. Jacsont

University of Geneva, France

The project Untangling the cordel seeks to study and revalue a corpus of Spanish chapbooks dating from the 19th century by creating a digital library (Leblanc and Carta 2021). This corpus of chapbooks, also called pliegos de cordel, is highly heterogeneous in its content and editorial formats, giving rise to multiple reflections on its encoding.

In this short paper, we would like to share our feedback and thoughts on the XML-TEI encoding of a Gospel pliego for its integration into TEI-Publisher.

This pliego is an in-4° containing 16 small columns with extracts from the Four Gospels (John's prologue, Annunciation, Nativity, Mark's finale and the passion according to John; i.e. the same extracts as those in the book of hours (Join-Lambert 2016)) duplicated on both sides. This printout had to be cut in half and then folded to obtain two identical sets of excerpts from the Four Gospels. Whoever acquires it appropriates the object for private devotions or protection: it is therefore not an object kept for reading (the text is written in Latin with small letters) but for apotropaic or curative use (Botrel 2021).

To put forward the interest of this pliego as a devotional object and not strictly as a textual object required much reflection concerning its encoding and its publication on our digital library. Indeed, depending on our choice of encoding, the information conveyed differs: should we favour a diplomatic and formal edition or an encoding that follows the reading?

To determine which encoding would be the most suitable, we decided to test two encoding solutions, one with <facsimile> and another with <sourceDoc>. The visualisation of the two encodings possibilities on TEI-Publisher will allow us to develop the advantages and disadvantages of each method.

ID: 124 / Session 7A: 2
Short Paper
Keywords: Digital Scholarly Edition, Dictionary, Linguistics, Manuscript

Annotating a historical manuscript as a linguistic resource

H.-J. Döhla³, H. Klöter², M. Scholger¹, E. Steiner¹

¹University of Graz; ²Humboldt-Universität zu Berlin; ³Universität Tübingen

The Bocabulario de lengua sangleya por las letraz de el A.B.C. is a historical Chinese-Spanish dictionary held by the British Library (Add ms. 25.317), probably written in 1617. It consists of 223 double-sided folios with about 1400 alphabetically arranged Hokkien Chinese lemmas in the Roman alphabet.

The contribution will introduce our considerations on how to extract and annotate linguistic data from the historical manuscript and the design of a digital scholarly edition (DSE) in order to answer research questions in the fields of linguistics, missionary linguistics and migration (Klöter/Döhla 2022).

ID: 163 / Session 7A: 3
Short Paper
Keywords: text mining, topic modeling, digital scholarly editions, data modeling, data integration

How to Represent Topic Models in Digital Scholarly Editions

U. Henny-Krahmer¹, F. Neuber²

¹University of Rostock, Germany; ²Berlin-Brandenburgische Akademie der Wissenschaften, Germany

Topic modeling (Blei et al. 2003, Blei 2012) as a quantitative text analysis method is not part of the classic editing workflow as it stands for a way of working with text that in many respects contrasts with critical editing. However, for the purpose of a thematic classification of documents, topic modeling can be a useful enhancement to an editorial project. It has the potential to replace the cumbersome manual work that is needed to represent and structure large edition corpora thematically, as has been done for instance in the projects Alfred Escher Briefedition (Jung 2022), Jean Paul – Sämtliche Briefe digital (Miller et al. 2018) or the edition humboldt digital (Ette 2016).

We apply topic modeling to two edition corpora of correspondence of the German-language authors Jean Paul (1763-1825) and Uwe Johnson (1934-1984), compiled at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) and the University of Rostock (Miller et al. 2018, Helbig et al. 2017). In our contribution, we discuss how the results of the topic modeling can be usefully integrated into digital editions. We propose to integrate them into the TEI corpora on three levels: (1) the topic model of a corpus, including the topic words and the parameters of its creation, is modeled as a taxonomy in a separate TEI file, (2) the relevance of the topics for individual documents is expressed in the text classification section of the TEI header of each document in the corpus, and (3) the assignment of individual words in a document to topics is expressed by links from word tokens to the corresponding topic in the taxonomy. Following a TEI encoding workflow as outlined above allows for developing digital editions that include topic modeling as an integral part of their user interface.

ID: 119 / Session 7A: 4
Short Paper
Keywords: Odyssey, heroines, prosopography, women

Analyzing the Catalogue of Heroines through Text Encoding

R. Milio

Bucknell University, United States of America

The Catalogue of Heroines (Odyssey 11.225-330) presents a corpus of prominent mythological women as Odysseus recounts the stories of each woman he encounters in the Underworld. I undertook a TEI close reading of the Catalogue in order to center ancient women in a discussion of the Odyssey and determine how the relationships between the heroines contribute to the Catalogue’s overall purpose. In this short paper I demonstrate first my process: developing my own detailed feminist translation of the Catalogue, applying a TEI close reading to both my translation and the original ancient Greek, and creating a customized schema to best suit my purposes. Then, I detail my analysis of my close reading using cross-language encoding and a prosopography I developed through that reading, which reveals complex connections, both explicit and implied, among characters of the Catalogue. Third, I present the result of this analysis: that through this act of close reading I identified a heretofore unconsidered list of objects within the Catalogue and then demonstrated how these four objects of the Catalogue, ζώνη (girdle), βρόχοs (noose), ἕδνα (bride-price), and χρυσὸs (gold), reveal the ancient Greek stigma surrounding women, sexuality, and fidelity. These objects clearly allude to negative perceptions of women in ancient Greek society and through these objects the Catalogue of Heroines reminds its audience of Odysseus’ concerns regarding the faithfulness of his wife Penelope. Ultimately, by applying and adapting a TEI close reading, I identified patterns within the text that spoke to a greater purpose for the Catalogue and the Odyssey overall, that was able to export for further analysis of this prosopographical data. By the time of the conference, I will be able to present data visualizations that provide pathways that can assist other classicists to center women in ancient texts.

9:30am - 11:00am

Session 7B: Long Papers
Location: ARMB: 2.16
Session Chair: Gimena del Rio Riande, CONICET

ID: 132 / Session 7B: 1
Long Paper
Keywords: TEI, born-digital heritage, retrocomputing, digitality, materiality

Is it still data? Scholarly Editing of Text from Early Born-Digital Heritage

T. Roeder

Universität Würzburg, Germany

Digital heritage is strongly bound to original devices and displays. Even in today’s standardized environments, text can change its appearance depending on the monitor technology, on the processing software, and on the available fonts on the system: Text as data depends much on technical interpretation.

Creating a scholarly digital edition from born-digital heritage, expecially text, needs to consider the original conditions, like encoding and hardware standards. My question is: Are the encoding guidelines of the TEI suitable for representing born-digital text? How much information is required about the original environment? Can a screenshot serve as facsimile, or it is neccessary to link to emulated states of the display software?

To give an example, I will present a preliminary scholarly TEI-based digital edition of “disk magazines”. These magazines were a special type of periodical that was published mostly on floppy disk mainly in the 1980s and 1990s. Created by home computer enthusiasts for the community, disk magazines are potentially valuable as a historical resource to study the experiences of programmers, users and gamers in the early stage of microcomputing.

In the examples (one of them is available at https://diskmags.github.io/md_87-11.html), the digital texts are decompressed byte sequences of PETSCII code, which is only partially compatible to ASCII. The appearance of the characters could be changed completely by the programmer to display foreign characters or alphabets. Further, it depended on a 40x25 characters layout, where text had to be aligned manually by inserting whitespaces. The once born-digital text – as data – is transformed into readable text – as image – on a screen. The example demonstrated that the connection between textual data and textual display can be very fragile.

For TEI encoding, this would have some consequences. On the one side, there would be a requirement to preserve as much of the original data as possible. On the other side, a scholarly edition needs to represent the semantics of the visible document. It would require an interpretative layer to communicate between these two levels, which could be implemented by different markup strategies; however it needs to be discussed whether classes like “att.global.rendition” are actually suited for this. It also needs to be discussed in which way a digital document (or which instance of it: as stored data, as memory state, as display?) can be interpreted in the same way as a material document – and which implications this would have for TEI encoding of born-digital heritage.

Roeder-Is it still data Scholarly Editing of Text from Early Born-Digital Heritage-132.odt

ID: 152 / Session 7B: 2
Long Paper
Keywords: publishing, LOD, TEI infrastructure

Using Citation Structures

H. Cayless

Duke University, United States of America

This paper is really a follow-up to one I gave at Balisage in 2021.[1] Citation Structures are a TEI feature introduced in version 4.2.0 of the Guidelines, which provide an alternative (and more easily machine-processable) method for declaring their internal structures.[2] This mechanism is important because of the heterogeneity of texts and consequently of the TEI structures used to model them. This heterogeneity necessarily means it is difficult for any system publishing collections of TEI editions to treat their presentation consistently. For example, a citation like “1.2” might mean “poem 1, line 2” in one edition, and “book 1, chapter 2” in another. It might be perfectly sensible to split an edition into chapters, or even small sections, for presentation online, but not at all to split a poem into lines (though maybe groups of lines might be desirable). A publication system otherwise will have to rely on assumptions and guesswork about the items in its purview, and may fail to cope with new material that does not behave as it expects. Worse, there is no guarantee that the internal structures of editions are consistent within themselves. We might consider, for example, Ovid’s ‘Tristia’, in which the primary organizational structure is book, poem, line, but book two is a single, long poem.

Citation structures permit a level of flexibility hard to manage otherwise, by allowing both nested structures and alternative structures at every level. In addition, a key new feature of citation structures over the older reference declaration methods is the ability to attach information that may be used by a processing system to each structural level. The <citeData> element which makes this possible will allow, for example, a structural level to indicate what name it should be given in a table of contents, or even whether or not it should appear in such a feature.

I will discuss the mechanics of creating and using citation structures. Finally, I will present a working system in XSLT that can exploit <citeStructure> declarations to produce tables of contents, split large documents into substructures for presentation on the web, and resolve canonical citations to parts of an edition.

1. https://www.balisage.net/Proceedings/vol26/html/Cayless01/BalisageVol26-Cayless01.html

2. See https://tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CORS6 and https://tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SACRCS.

ID: 160 / Session 7B: 3
Long Paper
Keywords: Manuscript cataloguing, semantic markup, retro-conversion vs. born-digital

Text between data and metadata: An examination of input types and usage of TEI encoded texts

T. Schaßan

Herzog August Bibliothek Wolfenbüttel, Germany

Many texts that have been encoded using the TEI in the past are retro-converted from printed sources: manuscript catalogues and dictionaries are examples for highly structured texts, drama, verse, and performance texts are usually less structured, editions appear somewhere inbetween.

Many of the text types for which the TEI offers specialised elements represent both metadata and data, according to the scenarios in which these texts are used.

In the field of manuscript cataloguing, it has been a question for a long time whether the msdescription module is sufficient for the representation of a retro-converted text of a formerly printed catalogue. One may argue, that a catalogue is first of all a visually structured text, a succession of paragraphs, whose semantics are only loosely connected to the main elements the TEI defines, such as <msContents>, <physDesc>, or <msPart>. On the other hand, on a sub-paragraph level, the TEI offers structures, which may not be align-able with the actual text of the catalogue so that the person who carries out the retro-conversion has to decide whether to change the text according to the TEI schema rules or encode the text semantically wrong or structure the text with much less semantic information as it would be possible.

Now, that the TEI is more and more used to store these kind of texts as born-digitals, the questions is whether the structures offered by the TEI meet all the needs the texts and their authors might have in different scenarios: Is a TEI-encoded text of a given kind equally useful for all search and computational uses, as well as publishing needs? Are the TEI structures flexible enough or do they privilege some uses over others? How much of the semantic information is encoded in the text and how much of it might be realised only in the processing of the sources?

In this paper, manuscript catalogues serve as an example for the more general question about what structures, how much markup and what kind of markup is needed in the time of powerful search engines and artificial intelligence, authority files and the Linked Open Data.

Schaßan-Text between data and metadata-160.odt

11:00am - 11:30am

Friday Morning Refreshment Break
Location: ARMB: King's Hall

11:30am - 1:00pm

Session 8A: Long Papers
Location: ARMB: 2.98
Session Chair: Meaghan Brown, Independent Scholar

ID: 102 / Session 8A: 1
Long Paper
Keywords: medieval studies; medieval literature; xforms; manuscript; codicology

Codex as Corpus : Using TEI to unlock a 14th-century collection of Old French short texts

S. Dows-Miller

University of Oxford, United Kingdom

Medieval manuscript collections of short texts are, in a sense, materially discrete corpora, offering data that can help scholarship understand the circumstances of their composition and early readership.

This paper will discuss the role played by TEI in an ongoing mixed-method study into a fourteenth-century manuscript written in Old French: Bibliothèque nationale de France, fonds français, 24432. The aim of the project has been to display how fruitful the combination of traditional and more data-driven approaches can be in the holistic study of individual manuscripts.

TEI has been critical to the project so far, and has enabled discoveries about the manuscript which have eluded less technologically enabled generations of scholarship. For example, quantitative analysis of scribal abbreviation, made possible through the manuscript’s encoding, has illuminated the contributions of a number of individuals in the production of the codex. Similarly, analysis of the people and places mentioned in the texts allows for greater localisation of the manuscript than was previously considered possible.

As with any project of this nature, the process of encoding BnF fr. 24432 in TEI has not been without difficulty, and so this paper will also discuss the ways in which attempts have been made to streamline the process through automation and UI tools, most notably in the case of this project through the use of XForms.

Dows-Miller-Codex as Corpus-102.docx

ID: 149 / Session 8A: 2
Long Paper
Keywords: ODD, ODD chaining, RELAX NG, schema, XSLT Stylesheets

atop: another TEI ODD processor

S. Bauman¹, H. Bermúdez Sabel², M. Holmes³, D. Maus⁴

¹Northeastern University, United States of America; ²University of Neuchâtel, Switzerland; ³University of Victoria, Canada; ⁴State and University Library Hamburg, Germany

TEI is, among other things, a schema. That schema is written in and customized with the TEI schema language system, ODD. ODD is defined by Chapter 22 of the _Guidelines_, and is also used to _define_ TEI P5. It can also be used to define non-TEI markup languages. The TEI supports a set of stylesheets (called, somewhat unimaginatively, “the Stylesheets”) that, among other things, convert ODD definitions of markup languages (including TEI P5) and customizations thereof into schema languages like RELAX NG and XSD that one can use to validate XML documents.

Holmes and Bauman have been fantasizing for years about re-writing those Stylesheets from scratch. Spurred by Maus’ comment of 2021-03-23[1] Holmes presented a paper last year describing the problems with the current Stylesheets and, in essence, arguing that they should be re-written.[2] Within a few months the TEI Technical Council had charged Bauman with creating a Task Force for the purpose of creating, from scratch, an ODD processor that reads in one or more TEI ODD customization files, merges them with a TEI language (likely, but not necessarily, TEI P5 itself), and generates RELAX NG and Schematron schemas. It is worth noting that this is a distinctly narrower scope than the current Stylesheets,[3] which, in theory, convert most any TEI into any of a variety of formats including DocBook, MS Word, OpenOffice Writer, MarkDown, ePub, LaTeX, PDF, and XSL-FO (and half of those formats into TEI); and convert a TEI ODD customization file into RELAX NG, DTD, XML Schema, ISO Schematron, and HTML documentation. A different group is working on the conversion of a customization ODD into customized documentation using TEIPublisher.[4]

The Task Force, which began meeting in April, comprises the authors. We meet weekly, with the intent of making slow, steady progress. Our main goals are that the deliverables be a utility that can be easily run on GNU/Linux, MacOS, or within oXygen, and that they be programs that can be easily maintained by any programmer knowledgeable about TEI ODD, XSLT, and ant. Of course we also want the program to work properly. Thus we are generating test suites and performing unit testing (with XSpec[5]) as we go, rather than creating tests as an afterthought. We have also developed naming and other coding conventions for ourselves and written constraints (mostly in Schematron) to help enforce them. So, e.g., all XSLT variables must start with the letter ‘v’, and all internal parameters must start with the letter ‘p’ or letters “tp” for tunnel parameters.

We are trying to tackle this enormous project in a sensible, piecemeal approach. We have (conceptually) completely separated the task of assembling one or more customization ODDs with a source ODD into a derived ODD from the task of converting the derived ODD into RELAX NG, and from converting the derived ODD into Schematron. In order to make testing-as-we-go easier, we are starting with the derived ODD→RELAX NG process, and expect to demonstrate some working code at the presentation.

Bauman-atop another TEI ODD processor-149.odt

11:30am - 1:00pm

Session 8B: Demonstrations
Location: ARMB: 2.16
Session Chair: Tiago Sousa Garcia, Newcastle University

ID: 114 / Session 8B: 1
Demonstration
Keywords: Digital Humanities Critical Editions Tools IIIF

Transcribing Primary Sources using FairCopy and IIIF

N. Laiacona

Performant Software Solutions LLC, United States of America

FairCopy is a simple and powerful tool for reading, transcribing, and encoding primary sources using the TEI Guidelines. FairCopy can import IIIF manifests as a starting point for transcription. Users can then highlight zones on each surface and link them to the transcription. FairCopy exports valid TEI-XML which is linked back to the original IIIF endpoints. In this demonstration, we will demonstrate the IIIF functionality in FairCopy and then take a look at the exported TEI-XML and how it provides a consistent interface to images as well as the original IIIF manifest.

Laiacona-Transcribing Primary Sources using FairCopy and IIIF-114.docx

ID: 133 / Session 8B: 2
Demonstration
Keywords: Digital publishing, TEI processing, static sites, programming

Adapting CETEIcean for static site building with React and Gatsby

R. Viglianti

University of Maryland, United States of America

The JavaScript library CETEIcean, written by Hugh Cayless and Raff Viglianti, relies on the DOM processing of web browsers and HTML5 Custom Elements to publish TEI documents as a component pluggable into any HTML structure. This makes it possible to publish and lightly transform TEI documents directly in the user’s browser, doing away with complex server-side infrastructure for TEI publishing. However, CETEIcean provides a fairly bare-bones API for a fully-fledged TEI publishing solution and, without some additional considerations, TEI documents rendered with CETEIcean can be invisible to search engines.

This demonstration will showcase an adaptation of the CETEIcean algorithm as a plugin for the static site generator Gatsby, which relies on the popular framework React for building user interfaces. Two plugins will be shown:

gatsby-transformer-ceteicean (https://www.gatsbyjs.com/plugins/gatsby-transformer-ceteicean/) prepares XML to be registered as HTML5 Custom Elements. It also allows users to apply custom NodeJS transformations before and after processing.

gatsby-theme-ceteicean (https://www.npmjs.com/package/gatsby-theme-ceteicean) implements HTML5 Custom Elements for XML publishing, particularly with TEI. It re-implements parts of CETEIcean excluding behaviors; instead, users can define React components to customize the behavior of specific TEI elements.

The demonstration will show examples from the Scholarly Editing journal (https://scholarlyediting.org), which published TEI-based small-scale editions with these tools alongside other essay-like content.

Viglianti-Adapting CETEIcean for static site building with React and Gatsby-133.docx

ID: 167 / Session 8B: 3
Demonstration
Keywords: TEI, Translation, crowdsourcing

Spec Translator: Enabling translation of TEI Specifications

H. Cayless

Duke University, United States of America

This demonstration will introduce Spec Translator, available from https://translate.tei-c.org/ which enables users to submit pull requests for translations of specification pages from the TEI Guidelines.

ID: 168 / Session 8B: 4
Demonstration
Keywords: TEI, RDF, Online Editors

LEAF-Writer: a TEI + RDF online XML editor

D. Jakacki¹, S. Brown², J. Cummings³

¹Bucknell University, United States of America; ²University of Guelph, Canada; ³Newcastle University, UK

LEAF-Writer is an open-source, open-access Extensible Markup Language (XML) editor that runs in a web browser and offers scholars and students a rich textual editing experience without the need to download, install, and configure proprietary software, pay ongoing subscription fees, or learn complex coding languages. This user-friendly editing environment incorporates Text Encoding Initiative (TEI) and Resource Description Framework (RDF) standards, meaning that texts edited in LEAF-Writer are interoperable with other texts produced by the scholarly editing community and with other materials produced for the Semantic Web. LEAF-Writer is particularly valuable for pedagogical purposes, allowing instructors to teach students best practices for encoding texts without also having to teach students how to code in XML directly. LEAF-Writer is designed to help bridge the gap by providing access to all who want to engage in new and important forms of textual production, analysis, and discovery. LEAF-Writer draws on TEI All as well as other TEI-C-supplied schemas, can use project-specific customized schemas, and offers continuous validation against supported and declared schemas. LEAF-Writer allows users to access and synchronize their documents in GitHub and GitLab, as well as to upload and save documents from their desktop. This prsentation will demonstrate the variety of funcationality and affordances of LEAF-Writer.

1:00pm - 2:30pm

Friday Lunch Break
Location: ARMB: King's Hall

2:30pm - 4:00pm

Closing Keynote: Emmanuel Ngue Um, 'Tone as “Noiseless Data”: Insight from Niger-Congo Tone Languages'
Location: ARMB: 2.98
Session Chair: Martina Scholger, University of Graz

With Closing Remarks, Dr James Cummings, Local TEI2022 Conference Organiser

ID: 166 / Closing Keynote: 1
Invited Keynote

Tone as “Noiseless Data”: Insight from Niger-Congo Tone Languages

E. Ngue Um

University of Yaoundé 1 & University of Bertoua (Cameroon), Cameroon

Text processing assumes two layers of textual data: a "noisy" layer and a "noiseless" layer. The “noisy” layer is generally considered unsuitable for analysis and is eliminated at the pre-processing stage. In current Natural Language Processing (NLP) technologies like text generation in machine translation, the representation of tones as diacritical symbols in the orthography of Niger-Congo languages leads to these symbols being pre-processed as “noisy” data. As an illustration, none of the 15 Niger-Congo tone languages modules available on Google Translate delivers in a systematic and consistent manner, text data that contains linguistic information encoded through tone melody.

The Text Encoding Initiative (TEI) is a framework which can be used to circumvent the “noisiness” brought about by diacritical tone symbols in the processing of text data of Niger-Congo languages.

In novel work, I propose a markup scheme for tone that encompasses:

a) The markup of tone units within an <m> (morpheme) element; this aims to capture the functional properties of tone units, just like segmental morphemes.

b) The markup of tonal characters (diacritical symbols) within a <g> (glyph) element and the representation of the pitch by hexadecimal data representing the Unicode character code for that pitch; this aims to capture tone marks as autonomous symbols, in contrast with their combining layout when represented as diacritics.

c) The markup of downstep and upstep within an <accid> (accidental) element mirroring musical accidentals such as “sharp” and “flat”; this aims to capture strictly melodic properties of tone on a separate annotation tier.

The objectives of tone encoding within the TEI framework are threefold:

a) To harness quantitative research on tone in Niger-Congo languages.

b) To leverage “clean” language data of Niger-Congo languages that can be used more efficiently in machine learning tasks for tone generation in textual data.

c) To gain better insights into the orthography of tone in Niger-Congo languages.

In this paper, I will show how this novel perspective to the annotation of tone can be applied productively, using a corpus of language data stemming from 120 Niger-Congo languages.

Ngue Um-Tone as “Noiseless Data”-166.pdf

4:00pm - 5:30pm

Closing Keynote Reception
Location: ARMB: King's Hall

Date: Thursday, 22/Sept/2022

1:00pm - 2:00pm

Virtual Poster Session on Gather.Town
Virtual location: Gather.Town
Session Chair: Martina Scholger, University of Graz

The link to gather.town https://app.gather.town/app/DVLCOOcP1lTL5Zkh/TEI2022 will only work at the time of the session.

ID: 142 / Virtual Poster Session: 1
Virtual Poster
Keywords: digital edition, text processing, data management, tei publisher, ocr

From Archives to TEI Publisher: Digital Edition of German Work Regulations in the Project 'Non-state Law of the Economy'

P. Solonets

Max Planck Institute for Legal History and Legal Theory, Germany

The aim of the project 'Non-state Law of the Economy' is to build a digital collection of primary sources, showing the normative world of industrial relations in the German metal industry of the 19th and 20th centuries. This collection includes various types of textual documents: collective and individual agreements, internal work regulations, rental contracts of company flats, company health or pension insurance, etc. The focus of my poster is on the life cycle of the textual data inside our project, in particular, on showing what stages our data comes through from archives to publication on TEI Publisher. I am sure that this poster could become a valuable contribution to the discussion about effective textual data workflow strategies among the TEI community.

The core of the project’s approach to handling of the sources is a wide use of computer-assisted processing and various digital humanities tools and methods. The current workflow begins in archives, where the textual data is collected with the help of our researcher’s smartphones and a scan tent. As a next step an open-source OCR software is applied to the scans obtained earlier and the output in form of XMLPages format is produced, which requires a subsequent XSLT transformation to a standard TEI XML. Once a TEI-compliant document is produced, it is possible to begin with a basic structural annotation of the text. After that the text goes to the correction, if necessary (in case there are some unclear spots where the OCR programme failed to recognise characters successfully) and then finally to TEI Publisher Platform, where it is subsequently annotated and enriched with more specific meta data.

At the present moment there are around 50 sources published at our instance of TEI Publisher and around 300 documents being processed at different stages.

Solonets-From Archives to TEI Publisher-142.docx

ID: 174 / Virtual Poster Session: 2
Virtual Poster
Keywords: feature structures, character analysis, theater, personography, Alsatian

Feature structures for character social variable annotation and an application to Alsatian theater

P. Ruiz Fabo¹, H. Bermúdez Sabel²

¹Université de Strasbourg, France; ²Université de Neuchâtel, Switzerland

Several works address the computational treatment of dramatic characters. Zöllner-Weber (2008, 2011) presents a character analysis ontology. Galleron (2017) developed a characteriseme (characterization unit) taxonomy based on character lists in French theater between 1630 and 1810, formalized as a TEI feature structure (FS) library (see Romary, 2015). Following Phelan (1989), the taxonomy includes mimetic features, which give characters traits assimilating them to humans, and synthetic ones, describing their role in the plot.

We believe that characteriseme analysis using a common annotation schema can help comparative drama analysis. We successfully adapted Galleron’s FS approach to model characters in a different language (Alsatian) and period (1870-1940). This can help compare the Alsatian tradition to the hegemonic literatures surrounding it (German and French), one of the goals towards which our ongoing MeThAL project contributes (Ruiz et al., 2022).

The poster’s contributions:

- A character feature (characteriseme) taxonomy using feature structures, inspired by Galleron (2017) but providing an improved, more modular implementation, and enabling the description of more recent drama

- A TEI personography where each of our corpus’ 2386 characters is described according to the feature structure

- First characterization analyses in the corpus based on it

Intermediate levels were added in our FS to better group mimetic features into basic traits (age, gender, origin, language), socioeconomic traits (profession, class) and relation-position traits (where a character stands in personal or professional relations, e.g. spouse or manager). Controlled vocabularies were added, including a list of ca. 350 professions and a taxonomy of socioprofessional groups. Personography compliance was ensured with a schema automatically derived from the FS System Declaration (Bermúdez, 2019). The annotations have yielded insight into how female characters are characterized differently by female authors (increased reference to character’s profession) vs. male ones. An interface to navigate the corpus based on the annotations was created.

Ruiz Fabo-Feature structures for character social variable annotation and an application-174.odt

ID: 172 / Virtual Poster Session: 3
Virtual Poster
Keywords: language, script, typography, multilingualization

Multilingualism and multiscriptism in TEI publishing: DH2022

Y. Wang^1,2, K. Nagasaki¹, I. Ohmukai³, M. Shimoda³

¹International Institute for Digital Humanities, Japan; ²Graduate school of the University of Tokyo; ³The University of Tokyo

In the conference DH2022 Tokyo held in this July, the book of abstracts has been published entirely through the XSL-FO pipeline based on ADHO’s DHConvalidator and TEI to PDF Book Creator. The texts of the abstracts were converted by each author with DHConvalidator, which generates a format not always expected by the original TEI to PDF Book Creator. Moreover, while the text body is mostly written in English and some other languages which are accepted in CFP, it also embraces a large number of words and phrases in various Asian languages, reflecting the theme of the conference and authors’ regional backgrounds. Thus, we needed to adapt the original stylesheets to multi-script typography by a large expansion of linguistic and typographical templates as well as extensive annotation.

Our modification involves extraction and annotation of Asian language fragments in TEI documents, locale-oriented typeface differentiation, adjustment for typographical conventions, and mixed script typesetting. We will share our methodology and decisions we applied to the actual book, hoping that it serves as a case study that leads to dissemination of attention to, and better practices in, non-Latin and/or multi-script publication in the TEI community.

Wang-Multilingualism and multiscriptism in TEI publishing-172.docx

ID: 156 / Virtual Poster Session: 4
Virtual Poster
Keywords: Japanese script, East Asian texts, character variation, hentaigana, digital editions

Celebrating Deviation: Encoding Variant Japanese Phonetic Characters known as Hentaigana

H. McGaughey

Hosei University, Japan

In digitalizing the manuscript heritage of secret writings by the Japanese 15th century actor, playwright, producer, and teacher Zeami, I am including the premodern script variations known as hentaigana now available in Unicode. Hentaigana are variant hiragana, phonetic characters that are used to write various Japanese grammatical and function words and often ruby, for which the TEI released elements last year. In 2017, Unicode formally added 285 hentaigana characters in their Kana Supplement and Kana Extended-A code charts. These alternatives are fluid or cursive abbreviations of various “parent characters,” phonetically used Chinese characters (kanji), with varying patterns and degrees of simplification. In encoding both the modern, standardized hiragana and hentaigana, this project makes the manuscripts more accessible to learners of Japanese premodern script. Comparisons of the variants in different text witnesses using such markup might be useful for future analyses of text genealogy.

In this poster, I will present my methods for systematically including hentaigana developed while transcribing manuscript witnesses of Zeami’s writings and explain my inclusion of the “old character forms” (kyūjitai) of kanji using similar markup. I will furthermore share initial orthographical explorations of texts encoded thus far and consider methods for sharing the project with digitally savvy users, noh theater experts with no IT background, and educated non-specialists.

McGaughey-Celebrating Deviation-156.docx

ID: 171 / Virtual Poster Session: 5
Virtual Poster
Keywords: alliteration, XML-TEI, encoding, poetry, translation

Theoretical and practical challenges of automatically identifying and encoding alliteration in texts written in Italian

A. Consalvi¹, S. Fumagalli²

¹Sapienza University of Rome; ²Università Cattolica del Sacro Cuore, Milan

In our proposed presentation, we would like to display the theoretical and practical challenges posed by the creation of a program aimed at automatically identifying and encoding alliteration in texts written in Italian. Furthermore, a reflection on the possibilities offered by the analysis of the above mentioned phenomenon will be presented: from examining the style of a poet to determining if and how this literary device is preserved in translation.

On the theoretical level, alliteration is generally defined as a literary device consisting of the repetition of sounds at the beginning of adjacent words (cf. Beltrami, 2011). But what kind of sounds are we talking about? How long should they be? And what do we mean by ‘adjacent’? These are all crucial interrogatives that must be dealt with at the very beginning of any investigation on alliteration, especially considering that scholars (cf. Valesio, 1967; Lausberg, 1969; Menichetti, 1993; Mortara Garavelli, 1997; Ellero and Redisori, 2001; Ghiazza and Napoli, 2007; Mortara Garavelli, 2010; Arduini and Damiani, 2010; Lavezzi, 2017; Motta, 2020) tend to offer slightly different definitions.

On the practical level, once the rule-based program is created for Italian, it can be easily adapted to languages with a high degree of correspondence between graphemes and phonemes. Given a TXT file, the program is likely to be able to automatically identify the above-mentioned phenomenon. However, in this case the demanding task is the encoding process: a thorough reflection is needed to find a proper way to define an XML-TEI tag that contains all the important information such as the repeated sound and the number of words involved.

Consalvi-Theoretical and practical challenges of automatically identifying and encoding alliteration-171.odt