Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
Only Sessions at Location/Venue 
Only Sessions at Date / Time 
 
 
Session Overview
Date: Wednesday, 14/Sept/2022
9:00am - 9:30amRegistration - Wednesday
9:30am - 11:00amSession 1A: Short-Papers
Location: ARMB: 2.98
Session Chair: Martin Holmes, University of Victoria
 
ID: 140 / Session 1A: 1
Short Paper
Keywords: text mining, stand-off annotations, models of text, generic services

Standoff-Tools. Generic services for building automatic annotation pipelines around existing tools for plain text analysis

C. Lück

Universität Münster, Germany

TEI XML excels at encoding text. But when it comes to machine-based

analysis of a corpus as data, XML is no good platform. NLP, NER, topic

modelling, text-reuse detection etc. work on plain text; they get very

complicated and slow, if they have to traverse a tree structure. While

extracting plain text from XML is simple, feeding the result back into

XML is tricky. However, having the analysis in XML is desired: Its

result can be related to the internal markup, e.g. for overviews of

names per chapter, ellipsis per verse etc. In my short paper I will

introduce standoff-tools, a suite of generic tools for building

(automatic) annotation pipelines around plain text tools.

standoff-tools implement the extractor *E* and the internalizer

*I*. *E* produces a special flavour of plain text, I term *equidistant

plain text*: The XML tags are replaced by special characters,

e.g. zero-width non-joiner U+200C, i.e. all non-special characters

have the same character offset as in the XML source. This equidistant

plain text can then be fed to an arbitrary tagger *T* designed for

plain text. Its only requirement is to produce positioning

information. *I* inserts tags based on positioning information into

XML. For this purpose, it splits the annotated spans of text, so that

the result is syntactically valid XML without overlapping edges. It

aggregates the splits back together with `@next` and `@from`.

Optionally, a shrinker *S* removes the special characters in the

output of *E* and also produces a map of character positions. This map

of character positions is applied by a corrector *C* to the

positioning information produced by the tagger *T*.

The internalizer can also be used to internalize stand-off markup

produced manually with CATMA, GNU Emacs standoff-mode, etc. into

syntactically correct XML.



ID: 103 / Session 1A: 2
Short Paper
Keywords: TEI, indexes, XQuery

TEI Automatic Enriched List of Names (TAELN): An XQuery-based Open Source Solution for the Automatic Creation of Indexes from TEI and RDF Data

G. Fernandez Riva

Universität Heidelberg, Germany

The annotation of names of persons, place or organizations is a common feature of TEI editions. One way of identifying the annotated individuals is through the use of IDs from authority records like Geonames, Wikidata or the GND.

In this paper I will introduce an open source tool written in XQuery that enables the creation of TEI indexes using a very flexible custom templating language. The TEI Automatic Enriched List of Names (TAELN) uses the ids according to one authority document to create a custom index (model.listLike) with information from one or more RDF endpoints.

TAELN has been developed for the edition of the diaries and travel journals from Albrecht Dürer and his family. People, places and art works are identified with GND-numbers in the TEI edition. The indexes generated with TAELN include some information from GND records, but mostly from duerer.online, a virtual research portal, created with WissKI (https://wiss-ki.eu/), which offers an RDF endpoint.

TAELN relies on an XML-template to indicate how to retrieve information from the different endpoints and how to structure the desired TEI output. The templates use a straight-forward but flexible syntax. Simple use cases are depicted in the following example that retrieves the person name from the GND and the occupation from WissKI (which relies on the so-called »Pathbuilder syntax«).

<person>

<persName origin="gnd">preferredNameForThePerson</persName>

<occupation origin="wisski">ecrm:E21_Person -> ecrm:P11i_participated_in -> wvz:WV7_Occupation -> ecrm:P3_has_note</occupation>

</person>

Much more complex outputs can be achieved. TAELN offers editions an out of the box solution to generate TEI indexes by gathering information from different endpoints and it only requires the creation of the corresponding template and the knowledge of how to apply an XQuery transformation. The tool will be published shortly before the date of the TEI conference.



ID: 151 / Session 1A: 3
Short Paper
Keywords: manuscripts, codicology, paleography, XForms

manuForma – A Web Tool for Cataloging Manuscript Data

M. de Molière

University of Munich, Germany

The team of the ERC-funded project "MAJLIS – The Transformation of Jewish Literature in Arabic in the Islamicate World" at the University of Munich needed a software solution for describing manuscripts in TEI that would be easy to learn for non-specialists. After about one year of development, manuForma provides to our manuscript catalogers an accessible platform for entering their data. Users can choose elements and attributes from a list, add them to their catalog file and rearrange them with a mouse click. While manuForma does not spare our catalogers the need to learn the fundamentals of TEI, the restrictions the forms based approach proffers, enhances both TEI conformance and the uniformity of our catalog records. Moreover, our tool eliminates the need to install commercial XML editors on the machine of each and every project member tasked with describing manuscripts. Instead, our tool offers a web interface for the entire editorial process.

At the heart, manuForma uses XForms, which has been modified to allow adding, moving and deleting elements and attributes. A tightly knit schema file controls which elements and attributes can be added and in which situations to ensure conformance to the project's scholarly objectives. As an existDB application, manuForma integrates well with other apps that provide the front end to the manuscript catalog. TEI records can be stored on and retrieved from GitHub, tying the efforts of the entire team together. The web solution is adaptable to other entities by writing a dedicated schema and template file. Moreover, manuForma will be available under an OpenSource licence.

 
9:30am - 11:00amSession 1B: Long Papers
Location: ARMB: 2.16
Session Chair: Syd Bauman, Northeastern University
 
ID: 139 / Session 1B: 1
Long Paper
Keywords: intertextuality, bibliography, interface development, customization

Texts All the Way Down: The Intertextual Networks Project

S. Connell, A. Clark

Northeastern University, United States of America

In 2016, the Women Writers Project (WWP) began a new research project on the multivalent ways that early women writers engaged with literate culture, at the center of which were systemic enhancements to a longstanding TEI corpus. The WWP’s flagship publication, Women Writers Online (WWO), collects approximately 450 works from the sixteenth to the nineteenth centuries, a watershed period in which women’s participation in the authorship and consumption of texts expanded dramatically. With generous funding from the National Endowment for the Humanities, we used WWO’s TEI encoding to jumpstart the creation of a standalone bibliography containing and linking to all the works referenced in WWO. This bibliography currently includes 3,431 book-level entries; 942 entries that are parts of larger works, such as individual essays or poems; and 126 simple bibliographic entries (e.g. books of the Bible). The bibliography identifies the genre of each work and the gender of the author, where known. We also expanded WWO’s custom TEI markup in order to say more about “intertextual gestures”—or WWO authors’ engagement with other works—which include not only named titles and quotations but also textual remix, adaptation, and parody. By the end of the grant period, we had identified 11,787 quotations, 5,692 titles, 4,825 biblical references, and 1,968 other bibliographic references, linking the individual instances within the WWO texts to the relevant bibliography entries.

Now, the WWP has published “Women Writers: Intertextual Networks” (https://wwp.northeastern.edu/intertextual-networks), a web interface built on these two sources of rich TEI data: the bibliography and WWO’s newly refined intertextual gestures. In this paper we will discuss the challenge of turning dense, textually-embedded data into an interface. Though the encoded texts themselves can stand alone as complete documents, we built Intertextual Networks with a focus on connective tissue, using faceting and linkages to invite curiosity about how authors and works are in conversation with each other. As the numbers above suggest, this project attempts to enable investigations at scale, but we have also sought to draw out the local, even individual, ways that our writers engaged with other texts and authors. Thus, the interface includes visualizations that show overall patterns of usage (for example, the kinds of intertextual gestures employed by each author), but it also allows the reader to view the complete text of each gesture, reading through quotations, named titles, citations, and so on in full, with filtering and faceting to support exploration of this language.

An important challenge for this project has been to build an interface that can address the multidirectional levels of textual imbrication at stake, allowing researchers to examine patterns among both referenced and referencing texts. This paper will share some key insights for TEI projects seeking to undertake similar markup expansion and interface development initiatives. We will discuss strategies for modeling, enabling discovery, and revealing complex layers of textual data and textuality among not only a primary corpus but also a related collection of texts.



ID: 116 / Session 1B: 2
Long Paper
Keywords: sex, gender, TEI Guidelines, document data, theory

Revising Sex and Gender in the TEI Guidelines

E. Beshero-Bondar1, R. Viglianti2, H. Bermúdez Sabel3, J. Jenstad4

1Penn State Behrend, United States of America; 2University of Maryland, United States of America; 3University of Neuchâtel, Switzerland; 4University of Victoria, Canada

In Spring 2022, the co-authors collaborated in a TEI Technical Council subgroup to introduce a long-awaited <gender> element and attribute. In the process, we wrote new language for the TEI Guidelines on how to approach these concepts. As we submit this abstract, our proposed changes are under review by the Council for introduction in the next release of the TEI Guidelines, slated for October 2022. We wish to discuss this work with the TEI community to validate and address

* the history of the Guidelines' representation of these concepts,

* applications of the new encoding, and

* the extent to which the new specifications preserve backwards compatibility.

We must recognize as digital humanists and textual scholars that coding sex and gender as true "data" from texts significantly risks categorical determinism and normative cultural bias (Sedgwick 1990, 27+). Nevertheless, we believe that the TEI community is well prepared to encounter these risks with diligent study and expertise on the cultures that produce the textual objects being encoded, in that TEI projects are theoretical in their deliberate efforts to model document data (Ramsay and Rockwell 2012). We seek to encourage TEI-driven research on sex and gender by enhancing the Guidelines' expressiveness in these areas. Our revision of the Guidelines therefore provides examples but resists endorsing any single particular standard for specifying values for sex or gender. We recommend that projects encoding sex and/or gender explicitly state the theoretical groundwork for their ontological modeling, such that the encoding articulates a context-appropriate, informed, and thoughtful epistemology.

Gayle Rubin's influential theory of "sex/gender systems" informs some of our new language in the Guidelines “Names and Dates” chapter (Rubin 1975). While updating existing examples for encoding sex and introducing related examples for encoding gender, we mention the “sex/gender systems” concept to suggest that sex and gender may be related, such that a culture's perspective on biological sex gives rise to its notions of gender identity. Unexpectedly, we found ourselves confronting the Guidelines' prioritization of personhood in discussion of sex, likely stemming from the conflation of sex and gender in the current version of the Guidelines. In revising the technical specifications describing sex, we introduced the term "organism" to broaden the application of sex encoding. We leave it to our community to investigate the fluid concepts of gender and sex in their textual manifestations of personhood and biological life.

Encoding of cultural categories, when unquestioned, can entrench biases and do harm, a risk we must face in digital humanities generally. Yet we seek to make the TEI more expressive and adaptable for projects that complicate, question, and theorize sex and gender constructions. We look forward to working with the TEI community, in hopes of continued revisions, examples, and theoretical document data modeling of sex and gender for future projects. In particular, we are eager to learn more from project customizations that “queer” the TEI and theorize about sexed and gendered cultural constructions, and we hope for a lively discussion at the TEI conference and beyond.



ID: 104 / Session 1B: 3
Long Paper
Keywords: TEI, Spanish, Survey, Community, Geopolitics of Knowledge

Where is the Spanish in the TEI?: Insights on a Bilingual Community Survey

G. del Rio Riande1, S. Allés-Torrent2

1CONICET, Argentine Republic; 2Unversity of Miami, USA

Who can best define the interests and needs of a community? The members of the community itself.

“Communicating the Text Encoding Initiative to a Multilingual User Community” is a research project financed by the A. Mellon Foundation in which scholars from North and South America are generating linguistic, cultural, didactic and situated educational materials to improve the XML-TEI encoding, editing and publication of Spanish texts.

As part of the project activities, we prepared a bilingual survey (Spanish-English) aimed at inquiring t who uses or has used XML-TEI practices, and where and how they have been applied to Spanish humanistic texts. Bearing in mind that many digital scholarly edition projects of Spanish texts are carried out in Spanish-speaking and Anglophone institutions, we did not focus on a geographical survey, but on the use of XML at a global level. The survey ran between February and April 2022. It is an anonymous survey and consists of 22 questions. It received 104 responses, 77 in Spanish and 28 in English.

Some of the data that we will discuss in this short presentation aims at illustrating the significant differences regarding the organization of projects, collaboration, financing and use of TEI in master's and doctoral research. In broad terms, the survey allowed us to better understand not only the Spanish-speaking community that uses XML-TEI, but also to think of strategies that can contribute with more inclusive practices for scholars from less represented countries and in less favorable contexts inside the global TEI community. Last but not least, we believe the survey will be useful for designing actions that can support a wider range of modes of interaction and collaboration inside the global TEI community.

 
11:00am - 11:30amWednesday Morning Refreshment Break
Location: ARMB: King's Hall
11:30am - 1:00pmSession 2A: Long Papers
Location: ARMB: 2.98
Session Chair: Elli Bleeker, Huygens Institute for the History of the Netherlands
 
ID: 131 / Session 2A: 1
Long Paper
Keywords: Herman Melville, genetic criticism, text analysis, R, XPath

Revision, Negation, and Incompleteness in Melville's _Billy Budd_ Manuscript

C. Ohge

School of Advanced Study, University of London, United Kingdom

In 2019, John Bryant, Wyn Kelley, and I released a beta-version of a digital edition of Herman Melville's last work _Billy Budd, Sailor_. This TEI-encoded edition required nearly 10 years of work to complete, mostly owing to the fact that this last, unfinished work by Melville survives in an incredibly complicated manuscript that demonstrates about 8 stages of revision. The digital edition (https://melville.electroniclibrary.org/versions-of-billy-budd) has since been updated, and it presents a fluid-text edition (Bryant 2002) in three versions: a diplomatic transcription of the manuscript, a 'base' (or clean readable) version of the manuscript, and a critical, annotated reading text generated from the base version. Nevertheless, it remained questionable to me how we could effectively use all of the sophisticated descriptive markup of the manuscript transcription for critical purposes. What is missing, in other words, is an effective analysis of the genesis of this work.

In this talk I would like to demonstrate recent work on text analyses on the TEI XML data of the manuscript for a chapter-in-progress of my book-length project entitled _Melville’s Codes: Literature and Computation Across Complex Worlds_ (co-authored with Dennis Mischke, and under contract with Bloomsbury). First I generated and visualised basic statistics of textual phenomena (additions, deletions, and substitutions, e.g.) using XPath expressions combined with the R programming language. I then used the XML2 and TidyText libraries in R to perform more sophisticated analyses of the manuscript in comparison to Melville's oeuvre. Ultimately the analyses show that _Billy Budd_ ought to be read as a testament to incompleteness and negation.

In general, Melville’s use of negations and negative sentiments increased throughout his fictional work. Although this trend drops off in the late poetry, _Billy Budd_ has the highest number of negations in all of Melville’s oeuvre. It also has more acts of deletion than addition in the manuscript. Yet these trends need to be analysed in the context of Melville’s incomplete manuscript, the ‘ragged edges’ of which demonstrate not only a late tendency to increase negative words and ideas, but also, in late revisions, to complicate the main characters of the novel (particularly Captain Vere) who represent justice in the story. Like 'Benito Cereno', the codes of judgment are shown to be inadequate to the task of reckoning with the tragic conditions represented in Melville’s final sea narrative. This inadequacy is illustrated by Vere’s reaction to Billy’s death, which is framed as a computation, an either/or conditional: ‘Captain Vere, either thro stoic self-control or a sort of momentary paralysis induced by emotional shock, stood erectly rigid as a musket in the ship-armorer's rack’ (Chapter 25). This thematic incompleteness is not only a metaphor in the text but a metaphor of the text of this incomplete story.

Christopher Ohge is Senior Lecturer in Digital Approaches to Literature at the School of Advanced Study, University of London. His book _Publishing Scholarly Editions: Archives, Computing, and Experience_ was published in 2021 by Cambridge University Press. He also serves as the Associate Director of the Herman Melville Electronic Library.



ID: 136 / Session 2A: 2
Long Paper
Keywords: digital editions, sentiment analysis, machine learning, literary analysis, corpus annotation

“Un mar de sentimientos”. Sentiment analysis of TEI encoded Spanish periodicals using machine learning

L. Krusic1, M. Scholger1, E. Hobisch2, Y. Völkl2

1Institute Centre of Information Modelling (Austrian Centre for Digital Humanities), University of Graz; 2Technical University Graz

Sentiment analysis (SA), one of the most active research areas in NLP for over two decades, focuses on the automatic detection of sentiments, emotions and opinions found in textual data (Liu, 2012). Recently, SA has also gained popularity in the field of Digital Humanities (Schmidt Burghardt & Dennerlein, 2021). This contribution will present the analysis of a TEI encoded digital scholarly edition of Spanish periodicals using a machine learning approach for sentiment analysis as well as the re-implementation of the results into TEI for further retrieval and visualization.

 
11:30am - 1:00pmSession 2B: Long Papers
Location: ARMB: 2.16
Session Chair: Hugh Cayless, Duke University
 
ID: 145 / Session 2B: 1
Long Paper
Keywords: collation, information transfer, ecdotics, materiality

TEICollator: a semi-automatic TEI to TEI workflow

M. Gille Levenson

ENS Lyon, France

Automated text comparison has been an area of interest for many years [Nury 2019]: tools such as CollateX allow automated text comparison, and even export to TEI. However, there is no tool today that allows, from transcripts encoded and structured in XML-TEI, to automate the collation of texts and to inject the produced apparatuses into the original files. Working in this way ensures that the contextual and structural information specific to each witness (structure, additions, deletions, line changes, etc) encoded in XML-TEI is not lost. In other words, there is a need of being able to work on textual differences without ignoring the individual, structural and material reality of each text or witness.

Furthermore, the increasing use of Optical Character Recognition (OCR) or Handwritten Text Recognition (HTR) tools [Kiessling 2019], which is interesting both in terms of speed of acquisition and of quality of the preserved information [Camps 2016], have consequences for the ecdotical methods: should we keep collating the text manually, when its acquisition has been done by the computer ?

My work focus on a semi-automatic collation workflow. I will present a complete TEI to TEI processing chain, from single TEI-encoded transcriptions to meaningful collated ones (by the production of typed apparatus, for instance: see [Camps 2018]) that allows to keep the original structural information. This process also identifies the omissions and transpositions, and finally the transformation of the data into documents that present the textual information in the clearest possible way. I will present my work from the perspective of information transfer and pointing out the dialectic between material and textual collation (as carried out by Blekker et al 2018, but using other methods): the latter being the alignment of material features encoded in TEI. Finally, I will outline the limitations and difficulties I face along the processing chain (can the tokenisation of TEI-encoded text be fully automated? What level of textual heterogeneity can manage the worflow ? What quality of lemmatisation is required? what encoding method should be prefered to get the best result posible ?).

I want to show how the TEI standard, the pivot format of this computational method, can be used to describe text as well as to process it. Finally, I will show how the last operation, the transformation from TEI to LaTeX, maybe the most complex task, is fully part of the ecdotic chain, and contributes to produce meaning from the data: in this sense, my work is part of the reflection carried out for several years on Digital Scholarly Editions [Pierazzo 2015; Pierazzo and Driscoll 2016], -- I made a choice to prefer the print/pdf format over a web interface -- thanks to the LaTeX Reledmac package developed and maintained by Maïeul Rouquette [Rouquette 2022].

This paper will be the technical counterpart of a paper presented in La Laguna in July, which will focus on the philological side of the processing chain.



ID: 148 / Session 2B: 2
Long Paper
Keywords: digital edition, data quality assurance, XSL-FO, software test, PDF

Back to analog: the added value of printing TEI editions

M. Kupreyev

Goethe Universität Frankfurt am Main, Germany

Saale (2017) [1] provides the operational definition of a scholarly digital edition by contrasting its paradigm to that of a print edition. His bottom line is that any “digital edition cannot be given in print without significant loss of content and functionality”. In my talk I will touch upon the challenges of printing TEI XML datasets but also substantiate its positive effects: PDF export, indeed, presents only a part of the encoded information, but it can play essential role in data quality assurance. Creating a printed version of a digital edition can enhance the consistency of encoding and affect the overall production pipeline of the TEI XML data.

At the “School of Salamanca” [2] project the TEI XML of the early modern print editions goes through the restrictive Schema and Schematron check-ups, after which it is exported to HTML and JSON IIIF for web display [3]. Recently, an option of PDF export was added. Considering the complexity and the depth of annotation the solution integrated in Salamanca’s Oxygen workflow was chosen, namely a free Apache FOP processor. Similar results may have been achieved with TEI Publisher or Oxygen PDF Chemistry processor. The PDF export highlighted the issues which pertain to two ontologically different areas:

• Rendering XML elements in a constrained two-dimensional PDF layout.

• Varying XML encoding of semantically identical chunks of information.

The issues of the first type refer, for example, to the representation of marginal notes and their anchors, and to the pagination correlation between XML and IIIF (as representing the original), and PDF (as a print output). The second type embraces different rendering of semantically identical text parts, induced either by errors in the original or by the text editors.

PDF generation was initially intended to be one of the export methods of TEI data. It is now implemented early in the TEI production workflow, as it pinpoints the semantic and structural inconsistencies in the data and allows to correct them before the final XML release. PDF production thus adheres to one of the principles of agile software testing, which states that capturing and eliminating defects in the early stages of RDLC (research data life cycle) is less time-consuming, less resource-intensive and less prone to collateral bugs (Crispin 2008) [4].

[1] Sahle, Patrick. 2017. "What is a Scholarly Digital Edition?" in Digital Scholarly Editing, edited by Matthew James Driscoll and Elena Pierazzo, 19-39. Cambridge: Open Book Publishers.

[2] https://www.salamanca.school/en/index.html , accessed on 20.06.2022.

[3] https://blog.salamanca.school/de/2022/04/27/the-school-of-salamanca-text-workflow-from-the-early-modern-print-to-tei-all/,

https://blog.salamanca.school/de/2020/03/17/deutsch-entwicklung-der-webanwendung-v2-0/ , accessed on 20.06.2022.

[4] Crispin, LIsa. 2008. Agile Testing: A Practical Guide for Testers and Agile Teams. Addison-Wesley.



ID: 106 / Session 2B: 3
Long Paper
Keywords: poetry, rhyme, sound

Encoding sonic devices: what is it good for?

M. Holmes

University of Victoria, Canada

The Digital Victorian Periodical Poetry project[1] has captured metadata and page-images for 15,548 poems from Victorian periodicals, and transcribed and encoded a representative sample of 2,150 poems. Our encoding captures rhyme and other sonic devices such as anaphora, epistrophe, and refrains. This presentation will describe our encoding practices and then discuss what useful outcomes can be gained from this undertaking. Although even TEI P1 specified both a rhyme attribute to capture rhyme-scheme and a rhyme element for "very detailed studies of rhyming" (TEI P1 P172)[2], and all significant TEI tutorials teach the encoding of rhyme (e.g. TEI by Example Module 4), it is difficult to find work which makes explicit use of TEI encoding of rhyme (let alone other sonic devices) in the analysis of English poetry.

Is manual encoding of rhyme still necessary? Chisholm & Robey noted back in 1995 that "much of the analysis which currently requires extensive manual markup will in due course be carried out by electronic means" (100), and much work has been devoted to the automated detection of rhyme (Kavanagh 2008; Kilner & Fitch 2017). However, these tools are not completely successful, and in our own work, there is a consistent subset of cases which generate disagreement and discussion regarding type of rhyme, or even whether a rhyme is intended. We do make use of automated detection of anaphora and epistrophe, but only to generate suggestions for cases that might have been missed after the initial encoding has been done. We therefore believe that manually-curated encoding of sonic devices is a prerequisite for serious literary analysis which depends on that encoding.

[1] DVPP, https://dvpp.uvic.ca/.

[2] See also Chisholm & Robey 1995.

Having invested in careful encoding of sonic devices, what are the potential uses for research? DVPP has begun by making rhyme-scheme discoverable and searchable in our search interface, and this is beginning to generate research questions. We can already test notions such as the claim that irregular rhyme-schemes were more frequently used as the century progressed; a table of the percentage of irregularly-rhymed poems in each decade in our collection (Appendix) shows only the weakest support for this claim.

In addition to tracing trends in poetic practice, and the construction of historical rhyme dictionaries, sonic device encoding might also be used for:

- Dialect detection. For example, our dataset includes a significant subset of poems written in Scots dialect, and others which may or may not be; for problem cases, where other factors such as poet and host publication suggest a dialect poem, but surface features are not persuasive, rhyme patterns may provide more evidence.

- Genre detection. Particular poetic genres, such as sonnets or ballads are characterized by formal structures which include rhyme-scheme.

- Bad poetry. We are particularly interested in the notion of what constitutes bad poetry, and our early work suggests that poetry which subjectively seems to be of poor quality also exhibits features such as monotonous rhyme-schemes and intrusive echoic devices.

- Authorship attribution.

- Diachronic sound-change.

- Historical rhyming dictionaries.

 
1:00pm - 2:30pmWednesday Lunch Break
Location: ARMB: King's Hall
2:30pm - 4:00pmSession 3A: Long Papers
Location: ARMB: 2.98
Session Chair: Gustavo Fernandez Riva, Universität Heidelberg
 
ID: 113 / Session 3A: 1
Long Paper
Keywords: Middle Ages, lexicography, glossary, quantitative analysis, Latin

Vocabularium Bruxellense. Towards Quantitative Analysis of Medieval Lexicography

K. Nowak1, I. Krawczyk1, R. Alexandre2

1Institute of Polish Language (Polish Academy of Sciences), Poland; 2Institut de recherche et d'histoire des textes, France

The Vocabularium Bruxellense is a little-known example of medieval Latin lexicography (Weijers 1989). It has survived in a single manuscript dated to the 12th century and currently held at the Royal Library of Belgium in Brussels. In this paper we present the digital edition of the dictionary and the results of a quantitative study of its structure and content based on the TEI-conformant XML annotation.

First, we briefly discuss a number of annotation-related issues. For the most part, they result from the discrepancy between medieval and modern lexicographic practices which are accounted for in the 9th chapter of the TEI Guidelines (TEI Consortium). For example, a single paragraph of a manuscript may contain multiple dictionary entries which are etymologically or semantically related to the headword.

Medieval glossaries are also less consistent in their use of descriptive devices. For instance, the dictionary definitions across the same work may greatly vary as to their form and content. As such, they require fine-grained annotation if the semantics of the TEI elements is not to be strained.

Second, we present the TEI Publisher-based digital edition of the Vocabularium (Reijnders et al. 2022). At the moment, it provides basic browsing and search functionalities, making the dictionary available to the general public for the first time since the Middle Ages.

Thirdly, we demonstrate how the TEI-conformant annotation may enable a thourough quantitative analysis of the text which sheds a light on its place in a long tradition of medieval lexicography. We focus on two major aspects, namely the structure and the sources of the dictionary. As for the first, we present summary statistics of the almost 8,000 entries of the Vocabularium, expressed as a number of entries per letter and per physical page. We show that half of the entries are relatively short: a number among them contain only a one-word gloss and only 25% of entries contain 15 or more tokens.

Based on the the TEI XML annotation of nearly 1200 quotes, we were able to make a number of points concerning the function of quotations in medieval lexicographic works which is hardly limited to attesting specific language use. We observe that quotations are not equally distributed across the dictionary, as they can be found in slightly more than 10% of the entries, whereas nearly 7,000 entries have no quotations at all. The quotes are usually relatively short with only 5% containing 10 or more words. Our analysis shows that the most quoted author is by a wide margin Virgil followed by Horace, Lucan, Juvenal, Ovid, Plautus, and Terence (19). Church Fathers and medieval authors are seldom quoted, we have also discovered only 86 explicit Bible quotations so far.

In conclusions, we argue that systematic quantitative analyses of the existing editions of the medieval glossaries might provide useful insight into the development of this important part of the medieval written production.



ID: 162 / Session 3A: 2
Long Paper
Keywords: standardization, morphology, morphosyntax, ISO, MAF, stand-off annotation

ISO MAF reloaded: new TEI serialization for an old ISO standard

P. Banski1, L. Romary2, A. Witt1

1IDS Mannheim, Germany; 2INRIA, France

The ISO Technical Committee TC 37, Language and terminology, Subcommittee SC 4, Language resource management (https://www.iso.org/committee/297592.html, ISO TC37 SC4 henceforth) has been, for nearly 20 years now, the locus of much work focusing on standardization of annotated language resources. Through the subcommittee’s liaison with the TEI-C, many of the standards developed there use customizations of the TEI Guidelines for the purpose of serializing their data models. Such is the case of the feature structure standards (ISO 24610-1:2006, ISO 24610-2:2011), which together form chapter 18 of the Guidelines, as well as the standard on the transcription of the spoken language (ISO 24624:2016, reflected in ch. 8) or the Lexical Markup Framework (LMF) series, where ISO 24613-4:2021 mirrors ch. 9 of the Guidelines.

The Morphosyntactic Annotation Framework (ISO 24611:2012) was initially published with its own serialization format, interwoven with suggestions on how its fragments can be rendered in the TEI. In a recent cyclic revision process, a decision was made to divide the standard in two parts, and to replace the legacy serialization format with a customization of the TEI that makes use of the recent developments in the Guidelines – crucially, the work on the standOff element and the work on the att.linguistic attribute class. The proposed contribution reviews fragments of the revised standard and presents the TEI devices used to encode it. At the time of the conference, ISO/CD 24611-1 “Language resource management — Morphosyntactic annotation framework (MAF) — Part 1: Core model” will have been freshly through the Committee Draft ballot by the national committees mirroring ISO TC37 SC4.

In what follows, we briefly outline the basic properties of the MAF data model and review selected examples of its serialization in the TEI.



ID: 108 / Session 3A: 3
Long Paper
Keywords: lexicography, dictionaries, semantic web

TEI Modelling of the Lexicographic Data in the DARIAH-PL Project

K. Nowak, D. Mika, W. Łukasik

Institute of Polish Language (Polish Academy of Sciences), Poland

The main goal of project “DARIAH-PL Digital Research Infrastructure for the Arts and Humanities” project is building the Dariah.lab infrastructure, which would allow for sharing and integrated access to digital resources and data from various fields of the humanities and arts. Among numerous tasks that the Institute of Polish Language, Polish Academy of Sciences coordinates, we are working towards the integration of our lexicographic data with the LLOD resources (Chiarcos et al. 2012). The essential step of this task is to convert the raw text into TEI-compliant XML format (TEI Consortium).

In this paper we would like to outline the main issues involved in TEI XML modelling of these heterogeneous lexicographic data.

In the first part, we will give a brief overview of the formal and content features of the dictionaries. For the most part, they are paper-born works developed with the research community in mind and as such are rich in information and complex in structure. They cover diachronic development (from medieval Polish and Latin to present day Polish) and its functional variation of Polish (general language vs. dialects, proper names).

On a practical level, this meant that, first, substantial effort had to be put into optimizing the quality of the OCR output. Since, except for grobid-dictionaries (Khemakhem et al. 2018), there are no tools at the moment that would enable easy conversion of lexicographic data, the subsequent phase of structuring of dictionary text had to be applied on a per resource basis.

The TEI XML annotation has three main goals. First, it is a means of preserving the textuality of paper-born dictionaries which makes heavy use of formatting necessary to convey information and employ a complex system of text-based internal cross-references. Second, TEI modelling aims at a better understanding of each resource and its explicit description. The analysis is performed by lexicographers who may, however, come from a lexicographic tradition different from the one embodied in a particular dictionary, and thus need to make their interpretation of the dictionary text explicit. Regardless, in this way we may also detect and correct editorial inconsistencies, which are natural for collective works developed over many years. Third, the annotated text is meant to be the input of the alignment and linking tasks, it is then crucial that functionally equivalent structures were annotated in a systematic and coherent way. As we plan to provide an integrated access to the dictionaries, the TEI XML representation is also where the first phase of data reconciliation takes place. It does not only concern the structural units of a typical dictionary entry, such as <sense/> or <form/>, but also mapping between units of analytical language the dictionaries employ, such as labels, bibliographic reference system etc.

 
2:30pm - 4:00pmSession 3B: Notes from the DEPCHA Field and Beyond: TEI/XML/RDF for Accounting Records
Location: ARMB: 2.16
Session Chair: Syd Bauman, Northeastern University
 
ID: 154 / Session 3B: 1
Panel
Keywords: accounts, accounting, DEPCHA, bookkeeping ontology

Notes from the DEPCHA Field and Beyond: TEI/XML/RDF for Accounting Records

K. Tomasek1, O. Bullock1, L. Hermsen2, R. Walker2, N. Kokaze3

1Wheaton College Massachusetts, United States of America; 2Rochester Institute of Technology, United States of America; 3Chiba University, Japan

Notes from the DEPCHA Field and Beyond: TEI/XML/RDF for Accounting Records

Session Proposal—Short Papers

TEI 2022, Newcastle

Abstract:

The short papers in this session focus on questions that arise in the process of editing manuscript account books. Some of these questions result from the “messiness” of accounting practices in contrast to the “rationality” of accounting principles; others arise from efforts to reflect in the markup social and economic relationships beyond those imagined in Chapter 14 of the P5 TEI Guidelines, “Tables, Formulae, Graphics, and Notated Music.” The Bookkeeping Ontology developed by Christopher Pollin for the Digital Edition Publishing Cooperative for Historical Accounts (DEPCHA) in the Graz Asset Management System (GAMS) extends the potential of TEI/XML using RDF.

In “Operating Centre Mills,” Tomasek and Bullock focus on markup for information about the people, materials, and machines used to produce cotton batting at Centre Mills, a textile mill in Norton, Massachusetts, in 1847-48. The general ledger for this enterprise includes store accounts, production records, and tracking of materials used to run the mill. Entries that reflect the costs of mill operation show sources of raw cotton, daily use of materials, and payments for wages and board for a small labor force. Examples in the paper demonstrate flexible use of the <measure> element combined with a draft taxonomy based on Historical Statistics of the United States, a resource for economic history originally published by the U.S. Bureau of the Census. The goal of the edition is to develop additional semantic markup to supplement Pollin’s Bookkeeping Ontology.

“Wages and Hours,” Hermsen and Walker’s paper, emerges from their work on a digital scholarly edition of account books of William Townsend & Sons, Printers, Stationers, and Account Book Manufacturers, Sheffield UK (1830-1910). Volume 3, “Business Guide and Works Manual,” speaks both to book history and to cultural observations about unionization, gender roles, and credit/debit accounting. Parts of this complex manuscript might be considered a nineteenth-century commonplace book; it also contains specific instructions for book binding, including lists of required materials and a recipe for glue.

The financial accounts in this collection are recorded in ambiguous tabular form with in-text page references to nearly indecipherable price keys. For example, Townsend provides a “key” to determine the size of an account book. The formula is figured using imperial standards for the size of a sheet of paper (i.e. Foolscap) and quarto or octavo folds of the sheet and the number of sheets. This formula, along with the type of ruling and binding, provides the necessary numbers for the arithmetic that will determine the price of an account book.

Naoki’s paper, “Stakeholders in the British Ship-Breaking Industry,” develops a set of methods to analyse structured data of historical financial records, taking a disbursement ledger of Thomas W. Ward, the largest British shipbreaker in the twentieth century, as an example. That ledger is held by the Marine Technology Special Collection at Newcastle University, UK. The academic contribution of this research is to critically examine the possibilities and limitations of DEPCHA, the ongoing digital humanities approach for semantic datafication of historical financial records with the TEI and RDF, mainly developed by scholars in the United States and Austria, and to present an original argument in British maritime history, which is to visualise a part of the overall structure of the British shipbreaking industry.

Development of DEPCHA was supported by a joint initiative of the National Historic Publications and Records Commission at the National Archives and Records Administration in the United States and the Andrew W. Mellon Foundation.

Bios:

Kathryn Tomasek is Professor of History at Wheaton College. She has been working on TEI for account books since 2009, and she was PI for the DEPCHA planning award in 2018. She chaired the TEI Board between 2018 and 2021

Olivia Bullock is a senior Creative Writing major at Wheaton College who studies intersectional identities in literature and history.

Lisa Hermsen is Professor and Caroline Werner Gannett Endowed Chair in the College of Liberal Arts at Rochester Institute of Technology.

Rebecca Walker, Digital Humanities Librarian, coordinates large-scale DH projects and supports classroom digital initiatives in the College of Liberal Arts at Rochester Institute of Technology.

Naoki Kokaze is an Assistant Professor at Chiba University, where he leads the design and implementation of DH-related lectures in the government-funded humanities’ graduate education program conducted in collaboration with several Japanese universities. He is a PhD candidate in History at the University of Tokyo, writing his doctoral dissertation focusing on the social, economic, and diplomatic aspects of the disposal of obsolete British Royal Navy’s warships from the mid-nineteenth century through the 1920s.

 
4:00pm - 4:30pmWednesday Afternoon Refreshment Break
Location: ARMB: King's Hall
4:30pm - 6:00pmPoster Slam, Session, and Reception
Location: ARMB: King's Hall
Session Chair: Syd Bauman, Northeastern University
The Poster Slam and Session will start with a 1 minute - 1 slide presentation by all poster presenters summarising their poster and why you should come see it.

There will be an informal drinks and nibbles reception during the poster session.
 
ID: 115 / Poster Session: 1
Poster
Keywords: Early modern history, Ottoman, Edition

The QhoD project: A resource on Habsburg-Ottoman diplomatic exchange

S. Kurz, M. Mayer, Y. Yılmaz

Austrian Academy of Sciences, Austria

After having started as a cross-disciplinary (Early modern history, Ottoman studies) in 2020, the Digitale Edition von Quellen zur habsburgisch-osmanischen Diplomatie 1500–1918 (QhoD) project has recently gone public with their TEI based source editions related to the diplomatic exchange between the Ottoman and Habsburg empires.

Unique features of QhoD are:

- QhoD is editing sources from both sides (Habsburg and Ottoman archives), giving complimentary views; Ottoman sources are translated into English

- diversity of source genres (e.g. letters, contracts, travelogues, descriptions and depictions of cultural artefacts in LIDO; protocol register entries, Seyahatnâme, Sefâretnâme, newspapers, etc.)

- openness to outside collaboration (bring your TEI data!)

For Ottoman sources, QhoD is adhering to the İslam Ansiklopedisi Transkripsiyon Alfabesi transcription rules (Arabopersian to Latin transliteration). Transcriptions are aided by using Transkribus HTR mainly for German language sources, with ventures into Ottoman HTR together with other projects. Named entity data is curated in a shared instance of the Austrian Prosopographical Information System (APIS), aligned to GND identifiers.

By the time of writing, <https://qhod.net> features

- by language: 60 German, 42 Ottoman language documents

- by genre: 60 letters, 20 protocol register entries, 16 official records, 5 artefacts, 4 travelogues, 4 reports, 3 instructions.

- by embassy/timeframe: 16 sources related to correspondence between Maximilian II and Selim II (1566–1574); 31 sources on Rudolf Schmid zu Schwarzenhorn’s internuntiature (1649); 61 sources on the mutual grand embassies of Virmont and Ibrahim Pasha (1719–1720)

The poster will describe those sources and the TEI-infused reasoning behind their edition, as well as the technical implementation, which uses the GAMS repository software to archive and disseminate data.

QhoD uses state-of-the-art TEI/XML technology to improve availability of archival material essential for understanding centuries of mutual relations between two large imperial entities.



ID: 110 / Poster Session: 2
Poster
Keywords: Semantic Web, Mobility Studies, Travelogues

Building a digital infrastructure for the edition and analysis of historical travelogues

S. Balck

IOS Regensburg, Germany

The core of the project stems from the unpublished records of Franz Xaver Bronner's (1758 - 1850) journey from Aarau, via St. Petersburg, to the university in Kazan (1810) and his way back (1817) via Moscow, Lviv and Vienna. A digital edition of these manuscripts will be created, enhanced by Semantic Web and Linked Data technologies.

The project will use the annotated critical text edition of the work above as a Case Study with the aim of developing a modularly expandable digital research infrastructure. This infrastructure will support digital transcription, annotation and visualisation of travelogues.

In the preliminary stages of the project, the first and more extensive part (the outward journey) of Franz Xaver Bronner's travelogue manuscript has already been transcribed with Transkribus. High-quality digital copies were made for Handwritten Text Recognition and training modules were developed on the basis of the manually transcribed texts. These are to be used for the semi-automatic transcription of other related texts. People, places, travel and other events were annotated with XML markup elements using TEI.

In the next step, visualisations and ontology design patterns for travelogues and itineraries will be developed. This includes a new annotation scheme for linking the TEI annotated text passages to associated database entries. The edition will enable the visualisations of textual information and contextual data.



ID: 122 / Poster Session: 3
Poster
Keywords: scholarly digital editions, conceptual model, digital philology, textual criticism, text modeling

TEI and Scholarly Digital Editions: how to make philological data easier to retrieve and elaborate

C. Martignano

University of Florence, Italy

In the past few decades the number of TEI-encoded scholarly digital editions (SDEs) has risen significantly, which means that a big amount of philologically edited data is now available in a machine-readable form. One could try to apply computational approaches, in order to further study the linguistic data, the information about the textual transmission, etc. contained in multiple TEI-encoded digital editions. The problem is that retrieving philological data through different TEI-encoded SDEs is not that simple.

Every TEI-encoded edition has its own markup model, designed to respond to the philological and editorial requirements of that particular edition. Hence, it is difficult to share a markup model between various editions.

A possible way to bridge multiple digital editions, despite their different markup solutions, is to map them onto a same model that is able to represent SDEs on a more abstract level.

This kind of mapping would be particularly useful when the markup solutions are more prone to various interpretations or more ambiguous. The TEI guidelines, for example, show how the @type attribute can be used with the <rdg> element to distinguish between different types of variants. However, every edition may have its own set of possible values, beyond “orthographic” and “substantive”, to markup a wider range of phenomena of the textual transmission.

To build a model capable of representing different editions is a challenging task, for “scholarly practice in representing critical editions differs widely across disciplines, time periods, and languages.” However there is common ground that can be used to model scholarly editing: what the editor reads in the source/s, how the editor compares different sources and finally what the editor writes in the edition. Around these three concepts I am building a model that aims at making philological data more visible and easier to further elaborate.



ID: 117 / Poster Session: 4
Poster
Keywords: Spanish literature, Digital library, Services, TEI-Publisher

Between Data and Interface, Building a Digital Library for Spanish Chapbooks with TEI-Publisher

E. Leblanc, P. Jacsont

University of Geneva, France

This poster will present the project Untangling the cordel (2020-2023) and its experimentations with TEI-Publisher to develop a digital library (DL) that aims at studying and promoting the Geneva collection of Spanish chapbooks (Leblanc and Carta 2021).

Intended for a wide audience and sold in the streets, chapbooks recount fictitious or real events as well as songs, dramas, or religious writings. Although their contents are varied, they are characterised by their editorial form, i.e. small texts (4 to 8 pages), in in-quarto, arranged in columns and decorated with woodcuts. The interest in chapbooks ranges from literature to art and book history, sociology, linguistics, or musicology. This diversity reflects the hybridity of chapbooks, at the frontier between document, text, image, and orality (Botrel 2001; Gomis and Botrel 2019, 127–30).

An editorial workflow based on XML-TEI to display our corpus online was devised. After transcribing texts with HTR tools, they were 1) converted the transcriptions in XML-TEI via XSLT, 2) stored them in eXist-DB, and 3) published them with TEI-Publisher. Images of the documents are displayed with IIIF. Through this workflow, DL can offer services that stress different aspects of chapbooks.

Working with TEI-Publisher has influenced the way we think about our XML-TEI model. If the choices we have made are mainly driven by data, it appears that part of them have been influenced by the functionalities we wanted to implement, such as the addition of image links or keywords. Thus, our ODD reflects not only the nature of our documents but also the DL services. In this context, the use of TEI-Publisher invites us to reconsider a strict distinction between “data over interface” and “interface over data” (Dillen 2018), as data and interface are here mutually influenced.



ID: 157 / Poster Session: 5
Poster
Keywords: software, editors, oxygen, frameworks, annotations

oXbytei and oXbytao. A Stack of Configurable oXygen Frameworks

C. Lück

Universität Münster, Germany

Until recently, adapting author mode frameworks for the oXygen XML

editor was rather limited. A framework was either a base framework

like TEI-C's *TEI P5* framework, or it was based on a base

framework. But since version 23.1+, the mechanism of *.framework files

for configuring frameworks is replaced/supplemented with extension

scripts. This allows us to design arbitrarly tall stacks of

frameworks, not only limited to height level 2. It's now possible to

design base and intermediate frameworks with common functions. Only a

thin layer is required for project specific needs.

oXbytei and oXbytao are such intermediate and higher level frameworks.

oXbytei is based on TEI-C's *TEI P5* framework. Its design idea is to get

as much of its configuration as possible from the TEI document's

header. E.g. depending on the variant encoding declared in the header

it produces a parallel segmentation, double end-point attached or

location referenced apparatus. Since not all information for setting

up the editor is available in the header, oXbytei comes with its own

XML configuration. It ships with Java classes for rather complex

actions. It has a plugin interface for aggregating and selecting

information either from local or remote norm data. It also offers actions

for generating anchor-based annotations, either with TEI `<span>` or

in RDF/OWL with OA.

oXbytao is a level 3 framework based on oXbytei. It offers common

actions that are more biased towards a certain kind of TEI usage,

e.g. for `<corr>` and `<choice>` or for encoding multiple recensions

of the same text within a single TEI document. It defines a template

directory on each oXygen project. CSS styles offer a collapsed and an

expanded view and optional views on the header or for editing through

form controls etc. All styles are fully customizable on a project

basis.

https://github.com/SCDH/oxbytei

https://github.com/SCDH/oxbytao



ID: 141 / Poster Session: 6
Poster
Keywords: automation, validation, continuous integration, continuous deployment, error reports, quality control

Automatic Validation, Packaging and Deployment of TEI Documents. What Continuous Integration can do for us

C. Lück

Universität Münster, Germany

Keeping TEI documents in a Git repository is one way to store

data. However, Git does not only excel in robustness against data

losses, downtime of internet connections and enabling collaboration on

a TEI editions. Git-Servers also leverage automation of re-current

tasks: validation of all our TEI documents, generating human-readable

reports about their validity, assembling them in a data package, and

deploying it on a publication environment. These tasks can be

processed automatically in a continuous integration (CI) pipeline. In

software development, CI has established as a key to quality

assurance. It gets its strength from automation by running tests

*regularly* and *uniformly*. For obvious reasons, CI has been

transferred to quality assurance of research data (in life sciences)

by Cimiano et al. (2021). The poster presentation will be on a data

template for TEI editions, that runs the above listed tasks on a

Gitlab server or on Github and even generates and deploys an

Expath-Package on TEI Publisher:

https://github.com/scdh/edition-data-template-cx

The template extends the data template for TEI Publisher.[^1] It uses

Apache Maven as a pipeline driver because Maven only needs a

configuration file and thus enables us to keep our repository free of

software.[^2] It validates all TEI documents against common RNG and

schematron files. Jing's output is parsed and a human-readable report

is created and deployed on the Git-Server's publication environment

(e.g. gitlab pages). On successful validation, a XAR package is

assembled and deployed on a running TEI publisher instance.

References

Cimiano, Ph. et al. (2021): Studies in Analytic Reproducibility. The

Conquaire Project. U Bielefeld Press. doi: 10.4119/unibi/2942780

[^1]:

[https://github.com/eeditiones/tei-publisher-data-template](https://github.com/eeditiones/tei-publisher-data-template)

[^2]: Using GNU Make would not be portable and XProc lacks incremental

builds. Replacing Maven with Gradle is under development.



ID: 137 / Poster Session: 7
Poster
Keywords: braille, bibliography, accessibility, book history, publishing

Adapting TEI for Braille

E. Forget

University of Toronto, Canada

Bibliography as a field has undergone rapid changes to adapt for ever-evolving book formats in the digital age. Methods, tools, and techniques originally meant for manuscripts and printed books have now been adjusted to apply to ebooks (Rowberry 2017; Galey 2012 and 2021), audiobooks (Rubery 2011), and other bookish objects (Pressman 2020). However, there is much less work currently available that considers the bibliographical differences of accessible book formats or, more specifically, braille as a book format. Braille lends itself well to bibliographical analysis due to its tactile nature, but traditional bibliographical methods and tools were not developed with braille in mind and must be adapted to work with braille.

As part of a larger braille bibliography-focussed project, I am adapting TEI to work for analyzing braille books—specifically braille editions of illustrated children’s books. Illustrated books offer additional complexity to textual analysis that is compounded by the forced hierarchy of linear-text tools, and working with braille editions of illustrated books further complicates questions of hierarchy and format descriptions.

This poster will showcase the progress I have made so far in adapting TEI to work with braille, specifically using the multilingual prototype book as an example. The poster will touch on questions of textual hierarchy, line length/breaks, illustration descriptions, braille and format descriptions, and how languages are tagged, and it will include a wish list of TEI needs that I have not successfully adapted yet, as this is a work-in-progress project.



ID: 135 / Poster Session: 8
Poster
Keywords: lexicography, Okinawan, endangered language, multiple writing systems, language revitalization

Okinawan Lexicography in TEI: Challenges for Multiple Writing Systems

S. Miyagawa1, K. Kato2, M. Zlazli3, S. Machida4, S. Carlino5

1National Institute for Japanese and Linguistics (NINJAL), Japan; 2Tokyo University of Foreign Studies, Japan; 3SOAS University of London, UK; 4University of Hawaiʻi at Hilo, US; 5Kyushu University/Hitotsubashi University, Japan

Okinawan is classified as one of the Northern Ryukyuan languages in the Japonic language family. It is primarily spoken in the south and central parts of the Okinawa Island of the Ryukyu Archipelago. It was the official lingua franca of the Ryukyu Kingdom and a literary vehicle; e.g., the Omoro Soshi poetry collection, but currently an endangered language. Okinawan has been recorded in various written forms: A combination of Kanji logograms and Hiragana syllabary with archaic spellings (e.g. Omoro Soshi) or modern spelling variations to approximate actual pronunciation, pure Katakana syllabary (e.g., Bettelheim’s Bible translation), Latin alphabet (mostly by linguists), and pure Hiragana (popular).

The Okinawago Jiten (Okinawan Dictionary; OD), published by the National Institute for Japanese Language and Linguistics (NINJAL) in 1963 and revised in 2001[1], uses the Latin alphabet for each lexical entry. We first added the possible writing forms listed above to the data in CSV format. We then converted the CSV into TEI XML using Python. Figure 1 presents a sample encoding of the TEI file for each entry. Here, we solved the multiple writing forms with <orth> tags with corresponding writing systems in @xml:lang attribute following BCP47[2] (e.g., xml:lang=”ryu-Hira'' for Okinawan words written in Hiragana). We added the International Phonetic Alphabet (IPA) and the accent type to make the pronunciation clearer with the <pron> tags.

Fig. 1 TEI of each lexical entry

Using XSLT, we transformed this TEI file into a static webpage with a user-friendly GUI, as shown in Figure 2. It is anticipated that this digitization of OD and its publication under the open license will benefit key stakeholders, such as Okinawan heritage learners and worldwide Okinawan learners, being the largest Okinawan dictionary available online.

Fig. 2 Webpage rendition of TEI



ID: 155 / Poster Session: 9
Poster
Keywords: 3D scholarly editions, annotation, <souceDoc>, Babylon.js

Text as Object: Encoding the data for 3D annotation in TEI

J. Ogawa1, K. Nagasaki2, I. Ohmukai3, Y. Nakamura3, A. Kitamoto1

1Center for Open Data in the Humanities, Japan; 2International Institute for Digital Humanities, Japan; 3University of Tokyo, Graduate School of Humanities and Sociology

This poster will present a way of representing the text on a 3D object and its annotations in TEI. Since the concept of 3D scholarly editions has recently been discussed in the field of Digital Humanities, we try to provide experimentally a practical method contributing to the realization of this concept.



ID: 158 / Poster Session: 10
Poster
Keywords: Japanese text, close reading, interface, CETEIcean

Building Interfaces for East Asian/Japanese TEI data

K. Nagasaki1, S. Nakamura2, K. Okada3

1International Institute for Digital Humanities, Japan; 2Historiographical Insutite, The University of Tokyo; 3Hokkai Gakuen University

Over the past several years, East Asian/Japanese (henceforth, EAJ) TEI data have been created in various fields. Under the situation, one of the issues the authors have been working on is constructing an easy-to-use interface. In this presentation, we will report the activity.



ID: 170 / Poster Session: 11
Poster
Keywords: Natural Language Processing, Explainable AI, Computing, Social Media, Hate Speech

Explainable Supervised Models for Bias Mitigation in Hate Speech Detection: African American English

A. Gabriel, M. Sinclair

Northumbria University

Automated hate speech detection systems have great potential in the realm of social media but have seen their success limited in practice due to their unreliability and inexplicability. Two major obstacles they have yet to overcome is their tendency to underperform when faced with non-standard forms of English and a general lack of transparency in their decision-making process. These issues result in users of low-resource languages (those that have limited data available for training) such as African-American English being flagged for hate speech at a higher rate than users of mainstream English. The cause of the performance disparity in these systems has been traced to multiple issues including social biases held by the human annotators employed to label training data, training data class imbalances caused by insufficient instances of low-resource language text and a lack of sensitivity of machine learning (ML) models to contextual nuances between dialects. All these issues are further compounded by the ‘black-box’ nature of the complex deep learning models used in these systems. This research proposes to consolidate seemingly unrelated recently developed methods in machine learning to resolve the issue of bias and lack of transparency in automated hate speech detection. The research will utilize synthetic text generation to produce a theoretically unlimited amount of low-resource language text training data, machine translation to overcome annotation conflicts caused by contextual nuances between dialects and explainable ML (including integrated gradients and instance-level explanation by simplification). We will attempt to show that when repurposed and integrated into a single system these methods can both significantly reduce bias in hate speech detection tasks whilst also providing interpretable explanations of the system’s decision-making process.



ID: 105 / Poster Session: 12
Poster
Keywords: manucript studies, palaeography, IIIF, cataloguing

A TEI/IIIF Structure for Adding Palaeographic Examples to Catalogue Entries

S. M. Winslow

University of Graz, Austria

The study of palaeography generally relies on either expert testimony with sparse examples or separate, specialist catalogues imaging and documenting the specific characteristics of each hand. Both practices presumably made much more sense due to the cost, difficulty, and space used by printed catalogues in the past, but with modern practice in cataloguing manuscripts via TEI and disseminating images via IIIF, these difficulties have been largely obviated. Accordingly, it is desirable to have a simple, consistent, and searachable way to embed examples of manuscript hands within the TEI, as a companion to elements from msdescription that describe hand features. This poster will demonstrate a simple and re-useable structure for embedding information about the palaeography of manuscript hands in msdescription and associating it with character examples using IIIF. An example implementation, part of the Hidden Treasures from the Syriac Manuscript Heritage project, will be demonstrated and an ODD containing the new elements and strcuture will be made available.



ID: 123 / Poster Session: 13
Poster
Keywords: Digital edition, projects, cooperation, digital texts, infrastructure

From facsimile to online representation. The Centre for Digital Editions in Darmstadt. An Introduction

K. Fischer, S. Kalmer, D. Kampkaspar, S. Müller, M. Scheffer, M. E.-H. Seltmann, K. Wunsch

University and State Library Darmstadt, Germany

The Centre for Digital Editions in Darmstadt (CEiD) covers all aspects of preparing texts for digital scholarly editions from planning to publication. It not only processes the library's own holdings, but also partners with external institutions.

Workflow

After applying various methods for text recognition (OCR/HTR) the output is used as a starting point for the realisation of the digital edition as an online publication. In addition, a variety of transformation tools is used to convert texts from different formats such as XML, JSON, WORD-DOCX or PDF into a wide range of TEI-based formats (TEI Consortium 2022), thus enabling uniformity across different projects. These texts can be annotated and enriched with metadata. Furthermore, entities can be marked up, which are managed in a central index file. This workflow is not static, but can be adapted according to the needs of the project.

Framework

The XML files are stored in eXist-db (eXist Solutions 2021) and presented in various user-friendly ways with the help of the framework wdbplus (Kampkaspar 2018). By default, the corresponding scan and the transcribed text are presented side by side. Additionally, different forms of presentation are available so that the special needs of individual projects can be taken into account. Further advantages of wdbplus are various APIs, which not only allow the retrieval of individual texts, but also of metadata and further information. Full-text search is realised at project level as well as across projects.

CEiD's portfolio includes several projects in which a multitude of texts are processed. The source material ranges from early modern prints and manuscripts to more recent texts and includes early constitutional texts, religious peace agreements, newspapers and handwritten love letters.



ID: 173 / Poster Session: 14
Poster
Keywords: Software Sustainability, Software Development, DH Communities

From Oxgarage to TEIGarage and MEIGarage

P. Stadler, A. Ferger, D. Röwenstrunk

Paderborn University, Germany

Poster proposal for presenting the history and future development of the OxGarage.



ID: 176 / Poster Session: 15
Poster
Keywords: marginalia, Old English, mise-en-page, sourceDoc, facsimile

Towards a digital documentary edition of CCCC41: The TEI and Marginalia-Bearing Manuscripts

P. O Connor

University of Oxford, United Kingdom

The specific aim of this case study is to demonstrate how the TEI Guidelines have transformed the representation of an important corollary of the medieval production process; the annotations, glosses and other textual evidence of an interactive engagement with the text. Cambridge, Corpus Christi College MS 41 (CCCC MS 41) best exemplifies the value of the TEI in this respect as this manuscript is noted for containing a remarkable record of textual engagement from early medieval England. CCCC MS 41 is an early-eleventh century manuscript witness of the vernacular translation of Bede’s Historia ecclesiastica, commonly referred to as the Old English Bede. However, in addition to preserving the earliest historical account of early medieval England, the margins of CCCC MS 41 contain numerous Old English and Latin texts. Of the 490 pages of CCCC MS 41, 108 pages contain marginal texts which span several genres of Old English and Latin literature; and thereby provide the potential for substantial evidence of interaction with the manuscript’s central text.

While the marginalia of CCCC MS 41 continue to excite scholarly attention, the representation of this vast body of textual engagement poses certain challenges to editors of print scholarly editions. This poster emphasises the importance of the transcription process in successfully conveying the mise-en-page of marginalia-bearing manuscripts and explains how adopting the <facsimile> or <sourceDoc> approach encourages further engagement with and a deeper understanding of CCCC MS 41’s marginalia.



ID: 169 / Poster Session: 16
Poster
Keywords: letters, America, France, transnational, networks

Transatlantic Networks - a Pilot: mapping the correspondence of David Bailie Warden (1772-1845)

J. Orr, S. Howard, J. Cummings

Newcastle University, United Kingdom

The scientific revolution of the nineteenth century is often seen as remediating the early modern republic of letters (Klancher) from the pens of learned individuals to learned Institutions. This project aims to map the transatlantic network of one of the most important hubs in the exchange of literary and scientific correspondence, David Bailie Warden (1772-1845). Warden is known as an Irish political asylum seeker, American diplomat, and respected Parisian scientific writer in his own right, authoring and collaborating in foundational statistical works on America, the burgeoning natural sciences, and anti-slavery. More importantly, his correspondence with at least 3000 individuals and learned institutions reframes our perspective on the scientific revolution, its historical context, and its everyday activities. In addition to traditional close reading methods, this project tests methods from the field of scientific network analysis to enable us to identify other important network nodes, enabling the process of continual discovery. This project seeks to compile not only a ‘who’s who’ of the intellectual community in this period but to identify previously hidden facilitative figures whose importance to the fabric of the republic of letters might not be at first obvious due to a range of marginalising factors including: social class, transnationality, gender, religion, or other liminal identities.

 

 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: TEI 2022
Conference Software - ConfTool Pro 2.6.145+CC
© 2001–2022 by Dr. H. Weinreich, Hamburg, Germany