TEI Conference and Members' Meeting 2022
September 12 - 16, 2022 | Newcastle, UK
Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Session Overview |
Date: Thursday, 15/Sept/2022 | |
9:00am - 9:30am | Registration - Thursday |
9:30am - 11:00am | Session 4A: Short-Papers Location: ARMB: 2.98 Session Chair: Peter Stadler, Paderborn University |
|
ID: 126
/ Session 4A: 1
Short Paper Keywords: digital texts, textual studies, born-digital, electronic literature TEI and the Re-Encoding of Born-Digital and Multi-Format Texts University of Toronto, Canada What affordances can TEI encoding offer scholars who work with born-digital, multi-format, and other kinds of texts produced in today’s publishing environments, where the term “digitization” is almost redundant? How can we use TEI and other digitization tools to analyze materials that are already digital? How do we distinguish between a digital text’s multiple editions or formats and its paratexts, and what differences do born-digital texts make to our understanding of markup? Can TEI help with a situation such as the demise of Flash, where the deprecation of a format has left many works of electronic literature newly vulnerable — and, consequently, newly visible as historical artifacts? These questions take us beyond descriptive metadata and back to digital markup’s origins in electronic typesetting, but also point us toward recent work on electronic literature, digital ephemera, and the textual artifacts of the very recent past (e.g. those described in recent work by Matthew Kirschenbaum, Dennis Tenen, and Richard Hughes Gibson). Drawing from textual studies, publishing studies, book history, disability studies, and game studies, we are experimenting with the re-encoding of born-digital materials, using TEI to encode details of the texts’ form and function as digital media objects. In some cases, we are working from a single digital source, and in others we are working with digital editions of materials that are available in multiple analogue and digital formats. Drawing on our initial encoding and modelling experiments, this paper explores the affordances of using TEI and modelling for born-digital and multi-format textual objects, particularly emerging digital book formats. We reconsider what the term “data” entails when one’s materials are born-digital, and the implications for digital preservation practice and the emerging field of format theory. ID: 107
/ Session 4A: 2
Short Paper Keywords: online forum, thread structure, social media, computer mediated communication Capturing the Thread Structure: A Modification of CMC-Core to Account for Characteristics of Online Forums Ruhr-University Bochum, Germany Representing computer mediated communication (CMC), such as discussions in online forums, according to the guidelines of the Text Encoding Initiative was addressed by the CMC Special Interest Group (SIG). Their latest schema, CMC-core, presents a basic way of representing a wide range of different types of CMC in TEI P5. However, this schema has a general aim and is not specifically tailored for capturing the thread structure of online forums. In particular, CMC-core is organized centrally by the time stamp of posts (a timeline structure), whereas online forums often split into threads and subthreads, giving less importance to the time of posting. In addition, forums may contain quotes from external sources as well as from other forum posts, which need to be differentiated in an adapted <quote> element. Not only do online forums as a whole differ from other forms of CMC, but there are often also considerable differences between individual online forums. We created a corpus of posts from various religious online forums, including different communities on Reddit, as well as two German forums which specifically focus on the topic of religion, with the purpose of analyzing their structure and textual content. These forums differ in the way threads are structured, how emoticons and emojis are used, and how people are able to react to other posts (for example by voting). This raises the need for a schema which on the one hand takes the features of online forums as a genre into account, and, on the other hand, is flexible enough to enable the representation of a wide range of different online forums. We present some modifications of the elements in CMC-core in order to guarantee a standardized representation of three substantially different online forums while retaining all their potentially interesting microstructural characteristics. ID: 111
/ Session 4A: 3
Short Paper Keywords: digital publications, VRE, open access, scholarly communication, web publication Publishing the grammateus research output with the TEI : how our scholarly texts become data University of Geneva, Switzerland The TEI is not exclusively used to encode primary sources: TEI-based scholarly publishing represents a non-negligible portion of TEI-encoded texts (Baillot and Giovacchini 2019). I present here how the encoding of secondary sources such as scholarly texts can benefit researchers, with the example of the grammateus project. In the grammateus project, we are creating a Virtual Research Environment to present a new way of classifying Greek documentary papyri. This environment comprises a database of papyri, marked up with the standard EpiDoc subset of the TEI. It includes as well the textual research output from the project, such as introductory materials, detailed descriptions of papyri by type, and an explanation on the methodology of the classification. The textual research output was deliberately prepared as an online publication so as to fully take advantage of the interactivity with data offered by a web application, in contrast to a printed book. We are thus experimenting with a new model of scholarly writing and publishing. In this short paper I will describe how we have used the TEI not only for modeling papyrological data, but also for the encoding of scholarly texts produced in the context of the project, which would have traditionally been material for a monograph or academic articles. I will also demonstrate how this has enabled us later on to enrich our texts with markup for features that have emerged as relevant. We implemented a spiraling encoding process in which methodological documentation and analytical descriptions keep feeding back the editorial encoding of the scholarly texts. Documentation and analytical text therefore become data, within a research process based on a feedback method. ID: 153
/ Session 4A: 4
Short Paper Keywords: HTR, Transkribus, Citizen Science Handwritten Text Recognition for heterogeneous collections? The Use Case Gruß & Kuss 1University of Applied Sciences Darmstadt (h_da), Germany; 2University and State Library Darmstadt, Germany Gruß & Kuss – Briefe digital. Bürger*innen erhalten Liebesbriefe – a research project funded by BMBF for 36 months – aims to digitize and explore love letters from ordinary persons with the help of dedicated volunteers, also raising the question of how citizens can actively participate in the indexing and encoding of textual sources. To present, transcriptions are made manually in Transkribus (lite), tackling a corpus consisting of more than 22,000 letters from 52 countries and 345 donators, divided into approximately 750 bundles (i.e., correspondences between usually two writers). The oldest letter dates from 1715, the most recent from 2021, using a very broad concept of letter and including, for instance, notes left on pillows or WhatsApp messages. The paper investigates the applicability of Handwritten Text Recognition (HTR) to this highly heterogeneous stock in a citizen science context. In an explorative approach, we will investigate at which scope of a bundle, respectively at which number of pages of the same handwriting, HTR becomes worthwhile. For this purpose, the effort of a manual transcription is first compared to the effort of a model creation in Transkribus (in particular the creation of a training and validation set by double keying), including final corrections. In a second step, we will explore whether a modification of the procedure can be used to process even smaller bundles. Based on given metadata (time of origin, gender, script ...) a first clustering can be created, and existing models can be used as a basis for graphemically similar handwritings, allowing training sets to be kept much smaller while maintaining acceptable error rates. Another possibility is to start off with mixed training sets covering a class of related scripts. Furthermore, we discuss how manual transcription by citizen scientists can be quantified in relation to the project’s overall resources. |
9:30am - 11:00am | Session 4B: Long Papers Location: ARMB: 2.16 Session Chair: Elisa Beshero-Bondar, Penn State Behrend |
|
ID: 138
/ Session 4B: 1
Long Paper Keywords: IPIF, Prosopography, Personography, Linked Open Data From TEI Personography to IPIF data 1Austrian Academy of Sciences, Austria; 2University of Graz, Austria The International Prosopography Interchange Format (IPIF) is an open API and data model for prosopographical data interchange, access, querying and merging, using a regularised format. This paper discusses the challenges for converting TEI personographies into the IPIF format, and more general questions of using the TEI for so-called 'factoid' prospographies. ID: 147
/ Session 4B: 2
Long Paper Keywords: data modeling, information retrieval, data processing, digital philology, digital editions TEI as Data: Escaping the Visualization Trap 1Università di Torino, Italy; 2University of Vienna, Austria; 3Università di Pisa, Italy During the last few years, the TEI Guidelines and schemas have continued growing in terms of capability and expressive power. A well-encoded TEI document constitutes a small treasure trove of textual data that could be queried to quickly derive information of different types. However, access to such data is mainly intended for visualization purposes in many edition browsing tools, e.g. EVT (http://evt.labcd.unipi.it/). Such an approach seems to be hardly compatible with the strategy of setting up databases to query this data, thus leading to a splitting of environments: DSEs to browse edition texts versus databases to perform powerful and sophisticated queries. It would be interesting to expand the capabilities of EVT, and possibly other tools, adding functionalities which would allow them to process TEI documents to answer complex user queries. This requires both an investigation to define the text model in terms of TEI elements and a subsequent implementation of the desired functionality, to be tested on a suitable TEI project that can adequately represent the text model. The Anglo-Saxon Chronicle stands out as an ideal environment to test such a method. The wealth of information that it records about early medieval England makes it the optimal footing upon which to enhance computational methods for textual criticism, knowledge extraction and data modeling for primary sources. The application of such a method could here prove essential to assist the retrieval of knowledge otherwise difficult to extract from a text that survives in multiple versions. Bridging together, cross-searching and querying information dispersed in all the witnesses of the tradition would allow us to broaden our understanding of the Chronicle in unprecedented ways. Interconnecting the management of a wide spectrum of named entities and realia—which is one of the greatest assets of TEI—with the representation of historical events would make it possible to gain new knowledge about the past. Most importantly, it would lay the groundwork for a Digital Scholarly Edition of the Anglo-Saxon Chronicle, a project never undertaken so far. Therefore, we decided to implement a new functionality capable of extracting and processing a greater amount of information by cross-referencing various types of TEI/XML-encoded data. We developed a TypeScript library to outline and expose a series of APIs allowing the user to perform complex queries on the TEI document. Besides the cross referencing of people, places and events as hinted above—on the basis of standard TEI elements such as <listPerson>/<person>, <listPlace>/<place>, <listEvent>/<event> etc.—we plan to support ontology-based queries, defining the relationships between different entities by means of RDF-like triples. In a similar way, it will be possible to query textual variants recorded in the critical apparatus by typology and witness distribution. This library will be integrated in EVT to interface directly with its existing data structures, but it is not limited to it. We are currently working on designing a dedicated GUI within EVT to make the query system intuitive and user-friendly. ID: 120
/ Session 4B: 3
Long Paper Keywords: linked data, conversion, reconciliation, software development LINCS’ Linked Workflow: Creating CIDOC-CRM from TEI University of Ottawa, Canada TEI data is so often carefully curated without any of the noise and error common to algorithmically created data, that it is a perfect candidate for linked data creation; however, while most small TEI projects boast clean beautifully crafted data, linked data creation is often out of reach both technically and financially for these project teams. This paper reports (following where others have tread ) on the Networked Cultural Scholarship project (LINCS) workflow, mappings, and tools for creating linked data from TEI resources. The process of creating linked data is far from straightforward since TEI is by nature hierarchical, taking its meaning from the deep nesting of elements. Any one element in TEI may be drawing its meaning from its relationship to a grandparent well up the tree (for example a persName appearing inside a listPerson inside the teiHeader is more likely to be a canonical reference to a person than a persName whose parent is a paragraph). Furthermore, the meaning of TEI elements are not always well-represented in existing ontologies and the time and money required to represented TEI-based information about people, places, time, and cultural production as linked data is out of reach of many small projects. This paper introduces the LINCS workflow for creating linked data from TEI. We will introduce the named entity recognition and reconciliation service, NSSI (pronounced nessy), and its integration into a TEI-friendly vetting interface, Leaf Writer. Following NSSI reconciliation, Leaf Writer users can download their TEI with the entity uris in idno elements for their own use. If they wish to contribute to LINCS, they may proceed to enter the TEI document they have exported from Leaf Writer into XTriples, a customized version of Mainz’s Digitale Akademie’s XTriples tool of the same name, which converts TEI to CIDOC-CRM for either private use, or for integration into the LINCS repository. We have adopted the XTriples tool because it meets the needs of a very common type of TEI user: the director or team member of a project who is not going to be able to learn the intricacies of CIDOC-CRM, or indeed perhaps not even of linked data principles, but would still like to contribute their data to LINCS. That said, we are keen to get the feedback of the expert users of the TEI community on our workflow, CIDOC-CRM mapping, and tools. 1. Bodard, Gabriel, Hugh Cayless, Pietro Liuzzo, Chiara Cenati, Alison Cooley, Tom Elliott, Silvia Evangelisti, Achille Felicetti, et al. “Modeling Epigraphy with an Ontology.” Zenodo, March 26, 2021. Ciotti, Fabio. “A Formal Ontology for the Text Encoding Initiative.” Umanistica Digitale, vol. 2, no. 3, 2018. Eide, Ø., and C. Ore. “From TEI to a CIDOC-CRM Conforming Model: Towards a Better Integration Between Text Collections and Other Sources of Cultural Historical Documentation.” Digital Humanities, 2007. Ore, Christian-Emil, and Øyvind Eide. “TEI and Cultural Heritage Ontologies: Exchange of Information?” Literary and Linguistic Computing, vol. 24, no. 2, 2009, pp. 161–72., https://doi.org/10.1093/llc/fqp010. |
11:00am - 11:30am | Thursday Morning Refreshment Break Location: ARMB: King's Hall |
11:30am - 1:00pm | Session 5A: Long Papers Location: ARMB: 2.98 Session Chair: Dario Kampkaspar, Universitäts- und Landesbibliothek Darmstadt |
|
ID: 159
/ Session 5A: 1
Long Paper Keywords: TEI XML, Handwritten Text Recognition, HTR, Libraries Evolving Hands: HTR and TEI Workflows for cultural institutions 1Newcastle University, United Kingdom; 2Bucknell University, USA This Long Paper will look at the work of the Evolving Hands project which is undertaking three case studies ranging across document forms to demonstrate how TEI-based HTR workflows can be iteratively incorporated into curation. These range from: 19th-20th century handwritten letters and diaries from the UNESCO Gertrude Bell Archive, 18th century German, 20th century French correspondence, and a range of printed materials from the 19th century onward in English and French. A joint case study converts legacy printed material of the Records of Early English Drama (REED) project. By covering a wide variety of periods and document forms the project has a real opportunity here to foster responsible and responsive support for cultural institutions. See Uploaded Abstract for more information ID: 109
/ Session 5A: 2
Long Paper Keywords: TEI, text extraction, linguistic annotation, digital edition, mass digitisation Between automatic and manual encoding: towards a generic TEI model for historical prints and manuscripts 1Ecole nationale des chartes | PSL (France); 2INRIA (France); 3Université de Genève (Switzerland) Cultural heritage institutions today aim to digitise their collections of prints and manuscripts (Bermès 2020) and are generating more and more digital images (Gray 2009). To enrich these images, many institutions work with standardised formats such as IIIF, preserving as much of the source’s information as possible. To take full advantage of textual documents, an image alone is not enough. Thanks to automatic text recognition technology, it is now possible to extract images’ content on a large scale. The TEI seems to provide the perfect format to capture both an image’s formal and textual data (Janès et al. 2021). However, this poses a problem. To ensure compatibility with a range of use cases, TEI XML files must guarantee IIIF or RDF exports and therefore must be based on strict data structures that can be automated. But a rigid structure contradicts the basic principles of philology, which require maximum flexibility to cope with various situations. The solution proposed by the Gallic(orpor)a project attempted to deal with such a contradiction, focusing on French historical documents produced between the 15th and the 18th c. It aims to enrich the digital facsimiles distributed by the French National Library in two different ways: • text extraction, including the segmentation of the image (layout analysis) with SegmOnto (Gabay, Camps, et al. 2021) and the recognition of the text (Handwritten Text Recognition) augmenting already existing models (Pinche and Clérice, 2021); • linguistic annotation, including lemmatisation, POS tagging (Gabay, Clérice, et al. 2020), named entity recognition and linguistic normalisation (Bawden et al. 2022). Our TEI document modelling has two strictly coercive automatically generated data blocks: • the <sourceDoc> with information from the digital facsimile, which computer vision, HTR and segmentation tools produce thanks to machine learning (Scheithauer et al. 2021); • the <standOff> (Bartz et al. 2021a) with linguistic information produced by natural language processing tools (Gabay, Suarez, et al. 2022) to make it easier to search the corpus (Bartz et al. 2021b). Two other elements are added that can be customised according to researchers’ specific needs: • a pre-filled <teiHeader> with basic bibliographic metadata automatically retrieved from (i) the digital facsimile’s IIIF Image API and (ii) the BnF’s Search/Retrieve via URL (SRU) API. The <teiHeader> can be enriched with additional data, as long as it respects a strict minimum encoding; • a pre-editorialised <body>. It is the only element totally free regarding encoding choices. By restricting certain elements and allowing others to be customisable, our TEI model can efficiently pivot toward other export formats, including RDF and IIIF. Furthermore, the <sourceDoc> element’s strict and thorough encoding of all of the document’s graphical information allows the TEI document to be converted into PAGE XML and ALTO XML files, which can then be used to train OCR, HTR, and segmentation models. Thus, not only does our TEI model’s strict encoding avoid limiting philological choices, thanks to the <body>, it also allows us to pre-editorialise the <body> via the content of the <sourceDoc> and, in a near future, the <standOff>. ID: 128
/ Session 5A: 3
Long Paper Keywords: NER, HTR, Correspondence, Digital Scholarly Edition Dehmel Digital: Pipelines, text as data, and editorial interventions at the distance 1State and University Library Hamburg, Germany; 2University of Hamburg Ida and Richard Dehmel were a famous, internationally well-connected artist couple around 1900. The correspondence of the Dehmels, which has been comprehensively preserved in approx. 35,000 documents, has so far remained largely unexplored in the Dehmel Archive of the State and University Library Hamburg. The main reason for this is the quantity of the material that makes it difficult to explore the material using traditional methods of scholarly editing. However, the corpus is relevant for future research precisely because of its size and variety. It not only contains many letters from important personalities from the arts and culture of the turn of the century, but also documents personal relationships, main topics as well as forms and ways of communication in the cultural life of Germany and Europe before the First World War on a large scale. The project Dehmel digital seeks out to close this gap by creating a digital scholarly edition of the Dehmels’ correspondence that addresses the quantitative aspects with a combination of state-of-the-art machine learning approaches, namely handwritten text recognition (HTR) and named entity recognition (NER). At the heart of the project is a scalable pipeline that integrates automated and semi-automated text/data processing tasks. In our paper we will introduce and discuss the main steps: 1. Importing the result of HTR from Transkribus and OCR4all, 2. Applying a trained NER model; 3. Disambiguating entities and referencing authority records with OpenRefine; 4. Publishing data and metadata to a Linked Open Data web service. Our main focus will be on the pipeline itself, the “glue” that ties together well-established tools (Transkribus, OCR4All, Stanford Core NLP, OpenRefine), our use of TEI to encode relevant information and the special challenges we observe when using text as data, i.e. combining automated and semi-automated processes with the desire of editorial interventions. |
11:30am - 1:00pm | Session 5B: Panel - Manuscript catalogues as data for research Location: ARMB: 2.16 Session Chair: Katarzyna Anna Kapitan, University of Oxford |
|
ID: 144
/ Session 5B: 1
Panel Keywords: Manuscripts, Provenance, Research, Clustering, Linked Data Manuscript catalogues as data for research 1Cambridge University, United Kingdom; 2University of Oxford; 3Herzog August Bibliothek; 4University of Leeds Manuscript catalogues present problems and opportunities for researchers, not least the status of manuscript descriptions as both information about texts and texts in themselves. In this panel, we will present three recent projects which have used manuscript catalogues as data for research, and which raise general questions in text encoding, in manuscript studies and in data-driven digital humanities. This will be followed by a panel discussion to further investigate issues and questions raised by the papers. 1. Investigating the Origins of Islamicate Manuscripts Using Computational Methods (Yasmin Faghihi and Huw Jones): This project evaluated computational methods for the generation of new information about the origins of manuscripts from existing catalogue data. The dataset was the Fihrist Union Catalogue of Manuscripts from the Islamicate World. We derived a set of codicological features from the TEI data, clustered together manuscripts sharing features, and used dated/placed manuscripts to generate hypotheses about the provenance of other manuscripts in the clusters. We aimed to establish a set of base criteria for the dating/placing of manuscripts, to investigate methods of enriching existing datasets with inferred data to form the basis of further research, and to engage critically with the research cycle in relation to computational methods in the humanities. 2. Re-thinking the <provenance> element in TEI Manuscript Description to support graph database transformations (Toby Burrows and Matthew Holford): This paper reports on the transformation of the Bodleian Library’s online medieval manuscripts catalogue, based on the “Manuscript Description” section of the TEI Guidelines, into RDF graphs using the CIDOC-CRM and FRBROO ontologies. This work was carried out in the context of two Linked Open Data projects: Oxford Linked Open Data and Mapping Manuscript Migrations. One area of particular focus was the provenance data relating to these manuscripts, which proved challenging to transform effectively from TEI to RDF. An important output from the MMM project was a set of recommendations for re-thinking the structure and encoding of the TEI <provenance> element to enable more effective reuse of the data in graph database environments. These recommendations draw on concepts previously outlined by Ore and Eide (2009), but also take into account the parallel work being done in the art museum and gallery community. 3. The use of TEI in the Handschriftenportal (Torsten Schaßan) The national manuscript portal for Germany in the making, the Handschriftenportal, is built on TEI encoded data. These include representations for manuscripts, descriptions that have been imported, authority data, and OCR-generated catalogues. In the future, it will be possible to enter descriptions directly into the backend database. The structure of the descriptive data shall be adopted according to the latest developments in manuscript studies, e.g. the risen importance of material aspects, or the alignment of the description of texts and illuminations. Especially the latter, the data to be entered in the future, poses several issues to the TEI encoding as currently defined in the Guidelines. This comprises the overall structure of the main components of a description, as well as needs on a more detailed level. Bios Dr Toby Burrows is a Digital Humanities researcher at the University of Oxford and the University of Western Australia. His research focuses on the history of cultural heritage collections, and especially medieval and Renaissance manuscripts. Yasmin Faghihi is Head of the Near and Middle Eastern Department at Cambridge University Library. She is the editor of FIHRIST, the online union catalogue for manuscripts from the Islamicate world. Matthew Holford is Tolkien Curator of Medieval Manuscripts at the Bodleian Library, Oxford. He has a long-standing research interest in the use of TEI for the description and cataloguing of Western medieval manuscripts. Huw Jones is Head of the Digital Library at Cambridge University Library, and Director of CDH Labs at Cambridge Digital Humanities. His work spans many aspects of collections-driven digital humanities, from creating and making collections available to their use in a research and teaching context. Torsten Schaßan is member of the Manuscripts and Special Collections department of the Herzog August Bibliothek Wolfenbüttel. He was involved in many manuscript digitisation and cataloguing projects. In the Handschriftenportal project he is responsible for the definition of schemata and all transformations of data for import into the portal. Chair: Dr Katarzyna Anna Kapitan is manuscript scholar and digital humanist specialising in Old Norse literature and culture. Currently she is Junior Research Fellow at Linacre College, University of Oxford, where she works on a digital book-historical project, “Virtual Library of Torfæus”, funded by the Carlsberg Foundation. Respondent: Dr N. Kıvılcım Yavuz works at the intersection of medieval studies and digital humanities, with an expertise in medieval historiography and European manuscript culture. She is especially interested in digitisation of manuscripts as cultural heritage items and creation, collection and interpretation of data and metadata in the context of digital repositories. |
1:00pm - 2:30pm | Thursday Lunch Break Location: ARMB: King's Hall |
2:30pm - 4:00pm | Session 6A: An Interview With ... Lou Burnard Location: ARMB: 2.98 Session Chair: Diane Jakacki, Bucknell University An interview session: a short statement piece followed by interview questions, then audience questions. |
4:00pm - 4:30pm | Thursday Afternoon Refreshment Break Location: ARMB: King's Hall |
4:30pm - 6:00pm | TEI Annual General Meeting - All Welcome Location: ARMB: 2.98 Session Chair: Diane Jakacki, Bucknell University |
Contact and Legal Notice · Contact Address: Privacy Statement · Conference: TEI 2022 |
Conference Software - ConfTool Pro 2.6.145+CC © 2001–2022 by Dr. H. Weinreich, Hamburg, Germany |